Archive

Archive for December, 2012

2012 in review

December 31, 2012 Leave a comment

The WordPress.com stats helper monkeys prepared a 2012 annual report for this blog.

Here’s an excerpt:

About 55,000 tourists visit Liechtenstein every year. This blog was viewed about 170,000 times in 2012. If it were Liechtenstein, it would take about 3 years for that many people to see it. Your blog had more visits than a small country in Europe!

Click here to see the complete report.

Categories: Uncategorized Tags: , ,

Free online polls

December 27, 2012 1 comment
Categories: Uncategorized Tags: ,

Scraping AJAX web pages

December 27, 2012 Leave a comment
Categories: Uncategorized Tags: ,

Scraping AJAX web pages (Part 4)

December 27, 2012 8 comments

Don’t forget to check out the rest of the series too!

I managed to solve a problem that bugged me for a long time. Namely, (1) I want to download the generated source of an AJAX-powered webpage; (2) I want a headless solution, i.e. I want no browser window; and (3) I want to wait until the AJAX-content is fully loaded.

During the past 1.5 years I got quite close :) I could solve everything except issue #3. Now I’m proud to present a complete solution that satisfies all the criteria above.

#!/usr/bin/env python

import os
import sys

from PySide.QtCore import *
from PySide.QtGui import *
from PySide.QtWebKit import QWebPage

SEC = 1000 # 1 sec. is 1000 msec.
USER_AGENT = 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:17.0) Gecko/20100101 Firefox/17.0'

class JabbaWebkit(QWebPage):
    # 'html' is a class variable
    def __init__(self, url, wait, app, parent=None):
        super(JabbaWebkit, self).__init__(parent)
        JabbaWebkit.html = ''

        if wait:
            QTimer.singleShot(wait * SEC, app.quit)
        else:
            self.loadFinished.connect(app.quit)

        self.mainFrame().load(QUrl(url))

    def save(self):
        JabbaWebkit.html = self.mainFrame().toHtml()

    def userAgentForUrl(self, url):
        return USER_AGENT

def get_page(url, wait=None):
    # here is the trick how to call it several times
    app = QApplication.instance() # checks if QApplication already exists
    if not app: # create QApplication if it doesnt exist
        app = QApplication(sys.argv)
    #
    form = JabbaWebkit(url, wait, app)
    app.aboutToQuit.connect(form.save)
    app.exec_()
    return JabbaWebkit.html

#############################################################################

if __name__ == "__main__":
    url = 'http://simile.mit.edu/crowbar/test.html'
    print get_html(url)

It’s also on GitHub. The GitHub version contains more documentation and more examples.

[ reddit comments ]

Update (20121228)
Jabba-Webkit got included in Pycoder’s Weekly #46. Awesome.

GitHub: create a new repository and start using it

December 27, 2012 Leave a comment

Problem
You want to create a new GitHub repository and you want to use it right away, i.e. you want to upload some content.

In the past, GitHub showed a detailed step-by-step help for all this, but it got removed :(

Solution
On the main page of GitHub, there is a button called “New repository”. Click on it, fill out the fields and create the repo. Now it’s on GitHub.

The next step is to clone it on your local machine:

git clone git@github.com:username/project.git

Here use the URL that starts with “git@github.com”! Not the one with “https://“. Once I cloned the “https://” and then it kept asking my username and password at each commit :(

Now you can perform your local changes. When ready, upload the changes to github:

git push origin master

More info here.

Update (20131216)
If you need to upload your SSH key, follow this guide.

Free list of Elite proxy servers

December 27, 2012 Leave a comment

Problem
You want to collect a list of free Elite proxies.

Solution
Currently I scrape these pages to maintain a list of Elite proxies:

It’s enough for me now. If my needs change in the future, I will update this list.

Wikipedia APIs for bots

December 18, 2012 1 comment
Categories: Uncategorized Tags: , ,

Image Gallery from a list of URLs

December 18, 2012 Leave a comment

Problem
I have several scrapers that extract images. How to visualize them? One way is to open each one in a new browser tab but it’s slow and who wants to have several hundreds of tabs? Is there a way to browse these images in one single tab?

Solution
A primitive solution would be to create an HTML page that lists all the images one below the other. But again, what if you have lots of images?

A better way is to organize the images in a gallery. There are tons of image gallery generators out there but most of them work with local images. I want to browse remote images when only their URLs are available. So I made my own image gallery generator that works with URLs. Available on github.

There is also a live demo, check it out.

The software is written in Python. See the README file for usage examples.

Categories: html, python Tags: , , ,

Put a text on the clipboard from your webpage

December 18, 2012 Leave a comment

Problem
From an HTML page you want to copy some text on the clipboard by pressing a button.

Example: on a page you present a list of URLs. Next to each URL there is a button. If the user clicks on the button, the corresponding URL is copied to his/her clipboard.

Solution
You can use clippy for this task. “Clippy is a very simple Flash widget that makes it possible to place arbitrary text onto the client’s clipboard.

Here is an HTML template that you must paste in your HTML: clippy.html. Simply replace “{{ clippy_text }}” and “{{ bgcolor }}” with the values you want.

Update (20130103)
GitHub also used clippy but recently they switched to ZeroClipboard. Here is their announcement.

Categories: html Tags: , ,

What does REPL mean?

December 12, 2012 Leave a comment

REPL stands for “Read–eval–print loop”.

A read–eval–print loop (REPL) is a simple, interactive computer programming environment. REPLs facilitate exploratory programming and debugging because the read–eval–print loop is usually much faster than the classic edit-compile-run loop. In a REPL, the user enters one or more expressions (rather than an entire compilation unit), which are then evaluated, and the results displayed. The name read–eval–print loop comes from the names of the Lisp primitive functions.” (source)

More info here.

Categories: Uncategorized Tags:
Follow

Get every new post delivered to your Inbox.

Join 62 other followers