Home > html, javascript, python, web > Scraping AJAX web pages (Part 2)

Scraping AJAX web pages (Part 2)

Don’t forget to check out the rest of the series too!

In this post we’ll see how to get the generated source of an HTML page. That is, we want to get the source with embedded Javascript calls evaluated.

Here is my solution:

#!/usr/bin/env python

"""
Simple webkit.
"""

import sys
from PyQt4 import QtGui, QtCore, QtWebKit

class SimpleWebkit():
    def __init__(self, url):
        self.url = url
        self.webView = QtWebKit.QWebView()

    def save(self):
        print self.webView.page().mainFrame().toHtml()
        sys.exit(0)

    def process(self):
        self.webView.load(QtCore.QUrl(self.url))
        QtCore.QObject.connect(self.webView, QtCore.SIGNAL("loadFinished(bool)"), self.save)

def process(url):
    app = QtGui.QApplication(sys.argv)
    s = SimpleWebkit(url)
    s.process()
    sys.exit(app.exec_())

#############################################################################

if __name__ == "__main__":
    #url = 'http://simile.mit.edu/crowbar/test.html'
    if len(sys.argv) > 1:
        process(sys.argv[1])
    else:
        print >>sys.stderr, "{0}: error: specify a URL.".format(sys.argv[0])
        sys.exit(1)

You can also find this script in my jabbapylib library.

Usage:

./simple_webkit.py 'http://dl.dropbox.com/u/144888/hello_js.html'

That is, just specify the URL of the page to be fetched. The generated HTML is printed to the standard output but you can easily redirect that to a file.

Pros
As you can see, it’s hyper simple. It uses a webkit instance to get and evaluate the page, which means that Javascript (and AJAX) calls will be executed. Also, the webkit instance is not visible in a window (headless browsing).

Cons
This solution is not yet perfect. The biggest problem is that AJAX calls can take some time and this script doesn’t wait for them. Actually, it cannot be known when all AJAX calls are terminated, so we cannot know for sure when the page is completely loaded :( The best way could be to integrate a waiting mechanism in the script, say “wait 5 seconds before printing the source”. Unfortunately I didn’t manage to add this feature. It should be done with QTimer somehow. If someone could add this functionality to this script, please let me know.

Challenge:
Try to download this page: CP002059.1. If you open it in Firefox for instance, at the bottom you’ll see a progress bar. For me the complete download takes about 10 sec. The script above will only fetch the beginning of the page :( Some help: the end of the downloaded sequence is this:

ORIGIN
//

If you can modify the script above to work correctly with this particular page, let me know.

Another difficulty is how to integrate this downloader in a larger project. At the end, “app.exec_()” must be called, otherwise no output is produced. But if you call it, it terminates the script. My current workaround is to call this script as an external command and catch its output on stdout. If you have a better idea, let me know.

Resources used

Update (20110921)
I just found an even simpler solution here. And this one doesn’t exit(), so it can be integrated in another project easily (without the need for calling it as an external command). However, the “waiting problem” is still there.

What’s next
In the next part of this series we will see another way to download an AJAX page. In Part 3 we will address the problem of waiting X seconds for AJAX calls. Stay tuned.

Troubleshooting
If you get the following error message:

Gtk-WARNING **: Unable to locate theme engine in module_path: "pixmap",

Then install this package:

sudo apt-get install gtk2-engines-pixbuf

This tip is from here.

  1. September 21, 2011 at 06:41

    Have you tried using selenium RC . Along with scrapy . I find it useful because you can perfom actions like click , mouseover etc for ajax loaded content and wait till ajax loads . Pretty neat

    • September 21, 2011 at 11:31

      Could you provide a solution for solving the challenge mentioned in the post with your own method? Now I have no time to dig in the documentation of Selenium. Thanks. I tried Splinter, which is built upon Selenium. In the next post I will write about it. However, it opens a browser window and I’d prefer a headless solution.

      Edit: Part 3 with Splinter is ready.

  2. January 21, 2012 at 19:18

    Did you try http://docs.python.org/library/time.html#time.sleep ? You can simply put the thread to bed for 5 seconds.

    But this won’t solve your problem with recurrent ajax requests or with client side sockets. You should try to find a way to monitor if there are any active connections between your client and the server.

    • January 21, 2012 at 22:49

      @Tudor: Thanks. I tried sleep() but it also blocks the webkit engine. The webkit engine should continue running to have enough time to evaluate the Javascript calls. So with sleep() I couldn’t solve the problem

  1. December 27, 2012 at 16:49

Leave a comment