Scraping AJAX web pages (Part 3)

Home > html, javascript, python, web > Scraping AJAX web pages (Part 3)

Scraping AJAX web pages (Part 3)

November 8, 2011 Jabba Laci Leave a comment Go to comments

Don’t forget to check out the rest of the series too!

In Part 2 we saw how to download an Ajax-powered webpage. However, there was a problem with that approach: sometimes it terminated too quickly, thus it fetched just part of a page. The problem with Ajax is that we cannot tell for sure when a page is completely downloaded.

So, the solution is to integrate some waiting mechanism in the script. That is, we need the following: “open a given page, wait X seconds, then get the HTML source”. Hopefully all Ajax calls will be finished in X seconds. It is you who decides how many seconds to wait. Or, you can analyze the partially downloaded HTML and if something is missing, wait some more.

Here I will use Splinter for this task. It opens a browser window that you can control from Python. Thanks to the browser, it can interpret Javascript. The only disadvantage is that the browser window is visible.

Example
Let’s see how to fetch the page CP002059.1. If you open it in a browser, you’ll see a status bar at the bottom that indicates the download progress. For me it takes about 20 seconds to fully get this page. By analyzing the content of the page, we can notice that the string “ORIGIN” appears just once, at the end of the page. So we’ll check its presence in a loop and wait until it arrives.

#!/usr/bin/env python

from time import sleep
from splinter.browser import Browser

url = 'http://www.ncbi.nlm.nih.gov/nuccore/CP002059.1'

def main():
    browser = Browser()
    browser.visit(url)

    # variation A:
    while 'ORIGIN' not in browser.html:
        sleep(5)

    # variation B:
    # sleep(30)   # if you think everything arrives in 30 seconds

    f = open("/tmp/source.html", "w")   # save the source in a file
    print >>f, browser.html
    f.close()

    browser.quit()
    print '__END__'

#############################################################################

if __name__ == "__main__":
    main()

You might be tempted to check the presence of ‘</html>’. However, don’t forget that the browser downloads a plain source first starting with ‘<html><body>…’ until ‘</body></html>’. Then it starts to interpret the source and if it finds some Ajax calls, they will be called, and these calls will expand something in the body of the HTML. So you’ll have ‘</html>’ right at the beginning.

Future work
This is not bad but I’m still not fully satisfied. I’d like something like this but without any browser window. If you have a headless solution, let me know. I think it’s possible with PhantomJS and/or Zombie.js but I had no time yet to investigate them.

Categories: html, javascript, python, web Tags: ajax, generated html source, PhantomJS, post-ajax, scraper, Zombie.js

Comments (2) Trackbacks (1) Leave a comment Trackback

Don H.

September 25, 2012 at 02:15

I needed something that watched for the entire page of javascript to settle so I came up with this to hash the entire page and return when the hashes match repeatedly.

import md5
from collections import deque

def funcIsPageLoadComplete_Hash(browserIn1):
    # browserIn1 is a splinter.driver.webdriver.firefox.WebDriver
    md5HtmlPage = md5.new(browserIn1.html);
    md5HtmlPageHex = md5HtmlPage.hexdigest();
    fifoMd5Hist = deque('0000');
    
    # Wait a maximum of 60 seconds for page to load
    for i in range(0,60):
        print '---------------------'
        bolWaitDone = True;
        # Check each digest against the previous ones.
        for hex in fifoMd5Hist:
            print 'comparing',hex,' to ', md5HtmlPageHex
            if hex != md5HtmlPageHex:
                # One of the prev digests doesn't match so we have to wait
                bolWaitDone = False;
        
        if bolWaitDone == False:
            md5HtmlPage = md5.new(browserIn1.html);
            md5HtmlPageHex = md5HtmlPage.hexdigest();
            # Remove one location from the left side of the deque
            fifoMd5Hist.popleft();
            # Add the current digest to the right side of the deque
            fifoMd5Hist.append(md5HtmlPageHex);
            time.sleep(1);
        else:
            # Page load has completed
            return True;
    return True;
# End funcIsPageLoadComplete_Hash

Jabba Laci

September 25, 2012 at 06:26

Reply

Thanks, it’s a nice idea to check the hash value of the HTML source repeatedly. (One remark: you don’t need “;” at the end of lines in Python.)

December 27, 2012 at 16:49

Scraping AJAX web pages « The Ubuntu Incident

The Ubuntu Incident

Scraping AJAX web pages (Part 3)

Leave a comment Cancel reply

Blog Stats

Random Post

Recent Posts

Archives

Meta

The Ubuntu Incident

Scraping AJAX web pages (Part 3)

Share this:

Related

Leave a comment Cancel reply

Blog Stats

Random Post

Recent Posts

Categories

Archives

Meta