Home > python > Scraping AJAX web pages (Part 4)

Scraping AJAX web pages (Part 4)

Don’t forget to check out the rest of the series too!

I managed to solve a problem that bugged me for a long time. Namely, (1) I want to download the generated source of an AJAX-powered webpage; (2) I want a headless solution, i.e. I want no browser window; and (3) I want to wait until the AJAX-content is fully loaded.

During the past 1.5 years I got quite close :) I could solve everything except issue #3. Now I’m proud to present a complete solution that satisfies all the criteria above.

#!/usr/bin/env python

import os
import sys

from PySide.QtCore import *
from PySide.QtGui import *
from PySide.QtWebKit import QWebPage

SEC = 1000 # 1 sec. is 1000 msec.
USER_AGENT = 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:17.0) Gecko/20100101 Firefox/17.0'

class JabbaWebkit(QWebPage):
    # 'html' is a class variable
    def __init__(self, url, wait, app, parent=None):
        super(JabbaWebkit, self).__init__(parent)
        JabbaWebkit.html = ''

        if wait:
            QTimer.singleShot(wait * SEC, app.quit)
        else:
            self.loadFinished.connect(app.quit)

        self.mainFrame().load(QUrl(url))

    def save(self):
        JabbaWebkit.html = self.mainFrame().toHtml()

    def userAgentForUrl(self, url):
        return USER_AGENT

def get_page(url, wait=None):
    # here is the trick how to call it several times
    app = QApplication.instance() # checks if QApplication already exists
    if not app: # create QApplication if it doesnt exist
        app = QApplication(sys.argv)
    #
    form = JabbaWebkit(url, wait, app)
    app.aboutToQuit.connect(form.save)
    app.exec_()
    return JabbaWebkit.html

#############################################################################

if __name__ == "__main__":
    url = 'http://simile.mit.edu/crowbar/test.html'
    print get_html(url)

It’s also on GitHub. The GitHub version contains more documentation and more examples.

[ reddit comments ]

Update (20121228)
Jabba-Webkit got included in Pycoder’s Weekly #46. Awesome.

  1. Rodolfo
    January 16, 2013 at 17:51

    Hi

    I found your site when I was looking for a webscrapper.
    I succeeded implemeting the scrapper for single URL, but when I try to use more than one, it freezes and no HTML data is read.

    Look at my code:

    #!/usr/bin/env python
    import jabba_webkit as jw
      
    url_sky1 = "http://www.skyscanner.com/flights/saoa/ams/130628/130727/?flt=1&language=EN&ccy=USD"
    url_sky2 = "http://www.skyscanner.com/flights/saoa/cdg/130628/130727/?flt=1&language=EN&ccy=USD"
    
    sky1_html = ""
    sky2_html = ""
    
    sky1_html = (jw.get_page(url_sky1)).encode('ascii', 'ignore')
    sky2_html = (jw.get_page(url_sky2)).encode('ascii', 'ignore')
    
    print "-------------------- SKY AMS"
    print sky1_html
    print "-------------------- SKY CDG"
    print sky2_html
    

    It just works if I comment sky1_hmtl or sky2_hmtl. Do not work for the two URLs same time.

    Could you please help me?

    Thanks
    Rodolfo.

    • January 16, 2013 at 18:10

      Strange. The 1st HTML is read, it hangs for the 2nd. With a timeout, I managed to get both sources. Try like this: jw.get_page(url, 10) for instance. Now after 10 seconds it will stop.

      • Rodolfo
        January 17, 2013 at 19:14

        Hi:

        Thanks for the help.
        Using the timeout it not hangs anymore, but the second call do not execute the scripts in webpage.
        I saved the 2 htmls in file and analized it. In first, I have all the information but in second I have none.
        A added the following code to save html:

        f = open('sky1.html', 'w+')
        f.write(sky1_html)
        f = open('sky2.html', 'w+')
        f.write(sky2_html)
        

        Could you please take a look?

        Thanks
        Rodolfo

  2. Mariano
    March 20, 2013 at 21:21

    Hi Jabba Laci, when I read the 3 issues you wrote.. I see are the same I need!
    I work on PHP.. so I’m not good with Python (although I merely understand the code).
    Is it possible to “make” a executable service to be run in linux console?
    I mean, it would be great if I could exec this: “jabbaws 5 http://someurl.com” and it retrieves the entire post-AXAX source code, where 5 is the variable SEC in you code.
    If that’s possible.. it could be really useful, cos I could run a php script on linux that calls that service like this:
    $source = exec( “jabbaws 5 http://someurl.com” );

    Sorry for my english, isn’t good, but I think I was clear..

    Hope your answer, thanks!

    PD: I googled a lot, and there’s no PHP version of a scrapper like this one..

  3. Deepak
    July 23, 2013 at 16:37

    Hi There!

    Is it simple to extend this? What I want is *very close* to what you are offering here.
    I want to fill in a “User ID” and “Password” field. Then I want the page that is loaded after the submit to be saved as HTML.

    And this needs to be automated (i.e auto fill form) and headless.

    Any suggestions?

  4. eon01
    October 28, 2013 at 18:52

    Hi, thank you for publishing this script. I tried to use it with Youtube :

    I entred :
    ./jabba_webkit.py https://www.youtube.com | grep -c “/watch”

    and I ‘ve got 0 as a result, so I don’t think it scraps everything or there is a problem somewhere, this is the output :

    QFont::setPixelSize: Pixel size <= 0 (0)
    QFont::setPixelSize: Pixel size <= 0 (0)
    java version "1.7.0_25"
    OpenJDK Runtime Environment (IcedTea 2.3.10) (7u25-2.3.10-1ubuntu0.12.10.2)
    OpenJDK Server VM (build 23.7-b01, mixed mode)
    0

    • October 28, 2013 at 19:13

      Hi,

      When I launched it from the command-line, it also behaved strangely for me. It worked well when I wrote it :) But, launching it from Python it produced a good result. Try this script:

      #!/usr/bin/env python
      
      import jabba_webkit as jw
      
      URL = "https://www.youtube.com/"
      
      def main():
          html = jw.get_page(URL)
          for line in html.splitlines():
              if "/watch" in line:
                  print line
      
      ##########
      
      if __name__ == "__main__":
          main()
      
  5. September 26, 2014 at 08:19

    Hi.

    Nice hack, thank you, that was realy usefull for us.
    But, It seems that under linux, I need a X11 interface.

    Here is a very simple version using phantomJS.

    cat wgetJavaScript.js :

    var page = require('webpage').create();
    page.open('http://simile.mit.edu/crowbar/test.html', function () {
        console.log(page.content);
        phantom.exit();
    });
    

    and then call it:

    phantomjs wgetJavaScript.js
    
  1. January 3, 2013 at 10:33