Scraping AJAX web pages (Part 4)

Home > python > Scraping AJAX web pages (Part 4)

Scraping AJAX web pages (Part 4)

December 27, 2012 Jabba Laci Leave a comment Go to comments

Don’t forget to check out the rest of the series too!

I managed to solve a problem that bugged me for a long time. Namely, (1) I want to download the generated source of an AJAX-powered webpage; (2) I want a headless solution, i.e. I want no browser window; and (3) I want to wait until the AJAX-content is fully loaded.

During the past 1.5 years I got quite close :) I could solve everything except issue #3. Now I’m proud to present a complete solution that satisfies all the criteria above.

#!/usr/bin/env python

import os
import sys

from PySide.QtCore import *
from PySide.QtGui import *
from PySide.QtWebKit import QWebPage

SEC = 1000 # 1 sec. is 1000 msec.
USER_AGENT = 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:17.0) Gecko/20100101 Firefox/17.0'

class JabbaWebkit(QWebPage):
    # 'html' is a class variable
    def __init__(self, url, wait, app, parent=None):
        super(JabbaWebkit, self).__init__(parent)
        JabbaWebkit.html = ''

        if wait:
            QTimer.singleShot(wait * SEC, app.quit)
        else:
            self.loadFinished.connect(app.quit)

        self.mainFrame().load(QUrl(url))

    def save(self):
        JabbaWebkit.html = self.mainFrame().toHtml()

    def userAgentForUrl(self, url):
        return USER_AGENT

def get_page(url, wait=None):
    # here is the trick how to call it several times
    app = QApplication.instance() # checks if QApplication already exists
    if not app: # create QApplication if it doesnt exist
        app = QApplication(sys.argv)
    #
    form = JabbaWebkit(url, wait, app)
    app.aboutToQuit.connect(form.save)
    app.exec_()
    return JabbaWebkit.html

#############################################################################

if __name__ == "__main__":
    url = 'http://simile.mit.edu/crowbar/test.html'
    print get_html(url)

It’s also on GitHub. The GitHub version contains more documentation and more examples.

_{[ reddit comments ]}

Update (20121228)
Jabba-Webkit got included in Pycoder’s Weekly #46. Awesome.

Categories: python Tags: ajax, generated html source, headless browser, post-ajax, pycoder's weekly, pyqt, pyside, scraper, scraping, webkit, webscraping

Comments (8) Trackbacks (1) Leave a comment Trackback

Rodolfo
January 16, 2013 at 17:51

Reply
Hi

I found your site when I was looking for a webscrapper.
I succeeded implemeting the scrapper for single URL, but when I try to use more than one, it freezes and no HTML data is read.

Look at my code:
```
#!/usr/bin/env python
import jabba_webkit as jw
  
url_sky1 = "http://www.skyscanner.com/flights/saoa/ams/130628/130727/?flt=1&amp;language=EN&amp;ccy=USD"
url_sky2 = "http://www.skyscanner.com/flights/saoa/cdg/130628/130727/?flt=1&amp;language=EN&amp;ccy=USD"

sky1_html = ""
sky2_html = ""

sky1_html = (jw.get_page(url_sky1)).encode('ascii', 'ignore')
sky2_html = (jw.get_page(url_sky2)).encode('ascii', 'ignore')

print "-------------------- SKY AMS"
print sky1_html
print "-------------------- SKY CDG"
print sky2_html
```
It just works if I comment sky1_hmtl or sky2_hmtl. Do not work for the two URLs same time.

Could you please help me?

Thanks
Rodolfo.
- Jabba Laci
  
  January 16, 2013 at 18:10
  
  Reply
  
  Strange. The 1st HTML is read, it hangs for the 2nd. With a timeout, I managed to get both sources. Try like this: jw.get_page(url, 10) for instance. Now after 10 seconds it will stop.
  - Rodolfo
    January 17, 2013 at 19:14
    Hi:
    
    Thanks for the help.
    Using the timeout it not hangs anymore, but the second call do not execute the scripts in webpage.
    I saved the 2 htmls in file and analized it. In first, I have all the information but in second I have none.
    A added the following code to save html:
    
    f = open('sky1.html', 'w+') f.write(sky1_html) f = open('sky2.html', 'w+') f.write(sky2_html)
    
    Could you please take a look?
    
    Thanks
    Rodolfo
Mariano

March 20, 2013 at 21:21

Reply

Hi Jabba Laci, when I read the 3 issues you wrote.. I see are the same I need!
I work on PHP.. so I’m not good with Python (although I merely understand the code).
Is it possible to “make” a executable service to be run in linux console?
I mean, it would be great if I could exec this: “jabbaws 5 http://someurl.com” and it retrieves the entire post-AXAX source code, where 5 is the variable SEC in you code.
If that’s possible.. it could be really useful, cos I could run a php script on linux that calls that service like this:
$source = exec( “jabbaws 5 http://someurl.com” );

Sorry for my english, isn’t good, but I think I was clear..

Hope your answer, thanks!

PD: I googled a lot, and there’s no PHP version of a scrapper like this one..
Deepak

July 23, 2013 at 16:37

Reply

Hi There!

Is it simple to extend this? What I want is *very close* to what you are offering here.
I want to fill in a “User ID” and “Password” field. Then I want the page that is loaded after the submit to be saved as HTML.

And this needs to be automated (i.e auto fill form) and headless.

Any suggestions?
eon01

October 28, 2013 at 18:52

Reply

Hi, thank you for publishing this script. I tried to use it with Youtube :

I entred :
./jabba_webkit.py https://www.youtube.com | grep -c “/watch”

and I ‘ve got 0 as a result, so I don’t think it scraps everything or there is a problem somewhere, this is the output :

QFont::setPixelSize: Pixel size <= 0 (0)
QFont::setPixelSize: Pixel size <= 0 (0)
java version "1.7.0_25"
OpenJDK Runtime Environment (IcedTea 2.3.10) (7u25-2.3.10-1ubuntu0.12.10.2)
OpenJDK Server VM (build 23.7-b01, mixed mode)
0
- Jabba Laci
  October 28, 2013 at 19:13
  
  Reply
  Hi,
  
  When I launched it from the command-line, it also behaved strangely for me. It worked well when I wrote it :) But, launching it from Python it produced a good result. Try this script:
  
  #!/usr/bin/env python import jabba_webkit as jw URL = "https://www.youtube.com/" def main(): html = jw.get_page(URL) for line in html.splitlines(): if "/watch" in line: print line ########## if __name__ == "__main__": main()
Jean-Baptiste Nguyen
September 26, 2014 at 08:19

Reply
Hi.

Nice hack, thank you, that was realy usefull for us.
But, It seems that under linux, I need a X11 interface.

Here is a very simple version using phantomJS.

cat wgetJavaScript.js :
```
var page = require('webpage').create();
page.open('http://simile.mit.edu/crowbar/test.html', function () {
    console.log(page.content);
    phantom.exit();
});
```
and then call it:
```
phantomjs wgetJavaScript.js
```

January 3, 2013 at 10:33

Scraping AJAX web pages « The Ubuntu Incident

The Ubuntu Incident

Scraping AJAX web pages (Part 4)

Leave a reply to Jean-Baptiste Nguyen Cancel reply

Blog Stats

Random Post

Recent Posts

Archives

Meta

The Ubuntu Incident

Scraping AJAX web pages (Part 4)

Share this:

Related

Leave a reply to Jean-Baptiste Nguyen Cancel reply

Blog Stats

Random Post

Recent Posts

Categories

Archives

Meta