Posts Tagged ‘webscraping’
Scraping AJAX web pages (Part 4)
December 27, 2012
9 comments
Don’t forget to check out the rest of the series too!
I managed to solve a problem that bugged me for a long time. Namely, (1) I want to download the generated source of an AJAX-powered webpage; (2) I want a headless solution, i.e. I want no browser window; and (3) I want to wait until the AJAX-content is fully loaded.
During the past 1.5 years I got quite close :) I could solve everything except issue #3. Now I’m proud to present a complete solution that satisfies all the criteria above.
#!/usr/bin/env python import os import sys from PySide.QtCore import * from PySide.QtGui import * from PySide.QtWebKit import QWebPage SEC = 1000 # 1 sec. is 1000 msec. USER_AGENT = 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:17.0) Gecko/20100101 Firefox/17.0' class JabbaWebkit(QWebPage): # 'html' is a class variable def __init__(self, url, wait, app, parent=None): super(JabbaWebkit, self).__init__(parent) JabbaWebkit.html = '' if wait: QTimer.singleShot(wait * SEC, app.quit) else: self.loadFinished.connect(app.quit) self.mainFrame().load(QUrl(url)) def save(self): JabbaWebkit.html = self.mainFrame().toHtml() def userAgentForUrl(self, url): return USER_AGENT def get_page(url, wait=None): # here is the trick how to call it several times app = QApplication.instance() # checks if QApplication already exists if not app: # create QApplication if it doesnt exist app = QApplication(sys.argv) # form = JabbaWebkit(url, wait, app) app.aboutToQuit.connect(form.save) app.exec_() return JabbaWebkit.html ############################################################################# if __name__ == "__main__": url = 'http://simile.mit.edu/crowbar/test.html' print get_html(url)
It’s also on GitHub. The GitHub version contains more documentation and more examples.
[ reddit comments ]
Update (20121228)
Jabba-Webkit got included in Pycoder’s Weekly #46. Awesome.
Categories: python
ajax, generated html source, headless browser, post-ajax, pycoder's weekly, pyqt, pyside, scraper, scraping, webkit, webscraping