Home > bash > Scraping AJAX web pages (Part 5)

Scraping AJAX web pages (Part 5)

Don’t forget to check out the rest of the series too!

I’ve already written about PhatomJS, for instance here. Recall: PhantomJS is a headless WebKit scriptable with a JavaScript API.

The problem is still the same: we have a webpage that contains lots of JavaScript code and we want to get the final HTML that is produced after the JavaScript codes have been executed.

Example: http://simile.mit.edu/crowbar/test.html. If you download it with “wget” for instance, you get the text “Hi lame crawler” in the source. However, a JavaScript code changes this text to “Hi Crowbar!” in the browser and we want to get this generated source. How?

This time we’ll use PhantomJS. We also need a JavaScript script that will instruct PhantomJS what to do. Let’s call it printSource.js:

var system = require('system');
var page   = require('webpage').create();
// system.args[0] is the filename, so system.args[1] is the first real argument
var url    = system.args[1];
// render the page, and run the callback function
page.open(url, function () {
  // page.content is the source
  console.log(page.content);
  // need to call phantom.exit() to prevent from hanging
  phantom.exit();
});

Note that this code comes from here.

If you want to set the user agent, use this modified script:

var system = require('system');
var page   = require('webpage').create();
page.settings.userAgent = 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30)';
// system.args[0] is the filename, so system.args[1] is the first real argument
var url    = system.args[1];
// render the page, and run the callback function
page.open(url, function () {
  // page.content is the source
  console.log(page.content);
  // need to call phantom.exit() to prevent from hanging
  phantom.exit();
});

Then launch the following command:

$ phantomjs printSource.js http://simile.mit.edu/crowbar/test.html

The output is printed to the standard output.

If you want to do the same thing from a Python script, check out Part 5.5 of the series.

Categories: bash Tags: , , , ,
  1. No comments yet.
  1. No trackbacks yet.

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: