Home > firefox, html, javascript > Scraping AJAX web pages (Part 1.5)

Scraping AJAX web pages (Part 1.5)

Don’t forget to check out the rest of the series too!

Before attacking Part 2, I think it would be useful to investigate what the generated source of a page looks like.

Consider the following source:

<script>document.write("Hello World!");</script>

If you open it, you’ll see the text “Hello World!”. It’s not a big surprise :) But what is the generated source? How is the original html above interpreted by the browser?

Option A:

Hello World!

Option B:

<script>document.write("Hello World!");</script>Hello World!

Well, the correct answer is B. If you install the Web Developer add-on to Firefox, you’ll be able to see both sources: the original one (that is downloaded from the web server), and the generated one (which is produced by the browser after interpreting the original source).

If you don’t want to install Web Developer, there is another option. In Firefox, you can save a page in two different ways. If you save it as “Web Page, complete”, you’ll get the generated source. If you choose “Web Page, HTML only”, you’ll get the original source. However, if you save the “Hello World!” example as “Web Page, complete” and you open it from your local machine, you’ll see the text “Hello World!” twice! When you open the generated source, the embedded Javascript code will be executed again.

So, if you scrape AJAX pages, don’t be surprised if the resulting HTML source is still full of Javascript codes. But if you use an intelligent method that understands Javascript, then the interpreted result will be in the source too. In the next part we will see how to download webpages with Python and webkit.

  1. No comments yet.
  1. December 27, 2012 at 16:49

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: