Archive

Posts Tagged ‘PhantomJS’

Scraping AJAX web pages (Part 5.5)

July 13, 2016 Leave a comment

Don’t forget to check out the rest of the series too!

This post is very similar to the previous one (Part 5), which scraped a webpage using PhantomJS from the command line and sent the output to the stdout.

This time we use PhantomJS again, but we do it from a Python script and wrap Selenium around PhantomJS. The generated HTML source will be available in a variable. Here is the source:

#!/usr/bin/env python3
# encoding: utf-8

"""
required packages:
* selenium
optional packages:
* bs4
* lxml
"""

from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
# from bs4 import BeautifulSoup

url = "http://simile.mit.edu/crowbar/test.html"

dcap = dict(DesiredCapabilities.PHANTOMJS)
dcap["phantomjs.page.settings.userAgent"] = (
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/53 "
    "(KHTML, like Gecko) Chrome/15.0.87"
)
driver = webdriver.PhantomJS(desired_capabilities=dcap)
driver.get(url)
html = driver.page_source
print(html)
# soup = BeautifulSoup(driver.page_source, "lxml") #page_source fetches page after rendering is complete
# driver.save_screenshot('screen.png') # save a screenshot to disk

driver.quit()

The script sets the user agent (optional but recommended). The source is captured in a variable. The last two lines are in comments but they would work. You could feed the source to BeautifulSoup and then you could extract part of the HTML source. If you uncomment the last line, then you can create a screenshot of the webpage.

Advertisements

Scraping AJAX web pages (Part 5)

July 13, 2016 Leave a comment

Don’t forget to check out the rest of the series too!

I’ve already written about PhatomJS, for instance here. Recall: PhantomJS is a headless WebKit scriptable with a JavaScript API.

The problem is still the same: we have a webpage that contains lots of JavaScript code and we want to get the final HTML that is produced after the JavaScript codes have been executed.

Example: http://simile.mit.edu/crowbar/test.html. If you download it with “wget” for instance, you get the text “Hi lame crawler” in the source. However, a JavaScript code changes this text to “Hi Crowbar!” in the browser and we want to get this generated source. How?

This time we’ll use PhantomJS. We also need a JavaScript script that will instruct PhantomJS what to do. Let’s call it printSource.js:

var system = require('system');
var page   = require('webpage').create();
// system.args[0] is the filename, so system.args[1] is the first real argument
var url    = system.args[1];
// render the page, and run the callback function
page.open(url, function () {
  // page.content is the source
  console.log(page.content);
  // need to call phantom.exit() to prevent from hanging
  phantom.exit();
});

Note that this code comes from here.

If you want to set the user agent, use this modified script:

var system = require('system');
var page   = require('webpage').create();
page.settings.userAgent = 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30)';
// system.args[0] is the filename, so system.args[1] is the first real argument
var url    = system.args[1];
// render the page, and run the callback function
page.open(url, function () {
  // page.content is the source
  console.log(page.content);
  // need to call phantom.exit() to prevent from hanging
  phantom.exit();
});

Then launch the following command:

$ phantomjs printSource.js http://simile.mit.edu/crowbar/test.html

The output is printed to the standard output.

If you want to do the same thing from a Python script, check out Part 5.5 of the series.

Categories: bash Tags: , , , ,

taking a screenshot about a webpage

July 11, 2015 Leave a comment

Problem

You know the URL of a webpage and you want to take a screenshot of it. For instance you want a thumbnail about the webpage.

Solution

It can be done very nicely with PhantomJS.

What is PhantomJS?
PhantomJS is a headless WebKit scriptable with a JavaScript API. It has fast and native support for various web standards: DOM handling, CSS selector, JSON, Canvas, and SVG.” (source)

How to install PhantomJS?
Follow the instructions here. Under Ubuntu I compiled it from source to get the latest version. Note that it takes a lot of time (about 30 minutes). Under Manjaro I could install it via yaourt and it took 1 minute (and got the newest version). The good news is that installation is not a problem.

How to take a screenshot?
If you download the source, you get a lot of example scripts. One of them is called rasterize.js, and this is exactly what we need.

$ phantomjs rasterize.js 
Usage: rasterize.js URL filename [paperwidth*paperheight|paperformat] [zoom]
  paper (pdf output) examples: "5in*7.5in", "10cm*20cm", "A4", "Letter"
  image (png/jpg output) examples: "1920px" entire page, window width 1920px
                                   "800px*600px" window, clipped to 800x600

Example #1:

phantomjs rasterize.js http://raphaeljs.com/polar-clock.html clock.png


Example #2:

phantomjs rasterize.js https://www.reddit.com/ red.png

It produced an image with dimension 600×3304. It’s too narrow, let’s fix that.

Example #3:

phantomjs rasterize.js https://www.reddit.com/ red.png 1024px

Its dimension is 1024×2432. Looks much better.

Example #4:
The previous image was too high. Let’s take a photo of that part that would be visible on our screen. For this we need to clip a window.

phantomjs rasterize.js https://www.reddit.com/ red.png "1024px*768px"

Great. Now scale it down to get a thumbnail.

Scaling down an image to thumbnail size

$ phantomjs rasterize.js https://www.reddit.com/ screenshot.png "1024px*768px"
$ convert -resize 250 screenshot.png thumb.jpg

The command convert comes from the ImageMagick package. Here we resize the image to width 250px. Convert will keep the image ratio, i.e. it figures out the height value.

Links

Scraping AJAX web pages (Part 3)

November 8, 2011 3 comments

Don’t forget to check out the rest of the series too!

In Part 2 we saw how to download an Ajax-powered webpage. However, there was a problem with that approach: sometimes it terminated too quickly, thus it fetched just part of a page. The problem with Ajax is that we cannot tell for sure when a page is completely downloaded.

So, the solution is to integrate some waiting mechanism in the script. That is, we need the following: “open a given page, wait X seconds, then get the HTML source”. Hopefully all Ajax calls will be finished in X seconds. It is you who decides how many seconds to wait. Or, you can analyze the partially downloaded HTML and if something is missing, wait some more.

Here I will use Splinter for this task. It opens a browser window that you can control from Python. Thanks to the browser, it can interpret Javascript. The only disadvantage is that the browser window is visible.

Example
Let’s see how to fetch the page CP002059.1. If you open it in a browser, you’ll see a status bar at the bottom that indicates the download progress. For me it takes about 20 seconds to fully get this page. By analyzing the content of the page, we can notice that the string “ORIGIN” appears just once, at the end of the page. So we’ll check its presence in a loop and wait until it arrives.

#!/usr/bin/env python

from time import sleep
from splinter.browser import Browser

url = 'http://www.ncbi.nlm.nih.gov/nuccore/CP002059.1'

def main():
    browser = Browser()
    browser.visit(url)

    # variation A:
    while 'ORIGIN' not in browser.html:
        sleep(5)

    # variation B:
    # sleep(30)   # if you think everything arrives in 30 seconds

    f = open("/tmp/source.html", "w")   # save the source in a file
    print >>f, browser.html
    f.close()

    browser.quit()
    print '__END__'

#############################################################################

if __name__ == "__main__":
    main()

You might be tempted to check the presence of ‘</html>’. However, don’t forget that the browser downloads a plain source first starting with ‘<html><body>…’ until ‘</body></html>’. Then it starts to interpret the source and if it finds some Ajax calls, they will be called, and these calls will expand something in the body of the HTML. So you’ll have ‘</html>’ right at the beginning.

Future work
This is not bad but I’m still not fully satisfied. I’d like something like this but without any browser window. If you have a headless solution, let me know. I think it’s possible with PhantomJS and/or Zombie.js but I had no time yet to investigate them.

Zombie.js, PhantomJS

September 23, 2011 1 comment

This is not a real post, just a reminder for me. I should look at these projects in detail in the future.

Zombie.js is a fast, headless full-stack testing using Node.js. Zombie.js is a lightweight framework for testing client-side JavaScript code in a simulated environment. No browser required. Here is a Python driver for it called python-zombie.

PhantomJS is a headless WebKit with JavaScript API. It has fast and native support for various web standards: DOM handling, CSS selector, JSON, Canvas, and SVG. PhantomJS is an optimal solution for headless testing of web-based applications, site scraping, pages capture, SVG renderer, PDF converter and many other use cases.