Archive

Posts Tagged ‘scraping’

Scraping AJAX web pages (Part 5.5)

July 13, 2016 Leave a comment

Don’t forget to check out the rest of the series too!

This post is very similar to the previous one (Part 5), which scraped a webpage using PhantomJS from the command line and sent the output to the stdout.

This time we use PhantomJS again, but we do it from a Python script and wrap Selenium around PhantomJS. The generated HTML source will be available in a variable. Here is the source:

#!/usr/bin/env python3
# encoding: utf-8

"""
required packages:
* selenium
optional packages:
* bs4
* lxml
"""

from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
# from bs4 import BeautifulSoup

url = "http://simile.mit.edu/crowbar/test.html"

dcap = dict(DesiredCapabilities.PHANTOMJS)
dcap["phantomjs.page.settings.userAgent"] = (
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/53 "
    "(KHTML, like Gecko) Chrome/15.0.87"
)
driver = webdriver.PhantomJS(desired_capabilities=dcap)
driver.get(url)
html = driver.page_source
print(html)
# soup = BeautifulSoup(driver.page_source, "lxml") #page_source fetches page after rendering is complete
# driver.save_screenshot('screen.png') # save a screenshot to disk

driver.quit()

The script sets the user agent (optional but recommended). The source is captured in a variable. The last two lines are in comments but they would work. You could feed the source to BeautifulSoup and then you could extract part of the HTML source. If you uncomment the last line, then you can create a screenshot of the webpage.

Scraping AJAX web pages (Part 5)

July 13, 2016 Leave a comment

Don’t forget to check out the rest of the series too!

I’ve already written about PhatomJS, for instance here. Recall: PhantomJS is a headless WebKit scriptable with a JavaScript API.

The problem is still the same: we have a webpage that contains lots of JavaScript code and we want to get the final HTML that is produced after the JavaScript codes have been executed.

Example: http://simile.mit.edu/crowbar/test.html. If you download it with “wget” for instance, you get the text “Hi lame crawler” in the source. However, a JavaScript code changes this text to “Hi Crowbar!” in the browser and we want to get this generated source. How?

This time we’ll use PhantomJS. We also need a JavaScript script that will instruct PhantomJS what to do. Let’s call it printSource.js:

var system = require('system');
var page   = require('webpage').create();
// system.args[0] is the filename, so system.args[1] is the first real argument
var url    = system.args[1];
// render the page, and run the callback function
page.open(url, function () {
  // page.content is the source
  console.log(page.content);
  // need to call phantom.exit() to prevent from hanging
  phantom.exit();
});

Note that this code comes from here.

If you want to set the user agent, use this modified script:

var system = require('system');
var page   = require('webpage').create();
page.settings.userAgent = 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30)';
// system.args[0] is the filename, so system.args[1] is the first real argument
var url    = system.args[1];
// render the page, and run the callback function
page.open(url, function () {
  // page.content is the source
  console.log(page.content);
  // need to call phantom.exit() to prevent from hanging
  phantom.exit();
});

Then launch the following command:

$ phantomjs printSource.js http://simile.mit.edu/crowbar/test.html

The output is printed to the standard output.

If you want to do the same thing from a Python script, check out Part 5.5 of the series.

Categories: bash Tags: , , , ,

[manjaro] ping needs special privileges

September 5, 2015 Leave a comment

Problem

$ ping -c 1 www.google.com
ping: icmp open socket: Operation not permitted

Solution

$ sudo chmod u+s `which ping`
$ ping -c 1 www.google.com
PING www.google.com (173.194.65.104) 56(84) bytes of data.
64 bytes from ee-in-f104.1e100.net (173.194.65.104): icmp_seq=1 ttl=45 time=38.6 ms

--- www.google.com ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 38.681/38.681/38.681/0.000 ms
Categories: bash Tags: , ,

Scraping AJAX web pages

December 27, 2012 Leave a comment
Categories: Uncategorized Tags: ,

Scraping AJAX web pages (Part 4)

December 27, 2012 9 comments

Don’t forget to check out the rest of the series too!

I managed to solve a problem that bugged me for a long time. Namely, (1) I want to download the generated source of an AJAX-powered webpage; (2) I want a headless solution, i.e. I want no browser window; and (3) I want to wait until the AJAX-content is fully loaded.

During the past 1.5 years I got quite close :) I could solve everything except issue #3. Now I’m proud to present a complete solution that satisfies all the criteria above.

#!/usr/bin/env python

import os
import sys

from PySide.QtCore import *
from PySide.QtGui import *
from PySide.QtWebKit import QWebPage

SEC = 1000 # 1 sec. is 1000 msec.
USER_AGENT = 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:17.0) Gecko/20100101 Firefox/17.0'

class JabbaWebkit(QWebPage):
    # 'html' is a class variable
    def __init__(self, url, wait, app, parent=None):
        super(JabbaWebkit, self).__init__(parent)
        JabbaWebkit.html = ''

        if wait:
            QTimer.singleShot(wait * SEC, app.quit)
        else:
            self.loadFinished.connect(app.quit)

        self.mainFrame().load(QUrl(url))

    def save(self):
        JabbaWebkit.html = self.mainFrame().toHtml()

    def userAgentForUrl(self, url):
        return USER_AGENT

def get_page(url, wait=None):
    # here is the trick how to call it several times
    app = QApplication.instance() # checks if QApplication already exists
    if not app: # create QApplication if it doesnt exist
        app = QApplication(sys.argv)
    #
    form = JabbaWebkit(url, wait, app)
    app.aboutToQuit.connect(form.save)
    app.exec_()
    return JabbaWebkit.html

#############################################################################

if __name__ == "__main__":
    url = 'http://simile.mit.edu/crowbar/test.html'
    print get_html(url)

It’s also on GitHub. The GitHub version contains more documentation and more examples.

[ reddit comments ]

Update (20121228)
Jabba-Webkit got included in Pycoder’s Weekly #46. Awesome.

Get IMDB ratings without any scraping

February 12, 2012 Leave a comment

Update (20150712): If you know Python, check out the awesome IMDbPY library. It does the hard work for you, you just need to call some simple functions. Here is the docs.

Update (20130130): imdbapi.com seems to have moved to omdbapi.com. Links below are updated accordingly.

Problem
You want to get some data (e.g. rating) of a movie from IMDB. How to do it without any web scraping?

Solution #1
Someone made a simple API for this task, available at http://www.omdbapi.com/. You can search by ID or title.

Examples:

The result is a JSON string that contains basic movie info, rating included.

Solution #2
IMDB has a secret API too, made for mobile applications (available at http://app.imdb.com/). Here they say “For use only by clients authorized in writing by IMDb. Authors and users of unauthorized clients accept full legal exposure/liability for their actions.” So what comes below is strictly for educational purposes.

Examples:

The result is a JSON string. Find more info about this API here.

Related posts

Thanks reddit.

Update (20120222)
Python code for solution #1 is here.

Firebug: sitescraper’s best friend

November 23, 2011 Leave a comment

When you do sitescraping, usually you know exactly what part of a webpage you want to extract. The naive way is to download and analyze the source code of the page trying to identify the interesting part(s). But there is a better way: use Firebug.

Firebug is a Firefox add-on for web developers. You can edit, debug, and monitor CSS, HTML, and JavaScript live in any web page. The interesting part for us is the feature that you can point on any element of a webpage and Firebug shows you its exact location in the source. You can also get the CSS Path and/or the XPath of the given element.

First, install Firebug and restart the browser. In the top right corner of the browser you’ll see a little bug (part A on the figure below). Clicking on this will call the Firebug console. On the console, click on the 2nd icon from the left in the console’s top (part B). Then click on an element in the browser that you want to inspect (part C). The relevant HTML source code will be highlighted in the console (part D). Right click on it and choose the CSS Path / XPath from the popup menu. Now you only have to write a script that extracts this part of the page.

Categories: firefox Tags: ,
Follow

Get every new post delivered to your Inbox.

Join 93 other followers