Download login-protected pages with Python using Mechanize and Splinter (Part 3)

Home > python > Download login-protected pages with Python using Mechanize and Splinter (Part 3)

Download login-protected pages with Python using Mechanize and Splinter (Part 3)

November 8, 2011 Jabba Laci Leave a comment Go to comments

In Part 2 of this series we saw how to fetch a cookie-protected page. The site that we used in the example (Project Euler) required you to log in to access the page we were interested in (statistics). First we extracted the cookies from Firefox’s cookies.sqlite and thus we could access the protected page. In the Python script no username/password had to be specified. However, it was necessary to log in in Firefox to get the cookies.

The example site, Project Euler (PE), changed completely a few weeks ago and the method presented in Part 2 doesn’t work any more with this particular site :( So I had to find an alternative solution and this is how I got to Mechanize and Splinter.

Mechanize is a headless programmable browser, which means it doesn’t open any browser window but you can navigate it as if it were a real browser. It’s quite fast but it doesn’t handle Javascript.

Splinter is similar to Mechanize but it opens a browser window. (Theoretically they support a headless browser (zope.testbrowser) but I couldn’t make it work.) Since the navigation is done in a browser window, it’s slower than Mechanize, but it handles Javascript! (Note that zope.testbrowser doesn’t handle Javascript.)

Let’s see a concrete example: fetch Project Euler’s countries page. This page requires a login.

Example #1: Mechanize
The test site doesn’t use Ajax calls, so Mechanize is the better choice here.

#!/usr/bin/env python

import mechanize
from jabbapylib.filesystem import fs   # you can ignore it

PE_LOGIN = 'http://projecteuler.net/login'
PE_COUNTRIES = 'http://projecteuler.net/countries'

USERNAME = fs.read_first_line('/home/jabba/secret/project_euler/username.txt')
PASSWORD = fs.read_first_line('/home/jabba/secret/project_euler/password.txt')

def main():
    browser = mechanize.Browser()
    browser.open(PE_LOGIN)

    browser.select_form(name="login_form")
    browser['username'] = USERNAME
    browser['password'] = PASSWORD

    res = browser.submit()
    #print res.get_data()

    res = browser.open(PE_COUNTRIES)
    print res.get_data()   # HTML source of the page

#############################################################################

if __name__ == "__main__":
    main()

This script requires your username and password on the site PE. For security reasons, I don’t like integrating such data in scripts so I store them in files on a Truecrypt volume. To try this script, you can simply ignore the “jabbapylib” import and specify your username and password directly.

Example #2: Splinter
This example is here for the sake of completeness. Since no Ajax is used on the site PE, there is no real need to navigate a real browser window. That is, Mechanize would be a better (faster) choice here.

#!/usr/bin/env python

from splinter.browser import Browser
from jabbapylib.filesystem import fs

PE_LOGIN = 'http://projecteuler.net/login'
PE_COUNTRIES = 'http://projecteuler.net/countries'

USERNAME = fs.read_first_line('/home/jabba/secret/project_euler/username.txt')
PASSWORD = fs.read_first_line('/home/jabba/secret/project_euler/password.txt')

def main():
#    browser = Browser('chrome')
    browser = Browser()    # opens a Firefox instance
    browser.visit(PE_LOGIN)
    
    browser.fill('username', USERNAME)
    browser.fill('password', PASSWORD)
    button = browser.find_by_name('login')
    button.click()

    browser.visit(PE_COUNTRIES)
       
    f = open("/tmp/stat.html", "w")
    print >>f, browser.html    # HTML source of the page
    f.close()
    
    browser.quit()    # close the browser window

    print '__END__'

#############################################################################

if __name__ == "__main__":
    main()

Basically, it works just like Mechanize. You tell the browser which page to open, what fields to fill, where to click, etc. At the end we save the HTML source in a file.

Both Mechanize and Splinter treat cookies after login so we don’t have to bother about cookielib.

Categories: python Tags: authentication, cookies, credentials, html source, mechanize, splinter

Comments (2) Trackbacks (0) Leave a comment Trackback

Douglas Camata

February 14, 2012 at 14:03

Reply

If you want Splinter to not open a real browser, you can instantiate it with Browser(‘zope.testbrowser’). This way Splinter will use zope.testbrowser instead of Firefox and still have the same API.
- Jabba Laci
  
  February 14, 2012 at 15:52
  
  Reply
  
  Thanks, I will give it another try. Last time I couldn’t get zope.testbrowser to work under Linux. Can it process Javascript?

No trackbacks yet.

The Ubuntu Incident

Download login-protected pages with Python using Mechanize and Splinter (Part 3)

Leave a comment Cancel reply

Blog Stats

Random Post

Recent Posts

Archives

Meta

The Ubuntu Incident

Download login-protected pages with Python using Mechanize and Splinter (Part 3)

Share this:

Related

Leave a comment Cancel reply

Blog Stats

Random Post

Recent Posts

Categories

Archives

Meta