Archive

Posts Tagged ‘credentials’

Download login-protected pages with Python using Mechanize and Splinter (Part 3)

November 8, 2011 2 comments

In Part 2 of this series we saw how to fetch a cookie-protected page. The site that we used in the example (Project Euler) required you to log in to access the page we were interested in (statistics). First we extracted the cookies from Firefox’s cookies.sqlite and thus we could access the protected page. In the Python script no username/password had to be specified. However, it was necessary to log in in Firefox to get the cookies.

The example site, Project Euler (PE), changed completely a few weeks ago and the method presented in Part 2 doesn’t work any more with this particular site :( So I had to find an alternative solution and this is how I got to Mechanize and Splinter.

Mechanize is a headless programmable browser, which means it doesn’t open any browser window but you can navigate it as if it were a real browser. It’s quite fast but it doesn’t handle Javascript.

Splinter is similar to Mechanize but it opens a browser window. (Theoretically they support a headless browser (zope.testbrowser) but I couldn’t make it work.) Since the navigation is done in a browser window, it’s slower than Mechanize, but it handles Javascript! (Note that zope.testbrowser doesn’t handle Javascript.)

Let’s see a concrete example: fetch Project Euler’s countries page. This page requires a login.

Example #1: Mechanize
The test site doesn’t use Ajax calls, so Mechanize is the better choice here.

#!/usr/bin/env python

import mechanize
from jabbapylib.filesystem import fs   # you can ignore it

PE_LOGIN = 'http://projecteuler.net/login'
PE_COUNTRIES = 'http://projecteuler.net/countries'

USERNAME = fs.read_first_line('/home/jabba/secret/project_euler/username.txt')
PASSWORD = fs.read_first_line('/home/jabba/secret/project_euler/password.txt')

def main():
    browser = mechanize.Browser()
    browser.open(PE_LOGIN)

    browser.select_form(name="login_form")
    browser['username'] = USERNAME
    browser['password'] = PASSWORD

    res = browser.submit()
    #print res.get_data()

    res = browser.open(PE_COUNTRIES)
    print res.get_data()   # HTML source of the page

#############################################################################

if __name__ == "__main__":
    main()

This script requires your username and password on the site PE. For security reasons, I don’t like integrating such data in scripts so I store them in files on a Truecrypt volume. To try this script, you can simply ignore the “jabbapylib” import and specify your username and password directly.

Example #2: Splinter
This example is here for the sake of completeness. Since no Ajax is used on the site PE, there is no real need to navigate a real browser window. That is, Mechanize would be a better (faster) choice here.

#!/usr/bin/env python

from splinter.browser import Browser
from jabbapylib.filesystem import fs

PE_LOGIN = 'http://projecteuler.net/login'
PE_COUNTRIES = 'http://projecteuler.net/countries'

USERNAME = fs.read_first_line('/home/jabba/secret/project_euler/username.txt')
PASSWORD = fs.read_first_line('/home/jabba/secret/project_euler/password.txt')

def main():
#    browser = Browser('chrome')
    browser = Browser()    # opens a Firefox instance
    browser.visit(PE_LOGIN)
    
    browser.fill('username', USERNAME)
    browser.fill('password', PASSWORD)
    button = browser.find_by_name('login')
    button.click()

    browser.visit(PE_COUNTRIES)
       
    f = open("/tmp/stat.html", "w")
    print >>f, browser.html    # HTML source of the page
    f.close()
    
    browser.quit()    # close the browser window

    print '__END__'

#############################################################################

if __name__ == "__main__":
    main()

Basically, it works just like Mechanize. You tell the browser which page to open, what fields to fill, where to click, etc. At the end we save the HTML source in a file.

Both Mechanize and Splinter treat cookies after login so we don’t have to bother about cookielib.

Download cookie-protected pages with Python using cookielib (Part 2)

September 11, 2011 13 comments

Warning! In this post I use the Project Euler site as an example. However, it seems that this method doesn’t work anymore with that site. The PE site was updated recently and they have changed something. However, the method described below might work well with other sites.

Update (20111108): If you want to scrape the Project Euler site, check out Part 3 of this series.


In Part 1 we showed how to download a cookie-protected page with Python + wget. First, cookies of a given site were extracted from Firefox’s cookies.sqlite file and they were stored in a plain-text file called cookies.txt. Then this cookies.txt file was passed to wget and wget fetched the protected page.

The solution above works but it has some drawbacks. First, an external command (wget) is called to fetch the webpage. Second, the extracted cookies must be written in a file for wget.

In this post, we provide a clean, full-Python solution. The extracted cookies are not stored in the file system and the pages are downloaded with a Python module from the standard library.

Step 1: extracting cookies and storing them in a cookiejar
On the blog of Guy Rutenberg I found a post that explains this step. Here is my slightly refactored version:

#!/usr/bin/env python

import os
import sqlite3
import cookielib
import urllib2

COOKIE_DB = "{home}/.mozilla/firefox/cookies.sqlite".format(home=os.path.expanduser('~'))
CONTENTS = "host, path, isSecure, expiry, name, value"
COOKIEFILE = 'cookies.lwp'          # the path and filename that you want to use to save your cookies in
URL = 'http://projecteuler.net/index.php?section=statistics'

def get_cookies(host):
    cj = cookielib.LWPCookieJar()       # This is a subclass of FileCookieJar that has useful load and save methods
    con = sqlite3.connect(COOKIE_DB)
    cur = con.cursor()
    sql = "SELECT {c} FROM moz_cookies WHERE host LIKE '%{h}%'".format(c=CONTENTS, h=host)
    cur.execute(sql)
    for item in cur.fetchall():
        c = cookielib.Cookie(0, item[4], item[5],
            None, False,
            item[0], item[0].startswith('.'), item[0].startswith('.'),
            item[1], False,
            item[2],
            item[3], item[3]=="",
            None, None, {})
        cj.set_cookie(c)

    return cj

def main():
    host = 'projecteuler'
    cj = get_cookies(host)
    for index, cookie in enumerate(cj):
        print index,':',cookie
    #cj.save(COOKIEFILE)    # save the cookies if you want (not necessary)

if __name__=="__main__":
    main()

Step 2: download the protected page using the previously filled cookiejar
Now we need to download the protected page:

def get_page_with_cookies(cj):
    opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
    urllib2.install_opener(opener)

    theurl = URL    # an example url that sets a cookie, try different urls here and see the cookie collection you can make !
    txdata = None   # if we were making a POST type request, we could encode a dictionary of values here - using urllib.urlencode
    #params = {}
    #txdata = urllib.urlencode(params)
    txheaders =  {'User-agent' : 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'}  # fake a user agent, some websites (like google) don't like automated exploration

    req = urllib2.Request(theurl, txdata, txheaders)    # create a request object
    handle = urllib2.urlopen(req)                       # and open it to return a handle on the url

    return handle.read()

See the full source code here. This code is also part of my jabbapylib library (see the “web” module). For one more example, see this project of mine, where I had to download a cookie-protected page.

Resources used

What’s next
In Part 3 we show how to use Mechanize and Splinter (two programmable browsers) to log in to a password-protected site and get the HTML source of a page.

Download pages with wget that are protected with cookies (Part 1)

September 5, 2011 Leave a comment

Problem
I wanted to write a script that analyzes the statistics page on the site Project Euler. However, the stat. page is only visible if you are logged in to the site. The authentication is done via cookies. Thus, when I wanted to fetch the stat. page from command line with wget for instance, I didn’t get the correct page.

How to tell wget to use my Firefox cookies?

Solution
Firefox stores your cookies in a file called cookies.sqlite. You can pass cookies to wget in a text file with the “--load-cookies” option. So the first task is to extract cookies that belong to the host from where you want to download some pages.

At old.0x7be.de I found a Python script written by Dirk Sohler that does this job. Here is my slightly refactored version of his script:

#!/usr/bin/env python

import os
import sys
import sqlite3 as db

USERDIR = 'w3z7c6j4.default'

COOKIEDB = os.path.expanduser('~') + '/.mozilla/firefox/' + USERDIR + '/cookies.sqlite'
OUTPUT = 'cookies.txt'
CONTENTS = "host, path, isSecure, expiry, name, value"

def extract(host):
    conn = db.connect(COOKIEDB)
    cursor = conn.cursor()

    sql = "SELECT {c} FROM moz_cookies WHERE host LIKE '%{h}%'".format(c=CONTENTS, h=host)
    cursor.execute(sql)

    out = open(OUTPUT, 'w')
    cnt = 0
    for row in cursor.fetchall():
        s = "{0}\tTRUE\t{1}\t{2}\t{3}\t{4}\t{5}\n".format(row[0], row[1],
                 str(bool(row[2])).upper(), row[3], str(row[4]), str(row[5]))
        out.write(s)
        cnt += 1

    print "Gesucht nach: {0}".format(host)
    print "Exportiert: {0}".format(cnt)

    out.close()
    conn.close()

if __name__ == "__main__":
    if len(sys.argv) == 1:
        print "{0}: error: specify the host.".format(sys.argv[0])
        sys.exit()
    else:
        extract(sys.argv[1])

You can also find the latest version of this script in my Bash-Utils repository.

Customize the constant USERDIR and you are done. Here is how to extract the cookies of the site Project Euler:

$ ./export_firefox_cookies.py projecteuler

Now, let’s fetch that protected page:

$ wget --cookies=on --load-cookies=cookies.txt --keep-session-cookies "http://projecteuler.net/index.php?section=statistics" -O stat.html

Related links

Notes
Get base URL and save its cookies in file:

$ wget --cookies=on --keep-session-cookies --save-cookies=cookie.txt http://first_page

Get protected content using stored cookies:

$ wget --referer=http://first_page --cookies=on --load-cookies=cookie.txt --keep-session-cookies --save-cookies=cookie.txt http://second_page

What’s next
In this post we showed how to download a cookie-protected page with Python + wget. In Part 2 we provide a full-Python solution.