Download cookie-protected pages with Python using cookielib (Part 2)

Home > python, security > Download cookie-protected pages with Python using cookielib (Part 2)

Download cookie-protected pages with Python using cookielib (Part 2)

September 11, 2011 Jabba Laci Leave a comment Go to comments

Warning! In this post I use the Project Euler site as an example. However, it seems that this method doesn’t work anymore with that site. The PE site was updated recently and they have changed something. However, the method described below might work well with other sites.

Update (20111108): If you want to scrape the Project Euler site, check out Part 3 of this series.

In Part 1 we showed how to download a cookie-protected page with Python + wget. First, cookies of a given site were extracted from Firefox’s cookies.sqlite file and they were stored in a plain-text file called cookies.txt. Then this cookies.txt file was passed to wget and wget fetched the protected page.

The solution above works but it has some drawbacks. First, an external command (wget) is called to fetch the webpage. Second, the extracted cookies must be written in a file for wget.

In this post, we provide a clean, full-Python solution. The extracted cookies are not stored in the file system and the pages are downloaded with a Python module from the standard library.

Step 1: extracting cookies and storing them in a cookiejar
On the blog of Guy Rutenberg I found a post that explains this step. Here is my slightly refactored version:

#!/usr/bin/env python

import os
import sqlite3
import cookielib
import urllib2

COOKIE_DB = "{home}/.mozilla/firefox/cookies.sqlite".format(home=os.path.expanduser('~'))
CONTENTS = "host, path, isSecure, expiry, name, value"
COOKIEFILE = 'cookies.lwp'          # the path and filename that you want to use to save your cookies in
URL = 'http://projecteuler.net/index.php?section=statistics'

def get_cookies(host):
    cj = cookielib.LWPCookieJar()       # This is a subclass of FileCookieJar that has useful load and save methods
    con = sqlite3.connect(COOKIE_DB)
    cur = con.cursor()
    sql = "SELECT {c} FROM moz_cookies WHERE host LIKE '%{h}%'".format(c=CONTENTS, h=host)
    cur.execute(sql)
    for item in cur.fetchall():
        c = cookielib.Cookie(0, item[4], item[5],
            None, False,
            item[0], item[0].startswith('.'), item[0].startswith('.'),
            item[1], False,
            item[2],
            item[3], item[3]=="",
            None, None, {})
        cj.set_cookie(c)

    return cj

def main():
    host = 'projecteuler'
    cj = get_cookies(host)
    for index, cookie in enumerate(cj):
        print index,':',cookie
    #cj.save(COOKIEFILE)    # save the cookies if you want (not necessary)

if __name__=="__main__":
    main()

Step 2: download the protected page using the previously filled cookiejar
Now we need to download the protected page:

def get_page_with_cookies(cj):
    opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
    urllib2.install_opener(opener)

    theurl = URL    # an example url that sets a cookie, try different urls here and see the cookie collection you can make !
    txdata = None   # if we were making a POST type request, we could encode a dictionary of values here - using urllib.urlencode
    #params = {}
    #txdata = urllib.urlencode(params)
    txheaders =  {'User-agent' : 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'}  # fake a user agent, some websites (like google) don't like automated exploration

    req = urllib2.Request(theurl, txdata, txheaders)    # create a request object
    handle = urllib2.urlopen(req)                       # and open it to return a handle on the url

    return handle.read()

See the full source code here. This code is also part of my jabbapylib library (see the “web” module). For one more example, see this project of mine, where I had to download a cookie-protected page.

Resources used

What’s next
In Part 3 we show how to use Mechanize and Splinter (two programmable browsers) to log in to a password-protected site and get the HTML source of a page.

Categories: python, security Tags: authentication, cookie protection, cookielib, cookies, cookies.lwp, cookies.sqlite, cookies.txt, cookijar, credentials, download protected page, wget

Comments (13) Trackbacks (0) Leave a comment Trackback

bulkanevcimen

September 12, 2011 at 07:46

Reply

Why don’t you just use the requests module ?

import requests
from urllib import urlencode

def main():
with requests.session() as s:
resp = requests.post(“http://projecteuler.net/index.php”,
data=urlencode(dict(username=”username”, password=”password”, login=”Login”))
)
print resp.cookies

if __name__ == “__main__”:
main()
- Jabba Laci
  
  September 12, 2011 at 12:32
  
  Reply
  
  I’ve never used requests but I’ll try it. However, I wanted to avoid integrating my username and password in the script.
Andrew

November 7, 2011 at 16:19

Reply

This just doesn’t work… I think Firefox changed something
- Jabba Laci
  
  November 7, 2011 at 16:31
  
  Reply
  
  Does COOKIE_DB point to your cookies.sqlite file? Verify that. Here I use a symbolic link which points to xxxxxxxx.default/cookies.sqlite.
Andrew

November 7, 2011 at 16:23

Reply

Specifically:

line 18, in get_cookies
cur.execute(sql)
DatabaseError: file is encrypted or is not a database

Edit: Thanks for the fast response. Indeed it does point to the right file — I’ve since gotten it to stop throwing that error by upgrading my sqlite dll file, but it doesn’t seem to be working the cookies properly. When the page is printed, it just prints the title page (the one you see when you’re logged out)
- Jabba Laci
  
  November 7, 2011 at 16:52
  
  Reply
  
  In order to make it work, you must log in to the site (here, Project Euler) with Firefox. It’ll store the cookies in cookies.sqlite that the script extracts. Verify in Firefox that you are logged in.
Andrew

November 7, 2011 at 17:34

Reply

Yes, I’m logged in. Still the same result, unfortunately.
- Jabba Laci
  
  November 7, 2011 at 17:44
  
  Reply
  
  Wait, I just checked it and I also receive the title page :( The PE site has changed some weeks ago, this script was working with the old one. Hmm, we need to find an alternative solution then.
Andrew

November 7, 2011 at 17:49

Reply

The requests method listed in the first comment also gives me the title page
- Jabba Laci
  
  November 7, 2011 at 17:56
  
  Reply
  
  You can try it with splinter. However, I couldn’t make it work in headless mode, i.e. it opens a browser window for me. With “browser.html” you can access the HTML source of the page.
Andrew

November 7, 2011 at 18:49

Reply

I just installed Splinter — what code are you using to access?

Edit: Got things working with Splinter — thanks for introducing me to it!
- Jabba Laci
  
  November 7, 2011 at 22:54
  
  Reply
  
  Hey, post the relevant code! :)
  
  Edit: see Part 3 for accessing the new Project Euler site.
Rajesh

January 12, 2012 at 04:11

Reply

Great script. Have been trying for weeks to do this and got a complete solution! Thanks much!

No trackbacks yet.

The Ubuntu Incident

Leave a reply to Jabba Laci Cancel reply

Blog Stats

Random Post

Recent Posts

Archives

Meta

The Ubuntu Incident

Download cookie-protected pages with Python using cookielib (Part 2)

Share this:

Related

Leave a reply to Jabba Laci Cancel reply

Blog Stats

Random Post

Recent Posts

Categories

Archives

Meta