Home > firefox, python > Download pages with wget that are protected with cookies (Part 1)

Download pages with wget that are protected with cookies (Part 1)

Problem
I wanted to write a script that analyzes the statistics page on the site Project Euler. However, the stat. page is only visible if you are logged in to the site. The authentication is done via cookies. Thus, when I wanted to fetch the stat. page from command line with wget for instance, I didn’t get the correct page.

How to tell wget to use my Firefox cookies?

Solution
Firefox stores your cookies in a file called cookies.sqlite. You can pass cookies to wget in a text file with the “--load-cookies” option. So the first task is to extract cookies that belong to the host from where you want to download some pages.

At old.0x7be.de I found a Python script written by Dirk Sohler that does this job. Here is my slightly refactored version of his script:

#!/usr/bin/env python

import os
import sys
import sqlite3 as db

USERDIR = 'w3z7c6j4.default'

COOKIEDB = os.path.expanduser('~') + '/.mozilla/firefox/' + USERDIR + '/cookies.sqlite'
OUTPUT = 'cookies.txt'
CONTENTS = "host, path, isSecure, expiry, name, value"

def extract(host):
    conn = db.connect(COOKIEDB)
    cursor = conn.cursor()

    sql = "SELECT {c} FROM moz_cookies WHERE host LIKE '%{h}%'".format(c=CONTENTS, h=host)
    cursor.execute(sql)

    out = open(OUTPUT, 'w')
    cnt = 0
    for row in cursor.fetchall():
        s = "{0}\tTRUE\t{1}\t{2}\t{3}\t{4}\t{5}\n".format(row[0], row[1],
                 str(bool(row[2])).upper(), row[3], str(row[4]), str(row[5]))
        out.write(s)
        cnt += 1

    print "Gesucht nach: {0}".format(host)
    print "Exportiert: {0}".format(cnt)

    out.close()
    conn.close()

if __name__ == "__main__":
    if len(sys.argv) == 1:
        print "{0}: error: specify the host.".format(sys.argv[0])
        sys.exit()
    else:
        extract(sys.argv[1])

You can also find the latest version of this script in my Bash-Utils repository.

Customize the constant USERDIR and you are done. Here is how to extract the cookies of the site Project Euler:

$ ./export_firefox_cookies.py projecteuler

Now, let’s fetch that protected page:

$ wget --cookies=on --load-cookies=cookies.txt --keep-session-cookies "http://projecteuler.net/index.php?section=statistics" -O stat.html

Related links

Notes
Get base URL and save its cookies in file:

$ wget --cookies=on --keep-session-cookies --save-cookies=cookie.txt http://first_page

Get protected content using stored cookies:

$ wget --referer=http://first_page --cookies=on --load-cookies=cookie.txt --keep-session-cookies --save-cookies=cookie.txt http://second_page

What’s next
In this post we showed how to download a cookie-protected page with Python + wget. In Part 2 we provide a full-Python solution.

  1. No comments yet.
  1. No trackbacks yet.

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 76 other followers

%d bloggers like this: