Download pages with wget that are protected with cookies (Part 1)
Problem
I wanted to write a script that analyzes the statistics page on the site Project Euler. However, the stat. page is only visible if you are logged in to the site. The authentication is done via cookies. Thus, when I wanted to fetch the stat. page from command line with wget for instance, I didn’t get the correct page.
How to tell wget to use my Firefox cookies?
Solution
Firefox stores your cookies in a file called cookies.sqlite. You can pass cookies to wget in a text file with the “--load-cookies” option. So the first task is to extract cookies that belong to the host from where you want to download some pages.
At old.0x7be.de I found a Python script written by Dirk Sohler that does this job. Here is my slightly refactored version of his script:
#!/usr/bin/env python
import os
import sys
import sqlite3 as db
USERDIR = 'w3z7c6j4.default'
COOKIEDB = os.path.expanduser('~') + '/.mozilla/firefox/' + USERDIR + '/cookies.sqlite'
OUTPUT = 'cookies.txt'
CONTENTS = "host, path, isSecure, expiry, name, value"
def extract(host):
conn = db.connect(COOKIEDB)
cursor = conn.cursor()
sql = "SELECT {c} FROM moz_cookies WHERE host LIKE '%{h}%'".format(c=CONTENTS, h=host)
cursor.execute(sql)
out = open(OUTPUT, 'w')
cnt = 0
for row in cursor.fetchall():
s = "{0}\tTRUE\t{1}\t{2}\t{3}\t{4}\t{5}\n".format(row[0], row[1],
str(bool(row[2])).upper(), row[3], str(row[4]), str(row[5]))
out.write(s)
cnt += 1
print "Gesucht nach: {0}".format(host)
print "Exportiert: {0}".format(cnt)
out.close()
conn.close()
if __name__ == "__main__":
if len(sys.argv) == 1:
print "{0}: error: specify the host.".format(sys.argv[0])
sys.exit()
else:
extract(sys.argv[1])
You can also find the latest version of this script in my Bash-Utils repository.
Customize the constant USERDIR and you are done. Here is how to extract the cookies of the site Project Euler:
$ ./export_firefox_cookies.py projecteuler
Now, let’s fetch that protected page:
$ wget --cookies=on --load-cookies=cookies.txt --keep-session-cookies "http://projecteuler.net/index.php?section=statistics" -O stat.html
Related links
- Using the cookies.sqlite from Firefox 3 in wget
- using wget to download content protected by referer and cookies (see the “notes” below)
- View Cookies add-on for Firefox (to easily figure out which cookies belong to a site)
Notes
Get base URL and save its cookies in file:
$ wget --cookies=on --keep-session-cookies --save-cookies=cookie.txt http://first_page
Get protected content using stored cookies:
$ wget --referer=http://first_page --cookies=on --load-cookies=cookie.txt --keep-session-cookies --save-cookies=cookie.txt http://second_page
What’s next
In this post we showed how to download a cookie-protected page with Python + wget. In Part 2 we provide a full-Python solution.