Archive

Posts Tagged ‘wget’

wget with no output

August 13, 2013 Leave a comment

silencerProblem
You want to call wget from a script but you want wget to stay silent, i.e. it shouldn’t produce any output.

Solution

wget -q URL
Categories: bash Tags: , ,

How to download an entire website for off-line reading?

November 30, 2012 Leave a comment

Problem

You want to download an entire website (e.g. a blog) for offline reading.

Solution

wget --mirror -p --convert-links -P ./LOCAL-DIR WEBSITE-URL

Tip from here.

Categories: bash Tags: , , ,

Download files with wget from sites that verify your user-agent

March 28, 2012 Leave a comment

Problem
You want to download a file from a given site with your favourite wget utility but you get a “403 Forbidden” error in your face. Of course, everything works from your browser. What to do?

Solution
If it works from the browser but it fails with wget, then the site must check your user-agent. If it sees “User-Agent: Wget/1.12 (linux-gnu)” (version may vary), then it simply blocks you.

But don’t fear for a second. Simply fake a different user agent with wget and continue downloading.

Solution 1:

wget --user-agent "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:11.0) Gecko/20100101 Firefox/11.0" http://host/file.jpg

Solution 2:
If you don’t want to provide a user agent each time, put the following your ~/.wgetrc file:

# custom .wgetrc file
user_agent = Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:11.0) Gecko/20100101 Firefox/11.0

Then:

wget http://host/file.jpg

Download cookie-protected pages with Python using cookielib (Part 2)

September 11, 2011 13 comments

Warning! In this post I use the Project Euler site as an example. However, it seems that this method doesn’t work anymore with that site. The PE site was updated recently and they have changed something. However, the method described below might work well with other sites.

Update (20111108): If you want to scrape the Project Euler site, check out Part 3 of this series.


In Part 1 we showed how to download a cookie-protected page with Python + wget. First, cookies of a given site were extracted from Firefox’s cookies.sqlite file and they were stored in a plain-text file called cookies.txt. Then this cookies.txt file was passed to wget and wget fetched the protected page.

The solution above works but it has some drawbacks. First, an external command (wget) is called to fetch the webpage. Second, the extracted cookies must be written in a file for wget.

In this post, we provide a clean, full-Python solution. The extracted cookies are not stored in the file system and the pages are downloaded with a Python module from the standard library.

Step 1: extracting cookies and storing them in a cookiejar
On the blog of Guy Rutenberg I found a post that explains this step. Here is my slightly refactored version:

#!/usr/bin/env python

import os
import sqlite3
import cookielib
import urllib2

COOKIE_DB = "{home}/.mozilla/firefox/cookies.sqlite".format(home=os.path.expanduser('~'))
CONTENTS = "host, path, isSecure, expiry, name, value"
COOKIEFILE = 'cookies.lwp'          # the path and filename that you want to use to save your cookies in
URL = 'http://projecteuler.net/index.php?section=statistics'

def get_cookies(host):
    cj = cookielib.LWPCookieJar()       # This is a subclass of FileCookieJar that has useful load and save methods
    con = sqlite3.connect(COOKIE_DB)
    cur = con.cursor()
    sql = "SELECT {c} FROM moz_cookies WHERE host LIKE '%{h}%'".format(c=CONTENTS, h=host)
    cur.execute(sql)
    for item in cur.fetchall():
        c = cookielib.Cookie(0, item[4], item[5],
            None, False,
            item[0], item[0].startswith('.'), item[0].startswith('.'),
            item[1], False,
            item[2],
            item[3], item[3]=="",
            None, None, {})
        cj.set_cookie(c)

    return cj

def main():
    host = 'projecteuler'
    cj = get_cookies(host)
    for index, cookie in enumerate(cj):
        print index,':',cookie
    #cj.save(COOKIEFILE)    # save the cookies if you want (not necessary)

if __name__=="__main__":
    main()

Step 2: download the protected page using the previously filled cookiejar
Now we need to download the protected page:

def get_page_with_cookies(cj):
    opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
    urllib2.install_opener(opener)

    theurl = URL    # an example url that sets a cookie, try different urls here and see the cookie collection you can make !
    txdata = None   # if we were making a POST type request, we could encode a dictionary of values here - using urllib.urlencode
    #params = {}
    #txdata = urllib.urlencode(params)
    txheaders =  {'User-agent' : 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'}  # fake a user agent, some websites (like google) don't like automated exploration

    req = urllib2.Request(theurl, txdata, txheaders)    # create a request object
    handle = urllib2.urlopen(req)                       # and open it to return a handle on the url

    return handle.read()

See the full source code here. This code is also part of my jabbapylib library (see the “web” module). For one more example, see this project of mine, where I had to download a cookie-protected page.

Resources used

What’s next
In Part 3 we show how to use Mechanize and Splinter (two programmable browsers) to log in to a password-protected site and get the HTML source of a page.

Download pages with wget that are protected with cookies (Part 1)

September 5, 2011 Leave a comment

Problem
I wanted to write a script that analyzes the statistics page on the site Project Euler. However, the stat. page is only visible if you are logged in to the site. The authentication is done via cookies. Thus, when I wanted to fetch the stat. page from command line with wget for instance, I didn’t get the correct page.

How to tell wget to use my Firefox cookies?

Solution
Firefox stores your cookies in a file called cookies.sqlite. You can pass cookies to wget in a text file with the “--load-cookies” option. So the first task is to extract cookies that belong to the host from where you want to download some pages.

At old.0x7be.de I found a Python script written by Dirk Sohler that does this job. Here is my slightly refactored version of his script:

#!/usr/bin/env python

import os
import sys
import sqlite3 as db

USERDIR = 'w3z7c6j4.default'

COOKIEDB = os.path.expanduser('~') + '/.mozilla/firefox/' + USERDIR + '/cookies.sqlite'
OUTPUT = 'cookies.txt'
CONTENTS = "host, path, isSecure, expiry, name, value"

def extract(host):
    conn = db.connect(COOKIEDB)
    cursor = conn.cursor()

    sql = "SELECT {c} FROM moz_cookies WHERE host LIKE '%{h}%'".format(c=CONTENTS, h=host)
    cursor.execute(sql)

    out = open(OUTPUT, 'w')
    cnt = 0
    for row in cursor.fetchall():
        s = "{0}\tTRUE\t{1}\t{2}\t{3}\t{4}\t{5}\n".format(row[0], row[1],
                 str(bool(row[2])).upper(), row[3], str(row[4]), str(row[5]))
        out.write(s)
        cnt += 1

    print "Gesucht nach: {0}".format(host)
    print "Exportiert: {0}".format(cnt)

    out.close()
    conn.close()

if __name__ == "__main__":
    if len(sys.argv) == 1:
        print "{0}: error: specify the host.".format(sys.argv[0])
        sys.exit()
    else:
        extract(sys.argv[1])

You can also find the latest version of this script in my Bash-Utils repository.

Customize the constant USERDIR and you are done. Here is how to extract the cookies of the site Project Euler:

$ ./export_firefox_cookies.py projecteuler

Now, let’s fetch that protected page:

$ wget --cookies=on --load-cookies=cookies.txt --keep-session-cookies "http://projecteuler.net/index.php?section=statistics" -O stat.html

Related links

Notes
Get base URL and save its cookies in file:

$ wget --cookies=on --keep-session-cookies --save-cookies=cookie.txt http://first_page

Get protected content using stored cookies:

$ wget --referer=http://first_page --cookies=on --load-cookies=cookie.txt --keep-session-cookies --save-cookies=cookie.txt http://second_page

What’s next
In this post we showed how to download a cookie-protected page with Python + wget. In Part 2 we provide a full-Python solution.

Download a webpage and print it to the standard output

August 10, 2011 Leave a comment

Problem
You want to download a webpage and print its content to the standard output. For instance you want to push it through a pipe for further processing.

Solution
The easiest way is to use “curl” since by default it prints the downloaded content to the stdout:

curl http://www.python.org | less

You might want to add the switch “-s” to make curl silent, i.e. hide the progress bar.

With “wget” it’s a bit more complicated:

wget -qO- http://www.python.org | less

If you want to add syntax highlighting to less, see this post.

References
Redirecting wget to STDOUT – now with Syntax Highlighting

wget examples

February 15, 2011 Leave a comment

Let’s see some use cases with wget.

Download a page with http password authentication.

wget --http-user=name --http-password=password URL

Download a complete site with http password authentication. Make the copy locally browsable, i.e. convert links to point to local pages.

wget --convert-links -r --http-user=name --http-password=password URL

Collect some URLs in a file (one URL per line) and download ‘em all:

wget -i file.txt

Download a page and save it under a different name:

wget http://example.com?a=1&b=2&looks_stupid=true -O simple.html

Download gallery files (requires bash, the loop is expanded by bash) [more info here]:

wget http://example.com/Tiffany/beach{01..20}.jpg

Update (20130903)

For more examples, refer to this page.

Categories: bash Tags:

bash loop without for

February 3, 2011 1 comment

The following examples require bash >= 3.0.

Example 1

Print numbers from 1 to 10 without a for loop:

echo {1..10}

Output:

1 2 3 4 5 6 7 8 9 10

What happens is bash expands the range this way:

echo 1 2 3 4 5 6 7 8 9 10

Example 2

Produce file names from 01.pdf to 10.pdf:

echo {01..10}.pdf

Output:

01.pdf 02.pdf 03.pdf 04.pdf 05.pdf 06.pdf 07.pdf 08.pdf 09.pdf 10.pdf

Example 3

Download a gallery of images with wget:

wget http://example.com/gallery/image{01..10}.jpg

bash will expand it like this:

wget http://example.com/gallery/image01.jpg http://example.com/gallery/image02.jpg http://example.com/gallery/image03.jpg http://example.com/gallery/image04.jpg http://example.com/gallery/image05.jpg http://example.com/gallery/image06.jpg http://example.com/gallery/image07.jpg http://example.com/gallery/image08.jpg http://example.com/gallery/image09.jpg http://example.com/gallery/image10.jpg

Related posts

Categories: bash Tags: , ,

Download all issues of Full Circle Magazine

January 27, 2011 5 comments

Problem

You want to get all the issues of Full Circle Magazine but you don’t want to download ‘em one by one. Is there an easy and painless way to get them in a bundle?

Solution

Here are the necessary URLs till issue 47:


http://dl.fullcirclemagazine.org/issue0_en.pdf


http://dl.fullcirclemagazine.org/issue1_en.pdf


http://dl.fullcirclemagazine.org/issue2_en.pdf


http://dl.fullcirclemagazine.org/issue3_en.pdf


http://dl.fullcirclemagazine.org/issue4_en.pdf


http://dl.fullcirclemagazine.org/issue5_en.pdf


http://dl.fullcirclemagazine.org/issue6_en.pdf


http://dl.fullcirclemagazine.org/issue7_en.pdf


http://dl.fullcirclemagazine.org/issue8_en.pdf


http://dl.fullcirclemagazine.org/issue9_en.pdf


http://dl.fullcirclemagazine.org/issue10_en.pdf


http://dl.fullcirclemagazine.org/issue11_en.pdf


http://dl.fullcirclemagazine.org/issue12_en.pdf


http://dl.fullcirclemagazine.org/issue13_en.pdf


http://dl.fullcirclemagazine.org/issue14_en.pdf


http://dl.fullcirclemagazine.org/issue15_en.pdf


http://dl.fullcirclemagazine.org/issue16_en.pdf


http://dl.fullcirclemagazine.org/issue17_en.pdf


http://dl.fullcirclemagazine.org/issue18_en.pdf


http://dl.fullcirclemagazine.org/issue19_en.pdf


http://dl.fullcirclemagazine.org/issue20_en.pdf


http://dl.fullcirclemagazine.org/issue21_en.pdf


http://dl.fullcirclemagazine.org/issue22_en.pdf


http://dl.fullcirclemagazine.org/issue23_en.pdf


http://dl.fullcirclemagazine.org/issue24_en.pdf


http://dl.fullcirclemagazine.org/issue25_en.pdf


http://dl.fullcirclemagazine.org/issue26_en.pdf


http://dl.fullcirclemagazine.org/issue27_en.pdf


http://test.fullcirclemagazine.org/wp-content/uploads/2009/08/fullcircle-issue28-eng1.pdf


http://dl.fullcirclemagazine.org/issue29_en.pdf


http://dl.fullcirclemagazine.org/issue30_en.pdf


http://dl.fullcirclemagazine.org/issue31_en.pdf


http://dl.fullcirclemagazine.org/issue32_en.pdf


http://dl.fullcirclemagazine.org/issue33_en.pdf


http://dl.fullcirclemagazine.org/issue34_en.pdf


http://dl.fullcirclemagazine.org/issue35_en.pdf


http://dl.fullcirclemagazine.org/issue36_en.pdf


http://dl.fullcirclemagazine.org/issue37_en.pdf


http://dl.fullcirclemagazine.org/issue38_en.pdf


http://dl.fullcirclemagazine.org/issue39_en.pdf


http://dl.fullcirclemagazine.org/issue40_en.pdf


http://dl.fullcirclemagazine.org/issue41_en.pdf


http://dl.fullcirclemagazine.org/issue42_en.pdf


http://dl.fullcirclemagazine.org/issue43_en.pdf


http://dl.fullcirclemagazine.org/issue44_en.pdf


http://dl.fullcirclemagazine.org/issue45_en.pdf


http://dl.fullcirclemagazine.org/issue46_en.pdf


http://dl.fullcirclemagazine.org/issue47_en.pdf

Save it to a file called down.txt, then download them all:

wget -i down.txt

Update (20110130) #1:

Unfortunately, issues below 10 are named as issueX_en.pdf and not as issue0X_en.pdf. Thus, if you download all the files and list them with ‘ls -al‘, issues < 10 will be mixed with the others. Here is how to fix it:

rename -n 's/issue(\d)_en(.*)/issue0$1_en$2/' *.pdf

It will just print the renames (without executing them). If the result is OK, remove the ‘-n‘ switch and execute the command again. Now the files will be renamed in order.

Update (20110130) #2:

This post was taken over by Ubuntu Life, and the user Capitán suggested an easier solution in a comment over there:

wget http://dl.fullcirclemagazine.org/issue{0..45}_en.pdf

I didn’t know about this wget feature :) Now I see why issues < 10 are named as issueX_en.pdf and not as issue0X_en.pdf

Update (20110203): That {0..45} thing is actually expanded by bash, not by wget! See this post for more info.

Update (20110130) #3:

Another reader of Ubuntu Life, marco, suggests a bash script solution:

for i in {0..45}
do 
wget http://dl.fullcirclemagazine.org/issue${i}_en.pdf; 
done

Or, in one line:

for i in {0..45}; do wget http://dl.fullcirclemagazine.org/issue${i}_en.pdf; done
Follow

Get every new post delivered to your Inbox.

Join 61 other followers