Posts Tagged ‘wget’

How to download files in parallel?

October 10, 2017 Leave a comment

So far I’ve mostly used wget to download files. You can collect the URLs in a file (one URL per line), and pass it to wget: “wget -i list.txt“. However, wget fetches them one by one, which can be time consuming. Is there a way to parallelize the download?

Use Aria2 for this purpose. It’s similar to wget, but it spawns several threads and starts to download the files in parallel. The number of worker threads has a default value, but you can also change that. Its basic usage is the same:

aria2c -i list.txt
Categories: bash Tags: , , , ,

copy large files between computers at home over the network

January 2, 2015 Leave a comment

I have a desktop machine at home with a Windows 2007 virtual machine. I mainly have it because of Powerpoint. Recently I prefer to work on my laptop in the living room. Today I needed Powerpoint, so I decided to copy the whole Windows virtual machine and put it on my laptop. The only problem is that it was 67 GB and I didn’t have that much space on my external HDDs :(

Don’t panic. On my desktop machine I entered the folder that I wanted to copy and started a web server:

python -m SimpleHTTPServer

With “ifconfig” I checked the local IP address of the machine, it was

On my laptop I opened a browser and navigated to ““. All the files I needed were there. Since I’m lazy and I didn’t want to click on each link one by one, I issued the following command (tip from here):

wget -r --no-parent

The download speed was about 10 MB/sec, so it took almost 2 hours.

Categories: network, python Tags: , ,

mirror a website with wget

June 15, 2014 Leave a comment

I want to crawl a webpage recursively and download it for local usage.


wget -c --mirror -p --html-extension --convert-links --no-parent --reject "index.html*" $url

The options:

  •  -c: continue (if you stop the process with CTRL+C and relaunch it, it will continue)
  • --mirror: turns on recursion and time-stamping, sets infinite recursion depth and keeps FTP directory listings
  • -p: get all images, etc. needed to display HTML page
  • --html-extension: save HTML docs with .html extensions
  • --convert-links: make links in downloaded HTML point to local files
  • --no-parent: Do not ever ascend to the parent directory when retrieving recursively. This is a useful option, since it guarantees that only the files below a certain hierarchy will be downloaded.
  • --reject "index.html*": don’t download index.html* files

Tips from here.

Get URLs only
If you want to spider a website and get the URLs only, check this post out. In short:

wget --spider --force-html -r -l2 $url 2>&1 | grep '^--' | awk '{ print $3 }'

Where -l2 specifies recursion maximum depth level 2. You may have to change this value.

Categories: bash Tags: , , , ,

wget with no output

August 13, 2013 Leave a comment

You want to call wget from a script but you want wget to stay silent, i.e. it shouldn’t produce any output.


wget -q URL
Categories: bash Tags: , ,

How to download an entire website for off-line reading?

November 30, 2012 Leave a comment


You want to download an entire website (e.g. a blog) for offline reading.


wget --mirror -p --convert-links -P ./LOCAL-DIR WEBSITE-URL

Tip from here.

Categories: bash Tags: , , ,

Download files with wget from sites that verify your user-agent

March 28, 2012 2 comments

You want to download a file from a given site with your favourite wget utility but you get a “403 Forbidden” error in your face. Of course, everything works from your browser. What to do?

If it works from the browser but it fails with wget, then the site must check your user-agent. If it sees “User-Agent: Wget/1.12 (linux-gnu)” (version may vary), then it simply blocks you.

But don’t fear for a second. Simply fake a different user agent with wget and continue downloading.

Solution 1:

wget --user-agent "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:11.0) Gecko/20100101 Firefox/11.0" http://host/file.jpg

Solution 2:
If you don’t want to provide a user agent each time, put the following your ~/.wgetrc file:

# custom .wgetrc file
user_agent = Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:11.0) Gecko/20100101 Firefox/11.0


wget http://host/file.jpg

Download cookie-protected pages with Python using cookielib (Part 2)

September 11, 2011 13 comments

Warning! In this post I use the Project Euler site as an example. However, it seems that this method doesn’t work anymore with that site. The PE site was updated recently and they have changed something. However, the method described below might work well with other sites.

Update (20111108): If you want to scrape the Project Euler site, check out Part 3 of this series.

In Part 1 we showed how to download a cookie-protected page with Python + wget. First, cookies of a given site were extracted from Firefox’s cookies.sqlite file and they were stored in a plain-text file called cookies.txt. Then this cookies.txt file was passed to wget and wget fetched the protected page.

The solution above works but it has some drawbacks. First, an external command (wget) is called to fetch the webpage. Second, the extracted cookies must be written in a file for wget.

In this post, we provide a clean, full-Python solution. The extracted cookies are not stored in the file system and the pages are downloaded with a Python module from the standard library.

Step 1: extracting cookies and storing them in a cookiejar
On the blog of Guy Rutenberg I found a post that explains this step. Here is my slightly refactored version:

#!/usr/bin/env python

import os
import sqlite3
import cookielib
import urllib2

COOKIE_DB = "{home}/.mozilla/firefox/cookies.sqlite".format(home=os.path.expanduser('~'))
CONTENTS = "host, path, isSecure, expiry, name, value"
COOKIEFILE = 'cookies.lwp'          # the path and filename that you want to use to save your cookies in
URL = ''

def get_cookies(host):
    cj = cookielib.LWPCookieJar()       # This is a subclass of FileCookieJar that has useful load and save methods
    con = sqlite3.connect(COOKIE_DB)
    cur = con.cursor()
    sql = "SELECT {c} FROM moz_cookies WHERE host LIKE '%{h}%'".format(c=CONTENTS, h=host)
    for item in cur.fetchall():
        c = cookielib.Cookie(0, item[4], item[5],
            None, False,
            item[0], item[0].startswith('.'), item[0].startswith('.'),
            item[1], False,
            item[3], item[3]=="",
            None, None, {})

    return cj

def main():
    host = 'projecteuler'
    cj = get_cookies(host)
    for index, cookie in enumerate(cj):
        print index,':',cookie    # save the cookies if you want (not necessary)

if __name__=="__main__":

Step 2: download the protected page using the previously filled cookiejar
Now we need to download the protected page:

def get_page_with_cookies(cj):
    opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))

    theurl = URL    # an example url that sets a cookie, try different urls here and see the cookie collection you can make !
    txdata = None   # if we were making a POST type request, we could encode a dictionary of values here - using urllib.urlencode
    #params = {}
    #txdata = urllib.urlencode(params)
    txheaders =  {'User-agent' : 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'}  # fake a user agent, some websites (like google) don't like automated exploration

    req = urllib2.Request(theurl, txdata, txheaders)    # create a request object
    handle = urllib2.urlopen(req)                       # and open it to return a handle on the url


See the full source code here. This code is also part of my jabbapylib library (see the “web” module). For one more example, see this project of mine, where I had to download a cookie-protected page.

Resources used

What’s next
In Part 3 we show how to use Mechanize and Splinter (two programmable browsers) to log in to a password-protected site and get the HTML source of a page.