Archive

Posts Tagged ‘wget’

[wget] downloading images

December 11, 2017 Leave a comment

Problem
Working with a Python script, I wanted to download images from various websites. I gave this job to wget that I called as an external program. However, downloading some images failed. I verified them, and they opened nicely in my browser. What da hell?

Solution
Some web servers verify the client and if it’s not a browser, they simply block it. Our job is to make wget pretend it’s a normal browser. Put the following content in your “~/.wgetrc“:

header = Accept-Language: en-us,en;q=0.5
header = Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
header = Connection: keep-alive
user_agent = Mozilla/5.0 (X11; Fedora; Linux x86_64; rv:40.0) Gecko/20100101 Firefox/40.0
referer = /
robots = off

Problem solved. I found this tip here.

Advertisements
Categories: bash, web Tags: ,

Linux Journal shuts down after 23 years

December 5, 2017 Leave a comment

It’s sad but Linux Journal shuts down :( Here is the announcement: https://www.linuxjournal.com/content/linux-journal-ceases-publication .

LJ made all the issues available in various digital formats: https://secure2.linuxjournal.com/pdf/dljdownload.php .

Here is a simple script to produce individual wget commands to download the PDFs one by one:

#!/usr/bin/env python3

template = "http://download.linuxjournal.com/pdf/get-doc.php?code=dlj{n}.pdf&tcode=pdf-{n}-1234&action=spit"

def main():
    for i in range(132, 283+1):
        url = template.format(n=i)
        cmd = 'wget "{url}" -O {n}.pdf'.format(url=url, n=i)
        print(cmd)

if __name__ == "__main__":
    main()

Usage:

$ python3 lj.py > down.sh
$ sh down.sh
Categories: linux Tags: ,

How to download files in parallel?

October 10, 2017 1 comment

Problem
So far I’ve mostly used wget to download files. You can collect the URLs in a file (one URL per line), and pass it to wget: “wget -i list.txt“. However, wget fetches them one by one, which can be time consuming. Is there a way to parallelize the download?

Solution
Use Aria2 for this purpose. It’s similar to wget, but it spawns several threads and starts to download the files in parallel. The number of worker threads has a default value, but you can also change that. Its basic usage is the same:

aria2c -i list.txt
Categories: bash Tags: , , , ,

copy large files between computers at home over the network

January 2, 2015 Leave a comment

Problem
I have a desktop machine at home with a Windows 2007 virtual machine. I mainly have it because of Powerpoint. Recently I prefer to work on my laptop in the living room. Today I needed Powerpoint, so I decided to copy the whole Windows virtual machine and put it on my laptop. The only problem is that it was 67 GB and I didn’t have that much space on my external HDDs :(

Solution
Don’t panic. On my desktop machine I entered the folder that I wanted to copy and started a web server:

python -m SimpleHTTPServer

With “ifconfig” I checked the local IP address of the machine, it was 192.168.0.53.

On my laptop I opened a browser and navigated to “http://192.168.0.53:8000“. All the files I needed were there. Since I’m lazy and I didn’t want to click on each link one by one, I issued the following command (tip from here):

wget -r --no-parent http://192.168.0.53:8000

The download speed was about 10 MB/sec, so it took almost 2 hours.

Categories: network, python Tags: , ,

mirror a website with wget

June 15, 2014 Leave a comment

Problem
I want to crawl a webpage recursively and download it for local usage.

Solution

wget -c --mirror -p --html-extension --convert-links --no-parent --reject "index.html*" $url

The options:

  •  -c: continue (if you stop the process with CTRL+C and relaunch it, it will continue)
  • --mirror: turns on recursion and time-stamping, sets infinite recursion depth and keeps FTP directory listings
  • -p: get all images, etc. needed to display HTML page
  • --html-extension: save HTML docs with .html extensions
  • --convert-links: make links in downloaded HTML point to local files
  • --no-parent: Do not ever ascend to the parent directory when retrieving recursively. This is a useful option, since it guarantees that only the files below a certain hierarchy will be downloaded.
  • --reject "index.html*": don’t download index.html* files

Credits
Tips from here.

Get URLs only
If you want to spider a website and get the URLs only, check this post out. In short:

wget --spider --force-html -r -l2 $url 2>&1 | grep '^--' | awk '{ print $3 }'

Where -l2 specifies recursion maximum depth level 2. You may have to change this value.

Categories: bash Tags: , , , ,

wget with no output

August 13, 2013 Leave a comment

silencerProblem
You want to call wget from a script but you want wget to stay silent, i.e. it shouldn’t produce any output.

Solution

wget -q URL
Categories: bash Tags: , ,

How to download an entire website for off-line reading?

November 30, 2012 Leave a comment

Problem

You want to download an entire website (e.g. a blog) for offline reading.

Solution

wget --mirror -p --convert-links -P ./LOCAL-DIR WEBSITE-URL

Tip from here.

Categories: bash Tags: , , ,