Home > bash > mirror a website with wget

mirror a website with wget

I want to crawl a webpage recursively and download it for local usage.


wget -c --mirror -p --html-extension --convert-links --no-parent --reject "index.html*" $url

The options:

  •  -c: continue (if you stop the process with CTRL+C and relaunch it, it will continue)
  • --mirror: turns on recursion and time-stamping, sets infinite recursion depth and keeps FTP directory listings
  • -p: get all images, etc. needed to display HTML page
  • --html-extension: save HTML docs with .html extensions
  • --convert-links: make links in downloaded HTML point to local files
  • --no-parent: Do not ever ascend to the parent directory when retrieving recursively. This is a useful option, since it guarantees that only the files below a certain hierarchy will be downloaded.
  • --reject "index.html*": don’t download index.html* files

Tips from here.

Get URLs only
If you want to spider a website and get the URLs only, check this post out. In short:

wget --spider --force-html -r -l2 $url 2>&1 | grep '^--' | awk '{ print $3 }'

Where -l2 specifies recursion maximum depth level 2. You may have to change this value.

Categories: bash Tags: , , , ,
  1. No comments yet.
  1. No trackbacks yet.

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: