"Linux Gazette...making Linux just a little more fun!"

Downloading without a Browser

By

Ever had to download a file so huge over a link so slow that you'd need to keep the web browser open for hours or days? What if you had 40 files linked from a single web page, all of which you needed -- will you tediously click on each one? What if the browser crashes before it can finish? GNU/Linux comes equipped with a handy set of tools for downloading in the background, independent of the browser. This allows you to log out, resume interrupted downloads, and even schedule them to occur during off-peak Net usage hours.

When interactivity stands in the way

Web browsers are designed to make the Web interactive -- click and expect results within seconds. But there are still many files that can take longer than a few seconds to download, even over the quickest of connections. An example are the ISO images that are popular among those burning their own GNU/Linux CD-ROM distro. Some web browsers, especially poorly coded ones, do not behave very well over long durations, leaking memory or crashing at the most inopportune moment. Despite the fusion of some browsers with file managers many still do not support the multi-selection and rubber banding operations that make it easy to transfer several files all in one go. You also have to stay logged in until the entire file has arrived. Finally, you have to be present at the office to click the link initiating the download, thus angering coworkers with whom office bandwidth is being shared.

Downloading of large files is a task more suitable for a different suite of tools. This article will discuss how to combine various GNU/Linux utilities, namely lynx, wget, at, crontab, etc. to solve a variety of file transfer situations. A small amount of simple scripting will also be employed, so a little knowledge of the bash shell will help.

The `wget` utility

All the major distributions include the wget downloading tool.

  bash$ wget http://place.your.url/here

This can also handle FTP, date stamps, and recursively mirror entire web-site directory trees -- and if you're not careful, entire website and whatever other sites they link to:

  bash$ wget -m http://target.web.site/subdirectory

Due to the potential high loads this tool can place on servers, this tool obeys the "robots.txt" protocol when mirroring. There are several command options to control what exactly gets mirrored, limiting the types of links followed and the file types downloaded. Example: to follow only relative links and skip GIF images:

  bash$ wget -m -L --reject=gif http://target.web.site/subdirectory

wget can also resume interrupted downloads ("-c" option) when given the incomplete file to which to append the remaining data. This operation needs to be supported by the server.

  bash$ wget -c http://the.url.of/incomplete/file

The resumption and mirroring can be combined, allowing one to mirror a large collection of files over the period of many separate download sessions. More on how to automate this later.

If you're experiencing download interruptions as often as I do in my office, you can tell wget to retry the URL several times:

  bash$ wget -t 5 http://place.your.url/here

Here we give up after 5 attempts. Use "-t inf" to never give up.

What about proxy firewalls? Use the http_proxy environment variable or the .wgetrc configuration file to specify a proxy via which to download. One problem with proxied connections over intermittent connections is that resumptions can sometimes fail. If a proxied download is interrupted, the proxy server will cache only an incomplete copy of the file. When you try to use "wget -c" to get the remainder of the file the proxy checks its cache and erroneously reports that you have the entire file already. You can coax most proxies to bypass their cache by adding a special header to your download request:

  bash$ wget -c --header="Pragma: no-cache" http://place.your.url/here

The "--header" option can add any number and manner of headers, by which one can modify the behaviour of web servers and proxies. Some sites refuse to serve files via externally sourced links; content is delivered to browsers only if they access it via some other page on the same site. You can get around this by appending a "Referer:" header:

  bash$ wget --header="Referer: http://coming.from.this/page" http://surfing.to.this/page

Some particularly anti-social web sites will only serve content to a specific brand of browser. Get around this with a "User-Agent:" header:

  bash$ wget --header="User-Agent: Mozilla/4.0 (compatible; MSIE 5.0; Windows NT; DigExt)" http://msie.only.url/here

(Warning: the above tip may be considered circumventing a content licensing mechanism and there exist anti-social legal systems that have deemed these actions to be illegal. Check your local legislature. Your mileage may vary.)

Downloading `at` what hour?

If you are downloading large files on your office computer over a connection shared with easily angered coworkers who don't like their streaming media slowed to a crawl, you should consider starting your file transfers in the off-peak hours. You do not have to stay in the office after everyone has left, nor remember to do a remote login from home after dinner. Make use of the at job scheduler:

bash$ at 2300 warning: commands will be executed using /bin/sh at> wget http://place.your.url/here at> press Ctrl-D

Here, we want to begin downloading at 11.00pm. Make sure that the atd scheduling daemon is running in the background for this to work.

It'll take how many days?

When there is a lot of data to download in one or several files, and your bandwidth is comparable to the carrier pigeon protocol, you will often find that the download you scheduled to occur has not yet completed when you arrive at work in the morning. Being a good neighbour, you kill the job and submit another at job, this time using "wget -c", repeating as necessary over as many days as it'll take. It is better to automate this using a crontab. Create a plain text file, called "crontab.txt", containing something like the following:

0 23 * * 1-5    wget -c -N http://place.your.url/here
0  6 * * 1-5    killall wget

This will be your crontab file which specifies what jobs to execute at periodic intervals. The first five columns say when to execute the command, and the remainder of each line says what to execute. The first two columns indicate the time of day -- 0 minutes past 11pm to start wget, 0 minutes past 6am to killall wget. The * in the 3rd and 4th columns indicates that these actions are to occur every day of every month. The 5th column indicates on which days of the week to schedule each operation -- "1-5" is Monday to Friday.

So every weekday at 11pm your download will begin, and at 6am every weekday any wget still in progress will be terminated. To activate this crontab schedule you need to issue the command:

  bash$ crontab crontab.txt

The "-N" option for wget will check the timestamp of the target file and halt downloading if they match, which is an indication that the entire file has been transferred. So you can just set it and forget it. "crontab -r" will remove this schedule. I've downloaded many an ISO image over shared dial-up connections using this approach.

Dynamically Generated Web Pages

Some web pages are generated on demand since they are subject to frequent changes sometimes several times a day. Since the target is technically not a file, there is no file length and resuming a download becomes meaningless -- the "-c" option fails to work. Example: a PHP-generated page at Linux Weekend News:

  bash$ wget http://lwn.net/bigpage.php3

If you interrupt the download and try to resume, it starts over from scratch. My office Net connection is at times so poor that I've written a simple script detecting when a dynamic HTML page has been delivered completely:

#!/bin/bash

#create it if absent
touch bigpage.php3

#check if we got the whole thing
while ! grep -qi '</html>' bigpage.php3
do
  rm -f bigpage.php3

  #download LWN in one big page
  wget http://lwn.net/bigpage.php3

done

The above bash script keeps downloading the document unless the string "</html>" can be found, which marks the end of the file.

SSL and Cookies

URLs beginning with "https://" must access remote files through the Secure Sockets Layer. You will find another download utility, called curl, to be handy in these situations.

Some web sites force-feed cookies to the browser before serving the requested content. One must add a "Cookie:" header with the correct information which can be obtained from your web browser's cookie file. For lynx and Mozilla cookie file formats:

  bash$ cookie=$( grep nytimes ~/.lynx_cookies |awk '{printf("%s=%s;",$6,$7)}' )

will construct the required cookie for downloading stuff from http://www.nytimes.com, assuming that you have already registered with the site using this browser. w3m uses a slightly different cookie file format:

  bash$ cookie=$( grep nytimes ~/.w3m/cookie |awk '{printf("%s=%s;",$2,$3)}' )

Downloading can now be carried out thus:

  bash$ wget --header="Cookie: $cookie" http://www.nytimes.com/reuters/technology/tech-tech-supercomput.html

or using the curl tool:

  bash$ curl -v -b $cookie -o supercomp.html http://www.nytimes.com/reuters/technology/tech-tech-supercomput.html

Making Lists of URLs

So far, we've only been downloading single files or mirroring entire website directories. Sometimes one is interested in downloading a large number of files whose URLs are given on a web page but are not interested in performing a full scale mirror of the entire site. An example would be downloading of the top 20 music files on a site that displays the top 100 in order. Here the "--accept" and "--reject" options wouldn't work since they only operate on file extensions. Instead, make use of "lynx -dump".

  bash$ lynx -dump ftp://ftp.ssc.com/pub/lg/ |grep 'gz$' |tail -10 |awk '{print $2}' > urllist.txt

The output from lynx can then be filtered using the various GNU text processing utilities. In the above example, we extract URLs ending in "gz" and store the last 10 of these in a file. A tiny bash scripting command will automatically download any URLs listed in this file:

bash$ for x in $(cat urllist.txt) > do > wget $x > done

We've succeeded in downloading the last 10 issues of Linux Gazette.

Swimming in bandwidth

If you're one of the select few to be drowning in bandwidth, and your file downloads are slowed only by bottlenecks at the web server end, this trick can help "shotgun" the file transfer process. It requires the use of curl and several mirror web sites where identical copies of the target file are located. For example, suppose you want to download the Mandrake 8.0 ISO from the following three locations:


url1=http://ftp.eecs.umich.edu/pub/linux/mandrake/iso/Mandrake80-inst.iso
url2=http://ftp.rpmfind.net/linux/Mandrake/iso/Mandrake80-inst.iso
url3=http://ftp.wayne.edu/linux/mandrake/iso/Mandrake80-inst.iso

The length of the file is 677281792, so initiate three simultaneous downloads using curl's "--range" option:


bash$ curl -r 0-199999999 -o mdk-iso.part1 $url1 &
bash$ curl -r 200000000-399999999 -o mdk-iso.part2 $url2 &
bash$ curl -r 400000000- -o mdk-iso.part3 $url3 &

This creates three background download processes, each transferring a different part of the ISO image from a different server. The "-r" options specifies a subrange of bytes to extract from the target file. When completed, simply cat all three parts together -- cat mdk-iso.part? > mdk-80.iso. (Checking the md5 hash before burning to CD-R is well recommended.) Launching each curl in its own window while using the "--verbose" option allows one to track the progress of each transfer.

Conclusion

Do not be afraid to use non-interactive methods for effecting your remote file transfers. No matter how hard web designers may try to force you to surf their sites interactively, there will always be free tools to help automate the process, thus enriching our overall Net experience.

Adrian J Chung

When not teaching undergraduate computing at the University of the West Indies, Trinidad, Adrian writes scripts to automate web email downloads, and experiments with interfacing various scripting environments with homebrew computer graphics renderers and data visualization libraries.