Use rsync (see http://rsync.samba.org/) to mirror the Linuxfocus server. rsync minimizes network traffic and mirror time. It is the fastest and easiest way to ensure that your site is always up to date. Don't use any web crawler or ftp! Those methods are slow and generate a lot of load on the main server.
Take a look at the following script. You should use such a script to mirror LinuxFocus. Note that the domain to mirror from is rsync.linuxfocus.org and not www.linuxfocus.org.
#!/bin/sh # Please contact [email protected] if you have any # any questions. #----------------------------------------- ## put something like this into the crontab of a user who has write ## permissions on your web-server: ## run the synclf script every day at 2:33 in the night: #33 2 * * * /home/xxx/synclf # #----------------------------------------- # ensure that you webserver can read new files: umask 022 #----------------------------------------- # the directory of the LinuxFocus mirror page (please edit this line): target=/http/linuxfocus # # You can uncomment the following line for debug purposes: #DEBUG="yes" # if [ "$DEBUG" = "yes" ]; then echo "debug output will be written to the file /tmp/synclf.$$ ..." echo "start rsync with rsync.linuxfocus.org" > /tmp/synclf.$$ date >> /tmp/synclf.$$ rsync -rLtz -vv --delete rsync.linuxfocus.org::lf/ $target >> /tmp/synclf.$$ 2>&1 exit 0 fi # Normally (debug off) the following will be executed: # rsync -rLtz --delete rsync.linuxfocus.org::lf/ $target # #-------------- End of rsync script --------------- # You can get rsnyc at ftp://rsync.samba.org/pub/rsync/ # or http://rsync.samba.org/ #The above script is an example for downloading the dynamic html pages. As an alternative you can get static pages from rsync.linuxfocus.org::statichtml/ instead of rsync.linuxfocus.org::lf/
You should mirror LinuxFocus once a day in low traffic hours from 23:00 to 5:00 in the night (UTC / GMT).
Create a text file, called crontab.txt, with the following data (please vary the time a bit):
# run the synclf script every day at 2:45 in the night: 45 2 * * * /home/where/ever/you/put/it/synclfand then activate it with the command
# To use server-parsed HTML files AddType text/html .shtml AddHandler server-parsed .shtmlBoth the #exec command and #include must be enabled (see http://www.apache.org/docs and search for SSI).
The #exec command is need as linuxfocus web pages execute a perl script called lfdynahead.pl This script sets the links between the different languages. You can take a look at it if you want. It is in the document root directory of linuxfocus.org.
You can see that SSI is working if you have in the articles the line at the top that says "This article is available in:....." as shown on the following picture:
/usr/bin/perl is the standard path to perl under Linux. Any common linux distribution will have perl in that location.