Site Spider

Sun Jan 20 01:30:32 UTC 2008

Paul Lemmons wrote:
> For it to "spider" through, it will open the initial page (usually 
> index.html, index.php or default.htm) and then follow the links to new 
> pages and then follow their links and so on and so forth until it has 
> the site.
>
> If you want to create a complete backup, including files that are not 
> linked to, you will want to use the ftp protocol instead of http.
>
> wget -rc ftp://userid:password@www.your-site.com
>
> wget --help gives you some help remembering the options. "man wget" gives you a lot more detail.Googling will turn up lots of examples.
>   
I decided to try it anyway before I got your answer.

wget -rc with the http protocol downloads the index.html and the 
robots.txt files...  It doesn't go on from there.

since you suggested using the ftp protocol, I tried again.  It doesn't 
even find the site...  can't change to the directory.

Let me explain what I want to do.  There is an archived copy of a 
website that is no longer there at web.archive.org.  I want to retrieve 
the data from that site.  I can go through it, page by page, and save as 
I go.  But I thought that seemed a bit long-winded when wget could grab 
the lot in one command.  (I also tried it on a live site and got the 
same results.)

-- 
Blessings

Wulfmann

Wulf Credo:
Respect the elders. Teach the young. Co-operate with the pack.
Play when you can. Hunt when you must. Rest in between.
Share your affections. Voice your opinion. Leave your Mark.
Copyright July 17, 1988 by Del Goetz