how to find out dead links

Mon Nov 16 02:35:35 UTC 2009

Loïc Grenié wrote:

> 2009/11/14 Derek Broughton <derek at pointerstop.ca>:
>> Loïc Grenié wrote:
>>
>>> 2009/11/14 Eugeneapolinary Ju <eugeneapolinary81 at yahoo.com>:
>>>> wget -r -p -U Firefox "http://www.somesite.com/" 2>&1 | grep 404 >
>>>> 404.txt
>>>>
>>>>
>>>> why come 404.txt is 0 Byte? how to put the STDOUT to a file with wget?
>>>
>>> Have you tried
>>>
>>> wget -r -p -U Firefox "http://www.somesite.com/"
>>>
>>> There is no 404 message (at least here). To be more precise, there is
>>> no 404 message because there is no web server that can output the
>>> 404 message. A web page can fail for (at least) three different reasons:
>>
>> I imagine that "somesite.com" was an example, likely because his actual
>> site isn't accessible to the Internet.
>>
>> The real problem is:
>>
>> $ wget http://localhost/test.htm
>> --2009-11-14 10:43:23--  http://localhost/test.htm
>> Resolving localhost... 127.0.0.1, ::1
>> Connecting to localhost|127.0.0.1|:80... connected.
>> HTTP request sent, awaiting response... 404 Not Found
>> 2009-11-14 10:43:23 ERROR 404: Not Found.
>>
>>
>> In this case, 404 is ONLY a status, and not a page.
> 
>    Of course, but the status is delivered by a web server.
>   We'll need a better understanding of what the first user
>   wants: detect non-existing sites or non-existing pages
>   on an existing site (or both).

Why does it matter?  My point is that if the site exists, you _still_ won't 
get a page.  So you need to be checking the server responses, not the 
contents of a page.  
-- 
derek