how to find out dead links

Mon Nov 16 09:25:14 UTC 2009

2009/11/16 Derek Broughton <derek at pointerstop.ca>:
> Loïc Grenié wrote:
>
>> 2009/11/14 Derek Broughton <derek at pointerstop.ca>:
>>> Loïc Grenié wrote:
>>>
>>>> 2009/11/14 Eugeneapolinary Ju <eugeneapolinary81 at yahoo.com>:
>>>>> wget -r -p -U Firefox "http://www.somesite.com/" 2>&1 | grep 404 >
>>>>> 404.txt
>>>>>
>>>>>
>>>>> why come 404.txt is 0 Byte? how to put the STDOUT to a file with wget?
>>>>
>>>> Have you tried
>>>>
>>>> wget -r -p -U Firefox "http://www.somesite.com/"
>>>>
>>>> There is no 404 message (at least here). To be more precise, there is
>>>> no 404 message because there is no web server that can output the
>>>> 404 message. A web page can fail for (at least) three different reasons:
>>>
>>> I imagine that "somesite.com" was an example, likely because his actual
>>> site isn't accessible to the Internet.
>>>
>>> The real problem is:
>>>
>>> $ wget http://localhost/test.htm
>>> --2009-11-14 10:43:23--  http://localhost/test.htm
>>> Resolving localhost... 127.0.0.1, ::1
>>> Connecting to localhost|127.0.0.1|:80... connected.
>>> HTTP request sent, awaiting response... 404 Not Found
>>> 2009-11-14 10:43:23 ERROR 404: Not Found.
>>>
>>>
>>> In this case, 404 is ONLY a status, and not a page.
>>
>>    Of course, but the status is delivered by a web server.
>>   We'll need a better understanding of what the first user
>>   wants: detect non-existing sites or non-existing pages
>>   on an existing site (or both).
>
> Why does it matter?  My point is that if the site exists, you _still_ won't
> get a page.  So you need to be checking the server responses, not the
> contents of a page.

    Well yes but since the first person does a grep 404 s/he will get a result
  *only if* the site exists (and the page does not). S/he will never know when
  the site *does not* exist. If s/he wants to know whether the page is
  available, testing for 404 is the correct answer *only if* s/he
knows beforehand
  that the site exists. Otherwise s/he *must* behave differently: s/he could try
  something like

if wget -q -U Firefox -O - "http://www.somesite.com/" | grep -q '<body'
then
    # Page exists, download it and its dependencies
    wget -r -p -U Firefox "http://www.somesite.com/" 2>&1 | grep 404
else
   # Problem
   echo "Page does not exist" >&2
   exit 1
fi

    I still think my question remains: does the first person *know* that the
  initial page (and more precisely: site) exists ?

     Cheers,

           Loïc