how to find out dead links

Loïc Grenié loic.grenie at
Mon Nov 16 09:25:14 UTC 2009

2009/11/16 Derek Broughton <derek at>:
> Loïc Grenié wrote:
>> 2009/11/14 Derek Broughton <derek at>:
>>> Loïc Grenié wrote:
>>>> 2009/11/14 Eugeneapolinary Ju <eugeneapolinary81 at>:
>>>>> wget -r -p -U Firefox "" 2>&1 | grep 404 >
>>>>> 404.txt
>>>>> why come 404.txt is 0 Byte? how to put the STDOUT to a file with wget?
>>>> Have you tried
>>>> wget -r -p -U Firefox ""
>>>> There is no 404 message (at least here). To be more precise, there is
>>>> no 404 message because there is no web server that can output the
>>>> 404 message. A web page can fail for (at least) three different reasons:
>>> I imagine that "" was an example, likely because his actual
>>> site isn't accessible to the Internet.
>>> The real problem is:
>>> $ wget http://localhost/test.htm
>>> --2009-11-14 10:43:23--  http://localhost/test.htm
>>> Resolving localhost..., ::1
>>> Connecting to localhost||:80... connected.
>>> HTTP request sent, awaiting response... 404 Not Found
>>> 2009-11-14 10:43:23 ERROR 404: Not Found.
>>> In this case, 404 is ONLY a status, and not a page.
>>    Of course, but the status is delivered by a web server.
>>   We'll need a better understanding of what the first user
>>   wants: detect non-existing sites or non-existing pages
>>   on an existing site (or both).
> Why does it matter?  My point is that if the site exists, you _still_ won't
> get a page.  So you need to be checking the server responses, not the
> contents of a page.

    Well yes but since the first person does a grep 404 s/he will get a result
  *only if* the site exists (and the page does not). S/he will never know when
  the site *does not* exist. If s/he wants to know whether the page is
  available, testing for 404 is the correct answer *only if* s/he
knows beforehand
  that the site exists. Otherwise s/he *must* behave differently: s/he could try
  something like

if wget -q -U Firefox -O - "" | grep -q '<body'
    # Page exists, download it and its dependencies
    wget -r -p -U Firefox "" 2>&1 | grep 404
   # Problem
   echo "Page does not exist" >&2
   exit 1

    I still think my question remains: does the first person *know* that the
  initial page (and more precisely: site) exists ?



More information about the ubuntu-users mailing list