how to find out dead links
Loïc Grenié
loic.grenie at gmail.com
Mon Nov 16 09:25:14 UTC 2009
2009/11/16 Derek Broughton <derek at pointerstop.ca>:
> Loïc Grenié wrote:
>
>> 2009/11/14 Derek Broughton <derek at pointerstop.ca>:
>>> Loïc Grenié wrote:
>>>
>>>> 2009/11/14 Eugeneapolinary Ju <eugeneapolinary81 at yahoo.com>:
>>>>> wget -r -p -U Firefox "http://www.somesite.com/" 2>&1 | grep 404 >
>>>>> 404.txt
>>>>>
>>>>>
>>>>> why come 404.txt is 0 Byte? how to put the STDOUT to a file with wget?
>>>>
>>>> Have you tried
>>>>
>>>> wget -r -p -U Firefox "http://www.somesite.com/"
>>>>
>>>> There is no 404 message (at least here). To be more precise, there is
>>>> no 404 message because there is no web server that can output the
>>>> 404 message. A web page can fail for (at least) three different reasons:
>>>
>>> I imagine that "somesite.com" was an example, likely because his actual
>>> site isn't accessible to the Internet.
>>>
>>> The real problem is:
>>>
>>> $ wget http://localhost/test.htm
>>> --2009-11-14 10:43:23-- http://localhost/test.htm
>>> Resolving localhost... 127.0.0.1, ::1
>>> Connecting to localhost|127.0.0.1|:80... connected.
>>> HTTP request sent, awaiting response... 404 Not Found
>>> 2009-11-14 10:43:23 ERROR 404: Not Found.
>>>
>>>
>>> In this case, 404 is ONLY a status, and not a page.
>>
>> Of course, but the status is delivered by a web server.
>> We'll need a better understanding of what the first user
>> wants: detect non-existing sites or non-existing pages
>> on an existing site (or both).
>
> Why does it matter? My point is that if the site exists, you _still_ won't
> get a page. So you need to be checking the server responses, not the
> contents of a page.
Well yes but since the first person does a grep 404 s/he will get a result
*only if* the site exists (and the page does not). S/he will never know when
the site *does not* exist. If s/he wants to know whether the page is
available, testing for 404 is the correct answer *only if* s/he
knows beforehand
that the site exists. Otherwise s/he *must* behave differently: s/he could try
something like
if wget -q -U Firefox -O - "http://www.somesite.com/" | grep -q '<body'
then
# Page exists, download it and its dependencies
wget -r -p -U Firefox "http://www.somesite.com/" 2>&1 | grep 404
else
# Problem
echo "Page does not exist" >&2
exit 1
fi
I still think my question remains: does the first person *know* that the
initial page (and more precisely: site) exists ?
Cheers,
Loïc
More information about the ubuntu-users
mailing list