extract info form web pages

Adam Funk a24061 at ducksburg.com
Thu Mar 22 12:52:47 UTC 2007


On 2007-03-22, Dimitri Mallis wrote:

> no, its for a university 3rd year computer science project.
> i thought wget only downloads the whole website as in it makes a mirror of
> it on my hard drive which i dont want to do exactly, but ill man wget incase
> you are talking about somethings else.
>
> i was hoping for some script were i could type the URL, the key words & it
> would extract information into a new page on my hard drive...

You could use `lynx -dump http://www.example.com/foo.html` to get a
"textualized" version of the page, then process it with perl, sed and
awk, or other tools.  Using `lynx -dump` instead of wget saves you
from having to parse the HTML.  

Perl's LWP modules can do the fetching and HTML-stripping for you, but
if you don't already know some Perl, you'd need to learn quite a bit.


Also, please don't top-post or post in HTML.

http://www.expita.com/nomime.html
http://www.xs4all.nl/~hanb/documents/quotingguide.html


-- 
()  ascii ribbon campaign - against html e-mail 
/\  www.asciiribbon.org   - against proprietary attachments





More information about the ubuntu-users mailing list