Off-Topic: Parse an html file and transfer the text found

Leo Cacciari leo.cacciari at gmail.com
Wed Aug 6 09:22:43 UTC 2008


Il giorno mer, 06/08/2008 alle 10.08 +0200, Markus Schönhaber ha
scritto:
> John Toliver wrote:
> 
> > So my question to start is which language should I use to pull the
> > data out of an html file? 
> 
> The one that you're familiar with is, IMO, the primary choice.
> 
> > Is perl better for this application, or is
> > python better or some other language?
> 
> I'm not too familiar with Perl but have done quite some Python
> programming over the years. Therefore I don't have an unbiased view in
> this regard, nevertheless I doubt that one has a massive advantage over
> the other when it comes to text processing.
> 

Well, I'll tend to disagree, but then I'm perl  biased, thus my maybe my
advice is to be taken "cum grano salis" :) 

> > I'm probably going to need to brush up on my regular expressions for
> > this but that's ok too.
> > 
> > Any thoughts would be appreciated...
> 
There is a wonderful book on RE in the O'Reilly series, explaining how
to use it in different languages "Mastering Regular Expressions", by
Jeffrey Friedl.

If you decide by Perl (not PERL, this is another thing...), you could
find useful the HTML::Tree module
(http://search.cpan.org/~petek/HTML-Tree-3.23/lib/HTML/Tree.pm) 



> ...snip....
> To sum thing up: IMO there is not the one best way or the one best
> programming language to get the desired result. What's best for you
> largely depends on what you're familiar with and what matches your
> personal preference best.
> 

And this is nothing but the truth :) 
 
Enjoy
-- 
Leo Cacciari

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: Questa è una parte del messaggio	firmata digitalmente
URL: <https://lists.ubuntu.com/archives/ubuntu-users/attachments/20080806/32b80917/attachment.sig>


More information about the ubuntu-users mailing list