Off-Topic: Parse an html file and transfer the text found

Markus Schönhaber ubuntu-users at list-post.mks-mail.de
Wed Aug 6 08:08:55 UTC 2008


John Toliver wrote:

> So my question to start is which language should I use to pull the
> data out of an html file? 

The one that you're familiar with is, IMO, the primary choice.

> Is perl better for this application, or is
> python better or some other language?

I'm not too familiar with Perl but have done quite some Python
programming over the years. Therefore I don't have an unbiased view in
this regard, nevertheless I doubt that one has a massive advantage over
the other when it comes to text processing.

> I'm probably going to need to brush up on my regular expressions for
> this but that's ok too.
> 
> Any thoughts would be appreciated...

To extract data from HTML there are to ways to approach the problem that
seem obvious to me:
1. See HTML as text.
2. See HTML as structured data.

In the 1. case, you could use REs to extract the wanted data. To me, it
seems that this is what you have in mind.

In the 2. case, you could use an appropriate parser that helps you
navigate the document and access the wanted data.
For example: depending on the quality of the HTML document it might
already be well formed XML (or could easily be converted to it using
something like HTML tidy). You could then load it with an XML parser and
use it's methods to navigate to the data you're interested in.
You could even use XSLT to print out the desired SQL statements and do
no Python/Perl/whatever programming at all.

To sum thing up: IMO there is not the one best way or the one best
programming language to get the desired result. What's best for you
largely depends on what you're familiar with and what matches your
personal preference best.

Regards
  mks




More information about the ubuntu-users mailing list