How can I extract sentenses from text documents

Matt Palmer mpalmer at hezmatt.org
Thu Dec 22 22:14:22 UTC 2005


On Thu, Dec 22, 2005 at 01:09:24PM -0600, Wade Smart wrote:
> Ok, this may be totally impossible but, I have about 1800 documents that 
> have sentences inside [QUOTE] and sometimes [QUOTE] [QUOTE] or [QUOTE] 
> [/QUOTE]. I don't know how many lines each document has - maybe 8 to 
> 20k. Is there a way to copy all the sentences between the [QUOTE] 
> [QUOTE] or [QUOTE] [/QUOTE] to a new file?  
> 
> This is way beyond my knowledge but if someone knows how this is done, 
> if they would point me in the right direction - I would greatly 
> appreciate it..

It's not a particularly difficult problem, from a programming perspective. 
You simply scan the document looking for your opening tag ([QUOTE]) and then
start writing everything you subsequently read until you hit your closing
tag ([/QUOTE]).

As an aside, the Ruby language has a language primitive that makes coding
this near-trivial -- it's range operator can take regexes or strings, and
it'll print everything between the opening and closing tags.  But even in a
lesser language <duck!> a simple state machine to handle your cases should
be a simple programming operation.

Of course, if you can't program, then you're kinda up the spout.  There's
probably search-and-replace functionality in editors you could use, but it
would probably end up being programming of a sort anyway.  You could
probably get some kind-hearted soul to spend 10 minutes whipping something
up for you to run, or worst-case hire a programmer for an hour or two (it
wouldn't take longer than that, end-to-end, including internal overheads) to
write the program for you (I'd pay someone a couple of hundred bucks to
avoid manually cutting-and-pasting my way through 14.5 million lines of
text...)

- Matt

-- 
"[the average computer user] has been served so poorly that he expects his
system to crash all the time, and we witness a massive worldwide
distribution of bug-ridden software for which we should be deeply ashamed."
		-- Edsger Dijkstra




More information about the ubuntu-users mailing list