How can I extract sentenses from text documents
Matt Palmer
mpalmer at hezmatt.org
Thu Dec 22 22:14:22 UTC 2005
On Thu, Dec 22, 2005 at 01:09:24PM -0600, Wade Smart wrote:
> Ok, this may be totally impossible but, I have about 1800 documents that
> have sentences inside [QUOTE] and sometimes [QUOTE] [QUOTE] or [QUOTE]
> [/QUOTE]. I don't know how many lines each document has - maybe 8 to
> 20k. Is there a way to copy all the sentences between the [QUOTE]
> [QUOTE] or [QUOTE] [/QUOTE] to a new file?
>
> This is way beyond my knowledge but if someone knows how this is done,
> if they would point me in the right direction - I would greatly
> appreciate it..
It's not a particularly difficult problem, from a programming perspective.
You simply scan the document looking for your opening tag ([QUOTE]) and then
start writing everything you subsequently read until you hit your closing
tag ([/QUOTE]).
As an aside, the Ruby language has a language primitive that makes coding
this near-trivial -- it's range operator can take regexes or strings, and
it'll print everything between the opening and closing tags. But even in a
lesser language <duck!> a simple state machine to handle your cases should
be a simple programming operation.
Of course, if you can't program, then you're kinda up the spout. There's
probably search-and-replace functionality in editors you could use, but it
would probably end up being programming of a sort anyway. You could
probably get some kind-hearted soul to spend 10 minutes whipping something
up for you to run, or worst-case hire a programmer for an hour or two (it
wouldn't take longer than that, end-to-end, including internal overheads) to
write the program for you (I'd pay someone a couple of hundred bucks to
avoid manually cutting-and-pasting my way through 14.5 million lines of
text...)
- Matt
--
"[the average computer user] has been served so poorly that he expects his
system to crash all the time, and we witness a massive worldwide
distribution of bug-ridden software for which we should be deeply ashamed."
-- Edsger Dijkstra
More information about the ubuntu-users
mailing list