How can I extract sentenses from text documents

Mike Bird mgb-ubuntu at yosemite.net
Fri Dec 23 04:11:04 UTC 2005


On Thu, 2005-12-22 at 18:45, Wade Smart wrote:
> Take this email for example. She writes as simple as we speak back and
> forth. And then she was just drop in a quote, [QUOTE] The positive
> thing about writing is that you connect with yourself in the deepest
> way, and that's heaven. You get a chance to know who you are, to know
> what you think. You begin to have a relationship with your mind.
> [/QUOTE]  And then keep talking from there. 

This sed script works provided that you don't have open square
brackets anywhere but for those tags.  If you do, your best
bet is to try to find a couple of characters which aren't used
anywhere, use them to replace the begin and end quotes, and work
from there.  If you can't find two such characters you'd be
better off scanning the files with PERL.

sed -n -e H -e '${x;
s/\n/ /g;
s/^[^[]*\[QUOTE\]//;
s/\[\/QUOTE\][^]]*$//;
s/\[\/QUOTE\]\([^[]*\)\[QUOTE\]/\n/g;
p}' <infile.txt >quotefile

You can skip the newlines after the semicolons.  One normally
does.  I only put them in so the mail program wouldn't break
the lines at places that wouldn't be OK.

If you have a bunch of text files and want one quote file with
duplicates removed, it could look like this:

for f in *.txt; do
sed -n -e H -e '${x;
s/\n/ /g;
s/^[^[]*\[QUOTE\]//;
s/\[\/QUOTE\][^]]*$//;
s/\[\/QUOTE\]\([^[]*\)\[QUOTE\]/\n/g;
p}' <$f; done | sort | uniq >quotefile

Enjoy,

--Mike Bird





More information about the ubuntu-users mailing list