Make a word list from a text

Donn donn.ingle at gmail.com
Sun Aug 3 03:17:32 UTC 2008


On Saturday, 02 August 2008 05:52:46 Wulfy wrote:
> I want to take a text file and extract all the words and sort them into
> a unique list. 
I gave it a go and this is the best I can do:
cat myfile | sed "s/'//g" | tr -s '[:space:][:punct:]' "\n" | sort | uniq -c

The sed bit is to remove single quotes so words like "didn't" don't 
become "didn" and "t". It then uses tr to replace spaces or punctuation with 
newlines and then out to sort and uniq.

I find text parsing very hard to do. There seem to be corner-cases everywhere. 
What is a word really? How do you define it's edges? Ah well, HTH.
\d




More information about the kubuntu-users mailing list