How to format text for normal reading

Aart Koelewijn aart at mtack.xs4all.nl
Sat Nov 6 14:44:27 UTC 2010


On Sat, 06 Nov 2010 08:58:43 +0000, user1 wrote:

> I tried to do this:
> 
> for file in *.html; do html2text -o "${file%.*}.txt" "$file" ; done
> 
> I found it here: http://commandline.org.uk/command-line/converting-html-
> to-text/
> 
> That works fine, but when I then cat all the single text files into one
> big text file I still need to format this big file, to make it readable.
> 
> So my problem is not really only a html problem, but how to make any
> text file which is badly formatted readable. That is to get each
> paragraph stand out with full lines ended with a dot as well as strange
> charachers/charachter-phrases removed.
> 
> Here follows 3 examples of some charachters I want removed:
> 
> � "  ’

It looks like you have a problem with character encoding. These can 
usually be tackeld with the program "recode". The &quot gives the 
impression there is stil some html character encoding in place. To change 
this to UTF-8 you could use "recode HTML..UTF-8 file". You can do much 
more with recode, "man recode" for all possibilities.

-- 
Aart





More information about the ubuntu-users mailing list