How to format text for normal reading
Aart Koelewijn
aart at mtack.xs4all.nl
Sat Nov 6 14:44:27 UTC 2010
On Sat, 06 Nov 2010 08:58:43 +0000, user1 wrote:
> I tried to do this:
>
> for file in *.html; do html2text -o "${file%.*}.txt" "$file" ; done
>
> I found it here: http://commandline.org.uk/command-line/converting-html-
> to-text/
>
> That works fine, but when I then cat all the single text files into one
> big text file I still need to format this big file, to make it readable.
>
> So my problem is not really only a html problem, but how to make any
> text file which is badly formatted readable. That is to get each
> paragraph stand out with full lines ended with a dot as well as strange
> charachers/charachter-phrases removed.
>
> Here follows 3 examples of some charachters I want removed:
>
> � "
It looks like you have a problem with character encoding. These can
usually be tackeld with the program "recode". The " gives the
impression there is stil some html character encoding in place. To change
this to UTF-8 you could use "recode HTML..UTF-8 file". You can do much
more with recode, "man recode" for all possibilities.
--
Aart
More information about the ubuntu-users
mailing list