Removing 'bad'chrs from a file

Sandy Harris sandyinchina at gmail.com
Sat Oct 9 13:06:11 UTC 2010


On Sat, Oct 9, 2010 at 11:45 AM, rikona <rikona at sonic.net> wrote:

First thing is to look for docs on the file format involved.
If there's a man page, it is in section 4. If not, the format
should be documented somewhere. If you really need
the data and cannot find docs, "Use the source, Luke."

> The kate error mentioned utf8 [IIRC]. I was not able to determine if
> an all-utf8 file was OK for TB, or even useful. It may not index utf8
> at all, for all I know. Perhaps it is worth a try to convert all the
> chrs to utf8 and/or plain ASCII and see what happens - but I don't
> know how to do that. If you know a way, perhaps that is worth a try.

I'm a Unix old fart, using it for almost 30 years, so what I suggest
may be out of date. There's almost certainly an easier way with
Perl or Python. However, here's what I know. This has all been
in Unix since at least 7th Edition, 1979 or thereabouts.

All of it was once documented in man pages. GNU versions
may be documented in info. Either way, there should be
complete docs available.

tr will do anything you like that involves only a one-to-one
mapping on 8-bit characters. As long as you can express
what you want as two lists of characters any character
on the 1st list is always replaced by the matching
character from the second list. Or a list and a single
character, replace anything from that list with the
character. e.g. tr -c [a-zA-Z] '\012' turns all
non-alphabetic characters into newlines.

tr can change all non-ASCII characters >= 128
into the corresponding ASCII characters or
blanks or whatever. Or delete null characters
or ...

If anything you want done needs to depend on
more than a single character of input, tr cannot
help.

Sed(1) is a stream editor with roughly the same
commands as to vi(1) command line. Example:

sed 's/abc/ABC/' < infile > outfile

Reads one file, applies the editor command
to every line, writes another file.

sed -f scriptfile  < infile > outfile

Applies a whole script of editor commands
to every line. I always find I need a lot of:

sed -f scriptfile  < infile | more

before I get it right, but it does work well
eventually.

If sed(1) cannot do it, you might try awk(1),
a complete programming language for
text manipulation. Or use Perl or ...




More information about the ubuntu-users mailing list