Removing 'bad'chrs from a file

Sat Oct 9 03:45:03 UTC 2010

Hello Sandy,

Friday, October 8, 2010, 12:26:28 AM, Sandy wrote:

First, thanks for the reply...

SH> On Fri, Oct 8, 2010 at 3:13 PM, rikona <rikona at sonic.net> wrote:

>> Or, is there a way to just fix/remove the 'bad chrs' from a file? All
>> [reasonable] {gotta watch out on this list :-) } ideas welcome.

SH> You can do that with tr(1). You'd have to know what the
SH> app means by "bad chrs", though.

That's THE problem - kate doesn't say, and TB is silent during
indexing - it apparently just stops. Kate was impossibly slow on a 1
gig file. I tried cream [a gui for vim] and it is MUCH better for very
large files. Cream loads the file without any error message - but
perhaps it's just less fussy than kate. I can now see the whole file
relatively quickly, though.

At first, I thought I might be able to find the problem with a text
editor. The last email indexed by TB is on line # 7,588,969, but is
about 3/4 of the way through the file. The idea of looking, visually,
at perhaps 2,500,000 more lines to find the error is, to say the
least, not appealing. :-(( I need a better way to find the 'bad'
chr(s) - whatever they are....

The kate error mentioned utf8 [IIRC]. I was not able to determine if
an all-utf8 file was OK for TB, or even useful. It may not index utf8
at all, for all I know. Perhaps it is worth a try to convert all the
chrs to utf8 and/or plain ASCII and see what happens - but I don't
know how to do that. If you know a way, perhaps that is worth a try.

SH> tr -d '\0' < infile > outfile

SH> should delate (-d) all null characters. To replace them with
SH> newlines

SH> tr '\0'  '\012' < infile > outfile

SH> That's off the top of my head, quite likely has syntax
SH> errors, but it should give you the idea. Check the man
SH> page and it should be easy from there.

Thanks,    

-- 

 rikona