Moving to non-Word formats [long] (was: Re: Having trouble finding a word in multiple files

Tue Jun 16 20:21:50 UTC 2020

On Tue, Jun 16, 2020 at 12:39:02PM -0700, rikona wrote:
> On Mon, 15 Jun 2020 20:22:25 -0400
> H <agents at meddatainc.com> wrote:
> 
> > On June 15, 2020 7:02:49 PM EDT, Liam Proven <lproven at gmail.com>
> > wrote:
> > >On Tue, 16 Jun 2020 at 00:38, Mike Marchywka <marchywka at hotmail.com>
> > >wrote:
> 
> [big snip...]
> 
> > I have been missing the old DOS outliners for a very long time, to
> > use exactly as you describe. I think I used PC Outline the most if I
> > remember correctly...
> 
> Agreed!! I still have many hundreds of .pco files with historically
> important info.
> 
> But, they are coded binary files with no known way to convert them
> [that I know of]. I can visually access the 'text' portions of the file
> if I need to, and that is still useful.
> 
> What I'd really like is some kind of 'text extraction' tool/pgm/method
> that would extract the text portions, make an ordered list of them and
> let me put that text in a file. 
> 
> Recoll does not know how to read .pco files, but it could then read
> some of my historical info and that would appear in my searches if
> relevant.
> 
> Anyone know how to do that?

First of all, I hate the idea of some indexing tool taking 
up resources and turn them off when I notice them. However,
I do find myself floundering around for historical ( month
old or migrated from another machine ) data or code or text.
Usually I can't remember  a good key word so indexing would not help.

If there is just plain text interspersed with junk, 
generally "strings " should work in place of cat and you can 
break those up on white space and make a vocabulary list.
Otherwise you need to decode the thing with something ( I often
had to find symbol names in object files for example and can't
remember now if the mangled names are always greppable ). 
( this leaves in punctuation and case etc but you get the idea )
strings |  sed -e 's/[ ]/\\n/g' | uniq | sort 

should work. 

I guess the politics of file formats is strong but two
pillars for my rant : 1) human readability is not exclusive
of resource efficiency and 2) compression should be informative.
There may be similar text compression systems but I wrote on
based on indexing words in csv files and just replacing them
with ascii tokens. This is not optimal but then you can gzip the thing
too. In any case, the dictionary it generates should be great for
word searches. In video and audio you often see signals decomposed
into meaningful units - motion or vocal tract parameters for example.
I was going to pull this back out and play with it but
there may be existing stuff like that.  

> 
> Rik
> 
> -- 
> ubuntu-users mailing list
> ubuntu-users at lists.ubuntu.com
> Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-users

-- 

mike marchywka
306 charles cox
canton GA 30115
USA, Earth 
marchywka at hotmail.com
404-788-1216
ORCID: 0000-0001-9237-455X