Moving to non-Word formats [long] (was: Re: Having trouble finding a word in multiple files

Wed Jun 17 00:04:46 UTC 2020

On Tue, 16 Jun 2020 16:21:50 -0400
Mike Marchywka <marchywka at hotmail.com> wrote:

> On Tue, Jun 16, 2020 at 12:39:02PM -0700, rikona wrote:
> > On Mon, 15 Jun 2020 20:22:25 -0400
> > H <agents at meddatainc.com> wrote:
> >   
> > > On June 15, 2020 7:02:49 PM EDT, Liam Proven <lproven at gmail.com>
> > > wrote:  
> > > >On Tue, 16 Jun 2020 at 00:38, Mike Marchywka
> > > ><marchywka at hotmail.com> wrote:  
> > 
> > [big snip...]
> >   
> > > I have been missing the old DOS outliners for a very long time, to
> > > use exactly as you describe. I think I used PC Outline the most
> > > if I remember correctly...  
> > 
> > Agreed!! I still have many hundreds of .pco files with historically
> > important info.
> > 
> > But, they are coded binary files with no known way to convert them
> > [that I know of]. I can visually access the 'text' portions of the
> > file if I need to, and that is still useful.
> > 
> > What I'd really like is some kind of 'text extraction'
> > tool/pgm/method that would extract the text portions, make an
> > ordered list of them and let me put that text in a file. 
> > 
> > Recoll does not know how to read .pco files, but it could then read
> > some of my historical info and that would appear in my searches if
> > relevant.
> > 
> > Anyone know how to do that?  
> 
> First of all, I hate the idea of some indexing tool taking 
> up resources and turn them off when I notice them. 

If you have a huge number of files, the initial index does take a long
time. But, once you have that, recoll just updates the existing index.
It can be scheduled to do that at 3 AM, which may not impact your work
in any way. And, if you haven't added or subtracted much, the update
may only take only a few minutes, and is done at a priority that
doesn't impact performance. I don't find that to be a problem at all.

> However,
> I do find myself floundering around for historical ( month
> old or migrated from another machine ) data or code or text.

Some of my historical data is ~50 years old, migrated many times. :-)
I've always done a LOT of research, and still do. Back in the mainframe
days, I once asked for a modestly broad set of references in a
literature search - they came back in about 6-8 large cardboard file
boxes completely filled with paper! After a week or so of looking at
that mess, I quickly learned to do MUCH more sophisticated, narrow
searches. :-))

> Usually I can't remember  a good key word so indexing would not help.

I'm familiar with that problem too. :-) I usually do multi-stage
searches, using the first ones to remind/suggest words to narrow
subsequent more complex searches.

> If there is just plain text interspersed with junk, 
> generally "strings " should work in place of cat and you can 
> break those up on white space and make a vocabulary list.
> Otherwise you need to decode the thing with something ( I often
> had to find symbol names in object files for example and can't
> remember now if the mangled names are always greppable ). 
> ( this leaves in punctuation and case etc but you get the idea )
> strings |  sed -e 's/[ ]/\\n/g' | uniq | sort 
> 
> should work. 

Does that produce just words? What's important is phrases of multiple
words. That is what really conveys info. Can that keep together
'strings' that include spaces between words as part of the 'string'?

> I guess the politics of file formats is strong but two
> pillars for my rant : 1) human readability is not exclusive
> of resource efficiency and 2) compression should be informative.
> There may be similar text compression systems but I wrote on
> based on indexing words in csv files and just replacing them
> with ascii tokens. This is not optimal but then you can gzip the thing
> too. In any case, the dictionary it generates should be great for
> word searches. In video and audio you often see signals decomposed
> into meaningful units - motion or vocal tract parameters for example.
> I was going to pull this back out and play with it but
> there may be existing stuff like that.  

The problem here is mostly binary as part of an overall coded format,
not file compression. That binary adds a lot to the size - the
opposite of compression. And yes, there's a lot more in general than
just text.