Having trouble finding a word in multiple files

Peter Flynn peter at silmaril.ie
Mon Jun 15 15:46:50 UTC 2020


On 15/06/2020 12:36, Mike Marchywka wrote:
[...]
> Does this try to just extract data from the xml?

Not "try" :-) it actually *does*. XML is a plain text file format, and 
there is a stack of tools to use with it.

> Isn't there anyway scriptable way to render the document into say
> pdf and then run pdftotext or some similar thing that executes the 
> formatting junk? I know this sounds like it goes deeper into the 
> swamp but it may be a better path back out lol.
I'm not sure I understand "executes the formatting junk". Do you mean 
"removes the formatting junk"?

Again, it depends if you just want to identify the files containing the 
string, OR if you actually want to find out *where* in the document the 
string occurs, which is more complex.

You could certainly use LO or AbiWord on each file in turn to save to 
PDF, and then run pdftotext. That gives you one CRLF-delimited "line" 
per "line" of the typeset PDF, which in most cases is fine for grep -l, 
which is normally all people need. But I think unzipping the XML from 
the .doc file and grepping it for a simple occurrence is likely to be 
several orders of magnitude faster than firing up LO each time.

However, in the process it would lost any sense of structure, so you 
cannot see if the hits occurred in the Introduction or the Conclusion 
(if that it what you wanted). If you need that kind of accuracy (and 
many people who work with large and complex documents do need it), then 
a language like XQuery will do that. You could phrase a query that, for 
each hit, returned the section number (and heading) and the number of 
the paragraph, eg

"covid" occurs in
section 4.2 "Evacuation plans", paragraph 7

I would class *that* as "productivity software" because it will save you 
a metric shedload of time.

Peter




More information about the ubuntu-users mailing list