Having trouble finding a word in multiple files
Peter Flynn
peter at silmaril.ie
Mon Jun 15 15:46:50 UTC 2020
On 15/06/2020 12:36, Mike Marchywka wrote:
[...]
> Does this try to just extract data from the xml?
Not "try" :-) it actually *does*. XML is a plain text file format, and
there is a stack of tools to use with it.
> Isn't there anyway scriptable way to render the document into say
> pdf and then run pdftotext or some similar thing that executes the
> formatting junk? I know this sounds like it goes deeper into the
> swamp but it may be a better path back out lol.
I'm not sure I understand "executes the formatting junk". Do you mean
"removes the formatting junk"?
Again, it depends if you just want to identify the files containing the
string, OR if you actually want to find out *where* in the document the
string occurs, which is more complex.
You could certainly use LO or AbiWord on each file in turn to save to
PDF, and then run pdftotext. That gives you one CRLF-delimited "line"
per "line" of the typeset PDF, which in most cases is fine for grep -l,
which is normally all people need. But I think unzipping the XML from
the .doc file and grepping it for a simple occurrence is likely to be
several orders of magnitude faster than firing up LO each time.
However, in the process it would lost any sense of structure, so you
cannot see if the hits occurred in the Introduction or the Conclusion
(if that it what you wanted). If you need that kind of accuracy (and
many people who work with large and complex documents do need it), then
a language like XQuery will do that. You could phrase a query that, for
each hit, returned the section number (and heading) and the number of
the paragraph, eg
"covid" occurs in
section 4.2 "Evacuation plans", paragraph 7
I would class *that* as "productivity software" because it will save you
a metric shedload of time.
Peter
More information about the ubuntu-users
mailing list