Having trouble finding a word in multiple files

Mon Jun 15 11:36:02 UTC 2020

On Mon, Jun 15, 2020 at 12:14:12PM +0100, Peter Flynn wrote:
> On 15/06/2020 11:40, Liam Proven wrote:
> > On Sun, 14 Jun 2020 at 09:56, Pat Brown <pat.mysterywriter at gmail.com> wrote:
> > > 
> > > I've tried a variety of grep commands but I can't find the specific
> > > word I'm searching for that is in a file or files somewhere in my
> > > Dropbox folder. The word I'm trying to find is Blowback. Can someone
> > > please help me with the correct command?
> > 
> > [Reading down the thread]
> > 
> > They are files from proprietary Windows/Mac apps?
> > 
> > Then you can't. Grep only searches plain text.
> > 
> > You can't search in proprietary binary files. At all. Forget all the
> > ideas about converting them; you cannot efficiently convert or filter
> > these -- every file would need to be converted every time, which would
> > be _ludicrously_ slow.
> 
> The script I posted does the job in seconds.
> Word files are just zip files, so they unzip
> easily, and the document inside is XML, so it's
> plain text. Finding all files containing
> 'Blowback' is fast.
> 
> BUT...Word and ODT XML documents are stored
> without linebreaks: like HTML, you can have the
> end of one paragraph butting up against the
> start of the next with no white-space, eg like
> this</w:p><w:p>Next para, so finding *where* in
> the file (as opposed to *whether* the file) is a
> second stage, and that would slow it down a
> little, although doing this search as below (for
> my name, not for 'Blowback') on every Word file
> (a few hundred) on my disk took under a minute.
> 
> So the document.xml or content.xml file in the
> zip is just one long string with markup. To
> search *within* it you need a tool that will
> separate the markup from the text; fortunately
> there are dozens, if not hundreds, of these
> (every CS student at some stage writes an XML
> parser). For use in a script, the easiest I find
> are the LTxml2 tools from the Language
> Technology Group in Edinburgh
> (https://www.ltg.ed.ac.uk/software/ltxml2/), so
> after extracting the file from the zip into a
> pipe you could say
> 
> ...unzip -qc $wordfile.docx word/document.xml |\
>    lxprintf -e 'w:p[contains(.,"Blowback")]' "%s\n" -
> 
> and you'll get the text of any paragraph
> containing 'Blowback' (that conditional in
> [square brackets] is the XPath language used to
> identify pieces of an XML document).

Does this try to just extract data from the xml?
Isn't there anyway sciptable way to render the
document into say pdf and then run pdftotext
or some similar thing that executes the formatting
junk? I know this sounds like it goes deeper into the swamp
but it may be a better path back out lol. 

> 
> > You need a desktop search tool. There are not many for Linux and in my
> > experience they do not work well. I recently tried Catfish and it was
> > unable to search inside LibreOffice files.
> 
> Then the people who wrote it need to add some
> code. This stuff is not rocket science (or if it
> is, I know any number of unemployed rocket
> scientists who can do it for you) — it just
> means knowing what scripted text utilities can
> offer. XML is easy and fast to handle when it is
> used appropriately for what we designed it for:
> normal running text documents, not rectangular
> or columnar data, which is the province of CSV
> and JSON.
> 
> 2¢
> Peter
> 
> -- 
> ubuntu-users mailing list
> ubuntu-users at lists.ubuntu.com
> Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-users

-- 

mike marchywka
306 charles cox
canton GA 30115
USA, Earth 
marchywka at hotmail.com
404-788-1216
ORCID: 0000-0001-9237-455X