Having trouble finding a word in multiple files

Sun Jun 14 15:07:17 UTC 2020

On Sun, Jun 14, 2020 at 03:01:29PM +0100, Peter Flynn wrote:
> On 14/06/2020 10:17, Pat Brown wrote:
> [...]
> > Unfortunately, none of those suggestions worked. Perhaps it's
> > because the files I'm searching are either .doc, .docx  or .odt
> > files.
> 
> Right.
> 
> The .doc files are obsolete since 2003 and
> cannot be read by normal utilities because they
> are a proprietary binary format (which is why
> Microsoft sensibly moved to .docx). There *is*
> an ancient word-to-text utility in existence
> somewhere, but that's just prolonging the agony:
> I ***very strongly*** suggest you open all your
> .doc files and Save As...docx and get rid of the
> .doc versions (and tell anyone who sends them to
> you to do the same), because...

I just ran into this with pdf thinking that
strings xxx.pdf | grep 

would work. For that there is pdftotext.

The other comment though that may have detected
the problem was to use something like "od" and make
a histogram of character frequencies. I have a script I use
like this, 

myod -histo  mikemail.tex  | head
    162 
    158 0a
     35 20
     26 23
    460 25
     36 31
      4 32
      4 3a
     14 5b
    101 5c

that makes counts of hex values - note only linefeed "0a"
below the space "20" exists ( the 162 is a spurious output lol ). 
A binary file would be obvious
but the problem I ran into was escaped color highlighting 
that fools grep. 

You can also make a perl script to make a vocabulary list
and see if there is anything close - it is easier to browse
etc. Not sure what that does with binary files :)

> 
> ...the .docx, .pptx, .xlsx, and .odt files ARE
> JUST ZIP FILES containing XML, so they are
> basically plain text inside and can be searched,
> especially if all you want to find is *which*
> files contain "Blowback", rather than *where* in
> the file it occurs. You'll need to do it twice
> (because the document *inside* the zip file is
> named differently between .docx and .odt files),
> so this script does both. Set
> SEARCHTERM=blowback first
> 
> for TYPE in docx odt; do
>     if [ "$TYPE" = "docx" ]; then
> 	DOC="word/document.xml"
>     else
> 	DOC="content.xml"
>     fi
>     find ~/ -type f -name "*.$TYPE" | while read filename; do
> 	unzip -qo "$filename" $DOC
> 	HIT=`grep -i $SEARCHTERM word/document.xml`
> 	if [ -n "$HIT" ]; then echo $filename; fi
>     done
> done
> 
> I don't know a reliable way to pass an unzip -c
> file content into grep from within find *and*
> preserve the {} filename, hence doing it this
> way. You will get some weirdo files that don't
> contain a content.xml or document.xml file, and
> you'll get some Permission denied errors from
> non-Word/ODF files during find.
> 
> Peter
> 
> -- 
> ubuntu-users mailing list
> ubuntu-users at lists.ubuntu.com
> Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-users

-- 

mike marchywka
306 charles cox
canton GA 30115
USA, Earth 
marchywka at hotmail.com
404-788-1216
ORCID: 0000-0001-9237-455X