Having trouble finding a word in multiple files

Karl Auer kauer at biplane.com.au
Sun Jun 14 10:33:19 UTC 2020


On Sun, 2020-06-14 at 05:17 -0400, Pat Brown wrote:
> > Unfortunately, none of those suggestions worked. Perhaps it's
> > because the files I'm searching are either .doc, .docx  or .odt
> > files. The

docx and odt files are usually compressed, so your target word has
probably been rendered unrecognisable to grep. Not sure about doc, but
I think it is at least partly compressed.

Anyway, if you uncompress the document using e.g. unzip, then look at
word/document.xml (for docx) or content.xml (for odt), you may have
better luck finding your target word.

While you can probably do it on a single command line, I think you
would probably be better off writing a script for that, then use find
to run the script over the relevant directory.

Because lines can be very long in this machine-generated XML, you are
probably better off discarding the actual grep output, and just using
the return value to determine whether anything was found. So in broad
terms your script would look like this:

#!/bin/sh
FILE="$1"
TXT="$2"
TMP="/tmp/temp.$$"
# Try for a DOCX
if unzip -p "$FILE" word/document.xml > "$TMP" 2>/dev/null && \
      grep "$TXT" "$TMP" > /dev/null 2>&1 ; then
   echo "$FILE"
# Try for an ODT
elif unzip -p "$FILE" content.xml > "$TMP" 2>/dev/null && \
      grep "$TXT" "$TMP" > /dev/null 2>&1 ; then
   echo "$FILE"
# Try a straight grep
elif grep "$TXT" "$FILE" > /dev/null 2>&1 ; then
   echo "$FILE"
fi
rm "$TMP"
exit 1

And you would call it from find like this:

find /wherever/your/files/are   \
      -iname "*.docx"           \
      -exec /path/to/script.sh {} "Blowback" \;

Run it more than once for multiple file extensions - or figure out a
clever regexp :-)

Regards, K.

-- 
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Karl Auer (kauer at biplane.com.au)
http://www.biplane.com.au/kauer
http://twitter.com/kauer389

GPG fingerprint: 2561 E9EC D868 E73C 8AF1 49CF EE50 4B1D CCA1 5170
Old fingerprint: 8D08 9CAA 649A AFEF E862 062A 2E97 42D4 A2A0 616D






More information about the ubuntu-users mailing list