Having trouble finding a word in multiple files

Peter Flynn peter at silmaril.ie
Sun Jun 14 14:01:29 UTC 2020


On 14/06/2020 10:17, Pat Brown wrote:
[...]
> Unfortunately, none of those suggestions worked. Perhaps it's
> because the files I'm searching are either .doc, .docx  or .odt
> files. 

Right.

The .doc files are obsolete since 2003 and cannot be read by normal 
utilities because they are a proprietary binary format (which is why 
Microsoft sensibly moved to .docx). There *is* an ancient word-to-text 
utility in existence somewhere, but that's just prolonging the agony: I 
***very strongly*** suggest you open all your .doc files and Save 
As...docx and get rid of the .doc versions (and tell anyone who sends 
them to you to do the same), because...

...the .docx, .pptx, .xlsx, and .odt files ARE JUST ZIP FILES containing 
XML, so they are basically plain text inside and can be searched, 
especially if all you want to find is *which* files contain "Blowback", 
rather than *where* in the file it occurs. You'll need to do it twice 
(because the document *inside* the zip file is named differently between 
.docx and .odt files), so this script does both. Set SEARCHTERM=blowback 
first

for TYPE in docx odt; do
     if [ "$TYPE" = "docx" ]; then
	DOC="word/document.xml"
     else
	DOC="content.xml"
     fi
     find ~/ -type f -name "*.$TYPE" | while read filename; do
	unzip -qo "$filename" $DOC
	HIT=`grep -i $SEARCHTERM word/document.xml`
	if [ -n "$HIT" ]; then echo $filename; fi
     done
done

I don't know a reliable way to pass an unzip -c file content into grep 
from within find *and* preserve the {} filename, hence doing it this 
way. You will get some weirdo files that don't contain a content.xml or 
document.xml file, and you'll get some Permission denied errors from 
non-Word/ODF files during find.

Peter




More information about the ubuntu-users mailing list