Having trouble finding a word in multiple files
Peter Flynn
peter at silmaril.ie
Sun Jun 14 14:01:29 UTC 2020
On 14/06/2020 10:17, Pat Brown wrote:
[...]
> Unfortunately, none of those suggestions worked. Perhaps it's
> because the files I'm searching are either .doc, .docx or .odt
> files.
Right.
The .doc files are obsolete since 2003 and cannot be read by normal
utilities because they are a proprietary binary format (which is why
Microsoft sensibly moved to .docx). There *is* an ancient word-to-text
utility in existence somewhere, but that's just prolonging the agony: I
***very strongly*** suggest you open all your .doc files and Save
As...docx and get rid of the .doc versions (and tell anyone who sends
them to you to do the same), because...
...the .docx, .pptx, .xlsx, and .odt files ARE JUST ZIP FILES containing
XML, so they are basically plain text inside and can be searched,
especially if all you want to find is *which* files contain "Blowback",
rather than *where* in the file it occurs. You'll need to do it twice
(because the document *inside* the zip file is named differently between
.docx and .odt files), so this script does both. Set SEARCHTERM=blowback
first
for TYPE in docx odt; do
if [ "$TYPE" = "docx" ]; then
DOC="word/document.xml"
else
DOC="content.xml"
fi
find ~/ -type f -name "*.$TYPE" | while read filename; do
unzip -qo "$filename" $DOC
HIT=`grep -i $SEARCHTERM word/document.xml`
if [ -n "$HIT" ]; then echo $filename; fi
done
done
I don't know a reliable way to pass an unzip -c file content into grep
from within find *and* preserve the {} filename, hence doing it this
way. You will get some weirdo files that don't contain a content.xml or
document.xml file, and you'll get some Permission denied errors from
non-Word/ODF files during find.
Peter
More information about the ubuntu-users
mailing list