Having trouble finding a word in multiple files
Mike Marchywka
marchywka at hotmail.com
Sun Jun 14 15:07:17 UTC 2020
On Sun, Jun 14, 2020 at 03:01:29PM +0100, Peter Flynn wrote:
> On 14/06/2020 10:17, Pat Brown wrote:
> [...]
> > Unfortunately, none of those suggestions worked. Perhaps it's
> > because the files I'm searching are either .doc, .docx or .odt
> > files.
>
> Right.
>
> The .doc files are obsolete since 2003 and
> cannot be read by normal utilities because they
> are a proprietary binary format (which is why
> Microsoft sensibly moved to .docx). There *is*
> an ancient word-to-text utility in existence
> somewhere, but that's just prolonging the agony:
> I ***very strongly*** suggest you open all your
> .doc files and Save As...docx and get rid of the
> .doc versions (and tell anyone who sends them to
> you to do the same), because...
I just ran into this with pdf thinking that
strings xxx.pdf | grep
would work. For that there is pdftotext.
The other comment though that may have detected
the problem was to use something like "od" and make
a histogram of character frequencies. I have a script I use
like this,
myod -histo mikemail.tex | head
162
158 0a
35 20
26 23
460 25
36 31
4 32
4 3a
14 5b
101 5c
that makes counts of hex values - note only linefeed "0a"
below the space "20" exists ( the 162 is a spurious output lol ).
A binary file would be obvious
but the problem I ran into was escaped color highlighting
that fools grep.
You can also make a perl script to make a vocabulary list
and see if there is anything close - it is easier to browse
etc. Not sure what that does with binary files :)
>
> ...the .docx, .pptx, .xlsx, and .odt files ARE
> JUST ZIP FILES containing XML, so they are
> basically plain text inside and can be searched,
> especially if all you want to find is *which*
> files contain "Blowback", rather than *where* in
> the file it occurs. You'll need to do it twice
> (because the document *inside* the zip file is
> named differently between .docx and .odt files),
> so this script does both. Set
> SEARCHTERM=blowback first
>
> for TYPE in docx odt; do
> if [ "$TYPE" = "docx" ]; then
> DOC="word/document.xml"
> else
> DOC="content.xml"
> fi
> find ~/ -type f -name "*.$TYPE" | while read filename; do
> unzip -qo "$filename" $DOC
> HIT=`grep -i $SEARCHTERM word/document.xml`
> if [ -n "$HIT" ]; then echo $filename; fi
> done
> done
>
> I don't know a reliable way to pass an unzip -c
> file content into grep from within find *and*
> preserve the {} filename, hence doing it this
> way. You will get some weirdo files that don't
> contain a content.xml or document.xml file, and
> you'll get some Permission denied errors from
> non-Word/ODF files during find.
>
> Peter
>
> --
> ubuntu-users mailing list
> ubuntu-users at lists.ubuntu.com
> Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-users
--
mike marchywka
306 charles cox
canton GA 30115
USA, Earth
marchywka at hotmail.com
404-788-1216
ORCID: 0000-0001-9237-455X
More information about the ubuntu-users
mailing list