Having trouble finding a word in multiple files

Wed Jun 17 19:37:07 UTC 2020

On Mon, 15 Jun 2020 at 18:34, Chris Green <cl at isbd.net> wrote:
> >
> ??? Ay?  Plain text is *way* more compact than other ways of storing
> the data surely. Even the largest books (e.g. the whole of Harry
> Potter or the Bible) are only around a million words so, at the very
> most, I'd guess just 10Mb.

Er, no?

Up to MS Office 2007, DOC files (etc) were data + markup and
formatting info. Lots of it. So they were bigger than text files.
Considerably bigger.

Then due to the rise of the web, Javascript, JSON etc., XML got trendy.

Before that, things like CSV and other separated formats were used for
data. This is fine and compact but fragile. A typical format is
something like:

header, col1, col2, col3, col4
jan, 42, 44, 43, 46
feb, 32, 46,, 76
mar,  51, 32, 54, 56

See the problem? One fewer datum in line 2, or possibly line 3 or 1
depending on how you count (do you count the header line? Do you start
at 0 or 1?)
Is that a zero or a null? Or do you skip it?

The results vary according to human designer and programming language.

XML replaces this with
<row 0>
<month = "jan"> <cell count=0>42</cell><cell count=1>44</cell>

There is no redundancy or room for error here, but it's extremely
inefficient. You have more markup than data.

So, MS just decided to hide that file sizes had bloated by an order of
magnitude... by Zipping the data.

So inside a DOCX file (or XLSX or whatever) are multiple Zip subfiles
with chunks of compressed XML.

There is no longer any readable text. If your Zip data structures get
corrupted, damaged, or lost, the entire file is trash. Unrecoverable
garbage. Running `strings` against it won't help. There aren't any,
only garbage.

But it's smaller so that looks good. They could advertise as more
efficient file storage.

The highly repetitive XML markup compresses _very well_ so as a
result, formatted text is now _smaller_ than plain ASCII text. It's
smaller still than Unicode text.

And if you email it or whatever it's pre-compressed so it sends much
faster. In the dialup modem era, all connections were compressed with
on-the-fly streaming compression algorithms such as Huffman coding.
This is not super-efficient but it means you don't need the whole file
to compress it: you can compress the data stream as it goes by when
you're transmitting it, and decompress it on the fly as it's received.

So sending emails and HTML and so on was quick, but sending compressed
file formats like JPEG was slow. It can't be compressed further -- try
Zipping a JPG and you'll find it doesn't get significantly smaller --
and you waste time trying. But only a little.

Overall it all worked well. This is why in the dialup era, comms
programs, including the status bar of early web browsers, showed you
the transmission speed. Because it varied: both for _that connection_
depending on the noise on the phone line for that particular call, and
for the type of data being sent. For text, TX/RX speed went up, as it
was being compressed on TX and decompressed on RX. For Zip files or
JPEGs the speed seemed to go down as the modems couldn't compress it.

Microsoft Word always optimised for slow media -- Word's Fast Save
feature (which corrupted many a document) is in part of its internal
storage structure, the Piece Table:
http://1017.songtrellisopml.com/whatsBeenWroughtUsingPieceTables

Well, when they switched to XML, file sizes were suddenly *huge*, but
they wanted it to _look_ better, and that meant smaller file sizes --
so the internal Zip stuff.

It means the words in the document are not present in the file on
disk, which brings us full circle back to why you can't search for
them with Grep.

-- 
Liam Proven – Profile: https://about.me/liamproven
Email: lproven at cix.co.uk – gMail/gTalk/gHangouts: lproven at gmail.com
Twitter/Facebook/LinkedIn/Flickr: lproven – Skype: liamproven
UK: +44 7939-087884 – ČR (+ WhatsApp/Telegram/Signal): +420 702 829 053