Having trouble finding a word in multiple files

Peter Flynn peter at silmaril.ie
Wed Jun 17 22:03:17 UTC 2020


On 17/06/2020 20:37, Liam Proven wrote:
> On Mon, 15 Jun 2020 at 18:34, Chris Green <cl at isbd.net> wrote:
>>>
>> ??? Ay?  Plain text is *way* more compact than other ways of storing
>> the data surely. Even the largest books (e.g. the whole of Harry
>> Potter or the Bible) are only around a million words so, at the very
>> most, I'd guess just 10Mb.
> 
> Er, no?
> 
> Up to MS Office 2007, DOC files (etc) were data + markup and
> formatting info. Lots of it. So they were bigger than text files.
> Considerably bigger.

I think that's what Chris meant. The .doc files are bigger than the text 
they contain.

> Then due to the rise of the web, Javascript, JSON etc., XML got trendy.

Also the use of XML in publishing. Plus it's based on an international 
standard, and Microsoft was desperate to keep those of its very rich 
customers like governments and military who were legally obliged to use 
formal standards. Fortunately not every use of XML can be assigned to 
trendiness, although goddess knows there are plenty of them :-)

> Before that, things like CSV and other separated formats were used for
> data. 
[snip excellent example]
> XML replaces this with
> <row 0>
> <month = "jan"> <cell count=0>42</cell><cell count=1>44</cell>

I don't know where that comes from. It's not any XML I have ever seen. 
Is that some application's XML representation of CSV data?

> There is no redundancy or room for error here, but it's extremely
> inefficient. You have more markup than data.

The XML Specification is clear on this. "Terseness is of minimal 
importance."

> So, MS just decided to hide that file sizes had bloated by an order of
> magnitude... by Zipping the data.

Their bloat is even several orders of magnitude. You can test this by 
extracting the document.xml file from a .docx file, and comparing it 
with the DocBook file exported by AbiWord from the .docs file. They 
*had* to zip it: the original (2003) release of Office used actual .xml 
files as they stand, with images in Base64 encoding, and of course with 
their crazy markup, the file sizes were ludicrous.

> There is no longer any readable text. If your Zip data structures get
> corrupted, damaged, or lost, the entire file is trash. Unrecoverable
> garbage. Running `strings` against it won't help. There aren't any,
> only garbage.

I haven't had a corrupted zip file in 20 years or more, and then only 
from other people. Maybe I'm lucky. But yes, if that happens, forgeddit.

> But it's smaller so that looks good. They could advertise as more
> efficient file storage.

They did for a while :-)

> It means the words in the document are not present in the file on
> disk, which brings us full circle back to why you can't search for
> them with Grep.

Amen.

P




More information about the ubuntu-users mailing list