Having trouble finding a word in multiple files
Liam Proven
lproven at gmail.com
Thu Jun 18 11:16:00 UTC 2020
On Thu, 18 Jun 2020 at 00:04, Peter Flynn <peter at silmaril.ie> wrote:
>
> I think that's what Chris meant. The .doc files are bigger than the text
> they contain.
DOC files were bigger, yes. But no, plain text isn't more compact than
_any_ other form. Text is highly compressible, so any representation
with internal compression will be smaller.
> Also the use of XML in publishing. Plus it's based on an international
> standard, and Microsoft was desperate to keep those of its very rich
> customers like governments and military who were legally obliged to use
> formal standards. Fortunately not every use of XML can be assigned to
> trendiness, although goddess knows there are plenty of them :-)
A good point.
>From what I have read -- I have not looked personally -- MS' XML
format is only _technically_ open and documented. While it is, the
docs are huge, extremely complex, and at various points they basically
say "contents of this field may include an embedded BLOB containing
any of the older MS Office file formats" -- so in order to decode
them, you still need to maintain existing MS Office file
import/rendering code.
Whereas the OpenOffice one was genuinely open and free.
But this is hearsay from informed observers, not personal observation.
> I don't know where that comes from. It's not any XML I have ever seen.
> Is that some application's XML representation of CSV data?
It comes from my imagination, Peter. :-)
> The XML Specification is clear on this. "Terseness is of minimal
> importance."
Yerse. I strongly disagree with them on that one.
> Their bloat is even several orders of magnitude. You can test this by
> extracting the document.xml file from a .docx file, and comparing it
> with the DocBook file exported by AbiWord from the .docs file. They
> *had* to zip it: the original (2003) release of Office used actual .xml
> files as they stand, with images in Base64 encoding, and of course with
> their crazy markup, the file sizes were ludicrous.
I can easily believe that.
> I haven't had a corrupted zip file in 20 years or more, and then only
> from other people. Maybe I'm lucky. But yes, if that happens, forgeddit.
Agreed. Zip itself is very stable and I've not had problems since the
early 1990s.
But filesystems and media are fragile.
> They did for a while :-)
Quite.
> > It means the words in the document are not present in the file on
> > disk, which brings us full circle back to why you can't search for
> > them with Grep.
>
> Amen.
Which is one of several reasons I personally try to have nothing to do
with MS Office post 2007.
I do run it on Macs, because I have no choice -- the older versions
are PowerPC code and no longer execute. Also, on Mac OS X, I can turn
off the ribbon and just use the menus. But I keep most of my files in
.DOC format.
--
Liam Proven – Profile: https://about.me/liamproven
Email: lproven at cix.co.uk – gMail/gTalk/gHangouts: lproven at gmail.com
Twitter/Facebook/LinkedIn/Flickr: lproven – Skype: liamproven
UK: +44 7939-087884 – ČR (+ WhatsApp/Telegram/Signal): +420 702 829 053
More information about the ubuntu-users
mailing list