More LiveCD space optimizations
Louis Simard
louis.simard at gmail.com
Thu Oct 7 22:22:33 UTC 2010
* LONG MESSAGE WARNING *
While I've tried to reduce the quotes and quote nesting as much as I
could, this message is still long. It is still important to read, when
you have time.
2010-10-07 16:07 GMT John McCabe-Dansted <gmatht at gmail.com>:
> On Thu, Oct 7, 2010 at 10:05 AM, Louis Simard <louis.simard at gmail.com> wrote:
>> <snipped>
>
> I think this will be discussed at UDS-N, see:
> http://archives.free.net.ph/message/20101004.065026.e553efd1.en.html
Awesome! Will a digest of this conversation need to be posted to
ubuntu-devel only once done, continuing on ubuntu-devel-discuss for
now?
>> 2010-10-06 16:08 GMT John McCabe-Dansted <gmatht at gmail.com>:
>>> [...] I note that we can save further space by:
>>>
>>> 1) Using advdef on the png files in addition to optipng. This is what
>>> optimizegraphics does, and this shrinks the pngs on the Maverick RC
>>> liveCD from about 100.1MB to 85.3MB providing a saving of 14.8MB.
>
> We could test each file [after using advpng on them]
> to ensure the image is identical, perhaps
> using pngtopnm, and md5sum. This would be especially important for
> jpegrescan/jpgcrush, which is at version 0.0.0-1.
Good idea. I may be able to integrate this test into my script as an option.
>>> 2) Recompressing gz files with advdef. Using advdef, we can shrink the
>>> gz files from 89.5MB to 84.8MB, [...] a saving of 4.7MB.
>>
>> [...] I did use 7zip's Deflate compressor to recompress a
>> .zip file of OpenOffice.org's from 5.9 MB to 5.4 MB. [...]
>
> You mean images_human.zip?
Yes, thanks. :) I had forgotten the name.
> I have a hunch that compressing that file
> wouldn't actually save space on the liveCD as I can gzip it down to
> 3.9MB. It may be better to leave it as an uncompressed zip, and let
> squashfs deal with it.
Per that "Performance - Disk footprint" thread from ubuntu-devel
[brainstorm], we may actually want to also care about the installed
size, and use the 7zip recompression. While it's not going to be
*perfectly optimal*, reducing both the CD footprint and the installed
size by 0.5 MB using 7zip sounds better than reducing the CD footprint
by 2 MB, but increasing the installed size by more than 2 MB. And if
you managed to re-gzip the zip, squashfs will also manage to re-lzma
the zip for more savings and still a decent installed size. You should
test this again with lzma, I think.
> Recompressing the pngs contained in the zip
> sounds worthwhile though. Strangely, even running advzip -z -0
> images_human.zip shrinks it by 3%, and even shrinks the corresponding
> images_human.zip.gz file
I believe you there, only because the original situation has a
deflated container (png) within another deflated container (zip).
Counter-intuitive, but something to consider.
> Also, there are 12MB of jar files, which are basically zip files. We
> can also shrink those by 5MB or so with advzip, but that doesn't seem
> to shrink a .tgz of them so it may not shrink the liveCD. Since zip
> files compress file by file, we may be able to save space on the
> liveCD by running "advzip -z -0" on them. That would expand them to
> 24MB, but reduces the size of a .tgz of them to 4.6MB, possibly saving
> space on the liveCD if squashfs is similarly efficient.
<Later post by Matthias Klose>
> same for jar files. are these extracted as fast as without your changes by the
> jvm? if not, then these should be left alone (and afaik there shouldn't be any
> jar files on the live CD).
Aha! I completely forgot .jar files. The OpenJDK package itself may
become much smaller after this, because of the huge runtime rt.jar.
Must test and benchmark this!
I believe OpenOffice.org is a huge user of Java, so there would be
.jar files on the LiveCD from that too.
>>> A further 10MB could be saved by recompressing the gz files as lzma.
>> At what LZMA compression level? Default (7) or --best (9)?
> --best
I just want to add that blanket recompression of gzip files as lzma
with --best could be harmful, but with small files it's probably OK.
LZMA uses a huge dictionary to do its work, which needs to be
allocated even on the decompressing side, and --best may overrun the
memory of low-end computers on larger files.
> Also, if we want to take replacing deflate with lzma to extremes, we
> could replace the deflate compression in the png files with lzma. A
> command that does this is "advpng -z -0 $f && lzma --best $f". I found
> that this could save 18.7MB. However, It may also degrade performance
> slightly, but I doubt it would be too significant on modern CPUs.
> Running unlzma on all 66MB of the .png.lzma files takes:
> real 1m2.666s
> user 0m6.540s
> sys 0m5.610s
>
> I think the user/sys are the relevant ones, and taking 12s to read
> every png doesn't seem too bad. The main thing is that I doubt that it
> would work out of the box.
>
> If we use lzma in the squashfs, just deflating them all with advpng -z
> -0 could reduce the liveCD size. Probably wouldn't help the installed
> size though.
Indeed.
> There are a over a dozen different types of file to be tested (and
> there may be more than one application that wants to read them). For
> reference, I have attached them. Probably the most important thing to
> check is that printing still works, as many of the gz files seem to
> e.g. ppd files.
>
> Maybe if you added it to your script and just gave the resulting iso a
> spin in a VM to see if there was obvious breakage?
I have no printer supported by OpenPrinting PPDs to test this with,
but a VM is exactly what I used to test SVG, XML and PNG optimisations
in May (and realise that librsvg had a bug that needed worked around
in Scour! [librsvgbug]). I'll do this, but PPDs would still need
testing afterwards.
A separate thread and perhaps contact people already exist for the PPD
gzip compression ([openprinting-ppds-gzip]), and perhaps it would be
best to communicate with these people to have them test and add
AdvanceCOMP to their gzipping.
> Hmm, the biggest difference I could find was that advdef can shrink
> libidn11/AUTHORS.gz 170 bytes smaller than lzma. In total 4720 files
> are smaller as .gz, in total we can save another 165KB by letting some
> gz files remain gz files (and this is not counting the 9K we save in
> the directories as ".gz" is two bytes smaller than ".lzma" ;)
Hehe, counting directory entry sizes too. For the sake of accuracy! :)
For completeness, that 9 KB would be saved only on the LiveCD, as
inodes on a disk would take a minimum fixed size. 165 KB may be small,
though it's more significant than that 9 KB.
> Still something feels a bit unclean about arbitrarily picking gz or lzma.
If you mean mixing the two, then I agree: if a vulnerability is found
in either gzip or lzma support of the tool reading the file, then it
can be exploited, and it is unclean because it doubles the attack
surface.
>> Do you want me to add to my script any of the optimisations discussed
>> in your email? They are: Using AdvanceCOMP to recompress .png images
>> and gzipped files; using either of jpegoptim or jpegrescan to
>> losslessly recompress .jpg images; "transcoding" man pages from .gz to
>> .lzma. I'm not going to add untested optimisations yet, such as
>> transcoding *all* .gz files to .lzma.
>
> Sure. This could help with testing that these actually work ;).
Will do, and attach it to a separate email posting.
>> [... Should I then] file a single bug report on all of the
>> packages that would benefit the most from optimisations? That way,
>> package maintainers could opt in rather easily.
>
> I wouldn't file bugreports until it has been discussed at UDS-N end of
> October. However it does seem this could be useful for upsteam, e.g.
> if OO could drop the size of their 150MB windows installer by a few
> MB.
Ok, I'll hold off until then.
> P.S. You mentioned html files previously. I tried running Webpack with
> the HTML::Clean backend. This shrunk the html files by 1MB, but only
> shrunk the corresponding .tgz file by 100k. Also on many files it gave
> warnings that it was removing whitespace even though the file had a
> <pre> tag which made whitespace important. We could fix this, but it
> seems like a low priority.
Relatedly, gbrainy crashed when I used xmllint on its questions file
back in May, and the data was identical afterwards, but the whitespace
was removed. It may also have been because of 'xmllint --nsclean',
though. So yeah, automated HTML cleanup is a low priority :)
- Louis
[brainstorm] http://archives.free.net.ph/message/20101004.065026.e553efd1.en.html
[librsvgbug] https://bugzilla.gnome.org/show_bug.cgi?id=620923
[openprinting-ppds-gzip]
https://lists.ubuntu.com/archives/ubuntu-devel-discuss/2010-August/011965.html
More information about the Ubuntu-devel-discuss
mailing list