groupcompress extraction 10x faster

Robert Collins robert.collins at canonical.com
Thu Feb 19 21:37:33 GMT 2009


On Thu, 2009-02-19 at 14:52 -0600, John Arbash Meinel wrote:
> 
> I don't have a great answer here, but I have some thoughts:
> 
> 1) The first time a given key prefix (file_id) is seen in a group
> compressor, insert all lines.

I don't think this will help that much, and in fact I think it would be
useful to be able to compress things like Makefile across these
boundaries. (because they may well be very similar).

> 2) Insert all lines whenever the 'compressed' form exceeds a certain
> number of 'operations' or 'total length'. (certainly if
> len(comp_bytes)
> > len(raw_lines), but perhaps even if it is > 0.5 the raw lines.)

This is a more interesting metric - its one I've been considering using
using for triggering 'time for a new group'. (perhaps with a counter
e.g. when the last 5 texts have been more complex). And perhaps even
remove them from the old group and put into a fresh one.

> 3) My 'optimal' ordering is sorting by file-id, and then doing
> reverse-topological sorting from there. When I look closely at it, it
> seems that this ends up inserting at least 20 completely empty texts
> as
> the first few texts. I don't quite understand that, but it is
> certainly
> odd, and would certainly cause bad matches.

so thats 'newest first'? 

The ordering we should aim for (because we know it performs well and is
a good starting point) is something like basename, size; which fileid,
newest-first is a decent proxy for. So that sort should behave ok.

> 4) The "insert if copy is longer" code could probably be improved.
> I've
> seen cases where we have a regular insert, followed by one of these
> psuedo-inserts. The problem with a psuedo-insert is that it is not
> added
> to the index of lines to match against.

Yes, we probably need to do that.

> 5) Maybe the insertion of lines could be triggered if there is a
> series
> of copy instructions with no insertions, as that could be indicative
> of
> a code block that is going to be repeated often in the future.

Possibly. Alternatively we can look at starting new groups.

> I went ahead and tried a quick implementation of (1). And for the most
> part, it worked. The compression time was quite a bit slower (10m =>
> 24m). Probably more regions that it had to check for the longest
> match,

Likely. Note that while we don't aim for unique lines, by being fairly
sensitive we shouldn't end up with many collisions in the current
algorithm (and its collisions that will drive up the inner loop time
sorting out which of the multiple possible matches is really valid).

> I would assume exacerbated by something like a common copyright
> header.
> However, the total compression was better:
> 
> Commits: 1043
>                       Raw    %    Compressed    %  Objects
> Revisions:       3990 KiB   0%       826 KiB   3%     1043
> Inventories:    31012 KiB   3%     12617 KiB  50%    12090
> Texts:         882272 KiB  96%     11659 KiB  46%     7226
> Signatures:         0 KiB   0%         0 KiB   0%        0
> Total:         917275 KiB 100%     25102 KiB 100%    20359
> 
> Though it is still overall *worse* than the knit compression :(.
> 
> I'm still playing with it, tweaking some of the parameters (like
> avoiding all matches against empty texts, and starting new compression
> groups earlier when we run into a new file_id.)

One easy knob is to lower the max raw group size. I think we should be
wary of tying compression choices to fileids or other higher layer data
- the compression results should dictate the low level behaviour, while
higher layers (the GCVF) should structure the input to make good use of
the lower level. 

Another separate but useful tool would be to get some stats - a basic
histogram or scatter plot - of the encoded size, position in file,
#copies #inserts bytes-in-output-from-copies,
bytes-in-output-from-inserts. We could use that to spot common
pathologies,get ratios etc.

What we should be expecting to see is that long files that are older
than tip are encoded in many copy instructions, with a few insertions
(to add deleted content).  So 711 copy instructions isn't actually bad,
unless the file is only 711 lines long :).

-Rob



-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20090220/53461d56/attachment.pgp 


More information about the bazaar mailing list