[brisbane:MERGE] Lazy Groupcompress streaming

Tue Mar 17 20:59:59 GMT 2009

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

I'll try to summarize the changes in the attached patch:

1) GroupCompressVersionedFiles.get_record_stream() now returns a Lazy
object for each request, that then references the block. This means that
the extraction is no longer done by get_record_stream() and is instead
done at either 'get_bytes_as()' time or by 'insert_record_stream'.

2) insert_record_stream() now defaults to just copying the blocks as
they come in. This should make 'bzr branch' times much closer to
acceptable, since we don't repack everything at 'branch' time.

3) _insert_record_stream(reuse_blocks=False) is now the way that
autopack and pack force rebuilding the blocks.

4) get_record_stream().get_bytes_as('groupcompress-block') now has a
wire-stream format suitable for our network streaming code.

5) When getting a block from get_record_stream() we do a couple quick
checks based on the content we will be copying. We do this check when
converting to the wire-bytes for streaming, and just before inserting
the content. The former is so the smart server can stream less data, the
latter is so that local copies won't 'bloat' the target.

This is fairly necessary after (2). If your repository is unoptimally
packed (such as after a .get_record_stream('unordered') from a pack-0.92
repo), then an ordered (topo or groupcompress) fetch tends to cause
'group scatter'. Meaning you read randomly from many groups.
('unordered' is already optimized to minimize group churn). Without this
change, you would end up inserting the entire 1MB group, for every
little bit that was requested.

On a trivial conversion of bzrtools without proper autopack, this
bloated it from 6MB to 60MB, up from 1.2MB optimally packed. With just
trimming and rebuilding, it was down to 8MB.

There is room for more work here, both tweaking when we 'trim' versus
'rebuild' versus a 3rd 'strip' option that needs more work, but would
fall in between in terms of cost and expected gain.

trim is nice and fast, as it is just decompress capped to the max width,
and then recompress.

6) GroupCompressBlock is now designed as something to manage the block
from both compressed to uncompressed form, and extracting content. For
zlib compression, we can now do "partial decompression". So if the text
we are looking for is in the first 1kB, we don't have to decompress the
whole 2MB. This made a 5 out of 10s difference of 'bzr ls -r-1' for the
launchpad tree. Which is making me think more that lzma isn't going to
work well in the general case. (We might still consider it as a
'compress the old stuff very tightly because we don't expect to access
it often' case.).

7) It changes the default to _NO_LABELS = True. We have to generate a
different header to send it over the wire anyway. Also, inventory and
revision texts have the revision-id in them, so they are self
describing. CHK pages are self describing, since their key is their
hash. Which only leaves file texts. I would still consider moving the
*contet* into a CHK page, and just keeping the per file graphs in
another index. But anyway, only (file_id, revision_id) would be
something I would want to put into labels. We can look at re-enabling
that...

8) I'm a bit concerned that _LazyGroupCompressFactory and
_LazyGroupContentManager end up creating a reference cycle (just like
LazyKnitContentFactory does). We probably can change
Manager.get_record_stream() to clear self._factories when it is done.

Overall, I'm pretty happy with this change. There is still a reasonable
amount to do to make something like 'bzr branch' perform well. (During
_find_file_ids_to_fetch, we call _bytes_to_entry, which has to do a
.decode('utf-8') on every record....)

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAknAD08ACgkQJdeBCYSNAANjlwCZAXaocRQnenoGZxu/svXUukAH
BckAoMHlSvqcxZVID1iQ3WOqd68tzCvP
=TNPG
-----END PGP SIGNATURE-----
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: lazy_gc_stream.patch
Url: https://lists.ubuntu.com/archives/bazaar/attachments/20090317/ec725d8a/attachment-0001.diff