RFC: get_record_stream and groupcompress

Fri Aug 15 00:07:25 BST 2008

On Thu, 2008-08-14 at 17:18 -0500, John Arbash Meinel wrote:

> I feel like the heuristics for determining when to pay attention to the
> hint are somewhat ill-defined, and would require a non-trivial amount of
> tweaking to make sure they are correct. I'd rather postpone than wait a
> long time to get this api right on top of everything else.

I don't want to write another fetch API. If both knit->gc and gc->gc
doesn't fit well into get_record_stream, get_record_stream is broken and
I need to fix it. This thread is about *both* knit->gc and gc->gc
scenarios - neither is being sacrificed for the other.

If it takes non trivial time to get right in this API, it will take non
trivial time to get right outside this API, but also give us another API
to manage. 

> >>> We could buffer to a local store - for instance we could do like git
> >>> does some form of file-per-object store [but still using atomic
> >>> insertion to make data visible to other readers] and then compress that
> >>> store.
> >> Sounds like a lot of work to avoid doing "bzr pack".
> > 
> > The difference is that pack has to consider the whole repo; doing a
> > short term buffer operation (heck even a totally uncompressed temporary
> > pack) would only consider transmitted data, so be O(change) not
> > O(history).
> >  
> 
> I think this is an interesting idea. I'm not very convinced about
> needing visibility before it gets repacked. I'm *not* convinced that it
> needs to be there before we allow groupcompress to actually get into
> bzr.dev so that it is tracked properly, or even before it is a non-dev
> format to be used. It feels a lot like something we can add later.

I asserted it should *not* be visible. Being visible would require
either atomicity, or topological insertion.

I think groupcompress is best developed outside bzrlib for the moment,
there are only two developers to date; and the best input I'd like from
other folk is either design review or code. The biggest hurdle to
contributing to group compress is needed index2 + pybloom as well as
groupcompress.

> >> My feeling is to cross the bridge when we get there. Let's get basic
> >> groupcompress working.
> > 
> > It is.
> 
> I don't know what updates you've done and published. Last I checked
> there wasn't anything for me to pull. It may be sitting on your disk.
> 
> 'time bzr branch gc-repo1/bzr.dev gc-repo2/bzr.dev'
> 
> is not strictly "usable" at the moment. Without the LRU text cache, it
> has been 8 min+ and it hasn't gotten past "Transferring 0/4".

And how much memory is it using? Tonnes? Well, thats get_record_stream
again; the very API we're talking about improvements too? 

> With an LRU cache (which at least simulates transferring in a better
> order) the last time I tested it took 7 min to transfer from a pack repo
> to a gc repo. I'm currently at 9m30s between gc repos and still on 0/4.
> 
> It does work, but it takes far too long to be considered something I
> would ever have people actually use.

I'm not interested in optimising gc->gc separately from getting the
layering right - I have enough of a feel for all the issues that we can
do that at this point I think.

> >>  Get fetch working such that it doesn't transmit
> >> all full-texts and require a full recompression of everything.
> > 
> > Half done. In fact this is why I'm working on the ordering issue.
> 
> I can't say I've heard much about how you plan to avoid the
> full-recompression problem.

I think I've said this before, but I think we should at least
full-decompress everything coming in to be sure its intact etc. I know
that hg diffs everything on insert and is nice and fast. I think its a
bug if insertion is too slow to compress on the fly; We should profile
against conversions into other systems to see if this is realistic.

I do agree that it would be nice not to pay compression that we don't
need to. And I think we can do that with get_record_stream.

> I would actually guess extension is one of the better bits to use, but I
> don't know that for sure.

Possibly; testing needed :).

> Anyway, the first portion of a file-id (before the first '-') is
> actually the basename. So just sorting by file_id is probably going to
> get us close. Sure, it doesn't change with a rename, etc, etc, but it is
> still a decent hint.

Its not the basename for svn file ids and various others.

> We don't have gc in bzr.dev, and I don't think hinting is a strict
> requirement for getting it in. Thus I don't think the hinting should be
> the current highest priority. Maybe I'm just on my "lets get things
> moving" from being a release manager.

I'm not putting gc up for inclusion until it fits cleanly in; that means
patches to bzr.dev to make it a cleaner fit, which is precisely what
this thread is about.

> Getting a way to not have to spend 7 minutes of CPU to branch bzr.dev
> locally would be something more important than passing something other
> than "unordered" to get_record_stream(), IMO.

I don't think these are separate problems.

-Rob
-- 
GPG key available at: <http://www.robertcollins.net/keys.txt>.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20080815/b138286c/attachment.pgp