RFC: get_record_stream and groupcompress

Thu Aug 14 22:45:52 BST 2008

On Thu, 2008-08-14 at 15:34 -0500, John Arbash Meinel wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> Robert Collins wrote:
> > So I've a thorny little tradeoff to encapsulate in code.
> > 
> > The best approach I've thought of to date is a parameter to
> > VF.get_record_stream called 'sort_is_hint'. If you think that that
> > sounds fine and dandy, say so now and skip the detail below.
> > 
> 
> It is okay, but I'm not sure it is the best way. Why can't gc and packs
> just pass 'unordered' if the sort is a hint anyway.

its not a hint for knits or weaves; its correctness for them. knit based
packs don't strictly care [though they do remember what is being
inserted in case it's compression basis is not satisfied]. If its a hint
it doesn't stop it being _useful_.

> > We could insert in any order from remote repositories, this would tend
> > to convert from packs poorly, but a 'pack' operation would fix things
> > up. Fetching from a gc repository over SFTP or the smart server would
> > tend to behave well because they would preserve as much ordering as the
> > base repository has.
> 
> That is the route I would take.

This feels like its going to end up with a LOT of uses going 'my
repository got bigger; I thought this was an upgrade'. And that will
confuse people, and also be rather nasty for interoperation with knits,
knitpacks etc.

> > We could buffer to a local store - for instance we could do like git
> > does some form of file-per-object store [but still using atomic
> > insertion to make data visible to other readers] and then compress that
> > store.
> 
> Sounds like a lot of work to avoid doing "bzr pack".

The difference is that pack has to consider the whole repo; doing a
short term buffer operation (heck even a totally uncompressed temporary
pack) would only consider transmitted data, so be O(change) not
O(history).

> My feeling is to cross the bridge when we get there. Let's get basic
> groupcompress working.

It is.

>  Get fetch working such that it doesn't transmit
> all full-texts and require a full recompression of everything.

Half done. In fact this is why I'm working on the ordering issue.

> And then get a "bzr pack" that can figure out how to put everything
> "optimally".

Sure.

> I realized the other day that "topo_sort" *doesn't* guarantee grouping
> across the whole file_id.

Yes, thats why the new sort order is called reverse_topo_grouped,
because it wants to be grouped by file id.

> It would also be really neat if we could find a way to do appropriate
> cross-file grouping. I can't really think of much from just a file-id
> stand-point, though. Given file size, we might try to insert large ones
> first. Though I would be careful to not insert all the large files into
> one chunk, and have only tiny ones for the next.

basename is likely the best hint to use for that; I would use it to
group files named the same, with all of a given id then all of the next
etc. But we don't currently have that pushed down to the text layer.

> So I think the hint is a nice thing, but I feel it is a bit premature.

I don't :).

-Rob

-- 
GPG key available at: <http://www.robertcollins.net/keys.txt>.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20080815/b3f1aa17/attachment.pgp