Bazaar, OLPC, and the "no-packs" repository format.

Thu Dec 20 14:12:44 GMT 2007

On Št, 2007-12-20 at 07:54 -0600, John Arbash Meinel wrote:
> Lukáš Lalinský wrote:
> > On Št, 2007-12-20 at 06:25 -0600, John Arbash Meinel wrote:
> >> Robert has at least conceptualized a hash-based index, so lookups become more
> >> of an O(1) operation. Which makes it O(N) for N packs, but that is better than
> >> O(N log(M)) for our current indexes. I wonder if you could even combine the
> >> hash maps to make it O(1) across all indexes that you have read...
> > 
> > I've implemented a prototype of a similar idea about two weeks ago, but
> > gave up on finishing it because I found out that the main bottleneck for
> > me was reading data from all over the big pack files. The indexing code
> > is in the attachment, but it's only a part of the plugin, which itself
> > requires patched bzrlib and breaks 'packs-0.92', so I did't bother with
> > attaching the whole package (it's not very usable, anyway).
> 
> For what operations?

Various logs were my primary tests, since I was playing with this
because of the per-file log performance bug I've reported. To get all
revisions it had to seek so many times that even though I had faster
index than with knits, the whole operation is slower than with knits.

> > The general idea is that the index is split into segments, which are
> > grouped by their left-hand history (similar to toposort, but backwards)
> > and the index header contains 32 bit hashes of each key of each segment.
> > The combined index also uses a global cache, so in most cases you don't
> > search in all indexes, but only the ones that actually contain the key
> > (for 15k revisions in bzr.dev I get no hash collisisions, so it's
> > something like O(1+very_small_constant*N)). But that helps only in cases
> > where you have a few big indexes, not a *lot* of tiny ones.
>
> That seems like it would be difficult to look up an arbitrary key. Since you
> don't know where it fits relative to other keys.

No, lookup for an arbitrary key within one index is O(1) operation. You
have a hash map of all revisions IDs pointing to segment number in which
they are. This makes decision whether a revision ID is in the index
really trivial, and even for Python SVN repository with ~33k revisions I
had no hash collisions, so in order to lookup one key I had to load only
one segment. Though, the main advantage is that for most operations you
need the whole ancestry of a key, and again you read only the
interesting segments (even if they are in multiple indexes). You never
read segments you don't need, and the related segments are close
together, so in most cases you can read them without seeking.

Lukas

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Toto je =?ISO-8859-1?Q?digit=E1lne?=
	=?ISO-8859-1?Q?_podp=EDsan=E1?= =?UTF-8?Q?_=C4=8Das=C5=A5?=
	=?ISO-8859-1?Q?_spr=E1vy?=
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20071220/33330fc5/attachment.pgp