'bzr pack' a little *too* good

John Arbash Meinel john at arbash-meinel.com
Thu Dec 6 13:20:54 GMT 2007


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

I'm working with Mattias Eriksson on seeing if Bazaar will work for the
OpenOffice group. And we came across some interesting results.

After doing the conversion, he ended up with a conversion with approximately 16
pack files.

After the conversion, if he tried doing:

  bzr branch repo/a_branch standalone_branch

It took 13 minutes to copy the data. But then he found that if he did:

  bzr branch standalone_branch another_branch

It only took 5 minutes. I thought having a single pack file probably was
helping, but I also thought it would be effected by not having the other branch
data in there. However after running:

  bzr pack

He found that:

  bzr branch repo/a_branch standalone_branch

Only took 5 minutes. So it seems that simply having 16 pack files on the local
filesystem was enough to make "bzr branch" take 2.6x as long.

I'm attaching the callgrind files for people who are interested. But I'm pretty
sure the short answer is that the indexing layer is spending a lot of time for
each node searching for what index it might be in.

When you have 1 pack file, it bisects through the index for a given (file_id,
revision_id) pair. When you have 16 pack files, it bisects through each until
it finds it. I *think* it picks them based on whatever random order they were
found on disk. So there isn't specifically a bias to the "largest pack because
a random key is most likely to be there" nor a bias to the "smallest packs
because they should be fast to search and have newer revisions".

For this sort of bulk copy of most of history, I'm sure "buffer_all" would
probably show a big win. I'm guessing that Robert's "buffer_all if we access
>XX% of keys" would also be a win.

I'm curious if Robert's alternate index layout (hash map) would show a decent
win here. At least testing for the presence of a key becomes:
  hash_full = hash(key)
  for idx in indexes:
    idx_hash = hash_full[idx.hash_num_bytes]
    if idx_hash in idx.hash_map:
      offset = idx.hash_map[idx_hash]
      record = idx.get_record(offset)
      if record == key:
        return record

At least, I think that is what Robert had in mind. Of course, the big win is
that we aren't doing log(N) bisections for every key lookup.

Anyway, in the short term we might want to make the auto-pack code more
aggressive. I know we thought that with 9999 revisions, it would be reasonable
to have no more than 36 pack files. But if we degrade to 2x branch time with
just 16 packs...

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHV/c1JdeBCYSNAAMRAmasAKCMQgZooXtyuIJeRi24Z4roVLfHMACgqtx3
BlkWVjW6NzfhX7sU/g8g9po=
=8ajf
-----END PGP SIGNATURE-----
-------------- next part --------------
A non-text attachment was scrubbed...
Name: branch_ooo.tar.bz2
Type: application/octet-stream
Size: 98168 bytes
Desc: not available
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20071206/4759a7b6/attachment-0001.obj 


More information about the bazaar mailing list