[MERGE] Packs. Kthxbye.

Fri Oct 19 00:56:32 BST 2007

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Robert Collins wrote:
> I think the moment has come to get the pack repository format reviewed.
> 
> Caveats:
>   - committing merges is extremely slow. I plan to fix this tomorrow.
>   - The format string and short hand still claim its experimental. I'll
> fix this when the review is done, so that any bugs found are not liable
> to leak into user data.
>   - There's a bunch of fallout from the latest merge of bzr.dev that I
> haven't factored out into separate patchs. To avoid infinite regression
> I don't intend to do so.
>   - Pack branches can inherit bad indices, but they don't do anything
> for reconcile, this should be fixed before large scale adoption (e.g.
> making it the default format).
> 
> Cheers, enjoy.
> 
> -Rob
> 

More comments:

+        request_groups = {}
+        for index, key, value, references in nodes:
+            if index not in request_groups:
+                request_groups[index] = []
+            request_groups[index].append((key, value, references))

^- I used to think the idiom
request_groups.setdefault(index, []).append((key, value, references))
would be faster.

I did some timing, though, and at least for integer keys, doing

import random

unique_vals = 100
total_count = 100 * 1000
data = [random.randint(0, unique_vals) for x in xrange(total_count)]

# Method 1
d = {}
for x in data:
  if x not in d:
    d[x] = []
  d[x].append(x)

# Method 2
d = {}
for x in data:
  d.setdefault(x, []).append(x)

I get the table:

unique_vals	Method 1	Method 2
100		 73ms		107ms
1000		102ms		142ms
10000		154ms		188ms

So it seems a hash lookup for integers is faster than allocating a temporary list.

I wonder if that holds true for strings, though.

+    def _copy_nodes_graph(self, nodes, index_map, writer, write_index,

^- I find it a little strange that this function does double duty as a
generator if you supply output_lines=True, versus just repacking the data if
you don't.

I understand why you do, it just seems a little odd.

Also the statement:
+        else:
+            # eat the iterator to cause it to execute.
+            list(inv_lines)
+            text_filter = None

^- This still builds up a list with every entry in all of the inventories.
Which seems like a fairly large memory consumption.
Maybe
# Iterate the lines to execute the copy
for l in inv_lines:
  pass

Or at least
[None for l in inv_lines]
Which still builds up a huge list, but at least it re-uses a single object.

Am I correct in saying that you don't currently do any intelligent ordering of
the output? You put revisions at the front, then inventories and then texts.
And the Indexes are sorted by key (so you can bisect). But you don't do
anything like try to group related revisions close together, right?

+        self.allocate(new_pack)
+        if 'fetch' in debug.debug_flags:
+            # XXX: size might be interesting?
+            mutter('%s: create_pack: pack renamed into place: %s%s->%s%s
t+%6.3fs',
+                time.ctime(), self._upload_transport.base, new_pack.random_name,
+                new_pack.pack_transport, new_pack.name,
+                time.time() - new_pack.start_time)
+        if 'fetch' in debug.debug_flags:
+            # XXX: size might be interesting?
+            mutter('%s: create_pack: finished: %s%s t+%6.3fs',
+                time.ctime(), self._upload_transport.base, new_pack.random_name,
+                time.time() - new_pack.start_time)

I'm not sure what the differences here are.

In _obsolete_packs you have:
+            # TODO: Probably needs to know all possible indexes for this pack
+            # - or maybe list the directory and move all indexes matching this
+            # name whether we recognize it or not?
+            for suffix in ('.iix', '.six', '.tix', '.rix'):
+                self._index_transport.rename(pack.name + suffix,
+                    '../obsolete_packs/' + pack.name + suffix)

You could do:

for suffix, num in NewPack.indices.itervalues():

Or maybe we should be adding a list on packs themselves. So instead you would do:

+        for pack in packs:
+            for suffix in pack.index_suffixes:
+            ... rename indices
+            pack.pack_transport.rename(pack.file_name(),
+                '../obsolete_packs/' + pack.file_name())

It does seem to me that we should be renaming the indexes before we rename the
pack file itself. Especially if we have a preferred order to read the indexes in.

...

+class GraphKnitRevisionStore(KnitRevisionStore):
+    """An object to adapt access from RevisionStore's to use GraphKnits.
+
+    This should not live through to production: by production time we should
+    have fully integrated the new indexing and have new data for the
+    repository classes; also we may choose not to do a Knit1 compatible
+    new repository, just a Knit3 one. If neither of these happen, this
+    should definately be cleaned up before merging.

^-          definitely
Obviously this is living in production.
I would personally prefer to just have the Knit3 form. It will take longer to
upgrade (since we are rebuilding every inventory, AIUI), but we should really
be moving forward.

...
+    def _refresh_data(self):
+        if self._write_lock_count == 1 or self.control_files._lock_count==1:

^- Extra spaces around the '=='

+            # forget what names there are
+            self._packs.reset()
+            # XXX: Better to do an in-memory merge, factor out code from
+            # _save_pack_names.

During fetch() you have:
+        if pack is not None:
+            self.target._packs._save_pack_names()
+            # Trigger an autopack. This may duplicate effort as we've just done
+            # a pack creation, but for now it is simpler to think about as
+            # 'upload data, then repack if needed'.
+            self.target._packs.autopack()
+            return pack.get_revision_count()

Do we want to auto-repack a remote repository? For bzr+ssh it makes sense. But
for sftp/ftp/http it means we have to download all of that content and
re-upload it all. I really think it would make more sense for it not to do this.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHF/KwJdeBCYSNAAMRAvz2AJ9lYU9PxUhzVI56rDehJKIyHG4XvQCff6Yk
hI5ks14GJiZluUhLzKVvXCg=
=NTiX
-----END PGP SIGNATURE-----