[Fwd: Re: [MERGE] _generate_text_key_index]

John Arbash Meinel john at arbash-meinel.com
Mon Nov 19 03:47:22 GMT 2007


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

I would consider this a review of your earlier patch.

I'm not sure why you say "no reviews were given"

John
=:->


- -------- Original Message --------
Subject: Re: [MERGE] _generate_text_key_index
Date: Thu, 15 Nov 2007 20:27:06 -0500 (EST)
From: John Arbash Meinel <john at arbash-meinel.com>
Reply-To: bazaar at lists.canonical.com
To: bazaar at lists.canonical.com

John Arbash Meinel has voted tweak.
Status is now: Conditionally approved
Comment:
Overall, I see that you chose to go in topological sorted order, and
then to build up the per file graphs, which I think is the better way to
go for what we discussed. The algorithm looks correct, I just have some
small comments on odd constructs that you used.

Also, you might want to consider splitting it up into helper functions.
It seems to be getting a bit dense to follow everything. You could
probably break it into a helper that handles a single revision id.


I just did a quick test. I don't know that this is strictly performance
sensitive but you are doing:

+    def get_parents(self, revisions):
+        return [self.ancestry.get(r, None) for r in revisions]
+

And I found that that has to do an attribute lookup for each pass.
timeit says:

TIMEIT -s 'import random' \
- -s 'd = dict((random.randint(0,1000), random.randint(0,1000)) \
    for x in xrange(1000))' \
- -s 'y = [random.randint(0,1000) for x in xrange(200)]' \

  '[d.get(x, None) for x in y]'
460 usec

  'g = d.get; [g(x, None) for x in y]'
330 usec

  'g = d.get; [g(x) for x in y]'
306 usec

I think the extra 24us is because of the INCREF/DECREF around the None
object. For me, it is worth the clarity of making it obvious that we are
expecting None if the key isn't present. (Yes it is part of get() to do
so, but it isn't nearly as obvious.)

If I set it up so that we have the dictionary as an attribute of a
class, it gets even worse:
    '[o.d.get(x, None) for x in y]'
1000 loops, best of 3: 564 usec per loop

Anyway, we should keep in mind that creating a local variable of the
loop function is a good thing.

+    def _generate_text_key_index(self):
+        """Generate a new text key index for the repository.
+
+        This is an expensive function that will take considerable time
to run.
+
+        :return: A dict mapping text keys ((file_id, revision_id)
tuples) to a
+            list of parents, also text keys. When a given key has no
parents,
+            the parents list will be [NULL_REVISION].

"to a list of parents, also text keys." is unclear.
It might be easier to say:

:return: A dict mapping text keys to a list of parents (eg, (file_id,
revision_id) => []). When a given ...

Oh, maybe you meant "and invalid text keys", and just left off the word
"invalid". I think I prefer 'and' to 'also'.

...
+        revision_graph = self.get_revision_graph_with_ghosts()
+        ancestors = revision_graph.get_ancestors()

^- I wish there was a better way to do that. It is a bit innefficient to
do all the overhead of deprecated_graph.Graph() [which is adding a map
of parents to children] just to get the children to parents map.

Also, I noticed that get_ancestors() is defined as:
    def get_ancestors(self):
        """Return a dictionary of graph node:ancestor_list entries."""
        return dict(self._graph_ancestors.items())

Isn't that an expensive way of saying "self._graph_ancestors.copy()" ?
Since it has to build up a list of tuples in memory, which then gets
collapsed back into another dictionary.

...
+        revision_keys = {}
+        for revision_id in revision_order:
+            revision_keys[revision_id] = set()

maybe:
revision_keys = dict((revision_id, set()) for revision_id in
revision_order)

I realize this isn't really the performance sensitive portion of the
code.

...
+        # Set a cache with a size of 10 - this suffices for bzr.dev but
may be
+        # too small for large or very branchy trees. However, for 55K
path
+        # trees, it would be easy to use too much memory trivially.
Ideally we
+        # could gauge this by looking at available real memory etc, but
this is
+        # always a tricky proposition.
+        inventory_cache = lru_cache.LRUCache(10)

You could instead use a the LRUSizeCache of the inventories. Inventories
already have __len__ defined, and we know that a single Inventory entry
is approximately a fixed size. (a couple shared file ids, sha string,
size, etc).
That would be a lot more accurate than a fixed size of 10. Instead you
could say store approx 10,000 Inventory entries.

+        for offset in xrange(batch_count):
+            to_query = revision_order[offset * batch_size:(offset + 1)
*
+                batch_size]
^- Isn't it a lot easier to do:
for offset in xrange(0, text_count, batch_size):
  to_query = revision_order[offset:offset+batch_size]


CONTINUED

If we are going to cache the tuples, shouldn't we also check that the
file_ids and revision_ids are cached as well? We don't necessarily have
to have tests for it, but it seems like sharing the strings is going to
be a bigger win than just saving the tuples.

+                            # look at the parent commit details
inventories to
+                            # determine possible candidates in the per
file graph.
+                            # TODO: cache here.
+                            try:
+                                inv = inventory_cache[parent_id]
+                            except KeyError:

^- Shouldn't the TODO be removed?

+                    parent_heads = text_graph.heads(candidate_parents)
+                    new_parents = list(parent_heads)
+                    new_parents.sort(key=lambda
x:candidate_parents.index(x))

^- Interesting, I wonder if this is better than:
parent_heads = text_graph.heads(candidate_parents)
new_parents = sorted(parent_heads, key=...)

or my preferences
new_parents = [p for p in candidate_parents if p in parent_heads]

I'm guessing that sorted() is better than list().sort(). But the list of
candidate parents is always going to be small (maybe 8 for a large
'octopus' merge).

It would be interesting to see if there is a difference between them.
Certainly 'candidate_parents.index(x)' is a linear search.

+                    if new_parents == []:
+                        new_parents = [NULL_REVISION]
+
^- I think this has to create a new temporary list for each pass, so I'm
guessing 'if len(new_parents) == 0:', though I believe 'if not
new_parents' is faster.



For details, see:
http://bundlebuggy.aaronbentley.com/request/%3C1195161806.27807.2.camel%40lifeless-64%3E


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHQQdKJdeBCYSNAAMRAozKAJ4yAkhSFez7nELB6IRpiUjIOjXjpACg0Uts
QGOs6/2T19GyhGjm0J5e9ws=
=Wtaf
-----END PGP SIGNATURE-----



More information about the bazaar mailing list