[MERGE] _generate_text_key_index

John Arbash Meinel john at arbash-meinel.com
Fri Nov 16 01:27:06 GMT 2007


John Arbash Meinel has voted tweak.
Status is now: Conditionally approved
Comment:
Overall, I see that you chose to go in topological sorted order, and 
then to build up the per file graphs, which I think is the better way to 
go for what we discussed. The algorithm looks correct, I just have some 
small comments on odd constructs that you used.

Also, you might want to consider splitting it up into helper functions. 
It seems to be getting a bit dense to follow everything. You could 
probably break it into a helper that handles a single revision id.


I just did a quick test. I don't know that this is strictly performance 
sensitive but you are doing:

+    def get_parents(self, revisions):
+        return [self.ancestry.get(r, None) for r in revisions]
+

And I found that that has to do an attribute lookup for each pass. 
timeit says:

TIMEIT -s 'import random' \
-s 'd = dict((random.randint(0,1000), random.randint(0,1000)) \
     for x in xrange(1000))' \
-s 'y = [random.randint(0,1000) for x in xrange(200)]' \

   '[d.get(x, None) for x in y]'
460 usec

   'g = d.get; [g(x, None) for x in y]'
330 usec

   'g = d.get; [g(x) for x in y]'
306 usec

I think the extra 24us is because of the INCREF/DECREF around the None 
object. For me, it is worth the clarity of making it obvious that we are 
expecting None if the key isn't present. (Yes it is part of get() to do 
so, but it isn't nearly as obvious.)

If I set it up so that we have the dictionary as an attribute of a 
class, it gets even worse:
     '[o.d.get(x, None) for x in y]'
1000 loops, best of 3: 564 usec per loop

Anyway, we should keep in mind that creating a local variable of the 
loop function is a good thing.

+    def _generate_text_key_index(self):
+        """Generate a new text key index for the repository.
+
+        This is an expensive function that will take considerable time 
to run.
+
+        :return: A dict mapping text keys ((file_id, revision_id) 
tuples) to a
+            list of parents, also text keys. When a given key has no 
parents,
+            the parents list will be [NULL_REVISION].

"to a list of parents, also text keys." is unclear.
It might be easier to say:

:return: A dict mapping text keys to a list of parents (eg, (file_id, 
revision_id) => []). When a given ...

Oh, maybe you meant "and invalid text keys", and just left off the word 
"invalid". I think I prefer 'and' to 'also'.

...
+        revision_graph = self.get_revision_graph_with_ghosts()
+        ancestors = revision_graph.get_ancestors()

^- I wish there was a better way to do that. It is a bit innefficient to 
do all the overhead of deprecated_graph.Graph() [which is adding a map 
of parents to children] just to get the children to parents map.

Also, I noticed that get_ancestors() is defined as:
     def get_ancestors(self):
         """Return a dictionary of graph node:ancestor_list entries."""
         return dict(self._graph_ancestors.items())

Isn't that an expensive way of saying "self._graph_ancestors.copy()" ?
Since it has to build up a list of tuples in memory, which then gets 
collapsed back into another dictionary.

...
+        revision_keys = {}
+        for revision_id in revision_order:
+            revision_keys[revision_id] = set()

maybe:
revision_keys = dict((revision_id, set()) for revision_id in 
revision_order)

I realize this isn't really the performance sensitive portion of the 
code.

...
+        # Set a cache with a size of 10 - this suffices for bzr.dev but 
may be
+        # too small for large or very branchy trees. However, for 55K 
path
+        # trees, it would be easy to use too much memory trivially. 
Ideally we
+        # could gauge this by looking at available real memory etc, but 
this is
+        # always a tricky proposition.
+        inventory_cache = lru_cache.LRUCache(10)

You could instead use a the LRUSizeCache of the inventories. Inventories 
already have __len__ defined, and we know that a single Inventory entry 
is approximately a fixed size. (a couple shared file ids, sha string, 
size, etc).
That would be a lot more accurate than a fixed size of 10. Instead you 
could say store approx 10,000 Inventory entries.

+        for offset in xrange(batch_count):
+            to_query = revision_order[offset * batch_size:(offset + 1) 
*
+                batch_size]
^- Isn't it a lot easier to do:
for offset in xrange(0, text_count, batch_size):
   to_query = revision_order[offset:offset+batch_size]


CONTINUED

If we are going to cache the tuples, shouldn't we also check that the 
file_ids and revision_ids are cached as well? We don't necessarily have 
to have tests for it, but it seems like sharing the strings is going to 
be a bigger win than just saving the tuples.

+                            # look at the parent commit details 
inventories to
+                            # determine possible candidates in the per 
file graph.
+                            # TODO: cache here.
+                            try:
+                                inv = inventory_cache[parent_id]
+                            except KeyError:

^- Shouldn't the TODO be removed?

+                    parent_heads = text_graph.heads(candidate_parents)
+                    new_parents = list(parent_heads)
+                    new_parents.sort(key=lambda 
x:candidate_parents.index(x))

^- Interesting, I wonder if this is better than:
parent_heads = text_graph.heads(candidate_parents)
new_parents = sorted(parent_heads, key=...)

or my preferences
new_parents = [p for p in candidate_parents if p in parent_heads]

I'm guessing that sorted() is better than list().sort(). But the list of 
candidate parents is always going to be small (maybe 8 for a large 
'octopus' merge).

It would be interesting to see if there is a difference between them. 
Certainly 'candidate_parents.index(x)' is a linear search.

+                    if new_parents == []:
+                        new_parents = [NULL_REVISION]
+
^- I think this has to create a new temporary list for each pass, so I'm 
guessing 'if len(new_parents) == 0:', though I believe 'if not 
new_parents' is faster.



For details, see: 
http://bundlebuggy.aaronbentley.com/request/%3C1195161806.27807.2.camel%40lifeless-64%3E



More information about the bazaar mailing list