[MERGE] _generate_text_key_index

Thu Nov 15 23:18:29 GMT 2007

John Arbash Meinel has voted comment.
Status is now: Semi-approved
Comment:
I just did a quick test. I don't know that this is strictly performance 
sensitive but you are doing:

+    def get_parents(self, revisions):
+        return [self.ancestry.get(r, None) for r in revisions]
+

And I found that that has to do an attribute lookup for each pass. 
timeit says:

TIMEIT -s 'import random' \
-s 'd = dict((random.randint(0,1000), random.randint(0,1000)) \
     for x in xrange(1000))' \
-s 'y = [random.randint(0,1000) for x in xrange(200)]' \

   '[d.get(x, None) for x in y]'
460 usec

   'g = d.get; [g(x, None) for x in y]'
330 usec

   'g = d.get; [g(x) for x in y]'
306 usec

I think the extra 24us is because of the INCREF/DECREF around the None 
object. For me, it is worth the clarity of making it obvious that we are 
expecting None if the key isn't present. (Yes it is part of get() to do 
so, but it isn't nearly as obvious.)

If I set it up so that we have the dictionary as an attribute of a 
class, it gets even worse:
     '[o.d.get(x, None) for x in y]'
1000 loops, best of 3: 564 usec per loop

Anyway, we should keep in mind that creating a local variable of the 
loop function is a good thing.

+    def _generate_text_key_index(self):
+        """Generate a new text key index for the repository.
+
+        This is an expensive function that will take considerable time 
to run.
+
+        :return: A dict mapping text keys ((file_id, revision_id) 
tuples) to a
+            list of parents, also text keys. When a given key has no 
parents,
+            the parents list will be [NULL_REVISION].

"to a list of parents, also text keys." is unclear.
It might be easier to say:

:return: A dict mapping text keys to a list of parents (eg, (file_id, 
revision_id) => []). When a given ...

Oh, maybe you meant "and invalid text keys", and just left off the word 
"invalid". I think I prefer 'and' to 'also'.

...
+        revision_graph = self.get_revision_graph_with_ghosts()
+        ancestors = revision_graph.get_ancestors()

^- I wish there was a better way to do that. It is a bit innefficient to 
do all the overhead of deprecated_graph.Graph() [which is adding a map 
of parents to children] just to get the children to parents map.

Also, I noticed that get_ancestors() is defined as:
     def get_ancestors(self):
         """Return a dictionary of graph node:ancestor_list entries."""
         return dict(self._graph_ancestors.items())

Isn't that an expensive way of saying "self._graph_ancestors.copy()" ?
Since it has to build up a list of tuples in memory, which then gets 
collapsed back into another dictionary.

...
+        revision_keys = {}
+        for revision_id in revision_order:
+            revision_keys[revision_id] = set()

maybe:
revision_keys = dict((revision_id, set()) for revision_id in 
revision_order)

I realize this isn't really the performance sensitive portion of the 
code.

...
+        # Set a cache with a size of 10 - this suffices for bzr.dev but 
may be
+        # too small for large or very branchy trees. However, for 55K 
path
+        # trees, it would be easy to use too much memory trivially. 
Ideally we
+        # could gauge this by looking at available real memory etc, but 
this is
+        # always a tricky proposition.
+        inventory_cache = lru_cache.LRUCache(10)

You could instead use a the LRUSizeCache of the inventories. Inventories 
already have __len__ defined, and we know that a single Inventory entry 
is approximately a fixed size. (a couple shared file ids, sha string, 
size, etc).
That would be a lot more accurate than a fixed size of 10. Instead you 
could say store approx 10,000 Inventory entries.

+        for offset in xrange(batch_count):
+            to_query = revision_order[offset * batch_size:(offset + 1) 
*
+                batch_size]
^- Isn't it a lot easier to do:
for offset in xrange(0, text_count, batch_size):
   to_query = revision_order[offset:offset+batch_size]

I didn't get time to finish this review... family time right now. But 
I'll try to finish it tonight.

For details, see: 
http://bundlebuggy.aaronbentley.com/request/%3C1195161806.27807.2.camel%40lifeless-64%3E