[MERGE] _generate_text_key_index

Mon Nov 19 19:08:05 GMT 2007

Thanks for the review.. I'm going to reply to it as an action list for
me to consult later when I make the changes. (I'll let all the component
branches be review and then merge one result larger patch).

On Thu, 2007-11-15 at 20:27 -0500, John Arbash Meinel wrote:
> John Arbash Meinel has voted tweak.
> Status is now: Conditionally approved
> Comment:

> Also, you might want to consider splitting it up into helper functions. 
> It seems to be getting a bit dense to follow everything. You could 
> probably break it into a helper that handles a single revision id.

It seems to me that that will require more pre-work. It'll certainly
require more sort and split passes. bzr itself has 36K file texts so the
function call overhead is tolerable - about 30ms there - compared to the
logic processing overhead elsewhere in the code path.

> I just did a quick test. I don't know that this is strictly performance 
> sensitive but you are doing:
> 
> +    def get_parents(self, revisions):
> +        return [self.ancestry.get(r, None) for r in revisions]
> +
> 
> And I found that that has to do an attribute lookup for each pass. 
> timeit says:
> 
> TIMEIT -s 'import random' \
> -s 'd = dict((random.randint(0,1000), random.randint(0,1000)) \
>      for x in xrange(1000))' \
> -s 'y = [random.randint(0,1000) for x in xrange(200)]' \
> 
>    '[d.get(x, None) for x in y]'
> 460 usec
> 
>    'g = d.get; [g(x, None) for x in y]'
> 330 usec
> 
>    'g = d.get; [g(x) for x in y]'
> 306 usec
> 
> I think the extra 24us is because of the INCREF/DECREF around the None 
> object. For me, it is worth the clarity of making it obvious that we are 
> expecting None if the key isn't present. (Yes it is part of get() to do 
> so, but it isn't nearly as obvious.)
> 
> If I set it up so that we have the dictionary as an attribute of a 
> class, it gets even worse:
>      '[o.d.get(x, None) for x in y]'
> 1000 loops, best of 3: 564 usec per loop
> 
> Anyway, we should keep in mind that creating a local variable of the 
> loop function is a good thing.

So the problem is that its typically a small number of revisions being
queried I think. Still perhaps:

def get_parents(self, parents):
    # Avoid name lookups - costs a local variable assignment. It's a
    # shame python isn't smart enough to do this optimisation for us.
    get = self.ancestry.get
    return [get(r, None) for r in revisions]

We can't easily avoid the None without a subclassed dict and I'm
reasonably sure that that would be slower in the first case.

> +    def _generate_text_key_index(self):
> +        """Generate a new text key index for the repository.
> +
> +        This is an expensive function that will take considerable time 
> to run.
> +
> +        :return: A dict mapping text keys ((file_id, revision_id) 
> tuples) to a
> +            list of parents, also text keys. When a given key has no 
> parents,
> +            the parents list will be [NULL_REVISION].
> 
> "to a list of parents, also text keys." is unclear.
> It might be easier to say:
> 
> :return: A dict mapping text keys to a list of parents (eg, (file_id, 
> revision_id) => []). When a given ...

Ok

> Oh, maybe you meant "and invalid text keys", and just left off the word 
> "invalid". I think I prefer 'and' to 'also'.

'And' and 'also' have quite different meanings though :). I don't see
where they can be substituted.

> ...
> +        revision_graph = self.get_revision_graph_with_ghosts()
> +        ancestors = revision_graph.get_ancestors()
> 
> ^- I wish there was a better way to do that. It is a bit innefficient to 
> do all the overhead of deprecated_graph.Graph() [which is adding a map 
> of parents to children] just to get the children to parents map.
> 
> Also, I noticed that get_ancestors() is defined as:
>      def get_ancestors(self):
>          """Return a dictionary of graph node:ancestor_list entries."""
>          return dict(self._graph_ancestors.items())
> 
> Isn't that an expensive way of saying "self._graph_ancestors.copy()" ?
> Since it has to build up a list of tuples in memory, which then gets 
> collapsed back into another dictionary.

Looks like it yes. This is not a significant part of the runtime at the
moment though, so I'm inclined to ignore it for now. Full graph
operations are hopefully rare.

> ...
> +        revision_keys = {}
> +        for revision_id in revision_order:
> +            revision_keys[revision_id] = set()
> 
> maybe:
> revision_keys = dict((revision_id, set()) for revision_id in 
> revision_order)
> 
> I realize this isn't really the performance sensitive portion of the 
> code.

I think the current for loop is clearer; and yes, this isn't inside the
inner loop yet.

> ...
> +        # Set a cache with a size of 10 - this suffices for bzr.dev but 
> may be
> +        # too small for large or very branchy trees. However, for 55K 
> path
> +        # trees, it would be easy to use too much memory trivially. 
> Ideally we
> +        # could gauge this by looking at available real memory etc, but 
> this is
> +        # always a tricky proposition.
> +        inventory_cache = lru_cache.LRUCache(10)
> 
> You could instead use a the LRUSizeCache of the inventories. Inventories 
> already have __len__ defined, and we know that a single Inventory entry 
> is approximately a fixed size. (a couple shared file ids, sha string, 
> size, etc).
> That would be a lot more accurate than a fixed size of 10. Instead you 
> could say store approx 10,000 Inventory entries.

Ok. What does that look like? Separately and speculatively I've been
wondering about using tail-joining strategies some functional languages
use for optimising multiple inventory extraction. That is reusing common
inventory entries by having them in multiple dicts; and doing a copy
when they are actually altered.

> +        for offset in xrange(batch_count):
> +            to_query = revision_order[offset * batch_size:(offset + 1) 
> *
> +                batch_size]
> ^- Isn't it a lot easier to do:
> for offset in xrange(0, text_count, batch_size):
>    to_query = revision_order[offset:offset+batch_size]

Yah, clearly I don't use xrange's features enough.

> 
> CONTINUED
> 
> If we are going to cache the tuples, shouldn't we also check that the 
> file_ids and revision_ids are cached as well? We don't necessarily have 
> to have tests for it, but it seems like sharing the strings is going to 
> be a bigger win than just saving the tuples.

We do share the strings because the lower level uses the utf8 cache.
Accessing many historical references occurs several times - initial push
being the most obvious, so I think having tests for memory pressure
there is a godo thing. Not sure that at this level we want to test it
though, it seems... redundant.

> +                            # TODO: cache here.
> +                            try:
> +                                inv = inventory_cache[parent_id]
> +                            except KeyError:
> 
> ^- Shouldn't the TODO be removed?

Yup.

> +                    parent_heads = text_graph.heads(candidate_parents)
> +                    new_parents = list(parent_heads)
> +                    new_parents.sort(key=lambda 
> x:candidate_parents.index(x))
> 
> ^- Interesting, I wonder if this is better than:
> parent_heads = text_graph.heads(candidate_parents)
> new_parents = sorted(parent_heads, key=...)

This might be a tiny bit faster.

> or my preferences
> new_parents = [p for p in candidate_parents if p in parent_heads]

This is wrong:
candidate_parents = ['x', 'x']
parent_heads = set(['x'])
=>
new_parents = ['x', 'x']

> I'm guessing that sorted() is better than list().sort(). But the list of 
> candidate parents is always going to be small (maybe 8 for a large 
> 'octopus' merge).

Right. The common case in fact is 1. (Not a merge).

> It would be interesting to see if there is a difference between them. 
> Certainly 'candidate_parents.index(x)' is a linear search.
> 
> +                    if new_parents == []:
> +                        new_parents = [NULL_REVISION]
> +
> ^- I think this has to create a new temporary list for each pass, so I'm 
> guessing 'if len(new_parents) == 0:', though I believe 'if not 
> new_parents' is faster.

if not new_parents is faster; I had my maximum clarity cap on when
writing this code. I'll change it.

Thanks!
-Rob
-- 
GPG key available at: <http://www.robertcollins.net/keys.txt>.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20071120/40d2fd42/attachment-0001.pgp