[MERGE] _generate_text_key_index
Robert Collins
robertc at robertcollins.net
Mon Nov 19 19:08:05 GMT 2007
Thanks for the review.. I'm going to reply to it as an action list for
me to consult later when I make the changes. (I'll let all the component
branches be review and then merge one result larger patch).
On Thu, 2007-11-15 at 20:27 -0500, John Arbash Meinel wrote:
> John Arbash Meinel has voted tweak.
> Status is now: Conditionally approved
> Comment:
> Also, you might want to consider splitting it up into helper functions.
> It seems to be getting a bit dense to follow everything. You could
> probably break it into a helper that handles a single revision id.
It seems to me that that will require more pre-work. It'll certainly
require more sort and split passes. bzr itself has 36K file texts so the
function call overhead is tolerable - about 30ms there - compared to the
logic processing overhead elsewhere in the code path.
> I just did a quick test. I don't know that this is strictly performance
> sensitive but you are doing:
>
> + def get_parents(self, revisions):
> + return [self.ancestry.get(r, None) for r in revisions]
> +
>
> And I found that that has to do an attribute lookup for each pass.
> timeit says:
>
> TIMEIT -s 'import random' \
> -s 'd = dict((random.randint(0,1000), random.randint(0,1000)) \
> for x in xrange(1000))' \
> -s 'y = [random.randint(0,1000) for x in xrange(200)]' \
>
> '[d.get(x, None) for x in y]'
> 460 usec
>
> 'g = d.get; [g(x, None) for x in y]'
> 330 usec
>
> 'g = d.get; [g(x) for x in y]'
> 306 usec
>
> I think the extra 24us is because of the INCREF/DECREF around the None
> object. For me, it is worth the clarity of making it obvious that we are
> expecting None if the key isn't present. (Yes it is part of get() to do
> so, but it isn't nearly as obvious.)
>
> If I set it up so that we have the dictionary as an attribute of a
> class, it gets even worse:
> '[o.d.get(x, None) for x in y]'
> 1000 loops, best of 3: 564 usec per loop
>
> Anyway, we should keep in mind that creating a local variable of the
> loop function is a good thing.
So the problem is that its typically a small number of revisions being
queried I think. Still perhaps:
def get_parents(self, parents):
# Avoid name lookups - costs a local variable assignment. It's a
# shame python isn't smart enough to do this optimisation for us.
get = self.ancestry.get
return [get(r, None) for r in revisions]
We can't easily avoid the None without a subclassed dict and I'm
reasonably sure that that would be slower in the first case.
> + def _generate_text_key_index(self):
> + """Generate a new text key index for the repository.
> +
> + This is an expensive function that will take considerable time
> to run.
> +
> + :return: A dict mapping text keys ((file_id, revision_id)
> tuples) to a
> + list of parents, also text keys. When a given key has no
> parents,
> + the parents list will be [NULL_REVISION].
>
> "to a list of parents, also text keys." is unclear.
> It might be easier to say:
>
> :return: A dict mapping text keys to a list of parents (eg, (file_id,
> revision_id) => []). When a given ...
Ok
> Oh, maybe you meant "and invalid text keys", and just left off the word
> "invalid". I think I prefer 'and' to 'also'.
'And' and 'also' have quite different meanings though :). I don't see
where they can be substituted.
> ...
> + revision_graph = self.get_revision_graph_with_ghosts()
> + ancestors = revision_graph.get_ancestors()
>
> ^- I wish there was a better way to do that. It is a bit innefficient to
> do all the overhead of deprecated_graph.Graph() [which is adding a map
> of parents to children] just to get the children to parents map.
>
> Also, I noticed that get_ancestors() is defined as:
> def get_ancestors(self):
> """Return a dictionary of graph node:ancestor_list entries."""
> return dict(self._graph_ancestors.items())
>
> Isn't that an expensive way of saying "self._graph_ancestors.copy()" ?
> Since it has to build up a list of tuples in memory, which then gets
> collapsed back into another dictionary.
Looks like it yes. This is not a significant part of the runtime at the
moment though, so I'm inclined to ignore it for now. Full graph
operations are hopefully rare.
> ...
> + revision_keys = {}
> + for revision_id in revision_order:
> + revision_keys[revision_id] = set()
>
> maybe:
> revision_keys = dict((revision_id, set()) for revision_id in
> revision_order)
>
> I realize this isn't really the performance sensitive portion of the
> code.
I think the current for loop is clearer; and yes, this isn't inside the
inner loop yet.
> ...
> + # Set a cache with a size of 10 - this suffices for bzr.dev but
> may be
> + # too small for large or very branchy trees. However, for 55K
> path
> + # trees, it would be easy to use too much memory trivially.
> Ideally we
> + # could gauge this by looking at available real memory etc, but
> this is
> + # always a tricky proposition.
> + inventory_cache = lru_cache.LRUCache(10)
>
> You could instead use a the LRUSizeCache of the inventories. Inventories
> already have __len__ defined, and we know that a single Inventory entry
> is approximately a fixed size. (a couple shared file ids, sha string,
> size, etc).
> That would be a lot more accurate than a fixed size of 10. Instead you
> could say store approx 10,000 Inventory entries.
Ok. What does that look like? Separately and speculatively I've been
wondering about using tail-joining strategies some functional languages
use for optimising multiple inventory extraction. That is reusing common
inventory entries by having them in multiple dicts; and doing a copy
when they are actually altered.
> + for offset in xrange(batch_count):
> + to_query = revision_order[offset * batch_size:(offset + 1)
> *
> + batch_size]
> ^- Isn't it a lot easier to do:
> for offset in xrange(0, text_count, batch_size):
> to_query = revision_order[offset:offset+batch_size]
Yah, clearly I don't use xrange's features enough.
>
> CONTINUED
>
> If we are going to cache the tuples, shouldn't we also check that the
> file_ids and revision_ids are cached as well? We don't necessarily have
> to have tests for it, but it seems like sharing the strings is going to
> be a bigger win than just saving the tuples.
We do share the strings because the lower level uses the utf8 cache.
Accessing many historical references occurs several times - initial push
being the most obvious, so I think having tests for memory pressure
there is a godo thing. Not sure that at this level we want to test it
though, it seems... redundant.
> + # TODO: cache here.
> + try:
> + inv = inventory_cache[parent_id]
> + except KeyError:
>
> ^- Shouldn't the TODO be removed?
Yup.
> + parent_heads = text_graph.heads(candidate_parents)
> + new_parents = list(parent_heads)
> + new_parents.sort(key=lambda
> x:candidate_parents.index(x))
>
> ^- Interesting, I wonder if this is better than:
> parent_heads = text_graph.heads(candidate_parents)
> new_parents = sorted(parent_heads, key=...)
This might be a tiny bit faster.
> or my preferences
> new_parents = [p for p in candidate_parents if p in parent_heads]
This is wrong:
candidate_parents = ['x', 'x']
parent_heads = set(['x'])
=>
new_parents = ['x', 'x']
> I'm guessing that sorted() is better than list().sort(). But the list of
> candidate parents is always going to be small (maybe 8 for a large
> 'octopus' merge).
Right. The common case in fact is 1. (Not a merge).
> It would be interesting to see if there is a difference between them.
> Certainly 'candidate_parents.index(x)' is a linear search.
>
> + if new_parents == []:
> + new_parents = [NULL_REVISION]
> +
> ^- I think this has to create a new temporary list for each pass, so I'm
> guessing 'if len(new_parents) == 0:', though I believe 'if not
> new_parents' is faster.
if not new_parents is faster; I had my maximum clarity cap on when
writing this code. I'll change it.
Thanks!
-Rob
--
GPG key available at: <http://www.robertcollins.net/keys.txt>.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20071120/40d2fd42/attachment-0001.pgp
More information about the bazaar
mailing list