loggerhead takes a while to fetch revisions

John Arbash Meinel john at arbash-meinel.com
Thu Jan 4 20:50:22 GMT 2007


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Robey Pointer wrote:
...

> With a couple of other tweaks, that knocks the load time for the
> revision-list down to around 6-7 seconds, which is almost good enough to
> do without the cache.  Not quite (I would prefer around 1-2 seconds,
> max) but much, much better.  At least generating the cache will no
> longer take all night.
> 
> One thing that would help also is including the two RevisionTree objects
> in a Delta.  Would that break any abstractions?  The reason it would
> help is: When I fetch a pile of deltas, the repository collects a
> temporary cache of RevisionTrees to build them up, but doesn't save
> them.  If I want to compute 'diffs' for these deltas, I have to go fetch
> these same RevisionTrees all over again.

I think we could return them with an alternative api.

The original point (IMO) of having something like get_revision_delta()
was so that you could create one without having to actually create 2
full Revision Trees. Theoretically most of the info for the delta is
already stored in the files. Which would be a lot better than creating 2
full revision trees, and then finding the delta between them.

However, I see your point, and it is reasonable to have a better api for it.

Something like:

def get_deltas_for_revisions_with_trees(self, revisions):
 required_trees = set()
 for revision in revisions:
    required_trees.add(revision.revision_id)
    required_trees.update(revision.parent_ids[:1])
 trees = ...

 for revision in revisions:
    ...
    tree = trees[revision_id]
    yield tree, tree.changes_from(old_tree)


Basically, just a copy of the old code, and changing the yield statement.

Actually, everything in there is a public api, so you could write the
same thing as a helper function.

> 
> 
>> 2) We need to work on getting _KnitIndex.__init__() faster anyway.
>> Dmitry has a couple of optimizations he has proposed, and I have
>> approved, but nobody else has looked at them.
>> http://bundlebuggy.aaronbentley.com/request/%3C45700C04.6040706@hlabs.spb.ru%3E
>>
>>
>> In deference to Aaron, I want to ask that we avoid the spurious
>> whitespace changes (changing whitespace just for cleanliness sake, since
>> it might make future merging conflict more).
> 
> I tried this patch, and it didn't appear to help (even after the above
> improvements) so it may be a red herring.  But also the patch didn't
> apply cleanly (at least a half dozen conflicts) so I can't honestly say
> that I tried a functional version of the patch.  It may have been messed
> up.  A new, clean-applying version of the patch would probably be worth
> trying.
> 

I believe the patch has now been applied, so you might try again. But I
don't know that you'll see massive improvements either.


> 
>> 3) We really want to switch so that revision_ids are just UTF-8 strings
>> in memory, rather than having to encode/decode them as they are
>> read/written. It means adding another encode() and decode() step in the
>> XML serializers, because cElementTree wants to work in Unicode objects.
>> So in the short term, it won't help a whole lot. But it will help while
>> parsing the knit index files. And in the long run, we'll probably switch
>> away from cElementTree, and then it will help a lot there too.
> 
> My gut says that cElementTree verses pickle is the root cause of the
> remaining speed difference (which looks like another 10x).
> 

Unfortunately Pickle isn't considered a "safe" format, so it isn't
something that we can use as a general storage format.

But I have played around with custom storage formats enough to know that
we can do better than cElementTree.

> 
>> 4) We might also be able to greatly revise how the index files are
>> parsed. Right now, we do a readlines() call and then iterate through the
>> lines. It might actually be a lot faster to do:
>>
>> text = fp.read()
>> recs = [line.split() for line in text.split('\n')
>>                        if line.endswith(':')]
>> for rec in recs:
>>   parents = []
>>   for value in rec[4:-1]:
>>     if value.startswith('.'):
>>       parents.append(value[1:]) # No decode step, right?
>>     else:
>>       parents.append(self._history[int(value)])
> 
> Obviously this means reading the entire index file into memory.  Are
> those generally small?

Indexes are quite a bit smaller than the contents. And at the moment we
have specific assumptions that we read the whole thing.

It might be possible to read only part of an index, but if you think
through the consequences, it turns out it is a pretty hard problem to do
it *correctly*.

...

> I took a new lsprof snapshot, this time of get_changes() fetching 100
> revisions, and posted it here:
> 
>     http://www.lag.net/~robey/code/get_change2.html
> 
> The one thing I notice right away is that 4 seconds out of 9 seem to be
> spent in xml_serializer.py.
> 
> robey

This is probably exacerbated by having to create a RevisionTree a second
time if you have already created 1. Does 'get_changes()' actually
compute the file-level diffs for everything? Or is it just the
inventory-level diffs?

I'm guessing it is just inventory-level, since I don't see any other
diffs going on in the lsprof results.

One thing to note, lsprof does penalize xml_serializer a little bit more
than other functions. So while it is slow, it isn't quite as slow as
lsprof says.

You might try writing a helper for get_revision_deltas_with_trees(), and
see if that helps at all.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.3 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFFnWiOJdeBCYSNAAMRAsL0AKCyjh1dM84VYAycnE2hGzurVVVDpQCgmcLz
Dol6/NHysrZ0xV+dhlTXsP0=
=aZIf
-----END PGP SIGNATURE-----



More information about the bazaar mailing list