'bzr reconcile' *really* slow

John Arbash Meinel john at arbash-meinel.com
Sun Oct 14 18:44:41 BST 2007


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Andrew Bennetts wrote:
> John Arbash Meinel wrote:
> [...]
>> On my desktop it has been going for 2 hours now. And if I SIGQUIT in to see
>> what it is doing, it seems to be in _fix_text_parents() having gotten to 17 out
>> of 4000 entries. Which means that it should finish after 470 hours or something
>> like 20 *days*.
>>
>> I'm guessing there is something fishy going on here. It seems to be extracting
>> a whole lot of inventories (a complete inventory for every revision of every
>> file?).
> 
> Yeah, that does seem to be what's going on.  A quick hack to cache inventory
> lookups seems to bring the time down from 45 seconds to 9 seconds on a small
> branch with 258 revisions.
> 
> I've attached that patch.
> 
> -Andrew.
> 

I'm guessing this will help, but I'm worried about:

1) Memory overflow on large trees. I have an LRUCache implementation
that (for whatever reason) never made it into core. I'll probably try to
separate it out again, and submit it.

2) It may not be quite enough. I'm currently at just shy of 48 hours of
runtime. And I'm at 533/2068. Which puts me at 186 total runtime. Or
just shy of 8 days of operation. A 5-fold improvement makes this 1.5
days. Which is quite a bit better, but I certainly would think something
like this should be in the hours range, not in the days range.

Then again, my repository seems to be populated with a lot of "extra"
stuff, which isn't referenced by bzr.dev. A clean bzr.dev tree seems to
have 957 knits. So somehow (plugins most likely, possibly some other
stuff) I have another 1111 knits to reconcile.

3) Is there any way that this stuff could be stopped and resumed, or is
most of the time spent just reassuring itself that it really is correct.
(I have a feeling it is the latter.)

Anyway, I went ahead and included your patch and started running again.
I'll let you know what I find out.

By the way, at least when converting the first couple of weaves,
- --lsprof seems to say that the bulk of the time is being spent in
get_text_version(), and a non-trivial amount of time is spent in
graph.heads(). (This is with your inventory caching logic included.)

I suppose it is possible that these times my average out, as you get
more entries that have already been seen. I'm attaching the callgrind
output that I've seen so far.

John
=:->

> 
> 
> ------------------------------------------------------------------------
> 
> === modified file 'bzrlib/repository.py'
> --- bzrlib/repository.py	2007-10-12 08:18:54 +0000
> +++ bzrlib/repository.py	2007-10-13 12:48:37 +0000
> @@ -2489,6 +2489,7 @@
>      def __init__(self, repository):
>          self.repository = repository
>          self.revision_versions = {}
> +        self.inventories = {}
>  
>      def add_revision_text_versions(self, tree):
>          """Cache text version data from the supplied revision tree"""
> @@ -2507,6 +2508,16 @@
>              inv_revisions = self.add_revision_text_versions(tree)
>          return inv_revisions.get(file_id)
>  
> +    def get_inventory(self, revision_id):
> +        try:
> +            return self.inventories[revision_id]
> +        except KeyError:
> +            try:
> +                inv = self.repository.get_inventory(revision_id)
> +            except errors.RevisionNotPresent:
> +                inv = None
> +            self.inventories[revision_id] = inv
> +            return inv
>  
>  class VersionedFileChecker(object):
>  
> @@ -2514,6 +2525,7 @@
>          self.planned_revisions = planned_revisions
>          self.revision_versions = revision_versions
>          self.repository = repository
> +        self.repo_graph = self.repository.get_graph()
>      
>      def calculate_file_version_parents(self, revision_id, file_id):
>          text_revision = self.revision_versions.get_text_version(
> @@ -2526,19 +2538,15 @@
>          for parent in parents_of_text_revision:
>              if parent == _mod_revision.NULL_REVISION:
>                  continue
> -            try:
> -                inventory = self.repository.get_inventory(parent)
> -            except errors.RevisionNotPresent:
> -                pass
> -            else:
> +            inventory = self.revision_versions.get_inventory(parent)
> +            if inventory is not None:
>                  try:
>                      introduced_in = inventory[file_id].revision
>                  except errors.NoSuchId:
>                      pass
>                  else:
>                      parents_from_inventories.append(introduced_in)
> -        graph = self.repository.get_graph()
> -        heads = set(graph.heads(parents_from_inventories))
> +        heads = set(self.repo_graph.heads(parents_from_inventories))
>          new_parents = []
>          for parent in parents_from_inventories:
>              if parent in heads and parent not in new_parents:
> 

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHElWIJdeBCYSNAAMRAgl5AJ9Teh2D2rr63YGHt6WyKQnd7CDGhgCghrv6
lFb8wKflPail48W9TvpmKSY=
=9Ud6
-----END PGP SIGNATURE-----
-------------- next part --------------
A non-text attachment was scrubbed...
Name: reconcile.callgrind.bz2
Type: application/x-bzip
Size: 17839 bytes
Desc: not available
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20071014/af355446/attachment-0001.bin 


More information about the bazaar mailing list