[merge] InterKnit.join() doesn't need the complete revision graph

Sat Sep 9 02:27:18 BST 2006

John Arbash Meinel wrote:
> I'm still doing some profiling, and I found another surprise. When
> pushing a small branch into another small branch, but using a local and
> remote repository with a lot of revisions, InterKnit.join() touches too
> many revisions.
> 
>      4  5.6062  0.1274   bzrlib.knit:1560(join)
> +21453  0.2398  0.1607   +bzrlib.knit:509(has_version)
>     +4  0.4071  0.1089   +bzrlib.versionedfile:307(get_graph)
>     +4  1.6765  0.0048   +bzrlib.tsort:27(topo_sort)
>   +152  0.0017  0.0011   +bzrlib.knit:872(get_parents_with_ghosts)
>     +8  0.0036  0.0002   +bzrlib.knit:1454(read_records_iter_raw)
>     +4  3.1369  0.0002   +bzrlib.knit:384(_add_raw_records)
> 
> The actual plugin I'm pushing only has 25 revisions, as would the
> target. I don't need to check all 9181 revisions in my repository.
> (21453 is because you have inventory.knit, revisions.knit,
> signatures.knit, and the text.knit, where signatures and text have a
> very different number of revisions).
> 
> The attached patch just culls out the unnecessary revisions ahead of
> time. I haven't profiled it a lot to see how much it helps for cases
> like this. But it seems it should cut down the topo_sort time, if not
> the get_graph time.
> 
> Also, would it be reasonable to use merge_sort() since that seems to be
> faster than topo_sort?
> 
> John
> =:->
> 
> 
> ------------------------------------------------------------------------
> 
> === modified file 'bzrlib/knit.py'
> --- bzrlib/knit.py	2006-09-06 21:56:58 +0000
> +++ bzrlib/knit.py	2006-09-09 01:13:46 +0000
> @@ -1601,7 +1601,7 @@
>      
>              if not needed_versions and not mismatched_versions:
>                  return 0
> -            full_list = topo_sort(self.source.get_graph())
> +            full_list = topo_sort(self.source.get_graph(version_ids))
>      
>              version_list = [i for i in full_list if (not self.target.has_version(i)
>                              and i in needed_versions)]
> 

Another thing which I just thought of. Which is we probably should
change the second line to:
            version_list = [v for v in full_list
                            if (i in needed_versions
                                and (not self.target.has_version(i))]

I'm guessing that "i in needed_versions" is going to be faster than "not
self.target.has_version(i)"

Because

1) has_version() is a complete function call with frame overhead, versus
just a set lookup.
2) the needed_versions() is going to be a smaller set() for branches
that are up-to-date with eachother. Which means there are fewer possible
collisions. I know dict() and set() are supposed to be approx O(1), but
I've found that they aren't. When a dict is really small it is faster
than when it is really full. I don't know the specific curve, but 90%
positive that this is the case.

John
=:->

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 254 bytes
Desc: OpenPGP digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20060908/9200d406/attachment.pgp