[merge] InterKnit.join() doesn't need the complete revision graph
John Arbash Meinel
john at arbash-meinel.com
Sat Sep 9 02:27:18 BST 2006
John Arbash Meinel wrote:
> I'm still doing some profiling, and I found another surprise. When
> pushing a small branch into another small branch, but using a local and
> remote repository with a lot of revisions, InterKnit.join() touches too
> many revisions.
>
> 4 5.6062 0.1274 bzrlib.knit:1560(join)
> +21453 0.2398 0.1607 +bzrlib.knit:509(has_version)
> +4 0.4071 0.1089 +bzrlib.versionedfile:307(get_graph)
> +4 1.6765 0.0048 +bzrlib.tsort:27(topo_sort)
> +152 0.0017 0.0011 +bzrlib.knit:872(get_parents_with_ghosts)
> +8 0.0036 0.0002 +bzrlib.knit:1454(read_records_iter_raw)
> +4 3.1369 0.0002 +bzrlib.knit:384(_add_raw_records)
>
> The actual plugin I'm pushing only has 25 revisions, as would the
> target. I don't need to check all 9181 revisions in my repository.
> (21453 is because you have inventory.knit, revisions.knit,
> signatures.knit, and the text.knit, where signatures and text have a
> very different number of revisions).
>
> The attached patch just culls out the unnecessary revisions ahead of
> time. I haven't profiled it a lot to see how much it helps for cases
> like this. But it seems it should cut down the topo_sort time, if not
> the get_graph time.
>
> Also, would it be reasonable to use merge_sort() since that seems to be
> faster than topo_sort?
>
> John
> =:->
>
>
> ------------------------------------------------------------------------
>
> === modified file 'bzrlib/knit.py'
> --- bzrlib/knit.py 2006-09-06 21:56:58 +0000
> +++ bzrlib/knit.py 2006-09-09 01:13:46 +0000
> @@ -1601,7 +1601,7 @@
>
> if not needed_versions and not mismatched_versions:
> return 0
> - full_list = topo_sort(self.source.get_graph())
> + full_list = topo_sort(self.source.get_graph(version_ids))
>
> version_list = [i for i in full_list if (not self.target.has_version(i)
> and i in needed_versions)]
>
Another thing which I just thought of. Which is we probably should
change the second line to:
version_list = [v for v in full_list
if (i in needed_versions
and (not self.target.has_version(i))]
I'm guessing that "i in needed_versions" is going to be faster than "not
self.target.has_version(i)"
Because
1) has_version() is a complete function call with frame overhead, versus
just a set lookup.
2) the needed_versions() is going to be a smaller set() for branches
that are up-to-date with eachother. Which means there are fewer possible
collisions. I know dict() and set() are supposed to be approx O(1), but
I've found that they aren't. When a dict is really small it is faster
than when it is really full. I don't know the specific curve, but 90%
positive that this is the case.
John
=:->
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 254 bytes
Desc: OpenPGP digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20060908/9200d406/attachment.pgp
More information about the bazaar
mailing list