[MERGE] Make push/pull/merge faster.

Tue Oct 30 19:21:01 GMT 2007

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Robert Collins wrote:
> On Tue, 2007-10-30 at 13:55 -0500, John Arbash Meinel wrote:
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA1
>>
>> Robert Collins wrote:
>>> On Tue, 2007-10-30 at 13:15 -0500, John Arbash Meinel wrote:
>>>
>>>> You seem to set "supports_ghosts = True" for KnitRepo, but not for PackRepo. It
>>>> may be inherited, but setting it explicitly would be good.
>>> Hmm, I was deliberately not setting it for PackRepo.
>> Umm... that seems a bit odd. Shouldn't repositories have that attribute if you
>> are going to be using it? Heck, even if you set it to None so that it would be
>> considered "I'm not making an assertion."
>> I realize we will need to use getattr() anyway (so that foreign repository
>> implementations don't have to have it), but I would prefer if we didn't have
>> that for formats under our control.
> 
> I wasn't setting it because I thought it was inherited, but its not, so
> I'm on crack.
> 
> The compatibility thing is RepositoryFormat.supports_ghosts = None.
> 
>> ..
>>
>>>> ^- Why do you need to remove the null revisions? Just so you don't query the
>>>> target index for them?
>>> Yes. Searching for 'null:' when it will never be there is wasteful.
>> Does it always return 'null:'? Is it more wasteful that having to search and
>> remove it for every call. If you have 10,000 calls, and only 1 of them has
>> null:, it seems a lot cheaper to leave it in than to have 10,000 checks to see
>> if it needs to be removed. But as the remote lookup is a round trip, perhaps it
>> is worthwhile.
> 
> Exactly. And if we're doing 10K calls then we're probably doing an initial pull so ...
> 

So we've already loaded the whole index and an extra round trip wouldn't
matter? (and probably wouldn't even be a round trip) I'm not sure what your
point is.

If you are only doing the last 100 entries, then you won't have 'null:'. So it
seems to me that if you are getting 'null:' back, you are at the point where it
isn't that expensive to just look for it.

Overall, this is a lot of discussion over something that is likely to have very
little effect in practice. If you feel strongly about keeping it, that is ok.
But it seems like doing more work (and complexity) for minimal gain to possible
loss.

>> As "iter_entries()" can return the entries in any order, it seems better to use
>> a list than a generator here. Lists are actually generally faster than
>> generators, as long as the memory cost is not too high.
>> And iter_entries() seems like it would want to use a list so that it could plan
>> what entries are optimal to return.
>> % TIMEIT -s 'x = range(10000)' 'y = [(z,) for z in x]'
>> 100 loops, best of 3: 3.72 msec per loop
>>
>> % TIMEIT -s 'x = range(10000)' 'y = list((z,) for z in x)'
>> 100 loops, best of 3: 4.56 msec per loop
>>
>> Without the tuple creation, (z for z in x) it drops to 1.53ms and 2.65ms. Which
>> shows that list comprehensions are genuinely faster than generators. Especially
>> if you are just going to put it into a list anyway.
> 
> I think this is good to examine in more detail; its definitely a
> different thread though.
> 
> iter_entries is a loop around parallel bisection though - it doesn't do
> any 'planning' at a global scale, and we also have to handle aggregating
> indices together - have a look at CombinedGraphIndex.

Sure, but parallel bisection is going to occur on a list, right? So you already
need to know about all entries that you are going to look at. (I certainly hope
you don't bisect for each entry one a time.)

You actually do:
keys = set(keys)

So instead we can check:
% TIMEIT -s 'x = range(10000)' 'y = set([(z,) for z in x])'
100 loops, best of 3: 7.5 msec per loop

% TIMEIT -s 'x = range(10000)' 'y = set((z,) for z in x)'
100 loops, best of 3: 9.09 msec per loop

So again, unless there is memory pressure to not want to have a full list in
memory. It is generally faster to use a list comprehension than a generator.
Aaron was actually the one who turned me on to that fact.

Generators can be nice if they let you do some work while waiting for more
information. But generally they are slower than batching everything up in lists.

...

>> When someone comes in that isn't 100% positive about everything, it will lead
>> to accidents (where they thought it was returning the 4 entries, but really
>> only returning 3).
>>
>> I don't think the disk signature needs to change. I just think it is a crufty
>> and harder-than-it-has-to-be to use API.
>> If it is for compatibility then I'm okay with it. But generally changing the
>> returned signature of a function is a bit ugly.
> 
> The index API was in 0.90, and is not changed by this patch. I really
> don't want to block this merge by sidetracking onto the index API.
> 
> -Rob

This was never about blocking the merge. It was just discussing an API which
seems crufty at the point that I realized it was crufty. I don't expect you to
change it to merge the other changes. I probably should have been clearer on that.

Your changes are still approved. I would recommend using a list rather than a
generator. But that is still a generally minor thing.

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHJ4QdJdeBCYSNAAMRAlfQAKDY9DzFDs4/pBfAjp4c/cOzk2XorgCgjO77
37jGgeDACVxV98ilKYxVypI=
=Y59M
-----END PGP SIGNATURE-----