[MERGE] Make push/pull/merge faster.

Tue Oct 30 18:55:17 GMT 2007

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Robert Collins wrote:
> On Tue, 2007-10-30 at 13:15 -0500, John Arbash Meinel wrote:
> 
>> You seem to set "supports_ghosts = True" for KnitRepo, but not for PackRepo. It
>> may be inherited, but setting it explicitly would be good.
> 
> Hmm, I was deliberately not setting it for PackRepo.

Umm... that seems a bit odd. Shouldn't repositories have that attribute if you
are going to be using it? Heck, even if you set it to None so that it would be
considered "I'm not making an assertion."
I realize we will need to use getattr() anyway (so that foreign repository
implementations don't have to have it), but I would prefer if we didn't have
that for formats under our control.

..

>>
>> ^- Why do you need to remove the null revisions? Just so you don't query the
>> target index for them?
> 
> Yes. Searching for 'null:' when it will never be there is wasteful.

Does it always return 'null:'? Is it more wasteful that having to search and
remove it for every call. If you have 10,000 calls, and only 1 of them has
null:, it seems a lot cheaper to leave it in than to have 10,000 checks to see
if it needs to be removed. But as the remote lookup is a round trip, perhaps it
is worthwhile.

> 
>> +                target_keys =list ((key,) for key in next_revs)
>>
>> ^- This has a PEP8 typo, but even more, it seems odd to use a list on a
>> generator, than just doing a list comprehension
> 
> Bah. Debug code wins again (printing generators is not useful).
> 
>> target_keys = [(key,) for key in next_revs]

As "iter_entries()" can return the entries in any order, it seems better to use
a list than a generator here. Lists are actually generally faster than
generators, as long as the memory cost is not too high.
And iter_entries() seems like it would want to use a list so that it could plan
what entries are optimal to return.
% TIMEIT -s 'x = range(10000)' 'y = [(z,) for z in x]'
100 loops, best of 3: 3.72 msec per loop

% TIMEIT -s 'x = range(10000)' 'y = list((z,) for z in x)'
100 loops, best of 3: 4.56 msec per loop

Without the tuple creation, (z for z in x) it drops to 1.53ms and 2.65ms. Which
shows that list comprehensions are genuinely faster than generators. Especially
if you are just going to put it into a list anyway.

% TIMEIT -s 'x = range(10000)' 'y = list([(z,) for z in x])'
100 loops, best of 3: 3.84 msec per loop

So even if you end up copying it into yet another list, it is still a lot
faster than using a generator.

>>
>>
>> +                have_revs = frozenset(node[1][0] for node in
>> target_index.iter_entries(target_keys))
>>
>> ^- This is an overly long line.
>>
>> have_revs = frozenset(node[1][0] for node
>>                       in target_index.iter_entries(target_keys))
>> or something along those lines. (we each seem to have a slightly different
>> preference when wrapping)
> 
> Yah, missed that one - thanks.
> 
>> +                missing_revs.update(next_revs - have_revs)
>> +                searcher.stop_searching_any(have_revs)
>> +            return missing_revs
>>
>>
>> The doc string for "iter_entries" has:
>>         :return: An iterable as per iter_all_entries, but restricted to the
>>             keys supplied. No additional keys will be returned, and every
>>             key supplied that is in the index will be returned.
>>
>> It isn't perfectly clear that this will skip requested entries that are not
>> present. Or maybe it does, but it takes a long time to get there.
>>
>> :return: Iterate over tuples (like iter_all_entries) for all keys which are
>> present. Keys not present will be skipped, keys present but not requested will
>> not be returned.
> 
> Hmmm, I'm not sure this is more clear to be honest. And saying that the
> return value is 'Iterate' is IMO wrong - it is an 'Iterable'.
> 
>> It also seems weird to have the returned tuples be 3 or 4 based on what is present:
>>         :return: An iterable of (index, key, value) or (index, key, value,
>> reference_lists).
>>             The former tuple is used when there are no reference lists in the
>>             index, making the API compatible with simple key:value index types.
>>             There is no defined order for the result iteration - it will be in
>>             the most efficient order for the index.
>>
>> Since you still have the "index" object being returned, it isn't like you can
>> do: dict(foo.iter_all_entries). So having it have a fixed return value seems
>> better. (Yes this is late to realize, but still seems worthwhile.)
> 
> I don't understand what you mean. Do you mean 'having a reference-list
> free return value when reference lists are not being used is weird'?
> 
> I discussed this way back, basically this is GraphIndex offering a plain
> Index interface without requiring a wrapping layer to strip off the 4th
> field if it is not being used. I agree that its a touch weird, but it
> didn't seem worthwhile having the class and disk signature change just
> for the signatures .six file.

I'm saying that returning a list that sometimes has 4 entries and sometimes has
3 entries is a strange API.

When someone comes in that isn't 100% positive about everything, it will lead
to accidents (where they thought it was returning the 4 entries, but really
only returning 3).

I don't think the disk signature needs to change. I just think it is a crufty
and harder-than-it-has-to-be to use API.
If it is for compatibility then I'm okay with it. But generally changing the
returned signature of a function is a bit ugly.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHJ34UJdeBCYSNAAMRAnv0AJ9MUsYE4veMRz09BLPUvircGpGzdQCeKJr6
vl81E8UBdrDPqLOXr7lUtqM=
=Daw0
-----END PGP SIGNATURE-----