[MERGE] Make push/pull/merge faster.

Tue Oct 30 19:10:26 GMT 2007

On Tue, 2007-10-30 at 13:55 -0500, John Arbash Meinel wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> Robert Collins wrote:
> > On Tue, 2007-10-30 at 13:15 -0500, John Arbash Meinel wrote:
> > 
> >> You seem to set "supports_ghosts = True" for KnitRepo, but not for PackRepo. It
> >> may be inherited, but setting it explicitly would be good.
> > 
> > Hmm, I was deliberately not setting it for PackRepo.
> 
> Umm... that seems a bit odd. Shouldn't repositories have that attribute if you
> are going to be using it? Heck, even if you set it to None so that it would be
> considered "I'm not making an assertion."
> I realize we will need to use getattr() anyway (so that foreign repository
> implementations don't have to have it), but I would prefer if we didn't have
> that for formats under our control.

I wasn't setting it because I thought it was inherited, but its not, so
I'm on crack.

The compatibility thing is RepositoryFormat.supports_ghosts = None.

> ..
> 
> >>
> >> ^- Why do you need to remove the null revisions? Just so you don't query the
> >> target index for them?
> > 
> > Yes. Searching for 'null:' when it will never be there is wasteful.
> 
> Does it always return 'null:'? Is it more wasteful that having to search and
> remove it for every call. If you have 10,000 calls, and only 1 of them has
> null:, it seems a lot cheaper to leave it in than to have 10,000 checks to see
> if it needs to be removed. But as the remote lookup is a round trip, perhaps it
> is worthwhile.

Exactly. And if we're doing 10K calls then we're probably doing an initial pull so ...

> As "iter_entries()" can return the entries in any order, it seems better to use
> a list than a generator here. Lists are actually generally faster than
> generators, as long as the memory cost is not too high.
> And iter_entries() seems like it would want to use a list so that it could plan
> what entries are optimal to return.
> % TIMEIT -s 'x = range(10000)' 'y = [(z,) for z in x]'
> 100 loops, best of 3: 3.72 msec per loop
> 
> % TIMEIT -s 'x = range(10000)' 'y = list((z,) for z in x)'
> 100 loops, best of 3: 4.56 msec per loop
> 
> Without the tuple creation, (z for z in x) it drops to 1.53ms and 2.65ms. Which
> shows that list comprehensions are genuinely faster than generators. Especially
> if you are just going to put it into a list anyway.

I think this is good to examine in more detail; its definitely a
different thread though.

iter_entries is a loop around parallel bisection though - it doesn't do
any 'planning' at a global scale, and we also have to handle aggregating
indices together - have a look at CombinedGraphIndex.

> >> +                missing_revs.update(next_revs - have_revs)
> >> +                searcher.stop_searching_any(have_revs)
> >> +            return missing_revs
> >>
> >>
> >> The doc string for "iter_entries" has:
> >>         :return: An iterable as per iter_all_entries, but restricted to the
> >>             keys supplied. No additional keys will be returned, and every
> >>             key supplied that is in the index will be returned.
> >>
> >> It isn't perfectly clear that this will skip requested entries that are not
> >> present. Or maybe it does, but it takes a long time to get there.
> >>
> >> :return: Iterate over tuples (like iter_all_entries) for all keys which are
> >> present. Keys not present will be skipped, keys present but not requested will
> >> not be returned.
> > 
> > Hmmm, I'm not sure this is more clear to be honest. And saying that the
> > return value is 'Iterate' is IMO wrong - it is an 'Iterable'.
> > 
> >> It also seems weird to have the returned tuples be 3 or 4 based on what is present:
> >>         :return: An iterable of (index, key, value) or (index, key, value,
> >> reference_lists).
> >>             The former tuple is used when there are no reference lists in the
> >>             index, making the API compatible with simple key:value index types.
> >>             There is no defined order for the result iteration - it will be in
> >>             the most efficient order for the index.
> >>
> >> Since you still have the "index" object being returned, it isn't like you can
> >> do: dict(foo.iter_all_entries). So having it have a fixed return value seems
> >> better. (Yes this is late to realize, but still seems worthwhile.)
> > 
> > I don't understand what you mean. Do you mean 'having a reference-list
> > free return value when reference lists are not being used is weird'?
> > 
> > I discussed this way back, basically this is GraphIndex offering a plain
> > Index interface without requiring a wrapping layer to strip off the 4th
> > field if it is not being used. I agree that its a touch weird, but it
> > didn't seem worthwhile having the class and disk signature change just
> > for the signatures .six file.
> 
> I'm saying that returning a list that sometimes has 4 entries and sometimes has
> 3 entries is a strange API.
> 
> When someone comes in that isn't 100% positive about everything, it will lead
> to accidents (where they thought it was returning the 4 entries, but really
> only returning 3).
> 
> I don't think the disk signature needs to change. I just think it is a crufty
> and harder-than-it-has-to-be to use API.
> If it is for compatibility then I'm okay with it. But generally changing the
> returned signature of a function is a bit ugly.

The index API was in 0.90, and is not changed by this patch. I really
don't want to block this merge by sidetracking onto the index API.

-Rob
-- 
GPG key available at: <http://www.robertcollins.net/keys.txt>.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20071031/c31ef7f7/attachment.pgp