[MERGE] Native dirstate update-basis-via-delta
robertc at robertcollins.net
Wed Oct 24 01:45:08 BST 2007
On Tue, 2007-10-23 at 18:49 -0500, John Arbash Meinel wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> Robert Collins wrote:
> > This enables the delta-to-basis-from-commit that Martin and I
> > collaborated on over the last fortnight.
> > For my tests, 'bzr commit FOO' drops from 15 seconds to 10 on a mozilla
> > tree. Initial commit and regular incremental commit is also faster.
> > -Rob
> _discard_merge_parents iterates over all entries and just removes everything
> past the first node.
> What if instead it was an iterator itself and would yield all the nodes. It
> just seems a little weird to remove all of the ('a', 'a') entries when you may
> be introducing more. It would also mean that you could do 1 pass instead of
We can't do the change in one pass because of ordering with renames. I
noted that in the delta expansion logic.
> But from a correctness perspective it looks good.
> You are using "cache_utf8.encode" to convert path names. But they are unlikely
> to actually be repeated. So the caching probably ends up hurting (by bloating
> the cache, and having a key lookup failure for each entry.)
Currently they are repeated (because all the paths have previously been
handed to _get_entry in _sha_from_stat in working tree 4). Possibly with
my other patch I just sent this should be changed.
> You might want to use:
> encode = cache_utf8._utf8_encode
> path = encode(path)
> + # Because renames must preserve their children we must have
> + # processed all reloations and removes before hand. The sort
> + self._update_basis_apply_deletes(deletes)
> + deletes = 
> ^- In the inner loop you seem to apply all deletes whenever you see a renamed
> file. This seems a little odd to me, though I won't say it is incorrect.
The comment right above it explains why in it's first sentence; if you
could see why it didn't gel for you that would be great.
> + new_deletes = list(reversed(list(self._iter_child_entries(1,
> + encode(old_path)))))
> Does _iter_child_entries return an empty list for a file?
Yes, as per its docstring ;).
> list(reversed(list( seems a little suspect
> Especially since the only thing you do with it is
> for entry in new_deletes
Yah, slipped in when testing.
> So at a minimum
> new_deletes = reversed(list(self._iter_child_entries...)
> + self._update_basis_apply_deletes(deletes)
> + self._update_basis_apply_changes(changes)
> + self._update_basis_apply_adds(adds)
> + # remove all deletes
> + self._dirblock_state = DirState.IN_MEMORY_MODIFIED
> + self._header_state = DirState.IN_MEMORY_MODIFIED
> + return
> ^- This comment is misplaced. It either goes before
> _update_basis_apply_deletes, or it should just be removed.
> Explicitly using:
> absent = ('r', 'a')
> if kind not in absent:
> Is probably slower than either
> x in 'ar'
> is equivalent to
> y = set('ar')
> x in y
> Both are around 220 us.
> y = ('a', 'r')
> If x == 'a' then it is faster (209us), if it is 'r' then it is slower (250us),
> if it is not present it is much slower (278us).
> My guess is that "x in 'ar'" is implemented as a set check. And that "x in
> ('a', 'r')" is implemented as a comparison against each entry in the tuple.
I doubt that 'x in "ar"' is a set check, rather its probably special
cased - IIRC there are even CPU specific instructions that can perform
this. Changing to use strings though.
I wonder if for
'aa' in 'aa ar ra rr'
is better than
'aa' in set('aa', 'ar', 'ra, 'rr')
(obviously with the construction outside the loop).
> + if kind == 'd':
> + next_pending_dirs.append('/'.join(entry[0:2]))
> I'm pretty sure that
> entry + '/' + entry
> is faster than the '/'.join statement. Because the join has to create a new
> 2-entry list, as well as build up the string.
> + def create_dirstate_with_two_trees(self):
> + """This dirstate contains multiple files and directories.
> ^- You directly create the dirblock entries. This seems more error prone than
> doing it through a Tree interface. I won't say it is wrong, just that small
> errors are a lot harder to detect.
I was using the style other tests here used.
> Overall, I don't see anything obviously wrong :). I would say that I would like
> to see a whole lot more tests.
Any specific area? We have a number of tests of this already -
wt_implementations.get_parents tests just about every permutation I
could think of.
> One problem we have with dirstate, is that if the basis inventory gets skewed
> from reality, it can do nasty things, and we have very limited ability to
> detect it.
> Probably 'bzr check' should assert that "tree.basis_tree()" gives the same
> result at "tree.branch.repository.revision_tree(tree.last_revision())". That
> would at least go a way towards giving security in something like this which is
> mutating over time. (There is no point where we forcibly "sync" and verify that
> everything is 100% correct.)
I think thats a good idea. Also bzr check should call tree._validate().
Do you want these done as part of this patch?
Also, do you think the _id_index should be adjusted/reset by this? I
think resetting it is ok and sufficient.
GPG key available at: <http://www.robertcollins.net/keys.txt>.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Size: 189 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20071024/64196a74/attachment.pgp
More information about the bazaar