[MERGE] Native dirstate update-basis-via-delta

Wed Oct 24 01:45:08 BST 2007

On Tue, 2007-10-23 at 18:49 -0500, John Arbash Meinel wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> Robert Collins wrote:
> > This enables the delta-to-basis-from-commit that Martin and I
> > collaborated on over the last fortnight.
> > 
> > For my tests, 'bzr commit FOO' drops from 15 seconds to 10 on a mozilla
> > tree. Initial commit and regular incremental commit is also faster.
> > 
> > -Rob
> > 
> 
> _discard_merge_parents iterates over all entries and just removes everything
> past the first node.
>
> What if instead it was an iterator itself and would yield all the nodes. It
> just seems a little weird to remove all of the ('a', 'a') entries when you may
> be introducing more. It would also mean that you could do 1 pass instead of
> multiple.

We can't do the change in one pass because of ordering with renames. I
noted that in the delta expansion logic.

> But from a correctness perspective it looks good.
> 
> 
> You are using "cache_utf8.encode" to convert path names. But they are unlikely
> to actually be repeated. So the caching probably ends up hurting (by bloating
> the cache, and having a key lookup failure for each entry.)

Currently they are repeated (because all the paths have previously been
handed to _get_entry in _sha_from_stat in working tree 4). Possibly with
my other patch I just sent this should be changed.

> You might want to use:
> 
> encode = cache_utf8._utf8_encode
> 
> path = encode(path)[0]
> 
> +                # Because renames must preserve their children we must have
> +                # processed all reloations and removes before hand. The sort
> 
> relocations

Thanks.

> ...
> +                self._update_basis_apply_deletes(deletes)
> +                deletes = []
> 
> ^- In the inner loop you seem to apply all deletes whenever you see a renamed
> file. This seems a little odd to me, though I won't say it is incorrect.

The comment right above it explains why in it's first sentence; if you
could see why it didn't gel for you that would be great.

> ...
> +                new_deletes = list(reversed(list(self._iter_child_entries(1,
> +                    encode(old_path)))))
> 
> Does _iter_child_entries return an empty list for a file?

Yes, as per its docstring ;).

> list(reversed(list( seems a little suspect
> Especially since the only thing you do with it is
> for entry in new_deletes

Yah, slipped in when testing.

> So at a minimum
> new_deletes = reversed(list(self._iter_child_entries...)
> 
> ...
> +        self._update_basis_apply_deletes(deletes)
> +        self._update_basis_apply_changes(changes)
> +        self._update_basis_apply_adds(adds)
> +
> +        # remove all deletes
> +        self._dirblock_state = DirState.IN_MEMORY_MODIFIED
> +        self._header_state = DirState.IN_MEMORY_MODIFIED
> +        return
> 
> ^- This comment is misplaced. It either goes before
> _update_basis_apply_deletes, or it should just be removed.

Removed.

> 
> Explicitly using:
> 
> absent = ('r', 'a')
> ...
> if kind not in absent:
>   ...
> 
> Is probably slower than either
> 

...
> x in 'ar'
> 
> is equivalent to
> y = set('ar')
> x in y
> 
> Both are around 220 us.
> 
> but
> y = ('a', 'r')
> 
> If x == 'a' then it is faster (209us), if it is 'r' then it is slower (250us),
> if it is not present it is much slower (278us).
> 
> My guess is that "x in 'ar'" is implemented as a set check. And that "x in
> ('a', 'r')" is implemented as a comparison against each entry in the tuple.

I doubt that 'x in "ar"' is a set check, rather its probably special
cased - IIRC there are even CPU specific instructions that can perform
this. Changing to use strings though.
I wonder if for 
'aa' in 'aa ar ra rr'
is better than
'aa' in set('aa', 'ar', 'ra, 'rr')
(obviously with the construction outside the loop).

> ...
> +                    if kind == 'd':
> +                        next_pending_dirs.append('/'.join(entry[0][0:2]))
> 
> I'm pretty sure that
> entry[0][0] + '/' + entry[0][1]
> is faster than the '/'.join statement. Because the join has to create a new
> 2-entry list, as well as build up the string.

Will change.

> ...
> 
> +    def create_dirstate_with_two_trees(self):
> +        """This dirstate contains multiple files and directories.
> 
> ^- You directly create the dirblock entries. This seems more error prone than
> doing it through a Tree interface. I won't say it is wrong, just that small
> errors are a lot harder to detect.

I was using the style other tests here used.

> Overall, I don't see anything obviously wrong :). I would say that I would like
> to see a whole lot more tests.

Any specific area? We have a number of tests of this already -
wt_implementations.get_parents tests just about every permutation I
could think of.

> One problem we have with dirstate, is that if the basis inventory gets skewed
> from reality, it can do nasty things, and we have very limited ability to
> detect it.
> 
> Probably 'bzr check' should assert that "tree.basis_tree()" gives the same
> result at "tree.branch.repository.revision_tree(tree.last_revision())". That
> would at least go a way towards giving security in something like this which is
> mutating over time. (There is no point where we forcibly "sync" and verify that
> everything is 100% correct.)

I think thats a good idea. Also bzr check should call tree._validate().

Do you want these done as part of this patch?

Also, do you think the _id_index should be adjusted/reset by this? I
think resetting it is ok and sufficient.

-Rob
-- 
GPG key available at: <http://www.robertcollins.net/keys.txt>.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20071024/64196a74/attachment.pgp