[MERGE] Native dirstate update-basis-via-delta

John Arbash Meinel john at arbash-meinel.com
Wed Oct 24 00:49:30 BST 2007


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Robert Collins wrote:
> This enables the delta-to-basis-from-commit that Martin and I
> collaborated on over the last fortnight.
> 
> For my tests, 'bzr commit FOO' drops from 15 seconds to 10 on a mozilla
> tree. Initial commit and regular incremental commit is also faster.
> 
> -Rob
> 

_discard_merge_parents iterates over all entries and just removes everything
past the first node.

What if instead it was an iterator itself and would yield all the nodes. It
just seems a little weird to remove all of the ('a', 'a') entries when you may
be introducing more. It would also mean that you could do 1 pass instead of
multiple.

But from a correctness perspective it looks good.


You are using "cache_utf8.encode" to convert path names. But they are unlikely
to actually be repeated. So the caching probably ends up hurting (by bloating
the cache, and having a key lookup failure for each entry.)

You might want to use:

encode = cache_utf8._utf8_encode

path = encode(path)[0]

+                # Because renames must preserve their children we must have
+                # processed all reloations and removes before hand. The sort

relocations

...
+                self._update_basis_apply_deletes(deletes)
+                deletes = []

^- In the inner loop you seem to apply all deletes whenever you see a renamed
file. This seems a little odd to me, though I won't say it is incorrect.

...
+                new_deletes = list(reversed(list(self._iter_child_entries(1,
+                    encode(old_path)))))

Does _iter_child_entries return an empty list for a file?

list(reversed(list( seems a little suspect
Especially since the only thing you do with it is
for entry in new_deletes

So at a minimum
new_deletes = reversed(list(self._iter_child_entries...)

...
+        self._update_basis_apply_deletes(deletes)
+        self._update_basis_apply_changes(changes)
+        self._update_basis_apply_adds(adds)
+
+        # remove all deletes
+        self._dirblock_state = DirState.IN_MEMORY_MODIFIED
+        self._header_state = DirState.IN_MEMORY_MODIFIED
+        return

^- This comment is misplaced. It either goes before
_update_basis_apply_deletes, or it should just be removed.


Explicitly using:

absent = ('r', 'a')
...
if kind not in absent:
  ...

Is probably slower than either

absent = 'ar'

or

absent = set('ar')

TIMEIT -s 's = set("ar")' '"f" in s'     0.218 usec
TIMEIT -s 's = set("ar")' '"a" in s'     0.222 usec
TIMEIT -s 's = set("ar")' '"r" in s'     0.217 usec
TIMEIT -s 's = set("ar")' '"r" in "ar"'  0.222 usec
TIMEIT -s 's = set("ar")' '"a" in "ar"'  0.232 usec
TIMEIT -s 's = set("ar")' '"a" in "ar"'  0.22 usec
TIMEIT -s 's = set("ar")' '"f" in "ar"'  0.219 usec
TIMEIT -s 's = ("a", "r")' '"f" in s'    0.278 usec
TIMEIT -s 's = ("a", "r")' '"a" in s'    0.209 usec
TIMEIT -s 's = ("a", "r")' '"r" in s'    0.253 usec

So it seems like:

x in 'ar'

is equivalent to
y = set('ar')
x in y

Both are around 220 us.

but
y = ('a', 'r')

If x == 'a' then it is faster (209us), if it is 'r' then it is slower (250us),
if it is not present it is much slower (278us).

My guess is that "x in 'ar'" is implemented as a set check. And that "x in
('a', 'r')" is implemented as a comparison against each entry in the tuple.

...
+                    if kind == 'd':
+                        next_pending_dirs.append('/'.join(entry[0][0:2]))

I'm pretty sure that
entry[0][0] + '/' + entry[0][1]
is faster than the '/'.join statement. Because the join has to create a new
2-entry list, as well as build up the string.

...

+    def create_dirstate_with_two_trees(self):
+        """This dirstate contains multiple files and directories.

^- You directly create the dirblock entries. This seems more error prone than
doing it through a Tree interface. I won't say it is wrong, just that small
errors are a lot harder to detect.

Overall, I don't see anything obviously wrong :). I would say that I would like
to see a whole lot more tests.

One problem we have with dirstate, is that if the basis inventory gets skewed
from reality, it can do nasty things, and we have very limited ability to
detect it.

Probably 'bzr check' should assert that "tree.basis_tree()" gives the same
result at "tree.branch.repository.revision_tree(tree.last_revision())". That
would at least go a way towards giving security in something like this which is
mutating over time. (There is no point where we forcibly "sync" and verify that
everything is 100% correct.)

BB:tweak

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHHoiJJdeBCYSNAAMRAgc0AJ9x6hav2FyEYUO5pFdAWyzOKA/aqgCeKmXI
kicVlTN1DiHKRAjajr9iuJk=
=Yy3L
-----END PGP SIGNATURE-----



More information about the bazaar mailing list