fast delta generation in brisbane-core - advice/direction needed
John Arbash Meinel
john at arbash-meinel.com
Sat Mar 7 23:41:38 GMT 2009
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
...
>
> I was kind of hoping that we were storing inventory deltas
> *directly* and generating inventories from them. If we were,
> I'd then to able to hook in at a low layer and doing something
> like
>
...
> It doesn't seem like that's going to be possible yet?
> Whatever delta-ing we're doing seems to be at the storage layer
> (as a space(/time?) optimisation) and lost by the time we
> extract the text?
>
> Reading through John's emails, we're certainly making great
> progress in terms of:
>
> * faster text lookup
> * reduced storage size.
>
> But I fear that delta generation will always be slow on large
> trees like OOo if the algorithm remains:
>
> 1. get inventory
> 2. get previous inventory
> 3. calculate changes.
>
> Thoughts?
>
> Ian C.
>
>
So, I don't know what numbers you were seeing, but I'm seeing a rather
good improvement. Perhaps the bigger issues were the recent bugs that we
fixed with _iter_nodes and the hash layouts?
'time bzr log -v --short -r -10..-1 mysql-525'
For a 1.9 format repo, that was taking 2.0s, for a gc-chk255-bigpage
repository, I'm seeing it as 0.7-0.8s. And keep in mind that is with a
'bzr rocks' time of 0.35-0.4s.
Changing that to "log -v --no-aliases" was 15s for the chk branch, and
2m15s for the 1.9 format.
Now, I think we can certainly look into the remaining efficiency issues,
like whether chk_map._deserialise is as good as it could be.
But 2m15s => 15s is a pretty big win, IMO. :)
Under lsprof, 50% of the time is spent in deserialise(), but I should
note that lsprof lies :). Specifically, under lsprof the time drops from
15s to 1m+, and deserialise is a pure python function, so we know it
encounters more relative overhead that the C code, which skews the
results slightly.
I did check, though, and if I just add a trap in _deserialise() to see
how many times a given page is processed, I get:
26071 / 7030 (3.7), max 225
So of 7030 unique chk pages, _deserialise is called 26,071 times. On
average, a given page is deserialised 3.7 times, with a max count of 225.
Now, that 225 could easily have to do with the
parent_id_basename_to_file_id map, which is very similar across all
revisions. However, it certainly seems inefficient to deserialize the
same page 225 times.
If I split it up by type, I get:
LeafNode 24989 / 5988 (4.2), max 225
InternalNode 1082 / 1042 (1.0), max 2
Which makes me think that Robert was right about the _ensure_root issues
(it doesn't help to cache the root node, because it changes every revision.)
I'm a bit concerned that the numbers are too high for LeafNodes, so we
may need to investigate if iter_changes() is doing a good job of not
descending into nodes it doesn't need to. I know the iter_changes()
code is a bit complex, and I wonder if we could use the hash properties
to simplify it a bit. (The tree is going to be a lot flatter, so we
don't need to try hard to match up sha1's at different depths.)
John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iEYEARECAAYFAkmzBjEACgkQJdeBCYSNAAPO4gCfbTAYkFV6fbyeRBHIsVTueNpl
+1kAn1fx6iuOiNIH6AoAMiR6QQaVSxUN
=BGE0
-----END PGP SIGNATURE-----
More information about the bazaar
mailing list