fast delta generation in brisbane-core - advice/direction needed

Sat Mar 7 23:41:38 GMT 2009

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

...

>
> I was kind of hoping that we were storing inventory deltas
> *directly* and generating inventories from them. If we were,
> I'd then to able to hook in at a low layer and doing something
> like
>
...

> It doesn't seem like that's going to be possible yet?
> Whatever delta-ing we're doing seems to be at the storage layer
> (as a space(/time?) optimisation) and lost by the time we
> extract the text?
>
> Reading through John's emails, we're certainly making great
> progress in terms of:
>
> * faster text lookup
> * reduced storage size.
>
> But I fear that delta generation will always be slow on large
> trees like OOo if the algorithm remains:
>
> 1. get inventory
> 2. get previous inventory
> 3. calculate changes.
>
> Thoughts?
>
> Ian C.
>
>

So, I don't know what numbers you were seeing, but I'm seeing a rather
good improvement. Perhaps the bigger issues were the recent bugs that we
fixed with _iter_nodes and the hash layouts?

'time bzr log -v --short -r -10..-1 mysql-525'

For a 1.9 format repo, that was taking 2.0s, for a gc-chk255-bigpage
repository, I'm seeing it as 0.7-0.8s. And keep in mind that is with a
'bzr rocks' time of 0.35-0.4s.

Changing that to "log -v --no-aliases" was 15s for the chk branch, and
2m15s for the 1.9 format.

Now, I think we can certainly look into the remaining efficiency issues,
like whether chk_map._deserialise is as good as it could be.

But 2m15s => 15s is a pretty big win, IMO. :)

Under lsprof, 50% of the time is spent in deserialise(), but I should
note that lsprof lies :). Specifically, under lsprof the time drops from
15s to 1m+, and deserialise is a pure python function, so we know it
encounters more relative overhead that the C code, which skews the
results slightly.

I did check, though, and if I just add a trap in _deserialise() to see
how many times a given page is processed, I get:

26071 / 7030 (3.7), max 225

So of 7030 unique chk pages, _deserialise is called 26,071 times. On
average, a given page is deserialised 3.7 times, with a max count of 225.

Now, that 225 could easily have to do with the
parent_id_basename_to_file_id map, which is very similar across all
revisions. However, it certainly seems inefficient to deserialize the
same page 225 times.

If I split it up by type, I get:
LeafNode    24989 / 5988 (4.2), max 225
InternalNode 1082 / 1042 (1.0), max 2

Which makes me think that Robert was right about the _ensure_root issues
(it doesn't help to cache the root node, because it changes every revision.)

I'm a bit concerned that the numbers are too high for LeafNodes, so we
may need to investigate if iter_changes() is doing a good job of not
descending into nodes it doesn't need to.  I know the iter_changes()
code is a bit complex, and I wonder if we could use the hash properties
to simplify it a bit. (The tree is going to be a lot flatter, so we
don't need to try hard to match up sha1's at different depths.)

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkmzBjEACgkQJdeBCYSNAAPO4gCfbTAYkFV6fbyeRBHIsVTueNpl
+1kAn1fx6iuOiNIH6AoAMiR6QQaVSxUN
=BGE0
-----END PGP SIGNATURE-----