dirstate: close to merging, and some performance analysis.

Mon Mar 5 05:07:28 GMT 2007

Hi, 
	Dirstate is -very- close to mergable now. I'm working on upgrade code
for it, to let people upgrade their trees, and we've got some pretty
darn good performance results. Theres *still* a regression on xml
support from 0.14 - and I suspect that whatever its related too may
affect dirstate trees too - which is good, because we're currently nice
and fast even with it. But as we fallback to some of the same code
paths, fixing it before release is important.

The follow is on a tree with 4000 entries in it, on my laptop. Each
command is run twice, to reduce cold-cache effects.

bzr 0.14:
$ time bzr --no-plugins  st xml/
real    0m4.556s
user    0m1.764s
sys     0m0.216s
$ time bzr --no-plugins  st xml/
real    0m1.976s
user    0m1.756s
sys     0m0.128s

dirstate on xml:
$ time bzr --no-plugins st xml/
real    0m3.550s
user    0m3.160s
sys     0m0.196s
$ time bzr --no-plugins st xml/
real    0m3.200s
user    0m2.864s
sys     0m0.204s

dirstate with dirstate:
$ time bzr --no-plugins st dirstate/
real    0m2.265s
user    0m0.732s
sys     0m0.176s
$ time bzr --no-plugins st dirstate/
real    0m0.935s
user    0m0.728s
sys     0m0.128s

Comparing the best user + sys times for 0.14 to dirstate with dirstate
we have:
0.14    : 1.884s
dirstate: 0.856s

or a 54% time reduction: quite nice.

IIRC John has measured startup time at ~200ms, and removing that from
the comparison:
0.14    : 1.684s
dirstate: 0.656s

Gives a 61% reduction.

Comparing partial tree operations, same tree, but a subdir with 3380
paths under it:

0.14:
$ time bzr --no-plugins st xml/subtree
real    0m1.716s
user    0m1.576s
sys     0m0.136s
$ time bzr --no-plugins st xml/subtree
real    0m1.714s
user    0m1.580s
sys     0m0.136s

dirstate xml:
$ time bzr --no-plugins st xml/subtree
real    0m2.690s
user    0m2.472s
sys     0m0.220s
$ time bzr --no-plugins st xml/subtree
real    0m2.692s
user    0m2.500s
sys     0m0.188s

dirstate dirstate:
$ time bzr --no-plugins st xml/subtree
real    0m0.771s
user    0m0.672s
sys     0m0.100s
$ time bzr --no-plugins st xml/subtree
real    0m0.774s
user    0m0.696s
sys     0m0.076s

Again, comparing best times from 0.14 to dirstate with dirstate:
           including startup     excluding startup
0.14:        1.712s                1.512
dirstate:    0.772s                0.572s
reduction:   55%                   0.62%

These figures are in line with the whole tree reduction, which seems
fine.

lastly, what about status of a single file in the same tree?
0.14:
$ time bzr --no-plugins st xml/subtree/file
real    0m0.959s
user    0m0.848s
sys     0m0.112s
$ time bzr --no-plugins st xml/subtree/file
real    0m0.959s
user    0m0.884s
sys     0m0.072s

dirstate xml:
$ time bzr --no-plugins st xml/subtree/file
real    0m1.046s
user    0m0.944s
sys     0m0.100s
$ time bzr --no-plugins st xml/subtree/file
real    0m1.046s
user    0m0.960s
sys     0m0.084s

dirstate dirstate:
$ time bzr --no-plugins st xml/subtree/file
real    0m0.356s
user    0m0.308s
sys     0m0.040s
$ time bzr --no-plugins st xml/subtree/file
real    0m0.355s
user    0m0.292s
sys     0m0.060s

Again, comparing best times from 0.14 to dirstate with dirstate:
           including startup     excluding startup
0.14:        0.956s                0.756s
dirstate:    0.348s                0.148s
reduction:   63%                   0.80%

An interesting question is 'how long does it take each file to
process?'. I dont think I've plotted enough data, or done enough runs,
to say with great confidence, but heres a few interesting figures.

Firstly, lets eliminate the time required to get the data: opening the
tree and loading the dirstate or inventory.
For 0.14 on xml: 0.265s, and
for dirstate, thats 0.058s (both according to timeit).
$ PYTHONPATH=~/source/baz/dirstate.dogfood/ python -m timeit -s "import
bzrlib.workingtree"
"tree=bzrlib.workingtree.WorkingTree.open('dirstate')"
"tree.lock_read(); tree.current_dirstate()._read_dirblocks_if_needed();
tree.unlock()"
$ python -m timeit -s "import bzrlib.workingtree"
"tree=bzrlib.workingtree.WorkingTree.open('xml')" "tree.lock_read();
tree.inventory; tree.unlock()"

These trees have 4153 files (to be precise :).
The startup cost-per-file (assuming neither format has partial-reading
facilities) is then 0.265/4153 and 0.058/4153
0.14    : 0.265/4153 = .0000638s
dirstate: 0.058/4153 = .0000140s
reduction               78%

Subtracting all these overheads gives us a hopefully raw cost:
           Full(4153)  subtree(3380)   one file(1)
0.14    :  1.419s      1.247s          0.491
dirstate:  0.598s      0.514s          0.090

However its clearly not the real base cost, because the one-file figure
is still too high a proportion: so we're missing some constant overheads
(massive ones in the 0.14 case apparently :). Profiling is needed to
ascertain what they are. We can however, approximate the cost-per-path
by subtracting the one-file total cost, and dividing by the remaining
file count:
           Full(4152)      subtree(3379)   one file(0)
0.14    :  0.928s=>0.22ms  0.756s=>0.22ms  0/0 ???
dirstate:  0.508s=>0.12ms  0.424s=>0.13ms  0/0 ???

I think these figures are fairly representative: Once we eliminate bzr
statup (0.2seconds), tree opening and state reading (0.265s for xml,
0.058s for dirstate), and (unknown startup time + time to do one file)
(1/2 a second for xml!, 0.09s for dirstate) we get a figure of 0.22ms
per path for 0.14 xml trees, and 0.13ms per path for dirstate trees.
(Note that startups will include is_ignored regex compilation).

Going back to the totals:
          total   startups   perfile-total   ratio
0.14    : 1.884s  0.956       0.928s          49%
dirstate: 0.856s  0.348       0.508s          59%

So dirstate has saved about 1 second, made up of 600ms from total setup
overheads, and 400ms from per-file processing. 59% of the remaining time
dirstate spends on 4000 entry trees is in per-file processing (so that
sounds like a good place to focus on reducing the total time for large
trees: at 4000 its dominating (just), and 55K for mozilla, well heres a
guess:

0.348s -0.058 + 55000*(0.13ms + 0.014ms) = 7.789s

At that scale, 95% of the time spent is in the scales-per-path part of
the equation, and of that 89% is in the process-a-single-path overhead,
not in the loading of entries from disk.

So the next challenge for performance tuning is dropping that 0.13ms
dramatically IMO: if we can drop the cost per path by another 50%, we
would be sitting really pretty for performance, I think.

-Rob

-- 
GPG key available at: <http://www.robertcollins.net/keys.txt>.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20070305/a050f370/attachment-0001.pgp