[MERGE/CHK] Use the page cache during 'check_remap'

John Arbash Meinel john at arbash-meinel.com
Wed Dec 24 18:00:29 GMT 2008


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

John Arbash Meinel wrote:
> This update changes the _check_remap code a little bit, so that it can
> avoid unnecessary calls to get_record_stream() during _check_remap.
> 
> It also adds some trivial local functions that allow lsprof to show what
> the common results of remap are. (Whether it actually does need to
> remap, or whether it determines it doesn't need to because of existing
> children, etc.)
> 
> This patch has a pretty big benefit for conversion speed, because we
> have to call check_remap whenever we unmap or an entry shrinks. And
> because of the structure, lots of things are in the page cache that
> aren't unpacked in the local chk_map. (create_by_apply_delta starts from
> an empty chk_map each time.)

To give a specific number here, the time to convert 1k revs of mysql was
3m32s with 255-way fan out, and it drops to 2m11s. It seems likely that
w/ 255 way fan out, the check_remap code had a lot more children it
could potentially evaluate. And without the page_cache being used, it
would actually page in all 255 children using get_record_stream. Which
was already a little bit slower because of the delta chains.

Unfortunately, 255-way fan out with simple line-delta compression
doesn't seem to be getting me to the idea "25% overhead" that I was
hoping for.

It looks good for the first 1k revisions (24%), but at 21k revisions it
has gotten a bit bloated (41%). Though the other information seems
reasonable:

$ wbzr repository-details d4-256-delta-mysql/
Commits: 21200
                      Raw    %    Compressed    %  Objects
Revisions:      42737 KiB   0%     16740 KiB   8%    21200
Inventories:   457272 KiB   5%     77110 KiB  41%   256731
Texts:        7431463 KiB  93%     93227 KiB  49%    83243
Signatures:         0 KiB   0%         0 KiB   0%        0
Total:        7931473 KiB 100%    187077 KiB 100%   361174

Extra Info:           count    total  avg stddev  min  max
internal node refs   116941  7568327   64   90.2   11  255
internal p_id refs     4734   526578  111   88.2    2  235
inv depth            106794   309482    2    2.0    1    3
leaf node items      106794   225401    2    3.2    1   14
leaf p_id items        7062    43409    6   10.2    1   58
p_id depth             7062    22736    3    1.4    1    4


The depth is still 2-3, which is expected based on the size of the tree.
The number of internal pages is comparable to the number of leaf nodes.
Which means that on average for each leaf change we have one internal
node change. I'm guessing it is because:

1) for every commit the root node changes, but when you have multiple
leaves changing, you still only have 1 root node change.
2) You still would expect 1 more internal node change per leaf change,
because of the intermediate layer. However, since we average 5 leaf
changes per commit, it may be that we get more overlap.

Further, we probably still have some leaf nodes hanging directly off of
the root node.

And I wonder if the average number of leaf changes isn't because of
large merges. So most changes only have 1-2 leaf changes, but then you
have a few that change 100s, which brings the average up. But 100s of
changes also get more overlap in the internal nodes that need updating.

I'm going to play with it an see if I can't get a "more optimal" delta
ordering, and see if that makes a bigger impact.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAklSeL0ACgkQJdeBCYSNAAP0KwCgrdv9Q0PEYazKNNSyIHJSaP+6
k6YAnivxrJDPe59uUYM2QNxeYVuEBgNF
=oBSp
-----END PGP SIGNATURE-----



More information about the bazaar mailing list