brisbane:CHKMap.iteritems() tweaks

Tue Mar 24 21:43:19 GMT 2009

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Ian Clatworthy wrote:
> John Arbash Meinel wrote:
>>> 1) Change 'bzr pack' so that it creates a separate set of groups for
>>> items that are available in the newest XX (say 100) revisions. Or
>>> possibly group everything into 100 rev chunks.
>> This was easy to implement for CHK streams. And it changes "time bzr ls
>> -r-1" from 4.4s down to 1.6s. (I implemented as splitting at 10
>> revisions). And the time without the patch is 2.2s up from 1.6s. So the
>> patch makes a bigger difference when we aren't swamped with extract time.
> 
> Nice.
> 
>> The total size on disk after packing is barely noticeable:
>>  125666
>>  125981
>>
>> I guess that is 300KiB. But out of 125MiB, that is only 0.2%.
> 
> Well worth it IMO.

So I've done a bit of playing with this. Specifically, I felt that if
you want to pull out the most recent 100 revs of chk pages, then you
probably also want to pull out the most recent texts.

Pulling out the texts has a much larger impact on total size. Going from
124MB => 140MB. At best guess, it is that we are now storing a fulltext
for every file in the working tree. In testing, it didn't matter if I
set the threshold at 10 revs, 100 revs and 1000 revs. 1000 revs was
actually bigger than 100 revs.

But I think that fundamentally the issue is that we are creating another
fulltext for the whole tree. (Once at the 'recent' site, and again for
all of the really old versions of those files.)

I tested the "bzr co" time for the two different cases, to see if we see
a big difference.

So in testing, 'bzr ls -r-1' time, it didn't seem to matter whether I
packed at 10 revs or 1000 revs. They both gave me a time of 6-8s without
special sorting, and 2.7s with special sorting.

As for sorting the file texts. With the split at 100 revs I see:

time bzr co lp-100
10.9s
time bzr co lp-100-no-texts
12.9s

So we save ~2s during extracting the texts time. (This is with my fix to
TT.create_file(string) to use f.write() rather than f.writelines())

I'm not as convinced that this is worthwhile yet. Considering that we
spend 4.7s in 'get_build_details', making get_bytes_as() 2.5=>1.0s
doesn't seem really worth the 10% increase in repository size.

I guess I can say "maybe", but it isn't as clear-cut as the benefit to
changing the chk pages.

Thoughts?

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAknJU/cACgkQJdeBCYSNAANbMQCfafU9E5gqA7XApKh6ViQeebCe
xaoAn2s6Ut58NFkYJvkLmKOFg0LcevR+
=U9sd
-----END PGP SIGNATURE-----