brisbane:CHKMap.iteritems() tweaks

Wed Mar 25 02:33:56 GMT 2009

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Robert Collins wrote:
> So in short - the recent content we need is not at the front of the
> groups?
> 
> -Rob

Correct. If you consider the chk layout

   root
   / | \
  A  B  C

Over revisions 1-6

And then those get packed into a group as:

R1 R2 R3 R4 R5 R6 # texts at a different level are in their own group
A1 A2 A3 A4 A5 A6 B1 B2 B3 B4 B5 B6 C1 C2 C3 C4 C5

If that is all one group, then we have to decompress() all of A2-A6 to
get the content of B1.

The 'recent' patch that I put together changes this to

R1 R2
A1 A2 B1 B2 C1 C2
R3 R4 R5 R6
A3 A4 A5 A6 B3 B4 B5 B6 C3 C4 C5

So, 2 groups.

I did propose changing the chk grouping to be pure

R1 R2 R3 R4 R5 R6
A1 A2 A3 A4 A5 A6
B1 B2 B3 B4 B5 B6
C1 C2 C3 C4 C5

But to do that easily, I would need to create more streams. Also, when I
was trying to tweak this, the initial results showed things being worse
for file texts. So it may need some other work to try to tweak the
compress code to know what it is compressing, so that it know it is
dealing with file texts, which would have a different expected behavior
than chk texts, etc.

File texts are sometimes 'similar but not quite exact' chk texts are
pretty much always either very similar or very different. I guess I
could see getting a small amount of 'revision_id' correspondence between
unmatched pages, but all the sha1 sums and file_ids would be different.

I'll poke at it a bit. But anyway, the same thing holds true for the
file texts. Because it was one of the ways that we got better
compression. (I'm guessing it is cross-file compression of stuff like
copyright headers.)

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAknJmBQACgkQJdeBCYSNAAMC0ACgpaZTVHQZ0j5a2Uvjg1rgfTZj
+WsAoMMTmd2ad54O40wFWnsiIOJN95kW
=LZXa
-----END PGP SIGNATURE-----