[MERGE][RFC] Add simple revision serializer based on RIO.

Tue May 12 17:47:25 BST 2009

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Alexander Belchenko wrote:
> John Szakmeister пишет:
>> On Tue, May 12, 2009 at 10:07 AM, John Arbash Meinel
>> <john at arbash-meinel.com> wrote:
>> [snip]
>>> If you get 'corrupted' data somehow, it is much easier to see what
>>> should be there with something like RIO, rather than bencode. If only
>>> because of the natural line-breaks in RIO.
>>
>> FWIW, I've repaired well over 400 broken Subversion revisions by
>> hand... and having the FSFS format be so human-readable has been
>> tremendously beneficial.  Note that the FSFS format isn't entirely
>> human-readable: the actual deltas are stored as binary.  Despite that
>> fact, it's still been very useful.
> 
> This is interesting point, because inside bzr there is places that not
> always followed this paradigm.
> 
> Look at `pack-names` control file.
> 
> In the pack-0.92 format this file is plain text file and human readable
> (and I guess human writable). So (in the theory) user can restore the
> repository if there is some obsolete_packs around.
> 
> In the 1.9 format this file contains some binary data (and even not
> bencode!) so I don't see the way to repair it if needed.
> 
> Why?

Actually, it is just stored zlib compressed on 4096 byte pages. You can
use 'bzr dump-btree --raw' if you want to see the decompressed bytes.
And in this form it *is* rather obvious.

The reason we have this format, is because we use the same 'index'
format for 'pack-names' that we use for all of the .bzr/repository/indices/*

I think we could have used the same format for 'pack-names', as it isn't
particularly performance critical. However, getting the indexes
compressed, with a O(log N) layout was very important.

Notice that my first comment was (paraphrased) if the formats are
roughly equivalent for processing. The new btree format is much more
efficient for remote processing. At a minimum it averages 1/2 the
on-disk size, going better than that it allows searching with 100-way
fan out, rather than 2-way fan out for bisection. So on average you have
log100(N) round trips rather than log2(N). Yes, both are 'logarithmic',
but in this case it is about 6.6x better. As for why we went with zlib
compression... if it gives us another 2x smaller data that when reading,
that is another 2x win.

> 
> Revision objects are stored in pack files that again uses very complex
> binary format, not human-readable. Perhaps I'm missing something important?

'revision object stored in pack files'? I think the primary complexity
is the fact that things are zlib compressed, so you can't 'just see' the
data.

The new groupcompress format does use a binary encoding of copies and
inserts. It does so because itturned out to be significantly more
efficient at:

1) size
2) processing speed

So again, *if* things are equivalent, use the one that is friendlier to
humans. If there can be big wins by going for a binary format, do it.

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkoJqB0ACgkQJdeBCYSNAAPeBACfayJpWFAs8US0Cw7cl5MTHMLE
vRAAnibtRc7vTEc6t91zx+GyeOH38ogq
=fuSy
-----END PGP SIGNATURE-----