Bazaar-NG vs. Mercurial -- speed comparison

Thu May 18 21:52:09 BST 2006

Bryan O'Sullivan wrote:
> On 5/18/06, Jan Hudec <bulb at ucw.cz> wrote:
> 
>> No, it's not a plain http. It's a mercurial protocol over http and
>> requires
>> mercurial server.
> 
> No, you can serve a plain repository over HTTP (i.e. just the files in
> .hg) without a CGI server. It's just quite slow (i.e. much slower than
> using the CGI), so we don't push it as a feature.

And I think this is a very valid statement. I think it would be nice for
bzr to also support a more advanced protocol (and this is indeed in our
TODO list), but be able to fall back to plain http. In the mean-time, I
think we want to get plain http support as fast as possible, since this
will also likely speed up ftp and sftp support, and means people don't
need to do anything on the server end.

>> What I don't know is how knits and revlogs compare in number of blocks
>> in the
>> scatter/gather read request.
> 
> My observation was that knit files and indices seem to be bigger than
> our files (i.e. .bzr is almost 2x the size of .hg when storing the
> same data), so I don't know how they compare on individual accesses,
> but more data on disk presumably translates to more reading at some
> point.
> 

Well, there have been a few specific design differences.

1) revlog index files are binary chunks of fixed sizes
   knit index files are ascii text delimited by ':\n'

   We chose this because:
   a) You can open up a .kndx file in a text editor, which is good while
      debugging
   b) Our revision ids are not fixed size like revlog. So while we could
      pick a size which should contain everything, it isn't guaranteed

2) knit files are chunks compressed with gzip rather than zlib (which I
think is what revlog uses)

  a) You pay about 10% for this, in return you can do
	zcat foo.knit | vim -
     And read the raw data. This is good both because of debugging, and
     because if there was ever a problem, there are still common tools
     which would give you access to the original data
  b) We annotate each line with the complete revision id. Where revlog
     doesn't annotate at all. At one point we annotated with a
     dictionary-compressed integer, but it was only 10% bigger after
     gzip compression, and it means you don't have to modify the
     annotations when you merge them into a different branch. (so you
     don't have to uncompress them at all).
  c) mercurial uses a binary delta algorithm. I assume this means
     that it stores deltas that are smaller than one line. So if I do:

      some long sentence with a small typeo
      =>
      some long sentence with a small typo

      mercurial can just store the 'remove "e"' while bzr will store a
      line delta, which requires the whole line.

  d) Our inventory is still XML, and we store all attributes on a single
     line, which means any change to a file and all of the attributes
     for that file are saved. This only effects inventory.knit, but it
     has a pretty large effect on it. (I did some testing about changing
     the inventory format, and found you could get rather large savings
     if you were careful about the format and how it interacted with the
     delta algorithm).

Anyway, we should be aware of what the differences are, so we can decide
which ones we want to keep, and what we can get rid of. Just to say, I
wouldn't be surprised if bzr's knits take 50% more space, but I would be
surprised if it was >2x.

John
=:->

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 254 bytes
Desc: OpenPGP digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20060518/74c9e82b/attachment.pgp