[RFC] Blob format

John Arbash Meinel john at arbash-meinel.com
Wed Jun 28 19:36:34 BST 2006


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Johan Rydberg wrote:
> John Arbash Meinel <john at arbash-meinel.com> writes:
> 
>>> Absolutely.  We would be a lot better for if we had fixed-size file
>>> and revision identifiers, but it is a bit too late to change that.  So
>>> we probably need format with back-pointers to earlier entries in the
>>> index.  
>> That is possible, but realize it is much better to read 10k all at once,
>> rather than reading 200 bytes, seeking back 200 bytes, reading another
>> 200 bytes, etc.
>>
>> So while backpointers are good, just knowing that it is earlier than the
>> current record means that you just jump back and read another 10k.
> 
> You can of course read as much as you like; the backpointers are
> useful to identify where data starts as well.
> 
>> I don't know where the specific tradeoff would come, but for local disk
>> access the granularity is at the very least a disk block, and more
>> likely an entire stripe. So you are talking 4k -> 100+k.
>> hg revlog files actually contain the data inside the index in the case
>> the whole file is < 128k. So it is potentially a single disk read to get
>> everything, rather than jumping around a lot.
> 
> I have not studied mercurial's revlogs in a while: Do they rewrite the
> whole file if it grows above the threshold, or does it keep to the
> append-only policy but put pointers to the data file for the new data?

The index type is stored at the beginning of the file. Once it grows
past a certain size it changes type, so they rewrite it. (The data goes
into a .d file, and the index stays in the .i file).

They do append-only until the file grows big enough to split.

They also only work locally (even over ssh, they are just controlling a
remote hg which is operating locally).
So their constraints are a little different.


> 
>> For remote access to an HTTP file, it depends on latency versus
>> bandwidth, but my bandwidth is 160KB/s, and my latency to bazaar-vcs.org
>> bounces between 100ms and 400ms.
>> So at 100ms, I can download 16K, and at 400ms I can download 65K.
>> So unless your back pointer is pointing further back than 16Kbytes, it
>> is better to just leave it out, and read a bigger chunk.
>> I think I have a moderate bandwidth system. I think we could definitely
>> tune round-trip algorithms for around 20-30KB per read and do quite well.
> 
> I have about ~1MB/s with a ~35ms RTT, so yes, your bandwidth is
> moderate.
> 

You have a pretty nice setup. Though you might just live closer to London.
My fastest network in the US was a 3Mbit or 300KB/s on a cable modem
(though they seem to be offering 6Mbit now). DSL is 1.5Mbit or 3Mbit
(but they don't offer 3Mbit in my area). I can generally stream data at
160KB/s. So I'm maxing out the 1.5Mbit.

> Networks are a beast of their own, so probably the best way to do this
> would be to hack up some small test implementations and do a set of
> tests in different environments.

There were discussions about creating a test server that could have a
configurable bandwidth/latency and maybe even packet loss. I think loss
is hard to do, but the bandwidth and latency wouldn't be too bad.

It would be nice to have for our benchmark tests.

> 
>>> HTTP byte ranges supports negative offsets.  Not sure our readv()
>>> interface does though.
>> Unfortunately, it doesn't. But I think we should make it. file.seek()
>> can do negative offsets, and I believe sftp can as well. (I'm not
>> positive on that one, though).
> 
> Maybe Robey could answer this.

Looking at the paramiko code we have:

def seek(self, offset, whence=0):
    self.flush()
    if whence == self.SEEK_SET:
        self._realpos = self._pos = offset
    elif whence == self.SEEK_CUR:
        self._pos += offset
        self._realpos = self._pos
    else:
        self._realpos = self._pos = self._get_size() + offset
    self._rbuffer = ''

Where _get_size() is defined as a stat() and return the st_size
parameter. And stat() is a round trip to the server. (And doesn't seem
to be cached).

So we can certainly supply negative offsets, but it doesn't gain us much.
He has said in the past that all reads are 'pread()' style, so a seek is
a local only operation. So I guessed that it didn't really support
proper reading from the end of the file.

The _read() function does:
self.sftp._request(CMD_READ, self.handle,
		   long(self._realpos), int(size))

So it is requesting a start + length.

Digging deeper and deeper, it doesn't look like a negative offset would
work. So I'm guessing the SFTP protocol doesn't support it.

But our SFTPTransport could know about the inefficiency and just cache
the st_size result, and transform readv() with a negative value into a
positive one.
The real trick is it should only cache the value as long as a lock is
held, but that would be an API violation to know what was where.

> 
>> And since we want to support reading just the end of index files,
>> readv() should be changed. You have my blessing to do so :)
> 
> I'll take a look at it.
> 
> ~j
> 
> 
> 

By the way, I did a more accurate test of just file sizes, and these are
my results:
	<	>
10K	993	246
100K	1204	35

So switching to a mixed index format would save quite a few files.

And for an hg kernel tree, there are 21,229 *.i files, but only 30 *.d
files.

All of their .d files are >120K, so I'm assuming that their cutoff is 128K.
Looking at the .i files, they end up with 21211 files that are <100K,
and that leaves only 18 index files > 100K.
And only 2 of them are > 128K, so I'm assuming the big ones just haven't
split yet.

They read the first 4 bytes to determine the file type, then if the
index file is <10K, or the data is inlined, they just read the whole thing.

And I can see their point. Though I would argue that the first read
should just be a read of 10K, since no matter what format the file is,
they read at least 10K. (I guess if it is >10K they would preferentially
read from the end of the file).


Anyway, I think it is worth looking into.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.1 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFEoswyJdeBCYSNAAMRAjX7AJ4yTpU+Hz7J4yrl/z+Bw94WZePa+QCeLdUk
WPEWMj6Q7lyXtXV1Iuh/3iM=
=+6ks
-----END PGP SIGNATURE-----




More information about the bazaar mailing list