[MERGE] Container format basic implementation

Fri Jun 15 17:55:19 BST 2007

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Andrew Bennetts wrote:
> Yeah, we definitely want to support seeking over undesired records, if the
> source supports that — files can seek, but sockets can't.  The best we can do
> with sockets is minimise processing of bytes we don't care about.

Full ACK.

> Of course,
> with containers streamed over sockets we shouldn't normally want to seek, as we
> shouldn't be sending uninteresting records if we can help it.

Well, it would be nice with bundles.  Sometimes I merge a remote bundle
multiple times, e.g. Ian's kcachegrind stuff.  After the first merge,
all I really need from that is the revision id.

I believe SFTP supports true seeking.  But seeking in bundles is
problematic, because they're compressed and mime64-encoded.

Perhaps we can special-case bundles to just abort early if we already
have the target revision in our repository.

But if we use this for local repositories, seek could definitely be a win.

>> I've discovered that for processing a 56K bundle, 95% of its time is
>> spent in BaseReader._readline.  It is called 946 times, resulting in 29,
>> 914 calls to BaseReader.reader_func.
> 
> That's interesting, although not surprising, as I haven't yet tried to profile
> or optimise this code.  It's probably time I wrote my first addition to our
> benchmark suite.  I do want this layer to be as fast as possible.

I should mention IterableFile is participating in this slowdown.
Replacing IterableFile with a cStringIO makes the whole operation much
faster.  It seems to be more an issue with calling IterableFile.read
repeatedly, rather than a cost of retrieving that quantity of data,
because the record bodies don't factor into the costs at all.  Record
bodies account for about 316 calls and 0.8% of total time-- a difference
of two orders of magnitude.

>> So I recommend fixing BaseReader._read_line, preferably by passing in a
>> file-like object.
> 
> My tentative plan is to change it to be a push-based API, like
> bzrlib.smart.protocol, rather than the current pull-based one.

That sounds like a good approach.  When you have to use something like
IterableFile, it suggests you've got your producer/consumer
relationships confused.

> I'm not sure what the best way for the reader to inform the source that it
> should seek is in this proposed scheme.  Perhaps just a return value from
> accept_bytes...

Perhaps two?
skip_len, read_len = BundleReader.accept_bytes(bytes)

read_len would be a hint about how many bytes to send next.

> Thank you very much for this feedback, it's very helpful!

Glad to hear it.

Aaron
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGcsR30F+nu1YWqI0RAuM4AJ0Rya2syDExQdawjrmssWLHerp/EgCbB9gb
nZAXQXVYr3Oxah3rZ3HHSdE=
=iIjw
-----END PGP SIGNATURE-----