[MERGE] Container format basic implementation
Aaron Bentley
aaron.bentley at utoronto.ca
Fri Jun 15 17:55:19 BST 2007
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Andrew Bennetts wrote:
> Yeah, we definitely want to support seeking over undesired records, if the
> source supports that — files can seek, but sockets can't. The best we can do
> with sockets is minimise processing of bytes we don't care about.
Full ACK.
> Of course,
> with containers streamed over sockets we shouldn't normally want to seek, as we
> shouldn't be sending uninteresting records if we can help it.
Well, it would be nice with bundles. Sometimes I merge a remote bundle
multiple times, e.g. Ian's kcachegrind stuff. After the first merge,
all I really need from that is the revision id.
I believe SFTP supports true seeking. But seeking in bundles is
problematic, because they're compressed and mime64-encoded.
Perhaps we can special-case bundles to just abort early if we already
have the target revision in our repository.
But if we use this for local repositories, seek could definitely be a win.
>> I've discovered that for processing a 56K bundle, 95% of its time is
>> spent in BaseReader._readline. It is called 946 times, resulting in 29,
>> 914 calls to BaseReader.reader_func.
>
> That's interesting, although not surprising, as I haven't yet tried to profile
> or optimise this code. It's probably time I wrote my first addition to our
> benchmark suite. I do want this layer to be as fast as possible.
I should mention IterableFile is participating in this slowdown.
Replacing IterableFile with a cStringIO makes the whole operation much
faster. It seems to be more an issue with calling IterableFile.read
repeatedly, rather than a cost of retrieving that quantity of data,
because the record bodies don't factor into the costs at all. Record
bodies account for about 316 calls and 0.8% of total time-- a difference
of two orders of magnitude.
>> So I recommend fixing BaseReader._read_line, preferably by passing in a
>> file-like object.
>
> My tentative plan is to change it to be a push-based API, like
> bzrlib.smart.protocol, rather than the current pull-based one.
That sounds like a good approach. When you have to use something like
IterableFile, it suggests you've got your producer/consumer
relationships confused.
> I'm not sure what the best way for the reader to inform the source that it
> should seek is in this proposed scheme. Perhaps just a return value from
> accept_bytes...
Perhaps two?
skip_len, read_len = BundleReader.accept_bytes(bytes)
read_len would be a hint about how many bytes to send next.
> Thank you very much for this feedback, it's very helpful!
Glad to hear it.
Aaron
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iD8DBQFGcsR30F+nu1YWqI0RAuM4AJ0Rya2syDExQdawjrmssWLHerp/EgCbB9gb
nZAXQXVYr3Oxah3rZ3HHSdE=
=iIjw
-----END PGP SIGNATURE-----
More information about the bazaar
mailing list