[MERGE] Transport support for pack repositories

Wed Aug 8 15:05:01 BST 2007

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Robert Collins wrote:
> This adds two transport features supporting pack repositories:
>     * New method
> ``bzrlib.transport.Transport.get_recommended_page_size``.
>       This provides a hint to users of transports as to the reasonable
>       minimum data to read. In principle this can take latency and
>       bandwidth into account on a per-connection basis, but for now it
>       just has hard coded values based on the url. (e.g. http:// has a  
>       large page size, file:// has a small one.) (Robert Collins)
> 
>     * New methods on ``bzrlib.transport.Transport`` ``open_file_stream``
>       and ``close_file_stream`` allow incremental addition of data to a
>       file without requiring that all the data be buffered in memory.
>       (Robert Collins)
> 
> With the open_file_stream support doing a pull from rev 0 to 2000 of
> bzr.dev from a knit repo to a pack repo locally drops from 46 to 36
> seconds.
> 
> I'm not 100% sure that the SFTP implementation of open_file_stream
> is /correct/ - input on that would be good. However, even if its wrong I
> think that we probably want a local 64K or so buffer in there, unless
> write() is non-blocking on SFTP file objects.
> 
> -Rob
> 

I had started writing this to you several days ago, before my Mac
decided to start randomly segfaulting programs again. (It does that
sometimes... :(.

Anyway, I wanted to comment that paramiko auto-buffers SFTP file writes.
The default buffer size is 8192 bytes, but you can set that with the
'buffer' parameter to the open() command.

This actually makes me worry about synchronization, since paramiko will
be doing buffering (and I don't think it tries to make sure reading the
bytes that are in the queue to be written are returned).

I don't quite understand why we need to ensure synchronization before a
flush()/close() is done. So I would *prefer* if we could not guarantee
synchronization between readv() and stream() unless we need to. (Maybe
you were thinking that we would stream data from somewhere, save it
locally but not "activate" it until we have reprocessed it, but maybe we
could provide a '.flush()' for just that situation).

I also felt it was a little odd to return a writable function rather
than a file-like object. But I understood why you did it.

What worried me more was that you were sticking different kinds of
objects into the _file_streams dict (sometimes a file, sometimes a
callable, etc).

It would probably be good to have a way to detect that _file_streams
wasn't empty on shutdown, so we can complain about not closing your file
handles.

In general, I don't prefer global dicts unless we have real cause. Oh,
and what happens if we call "open_stream()" for the same URL 2 times? It
is bad practice, but that doesn't mean it won't ever happen. If you
really want it, it might be good to add _save_file_stream() to
Transport, so that all implementations can share it.

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGuc2MJdeBCYSNAAMRAqbkAJ9cd7QmeDsIrnMF9rLqxChHC3ufXgCgqjzS
7ch2qOGeyCzWRML/02yWIYI=
=4xKC
-----END PGP SIGNATURE-----