[MERGE] Transport support for pack repositories

Wed Aug 8 22:41:51 BST 2007

On Wed, 2007-08-08 at 09:05 -0500, John Arbash Meinel wrote:
> 
> 
> Anyway, I wanted to comment that paramiko auto-buffers SFTP file
> writes.
> The default buffer size is 8192 bytes, but you can set that with the
> 'buffer' parameter to the open() command.

That's interesting. The buffer doesn't seem to show up during tests; or
perhaps its just that the latency is so small writes in the test suite
are effectively synced. Robey - can you comment?

> This actually makes me worry about synchronization, since paramiko
> will
> be doing buffering (and I don't think it tries to make sure reading
> the
> bytes that are in the queue to be written are returned).
> 
> I don't quite understand why we need to ensure synchronization before
> a
> flush()/close() is done. So I would *prefer* if we could not guarantee
> synchronization between readv() and stream() unless we need to. (Maybe
> you were thinking that we would stream data from somewhere, save it
> locally but not "activate" it until we have reprocessed it, but maybe
> we
> could provide a '.flush()' for just that situation).

There are several things here that I wanted to balance.
 * We use the knit read-after-write-before-commit facility today. I
didn't want to have to 'fix' that as part of getting reasonable
performance out of packs *at all*.
 * flush() to ensure that the read-after-write reads the correct data is
in fact bad for performance; what we should rather do is know what data
has been written, and what hasn't, and satisfy readv/get from local data
as an when it's possible. flush's semantics are 'let another process see
this data', we need 'let read on this data see it'. So we could have a
sync() call with our precise semantics, and I'd be happy enough to
refactor it to do that, but it looked (and looks) like something we can
deterministically detect: the transport layer can always tell (for the
way we do this - same url prefix, as its the same repository object)
when synchronisation of some form is needed, and could in fact do that
by having its own cache with the file stream and satisfying readv/get
from a combination of external storage and the local write cache.

> 
> I also felt it was a little odd to return a writable function rather
> than a file-like object. But I understood why you did it.
> 
> What worried me more was that you were sticking different kinds of
> objects into the _file_streams dict (sometimes a file, sometimes a
> callable, etc).

Well, it seems to me that the user of a particular slot is
deterministic, so there's no need for it to be regular data.

> It would probably be good to have a way to detect that _file_streams
> wasn't empty on shutdown, so we can complain about not closing your
> file
> handles.

That gets into __del__ or exit hooks; I'm really not a fan of these,
because of the various vaguaries of python and memory management. More
importantly here, we are using these streams to allow incremental pack
writing *to an upload directory*. The 'write place' :) to do such a
check would actually be the rename() call, so that you don't rename an
open stream.

> In general, I don't prefer global dicts unless we have real cause. Oh,
> and what happens if we call "open_stream()" for the same URL 2 times?
> It
> is bad practice, but that doesn't mean it won't ever happen. If you
> really want it, it might be good to add _save_file_stream() to
> Transport, so that all implementations can share it.

Ugh. It should error I think.

-Rob

-- 
GPG key available at: <http://www.robertcollins.net/keys.txt>.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20070809/4f9b5c7f/attachment.pgp