[MERGE] Transport support for pack repositories

John Arbash Meinel john at arbash-meinel.com
Wed Aug 8 23:54:33 BST 2007


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Robert Collins wrote:
> On Wed, 2007-08-08 at 09:05 -0500, John Arbash Meinel wrote:
>>
>> Anyway, I wanted to comment that paramiko auto-buffers SFTP file
>> writes.
>> The default buffer size is 8192 bytes, but you can set that with the
>> 'buffer' parameter to the open() command.
> 
> That's interesting. The buffer doesn't seem to show up during tests; or
> perhaps its just that the latency is so small writes in the test suite
> are effectively synced. Robey - can you comment?

Well, I think you also need "pipelined = True". (file.set_pipelined(True)?)

> 
>> This actually makes me worry about synchronization, since paramiko
>> will
>> be doing buffering (and I don't think it tries to make sure reading
>> the
>> bytes that are in the queue to be written are returned).
>>
>> I don't quite understand why we need to ensure synchronization before
>> a
>> flush()/close() is done. So I would *prefer* if we could not guarantee
>> synchronization between readv() and stream() unless we need to. (Maybe
>> you were thinking that we would stream data from somewhere, save it
>> locally but not "activate" it until we have reprocessed it, but maybe
>> we
>> could provide a '.flush()' for just that situation).
> 
> There are several things here that I wanted to balance.
>  * We use the knit read-after-write-before-commit facility today. I
> didn't want to have to 'fix' that as part of getting reasonable
> performance out of packs *at all*.
>  * flush() to ensure that the read-after-write reads the correct data is
> in fact bad for performance; what we should rather do is know what data
> has been written, and what hasn't, and satisfy readv/get from local data
> as an when it's possible. flush's semantics are 'let another process see
> this data', we need 'let read on this data see it'. So we could have a
> sync() call with our precise semantics, and I'd be happy enough to
> refactor it to do that, but it looked (and looks) like something we can
> deterministically detect: the transport layer can always tell (for the
> way we do this - same url prefix, as its the same repository object)
> when synchronisation of some form is needed, and could in fact do that
> by having its own cache with the file stream and satisfying readv/get
> from a combination of external storage and the local write cache.
> 
>> I also felt it was a little odd to return a writable function rather
>> than a file-like object. But I understood why you did it.
>>
>> What worried me more was that you were sticking different kinds of
>> objects into the _file_streams dict (sometimes a file, sometimes a
>> callable, etc).
> 
> Well, it seems to me that the user of a particular slot is
> deterministic, so there's no need for it to be regular data.
> 
>> It would probably be good to have a way to detect that _file_streams
>> wasn't empty on shutdown, so we can complain about not closing your
>> file
>> handles.
> 
> That gets into __del__ or exit hooks; I'm really not a fan of these,
> because of the various vaguaries of python and memory management. More
> importantly here, we are using these streams to allow incremental pack
> writing *to an upload directory*. The 'write place' :) to do such a
> check would actually be the rename() call, so that you don't rename an
> open stream.

Well, isn't that being done at close time? Or are you saying that you
would stream up to a temporary file. And then rename that over, and
*that* should check that the file is closed? (I don't think you have a
way to check that with your current api.)

Again, it still has an issue that if you never get to that point, you
still don't know that you forgot to close.

I agree that __del__ isn't a great solution. Though we use it for our
file locks, etc. I think __del__ is only a problem when you have a
circular reference, and I'm not sure that it is possible with how we are
using this class. But I do agree to generally avoid __del__

> 
>> In general, I don't prefer global dicts unless we have real cause. Oh,
>> and what happens if we call "open_stream()" for the same URL 2 times?
>> It
>> is bad practice, but that doesn't mean it won't ever happen. If you
>> really want it, it might be good to add _save_file_stream() to
>> Transport, so that all implementations can share it.
> 
> Ugh. It should error I think.
> 
> -Rob
> 

I'm happy enough with that. As long as it doesn't just override the one
you already have.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGukmpJdeBCYSNAAMRAuR9AJ9+U1jJZld1vxfePaig/nVMxlNNGACgwQws
r/nU/oDW4rCHQsScMkEIYaM=
=SWeR
-----END PGP SIGNATURE-----



More information about the bazaar mailing list