Transport w/ delta / offset

John A Meinel john at arbash-meinel.com
Thu Jul 21 01:30:45 BST 2005


Aaron Bentley wrote:
> John A Meinel wrote:
>
>>>Aaron Bentley wrote:
>
>
>>>Well, you could have a SmartTransport that coupled with a SmartStorage,
>>>such that it could fulfill the standard get()/put() requests, but would
>>>also have advanced knowledge. There would be tighter coupling between
>>>Storage & Transport in that case, but it is probably okay.
>
>
> Okay, but why is there value in combining high-level operations with
> low-level ones?
>
> Transports operate in terms of files and directories.  Smart servers
> operate in terms of revisions, inventories, trees, lines of ancestry,
> etc.  There's not a whole lot of overlap, so I don't see an advantage in
> combining the two.
>

You could pretty easily implement a smart server in terms of a smart
storage and transport.
You could even talk to the server over http, which would mean you could
potentially not change the transport. It may not be worth the adding the
complication.

>
>>>I'm thinking you could have:
>>>bzr://some/sort/of/path
>>>Which would instantiate the SmartTransport, which could even connect to
>>>the smart server, and start asking for files. In Branch.__init__()
>>>(right now in _check_format()) it would realize that it should
>>>instantiate SmartStorage instances.
>>>
>>>Would it be okay to have Storage be able to yield a diff?
>
>
> Right now, Stores have a clean and simple API that's easy to
> reimplement.  If it were just one function, I'd probably not care.  But
> I think it's actually a lot of functions, if you go down that path.
>
>

Sort of true. But at the same time if you want to utilize the
effectiveness of weave or revfile storage, you need to expose that
functionality.

What was your idea for enabling weave merge? I don't think you want to
unpack all of the weave files into full revisions. Perhaps storage then
needs a "get_weave(file_id, revision_ids)"

Yes, it complicates the Storage interface, but you have to do something
if you want to save it in a certain format, and then retrieve it.

>
>>>To me, there are 2 aspects. First, with a smart server, when you say
>>>"give me this file" you might have the previous file and want to only
>>>get the diff, and re-create it on your end. That can be internal to a
>>>SmartStorage, and SmartTransport (basically smart transport has a
>>>get_diff, which SmartStorage knows how to use, but it isn't generally
>>>exported at the Transport level).
>>>
>>>The second aspect, is that frequently a Branch wants a diff, and certain
>>>storage formats are going to optimize for that. For instance, depending
>>>on the specific request, revfiles might store exactly the diff that you
>>>want, no need to re-create it.
>
>
> I suppose that if Revfiles made diffs easily accessible, there would be
> a case for this.  But bear in mind that revfiles can only produce a few
> diffs for a text (diff from an arbitrary parent, diff to any
> descendant), so their value may be limited.

You are missing an obvious application, though. If I am
branching/pulling a huge series of revisions, it would be nice to pull
them all as diffs rather than pulling the full text for each one.

Now, we do have the copy_multi interface, which can determine who
"other" is, to make the copying faster. Except it makes all the Storages
need to know the specifics about eachother. Rather than giving them a
generic interface that they can work through.

>
> So here's another way you can do it:
>
> SmartBranch can have the methods store_get(self, namespace, id),
> store_put(self, namespace, id, stream) and text_diff(self, from_id, to_id)
>
> SmartStore.__getitem__ can be implemented in terms of
> SmartBranch.store_get and SmartBranch.store_put.

Why would you have Store use a Branch level interface? Isn't that
backwards? Doesn't branch use store?

>
> e.g. SmartBranch.__init__ would include:
>
>         self.inventory_store = SmartStore(self, "inventory")
>
> That's even simpler than the other Stores.

It seems like you are trading adding a little bit of complexity to the
Storage layer, for inverting the hierarchy.

>
> SmartBranch.get_text_diff could be invoked if the user didn't specify
> any special diff options.  But it would fail if the it did not have one
> of the texts available.  And if the user did supply special options, it
> would be bad to run 'diff' remotely, so the interface doesn't permit
> special options.
>
> So you'd do something like
>
> def diff_texts(b, from_id, to_id, external_diff_options=None):
>     if external_diff_options()
>         diff = external_diff(...)
>     elif hasattr(b, "text_diff"):
>         try:
>             diff = b.text_diff(from_id, to_id)
>         except IDNotPresent:
>             diff = None
>     else:
> 	diff = None
>     if diff is None:
>         diff = internal_diff(...)
>
> A SmartBranch.revision_tree could produce a ChangesetTree.
> SmartBranch.append_revision could operate server-side.  It could iterate
> through all a revision's acestors on the server side.
>
>
>>>That's why I was wondering where the weave stuff was going. Are the
>>>higher level operations going to *require* a weave? So that if you are
>>>using the "CompressedTextStore" doing a merge requires it to rebuild the
>>>weave?
>
>
> annotate and weave merge will probably require weave data.  But for
> weave merge, you only need to go back as far as the most recent common
> ancestor(s), so it's not that bad.  bzrtools provides an annotate that
> works with CompressedTextStores, but its performance over a network
> connection would be hideous.
>

There are ways to optimize annotate access, even with
CompressedTextStores, perhaps something like Tom's 'revision changed
bits'. (I don't think his idea is extremely well fleshed out, but it
might turn into something)

>
>>>My thought was to have get_partial() take a list of files and ranges,
>>>something like:
>>>
>>>def get_partial(self, ranges):
>
>
> Yes, that could work nicely.
>

Any preference on the 'multiple ranges per file' versus being able to
request the same file multiple times.

>
>>>>I don't think transports should not be concerned with diffing.  I'd say
>>>>that's a branch-level concern.
>>>
>>>
>>>I generally agree. And I think SmartStorage can be more intelligent
>>>about a SmartTransport, but it doesn't have to be exposed at the
>>>Transport layer.
>>>
>>>But would it be okay to expose it as part of the Storage layer? Possibly
>>>with a way of doing "storage.get_diff()" returning 'I don't have it,
>>>build it yourself'.
>
>
> Again, I don't see why we'd want to include diffs in the Store
> interface, unless CompressedTextStore and/or WeaveStore actually did
> produce diffs.
>

Well, it seems like WeaveStore needs to at least provide an in-memory
Weave, and the weave format would allow for producing diffs different
from getting plain text and running diff.

For instance, to get the diff from the previous version, you just pull
out whatever changes occured in this version. I may be misunderstanding
what information is stored in a weave, but from what I've seen it should
make pulling out a diff a lot easier than trying to compare textual lines.

>
>>>I am tempted to pass get_diff a function that can generate the diff from
>>>2 texts in the case that the storage doesn't already have it.
>
>
> I'd be inclined to do it the other way: if branch.get_diff(old_id,
> new_id) throws an IDNotPresent, then you fall back to internal_diff.

Well, diff_texts then becomes the simple interface. You always have a
simple interface somewhere, it just depends what makes the most sense.

>
> Aaron

John
=:->
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 253 bytes
Desc: OpenPGP digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20050720/bf8f0c84/attachment.pgp 


More information about the bazaar mailing list