[CHK/MERGE] Use chk map to determine file texts to fetch

Tue Nov 11 23:14:42 GMT 2008

On Tue, 2008-11-11 at 16:50 -0600, John Arbash Meinel wrote:

> 1) It doesn't stream, that is correct
> 2) We should be doing it across all inventories we are copying at once,
> rather than one inventory at a time.
> 
> 3) We *do* read the root, then copy the unique keys + root. I'm not sure
> what you are thinking that it does.

It doesn't read all the roots. And because only transactional stores use
chks, ordering isn't an issue, so memory isn't stressed here.

> In the end, I think a better solution is to get a better streaming fetch
> written, but this is in the "get it to the point it is usable", rather
> than rewriting everything. (Such as having a usertest run complete in
> less-than-overnight fashion.)

Well, current fetch is streaming - I overhauled fetch.py a couple of
releases ago so its all unidirectional. This would be a regression of
that. Its not 'smart server verb' streaming but it should be in a state
to be shifted without being rewritten.

> Our logic for streaming data from the remote isn't really complete. We
> have the "get_record_stream()" functionality, which is a decent start,
> but you have to know all of the keys you are going to need ahead of time.

You can use the revision search, that is enough to tell you all the text
keys and inventory chk keys, by doing processing on the far end.

> What we really want is something that could be given a "search" on
> revision-ids, and then fill in all the details from there on down. This
> can be layered on top of our work for improving
> "item_keys_introduced_by" which can figure out what *texts* need to be
> transmitted, but at the same time, we could figure out what inventory
> pages need to be transmitted. I don't know if this fits into a "generic"
> fetch code, though.

We had this pre 1.6, but it had several major flaws: it buffered a lot
of data, it reimplemented rather than reusing the mainline fetch logic,
and it was too closely coupled to knit representation to handle even the
moderate change of keyspace done for 1.6.

> You could either have the namespace unified, or the first entry would
> define what index needs to be used, etc.

Aaron proposed a unified keyspace in his Storage work; I'm totally sold
on that now - and note that the VersionedFiles work is a N-length tuple
keyspace so it should be straightforward to adapt code to a single
keyspace inside of fetch.

> Perhaps all we really need is to update item_keys_introduced_by() to
> allow it to return the chk pages that need to be copied.

I don't think so; because this function is really a bit awkward; it
drives the duplicate processing of data. Really we need to walk data,
obtaining information about what to copy, and copying the data that we
read; for repositories which are not atomic they need ordering
constrains on writes that are the reverse of reads; they should buffer
on the client before writing data.

> The other concern is needing the ability to adapt between repository
> formats. Certainly streaming chk inventory records to a Knit repository
> needs someone to do the translation. (And is it possible to do the
> translation if you are only given a the minimal stream you would want if
> you were streaming into another chk repository.)

It's not necessarily possible, give only new byte_values for a
repository, to generate an inventory delta. So streaming CHKMap contents
to a knit client won't help the client.

I think that when the representation is different we should stream
inventory delta's or something similar.

-Rob

-- 
GPG key available at: <http://www.robertcollins.net/keys.txt>.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20081112/4cec63bf/attachment.pgp