[RFC] insert_data_stream/get_data_stream returning more keys

Mon Nov 17 23:13:11 GMT 2008

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

I've been thinking a bit about how to structure a more complete
"universal fetch" that can be a bit more optimal for pack repositories
than our current generic fetch.

The current general-case has to double-handle the inventory data,
because it needs to determine what file texts to transmit and send them
before it sends the inventory data. What we would really like is to be
able to stream the inventory records across, and as we go, look at them
to see what file texts need to be transmitted. (It is how the "Packer"
code does the transmission.)

This becomes even more important with CHK repositories, because there we
have CHK pages that store the real info on the inventory, more than just
the actual inventory records.

It also may have some bearing with stacked (shallow?) branches, which
will occasionally want to get more records stored.

What I was thinking was a sort of iterator that could tell you about
more keys that it wants you to fetch, I'm not sure about whether
get_record_stream should be saying this versus insert_record_stream,
though right now I'm thinking insert_record_stream.

The idea would be a loop something like:

keys = [('revision', revision_id) for revision_id
        in revision_ids]

while keys:
  stream = source.get_record_stream(keys)
  keys = target.insert_record_stream(stream)
  # filter out keys which are already present locally

The idea is that you could start with "Revision" texts, which could then
tell you about the inventory texts you need to fetch (though currently
these start with the same revision_ids, and that is mostly unlikely to
change).
Then as those are inserted it would tell you about the chk_map pages you
would want to get, and when those are inserted it could tell you about
the file texts (or on released format, the inventories would tell you
right away about the file texts).

This also needs to be done on Repository, since it would need to figure
out which VersionedFile each record goes (revisions, inventories, file
texts, have a separate index, etc.)

For the smart-server protocol, we could adapt it to be in 2 parts. On
the server it would be:

def handle_request(self, keys):
  while keys:
    stream = self.get_record_stream(keys)
    keys = self.send_and_analyze(stream)

If we allow the "send_and_analyze" to include a "more pending" flag,
then the client would receive all of the revisions, inventories,
chk_maps, and texts in a single extra-long stream. And while it would
then reference texts that it was already sending, that can be handled by
the simple "filter out" step.

The other possibility would be to turn the return value into an
iterator, rather than all-at-once. But since you can't place another
request until the current one finishes, I don't think it would be a big
win. The callers would just end up buffering the values until they can
place the next request, so we might as well do the work for them.

What I really like is that this could allow us to unify the smart and
non-smart code paths. The smart path would just return more data.

Having it on "insert_record_stream()" could allow stacked branches to
say that they need to get the basis texts that they don't have yet.
(Possibly asking for them as a fulltext?)

Also, if we did it right, we could make it so that a smart request could
indicate that you've gotten 100/500 revisions you have requested, but
the revisions you've gotten so far are "complete" so you can commit it
to your repository.

When we discussed it in the taxi in Sydney, it actually would work
really well to be able to transmit things in a "mixed" order. The idea
is that if you had:

A
|
B
|
C
|
D

You really want to have it stored on disk as "D C B A", but you should
transmit it as "A B C D". As a tradeoff, you could send it as "B A", "D
C", which would give you 2 pack files, both in reverse-topological
order, allow you to commit the data permanently as you go (allowing
resume to work). And when you are done, an autopack that tries to do the
minimum amount of seeking could still preserve the reverse-topological
ordering.

Thoughts?

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkkh+ocACgkQJdeBCYSNAANo6QCghUoESbqBkN2X3GoyOj7RwKam
EG8An3CmPwicae81ocHbSGsLW9Ki2CfA
=tHJb
-----END PGP SIGNATURE-----