[RFC] two-phase version add?

Mon Jun 25 17:25:09 BST 2007

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Aaron Bentley wrote:
> We've been able to take advantage of the reference dependencies in our
> storage to avoid having half-added data being a problem.
> 
> File entries are irrelevant unless there's an inventory that points at
> them.  Inventory entries are irrelevant unless there's a revision that
> points at them.
> 
> So when we write, we write in this order:
> 1. files
> 2. inventories
> 3. revisions
> 
> This ordering ensures correctness, because nothing becomes visible until
> the revisions are added.  So if we are interrupted, nothing is visible.
> 
> But when we read, we must read the inventories before we can read the files.
> 
> For fetch, we do both reads and writes.  This means that we have to read
> the inventory, then read the files, then read the inventory again.  Now,
> we could read the inventory to a temp file.  That would at least avoid
> server round-trips.  But it seems inelegant and a bit inefficient.

Actually, we explicitly cache the inventory.knit data in memory, until we
finish pulling and then we copy it into the target file. I thought there was a
test for this, as I know it has regressed in the past. (It was one of the big
performance on 'pull' issues, because often inventory.knit is pretty big).

> 
> It would be nice if we could first copy the inventory into local storage
> (e.g.. inventory.knit), then read the files into local storage, then
> mark the inventory active* (e.g. by updating inventory.kndx).

That is one thing we have considered. It might also be nice if we are pulling
50 revisions that we could mark 1-45 as finished even if we haven't been able
to grab everything yet. Our current pull isn't quite all-or-nothing, because
some of the files will have had their .knit + .kndx updated, but we *do* still
have to download all of inventory.knit again.

> 
> That would also be nice for bundle files.  Version 4 is specifically
> designed to be installed, so it contains the entries in write order.
> That makes it an inappropriate choice to behave as a repository.  Read
> order makes much more sense there, and would allow us to, say, generate
> a revision tree by streaming through the file.
> 
> With bundles being in read order, we have to seek.  But the bundles are
> bzip2-compressed, which hampers seeking backwards.  So in order to
> construct a revision tree, we have to stream through the file at least
> twice-- once to get the inventory, and once to get the file diffs.
> 
> I think it would be preferable to implement two-phase version adds, so
> that we could write the revision and inventory at the beginning of the
> bundle, then the files, then activate the inventory, then activate the
> revision.
> 
> Aaron
> 
> * The term here would usually be "commit", as in "commit a transaction",
> but I thought that would be confusing to use.

As I mentioned, we already do this a bit by buffering the inventory in RAM. It
is a little inefficient, and becomes more of an issue when you have large trees
(and lots of fulltext inventories).

I think it would be okay to be able to buffer to the inventory.knit file, though.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGf+xkJdeBCYSNAAMRAqNlAJ9dLguTlmyjoBbV3oJTUq80pv4jmgCeKJ8o
waOCmbzsnMS6G9HgfQyDZOI=
=qevw
-----END PGP SIGNATURE-----