[RFC] two-phase version add?
John Arbash Meinel
john at arbash-meinel.com
Mon Jun 25 17:25:09 BST 2007
-----BEGIN PGP SIGNED MESSAGE-----
Aaron Bentley wrote:
> We've been able to take advantage of the reference dependencies in our
> storage to avoid having half-added data being a problem.
> File entries are irrelevant unless there's an inventory that points at
> them. Inventory entries are irrelevant unless there's a revision that
> points at them.
> So when we write, we write in this order:
> 1. files
> 2. inventories
> 3. revisions
> This ordering ensures correctness, because nothing becomes visible until
> the revisions are added. So if we are interrupted, nothing is visible.
> But when we read, we must read the inventories before we can read the files.
> For fetch, we do both reads and writes. This means that we have to read
> the inventory, then read the files, then read the inventory again. Now,
> we could read the inventory to a temp file. That would at least avoid
> server round-trips. But it seems inelegant and a bit inefficient.
Actually, we explicitly cache the inventory.knit data in memory, until we
finish pulling and then we copy it into the target file. I thought there was a
test for this, as I know it has regressed in the past. (It was one of the big
performance on 'pull' issues, because often inventory.knit is pretty big).
> It would be nice if we could first copy the inventory into local storage
> (e.g.. inventory.knit), then read the files into local storage, then
> mark the inventory active* (e.g. by updating inventory.kndx).
That is one thing we have considered. It might also be nice if we are pulling
50 revisions that we could mark 1-45 as finished even if we haven't been able
to grab everything yet. Our current pull isn't quite all-or-nothing, because
some of the files will have had their .knit + .kndx updated, but we *do* still
have to download all of inventory.knit again.
> That would also be nice for bundle files. Version 4 is specifically
> designed to be installed, so it contains the entries in write order.
> That makes it an inappropriate choice to behave as a repository. Read
> order makes much more sense there, and would allow us to, say, generate
> a revision tree by streaming through the file.
> With bundles being in read order, we have to seek. But the bundles are
> bzip2-compressed, which hampers seeking backwards. So in order to
> construct a revision tree, we have to stream through the file at least
> twice-- once to get the inventory, and once to get the file diffs.
> I think it would be preferable to implement two-phase version adds, so
> that we could write the revision and inventory at the beginning of the
> bundle, then the files, then activate the inventory, then activate the
> * The term here would usually be "commit", as in "commit a transaction",
> but I thought that would be confusing to use.
As I mentioned, we already do this a bit by buffering the inventory in RAM. It
is a little inefficient, and becomes more of an issue when you have large trees
(and lots of fulltext inventories).
I think it would be okay to be able to buffer to the inventory.knit file, though.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
-----END PGP SIGNATURE-----
More information about the bazaar