[RFC] bzr.jrydberg.versionedfile

Wed Dec 21 16:55:20 GMT 2005

Johan Rydberg wrote:
> John Arbash Meinel <john at arbash-meinel.com> writes:
> 
> 
>>>Far from optimal, but uses the defined APIs.
>>
>>What would you consider optimal, and how different would it be to get us
>>there? I don't think we are stuck on any specific API, we won't reached
>>'stable' until February. :) Far better to do the right thing now, then
>>be hackish.
> 
> 
> I would consider the current implementation optimal in the sense that
> it does not have to compare any inventories to find out what file
> versions to pull.  There are a few small implementation problems of
> course, but those can be fixed quite easily.
> 
> Regarding the API: No, it is not written in stone.  But I have defined
> a API that I am quite fond of.  I'm talking about the VersionedFile,
> VersionedFileStore and RevisionStore classes.  Using these I think we
> can implement almost any history format.  They need a little bit more
> love to be complete, esp the store classes.
> 

Well, you were the one that said it was 'far from optimal'. But the
proposed implementation would still use 'changes', and thus wouldn't
have to compare inventories.

> 
> 
>>>>	1) Grab a list of revisions
>>>>	2) Figure out the set of files involved. This is either done by
>>>>	   reading inventories, or with your delta object.
>>>>	3) For each file, either:
>>>>		a) Pull in only changes which match the list of
>>>>		   revisions you are expecting to fetch
>>>>		b) Pull in everything, because usually the waste
>>>>		   will be very small (usually none)
>>>>	4) Fetch the text of the inventory, and check all of the
>>>>	   associated texts, to make sure they have what you need
>>>>	5) Commit this inventory, then commit the revision
>>>>	6) Go back to 2 for the next inventory.
>>
>>I don't see any specific problems. I think it is pretty much what I was
>>suggesting. You can do everything direct to disk, you just have to do it
>>in the right order. Changes can include revisions which aren't fully
>>added yet, since it isn't part of the contract.
> 
> 
> Yes, with the exception of steps 5 and 6.

I'm not sure what you mean here. Let me provide my revised version of
your set of steps:

  1) Calculate a list of what revisions to fetch.
  2) Create a in-memory copy of the local 'revision' knit, and merge
     remote versions into the in-memory copy.
  3) Merge the 'changes' knit directly to disk (.join)
  4) Iterate over the pulled versions of the 'changes' file,
     and record them in a list.
  5) Iterate over the list, on per-file basis, and merge the versions
     directory to disk.
  6) Merge the 'inventory' knit directly to disk (.join)
  7) Copy in-memory 'revision' knit to disk (using .join)

The only step I might add is a 5b), which before merging an inventory to
disk would actually extract the full text, generate the in-memory
representation, and verify that all referenced files have all of the
referenced revisions. This could be more of an integrity check stage,
which we could be optional (and in the future removed).
We would need a code path for this sort of thing anyway, in the case
that a remote 'changes' did not exist. Or is 'changes' going to be a
required file.

> 
> 
>>One alternative, would be to have a WAL of sorts, which is just a list
>>of revision-ids which have been committed to the store. So the
>>transaction id then becomes the revision id (which is really what we are
>>doing right now, we are just using the revision-store as the WAL).
> 
> 
> Sorry, but what does WAL mean? 

Write Ahead Log.
I'm using the term rather loosely here. It is the same as the Journal in
a journaling filesystem. Postgres uses the term WAL a lot, and Robert
used WALF (I assume write ahead log file).

The idea is that you have a small file which can be written quickly,
which talks about what you are doing, so that upon recovery, you can
read it, and know what should(n't) exist.

I suppose it would actually be a write-after-log, where anything that
was not mentioned in the log is actually invalid.

We have this for knits, in the 'knit index' file. I assume you only
write an entry to the index if you completed the write to the knit. So
if I write half of an entry, and then get canceled, that chunk is
implicitly marked as bad, because there is no index entry which
references it.

> 
> I had an idea some time ago of having a 'revision-graph' file in .bzr
> that contains (revision-id, parents) tuples of all revisions available
> in the revision-store.  I think that using such a file is cleaner
> design wise, than to rely on the index of the inventory or revision
> knits to extract ancestry and graph information about the branch --
> esp in the case where the inventory and revision knits are shared
> between several branches.
> 
> ~j

Well, you are adding more information into the file, which isn't
terrible. Just extra.
The question is why have that extra file, if we already have the
information in the index? Isn't it redundant, with potential to
disagree? (breaking the idea of normalization)

John
=:->
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 256 bytes
Desc: OpenPGP digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20051221/ff06f85f/attachment.pgp