please check out weave-format branch

Sat Sep 24 00:51:46 BST 2005

Martin Pool wrote:
> On 24/09/05, John A Meinel <john at arbash-meinel.com> wrote:
> 

...

> 
> This is because *insertions* always do nest properly; insertions
> happen at a point and so never cross existing blocks.  We can omit the
> closing number because it always matches the last-closed block.  (At
> first I didn't do this, and perhaps its an overoptimization.)

What if 2 revisions insert the same line? (2 parallel branches) but with 
different surrounding branches. For example:

rev 0:
A
B
C

rev 1: parent 0
A
B
D
E
C

rev 2: parent 0
A
B
E
F
C

Now, in testing, I see that you just let the duplication occur:
$ ./bzrlib/weave.py init ,,test
$ echo -e "A\nB\nC" | ./bzrlib/weave.py add ,,test 0
$ echo -e "A\nB\nD\nE\nC" | ./bzrlib/weave.py add ,,test 1 0
$ echo -e "A\nB\nE\nF\nC" | ./bzrlib/weave.py add ,,test 2 0
$ cat ,,test
# bzr weave file v5
i
1 a4f3c4f6fb6ac5ffffe009c9d26a33c875d240f3
n 0

i 0
1 cf975d0eb01f54d329385be987f9494638c28064
n 1

i 0
1 4750c8b1973c6835ef4cf6ede1c30cffb2d9bc3d
n 2

w
{ 0
. A
. B
{ 1
. D
. E
}
{ 2
. E
. F
}
. C
}
W

Which maybe is the correct way, because you don't have any way of 
stating that a line was added in 2 revisions otherwise. Actually, to 
take it to the logical extreme:
$ ./bzrlib/weave.py init ,,test
$ echo -e "A\nB\nC" | ./bzrlib/weave.py add ,,test 0
$ echo -e "A\nB\nD\nC" | ./bzrlib/weave.py add ,,test 1 0
$ echo -e "A\nB\nD\nC" | ./bzrlib/weave.py add ,,test 2 0
$ cat ,,test
# bzr weave file v5
i
1 a4f3c4f6fb6ac5ffffe009c9d26a33c875d240f3
n 0

i 0
1 207ab437e3bb81d0906d403a29d56251956ce0bc
n 1

i 0
1 207ab437e3bb81d0906d403a29d56251956ce0bc
n 2

w
{ 0
. A
. B
{ 1
. D
}
{ 2
. D
}
. C
}
W

So that shows that the exact same line was added in each revision, with 
full duplication.
Probably not a big deal, and has to be this way, because each revision 
only knows about the lines added in its ancestors, not its siblings.

> 
> 
>>The problem I have with switching to the current weave format, is that
>>it feels like I should be making snapshots, in case something gets
>>messed up. (Think about svn with the berkley db backend, where you copy
>>it periodically).
>>But why am I using a SCM if I have to backup its meta-data periodically.
>> Coming from Arch, the SCM *was* the backup method. Naturally you should
>>keep a mirror for redundancy, but mirroring a corrupted branch will give
>>you a corrupted branch.
> 
> 
> As you say, you need backups anyhow, especially with the history in
> the working directory.  The question is whether corruption will
> propagate to mess up your backups, necessitating having multiple
> levels as we needed with svn.

Well, if your only backup is a mirror, whatever corrupted the local 
entry is likely to spread. Obviously if you are backing up to tape, what 
is on tape isn't going to be effected.

I suppose we could write a "bzr backup" which would create something 
like the .bzr.backup directory (possibly in another place, probably 
linked to the date), which would check all of the sha sums before doing 
the backup. You could at least run this as a cron job.

> 
> The data is guarded by sha1 both at the inventory/revision level, and
> also within the weave.  If you run the 'weave check' command it will
> extract every version and make sure that its sha1 is what it should
> be.  I am fairly confident that if any corruption does occur this will
> trap it, and so if you check the branch before making the backup the
> result will always be usable.

By the way, weave.py is importing "bzrlib.progress" so it cannot be run 
directly. You have to have a "bzrlib" directory somewhere in your path. 
(or relative to weave.py).
We might consider using try: import bzrlib.progress except import progress.

> 
> 
>>Now maybe you feel like the current weavefile format is obvious enough
>>that it is difficult to corrupt. From my experience of trying to look
>>through the .weave file, I don't quite feel the same.
> 
> 
> it's is a somewhat indirect format.  On the other hand I think that
> one can at least partially understand it by looking at it, at least if
> you start with files you're familiar with.  But maybe I'm biased.  One
> reason why I did make it text and line based is to help with this.  An
> append-only weave (good though it would be in other ways) might be
> less obvious.

Well, with any append-only you won't be able to look at the text in 
order. That is certainly true. I suppose with the current format, you 
could mentally mask out the regions, and figure out what the final text 
would be. But with complex ancestry, that would get rather tricky.

If I developed an append only format, would you possibly merge it? I 
know *I* would feel better with that sort of format. Though naturally, 
whichever code base gets the most use is the one that is probably the 
most trustworthy.

It might make a good case for the Storage abstraction, and having bzr be 
able to instantiate multiple back-end stores.

Speaking of which, are you thinking to merge the Transport stuff, or is 
it waiting for the bzr.newformat to hit first?

My favorite bzr.newformat feature is probably the ancestry file, though 
cutting down the .bzr/ directory size because of weaves is also nice, as 
it makes copying faster.

John
=:->

> 
> --
> Martin
> 
> 

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 253 bytes
Desc: OpenPGP digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20050923/b3c64377/attachment.pgp