Dotted revno "algebra"

Fri Apr 30 22:27:09 BST 2010

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

I was talking to Ian on the phone yesterday, and he wanted me to post
about an idea he had.

The idea was to find a way to describe revisions, that looks like a
dotted revno, but is, in essence, an algebra for how to reach the revision.

The basic idea is to keep a 3-digit number, but to change how those
digits get assigned. Specifically:

1) First digit is based on the mainline revision that merged this revision.

2) Second digit uses an 'algebra' to compute a psuedo-number. More on
this later.

3) Third digit defines a distance along the lefthand parent history.

For (2) his idea was to, essentially, append another number for every
time that you follow a right-hand parent. And this number would
represent *which* parent was followed.

As an example (time goes up):

 I      5
 |\
 G H    4   5.1.1
 | |\
 D E F  3   5.1.2   5.11.1
 | |/
 B C    2   5.1.3
 |/
 A      1

So in this case, F gets something like 5.1 1.1, but you just smush the
first two together. To get enough points so that you can handle wide
octopus merge, you could use letters (5.a.10). I guess one of the Linux
Kernel revisions is 31-revs-wide. but 10digits + 26 letters is still
36-wide, and you could get into extra chars if you really wanted.

The really nice thing about this layout is that you can look up the
revision by walking the ancestry graph, without having to know anything
ahead of time.

Some problems that I see:

1) This graph:

 G      2
 |\
 | F        2.1.1
 | |\
 | E \      2.1.2
 | |\ \
 | | C D            2.11.1  2.11.1
 | |/ /
 | B-'      2.1.3
 |/
 A      1

^- C and D have the same number, because you didn't base the new digit
on what came before.

As such, you'd really have to go back to the extending-dotted form, and
the above would become:

 G      2
 |\
 | F        2.1.1
 | |\
 | E \      2.1.2
 | |\ \
 | | C D            2.1.2.1.1  2.1.1.1.1
 | |/ /
 | B-'      2.1.3
 |/
 A      1

Now, we used to have this, but based on the branch point. And we found
that for mysql revisions, the dotted revno would get enormous.
(sometimes 150 chars or so). Longer than a revision id.

However, switching to base it on merge point rather than branch point
might make a significant difference here. I would have to test it to
know for sure.

2) To be genuinely useful, I think we would have to number the third
digit with 1 as the most recent revision, and its LH parent getting 2,
etc. Otherwise you don't know how far back to go until you get to 1,
without actually computing the graph difference information.

Which means that the numbers along the mainline count upwards with time,
but the revisions that are merged count down with time. Vincent
mentioned in another thread that we could start counting into negative
numbers, but if you just omit the '-' sign, then you're still counting
upwards.

3) Dotted revnos no longer would have an absolute single value
associated with them. This is both good and bad. If we made it so that
'bzr log' could show revisions as --topo or --by-date, etc (aka not
merge sorted), then it could compute *a* number to assign to any
revision it displays. It might just give a different number based on
phase-of-the-moon (how things end up getting sorted).

In the above example, 'B' is the left-hand parent of both E, C, and D.
So it could be reached as:

2.1.3, 2.1.2.1.2, 2.1.1.1.2

If log always displayed things in merge-sort order, it could have a
stable numbering, but we could still allow people to specify a revision
by a different path if they so chose.

4) I wonder if we would be better off choosing a more obvious grammar.
For example, imagine if a number indicated to more N parents back, and .
symbol indicated to move to a right-hand parent. Then the question
becomes whether we want to special case the first number, to count up
from zero, or to count back from the tip. (is 1 the first commit, or is
it last commit.)

The above would become:

 G      2
 |\
 | F        2.1
 | |\
 | E \      2.2
 | |\ \
 | | C D        2.2.1 2.1.1
 | |/ /
 | B-'      2.3
 |/
 A      1

Now, it gets a bit uglier if you have multiple merges:

 G      2
 |\
 | F        2.1
 | |\
 | E \      2.2
 | | |\
 | | C D        2.1.1 2.1..1
 | |/ /
 | B-'      2.3
 |/
 A      1

And the 31-way merge would look like:

 1000...............................1

5) We could instead switch to letters after the first right-hand parent:

 G      2
 |\
 | F        2.1
 | |\
 | E \      2.2
 | | |\
 | | C D        2.1.1 2.1a1
 | |/ /
 | B-'      2.3
 |/
 A      1

6) Or always use letters:
 G      2
 |\
 | F        2a1
 | |\
 | E \      2a2
 | | |\
 | | C D        2a1a1 2a1b1
 | |/ /
 | B-'      2a3
 |/
 A      1

The numbers could still get pretty wide here, but they wouldn't grow as
fast. We still have to count steps to parents, which means older
revisions get bigger numbers, and mainline revs get smaller.

Right now, I actually think my favorite version is (5).

 I      5
 |\
 G H    4   5.1
 | |\
 D E F  3   5.2   5.1.1
 | |/
 B C    2   5.3
 |/
 A      1

 G      2
 |\
 | F        2.1
 | |\
 | E \      2.2
 | |\ \
 | | C D        2.2.1 2.1.1
 | |/ /
 | B-'      2.3
 |/
 A      1

 G      2
 |\
 | F        2.1
 | |\
 | E \      2.2
 | | |\
 | | C D        2.1.1 2.1a1
 | |/ /
 | B-'      2.3
 |/
 A      1

We would need to see them on a real tree to know for sure how we felt,
and how wide they get, etc.

I like that you can (as a human) follow the numbers fairly well, and you
can make up your own without having to have seen them from log, etc. I
don't really like that a given revision can be reached from multiple
values, or that the final number counts up instead of down.

Performance wise, they should be fast to compute, because as you walk
from tip, you should immediately be able to assign a correct value to
any revision you reach, even if you may later chose a different value
for it. In the above example, A could also be reached as 2.1a3, and we
may want to simplify it back to 1, which also has implications for
things you numbered *from* A.

We might be able to define a preferred order (like gdfo) that would give
the nicest values. (I think prefer-left gives us optimal values, but it
suffers from the same problem merge-sort suffers from today, you walk
the whole left hand history, and then their merges back to root, and
then.) Possibly prefer-left subject to gdfo. But then you still have
"walk back to the point where this was branch" sort if issues. As an
example:

 I
 |\
 F H
 | |
 E |
 | |
 D |
 |/
 C
 |\
 B G
 |/
 A

The shortcut H means you'll hit the mainline rev C far before you would
normally get there by walking the mainline. And if you did strictly GDFO
walking, you'd still walk all the way back to D, given that gdfo(H) ==
4. while gdfo(I) == 7.

I think there is some merit to the idea, and we might be able to refine
it a bit to be more palatable. I would be okay with (4) except it
gets confused with our range operator:

 bzr log -r 2..1

Is that 2..1 or 2 .. 1. 2a1 doesn't suffer from that. though it suffers
from starting to not look like a simple dotted revno. You could also go
to something like:

2-5-1.1.1

Where any parent other than the first-right-hand parent gets a
long-form. 2?5?1.1.1 and 2-5-1.1.1 come to mind.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkvbSy0ACgkQJdeBCYSNAAOJtgCePXnwOK5RBhV5iSW9DpQMNeWQ
rBIAoIue47xp3L0F1T718xsMnr93AWIX
=ZSeg
-----END PGP SIGNATURE-----