[MERGE REVIEW] Revert destroys file contents produced by merge

Mon Mar 6 09:18:36 GMT 2006

On 27 Feb 2006, Aaron Bentley <aaron.bentley at utoronto.ca> wrote:

> Okay, so this is also similar to XML in that it doesn't guarantee a
> particular number of lines per value.
> 
> That is one of the reasons why I think Boroncelli's hack for detecting
> revision-ids by reading the weaved XML is a dirty hack.  I would be much
> more comfortable with it if the format been designed so that it could be
> read directly from weave.  I think splatfile is better-suited to this
> than rio, because its dictionary form is line-per-key/value pair.  In
> fact, I think you could build an API that supported annotated values,
> and then be able to history-based scalar merges.

A format with one key/value pair per line is possibly easier to treat as
a scalar merge from a weave-like form.  (It's probably still possible
with a form that allows multi-line values, just by treating the
most-recently-modified line as representative for the whole field.)

An early draft of rio used escape characters as splatfile does.  Someone
(njs?) pointed out that if things like commit messages are allowed to
extend across lines it makes it much more palatable for human readers.
The question then would be whether you want it to be only just human
readable, or actually palatable to read.  For most files that will
only be seen in debugging (such as a record of merge results) it's not a
big deal.  If we want to include it in externalized changesets it's more
important.

We don't use rio in the permanent data of any release so I'm happy to
change the format if we think it'll be a better tradeoff.  I would like
to settle it on a reasonably good tradeoff before the release.

> You might represent inventory entries like this (the first line is to
> assert that there is an entry whose revision-id is bzqr):
> 
> bzqr.file_id bzqr
> bzqr.type file
> bzqr.name README
> bzqr.text_sha1 24352edfsdg34
> bzqr.text_size 3
> bzqr.revision rev at lambdafoo
> 
> Hmm.  Perhaps I should reserve a path separator, like '.' or '/'.
> 
> > I had originally thought of the rio field names as being similar to
> > Python attribute names or to XML attribute names, both of which have to
> > be ascii subject to some constraints.  That's just kind of design by
> > analogy so there's no strong reason to keep it if having unicode there
> > would be more useful.  We can escape it in the same way.  There's no
> > sense having callers escaping the contents too.
> 
> I think this is definitely true of the values.  You've managed to plant
> a seed of doubt about keys, though.

Did anything grow?

> > For human readability and so that you can add new optional fields in the
> > future I wanted to have fields explicitly labelled, at least briefly.
> > In some cases like the ancestry cache or the stat cache perhaps the
> > labels are not worth the time or space, but I suppose you can always
> > just make them small.  Ordered multimaps are perhaps less natural in
> > Python than regular dicts but it seems like a natural way to store
> > repeated attributes such as parent ids.
> 
> I suppose I could have written it as a multimap of {"revision-id":
> "foo", "parent-id": "bar", "parent-id": "baz"}.  It just seemed natural
> to write it as {"foo": "bar", "foo":"baz"}, at the time.

Explicitly labelling them seems to give more room for future growth.  Of
course we now have more of a pattern of explicitly labelling formats and
upgrading them wholesale, which may mean this is less important.  But I
think it's still useful.

Labelling fields does use some space but of course a particular format
could choose to use small (even single-byte) names.  In this case giving
the child revision-id only once is likely to be a space win.

> > I think allowing multiline values to span lines makes it much more
> > readable for humans than escaping them.
> 
> But you're not wrapping at any particular width, are you?  If so, you
> may have an argument, but most file viewers will auto-wrap.

The main case for multiline values, it seemed to me, was for multiline
text entered by humans: at the moment only commit messages; possibly
others in the future.  Those are likely to be of reasonable width.  If
a viewer does wrap them that seems harmless to human legibility.

> >  For the field names it might be
> > best to just escape special characters - or we could allow unicode but
> > not newlines, colon etc.  Perhaps that would be too much of a restriction
> > since they can turn up in file ids.
> 
> If we want to use file or revision ids as tags at all, we should support
> any of the characters they may contain.  It's nice to be able to parse
> the file straight into a dict; using 'file-id' as a tag means that we
> need to externally specify which tag value to use as a dictionary key.

I think I wrote the rio code under the impression we would restrict the
characters in file-ids.

-- 
Martin