[fastimport/MERGE] Train fixes

Wed Mar 12 05:26:25 GMT 2008

James Westby <jw+debian at jameswestby.net> wrote:
> On Mon, 2008-03-10 at 20:23 -0400, Shawn O. Pearce wrote:
> > Nothing better comes to mind.
> > 
> > On Git we would have to have the SHA-1 of the "ghost parent" in order
> > to generate the correct SHA-1 of the child commit that references
> > it, otherwise we cannot store the child commit.  Mark or no mark,
> > it has to in the end boil down to a SHA-1 before we can finish the
> > child commit.
> > 
> > Rough idea of a BNF:
> > 
> > 	new_ghost ::= 'ghost' lf
> > 	  mark
> > 	  ('id' sp hexsha1 lf)?
> > 	  lf?;
> > 
> > and require that at least on Git to import a ghost the frontend
> > must give us the "id" subcommand.  What do the bzr folks think?
> 
> (Aside, the optional terminal LFs are a bit of a pain for
> our current parser, it would be great to have them compulsory.
> If there is ever another version of the format then I would
> like to suggest that it mandates them. The documentation
> suggests that they were an addition, hence the optional
> nature, so I guess this is the way you would go anyway)

Heh.  The format used to _require_ the terminal LFs.  This turned
out to be difficult for a lot of frontend authors.  I kept getting
complaints: "why did fast-import reject my stream!  it was just
missing this silly blank line here!" so I made all of the command
terminal LFs optional.

However.  We have had a pretty long standing bug in git-fast-import
where the terminal LF was required on a "reset" command (the optional
part of "lf?" was not implemented correctly).  That was finally
fixed just this past week or so, and was shipped in Git 1.5.4.4.

If you haven't looked at fast-import.c in git, don't, its ugly.
But the trick to our command parser is we buffer a full line
of text and call a subroutine for each command in the grammer.
If the subroutine recognizes the command it eats it and reloads
the buffer.  If the subroutine doesn't recognize the command it
returns silently and allows someone else to look at it.  At the
end of most subroutines we check if the buffer is the empty line,
and if so reload it again (thus eating the optional LF).

The stream format is somewhat built around this parsing model.
Using other techniques to parse it may be harder.  Or maybe
easier.  I've only implemented it once.  :-)

> Generating the id would be a little problematic for us, as it
> is part of the git format, but it is certainly possible. Should
> we make it compulsory? Another solution would be for git to drop
> the parent if it is a ghost with no id. This would break the
> history representation and the round trip, but it would at least
> allow you to get a working git repo.

I think you pointed this out already; its a ghost, you don't have
the original data, so you cannot generate the Git SHA-1 for it.

In Git I think the only way a ghost can happen would be some sort
of sequence like this:

	- Obtain a clean import of a repository from say SVN.
	- Decide "rm -f .git/objects/xx/yyyyyy...yyyyy" is fun.
	- Reconvert the tree:
	   git fast-export | git fast-import

If the object you deleted happens to be a commit we now have what you
are referring to as a ghost.  The descendant (subsequent) revision
will contain a reference to this deleted commit, but we have no
data for this deleted commit and therefore cannot create its SHA-1.

During fast-export we want to mark this a ghost, so that during
fast-import we can still obtain the same revision history as we
had before.  Yes, your repository still has this ghost corruption,
but at least your revision identifiers have not changed.

I don't know how Bzr works, so I cannot begin to imagine what a
roundtrip needs to take to obtain the same result, especially if
you have lost a revision but you know where it was lost.

The "id" line of a ghost command was meant to carry this VCS specific
payload of what the ghost's name is, so the VCS can later say "yea,
I should know about this thing, but I don't have it, sorry".

A clean git->bzr->git conversion is probably just not possible
with ghosts.  Bzr probably won't store the Git SHA-1 ghost data, and
Git won't store Bzr's ghost id ... assuming Bzr even has something
it could include in the stream for its own bzr-fast-import to make
use of.

-- 
Shawn.