bzr/LP issues from work discussed at UDS

Thu Dec 3 17:33:29 GMT 2009

Hi John,

Thanks for the comments.

On Thu Dec 03 11:07:24 -0500 2009 John Arbash Meinel wrote:
> So would "bzr merge --by-path" help this? Is it just that you need to
> merge just a subdir like 'debian'?

A two-way merge based on path might be sufficient. I'm not sure that
you would want to continue relying on that as every time would be
the same. It would be possible to put a branch in the middle that
joined the two.

Merging a subdir may be desirable, I don't have a feel yet for whether
that is what people would want to do. My impression is they they would
not in most cases.

> bzr stitch
> 
> Look in the ancestry of both branches, and try to figure out any
> revisions where the two trees were identical. This isn't perfect for the
> 'debian/' case, because the trees will never be identical. Both because
> you have a "debian/", though that can probably be trivially ignored, but
> also I think because you have debian/patches or whatever. So the
> *actual* content is meant to be after patching?
> 
> 
> What about a gui tool that let you create a new ancestry graph by
> selectively marking the revisions you want to sync up? So if you had:
> 
> A   X
> |   |
> B   Y
> |   |
> C   Z
> 
> You could then do:
> 
> A   X
> |\ /|
> B L Y
> |\|/|
> C M Z
>  \|/
>   N
> 
> Just a thought.
> 
> If you have cases where X is identical to A, then we could do this
> somewhat automatically. Or if X is identical to a subset of A (ignoring
> debian/ for example).

That would be an interesting approach. As I mention below, we have some
plans to do this, and is how we will join them in the end. Giving people
the tools to do it themselves could be a great move.

> We've had a lot of discussion on this point. I'm pretty sure we ended up
> with
> 
> http://host/path/to/X;branch=Y
> 
> As the preferred syntax. It requires quoting on the command line, and
> using 'branch=' is a bit more verbose if you were typing it manually.
> But it ends up being the "least evil". So I'm guessing we should JFDI
> and get something.

Sounds good to me.

I'm sure the question has been asked, but does git have a syntax for
doing this?

> I think we currently have 8k or so, with some fraction of that failing.
> At least, I thought I remembered about 1-2k failures, and a 25% failure
> rate. So 2/.25 = 8k.

Thanks. That means we are looking at about doubling the number of imports
this cycle.

> I think you need to have some Launchpad interfaces, so that people can
> garden their own branches. Gardening needs to be done from time to time.
> If we aren't going to do it ourselves, then we need to expose a way for
> others to do it.

I agree. In this case however, we are taking the locations from Debian
metadata, so we can at least semi-automatically do it.

> > 5) API for requesting a code import be tried ASAP
> > 
> > Do Branch.requestMirror() and Branch.last_mirror_attempt refer to
> > importing to the code if the branch is a vcs-imports one?
> > 
> > If not, can we get an API similar to the above for vcs-imports?
> > 
> > We would want to say “try now,” and then spend a while waiting
> > for an indication it tried to import, so that we could be reasonably
> > sure the import was up to date.
> 
> It sounds like you want a synchronous api, but probably something like:
> 
> startMirror() # return when the attempt has started, or failed
> waitForFinish() # wait until the current mirroring has finished, include
>         # info about how much has been imported.

I think synchronous is impossible in the LP API currently, if for the simple
reason that the request will time out after a short time, likely to be
longer than a mirror.

The two things I highlighted at least allow us to approximate this with
polling. I am told that polling is probably the best we can do for
at least the medium term with the LP API.

> Though I have to ask, how important is it to be at the current tip? What
> do you want to do if the tip is 'active' and there is more that can be
> pulled as soon as you finish the previous pull? Are you going to loop
> until convergence?
> 
> If you aren't waiting for convergence, is there harm in having an import
> be < 24 hours out of date?

Consider this:

  * Debian maintainer upgrades to a new upstream version in their VCS.
  * They test and upload the package.
  * They then commit/push as appropriate for their VCS.
  * We see the upload on average 3 hours later.
  * The probability of the import running in those 3 hours is small.
  * Therefore we won't be able to see the revision corresponding to
    the upload and so can't add it as a parent.

Therefore when we see the upload, I would like to trigger the code import
system to make a best effort to be up to date at that point. It's not
perfect, but it will cover the common case. (If the maintainer forgets
to push then the time until the revision can be mirrored may be
unbounded.)

> > 6) Guessing parent relationships
> > 
> > We currently infer parent relationships from debian/changelog, as
> > if you include changelog entries of another upload then we presume
> > you merged the changes.
> 
> What about the imports that are from upstream (and presumably don't have
> a debian/ directory at all)?

That is out of scope for this phase. We will have to solve this at some
point, as possibly soon for daily builds as you suggest above.

> > We will need to start inferring parent relationships in some cases
> > though, as there are some uses that means the code that was uploaded
> > is never exactly committed as a single revision. (Such as never
> > commiting the revision that changes the target from UNRELEASED
> > to unstable, or files modified in the clean target.)
> > 
> > The heuristics shouldn't have to be too fuzzy, but any fuzziness
> > makes me a little nervous, do the bzr developers agree? Do you
> > have any advice on how to do it well, so that it doesn't cause
> > mis-merges and the like?
> 
> Merging content is generally a bit fuzzy anyway. Which is one reason why
> we don't auto-commit it...
> 
> I don't have great answers here, but I'm guessing we'll have to be
> satisfied with a 75% solution, because you really can't do much better
> than that.

That's how I feel too. I will code the heuristics conservatively, but
there will be times when a mistake is made.

> > 7) Migration over branch history rewrites
> > 
> > In order to include new history in to the branch we need to
> > rewrite their history. This means changing revision ids.
> > 
> > In order to make the new branches mergeable with existing other
> > branches we need to change file ids.
> > 
> > We can do this fine for all the branches we control, but it
> > will instantly make developers local branches unrelated.
> 
> You can always mark them as merges rather than throwing away that
> history. But then you have to carry around all the extra history...

That's true.

> Well, you can store the maps in the revision graph, or you could weaken
> it and just store it as a revision property (which is mostly what
> bzr-rewrite and bzr-svn/git/hg are doing today, IIRC)

Would putting the file-id maps there make sense too?

> > Also, we have a terrible user experience on the flag day:
> > 
> >   # day before
> >   $ bzr pull
> >   ...
> >   # flag day
> >   $ bzr pull
> >   bzr: ERROR: there is no common ancestor.
> > 
> > any suggestions on how to improve on that would be gratefully
> > received.

> If you put it in the rev graph, then pull 'just works', but if you are
> changing file-ids, then they get 2x the history, their repo gets big,
> and we touch every file on their filesystem.
> 
> Though if you don't include the revision graph, then pull fails, they
> have to start a new fetch, and their repo doubles in size (or they just
> start a new repo, but still...)
> 
> You could always have "bzr flag-day-pull" or some sort of command that
> knows that the ancestry is being re-written, and to pull across based on
> that.

That's the kind of thing that I want, but nothing in the above example
gives any clue that this command exists or that they should run it now.
That's my concern.

Thanks,

James