bzr/LP issues from work discussed at UDS

Thu Dec 3 16:07:24 GMT 2009

James Westby wrote:
> Hi,
> 
> I'd like to provide some information about some of the discussions that
> went on at UDS about UDD, and in particular some open questions related
> to bzr and Launchpad.
> 
> I have just written up two specs from the sessions:
> 
>   https://blueprints.launchpad.net/ubuntu/+spec/foundations-lucid-daily-builds
> 
> about all things daily builds, and
> 
>   https://blueprints.launchpad.net/ubuntu/+spec/foundations-lucid-distributed-development/
> 
> about using bzr for Ubuntu development.
> 
> There are a whole bunch of topics tied up in them, so I'd like to pull apart
> some of the threads for discussion. Most of these things are open questions,
> but some are a “please help do this” request. Some of them will be blocking
> things we want to roll out over the next 6 months.
> 
> 1) Merging unrelated branches in a recipe.
> 
> We currently have a rather unfortunate situation, but one we entered in to
> knowingly. We can associate lp:<project> with an lp:ubuntu/<package> to know
> they contain the same code, and this would make it dead easy to set up
> the first cut of a daily build. However, as it currently stands the two
> branches will, except for a minority of projects, share no revision history,
> meaning that they can't be merged, and so can't be combined in a recipe.
> 
> There are a couple of main drawbacks to this, namely that starting a daily
> build is more work than it could be, and that changes made in the packaging
> of the daily build aren't directly mergeable back to the packaging.
> 

So would "bzr merge --by-path" help this? Is it just that you need to
merge just a subdir like 'debian'?

> We have a plan to rectify this. It however is a multi-year plan, so we may
> want to sidestep the issue somewhat. There are good reasons for it being
> a long term plan, but it's not out of the question that this issue, and
> others below, force us to re-evaluate this. It should not be done lightly
> though.
> 
> One way to alleviate the pressure on this issue would be to make it possible
> to combine lp:<project> and lp:ubuntu/<package>, even though they are not
> mergeable.
> 
> https://code.edge.launchpad.net/~spiv/bzr-builder/merge-subdirs-479705/+merge/14979
> is said to go some way towards doing this, but as I say within, I think
> I am missing something, as it can't be the whole solution on its own.
> 
> What I am looking for here is suggestions on how we can elegantly allow
> people to combine the two trees in a system that isn't too fragile.

Path tokens are a "nice" fix for this, but fairly involved. And they
don't solve everything. (You still end up with 2x the history in the
parallel import case. Switching to a content-hash storage for file texts
would make this a little bit better.)

merge --by-path

would theoretically try to do a 2-way merge of the file contents for
every file. The problem with 2-way merge is that without a BASE, where
things differ, you don't know which one is "newer". (So if you change a
line "A => B", we don't know whether you changed A => B or other changed
B => A.)

bzr stitch

Look in the ancestry of both branches, and try to figure out any
revisions where the two trees were identical. This isn't perfect for the
'debian/' case, because the trees will never be identical. Both because
you have a "debian/", though that can probably be trivially ignored, but
also I think because you have debian/patches or whatever. So the
*actual* content is meant to be after patching?

What about a gui tool that let you create a new ancestry graph by
selectively marking the revisions you want to sync up? So if you had:

A   X
|   |
B   Y
|   |
C   Z

You could then do:

A   X
|\ /|
B L Y
|\|/|
C M Z
 \|/
  N

Just a thought.

If you have cases where X is identical to A, then we could do this
somewhat automatically. Or if X is identical to a subset of A (ignoring
debian/ for example).

> 
> 2) Importing non-master branches
> 
> I know this is being discussed elsewhere right now, but this is another
> area where this came up as being useful/a blocker. I don't want to split
> the discussion, but just wanted to register another vote for being
> able to do this.
> 
> We may also want to do some interesting things with SVN imports, depending
> on how they are layed out. I haven't looked in to it, but I imagine that
> switching to bzr-svn could change what we can do.
> 

We've had a lot of discussion on this point. I'm pretty sure we ended up
with

http://host/path/to/X;branch=Y

As the preferred syntax. It requires quoting on the command line, and
using 'branch=' is a bit more verbose if you were typing it manually.
But it ends up being the "least evil". So I'm guessing we should JFDI
and get something.

> 3) Importing a lot more branches
> 
> We want to import a lot more branches this cycle, all of those used
> for maintaining packages in Debian. I don't have a definite number
> that we want to import, but
> 
>   http://upsilon.cc/~zack/stuff/vcs-usage/
> 
> declares that there are 6881 source packages using a VCS. Therefore,
> what would happen if tomorrow we increased the number of vcs-imports
> by 5000? (What is the current number?)

I think we currently have 8k or so, with some fraction of that failing.
At least, I thought I remembered about 1-2k failures, and a 25% failure
rate. So 2/.25 = 8k.

> 
> It may be that the answer here is just “deal with the failures,” but
> maybe there needs to be infrastructure work done before this. jml
> says that it may just be a case of throwing more machines at it,
> as the system is already built to be scalable.

Well, I would assume that growing from 1 puller to 2 pullers would be a
significant growing pain. But growing from there to N pullers would be
mostly a matter of throwing hardware at the problem.

And while the system is probably designed to support >1 pullers, until
you actually have 2 running concurrently, I don't think you can claim
anything :).

> 
> 4) API for creating code imports
> 
> I don't want to set up those 5000 new imports by hand. I also don't
> want to have to maintain it as the locations change and the like.
> Launchpad has an API to allow scripts to be written to manipulate it.
> It would be great if that could be used to avoid doing it all by
> hand. Is this just a case where we need to expose something, or
> is there more involved than that?
> 
> Also, I don't think the CHR would be happy if we created 5000 import
> requests tomorrow. Can the review step be removed, or at least
> waived here?
> 
> I imagine that the lifecycle of these things will require locations
> to be changed sometimes, is just requesting a new import the best
> thing there?

I think you need to have some Launchpad interfaces, so that people can
garden their own branches. Gardening needs to be done from time to time.
If we aren't going to do it ourselves, then we need to expose a way for
others to do it.

> 
> 5) API for requesting a code import be tried ASAP
> 
> Do Branch.requestMirror() and Branch.last_mirror_attempt refer to
> importing to the code if the branch is a vcs-imports one?
> 
> If not, can we get an API similar to the above for vcs-imports?
> 
> We would want to say “try now,” and then spend a while waiting
> for an indication it tried to import, so that we could be reasonably
> sure the import was up to date.

It sounds like you want a synchronous api, but probably something like:

startMirror() # return when the attempt has started, or failed
waitForFinish() # wait until the current mirroring has finished, include
		# info about how much has been imported.

Though I have to ask, how important is it to be at the current tip? What
do you want to do if the tip is 'active' and there is more that can be
pulled as soon as you finish the previous pull? Are you going to loop
until convergence?

If you aren't waiting for convergence, is there harm in having an import
be < 24 hours out of date?

> 
> 6) Guessing parent relationships
> 
> We currently infer parent relationships from debian/changelog, as
> if you include changelog entries of another upload then we presume
> you merged the changes.

What about the imports that are from upstream (and presumably don't have
a debian/ directory at all)?

> 
> We will need to start inferring parent relationships in some cases
> though, as there are some uses that means the code that was uploaded
> is never exactly committed as a single revision. (Such as never
> commiting the revision that changes the target from UNRELEASED
> to unstable, or files modified in the clean target.)
> 
> The heuristics shouldn't have to be too fuzzy, but any fuzziness
> makes me a little nervous, do the bzr developers agree? Do you
> have any advice on how to do it well, so that it doesn't cause
> mis-merges and the like?

Merging content is generally a bit fuzzy anyway. Which is one reason why
we don't auto-commit it...

I don't have great answers here, but I'm guessing we'll have to be
satisfied with a 75% solution, because you really can't do much better
than that.

> 
> 7) Migration over branch history rewrites
> 
> In order to include new history in to the branch we need to
> rewrite their history. This means changing revision ids.
> 
> In order to make the new branches mergeable with existing other
> branches we need to change file ids.
> 
> We can do this fine for all the branches we control, but it
> will instantly make developers local branches unrelated.

You can always mark them as merges rather than throwing away that
history. But then you have to carry around all the extra history...

> 
> Therefore we need to provide a way to change a local branch
> to make it mergeable again.
> 
> This means rewriting all the revision ids using a map that
> we create when we import the new branches, and generating
> new ones for revisions not in the map.
> 
> It also means rewriting file ids for all revisions not in the
> map, and any working trees, for revisions not in the map.
> 
> There's obviously a lot of risk in this, and also a whole
> lot of code needed to do this well. Jelmer said that
> bzr-rewrite(?) already contains some code to do something like
> this, so we will be able to start from there.
> 
> Here I'm looking for advice on how to do this well, and also
> things like how to distribute the maps and where to put the code
> to do the work. I'm also very keen on any suggestions you
> may have for doing this in a way that avoids these issues.

Well, you can store the maps in the revision graph, or you could weaken
it and just store it as a revision property (which is mostly what
bzr-rewrite and bzr-svn/git/hg are doing today, IIRC)

> 
> Also, we have a terrible user experience on the flag day:
> 
>   # day before
>   $ bzr pull
>   ...
>   # flag day
>   $ bzr pull
>   bzr: ERROR: there is no common ancestor.
> 
> any suggestions on how to improve on that would be gratefully
> received.
> 
> 
> 
> That's all I have for now. Thanks to anyone that read this far,
> your input will be valued.
> 
> Thanks,
> 
> James
> 

If you put it in the rev graph, then pull 'just works', but if you are
changing file-ids, then they get 2x the history, their repo gets big,
and we touch every file on their filesystem.

Though if you don't include the revision graph, then pull fails, they
have to start a new fetch, and their repo doubles in size (or they just
start a new repo, but still...)

You could always have "bzr flag-day-pull" or some sort of command that
knows that the ancestry is being re-written, and to pull across based on
that.

John
=:->