Understanding pull

Mon Mar 26 16:06:20 BST 2007

Ian Clatworthy wrote:
...

>>
> So trying to summarise the various development models and matching
> Bazaar 'recipes' in my head, I initially found myself at a loss to
> understand why pull existed at all:
> 
> 1. central repository model: checkout + update + commit(local or central)
> 2. distributed repository model: branch + merge + commit(local) + bundle
> + email to gatekeeper
> 
> There are excellent reasons for merging from the 'master' code base and
> retesting before committing. But there are also plenty of times when the
> best time to resync your working tree is immediately *after* completing
> one fix before you start on the next. 'pull' would be ok in the former
> case but almost always fail in the latter case - given the "as long as
> local changes aren't committed" rule.
> 
> So what recipes involve pull? Do developers use it commonly in
> day-to-day development? I can see its applicability when I want a
> pristine local mirror of a master repository in order to run/test
> against when reporting bugs, say. By even in that case, what is pull
> buying me that merge isn't?
> 
> Ian C.
> 
> PS: Apologies if the questions above are dumb ones. I am truly impressed
> by just how flexible and powerful bazaar is. But with that power comes
> the need for multiple recipes for beginners where previously just one
> 'sufficed' most of the time. The barrier to entry is low IMHO except in
> this area: CVS just has 'update' while bazaar has update/merge/pull to
> choose from.
> 

No apologies needed. While we may not have exact answers all of the
time, it is questions like this which should be clarified, so that we
all understand things a bit better. (It is easy to get used to things,
and lose track of what it all means).

Having written this, it seems like it should be a FAQ or Wiki, or
something that we can refer people to.

So as far as "pull" versus "merge"...

One thing bzr lets you do, is maintain a "branch". Which acknowledges a
difference between commits that you have merged, and ones that you
generated. We frequently refer to these as "merged" versus "mainline".

One way to see this difference is to do "bzr log --long" on a bzr.dev
tree, versus "bzr log --short". --long will show you all of the merged
revisions, while --short only shows you the mainline, which is generally
a summary of what was merged into bzr.dev.

I don't know about other people, but I *really* like 'bzr log -r -10..-1
--short --forward', enough so that I've aliased it to "bzr log". It
gives me a nice summary of the last 10 changes on a branch. And usually
that summary fills about 1 screen-full.

So why the distinction between a mainline and a merge commit. We have a
few use cases for them...

1) "These are the patches reviewed by me". I don't review every single
change someone makes. But I *do* review the merge before I commit it.

2) "Every commit on the mainline passes the test suite". This is a
pretty big one for our bzr.dev process.

3) "Summary of changes". It gives an obvious place to summarize the many
(potentially hundreds or more) commits someone made. You frequently want
to keep all of those hundreds of revisions around, because it gives you
nice, fine grained details about things that have changed. (Useful for
annotate, or any sort of digging that needs to be done).

But having a single summary revision, also lets the people who *don't*
want to wade through 100 revisions to understand "Implemented bound
branches".

'bzr pull' is generally a statement of "I want an exact copy of the
other branch", versus 'bzr merge' "I want to include the changes from
that other branch".

There are people who don't really care about maintaining a mainline, or
about the summary commits. Which is why we have "bzr merge --pull".
Which is a statement of "I just want those revisions, pull if you can,
if not merge, and I'll commit". That is actually (AFAIK) the only real
workflow that 'git' lets you use. Since it's merge is always a
fast-forward if possible. Also, git doesn't seem to prefer the "merge,
review, commit" workflow. Because 'merge' automatically updates your
branch history (aka commits the changes). Their workflow is
"merge+commit; maybe review and commit --amend".

I hope I've made clear why we at least need 2 commands, so that users
can give bzr a hint of what their intentions are. And bzr can try to
chose the best strategy. ('bzr pull' fails if the branches have
diverged, because we *can't* make an exact copy. People have argued that
it should fall back to 'merge', but they don't realize the potential
problems with uncommitted changes.)

So, on to why we have 'bzr pull' versus 'bzr update'.

#1 reason... hysterical raisins. (historical reasons). 'bzr pull'
existed long before 'bzr update' did. Because 'update' really only does
something useful when you have a checkout (bound branch). And it wasn't
until even more recently that you could have a checkout of a readonly
location. So if you wanted to mirror
'http://bazaar-vcs.org/bzr/bzr.dev', you had to use 'branch' and 'pull'.

Within the last few versions (0.11 according to NEWS, and 'bzr checkout'
itself is 0.8). We now have the ability to do a checkout of a readonly
url. And I've actually switched all of my mirrors of 'bzr.dev' over to a
checkout. (I have one on every machine that I use, and it is my primary
'bzr' command).

I do this because it makes it *very* clear that it is only meant as a
mirror. So I cannot merge or commit in that branch. Which means I *know*
it is always an exact mirror of the upstream. (Barring local uncommitted
changes). So 'bzr checkout' + 'bzr update', could very well take the
place of 'bzr branch' + 'bzr pull'.

There is at least one case that it fails for:

Mirroring known public branches.

Now that we have the bazaar.launchpad.net mirroring branches, I don't
try as hard, but at one point, any branch that someone mentioned I would
add to:
http://bzr.arbash-meinel.com/mirrors/

There were a few goals. One was to be able to see if people are making
changes (I have a cron script to update it a couple times a day, and
email me any changes). Another was just to have a mirror in case hosts
disappear. And the third was to have those revisions locally. At one
point 'pull' was much more expensive (weaves), so that allowed me to
have an automated script do the downloading, rather than having me wait
for it.

I still use it in case things disappear (especially for plugins). Though
I try to recommend people register branches on LP, so that I don't have
to spend my bandwidth updating mirrors.

Anyway, right now 'bzr update' and 'bzr checkout' only works on working
trees. We have no way in the UI to update a bound branch that doesn't
have a working tree. (bzrlib does it fine, and my 'update-mirrors'
plugin does just that). So I can do "bzr checkout FOO; bzr remove-tree
FOO" And I have a bound branch but no way to update it. (pull might
work, but logically it could also fail because you are trying to update
a readonly branch unless we special cased a 'pull' from the master branch).

Some interesting case studies... I have 126 mirrors of just bzr
branches. If each of those had working trees, it would be about
8*126=1GB of disk space. If we didn't have shared repositories, it would
be 53*126=6.6GB of space. Though I have 211 of my own bzr branches (no
mirrors). So I can say that shared repositories with no working trees is
very important to me. Rather than taking approx 60*337=20GB, I'm using
about 150 MB (I have a couple repositories, and some standalone branches
in there).

John
=:->