[MERGE] Merge directive format 2, Bundle format 4

Fri Jun 29 22:24:42 BST 2007

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

John Arbash Meinel wrote:
> Aaron Bentley wrote:

> As I understand it, that means you would now need to use "bzr
> merge-directive > ../something" whereas before you would use "bzr bundle
>> ../something".

You're right.  We should provide an option to write to a file.

> Also, I think 'merge-directive' has a different set of ideas about what
> branches you are actually basing your changes on. At least from what
> 'bzr help merge-directive' says, you *have* to supply a submit branch.
> It won't use parent.

Actually, it will use parent:
3563        if submit_branch is None:
3564            submit_branch = branch.get_parent()

Looks like a documentation problem.

> It doesn't specifically effect me because after the 'bzr cbranch'
> changes, I have to specify the source all the time, so I just have a
> shell alias (since my source is *always* bzr.dev).

You should be able to set submit_branch in locations.conf.

> So in general, it seems like 'bzr merge-directive' is going to be more
> to type than 'bzr bundle'. (Both because it is longer, and because it
> generally needs a source url).

I think the command name's a bit too long also.  Any suggestions?

> The help for merge-directive should be clarified, since you aren't
> recommending it as the way to send patches to a mailing list, since it
> doesn't give you a chance to write a discussion.

Okay.

> I'm still not 100% happy with how merge directives verify their texts.
> Basing it on a heuristic delta format seems incorrect.

What I'm trying to verify is that the preview patch matches the actual
requested changes.  Are you uncomfortable with the goal or the strategy?

The only other strategy I can think of is to apply the preview patch,
record the result, then see if applying the requested changes produces
the same result.  I think that would be much harder to do right, due to
whitespace munging.  And not worth it, since patience diff is deterministic.

> So it is probably ok, enough that I'm okay having it. It just feels like
> several layers of "fuzziness" trying to guarantee that your data is correct.

>> The new bundles are much smaller in the vast majority of situations.
>> They use a multi-parent delta format, they use bzip2, and they compress
>> the entire file at once.  The combination of merge-directive and bundle
>> will not be smaller than a single-commit v09 delta.
>>
> 
> Because they now have the data twice, right? (Once as the human-readable
> diff, and once as the the bzip2+base64 hunk).

Right.  Once you get up to two commits, you're probably neck-and-neck.
Unless you have merges, which would favour bundle 4.

>> Features
>> ========
>> Merge directives can now specify a cherrypick, and merge will honor
>> cherrypicks.  Revisions containing tree-references can now be included
>> in a bundle.  Formats are well-documented.  New bundle-info command
>> provides statistics and other data about bundles.
>> Repository.create_bundle method provides targetted API.

Oh, I also forgot to mention that they are immune to mail-transport damage.

> v- Generally, you seem to be missing doc strings on most of your
> functions. Several of them would be a little more obvious if you
> documented what the parameters were. (For example 'repo_kind', I
> initially thought this was the repository kind (knit1/knit3), but it is
> actually file/revision/etc.)

Okay, will fix.

> +    def begin(self):
> +        self._fileobj.write(serializer._get_bundle_header('4alpha'))
> +        self._fileobj.write('#\n')
> +        self._container.begin()
> 
> 
> ^- 4alpha looks like something that needs to be cleaned up before it is
> merged into core. Can you move it out somewhere else as a constant to
> make it easier to find? And maybe keep a note somewhere so we don't forget.

You bet.

> v- You have 5 kinds, and 4 of them cannot have file ids. It seems better to:
> 
> if content_kind == 'file_id':
>     assert file_id is not None
> else:
>     assert file_id is None

Sure.

> +        return '/'.join(names)
> 
> ^- I don't believe that '/' is strictly forbidden in revision ids,
> though I'm not positive. It *is* forbidden for file ids as a side effect
> of our 'Store' objects. It might have been forbidden for older revision
> ids, as they were stored in a similar way.
> 
> But this would be an argument for using whitespace, given a container
> change that allowed characters that are explicitly denied to file ids
> and revision ids.

I like the whitespace idea.  If that doesn't work out, I guess I can put
the revision at the end of the name.

> +    def _add_record(self, bytes, metadata, repo_kind, revision_id,
> file_id):
> +        name = self.encode_name(repo_kind, revision_id, file_id)
> +        encoded_metadata = bencode.bencode(metadata)
> +        self._container.add_bytes_record(encoded_metadata, [name])
> +        if metadata['storage_kind'] != 'header':
> +            self._container.add_bytes_record(bytes, [])
> 
> 
> ^- Is it possible for anything to be None? I at least thought there was
> a bug in bencode in that it couldn't represent None. And would fail with
> an exception if you tried.

So far, we're storing parents, storage_kind, sha1, serializer_format,
and supports_rich_root.  It's not legal for any of those to be None.
(The closest is parents, which can be an empty list.)

> ^- I find it a little odd that Bundles are *defined* as bzip2 streams,
> rather than that just being a 'how it was used' implementation detail.

If the bundles were completely bzip2ed, then old clients would raise
"NotABundle" when they encountered them.

Starting the bundle with "# Bundle format v4" in plaintext allows old
clients to handle new bundles correctly.  But starting with plaintext
means making the bzipping part of the bundle format.

Also, there's no great advantage to allowing bundles to be unbzip2ed,
since they're essentially single-use files.  Once they're installed in
your repository, you don't need 'em anymore.

> v- If a record happens to contain more than one name, this will fail
> with a really weird error. As you are processing stuff generated by the
> user, I would recommend catching 'names' and having a real exception if
> there is more than one name for a section.

I guess we can raise BadBundle, since that's a format error.

> +            if metadata['storage_kind'] == 'header':
> +                bytes = None
> +            else:
> +                _unused, bytes = iterator.next()
> +                bytes = bytes(None)
> +            yield (bytes, metadata) + self.decode_name(name)
> 
> ^ So each effective object has 2 records, right? One meta record, and
> one content record.

Except header records, which are all-metadata.

> Also, it would probably be better to return the 'bytes_reader()' rather
> than the actual string of bytes, so that we don't have to hold the whole
> text in memory. But I'm willing to accept it in this pass.

We *must* be able to hold each record in memory.  The repository APIs
require us to deal in memory chunks.

> V- I don't quite understand why there is a BundleSerializer separate
> from the BundleWriter separate from the BundleWriteOperation.

The BundleSerializer is glue code to support the existing bundle API.

The BundleWriter transforms the raw ContainerWriter API into a
higher-level one for writing bundles in particular.

The BundleWriteOperation is the bridge between repositories and Bundle
containers.  It determines what should be written, retrieves it from the
repository, formats it as necessary, and requests the BundleWriter to
store it.

> It might help if there were docstrings.

Okay.

> Unfortunately, I'm running out of time to review the rest (I have to put
> together a presentation to give this weekend). But I figured I could
> send you what I did complete.

Thanks.  I'll update with those changes.

Aaron
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGhXia0F+nu1YWqI0RAjyWAJ9J8LHYxS5KnY0WYoIGIGZtS15wFACfdDKq
O63QY7U2byvhvzI3HEsBbxk=
=mo6q
-----END PGP SIGNATURE-----