[MERGE] Merge directive format 2, Bundle format 4

Fri Jun 29 21:26:50 BST 2007

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Aaron Bentley wrote:
> Hi all,
> 
> I've got the new merge directive and bundle formats feature-complete.
> 
> If we're going to get them into Bazaar 0.18, they'll need review pretty
> quickly.  Unfortunately, they depend on the new container format, so
> they can't go in before that.
> 
> Anyhow, here goes nothing:
> 
> Changes in role
> ===============
> Merge directives now provide the human-readable aspects that bundles
> had.  Bundles are now simply a revision transport format.  Merge
> directives provide patch previews, and verify that the preview matches
> the requested change.

As I understand it, that means you would now need to use "bzr
merge-directive > ../something" whereas before you would use "bzr bundle
> ../something".

Also, I think 'merge-directive' has a different set of ideas about what
branches you are actually basing your changes on. At least from what
'bzr help merge-directive' says, you *have* to supply a submit branch.
It won't use parent.

So the semi-common case of:
	bzr branch http://bazaar-vcs.org/bzr/bzr.dev mystuff
	cd mystuff
	<hack hack, commit, commit>
	bzr bundle > ../mystuff.patch

Won't work, (you have to type the source url again). It is a bit of
extra work when you have a small change.

It doesn't specifically effect me because after the 'bzr cbranch'
changes, I have to specify the source all the time, so I just have a
shell alias (since my source is *always* bzr.dev).

So in general, it seems like 'bzr merge-directive' is going to be more
to type than 'bzr bundle'. (Both because it is longer, and because it
generally needs a source url).

The help for merge-directive should be clarified, since you aren't
recommending it as the way to send patches to a mailing list, since it
doesn't give you a chance to write a discussion.

I'm still not 100% happy with how merge directives verify their texts.
Basing it on a heuristic delta format seems incorrect. I guess patience
diff is at least deterministic, and while it uses a dict (which can
randomize sorting) I don't think it uses it in ways that would effect
the output. (We've run into issues with different platforms having
different sorting for the same version of python)

So it is probably ok, enough that I'm okay having it. It just feels like
several layers of "fuzziness" trying to guarantee that your data is correct.

> 
> Performance
> ===========
> The new bundles are much faster to generate.  I know how to make bundle
> installation really fast, but I don't have time to do that before 0.18.

This is quite nice.

> 
> The new bundles are much smaller in the vast majority of situations.
> They use a multi-parent delta format, they use bzip2, and they compress
> the entire file at once.  The combination of merge-directive and bundle
> will not be smaller than a single-commit v09 delta.
> 

Because they now have the data twice, right? (Once as the human-readable
diff, and once as the the bzip2+base64 hunk).

> Features
> ========
> Merge directives can now specify a cherrypick, and merge will honor
> cherrypicks.  Revisions containing tree-references can now be included
> in a bundle.  Formats are well-documented.  New bundle-info command
> provides statistics and other data about bundles.
> Repository.create_bundle method provides targetted API.
> 

...

v- Generally, you seem to be missing doc strings on most of your
functions. Several of them would be a little more obvious if you
documented what the parameters were. (For example 'repo_kind', I
initially thought this was the repository kind (knit1/knit3), but it is
actually file/revision/etc.)

+
+from cStringIO import StringIO
+import bz2
+
+from bzrlib import (
+    diff,
+    errors,
+    iterablefile,
+    multiparent,
+    osutils,
+    pack,
+    revision as _mod_revision,
+    trace,
+    xml_serializer,
+    )
+from bzrlib.bundle import bundle_data, serializer
+from bzrlib.util import bencode
+
+
+class BundleWriter(object):
+
+    def __init__(self, fileobj):
+        self._container = pack.ContainerWriter(self._write_encoded)
+        self._fileobj = fileobj
+        self._compressor = bz2.BZ2Compressor()
+
+    def begin(self):
+        self._fileobj.write(serializer._get_bundle_header('4alpha'))
+        self._fileobj.write('#\n')
+        self._container.begin()

^- 4alpha looks like something that needs to be cleaned up before it is
merged into core. Can you move it out somewhere else as a constant to
make it easier to find? And maybe keep a note somewhere so we don't forget.

...

v- You have 5 kinds, and 4 of them cannot have file ids. It seems better to:

if content_kind == 'file_id':
    assert file_id is not None
else:
    assert file_id is None

+    @staticmethod
+    def encode_name(content_kind, revision_id, file_id=None):
+        assert content_kind in ('revision', 'file', 'inventory',
'signature',
+                                'info')
+        if content_kind in ('revision', 'inventory', 'signature', 'info'):
+            assert file_id is None
+        else:
+            assert file_id is not None
+        if content_kind == 'info':
+            assert revision_id is None
+        else:
+            assert revision_id is not None
+        names = [content_kind]
+        if revision_id is not None:
+            names.append(revision_id)
+            if file_id is not None:
+                names.append(file_id)
+        return '/'.join(names)

^- I don't believe that '/' is strictly forbidden in revision ids,
though I'm not positive. It *is* forbidden for file ids as a side effect
of our 'Store' objects. It might have been forbidden for older revision
ids, as they were stored in a similar way.

But this would be an argument for using whitespace, given a container
change that allowed characters that are explicitly denied to file ids
and revision ids.

+
+    def _add_record(self, bytes, metadata, repo_kind, revision_id,
file_id):
+        name = self.encode_name(repo_kind, revision_id, file_id)
+        encoded_metadata = bencode.bencode(metadata)
+        self._container.add_bytes_record(encoded_metadata, [name])
+        if metadata['storage_kind'] != 'header':
+            self._container.add_bytes_record(bytes, [])

^- Is it possible for anything to be None? I at least thought there was
a bug in bencode in that it couldn't represent None. And would fail with
an exception if you tried.

+
+
+class BundleReader(object):
+
+    def __init__(self, fileobj):
+        line = fileobj.readline()
+        if line != '\n':
+            fileobj.readline()
+        self.patch_lines = []
+        self._container = pack.ContainerReader(
+            iterablefile.IterableFile(self.iter_decode(fileobj)))
+
+    @staticmethod
+    def iter_decode(fileobj):
+        decompressor = bz2.BZ2Decompressor()
+        for line in fileobj:
+            yield decompressor.decompress(line)

^- I find it a little odd that Bundles are *defined* as bzip2 streams,
rather than that just being a 'how it was used' implementation detail.
(bzr bundle would generate bz2 streams, as would 'bzr merge-directive',
but it wouldn't be an intrinsic property of the bundle itself). There
certainly are cases where it would be faster to not use bz2 (local
network, etc).

As the only current use case is for a bz2 stream, I'm not super sticky
on this. It just feels like it is being done at the wrong level.

+
+    @staticmethod
+    def decode_name(name):

v- See my earlier discussion as to why this *may* be unsafe.

+        names = name.split('/')
+        content_kind = names[0]
+        revision_id = None
+        file_id = None
+        if len(names) > 1:
+            revision_id = names[1]
+        if len(names) > 2:
+            file_id = names[2]
+        return content_kind, revision_id, file_id
+
+    def iter_records(self):
+        iterator = self._container.iter_records()

v- If a record happens to contain more than one name, this will fail
with a really weird error. As you are processing stuff generated by the
user, I would recommend catching 'names' and having a real exception if
there is more than one name for a section.

+        for (name,), meta_bytes in iterator:
+            metadata = bencode.bdecode(meta_bytes(None))
+            if metadata['storage_kind'] == 'header':
+                bytes = None
+            else:
+                _unused, bytes = iterator.next()
+                bytes = bytes(None)
+            yield (bytes, metadata) + self.decode_name(name)

^ So each effective object has 2 records, right? One meta record, and
one content record.

Also, it would probably be better to return the 'bytes_reader()' rather
than the actual string of bytes, so that we don't have to hold the whole
text in memory. But I'm willing to accept it in this pass.

+
+
+class BundleSerializerV4(serializer.BundleSerializer):
+
+    def write(self, repository, revision_ids, forced_bases, fileobj):
+        write_op = BundleWriteOperation.from_old_args(repository,
revision_ids,
+                                                      forced_bases,
fileobj)
+        return write_op.do_write()
+
+    def write_bundle(self, repository, target, base, fileobj):
+        write_op =  BundleWriteOperation(base, target, repository, fileobj)
+        return write_op.do_write()
+
+    def read(self, file):
+        bundle = BundleInfoV4(file, self)
+        return bundle
+
+    @staticmethod
+    def get_source_serializer(info):
+        return xml_serializer.format_registry.get(info['serializer'])
+

V- I don't quite understand why there is a BundleSerializer separate
from the BundleWriter separate from the BundleWriteOperation.

It might help if there were docstrings.

+
+class BundleWriteOperation(object):
+
+    @classmethod
+    def from_old_args(cls, repository, revision_ids, forced_bases,
fileobj):
+        base, target = cls.get_base_target(revision_ids, forced_bases,
+                                           repository)
+        return BundleWriteOperation(base, target, repository, fileobj,
+                                    revision_ids)
+

Unfortunately, I'm running out of time to review the rest (I have to put
together a presentation to give this weekend). But I figured I could
send you what I did complete.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGhWsKJdeBCYSNAAMRAmlkAJ4xf6bSIAR9rY4kq+uxjJMk89bP2QCgvxpn
ohVjEla8KoppP4U7RB2Wz0w=
=jeDs
-----END PGP SIGNATURE-----