[RFC] Knit format 2

Sat Sep 9 14:32:22 BST 2006

Aaron Bentley wrote:
> Hi all,
> 

...

> I have tested bzr with this as the default format, and all tests except
> bundle tests pass.  Those tests that didn't pass at first were usually
> added to the repository_implementations tests, to ensure that all
> formats are adequately tested.
> 
> I believe this can be merged at this stage, barring any quibbles with
> the implementation.  Though probably not for 1.0. :-)
> 
> Remaining work is:
>  - a new bundle format
>  - a new testament format
>  - optimizing the converters
> 
> Thoughts?
> 
> Aaron

This review took far too long. :)

As long as it is not actually setup as the default format, I think it is
reasonable to add it, so we can have some time to play around with it,
and see if it is reasonable.

There are a few things were I see cause for concern. If you can address
them, I think it would be reasonable to fold the rest of this into
bzr.dev, so it can be hacked on and fixed up as we get a chance to
really test it in real life. (contingent +1)

...

=== modified file 'bzrlib/fetch.py'

--- bzrlib/fetch.py	2006-07-28 16:05:23 +0000
+++ bzrlib/fetch.py	2006-08-25 21:49:55 +0000
@@ -281,6 +281,66 @@
         to_rf.join(from_rf, version_ids=revs)


v- I would like to see doc strings on this, so that I can understand
what it is actually trying to do.

It seems really odd to me to have it process specific functions, but be
based on 'object' and have no '__init__' function. Meaning it doesn't
have any of its own state.

Neither the class, nor any of the functions have a docstring.

Also, I realize there is some disagreement, but I think we prefer:

while revs:

to

while len(revs):


Any particular reason why you chose 500 as a magic number? I believe I
understand that you are wanting to take advantage of reading a set of
trees at a time (rather than 1 at a time).

I'm a little concerned that this has to unserialize 500 inventories at a
time. I probably would be a little more comfortable with 100. But we
still will have issues with something like the freebsd tree. Where we
have >5MB tree. So 500*5 == 2.5GB of memory. (In the full freebsd ports
tree, a committed inventory is 30MB in size. So even 100*30MB == 3GB,
and 500* 30=15GB is ludicrously huge).

This is a good argument for for lazy parsing of inventories, though.

It might be better if we had some way to know the estimated inventory
size (maybe read the first one and use it). And then tune the number of
trees to read at a time based on that, rather than having a fixed #. I
realize that converting a freebsd ports tree will never be fun, but it
should be possible.

+class Inter1and2Helper(object):
+
+    def iter_rev_trees(self, revs, source):
+        while len(revs):
+            for tree in source.revision_trees(revs[:500]):
+                if tree.inventory.revision_id is None:
+                    tree.inventory.revision_id = tree.get_revision_id()
+                yield tree
+            revs = revs[500:]
+
+    def generate_root_texts(self, fetcher, revs):
+        inventory_weave = fetcher.from_repository.get_inventory_weave()
+        for tree in self.iter_rev_trees(revs, fetcher.from_repository):
+            parent_texts = {}
+            revision_id = tree.inventory.root.revision
+            parents = inventory_weave.get_parents(revision_id)
+            to_store = fetcher.to_repository.weave_store
+            versionedfile = to_store.get_weave_or_empty(
+                tree.inventory.root.file_id,
+                fetcher.to_repository.get_transaction())
+            parent_texts = versionedfile.add_lines(revision_id,
parents, [],
+                                                   parent_texts)
+        versionedfile.clear_cache()

^-- shouldn't the 'clear_cache()' call be inside the loop? You aren't
even guaranteed to have a versionedfile variable if there are no 'revs'
to be processed.

If you are intending to have the versioned file be persistent and
caching across tree additions, then you probably need to not be creating
a new versioned file object for every revision.

+
+
+    def regenerate_inventory(self, fetcher, revs):
+        inventory_weave = fetcher.from_repository.get_inventory_weave()
+        for tree in self.iter_rev_trees(revs, fetcher.from_repository):
+            parents = inventory_weave.get_parents(tree.get_revision_id())
+            if tree.inventory.revision_id is None:
+                tree.inventory.revision_id = tree.get_revision_id()
+            fetcher.to_repository.add_inventory(tree.get_revision_id(),
+                                                tree.inventory, parents)

^-- tree.inventory.revision_id will never be None here, because you take
care of that in 'iter_rev_trees()' I assume you just were refactoring
stuff and forgot to remove it.

...

-    # _optimisers = set()
-    # Each concrete InterObject type should have its own optimisers set.
+    # _optimisers = list()
+    # Each concrete InterObject type should have its own optimisers list.

^-- what if instead of a list in ordered precedence, we registered them
with a priority level. So that a plugin could register with a higher
priority, but if the core comes out with an even better one, it would be
used. I'm okay with this, but think registration order is a little bit
loose. I realize the code or switching to a list from a set is very
simple. So I'm not sticking on this.

...

-        assert isinstance(self._format, (RepositoryFormat5,
-                                         RepositoryFormat6,
-                                         RepositoryFormat7,
-                                         RepositoryFormatKnit1)), \
+        assert self._format.has_unnested_inventory(), \
             ("fileids_altered_by_revision_ids only supported for
branches "
              "which store inventory as unnested xml, not on %r" % self)
         selected_revision_ids = set(revision_ids)

^- I'm not sure that 'has_unnested_inventory()' is a sufficient
statement. Because the text has to also be in a known XML format. Even
changing the XML format a little bit would potentially break
fileids_altered_by_revision_ids. It is a very fragile function.
I realize it makes it difficult to add a new RepositoryFormat by a
plugin, etc. So we should consider changing this. But we need to be careful.

...

v- Put a '\n' at the end of this string.
It is the raw text that goes into the .bzr/repository/format, and it is
much better to end with a newline.
(There is an interesting bug with zsh when you cat a single line file
without a trailing newline, the prompt will cover up the contents. So
doing 'cat .bzr/repository/format' just gives you a new prompt. As if
the file contents were empty. That was how I found out about our current
format problems :)

...

=== modified file
'bzrlib/tests/repository_implementations/test_fileid_involved.py'
--- bzrlib/tests/repository_implementations/test_fileid_involved.py
2006-07-29 08:23:47 +0000
+++ bzrlib/tests/repository_implementations/test_fileid_involved.py
2006-08-24 00:38:48 +0000
@@ -173,13 +173,21 @@
             self.branch.repository.fileids_altered_by_revision_ids(
                 ['rev-G', 'rev-F', 'rev-C', 'rev-B', 'rev-<D>',
'rev-K', 'rev-J']))

v-- Is it better to strip this? It seems like it may be important to
know if it exists. (Though I realize this is a test against all
repository implementations, it does mean that different repo
implementations now cause different info to be returned. Which is not
really what we want).

+    def fileids_altered_by_revision_ids(self, revision_ids):
+        """This is a wrapper to strip TREE_ROOT if it occurs"""
+        repo = self.branch.repository
+        result = repo.fileids_altered_by_revision_ids(revision_ids)
+        if 'TREE_ROOT' in result:
+            del result['TREE_ROOT']
+        return result
+

...

=== modified file 'bzrlib/xml_serializer.py'
--- bzrlib/xml_serializer.py	2006-07-28 16:05:23 +0000
+++ bzrlib/xml_serializer.py	2006-08-24 02:40:28 +0000
@@ -37,6 +37,7 @@
     import util.elementtree as elementtree

 from bzrlib.errors import BzrError
+from bzrlib import errors


^- the 'from bzrlib import errors' should come first.


v- Should this be catching SyntaxError? It seems the
UnexpectedInventoryFormat() should be raised at the 'assert format == 5'
level, and the rest would be SyntaxError. Or maybe
'InvalidSerializedInventory', or something. But it doesn't seem like all
exceptions should raise UnexpectedInventoryFormat. Since that sound
specifically like the format == 5 vs 6 stuff.

 class Serializer(object):
@@ -50,10 +51,16 @@
         return tostring(self._pack_inventory(inv)) + '\n'

     def read_inventory_from_string(self, xml_string):
-        return self._unpack_inventory(fromstring(xml_string))
+        try:
+            return self._unpack_inventory(fromstring(xml_string))
+        except SyntaxError, e:
+            raise errors.UnexpectedInventoryFormat(e)

     def read_inventory(self, f):
-        return self._unpack_inventory(self._read_element(f))
+        try:
+            return self._unpack_inventory(self._read_element(f))
+        except SyntaxError, e:
+            raise errors.UnexpectedInventoryFormat(e)

     def write_revision(self, rev, f):
         self._write_element(self._pack_revision(rev), f)



John
=:->

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 254 bytes
Desc: OpenPGP digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20060909/638cb128/attachment.pgp