[RFC] Knit format 2

Sat Sep 9 21:40:08 BST 2006

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

John Arbash Meinel wrote:

> This review took far too long. :)

Thank you for doing it.

> === modified file 'bzrlib/fetch.py'
> --- bzrlib/fetch.py	2006-07-28 16:05:23 +0000
> +++ bzrlib/fetch.py	2006-08-25 21:49:55 +0000
> @@ -281,6 +281,66 @@
>          to_rf.join(from_rf, version_ids=revs)
> 
> 
> v- I would like to see doc strings on this, so that I can understand
> what it is actually trying to do.
> 
> It seems really odd to me to have it process specific functions, but be
> based on 'object' and have no '__init__' function. Meaning it doesn't
> have any of its own state.

Future enhancements may give it state.  We currently iterate through
inventories twice: once to determine the root id for creating
versionedfiles, and once to create a new inventory knit.  There's got to
be a way to combine those steps.

> Neither the class, nor any of the functions have a docstring.

Okay, but they will probably sound a bit repetitive.

> Also, I realize there is some disagreement, but I think we prefer:
> 
> while revs:
> 
> to
> 
> while len(revs):

Well, I actually prefer len(revs) != 0.  I think Python's boolean
evaluation is too loose.  Combined with dynamic typing, you don't know
whether
a. revs is a boolean that may be true or false
b. revs is a number that may be 0 or non-0
c. revs is a sequence (possibly a string) that may be empty or non-empty

But I'll change it for the sake of focusing on more important things.

> Any particular reason why you chose 500 as a magic number? 

By experimentation with the Bazaar tree, I found that at 500 revisions,
there was a noticeable, but not significant, increase in memory use.  So
I thought that even a large tree would not exhaust memory keeping 500
revisions at once.

> I believe I
> understand that you are wanting to take advantage of reading a set of
> trees at a time (rather than 1 at a time).

Right
  > I'm a little concerned that this has to unserialize 500 inventories at a
> time. I probably would be a little more comfortable with 100. But we
> still will have issues with something like the freebsd tree. Where we
> have >5MB tree. So 500*5 == 2.5GB of memory. (In the full freebsd ports
> tree, a committed inventory is 30MB in size. So even 100*30MB == 3GB,
> and 500* 30=15GB is ludicrously huge).

Hmm.  Perhaps we should cap the number of revisions that log -v will
load at once, because it too is capable of loading hundreds of
inventories at once.  The problem is that there's a clear performance
advantage to doing so.

I will drop this to 100.

> This is a good argument for for lazy parsing of inventories, though.
> 
> It might be better if we had some way to know the estimated inventory
> size (maybe read the first one and use it). And then tune the number of
> trees to read at a time based on that, rather than having a fixed #. I
> realize that converting a freebsd ports tree will never be fun, but it
> should be possible.

Full-ACK.  Without that info, all we can do is use the lowest common
denominator.  And if we drop the number too low, then we trade
small-tree performance against the ability to use large trees.  That
said, I don't think it would be terrible if, in order to upgrade the
ports tree, you had to set up a big swap file.

> +class Inter1and2Helper(object):
> +
> +    def iter_rev_trees(self, revs, source):
> +        while len(revs):
> +            for tree in source.revision_trees(revs[:500]):
> +                if tree.inventory.revision_id is None:
> +                    tree.inventory.revision_id = tree.get_revision_id()
> +                yield tree
> +            revs = revs[500:]
> +
> +    def generate_root_texts(self, fetcher, revs):
> +        inventory_weave = fetcher.from_repository.get_inventory_weave()
> +        for tree in self.iter_rev_trees(revs, fetcher.from_repository):
> +            parent_texts = {}
> +            revision_id = tree.inventory.root.revision
> +            parents = inventory_weave.get_parents(revision_id)
> +            to_store = fetcher.to_repository.weave_store
> +            versionedfile = to_store.get_weave_or_empty(
> +                tree.inventory.root.file_id,
> +                fetcher.to_repository.get_transaction())
> +            parent_texts = versionedfile.add_lines(revision_id,
> parents, [],
> +                                                   parent_texts)
> +        versionedfile.clear_cache()
> 
> ^-- shouldn't the 'clear_cache()' call be inside the loop? You aren't
> even guaranteed to have a versionedfile variable if there are no 'revs'
> to be processed.

There's probably no advantage in the call at all.  IIRC, I copied that
code from elsewhere.  I will zap it.

If there were an advantage, it would make sense to do the call at the
end, because the versionedfile will almost always be the same.

I think I see some other crack in that function, though: the
parents_texts shouldn't be initialized to {} inside the loop.

> If you are intending to have the versioned file be persistent and
> caching across tree additions, then you probably need to not be creating
> a new versioned file object for every revision.

There's a small chance that it's a different versionedfile every time.
I'll rewrite this a bit.

> +    def regenerate_inventory(self, fetcher, revs):
> +        inventory_weave = fetcher.from_repository.get_inventory_weave()
> +        for tree in self.iter_rev_trees(revs, fetcher.from_repository):
> +            parents = inventory_weave.get_parents(tree.get_revision_id())
> +            if tree.inventory.revision_id is None:
> +                tree.inventory.revision_id = tree.get_revision_id()
> +            fetcher.to_repository.add_inventory(tree.get_revision_id(),
> +                                                tree.inventory, parents)
> 
> ^-- tree.inventory.revision_id will never be None here, because you take
> care of that in 'iter_rev_trees()' I assume you just were refactoring
> stuff and forgot to remove it.

Old inventories didn't have revision_id set, and the deserialising code
doesn't know what the revision_id should be.  I'm pretty sure it can be
None.  Why do you think it can't?

> ^-- what if instead of a list in ordered precedence, we registered them
> with a priority level. So that a plugin could register with a higher
> priority, but if the core comes out with an even better one, it would be
> used. I'm okay with this, but think registration order is a little bit
> loose. I realize the code or switching to a list from a set is very
> simple. So I'm not sticking on this.

I think using a priority makes policy a lot more complicated.  Once we
get the possibility of a series of converters, things will really get
messy.  So for now, I think this is a YAGNI.

> -        assert isinstance(self._format, (RepositoryFormat5,
> -                                         RepositoryFormat6,
> -                                         RepositoryFormat7,
> -                                         RepositoryFormatKnit1)), \
> +        assert self._format.has_unnested_inventory(), \
>              ("fileids_altered_by_revision_ids only supported for
> branches "
>               "which store inventory as unnested xml, not on %r" % self)
>          selected_revision_ids = set(revision_ids)
> 
> ^- I'm not sure that 'has_unnested_inventory()' is a sufficient
> statement. Because the text has to also be in a known XML format. Even
> changing the XML format a little bit would potentially break
> fileids_altered_by_revision_ids. It is a very fragile function.
> I realize it makes it difficult to add a new RepositoryFormat by a
> plugin, etc. So we should consider changing this. But we need to be careful.

I see two options:
1. Depend on the inventory format, not the repository format
2. Use polymorphism so that this implementation cannot be used if the
inventory format is incompatible.  We could implement it on
RepositoryFormats or on InventoryFormats, I think.

> v- Put a '\n' at the end of this string.
> It is the raw text that goes into the .bzr/repository/format, and it is
> much better to end with a newline.
> (There is an interesting bug with zsh when you cat a single line file
> without a trailing newline, the prompt will cover up the contents. So
> doing 'cat .bzr/repository/format' just gives you a new prompt. As if
> the file contents were empty. That was how I found out about our current
> format problems :)

Wouldn't it be better to fix the code that sets ./bzr/repository/format,
so that it uses a newline?

> === modified file
> 'bzrlib/tests/repository_implementations/test_fileid_involved.py'
> --- bzrlib/tests/repository_implementations/test_fileid_involved.py
> 2006-07-29 08:23:47 +0000
> +++ bzrlib/tests/repository_implementations/test_fileid_involved.py
> 2006-08-24 00:38:48 +0000
> @@ -173,13 +173,21 @@
>              self.branch.repository.fileids_altered_by_revision_ids(
>                  ['rev-G', 'rev-F', 'rev-C', 'rev-B', 'rev-<D>',
> 'rev-K', 'rev-J']))
> 
> v-- Is it better to strip this? It seems like it may be important to
> know if it exists. 

I don't think these tests should fail if it does.

(Though I realize this is a test against all
> repository implementations, it does mean that different repo
> implementations now cause different info to be returned. Which is not
> really what we want).

No, it is what we want.  The purpose of this new format is to provide
the same information about the root that other formats provide about all
other file ids.  And we can't make
Repository.filids_altered_by_revision_ids always include TREE_ROOT,
because that implies there's a versionedfile for TREE_ROOT, and for old
Repositories, that's not true.  So we must permit this difference in output.

> v- Should this be catching SyntaxError?

Yes.  The supplied inventory may not be in XML format.  The serializers
shouldn't behave differently depending on whether the supplied inventory
is XML.  So a rio inventory should produce the same exception that a
wrong-format XML inventory would produce.

I would rather trap NotAnXmlDocument, or something, but SyntaxError is
all I get.

> It seems the
> UnexpectedInventoryFormat() should be raised at the 'assert format == 5'
> level, and the rest would be SyntaxError. Or maybe
> 'InvalidSerializedInventory', or something. But it doesn't seem like all
> exceptions should raise UnexpectedInventoryFormat. Since that sound
> specifically like the format == 5 vs 6 stuff.

No, that is "You gave me a format to deserialize that I do not
recognize.  I make no claims as to whether it's valid.  I only claim
that I cannot handle it."

I need an error like that, to handle basis caching properly.  I had to
rename the inventory cache file because old clients don't gracefully
handle situations in which the cached inventory is in an unexpected
format.  I don't want to have to do that again.

Aaron
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFFAyao0F+nu1YWqI0RAlcTAJ43tSe/LV7gLY7nH+kZ1G2gUjJX4gCdGaB4
hkWMEHfWATDtXT5mLMrHyxQ=
=V4M7
-----END PGP SIGNATURE-----