[MERGE] Move serialisation of inventories into CommitBuilder

Andrew Bennetts andrew at canonical.com
Wed Oct 3 03:00:09 BST 2007


Robert Collins wrote:
> On Tue, 2007-10-02 at 19:20 +1000, Andrew Bennetts wrote:
[...]
> > > === modified file 'bzrlib/commit.py'
> > > --- bzrlib/commit.py	2007-09-21 04:22:53 +0000
> > > +++ bzrlib/commit.py	2007-10-02 01:06:55 +0000
> > [...]
> > > +            content_summary = self.work_tree.path_content_summary(path)
> > > +            if not specific_files or is_inside_any(specific_files, path):
> > > +                if content_summary[0] == 'missing':
> > > +                    deleted_paths.add(path)
> > > +                    self.reporter.missing(path)
> > > +                    deleted_ids.append(file_id)
> > > +                    continue
> > > +            # TODO: have the builder do the nested commit just-in-time IF and
> > > +            # only if needed.
> > > +            if content_summary[0] == 'tree-reference':
> > > +                # enforce repository nested tree policy.
> > > +                if (not self.work_tree.supports_tree_reference() or
> > 
> > Is it really possible that we could have a tree-reference here if work_tree
> > doesn't support tree references?  That seems odd.  Can you explain to me how
> > that might happen?   (I'm guessing this seems odd to me because of my ignorance,
> > rather than because it's a problem.)
> 
> path_content_summary reports the status of whats on disk, not the
> logical support for a feature. The check here is a policy check, to keep
> that method from having to lie ('no its not a subtree, even if it is'),
> which keeps it general enough to be reused e.g. by add.

Ah.  I thought that what can be on disk is restricted by the working tree
format, i.e. that there would be no way to express on disk "here is a subtree"
in a working tree that didn't support them.  For some reason I had the strange
notion that subtrees were something different than a plain old directory with a
".bzr" directory in them.

> > > === modified file 'bzrlib/memorytree.py'
> > > --- bzrlib/memorytree.py	2007-08-23 00:12:35 +0000
> > > +++ bzrlib/memorytree.py	2007-09-05 05:51:34 +0000
> > > @@ -98,6 +98,26 @@
> > >              return None, False, None
> > >          return entry.kind, entry.executable, None
> > >  
> > > +    def path_content_summary(self, path):
> > > +        """See Tree.path_content_summary."""
> > > +        id = self.path2id(path)
> > > +        if id is None:
> > > +            return 'missing', None, None, None
> > > +        kind = self.kind(id)
> > > +        if kind == 'file':
> > > +            bytes = self._file_transport.get_bytes(path)
> > > +            size = len(bytes)
> > > +            executable = self._inventory[id].executable
> > 
> > It doesn't make sense to me to get the executable bit from the inventory but the
> > file size from transport.  Shouldn't you get both from the same source?
> 
> This is what we do for support on filesystems that are not natively
> supporting the execute bit; as this is primarily test harness support it
> seems sufficient unto the purpose to me.

Two things:

I wondered about the purpose of MemoryTree (i.e. is it just for testing?), but I
couldn't find the answer in memorytree.py.  It would be good to mention this in
the MemoryTree docstring (or the memorytree.py docstring).

Also, a brief comment in this method like "we intentionally use the executable
bit from the inventory, in case we are testing on a filesystem that doesn't
support it natively" would be good.

> > > +                else:
> > > +                    # Either there is only a hash change(no hash cache entry,
> > > +                    # or same size content change), or there is no change on
> > > +                    # this file at all.
> > > +                    # There is a race condition when inserting content into the
> > > +                    # knit though that can result in different content being
> > > +                    # inserted so even though we may have had a hash cache hit
> > > +                    # here we still tell the store the hash we would *not*
> > > +                    # store a new text on, which means that it can avoid for us
> > > +                    # without a race condition and without double-shaing the
> > > +                    # lines.
> > > +                    nostore_sha = parent_entry.text_sha1
> > 
> > I don't quite follow that comment.  That sixty-four word sentence is just a bit
> > too convoluted for me to be able to detangle as I read it.  Could you please try
> > rephrasing it using shorter sentences?
> 
> Uhm, nothing comes to mind; but I've been up on the coal face on this
> bit of code for some time. Care to suggest a rephrase?

“Because the user might be changing the file on disk while we are committing, a
stat cache hit here is not enough for us to be sure of what content we actually
will store.  So we pass the cached hash to the store so that it can determine if
the content it would store is the same as we already had (and thus if it should
not be stored).  This way only we only calculate the expensive SHA-1 hash once,
in the store, and we will do the right thing even if the file has changed since
the stat lookup.”

It's a bit more verbose, and could probably be edited down a little, but at
least it doesn't make my eyes glaze over 2/3rds of the way through the comment
:)

-Andrew.




More information about the bazaar mailing list