[RFC] bzr.jrydberg.versionedfile

Wed Dec 21 09:01:27 GMT 2005

Robert Collins <robertc at robertcollins.net> writes:

Thanks for you input Robert.  I like to comment some of your comments,
leaving out obvious things such as comment-style comments.

> At the ui level - I think 'bzr compare' is not clear enough. Perhaps
> 'cross-check' or perhaps an option to 'bzr check' i.e. 'bzr check
> --against other'.

Fair enough.

> Some design level thoughts:
> I'm not sure what the goal of the rollback stuff is: we're aiming for
> lock-free read operation. This means that writing cannot in general ever
> rollback. 

The whole idea with transactions is to encapsulate several operations
into one atomic operation.  The concept has been used for databases
for ages.

> heres why with your current implementation:
> ----
> lock a branch
> write a bunch of data to the end of knits, including some revisions.
> Decide to rollback
>  -> a reader reading one of those revisions is now fucked.
> ----
>
> So I'd like some use cases for the rollback stuff or for 'commit' to not
> be a no-op. That is:
> Either there is no rollback,
> OR
> no data is actually available to lock-free readers until we commit. 

The reason rollback is important is that several parts of the
knit-design depends in the notion that changes can be rolled back in
the event of a fatal error.  For example, the fetcher simply merges
all remote revisions into the local knit, _BEFORE_ it starts to look
at texts and inventories.  If something goes bad during the latter
phases, all fetched revisions must be thrown away.

Rolling back changes are also needed when we end up in situations that
there is no clean way out of; out-of-disk being one.  We want to leave
the history exactly as it was _before_ the modifying operation begun.

I agree that the current implementation is flawed in the sense that it
exposes changes before they have been committed.  This has to be fixed.

> Inside BzrBranch we are getting cramped with format support. I'd like
> use to move all the storage stuff off to the side as Aaron Bentley and I
> have been working on. I think that that branch should land before yours
> - or be merged into yours. Lets start with merging it first, and I'll
> spend some time getting a TODO list for its merging done.

Sounds reasonable.  When I get a few hours over I'll try to merge your
repository branch into mine.

> I think the trailing newline requirement in knit is probably a
> showstopper to merging :).

That has been addressed.  If it sees that the new text does not have a
trailing newline, it records the text as "incomplete" and adds a \n.
When extracting, it removes the newline.

> +    def apply_delta(self, delta):
> +        """Apply delta to this content."""
> +        offset = 0
> +        for start, end, c, lines in delta:
>
> ^ whats 'c' here ? (I really cannot tell without cracking open a
> debugger!) [it could be a character, a count, a code...?] ; this happens
> in a number of routines.

'c' means count, or counter, in the same way that 'i' usually mean
'iterator'.

> +    def _check_versions_present(self, version_ids):
> +        """Check that all specified versions are present."""
> +        version_ids = set(version_ids)
> +        for r in list(version_ids):
> Whats 'r' here - is it a version_id? EDOESNOTCOMPUTE :).

Yes.  Will fix that.

> I'm confused here - is it or is it not an error:
> +                if self.factory.annotated:
> +                    # FIXME: need to re-annotate the fulltext since
> +                    # the indexes may be out of sync.
> +                    #raise NotImplementedError
> +                    pass
> If it is, don't 'pass', if its not - that comment should be deleted no?

No, it should actually raise the exception, since that functionality
has not been implemented yet. :) I put a 'pass' there so I could focus
on other things for a while.

> +class _KnitIndex(_KnitComponentFile):
> +    """Manages knit index file.
> +
> +    The index is already kept in memory and read on startup, to enable
> +    fast lookups of revision information.  The cursor of the index
>
> That docstring is confusing. "already in memory" and "Read on startup"
> dont fit together for me.

Should say 'always'.

> and
> +    parent references; a index entry may only have parents that with a
> +    lover index number.  As a result, the index is topological sorted.
> is quite funny, but probably wrong :).
>
>
> +class ChangeEntry(object):
> +    """A change."""
>
> Could you enlarge this? what sort of change is it? where should we
> expect it? Does it include content? or just a mark that a change
> occured ?
>
> more PEP8 - and
> +    """Revision changes holds information about what changes a
> +    particular revision.
> That sentence does not parse for me.

s/revision./revision introduced./

> this is an issue:
>
> +    def add_object(self, id, obj):
> +        """Add a object to the map with a given id."""
> +        if id in self._map:
> +            raise errors.BzrError('object %r already in the identity
> map' % id)
> +        self._map[id] = obj
> +        self._reverse_map[obj] = id
>
>
> The point of the map is to deal in logical identity. Raw objects
> dont have logical identity, which is why a key function is needed,
> or you can end up with two knits that represent the same logical
> file id in the identity map. The weave api you removed from there
> did this, your replacement api should preserve that.

You could still end up with that with the old API for the same
reasons.  The difference is that you could only get conflicts between
weaves.

My opinion is that instead of having a key function a more specific id
can be used.  For knits, the ID of a knit is (store, file_id).

> The id that was passed in was the logical id of the weave - the file
> id. The thing I think I object to here is making the api too
> generic. I'd rather have specific calls on it for the types of
> things it needs to keep unique.

The mapping is kept using using a (store, file_id) id.

> The remove_* functions should probably not exist either - the point of
> the map is to exist for the whole transaction, when things are finished
> with the whole map should go away when the transaction does.

I put in the remove_object method to handle cache invalidations.  This
originates from the need to flush the cache in WeaveStore.copy_file.
My personal opinion is that weaves should never be copied, but instead
always merged (using WeaveVersionedFile.join), but I did not know the
performance implications so I left it in.

> +def clonefile(fromfile, tofile):
> +    if isinstance(fromfile, basestring):
> +        fromfile = StringIO(fromfile)
> +    pumpfile(fromfile, tofile)
>
> We're trying to move away from magic apis - should should probably be
> def pumpstring(from_string, to_file):
>     from_file = StringIO(from_string)
>     pumpfile(from_file, to_file)

Yes, I agree. This needs to be cleaned up.  

> +    def get_text(self, file_id, version_id, transaction):
> +        """Return a sequence of lines that makes out version-id of
> +        specified file."""
> +        file = self.get_file(file_id, transaction)
> +        return file.get_text(version_id)
> +
> +    get_lines = get_text
> +
>
> I'd expect get_text and get_lines to return (respectively) a string and
> a sequence of lines. Having them the same is rather, confusing.

I rather have get_text return a sequence of lines, and have a
get_string method that returns the lines concatenated.

The reason get_lines is there is that I am lazy and didn't want to
change a large set of tests.

> +    def reigster_clean(self, an_object, precious=False):
> +        """Register an_object as being clean.  Iff the precious hint is
> +        True, the object will not be ejected from the object identity
> +        map ever."""
> +        pass
>
> Typo :). Also, IFF is probably opaque to some coders. And PEP8 on the
> docstring layout :).
>
>
>
> I'm not going to review the transaction changes just yet, I'd like to
> talk through the rollback etc stuff first.
>
>
>
> I think there is a bug here:
> +class WeaveUpgrade(object):
> +    def __init__(self, base):
> +        self.base = base
> +        self.text_weaves = {}
> +        self.converted_revs = set()
> +
> +    def convert(self):
>          note('note: upgrade may be faster if all store files are
> ungzipped first')
> -        if not os.path.isdir(self.base + '/.bzr/weaves'):
> -            os.mkdir(self.base + '/.bzr/weaves')
> -        self.inv_weave = Weave('inventory')
> +        base = self.base
> +        if not os.path.isdir(base.base_dir + '/.bzr/weaves'):
> +            os.mkdir(base.base_dir + '/.bzr/weaves')
> +        self.inv_weave = WeaveVersionedFile('inventory')
>          # holds in-memory weaves for all files
> -        self.text_weaves = {}
> -        os.remove(self.branch.controlfilename('branch-format'))
> +        os.remove(base.branch.controlfilename('branch-format'))
>
>
> Base is a path according to how its used later, but you seem to be
> treating it like a branch. Also, "WeaveUpgrade" is not unambigous, have
> you considered 'UpgradeStoreToWeave' or similar ?

base is a BaseUpgrade object that contains information that can be
shared in a upgrade pipeline.  For example, it holds revision
information so it does not have to be re-read in every stage.

> Whew, thats it. Hope you got this far :).

Thanks again.  Actually, you had less comments than I thought you
would.

~j