[RFC] bzr.jrydberg.versionedfile

Wed Dec 21 10:18:03 GMT 2005

On Wed, 2005-12-21 at 10:01 +0100, Johan Rydberg wrote:
> Robert Collins <robertc at robertcollins.net> writes:

> > Some design level thoughts:
> > I'm not sure what the goal of the rollback stuff is: we're aiming for
> > lock-free read operation. This means that writing cannot in general ever
> > rollback. 
> 
> The whole idea with transactions is to encapsulate several operations
> into one atomic operation.  The concept has been used for databases
> for ages.

Well duh :).

> > heres why with your current implementation:
> > ----
> > lock a branch
> > write a bunch of data to the end of knits, including some revisions.
> > Decide to rollback
> >  -> a reader reading one of those revisions is now fucked.
> > ----
> >
> > So I'd like some use cases for the rollback stuff or for 'commit' to not
> > be a no-op. That is:
> > Either there is no rollback,
> > OR
> > no data is actually available to lock-free readers until we commit. 
> 
> The reason rollback is important is that several parts of the
> knit-design depends in the notion that changes can be rolled back in
> the event of a fatal error.  For example, the fetcher simply merges
> all remote revisions into the local knit, _BEFORE_ it starts to look
> at texts and inventories.  If something goes bad during the latter
> phases, all fetched revisions must be thrown away.

Except we have no guarantee (unless we get into WALF implementations)
that this throw away will occur. That is, all it takes is a SEGSIGV and
the rollback will not occur. Thus the use case of 'ensure a consistent
repository' CANNOT be satisfied with a rollback mechanism on our
transactions.

> Rolling back changes are also needed when we end up in situations that
> there is no clean way out of; out-of-disk being one.  We want to leave
> the history exactly as it was _before_ the modifying operation begun.

This is more interesting - but just stopping cold would be fine, if we
have maintained our invariant of 'all data is always valid all the
time.'

> I agree that the current implementation is flawed in the sense that it
> exposes changes before they have been committed.  This has to be fixed.

I don't think thats the key current flaw: The key current flaw is that
critical invariants in branch data are being violated in a way that we
cannot guarantee recovery from :p.

> > I think the trailing newline requirement in knit is probably a
> > showstopper to merging :).
> 
> That has been addressed.  If it sees that the new text does not have a
> trailing newline, it records the text as "incomplete" and adds a \n.
> When extracting, it removes the newline.

Theres a XXX comment in the knit file :)

> > +    def apply_delta(self, delta):
> > +        """Apply delta to this content."""
> > +        offset = 0
> > +        for start, end, c, lines in delta:
> >
> > ^ whats 'c' here ? (I really cannot tell without cracking open a
> > debugger!) [it could be a character, a count, a code...?] ; this happens
> > in a number of routines.
> 
> 'c' means count, or counter, in the same way that 'i' usually mean
> 'iterator'.

If you could expand that in the code, then other folk reading it won't
need to ask the same question ;).

> > I'm confused here - is it or is it not an error:
> > +                if self.factory.annotated:
> > +                    # FIXME: need to re-annotate the fulltext since
> > +                    # the indexes may be out of sync.
> > +                    #raise NotImplementedError
> > +                    pass
> > If it is, don't 'pass', if its not - that comment should be deleted no?
> 
> No, it should actually raise the exception, since that functionality
> has not been implemented yet. :) I put a 'pass' there so I could focus
> on other things for a while.

If its not in your TODO list for knits, could you add that there ?
...

> s/revision./revision introduced./

Where I'm asking for clarification, its mainly a hint for you to address
in your branch :).

> > The point of the map is to deal in logical identity. Raw objects
> > dont have logical identity, which is why a key function is needed,
> > or you can end up with two knits that represent the same logical
> > file id in the identity map. The weave api you removed from there
> > did this, your replacement api should preserve that.
> 
> You could still end up with that with the old API for the same
> reasons.  The difference is that you could only get conflicts between
> weaves.
> 
> My opinion is that instead of having a key function a more specific id
> can be used.  For knits, the ID of a knit is (store, file_id).
> 
> 
> > The id that was passed in was the logical id of the weave - the file
> > id. The thing I think I object to here is making the api too
> > generic. I'd rather have specific calls on it for the types of
> > things it needs to keep unique.
> 
> The mapping is kept using using a (store, file_id) id.

Which is now the responsibility of *all* callers that want to put an
object in the map. I'm concerned here about the ability for callers to
skew their apis - having a semantic api to the map prevents that. Its
fine that the key be more specific - but we should not encourage skew.

> > The remove_* functions should probably not exist either - the point of
> > the map is to exist for the whole transaction, when things are finished
> > with the whole map should go away when the transaction does.
> 
> I put in the remove_object method to handle cache invalidations.  This
> originates from the need to flush the cache in WeaveStore.copy_file.
> My personal opinion is that weaves should never be copied, but instead
> always merged (using WeaveVersionedFile.join), but I did not know the
> performance implications so I left it in.

mmm, I'll need to look at this. I'll do so next time I get a good shot
to eyeball the code.

> > +    def get_text(self, file_id, version_id, transaction):
> > +        """Return a sequence of lines that makes out version-id of
> > +        specified file."""
> > +        file = self.get_file(file_id, transaction)
> > +        return file.get_text(version_id)
> > +
> > +    get_lines = get_text
> > +
> >
> > I'd expect get_text and get_lines to return (respectively) a string and
> > a sequence of lines. Having them the same is rather, confusing.
> 
> I rather have get_text return a sequence of lines, and have a
> get_string method that returns the lines concatenated.
> 
> The reason get_lines is there is that I am lazy and didn't want to
> change a large set of tests.

If I have a variable 'foo_text' with no extra explanation, would you
assume that to be a string or a list of lines ? I'm pretty sure the
former would be the case -> and thats why I'm encouring get_text return
a string.

> > Base is a path according to how its used later, but you seem to be
> > treating it like a branch. Also, "WeaveUpgrade" is not unambigous, have
> > you considered 'UpgradeStoreToWeave' or similar ?
> 
> base is a BaseUpgrade object that contains information that can be
> shared in a upgrade pipeline.  For example, it holds revision
> information so it does not have to be re-read in every stage.

Hmm. base might be a improvable name then. How about 'upgrade_state' or
something ?

Rob

-- 
GPG key available at: <http://www.robertcollins.net/keys.txt>.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20051221/3f01fda5/attachment.pgp