[RFC] bzr.jrydberg.versionedfile

Wed Dec 21 06:09:02 GMT 2005

On Mon, 2005-12-19 at 04:24 +0100, Johan Rydberg wrote:
> Hi,
...
> As I said, there is still work that needs to be done.  It is far from
> finished.  For example, partial accesses via http is not supported in
> the http-transport yet.  But comments are welcome, as well as bug
> reports.

Generally this is looking good. I'd like to get this branch to the point
that its not a regression on any of the 3 metrics we have for merging,
and then merge, with the new format not the default. I think that that
will be very useful. And minimise the 8K line delta. (What is it with
multi-K line deltas today ?!)


At the ui level - I think 'bzr compare' is not clear enough. Perhaps
'cross-check' or perhaps an option to 'bzr check' i.e. 'bzr check
--against other'.


Some design level thoughts:
I'm not sure what the goal of the rollback stuff is: we're aiming for
lock-free read operation. This means that writing cannot in general ever
rollback. heres why with your current implementation:
----
lock a branch
write a bunch of data to the end of knits, including some revisions.
Decide to rollback
 -> a reader reading one of those revisions is now fucked.
----

So I'd like some use cases for the rollback stuff or for 'commit' to not
be a no-op. That is:
Either there is no rollback,
OR
no data is actually available to lock-free readers until we commit. 


Inside BzrBranch we are getting cramped with format support. I'd like
use to move all the storage stuff off to the side as Aaron Bentley and I
have been working on. I think that that branch should land before yours
- or be merged into yours. Lets start with merging it first, and I'll
spend some time getting a TODO list for its merging done.


on testing: I note the interformat test stuff in test_compare.py. I
think that this should only be needed if there are different code paths
in the compare routine based on the branch. That is - this matrix grows
with the square of the branch formats we have, which is not good.
One possible thing is to compare a known format (say format 5) against
the current default format. (no growth as new formats are added).
Another is to only add cross-format checks when there is cross format
code in that routine (actually, to add cross format tests before adding
such code ;)).


Some code level thoughts:

+def compare_branches(a, b, strict=False, pb=None):
+    """Compare branches a and b."""
+


This might be easier to find as 'Branch.compare(self, other,
strict=False, pb=None).


I think the trailing newline requirement in knit is probably a
showstopper to merging :).


+    def apply_delta(self, delta):
+        """Apply delta to this content."""
+        offset = 0
+        for start, end, c, lines in delta:

^ whats 'c' here ? (I really cannot tell without cracking open a
debugger!) [it could be a character, a count, a code...?] ; this happens
in a number of routines.

+        """Construct a knit at location specified by relpath.  The
+        given transaction will be used throughout the lifetime of the
+        knit."""


That should have the trailing """ on a new line, and the first line a
dedicated sentence with a line between:

+        """Construct a knit at location specified by relpath.
+
+        The given transaction will be used throughout the lifetime of
+        the knit.
+        """

like wise for _merge_annotations etc.

+    def _check_versions_present(self, version_ids):
+        """Check that all specified versions are present."""
+        version_ids = set(version_ids)
+        for r in list(version_ids):
Whats 'r' here - is it a version_id? EDOESNOTCOMPUTE :).



A small syntax error here:

+        if parents and delta:
+            # To speed to extract of texts the delta chain is limited
+            # to a fixed number of deltas.  This should minimize both
+            # I/O and the time spend applying deltas.

ITYM 'To speed the extraction of ...'



I'm confused here - is it or is it not an error:
+                if self.factory.annotated:
+                    # FIXME: need to re-annotate the fulltext since
+                    # the indexes may be out of sync.
+                    #raise NotImplementedError
+                    pass
If it is, don't 'pass', if its not - that comment should be deleted no?




PEP8:
+class _KnitComponentFile(object):
+    """One of the files used to implement a knit database"""
+    def __init__(self, transport, filename, mode, transaction):

One blank line before the def __init__ please.




+class _KnitIndex(_KnitComponentFile):
+    """Manages knit index file.
+
+    The index is already kept in memory and read on startup, to enable
+    fast lookups of revision information.  The cursor of the index

That docstring is confusing. "already in memory" and "Read on startup"
dont fit together for me.
and
+    parent references; a index entry may only have parents that with a
+    lover index number.  As a result, the index is topological sorted.
is quite funny, but probably wrong :).


+class ChangeEntry(object):
+    """A change."""

Could you enlarge this? what sort of change is it? where should we
expect it? Does it include content? or just a mark that a change
occured ?

more PEP8 - and
+    """Revision changes holds information about what changes a
+    particular revision.
That sentence does not parse for me.


this is another PEP8 glitch:
+    return changes
+
+def write_revision_changes(writer, changes):

->Two blank lines between top level symbols please:
+    return changes
+
+
+def write_revision_changes(writer, changes):


This is noise:
-            self.revision_store = get_store('revision-store')
+            self.inventory_store = get_store(u'inventory-store')

->get_store should unicodify all its parameters.


errors.py has a bunch of PEP8 vertical white space issues.



+def select_fetcher(to_branch, from_branch, revision=None, pb=None):
+    if to_branch._branch_format == from_branch._branch_format:
+        if to_branch._branch_format == 5:
+            f = WeaveFetcher(to_branch, from_branch, revision, pb)
+        elif to_branch._branch_format == 6:
+            f = WeaveFetcher(to_branch, from_branch, revision, pb)
+        elif to_branch._branch_format == 7:
+            f = KnitFetcher(to_branch, from_branch, revision, pb)
+    else:
+        f = Fetcher(to_branch, from_branch, revision, pb)
+    return f


This is slightly wrong: format 5 and 6 have the same underlying format.
def select_fetcher(to_branch, from_branch, revision=None, pb=None):
    """Return the best fetcher available for two branches."""
    if to_branch.format in (5, 6) and from_branch.format in (5, 6):
        return WeaveFetcher(to_branch, from_branch, revision, pb)
    elif to_branch.format in (7,) and from_branch.format in (7,):
        return KnitFetcher(to_branch, from_branch, revision, pb)
    else:
        return Fetcher(to_branch, from_branch, revision, pb)
 
might be a better way to express it.




this is an issue:

+    def add_object(self, id, obj):
+        """Add a object to the map with a given id."""
+        if id in self._map:
+            raise errors.BzrError('object %r already in the identity
map' % id)
+        self._map[id] = obj
+        self._reverse_map[obj] = id


The point of the map is to deal in logical identity. Raw objects dont
have logical identity, which is why a key function is needed, or you can
end up with two knits that represent the same logical file id in the
identity map. The weave api you removed from there did this, your
replacement api should preserve that. The id that was passed in was the
logical id of the weave - the file id. The thing I think I object to
here is making the api too generic. I'd rather have specific calls on it
for the types of things it needs to keep unique. So for instance, it
might know about 'versioned files' rather than weaves. 

The remove_* functions should probably not exist either - the point of
the map is to exist for the whole transaction, when things are finished
with the whole map should go away when the transaction does.


+def clonefile(fromfile, tofile):
+    if isinstance(fromfile, basestring):
+        fromfile = StringIO(fromfile)
+    pumpfile(fromfile, tofile)

We're trying to move away from magic apis - should should probably be
def pumpstring(from_string, to_file):
    from_file = StringIO(from_string)
    pumpfile(from_file, to_file)


+    def get_text(self, file_id, version_id, transaction):
+        """Return a sequence of lines that makes out version-id of
+        specified file."""
+        file = self.get_file(file_id, transaction)
+        return file.get_text(version_id)
+
+    get_lines = get_text
+

I'd expect get_text and get_lines to return (respectively) a string and
a sequence of lines. Having them the same is rather, confusing.



=== modified file 'bzrlib/tests/test_transport.py'

--- bzrlib/tests/test_transport.py      
+++ bzrlib/tests/test_transport.py      
@@ -39,6 +39,9 @@
     def test_urlescape(self):
         self.assertEqual('%25', urlescape('%'))

+def pumpstring(f, str):
+    f.write(str)
+    f.close()

should that use your new osutils pump routine ?


+    def reigster_clean(self, an_object, precious=False):
+        """Register an_object as being clean.  Iff the precious hint is
+        True, the object will not be ejected from the object identity
+        map ever."""
+        pass

Typo :). Also, IFF is probably opaque to some coders. And PEP8 on the
docstring layout :).



I'm not going to review the transaction changes just yet, I'd like to
talk through the rollback etc stuff first.



I think there is a bug here:
+class WeaveUpgrade(object):
+    def __init__(self, base):
+        self.base = base
+        self.text_weaves = {}
+        self.converted_revs = set()
+
+    def convert(self):
         note('note: upgrade may be faster if all store files are
ungzipped first')
-        if not os.path.isdir(self.base + '/.bzr/weaves'):
-            os.mkdir(self.base + '/.bzr/weaves')
-        self.inv_weave = Weave('inventory')
+        base = self.base
+        if not os.path.isdir(base.base_dir + '/.bzr/weaves'):
+            os.mkdir(base.base_dir + '/.bzr/weaves')
+        self.inv_weave = WeaveVersionedFile('inventory')
         # holds in-memory weaves for all files
-        self.text_weaves = {}
-        os.remove(self.branch.controlfilename('branch-format'))
+        os.remove(base.branch.controlfilename('branch-format'))


Base is a path according to how its used later, but you seem to be
treating it like a branch. Also, "WeaveUpgrade" is not unambigous, have
you considered 'UpgradeStoreToWeave' or similar ?

Whew, thats it. Hope you got this far :).

Rob




-- 
GPG key available at: <http://www.robertcollins.net/keys.txt>.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20051221/c0f7d0cb/attachment.pgp