Rev 3155: First journalling test up and passing. in http://people.ubuntu.com/~robertc/baz2.0/inventory.journalled

Thu Jan 3 02:59:15 GMT 2008

At http://people.ubuntu.com/~robertc/baz2.0/inventory.journalled

------------------------------------------------------------
revno: 3155
revision-id:robertc at robertcollins.net-20080103025910-vztidj6e9v3jrqsi
parent: robertc at robertcollins.net-20080103014016-7o8mxov3jxx0tuwo
committer: Robert Collins <robertc at robertcollins.net>
branch nick: inventory.journalled
timestamp: Thu 2008-01-03 13:59:10 +1100
message:
  First journalling test up and passing.
added:
  bzrlib/journalled_inventory.py journalled_inventory-20080103020931-0ht5n40kwc0p7fy1-1
  bzrlib/tests/test_journalled_inv.py test_journalled_inv.-20080103012121-ny2w9slze5jgty8i-1
modified:
  bzrlib/tests/__init__.py       selftest.py-20050531073622-8d0e3c8845c97a64
  doc/developers/inventory.txt   inventory.txt-20080103013957-opkrhxy6lmywmx4i-1
=== added file 'bzrlib/journalled_inventory.py'

--- a/bzrlib/journalled_inventory.py	1970-01-01 00:00:00 +0000
+++ b/bzrlib/journalled_inventory.py	2008-01-03 02:59:10 +0000
@@ -0,0 +1,50 @@
+# Copyright (C) 2008 Canonical Ltd
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write to the Free Software
+# Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+
+"""Journalled inventory serialisation.
+
+See doc/developers/inventory.txt for the design and rationalisation.
+
+In this module the interesting classes are:
+ - EntryAccess - interface for getting text-lines for an inventory.
+ - InventoryJournal - object to read/write journalled inventories.
+"""
+
+__all__ = ['EntryAccess', 'InventoryJournal']
+
+
+class EntryAccess(object):
+    """Provide access to named bytesequences of the journal entries."""
+
+
+class InventoryJournal(object):
+    """Serialise and deserialise inventories using a journal."""
+
+    FORMAT_1 = 'bzr journalled inventory v1 (bzr 1.1)'
+
+    def delta_to_lines(self, old_inventory_name, delta_to_new):
+        """Return a line sequence for delta_to_new.
+
+        :param old_inventory_name: A UTF8 revision id for the old inventory.
+            May be NULL_REVISION if there is no older inventory and
+            delta_to_new includes the entire inventory contents.
+        :param delta_to_new: An inventory delta such as Inventory.apply_delta
+            takes.
+        """
+        lines = []
+        lines.append("format: %s\n" % InventoryJournal.FORMAT_1)
+        lines.append("parent: %s\n" % old_inventory_name)
+        return lines

=== modified file 'bzrlib/tests/__init__.py'
--- a/bzrlib/tests/__init__.py	2007-12-19 08:12:34 +0000
+++ b/bzrlib/tests/__init__.py	2008-01-03 02:59:10 +0000
@@ -2587,6 +2587,7 @@
                    'bzrlib.tests.test_index',
                    'bzrlib.tests.test_info',
                    'bzrlib.tests.test_inv',
+                   'bzrlib.tests.test_journalled_inv',
                    'bzrlib.tests.test_knit',
                    'bzrlib.tests.test_lazy_import',
                    'bzrlib.tests.test_lazy_regex',

=== added file 'bzrlib/tests/test_journalled_inv.py'
--- a/bzrlib/tests/test_journalled_inv.py	1970-01-01 00:00:00 +0000
+++ b/bzrlib/tests/test_journalled_inv.py	2008-01-03 02:59:10 +0000
@@ -0,0 +1,73 @@
+# Copyright (C) 2005 Canonical Ltd
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write to the Free Software
+# Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+
+"""Tests for the journalled inventory logic.
+
+See doc/developer/inventory.txt for more information.
+"""
+
+from cStringIO import StringIO
+
+from bzrlib import (
+    errors,
+    inventory,
+    journalled_inventory,
+    )
+from bzrlib.inventory import Inventory
+from bzrlib.revision import NULL_REVISION
+from bzrlib.tests import TestCase
+
+### DO NOT REFLOW THESE TEXTS. NEW LINES ARE SIGNIFICANT. ###
+empty_lines = """format: bzr journalled inventory v1 (bzr 1.1)
+parent: null:
+"""
+empty_lines_validator = ""
+
+root_only_lines = """format: bzr journalled inventory v1 (bzr 1.1)
+parent: null:
+/ TREE_ROOT  a at e\xe5ample.com--2004 dir
+"""
+root_only_lines_validator = ""
+
+
+class TestSerializer(TestCase):
+    """Test journalled inventory serialisation."""
+
+    def make_inv_delta(self, old, new):
+        """Make an inventory delta from two inventories."""
+        old_ids = set(old._byid.iterkeys())
+        new_ids = set(new._byid.iterkeys())
+        adds = new_ids - old_ids
+        deletes = old_ids - new_ids
+        common = old_ids.intersection(new_ids)
+        delta = []
+        for file_id in deletes:
+            delta.append((old.id2path(file_id), None, file_id, None))
+        for file_id in adds:
+            delta.append((None, new.id2path(file_id), file_id, new[file_id]))
+        for file_id in common:
+            if old[file_id] != new[file_id]:
+                delta.append((old.id2path(file_id), new.id2path(file_id),
+                    file_id, new[file_id]))
+        return delta
+
+    def test_empty_delta_to_lines(self):
+        old_inv = Inventory()
+        new_inv = Inventory()
+        delta = self.make_inv_delta(old_inv, new_inv)
+        journal = journalled_inventory.InventoryJournal()
+        self.assertEqual(StringIO(empty_lines).readlines(),
+            journal.delta_to_lines(NULL_REVISION, delta))

=== modified file 'doc/developers/inventory.txt'
--- a/doc/developers/inventory.txt	2008-01-03 01:40:16 +0000
+++ b/doc/developers/inventory.txt	2008-01-03 02:59:10 +0000
@@ -45,3 +45,141 @@
 The in development journalled inventory serializer generates a single byte
 string during serialization, but may require many byte strings to deserialize,
 and these are discovered recursively.
+
+
+Detailed documentation
+======================
+
+Journalled inventories
+----------------------
+
+In a journalled inventory we represent hte inventory as a tree of
+inventory-lines texts. A given journalled inventory has three data items: A
+basis inventory pointer, a current lines pointer, and a validator. As the
+definition is recursive we need to supply a starting point - the empty
+journalled inventory. 
+
+An empty journalled inventory has no basis inventory pointer, a constant
+empty inventory lines text, and a constant validator which is the secure
+hash (we might choose sha1 for this) of its empty inventory lines text.
+
+To create a new inventory, one takes an existing inventory, which
+becomes the basis inventory, and a inventory lines object which is a
+serialised inventory delta to update the inventory to a new state. The
+validator for this new inventory is obtained by creating a secure hash
+from the validator of the basis inventory, and the current inventory
+lines.
+
+If we create a new journalled inventory with every commit we build up a
+structure like so::
+
+
+    EMPTY   rev1lines        rev2lines   rev3lines
+      |        |                 |            |
+      +--------+-----------------+------------+
+         rev1validator    rev2validator rev3validator
+
+etc.
+
+
+We can truncate this at any commit by starting with a pointer to EMPTY
+again. Using a heuristic to decide when to do this can limit the depth
+of the chain that we need to process to reconstruct a given revision. A
+similar heuristic to knits, where we store a full text when the size of
+the full text is roughly the same size as the size of the aggregate
+deltas (so on average we read 1.5 times the full text size after
+compression to reconstruct any given text) should give similarly good
+results.
+
+We will serialise the lines text for a new inventory as:
+'format: bzr journalled inventory v1' NL
+'parent:' SP BASIS_INVENTORY NL DELTA_LINES
+DELTA_LINES ::= DELTA_LINE (NL DELTA_LINE)*
+DELTA_LINE ::= NEWPATH SP file-id SP PARENT_ID SP LAST_MODIFIED SP CONTENT
+SP ::= ' '
+NEWPATH ::= NONE | PATH
+NONE ::= 'None'
+PATH ::= '/' path
+PARENT_ID ::= FILE_ID | ''
+CONTENT ::= FILE_CONTENT | DIR_CONTENT | TREE_CONTENT | LINK_CONTENT
+FILE_CONTENT ::= 'file' text_size text_sha1
+DIR_CONTENT ::= 'dir'
+TREE_CONTENT ::= 'tree' tree-revision
+LINK_CONTENT ::= 'link' link-target
+BASIS_INVENTORY ::= NULL_OR_REVISION
+LAST_MODIFIED ::= REVISION
+NULL_OR_REVISION ::= 'null:' | REVISION
+REVISION ::= revision-id-in-utf8-no-whitespace
+
+DELTA_LINES is lexographically sorted.
+
+At any commit the validator is the root of a tree. Changing the deltas -
+e.g. by rewriting an inventory in the history to be a full inventory
+itself will change validators for all inventories built upon that one.
+
+Diff
+^^^^
+
+We can diff between two adjacent inventories by examining their
+serialised deltas. This scales with size of change, not size of
+inventory - a significant win. We can diff between inventories that can
+be reached via the tree of different inventories by composing deltas.
+This will scale with the amount of history back to their common
+ancestor. And we can of course generate both inventories and perform our
+current diff as a fallback.
+
+Commit
+^^^^^^
+
+There are two ways that we could create new journalled inventories. We
+could take the in memory inventory that we create during commit, and
+that for the basis, then make a new delta and serialise it to create the
+new inventory text lines and validator that we incorporate into the
+revision. Additionally doing this we could serialise the inventory in
+its regular form to get a legacy validator that has no reference to
+existing data.
+
+Alternatively, we already generate an inventory delta during commit for
+updating the working tree. This delta can be serialised directly and
+should produce identical results. We will not get a standalone validator
+for this inventory though. On the other hand the memory and effort
+required to create this representation is proportional to changes in the
+inventory rather than to the size of the inventory, and as such should
+outperform our current serialisation significantly.
+
+Fetch
+^^^^^
+
+Fetching with a history horizon where some revisions are not fetched
+will need to copy the inventory components that make up the aggregate
+journalled inventory even if the revision objects are not being copied.
+As the validator in each revision is derived from the data in the
+inventories there is no need to copy the revision objects too, so our
+fetch invariants are able to be safely preserved. There may be some
+additional round trips to query for inventory components to copy during
+fetch, but the common case will be that we are copying the revision
+objects so our first query will find all inventory components that we
+need.
+
+Check/reconcile
+^^^^^^^^^^^^^^^
+
+The definition of 'garbage inventory' changes with this, but it should
+be trivial to accomodate. We can cache the validators of each inventory
+we process and that will be more efficient than the current requirement
+of parsing every inventory.
+
+Access a single inventory
+^^^^^^^^^^^^^^^^^^^^^^^^^
+
+By using the same hueristic as knits do, we need to read the same amount
+of data to construct a given inventory. We will need to parse the entire
+data set of data rather than just the result of applying the text data.
+On the other hand we will not have to parse knit hunks which will save
+the current double-parsing that occurs.
+
+Iter-inventories
+^^^^^^^^^^^^^^^^
+We can apply deltas forward or backwards to move along a chain of
+inventories trivially, rather than the current requirement of
+deserialising them all individually.