[RFC][PATCH] join two weave without expanding the text

Goffredo Baroncelli kreijack at alice.it
Sun Dec 18 18:19:09 GMT 2005


On Sunday 11 December 2005 04:21, you (John Arbash Meinel) wrote:
> Goffredo Baroncelli wrote:
> > Hi all,
> > 
> > the patch below contains an optimization for the function
> > Wave.join( ), so this function doesn't perform unnecessary content expansion if the
> > revision to be merged is already present. 
> > [...] 
> Thanks for the optimization. It seems generally worthwile.
> 
> I honestly think another optimization is available, which could merge
> revisions into a weave file, without unpacking the raw text, and merging
> it again (by diffing again).

Ok, the patch below try to do so.

The idea of extracting the diff from the weave is good, but this works only
if the base of the merge is the same. This is true for every revision with only one parent;
but if the parents are two or more, the parents may be in a different order
in the two weaves ( due to a different merging histories ). 

The patch adds two functions ( _iterate_diff(), _add_weave_revision() ) and
a class (_weave_inserter).
The function _iterate_diff( ) extracts the diff from the weave; the function
_add_weave_revision() iterates over the result of _iterate_diff() and injects 
( via the class _weave_inserter ) the diff in the weave. The revision with
more than one parents or with different parents [ghost revision ?] are processed 
with the old method.
 
With my big surprise if we insert a revision at time the gain is very low. In
order to get the best of performance we have to insert more revision. In fact
we insert a "line" of the revision graph.

On my test the gain is about 2x; my test is based on merging of two inventory,
the first with ~2000 revisions, the second with ~2950 revisions. On my (old)
Duron 800, the test with the original code took ~4.50minutes, with the new style 
they took about 2 minutes.

The code fail the test_merge_with_missing_file test (?). I have to investigate why.
I don't have anymore space to publish the repository of this code, so
I can send only the patch inline :-(


Any feedback is welcome.
Goffredo


=== modified file 'bzrlib/weave.py'
--- bzrlib/weave.py	
+++ bzrlib/weave.py	
@@ -744,8 +744,154 @@
                        state
 
 
+    def _iterate_diff( self, parent_no, nos):
+        """ extract the difference of the revision 'idx'
+            against its parent """
+
+        parents = self.inclusions([parent_no,])
+
+        lineno = 0
+        dset = set()
+        istack = []
+
+        for l in self._weave:
+            if isinstance(l, tuple):
+                c, v = l
+                isactive = None
+                if c == '{':
+                    assert v not in istack
+                    istack.append(v)
+                elif c == '}':
+                    v = istack[-1]
+                    istack.pop()
+                elif c == '[':
+                    if v in parents:
+                        assert v not in dset
+                        dset.add(v)
+                else:
+                    assert c == ']'
+                    if v in parents:
+                        assert v in dset
+                        dset.remove(v)
+
+                if v in nos:
+                    yield c,v,lineno
+
+            else:
+                assert isinstance(l, basestring)
+                if isactive is None:
+                    isactive = (not dset) and istack and (istack[-1] in parents)
+                if isactive:
+                    lineno += 1
+
+                if istack[-1] in nos:
+                    yield ('.',l,lineno)
+
+        if istack:
+            raise WeaveFormatError("unclosed insertion blocks "
+                    "at end of weave: %s" % istack)
+        if dset:
+            raise WeaveFormatError("unclosed deletion blocks at end of weave: %s"
+                                   % dset)
+
+    class _weave_inserter:
+
+
+        def __init__(self, parent_idx, w):
+            self.parent_idx = parent_idx
+            self._weave = w
+            self.pos = 0
+            self.istack = []
+            self.dset = set( )
+            self.line = 0
+
+        def jump(self, lineno ):
+            len_weave = len(self._weave)
+            isactive = None
+            while self.pos < len_weave:
+                l = self._weave[self.pos]
+                if isinstance(l, tuple):
+                    c, v = l
+                    isactive = None
+                    if c == '{':
+                        assert v not in self.istack
+                        self.istack.append(v)
+                    elif c == '}':
+                        self.istack.pop()
+                    elif c == '[':
+                        if v in self.parent_idx:
+                            assert v not in self.dset
+                            self.dset.add(v)
+                    else:
+                        assert c == ']'
+                        if v in self.parent_idx:
+                            assert v in self.dset
+                            self.dset.remove(v)
+                else:
+                    assert isinstance(l, basestring)
+                    if isactive is None:
+                        isactive = (not self.dset) and self.istack and \
+                            (self.istack[-1] in self.parent_idx)
+                    if isactive:
+                        if self.line == lineno:
+                            return self.pos
+                        self.line += 1
+                self.pos += 1
+            assert (lineno == self.line)
+            return self.line
+
+        def insert(self,l ):
+            self._weave.insert(self.pos,l)
+            self.pos += 1
+
+    def _add_weave_revision(self,other, revids):
+        
+        other_nos = set(map(other.lookup,revids))
+        other_no0 = other.lookup(revids[0])
+        assert( revids[0] in other._names )
+        other_parent_no = other._parents[other_no0][0]
+        other_parent = other.idx_to_name(other_parent_no)
+        assert( other_parent in self._names )
+        parent_no = self.lookup(other_parent)
+
+        map_no={other_parent_no:parent_no}
+        for revid in revids:
+
+            new_version = len(self._parents)
+
+            other_no = other.lookup(revid)
+            map_no[other_no] = new_version
+
+            assert( len(other._parents[other_no]) == 1 )
+
+            # if we abort after here the (in-memory) weave will be corrupt because only
+            # some fields are updated
+            self._parents.append([map_no[other._parents[other_no][0]]])
+            self._sha1s.append(other._sha1s[other_no])
+            self._names.append(revid)
+            self._name_map[revid] = new_version
+
+        included = self.inclusions([parent_no])
+        state = None
+        insert_level = 0
+        ins = self._weave_inserter(included, self._weave)
+        for c,v,l in other._iterate_diff(other_parent_no, other_nos):
+            if c in '[]{}':
+                if c in '[]{':
+                    if insert_level == 0:
+                        pos = ins.jump(l)
+                        assert( pos >= 0 )
+                    ins.insert((c,map_no[v]))
+                else:
+                    ins.insert((c,None))
+
+                if c == '{': insert_level += 1
+                elif c == '}': insert_level -= 1
+            elif c=='.':
+                assert insert_level>0
+                ins.insert(v)
+
     def join(self, other):
-        import sys, time
         """Integrate versions from other into this weave.
 
         The resulting weave contains all the history of both weaves; 
@@ -763,18 +909,14 @@
         for other_idx, name in enumerate(other._names):
             self._check_version_consistent(other, other_idx, name)
 
-        merged = 0
-        processed = 0
-        time0 = time.time( )
+        to_be_merge = []
         for other_idx, name in enumerate(other._names):
             # TODO: If all the parents of the other version are already
             # present then we can avoid some work by just taking the delta
             # and adjusting the offsets.
-            new_parents = self._imported_parents(other, other_idx)
+            # Goffredo: 18/12/2005, done
             sha1 = other._sha1s[other_idx]
 
-            processed += 1
-           
             if name in self._names:
                 idx = self.lookup(name)
                 n1 = map(other.idx_to_name, other._parents[other_idx] )
@@ -782,16 +924,29 @@
                 if sha1 ==  self._sha1s[idx] and n1 == n2:
                         continue
 
-            merged += 1
+            if len(other._parents[other_idx]) == 1 and \
+                name not in self._names:
+                    if len(to_be_merge):
+                        other_parent_name = other.idx_to_name(
+                            other._parents[other_idx][0])
+                        if other_parent_name == to_be_merge[-1]:
+                            to_be_merge.append(name)
+                            continue
+                    else:
+                        to_be_merge.append(name)
+                        continue
+
+            if len(to_be_merge):
+                self._add_weave_revision(other,to_be_merge)
+                to_be_merge = []
+
+            new_parents = self._imported_parents(other, other_idx)
             lines = other.get_lines(other_idx)
             self.add(name, new_parents, lines, sha1)
 
-        mutter("merged = %d, processed = %d, file_id=%s; deltat=%d"%(
-                merged,processed,self._weave_name, time.time( )-time0))
-
-
-
- 
+        if len(to_be_merge):
+            self._add_weave_revision(other,to_be_merge)
+
     def _imported_parents(self, other, other_idx):
         """Return list of parents in self corresponding to indexes in other."""
         new_parents = []
@@ -953,6 +1108,9 @@
         Auto-merge two versions and display conflicts.
     weave diff WEAVEFILE VERSION1 VERSION2 
         Show differences between two versions.
+    weave join WEAVEFILE1 WEAVEFILE2 
+        Show the join of two weaves.
+        
 
 example:
 
@@ -1062,7 +1220,16 @@
 
     elif cmd == 'stats':
         weave_stats(argv[2], ProgressBar())
-        
+
+    elif cmd == 'join':
+        w1 = read_weave(file(argv[2], 'rb'))
+        w2 = read_weave(file(argv[3], 'rb'))
+        if len(argv) > 4:
+            w2 = w2._shrink_num(int(argv[4]))
+        w1.join(w2)
+        sys.stderr.write("join\n")
+        write_weave(w1,sys.stdout)
+
     elif cmd == 'check':
         w = readit()
         pb = ProgressBar()


-- 
gpg key@ keyserver.linux.it: Goffredo Baroncelli (ghigo) <kreijack @ inwind.it>
Key fingerprint = CE3C 7E01 6782 30A3 5B87  87C0 BB86 505C 6B2A CFF9
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20051218/7b818933/attachment.pgp 


More information about the bazaar mailing list