[RFC] BIG speed improvment in clone/branch on http transport

Sun Dec 4 23:15:33 GMT 2005

On Sunday 04 December 2005 23:28, you (Robert Collins) wrote:
> On Thu, 2005-12-01 at 22:48 +0100, Goffredo Baroncelli wrote:
> > Hi all,
> > 
> > the patch below introduces a new function to the branch class. The function
> > is named file_involved(rev1,rev2 ) and returns the file_id 
> > involved in changes between the revision1 and the revisions2. Moreover
> > this function can be called with a set of revisions as argument: in this case
> > file_involved() returns the file_id(s) involved in the revisions set.
> > 
> > This function is __very__ useful during the clone/branch function because
> > with this is very easy know which weave we need to update/download.
> 
> 
> I haven't done a code review per se at this point, but let me say -
> cool! This is something talked about before but I'm ecstatic to see that
> someone has done it ;).

Ok, and I have found a little bug in my code; so as soon as I will post a 
review...

> 
> A few key things though that should be done before this is merged:
> 
>  * We need test cases that drive that function hard - its going to be
> core functionality and an error there -> corrupt repository. In
> particular I think we need some tests that use a format 6 branch, so
> that as new branch formats come in we know that the top level behaviour
> is preserved across formats (open a format 6 branch from a format 7
> branch for instance). Another important test is when we have
> ghosts/import revisions:
> Imagine that we have an inventory weave that starts with 
> rev A
> 
> then
> 
> we add in rev C :
> rev A content 
> rev C content
> 
> and then we add B which is a merge of A and C
> rev A content
> with B new lines/gone lines
> rev C content
> 
> the gap from A to B will not show any of C, but we need to get all of C
> that was not in A or B to be complete - we should test that this happens
> (and if it doesn't then rethink the idea).

The function file_involved, in order to detect the file involved between the
changes A->B performs:
1) compute all ancestry of A ( and there isn't C )
2) compute all ancestry of B ( and  there _is_ C )
3) compute the difference between B and A ( and there _is_ C )
4) select all the file touched in the revision detected at step 3.
So the function should return even the file touched in C ( in theory :-) ) 

However I will select some test case; but one question: what is the definition 
of a 'ghost revisions' ? I know their existence but it isn't clear, to me, what they are.
On the basis of the bzrlib.check.check_one_rev( ), it seems that a ghost revision
is a revision referred by another revision via its parent_ids, without
existing in the revision store: it is correct ? But what are the difference between 
a ghost revision and a missing revision ?

> 
> 
>  * It would be nice to use that in fetch_greedy too, just to make the
> code simpler. It would (obviously) not make things faster as most weaves
> will need merging rather than bulk copying.

I am working on it; the speed increase are not so good, but there are ( about 1.5x - 2x );
but i need some time in order to tweak the code.

However, the weave.join code can be faster: it is possible to check the existence
of a revision without expanding its contents: the gain should be another 2x...
My problem is  to understand why sometime it is raised a WeaveParentMismatch
exception, and why, if it happens, instead the weave.join( ) the
function reweave( ) is called. The comment refers to the problem of the ghost revisions
... but it isn't very clear

> 
> Rob
>
Goffredo

-- 
gpg key@ keyserver.linux.it: Goffredo Baroncelli (ghigo) <kreijack @ inwind.it>
Key fingerprint = CE3C 7E01 6782 30A3 5B87  87C0 BB86 505C 6B2A CFF9

[RFC] __BIG__ speed improvment in clone/branch on http transport

[RFC] BIG speed improvment in clone/branch on http transport