[RFC] Strawman replacement local directory state

Wed Jun 14 20:59:04 BST 2006

On 14/06/2006, at 2:29 AM, Michael Ellerman wrote:

> On 6/14/06, Robert Collins <robertc at robertcollins.net> wrote:
>> I've put a strawman working tree state file format up at
>> http://bazaar-vcs.org/Classes/WorkingTree.

I like it; it's at least a reasonable place to start experimenting.   
The refactoring to make it possible to keep these things together can  
probably only help.

>> Its got some interesting aspects, and some things that we might  
>> choose
>> to address...
>>
>> Its designed as a 'parse it fully every time, and write it fully too,
>> but make it fast'. format.
>
> I'm not really across this stuff enough to comment much, you've
> clearly thought about it a lot more than most people.
>
> Having said that I'm a tad wary of any format involves reading/writing
> in its entirety.

(Incidentally: is 2e6 files a good upper bound for what ought to be  
possible?  It's starting to get to the point where you probably  
cannot keep a whole inventory or even a description of a whole-tree  
change in memory on a 32-bit machine.  But let's continue anyhow.)

It is a bit of a concern.  For a tree with 2e6 files and with records  
of ~100 bytes each that will be 200MB of data to pull in every time,  
and it will likely use a similar amount in memory.  Those records  
could be larger if a merge has been committed or they're deeply  
nested, as they may well be for such a tree.

However the point is not the absolute size but rather how it compares  
to other operations on such a tree.  If you do 'status' there we'll  
need to at minimum stat all the files, which will take some time and  
also need to read ~100MB of inodes and directories.  If any have  
actually changed and we need to recalculate their sha1 then it will  
be much larger.

The worst case is probably diffing one particular named file which  
has changed.  Then we'll need to load the whole dirstate, search for  
the file, and (what's worse) write it all back out with a new working- 
copy sha1.  There are some things that can be done though:

  - since the file is in sorted order, we can do random access  
through it to find the path we're interested in

  - if we recompute the sha1 for a single file it might sometimes not  
be worth storing it; just leave the invalid one and reread it next time

There are a few possibilities: store one dirstate per source  
directory (or some similar split) or use an indirection like tdb or  
sqlite that allows inserting/removing data from the middle of a  
conceptual file.  But putting this in makes it easier to do those in  
the future.

One nice thing about keeping it in sorted order is that we can  
potentially diff/status/commit files as they stream through.

-- 
Martin