please check out weave-format branch

John A Meinel john at arbash-meinel.com
Sat Sep 24 00:23:12 BST 2005


Martin Pool wrote:
> On 24/09/05, John A Meinel <john at arbash-meinel.com> wrote:
> 
>>Another small point as I get deeper into the code.
>>Why do you write out the revision xml as uncompressed?
>>
>>I guess in my testing, you don't save a lot of space, since the files
>>are small. --apparent-size is 660k vs 755k, but because of filesystem
>>blocks, du -ksh reports 6.3M versus 6.2M.
> 
> 
> For just this reason - the actual saving is small, and it seems like
> python gzip is actually somewhat expensive to run.  It's not a final
> decision, and perhaps even if it's less effective on disk it'd be
> better to have it compressed to help with http.
> 
> (Actually I overestimated the cost of gzip because my upgrade cost was
> doing some redundant work, so perhaps it doesn't matter so much.)
> 
> If merging back to your code meant they had to be compressed I
> wouldn't really mind.
> 
> 
>>But the above du does raise an interesting issue. That we are losing
>>about 10x disk space because of a bunch of very small files. It isn't a
>>lot of space, and I don't know if people will really care, but I thought
>>I would mention it.
> 
> 
> Yes, it is quite noticeable.  On the other hand it's only an overhead
> of 2k per revision, which looked at that way is not so bad.
> 
> You can imagine designing an append-only file that stores them more
> compactly but allows fast random access, but perhaps its the
> filesystems job.
> 
> If we do keep them as separate files it might be good to eventually
> allow for hashed subdirectories to accomodate filesystems that can't
> handle having many files in a directory.
> 

Sure, but what is the actual number of files that you are thinking about 
supporting? 10k? 100k?
100k starts to get a little painful.
Especially if you are on Windows and have the dos names present (which 
is there unless you explicitly disable it).
Since most revision ids start with the email address, they are all going 
to collide at the 8.3 level, and cause no end of headache.

So I guess I agree that the revision-store should probably try to nest 
things. One obvious place to do it would be to use the email address as 
the first level of nesting, and then maybe the date as the second. So 
you would have:

mbp at sourcefrog.net-20050716000740-f2dcb8894a23fd2d =>
mbp at sourcefrog.net/20050716000740/f2dcb8894a23fd2d.gz

Though I'm thinking that probably fans out far to quickly (you are very 
likely to have lots of folders at the second level, and only a few at 
the first, and only 1 file in the final folder. So maybe go by month

mbp at sourcefrog.net/200507/16000740-f2dcb8894a23fd2d.gz

It would have been nice to separate at the "-" because then it was a 
simple transformation. But I think this will work anyway.

We could just do it as "until first -", next 6 characters, rest of text. 
So it doesn't have to fit the formula (though almost all revisions will).

I forgot about the dos mangling stuff. Even just 1k files with the same 
prefix starts to get painful.

John
=:->

> --
> Martin
> 

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 253 bytes
Desc: OpenPGP digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20050923/73c45432/attachment.pgp 


More information about the bazaar mailing list