large files and storage formats
Chad Dombrova
chadrik at gmail.com
Fri Jul 9 20:01:17 BST 2010
>
>
> I don't think you'll be able to make it such that the file in a 2a
> repository is a hardlink to the file in the working tree. Ultimately
> there is quite a bit of risk in that. Accidental modification of the
> working tree copy, and suddenly your archive is corrupted for everyone,
> and you don't have another copy to easily restore from.
true. my loose object plan involves ensuring that objects going into the
store are made read-only (and chown'd to a "repouser" using a simple
setuid'd program) so that hardlinks generated from them are not editable. if
i were to use 'checkout --hardlink' i'd most likely use a similar approach
to ensure that various users don't accidentally edit each other's files
across working trees.
> With 'bzr co --lightweight' and '--hardlink', you can easily get to the
> point where there is 1 copy on the centralized repository, and all
> working trees have hardlinked data.
>
this could be a good compromise solution, but currently it seems to be
missing a critical component: "bzr update --hardlink". "checkout
--hardlink" is great to get started, but its benefits disappear after bzr
stops trying to use hardlinks for new updates. ideally, when updating with
"--hardlink" each repo would have a list of shared repo paths that it would
first search (recursively?) to generate hardlinks, before defaulting to the
normal behavior if nothing suitable is found. this search pattern would
also be useful for "checkout --hardlink" which currently only looks to the
source repo's working tree. for instance, if the -r flag is specified on
checkout to get some older revision, the working tree of the source repo
might not have any revisions to hard link from, but other clones (sorry for
hg/git terminology!) very well might.
'bzr co --lightweight' isn't as fast as it could be, but it is decent,
> and IMO would be much more worth your time than a new storage format.
agreed. i gave --lightweight a test drive and it works like mercurial's
"shared" repo, which i like conceptually more than git's shared repos.
I'll note, though, that the git 'loose blobs' are not simply file content.
> IIRC they have a 'blob ####' header (# is the length, IIRC), and then the
> whole content is DEFLATE compressed. So even that doesn't fit his goal of
> having one copy of the text between the repository and the working tree.
true. as a proof of concept, i modified the git C code to separate the
header and leave the blob uncompressed. doing this to the dulwich python
code would be trivial.
i'm going to start playing with the patch that John sent to disable delta
compression. btw, a similar patch can be made to mercurial to boost
performance with large files, but unlike bzr, mercurial does not have a
handy pack command to later recompress.
one more question about packs: is it possible to *quickly* and *easily*
punch/prune/obliterate a file revision from the repo? git does this well,
mercurial does not. i checked `bzr help pack` and it did not show any
options for this, but it also does not show "--clean-obsolete-packs" (on
version 2.2b1). what's up with that?
thanks again for all the great responses. this definitely the most
informative mailing list that i've spammed lately :)
-chad
-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://lists.ubuntu.com/archives/bazaar/attachments/20100709/b05275e6/attachment-0001.htm
More information about the bazaar
mailing list