Path Tokens

Fri Jul 24 19:43:13 BST 2009

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

...
> 
> == Storage implications ==
> 
> I said above that storing the mappings themselves wouldn't be too
> difficult, but it does have implications for the existing storage.
> 
> If you have two trees that join all the files on every merge then you
> can end up with many changes of ids over time, and this will
> lead to inefficiencies in the storage formats that we currently
> have. This may mean we need to change some of that in order
> to accommodate this.

Currently we store a separate "copy" of texts for every (file_id, revision)
tuple. Which means that you'll also likely have lots of copies of the same
text in the repository.

Now, if the (file_id, revision) key sorts appropriately, the --2a format
should be able to compress it very well. (I didn't finish implementing it, but
if identical content appears in the same *group* we can fairly easily just
copy the reference, rather than writing another delta recipe.)

However, we've also discussed changing the *text content* storage, so that it
is simply addressed by hash, and leave the (file_id, revision) tuples as graph
keys, history info, etc.

So in <2a formats, we are likely to see significant bloat, as we cannot delta
between content with a different file_id. In 2a we will probably see *some*
bloat, and in the future we may get to the point where there isn't any (other
than the bloat in the per-file graphs, mapping storage, etc.)

Otherwise this at least is how *I* understand path tokens.
John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFKagDBJdeBCYSNAAMRAvRrAJ9i1GXaRyMNkbZ0OVsAZFSOQoyNDgCgoGcR
P6ihy161txOKoYd9m86tdTs=
=TBTp
-----END PGP SIGNATURE-----