Thoughts on file ids

John Arbash Meinel john at arbash-meinel.com
Fri May 6 12:59:09 UTC 2011


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 05/06/2011 12:50 PM, Robert Collins wrote:
> On Fri, May 6, 2011 at 3:07 AM, Jelmer Vernooij <jelmer at samba.org> wrote:
>> The main problem I see with something like this is that it doesn't help
>> with the problem of texts with the same content having a different file
>> id/revision and being stored multiple times in the repository.
> 
> 2a already handles this more-or-less.
> 
> -Rob
> 

It can combine arbitrary texts together, and if they are identical will
join their record. (So (f1, r1) == blob 10, offset 10, (f2, r2) == blob
10, offset 10).
At the moment, it only shares exact records if the texts would otherwise
be put into the same GroupcompressBlock. It would be theoretically
possible for 'pack' to optimize based on stuff like this.  Right now we
sort based on file-id, and file-id is loosely based on filenames, so
similarly named files are likely to end up in the same groupcompress
block and end up de-duped.

An idea I like is that if we could determine that 2 file-ids shared a
sha1 hash at any point in history, then they could be promoted to being
'buddies' and more likely to share blocks.

There are lots of bits like this that an advanced 'compressor' could do,
which is perfectly well behaved by all current clients (so fully
backwards compatible). Though the next time *they* repack, they'll
probably write the data differently.

That said, I've also considered using a text key (say, sha1 of content)
for the actual data storage, and just store the (file_id, revision_id)
graphs as separate knowledge. (You can't use sha1 for graphs because it
creates cycles, you have to have sha1 + something.)

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk3D8J0ACgkQJdeBCYSNAANg8wCgi8yK251FhP8iU3z6yjVOlu3h
+9wAn3wBVeQ0ZAcNT1K6qL8kSp58jwcD
=YnB3
-----END PGP SIGNATURE-----



More information about the bazaar mailing list