[RFC] sha1 of fileid naming for knit files

John Arbash Meinel john at arbash-meinel.com
Tue Oct 31 23:40:15 GMT 2006


Lachlan Patrick wrote:
...

> If the problem is 'bad' characters, one solution is to remove the bad
> characters and add the sha1 hash at the end. I've used systems where all
> non-ASCII is mapped to ASCII, spaces map to _ and most punctuation gets
> omitted, so you get mappings from Renée's pie chart.jpg to
> Renees_pie_chart.jpg which is pretty close to the original, but might
> not be unique. Adding a sha1 guarantees uniqueness:
> Renees_pie_chart-a7b64b8f2.jpg but this is even longer.
> 
> If the problem is long names, this may not help at all.
> 
> I don't fully understand the problem, can someone explain it? I *think*
> you might be talking about the double escaping problem on some web
> server setups?
> 
> Loki
> 


We actually already URL encode the names, so that everything is ASCII
characters. It interacts poorly with some web-servers, though, because
they double decode the url escape. It is actually the web-server which
is violating the RFC, but we are the ones triggering the problem.

So we've been trying to avoid doing that.

Just to be clear, these are file-ids, not file names. It happens that we
take the original filename, remove any non-ascii characters, lower case
it, and take the first 20 or so characters, and then add some random
characters on the end to make it unique. Overall, this works pretty
well, and avoids us having to URL escape new file ids.

So the problems are:
1) Old file ids that were generated by bzr before we did as much sanitation.
2) File ids that exist because of conversions from other systems. Like
converting an Arch project to bzr, it re-uses the id's as
Arch-1:arch at name--foobar%category--branch--1.1--patch-0

Which means that 2 people upgrading a project from Arch => bzr get the
same conversion, but has the downside that we don't have as much control
over generating "nice to the filesystem" file ids.

The bigger abuser of this is bzr-svn, because it has to use rather long
ids to get a unique SVN id. ("svn1:" + UUID of repository + path to
branch + revno file was added, ...)

Anyway, mapping through sha1 would be okay, and it would mean we have
fixed size records, and don't have to worry about escaping the path on
disk. But we really need the reverse mapping first.

John
=:->

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 254 bytes
Desc: OpenPGP digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20061031/02470d83/attachment.pgp 


More information about the bazaar mailing list