Help with a unicode issue

John Arbash Meinel john at arbash-meinel.com
Thu Jun 28 22:59:34 BST 2007


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Ian Clatworthy wrote:
> John,
> 
> I'm experimenting with building the commit data using an iter_changes
> like API instead of walking the inventory of the work_tree. Of the
> almost 7000 tests, only *one* is failing and I'm wondering if it's a
> latent bug elsewhere.
> 
> The lines # won't help you much but the output is attached. Here's what
> I think is happening ...
> 
> 1. a legal inventory is being built
> 2. a file within that is renamed with a unicode path
> 
> In the current code base, the commit is succeeding because the inventory
> entry isn't rechecked. In my code base, inventory.make_entry() is being
> stricter in it's checking the normalisation and falling over.
> 
> Does that sound feasible? If so, any ideas on where the root cause of
> the bug is? FWIW, the current code base does a make_entry iff the kind
> is changed (so the existing code is fragile anyhow if I'm right).
> 
> Thoughts?
> 
> Ian C.

I believe we fail to trap normalization after a rename. The argument is
whether we should be checking normalization or not.

I used to think we should, but it is a lot of work. (Both in developer
time, and in CPU time when adding 20k files, and if you are on Mac, it
costs a lot to re-normalize all of your files as you pass through for
stuff like 'status'.). *And* we still get complaints because a lot of
people want to version whatever they can get their OS to spit out, since
they don't always have control over it. For example, Japanese MS Office
it likes to use wide-character parentheses. Which fail our
"normalization" check. But people feel it is inelegant for us to ask
them to change the names.

The only reason we do it is because of Mac (which is arguably broken).
Nobody else does it that I've found.

So in the end, I've given up. It is easier not to do it, and we get
about the same number of people who are unhappy. *But* we have someone
to blame it on (Mac's HFS+). And only people who use Mac have to pay the
penalty.


All this to say....

a) Right now, we *do* still assert that filenames should be normalized
when added. So we should do it for renames.

b) We probably don't want to do it at all.

c) Right now our code is probably pretty broken for non-normalized names
on a Mac. In that you can't add them, and you have no way to "fix" them
on a Mac. And 'bzr status' will show them as missing+unknown. So we
probably need to do (b) sooner rather than later.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.5 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGhC9GJdeBCYSNAAMRAiuyAKCtyCeIThCLfrtZ7kFxTahDi9APzACdFLPS
+Q+3MYk19Biq4dE+/hsSdSo=
=pqm2
-----END PGP SIGNATURE-----



More information about the bazaar mailing list