[RFC] case sensitivity on Windows

Mon Nov 3 00:54:49 GMT 2008

On Mon, Nov 3, 2008 at 10:52 AM, Mark Hammond <mhammond at skippinet.com.au> wrote:
>> That's still pretty nice progress.  I'll read the patch now, but as a
>> tip, you should really update the subject line to include [merge] so
>> that BB will catch it.  It's a good safety net to make sure it does
>> get an answer one way or the other.
>
> Thanks for the reply.  FWIW, I used [RFC] in preference to [MERGE] in the
> subject line as I didn't consider it ready for merging.

Does it make anything worse, or is it just not a full solution?

(Oh, I see you answered that at the end.  I don't propose to teach you
how to suck eggs but we've found it can save confusion or roundtrips
if code that's not yet ready to merge for some reason says why in the
cover letter.)

>> So this routine, as it currently stands, is not win32-specific (aside
>> from the comment), but in fact is the kind of thing we'd like to also
>> use on os x or linux case-insensitive filesystems.  So even if we're
>> not handling those cases at present, I'd name it accordingly,
>> cicp_canonical_relpath or something.
>
> How do you see us deciding to use cicp_canonical_relpath at runtime?  Would
> we still unconditionally use it for Windows, or would we attempt to
> determine this dynamically for each tree object (ie, by sniffing the working
> tree for this property?)

I liked the idea of looking for the existence of eg
.bzr/checkout/FORMAT - it's true this does not tell us whether it's
cicp or case-folding, but for the stuff discussed so far (and perhaps
for the whole thing) it may not make any difference.  Alternatively
perhaps it could be a configuration setting.

> It has been mentioned here before that caching this per-directory might be
> good for performance.  Would it be preferred to 'implicitly cache' these
> results (eg, have cicp_canonical_relpath() keep a hidden, private cache), or
> build caching into the signatures of the functions - eg, something like:
>
>  entries = cicp_canonical_prime (dir_name)
>  filename = cicp_canonical_relpath(filename, entries) # entries could even
> be None to avoid caching?
>
> ?  The former looks much easier for consumers, but tricky to get right in
> the face of threads etc....

In general we avoid implicit caching because it's hard to control (in
for memory use, expiry, isolated testing), and instead look for an
object with the appropriate lifetime that can hold the cache.  (This
doesn't mean that the cache should be fused into another object; it
can be enough just to hold a reference to it.)

I would think about something like a class that holds a LRUCache of
these objects, and then one of these can be held by the WorkingTree.
The normal user interface then is wt.canonical_path.

This may not be enough if we need to do these queries before opening a tree.

We could later add a multi-variant that looks up many paths in one go,
but maybe yagni until it's clear this is heavily used.

ie

  canonical_relpaths(['goo']) => {'goo': 'Goo'}

>> > (the above, but the last 'add' is implicit)
>> > % touch foo
>> > % bzr add foo
>> > % mv foo Foo
>> > % bzr add
>> > => '' (ie, bzr should ignore the filesystem case for an existing
>> entry)
>> > % bzr diff/status
>> > => '' (ie, nothing has changed)
>>
>> This one is a bit interesting because it's saying that if the file was
>> renamed externally, we don't consider that it was "semantically"
>> renamed.  On Linux it would show as the newly added file being removed
>> and a new one present, until you confirmed that it had moved.
>
> Yes.  I think it is very important to decide on these semantics early on if
> at all possible.
>
> ie, is my example above "correct but interesting" or "interesting, but it's
> not clear if it's correct yet"?  [FWIW, I'm not particularly fussed about
> what semantics we decide, but think it worth drawing a line in the sand as
> early as possible.]

I agree about it being good to set clear semantics, so that we don't
have inconsistencies between different particular cases.  So the
examples are a great way to start, but we should also spell out the
rules.

The rule here seems to be: if the file has its case externally
changed, this does not count as a change.  This probably works best on
case-folding systems anyhow.

>
>> Also
>>
>>   touch foo
>>   bzr mkdir goo
>>   bzr add
>>   bzr mv FOO GOO
>>   => renamed ./foo ->./goo/foo
>>  (or arguably ./foo -> ./goo/FOO)
>
> Right.  I think we seem to be moving towards a policy of 'for items already
> in the inventory, ignore the case on the file-system' - so I'd suggest your
> first response is the most appropriate.

Right

>
>> bzr mv has two almost separate ways of interpreting the command line
>>
>>   bzr mv OLDNAME NEWNAME
>>   bzr mv OLDNAME... NEWDIR
>>
>> and you've mostly been talking about the former.  In the latter, I
>> suppose we want to match case-insensitive against all the parameters,
>> which is not so hard.
>
> Yes - this is another limitation in the patch I sent - it blindly 'fixes'
> every filename arg without considering the context.  'mv' is the obvious
> example, but I fear others may also exist.  My 'problem' is that the list of
> filenames is passed to tree_files(), which walks subdirs etc - so once
> 'tree_files()' has returned I've lost the ability to fixup the filenames as
> specified by the user.  But tree_files() doesn't have the context to know
> that the last item shouldn't be modified, for example.
>
> In other words, I'm aware of that problem, but struggling to find a way to
> modify cmd_mv to handle this.  Thoughts on this are very welcome and it is
> basically exactly where I am stuck - one of my new tests fails for this
> exact reason.

Well, it would be reasonable to for example just not pass the last
parameter to tree_files.  But I don't think I understand the problem
properly.  Is it that if you map the last argument to filesystem case,
then mv can't find the directory in its own inventory?

So at present you have a method which, given a name in arbitrary case,
tells you what case it actually has on disk.  It seems like we also
need a method which, given a name in arbitrary case, looks up that
name in the tree.  I don't think this exists yet, and it probably has
to, because the inventory and tree code at present is case sensitive.
We can't just use the first because in general the case present on
disk is not the same as what's used in memory.

I can think of a few ways to tackle this: one is to have the
inventory/tree code have a 'mode' (conceptually a bit on the object;
possibly actually composition with a different object) that tells it
to do everything case-insensitively.

A couple of issues here: there are several interfaces that take in
paths; presumably all would need to be updated.  (It may be possible
to rationalize that set by eliminating old ones.)  The internals of
the dirstate are defined to be in a particular collation order, so
doing case insensitivity there depends on doing something like your
canonical_relpath on the tree control data - searching each directory
on the way down.

So that's probably how I would begin - a function that flips case to
match the inventory.

This suggests that the existing canonical_relpath maybe should be
named to indicate it's matching the external filesystem case.

Hope that helps,
-- 
Martin <http://launchpad.net/~mbp/>