[RFC] Improvements to is_ignored.

Fri Jan 20 23:57:24 GMT 2006

On 20 Jan 2006, Jan Hudec <bulb at ucw.cz> wrote:
> Hello All,
> 
> I have added some tests on my improvements to is_ignored. 

This sounds useful; thankyou.

However today
> Robert Collins raised several issues on IRC that should be discussed before
> this work can be merged in.
> 
> First the changes the branch contains:
> 
>   * All ignore patterns are compiled in one huge regular expression, which is
>     then matched in one go for each filename. Earlier measurements showed the
>     speedup to status is significant. The is_ignored method now does not
>     return the matching glob, because it does not know.
>   * For cases where we want to find which pattern matched, the is_ignored_by
>     method wraps the patterns in capturing groups instead and uses lastindex
>     to find which pattern matched. Unfortunately python regex engine has
>     arbitrary limit on number of capturing groups, so the patterns are grouped
>     by 50 only.
>   - These two changes together should improve speed of status and other
>     command walking the tree noticably. This part is backward compatible and
>     could be possibly merged separately after some minor cleanup.
> 

>   * New glob->regex convertor that more closely follows shell semantics.
>     It has * and ? not matching .-files, **/ matching any number of path
>     components (zsh-style - whole components only!) and ***/ that also
>     matches .-dirs, named character groups ([[:alnum:]], [[:digit:]],
>     [[:space::]] are unicode-aware) and RE:regexp.

There was some discussion last time about whether ** should match only
whole components, or any substring.  The zsh form seems a bit more
precise.  Does anyone have an opinion?

I don't expect it will be used very often in ignore patterns; it's
probably more likely in the branch configuration, where we might have

[/home/mbp/**/bzr.mbp.*]

and even in that case it makes no difference which meaning we use.

>   * The .bzrignore file is read in utf-8. If decoding fails, it is not used.

I'd like at least a warning to be shown if it can't be decoded, rather
than just being unused.

>   . Plus a test-suite checking patterns really have described semantics.
>   - These further changes are not backward compatible. It should thus be
>     discussed under what conditions they could be merged.
> 
> Robert Collins was concerned, that the users will suddenly get different
> results from ignore and automatic add and that compatibility should be
> provided. That would mean:
>   * Include the new glob engine along with branch revision change and
>     + Provide automatic conversion.
>     + Use the old semantics for older branches.
>     - Or not provide automatic conversion, just explain to the user what the
>       difference when upgrading a branch.
> 

> I actually think the difference will matter far less than it seems. They only
> affect three cases, all of which I think are minimally used:
>   * There is an absolute path containing * that should match over /
>     - Since ignored directories are not searched, usually only the directory
>       name is given, which does not contain wildcards most of the time.
>   * There are .-files in the tree that should be matched with a pattern
>     starting with * or ?.
>     - If .-files are present, they usually have their own rules and different
>       extension anyway.
>   * Non-ascii filenames are present and matched.
>     - Given that the charset was not defined and relied on the default
>       conversion, which is usually ascii, it's unlikely anyone did that.
>       I don't even think it worked.

I would tend to agree that they're unlikely to be relied upon.  In
general I don't think we need to have an option to support old buggy
behaviour just in case someone relied on it.

The changes should be in the Changes section of the news file, and there
should be more description in tutorial.txt.

> Other things to consider:
> 
>   * Do we actually want to use the semantics that * and ? don't match . at the
>     begining of a path component? It is closer to what people expect, since
>     it is what shell does. However semantics is more useful.

(I don't understand that last sentence.)

I think being consistent with the shell is best.  The only drawback is
that it may confuse windows users who are not familiar with that aspect
of unix globs, but they're less likely to have dot files anyhow.

>   * Do we want to use the new globs for finding per-branch sections in
>     ~/.bazaar/bazaar.conf? Probably yes I think. This will be user-visible
>     change.

Yes, I think so.

>   * Do we want to use the new glob semantics for expanding command-line
>     arguments on Windows? I think I can do that (with exception of the RE:
>     'pseudo'-globs) relatively easily.

That would be good.  I'd like sometime to do this more systematically by
defining which arguments should be glob-expandd, rather than doing it
inside the run method.

-- 
Martin