[RFC] Improvements to is_ignored.

Jan Hudec bulb at ucw.cz
Fri Jan 20 15:21:17 GMT 2006


Hello All,

I have added some tests on my improvements to is_ignored. However today
Robert Collins raised several issues on IRC that should be discussed before
this work can be merged in.

First the changes the branch contains:

  * All ignore patterns are compiled in one huge regular expression, which is
    then matched in one go for each filename. Earlier measurements showed the
    speedup to status is significant. The is_ignored method now does not
    return the matching glob, because it does not know.
  * For cases where we want to find which pattern matched, the is_ignored_by
    method wraps the patterns in capturing groups instead and uses lastindex
    to find which pattern matched. Unfortunately python regex engine has
    arbitrary limit on number of capturing groups, so the patterns are grouped
    by 50 only.
  - These two changes together should improve speed of status and other
    command walking the tree noticably. This part is backward compatible and
    could be possibly merged separately after some minor cleanup.

  * New glob->regex convertor that more closely follows shell semantics.
    It has * and ? not matching .-files, **/ matching any number of path
    components (zsh-style - whole components only!) and ***/ that also
    matches .-dirs, named character groups ([[:alnum:]], [[:digit:]],
    [[:space::]] are unicode-aware) and RE:regexp.
  * The .bzrignore file is read in utf-8. If decoding fails, it is not used.
  . Plus a test-suite checking patterns really have described semantics.
  - These further changes are not backward compatible. It should thus be
    discussed under what conditions they could be merged.

Robert Collins was concerned, that the users will suddenly get different
results from ignore and automatic add and that compatibility should be
provided. That would mean:
  * Include the new glob engine along with branch revision change and
    + Provide automatic conversion.
    + Use the old semantics for older branches.
    - Or not provide automatic conversion, just explain to the user what the
      difference when upgrading a branch.

I actually think the difference will matter far less than it seems. They only
affect three cases, all of which I think are minimally used:
  * There is an absolute path containing * that should match over /
    - Since ignored directories are not searched, usually only the directory
      name is given, which does not contain wildcards most of the time.
  * There are .-files in the tree that should be matched with a pattern
    starting with * or ?.
    - If .-files are present, they usually have their own rules and different
      extension anyway.
  * Non-ascii filenames are present and matched.
    - Given that the charset was not defined and relied on the default
      conversion, which is usually ascii, it's unlikely anyone did that.
      I don't even think it worked.
    
Other things to consider:

  * Do we actually want to use the semantics that * and ? don't match . at the
    begining of a path component? It is closer to what people expect, since
    it is what shell does. However semantics is more useful.

  * Do we want to use the new globs for finding per-branch sections in
    ~/.bazaar/bazaar.conf? Probably yes I think. This will be user-visible
    change.

  * Do we want to use the new glob semantics for expanding command-line
    arguments on Windows? I think I can do that (with exception of the RE:
    'pseudo'-globs) relatively easily.

The branch is at http://drak.ucw.cz/~bulb/bzr/bzr.ignore
I won't post diff this time, for I want to discuss the semantics first.

Regards,

Jan Hudec

-- 
						 Jan 'Bulb' Hudec <bulb at ucw.cz>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20060120/396d35f2/attachment.pgp 


More information about the bazaar mailing list