[RFC] Improvements to is_ignored.
Jan Hudec
bulb at ucw.cz
Fri Jan 20 15:21:17 GMT 2006
Hello All,
I have added some tests on my improvements to is_ignored. However today
Robert Collins raised several issues on IRC that should be discussed before
this work can be merged in.
First the changes the branch contains:
* All ignore patterns are compiled in one huge regular expression, which is
then matched in one go for each filename. Earlier measurements showed the
speedup to status is significant. The is_ignored method now does not
return the matching glob, because it does not know.
* For cases where we want to find which pattern matched, the is_ignored_by
method wraps the patterns in capturing groups instead and uses lastindex
to find which pattern matched. Unfortunately python regex engine has
arbitrary limit on number of capturing groups, so the patterns are grouped
by 50 only.
- These two changes together should improve speed of status and other
command walking the tree noticably. This part is backward compatible and
could be possibly merged separately after some minor cleanup.
* New glob->regex convertor that more closely follows shell semantics.
It has * and ? not matching .-files, **/ matching any number of path
components (zsh-style - whole components only!) and ***/ that also
matches .-dirs, named character groups ([[:alnum:]], [[:digit:]],
[[:space::]] are unicode-aware) and RE:regexp.
* The .bzrignore file is read in utf-8. If decoding fails, it is not used.
. Plus a test-suite checking patterns really have described semantics.
- These further changes are not backward compatible. It should thus be
discussed under what conditions they could be merged.
Robert Collins was concerned, that the users will suddenly get different
results from ignore and automatic add and that compatibility should be
provided. That would mean:
* Include the new glob engine along with branch revision change and
+ Provide automatic conversion.
+ Use the old semantics for older branches.
- Or not provide automatic conversion, just explain to the user what the
difference when upgrading a branch.
I actually think the difference will matter far less than it seems. They only
affect three cases, all of which I think are minimally used:
* There is an absolute path containing * that should match over /
- Since ignored directories are not searched, usually only the directory
name is given, which does not contain wildcards most of the time.
* There are .-files in the tree that should be matched with a pattern
starting with * or ?.
- If .-files are present, they usually have their own rules and different
extension anyway.
* Non-ascii filenames are present and matched.
- Given that the charset was not defined and relied on the default
conversion, which is usually ascii, it's unlikely anyone did that.
I don't even think it worked.
Other things to consider:
* Do we actually want to use the semantics that * and ? don't match . at the
begining of a path component? It is closer to what people expect, since
it is what shell does. However semantics is more useful.
* Do we want to use the new globs for finding per-branch sections in
~/.bazaar/bazaar.conf? Probably yes I think. This will be user-visible
change.
* Do we want to use the new glob semantics for expanding command-line
arguments on Windows? I think I can do that (with exception of the RE:
'pseudo'-globs) relatively easily.
The branch is at http://drak.ucw.cz/~bulb/bzr/bzr.ignore
I won't post diff this time, for I want to discuss the semantics first.
Regards,
Jan Hudec
--
Jan 'Bulb' Hudec <bulb at ucw.cz>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20060120/396d35f2/attachment.pgp
More information about the bazaar
mailing list