[MERGE] Improvements to is_ignored, now with new glob converter and tests

Sat Jan 14 20:26:25 GMT 2006

Hello All,

Fortunately for bzr, I had sufficient supply of toits (and 茶) today, so
I finished the is_ignored improvements.

Merge location: http://www.ucw.cz/~bulb/bzr/bzr.ignore

Summary:

 * All ignore patterns are compiled in one huge regular expression, which is
   then matched in one go for each filename. Earlier measurements showed the
   speedup to status is significant. The is_ignored method now does not
   return the matching glob, because it does not know.
 * For cases where we want to find which pattern matched, the is_ignored_by
   method wraps the patterns in capturing groups instead and uses lastindex
   to find which pattern matched. Unfortunately python regex engine has
   arbitrary limit on number of capturing groups, so the patterns are grouped
   by 50 only.
 * New glob->regex convertor is implemented (as private to workingtree for
   the time being). This convertor handles * and ? not matching . at the
   component start, zsh-style **/ (matching any number of whole components
   not starting with .) and several named character groups. The [:alnum:],
   [:digit:] and [:space:] are all unicode aware (they must be used inside
   [], so it is [[:digit:]] -- and [^[:digit:]]). Regex special characters
   are escaped and anything preceeded with backslash is passed through as is,
   including the backslash. That means \* matches literal * and also that
   \xHH and \uHHHH might work (for some mysertious reason they work for me in
   one test, but not when I tried to write another). The converter should be
   relatively easy to extend to things like {x,y} or zsh-style () and
   # (repetition).
 * The .bzrignore file is now read in utf-8. If decoding fails, it is not
   used and a warning is printed.
 * Test suite checking of glob behaviour with respect to leading dots,
   character classes, multiple components with **/ and unicode characters.

Diff against current (revno 1520) bzr.dev, obtained by merging, follows, for
easier review. Please use above URL for merging. Comments welcome.

Regards,

Jan Hudec

=== added file 'bzrlib/tests/test_ignore.py'

--- /dev/null
+++ bzrlib/tests/test_ignore.py
@@ -0,0 +1,191 @@
+# Copyright (C) 2006 by Jan Hudec
+
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write to the Free Software
+# Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+
+import re
+
+import logging
+from cStringIO import StringIO
+import bzrlib.trace
+
+from bzrlib.branch import Branch
+from bzrlib.tests import TestCase, TestCaseInTempDir
+
+from bzrlib.workingtree import _glob_to_regex
+
+class TestGlobs(TestCase):
+
+    def assertMatch(self, glob, names):
+        rx = re.compile(_glob_to_regex(glob), re.UNICODE)
+        for name in names:
+            if not rx.match(name):
+                raise AssertionError(repr(
+                        u'name "%s" does not match glob "%s" (rx="%s")' %
+                        (name, glob, rx.pattern)))
+
+    def assertNotMatch(self, glob, names):
+        rx = re.compile(_glob_to_regex(glob), re.UNICODE)
+        for name in names:
+            if rx.match(name):
+                raise AssertionError(repr(
+                        u'name "%s" does match glob "%s" (rx="%s")' %
+                        (name, glob, rx.pattern)))
+
+    def test_char_groups(self):
+        # The definition of digit this uses includes arabic digits from
+        # non-latin scripts (arabic, indic, etc.) and subscript/superscript
+        # digits, but neither roman numerals nor vulgar fractions.
+        self.assertMatch(u'[[:digit:]]', [u'0', u'5', u'\u0663', u'\u06f9',
+                u'\u0f21', u'\xb9'])
+        self.assertNotMatch(u'[[:digit:]]', [u'T', u'q', u' ', u'\u8336'])
+
+        self.assertMatch(u'[[:space:]]', [u' ', u'\t', u'\n', u'\xa0',
+                u'\u2000', u'\u2002'])
+        self.assertMatch(u'[^[:space:]]', [u'a', u'-', u'\u8336'])
+        self.assertNotMatch(u'[[:space:]]', [u'a', u'-', u'\u8336'])
+
+        self.assertMatch(u'[[:alnum:]]', [u'a', u'Z', u'\u017e', u'\u8336'])
+        self.assertMatch(u'[^[:alnum:]]', [u':', u'-', u'\u25cf'])
+        self.assertNotMatch(u'[[:alnum:]]', [u':', u'-', u'\u25cf'])
+
+        self.assertMatch(u'[[:ascii:]]', [u'a', u'Q', u'^'])
+        self.assertNotMatch(u'[^[:ascii:]]', [u'a', u'Q', u'^'])
+        self.assertNotMatch(u'[[:ascii:]]', [u'\xcc', u'\u8336'])
+
+        self.assertMatch(u'[[:blank:]]', [u'\t'])
+        self.assertNotMatch(u'[^[:blank:]]', [u'\t'])
+        self.assertNotMatch(u'[[:blank:]]', [u'x', u'y', u'z'])
+
+        self.assertMatch(u'[[:cntrl:]]', [u'\b', u'\t', '\x7f'])
+        self.assertNotMatch(u'[[:cntrl:]]', [u'a', u'Q', u'\u8336'])
+
+        self.assertMatch(u'[a-z]', [u'a', u'q', u'f'])
+        self.assertNotMatch(u'[a-z]', [u'A', u'Q', u'F'])
+        self.assertMatch(u'[^a-z]', [u'A', u'Q', u'F'])
+        self.assertNotMatch(u'[^a-z]', [u'a', u'q', u'f'])
+
+        self.assertMatch(ur'[\x20-\x30\u8336]', [u'\040', u'\044', u'\u8336'])
+        self.assertNotMatch(ur'[^\x20-\x30\u8336]', [u'\040', u'\044', u'\u8336'])
+
+    def test_question_mark(self):
+        self.assertMatch(u'?foo', [u'xfoo', u'bar/xfoo', u'bar/\u8336foo'])
+        self.assertNotMatch(u'?foo', [u'.foo', u'bar/.foo', u'bar/foo',
+                u'foo'])
+
+        self.assertMatch(u'foo?bar', [u'fooxbar', u'foo.bar',
+                u'foo\u8336bar', u'qyzzy/foo.bar'])
+        self.assertNotMatch(u'foo?bar', [u'foo/bar'])
+
+        self.assertMatch(u'foo/?bar', [u'foo/xbar'])
+        self.assertNotMatch(u'foo/?bar', [u'foo/.bar', u'foo/bar',
+                u'bar/foo/xbar'])
+
+    def test_asterisk(self):
+        self.assertMatch(u'*.x', [u'foo/bar/baz.x', u'\u8336/Q.x'])
+        self.assertNotMatch(u'*.x', [u'.foo.x', u'bar/.foo.x', u'.x'])
+
+        self.assertMatch(u'x*x', [u'xx', u'x.x', u'x\u8336..x',
+                u'\u8336/x.x'])
+        self.assertNotMatch(u'x*x', [u'x/x', u'bar/x/bar/x', u'bax/abaxab'])
+
+        self.assertMatch(u'*/*x', [u'\u8336/x', u'foo/bax', u'x/a.x'])
+        self.assertNotMatch(u'*/*x', [u'.foo/x', u'\u8336/.x', u'foo/.q.x',
+                u'foo/bar/bax'])
+
+    def test_double_asterisk(self):
+        self.assertMatch(u'**/\u8336', [u'\u8336', u'foo/\u8336',
+                u'q/y/z/z/y/\u8336'])
+        self.assertNotMatch(u'**/\u8336', [u'that\u8336',
+                u'q/y/z/.z/y/\u8336', u'a/b\u8336'])
+
+        self.assertMatch(u'x**x', [u'xaaargx', u'boo/x-x'])
+        self.assertNotMatch(u'x**x', [u'x/y/z/bax', u'boo/x/x'])
+        # ... because it's not **, but rather **/ and it is only recognized
+        # after / or at the begining.
+
+        self.assertMatch(u'x**/x', [u'xaarg/x', u'x/x'])
+        self.assertNotMatch(u'x**/x', [u'xa/b/x', u'foo/xfoo/x'])
+
+    def test_leading_dotslash(self):
+        self.assertMatch(u'./foo', [u'foo'])
+        self.assertNotMatch(u'./foo', [u'\u8336/foo', u'barfoo', u'x/y/foo'])
+
+    def test_end_anchor(self):
+        self.assertMatch(u'*.333', [u'foo.333'])
+        self.assertNotMatch(u'*.3', [u'foo.333'])
+
+class TestBzrignore(TestCaseInTempDir):
+
+    shape = None
+
+    def setUp(self):
+        super(TestBzrignore, self).setUp()
+        self.branch = Branch.initialize(u'.')
+        self.wt = self.branch.working_tree()
+
+    def putIgnores(self, ignores):
+        bzrignore = file(u'.bzrignore', 'wb')
+        bzrignore.write(ignores)
+
+    def assertIgnored(self, name):
+        if not self.wt.is_ignored(name):
+            raise AssertionError(repr(u'name "%s" is not ignored' % name))
+
+    def assertNotIgnored(self, name):
+        if self.wt.is_ignored(name):
+            raise AssertionError(repr(u'name "%s" is ignored' % name))
+
+    def assertIgnoredBy(self, name, pattern):
+        by = self.wt.is_ignored_by(name)
+        if by != pattern:
+            raise AssertionError(repr(
+                        u'name "%s" ignored by "%s" instead of "%s"' %
+                        (name, by, pattern)))
+
+
+    def test_unicode(self):
+        self.putIgnores(u'*.\u8336\n'.encode('utf-8'))
+        self.assertIgnored(u'foo.\u8336')
+        self.assertNotIgnored(u'.boo.\u8336')
+        self.assertNotIgnored(u'this')
+
+    def test_misencoded(self):
+        stderr = StringIO()
+        handler = logging.StreamHandler(stderr)
+        handler.setFormatter(bzrlib.trace.QuietFormatter())
+        handler.setLevel(logging.INFO)
+        logger = logging.getLogger('')
+        logger.addHandler(handler)
+        self.putIgnores(u'*.[1\xc1]\n*.2\n'.encode('iso8859-1'))
+
+        self.assertNotIgnored(u'whatever.1')
+        self.assertNotIgnored(u'whatever.2')
+        self.assertContainsRe(stderr.getvalue(), 'WARN.*not utf-8')
+
+    def test_long(self):
+        self.putIgnores(u''.join([u'*.%i\n' % i
+                                    for i in range(1, 999)]).encode('utf-8'))
+        self.assertIgnoredBy(u'foo.333', u'*.333')
+        self.assertIgnoredBy(u'qyzzy.666', u'*.666')
+        self.assertIgnoredBy(u'\u8336.42', u'*.42')
+        self.assertNotIgnored(u'\u8336')
+        self.assertIgnoredBy(u'42', None)
+
+    def test_escapes(self):
+        self.putIgnores(u'*.\\*\n\\[*\\]\n'.encode('utf-8'))
+        # XXX: \uNNNN sequence does not work for me here, though it does
+        # work for me in TestGlob.test_char_groups above.
+        self.assertIgnored(u'foo.*')
+        self.assertIgnored(u'[foo]')

=== modified file 'bzrlib/add.py'
--- bzrlib/add.py
+++ bzrlib/add.py
@@ -155,7 +155,7 @@
                 if subf == bzrlib.BZRDIR:
                     mutter("skip control directory %r", subp)
                 else:
-                    ignore_glob = tree.is_ignored(subp)
+                    ignore_glob = tree.is_ignored_by(subp)
                     if ignore_glob is not None:
                         mutter("skip ignored sub-file %r", subp)
                         if ignore_glob not in ignored:

=== modified file 'bzrlib/builtins.py'
--- bzrlib/builtins.py
+++ bzrlib/builtins.py
@@ -1135,7 +1135,7 @@
             if file_class != 'I':
                 continue
             ## XXX: Slightly inefficient since this was already calculated
-            pat = tree.is_ignored(path)
+            pat = tree.is_ignored_by(path)
             print '%-50s %s' % (path, pat)
 
 

=== modified file 'bzrlib/tests/__init__.py'
--- bzrlib/tests/__init__.py
+++ bzrlib/tests/__init__.py
@@ -692,6 +692,7 @@
                    'bzrlib.tests.test_hashcache',
                    'bzrlib.tests.test_http',
                    'bzrlib.tests.test_identitymap',
+                   'bzrlib.tests.test_ignore',
                    'bzrlib.tests.test_inv',
                    'bzrlib.tests.test_log',
                    'bzrlib.tests.test_merge',

=== modified file 'bzrlib/workingtree.py'
--- bzrlib/workingtree.py
+++ bzrlib/workingtree.py
@@ -46,6 +46,7 @@
 import os
 import stat
 import fnmatch
+import re
  
 from bzrlib.branch import (Branch,
                            is_control_file,
@@ -75,7 +76,7 @@
                             rename)
 from bzrlib.textui import show_status
 import bzrlib.tree
-from bzrlib.trace import mutter
+from bzrlib.trace import mutter, warning
 import bzrlib.xml5
 
 
@@ -85,7 +86,6 @@
     This should probably generate proper UUIDs, but for the moment we
     cope with just randomness because running uuidgen every time is
     slow."""
-    import re
     from binascii import hexlify
     from time import time
 
@@ -111,6 +111,106 @@
 def gen_root_id():
     """Return a new tree-root file id."""
     return gen_file_id('TREE_ROOT')
+
+
+"""Do a multiple-pattern substitution.
+
+The patterns and substitutions are combined into one, so the result of
+one replacement is never substituted again. Add the patterns and
+replacements via the add method and then call the object. The patterns
+must not contain capturing groups.
+"""
+class _replacer(object):
+    _expand = re.compile(ur'\\&')
+    def __init__(self):
+        self._pat = None
+        self._pats = []
+        self._funs = []
+
+    r"""Add a pattern and replacement.
+
+    The pattern must not contain capturing groups.
+    The replacement might be either a string template in which \& will be
+    replaced with the match, or a function that will get the matching text as
+    argument. It does not get match object, because capturing is forbidden
+    anyway.
+    """
+    def add(self, pat, fun):
+        self._pat = None
+        self._pats.append(pat)
+        self._funs.append(fun)
+
+    def __call__(self, text):
+        if self._pat is None:
+            self._pat = re.compile(
+                    u'|'.join([u'(%s)' % p for p in self._pats]),
+                    re.UNICODE)
+        return self._pat.sub(self._do_sub, text)
+
+    def _do_sub(self, m):
+        fun = self._funs[m.lastindex - 1]
+        if hasattr(fun, '__call__'):
+            return fun(m.group(0))
+        else:
+            return self._expand.sub(m.group(0), fun)
+
+_sub_named = _replacer()
+_sub_named.add(u'\[:digit:\]', ur'\d')
+_sub_named.add(u'\[:space:\]', ur'\s')
+_sub_named.add(u'\[:alnum:\]', ur'\w')
+_sub_named.add(u'\[:ascii:\]', ur'\0-\x7f')
+_sub_named.add(u'\[:blank:\]', ur' \t')
+_sub_named.add(u'\[:cntrl:\]', ur'\0-\x1f\x7f-\x9f')
+# I would gladly support many others like [:alpha:], [:graph:] and [:print:]
+# but python regular expression engine does not provide their equivalents.
+
+def _sub_group(m):
+    if m[1] == u'!':
+        m[1] = u'^'
+    return u'[' + _sub_named(m[1:-1]) + u']'
+
+_sub_glob = _replacer()
+_sub_glob.add(ur'^\./', u'') # leading ./ with nothing (match is used)
+_sub_glob.add(ur'\\.', ur'\&') # keep anything backslashed...
+_sub_glob.add(ur'[(){}|^$+.]', ur'\\&') # escape specials...
+_sub_glob.add(ur'(?:(?<=/)|^)\*\*/', ur'(?:[^./][^/]*/)*') # **, zsh-style
+_sub_glob.add(ur'(?:(?<=/)|^)\*\.', ur'(?:[^./][^/]*)\.') # *. after / or at start
+_sub_glob.add(ur'(?:(?<=/)|^)\*', ur'(?:[^./][^/]*)?') # * after / or at start
+_sub_glob.add(ur'\*', ur'[^/]*') # * elsewhere
+_sub_glob.add(ur'/\?', ur'/[^./]') # * after /
+_sub_glob.add(ur'^\?', ur'[^./]') # * at the begining
+_sub_glob.add(ur'\?', ur'[^/]') # * elsewhere
+_sub_glob.add(ur'\[\^?\]?(?:[^][]|\[:[^]]+:\])+\]', _sub_group) # character group
+
+def _glob_to_regex(pat):
+    r"""Convert a unix glob to regular expression.
+    
+    Patterns containing '/' need to match whole path; others match
+    against only the last component - as per requirement of
+    WorkingTree.is_ignored().
+    
+    Globs implement *, ?, [] character groups (both ! and ^ negate), named
+    character classes [:digit:], [:space:], [:alnum:], [:ascii:], [:blank:],
+    [:cntrl:] (use /inside/ []), zsh-style **/ and escapes regular expression
+    special characters.
+
+    Pattern is returned as string.
+    """
+    # XXX: Shouldn't the globs be actually UNICODE?
+
+    # XXX: Is the path normalized? Should we match [/\\] ?
+    if '/' in pat:
+        return _sub_glob(pat) + u'$'
+    else:
+        return u'(?:.*/)?' + _sub_glob(pat) + u'$'
+
+def _glob_list_to_regex(pats, wrap=u'(?:%s)'):
+    """Convert a list of unix globs to a regular expression.
+
+    The pattern is returned as string. The wrap is % format applied to each
+    individual glob pattern. It has to apply group.
+    """
+    return u'|'.join([wrap % _glob_to_regex(x) for x in pats])
 
 
 class TreeEntry(object):
@@ -712,11 +812,10 @@
 
 
     def ignored_files(self):
-        """Yield list of PATH, IGNORE_PATTERN"""
+        """Yield list of paths"""
         for subp in self.extras():
-            pat = self.is_ignored(subp)
-            if pat != None:
-                yield subp, pat
+            if self.is_ignored(subp):
+                yield subp
 
 
     def get_ignore_list(self):
@@ -730,10 +829,43 @@
         l = bzrlib.DEFAULT_IGNORE[:]
         if self.has_filename(bzrlib.IGNORE_FILENAME):
             f = self.get_file_byname(bzrlib.IGNORE_FILENAME)
-            l.extend([line.rstrip("\n\r") for line in f.readlines()])
+            try:
+                l.extend([line.decode('utf-8').rstrip("\n\r")
+                        for line in f.readlines()])
+            except UnicodeDecodeError:
+                warning("'%s' is not utf-8 encoded, not reading ignore patterns"
+                        % bzrlib.IGNORE_FILENAME)
         self._ignorelist = l
         return l
 
+    def _get_ignore_regex(self):
+        """Return a regular expression composed of ignore patterns.
+
+        Cached in the Tree object after the first call.
+        """
+        if not hasattr(self, '_ignoreregex'):
+            self._ignoreregex = re.compile(
+                    _glob_list_to_regex(self.get_ignore_list()))
+        return self._ignoreregex
+
+    def _get_ignore_by_regex_list(self):
+        """Return regex list for is_ignored_by method.
+
+        Cached in the Tree object after the first call.
+
+        The return is a list of lists, each having pattern as the first
+        element, followed by list of globs it is composed from.
+        """
+        if not hasattr(self, '_ignore_by_regex_list'):
+            pats = list(self.get_ignore_list()) # So we can shift...
+            self._ignore_by_regex_list = []
+            while pats:
+                self._ignore_by_regex_list.append(
+                        [re.compile(_glob_list_to_regex(pats[0:50],
+                                    wrap=u'(%s)'))]
+                        + pats[0:50])
+                pats[0:50] = ()
+        return self._ignore_by_regex_list
 
     def is_ignored(self, filename):
         r"""Check whether the filename matches an ignore pattern.
@@ -741,37 +873,27 @@
         Patterns containing '/' or '\' need to match the whole path;
         others match against only the last component.
 
-        If the file is ignored, returns the pattern which caused it to
-        be ignored, otherwise None.  So this can simply be used as a
-        boolean if desired."""
-
-        # TODO: Use '**' to match directories, and other extended
-        # globbing stuff from cvs/rsync.
-
-        # XXX: fnmatch is actually not quite what we want: it's only
-        # approximately the same as real Unix fnmatch, and doesn't
-        # treat dotfiles correctly and allows * to match /.
-        # Eventually it should be replaced with something more
-        # accurate.
-        
-        for pat in self.get_ignore_list():
-            if '/' in pat or '\\' in pat:
-                
-                # as a special case, you can put ./ at the start of a
-                # pattern; this is good to match in the top-level
-                # only;
-                
-                if (pat[:2] == './') or (pat[:2] == '.\\'):
-                    newpat = pat[2:]
-                else:
-                    newpat = pat
-                if fnmatch.fnmatchcase(filename, newpat):
-                    return pat
-            else:
-                if fnmatch.fnmatchcase(splitpath(filename)[-1], pat):
-                    return pat
-        else:
-            return None
+        If the file is ignored, returns a match object, otherwise None. So
+        this can simply be used as a boolean if desired. The match object is
+        really not very useful, because the individual patterns are not
+        captured.
+        """
+        pat = self._get_ignore_regex()
+        return pat.match(filename)
+
+    def is_ignored_by(self, filename):
+        r"""Check whether the filename matches and return the pattern it matches.
+
+        This method is similar to is_ignored, but makes the extra effort to
+        return the pattern that matched.
+        """
+
+        pats = self._get_ignore_by_regex_list()
+        for pat in pats:
+            m = pat[0].match(filename)
+            if m:
+                return pat[m.lastindex]
+        return None
 
     def kind(self, file_id):
         return file_kind(self.id2abspath(file_id))


-- 
						 Jan 'Bulb' Hudec <bulb at ucw.cz>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20060114/6caab05b/attachment.pgp