[MERGE] is_ignored improvements...

Wed May 17 21:25:06 BST 2006

Hello,

On Tue, May 16, 2006 at 19:03:40 +1000, Robert Collins wrote:
> Jan Hudec was working on is_ignored improvements... they account for a
> nontrivial amount of the time of a 'bzr add' call.
> 
> I was wondering if we can get the compatible replacement in - leaving
> the format changing etc stuff for a later time.

Ok, here it is. All tests pass and there are some new tests for globs and
ignore. Please review. For merging please use
http://drak.ucw.cz/~bulb/bzr/bzr/matcher/ branch.

It splits the WorkingTree.is_ignored method to ignored, which returns just
a boolean and ignored_by, which returns the pattern. is_ignored is deprecated
forwarder to ignored_by (to remain compatible).

I fixed the WorkingTree.ignored_files method, but it's neither used anywhere
nor tested. Perhaps it should be deprecated.

There is no zero_nine in bzrlib.symbol_versioning yet, so I just used
zero_eight (just one deprecated method -- WorkingTree.is_ignored). Please fix
as appropriate for the version that will eventually get the changes.

-- 
						 Jan 'Bulb' Hudec <bulb at ucw.cz>
-------------- next part --------------
=== added file 'bzrlib/glob.py'

--- /dev/null	
+++ bzrlib/glob.py	
@@ -0,0 +1,144 @@
+# Copyright (C) 2006 Jan Hudec
+
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write to the Free Software
+# Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+
+"""Tools for converting globs to regular expressions.
+
+This module provides functions for converting shell-like globs to regular
+expressions. See translate function for implemented glob semantics.
+"""
+
+
+import fnmatch
+import os
+import os.path
+import re
+
+from bzrlib.trace import mutter, warning
+from bzrlib.osutils import lexists, pathjoin
+
+
+def FNMATCH(pat, sep):
+    """Convert globs to regexp using standard fnmatch module.
+
+    Converts a glob to regular expression using the fnmatch module. This can
+    be used for legacy reasons as style argument to translate and derived
+    functions.
+
+    This is intended for use the style argument to translate and related
+    functions.
+    """
+    import fnmatch
+    def xlate(pat):
+        return fnmatch.translate(sep(pat)).rstrip(u'$') # $ will be re-added
+    if(pat.startswith(u'***/')): # Processed with anchor_glob
+        # This cheats a bit, assuming noone would ever write *** in old globs
+        return u'(?:.*/)?(?!.*/)' + xlate(pat[4:])
+    else:
+        return xlate(pat)
+
+
+def POSIX(pat):
+    """Canonicalize glob using / as directory separator.
+
+    This is intended for passing in the sep argument to translate and
+    related functions.
+    """
+    return pat
+
+
+def NATIVE(pat):
+    """Canonicalize glob using native directory separator.
+
+    This converts a glob using native directory separator to a normalized
+    glob for matching posix-style paths.
+
+    This is intended for passing in the sep argument to translate and
+    related functions.
+    """
+    return pat.replace(os.sep, u'/')
+
+
+# Default style so it's consisten between all funcs that take that argument.
+_default_style = FNMATCH
+
+
+def translate(pat, style=_default_style, sep=POSIX):
+    r"""Convert a glob to regular expression.
+
+    The style argument is the actual translator to be used. Translators
+    defined is only FNMATCH for now. See their respective documentation for
+    exact interpretation of globbing chars. Default translator is FNMATCH.
+    
+    Pattern is returned as string.
+    """
+    return style(pat, sep) + u'$'
+
+
+def translate_list(pats, wrap=u'(?:%s)', style=_default_style, sep=POSIX):
+    """Convert a list of unix globs to a regular expression.
+
+    The pattern is returned as string. The wrap is % format applied to each
+    individual glob pattern. It has to apply group.
+
+    See translate for glob semantics.
+    """
+    return u'|'.join([wrap % translate(x, style, sep) for x in pats])
+
+
+def compile(pat, style=_default_style, sep=POSIX):
+    """Convert a unix glob to regular expression and compile it.
+
+    This converts a glob to regex via translate and compiles the regex. See
+    translate for glob semantics.
+    """
+    return re.compile(translate(pat, style, sep), re.UNICODE)
+
+
+def compile_list(pats, wrap=u'(?:%s)', style=_default_style, sep=POSIX):
+    """Convert a list of unix globs to a regular expression and compile it.
+
+    The pattern is returned as compiled regex object. The wrap is % format
+    applied to each individual glob pattern. It has to apply group.
+    """
+    return re.compile(translate_list(pats, wrap, style, sep), re.UNICODE)
+
+
+def anchor_glob(pat):
+    """Convert file-glob to path glob as used in ignore patterns.
+
+    Ignore patterns not containing '/' should match against the filename only.
+    This function prepends such pattern with '***/' so they can be compiled
+    together with whole-path globs (containing '/') and matched against the
+    whole path.
+    """
+    if pat.startswith(u'RE:'):
+        return pat
+    if u'/' in pat:
+        return pat
+    else:
+        return u'***/' + pat
+
+
+def anchor_globs(pats):
+    """Convert file-globs to path globs as used in ignore patterns.
+
+    Ignore patterns not containing '/' should match against the filename only.
+    This function prepends such patterns with '***/' so they can be compiled
+    together with whole-path globs (containing '/') and matched against the
+    whole path.
+    """
+    return [anchor_glob(pat) for pat in pats]
+

=== added file 'bzrlib/tests/test_glob.py'
--- /dev/null	
+++ bzrlib/tests/test_glob.py	
@@ -0,0 +1,82 @@
+# Copyright (C) 2006 by Jan Hudec
+# -*- coding: utf-8 -*-
+
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write to the Free Software
+# Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+
+from bzrlib.tests import TestCase, TestCaseInTempDir
+
+from bzrlib.glob import (
+        anchor_glob,
+        compile,
+        translate,
+        FNMATCH, NATIVE,
+        )
+from bzrlib.osutils import abspath
+
+
+class TestFnmatchGlobs(TestCase):
+
+    def assertMatch(self, glob, positive, negative):
+        rx = compile(glob, style=FNMATCH)
+        for name in positive:
+            self.failUnless(rx.match(name), repr(
+                        u'name "%s" does not match glob "%s" (rx="%s")' %
+                        (name, glob, rx.pattern)))
+        for name in negative:
+            self.failIf(rx.match(name), repr(
+                        u'name "%s" does match glob "%s" (rx="%s")' %
+                        (name, glob, rx.pattern)))
+
+
+    def test_asterisk(self):
+        self.assertMatch(u'foo*bar', [u'foobar', u'foo-bar', u'foo\u8336bar',
+                u'foo/barbar'], [u'booboofoobar', u'foobary'])
+        self.assertMatch(u'*foo', [u'foo', u'boo/foo', u'.foo', u'boo/.foo'],
+                [u'foobaz'])
+
+    def test_brackets(self):
+        self.assertMatch(u'f[oaq]o', [u'foo', u'fao', u'fqo'], [u'fzo',
+                u'f[oaq]o'])
+        self.assertMatch(u'f[!azm]o', [u'foo', u'f\u8336o'], [u'fao',
+                u'fzo'])
+
+    def self_regexp_specials(self):
+        self.assertMatch(u'(foo|bar)', [u'(foo|bar)'], [u'foo', 'bar'])
+        self.assertMatch(u'f{2,4}', [u'f{2,4}'], [u'ff', u'ffff', u'f2',
+                u'f4'])
+        self.assertMatch(u'*.*', [u'foo.bar', u'foo.', u'.bar'], [u'foobar',
+                u'qyzzy'])
+
+
+class TestAnchoredFnmatchGlobs(TestCase):
+
+    def assertMatch(self, glob, positive, negative):
+        rx = compile(anchor_glob(glob), style=FNMATCH)
+        for name in positive:
+            self.failUnless(rx.match(name), repr(
+                        u'name "%s" does not match glob "%s" (rx="%s")' %
+                        (name, glob, rx.pattern)))
+        for name in negative:
+            self.failIf(rx.match(name), repr(
+                        u'name "%s" does match glob "%s" (rx="%s")' %
+                        (name, glob, rx.pattern)))
+
+
+    def test_anchoring(self):
+        self.assertMatch(u'foo', [u'foo', u'bar/foo', u'qyzzy/bar/foo'],
+                [u'qyzzyfoo', u'foo/.foo', u'boo/goo/afoo'])
+        self.assertMatch(u'bar/foo', [u'bar/foo'], [u'zyqqy/bar/foo'])
+        self.assertMatch(u'foo.*', [u'bar.bar/foo.bar'], [u'foo.bar/bar.bar'])
+

=== added file 'bzrlib/tests/test_ignore.py'
--- /dev/null	
+++ bzrlib/tests/test_ignore.py	
@@ -0,0 +1,61 @@
+# Copyright (C) 2006 by Jan Hudec
+# -*- coding: utf-8 -*-
+
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write to the Free Software
+# Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+
+import logging
+from cStringIO import StringIO
+import bzrlib.trace
+
+from bzrlib.workingtree import WorkingTree
+from bzrlib.tests import TestCase, TestCaseInTempDir
+
+
+class TestBzrignore(TestCaseInTempDir):
+
+    shape = None
+
+    def setUp(self):
+        super(TestBzrignore, self).setUp()
+        self.wt = WorkingTree.create_standalone(u'.')
+
+    def putIgnores(self, ignores):
+        bzrignore = file(u'.bzrignore', 'wb')
+        bzrignore.write(ignores)
+
+    def assertIgnored(self, name):
+        if not self.wt.ignored(name):
+            raise AssertionError(repr(u'name "%s" is not ignored' % name))
+
+    def assertNotIgnored(self, name):
+        if self.wt.ignored(name):
+            raise AssertionError(repr(u'name "%s" is ignored' % name))
+
+    def assertIgnoredBy(self, name, pattern):
+        by = self.wt.ignored_by(name)
+        if by != pattern:
+            raise AssertionError(repr(
+                        u'name "%s" ignored by "%s" instead of "%s"' %
+                        (name, by, pattern)))
+
+
+    def test_long(self):
+        self.putIgnores(u''.join([u'*.%i\n' % i
+                                    for i in range(1, 999)]).encode('utf-8'))
+        self.assertIgnoredBy(u'foo.333', u'*.333')
+        self.assertIgnoredBy(u'qyzzy.666', u'*.666')
+        self.assertIgnoredBy(u'\u8336.42', u'*.42')
+        self.assertNotIgnored(u'\u8336')
+        self.assertIgnoredBy(u'42', None)

=== modified file 'bzrlib/add.py'
--- bzrlib/add.py	
+++ bzrlib/add.py	
@@ -155,7 +155,7 @@
                 if tree.is_control_filename(subp):
                     mutter("skip control directory %r", subp)
                 else:
-                    ignore_glob = tree.is_ignored(subp)
+                    ignore_glob = tree.ignored_by(subp)
                     if ignore_glob is not None:
                         mutter("skip ignored sub-file %r", subp)
                         if ignore_glob not in ignored:

=== modified file 'bzrlib/builtins.py'
--- bzrlib/builtins.py	
+++ bzrlib/builtins.py	
@@ -1422,7 +1422,7 @@
             if file_class != 'I':
                 continue
             ## XXX: Slightly inefficient since this was already calculated
-            pat = tree.is_ignored(path)
+            pat = tree.ignored_by(path)
             print '%-50s %s' % (path, pat)
 
 

=== modified file 'bzrlib/info.py'
--- bzrlib/info.py	
+++ bzrlib/info.py	
@@ -200,7 +200,7 @@
 
     ignore_cnt = unknown_cnt = 0
     for path in working.extras():
-        if working.is_ignored(path):
+        if working.ignored(path):
             ignore_cnt += 1
         else:
             unknown_cnt += 1

=== modified file 'bzrlib/tests/__init__.py'
--- bzrlib/tests/__init__.py	
+++ bzrlib/tests/__init__.py	
@@ -1058,11 +1058,13 @@
                    'bzrlib.tests.test_errors',
                    'bzrlib.tests.test_escaped_store',
                    'bzrlib.tests.test_fetch',
+                   'bzrlib.tests.test_glob',
                    'bzrlib.tests.test_gpg',
                    'bzrlib.tests.test_graph',
                    'bzrlib.tests.test_hashcache',
                    'bzrlib.tests.test_http',
                    'bzrlib.tests.test_identitymap',
+                   'bzrlib.tests.test_ignore',
                    'bzrlib.tests.test_inv',
                    'bzrlib.tests.test_knit',
                    'bzrlib.tests.test_lockdir',

=== modified file 'bzrlib/tree.py'
--- bzrlib/tree.py	
+++ bzrlib/tree.py	
@@ -239,7 +239,7 @@
 
     if not new_id and not old_id:
         # easy: doesn't exist in either; not versioned at all
-        if new_tree.is_ignored(filename):
+        if new_tree.ignored(filename):
             return 'I', None, None
         else:
             return '?', None, None

=== modified file 'bzrlib/workingtree.py'
--- bzrlib/workingtree.py	
+++ bzrlib/workingtree.py	
@@ -44,6 +44,7 @@
 import errno
 import fnmatch
 import os
+import re
 import stat
  
 
@@ -65,6 +66,7 @@
                            MergeModifiedFormatError,
                            UnsupportedOperation,
                            )
+from bzrlib.glob import anchor_globs, compile_list, FNMATCH
 from bzrlib.inventory import InventoryEntry, Inventory
 from bzrlib.lockable_files import LockableFiles, TransportLock
 from bzrlib.lockdir import LockDir
@@ -93,8 +95,8 @@
 from bzrlib.symbol_versioning import *
 from bzrlib.textui import show_status
 import bzrlib.tree
+from bzrlib.trace import mutter, note, warning
 from bzrlib.transform import build_tree
-from bzrlib.trace import mutter, note
 from bzrlib.transport import get_transport
 from bzrlib.transport.local import LocalTransport
 import bzrlib.ui
@@ -107,7 +109,6 @@
     This should probably generate proper UUIDs, but for the moment we
     cope with just randomness because running uuidgen every time is
     slow."""
-    import re
     from binascii import hexlify
     from time import time
 
@@ -669,7 +670,7 @@
     def file_class(self, filename):
         if self.path2id(filename):
             return 'V'
-        elif self.is_ignored(filename):
+        elif self.ignored(filename):
             return 'I'
         else:
             return '?'
@@ -707,7 +708,7 @@
                 f_ie = inv.get_child(from_dir_id, f)
                 if f_ie:
                     c = 'V'
-                elif self.is_ignored(fp):
+                elif self.ignored(fp):
                     c = 'I'
                 else:
                     c = '?'
@@ -886,7 +887,7 @@
         [u'foo']
         """
         for subp in self.extras():
-            if not self.is_ignored(subp):
+            if not self.ignored(subp):
                 yield subp
 
     @deprecated_method(zero_eight)
@@ -970,7 +971,7 @@
     def ignored_files(self):
         """Yield list of PATH, IGNORE_PATTERN"""
         for subp in self.extras():
-            pat = self.ignored(subp)
+            pat = self.ignored_by(subp)
             if pat != None:
                 yield subp, pat
 
@@ -986,48 +987,84 @@
         l = bzrlib.DEFAULT_IGNORE[:]
         if self.has_filename(bzrlib.IGNORE_FILENAME):
             f = self.get_file_byname(bzrlib.IGNORE_FILENAME)
-            l.extend([line.rstrip("\n\r") for line in f.readlines()])
+            try:
+                l.extend([line.decode('utf-8').rstrip("\n\r")
+                        for line in f.readlines()])
+            except UnicodeDecodeError:
+                warning(u"'%s' is not utf-8 encoded, not reading ignore patterns"
+                        % bzrlib.IGNORE_FILENAME)
         self._ignorelist = l
         return l
 
-
+    def _get_ignore_regex(self):
+        """Return a regular expression composed of ignore patterns.
+
+        Cached in the Tree object after the first call.
+        """
+        if not hasattr(self, '_ignoreregex'):
+            self._ignoreregex = compile_list(
+                    anchor_globs(self.get_ignore_list()),
+                    style=FNMATCH)
+        return self._ignoreregex
+
+    def _get_ignore_by_regex_list(self):
+        """Return regex list for ignored_by method.
+
+        Cached in the Tree object after the first call.
+
+        The return is a list of lists, each having pattern as the first
+        element, followed by list of globs it is composed from.
+        """
+        if not hasattr(self, '_ignore_by_regex_list'):
+            pats = self.get_ignore_list() # So we can shift...
+            self._ignore_by_regex_list = []
+            while pats:
+                self._ignore_by_regex_list.append(
+                        [compile_list(anchor_globs(pats[0:50]),
+                                    wrap=u'(%s)',
+                                    style=FNMATCH)]
+                        + pats[0:50])
+                pats = pats[50:]
+        return self._ignore_by_regex_list
+
+    @deprecated_method(zero_eight)
     def is_ignored(self, filename):
         r"""Check whether the filename matches an ignore pattern.
 
+        This method was split in two. A faster ignored method that returns
+        True when the filename matches some ignore pattern and a slower
+        ignored_by, that returns the (first) matching pattern. For backward
+        compatibility this returns the pattern.
+        """
+        return self.ignored_by(filename)
+
+    def ignored(self, filename):
+        r"""Check whether the filename matches an ignore pattern.
+
         Patterns containing '/' or '\' need to match the whole path;
         others match against only the last component.
 
-        If the file is ignored, returns the pattern which caused it to
-        be ignored, otherwise None.  So this can simply be used as a
-        boolean if desired."""
-
-        # TODO: Use '**' to match directories, and other extended
-        # globbing stuff from cvs/rsync.
-
-        # XXX: fnmatch is actually not quite what we want: it's only
-        # approximately the same as real Unix fnmatch, and doesn't
-        # treat dotfiles correctly and allows * to match /.
-        # Eventually it should be replaced with something more
-        # accurate.
-        
-        for pat in self.get_ignore_list():
-            if '/' in pat or '\\' in pat:
-                
-                # as a special case, you can put ./ at the start of a
-                # pattern; this is good to match in the top-level
-                # only;
-                
-                if (pat[:2] == './') or (pat[:2] == '.\\'):
-                    newpat = pat[2:]
-                else:
-                    newpat = pat
-                if fnmatch.fnmatchcase(filename, newpat):
-                    return pat
-            else:
-                if fnmatch.fnmatchcase(splitpath(filename)[-1], pat):
-                    return pat
-        else:
-            return None
+        If the file is ignored, returns a match object, otherwise None. So
+        this can simply be used as a boolean if desired. The match object is
+        really not very useful, because the individual patterns are not
+        captured.
+        """
+        pat = self._get_ignore_regex()
+        return pat.match(filename)
+
+    def ignored_by(self, filename):
+        r"""Check whether the filename matches and return the pattern it matches.
+
+        This method is similar to ignored, but makes the extra effort to
+        return the pattern that matched.
+        """
+
+        pats = self._get_ignore_by_regex_list()
+        for pat in pats:
+            m = pat[0].match(filename)
+            if m:
+                return pat[m.lastindex]
+        return None
 
     def kind(self, file_id):
         return file_kind(self.id2abspath(file_id))
@@ -1157,7 +1194,7 @@
             mutter("remove inventory entry %s {%s}", quotefn(f), fid)
             if verbose:
                 # having remove it, it must be either ignored or unknown
-                if self.is_ignored(f):
+                if self.ignored(f):
                     new_status = 'I'
                 else:
                     new_status = '?'

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20060517/597792dc/attachment.pgp