[PATCH][MERGE] Improvements to is_ignored

Tue Jan 10 22:39:52 GMT 2006

Hello All,

I refactored WorkingTree.is_ignored to compile all patterns to one big regexp
and match against that.

The method was split in two, is_ignored and is_ignored_by. is_ignored now
returns only boolean value (the match object actually, but it's mostly
useless) and uses non-capturing groups in the pattern to avoid the
bookkeeping of captures. The is_ignored_by returns the (last) matching
pattern for the cases where it is used.

Due to how match object work, the _last_ matching pattern is returned from
is_ignored_by.

The _glob_to_regex function still uses the same logic of fnmatch. However it
should now be easy to replace with better convertor (which is however pretty
hard to write).

Brief measurements on the bzr tree itself show, that is faster by some 10%.
Please if you can, compare the performance on some large tree with many
ignore patterns. I don't have any such lying around here.

The branch is on http://www.ucw.cz/~bulb/bzr/bzr.ignore
(revision 1562; pushed, so no working tree)

Diff follows:

=== modified file 'bzrlib/add.py'

--- bzrlib/add.py
+++ bzrlib/add.py
@@ -155,7 +155,7 @@
                 if subf == bzrlib.BZRDIR:
                     mutter("skip control directory %r", subp)
                 else:
-                    ignore_glob = tree.is_ignored(subp)
+                    ignore_glob = tree.is_ignored_by(subp)
                     if ignore_glob is not None:
                         mutter("skip ignored sub-file %r", subp)
                         if ignore_glob not in ignored:

=== modified file 'bzrlib/builtins.py'
--- bzrlib/builtins.py
+++ bzrlib/builtins.py
@@ -1135,7 +1135,7 @@
             if file_class != 'I':
                 continue
             ## XXX: Slightly inefficient since this was already calculated
-            pat = tree.is_ignored(path)
+            pat = tree.is_ignored_by(path)
             print '%-50s %s' % (path, pat)
 
 

=== modified file 'bzrlib/workingtree.py'
--- bzrlib/workingtree.py
+++ bzrlib/workingtree.py
@@ -113,6 +113,45 @@
     return gen_file_id('TREE_ROOT')
 
 
+def _glob_to_regex(pat):
+    r"""Convert a unix glob to regular expression.
+    
+    Patterns containing '/' or '\' need to match whole path; others match
+    against only the last component - as per requirement of
+    WorkingTree.is_ignored().
+    
+    Pattern is returned as string.
+    """
+    # TODO: For now we use fnmatch.translate, which is Broken(tm). New,
+    # correct, translator that would handle '**' for matching paths and other
+    # extended globbing stuff from cvs/rsync should be implemented.
+
+    # XXX: Shouldn't the globs be actually UNICODE?
+
+    # XXX: fnmatch is actually not quite what we want: it's only
+    # approximately the same as real Unix fnmatch, and doesn't
+    # treat dotfiles correctly and allows * to match /.
+    # Eventually it should be replaced with something more
+    # accurate.
+        
+    if '/' in pat or '\\' in pat:
+        if (pat[:2] == './') or (pat[:2] == '.\\'):
+            pat = pat[2:]
+        return fnmatch.translate(pat)
+    else:
+        # XXX: Is the path normalized? Should we match [/\\] ?
+        return '(?:.*/)?' + fnmatch.translate(pat)
+
+
+def _glob_list_to_regex(pats, wrap='(?:%s)'):
+    """Convert a list of unix globs to a regular expression.
+
+    The pattern is returned as string. The wrap is % format applied to each
+    individual glob pattern. It has to apply group.
+    """
+    return '|'.join([wrap % _glob_to_regex(x) for x in pats])
+
+
 class TreeEntry(object):
     """An entry that implements the minium interface used by commands.
 
@@ -186,6 +225,8 @@
         (branch.base is not cross checked, because for remote branches that
         would be meaningless).
         """
+        self._ignoreregex = {}
+
         from bzrlib.hashcache import HashCache
         from bzrlib.trace import note, mutter
         assert isinstance(basedir, basestring), \
@@ -712,11 +753,10 @@
 
 
     def ignored_files(self):
-        """Yield list of PATH, IGNORE_PATTERN"""
+        """Yield list of paths"""
         for subp in self.extras():
-            pat = self.is_ignored(subp)
-            if pat != None:
-                yield subp, pat
+            if self.is_ignored(subp):
+                yield subp
 
 
     def get_ignore_list(self):
@@ -734,6 +774,21 @@
         self._ignorelist = l
         return l
 
+    def get_ignore_regex(self, wrap='(?:%s)'):
+        """Return a regular expression composed of ignore patterns.
+
+        Cached in the Tree object after the first call.
+
+        The wrap argument is a % format to be applied to each separate
+        pattern. It can be used to switch between capturing/noncapturing
+        grouping.
+        """
+        import re
+        if not wrap in self._ignoreregex:
+            self._ignoreregex[wrap] = re.compile(
+                    _glob_list_to_regex(self.get_ignore_list(), wrap))
+        return self._ignoreregex[wrap]
+
 
     def is_ignored(self, filename):
         r"""Check whether the filename matches an ignore pattern.
@@ -741,37 +796,26 @@
         Patterns containing '/' or '\' need to match the whole path;
         others match against only the last component.
 
-        If the file is ignored, returns the pattern which caused it to
-        be ignored, otherwise None.  So this can simply be used as a
-        boolean if desired."""
-
-        # TODO: Use '**' to match directories, and other extended
-        # globbing stuff from cvs/rsync.
-
-        # XXX: fnmatch is actually not quite what we want: it's only
-        # approximately the same as real Unix fnmatch, and doesn't
-        # treat dotfiles correctly and allows * to match /.
-        # Eventually it should be replaced with something more
-        # accurate.
-        
-        for pat in self.get_ignore_list():
-            if '/' in pat or '\\' in pat:
-                
-                # as a special case, you can put ./ at the start of a
-                # pattern; this is good to match in the top-level
-                # only;
-                
-                if (pat[:2] == './') or (pat[:2] == '.\\'):
-                    newpat = pat[2:]
-                else:
-                    newpat = pat
-                if fnmatch.fnmatchcase(filename, newpat):
-                    return pat
-            else:
-                if fnmatch.fnmatchcase(splitpath(filename)[-1], pat):
-                    return pat
-        else:
-            return None
+        If the file is ignored, returns a match object, otherwise None. So
+        this can simply be used as a boolean if desired. The match object is
+        really not very useful, because the individual patterns are not
+        captured.
+        """
+
+        pat = self.get_ignore_regex()
+        return pat.match(filename)
+
+    def is_ignored_by(self, filename):
+        r"""Check whether the filename matches and return the pattern it matches.
+
+        This method is similar to is_ignored, but makes the extra effort to
+        return the pattern that matched.
+        """
+
+        pat = self.get_ignore_regex(wrap='(%s)')
+        m = pat.match(filename)
+        if m:
+            return self.get_ignore_list()[m.lastindex - 1]
 
     def kind(self, file_id):
         return file_kind(self.id2abspath(file_id))


-- 
						 Jan 'Bulb' Hudec <bulb at ucw.cz>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20060110/e5786fd1/attachment.pgp