[PATCH][Bug 86451]Normalise path separators in ignore patterns

Thu Mar 1 14:35:13 GMT 2007

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Kent Gibson wrote:
> 
> 
> Aaron Bentley wrote:
>>> It would be much better to fix the glob-to-regex conversion, and
>>> normalize those slashes.
>>>
>>>
> OK, here is a patch that normalises the path separator in ignore
> patterns, rather than just chopping trailing backslashes.
> 
> I had a go at making the normalisation part of the glob-to-regex
> conversion proper, but that turned out to be impossible since the
> Replacer only does a single match/replace pass over each pattern, and
> you need to match backslashes twice - once for normalisation and once
> for canonicalisation.  Rather than change the Replacer, I added
> normalisation as a separate first pass.
> 
> There are also another couple of trivial fixed in there.
> Firstly, cmd_ignore now checks for DOS style absolute paths as well as
> Unix style, just to be consistent across platforms.
> 
> Secondly, globbing now checks for and removes trailing backslashes in
> REs that can escape the bracket added when aggregating patterns.  That
> is another corner case that can crash the pattern matcher, so I
> thought it was worth fixing it while I was there.
> 
> 
> Cheers,
> Kent.
> 

Thanks for updating your patch. I'd really like to get this in 0.15.
There are a few small things, though.

...

v- Your spacing here is a little odd, I think it is better as:

  if (len(m) % 2) != 0:

  Also, while not strictly required, it would be good to have a simple
docstring. Maybe "check for trailing backslashes in the pattern"


+def _trailing_backslashes_regex(m):
+    if len(m) %2  != 0:
+        warning(u"Regular expressions cannot end with an odd number of
'\\'. "
+                "Dropping the final '\\'.")
+        return m[:-1]
+    return m
+
+
 _sub_re = Replacer()
 _sub_re.add(u'^RE:', u'')
 _sub_re.add(u'\((?!\?)', u'(?:')
 _sub_re.add(u'\(\?P<.*>', _invalid_regex(u'(?:'))
 _sub_re.add(u'\(\?P=[^)]*\)', _invalid_regex(u''))
+_sub_re.add(ur'\\+$', _trailing_backslashes_regex)


 _sub_fullpath = Replacer()
@@ -169,6 +178,7 @@
         base_patterns = []
         ext_patterns = []
         for pat in patterns:
+            pat = normalize_pattern(pat)
             if pat.startswith(u'RE:') or u'/' in pat:
                 path_patterns.append(pat)
             elif pat.startswith(u'*.'):
@@ -200,3 +210,15 @@
                 return patterns[match.lastindex -1]
         return None

+
+_normalize_re = re.compile(ur'\\(?!x\d\d|\d\d\d|u\d\d\d\d)')
+
+
+def normalize_pattern(pattern):
+    """Converts backslashes in path patterns to forward slashes.
+
+    Avoids converting escape backslashes.
+    """
+    if not pattern.startswith('RE:'):
+        pattern = _normalize_re.sub('/', pattern)
+    return pattern.rstrip('/')

^- If I understand correctly, you are allowing backslashes for character
codes (hex) \x00 or (octal) \000 (unicode decimal) \u0000.  There are a
few problems, though.

1) I don't think you need to do this. I'm guessing you misunderstood the
tests. But the data in .bzrignore is meant to be utf-8 data. So
non-ascii characters are expected to be present, but not escapes.

2) If we do need this, then both "\x**" and "\u****" should allow
hexidecimal, not just decimal characters. For example "\xb5" and
"\uabcd" are both valid characters. (I have no idea what \uabcd is, but
u'\u03bc' == µ (mu)).

3) If we do leave this in, we need to make it a little bit clearer in
the docstring. "Avoids converting escape backslashes (for numerical
codes such as \xaa)". Or something like that. It took me a little while
to figure out what you were doing. Also \d{3} might be nicer than \d\d\d.

My guess is that you were confused by some of the tests, specifically these:

In these tests we use "u'\u8336'" because we are checking against a
Unicode character string: 茶/foo

Because when we are reading the filesystem, we end up with Unicode
strings in memory. So we want to make sure that .\foo doesn't match foo
in a subdirectory, even if it the parent directory is not an ascii string.

I might also be happier if we emitted warnings that ignore patterns
should use forward slashes. Rather than silently translating. But I'm
not going to be picky on that part.

+    def test_backslash(self):
+        self.assertMatch([
+            (u'.\\foo',
+             [u'foo'],
+             [u'\u8336/foo', u'barfoo', u'x/y/foo']),
+            (u'.\\f*',
+             [u'foo'],
+             [u'foo/bar', u'foo/.bar', u'x/foo/y']),
+            (u'foo\\**\\x',
+             [u'foo/x', u'foo/bar/x'],
+             [u'foox', u'foo/bax', u'foo/.x', u'foo/bar/bax']),
+            ])
+
+    def test_trailing_slash(self):
+        self.assertMatch([
+            (u'./foo/',
+             [u'foo'],
+             [u'\u8336/foo', u'barfoo', u'x/y/foo']),
+            (u'.\\foo\\',
+             [u'foo'],
+             [u'foo/', u'\u8336/foo', u'barfoo', u'x/y/foo']),
+            ])
+
     def test_leading_asterisk_dot(self):
         self.assertMatch([
             (u'*.x',


So +1 (conditional). I think you need to fix the whitespace in one
place, and unless I'm misunderstanding something you probably want to
take out the special case for numerical escapes.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.3 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFF5uSgJdeBCYSNAAMRAj18AJ9zL0t1vpr0uumlHna6atE9i04ZFwCgkqif
Rjg5Ob9Oik4v6Qb5ZZbLyiE=
=RFZj
-----END PGP SIGNATURE-----