[MERGE/RFC] Working tree content filtering

Thu Apr 17 18:09:58 BST 2008

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Ian Clatworthy wrote:
| Here's a patch providing my progress so far on this. A lot more tests
| are needed but there's an outside chance it basically works now. Brave
| souls may wish to:
|
| * write some experimental content filters
| * patch in a _content_filter_stack() method to the WorkingTree class (in
|   workingtree.py) to enable them
| * tell me how broken or otherwise things are.
|
| To state the obvious, this isn't production code. I'm publishing it now
| in order to get some feedback on things like:
|
| * how easy or otherwise it is for developers to create filters
| * whether the implementation is heading in the right direction
|   (before I add a heap more tests)
| * places I've missed making changes to, e.g. PreviewTrees?
|
| FWIW, until earlier tonight, the delta was much smaller than it is now
| because I had a function in bzrlib.filters that returned the stack of
| content filters given an absolute path. I decided to make the
| content-filter-stack lookup a responsibility of a Tree and to use paths
| relative to the top of the tree. That feels a better design IMO. Let me
| know if you think the responsibility better belongs somewhere else.
|
| Finally, a big thank-you to Alexander who did the hard work on finding
| most of the spots that needed changes made. Had he not done the original
| code, this patch would have taken me much longer.
|
| Ian C.
|

First off, I like the idea of making it a content filter, which is controlled at
the tree level.

I also completely agree that if it is path based, it should be path relative to
the working tree root (relpath). If we want to support absolute path based, we
can do that by having the filters customized for the WT based on their absolute
path. At least, that is my opinion. (The only use case I can think of for that
is because you are on linux, with a mounted Samba share, and you want the files
to be CRLF on the samba share, but LF on your local hard-drive.)

Anyway, some specific code comments:

@dirstate.py
- -            link_or_sha1 = self._sha1_file(abspath)
+            if self._cfs_provider is None:
+                filters_ = None
+            else:
+                relpath = osutils.pathjoin(entry[0][0], entry[0][1])
+                filters_ = self._cfs_provider(relpath)
+            link_or_sha1 = self._sha1_file(abspath, filters_)

^- I don't really like using variables with a trailing '_'. Better to call it
'filter_list' or something like that (IMO).

Further, at this point you have the file_id (entry[0][2] is the file_id). I
would highly recommend that the _cfs_provider api allow you to supply as much
information as you have. So it can be:

self._cfs_provider(path=relpath, file_id=entry[0][2])

*I* really prefer named arguments, mostly because it helps clarify when the
parameter itself is unclear.

foo(True) is not very descriptive versus foo(make_it_correct=True).

Anyway, DirState._get_entry() already uses this, as do a few of the Tree apis.
Which means you get to supply what you know, and it can do it in the most
efficient way. The only question becomes whether we want to *require* both
pieces of information. While we are getting it working, I would probably not
require it, as it simplifies the calling code (at the expense of the filter code).

Note also, it may be less than trivial to write a filter that has enough context
to go from file_id <=> path. At a minimum you need the tree context itself.

+def filtered_input_file(f, filters):
+    """Get an input file that converts external to internal content.
+
+    :param f: the original input file
+    :param filters: the stack of filters to apply
+    :return: a file-like object
+    """
+    if filters:
+        lines = f.readlines()
+        for filter in filters:
+            lines = filter.reader(lines)
+        return cStringIO.StringIO(''.join(lines))
+    else:
+        return f
+

^- So a little discussion about the filter api. This assumes they are
line-based, and work on the whole content at once. It might be interesting to do
it a different way. Specifically, you could do:

if filters:
~  out_lines = []
~  for line in f:
~    for filter in filters:
~      line = filter.reader(line)
~    out_lines.append(line)
~  return cStringIO.StringIO(''.join(lines))

This is nice from an I/O perspective, in that you can start processing some of
the lines before you have finished buffering everything into memory. The flip
side is that it is probably more CPU intensive (you are calling a function for
every line, rather than evaluating over the collection of lines.)

The downside is that it becomes tricky to create and delete lines. (Technically
you can return '' to remove a line, and 'foo\nbar\n' would be multiple lines,
but it makes things trickier than I think it has to be.)

Another alternative would be to have the .reader(some_lines) which could be some
set of the lines, but not guaranteed to be all of them. However, then you have
difficulties if you want a filter that can look "back" in time. (If the next
line is 'foo' delete this line.)

So, overall, I'm happy enough with not requiring filters to be streamable, and
to process the full set of lines at a time. But I think the decision should be
documented in the Filter design documentation.

Oh, but I realized one thing, 'f.readlines()' is not going to be safe before we
know the actual line endings of the file. Specifically consider a "CR" file from
a Mac system that is currently residing on a Windows system. I just tested it:

|>> open('foo', 'wb').write('foo\rbar\rbaz\r')
|>> open('foo', 'rb').readlines()
['foo\rbar\rbaz\r']
|>> open('foo', 'r').readlines()
['foo\rbar\rbaz\r']
|>> open('foo', 'rU').readlines()
['foo\n', 'bar\n', 'baz\n']

So I would actually prefer the API to be either 1 big string:

text = f.read()
for filter in filters:
~  text = filter.reader(text)

The downside is that munging a large string takes more memory copies, etc, than
just working on a bunch of small strings. So instead, we could make the api work
on "chunks" of a file. Which means that:

chunks = [f.read()]
for filter in filters:
~  chunks = filter.reader(chunks)

is just as valid as doing

chunks = f.readlines()

AKA, all filters will be given the full content of the file, but the chunking is
arbitrary. In fact, they should probably handle cases like:

['foo', 'bar\nbaz', 'bing\n']

It makes the filters a little bit more tricky, but as we can't guarantee from
the beginning that we have line endings correct, we need to have people writing
filters *know* that.

Note that even worse is a UTF-16-le file, if you do:

open('foo', 'wb').write(u'foo\nbar\nbaz\n'.encode('utf-16-le'))
open('foo', 'rb').readlines()
['f\x00o\x00o\x00\n', '\x00b\x00a\x00r\x00\n', '\x00b\x00a\x00z\x00\n', '\x00']

Note that the '\n\x00' which is the newline character is actually split between
the strings.
Otherwise you could chain a "UTF-16-LE => UTF-8" filter, as a step before the
"CRLF" encoder, etc.

We probably also want to make it clear that the filters work on 8-bit
byte-streams, *not* Unicode strings.

I'm also thinking about a filter like the $Log$ keyword expansion. Which expands
into many lines, and would thus want to be collapsed back into a single line.

Similarly for the "filter.writer()" function, btw.

I'm not sure about having ContentFilter be a plain data object, that has members
passed in at init time. I'm a little more comfortable with an abstract base
class, and using inheritance to overload the functionality. Though I suppose if
we are planning to have a given set of singleton Filters, it doesn't really matter.

Why use: _content_filter_stack() => None to indicate 0 filters rather than just
a simple list? You already are using:

if filters:
~  # do filtered form
else:
~  # Do other form

and that 'if' check will trigger for both None and [] (and False, and other
things, which I know Aaron doesn't like.)

So if you really want None, I would probably recommend:

if filters is None:
~  XXX
else:
~  YYY

though *I* prefer an empty list.

I also think that:

X
for y in []:
~  X = y(X)

is just as fast as:

X
if []:
~  for y in []:
~    X = y(X)

Though I realize in some places you might change more than just how you loop.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkgHhGYACgkQJdeBCYSNAAPzrgCfaYiQ/k7VOeuSuBNPCmWPst/y
sb8An1NMC5wMonA0ZbSa1Vxo5fdGPSQu
=HsEi
-----END PGP SIGNATURE-----