[MERGE/RFC] Working tree content filtering

Fri Apr 18 08:02:59 BST 2008

John Arbash Meinel wrote:

> Further, at this point you have the file_id (entry[0][2] is the file_id). I
> would highly recommend that the _cfs_provider api allow you to supply as
> much
> information as you have. So it can be:
> 
> self._cfs_provider(path=relpath, file_id=entry[0][2])

I know what you mean and I've explicitly made exactly this change in the
past to parts of our code while performance testing. I'll probably do
this here as well but I'd like to think about it some more.
Paradoxically, if we do go the patterns way (which is my preference at
this point), *knowing* that the relpath is required as a parameter may
result in faster overall code. For example, if there is only the file-id
known about at a particularly level but the relpath was known at a level
above, then passing the relpath down to the lower level will give better
performance than converting id -> relpath inside the cfs_provider. Of
course, if we go with versioned properties per file, then needing to
lookup the file-id inside that routine would be equally dumb.

> Note also, it may be less than trivial to write a filter that has enough
> context
> to go from file_id <=> path. At a minimum you need the tree context itself.

The filters ought to operate on the content so I don't expect them to
have to bother with this either way.

> +def filtered_input_file(f, filters):
> +    """Get an input file that converts external to internal content.
> +
> +    :param f: the original input file
> +    :param filters: the stack of filters to apply
> +    :return: a file-like object
> +    """
> +    if filters:
> +        lines = f.readlines()
> +        for filter in filters:
> +            lines = filter.reader(lines)
> +        return cStringIO.StringIO(''.join(lines))
> +    else:
> +        return f
> +
> 
> ^- So a little discussion about the filter api.

Please. (That was the main reason for getting this code out there now
for review/feedback.)

> So, overall, I'm happy enough with not requiring filters to be
> streamable, and
> to process the full set of lines at a time. But I think the decision
> should be
> documented in the Filter design documentation.

Agreed.

> Oh, but I realized one thing, 'f.readlines()' is not going to be safe
> before we
> know the actual line endings of the file. Specifically consider a "CR"
> file from
> a Mac system that is currently residing on a Windows system. I just
> tested it:
> 
> |>> open('foo', 'wb').write('foo\rbar\rbaz\r')
> |>> open('foo', 'rb').readlines()
> ['foo\rbar\rbaz\r']
> |>> open('foo', 'r').readlines()
> ['foo\rbar\rbaz\r']
> |>> open('foo', 'rU').readlines()
> ['foo\n', 'bar\n', 'baz\n']

Ahh - my mistake. I *meant* to write ...

  chunks = [f.read()]

as you suggested as one option, not

  lines = f.readlines()

but ...

> So I would actually prefer the API to be either 1 big string:
> 
> text = f.read()
> for filter in filters:
> ~  text = filter.reader(text)

Having spent time today trying to write a filter, I now think ...

  text = f.read()

is definitely better. Asking filter writers to process a sequence of
chunks makes their life *much* harder. With the exception of filters
that match a *single* character with no other context, the filters
basically need to always do

  ''.join(chunks)

and then return [result] or something like that. The moment there are
multiple characters to match (e.g. $Date$ or \r\n), then some may be in
one chunk and some in the next. I'm concerned that we'll get obscure
bugs when a filter writer forgets that's possible. So either we:

* pass in and return a blob of text, or
* we explicitly pass around true "lines" and filters can *know*
  they are to receive and return complete, logical lines - not partial
  ones.

The first seems like the path of minimum bugs. Practically, the vast
majority of files on which filtering will be done will be text files of
small to moderate sizes, so I'm not expecting much impact on either
performance or memory pressure. I'm sure some corner case will prove me
wrong though. :-)

> We probably also want to make it clear that the filters work on 8-bit
> byte-streams, *not* Unicode strings.

Yes.

> I'm not sure about having ContentFilter be a plain data object, that has
> members
> passed in at init time. I'm a little more comfortable with an abstract base
> class, and using inheritance to overload the functionality. Though I
> suppose if
> we are planning to have a given set of singleton Filters, it doesn't
> really matter.

I thought about that. In this case, my feeling is that it doesn't make
much difference either way.

> Why use: _content_filter_stack() => None to indicate 0 filters rather
> than just
> a simple list?

Well [] does mean an empty stack. We could make None mean something else
like the default stack, but the default stack is empty anyhow.

You already are using:
> 
> if filters:
> ~  # do filtered form
> else:
> ~  # Do other form
> 
> and that 'if' check will trigger for both None and [] (and False, and other
> things, which I know Aaron doesn't like.)
> 
> So if you really want None, I would probably recommend:
> 
> if filters is None:
> ~  XXX
> else:
> ~  YYY
> 
> though *I* prefer an empty list.

An empty list must mean an empty list. As mentioned above, I'm just
being robust in the case where None is passed (as well).

Ian C.