[MERGE/RFC] Working tree content filtering

John Arbash Meinel john at arbash-meinel.com
Fri Apr 18 17:39:34 BST 2008

Previous message: [MERGE/RFC] Working tree content filtering
Next message: [MERGE] Make the test suite use the new test loader
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Ian Clatworthy wrote:
| John Arbash Meinel wrote:
|
|> Further, at this point you have the file_id (entry[0][2] is the file_id). I
|> would highly recommend that the _cfs_provider api allow you to supply as
|> much
|> information as you have. So it can be:
|>
|> self._cfs_provider(path=relpath, file_id=entry[0][2])
|
| I know what you mean and I've explicitly made exactly this change in the
| past to parts of our code while performance testing. I'll probably do
| this here as well but I'd like to think about it some more.
| Paradoxically, if we do go the patterns way (which is my preference at
| this point), *knowing* that the relpath is required as a parameter may
| result in faster overall code. For example, if there is only the file-id
| known about at a particularly level but the relpath was known at a level
| above, then passing the relpath down to the lower level will give better
| performance than converting id -> relpath inside the cfs_provider. Of
| course, if we go with versioned properties per file, then needing to
| lookup the file-id inside that routine would be equally dumb.

Yeah. That was sort of why I wanted to provide both, since it would let a given
filter decide how it wanted to operate without incurring a massive penalty in
common cases.

|
|> Note also, it may be less than trivial to write a filter that has enough
|> context
|> to go from file_id <=> path. At a minimum you need the tree context itself.
|
| The filters ought to operate on the content so I don't expect them to
| have to bother with this either way.

Ok, maybe I should have said the content_filter_provider needs to have that much
context.

|
|> +def filtered_input_file(f, filters):
|> +    """Get an input file that converts external to internal content.
|> +
|> +    :param f: the original input file
|> +    :param filters: the stack of filters to apply
|> +    :return: a file-like object
|> +    """
|> +    if filters:
|> +        lines = f.readlines()
|> +        for filter in filters:
|> +            lines = filter.reader(lines)
|> +        return cStringIO.StringIO(''.join(lines))
|> +    else:
|> +        return f
|> +
|>
|> ^- So a little discussion about the filter api.
|
| Please. (That was the main reason for getting this code out there now
| for review/feedback.)
|
|> So, overall, I'm happy enough with not requiring filters to be
|> streamable, and
|> to process the full set of lines at a time. But I think the decision
|> should be
|> documented in the Filter design documentation.
|
| Agreed.
|
|> Oh, but I realized one thing, 'f.readlines()' is not going to be safe
|> before we
|> know the actual line endings of the file. Specifically consider a "CR"
|> file from
|> a Mac system that is currently residing on a Windows system. I just
|> tested it:
|>
|> |>> open('foo', 'wb').write('foo\rbar\rbaz\r')
|> |>> open('foo', 'rb').readlines()
|> ['foo\rbar\rbaz\r']
|> |>> open('foo', 'r').readlines()
|> ['foo\rbar\rbaz\r']
|> |>> open('foo', 'rU').readlines()
|> ['foo\n', 'bar\n', 'baz\n']
|
| Ahh - my mistake. I *meant* to write ...
|
|   chunks = [f.read()]
|
| as you suggested as one option, not
|
|   lines = f.readlines()
|
| but ...
|
|> So I would actually prefer the API to be either 1 big string:
|>
|> text = f.read()
|> for filter in filters:
|> ~  text = filter.reader(text)
|
| Having spent time today trying to write a filter, I now think ...
|
|   text = f.read()
|
| is definitely better. Asking filter writers to process a sequence of
| chunks makes their life *much* harder. With the exception of filters
| that match a *single* character with no other context, the filters
| basically need to always do
|
|   ''.join(chunks)
|
| and then return [result] or something like that. The moment there are
| multiple characters to match (e.g. $Date$ or \r\n), then some may be in
| one chunk and some in the next. I'm concerned that we'll get obscure
| bugs when a filter writer forgets that's possible. So either we:
|
| * pass in and return a blob of text, or
| * we explicitly pass around true "lines" and filters can *know*
|   they are to receive and return complete, logical lines - not partial
|   ones.
|
| The first seems like the path of minimum bugs. Practically, the vast
| majority of files on which filtering will be done will be text files of
| small to moderate sizes, so I'm not expecting much impact on either
| performance or memory pressure. I'm sure some corner case will prove me
| wrong though. :-)

I think the latter isn't possible without having a whole lot more
meta-information (which may be out of date) on the file.

I like chunks from a "don't have to reallocate the whole 10MB string for every
character you change" perspective, but I agree that "whole text as a byte
stream" is easier to program for, and in most common cases won't perform poorly.

|
|> We probably also want to make it clear that the filters work on 8-bit
|> byte-streams, *not* Unicode strings.
|
| Yes.
|
|> I'm not sure about having ContentFilter be a plain data object, that has
|> members
|> passed in at init time. I'm a little more comfortable with an abstract base
|> class, and using inheritance to overload the functionality. Though I
|> suppose if
|> we are planning to have a given set of singleton Filters, it doesn't
|> really matter.
|
| I thought about that. In this case, my feeling is that it doesn't make
| much difference either way.
|
|> Why use: _content_filter_stack() => None to indicate 0 filters rather
|> than just
|> a simple list?
|
| Well [] does mean an empty stack. We could make None mean something else
| like the default stack, but the default stack is empty anyhow.
|
| You already are using:
|> if filters:
|> ~  # do filtered form
|> else:
|> ~  # Do other form
|>
|> and that 'if' check will trigger for both None and [] (and False, and other
|> things, which I know Aaron doesn't like.)
|>
|> So if you really want None, I would probably recommend:
|>
|> if filters is None:
|> ~  XXX
|> else:
|> ~  YYY
|>
|> though *I* prefer an empty list.
|
| An empty list must mean an empty list. As mentioned above, I'm just
| being robust in the case where None is passed (as well).
|
| Ian C.
|

I would rather not allow None, just because they you have a clean type and you
can just write your simple for loop when it doesn't matter.

if foo:
~  for x in foo:
~    do stuff

is a bit uglier than just

for x in foo:
~  do stuff

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkgIzsYACgkQJdeBCYSNAAPHxACglFHRyuzNt4klg+GYV8rYsm0X
ktIAn20m0v5ZRtMoBoHS1poD5aXspzKc
=lJnv
-----END PGP SIGNATURE-----

Previous message: [MERGE/RFC] Working tree content filtering
Next message: [MERGE] Make the test suite use the new test loader
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the bazaar mailing list