[MERGE] Content filtering (EOL part 2 of 3)

Tue May 27 08:33:00 BST 2008

Aaron Bentley wrote:
> Hi Ian,
> 
> Thanks for continuing this work.

Thanks for the thoughtful review. This particular bit is actually quite far
reaching so getting the detailed design clean will be mandatory if this
change isn't to introduce a bunch of bugs, both across the core and in
plugins.

>> +        self._cfs_provider = content_filter_stack_provider
> 
> ^^^ I think "_filter_provider" or "filter_stack" would be a much clearer
> name.

I've gone with _filter_provider both here and in hashcache.py.

> Wouldn't just the basename work here?  Most filters are only going to
> care about file extension.

Many but certainly not all. If we're going to support the same pattern
language in rules as in ignores, then relative paths must be used, not
just the basename.

>>      @staticmethod
>> -    def on_file(path):
>> +    def on_file(path, content_filter_stack_provider=None):
> 
> ^^^ I would be more comfortable if content_filter_stack_provider were a
> mandatory parameter.  That would ensure we hit all the right callsites.

Making it a mandatory parameter would be an API break on a public method.
I'd prefer to avoid that unless you feel really strongly that it's the
right trade-off to do that.

> Don't you need to apply the same approach to file size that you've
> applied to sha1s?

I have a strong feeling you're right. I'm pretty concerned about the
performance impact of doing that, and the bug impact of not doing it.
I'll look deeper tomorrow on this.

>> === added directory 'bzrlib/filters'
> 
> ^^^ This looks like it can be a module ('bzrlib/filters.py') instead of
> a package.

Perhaps. My thinking was that it was likely we'd have standard filters
bundled and they ought to live under bzrlib/filters. I could treat this
as yagni but it seems a fair decision now?

> Thank you for specifying content as a series of bytestrings.
> 
> I'm not sure it's reasonable to require an encoding.

Agreed. Fixed in attached version.

> 
> Is context fixed?  i.e. is it always an instance of
> ContentFilterContext?  If not, how does it vary?

Yes and yes. Clearer explanation in the attached version.

>> +def filtered_input_file(f, filters):
>> +    """Get an input file that converts external to internal content.
>> +
>> +    :param f: the original input file
>> +    :param filters: the stack of filters to apply
>> +    :return: a file-like object
>> +    """
>> +    if filters:
>> +        chunks = [f.read()]
>> +        for filter in filters:
>> +            if filter.reader is not None:
>> +                chunks = filter.reader(chunks)
>> +        return cStringIO.StringIO(''.join(chunks))
>> +    else:
>> +        return f
> 
> ^^^ I can see how this might be necessary in some cases, but perhaps it
> should a -Devil method due to its memory characteristics.  I'd prefer to
> see filtered_input_bytes as the main way of accessing this, with
> filtered_input_file used to satisfy deprecated interfaces.

OK - I'll reconsider that approach. I originally started there and
moved to filtered_input_file because that matched the needs of the
only client calling this. A lot depends on just how intrusive or
otherwise the filtering is. More on this below.

>> +def filtered_output_lines(chunks, filters, context=None):
>> +    """Convert output lines from internal to external format.
> 
> ^^^ I think the name is wrong; There's no guarantee that these chunks
> are lines.  We wouldn't want to guarantee that chunks are lines, because
> a binary file might contain no \n bytes, and a line would therefore take
> up the total size of the file.

I was aiming for parity with output_lines in the file API, but I agree
the potential for confusion is too high in doing that. Changed to
filtered_output_bytes.

>> +# The registry of filter stacks indexed by name.
>> +# (This variable should be treated as private to this module as its
>> +# implementation may well change in the future.)
>> +_filter_stacks_registry = {}
> 
> ^^^ Wouldn't it be more future-proof to use a registry.Registry?

Done.

>> === added file 'bzrlib/help_topics/en/content-filters.txt'
>> --- bzrlib/help_topics/en/content-filters.txt	1970-01-01 00:00:00 +0000
>> +++ bzrlib/help_topics/en/content-filters.txt	2008-05-16 14:51:56 +0000
>> @@ -0,0 +1,87 @@
>> +Content Filters
>> +===============

>> +Read converters are only applied to commands that read content from
>> +a working tree, e.g. status, diff and commit. By default, write
>> +converters are only applied by the ``cat`` command
> 
> I'm not sure which representation cat should produce.  I would kinda
> expect it to return the canonical representation, but I can see that
> there are cases where the convenience representation is desirable.

After submitting the original patch, I thought about 'cat' a lot.
I'm pretty convinced now that cat ought to display the *canonical*
format by default. Likewise, I'm pretty convinced export ought to
do the same. I'd expect a further patch to add options to these
commands to optionally apply the filters, perhaps selectively.
For example, it might well make sense to export with canonical
line endings but with keywords expanded! Right now though in the
first cut, I believe cat ought not to be applying any content filters.

>>  and by commands
>> +that **create** files in a working tree, e.g. branch, checkout, update.
>> +In particular, note that ``bzr commit`` does not implicitly apply
>> +write converters after comitting files. (If this makes sense for
>> +a given plugin providing a content filter, the plugin can always
>> +reigster a ``post_commit`` hook to achieve this effect.)
> 
> ^^^ Well, really it would make sense to provide a precommit filter, and
> not use this framework at all.

In many cases, that's true. It doesn't always apply though, e.g.
keyword expansion is a good example where users have an expectation
of the files being modified after commit and using a pre-commit
(or more often start-commit now it exists) hook doesn't help.

I've improved the text though to reflect your point.

>> + * To reapply write converters, ensure there are no local changes,
>> +   delete the relevant files and run ``bzr update``.
> 
> ^^^ You mean "revert", right?  Update is not going to resurrect dead files.

Oops - I did.

>> +Note: In the future, it is likely that additional options will be added
>> +to commands to make this refreshing process faster and safer.
> 
> Perhaps the simplest approach is to provide an "invalidate-hashcache"
> command.

I'd prefer a more user-friendly name though, e.g. reset-cache perhaps.
(I'll still like to keep this out of scope of this patch either way.)

> ^^^ There should be tests of iter_files_bytes.  It's our key method
> going forward.

> And shouldn't there be tests for other tree types?

Hmm. I could test each and every api on each and every tree type and
perhaps that's the only way we can be sure things are working. The basis
of the implementation design though is this:

1. Read converters should *always* be implicitly applied when asking for
   content from a working tree file. If we don't do this, there's a huge
   risk that code/plugins will read content and see the convenience format
   when they ought to see the canonical format. That would be very bad IMO.

2. Read converters should *never* be applied to revision trees - they should
   only be applied to working trees. Perhaps the boundary needs to be mutable
   trees, not working trees, but it's a hard boundary where ever we draw it.

3. Write converters are applied selectively based on command, not tree
   source.

So I honestly think that extending iter_files_bytes to have a filtering
flag isn't required. We certainly never want it to implicitly apply
the write converters and, beyond usages inside tests, I can't see why
we've ever need to make the read filtering optional.

To repeat the core design assumption: working trees must always apply read
converters while other trees must never apply them. Tests need a back-door
though so, for that reason alone so far, I've extended some low-level
apis on working tree - get_file, get_file_byname and get_file_text - to
have a filtered=True parameter.

Does that reasoning make sense? If so, I agree that explicit tests need
to be added to workingtree_implementations for each of the apis extended
this way.

> What about testing how some of our core operations behave with content
> filtering?  I'd suggest diff, merge, revert...

I think that would be most valuable. Merge in particular scares me. Taking
keyword expansion as an example, I expect merge to work on the canonical
content. I wonder though about whether conflict marker lines need special
treatment or not. For example, those lines would need newline conversion
applied to get the right results for Windows users, yes?

I'll add these tests - they aren't present yet in the updated patch.

>> === modified file 'bzrlib/transform.py'
>> --- bzrlib/transform.py	2008-05-08 04:33:38 +0000
>> +++ bzrlib/transform.py	2008-05-13 01:31:50 +0000
>> @@ -33,6 +33,7 @@
>>                             ReusingTransform, NotVersionedError, CantMoveRoot,
>>                             ExistingLimbo, ImmortalLimbo, NoFinalPath,
>>                             UnableCreateSymlink)
>> +from bzrlib.filters import filtered_output_lines, ContentFilterContext
>>  from bzrlib.inventory import InventoryEntry
>>  from bzrlib.osutils import (file_kind, supports_executable, pathjoin, lexists,
>>                              delete_any, has_symlinks)
>> @@ -326,7 +327,12 @@
>>                  os.unlink(name)
>>                  raise
> 
>> -            f.writelines(contents)
>> +            # Is this the best way to get the relative path?
>> +            # (The +1 adjusts for the path separator.)
>> +            relpath = name[len(self._limbodir) + 1:]
> 
> ^^^ No.  You should assume the limboname is completely unrelated to the
> pathname.
> 
> You want FinalPaths.get_path.  But note that the final path is not
> required to be defined at this time.  There may be no path defined, or
> the path may change due to conflict resolution or caller perversity.
> 
> The true final path can only be known once conflict resolution is
> complete, but you probably don't want to use that name anyway, because
> e.g. conflict resolution could cause foo.txt to be renamed to
> foo.txt.moved, and not get content filtering applied.
> 
> I would imagine you really want to use the name that the file has in the
> source tree, which would mean doing the operation in the create_file
> caller instead.

Ah. That all makes sense. I've moved the application of write converters
up several layers accordingly.

>> === modified file 'bzrlib/tree.py'
>> --- bzrlib/tree.py	2008-05-08 04:33:38 +0000
>> +++ bzrlib/tree.py	2008-05-16 12:55:25 +0000
>> @@ -24,6 +24,7 @@
>>  import bzrlib
>>  from bzrlib import (
>>      delta,
>> +    filters,
>>      osutils,
>>      revision as _mod_revision,
>>      conflicts as _mod_conflicts,
>> @@ -435,7 +436,9 @@
>>      def print_file(self, file_id):
>>          """Print file with id `file_id` to stdout."""
>>          import sys
>> -        sys.stdout.write(self.get_file_text(file_id))
>> +        sys.stdout.writelines(filters.filtered_output_lines(
>> +            self.get_file_lines(file_id),
>> +            self._content_filter_stack(file_id=file_id)))
> 
> ^^^ Please just deprecate print_file instead.  It's a really bad method.

It's pretty gross, yes. I've backed out this change given my revised
view on how 'cat' needs to work. I'll deprecate (or fix) this method
but I'll do that separately to this patch if you don't mind.

> You need to update the interface of Tree.get_file, Tree.get_file_byname,
> etc to match what you've done in WorkingTree.
> 
> Similarly, the changes need to be applied to other trees like
> RevisionTree and DirStateRevisionTree.
> 
> I'm quite surprised to see you haven't implemented a filter flag for
> iter_files_bytes.  That's the one we'll be using more and more.

See comments above.

>> @@ -828,11 +855,7 @@
>>              if kind[0] != kind[1]:
>>                  changed_content = True
>>              elif from_kind == 'file':
>> -                from_size = self.source._file_size(from_entry, from_stat)
>> -                to_size = self.target._file_size(to_entry, to_stat)
>> -                if from_size != to_size:
>> -                    changed_content = True
>> -                elif (self.source.get_file_sha1(file_id, from_path, from_stat) !=
>> +                if (self.source.get_file_sha1(file_id, from_path, from_stat) !=
>>                      self.target.get_file_sha1(file_id, to_path, to_stat)):
>>                      changed_content = True
> 
> ^^^ This is a pessimization.  The original code did not need to
> calculate a SHA1 unless the file sizes were the same.

This is iter_changes code so I've very reluctant to do anything that would
slow it down. I do wonder though whether this "sizes are different check"
is buying that much. In the case of *any* file with content filtering
enabled, I currently calculate both the size and the sha in one pass. So
we gain nothing in terms of reduced processing. And if we're going to provide
"smart" eol conversion ala hg and git, then every file is potentially always
a candidate for content filtering.

> Also, shouldn't you be preventing .bzrignore parsing from going through
> content filters?

It's a good question. I asked poolie on irc re this and we can't see
a good reason to exclude it. OTOH, Windows users may have a fair expectation
for the lines in that file to be ended in crlf, so there is a case for
applying content filtering.

I've attached an updated patch which addresses many, but not all, of the
points you raised. I'd like your feedback on the open issues above please.
In the meantime, I'll work on the following:

* the rules-based preferences patch
* the high levels tests requested above for diff, merge, revert, etc.
* tracking canonical size, not just sha, in dirstate (with minimal
  impact on performance hopefully).

Thanks again for your time on this. This bit really is the core of
the eol work and small errors could have nasty consequences for
users, so I'm keen to ensure we get it as rock solid as we can first
time around.

Ian C.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: content-filters-2.patch
Type: text/x-diff
Size: 69139 bytes
Desc: not available
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20080527/34816bb2/attachment-0001.bin