[MERGE] Content filtering (EOL part 2 of 3)

Fri May 23 03:18:53 BST 2008

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi Ian,

Thanks for continuing this work.

bb:resubmit

Ian Clatworthy wrote:
> === modified file 'bzrlib/dirstate.py'
> --- bzrlib/dirstate.py	2008-05-08 04:12:06 +0000
> +++ bzrlib/dirstate.py	2008-05-13 01:31:50 +0000
> @@ -215,6 +215,7 @@
>      cache_utf8,
>      debug,
>      errors,
> +    filters,
>      inventory,
>      lock,
>      osutils,
> @@ -303,10 +304,14 @@
>      HEADER_FORMAT_2 = '#bazaar dirstate flat format 2\n'
>      HEADER_FORMAT_3 = '#bazaar dirstate flat format 3\n'
>  
> -    def __init__(self, path):
> +    def __init__(self, path, content_filter_stack_provider=None):
>          """Create a  DirState object.
>  
>          :param path: The path at which the dirstate file on disk should live.
> +        :param content_filter_stack_provider: a function that takes a
> +            path (relative to the top of the tree) and a file-id as
> +            parameters and returns a stack of ContentFilter's.

Plurals shouldn't have apostrophes.

> @@ -341,13 +346,15 @@
>          if 'hashcache' in debug.debug_flags:
>              self._sha1_file = self._sha1_file_and_mutter
>          else:
> -            self._sha1_file = osutils.sha_file_by_name
> +            self._sha1_file = filters.sha_file_by_name
>          # These two attributes provide a simple cache for lookups into the
>          # dirstate in-memory vectors. By probing respectively for the last
>          # block, and for the next entry, we save nearly 2 bisections per path
>          # during commit.
>          self._last_block_index = None
>          self._last_entry_index = None
> +        # Content filtering setup
> +        self._cfs_provider = content_filter_stack_provider

^^^ I think "_filter_provider" or "filter_stack" would be a much clearer
name.

> @@ -1492,7 +1499,13 @@
>          # process this entry.
>          link_or_sha1 = None
>          if minikind == 'f':
> -            link_or_sha1 = self._sha1_file(abspath)
> +            if self._cfs_provider is None:
> +                filter_list = []
> +            else:
> +                relpath = osutils.pathjoin(entry[0][0], entry[0][1])
> +                file_id=entry[0][2]

^^^ Need space around = operator.

> +                filter_list = self._cfs_provider(relpath, file_id)

Wouldn't just the basename work here?  Most filters are only going to
care about file extension.

> +            link_or_sha1 = self._sha1_file(abspath, filter_list)
>              executable = self._is_executable(stat_value.st_mode,
>                                               saved_executable)
>              if self._cutoff_time is None:

Ideally, changing the filter setting could change the sha1.  This
doesn't permit that, because the stored_stat will still match the
packed_stat.  I'm okay with this for a first cut.

> @@ -1943,12 +1956,16 @@
>          return len(self._parents) - len(self._ghosts)
>  
>      @staticmethod
> -    def on_file(path):
> +    def on_file(path, content_filter_stack_provider=None):

^^^ I would be more comfortable if content_filter_stack_provider were a
mandatory parameter.  That would ensure we hit all the right callsites.

>          """Construct a DirState on the file at path path.
>  
> +        :param content_filter_stack_provider: a function that takes a
> +            path (relative to the top of the tree) and a file-id as
> +            parameters and returns a stack of ContentFilter's.

^^^ No apostrophes in plurals.

> +            If None, no content filtering is performed.
>          :return: An unlocked DirState object, associated with the given path.
>          """
> -        result = DirState(path)
> +        result = DirState(path, content_filter_stack_provider)
>          return result

>  
>      def _read_dirblocks_if_needed(self):

Don't you need to apply the same approach to file size that you've
applied to sha1s?

> === added directory 'bzrlib/filters'

^^^ This looks like it can be a module ('bzrlib/filters.py') instead of
a package.

> --- bzrlib/filters/__init__.py	1970-01-01 00:00:00 +0000
> +++ bzrlib/filters/__init__.py	2008-05-16 12:55:25 +0000
> +"""Working tree content filtering support.
> +
> +Filters have the following signatures::
> +
> +    read_filter(chunks) -> chunks
> +    write_filter(chunks, context) -> chunks
> +
> +where:
> +
> + * chunks is an iterator over a sequence of 8-bit utf-8 strings
> +
> + * context is an optional object (possibly None) providing filters access
> +   to interesting information, e.g. the relative path of the file.
> +
> +Note that context is currently only supported for write filters.
> +"""

Thank you for specifying content as a series of bytestrings.

I'm not sure it's reasonable to require an encoding.  That requires bzr
to know the source encoding of the file, transform it into utf-8, and
provide it.

If we expect that bzrlib may know the the file encoding, perhaps we
should pass encoding as a parameter to read/write filter methods, and
then provide a filter base class that allows users to implement
read_utf8 / write_utf8 instead of read/write.

Is context fixed?  i.e. is it always an instance of
ContentFilterContext?  If not, how does it vary?

> +import cStringIO
> +from bzrlib.lazy_import import lazy_import
> +lazy_import(globals(), """
> +from bzrlib import (
> +    errors,
> +    osutils,
> +    )
> +""")
> +
> +
> +class ContentFilter(object):
> +
> +    def __init__(self, reader, writer):
> +        """Create a filter that converts content while reading and
writing.
> +
> +        :param reader: function for converting external to internal
content
> +        :param writer: function for converting internal to external
content
> +        """
> +        self.reader = reader
> +        self.writer = writer
> +
> +
> +class ContentFilterContext(object):
> +    """Object providing information that filters can use.
> +
> +    In the future, this is likely to be expanded to include
> +    details like the Revision when this file was last updated.
> +    """
> +
> +    def __init__(self, relpath=None):
> +        """Create a context.
> +
> +        :param relpath: the relative path or None if this context doesn't
> +           support that information.
> +        """
> +        self._relpath = relpath
> +
> +    def relpath(self):
> +        """Relative path of file to tree-root."""
> +        if self._relpath is None:
> +            raise NotImplementedError(self.relpath)
> +        else:
> +            return self._relpath
> +
> +
> +def filtered_input_file(f, filters):
> +    """Get an input file that converts external to internal content.
> +
> +    :param f: the original input file
> +    :param filters: the stack of filters to apply
> +    :return: a file-like object
> +    """
> +    if filters:
> +        chunks = [f.read()]
> +        for filter in filters:
> +            if filter.reader is not None:
> +                chunks = filter.reader(chunks)
> +        return cStringIO.StringIO(''.join(chunks))
> +    else:
> +        return f

^^^ I can see how this might be necessary in some cases, but perhaps it
should a -Devil method due to its memory characteristics.  I'd prefer to
see filtered_input_bytes as the main way of accessing this, with
filtered_input_file used to satisfy deprecated interfaces.

> +def filtered_output_lines(chunks, filters, context=None):
> +    """Convert output lines from internal to external format.

^^^ I think the name is wrong; There's no guarantee that these chunks
are lines.  We wouldn't want to guarantee that chunks are lines, because
a binary file might contain no \n bytes, and a line would therefore take
up the total size of the file.

> +def sha_file_by_name(name, filters):
> +    """Get sha of internal content given external content.
> +
> +    :param name: path to file
> +    :param filters: the stack of filters to apply
> +    """
> +    if filters:
> +        f = open(name, 'rb', 65000)
> +        return osutils.sha_strings(filtered_input_file(f, filters))

^^^ It looks like you're using filtered_input_file purely as an
interator of lines.  That seems like a really memory-intensive way of
doing it.  Why not use IterableFile instead?

> +    else:
> +        return osutils.sha_file_by_name(name)
> +
> +
> +# The registry of filter stacks indexed by name.
> +# (This variable should be treated as private to this module as its
> +# implementation may well change in the future.)
> +_filter_stacks_registry = {}

^^^ Wouldn't it be more future-proof to use a registry.Registry?

> === modified file 'bzrlib/hashcache.py'
> --- bzrlib/hashcache.py	2008-04-24 07:22:53 +0000
> +++ bzrlib/hashcache.py	2008-05-13 01:31:50 +0000
> @@ -32,7 +32,8 @@
>  import os, stat, time
>  import sha
>  
> -from bzrlib.osutils import sha_file, pathjoin, safe_unicode
> +from bzrlib.filters import sha_file_by_name
> +from bzrlib.osutils import pathjoin, safe_unicode
>  from bzrlib.trace import mutter, warning
>  from bzrlib.atomicfile import AtomicFile
>  from bzrlib.errors import BzrError
> @@ -80,8 +81,15 @@
>      """
>      needs_write = False
>  
> -    def __init__(self, root, cache_file_name, mode=None):
> -        """Create a hash cache in base dir, and set the file mode to mode."""
> +    def __init__(self, root, cache_file_name, mode=None,
> +            content_filter_stack_provider=None):
> +        """Create a hash cache in base dir, and set the file mode to mode.
> +
> +        :param content_filter_stack_provider: a function that takes a
> +            path (relative to the top of the tree) and a file-id as
> +            parameters and returns a stack of ContentFilter's.

^^^ No apostrophes in plurals.

> === added file 'bzrlib/help_topics/en/content-filters.txt'
> --- bzrlib/help_topics/en/content-filters.txt	1970-01-01 00:00:00 +0000
> +++ bzrlib/help_topics/en/content-filters.txt	2008-05-16 14:51:56 +0000
> @@ -0,0 +1,87 @@
> +Content Filters
> +===============
> +
> +Content formats
> +---------------
> +
> +At times, it is useful to have files formatted in the working tree
> +differently to how they are stored when committed. For example,
> +a cross-platform project needs to select their preferred newline

Should "a project" be a thing, rather than a group of people, and have
"its preferred newline convention"?

I'd encourage using a slightly less formal tone.

Maybe something like this:

"Bazaar's format conversion allows you to store files in a different
format from the copy in your working tree.  This lets you (or your
co-developers) use Windows development tools that expect CRLF files on
projects that use other line-ending conventions, among other things."

> +Format conversion
> +-----------------
> +
> +The conversion between these formats is done by content filters.
> +A content filter has two parts:
> +
> +* a read converter - converts from convenience to canonical format
> +* a write converter - converts from canonical to convenience format.
> +
> +In many cases, these converters *round-trip* files

How about "Many of these converters will provide round-trip conversions"

> This is not strictly required though.

"However, others may provide an asymmetric conversion." ?

> +For example, a read
> +converter might strip trailing whitespace off lines in source code
> +while the matching write converter might pass content through unchanged.

> +Enabling content filters
> +------------------------
> +
> +Content filters are typically provided by plugins so the first step

Comma needed: "provided by plugins, so"

> +in using them is to install the relevant plugins and read their
> +documentation. 

Can I suggest not talking about "in some cases", and saying "some
plugins" instead?  It's more direct.

> In some cases, the plugin might be very specific

> +Read converters are only applied to commands that read content from
> +a working tree, e.g. status, diff and commit. By default, write
> +converters are only applied by the ``cat`` command

I'm not sure which representation cat should produce.  I would kinda
expect it to return the canonical representation, but I can see that
there are cases where the convenience representation is desirable.

>  and by commands
> +that **create** files in a working tree, e.g. branch, checkout, update.
> +In particular, note that ``bzr commit`` does not implicitly apply
> +write converters after comitting files. (If this makes sense for
> +a given plugin providing a content filter, the plugin can always
> +reigster a ``post_commit`` hook to achieve this effect.)

^^^ Well, really it would make sense to provide a precommit filter, and
not use this framework at all.

> +Note: In the future, it is likely that additional options will be
> +added to commands like ``export`` so that content filtering can
> +optionally be enabled for them. If you have a particular need like
> +this, please raise it in community formums (e.g. the mailing list or
> +bug tracker) so these enhancements can be correctly prioritized.
> +
> +
> +Refreshing your working tree
> +----------------------------

> + * To reapply write converters, ensure there are no local changes,
> +   delete the relevant files and run ``bzr update``.

^^^ You mean "revert", right?  Update is not going to resurrect dead files.

> +Note: In the future, it is likely that additional options will be added
> +to commands to make this refreshing process faster and safer.

Perhaps the simplest approach is to provide an "invalidate-hashcache"
command.

> +# sample filter stacks
> +_swapcase = lambda chunks, context=None: [s.swapcase() for s in chunks]
> +_addjunk = lambda chunks: ['junk\n'] + [s for s in chunks]
> +_deljunk = lambda chunks, context: [s for s in chunks[1:]]
> +_stack_1 = [
> +    ContentFilter(_swapcase, _swapcase),
> +    ]
> +_stack_2 = [
> +    ContentFilter(_swapcase, _swapcase),
> +    ContentFilter(_addjunk, _deljunk),
> +    ]
> +
> +# sample data
> +_sample_external = ['Hello\n', 'World\n']
> +_internal_1 = ['hELLO\n', 'wORLD\n']
> +_internal_2 = ['junk\n', 'hELLO\n', 'wORLD\n']
> +
> +
> +class TestContentFilterContext(TestCase):
> +
> +    def test_empty_filter_context(self):
> +        ctx = ContentFilterContext()
> +        self.assertRaises(NotImplementedError, ctx.relpath)

^^^ Wouldn't it be more usual to set relpath to None instead of raising
NotImplementedError?  Or even raising AttributeError...

> +class TestFilterStackMaps(TestCase):
> +
> +    def _register_map(self, pref, stk1, stk2):
> +        register_filter_stack_map(pref, {'v1': stk1, 'v2': stk2})
> +
> +    def test_filter_stack_maps(self):
> +        # Save the current registry
> +        original_registry = filters._reset_registry()
> +        # Test registration
> +        a_stack = [ContentFilter('b', 'c')]
> +        z_stack = [ContentFilter('y', 'x'), ContentFilter('w', 'v')]
> +        self._register_map('foo', a_stack, z_stack)
> +        self.assertEqual(['foo'], _get_registered_names())
> +        self._register_map('bar', z_stack, a_stack)
> +        self.assertEqual(['foo', 'bar'], _get_registered_names())
> +        # Test re-registration raises an error
> +        self.assertRaises(errors.BzrError, self._register_map, 'foo', [], [])
> +        # Restore the real registry
> +        filters._reset_registry(original_registry)

^^^ This should be in a try/finally, or using addCleanup

> === added file 'bzrlib/tests/workingtree_implementations/test_content_filters.py'
> --- bzrlib/tests/workingtree_implementations/test_content_filters.py	1970-01-01 00:00:00 +0000
> +++ bzrlib/tests/workingtree_implementations/test_content_filters.py	2008-05-16 15:18:32 +0000
> +def _swapcase(chunks, context=None):
> +    """A filter that swaps the case of text."""
> +    result = []
> +    for chunk in chunks:
> +        result.append(chunk.swapcase())
> +    return iter(result)
> +
> +
> +def _uppercase(chunks, context=None):
> +    """A filter that converts text to uppercase."""
> +    result = []
> +    for chunk in chunks:
> +        result.append(chunk.upper())
> +    return iter(result)
> +
> +def _lowercase(chunks, context=None):
> +    """A filter that converts text to lowercase."""
> +    result = []
> +    for chunk in chunks:
> +        result.append(chunk.lower())
> +    return iter(result)

^^^ Save the electrons!  Reuse code!

> +class TestWorkingTreeWithContentFilters(TestCaseWithWorkingTree):
> +
> +    def create_cf_tree(self, txt_read_filter, txt_write_filter):
> +        t = self.make_branch_and_tree('t')

           ^^^ Please use a somewhat longer name, like 'tree'.

> +        def _content_filter_stack(path=None, file_id=None):
> +            if path.endswith('.txt'):
> +                return [ContentFilter(txt_read_filter, txt_write_filter)]
> +            else:
> +                return []
> +        t._content_filter_stack = _content_filter_stack
> +        t.add(['file1.txt'], ids=['file1-id'], kinds=['file'])
> +        t.put_file_bytes_non_atomic('file1-id', 'Foo Txt')
> +        t.add(['file2.bin'], ids=['file2-id'], kinds=['file'])
> +        t.put_file_bytes_non_atomic('file2-id', 'Foo Bin')

^^^ The conventional way of building a tree is self.build_tree_contents()

> +    def test_symmetric_content_filtering(self):
> +        # test handling when read then write gives back the initial content
> +        t = self.create_cf_tree(txt_read_filter=_swapcase,
> +            txt_write_filter=_swapcase)
> +        # Check that the basis tree has the transformed content
> +        basis = t.basis_tree()
> +        basis.lock_read()
> +        self.addCleanup(basis.unlock)
> +        self.assertEqual('fOO tXT', basis.get_file('file1-id').read())
> +        self.assertEqual('Foo Bin', basis.get_file('file2-id').read())
> +        # Check that the working tree has the original content
> +        t.lock_read()
> +        self.addCleanup(t.unlock)
> +        self.assertEqual('Foo Txt', t.get_file('file1-id',
> +            filtered=False).read())
> +        self.assertEqual('Foo Bin', t.get_file('file2-id',
> +            filtered=False).read())
> +
> +    def test_readonly_content_filtering(self):
> +        # test handling with a read filter but no write filter
> +        t = self.create_cf_tree(txt_read_filter=_uppercase,
> +            txt_write_filter=None)
> +        # Check that the basis tree has the transformed content
> +        basis = t.basis_tree()
> +        basis.lock_read()
> +        self.addCleanup(basis.unlock)
> +        self.assertEqual('FOO TXT', basis.get_file('file1-id').read())
> +        self.assertEqual('Foo Bin', basis.get_file('file2-id').read())
> +        # We expect the workingtree content to be unchanged (for now at least)
> +        t.lock_read()
> +        self.addCleanup(t.unlock)
> +        self.assertEqual('Foo Txt', t.get_file('file1-id',
> +            filtered=False).read())
> +        self.assertEqual('Foo Bin', t.get_file('file2-id',
> +            filtered=False).read())

^^^ There should be tests of iter_files_bytes.  It's our key method
going forward.

And shouldn't there be tests for other tree types?

What about testing how some of our core operations behave with content
filtering?  I'd suggest diff, merge, revert...

> === modified file 'bzrlib/transform.py'
> --- bzrlib/transform.py	2008-05-08 04:33:38 +0000
> +++ bzrlib/transform.py	2008-05-13 01:31:50 +0000
> @@ -33,6 +33,7 @@
>                             ReusingTransform, NotVersionedError, CantMoveRoot,
>                             ExistingLimbo, ImmortalLimbo, NoFinalPath,
>                             UnableCreateSymlink)
> +from bzrlib.filters import filtered_output_lines, ContentFilterContext
>  from bzrlib.inventory import InventoryEntry
>  from bzrlib.osutils import (file_kind, supports_executable, pathjoin, lexists,
>                              delete_any, has_symlinks)
> @@ -326,7 +327,12 @@
>                  os.unlink(name)
>                  raise
>  
> -            f.writelines(contents)
> +            # Is this the best way to get the relative path?
> +            # (The +1 adjusts for the path separator.)
> +            relpath = name[len(self._limbodir) + 1:]

^^^ No.  You should assume the limboname is completely unrelated to the
pathname.

You want FinalPaths.get_path.  But note that the final path is not
required to be defined at this time.  There may be no path defined, or
the path may change due to conflict resolution or caller perversity.

The true final path can only be known once conflict resolution is
complete, but you probably don't want to use that name anyway, because
e.g. conflict resolution could cause foo.txt to be renamed to
foo.txt.moved, and not get content filtering applied.

I would imagine you really want to use the name that the file has in the
source tree, which would mean doing the operation in the create_file
caller instead.

> === modified file 'bzrlib/tree.py'
> --- bzrlib/tree.py	2008-05-08 04:33:38 +0000
> +++ bzrlib/tree.py	2008-05-16 12:55:25 +0000
> @@ -24,6 +24,7 @@
>  import bzrlib
>  from bzrlib import (
>      delta,
> +    filters,
>      osutils,
>      revision as _mod_revision,
>      conflicts as _mod_conflicts,
> @@ -435,7 +436,9 @@
>      def print_file(self, file_id):
>          """Print file with id `file_id` to stdout."""
>          import sys
> -        sys.stdout.write(self.get_file_text(file_id))
> +        sys.stdout.writelines(filters.filtered_output_lines(
> +            self.get_file_lines(file_id),
> +            self._content_filter_stack(file_id=file_id)))

^^^ Please just deprecate print_file instead.  It's a really bad method.

> +    def _content_filter_stack(self, path=None, file_id=None):
> +        """The stack of content filters for a path.
> +        
> +        Readers will be applied in first-to-last order.
> +        Writers will be applied in last-to-first order.
> +        Either the path or the file-id needs to be provided.
> +
> +        :param path: path relative to the root of the tree
> +            or None if unknown
> +        :param file_id: file_id or None if unknown
> +        :return: the list of filters - [] if there are none
> +        """
> +        filter_pref_names = filters._get_registered_names()
> +        if len(filter_pref_names) == 0:
> +            return []
> +        if path is None:
> +            path = self.id2path(file_id)
> +        prefs = None
> +        # TODO: Replace the line above with the line below
> +        # (when the rule-based preferences branch is merged)
> +        #prefs = rules.iter_search_rules(self.branch, [path],
> +        #    filter_pref_names).next()
> +        return filters._get_filter_stack_for(prefs)

You need to update the interface of Tree.get_file, Tree.get_file_byname,
etc to match what you've done in WorkingTree.

Similarly, the changes need to be applied to other trees like
RevisionTree and DirStateRevisionTree.

I'm quite surprised to see you haven't implemented a filter flag for
iter_files_bytes.  That's the one we'll be using more and more.

> +
>  
>  class EmptyTree(Tree):
>  
> @@ -828,11 +855,7 @@
>              if kind[0] != kind[1]:
>                  changed_content = True
>              elif from_kind == 'file':
> -                from_size = self.source._file_size(from_entry, from_stat)
> -                to_size = self.target._file_size(to_entry, to_stat)
> -                if from_size != to_size:
> -                    changed_content = True
> -                elif (self.source.get_file_sha1(file_id, from_path, from_stat) !=
> +                if (self.source.get_file_sha1(file_id, from_path, from_stat) !=
>                      self.target.get_file_sha1(file_id, to_path, to_stat)):
>                      changed_content = True

^^^ This is a pessimization.  The original code did not need to
calculate a SHA1 unless the file sizes were the same.

Also, shouldn't you be preventing .bzrignore parsing from going through
content filters?

Aaron
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFINimL0F+nu1YWqI0RAknvAJ9fVcjKjhEy085rBVDRHLv4WUZaNACeOPjj
EEyEbfj9IU9hDkwEk3RVgAY=
=9Wdf
-----END PGP SIGNATURE-----