[MERGE] Content filtering (EOL part 2 of 3)

Tue Jul 1 07:37:22 BST 2008

bb: comment

Ian said he's only done some of the items Aaron pointed out so this might
not be ready to merge yet, but since it's still in the queue I thought I'd
read through it now.

=== modified file 'bzrlib/dirstate.py'

--- bzrlib/dirstate.py	2008-05-08 04:12:06 +0000
+++ bzrlib/dirstate.py	2008-05-27 04:42:36 +0000

@@ -1943,12 +1956,16 @@
         return len(self._parents) - len(self._ghosts)

     @staticmethod
-    def on_file(path):
+    def on_file(path, content_filter_stack_provider=None):
         """Construct a DirState on the file at path path.

+        :param content_filter_stack_provider: a function that takes a
+            path (relative to the top of the tree) and a file-id as
+            parameters and returns a stack of ContentFilters.
+            If None, no content filtering is performed.
         :return: An unlocked DirState object, associated with the given path.
         """
-        result = DirState(path)
+        result = DirState(path, content_filter_stack_provider)
         return result

     def _read_dirblocks_if_needed(self):

You're changing this api, and it doesn't seem to add much value beyond the
constructor, so maybe we should just delete it?

=== added directory 'bzrlib/filters'
=== added file 'bzrlib/filters/__init__.py'
--- bzrlib/filters/__init__.py	1970-01-01 00:00:00 +0000
+++ bzrlib/filters/__init__.py	2008-05-27 03:13:04 +0000
@@ -0,0 +1,212 @@

+"""Working tree content filtering support.
+
+A filter consists of a read converter, write converter pair.
+Converters have the following signatures::
+
+    read_converter(chunks) -> chunks
+    write_converter(chunks, context) -> chunks
+
+where:
+
+ * chunks is an iterator over a sequence of byte strings
+
+ * context is an optional ContentFilterContent object (possibly None)
+   providing converters access to interesting information, e.g. the
+   relative path of the file.
+
+Note that context is currently only supported for write converters.
+"""

In the documentation you use the names 'canonical' and 'convenient' forms,
which I think are pretty good names, so I suggest you use them here too.

+class ContentFilter(object):
+
+    def __init__(self, reader, writer):
+        """Create a filter that converts content while reading and writing.
+
+        :param reader: function for converting external to internal content
+        :param writer: function for converting internal to external content
+        """
+        self.reader = reader
+        self.writer = writer

and here.


+
+
+class ContentFilterContext(object):
+    """Object providing information that filters can use.
+
+    In the future, this is likely to be expanded to include
+    details like the Revision when this file was last updated.
+    """
+
+    def __init__(self, relpath=None):
+        """Create a context.
+
+        :param relpath: the relative path or None if this context doesn't
+           support that information.
+        """
+        self._relpath = relpath
+
+    def relpath(self):
+        """Relative path of file to tree-root."""
+        return self._relpath

This seems like the kind of place where we want to be careful not to
commit to an api that requires allocating a complex object for every
working tree file.  Would it be enough to just pass a dict to the filters?

Passing a real object does make it a bit easier to make some parts lazy
in future, but on the other hand for things we think will be expensive we
could have the dict contain callables.

Similarly it seems like it would be nice for the api to allow the same
filter stacks to be reused for multiple files.  (This should work at
present.)

+def filtered_input_file(f, filters):
+    """Get an input file that converts external to internal content.
+
+    :param f: the original input file
+    :param filters: the stack of filters to apply
+    :return: a file-like object
+    """
+    if filters:
+        chunks = [f.read()]
+        for filter in filters:
+            if filter.reader is not None:
+                chunks = filter.reader(chunks)
+        return cStringIO.StringIO(''.join(chunks))
+    else:
+        return f

As Aaron points out, we might want to check whether any callers really
need a file-like interface.

+
+
+def filtered_output_bytes(chunks, filters, context=None):
+    """Convert byte chunks from internal to external format.
+
+    :param chunks: an iterator containing the original content
+    :param filters: the stack of filters to apply
+    :param context: a ContentFilterContext object passed to
+        each filter
+    :return: an iterator containing the content to output
+    """
+    if filters:
+        for filter in reversed(filters):
+            if filter.writer is not None:
+                chunks = filter.writer(chunks, context)
+    return chunks
+
+
+def internal_size_sha_file_byname(name, filters):
+    """Get size and sha of internal content given external content.
+
+    :param name: path to file
+    :param filters: the stack of filters to apply
+    """
+    f = open(name, 'rb', 65000)
+    try:
+        if filters:
+            f = filtered_input_file(f, filters)
+        size = 0
+        s = sha.new()
+        BUFSIZE = 128<<10
+        while True:
+            b = f.read(BUFSIZE)
+            if not b:
+                break
+            size += len(b)
+            s.update(b)
+        return size, s.hexdigest()
+    finally:
+        f.close()
+
+

Looking at the way these are called, it seems like they would be better as
a FilterStack (or just Filter composite object?) with methods, rather than
global functions that are passed a list of objects that do the actual
work...

+# The registry of filter stacks indexed by name.
+_filter_stacks_registry = registry.Registry()

This sounds like it's going from a name (like 'eol') to a filter class,
but in fact it seems to go to a dict?  Either way you should document it.
I think going to a callable is better because there may be some filters
that can be parameterized in arbitrary different ways not just by a small
set of values.

+
+
+# Cache of preferences -> stack
+_stack_cache = {}

I would add at least a todo here that the cache should be inside another
object (a working tree?) rather than in global scope.

+
+
+def register_filter_stack_map(name, stack_map):
+    """Register the filter stacks to use for various preference values.
+
+    :param name: the preference/filter-stack name
+    :param stack_map: a dictionary where
+      the keys are preference values to match and
+      the values are the matching stack of filters for each
+    """
+    if name in _filter_stacks_registry:
+        raise errors.BzrError(
+            "filter stack for %s already installed" % name)
+    _filter_stacks_registry.register(name, stack_map)

Why not just have the caller do registry.register directly?

+
+
+def _get_registered_names():
+    """Get the list of names with filters registered."""
+    # Note: We may want to intelligently order these later.
+    # If so, the register_ fn will need to support an optional priority.
+    return _filter_stacks_registry.keys()

Likewise.

+
+def _get_filter_stack_for(preferences):
+    """Get the filter stack given a sequence of preferences.
+
+    :param preferences: a sequence of (name,value) tuples where
+      name is the preference name and
+      value is the key into the filter stack map regsitered
+      for that preference.
+    """

spelled 'registered'

+    if preferences is None:
+        return []
+    stack = _stack_cache.get(preferences)
+    if stack is not None:
+        return stack
+    stack = []
+    for k, v in preferences:
+        try:
+            stacks_by_values = _filter_stacks_registry.get(k)
+        except KeyError:
+            continue
+        items = stacks_by_values.get(v)
+        if items:
+            stack.extend(items)
+    _stack_cache[preferences] = stack
+    return stack

This seems odd that you'd just do nothing if the filter value is not found
there..


+
+
+def _reset_registry(value=None):
+    """Reset the filter stack registry.
+
+    This function is provided to aid testing. The expected usage is::
+
+      old = _reset_registry()
+      # run tests
+      _reset_registry(old)
+
+    :param value: the value to set the registry to or None for an empty one.
+    :return: the existing value before it reset.
+    """
+    global _filter_stacks_registry
+    original = _filter_stacks_registry
+    if value is None:
+        _filter_stacks_registry = registry.Registry()
+    else:
+        _filter_stacks_registry = value
+    _stack_cache.clear()
+    return original

This function smells a bit funny because it doesn't do anything much
specific to filters.  The simplest thing is probably to just have the
tests assign a new Registry() to the value.

(Putting new objects into global variable is kind of risky because you
don't know if someone else is holding a reference but in this case it
seems reasonable.)

=== modified file 'bzrlib/hashcache.py'
@@ -199,9 +212,9 @@
             self._cache[path] = (digest, file_fp)
         return digest

-    def _really_sha1_file(self, abspath):
+    def _really_sha1_file(self, abspath, filters):
         """Calculate the SHA1 of a file by reading the full text"""
-        return sha_file(file(abspath, 'rb', buffering=65000))
+        return internal_size_sha_file_byname(abspath, filters)[1]

It seems like we ought to document somewhere (like in this class's
docstring or maybe the developer documents?) that it holds the
convenience-form sha1?


     def write(self):
         """Write contents of cache to file."""

=== added file 'bzrlib/help_topics/en/content-filters.txt'
--- bzrlib/help_topics/en/content-filters.txt	1970-01-01 00:00:00 +0000
+++ bzrlib/help_topics/en/content-filters.txt	2008-05-26 06:36:40 +0000
@@ -0,0 +1,99 @@
+Content Filters
+===============
+
+Content formats
+---------------
+
+Bazaar's format conversion allows you to store files in a different
+format from the copy in your working tree.  This lets you, or your
+co-developers, use Windows development tools that expect CRLF files
+on projects that use other line-ending conventions. Among other things,
+content filters also let Linux developers more easily work on projects
+using Windows line-ending conventions, keyword expansion/compression,
+and trailing spaces on lines in text files to be implicitly stripped
+when committed.

I would say 'content filtering' rather than 'format' to avoid confusion
with component formats, throughout this file.

This is a really nice document.

I read the tests and have no particular comments beyond the above at the
moment.

-- 
Martin