[MERGE] Content filtering (EOL part 2 of 3)

Wed Jul 9 02:49:17 BST 2008

Martin Pool wrote:

> === modified file 'bzrlib/dirstate.py'
> --- bzrlib/dirstate.py	2008-05-08 04:12:06 +0000
> +++ bzrlib/dirstate.py	2008-05-27 04:42:36 +0000
> 
> @@ -1943,12 +1956,16 @@
>          return len(self._parents) - len(self._ghosts)
> 
>      @staticmethod
> -    def on_file(path):
> +    def on_file(path, content_filter_stack_provider=None):
>          """Construct a DirState on the file at path path.
> 
> +        :param content_filter_stack_provider: a function that takes a
> +            path (relative to the top of the tree) and a file-id as
> +            parameters and returns a stack of ContentFilters.
> +            If None, no content filtering is performed.
>          :return: An unlocked DirState object, associated with the given path.
>          """
> -        result = DirState(path)
> +        result = DirState(path, content_filter_stack_provider)
>          return result
> 
>      def _read_dirblocks_if_needed(self):
> 
> You're changing this api, and it doesn't seem to add much value beyond the
> constructor, so maybe we should just delete it?

It's a public API so it would need to go through the usual deprecation
process rather than just be deleted. We can do that but I'd prefer to
keep it out of scope of this change myself.

> === added directory 'bzrlib/filters'
> === added file 'bzrlib/filters/__init__.py'
> --- bzrlib/filters/__init__.py	1970-01-01 00:00:00 +0000
> +++ bzrlib/filters/__init__.py	2008-05-27 03:13:04 +0000
> @@ -0,0 +1,212 @@
> 
> +"""Working tree content filtering support.
> +
> +A filter consists of a read converter, write converter pair.
> +Converters have the following signatures::
> +
> +    read_converter(chunks) -> chunks
> +    write_converter(chunks, context) -> chunks
> +
> +where:
> +
> + * chunks is an iterator over a sequence of byte strings
> +
> + * context is an optional ContentFilterContent object (possibly None)
> +   providing converters access to interesting information, e.g. the
> +   relative path of the file.
> +
> +Note that context is currently only supported for write converters.
> +"""
> 
> In the documentation you use the names 'canonical' and 'convenient' forms,
> which I think are pretty good names, so I suggest you use them here too.

Done.

> +class ContentFilter(object):
> +
> +    def __init__(self, reader, writer):
> +        """Create a filter that converts content while reading and writing.
> +
> +        :param reader: function for converting external to internal content
> +        :param writer: function for converting internal to external content
> +        """
> +        self.reader = reader
> +        self.writer = writer
> 
> and here.

Done.

> +class ContentFilterContext(object):
> +    """Object providing information that filters can use.
> +
> +    In the future, this is likely to be expanded to include
> +    details like the Revision when this file was last updated.
> +    """
> +
> +    def __init__(self, relpath=None):
> +        """Create a context.
> +
> +        :param relpath: the relative path or None if this context doesn't
> +           support that information.
> +        """
> +        self._relpath = relpath
> +
> +    def relpath(self):
> +        """Relative path of file to tree-root."""
> +        return self._relpath
> 
> This seems like the kind of place where we want to be careful not to
> commit to an api that requires allocating a complex object for every
> working tree file.  Would it be enough to just pass a dict to the filters?
> 
> Passing a real object does make it a bit easier to make some parts lazy
> in future, but on the other hand for things we think will be expensive we
> could have the dict contain callables.

I don't feel strongly about an object vs a dict. Being able to make things
lazily evaluated is important though and it was the reason I selected the
object.

> Similarly it seems like it would be nice for the api to allow the same
> filter stacks to be reused for multiple files.  (This should work at
> present.)

Precisely. I think that is essential for performance.

> Looking at the way these are called, it seems like they would be better as
> a FilterStack (or just Filter composite object?) with methods, rather than
> global functions that are passed a list of objects that do the actual
> work...

Perhaps. We'd need to cleanly handle the "empty stack" case.

> +# The registry of filter stacks indexed by name.
> +_filter_stacks_registry = registry.Registry()
> 
> This sounds like it's going from a name (like 'eol') to a filter class,
> but in fact it seems to go to a dict?  Either way you should document it.

Done.

> I think going to a callable is better because there may be some filters
> that can be parameterized in arbitrary different ways not just by a small
> set of values.

You're probably right but I was treating this as YAGNI. I hope there's
enough abstraction in the API for this to be supported via refactoring later.

> +# Cache of preferences -> stack
> +_stack_cache = {}
> 
> I would add at least a todo here that the cache should be inside another
> object (a working tree?) rather than in global scope.

Done.

> +def register_filter_stack_map(name, stack_map):
> +    """Register the filter stacks to use for various preference values.
> +
> +    :param name: the preference/filter-stack name
> +    :param stack_map: a dictionary where
> +      the keys are preference values to match and
> +      the values are the matching stack of filters for each
> +    """
> +    if name in _filter_stacks_registry:
> +        raise errors.BzrError(
> +            "filter stack for %s already installed" % name)
> +    _filter_stacks_registry.register(name, stack_map)
> 
> Why not just have the caller do registry.register directly?
> 
> +
> +
> +def _get_registered_names():
> +    """Get the list of names with filters registered."""
> +    # Note: We may want to intelligently order these later.
> +    # If so, the register_ fn will need to support an optional priority.
> +    return _filter_stacks_registry.keys()
> 
> Likewise.

The first API is very public - it needs to be called by each and
every plugin registering a content-filter. I felt that providing a
layer of abstraction there was worth it. The second API was a natural
consequence of the first decision.

> +
> +def _get_filter_stack_for(preferences):
> +    """Get the filter stack given a sequence of preferences.
> +
> +    :param preferences: a sequence of (name,value) tuples where
> +      name is the preference name and
> +      value is the key into the filter stack map regsitered
> +      for that preference.
> +    """
> 
> spelled 'registered'

Fixed.

> +    if preferences is None:
> +        return []
> +    stack = _stack_cache.get(preferences)
> +    if stack is not None:
> +        return stack
> +    stack = []
> +    for k, v in preferences:
> +        try:
> +            stacks_by_values = _filter_stacks_registry.get(k)
> +        except KeyError:
> +            continue
> +        items = stacks_by_values.get(v)
> +        if items:
> +            stack.extend(items)
> +    _stack_cache[preferences] = stack
> +    return stack
> 
> This seems odd that you'd just do nothing if the filter value is not found
> there..

I fully expect some preferences to not have associated filters, e.g. we
might want a preference to indicate large binary files to speed their
internal handling. We don't want to pre-mask those out IMO. I've added
a short comment inside the except clause to make this a little clearer.

> +def _reset_registry(value=None):
> +    """Reset the filter stack registry.
> +
> +    This function is provided to aid testing. The expected usage is::
> +
> +      old = _reset_registry()
> +      # run tests
> +      _reset_registry(old)
> +
> +    :param value: the value to set the registry to or None for an empty one.
> +    :return: the existing value before it reset.
> +    """
> +    global _filter_stacks_registry
> +    original = _filter_stacks_registry
> +    if value is None:
> +        _filter_stacks_registry = registry.Registry()
> +    else:
> +        _filter_stacks_registry = value
> +    _stack_cache.clear()
> +    return original
> 
> This function smells a bit funny because it doesn't do anything much
> specific to filters.  The simplest thing is probably to just have the
> tests assign a new Registry() to the value.
> 
> (Putting new objects into global variable is kind of risky because you
> don't know if someone else is holding a reference but in this case it
> seems reasonable.)

I agree it smells. I added it to abstract the tests from the internal
implementation of how things were registered. It can go if you like.

> === modified file 'bzrlib/hashcache.py'
> @@ -199,9 +212,9 @@
>              self._cache[path] = (digest, file_fp)
>          return digest
> 
> -    def _really_sha1_file(self, abspath):
> +    def _really_sha1_file(self, abspath, filters):
>          """Calculate the SHA1 of a file by reading the full text"""
> -        return sha_file(file(abspath, 'rb', buffering=65000))
> +        return internal_size_sha_file_byname(abspath, filters)[1]
> 
> It seems like we ought to document somewhere (like in this class's
> docstring or maybe the developer documents?) that it holds the
> convenience-form sha1?

Done.

>      def write(self):
>          """Write contents of cache to file."""
> 
> === added file 'bzrlib/help_topics/en/content-filters.txt'
> --- bzrlib/help_topics/en/content-filters.txt	1970-01-01 00:00:00 +0000
> +++ bzrlib/help_topics/en/content-filters.txt	2008-05-26 06:36:40 +0000
> @@ -0,0 +1,99 @@
> +Content Filters
> +===============
> +
> +Content formats
> +---------------
> +
> +Bazaar's format conversion allows you to store files in a different
> +format from the copy in your working tree.  This lets you, or your
> +co-developers, use Windows development tools that expect CRLF files
> +on projects that use other line-ending conventions. Among other things,
> +content filters also let Linux developers more easily work on projects
> +using Windows line-ending conventions, keyword expansion/compression,
> +and trailing spaces on lines in text files to be implicitly stripped
> +when committed.
> 
> I would say 'content filtering' rather than 'format' to avoid confusion
> with component formats, throughout this file.

Done.

> This is a really nice document.

Thanks.

> I read the tests and have no particular comments beyond the above at the
> moment.

Updated patch and incremental diff attached.

Ian C.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: content-filters-3.patch
Type: text/x-diff
Size: 70420 bytes
Desc: not available
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20080709/54726659/attachment-0002.bin 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: content-filter.diff
Type: text/x-diff
Size: 2649 bytes
Desc: not available
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20080709/54726659/attachment-0003.bin