[MERGE/RFC] Content filters - take 6

Fri Feb 20 07:22:54 GMT 2009

bb: approve

> Attached is a patch implementing content filtering
> if and only if trees say they support it. While I'm
> reasonably happy with it and it may be sufficient for
> others to start experimenting with it (when used in
> combination with the 1.12-previews formats posted
> separately), I'm not 100% convinced that the code is
> now correct, given changes in bzr.dev since I wrote
> the core code 6 months back.
>
> In particular, I'd like some feedback from the
> dirstate gurus about these changes. It's already
> taken me a few days :-( to get the code working again

This patch has unfortunately been waiting review for ages now.
Let's try to find a way to either land it, or identify a specific
step forward that we can take.

It has been review in reasonable depth a couple of times before I
believe.  But it's also a bit risky as, as you say, it touches
something that has a fairly tweaked implementation in both C and
Python, and that is going to be active whether you're using the
feature or not.

There's also been a long thread leading up to this and I'm
intentionally not going to re-read it all today, but rather just
read the patch itself.  So if I miss something, please excuse me.

> but I'm afraid it may break when C extensions are
> used for example? I also fear that the new _observed_sha1
> API implies changes I'm yet to grok?

We should, generally, have tests that run against both the Python
and C implementations of such an interface.  Of course having the
tests is no guarantee, but I'd be looking in the first place to add
more tests there.

> My other big concern about landing this is the lack of
> tree/branch specific rules. Global rules (BZR_HOME/rules)
> *might* be ok for EOL support but it scares me in
> other places when this feature is applicable, e.g.
> keyword expansion. In the name of keeping metadata
> out of the tree and making life bearable for bzr-svn,
> we've agreed to not using a .bzrrules file. Perhaps
> we still ought to support .bzr/branch/rules say, or
> .bzr/checkout/rules, even if that implies/requires
> local editing (ala branch.conf)? If we can get agreement
> on a location, I can easily add the code back in to
> lookup non-global rules.

We do need that for this to ultimately be useful, but let's discuss
it separately.

>
> Anyway, it's nice to have this code back on the table
> even if it still needs a few more rounds in the ring.

# Bazaar merge directive format 2 (Bazaar 0.90)
# revision_id: ian.clatworthy at canonical.com-20081218065348-\
#   1jq86r29aj9kabvx
# target_branch: file:///home/ian/bzr/repo.packs/bzr.wt5/
# testament_sha1: 5179b315af50a99aad43716a1ba74535f9660fef
# timestamp: 2008-12-18 17:36:48 +1000
# base_revision_id: pqm at pqm.ubuntu.com-20081216015655-5wn3k66fkt8wv4i9
#
# Begin patch
=== modified file 'bzrlib/builtins.py'

--- bzrlib/builtins.py	2008-12-12 12:14:01 +0000
+++ bzrlib/builtins.py	2008-12-17 06:13:38 +0000
@@ -2184,12 +2184,13 @@
                help="Type of file to export to.",
                type=unicode),
         'revision',
+        Option('filters', help='Apply content filters.'),

Might be a bit clearer as "Apply content filters to export the
convenient form", or something else to give a clue that the point is
to make it the same as in the wt.  Similarly for cat.

=== modified file 'bzrlib/dirstate.py'
--- bzrlib/dirstate.py	2008-09-26 04:57:33 +0000
+++ bzrlib/dirstate.py	2008-12-17 06:13:38 +0000
@@ -215,6 +215,7 @@
     cache_utf8,
     debug,
     errors,
+    filters,
     inventory,
     lock,
     osutils,
@@ -320,10 +321,14 @@
     HEADER_FORMAT_2 = '#bazaar dirstate flat format 2\n'
     HEADER_FORMAT_3 = '#bazaar dirstate flat format 3\n'

-    def __init__(self, path):
+    def __init__(self, path, content_filter_stack_provider=None):
         """Create a  DirState object.

         :param path: The path at which the dirstate file on disk should live.
+        :param content_filter_stack_provider: a function that takes a
+            path (relative to the top of the tree) and a file-id as
+            parameters and returns a stack of ContentFilters.
+            If None, no content filtering is performed.
         """
         # _header_state and _dirblock_state represent the current state
         # of the dirstate metadata and the per-row data respectiely.

I'd like to have described in here somewhere just why this needss to
be hooked in to the dirstate, rather than sitting on top of it as a
separate layer.  I know it's been discussed before so maybe it's too
much to hope that we'll find a way to do this without modifying the
dirstate though it'd be nice, for layering and also for reducing
risk.

Is it just that the dirstate needs to report the working copy size
and that would not work together with filtering.

(Digression; in the next update to the wt code I think we should
just have it hold for each file the stat value where that file was
known to be unchanged from the basis tree.  I suspect it's uncommon
for us to ever make use of file content hashes to do much more than
that, and also relatively uncommon for files to be discovered to
have that value other than when bzr writes the tree.  The main other
case would be if we write a tree too quickly for it to have a safe
stat value, and but then status runs later and can record that
they're unchanged.)

So these changes seem to mean that where the dirstate currently
stores the sha1, it will now have the sha1 of running the input
filters on the working file.  That would be worth documenting.

So there are a few cases to consider:

If there are no filters then it should be storing just the same hash
that it does at present; we'd want to check that it's not doing so
inefficiently.

If there are filters and they're symmetric then when we check the
working file we'll convert it back to the original form, and see
that it hasn't changed.

If the filters are asymmetrical, then I think we may get the
following situation: when the tree is constructed it'll be given the
canonical hashes, but if we re-hash a file we'll see the hash that
comes from input filtering on the working copy.  That might be
reasonable if we saw them as changed when the user specifically
touched those files, but the problem is that we'll sometimes re-hash
them because the original timestamp was not safe.

I'd like to check in to how the hashes are set when the working tree
is first built.

So this could lead to the strange situation where if you get a
checkout with asymmetrical filters then immediately run status, some
of the files are shown as modified and some are not.  I'm not sure
what should happen there, but consistently showing them all as
modified is probably the best thing.

Asymmetric filters are not the common case but I hoped considering
those cases would shed light on how it should work.

@@ -54,8 +59,17 @@
                 item.mode = 0755
             else:
                 item.mode = 0644
-            item.size = ie.text_size
-            fileobj = tree.get_file(ie.file_id)
+            if filtered:
+                chunks = tree.get_file_lines(ie.file_id)
+                filters = tree._content_filter_stack(dp)
+                context = ContentFilterContext(dp, tree, ie)
+                contents = filtered_output_bytes(chunks, filters, context)
+                content = ''.join(contents)
+                item.size = len(content)
+                fileobj = StringIO.StringIO(content)
+            else:
+                item.size = ie.text_size
+                fileobj = tree.get_file(ie.file_id)
         elif ie.kind == "directory":
             item.type = tarfile.DIRTYPE
             item.name += '/'

This smells a bit like we're going a lot of string manipulation that
might be inefficient in splitting, joining, and converting to and
from files.  Probably it wants to be done as chunk iterators, even
if for ease of implementing the filters they operate on the whole
thing as one chunk.  But we can do that later.  (And it may be that
if this is always dealing with one chunk the join operations don't
actually hurt.)

+def filtered_output_bytes(chunks, filters, context=None):
+    """Convert byte chunks from internal to external format.
+
+    :param chunks: an iterator containing the original content
+    :param filters: the stack of filters to apply
+    :param context: a ContentFilterContext object passed to
+        each filter
+    :return: an iterator containing the content to output
+    """
+    if filters:
+        for filter in reversed(filters):
+            if filter.writer is not None:
+                chunks = filter.writer(chunks, context)
+    return chunks

This would be a bit simpler without the 'if' statement, just letting
the loop do nothing...

+
+
+def internal_size_sha_file_byname(name, filters):
+    """Get size and sha of internal content given external content.
+
+    :param name: path to file
+    :param filters: the stack of filters to apply
+    """
+    f = open(name, 'rb', 65000)
+    try:
+        if filters:
+            f = filtered_input_file(f, filters)
+        size = 0
+        s = sha.new()
+        BUFSIZE = 128<<10
+        while True:
+            b = f.read(BUFSIZE)
+            if not b:
+                break
+            size += len(b)
+            s.update(b)
+        return size, s.hexdigest()
+    finally:
+        f.close()

It would be a bit better if filtered_input_file and
filtered_output_bytes called it 'canonical' and 'convenient',
because of course we read and write both of them at different times.

Also, it seems like there should be an osutils.size_sha1_file that
this calls.  In fact there is already osutils.fingerprint_file
though that's arguably not the nicest or fastest api, since it
returns a dictionary.

+def _reset_registry(value=None):
+    """Reset the filter stack registry.
+
+    This function is provided to aid testing. The expected usage is::
+
+      old = _reset_registry()
+      # run tests
+      _reset_registry(old)
+
+    :param value: the value to set the registry to or None for an empty one.
+    :return: the existing value before it reset.
+    """
+    global _filter_stacks_registry
+    original = _filter_stacks_registry
+    if value is None:
+        _filter_stacks_registry = registry.Registry()
+    else:
+        _filter_stacks_registry = value
+    _stack_cache.clear()
+    return original

It seems like that ought to be a generic Registry (or maybe
selftest?) facility.  Maybe there is one?

=== modified file 'bzrlib/hashcache.py'
--- bzrlib/hashcache.py	2008-09-26 07:09:50 +0000
+++ bzrlib/hashcache.py	2008-12-17 06:13:38 +0000
@@ -31,6 +31,7 @@

 import os, stat, time

+from bzrlib.filters import internal_size_sha_file_byname
 from bzrlib.osutils import sha_file, sha_string, pathjoin, safe_unicode
 from bzrlib.trace import mutter, warning
 from bzrlib.atomicfile import AtomicFile
@@ -79,8 +80,15 @@
     """
     needs_write = False

-    def __init__(self, root, cache_file_name, mode=None):
-        """Create a hash cache in base dir, and set the file mode to mode."""
+    def __init__(self, root, cache_file_name, mode=None,
+            content_filter_stack_provider=None):
+        """Create a hash cache in base dir, and set the file mode to mode.
+
+        :param content_filter_stack_provider: a function that takes a
+            path (relative to the top of the tree) and a file-id as
+            parameters and returns a stack of ContentFilters.
+            If None, no content filtering is performed.
+        """
         self.root = safe_unicode(root)
         self.root_utf8 = self.root.encode('utf8') # where is the
filesystem encoding ?
         self.hit_count = 0

I think this class is only active for deprecated wt formats and so it
probably does not need to be updated for new features like filtering
that presumably won't be active.  On the other hand since it's still
here and might conceivably be derived-from, there's no harm updating
it.

=== added file 'bzrlib/help_topics/en/content-filters.txt'
--- bzrlib/help_topics/en/content-filters.txt	1970-01-01 00:00:00 +0000
+++ bzrlib/help_topics/en/content-filters.txt	2008-12-17 10:16:47 +0000
@@ -0,0 +1,93 @@
+Content Filters
+===============
+
+Content formats
+---------------
+
+Bazaar's content filtering allows you to store files in a different
+format from the copy in your working tree.  This lets you, or your
+co-developers, use Windows development tools that expect CRLF files
+on projects that use other line-ending conventions. Among other things,
+content filters also let Linux developers more easily work on projects
+using Windows line-ending conventions, keyword expansion/compression,
+and trailing spaces on lines in text files to be implicitly stripped
+when committed.
+
+To generalize, there are two content formats supported by Bazaar:
+
+ * a canonical format - how files are stored internally
+ * a convenience format - how files are created in a working tree.

(Tiny nit: ReST lists should have no leading space or they're doubly
indented.)

+Impact on commands
+------------------
+
+Read converters are only applied to commands that read content from
+a working tree, e.g. status, diff and commit. For example, ``bzr diff``
+will apply read converters to files in the working tree, then compare
+the results to the content last committed.
+
+Write converters are only applied by commands that **create files in a
+working tree**, e.g. branch, checkout, update. If you wish to see the
+canonical format of a file or tree, use ``bzr cat`` or ``bzr export``
+respectively.
+
+Note: ``bzr commit`` does not implicitly apply write converters after
+comitting files. If this makes sense for a given plugin providing
+a content filter, the plugin can usually achieve this effect by using a
+``start_commit`` or ``post_commit`` hook say. See ``bzr help hooks`` and
+`Using hooks`_ in the Bazaar User Guide for more information on hooks.

That makes sense, thanks for mentioning it.

+
+
+Refreshing your working tree
+----------------------------
+
+For performance reasons, Bazaar caches the timestamps of files in
+a working tree, and assumes files are unchanged if their timestamps
+match the cached values. As a consequence, there are times when
+you may need to explicitly ask for content filtering to be reapplied
+in one or both directions, e.g. after installing or reconfiguring
+plugins providing it.
+
+Here are some general guidelines for doing this:
+
+ * To reapply read converters, ``touch`` files, i.e. update their
+   timestamp. Operations like ``bzr status`` should then reapply the
+   relevant read converters and compare the end result with the
+   canonical format.
+
+ * To reapply write converters, ensure there are no local changes,
+   delete the relevant files and run ``bzr revert`` on those files.
+
+Note: In the future, it is likely that additional options will be added
+to commands to make this refreshing process faster and safer.

OK this is good too.

=== modified file 'bzrlib/tests/blackbox/test_cat.py'
--- bzrlib/tests/blackbox/test_cat.py	2007-12-05 22:42:37 +0000
+++ bzrlib/tests/blackbox/test_cat.py	2008-07-23 08:12:59 +0000
@@ -118,6 +118,19 @@
         out, err = self.run_bzr_subprocess(['cat', url])
         self.assertEqual('contents of README\n', out)

+    def test_cat_filters(self):
+        wt = self.make_branch_and_tree('.')
+        self.build_tree(['README'])
+        wt.add('README')
+        wt.commit('Making sure there is a basis_tree available')
+        url = self.get_readonly_url() + '/README'
+        # Test unfiltered output
+        out, err = self.run_bzr_subprocess(['cat', url])
+        self.assertEqual('contents of README\n', out)
+        # TODO: Test filtering applied to output, not just a legal option
+        out, err = self.run_bzr_subprocess(['cat', '--filters', url])
+        self.assertEqual('contents of README\n', out)
+
     def test_cat_no_working_tree(self):
         wt = self.make_branch_and_tree('.')
         self.build_tree(['README'])

It would be a lot nicer if the filters were actually tested, even if
you had to specially hook them up in memory because this patch does
not provide a policy layer...  Maybe by importing from test_filters.

=== added file 'bzrlib/tests/test_filters.py'
--- bzrlib/tests/test_filters.py	1970-01-01 00:00:00 +0000
+++ bzrlib/tests/test_filters.py	2008-05-26 06:36:40 +0000

+# sample filter stacks
+_swapcase = lambda chunks, context=None: [s.swapcase() for s in chunks]
+_addjunk = lambda chunks: ['junk\n'] + [s for s in chunks]
+_deljunk = lambda chunks, context: [s for s in chunks[1:]]

They might as well be regular 'def'.

=== modified file 'bzrlib/workingtree.py'
@@ -423,22 +425,36 @@
     def has_filename(self, filename):
         return osutils.lexists(self.abspath(filename))

-    def get_file(self, file_id, path=None):
-        return self.get_file_with_stat(file_id, path)[0]
+    def get_file(self, file_id, path=None, filtered=True):
+        return self.get_file_with_stat(file_id, path, filtered=filtered)[0]

-    def get_file_with_stat(self, file_id, path=None, _fstat=os.fstat):
+    def get_file_with_stat(self, file_id, path=None, filtered=True,
+        _fstat=os.fstat):
         """See MutableTree.get_file_with_stat."""
         if path is None:
             path = self.id2path(file_id)
-        file_obj = self.get_file_byname(path)
-        return (file_obj, _fstat(file_obj.fileno()))
-
-    def get_file_byname(self, filename):
-        return file(self.abspath(filename), 'rb')
-
-    def get_file_lines(self, file_id, path=None):
+        file_obj = self.get_file_byname(path, filtered=False)
+        stat_value = _fstat(file_obj.fileno())
+        if self.supports_content_filtering() and filtered:
+            filters = self._content_filter_stack(path)
+            file_obj = filtered_input_file(file_obj, filters)
+        return (file_obj, stat_value)

This method get_file_with_stat seems a bit dangerous because the
size in the stat object may be inconsistent with what's actually
returned; I'd like to see if we could deprecate this method.

So to sum up: I think we should move forward with this and I don't
see anything actually that's actually buggy.  It would be very nice
if we could do this in a manner entirely separate from the tree,
either in the current format or in a future dirstate update, and it
seems just about possible.  But perhaps the right way to go towards
that is to merge it, and then pick it up.

-- 
Martin