Rev 4646: (mbp) developer documentation about content filtering in file:///home/pqm/archives/thelove/bzr/%2Btrunk/ Patch Queue Manager pqm at
Tue Aug 25 08:08:59 BST 2009

At file:///home/pqm/archives/thelove/bzr/%2Btrunk/

revno: 4646 [merge]
revision-id: pqm at
parent: pqm at
parent: mbp at
committer: Patch Queue Manager <pqm at>
branch nick: +trunk
timestamp: Tue 2009-08-25 08:08:58 +0100
  (mbp) developer documentation about content filtering
  doc/developers/content-filtering.txt contentfiltering.txt-20090825034801-16mynkm76jpmi4tp-1
  NEWS                           NEWS-20050323055033-4e00b5db738777ff
  doc/developers/index.txt       index.txt-20070508041241-qznziunkg0nffhiw-1
=== modified file 'NEWS'
--- a/NEWS	2009-08-24 22:32:53 +0000
+++ b/NEWS	2009-08-25 07:08:58 +0000
@@ -61,6 +61,9 @@
+* New developer documentation for content filtering.
+  (Martin Pool)
 API Changes

=== added file 'doc/developers/content-filtering.txt'
--- a/doc/developers/content-filtering.txt	1970-01-01 00:00:00 +0000
+++ b/doc/developers/content-filtering.txt	2009-08-25 05:50:50 +0000
@@ -0,0 +1,147 @@
+Content Filtering
+Content filtering is the feature by which Bazaar can do line-ending
+conversion or keyword expansion so that the files that appear in the
+working tree are not precisely the same as the files stored in the
+This document describes the implementation; see the user guide for how to
+use it.
+We distinguish between the *canonical form* which is stored in the
+repository and the *convenient form* which is stored in the working tree.
+The convenient form will for example use OS-local newline conventions or
+have keywords expanded, and the canonical form will not.  We use these
+names rather than eg "filtered" and "unfiltered" because filters are
+applied when both reading and writing so those names might cause
+Content filtering is only active on working trees that support it, which
+is format 2a and later.
+Content filtering is configured by rules that match file patterns.
+Filters come in pairs: a read filter (reading convenient->canonical) and
+a write filter.  There is no requirement that they be symmetric or that
+they be deterministic from the input, though in general both these
+properties will be true.  Filters are allowed to change the size of the
+content, and things like line-ending conversion commonly will.
+Filters are fed a sequence of byte chunks (so that they don't have to
+hold the whole file in memory).  There is no guarantee that the chunks
+will be aligned with line endings.  Write filters are passed a context
+object through which they can obtain some information about eg which
+file they're working on.  (See ``bzrlib.filters`` docstring.)
+These are at the moment strictly *content* filters: they can't make
+changes to the tree like changing the execute bit, file types, or
+adding/removing entries.
+bzrlib interfaces that aren't explicitly specified to deal with the
+convenient form should return the canonical form.  Whenever we have the
+SHA1 hash of a file, it's the hash of the canonical form.
+Dirstate interactions
+The dirstate file should store, in the column for the working copy, the cached
+hash and size of the canonical form, and the packed stat fingerprint for
+which that cache is valid.  This implies that the stored size will
+in general be different to the size in the packed stat.  (However, it
+may not always do this correctly - see
+The dirstate is given a SHA1Provider instance by its tree.  This class
+can calculate the (canonical) hash and size given a filename.  This
+provides a hook by which the working tree can make sure that when the
+dirstate needs to get the hash of the file, it takes the filters into
+User interface
+Most commands that deal with the text of files present the
+canonical form.  Some have options to choose.
+Performance considerations
+Content filters can have serious performance implications.  For example,
+getting the size of (the canonical form of) a file is easy and fast when
+there are no content filters: we simply stat it.  However, when there
+are filters that might change the size of the file, determining the
+length of the canonical form requires reading in and filtering the whole
+Formats from 1.14 onwards support content filtering, so having fast
+paths for the case where content filtering is not possible is not
+generally worthwhile.  In fact, they're probably harmful by causing
+extra edges in test coverage and performance.
+We need to have things be fast even when filters are in use and then
+possibly do a bit less work when there are no filters configured.
+Future ideas and open issues
+* We might benefit from having filters declare some of their properties
+  statically, for example that they're deterministic or can round-trip
+  or won't change the length of the file.  However, common cases like
+  crlf conversion are not guaranteed to round-trip and may change the
+  length, so perhaps adding separate cases will just complicate the code
+  and tests.  So overall this does not seem worthwhile.
+* In a future workingtree format, it might be better not to separately
+  store the working-copy hash and size, but rather just a stat fingerprint
+  at which point it was known to have the same canonical form as the 
+  basis tree.
+* It may be worthwhile to have a virtual Tree-like object that does
+  filtering, so there's a clean separation of filtering from the on-disk
+  state and the meaning of any object is clear.  This would have some
+  risk of bugs where either code holds the wrong object, or their state
+  becomes inconsistent.
+  This would be useful in allowing you to get a filtered view of a
+  historical tree, eg to export it or diff it.  At the moment export
+  needs to have its own code to do the filtering.
+  The convenient-form tree would talk to disk, and the convenient-form
+  tree would sit on top of that and be used by most other bzr code.
+  If we do this, we'd need to handle the fact that the on-disk tree,
+  which generally deals with all of the IO and generally works entirely
+  in convenient form, would also need to be told the canonical hash to
+  store in the dirstate.  This can perhaps be handled by the
+  SHA1Provider or a similar hook.
+* Content filtering at the moment is a bit specific to on-disk trees:
+  for instance ``SHA1Provider`` goes directly to disk, but it seems like
+  this is not necessary.
+See also
+* `Developer Documentation <index.html>`_
+* ``bzrlib.filters``
+.. vim: ft=rst tw=72

=== modified file 'doc/developers/index.txt'
--- a/doc/developers/index.txt	2009-08-12 08:03:11 +0000
+++ b/doc/developers/index.txt	2009-08-25 03:48:12 +0000
@@ -126,6 +126,8 @@
 * `Computing last_modified values <last-modified.html>`_ for inventory
+* `Content filtering <content-filtering.html>`_
 * `LCA Tree Merging <lca_tree_merging.html>`_ |--| Merging tree-shape when
   there is not a single unique ancestor (criss-cross merge).

More information about the bazaar-commits mailing list