[MERGE/RFC] Content filters - take 6

Martin Pool mbp at sourcefrog.net
Sun Feb 22 22:40:20 GMT 2009

2009/2/21 John Arbash Meinel <john at arbash-meinel.com>:
>> If there are no filters then it should be storing just the same hash
>> that it does at present; we'd want to check that it's not doing so
>> inefficiently.
>> If there are filters and they're symmetric then when we check the
>> working file we'll convert it back to the original form, and see
>> that it hasn't changed.
>> If the filters are asymmetrical, then I think we may get the
>> following situation: when the tree is constructed it'll be given the
>> canonical hashes, but if we re-hash a file we'll see the hash that
>> comes from input filtering on the working copy.  That might be
>> reasonable if we saw them as changed when the user specifically
>> touched those files, but the problem is that we'll sometimes re-hash
>> them because the original timestamp was not safe.
> Do we want to support asymmetrical filters? It seems like we should only
> really allow filters where "in_filt(out_filt(bytes)) == bytes". Or do
> you have some otehr meaning here?

Well he does document later in the patch that such filters are allowed.

If we're going to run arbitrary code for them it seems a bit hard to
forbid them.  And anyhow, there are interesting plausible cases, such
as normalizing something on checkout but always committing precisely
what was written.

A filter that makes all files modified whenever you check out would be
strange I agree, but I'd like to at least know how it should work.

>> Asymmetric filters are not the common case but I hoped considering
>> those cases would shed light on how it should work.
> IMO, if you have an asymmetric filter, you have just modified the whole
> working tree.

Agree - or at least, all the files for which it produced an asymmetric result.

>> -    def get_file_with_stat(self, file_id, path=None, _fstat=os.fstat):
>> +    def get_file_with_stat(self, file_id, path=None, filtered=True,
>> +        _fstat=os.fstat):
>>          """See MutableTree.get_file_with_stat."""
>>          if path is None:
>>              path = self.id2path(file_id)
>> -        file_obj = self.get_file_byname(path)
>> -        return (file_obj, _fstat(file_obj.fileno()))
>> -
>> -    def get_file_byname(self, filename):
>> -        return file(self.abspath(filename), 'rb')
>> -
>> -    def get_file_lines(self, file_id, path=None):
>> +        file_obj = self.get_file_byname(path, filtered=False)
>> +        stat_value = _fstat(file_obj.fileno())
>> +        if self.supports_content_filtering() and filtered:
>> +            filters = self._content_filter_stack(path)
>> +            file_obj = filtered_input_file(file_obj, filters)
>> +        return (file_obj, stat_value)
>> This method get_file_with_stat seems a bit dangerous because the
>> size in the stat object may be inconsistent with what's actually
>> returned; I'd like to see if we could deprecate this method.
> The reason to have it was to have a "non-racing" implementation to get
> the content and the stat. Otherwise if you read at a separate time than
> you stat, something could have changed in the mean-time.
> I believe the stat is only really used to pass back to the dirstate to
> have it mark the file as "unchanged since last commit".

There is an important timing dependency here, which is that the stat
call must be taken before we read from the file, in case it's changed
while we're reading.  That's still safe in Ian's code but maybe a
comment would be safer.

Martin <http://launchpad.net/~mbp/>

More information about the bazaar mailing list