bzr diff --filter= or equivalent?

Doug Lee dgl at dlee.org
Thu Feb 3 18:34:31 UTC 2011


I see that bzr diff allows --using for an external differ, but I want
a filter applied to files before the internal differ is used.  Sample
usage:

	bzr diff --filter=docstream figures.xlsx

where docstream is a program that converts a Microsoft Office
Word/Excel file into a more diff-friendly format.  I have written a
quick example of such a filter.  Bzr would send each version of the
file through docstream and compare the output instead of the original
content.

Since I don't know if attachments are allowed here and it's a short
program, I'll just drop its 49-line self right here; pardon any
presumptuousness this demonstrates. :)  This currently requires the
name of the source file on the command line, but of course it would be
easy enough to allow default-to-stdin or add support of filename "-".
No license restrictions.

Is this planned, currently possible somewhere I missed, ... or should
I file it as a bug/feature request?

==========
#! /usr/bin/env python
"""DocStream - Make a (mostly) text stream out of a Microsoft Office 2007+ file.
Usage: docstream <filename>, where <filename> is an Office 2007+ file.
Example: docstream document.docx, or docstream wb.xlsx
The result is sent to stdout.
The result is a stream like
	Zip file: wb.xlsx
	File: [Content_Types].xml
	<... content of that file ...>
File: _rels/.rels 
	<... content of that file ...>
	<... other files ...>
XML files in the stream have "<" globally prepended with a Newline,
so that the file breaks down into logical segments by line.

The point of all this is to produce more easily/meaningfully diffable data:
Streams from two office documents can be compared with a standard diff utility.

Caveats:
Binary files are not textified before being inserted into the stream.
Example from a .xlsx file: xl/printerSettings/printerSettings1.bin.
They may also have a Newline appended (so the next "File:" line is flush left).

Author:  Doug Lee of SSB BART Group
"""

import sys, zipfile

if len(sys.argv) != 2:
	exit(__doc__)

zname = sys.argv[1]
z = zipfile.ZipFile(zname)
print "Zip file: %s" % (zname)
for fname in z.namelist():
	print ("File: %s" % (fname)),
	f = z.open(fname)
	txt = f.read()
	if txt[0] == "<":
		# Presumably XML.
		txt = txt.replace("<", "\n<")
	else:
		# Probably a binary file.
		# End the "File:" line first.
		print
		# TODO: Binary content just dropped in without being textified first.
	if txt[-1] != "\n":
		txt += "\n"
	print txt,
==========



-- 
Doug Lee                 dgl at dlee.org                http://www.dlee.org
SSB BART Group           doug.lee at ssbbartgroup.com   http://www.ssbbartgroup.com
"The U. S. Constitution doesn't guarantee happiness, only the pursuit
of it. You have to catch up with it yourself." --Benjamin Franklin



More information about the bazaar mailing list