Binary file handling discussion

Sat Nov 4 10:47:44 GMT 2006

On 11/3/06, John Arbash Meinel <john at arbash-meinel.com> wrote:
>
> Jari Aalto wrote:
> > Nicholas Allen <allen at ableton.com> writes:
> >
> >> The way CVS does it is really bad and it often
> >> makes mistakes by assuming that all files are text files unless the
> >> user specifies that they are binary (and often users forget this). So
> >> CVSs policy is one of data destruction by default and I do not think
> >> this would be a good idea for bzr!
> >
> > I understood that a VCS is primary for text files and only secondary
> > used for binary files
> >
> > CVS way of assuming text by default is for the typical situation and
> > explicitly marking other types of files as binary is logical. CVS has
> > similar list to "bzr ignore", where patterns can be added to the
> > server to automatically treat certain files as binary. So, the "-kb"
> > tagging of individual files is quite transparent to the casual users:
> >
> >         *.jpg
> >         *.xls
> >         *.doc
>
> CVS uses a bad policy. You really don't want to start out assuming
> everything is text, and then switch to binary on request. Because by the
> time the switch is requested, you've already corrupted the data.
> Especially if you start doing stuff like keyword expansion. But even
> line-ending conversions can quickly corrupt files. As an example, PNG
> files have an explicit \r\n as part of their magic number. It was
> explicitly put there to detect "accidental" but unnoticed corruption.
>
> But that is also why the storage aspect could be kept separate from the
> diff/merge logic. Because by default you would assume that you could
> diff and merge, but by default you would store the exact text.
>
> The suggestion for .bzrtypes or some other versioned-in-the-source-tree.
> We've had some discussions about whether it is the "right" thing, though.
>
> Being in the source tree makes it easier for users to edit. And makes
> merging and versioning come naturally from the rest of the system.
>
> On the downside, if we change how we interpret those files, we have no
> good way to maintain compatibility with old or new versions. We could
> have a format field in the file, which would at least give us some
> flexibility.

This is my fault of not doing things in the right order. I have been
proposing
solution where i should have written a feature request. What i wanted to say
with
the .bzrtypes was, that i would like to see a central place to configure
those things. I consider meta data, that is attached to files and
scattered over hundreds of sub directories a bad thing.

The advantages of a single configuration 'file' is, that it is easier to
understand than a combination of several dozen independent
meta attributes, and that you can copy / paste the configuration
from old projects to new ones. You do not have to write scripts or
documentation of how to configure a repository for a new project.

Do we want to allow people to change meta information like this for old
> versions? Is that necessary or desirable? If it is desired, then we need
> a different mechanism.

CVS has the meta data that is global to the repository stored in special
files in the special directory CVSROOT. A user can checkout this directory,
change the files and then commit it. In CVS the CVSROOT in the repository
and in the working tree are of identical format. This does not have to be
the
case.

Bzr could use a similar mechanism, where BZRMETA is only a logical
entity, some internal format. Only when a user wants to change a
configuration,
he has to 'materialize' it in the working tree, make his changes and when
committing
the files 'vanish' into the repository. Bzr 'translates' the internal format
into a
human readable / editable format and back. This way we can change the
internal representation and the external one as we need to. OK, that would
imply that we can not have a real history of this meta data, but i do not
know if
this is really necessary.

What about merging/conflicts. Do the normal conflict mechanisms make
> sense? They make more sense for something like this than they do for
> .bzrignore. For .bzrignore it really is a set operation, not a series of
> lines information. So 2 people adding different entries in approximately
> the same spot isn't really a conflict. Though in this case if order is
> important for pattern matching, then users *do* need to resolve a
> conflict because they need to give an explicit priority.

Good point. This will allow for very nasty misconfiguration. At the moment
i have no good solution for the 'priority' problem.

What about a set of user specified default options - simple glob style
saying that all *.xml have flag1 and flag2 - and a set of exceptions, which
are expressed in a more complex syntax (regexp for example). In each
set, all entries are unordered and of same priority, but the exception set
has higher priority than the default set.

This would make merging configurations more easy, and would allow for
two modes of specifying the configuration. Method one uses one or two files
which have to be edited (like editing .bzrignore) the second method allows
specification using bzr command (analog to bzr ignore).

One thing I really like about the proposal is that it gives an easy way
> to give values for lots of files in the tree. And to update that
> property. Tracking stuff like this in a per-file method like SVN means
> that you need to remember to set them for any new file. One of my
> biggest beefs is that svn:ignore doesn't have a way to make it
> recursive, so adding a new sub-project tends to add all the things that
> you just asked it to ignore in the other project.

Yes. I really like centralized configurations. I really do :-)

John
> =:->

Ciao,
  Steffen
-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://lists.ubuntu.com/archives/bazaar/attachments/20061104/92cab0c7/attachment.htm