Storage internals: UUID

Tue Jun 5 18:43:00 UTC 2012

Daniel Carrera writes:

 > > 1) you can pick an identifier before you finish with the revision. This
 > > let's you write things like indexes while you are writing out the data.
 > 
 > This is probably a stupid question, but why is this important? Does
 > it help with speed or something? Clearly it must be important
 > because apparently hg and git have thought about it too...

Can't speak to hg, but I'm pretty sure Linus did *not* think about it
much.  git tracks file-sized chunks of content, not files.  If you
don't have content, you've got nothing to track, so Linus/git is happy
to assign ids once content is complete.

bzr on the other hand is designed to be able to track abstract files
(which need not ever actually exist until you try to commit them).

 > > Git handles it by not having the concept of an individual file
 > > history. You have to infer Fe history by walking through the
 > > inventory info.
 > 
 > Interesting. I always thought that not tracking files was just a
 > weird indiosyncracy.

Idiosyncratic, yes, but not weird.  It's a fundamental design decision.

 > > 4) it decouples your identifiers from their current
 > > representation. If, for example, git decided it really wanted
 > > their tree entry to be in XML,

*snort*  Now, THAT ain't gonna happen!

 > > they would have to regenerate the sha hashes for the whole
 > > history.

This doesn't bother git, since (a) it does that all the time anyway
(for filter and rebase) and (b) it's quite fast enough because trees
and commits are small, and blobs "are" their hashes so you don't ever
need to regenerate their hashes.

 > > And without a map file, you couldn't incrementally pull
 > > in more data from another person who branched from somewhere in
 > > your history.

Git does have map files (they're called "grafts", but not very well
documented).

 > I appreciate you taking the time to explain. This is all very
 > interesting. I would like to understand what bzr does to ensure the
 > integrity of the repository. I am coming from Mercurial. I am
 > interested in security,

As far as I know, of git, hg, and monotone, only Monotone claims that
the hashing implements security in any sense related to actual
attacks.  For the others, it's purely a matter of integrity, and it
"might" help with true security if you really know what you're doing.

So for example in git, if you use rebase at all, commits can get new
ids although nothing changes but their parent.  So you may have to
trust all the way back to the root to get "security", because under
normal operation you're not verifying trees, you're verifying commits,
but the latter have ids that are unstable.  So a digital signature on
a branch head doesn't help you much with the provenance of the rest of
the history.

 > reliability, etc. Though I like Mercurial, I am starting a new
 > project and I'm eager to use this chance to learn a new VCS.