Storage internals: UUID

Mon Jun 4 20:17:21 UTC 2012

We don't use "uuid" the spec, just IDs that have similar probabilities to
be globally unique (enough).
We don't use the sha hash for a variety of reasons. We do track the sha1
hash of revisions for integrity/security checking.  Some of the reasons to
use a separate identifier:
1) you can pick an identifier before you finish with the revision. This
let's you write things like indexes while you are writing out the data. Hg
'cheats' this by using a reference of "the revision at position $int".
However if a commit fails partway, the recovery is to truncate the files so
that they don't contain invalid pointers. Git handles it by not having the
concept of an individual file history. You have to infer Fe history by
walking through the inventory info.
2) Reflection of data in 3rd party storage. via bzr-svn/git/hg we are able
to treat other vcs as another bzr compatible branch. (Eg you can use bzr
log "svn://...."). It is similar to using a map file, but the mapping is
stored as the identifier, rather than having to transmit, store, share
another file.
3) Along those lines, it let's you talk about revisions that you've never
seen. So if it gets converted in the future, it gets auto-grafted into the
right location in history.
4) it decouples your identifiers from their current representation. If, for
example, git decided it really wanted their tree entry to be in XML, they
would have to regenerate the sha hashes for the whole history. And without
a map file, you couldn't incrementally pull in more data from another
person who branched from somewhere in your history. (Format upgrades can be
done independently by different users and over time, not in lock-step).

There are downsides to it, but it isn't just that we do it differently
without cause. It is some explicit choices along the way.
John
=:->
On Jun 4, 2012 9:16 PM, "Daniel Carrera" <dcarrera at hush.com> wrote:

> Hello,
>
> I'm interested in getting to know how Bazaar stores data internally (links
> would be welcome, the "developer" pages seem to cover more of the API
> rather than how things work). I have read in some forum somewhere that bzr
> uses UUIDs instead of SHA1 hashes like Mercurial and Git. If this is
> correct, I'd like to ask a few questions:
>
> 1. Why was the decision made for UUIDs instead of SHA1? What were the pros
> and cons discussed?
>
> 2. Which version of UUID does bzr use? There are three versions. Versions
> 3 and 5 use hashes (MD5 and SHA1 resp) and as I understand it, there is no
> set rule as to how to generate the hash. In other words, my impression is
> that it is legal to take a SHA1 of the revision contents and metadata and
> use that to produce the UUID. In fact, I wonder if this might be what bzr
> already does.
>
> 3. Is anyone watching the evolution of the SHA3 specification? NIST is
> supposed to select the SHA3 algorithm this year. This means that the next
> revision of the bzr format could use the freshly minted SHA3 algorithm for
> its UUIDs. You don't have to wait for RFC 4122 to be updated. In their
> wisdom, the creators of UUID included version 4 which is "random". Since
> SHA3 is a valid pseudo-random number generator, you could use SHA3 to make
> the UUID.
>
> Thoughts?
>
> Cheers,
> Daniel.
> --
> Linux: Because rebooting is for installing hardware.
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.ubuntu.com/archives/bazaar/attachments/20120604/7342cbc0/attachment.html>