Google Summer of Code: Encrypted branch/repository format status

Tue Jul 17 16:34:37 BST 2007

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Martin Pool wrote:
> On 7/17/07, John Arbash Meinel <john at arbash-meinel.com> wrote:
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA1
>>
>> Bogdano Arendartchuk wrote:
>> > Hello,
>> >
>> > I'm working on the encrypted repository and branch format for Bazaar.
>> >
>> > Currently I'm coding a repository format that is intended to write
>> in the
>> > disk all the data slightly scrambled. This is a protype and nothing is
>> > encrypted at all, the objective is to know better the Bazaar
>> code/design
>> > and also plan what can be reused and what should be reimplemented in
>> order
>> > to fit the application needs.
> 
> The other question in my mind is whether this really needs to come in
> at the knit level.  Could you instead interpose an encrypting
> transport for access to some files?  I realize random access may be a
> bit hard to get just right, but that's probably no worse than for
> doing it in a knit... The transport interface is pretty stable.

Transport would probably be interesting. I'm not sure how it would
encrypt the filenames in the same way, but it at least seems possible.
You could have a custom BzrDir that at .open() time would re-wrap its
Transport in an EncryptedTransportWrapper, which would munge filenames
and file contents as appropriate.

I really think that anything of this sort of security should have a bit
of discussion about what sort of attacks are possible, and what it is
trying to defend against.

Specifically, some things that I don't think he is trying to hide:

a) That there is a Bazaar Branch of some sort at that location. (This
would effect discovery, and some other things). Specifically, you would
probably want to obfuscate the '.bzr' directory if you were trying to
avoid this. Since we aren't, it means that .bzr/ and the general
meta-information files can stay put (you don't have to hide where
.bzr/repository/inventory.knit is, you just want to hide the *contents*
of the file).

b) When information is added to the repository. It would be possible to
pre-allocate a certain amount of data and fill it with randomness, and
always modify some of the randomness to hide just what is being updated
and when. I'm pretty sure we aren't trying for this level of security.
If someone really wants more, they should look into something like
TrueCrypt and mount an encrypted filesystem, and then publish their
encrypted volume.

c) We aren't trying to prevent a hacker who has local root access from
reading the contents from in-memory. Not to mention that once you have
done a checkout, you have the real contents on-disk.

d) We aren't trying to prevent someone who *has* access from
accidentally checking out a working copy in the public location (maybe
we are).

e) I think we are looking mostly for a "I want to host my project on a
'public' location, but don't want everyone to have access to it". In
this case, I'm not sure if we need to hide file-ids, though we might,
since in the current bzr code, it reveals some information about filenames.

I believe the current codebase is decent about having a transport
pointed at '.bzr/repository' and another one pointed at
'.bzr/repository/knits/', and then all access for all Knits goes through
the 'weave_transport' (which is the one pointing at Knits).

So it would be possible to override '_get_versioned_file_store()' and
set the 'munge_filenames=True' flag for that transport, while the other
transports would only have the 'encrypt_file_contents=True'.

You *wouldn't* want to encrypt the .bzr/branch-format file, and probably
want to leave the README, etc alone. But most likely all other files
could have there contents encrypted.

> 
> And we really should add the same tests I did for DirState. Such that if
> we can
>> import _knit_load_data_c, then knit._load_data should be the right
>> function.
>> (Modulo any naming changes).
>>
>> It also brings up the thought of what we should name the extension
>> module,
>> since the name is now changing. We could do "_knit_helpers" to be
>> closer to
>> _dirstate_helpers (which is also better if we add more extension
>> functions).
> 
> That sounds reasonable - what was it in your current code?

The code that was merged used '_knit_load_data_c.pyx' and
'_knit_load_data_py.py'. Because there was only one function, which was
'knit._load_data'. But in general my pyrex code has been written at
different times, and thus different mindsets about how things could be
laid out.

I've also considered having a single pyrex extension file, which just
includes all of the extensions we have written. And just call that
_bzrlib. I think it can be done with something like:

Extension('_bzrlib', ['bzrlib/_dirstate_helpers_c.pyx',
                      'bzrlib/_knit_helpers.pyx',
                     ],
          libraries = [],
         )

However, I'm not 100% positive how to take multiple .pyx files and
combine them into a single extension, mostly from a 'from foo import
bar' standpoint, I don't know how that works exactly.

I kind of like having a single .so file, rather than lots of them, but
in the long term it probably doesn't matter. (There are tradeoffs in
both directions).

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGnOGNJdeBCYSNAAMRAiosAKC2BWOU7aM9+bugIet2RxLUIBwLjQCfUnrh
Vhre6RSOOFLoYSej5wxUrHM=
=H1Jg
-----END PGP SIGNATURE-----