[MERGE] sha_file_by_name using raw os files; -Dhashcache

John Arbash Meinel john at arbash-meinel.com
Fri Oct 5 15:46:38 BST 2007


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Martin Pool wrote:
> While profiling towards https://bugs.edge.launchpad.net/bzr/+bug/146176
> it seemed that we were double-buffering files while hashing them.  This
> seems about 10% faster but it's somewhat unstable to measure.  If someone
> else would like to confirm or deny it that would be useful.
> 
> 

If you were going to do this, why not just mmap the file?

Also, would this be a case for using O_DIRECT (I'm guessing not, but just a
thought.)

Otherwise, you are probably having another buffer in there anyway. (disk => os
=> userspace).

So you could do:

 def sha_file_by_name(fname):
     """Calculate the SHA1 of a file by reading the full text"""
     s = sha.new()
     mem = None
     f = os.open(fname, os.O_RDONLY)
     try:
	mem = mmap.mmap(f, 0, access=os.ACCESS_READ)
        return sha.new(mem)
     finally:
       if mem is not None:
         mem.close()
       os.close(f)

I don't know if we would want to loop around sections of the mmap'd string, but
I thought the above construct would be an overall good thing (avoiding any
user-space buffers).


I'm curious whether that will be better than something like:

_have_sha1sum = True # Assume that we have it at first

def sha_file_by_name(fname):
  global _have_sha1sum

  if _have_sha1sum:
    try:
      p = subprocess.Popen(['sha1sum', fname], stdout=subprocess.PIPE)
    except (IOError, OSError):
      # Check for ENOENT?
      _have_sha1sum = False
      return sha_file_by_name(fname)
    else:
      val = p.communicate()
      return val[:40]
  else:
    ... # Pick your favorite in-process method

I believe 'sha' got a lot faster in python2.5 because it is using the SSL
libraries. But I don't know how that compares to 'sha1sum'. Note that I don't
think Windows has it, and I don't have it (by default) on my Mac.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHBk5OJdeBCYSNAAMRAiYVAJ9vWFteRFYJCz3SD1rPeAPZ8outpwCgubhV
IwOdJ0APLCyvzE9z+CwLuVA=
=vs3V
-----END PGP SIGNATURE-----



More information about the bazaar mailing list