[MERGE] sha_file_by_name using raw os files; -Dhashcache
John Arbash Meinel
john at arbash-meinel.com
Fri Oct 5 18:10:07 BST 2007
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Martin Pool wrote:
> While profiling towards https://bugs.edge.launchpad.net/bzr/+bug/146176
> it seemed that we were double-buffering files while hashing them. This
> seems about 10% faster but it's somewhat unstable to measure. If someone
> else would like to confirm or deny it that would be useful.
I did some testing on Manganese. And with the Mozilla tree (and forcing every
cached value to miss), I get:
time bzr.dev status
10 loops, best of 3: 8.03 sec per loop
time bzr.patched status
10 loops, best of 3: 6.08 sec per loop
time bzr.mmap status [1]
~6.5s
time bzr.subprocess status
9min 26s (Obviously not the way to go, and it seemed to give the wrong values
anyway. :)
I get a time of about 2ms to spawn sha1sum, *49k files = 98sec, so I think it
will always be slower.
[1]: this is the code I was using for mmap:
def sha_file_by_name(fname):
"""Calculate the SHA1 of a file by reading the full text"""
fn = os.open(fname, os.O_RDONLY)
try:
# The documentation says you can use 0 to set it to the full size of the
# file, but in testing this does not work
size = os.fstat(fn).st_size
if size == 0:
return sha.new().hexdigest()
mem = mmap.mmap(fn, size, access=mmap.ACCESS_READ)
try:
digest = sha.new(mem).hexdigest()
finally:
mem.close()
finally:
os.close(fn)
return digest
I'm guessing that creating a couple extra objects (there is an fstat and a mmap
object that is created) is why this is slower than just doing the read directly.
So
BB:approve
It seems to be genuinely better for me. (without sha1 it is about 2s, so this
is 6s => 4s or 50% faster)
John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iD8DBQFHBm/uJdeBCYSNAAMRAuPjAJsHtXw3kpQXxtuSow8Z8VHJOAFZLwCgjoZ7
Iw0MeuiSXqIV6Z7V0Sg+mxQ=
=t+Mf
-----END PGP SIGNATURE-----
More information about the bazaar
mailing list