Benefits of walkdirs in ut8 instead of unicode
John Arbash Meinel
john at arbash-meinel.com
Wed Feb 28 18:13:28 GMT 2007
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Well, I've been profiling stuff like 'bzr status' with the new dirstate
code. And I've been seeing a lot of calls to encodings/utf_8.py
decode/encode. And when you look closer, you can see that a lot of these
calls are actually happening because of os.listdir() calls.
And since dirstate itself has all the paths in utf8, on systems that use
utf8 filesystem encoding, we don't really need to decode just to encode
again.
So I wrote _walkdirs_utf8. With the idea that on platforms with either
ASCII or UTF-8 encoding we can just walk the paths without any
encode/decode steps. On Windows, we can actually walk in Unicode and
encode each one into utf8. And I'm thinking that in all other encodings
we'll just do something like the Windows walk.
It means that we might also get some paths which are not valid utf8. But
I'm expecting those to be taken care of at a higher level. For example
"bzr add" can fail if trying to add a non-utf8 path, but _iter_changes
can just ignore the fact.
There is a pretty big difference in performance. Doing a walkdirs('.')
call on the large Mozilla tree (5,766 directories, approx 55k total
entries).
With a plain walkdirs(u'.') call it takes 2.5s, with a
_walkdirs_utf8('.') call, it takes 1.75 - 1.8s.
For a launchpad sized tree (~4600 entries) the difference is 145ms
versus 100ms.
As a comparison,
% python -m timeit -s "from subprocess import call, PIPE; import os;" \
-s "devnull = open('/dev/null', 'rb+')" \
"call(['find', '.', '-size', '+0'], stdout=devnull)"
seems to say that find() takes about 38ms (time find ... takes 60ms).
So how about a summary table:
num find walkdir _utf8 os.walk
bzr.dev 1018 11.6 29.4 22.8 33.6
launchpad 4638 39 143 99 167
mozilla 54775 600 2210 1740 2590
I don't think we'll really get to the speed of find. Though it would be
nice. I'm a little surprised that we aren't a lot faster than os.walk,
but then again all of this is cached speed, so the change in walk
pattern doesn't really show up yet. Also, walkdir() is returning a lot
more information (so we don't have to stat the files again, etc).
I'll let you know how it effects _iter_changes().
John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.3 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iD8DBQFF5cZIJdeBCYSNAAMRAkm4AKCkAh8+FIRRZy9t/WrrWN0FfklnTwCeO/r6
pDm9ffgOFwEiSjoAdXX4Do0=
=3liR
-----END PGP SIGNATURE-----
More information about the bazaar
mailing list