[MERGE] fast read-dir support for unix too
John Arbash Meinel
john at arbash-meinel.com
Mon Sep 22 22:56:58 BST 2008
John Arbash Meinel has voted tweak.
Status is now: Conditionally approved
Comment:
Your NEWS changes are slotting cleanly into the old 1.7 section, so
we'll need to update that when merging. (Topic for discussion, should
we change the indenting of all NEWS entries as part of a release, to
force conflicts if someone merges NEWS into an old area?)
If we are going to be switching how some of the extensions are named, we
need to change the ignore list a bit.
=== modified file '.bzrignore'
--- .bzrignore 2008-09-03 06:31:11 +0000
+++ .bzrignore 2008-09-22 20:22:35 +0000
@@ -47,8 +47,8 @@
doc/en/developer-guide/HACKING.txt
doc/en/user-reference/bzr_man.txt
# built extension modules
-bzrlib/_*_c.dll
-bzrlib/_*_c.so
-bzrlib/_*_c.pyd
+bzrlib/_*.dll
+bzrlib/_*.so
+bzrlib/_*.pyd
# generated help topics
doc/en/user-reference/*.txt
Is probably sufficient.
=== modified file 'bzrlib/_readdir_py.py'
...
+def _kind_from_mode(stat_mode, _formats=_formats, _unknown='unknown'):
+ """Generate a file kind from a stat mode. This is used in walkdirs.
+
+ Its performance is critical: Do not mutate without careful
benchmarking.
"""
- return [(None, name) for name in os.listdir(path)]
+ try:
+ return _formats[stat_mode & 0170000]
+ except KeyError:
+ return _unknown
s/Its/It's/
But it is kind of odd to have a "performance critical" comment in the
python
implementation.
It is a bit odd to have "python setup.py build_ext -i" fail to build the
_readdir_pyx extension on win32 and have that be okay (and correct, in
fact). I
think we probably want to change the setup.py logic so that it just
skips the
file. This is probably sufficient:
=== modified file 'setup.py'
--- setup.py 2008-09-04 12:03:01 +0000
+++ setup.py 2008-09-22 20:25:57 +0000
@@ -228,13 +228,14 @@
add_pyrex_extension('bzrlib._btree_serializer_c')
add_pyrex_extension('bzrlib._dirstate_helpers_c')
add_pyrex_extension('bzrlib._knit_load_data_c')
-add_pyrex_extension('bzrlib._readdir_pyx')
if sys.platform == 'win32':
# pyrex uses the macro WIN32 to detect the platform, even though it
should
# be using something like _WIN32 or MS_WINDOWS, oh well, we can
give it the
# right value.
add_pyrex_extension('bzrlib._walkdirs_win32',
define_macros=[('WIN32', None)])
+else:
+ add_pyrex_extension('bzrlib._readdir_pyx')
ext_modules.append(Extension('bzrlib._patiencediff_c',
['bzrlib/_patiencediff_c.c']))
=== modified file 'bzrlib/_readdir_pyx.pyx'
...
+cdef extern from 'sys/stat.h':
+ cdef struct stat:
+ int st_mode
+ int st_size
+ int st_dev
+ int st_ino
+ int st_mtime
+ int st_ctime
^- I think these types are actually wrong. The only one I'm actually
concerned
about is "int st_size". It is actually defined as "off_t" which is
actually a
64-bit integer.
The only reason it matters is for files > 2^32. Defining it as a plain
'int'
will cause pyrex to use PyInt_FromLong() rather than PyLongFrom... Which
will
mean a 2^32+1 size file will be returned as size '1' in python. Even
worse a
2^31+1 size file will be returned as some negative size.
(This is extra important, because you are directly exposing the
conversion via
your properties.)
I think to get the correct conversion we need either:
long long st_size
or
unsigned long long st_size
I'm not terribly concerned about whether we get a file of size 2^64
correct,
but I think we should get the files of size 2^32 correct.
Also, as this only compiles on linux, we don't need to worry that the
correct
type is "__int64" for visual studio.
I *do* wonder if doing:
property st_XXX:
def __get__(self):
return self._st.st_XXX
Is faster/slower than:
cdef class _Stat:
cdef readonly int st_XXX
I know the former is translated into an explicit function, while the
latter is
just put into the 'members' struct.
It might be worth benchmarking.
On the flip side, having an explicit wrapping of a 'struct stat' may be
a great
way to have other pyrex code access the raw 'struct stat' directly.
...
+ cdef _kind_from_mode(self, int mode):
+ # in order of frequency:
+ if S_ISREG(mode):
+ return self._file
+ if S_ISDIR(mode):
+ return self._directory
+ if S_ISCHR(mode):
+ return self._chardev
+ if S_ISBLK(mode):
+ return self._block
+ if S_ISLNK(mode):
+ return self._symlink
+ if S_ISFIFO(mode):
+ return self._fifo
+ if S_ISSOCK(mode):
+ return self._socket
+ return self._unknown
^- I would assume symlinks are more common in working trees than CHR and
BLK
devices. Not that it is a huge issue after you've handled files and
directories.
I notice that you pushed "lstat" into _read_dir(). Did you decide that
stat-ing
in inode order was not a sufficient win to continue using it? Does that
point
us to simplifying the _read_dir api?
I'll note that the docstring is now incorrect:
cdef _read_dir(path):
"""Like os.listdir, this reads the contents of a directory.
:param path: the directory to list.
:return: a list of (sort_key, basename) tuples.
"""
^- as _read_dir is now returning:
PyList_Append(result, (entry.d_ino, entry.d_name, None,
statvalue, None))
I'll note that it is a little bit ugly to create a tuple and modify it
*in-place*, but as long as it is contained in the C code, and you
haven't
called hash(tuple), etc, it is probably fine to do. Perhaps we should
make it a
bit clearer that we are doing so?
+ # We have inode, name, None, statvalue, None
+ # inode -> path_from_top
+ # direct concat - faster than operator +.
# We have (inode, name, None, statvalue, None), which we will replace
with
# (prefix_path, name, kind, statvalue, top_path)
Otherwise the read_dir() function looks like it is properly streamlined
on the
standard path. We use direct C access for all of the common path code.
I'll note, though, that you no longer return entries in sorted order. I
guess
that is okay, since you use a sorted() call in _walkdirs_utf8()
However, you need to change the DirReader api documentation:
def read_dir(self, prefix, top):
"""Read a specific dir.
:param prefix: A utf8 prefix to be preprended to the path
basenames.
:param top: A natively encoded path to read.
:return: A sorted list of the directories contents. Each item
contains:
(utf8_relpath, utf8_name, kind, lstatvalue, native_abspath)
"""
raise NotImplementedError(self.read_dir)
I'll also note that we can get rid of the sorted() in
UnicodeDirReader.read_dir() and the same code in walkdirs_win32.pyx
Also, things are quite broken for win32, I'm getting infinite recursion
in
"file_kind_from_stat_mode_thunk". You are missing this line:
+ file_kind_from_stat_mode = _file_kind_from_stat_mode
After you import the correct version from the _readdir_py.py module.
So we should probably should have test that it can fallback correctly.
Oh, and you broke the _walkdirs_win32 code, because it was expecting to
grab
"formats" from osutils.py. This fixes it for me:
=== modified file 'bzrlib/_walkdirs_win32.pyx'
--- bzrlib/_walkdirs_win32.pyx 2008-09-09 16:13:37 +0000
+++ bzrlib/_walkdirs_win32.pyx 2008-09-22 21:48:52 +0000
@@ -71,7 +71,7 @@
import operator
import stat
-from bzrlib import osutils
+from bzrlib import osutils, _readdir_py
cdef class _Win32Stat:
@@ -158,8 +158,8 @@
cdef object _file_kind
def __init__(self):
- self._directory_kind = osutils._directory_kind
- self._file_kind = osutils._formats[stat.S_IFREG]
+ self._directory_kind = _readdir_py._directory
+ self._file_kind = _readdir_py._file
def top_prefix_to_starting_dir(self, top, prefix=""):
"""See DirReader.top_prefix_to_starting_dir."""
=== modified file 'bzrlib/osutils.py'
--- bzrlib/osutils.py 2008-09-22 21:47:31 +0000
+++ bzrlib/osutils.py 2008-09-22 21:48:20 +0000
@@ -1565,8 +1565,9 @@
file_kind_from_stat_mode = UTF8DirReader().kind_from_mode
except ImportError:
from bzrlib._readdir_py import (
- _kind_from_mode as _file_kind_from_stat_mode
+ _kind_from_mode
)
+ file_kind_from_stat_mode = _kind_from_mode
return file_kind_from_stat_mode(mode)
file_kind_from_stat_mode = file_kind_from_stat_mode_thunk
I also think it would be reasonable to rename the "_walkdirs_win32.pyx"
file to
something like "_readdir_win32_pyx.pyx" to fit the other naming.
It would be nice if bzrlib.test_osutils.TestWalkdirs was parameterized
so that
it would run against all available DirReaders. But that isn't critical,
and I
think you cover that with this comment in test_osutils:
+ # Use the unicode reader. TODO: split into driver-and-driven
unit
+ # tests.
+ self._save_platform_info()
+ osutils._selected_dir_reader = osutils.UnicodeDirReader()
For details, see:
http://bundlebuggy.aaronbentley.com/project/bzr/request/%3C1221617332.13146.0.camel%40lifeless-64%3E
Project: Bazaar
More information about the bazaar
mailing list