[MERGE] fast read-dir support for unix too

Mon Sep 22 22:56:58 BST 2008

John Arbash Meinel has voted tweak.
Status is now: Conditionally approved
Comment:
Your NEWS changes are slotting cleanly into the old 1.7 section, so 
we'll need to update that when merging.  (Topic for discussion, should 
we change the indenting of all NEWS entries as part of a release, to 
force conflicts if someone merges NEWS into an old area?)

If we are going to be switching how some of the extensions are named, we 
need to change the ignore list a bit.

=== modified file '.bzrignore'

--- .bzrignore  2008-09-03 06:31:11 +0000
+++ .bzrignore  2008-09-22 20:22:35 +0000
@@ -47,8 +47,8 @@
  doc/en/developer-guide/HACKING.txt
  doc/en/user-reference/bzr_man.txt
  # built extension modules
-bzrlib/_*_c.dll
-bzrlib/_*_c.so
-bzrlib/_*_c.pyd
+bzrlib/_*.dll
+bzrlib/_*.so
+bzrlib/_*.pyd
  # generated help topics
  doc/en/user-reference/*.txt

Is probably sufficient.


=== modified file 'bzrlib/_readdir_py.py'
...
+def _kind_from_mode(stat_mode, _formats=_formats, _unknown='unknown'):
+    """Generate a file kind from a stat mode. This is used in walkdirs.
+
+    Its performance is critical: Do not mutate without careful 
benchmarking.
      """
-    return [(None, name) for name in os.listdir(path)]
+    try:
+        return _formats[stat_mode & 0170000]
+    except KeyError:
+        return _unknown

s/Its/It's/

But it is kind of odd to have a "performance critical" comment in the 
python
implementation.


It is a bit odd to have "python setup.py build_ext -i" fail to build the
_readdir_pyx extension on win32 and have that be okay (and correct, in 
fact). I
think we probably want to change the setup.py logic so that it just 
skips the
file. This is probably sufficient:

=== modified file 'setup.py'
--- setup.py    2008-09-04 12:03:01 +0000
+++ setup.py    2008-09-22 20:25:57 +0000
@@ -228,13 +228,14 @@
  add_pyrex_extension('bzrlib._btree_serializer_c')
  add_pyrex_extension('bzrlib._dirstate_helpers_c')
  add_pyrex_extension('bzrlib._knit_load_data_c')
-add_pyrex_extension('bzrlib._readdir_pyx')
  if sys.platform == 'win32':
      # pyrex uses the macro WIN32 to detect the platform, even though it 
should
      # be using something like _WIN32 or MS_WINDOWS, oh well, we can 
give it the
      # right value.
      add_pyrex_extension('bzrlib._walkdirs_win32',
                          define_macros=[('WIN32', None)])
+else:
+    add_pyrex_extension('bzrlib._readdir_pyx')
  ext_modules.append(Extension('bzrlib._patiencediff_c', 
['bzrlib/_patiencediff_c.c']))


=== modified file 'bzrlib/_readdir_pyx.pyx'
...
+cdef extern from 'sys/stat.h':
+    cdef struct stat:
+        int st_mode
+        int st_size
+        int st_dev
+        int st_ino
+        int st_mtime
+        int st_ctime

^- I think these types are actually wrong. The only one I'm actually 
concerned
about is "int st_size". It is actually defined as "off_t" which is 
actually a
64-bit integer.
The only reason it matters is for files > 2^32. Defining it as a plain 
'int'
will cause pyrex to use PyInt_FromLong() rather than PyLongFrom... Which 
will
mean a 2^32+1 size file will be returned as size '1' in python. Even 
worse a
2^31+1 size file will be returned as some negative size.

(This is extra important, because you are directly exposing the 
conversion via
your properties.)
I think to get the correct conversion we need either:

long long st_size
or
unsigned long long st_size

I'm not terribly concerned about whether we get a file of size 2^64 
correct,
but I think we should get the files of size 2^32 correct.

Also, as this only compiles on linux, we don't need to worry that the 
correct
type is "__int64" for visual studio.

I *do* wonder if doing:

  property st_XXX:
    def __get__(self):
      return self._st.st_XXX

Is faster/slower than:


cdef class _Stat:

   cdef readonly int st_XXX


I know the former is translated into an explicit function, while the 
latter is
just put into the 'members' struct.
It might be worth benchmarking.

On the flip side, having an explicit wrapping of a 'struct stat' may be 
a great
way to have other pyrex code access the raw 'struct stat' directly.

...

+    cdef _kind_from_mode(self, int mode):
+        # in order of frequency:
+        if S_ISREG(mode):
+            return self._file
+        if S_ISDIR(mode):
+            return self._directory
+        if S_ISCHR(mode):
+            return self._chardev
+        if S_ISBLK(mode):
+            return self._block
+        if S_ISLNK(mode):
+            return self._symlink
+        if S_ISFIFO(mode):
+            return self._fifo
+        if S_ISSOCK(mode):
+            return self._socket
+        return self._unknown

^- I would assume symlinks are more common in working trees than CHR and 
BLK
devices. Not that it is a huge issue after you've handled files and
directories.



I notice that you pushed "lstat" into _read_dir(). Did you decide that 
stat-ing
in inode order was not a sufficient win to continue using it? Does that 
point
us to simplifying the _read_dir api?

I'll note that the docstring is now incorrect:
cdef _read_dir(path):
     """Like os.listdir, this reads the contents of a directory.

     :param path: the directory to list.
     :return: a list of (sort_key, basename) tuples.
     """

^- as _read_dir is now returning:
PyList_Append(result, (entry.d_ino, entry.d_name, None,
     statvalue, None))


I'll note that it is a little bit ugly to create a tuple and modify it
*in-place*, but as long as it is contained in the C code, and you 
haven't
called hash(tuple), etc, it is probably fine to do. Perhaps we should 
make it a
bit clearer that we are doing so?


+            # We have inode, name, None, statvalue, None
+            # inode -> path_from_top
+            # direct concat - faster than operator +.

# We have (inode, name, None, statvalue, None), which we will replace 
with
# (prefix_path, name, kind, statvalue, top_path)

Otherwise the read_dir() function looks like it is properly streamlined 
on the
standard path. We use direct C access for all of the common path code.


I'll note, though, that you no longer return entries in sorted order. I 
guess
that is okay, since you use a sorted() call in _walkdirs_utf8()

However, you need to change the DirReader api documentation:
  def read_dir(self, prefix, top):
      """Read a specific dir.

      :param prefix: A utf8 prefix to be preprended to the path 
basenames.
      :param top: A natively encoded path to read.
      :return: A sorted list of the directories contents. Each item 
contains:
          (utf8_relpath, utf8_name, kind, lstatvalue, native_abspath)
      """
      raise NotImplementedError(self.read_dir)

I'll also note that we can get rid of the sorted() in
UnicodeDirReader.read_dir() and the same code in walkdirs_win32.pyx



Also, things are quite broken for win32, I'm getting infinite recursion 
in
"file_kind_from_stat_mode_thunk". You are missing this line:
+            file_kind_from_stat_mode = _file_kind_from_stat_mode

After you import the correct version from the _readdir_py.py module.

So we should probably should have test that it can fallback correctly.


Oh, and you broke the _walkdirs_win32 code, because it was expecting to 
grab
"formats" from osutils.py. This fixes it for me:
=== modified file 'bzrlib/_walkdirs_win32.pyx'
--- bzrlib/_walkdirs_win32.pyx  2008-09-09 16:13:37 +0000
+++ bzrlib/_walkdirs_win32.pyx  2008-09-22 21:48:52 +0000
@@ -71,7 +71,7 @@
  import operator
  import stat

-from bzrlib import osutils
+from bzrlib import osutils, _readdir_py


  cdef class _Win32Stat:
@@ -158,8 +158,8 @@
      cdef object _file_kind

      def __init__(self):
-        self._directory_kind = osutils._directory_kind
-        self._file_kind = osutils._formats[stat.S_IFREG]
+        self._directory_kind = _readdir_py._directory
+        self._file_kind = _readdir_py._file

      def top_prefix_to_starting_dir(self, top, prefix=""):
          """See DirReader.top_prefix_to_starting_dir."""

=== modified file 'bzrlib/osutils.py'
--- bzrlib/osutils.py   2008-09-22 21:47:31 +0000
+++ bzrlib/osutils.py   2008-09-22 21:48:20 +0000
@@ -1565,8 +1565,9 @@
              file_kind_from_stat_mode = UTF8DirReader().kind_from_mode
          except ImportError:
              from bzrlib._readdir_py import (
-                _kind_from_mode as _file_kind_from_stat_mode
+                _kind_from_mode
                  )
+            file_kind_from_stat_mode = _kind_from_mode
      return file_kind_from_stat_mode(mode)
  file_kind_from_stat_mode = file_kind_from_stat_mode_thunk


I also think it would be reasonable to rename the "_walkdirs_win32.pyx" 
file to
something like "_readdir_win32_pyx.pyx" to fit the other naming.


It would be nice if bzrlib.test_osutils.TestWalkdirs was parameterized 
so that
it would run against all available DirReaders. But that isn't critical, 
and I
think you cover that with this comment in test_osutils:

+        # Use the unicode reader. TODO: split into driver-and-driven 
unit
+        # tests.
+        self._save_platform_info()
+        osutils._selected_dir_reader = osutils.UnicodeDirReader()



For details, see: 
http://bundlebuggy.aaronbentley.com/project/bzr/request/%3C1221617332.13146.0.camel%40lifeless-64%3E
Project: Bazaar