[MERGE] Transport.readv() closes the file handle

John Arbash Meinel john at arbash-meinel.com
Mon Oct 27 15:23:12 GMT 2008

Hash: SHA1

Vincent Ladeuil wrote:
>>>>>> "john" == John Arbash Meinel <john at arbash-meinel.com> writes:
>     john> The default implementation of Transport.readv(), and
>     john> the one used by LocalTransport, is to do a regular
>     john> ".get()" and then call .seek() and .read() on the
>     john> resulting file object.
>     john> This works fine, but it turns out that it leaves the
>     john> file handle open on Windows.
> It leaves it open everywhere ;)

Sure, but I only noticed on Windows because an open handle locks the file.

>     john> Probably the biggest reason is that most of the callers
>     john> of .readv() actually iterate in lock-step with some
>     john> other iterator (which names the return values).
> I challenge the 'probably', I seem to remember that callers call
> the iterator to completion, but _seek_and_read lacks a close
> anyway.

So, my analysis might be slightly off, but part of the problem is this
code at the end of _seek_and_read():

while cur_offset_and_size in data_map:
    this_data = data_map.pop(cur_offset_and_size)
    yield cur_offset_and_size[0], this_data
    cur_offset_and_size = offset_stack.next()

Specifically, when the 'offset_stack' is empty, this will raise a
StopIteration exception, not exit cleanly. We *could do*:

  cur_offset_and_size = offset_stack.next()
except StopIteration:
  return  # or 'raise'

>     john> Because of this, the readv() generator never runs to
>     john> completion, and thus the file handle doesn't get closed
>     john> and garbage collected right away.
> I can't find any place where that file handle was closed before
> your patch so your description is a bit misleading no ?

File handles get closed when their references are consumed. If the readv
generator is not fully consumed then it would still hold a reference to
the file until the pointer to the generator goes out of scope.

>     john> The attached patch changes it so that we know when we
>     john> are done reading data from the file, and close it.
>     john> I ran into this because in my "pack retry" code the
>     john> test was failing on Windows because the .pack file
>     john> handle was still open, even though I know we were done
>     john> reading that file.
> Since you have a failing test, can you try doing the close in the
> generator before a final return, to get a better understanding of
> the issue ? Or are you already sure that the generator never runs
> to completion ?

Before a final return? You mean outside the loops? I guarantee it isn't
reached as I can do:
=== modified file 'bzrlib/transport/__init__.py'
- --- bzrlib/transport/__init__.py        2008-10-06 06:34:36 +0000
+++ bzrlib/transport/__init__.py        2008-10-27 13:56:52 +0000
@@ -690,6 +690,7 @@
                 this_data = data_map.pop(cur_offset_and_size)
                 yield cur_offset_and_size[0], this_data
                 cur_offset_and_size = offset_stack.next()
+        badcode

     def _sort_expand_and_combine(self, offsets, upper_limit):
         """Helper for readv.

And nothing breaks.

>     john>          # turn the list of offsets into a stack
>     john> @@ -688,6 +689,12 @@
>     john>              # Now that we've read some data, see if we can yield anything back
>     john>              while cur_offset_and_size in data_map:
>     john>                  this_data = data_map.pop(cur_offset_and_size)
>     john> +                remaining -= 1
>     john> +                if remaining <= 0:
>     john> +                    # Close the file handle as we are done yielding data.
> I think we need a more strong comment here explaining the issue
> if anybody encounters it in the future.
> Also, you're relying on the callers consuming all the data, if
> one of them ever consume only part of it, your fix is not
> triggered.
> I don't know of any case for the nominal uses, but if an error
> occurs during the readv, the file will remain open anyway.

Sure, but we don't get try/finally in a generator unless we depend
strictly on python2.5+, and file handles are closed when the file object
is cleaned up.

> An alternative may be to keep track of the file handle and close
> it whenever another readv (or get ?) is done on the transport.
> But this has drawbacks too.
>     john> +                    close = getattr(fp, 'close', None)
>     john> +                    if close is not None:
> StringIO *has* a close() attribute, I think you're overly
> cautious here and may hide a potential problem if _seek_and_read
> is used with a file handle *without* a one (do you have one class
> in mind ?).

Just that we've never had .close() as part of the Transport contract for
what is returned from Transport.get(). We could add it, but it hasn't
been there previously. (We *have* required seek/tell/read because we
have needed them at various times.)

> Overall, I'd more convinced by this patch if you were more
> aggressive: explicitly close before the final generator return,
> warn about potential problems in the comment if the readv is not
> totally consumed.
> But since this fix is needed and better than what we have, I let
> you judge whether it's worth the effort.
> BB:tweak
>         Vincent

So, I don't know how to have a direct test for this other than on
windows. Specifically, I don't know that there is any way to tell that a
file handle is still open under linux. On windows I can write a test like:

def test_readv_closes_at_completion(self):
    t = get_transport('.')
    t.put_bytes('foo', '123456789\n')
    requests = [(0, 1), (5, 1), (8, 1)]
    res = t.readv('foo', requests)
    for x, y in zip(res, range(len(requests))):
    # The file handle should be closed
    t.rename('foo', 'bar')

This passes, but if I do:
  for x, y in zip(range(len(requests)), res):

It fails. Because "zip" stops whenever one of it's iterators is
consumed. And it turns out that when they are the same length, the
readv() never is fully consumed when it is the second argument.

I'm not sure about the other cases where it doesn't get fully consumed.
But I notice that in the code which reads pack files we have:

    def _iter_record_objects(self):
        while True:
            record_kind = self.reader_func(1)
            if record_kind == 'B':
                # Bytes record.
                reader = BytesRecordReader(self._source)
                yield reader
            elif record_kind == 'E':
                # End marker.  There are no more records.
            elif record_kind == '':
                # End of stream encountered, but no End Marker record
seen, so
                # this container is incomplete.
                raise errors.UnexpectedEndOfContainerError()
                # Unknown record type.
                raise errors.UnknownRecordTypeError(record_kind)

However, I've also noticed that the last record in the index ends 1 byte
before the actual file ends (it ends at 1340, and the file is 1341 bytes
long), so the final 'E' doesn't seem to be attributed to any of the members.

So I don't think it is being read and this generator is returning early.

Anyway, I can't say that I've tracked down exactly who isn't iterating
past the last record, allowing _seek_and_readv() to close normally after
raising the StopIteration exception.

I *can* say that it is happening, and it is a bit hard to debug or test,
because it only fails if you are doing something concurrently (which we
don't often do in bzr) and if you are on a platform that doesn't like
file handles left open.

So if you prefer, I can make '.close()' a strictly required attribute,
but I don't think I can assume all callers run the iterator to
completion (because of the issue with zip()). I could track down all the
different places to put readv() first, but rememember that sometimes
generators are stacked, and that gets tricky to resolve.

Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org


More information about the bazaar mailing list