Worsening of selftest thread leaking

Wed May 19 09:24:24 BST 2010

For those not familiar with the threading problems with
selftest[bug392127] the root cause is essentially that a number of the
client-server tests spin up a new thread and don't rejoin it. These
generally finish up of their own accord, or quietly block till process
termination, and can cause some knock-on problems[bug531746] but are
generally harmless provided you have the ram for their stacks.
However, they've gone from being an annoyance (issue seen by other
people) to a real problem (issue seen by *me*!) causing random
failures, hangs, and crazy memory usage.

To try and work out when my box started having troubles I ran the
suite on a bunch of different past versions of Bazaar. There are a few
problems with doing this. Need to run the whole thing to get the ill
effects, which that takes a while. Also the symptoms vary, and not for
any particular reason. So though I've recorded some tangible
differences, as Vincent pointed out to me, this may not actually help
track down a specific problem.

The results for a recent revision[r5235] record a bunch of leaks, and
a random hang late on. Some runs[r5200] finish, but record loads of
"can't start new thread" and OOM failures.

Comparing with my known-good version[r4919], there are actually a
similar number of leaking tests, but vastly different memory usage.
Testing the next revision is problematic, as it crashes[r4920] due to
the introduction of testtools changing the teardown
semantics[testtools].

However, the next working revision[r4937] is similar to the current
one, high memory usage, and a hang. The next ten or so revisions I
tried reliably deadlock in bt.test_remote.TestStacking as well, but
none of them have the large numbers of thread-related failures
preceding that current versions have. To narrow the range down, I did
a version of the testtools introduction change with the crash fix
backported[r4920+4936..4937] which is much the same.

So, while I can't say what exactly, *something* is perhaps the fault
of testtools. As Vincent's workaround of segmenting the tests is no
longer sufficient to avoid these problems on the windows buildbot,
need to at least getting back to the old level of brokenness. I have
the detailed results if anyone is interested, and am happy to try
experiments.

Martin

[bug392127]: selftest fails with "can't start new thread"
<https://bugs.launchpad.net/bzr/+bug/392127>

[bug531746]: Intermittent test failure during _finishLogFile
<https://bugs.launchpad.net/bzr/+bug/531746>

[testtools]: Questions after testtools merge
<https://lists.ubuntu.com/archives/bazaar/2009q4/065580.html>

[r4919]: Results of bzr selftest for r4919 from 2009-12-22
    22593 tests run in total, of which:
        19393 Passed without problems
          805 Parameters of test do not apply
         1039 Lacking required feature to run test
         1310 Skipped for another reason
           29 Known to fail a particular assertion
            6 Failed a given assertion
           11 Raised an unexpected exception
    Also 1562 tests leaked threads.

    Time     real  4676.9688 seconds
             user  2970.3594 seconds
             sys   1086.0938 seconds
    Working set    177070080 bytes
    Pagefile       207454208 bytes

[r4920]: Results of bzr selftest for r4920 from 2009-12-23
    1142 tests run in total, of which:
           965 Passed without problems
             4 Parameters of test do not apply
            30 Lacking required feature to run test
           138 Skipped for another reason
             2 Known to fail a particular assertion
             3 Raised an unexpected exception
    Also 21 tests leaked threads.

    CRASH: Access violation
        after bb.test_version.TestVersionUnicodeOutput.test_unicode_bzr_home

    Time     real   449.9219 seconds
             user   255.5938 seconds
             sys    140.3438 seconds
    Working set    111452160 bytes
    Pagefile       117743616 bytes

[r4920+4936..4937]: Results of bzr selftest for r4921 from 2010-05-18
    20780 tests run in total, of which:
        17651 Passed without problems
          804 Parameters of test do not apply
          982 Lacking required feature to run test
         1304 Skipped for another reason
           23 Known to fail a particular assertion
            4 Failed a given assertion
           12 Raised an unexpected exception
    Also 1583 tests leaked threads.

    HANG: Possible deadlock
        after bt.test_remote.TestStacking.test_stacked_get_stream_groupcompress

    Time     real  5772.5938 seconds
             user  3822.8438 seconds
             sys   1084.1094 seconds
    Working set    500199424 bytes
    Pagefile       547721216 bytes

[r4937]: Results of bzr selftest for r4937 from 2010-01-07
    20812 tests run in total, of which:
        17685 Passed without problems
          804 Parameters of test do not apply
          982 Lacking required feature to run test
         1304 Skipped for another reason
           23 Known to fail a particular assertion
            3 Failed a given assertion
           11 Raised an unexpected exception
    Also 1660 tests leaked threads.

    HANG: Possible deadlock
        after bt.test_remote.TestStacking.test_stacked_get_stream_topological

    Time     real  5449.9375 seconds
             user  3681.0938 seconds
             sys   1052.7813 seconds
    Working set    501645312 bytes
    Pagefile       549441536 bytes

[r5200]: Results of bzr selftest for r5200 from 2010-05-03
    23166 tests run in total, of which:
        19735 Passed without problems
          829 Parameters of test do not apply
         1054 Lacking required feature to run test
         1323 Skipped for another reason
           34 Known to fail a particular assertion
           10 Failed a given assertion
          181 Raised an unexpected exception
    Also 1770 tests leaked threads.

    Time     real  3444.0469 seconds
             user  2169.2813 seconds
             sys    509.2500 seconds
    Working set    557449216 bytes
    Pagefile       613650432 bytes

[r5235]: Results of bzr selftest for r5235 from 2010-05-14
    20432 tests run in total, of which:
        17278 Passed without problems
          826 Parameters of test do not apply
          971 Lacking required feature to run test
         1317 Skipped for another reason
           22 Known to fail a particular assertion
            3 Failed a given assertion
           15 Raised an unexpected exception
    Also 1673 tests leaked threads.

    HANG: Possible deadlock
        after bt.test_lockable_files.TestLockableFiles_LockDir.test_unlock_after_lock_write_with_token

    Time     real  3085.7813 seconds
             user  1901.3906 seconds
             sys    454.5938 seconds
    Working set    507011072 bytes
    Pagefile       563527680 bytes