[Lucid] SRU: Some testing on the larger writeback set
tim.gardner at canonical.com
Fri Aug 20 13:20:29 UTC 2010
On 08/20/2010 03:59 AM, Stefan Bader wrote:
> The discussion upstream which solution to take for the writeback umount
> regression seems not really near a final conclusion. So I think we should make a
> decision to move forward with the bigger set (which also seems to have good
> effects on normal performance and responsiveness).
> I ran 2 tests of my own and the xfstests test suite on ext4 and saw no
> regression compared to before. Run-times were usually shorter with the patchset
> mount-umount of tmpfs with other IO: 0.33s -> 0.02s
> mount-cp-umount of ext4 : 9.00s -> 8.00s
> xfstests on ext4 : 24m30.00s -> 19m40.00s
> The xfstests failed two aio testcases (239, 240) in both cases with very similar
> looking errors. My kernels are based on the 2.6.32-24.41 release, so there can
> be fixes to ext4 in upcoming stable.
> Then I tried the xfstests on xfs and got scared by this on the new kernel:
> INFO: task xfs_io:5764 blocked for more than 120 seconds.
> "echo 0> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> xfs_io D ffff880206439650 0 5764 3651 0x00000000
> ffff8801f29dfd38 0000000000000082 0000000000015bc0 0000000000015bc0
> ffff8801de759ab0 ffff8801f29dffd8 0000000000015bc0 ffff8801de7596f0
> 0000000000015bc0 ffff8801f29dffd8 0000000000015bc0 ffff8801de759ab0
> Call Trace:
> [<ffffffff815595f7>] __mutex_lock_slowpath+0xe7/0x170
> [<ffffffff8114f1e1>] ? path_put+0x31/0x40
> [<ffffffff81559033>] mutex_lock+0x23/0x50
> [<ffffffff81152b59>] do_filp_open+0x3d9/0xba0
> [<ffffffff810f4487>] ? unlock_page+0x27/0x30
> [<ffffffff81112a19>] ? __do_fault+0x439/0x500
> [<ffffffff81115b78>] ? handle_mm_fault+0x1a8/0x3c0
> [<ffffffff8115e4ca>] ? alloc_fd+0x10a/0x150
> [<ffffffff81142219>] do_sys_open+0x69/0x170
> [<ffffffff81142360>] sys_open+0x20/0x30
> [<ffffffff810131b2>] system_call_fastpath+0x16/0x1b
> However the test run completes after 118m30s (and fails 10 out of 146 [017 109
> 194 198 225 229 232 238 239 240] tests). I did not see the dump on the old
> kernel, but that might just be because writeback is too slow to show that race.
> I would re-run the test on the old kernel to get the run-time and tests that
> fail. Though the run-time seems tremendous (that's why I forgot to note things
> down yesterday).
> Though all in all I think it should be safe for a larger regression testing in
> proposed. If there is no veto (and enough oks), I would add the set to our
> master branch.
I think the bigger patch set makes the most sense. Its certainly had the
most testing. I've been running it for a couple of weeks now. Lets get
it into -proposed soon, lots of folks are hating life because of this issue.
In the meantime, maybe you should rerun the XFS tests without all of the
dmesg hung_task noise (echo 0> /proc/sys/kernel/hung_task_timeout_secs)
just to be sure.
Tim Gardner tim.gardner at canonical.com
More information about the kernel-team