Server load too high when using qemu-img
Alvin
info at alvin.be
Tue Feb 1 09:35:29 UTC 2011
I have long standing performance problems on Lucid when handling large files.
I notice this on several servers, but here is a detailed example of a scenario
I encountered yesterday.
Server (stilgar) is a Quad-core with 8 GB ram. The server has 3 disks. 1 Disk
contains the operating system. The other two are mdadm RAID0 with LVM. I need
to recreate the RAID manually[1] on most boots, but otherwise it is working
fine.
(Before there are any heart attacks from reading 'raid0': the data on it is
NOT important, and only meant for testing.)
The server runs 4 virtual machines (KVM).
- 2 Lucid servers on qcow, residing on the local (non-raid) disk.
- 1 Lucid server on a fstab mounted NFS4 share.
- 1 Windows desktop on a logical volume.
I have an NFS mounted backup disk. When I restore the Windows image from the
backup (60GB), I encounter bug 658131[2]. All running virtual machines will
start showing errors like in bug 522014[3] in their logs
(hung_task_timeout_secs) and services on them will no longer be reachable. The
load on the server can climb to >30. Libvirt will no longer be able to
shutdown the virtual machines. Nothing else can be done than a reboot of the
whole machine.
From the bug report, it looks like this might be NFS related, but I'm not
convinced. If I copy the image first and then restore it, the load also climbs
insanely high and the virtual machines will be on the verge of crashing.
Services will be temporaraly unavailable.
The software used is qemu-img or dd. In all cases I'm running the commands
with 'ionice -c 3'.
This is only an example. Any high IO (e.g. rsync with large files) can crash
Lucid servers, but what should I do? Sometimes it is necessary to copy large
files. That should be something that can be done without taking down the
entire server. Any thoughts on the matter?
Links:
[1] https://bugs.launchpad.net/bugs/27037
[2] https://bugs.launchpad.net/bugs/658131
[3] https://bugs.launchpad.net/bugs/522014
Example from /var/log/messages (kernel) on the server:
kvm D 0000000000000000 0 9632 1 0x00000000
ffff8801a4269ca8 0000000000000086 0000000000015bc0 0000000000015bc0
ffff8802004fdf38 ffff8801a4269fd8 0000000000015bc0 ffff8802004fdb80
0000000000015bc0 ffff8801a4269fd8 0000000000015bc0 ffff8802004fdf38
Call Trace:
[<ffffffff815596b7>] __mutex_lock_slowpath+0x107/0x190
[<ffffffff815590b3>] mutex_lock+0x23/0x50
[<ffffffff810f5899>] generic_file_aio_write+0x59/0xe0
[<ffffffff811d7879>] ext4_file_write+0x39/0xb0
[<ffffffff81143a8a>] do_sync_write+0xfa/0x140
[<ffffffff81084380>] ? autoremove_wake_function+0x0/0x40
[<ffffffff81252316>] ? security_file_permission+0x16/0x20
[<ffffffff81143d88>] vfs_write+0xb8/0x1a0
[<ffffffff81144722>] sys_pwrite64+0x82/0xa0
[<ffffffff810121b2>] system_call_fastpath+0x16/0x1b
kdmflush D 0000000000000002 0 396 2 0x00000000
ffff88022eeb3d10 0000000000000046 0000000000015bc0 0000000000015bc0
ffff88022f489a98 ffff88022eeb3fd8 0000000000015bc0 ffff88022f4896e0
0000000000015bc0 ffff88022eeb3fd8 0000000000015bc0 ffff88022f489a98
Call Trace:
[<ffffffff815589a7>] io_schedule+0x47/0x70
[<ffffffff81435383>] dm_wait_for_completion+0xa3/0x160
[<ffffffff81059b90>] ? default_wake_function+0x0/0x20
[<ffffffff81435d47>] ? __split_and_process_bio+0x127/0x190
[<ffffffff81435dda>] dm_flush+0x2a/0x70
[<ffffffff81435e6c>] dm_wq_work+0x4c/0x1c0
[<ffffffff81435e20>] ? dm_wq_work+0x0/0x1c0
[<ffffffff8107f7e7>] run_workqueue+0xc7/0x1a0
[<ffffffff8107f963>] worker_thread+0xa3/0x110
[<ffffffff81084380>] ? autoremove_wake_function+0x0/0x40
[<ffffffff8107f8c0>] ? worker_thread+0x0/0x110
[<ffffffff81084006>] kthread+0x96/0xa0
[<ffffffff810131ea>] child_rip+0xa/0x20
[<ffffffff81083f70>] ? kthread+0x0/0xa0
[<ffffffff810131e0>] ? child_rip+0x0/0x20
More information about the ubuntu-server
mailing list