Server load too high when using qemu-img

Thu Feb 3 14:01:37 UTC 2011

Quoting Alvin (info at alvin.be):
> I have long standing performance problems on Lucid when handling large files.
> 
> I notice this on several servers, but here is a detailed example of a scenario 
> I encountered yesterday.
> 
> Server (stilgar) is a Quad-core with 8 GB ram. The server has 3 disks. 1 Disk 
> contains the operating system. The other two are mdadm RAID0 with LVM. I need 
> to recreate the RAID manually[1] on most boots, but otherwise it is working 
> fine.
> (Before there are any heart attacks from reading 'raid0': the data on it is 
> NOT important, and only meant for testing.)
> The server runs 4 virtual machines (KVM).
> - 2 Lucid servers on qcow, residing on the local (non-raid) disk.
> - 1 Lucid server on a fstab mounted NFS4 share.
> - 1 Windows desktop on a logical volume.
> 
> I have an NFS mounted backup disk. When I restore the Windows image from the 
> backup (60GB), I encounter bug 658131[2]. All running virtual machines will 
> start showing errors like in bug 522014[3] in their logs 
> (hung_task_timeout_secs) and services on them will no longer be reachable. The 
> load on the server can climb to >30. Libvirt will no longer be able to 

Is it possible for you to use CIFS instead of NFS?

It's been a few years, but when I had my NAS at home I found CIFS far more
stable and reliable than NFS.

> shutdown the virtual machines. Nothing else can be done than a reboot of the 
> whole machine.
> 
> From the bug report, it looks like this might be NFS related, but I'm not 
> convinced. If I copy the image first and then restore it, the load also climbs 
> insanely high and the virtual machines will be on the verge of crashing. 
> Services will be temporaraly unavailable.

(Not trying to be critical)  What do you expect to happen?  I.e what do you
think is the bug there?  Is it that ionice seems to be insufficient?  I'm
asking in particular about the conversion by itself, not the copy, as I agree
the copy pinning CPU must be a (kernel) bug.

> The software used is qemu-img or dd. In all cases I'm running the commands 
> with 'ionice -c 3'.
> 
> This is only an example. Any high IO (e.g. rsync with large files) can crash 
> Lucid servers,

Over NFS, or any rsync?

For that matter, rsync tries to be smart and slice and dice the file to
minimize network traffic.  What about a simple ftp/scp?

> but what should I do? Sometimes it is necessary to copy large 
> files. That should be something that can be done without taking down the 
> entire server. Any thoughts on the matter?

It might be worth testing other IO schedulers.

It also might be worth testing a more current kernel.  The kernel team
does produce backports of newer kernels to lucid which, while surely not
officially supported, should work and may fix these issues.

-serge