Server load too high when using qemu-img
Alvin
info at alvin.be
Thu Feb 3 15:43:15 UTC 2011
On Thursday 03 February 2011 15:01:37 Serge E. Hallyn wrote:
> Quoting Alvin (info at alvin.be):
> > I have long standing performance problems on Lucid when handling large
> > files.
> >
> > I notice this on several servers, but here is a detailed example of a
> > scenario I encountered yesterday.
> >
> > Server (stilgar) is a Quad-core with 8 GB ram. The server has 3 disks. 1
> > Disk contains the operating system. The other two are mdadm RAID0 with
> > LVM. I need to recreate the RAID manually[1] on most boots, but
> > otherwise it is working fine.
> > (Before there are any heart attacks from reading 'raid0': the data on it
> > is NOT important, and only meant for testing.)
> > The server runs 4 virtual machines (KVM).
> > - 2 Lucid servers on qcow, residing on the local (non-raid) disk.
> > - 1 Lucid server on a fstab mounted NFS4 share.
> > - 1 Windows desktop on a logical volume.
> >
> > I have an NFS mounted backup disk. When I restore the Windows image from
> > the backup (60GB), I encounter bug 658131[2]. All running virtual
> > machines will start showing errors like in bug 522014[3] in their logs
> > (hung_task_timeout_secs) and services on them will no longer be
> > reachable. The load on the server can climb to >30. Libvirt will no
> > longer be able to
>
> Is it possible for you to use CIFS instead of NFS?
>
> It's been a few years, but when I had my NAS at home I found CIFS far more
> stable and reliable than NFS.
Yes. I know NFS is somewhat neglected in Ubuntu, but why use MS Windows file
sharing between Linux machines? That makes no sense. NFS is easier to set up.
In short: I could try CIFS, but in order to exclude the network share from
this issue I copied the image file locally first. It is true that NFS (maybe
CIFS too) has an impact on this. The load gets even higher when using it.
> > shutdown the virtual machines. Nothing else can be done than a reboot of
> > the whole machine.
> >
> > From the bug report, it looks like this might be NFS related, but I'm not
> > convinced. If I copy the image first and then restore it, the load also
> > climbs insanely high and the virtual machines will be on the verge of
> > crashing. Services will be temporaraly unavailable.
>
> (Not trying to be critical) What do you expect to happen? I.e what do you
> think is the bug there? Is it that ionice seems to be insufficient? I'm
> asking in particular about the conversion by itself, not the copy, as I
> agree the copy pinning CPU must be a (kernel) bug.
Well, I expect a performance hit, but no hung tasks. Especially when using
ionice.
> > The software used is qemu-img or dd. In all cases I'm running the
> > commands with 'ionice -c 3'.
> >
> > This is only an example. Any high IO (e.g. rsync with large files) can
> > crash Lucid servers,
>
> Over NFS, or any rsync?
Both. In the example, NFS/rsync was not used. I only told that because I've
had the same trouble when using them on other servers.
> For that matter, rsync tries to be smart and slice and dice the file to
> minimize network traffic. What about a simple ftp/scp?
>
> > but what should I do? Sometimes it is necessary to copy large
> > files. That should be something that can be done without taking down the
> > entire server. Any thoughts on the matter?
>
> It might be worth testing other IO schedulers.
>
> It also might be worth testing a more current kernel. The kernel team
> does produce backports of newer kernels to lucid which, while surely not
> officially supported, should work and may fix these issues.
I might try those. I see you found my new bug report[1]. You're on to
something there! I didn't remove an usb drive, but there are similar troubles
I did not link to this before:
- mdadm does not auto-assemble [2]
- I have an LVM snapshot present on that system! Even worse, the snapshot is
100% full and thus corrupt.
Now, I didn't think of the snapshot. The presence of an LVM snapshot is a huge
IO performance hit, so that explains the extreme load. In my example I was
reading the raw image from its parent volume.
Because of your comment I also found a blog post[3] about the issue:
"Non-existent Device Mapper Volumes Causing I/O Errors?"
So, I will first contact all users and find a moment to take the server
offline for some testing. Then, i'll post my findings in the bug report.
Thanks for the tips.
Links:
[1] https://bugs.launchpad.net/bugs/712392
[2] https://bugs.launchpad.net/bugs/27037
[3] http://slated.org/device_mapper_weirdness
--
Alvin
More information about the ubuntu-server
mailing list