[Bug 1847361] Re: Upgrade of qemu binaries causes running instances not able to dynamically load modules
Victor Tapia
1847361 at bugs.launchpad.net
Wed Oct 7 09:34:20 UTC 2020
** Patch removed: "qemu-stein.debdiff"
https://bugs.launchpad.net/ubuntu/+source/qemu/+bug/1847361/+attachment/5415315/+files/qemu-stein.debdiff
** Patch added: "qemu-stein.debdiff"
https://bugs.launchpad.net/ubuntu/+source/qemu/+bug/1847361/+attachment/5418856/+files/qemu-stein.debdiff
--
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to Ubuntu Cloud Archive.
https://bugs.launchpad.net/bugs/1847361
Title:
Upgrade of qemu binaries causes running instances not able to
dynamically load modules
Status in Ubuntu Cloud Archive:
Fix Released
Status in Ubuntu Cloud Archive stein series:
Triaged
Status in libvirt package in Ubuntu:
Fix Released
Status in qemu package in Ubuntu:
Fix Released
Status in libvirt source package in Bionic:
Fix Released
Status in qemu source package in Bionic:
Fix Released
Status in libvirt source package in Eoan:
Fix Released
Status in qemu source package in Eoan:
Fix Released
Bug description:
[Impact]
* An infrequent but annoying issue is QEMUs problem to not be able to
hot-add capabilities IF since starting the instance qemu has been
upgraded. This is due to qemu modules only working with exactly the
same build.
* We brought changes upstream that allow the packaging to keep the old
files around and make qemu look after them as a fallback.
[Test Case]
I:
* $ apt install uvtool-libvirt
$ uvt-simplestreams-libvirt --verbose sync --source http://cloud-images.ubuntu.com/daily arch=amd64 label=daily release=bionic
$ uvt-kvm create --password ubuntu lateload arch=amd64 release=bionic label=daily
cat > curldisk.xml << EOF
<disk type='network' device='disk'>
<driver name='qemu' type='raw'/>
<source protocol="http" name="ubuntu/dists/bionic-updates/main/installer-amd64/current/images/netboot/mini.iso">
<host name="archive.ubuntu.com" port="80"/>
</source>
<target dev='vdc' bus='virtio'/>
<readonly/>
</disk>
EOF
# Here up or downgrade the installed packages, even a minor
# version or a rebuild of the same version
# Instead if you prefer (easier) you can run
$ apt install --reinstall qemu-*
Next check if they appeared (action of the maintainer scripts)
in the /var/run/qemu/<version> directory
# And then rm/mv the original .so files of qemu-block-extra
# Trying to load a .so now would after an upgrade fail as the old qemu can't load the build id
$ virsh attach-device lateload curldisk.xml
Reported issue happens on attach:
root at b:~# virsh attach-device lateload cdrom-curl.xml
error: Failed to attach device from cdrom-curl.xml
error: internal error: unable to execute QEMU command 'device_add': Property 'virtio-blk-device.drive' can't find value 'drive-virtio-disk2'
In the log we can see:
Failed to initialize module: /usr/lib/x86_64-linux-gnu/qemu/block-curl.so
One can also check files mapped into a process and we should see the
/var/run/.. path being used now.
II:
* As it had issues in the first iteration of the fix worth a
try is also the use of an environment var for an extra path:
$ QEMU_MODULE_DIR="/tmp/" qemu-system-x86_64 -cdrom localhost::/foo
[Regression Potential]
I:
* libvirt just allows a few more paths to be read from in the apparmor
isolation that is usually safe unless these paths are considered
sensitive. But /var/run/qemu is new, /var/run in general not meant
for permanent or secure data and as always if people want to ramp up
isolation they can always add deny rules to the local overrides.
II:
* the qemu change has two components.
In qemu code it looks for another path if the former ones failed.
I see no issues there yet, but can imagine that odd versions might
make it access odd paths which would then be denied by apparmor or
just don't exist. But that is no different than the former built-in
paths it tries, so nothing bad should happen.
The code change to the maintainer scripts has to backup the files.
If that goes wrong upgrades could be broken, but so far no tests have
shown issues.
[Other Info]
* To really use the functionality users will need the new qemu AND the
new libvirt that are uploaded for this bug.
But it felt wrong to add versioned dependencies from qemu->libvirt
(that is the semantically correct direction) also conflicts/breaks
might cause issues in many places that want to control these. OTOH
while the fix is great for some installations the majority of users
won't care and therefore be happy if extra dependencies are not
causing any oddity on apt upgrade. Therefore no versioned
dependencies were added intentionally.
---
[Feature Freeze Exception]
Hi,
this is IMHO a just a bugfix. But since it involves some bigger changes I wanted to be on the safe side and get an ack by the release Team.
Problem:
- on upgrade qemu processes are left running as they
represent a guest VM
- later trying to add features e.g. ceph disk hot-add will
need to load .so files e.g. from qemu-block-extra package
- those modules can on ly be loaded from the same build, but those are
gone after upgrade
Solution:
- If qemu fails to load from its usual paths it will
now also look in /var/run/<version/
- package upgrade code will place the .so's there
- things will be cleaned on reboot which is much simpler
and error-proof than trying to detect which versions
binaries are running
- libvirt has a change to allow just reading and
mapping from that path (apparmor)
@Release team it would be great if you would agree to this being safe
for an FFe.
--- initial report ---
Upgrading qemu binaries causes the on-disk versions to change, but the
in-memory running instances still attempt to dynamically load a
library which matches its same version. This can cause running
instances to fail actions like hotplugging devices. This can be
alleviated by migrating the instance to a new host or restarting the
instance, however in cloud type environments there may be instances
that cannot be migrated (sriov, etc) or the cloud operator does not
have permission to reboot.
This may be resolvable for many situations by changing the packaging
to keep older versions of qemu libraries around on disk (similar to
how the kernel package keeps older kernel versions around).
--- solution options (WIP) ---
For a packaging solution we would need:
- qemu-block-extra / qemu-system-gui binary packages would need
sort of a -$buildid in the name. That could be the version
string (sanitized for package name)
- /usr/lib/x86_64-linux-gnu/qemu/*.so would need a -$buildid
- loading of modules from qemu would need to consider $buildid
when creating module names.
util/module.c in module_load_one / module_load_file
It already searches in multiple dirs, maybe it could insert
the $buildid there
- We'd need a way of detecting running versions of qemu binaries
and only make them uninstallable once the binaries are all
gone. I have not seen something like that in apt yet (kernel
is easy in comparison as only one can be loaded at a time).
ALTERNATIVES:
- disable loadable module support
- add an option to load all modules in advance (unlikely to be
liked upstream) and not desirable for many setups using qemu
(especially not as default)
- add an option to load a module (e.g via QMP/HMP) which would
allow an admin
to decide doing so for the few setups that benefit.
- that could down the road then even get a libvirt interface
for easier consumption
Heads up - None of the above would be SRUable
--- mitigation options ---
- live migrate for upgrades
- prohibited by SR-IOV usage
- Tech to get SR-IOV migratable is coming (e.g. via net_failover,
bonding in DPDK, ...)
- load the modules you need in advance
- Note: lacking an explicit "load module" command makes this
slightly odd for now
- but using iscsi or ceph is never spontaneous, a deployment
has or hasn't the setup to use those
- Create a single small read-only node and attach this to each guest,
that will load the driver and render you immune to the issue. While
more clunky, this isn't so much different than how it would be
with an explicit "load module" command.
Actually the target doesn't have to exist it can fail to attach
and still achieves what is needed comment #17 has an example.
To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1847361/+subscriptions
More information about the Ubuntu-openstack-bugs
mailing list