[Bug 1904585] Re: opal-prd: Have a worker process handle page offlining (Fixes "PlatServices: dyndealloc memory_error() failed" is getting reported in error log (opal-prd))
Ćukasz Zemczak
1904585 at bugs.launchpad.net
Mon Jan 11 22:52:49 UTC 2021
Hello bugproxy, or anyone else affected,
Accepted skiboot into groovy-proposed. The package will build now and be
available at
https://launchpad.net/ubuntu/+source/skiboot/6.5.2-1ubuntu0.20.10.1 in a
few hours, and then in the -proposed repository.
Please help us by testing this new package. See
https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how
to enable and use -proposed. Your feedback will aid us getting this
update out to other Ubuntu users.
If this package fixes the bug for you, please add a comment to this bug,
mentioning the version of the package you tested, what testing has been
performed on the package and change the tag from verification-needed-
groovy to verification-done-groovy. If it does not fix the bug for you,
please add a comment stating that, and change the tag to verification-
failed-groovy. In either case, without details of your testing we will
not be able to proceed.
Further information regarding the verification process can be found at
https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in
advance for helping!
N.B. The updated package will be released to -updates after the bug(s)
fixed by this package have been verified and the package has been in
-proposed for a minimum of 7 days.
** Changed in: skiboot (Ubuntu Groovy)
Status: Incomplete => Fix Committed
** Tags added: verification-needed verification-needed-groovy
** Changed in: skiboot (Ubuntu Focal)
Status: In Progress => Fix Committed
** Tags added: verification-needed-focal
--
You received this bug notification because you are a member of Ubuntu
Foundations Bugs, which is subscribed to skiboot in Ubuntu.
Matching subscriptions: foundations-bugs-skiboot
https://bugs.launchpad.net/bugs/1904585
Title:
opal-prd: Have a worker process handle page offlining (Fixes
"PlatServices: dyndealloc memory_error() failed" is getting reported
in error log (opal-prd))
Status in The Ubuntu-power-systems project:
In Progress
Status in skiboot package in Ubuntu:
Fix Released
Status in skiboot source package in Xenial:
Fix Committed
Status in skiboot source package in Bionic:
Fix Committed
Status in skiboot source package in Focal:
Fix Committed
Status in skiboot source package in Groovy:
Fix Committed
Status in skiboot source package in Hirsute:
Fix Released
Bug description:
[Impact]
This impacts the opal-prd userspace command from the skiboot package
The memory_error() hservice interface expects the memory_error() call to
just accept the offline request and return without actually offlining the
memory. Currently we will attempt to offline the marked pages before
returning to HBRT which can result in an excessively long time spent in the memory_error() hservice call which blocks HBRT from processing other
errors.
[Test Case]
Unfortunately due to the specific hardware requirement I wasn't able
to reproduce this problem and provide a test case for it. However I
was able to build this package into a ppa and got the IBM team to
confirm this problem was resolved for groovy focal, bionic, xenial see
comment #4 and #6
Another verification test will be done (as part of the SRU process)
again by the IBM Power team.
[What could go wrong]
To avoid long delays (that may blocks HBRT from processing other
errors) the memory offlining process is now separated in a dedicated
worker process, that can now be handled in the background.
If broken this can introduce further issues, like hangs in the worker process, not returning, and processes that pile up or in worst case memory pages that are not offlined, but reported otherwise.
The latter one would be a significant memory management problem, that even may break the system over time entirely.
But the code seem to have taken this into account with 'sigaction',
the return-code/exit-status check and the reaping of the worker
threads.
The fix was prepared back in September and was upstream accepted,
hence it's unlikely that regressions are in and in between it already
landed in hirsute.
On top a PPA with a patched skiboot package was created, shared and
eventually successfully tested by IBM (the initial bug reporter).
[Original Description]
https://github.com/open-
power/skiboot/commit/8cbd0de88d162e387f11569eee1bdecef8fad2e3
opal-prd: Have a worker process handle page offlining
The memory_error() hservice interface expects the memory_error() call to
just accept the offline request and return without actually offlining the
memory. Currently we will attempt to offline the marked pages before
returning to HBRT which can result in an excessively long time spent in the
memory_error() hservice call which blocks HBRT from processing other
errors. Fix this by adding a worker process which performs the page
offlining via the sysfs memory error interfaces.
Reviewed-by: Vasant Hegde - hegdevasant at linux.vnet.ibm.com
Signed-off-by: Oliver O'Halloran - oohall at gmail.com
Thanks in advance for your support.
Machine Type = Power8 and Power9 OPAL systems
---Steps to Reproduce---
* Inject memory error (UE)
* Verify that opal-prd doesn't return asynchronously to the platform after requesting the memory offlining operation
Userspace tool common name: opal-prd
We need this fix for 16.04.x and 18.04.x LTS releases.
Fix also is needed for 20.04 and 20.10.
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/1904585/+subscriptions
More information about the foundations-bugs
mailing list