[Bug 1758273] [NEW] DD2.2 freezes/hangs after 20mins of uptime
Launchpad Bug Tracker
1758273 at bugs.launchpad.net
Fri Mar 23 08:40:09 UTC 2018
You have been subscribed to a public bug:
== Comment: #0 - Application Cdeadmin - 2018-03-19 09:30:53 ==
== Comment: #1 - Application Cdeadmin <> - 2018-03-19 09:30:55 ==
== Comment: #2 - Application Cdeadmin <> - 2018-03-19 09:30:57 ==
------- Comment From brihh 2018-03-19 09:10:30 EDT -------
Needless to say, machine is pretty unusable. Needed for Performance testing for release.
== Comment: #3 - Application Cdeadmin <> - 2018-03-19 10:10:59 ==
------- Comment From vaibhav92 2018-03-19 10:06:17 EDT -------
@pridhiviraj Any idea who can look into this ?
------- Comment From dougmill-ibm 2018-03-19 10:09:59 EDT -------
If machine is locked-up and console is unresponsive, try collecting eSEL data from the BMC.
== Comment: #4 - Application Cdeadmin <c> - 2018-03-19 10:40:59 ==
------- Comment From pridhiviraj 2018-03-19 10:38:47 EDT -------
@brihh Can you use latest 03/15 PNOR and re-create the issue. And also before it hangs please collect OPAL and kernel logs. @vaibhav92 If it is re-creatable OPAL/EM team need to look at it.
== Comment: #5 - Application Cdeadmin <> - 2018-03-19 11:01:58 ==
------- Comment From vaibhav92 2018-03-19 10:41:53 EDT -------
Saw this in the kernel-log of the system:
[ 1247.404962] PM: suspend entry (s2idle)
[ 1247.404970] PM: Syncing filesystems ... done.
Looks like its getting suspended after 20 mins of inactivity. Looking at the /etc/systemd/logind.conf see IdleAction by default is 'ignore':
#IdleAction=ignore
#IdleActionSec=30min
But clearly someone is issuing a suspend to the system. So this probably need to be looked by the distro/Power-Management/EM team
------- Comment From megcurry 2018-03-19 10:43:14 EDT -------
Please advise re. Assignments and Labels that would get the right team working on this.....Mirror label gets a Bz opened, and that is necessary or at least useful for some of the LTC teams to look at things, right?
------- Comment From brihh 2018-03-19 10:47:45 EDT -------
odd about inactivity - i once had a test running for 20mins and it still froze:
`08:52:23 up 20 min, 4 users, load average: 64.40, 98.94, 64.32`
------- Comment From brihh 2018-03-19 10:49:30 EDT -------
@pridhiviraj where is the latest 3/15 PNOR that I can load?
```
== Comment: #8 - Application Cdeadmin <> - 2018-03-19 12:40:54 ==
------- Comment From pridhiviraj 2018-03-19 12:34:18 EDT -------
```
Mar 19 11:35:01 p215n15 rsyslogd-2007: action 'action 13' suspended, next retry is Mon Mar 19 11:35:31 2018 [v8.16.0 try http://www.rsyslog.com/e/2007 ]
Mar 19 11:35:01 p215n15 CRON[6464]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Mar 19 11:42:39 p215n15 systemd[1]: Starting Cleanup of Temporary Directories...
Mar 19 11:42:39 p215n15 rsyslogd-2007: action 'action 13' suspended, next retry is Mon Mar 19 11:43:39 2018 [v8.16.0 try http://www.rsyslog.com/e/2007 ]
Mar 19 11:42:40 p215n15 systemd-tmpfiles[6504]: [/usr/lib/tmpfiles.d/var.conf:14] Duplicate line for path "/var/log", ignoring.
Mar 19 11:42:40 p215n15 systemd[1]: Started Cleanup of Temporary Directories.
Mar 19 11:45:01 p215n15 rsyslogd-2007: action 'action 13' suspended, next retry is Mon Mar 19 11:46:01 2018 [v8.16.0 try http://www.rsyslog.com/e/2007 ]
Mar 19 11:45:01 p215n15 CRON[6524]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Mar 19 11:47:39 p215n15 systemd[1]: Starting Message of the Day...
Mar 19 11:47:39 p215n15 rsyslogd-2007: action 'action 13' suspended, next retry is Mon Mar 19 11:48:39 2018 [v8.16.0 try http://www.rsyslog.com/e/2007 ]
Mar 19 11:47:40 p215n15 50-motd-news[6528]: * Meltdown, Spectre and Ubuntu: What are the attack vectors,
Mar 19 11:47:40 p215n15 50-motd-news[6528]: how the fixes work, and everything else you need to know
Mar 19 11:47:40 p215n15 50-motd-news[6528]: - https://ubu.one/u2Know
Mar 19 11:47:40 p215n15 systemd[1]: Started Message of the Day.
Mar 19 11:48:02 p215n15 NetworkManager[4662]: <info> [1521474482.7252] manager: sleep: sleep requested (sleeping: no enabled: yes)
Mar 19 11:48:02 p215n15 NetworkManager[4662]: <info> [1521474482.7259] manager: NetworkManager state is now ASLEEP
Mar 19 11:48:02 p215n15 gnome-shell[5538]: Screen lock is locked down, not locking
Mar 19 11:48:02 p215n15 whoopsie[5264]: [11:48:02] offline
Mar 19 11:48:02 p215n15 systemd[1]: Reached target Sleep.
Mar 19 11:48:02 p215n15 systemd[1]: Starting Suspend...
Mar 19 11:48:02 p215n15 systemd-sleep[6576]: Suspending system...
Mar 19 11:48:02 p215n15 kernel: [ 1246.906320] PM: suspend entry (s2idle)
```
@vaibhav92 You are right, from the above messages looks like system is suspending by some cron job i guess.
== Comment: #9 - Application Cdeadmin <> - 2018-03-19 13:30:57 ==
------- Comment From brihh 2018-03-19 13:28:36 EDT -------
@pridhiviraj interesting. know offhand how i can shut that off? seems a bit annoying .. :-)
== Comment: #11 - Application Cdeadmin <> - 2018-03-19 17:10:51 ==
------- Comment From mzipse 2018-03-19 16:52:11 EDT -------
At our daily defect call, it was suggested that we check to see if Opal-PRD is running, which is a prereq for the Firmware recovery to work properly. Opal-PRD is an app that should be part of the distro and should automatically be started. Never-the-less, if you are want to check on it, here's the command to run at the OS.....
sudo service opal-prd status
You'll need root authority to do this.
The output will look something like this.....
# sudo service opal-prd status
? opal-prd.service - OPAL PRD daemon
Loaded: loaded (/lib/systemd/system/opal-prd.service; enabled; vendor preset: enabled)
Active: active (running) since Wed 2018-03-14 19:04:36 CDT; 4 days ago
Docs: man:opal-prd(8)
Main PID: 5085 (opal-prd)
Tasks: 1
CGroup: /system.slice/opal-prd.service
??5085 /usr/sbin/opal-prd --pnor /dev/mtd0
Mar 18 19:52:21 ws003p1 opal-prd[5085]: HBRT: HTMGT:~[0x00d0] 1d000000 bd021c00 0000bf02 1c000000 *................*
Mar 18 19:52:21 ws003p1 opal-prd[5085]: HBRT: HTMGT:~[0x00e0] c1021b00 0000c302 1d000000 00030000 *................*
Mar 18 19:52:21 ws003p1 opal-prd[5085]: HBRT: HTMGT:~[0x00f0] 0000ff06 21465245 51000206 16000000 *....!FREQ.......*
Mar 18 19:52:21 ws003p1 opal-prd[5085]: HBRT: HTMGT:~[0x0100] 5d08f600 00006008 f6000000 6308f600 *].....`.....c...*
Mar 18 19:52:21 ws003p1 opal-prd[5085]: HBRT: HTMGT:~[0x0110] 00006608 f6000000 6908f600 00006c08 *..f.....i.....l.*
Mar 18 19:52:21 ws003p1 opal-prd[5085]: HBRT: HTMGT:~[0x0120] f6000000 6f08f600 00007208 *....o.....r. *
Mar 18 19:52:21 ws003p1 opal-prd[5085]: HBRT: HTMGT:OCC1 rsp status=0x00, length=0x01CC
Mar 18 19:52:21 ws003p1 opal-prd[5085]: HBRT: HTMGT:rsp data: (up to 16 bytes)
Mar 18 19:52:21 ws003p1 opal-prd[5085]: HBRT: HTMGT:~[0x0000] 03100200 03000000 00000000 00000000 *................*
Mar 18 19:52:21 ws003p1 opal-prd[5085]: HBRT: HTMGT:<<processOccError()
== Comment: #13 - Application Cdeadmin <> - 2018-03-19 17:30:51 ==
------- Comment From brihh 2018-03-19 17:22:06 EDT -------
@mzipse seems to be there and running:
```
root at p215n15:~# service opal-prd status
? opal-prd.service - OPAL PRD daemon
Loaded: loaded (/lib/systemd/system/opal-prd.service; enabled; vendor preset: enabled)
Active: active (running) since Mon 2018-03-19 18:19:58 EDT; 1min 35s ago
Docs: man:opal-prd(8)
Main PID: 4648 (opal-prd)
Tasks: 1 (limit: 27033)
CGroup: /system.slice/opal-prd.service
??4648 /usr/sbin/opal-prd
Mar 19 18:19:58 p215n15 opal-prd[4648]: SCOM: read: chip 0x8, addr 0x8010a50, val 0x0, rc 0
Mar 19 18:19:58 p215n15 opal-prd[4648]: SCOM: read: chip 0x8, addr 0x8010a54, val 0x0, rc 0
Mar 19 18:19:58 p215n15 opal-prd[4648]: HBRT: PRDF:<<PRDF::noLock_initialize()
Mar 19 18:19:58 p215n15 opal-prd[4648]: HBRT: PRDF:<<PRDF::initialize()
Mar 19 18:19:58 p215n15 opal-prd[4648]: HBRT: ATTN_SLOW:I>Service::enableAttns() enter
Mar 19 18:19:58 p215n15 opal-prd[4648]: HBRT: ATTN_SLOW:I>Service::enableAttns() exit
Mar 19 18:19:58 p215n15 opal-prd[4648]: HBRT: ATTN_SLOW:I><<ATTN_RT::enableAttns rc: 0
Mar 19 18:19:58 p215n15 opal-prd[4648]: HBRT: calling get_ipoll_events
Mar 19 18:19:58 p215n15 opal-prd[4648]: HBRT: enabling IPOLL events 0x5b90000000000000
Mar 19 18:19:58 p215n15 opal-prd[4648]: FW: writing init message
```
== Comment: #14 - Application Cdeadmin <> - 2018-03-20 03:50:53 ==
------- Comment From vaibhav92 2018-03-20 03:47:08 EDT -------
Did some kernel tracing and it seems that someone is starting the systemd-suspend.service that invokes systemd-sleep which then forces a suspend by writing to /sys/power/state file. Further investigation is needed as to who is invoking the systemd-suspend.service. In the meantime you can issue command on the host that will disable the service and prevent system from getting suspended:
`systemctl mask systemd-suspend.service`
Actually, it is also possible that the presence of the suspend config indicates user error. If Ubuntu Server was installed, I don't think suspend should have been configured. Did someone install "desktop Ubuntu" (or some other non-server config) on a server?
Also, 18.04 should be used, not 17.10 (and never 17.04), on P9.
== Comment: #25 - Application Cdeadmin <> - 2018-03-23 00:20:53 ==
------- Comment From vaibhav92 2018-03-23 00:19:40 EDT -------
Hi @mzipse @stewart-ibm. AFAIK this issue is not related to CAPP or CAPI at all. @brian_horton had confirmed that he was seeing this even without enabling/running any CAPI workloads. Disabling the systemd-suspend.service made the problem go away. Hence asked the bug to be mirrored to canonical. @mzipse not sure what CAPP errors PRD team saw. Can you please ask them to get in touch with me.
== Comment: #26 - Vaibhav Jain <> - 2018-03-23 00:39:45 ==
Summary of the issue:
Ubuntu 18.04 is forcing a system suspend after 20 mins of system boot. Suspend is forced even if system is running a workload or user is logged on to the terminal and performing any activity. The issue goes away if systemd-suspend service is disabled via:
"systemctl mask systemd-suspend.service"
So requesting canonical to look into this issue as a possible bug in
systemd or user inactivity monitor.
~ Vaibhav
** Affects: systemd (Ubuntu)
Importance: Undecided
Assignee: Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage)
Status: New
** Tags: architecture-ppc64le bugnameltc-165829 severity-critical targetmilestone-inin1804
--
DD2.2 freezes/hangs after 20mins of uptime
https://bugs.launchpad.net/bugs/1758273
You received this bug notification because you are a member of Ubuntu Foundations Bugs, which is subscribed to systemd in Ubuntu.
More information about the foundations-bugs
mailing list