[Bug 1758273] Re: DD2.2 freezes/hangs after 20mins of uptime

Steve Langasek steve.langasek at canonical.com
Fri Mar 23 18:51:32 UTC 2018


There are certainly no user inactivity monitors in the default Ubuntu
Server install and we have never seen systemd-suspend.service triggered
on any of our systems.  I don't think this is a systemd bug - it's
either a configuration error, or it's a bug in whatever is triggering
systemd-suspend.service (which systemd itself does not do).

This bug report indicates that there is desktop software installed on
the system where this bug was encountered.  There certainly may be idle
timers in desktop software that trigger a suspend; and there may be bugs
in how those triggers detect that the system is "idle"; possibly
including platform-specific bugs.  Canonical does not support the Ubuntu
Desktop on POWER.  Is this problem reproducible for you with a pristine
install of Ubuntu Server?

** Changed in: systemd (Ubuntu)
       Status: New => Incomplete

-- 
You received this bug notification because you are a member of Ubuntu
Foundations Bugs, which is subscribed to systemd in Ubuntu.
https://bugs.launchpad.net/bugs/1758273

Title:
  DD2.2 freezes/hangs after 20mins of uptime

Status in The Ubuntu-power-systems project:
  Triaged
Status in systemd package in Ubuntu:
  Incomplete

Bug description:
  == Comment: #0 - Application Cdeadmin  - 2018-03-19 09:30:53 ==
  == Comment: #1 - Application Cdeadmin <> - 2018-03-19 09:30:55 ==
  == Comment: #2 - Application Cdeadmin <> - 2018-03-19 09:30:57 ==
  ------- Comment From brihh 2018-03-19 09:10:30 EDT -------
  Needless to say, machine is pretty unusable. Needed for Performance testing for release.

  == Comment: #3 - Application Cdeadmin <> - 2018-03-19 10:10:59 ==
  ------- Comment From vaibhav92 2018-03-19 10:06:17 EDT -------
  @pridhiviraj Any idea who can look into this ?
  ------- Comment From dougmill-ibm 2018-03-19 10:09:59 EDT -------
  If machine is locked-up and console is unresponsive, try collecting eSEL data from the BMC.

  == Comment: #4 - Application Cdeadmin <c> - 2018-03-19 10:40:59 ==
  ------- Comment From pridhiviraj 2018-03-19 10:38:47 EDT -------
  @brihh Can you use latest 03/15 PNOR and re-create the issue. And also before it hangs please collect OPAL and kernel logs. @vaibhav92 If it is re-creatable OPAL/EM team need to look at it.

  == Comment: #5 - Application Cdeadmin <> - 2018-03-19 11:01:58 ==
  ------- Comment From vaibhav92 2018-03-19 10:41:53 EDT -------
  Saw this in the kernel-log of the system:

  [ 1247.404962] PM: suspend entry (s2idle)
  [ 1247.404970] PM: Syncing filesystems ... done.

  Looks like its getting suspended after 20 mins of inactivity. Looking at the /etc/systemd/logind.conf see IdleAction by default is 'ignore':
  #IdleAction=ignore
  #IdleActionSec=30min

  But clearly someone is issuing a suspend to the system. So this probably need to be looked by the distro/Power-Management/EM team
  ------- Comment From megcurry 2018-03-19 10:43:14 EDT -------
  Please advise re. Assignments and Labels that would get the right team working on this.....Mirror label gets a Bz opened, and that is necessary or at least useful for some of the LTC teams to look at things, right?
  ------- Comment From brihh 2018-03-19 10:47:45 EDT -------
  odd about inactivity - i once had a test running for 20mins and it still froze:

  `08:52:23 up 20 min,  4 users,  load average: 64.40, 98.94, 64.32`
  ------- Comment From brihh 2018-03-19 10:49:30 EDT -------
  @pridhiviraj  where is the latest 3/15 PNOR that I can load?

  ```

  == Comment: #8 - Application Cdeadmin <> - 2018-03-19 12:40:54 ==
  ------- Comment From pridhiviraj 2018-03-19 12:34:18 EDT -------
  ```
  Mar 19 11:35:01 p215n15 rsyslogd-2007: action 'action 13' suspended, next retry is Mon Mar 19 11:35:31 2018 [v8.16.0 try http://www.rsyslog.com/e/2007 ]
  Mar 19 11:35:01 p215n15 CRON[6464]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
  Mar 19 11:42:39 p215n15 systemd[1]: Starting Cleanup of Temporary Directories...
  Mar 19 11:42:39 p215n15 rsyslogd-2007: action 'action 13' suspended, next retry is Mon Mar 19 11:43:39 2018 [v8.16.0 try http://www.rsyslog.com/e/2007 ]
  Mar 19 11:42:40 p215n15 systemd-tmpfiles[6504]: [/usr/lib/tmpfiles.d/var.conf:14] Duplicate line for path "/var/log", ignoring.
  Mar 19 11:42:40 p215n15 systemd[1]: Started Cleanup of Temporary Directories.
  Mar 19 11:45:01 p215n15 rsyslogd-2007: action 'action 13' suspended, next retry is Mon Mar 19 11:46:01 2018 [v8.16.0 try http://www.rsyslog.com/e/2007 ]
  Mar 19 11:45:01 p215n15 CRON[6524]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
  Mar 19 11:47:39 p215n15 systemd[1]: Starting Message of the Day...
  Mar 19 11:47:39 p215n15 rsyslogd-2007: action 'action 13' suspended, next retry is Mon Mar 19 11:48:39 2018 [v8.16.0 try http://www.rsyslog.com/e/2007 ]
  Mar 19 11:47:40 p215n15 50-motd-news[6528]:  * Meltdown, Spectre and Ubuntu: What are the attack vectors,
  Mar 19 11:47:40 p215n15 50-motd-news[6528]:    how the fixes work, and everything else you need to know
  Mar 19 11:47:40 p215n15 50-motd-news[6528]:    - https://ubu.one/u2Know
  Mar 19 11:47:40 p215n15 systemd[1]: Started Message of the Day.
  Mar 19 11:48:02 p215n15 NetworkManager[4662]: <info>  [1521474482.7252] manager: sleep: sleep requested (sleeping: no  enabled: yes)
  Mar 19 11:48:02 p215n15 NetworkManager[4662]: <info>  [1521474482.7259] manager: NetworkManager state is now ASLEEP
  Mar 19 11:48:02 p215n15 gnome-shell[5538]: Screen lock is locked down, not locking
  Mar 19 11:48:02 p215n15 whoopsie[5264]: [11:48:02] offline
  Mar 19 11:48:02 p215n15 systemd[1]: Reached target Sleep.
  Mar 19 11:48:02 p215n15 systemd[1]: Starting Suspend...
  Mar 19 11:48:02 p215n15 systemd-sleep[6576]: Suspending system...
  Mar 19 11:48:02 p215n15 kernel: [ 1246.906320] PM: suspend entry (s2idle)
  ```
  @vaibhav92 You are right, from the above messages looks like system is suspending by some cron job i guess.

  == Comment: #9 - Application Cdeadmin <> - 2018-03-19 13:30:57 ==
  ------- Comment From brihh 2018-03-19 13:28:36 EDT -------
  @pridhiviraj interesting. know offhand how i can shut that off? seems a bit annoying .. :-)

  == Comment: #11 - Application Cdeadmin <> - 2018-03-19 17:10:51 ==
  ------- Comment From mzipse 2018-03-19 16:52:11 EDT -------
  At our daily defect call, it was suggested that we check to see if Opal-PRD is running, which is a prereq for the Firmware recovery to work properly.  Opal-PRD is an app that should be part of the distro and should automatically be started.  Never-the-less, if you are want to check on it, here's the command to run at the OS..... 

  sudo service opal-prd status

  You'll need root authority to do this.

  The output will look something like this.....
  # sudo service opal-prd status
  ? opal-prd.service - OPAL PRD daemon
     Loaded: loaded (/lib/systemd/system/opal-prd.service; enabled; vendor preset: enabled)
     Active: active (running) since Wed 2018-03-14 19:04:36 CDT; 4 days ago
       Docs: man:opal-prd(8)
   Main PID: 5085 (opal-prd)
      Tasks: 1
     CGroup: /system.slice/opal-prd.service
             ??5085 /usr/sbin/opal-prd --pnor /dev/mtd0

  Mar 18 19:52:21 ws003p1 opal-prd[5085]: HBRT: HTMGT:~[0x00d0] 1d000000 bd021c00 0000bf02 1c000000    *................*
  Mar 18 19:52:21 ws003p1 opal-prd[5085]: HBRT: HTMGT:~[0x00e0] c1021b00 0000c302 1d000000 00030000    *................*
  Mar 18 19:52:21 ws003p1 opal-prd[5085]: HBRT: HTMGT:~[0x00f0] 0000ff06 21465245 51000206 16000000    *....!FREQ.......*
  Mar 18 19:52:21 ws003p1 opal-prd[5085]: HBRT: HTMGT:~[0x0100] 5d08f600 00006008 f6000000 6308f600    *].....`.....c...*
  Mar 18 19:52:21 ws003p1 opal-prd[5085]: HBRT: HTMGT:~[0x0110] 00006608 f6000000 6908f600 00006c08    *..f.....i.....l.*
  Mar 18 19:52:21 ws003p1 opal-prd[5085]: HBRT: HTMGT:~[0x0120] f6000000 6f08f600 00007208             *....o.....r.    *
  Mar 18 19:52:21 ws003p1 opal-prd[5085]: HBRT: HTMGT:OCC1 rsp status=0x00, length=0x01CC
  Mar 18 19:52:21 ws003p1 opal-prd[5085]: HBRT: HTMGT:rsp data: (up to 16 bytes)
  Mar 18 19:52:21 ws003p1 opal-prd[5085]: HBRT: HTMGT:~[0x0000] 03100200 03000000 00000000 00000000    *................*
  Mar 18 19:52:21 ws003p1 opal-prd[5085]: HBRT: HTMGT:<<processOccError()

  == Comment: #13 - Application Cdeadmin <> - 2018-03-19 17:30:51 ==
  ------- Comment From brihh 2018-03-19 17:22:06 EDT -------
  @mzipse seems to be there and running:
  ```
  root at p215n15:~# service opal-prd status
  ? opal-prd.service - OPAL PRD daemon
     Loaded: loaded (/lib/systemd/system/opal-prd.service; enabled; vendor preset: enabled)
     Active: active (running) since Mon 2018-03-19 18:19:58 EDT; 1min 35s ago
       Docs: man:opal-prd(8)
   Main PID: 4648 (opal-prd)
      Tasks: 1 (limit: 27033)
     CGroup: /system.slice/opal-prd.service
             ??4648 /usr/sbin/opal-prd

  Mar 19 18:19:58 p215n15 opal-prd[4648]: SCOM: read: chip 0x8, addr 0x8010a50, val 0x0, rc 0
  Mar 19 18:19:58 p215n15 opal-prd[4648]: SCOM: read: chip 0x8, addr 0x8010a54, val 0x0, rc 0
  Mar 19 18:19:58 p215n15 opal-prd[4648]: HBRT: PRDF:<<PRDF::noLock_initialize()
  Mar 19 18:19:58 p215n15 opal-prd[4648]: HBRT: PRDF:<<PRDF::initialize()
  Mar 19 18:19:58 p215n15 opal-prd[4648]: HBRT: ATTN_SLOW:I>Service::enableAttns() enter
  Mar 19 18:19:58 p215n15 opal-prd[4648]: HBRT: ATTN_SLOW:I>Service::enableAttns() exit
  Mar 19 18:19:58 p215n15 opal-prd[4648]: HBRT: ATTN_SLOW:I><<ATTN_RT::enableAttns rc: 0
  Mar 19 18:19:58 p215n15 opal-prd[4648]: HBRT: calling get_ipoll_events
  Mar 19 18:19:58 p215n15 opal-prd[4648]: HBRT: enabling IPOLL events 0x5b90000000000000
  Mar 19 18:19:58 p215n15 opal-prd[4648]: FW: writing init message
  ```

  == Comment: #14 - Application Cdeadmin <> - 2018-03-20 03:50:53 ==
  ------- Comment From vaibhav92 2018-03-20 03:47:08 EDT -------
  Did some kernel tracing and it seems that someone is starting the systemd-suspend.service that invokes systemd-sleep which then forces a suspend by writing to /sys/power/state file. Further investigation is needed as to who is invoking the systemd-suspend.service. In the meantime you can issue command on the host that will disable the service and prevent system from getting suspended:
  `systemctl mask systemd-suspend.service`

  
  Actually, it is also possible that the presence of the suspend config indicates user error. If Ubuntu Server was installed, I don't think suspend should have been configured. Did someone install "desktop Ubuntu" (or some other non-server config) on a server?

  Also, 18.04 should be used, not 17.10 (and never 17.04), on P9.

  == Comment: #25 - Application Cdeadmin <> - 2018-03-23 00:20:53 ==
  ------- Comment From vaibhav92 2018-03-23 00:19:40 EDT -------
  Hi @mzipse @stewart-ibm. AFAIK this issue is not related to CAPP or CAPI at all. @brian_horton had confirmed that he was seeing this even without enabling/running any CAPI workloads. Disabling the systemd-suspend.service made the problem go away. Hence asked the bug to be mirrored to canonical.  @mzipse not sure what CAPP errors PRD team saw. Can you please ask them to get in touch with me.

  == Comment: #26 - Vaibhav Jain <> - 2018-03-23 00:39:45 ==
  Summary of the issue:

  Ubuntu 18.04 is forcing a system suspend after 20 mins of system boot. Suspend is forced even if system is running a workload or user is logged on to the terminal and performing any activity.  The issue goes away if systemd-suspend service is disabled via:
  "systemctl mask systemd-suspend.service"

  So requesting canonical to look into this issue as a possible bug in
  systemd or user inactivity monitor.

  ~ Vaibhav

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/1758273/+subscriptions



More information about the foundations-bugs mailing list