[Bug 1477225] Re: ceph-radosgw restart fails

Fri Sep 4 13:39:52 UTC 2015

** Description changed:

  [Impact]

  On 14.04 the restart target of the sysvinit script brings the service down
- but almost always fails to bring the service back up again.
+ but sometimes fails to bring the service back up again. There is a race between stop and start and in the failure case the attempt to bring the service up runs before the service has been stopped and the start command is never issued:

  The proposed fix updates /etc/init.d/radosgw so that the stop target
  waits for up to 30 seconds for the service to stop cleanly

  [Test Case]

- sudo apt-get install --yes radosgw
- sudo mkdir /etc/ceph
- sudo su -
- cat <<-EOF > /etc/ceph/ceph.conf
- [global]
+ Bundle:

- auth cluster required = cephx
- auth service required = cephx
- auth client required = cephx
+ openstack-services:
+   services:
+     mysql:
+       branch: lp:~openstack-charmers/charms/trusty/percona-cluster/next
+       constraints: mem=1G
+       options:
+         dataset-size: 50%
+     ceph:
+       branch: lp:~openstack-charmers/charms/trusty/ceph/next
+       num_units: 3
+       constraints: mem=1G
+       options:
+         monitor-count: 3
+         fsid: 6547bd3e-1397-11e2-82e5-53567c8d32dc
+         monitor-secret: AQCXrnZQwI7KGBAAiPofmKEXKxu5bUzoYLVkbQ==
+         osd-devices: /dev/vdb
+         osd-reformat: "yes"
+         ephemeral-unmount: /mnt
+     keystone:
+       branch: lp:~openstack-charmers/charms/trusty/keystone/next
+       constraints: mem=1G
+       options:
+         admin-password: openstack
+         admin-token: ubuntutesting
+     ceph-radosgw:
+       branch: lp:~openstack-charmers/charms/trusty/ceph-radosgw/next
+       options:
+         use-embedded-webserver: True
+   relations:
+     - [ keystone, mysql ]
+     - [ ceph-radosgw, keystone ]
+     - [ ceph-radosgw, ceph ]
+ # kilo
+ trusty-kilo:
+   inherits: openstack-services
+   series: trusty
+   overrides:
+     openstack-origin: cloud:trusty-kilo
+     source: cloud:trusty-kilo

- mon host = 127.0.0.1:6789

- [client.radosgw.gateway]
- host = $(hostname -s)
- keyring = /etc/ceph/keyring.rados.gateway
- rgw socket path = /tmp/radosgw.sock
- log file = /var/log/ceph/radosgw.log
- rgw frontends = civetweb port=70
- EOF
- 
- cat <<-EOF > /etc/ceph/keyring.rados.gateway
- [client.radosgw.gateway]
-         key = BBBBBBBBBBBBBBBBB/kkkkkkkkkkkkkkkkkkkk==
- EOF
- 
- service radosgw stop
- service radosgw start
- service radosgw status
- service radosgw restart
- service radosgw status
- 
- At this point /usr/bin/radosgw will no be running
+ $ juju-deployer -c next.yaml trusty-kilo
+ $ juju ssh ceph-radosgw/0
+ $ sudo su -
+ # service radosgw status
+ /usr/bin/radosgw is running.
+ # service radosgw restart
+ Starting client.radosgw.gateway...
+ /usr/bin/radosgw already running.
+ /usr/bin/radosgw is running.
+ # service radosgw status
+ /usr/bin/radosgw is not running.

  [Regression Potential]

   * The only change in behaviour that would result from this change is that
     running the stop target in the init script will wait for up to 30s before
     exiting rather than retuning immediatly. I cannot think of any use cases
     where this would be an issue.

  [Original Bug Report]
  job handler:
  Jul 22 16:03:44 job-handler-1 ERR Failed to execute job: PUT request for http://10.96.4.129:80/swift/v1/simplestreams failed with code 500 Internal Server Error: '<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">\n<html><head>\n<title>500 Internal Server Error</title>\n</head><body>\n<h1>Internal Server Error</h1>\n<p>The server encountered an internal error or\nmisconfiguration and was unable to complete\nyour request.</p>\n<p>Please contact the server administrator at \n ceph at ubuntu.com to inform them of the time this error occurred,\n and the actions you performed just before this error.</p>\n<p>More information about this error may be available\nin the server error log.</p>\n</body></html>\n'#012Traceback (most recent call last):#012 File "/opt/canonical/landscape/canonical/landscape/model/activity/jobrunner.py", line 38, in run#012 yield self._run_activity(account_id, activity_id)#012HTTPError: PUT request for http://10.96.4.129:80/swift/v1/simplestreams failed with code 500 Internal Server Error: '<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">\n<html><head>\n<title>500 Internal Server Error</title>\n</head><body>\n<h1>Internal Server Error</h1>\n<p>The server encountered an internal error or\nmisconfiguration and was unable to complete\nyour request.</p>\n<p>Please contact the server administrator at \n ceph at ubuntu.com to inform them of the time this error occurred,\n and the actions you performed just before this error.</p>\n<p>More information about this error may be available\nin the server error log.</p>\n</body></html>\n'

  Other logs attached.

-- 
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to ceph in Ubuntu.
https://bugs.launchpad.net/bugs/1477225

Title:
  ceph-radosgw restart fails

Status in ceph package in Ubuntu:
  Fix Released
Status in ceph source package in Trusty:
  Triaged
Status in ceph source package in Wily:
  Fix Released
Status in ceph-radosgw package in Juju Charms Collection:
  Invalid

Bug description:
  [Impact]

  On 14.04 the restart target of the sysvinit script brings the service down
  but sometimes fails to bring the service back up again. There is a race between stop and start and in the failure case the attempt to bring the service up runs before the service has been stopped and the start command is never issued:

  The proposed fix updates /etc/init.d/radosgw so that the stop target
  waits for up to 30 seconds for the service to stop cleanly

  [Test Case]

  Bundle:

  openstack-services:
    services:
      mysql:
        branch: lp:~openstack-charmers/charms/trusty/percona-cluster/next
        constraints: mem=1G
        options:
          dataset-size: 50%
      ceph:
        branch: lp:~openstack-charmers/charms/trusty/ceph/next
        num_units: 3
        constraints: mem=1G
        options:
          monitor-count: 3
          fsid: 6547bd3e-1397-11e2-82e5-53567c8d32dc
          monitor-secret: AQCXrnZQwI7KGBAAiPofmKEXKxu5bUzoYLVkbQ==
          osd-devices: /dev/vdb
          osd-reformat: "yes"
          ephemeral-unmount: /mnt
      keystone:
        branch: lp:~openstack-charmers/charms/trusty/keystone/next
        constraints: mem=1G
        options:
          admin-password: openstack
          admin-token: ubuntutesting
      ceph-radosgw:
        branch: lp:~openstack-charmers/charms/trusty/ceph-radosgw/next
        options:
          use-embedded-webserver: True
    relations:
      - [ keystone, mysql ]
      - [ ceph-radosgw, keystone ]
      - [ ceph-radosgw, ceph ]
  # kilo
  trusty-kilo:
    inherits: openstack-services
    series: trusty
    overrides:
      openstack-origin: cloud:trusty-kilo
      source: cloud:trusty-kilo

  $ juju-deployer -c next.yaml trusty-kilo
  $ juju ssh ceph-radosgw/0
  $ sudo su -
  # service radosgw status
  /usr/bin/radosgw is running.
  # service radosgw restart
  Starting client.radosgw.gateway...
  /usr/bin/radosgw already running.
  /usr/bin/radosgw is running.
  # service radosgw status
  /usr/bin/radosgw is not running.

  [Regression Potential]

   * The only change in behaviour that would result from this change is that
     running the stop target in the init script will wait for up to 30s before
     exiting rather than retuning immediatly. I cannot think of any use cases
     where this would be an issue.

  [Original Bug Report]
  job handler:
  Jul 22 16:03:44 job-handler-1 ERR Failed to execute job: PUT request for http://10.96.4.129:80/swift/v1/simplestreams failed with code 500 Internal Server Error: '<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">\n<html><head>\n<title>500 Internal Server Error</title>\n</head><body>\n<h1>Internal Server Error</h1>\n<p>The server encountered an internal error or\nmisconfiguration and was unable to complete\nyour request.</p>\n<p>Please contact the server administrator at \n ceph at ubuntu.com to inform them of the time this error occurred,\n and the actions you performed just before this error.</p>\n<p>More information about this error may be available\nin the server error log.</p>\n</body></html>\n'#012Traceback (most recent call last):#012 File "/opt/canonical/landscape/canonical/landscape/model/activity/jobrunner.py", line 38, in run#012 yield self._run_activity(account_id, activity_id)#012HTTPError: PUT request for http://10.96.4.129:80/swift/v1/simplestreams failed with code 500 Internal Server Error: '<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">\n<html><head>\n<title>500 Internal Server Error</title>\n</head><body>\n<h1>Internal Server Error</h1>\n<p>The server encountered an internal error or\nmisconfiguration and was unable to complete\nyour request.</p>\n<p>Please contact the server administrator at \n ceph at ubuntu.com to inform them of the time this error occurred,\n and the actions you performed just before this error.</p>\n<p>More information about this error may be available\nin the server error log.</p>\n</body></html>\n'

  Other logs attached.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1477225/+subscriptions