[Bug 1664435] [NEW] Upgrade timeout logic is faulty

Tue Feb 14 03:24:29 UTC 2017

Public bug reported:

When upgrading a ceph cluster, the charm code orders the nodes and
starts performing upgrades a single node at a time. The charms make use
of ceph's ability to provide an arbitrary key/value store in the monitor
and will mark the progress of the upgrades in this key storage.

This allows nodes to watch this central storage for progress of the
upgrade. As a node begins its upgrade path, it stores its start time
(via time.time()) in the ceph monitor's key value storage. The node
which is to upgrade after the current node will read the value stored in
the key and compare it to a timestamp from 10 minutes ago in order to
determine if the previous node should be considered timed out or not.

The problem is that the value read in from reading the monitor's key is
stored and returned as a string and then compared to a floating point
value from the time.time() call. This results in the node never timing
out the previous node.

This is however, a good thing. In the current released form of the
charms (16.10), the upgrade path will always recursively chown the OSD
directories, which in a production cluster is unlikely to finish in 10
minutes. Since the ceph charms will stop all services on the cluster at
the same time, this would effectively lead to an entire cluster outage
if the code were to work correctly.

Instead of fixing this code to add a timeout, I propose the timeout
logic be removed completely and error conditions be revisited in order
to prevent a sweeping cluster outage.

** Affects: charms.ceph
     Importance: Medium
         Status: Triaged

** Affects: ceph (Juju Charms Collection)
     Importance: Undecided
         Status: New

** Affects: ceph-mon (Juju Charms Collection)
     Importance: Undecided
         Status: New

** Affects: ceph-osd (Juju Charms Collection)
     Importance: Undecided
         Status: New

** Tags: sts

** Also affects: ceph-osd (Juju Charms Collection)
   Importance: Undecided
       Status: New

** Also affects: ceph (Ubuntu)
   Importance: Undecided
       Status: New

** No longer affects: ceph (Ubuntu)

** Also affects: ceph-mon (Juju Charms Collection)
   Importance: Undecided
       Status: New

** Also affects: ceph (Juju Charms Collection)
   Importance: Undecided
       Status: New

-- 
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to ceph in Ubuntu.
https://bugs.launchpad.net/bugs/1664435

Title:
  Upgrade timeout logic is faulty

Status in charms.ceph:
  Triaged
Status in ceph package in Juju Charms Collection:
  New
Status in ceph-mon package in Juju Charms Collection:
  New
Status in ceph-osd package in Juju Charms Collection:
  New

Bug description:
  When upgrading a ceph cluster, the charm code orders the nodes and
  starts performing upgrades a single node at a time. The charms make
  use of ceph's ability to provide an arbitrary key/value store in the
  monitor and will mark the progress of the upgrades in this key
  storage.

  This allows nodes to watch this central storage for progress of the
  upgrade. As a node begins its upgrade path, it stores its start time
  (via time.time()) in the ceph monitor's key value storage. The node
  which is to upgrade after the current node will read the value stored
  in the key and compare it to a timestamp from 10 minutes ago in order
  to determine if the previous node should be considered timed out or
  not.

  The problem is that the value read in from reading the monitor's key
  is stored and returned as a string and then compared to a floating
  point value from the time.time() call. This results in the node never
  timing out the previous node.

  This is however, a good thing. In the current released form of the
  charms (16.10), the upgrade path will always recursively chown the OSD
  directories, which in a production cluster is unlikely to finish in 10
  minutes. Since the ceph charms will stop all services on the cluster
  at the same time, this would effectively lead to an entire cluster
  outage if the code were to work correctly.

  Instead of fixing this code to add a timeout, I propose the timeout
  logic be removed completely and error conditions be revisited in order
  to prevent a sweeping cluster outage.

To manage notifications about this bug go to:
https://bugs.launchpad.net/charms.ceph/+bug/1664435/+subscriptions