[Bug 1987355] Re: Error validating X-Service-Token

Mon Aug 14 20:15:46 UTC 2023

** Description changed:

- I found this issue when nova calls cinder with an expired X-Auth-Token
- but it is configured to also send a X-Service-Token. The traffic goes
- like this:
+ [Impact]
+ This bug can cause a race condition for long running services that reuse their token (eg. Kubernetes Cinder CSI plugin) when the following occurs:

- nova-compute -> cinder: post with X-Auth-Token and X-Service-Token
- cinder -> keystone: validate X-Auth-Token
- keystone -> cinder: returns 404
- cinder -> nova-compute: returns 401
- nova-compute -> cinder: retry post with new X-Service-Token
- cinder -> keystone: validate X-Service-Token
- keystone -> cinder: returns 200 showing that the token is valid
- cinder -> nova-compute: returns 401
+ 1 [service] Asks nova to attach a volume to a server
+ 2 ...the user's token expires
+ 3 [service] Asks cinder if the volume has been attached
+ 4 [nova] Asks cinder to attach the volume

- As I understand Cinder should return 200 in the last message as the
- token is valid.
+ In step 3 the token is marked as invalid in the cache and step 4 fails
+ even if the token is accompanied by a valid service token. The key step
+ is that step 3 has to happen before step 4 which is not frequent hence
+ the race condition.

- My test client is a long running service that uses the same token to
- communicate to nova until it receives a 401 and then generates a new
- one. Sometimes the token is invalidated in the middle of a transaction
- and nova returns 200 to the client but cinder returns 401 to nova.
+ Also, the client will ask for a new user token if it is not authorized
+ in the calls in steps 1 or 3 but if the token is marked as invalid in
+ step 3 then step 4 fails and the volume becomes stuck in "detaching"
+ status.

- I have managed to reproduce this both on ussuri and yoga (the code I
- mentioned has not been changed in 7 years).
+ [Test Plan]
+ It hard to reproduce this bug as it depends on the timing of packets and the token expiration. I was able to reproduce by reducing the token expiration to 60 seconds and running a go script that is constantly attaching and detaching volumes. Even then it may take some time for the bug to occur.
+ 
+ [Where problems could occur]
+ The patch removes code that work as an optimization in order to save the time needed for rechecking invalid tokens. So it should not add problems beside the loss of the optimization. The new code will return all tokens from the cache for validation instead of throwing an exception. If the token is actually invalid it will be detected later on.

-- 
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to Ubuntu Cloud Archive.
https://bugs.launchpad.net/bugs/1987355

Title:
  Error validating X-Service-Token

Status in Ubuntu Cloud Archive:
  Fix Released
Status in Ubuntu Cloud Archive antelope series:
  Fix Released
Status in Ubuntu Cloud Archive bobcat series:
  Fix Released
Status in Ubuntu Cloud Archive ussuri series:
  New
Status in Ubuntu Cloud Archive victoria series:
  New
Status in Ubuntu Cloud Archive wallaby series:
  New
Status in Ubuntu Cloud Archive xena series:
  New
Status in Ubuntu Cloud Archive yoga series:
  New
Status in Ubuntu Cloud Archive zed series:
  New
Status in keystonemiddleware:
  Fix Released
Status in python-keystonemiddleware package in Ubuntu:
  Fix Released
Status in python-keystonemiddleware source package in Focal:
  New
Status in python-keystonemiddleware source package in Jammy:
  New
Status in python-keystonemiddleware source package in Lunar:
  Fix Released
Status in python-keystonemiddleware source package in Mantic:
  Fix Released

Bug description:
  [Impact]
  This bug can cause a race condition for long running services that reuse their token (eg. Kubernetes Cinder CSI plugin) when the following occurs:

  1 [service] Asks nova to attach a volume to a server
  2 ...the user's token expires
  3 [service] Asks cinder if the volume has been attached
  4 [nova] Asks cinder to attach the volume

  In step 3 the token is marked as invalid in the cache and step 4 fails
  even if the token is accompanied by a valid service token. The key
  step is that step 3 has to happen before step 4 which is not frequent
  hence the race condition.

  Also, the client will ask for a new user token if it is not authorized
  in the calls in steps 1 or 3 but if the token is marked as invalid in
  step 3 then step 4 fails and the volume becomes stuck in "detaching"
  status.

  [Test Plan]
  It hard to reproduce this bug as it depends on the timing of packets and the token expiration. I was able to reproduce by reducing the token expiration to 60 seconds and running a go script that is constantly attaching and detaching volumes. Even then it may take some time for the bug to occur.

  The code used is here: https://paste.ubuntu.com/p/CbGNzGxYt9/
  The openstack auth information should be set in lines 99-105 and then the script should be called with 3 parameters: the id of a volume and the ids of two servers. The script attaches and detaches the volume between those two servers.

  [Where problems could occur]
  The patch removes code that work as an optimization in order to save the time needed for rechecking invalid tokens. So it should not add problems beside the loss of the optimization. The new code will return all tokens from the cache for validation instead of throwing an exception. If the token is actually invalid it will be detected later on.

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1987355/+subscriptions