[SRU T/X/A/B][C][PATCH 0/1] LP: #1774336 - fix FS-Cache assert

Daniel Axtens daniel.axtens at canonical.com
Mon Jun 4 06:42:15 UTC 2018


From: Daniel Axtens <dja at axtens.net>

== SRU Justification ==

[Impact]
Oops during heavy NFS + FSCache use:

[81738.886634] FS-Cache:
[81738.888281] FS-Cache: Assertion failed
[81738.889461] FS-Cache: 6 == 5 is false
[81738.890625] ------------[ cut here ]------------
[81738.891706] kernel BUG at /build/linux-hVVhWi/linux-4.4.0/fs/fscache/operation.c:494!

6 == 5 represents an operation being DEAD when it was not expected to be.

[Cause]
There is a race in fscache and cachefiles.

One thread is in cachefiles_read_waiter:
 1) object->work_lock is taken.
 2) the operation is added to the to_do list.
 3) the work lock is dropped.
 4) fscache_enqueue_retrieval is called, which takes a reference.

Another thread is in cachefiles_read_copier:
 1) object->work_lock is taken
 2) an item is popped off the to_do list.
 3) object->work_lock is dropped.
 4) some processing is done on the item, and fscache_put_retrieval()
    is called, dropping a reference.

Now if the this process in cachefiles_read_copier takes place
*between* steps 3 and 4 in cachefiles_read_waiter, a reference will be
dropped before it is taken, which leads to the objects reference count
hitting zero, which leads to lifecycle events for the object happening
too soon, leading to the assertion failure later on.

(This is simplified and clarified from the original upstream analysis
for this patch at
https://www.redhat.com/archives/linux-cachefs/2018-February/msg00001.html
and from a similar patch with a different approach to fixing the bug
at
https://www.redhat.com/archives/linux-cachefs/2017-June/msg00002.html)

[Fix]
Move fscache_enqueue_retrieval under the lock in
cachefiles_read_waiter. This means that the object cannot be popped
off the to_do list until it is in a fully consistent state with the
reference taken.

[Testcase]
A user has run ~100 hours of NFS stress tests and not seen this bug recur.

[Regression Potential]
 - Limited to fscache/cachefiles.
 - The change makes things more conservative (doing more under lock)
   so that's reassuring.
 - There may be performance impacts but none have been observed so far.

Lei Xue (1):
  UBUNTU: SAUCE: CacheFiles: fix a read_waiter/read_copier race

 fs/cachefiles/rdwr.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

-- 
2.17.0





More information about the kernel-team mailing list