APPLIED: [SRU T/X/A/B][C][PATCH 0/1] LP: #1774336 - fix FS-Cache assert

Khaled Elmously khalid.elmously at canonical.com
Wed Jun 6 23:25:28 UTC 2018


Applied to T/X/A/B


On 2018-06-04 16:42:15 , Daniel Axtens wrote:
> From: Daniel Axtens <dja at axtens.net>
> 
> == SRU Justification ==
> 
> [Impact]
> Oops during heavy NFS + FSCache use:
> 
> [81738.886634] FS-Cache:
> [81738.888281] FS-Cache: Assertion failed
> [81738.889461] FS-Cache: 6 == 5 is false
> [81738.890625] ------------[ cut here ]------------
> [81738.891706] kernel BUG at /build/linux-hVVhWi/linux-4.4.0/fs/fscache/operation.c:494!
> 
> 6 == 5 represents an operation being DEAD when it was not expected to be.
> 
> [Cause]
> There is a race in fscache and cachefiles.
> 
> One thread is in cachefiles_read_waiter:
>  1) object->work_lock is taken.
>  2) the operation is added to the to_do list.
>  3) the work lock is dropped.
>  4) fscache_enqueue_retrieval is called, which takes a reference.
> 
> Another thread is in cachefiles_read_copier:
>  1) object->work_lock is taken
>  2) an item is popped off the to_do list.
>  3) object->work_lock is dropped.
>  4) some processing is done on the item, and fscache_put_retrieval()
>     is called, dropping a reference.
> 
> Now if the this process in cachefiles_read_copier takes place
> *between* steps 3 and 4 in cachefiles_read_waiter, a reference will be
> dropped before it is taken, which leads to the objects reference count
> hitting zero, which leads to lifecycle events for the object happening
> too soon, leading to the assertion failure later on.
> 
> (This is simplified and clarified from the original upstream analysis
> for this patch at
> https://www.redhat.com/archives/linux-cachefs/2018-February/msg00001.html
> and from a similar patch with a different approach to fixing the bug
> at
> https://www.redhat.com/archives/linux-cachefs/2017-June/msg00002.html)
> 
> [Fix]
> Move fscache_enqueue_retrieval under the lock in
> cachefiles_read_waiter. This means that the object cannot be popped
> off the to_do list until it is in a fully consistent state with the
> reference taken.
> 
> [Testcase]
> A user has run ~100 hours of NFS stress tests and not seen this bug recur.
> 
> [Regression Potential]
>  - Limited to fscache/cachefiles.
>  - The change makes things more conservative (doing more under lock)
>    so that's reassuring.
>  - There may be performance impacts but none have been observed so far.
> 
> Lei Xue (1):
>   UBUNTU: SAUCE: CacheFiles: fix a read_waiter/read_copier race
> 
>  fs/cachefiles/rdwr.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> -- 
> 2.17.0
> 
> 
> -- 
> kernel-team mailing list
> kernel-team at lists.ubuntu.com
> https://lists.ubuntu.com/mailman/listinfo/kernel-team




More information about the kernel-team mailing list