[SRU][B/E/F][PATCH 1/1] hugetlbfs: take read_lock on i_mmap for PMD sharing

Gavin Guo gavin.guo at canonical.com
Mon Jun 22 23:02:11 UTC 2020


On Mon, Jun 22, 2020 at 11:51 PM Gerald Yang <gerald.yang at canonical.com> wrote:
>
> Hi Colin,
>
> The steps to simulate the issue as below:
> 1. add a kprobe to huge_pmd_share and set offset between taking write lock on mapping
> and vma_interval_tree_foreach:
> ==================================================
> i_mmap_lock_write(mapping);
> *kprobe*
> vma_interval_tree_foreach(svma, &mapping->i_mmap, idx, idx) {
> ==================================================
> and add 2 ms delay in kprobe to simulate long search time
>
> 2. use BPF tool to insert a kprobe before the above kprobe, and get the timestamp when a process calls to this point
>
> 3. use hugetlbfs(https://github.com/libhugetlbfs/libhugetlbfs.git) to create 4 processes and trigger VMA interval tree scan simultaneously
>
> So we can observe how much time each process takes to search the VMA interval tree
>
> Without this patch, each process needs to take the write lock before searching the tree
> so only one process can search the tree and takes around 2 ms to complete, the the following is the output of bcc tool
>
> root at testvm:~/kprobe# /usr/share/bcc/tools/sharepmd
> kernel function: huge_pmd_share
>   2 233713.406179438 11499  shm-fork 0        ffff93d606742910        0                0 HUGEPMD
>   3 233715.392836926 11500  shm-fork 0        ffff93d606742910        0 ffffb05bc20e7c90 HUGEPMD
>   0 233717.378148059 11497  shm-fork 0        ffff93d606742910        0 ffffb05bc1fbfc90 HUGEPMD
>   1 233719.366618585 11498  shm-fork 0        ffff93d606742910        0 ffffb05bc1ac7c80 HUGEPMD
>   2 233721.589174146 11499  shm-fork 0        ffff93d606742910        0                0 HUGEPMD
>   0 233723.577443027 11497  shm-fork 0        ffff93d606742910        0 ffffb05bc1b3fc90 HUGEPMD
>   3 233725.565996868 11500  shm-fork 0        ffff93d606742910        0 ffffb05bc1fbfc80 HUGEPMD
>   1 233727.788305070 11498  shm-fork 0        ffff93d606742910        0                0 HUGEPMD
>   0 233729.773605019 11497  shm-fork 0        ffff93d606742910        0 ffffb05bc1ac7c90 HUGEPMD
>   2 233731.761853316 11499  shm-fork 0        ffff93d606742910        0 ffffb05bc1b3fc80 HUGEPMD
>   1 233733.988826353 11498  shm-fork 0        ffff93d606742910        0                0 HUGEPMD
>   3 233735.977068987 11500  shm-fork 0        ffff93d606742910        0 ffffb05bc20e7c90 HUGEPMD
>   0 233737.965477337 11497  shm-fork 0        ffff93d606742910        0 ffffb05bc1ac7c80 HUGEPMD
>   1 233740.186784801 11498  shm-fork 0        ffff93d606742910        0                0 HUGEPMD
>   3 233742.174938960 11500  shm-fork 0        ffff93d606742910        0 ffffb05bc1ac7c90 HUGEPMD
>   2 233744.163177097 11499  shm-fork 0        ffff93d606742910        0 ffffb05bc20e7c80 HUGEPMD
>   1 233746.382554150 11498  shm-fork 0        ffff93d606742910        0                0 HUGEPMD
>   0 233748.370968889 11497  shm-fork 0        ffff93d606742910        0 ffffb05bc1b3fc90 HUGEPMD
>   3 233750.359224507 11500  shm-fork 0        ffff93d606742910        0 ffffb05bc1ac7c80 HUGEPMD
>   2 233752.563115019 11499  shm-fork 0        ffff93d606742910        0                0 HUGEPMD
>
> With the patch, all processes take read lock before searching the VMA interval tree
> so they can do it simultaneously as below
>
> root at testvm:~/kprobe# /usr/share/bcc/tools/sharepmd
> kernel function: pcibios_pm_ops
>   6 471.183414969 1873   shm-fork 0        ffff896a271f6e00        0     7fea18c00000 HUGEPMD
>   5 471.183428598 1875   shm-fork 0        ffff896a271f6e00        0     7fea18c00000 HUGEPMD
>   4 471.183414908 1872   shm-fork 0        ffff896a271f6e00        0     7fea18c00000 HUGEPMD
>   3 471.183414962 1871   shm-fork 0        ffff896a271f6e00        0     7fea18c00000 HUGEPMD
>   6 476.111138155 1873   shm-fork 0        ffff896a271f6e00        0     7fea18c00000 HUGEPMD
>   4 476.111007350 1872   shm-fork 0        ffff896a271f6e00        0     7fea18c00000 HUGEPMD
>   1 476.111408005 1871   shm-fork 0        ffff896a271f6e00        0     7fea18c00000 HUGEPMD
>   5 476.111808625 1875   shm-fork 0        ffff896a271f6e00        0     7fea18c00000 HUGEPMD
>   5 480.421521033 1875   shm-fork 0        ffff896a271f6e00        0     7fea18c00000 HUGEPMD
>   4 480.421618975 1872   shm-fork 0        ffff896a271f6e00        0     7fea18c00000 HUGEPMD
>   1 480.421988461 1871   shm-fork 0        ffff896a271f6e00        0     7fea18c00000 HUGEPMD
>   6 480.422808252 1873   shm-fork 0        ffff896a271f6e00        0     7fea18c00000 HUGEPMD
>   4 484.732804127 1872   shm-fork 0        ffff896a271f6e00        0     7fea18c00000 HUGEPMD
>   6 484.733568351 1873   shm-fork 0        ffff896a271f6e00        0     7fea18c00000 HUGEPMD
>   5 484.733698494 1875   shm-fork 0        ffff896a271f6e00        0     7fea18c00000 HUGEPMD
>   1 484.733473733 1871   shm-fork 0        ffff896a271f6e00        0     7fea18c00000 HUGEPMD
>   4 489.037550955 1872   shm-fork 0        ffff896a271f6e00        0     7fea18c00000 HUGEPMD
>   6 489.037880018 1873   shm-fork 0        ffff896a271f6e00        0     7fea18c00000 HUGEPMD
>   5 489.038004825 1875   shm-fork 0        ffff896a271f6e00        0     7fea18c00000 HUGEPMD
>   1 489.038070483 1871   shm-fork 0        ffff896a271f6e00        0     7fea18c00000 HUGEPMD
>
> On Thu, Jun 4, 2020 at 6:13 PM Gavin Guo <gavin.guo at canonical.com> wrote:
>>
>> Hi Colin,
>>
>> On Thu, Jun 4, 2020 at 5:57 PM Colin Ian King <colin.king at canonical.com> wrote:
>> >
>> > On 04/06/2020 10:39, Gavin Guo wrote:
>> > > From: Waiman Long <longman at redhat.com>
>> > >
>> > > BugLink: https://bugs.launchpad.net/bugs/1882039
>> > >
>> > > A customer with large SMP systems (up to 16 sockets) with application
>> > > that uses large amount of static hugepages (~500-1500GB) are
>> > > experiencing random multisecond delays.  These delays were caused by the
>> > > long time it took to scan the VMA interval tree with mmap_sem held.
>> > >
>> > > The sharing of huge PMD does not require changes to the i_mmap at all.
>> > > Therefore, we can just take the read lock and let other threads
>> > > searching for the right VMA share it in parallel.  Once the right VMA is
>> > > found, either the PMD lock (2M huge page for x86-64) or the
>> > > mm->page_table_lock will be acquired to perform the actual PMD sharing.
>> > >
>> > > Lock contention, if present, will happen in the spinlock.  That is much
>> > > better than contention in the rwsem where the time needed to scan the
>> > > the interval tree is indeterminate.
>> > >
>> > > With this patch applied, the customer is seeing significant performance
>> > > improvement over the unpatched kernel.
>> > >
>> > > Link: http://lkml.kernel.org/r/20191107211809.9539-1-longman@redhat.com
>> > > Signed-off-by: Waiman Long <longman at redhat.com>
>> > > Suggested-by: Mike Kravetz <mike.kravetz at oracle.com>
>> > > Reviewed-by: Mike Kravetz <mike.kravetz at oracle.com>
>> > > Cc: Davidlohr Bueso <dave at stgolabs.net>
>> > > Cc: Peter Zijlstra <peterz at infradead.org>
>> > > Cc: Ingo Molnar <mingo at redhat.com>
>> > > Cc: Will Deacon <will.deacon at arm.com>
>> > > Cc: Matthew Wilcox <willy at infradead.org>
>> > > Signed-off-by: Andrew Morton <akpm at linux-foundation.org>
>> > > Signed-off-by: Linus Torvalds <torvalds at linux-foundation.org>
>> > > (cherry picked from commit 930668c34408ba983049322e04f13f03b6f1fafa)
>> > > Signed-off-by: Gavin Guo <gavin.guo at canonical.com>
>> > > ---
>> > >  mm/hugetlb.c | 4 ++--
>> > >  1 file changed, 2 insertions(+), 2 deletions(-)
>> > >
>> > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> > > index 2af1831596f2..9c871b47ef86 100644
>> > > --- a/mm/hugetlb.c
>> > > +++ b/mm/hugetlb.c
>> > > @@ -4891,7 +4891,7 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
>> > >       if (!vma_shareable(vma, addr))
>> > >               return (pte_t *)pmd_alloc(mm, pud, addr);
>> > >
>> > > -     i_mmap_lock_write(mapping);
>> > > +     i_mmap_lock_read(mapping);
>> > >       vma_interval_tree_foreach(svma, &mapping->i_mmap, idx, idx) {
>> > >               if (svma == vma)
>> > >                       continue;
>> > > @@ -4921,7 +4921,7 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
>> > >       spin_unlock(ptl);
>> > >  out:
>> > >       pte = (pte_t *)pmd_alloc(mm, pud, addr);
>> > > -     i_mmap_unlock_write(mapping);
>> > > +     i_mmap_unlock_read(mapping);
>> > >       return pte;
>> > >  }
>> > >
>> > >I don't see any testing evidence of this working on B/E/F. Mind you this
>> > looks straight forward, and that's what concerns me as a good win with
>> > minimal change. What testing has been performed on this fix?
>>
>> I agree that this is really straight forward. However, I asked many
>> times to the customer for the detailed data and this is all I have.
>> And currently, I cannot reproduce it locally. I'll try more and also
>> ask the customer to see if they can give me more details on how they
>> proceed with the testing.
>>
>> >
>> > Colin
>>
>> --
>> kernel-team mailing list
>> kernel-team at lists.ubuntu.com
>> https://lists.ubuntu.com/mailman/listinfo/kernel-team

As a supplementary to Gerald's steps(Thanks to Gerald for the explanation
about the idea), the reproducer source code could be found in:


shm-fork.c (To start the processes to use the huge page based on the
share memory mechanism)
https://github.com/libhugetlbfs/libhugetlbfs/blob/master/tests/shm-fork.c

kprobe_hugepmd.c (To simulate the 2s search in the VMA tree)
http://paste.ubuntu.com/p/t3WqCQKgsR/

shm-fork.sh (To start the reproducing process with 4 processes to go
through 1024 2MB hugepages)
http://paste.ubuntu.com/p/DVmS239KZk/

bcc tool sharepmd(To monitor the contention timing between the
processes)
http://paste.ubuntu.com/p/J5DfdTnfNJ/



More information about the kernel-team mailing list