mirror of
				https://github.com/torvalds/linux.git
				synced 2025-11-04 02:30:34 +02:00 
			
		
		
		
	Hugh has pointed that compound_head() call can be unsafe in some
context. There's one example:
	CPU0					CPU1
isolate_migratepages_block()
  page_count()
    compound_head()
      !!PageTail() == true
					put_page()
					  tail->first_page = NULL
      head = tail->first_page
					alloc_pages(__GFP_COMP)
					   prep_compound_page()
					     tail->first_page = head
					     __SetPageTail(p);
      !!PageTail() == true
    <head == NULL dereferencing>
The race is pure theoretical. I don't it's possible to trigger it in
practice. But who knows.
We can fix the race by changing how encode PageTail() and compound_head()
within struct page to be able to update them in one shot.
The patch introduces page->compound_head into third double word block in
front of compound_dtor and compound_order. Bit 0 encodes PageTail() and
the rest bits are pointer to head page if bit zero is set.
The patch moves page->pmd_huge_pte out of word, just in case if an
architecture defines pgtable_t into something what can have the bit 0
set.
hugetlb_cgroup uses page->lru.next in the second tail page to store
pointer struct hugetlb_cgroup. The patch switch it to use page->private
in the second tail page instead. The space is free since ->first_page is
removed from the union.
The patch also opens possibility to remove HUGETLB_CGROUP_MIN_ORDER
limitation, since there's now space in first tail page to store struct
hugetlb_cgroup pointer. But that's out of scope of the patch.
That means page->compound_head shares storage space with:
 - page->lru.next;
 - page->next;
 - page->rcu_head.next;
That's too long list to be absolutely sure, but looks like nobody uses
bit 0 of the word.
page->rcu_head.next guaranteed[1] to have bit 0 clean as long as we use
call_rcu(), call_rcu_bh(), call_rcu_sched(), or call_srcu(). But future
call_rcu_lazy() is not allowed as it makes use of the bit and we can
get false positive PageTail().
[1] http://lkml.kernel.org/g/20150827163634.GD4029@linux.vnet.ibm.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
		
	
			
		
			
				
	
	
		
			94 lines
		
	
	
	
		
			3.5 KiB
		
	
	
	
		
			Text
		
	
	
	
	
	
			
		
		
	
	
			94 lines
		
	
	
	
		
			3.5 KiB
		
	
	
	
		
			Text
		
	
	
	
	
	
Split page table lock
 | 
						|
=====================
 | 
						|
 | 
						|
Originally, mm->page_table_lock spinlock protected all page tables of the
 | 
						|
mm_struct. But this approach leads to poor page fault scalability of
 | 
						|
multi-threaded applications due high contention on the lock. To improve
 | 
						|
scalability, split page table lock was introduced.
 | 
						|
 | 
						|
With split page table lock we have separate per-table lock to serialize
 | 
						|
access to the table. At the moment we use split lock for PTE and PMD
 | 
						|
tables. Access to higher level tables protected by mm->page_table_lock.
 | 
						|
 | 
						|
There are helpers to lock/unlock a table and other accessor functions:
 | 
						|
 - pte_offset_map_lock()
 | 
						|
	maps pte and takes PTE table lock, returns pointer to the taken
 | 
						|
	lock;
 | 
						|
 - pte_unmap_unlock()
 | 
						|
	unlocks and unmaps PTE table;
 | 
						|
 - pte_alloc_map_lock()
 | 
						|
	allocates PTE table if needed and take the lock, returns pointer
 | 
						|
	to taken lock or NULL if allocation failed;
 | 
						|
 - pte_lockptr()
 | 
						|
	returns pointer to PTE table lock;
 | 
						|
 - pmd_lock()
 | 
						|
	takes PMD table lock, returns pointer to taken lock;
 | 
						|
 - pmd_lockptr()
 | 
						|
	returns pointer to PMD table lock;
 | 
						|
 | 
						|
Split page table lock for PTE tables is enabled compile-time if
 | 
						|
CONFIG_SPLIT_PTLOCK_CPUS (usually 4) is less or equal to NR_CPUS.
 | 
						|
If split lock is disabled, all tables guaded by mm->page_table_lock.
 | 
						|
 | 
						|
Split page table lock for PMD tables is enabled, if it's enabled for PTE
 | 
						|
tables and the architecture supports it (see below).
 | 
						|
 | 
						|
Hugetlb and split page table lock
 | 
						|
---------------------------------
 | 
						|
 | 
						|
Hugetlb can support several page sizes. We use split lock only for PMD
 | 
						|
level, but not for PUD.
 | 
						|
 | 
						|
Hugetlb-specific helpers:
 | 
						|
 - huge_pte_lock()
 | 
						|
	takes pmd split lock for PMD_SIZE page, mm->page_table_lock
 | 
						|
	otherwise;
 | 
						|
 - huge_pte_lockptr()
 | 
						|
	returns pointer to table lock;
 | 
						|
 | 
						|
Support of split page table lock by an architecture
 | 
						|
---------------------------------------------------
 | 
						|
 | 
						|
There's no need in special enabling of PTE split page table lock:
 | 
						|
everything required is done by pgtable_page_ctor() and pgtable_page_dtor(),
 | 
						|
which must be called on PTE table allocation / freeing.
 | 
						|
 | 
						|
Make sure the architecture doesn't use slab allocator for page table
 | 
						|
allocation: slab uses page->slab_cache for its pages.
 | 
						|
This field shares storage with page->ptl.
 | 
						|
 | 
						|
PMD split lock only makes sense if you have more than two page table
 | 
						|
levels.
 | 
						|
 | 
						|
PMD split lock enabling requires pgtable_pmd_page_ctor() call on PMD table
 | 
						|
allocation and pgtable_pmd_page_dtor() on freeing.
 | 
						|
 | 
						|
Allocation usually happens in pmd_alloc_one(), freeing in pmd_free() and
 | 
						|
pmd_free_tlb(), but make sure you cover all PMD table allocation / freeing
 | 
						|
paths: i.e X86_PAE preallocate few PMDs on pgd_alloc().
 | 
						|
 | 
						|
With everything in place you can set CONFIG_ARCH_ENABLE_SPLIT_PMD_PTLOCK.
 | 
						|
 | 
						|
NOTE: pgtable_page_ctor() and pgtable_pmd_page_ctor() can fail -- it must
 | 
						|
be handled properly.
 | 
						|
 | 
						|
page->ptl
 | 
						|
---------
 | 
						|
 | 
						|
page->ptl is used to access split page table lock, where 'page' is struct
 | 
						|
page of page containing the table. It shares storage with page->private
 | 
						|
(and few other fields in union).
 | 
						|
 | 
						|
To avoid increasing size of struct page and have best performance, we use a
 | 
						|
trick:
 | 
						|
 - if spinlock_t fits into long, we use page->ptr as spinlock, so we
 | 
						|
   can avoid indirect access and save a cache line.
 | 
						|
 - if size of spinlock_t is bigger then size of long, we use page->ptl as
 | 
						|
   pointer to spinlock_t and allocate it dynamically. This allows to use
 | 
						|
   split lock with enabled DEBUG_SPINLOCK or DEBUG_LOCK_ALLOC, but costs
 | 
						|
   one more cache line for indirect access;
 | 
						|
 | 
						|
The spinlock_t allocated in pgtable_page_ctor() for PTE table and in
 | 
						|
pgtable_pmd_page_ctor() for PMD table.
 | 
						|
 | 
						|
Please, never access page->ptl directly -- use appropriate helper.
 |