forked from mirrors/linux
		
	mm: Protect operations adding pages to page cache with invalidate_lock
Currently, serializing operations such as page fault, read, or readahead against hole punching is rather difficult. The basic race scheme is like: fallocate(FALLOC_FL_PUNCH_HOLE) read / fault / .. truncate_inode_pages_range() <create pages in page cache here> <update fs block mapping and free blocks> Now the problem is in this way read / page fault / readahead can instantiate pages in page cache with potentially stale data (if blocks get quickly reused). Avoiding this race is not simple - page locks do not work because we want to make sure there are *no* pages in given range. inode->i_rwsem does not work because page fault happens under mmap_sem which ranks below inode->i_rwsem. Also using it for reads makes the performance for mixed read-write workloads suffer. So create a new rw_semaphore in the address_space - invalidate_lock - that protects adding of pages to page cache for page faults / reads / readahead. Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>
This commit is contained in:
		
							parent
							
								
									c625b4cc57
								
							
						
					
					
						commit
						730633f0b7
					
				
					 7 changed files with 181 additions and 57 deletions
				
			
		| 
						 | 
					@ -271,19 +271,19 @@ prototypes::
 | 
				
			||||||
locking rules:
 | 
					locking rules:
 | 
				
			||||||
	All except set_page_dirty and freepage may block
 | 
						All except set_page_dirty and freepage may block
 | 
				
			||||||
 | 
					
 | 
				
			||||||
======================	======================== =========
 | 
					======================	======================== =========	===============
 | 
				
			||||||
ops			PageLocked(page)	 i_rwsem
 | 
					ops			PageLocked(page)	 i_rwsem	invalidate_lock
 | 
				
			||||||
======================	======================== =========
 | 
					======================	======================== =========	===============
 | 
				
			||||||
writepage:		yes, unlocks (see below)
 | 
					writepage:		yes, unlocks (see below)
 | 
				
			||||||
readpage:		yes, unlocks
 | 
					readpage:		yes, unlocks				shared
 | 
				
			||||||
writepages:
 | 
					writepages:
 | 
				
			||||||
set_page_dirty		no
 | 
					set_page_dirty		no
 | 
				
			||||||
readahead:		yes, unlocks
 | 
					readahead:		yes, unlocks				shared
 | 
				
			||||||
readpages:		no
 | 
					readpages:		no					shared
 | 
				
			||||||
write_begin:		locks the page		 exclusive
 | 
					write_begin:		locks the page		 exclusive
 | 
				
			||||||
write_end:		yes, unlocks		 exclusive
 | 
					write_end:		yes, unlocks		 exclusive
 | 
				
			||||||
bmap:
 | 
					bmap:
 | 
				
			||||||
invalidatepage:		yes
 | 
					invalidatepage:		yes					exclusive
 | 
				
			||||||
releasepage:		yes
 | 
					releasepage:		yes
 | 
				
			||||||
freepage:		yes
 | 
					freepage:		yes
 | 
				
			||||||
direct_IO:
 | 
					direct_IO:
 | 
				
			||||||
| 
						 | 
					@ -378,7 +378,10 @@ keep it that way and don't breed new callers.
 | 
				
			||||||
->invalidatepage() is called when the filesystem must attempt to drop
 | 
					->invalidatepage() is called when the filesystem must attempt to drop
 | 
				
			||||||
some or all of the buffers from the page when it is being truncated. It
 | 
					some or all of the buffers from the page when it is being truncated. It
 | 
				
			||||||
returns zero on success. If ->invalidatepage is zero, the kernel uses
 | 
					returns zero on success. If ->invalidatepage is zero, the kernel uses
 | 
				
			||||||
block_invalidatepage() instead.
 | 
					block_invalidatepage() instead. The filesystem must exclusively acquire
 | 
				
			||||||
 | 
					invalidate_lock before invalidating page cache in truncate / hole punch path
 | 
				
			||||||
 | 
					(and thus calling into ->invalidatepage) to block races between page cache
 | 
				
			||||||
 | 
					invalidation and page cache filling functions (fault, read, ...).
 | 
				
			||||||
 | 
					
 | 
				
			||||||
->releasepage() is called when the kernel is about to try to drop the
 | 
					->releasepage() is called when the kernel is about to try to drop the
 | 
				
			||||||
buffers from the page in preparation for freeing it.  It returns zero to
 | 
					buffers from the page in preparation for freeing it.  It returns zero to
 | 
				
			||||||
| 
						 | 
					@ -573,6 +576,25 @@ in sys_read() and friends.
 | 
				
			||||||
the lease within the individual filesystem to record the result of the
 | 
					the lease within the individual filesystem to record the result of the
 | 
				
			||||||
operation
 | 
					operation
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					->fallocate implementation must be really careful to maintain page cache
 | 
				
			||||||
 | 
					consistency when punching holes or performing other operations that invalidate
 | 
				
			||||||
 | 
					page cache contents. Usually the filesystem needs to call
 | 
				
			||||||
 | 
					truncate_inode_pages_range() to invalidate relevant range of the page cache.
 | 
				
			||||||
 | 
					However the filesystem usually also needs to update its internal (and on disk)
 | 
				
			||||||
 | 
					view of file offset -> disk block mapping. Until this update is finished, the
 | 
				
			||||||
 | 
					filesystem needs to block page faults and reads from reloading now-stale page
 | 
				
			||||||
 | 
					cache contents from the disk. Since VFS acquires mapping->invalidate_lock in
 | 
				
			||||||
 | 
					shared mode when loading pages from disk (filemap_fault(), filemap_read(),
 | 
				
			||||||
 | 
					readahead paths), the fallocate implementation must take the invalidate_lock to
 | 
				
			||||||
 | 
					prevent reloading.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					->copy_file_range and ->remap_file_range implementations need to serialize
 | 
				
			||||||
 | 
					against modifications of file data while the operation is running. For
 | 
				
			||||||
 | 
					blocking changes through write(2) and similar operations inode->i_rwsem can be
 | 
				
			||||||
 | 
					used. To block changes to file contents via a memory mapping during the
 | 
				
			||||||
 | 
					operation, the filesystem must take mapping->invalidate_lock to coordinate
 | 
				
			||||||
 | 
					with ->page_mkwrite.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
dquot_operations
 | 
					dquot_operations
 | 
				
			||||||
================
 | 
					================
 | 
				
			||||||
 | 
					
 | 
				
			||||||
| 
						 | 
					@ -630,11 +652,11 @@ pfn_mkwrite:	yes
 | 
				
			||||||
access:		yes
 | 
					access:		yes
 | 
				
			||||||
=============	=========	===========================
 | 
					=============	=========	===========================
 | 
				
			||||||
 | 
					
 | 
				
			||||||
->fault() is called when a previously not present pte is about
 | 
					->fault() is called when a previously not present pte is about to be faulted
 | 
				
			||||||
to be faulted in. The filesystem must find and return the page associated
 | 
					in. The filesystem must find and return the page associated with the passed in
 | 
				
			||||||
with the passed in "pgoff" in the vm_fault structure. If it is possible that
 | 
					"pgoff" in the vm_fault structure. If it is possible that the page may be
 | 
				
			||||||
the page may be truncated and/or invalidated, then the filesystem must lock
 | 
					truncated and/or invalidated, then the filesystem must lock invalidate_lock,
 | 
				
			||||||
the page, then ensure it is not already truncated (the page lock will block
 | 
					then ensure the page is not already truncated (invalidate_lock will block
 | 
				
			||||||
subsequent truncate), and then return with VM_FAULT_LOCKED, and the page
 | 
					subsequent truncate), and then return with VM_FAULT_LOCKED, and the page
 | 
				
			||||||
locked. The VM will unlock the page.
 | 
					locked. The VM will unlock the page.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
| 
						 | 
					@ -647,12 +669,14 @@ page table entry. Pointer to entry associated with the page is passed in
 | 
				
			||||||
"pte" field in vm_fault structure. Pointers to entries for other offsets
 | 
					"pte" field in vm_fault structure. Pointers to entries for other offsets
 | 
				
			||||||
should be calculated relative to "pte".
 | 
					should be calculated relative to "pte".
 | 
				
			||||||
 | 
					
 | 
				
			||||||
->page_mkwrite() is called when a previously read-only pte is
 | 
					->page_mkwrite() is called when a previously read-only pte is about to become
 | 
				
			||||||
about to become writeable. The filesystem again must ensure that there are
 | 
					writeable. The filesystem again must ensure that there are no
 | 
				
			||||||
no truncate/invalidate races, and then return with the page locked. If
 | 
					truncate/invalidate races or races with operations such as ->remap_file_range
 | 
				
			||||||
the page has been truncated, the filesystem should not look up a new page
 | 
					or ->copy_file_range, and then return with the page locked. Usually
 | 
				
			||||||
like the ->fault() handler, but simply return with VM_FAULT_NOPAGE, which
 | 
					mapping->invalidate_lock is suitable for proper serialization. If the page has
 | 
				
			||||||
will cause the VM to retry the fault.
 | 
					been truncated, the filesystem should not look up a new page like the ->fault()
 | 
				
			||||||
 | 
					handler, but simply return with VM_FAULT_NOPAGE, which will cause the VM to
 | 
				
			||||||
 | 
					retry the fault.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
->pfn_mkwrite() is the same as page_mkwrite but when the pte is
 | 
					->pfn_mkwrite() is the same as page_mkwrite but when the pte is
 | 
				
			||||||
VM_PFNMAP or VM_MIXEDMAP with a page-less entry. Expected return is
 | 
					VM_PFNMAP or VM_MIXEDMAP with a page-less entry. Expected return is
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -190,6 +190,8 @@ int inode_init_always(struct super_block *sb, struct inode *inode)
 | 
				
			||||||
	mapping_set_gfp_mask(mapping, GFP_HIGHUSER_MOVABLE);
 | 
						mapping_set_gfp_mask(mapping, GFP_HIGHUSER_MOVABLE);
 | 
				
			||||||
	mapping->private_data = NULL;
 | 
						mapping->private_data = NULL;
 | 
				
			||||||
	mapping->writeback_index = 0;
 | 
						mapping->writeback_index = 0;
 | 
				
			||||||
 | 
						__init_rwsem(&mapping->invalidate_lock, "mapping.invalidate_lock",
 | 
				
			||||||
 | 
							     &sb->s_type->invalidate_lock_key);
 | 
				
			||||||
	inode->i_private = NULL;
 | 
						inode->i_private = NULL;
 | 
				
			||||||
	inode->i_mapping = mapping;
 | 
						inode->i_mapping = mapping;
 | 
				
			||||||
	INIT_HLIST_HEAD(&inode->i_dentry);	/* buggered by rcu freeing */
 | 
						INIT_HLIST_HEAD(&inode->i_dentry);	/* buggered by rcu freeing */
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -436,6 +436,10 @@ int pagecache_write_end(struct file *, struct address_space *mapping,
 | 
				
			||||||
 * struct address_space - Contents of a cacheable, mappable object.
 | 
					 * struct address_space - Contents of a cacheable, mappable object.
 | 
				
			||||||
 * @host: Owner, either the inode or the block_device.
 | 
					 * @host: Owner, either the inode or the block_device.
 | 
				
			||||||
 * @i_pages: Cached pages.
 | 
					 * @i_pages: Cached pages.
 | 
				
			||||||
 | 
					 * @invalidate_lock: Guards coherency between page cache contents and
 | 
				
			||||||
 | 
					 *   file offset->disk block mappings in the filesystem during invalidates.
 | 
				
			||||||
 | 
					 *   It is also used to block modification of page cache contents through
 | 
				
			||||||
 | 
					 *   memory mappings.
 | 
				
			||||||
 * @gfp_mask: Memory allocation flags to use for allocating pages.
 | 
					 * @gfp_mask: Memory allocation flags to use for allocating pages.
 | 
				
			||||||
 * @i_mmap_writable: Number of VM_SHARED mappings.
 | 
					 * @i_mmap_writable: Number of VM_SHARED mappings.
 | 
				
			||||||
 * @nr_thps: Number of THPs in the pagecache (non-shmem only).
 | 
					 * @nr_thps: Number of THPs in the pagecache (non-shmem only).
 | 
				
			||||||
| 
						 | 
					@ -453,6 +457,7 @@ int pagecache_write_end(struct file *, struct address_space *mapping,
 | 
				
			||||||
struct address_space {
 | 
					struct address_space {
 | 
				
			||||||
	struct inode		*host;
 | 
						struct inode		*host;
 | 
				
			||||||
	struct xarray		i_pages;
 | 
						struct xarray		i_pages;
 | 
				
			||||||
 | 
						struct rw_semaphore	invalidate_lock;
 | 
				
			||||||
	gfp_t			gfp_mask;
 | 
						gfp_t			gfp_mask;
 | 
				
			||||||
	atomic_t		i_mmap_writable;
 | 
						atomic_t		i_mmap_writable;
 | 
				
			||||||
#ifdef CONFIG_READ_ONLY_THP_FOR_FS
 | 
					#ifdef CONFIG_READ_ONLY_THP_FOR_FS
 | 
				
			||||||
| 
						 | 
					@ -814,6 +819,33 @@ static inline void inode_lock_shared_nested(struct inode *inode, unsigned subcla
 | 
				
			||||||
	down_read_nested(&inode->i_rwsem, subclass);
 | 
						down_read_nested(&inode->i_rwsem, subclass);
 | 
				
			||||||
}
 | 
					}
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					static inline void filemap_invalidate_lock(struct address_space *mapping)
 | 
				
			||||||
 | 
					{
 | 
				
			||||||
 | 
						down_write(&mapping->invalidate_lock);
 | 
				
			||||||
 | 
					}
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					static inline void filemap_invalidate_unlock(struct address_space *mapping)
 | 
				
			||||||
 | 
					{
 | 
				
			||||||
 | 
						up_write(&mapping->invalidate_lock);
 | 
				
			||||||
 | 
					}
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					static inline void filemap_invalidate_lock_shared(struct address_space *mapping)
 | 
				
			||||||
 | 
					{
 | 
				
			||||||
 | 
						down_read(&mapping->invalidate_lock);
 | 
				
			||||||
 | 
					}
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					static inline int filemap_invalidate_trylock_shared(
 | 
				
			||||||
 | 
										struct address_space *mapping)
 | 
				
			||||||
 | 
					{
 | 
				
			||||||
 | 
						return down_read_trylock(&mapping->invalidate_lock);
 | 
				
			||||||
 | 
					}
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					static inline void filemap_invalidate_unlock_shared(
 | 
				
			||||||
 | 
										struct address_space *mapping)
 | 
				
			||||||
 | 
					{
 | 
				
			||||||
 | 
						up_read(&mapping->invalidate_lock);
 | 
				
			||||||
 | 
					}
 | 
				
			||||||
 | 
					
 | 
				
			||||||
void lock_two_nondirectories(struct inode *, struct inode*);
 | 
					void lock_two_nondirectories(struct inode *, struct inode*);
 | 
				
			||||||
void unlock_two_nondirectories(struct inode *, struct inode*);
 | 
					void unlock_two_nondirectories(struct inode *, struct inode*);
 | 
				
			||||||
 | 
					
 | 
				
			||||||
| 
						 | 
					@ -2487,6 +2519,7 @@ struct file_system_type {
 | 
				
			||||||
 | 
					
 | 
				
			||||||
	struct lock_class_key i_lock_key;
 | 
						struct lock_class_key i_lock_key;
 | 
				
			||||||
	struct lock_class_key i_mutex_key;
 | 
						struct lock_class_key i_mutex_key;
 | 
				
			||||||
 | 
						struct lock_class_key invalidate_lock_key;
 | 
				
			||||||
	struct lock_class_key i_mutex_dir_key;
 | 
						struct lock_class_key i_mutex_dir_key;
 | 
				
			||||||
};
 | 
					};
 | 
				
			||||||
 | 
					
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
							
								
								
									
										99
									
								
								mm/filemap.c
									
									
									
									
									
								
							
							
						
						
									
										99
									
								
								mm/filemap.c
									
									
									
									
									
								
							| 
						 | 
					@ -77,7 +77,8 @@
 | 
				
			||||||
 *        ->i_pages lock
 | 
					 *        ->i_pages lock
 | 
				
			||||||
 *
 | 
					 *
 | 
				
			||||||
 *  ->i_rwsem
 | 
					 *  ->i_rwsem
 | 
				
			||||||
 *    ->i_mmap_rwsem		(truncate->unmap_mapping_range)
 | 
					 *    ->invalidate_lock		(acquired by fs in truncate path)
 | 
				
			||||||
 | 
					 *      ->i_mmap_rwsem		(truncate->unmap_mapping_range)
 | 
				
			||||||
 *
 | 
					 *
 | 
				
			||||||
 *  ->mmap_lock
 | 
					 *  ->mmap_lock
 | 
				
			||||||
 *    ->i_mmap_rwsem
 | 
					 *    ->i_mmap_rwsem
 | 
				
			||||||
| 
						 | 
					@ -85,7 +86,8 @@
 | 
				
			||||||
 *        ->i_pages lock	(arch-dependent flush_dcache_mmap_lock)
 | 
					 *        ->i_pages lock	(arch-dependent flush_dcache_mmap_lock)
 | 
				
			||||||
 *
 | 
					 *
 | 
				
			||||||
 *  ->mmap_lock
 | 
					 *  ->mmap_lock
 | 
				
			||||||
 *    ->lock_page		(access_process_vm)
 | 
					 *    ->invalidate_lock		(filemap_fault)
 | 
				
			||||||
 | 
					 *      ->lock_page		(filemap_fault, access_process_vm)
 | 
				
			||||||
 *
 | 
					 *
 | 
				
			||||||
 *  ->i_rwsem			(generic_perform_write)
 | 
					 *  ->i_rwsem			(generic_perform_write)
 | 
				
			||||||
 *    ->mmap_lock		(fault_in_pages_readable->do_page_fault)
 | 
					 *    ->mmap_lock		(fault_in_pages_readable->do_page_fault)
 | 
				
			||||||
| 
						 | 
					@ -2368,20 +2370,30 @@ static int filemap_update_page(struct kiocb *iocb,
 | 
				
			||||||
{
 | 
					{
 | 
				
			||||||
	int error;
 | 
						int error;
 | 
				
			||||||
 | 
					
 | 
				
			||||||
	if (!trylock_page(page)) {
 | 
						if (iocb->ki_flags & IOCB_NOWAIT) {
 | 
				
			||||||
		if (iocb->ki_flags & (IOCB_NOWAIT | IOCB_NOIO))
 | 
							if (!filemap_invalidate_trylock_shared(mapping))
 | 
				
			||||||
			return -EAGAIN;
 | 
								return -EAGAIN;
 | 
				
			||||||
 | 
						} else {
 | 
				
			||||||
 | 
							filemap_invalidate_lock_shared(mapping);
 | 
				
			||||||
 | 
						}
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
						if (!trylock_page(page)) {
 | 
				
			||||||
 | 
							error = -EAGAIN;
 | 
				
			||||||
 | 
							if (iocb->ki_flags & (IOCB_NOWAIT | IOCB_NOIO))
 | 
				
			||||||
 | 
								goto unlock_mapping;
 | 
				
			||||||
		if (!(iocb->ki_flags & IOCB_WAITQ)) {
 | 
							if (!(iocb->ki_flags & IOCB_WAITQ)) {
 | 
				
			||||||
 | 
								filemap_invalidate_unlock_shared(mapping);
 | 
				
			||||||
			put_and_wait_on_page_locked(page, TASK_KILLABLE);
 | 
								put_and_wait_on_page_locked(page, TASK_KILLABLE);
 | 
				
			||||||
			return AOP_TRUNCATED_PAGE;
 | 
								return AOP_TRUNCATED_PAGE;
 | 
				
			||||||
		}
 | 
							}
 | 
				
			||||||
		error = __lock_page_async(page, iocb->ki_waitq);
 | 
							error = __lock_page_async(page, iocb->ki_waitq);
 | 
				
			||||||
		if (error)
 | 
							if (error)
 | 
				
			||||||
			return error;
 | 
								goto unlock_mapping;
 | 
				
			||||||
	}
 | 
						}
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
						error = AOP_TRUNCATED_PAGE;
 | 
				
			||||||
	if (!page->mapping)
 | 
						if (!page->mapping)
 | 
				
			||||||
		goto truncated;
 | 
							goto unlock;
 | 
				
			||||||
 | 
					
 | 
				
			||||||
	error = 0;
 | 
						error = 0;
 | 
				
			||||||
	if (filemap_range_uptodate(mapping, iocb->ki_pos, iter, page))
 | 
						if (filemap_range_uptodate(mapping, iocb->ki_pos, iter, page))
 | 
				
			||||||
| 
						 | 
					@ -2392,15 +2404,13 @@ static int filemap_update_page(struct kiocb *iocb,
 | 
				
			||||||
		goto unlock;
 | 
							goto unlock;
 | 
				
			||||||
 | 
					
 | 
				
			||||||
	error = filemap_read_page(iocb->ki_filp, mapping, page);
 | 
						error = filemap_read_page(iocb->ki_filp, mapping, page);
 | 
				
			||||||
	if (error == AOP_TRUNCATED_PAGE)
 | 
						goto unlock_mapping;
 | 
				
			||||||
		put_page(page);
 | 
					 | 
				
			||||||
	return error;
 | 
					 | 
				
			||||||
truncated:
 | 
					 | 
				
			||||||
	unlock_page(page);
 | 
					 | 
				
			||||||
	put_page(page);
 | 
					 | 
				
			||||||
	return AOP_TRUNCATED_PAGE;
 | 
					 | 
				
			||||||
unlock:
 | 
					unlock:
 | 
				
			||||||
	unlock_page(page);
 | 
						unlock_page(page);
 | 
				
			||||||
 | 
					unlock_mapping:
 | 
				
			||||||
 | 
						filemap_invalidate_unlock_shared(mapping);
 | 
				
			||||||
 | 
						if (error == AOP_TRUNCATED_PAGE)
 | 
				
			||||||
 | 
							put_page(page);
 | 
				
			||||||
	return error;
 | 
						return error;
 | 
				
			||||||
}
 | 
					}
 | 
				
			||||||
 | 
					
 | 
				
			||||||
| 
						 | 
					@ -2415,6 +2425,19 @@ static int filemap_create_page(struct file *file,
 | 
				
			||||||
	if (!page)
 | 
						if (!page)
 | 
				
			||||||
		return -ENOMEM;
 | 
							return -ENOMEM;
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
						/*
 | 
				
			||||||
 | 
						 * Protect against truncate / hole punch. Grabbing invalidate_lock here
 | 
				
			||||||
 | 
						 * assures we cannot instantiate and bring uptodate new pagecache pages
 | 
				
			||||||
 | 
						 * after evicting page cache during truncate and before actually
 | 
				
			||||||
 | 
						 * freeing blocks.  Note that we could release invalidate_lock after
 | 
				
			||||||
 | 
						 * inserting the page into page cache as the locked page would then be
 | 
				
			||||||
 | 
						 * enough to synchronize with hole punching. But there are code paths
 | 
				
			||||||
 | 
						 * such as filemap_update_page() filling in partially uptodate pages or
 | 
				
			||||||
 | 
						 * ->readpages() that need to hold invalidate_lock while mapping blocks
 | 
				
			||||||
 | 
						 * for IO so let's hold the lock here as well to keep locking rules
 | 
				
			||||||
 | 
						 * simple.
 | 
				
			||||||
 | 
						 */
 | 
				
			||||||
 | 
						filemap_invalidate_lock_shared(mapping);
 | 
				
			||||||
	error = add_to_page_cache_lru(page, mapping, index,
 | 
						error = add_to_page_cache_lru(page, mapping, index,
 | 
				
			||||||
			mapping_gfp_constraint(mapping, GFP_KERNEL));
 | 
								mapping_gfp_constraint(mapping, GFP_KERNEL));
 | 
				
			||||||
	if (error == -EEXIST)
 | 
						if (error == -EEXIST)
 | 
				
			||||||
| 
						 | 
					@ -2426,9 +2449,11 @@ static int filemap_create_page(struct file *file,
 | 
				
			||||||
	if (error)
 | 
						if (error)
 | 
				
			||||||
		goto error;
 | 
							goto error;
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
						filemap_invalidate_unlock_shared(mapping);
 | 
				
			||||||
	pagevec_add(pvec, page);
 | 
						pagevec_add(pvec, page);
 | 
				
			||||||
	return 0;
 | 
						return 0;
 | 
				
			||||||
error:
 | 
					error:
 | 
				
			||||||
 | 
						filemap_invalidate_unlock_shared(mapping);
 | 
				
			||||||
	put_page(page);
 | 
						put_page(page);
 | 
				
			||||||
	return error;
 | 
						return error;
 | 
				
			||||||
}
 | 
					}
 | 
				
			||||||
| 
						 | 
					@ -2967,6 +2992,7 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
 | 
				
			||||||
	pgoff_t max_off;
 | 
						pgoff_t max_off;
 | 
				
			||||||
	struct page *page;
 | 
						struct page *page;
 | 
				
			||||||
	vm_fault_t ret = 0;
 | 
						vm_fault_t ret = 0;
 | 
				
			||||||
 | 
						bool mapping_locked = false;
 | 
				
			||||||
 | 
					
 | 
				
			||||||
	max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
 | 
						max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
 | 
				
			||||||
	if (unlikely(offset >= max_off))
 | 
						if (unlikely(offset >= max_off))
 | 
				
			||||||
| 
						 | 
					@ -2976,25 +3002,39 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
 | 
				
			||||||
	 * Do we have something in the page cache already?
 | 
						 * Do we have something in the page cache already?
 | 
				
			||||||
	 */
 | 
						 */
 | 
				
			||||||
	page = find_get_page(mapping, offset);
 | 
						page = find_get_page(mapping, offset);
 | 
				
			||||||
	if (likely(page) && !(vmf->flags & FAULT_FLAG_TRIED)) {
 | 
						if (likely(page)) {
 | 
				
			||||||
		/*
 | 
							/*
 | 
				
			||||||
		 * We found the page, so try async readahead before
 | 
							 * We found the page, so try async readahead before waiting for
 | 
				
			||||||
		 * waiting for the lock.
 | 
							 * the lock.
 | 
				
			||||||
		 */
 | 
							 */
 | 
				
			||||||
		fpin = do_async_mmap_readahead(vmf, page);
 | 
							if (!(vmf->flags & FAULT_FLAG_TRIED))
 | 
				
			||||||
	} else if (!page) {
 | 
								fpin = do_async_mmap_readahead(vmf, page);
 | 
				
			||||||
 | 
							if (unlikely(!PageUptodate(page))) {
 | 
				
			||||||
 | 
								filemap_invalidate_lock_shared(mapping);
 | 
				
			||||||
 | 
								mapping_locked = true;
 | 
				
			||||||
 | 
							}
 | 
				
			||||||
 | 
						} else {
 | 
				
			||||||
		/* No page in the page cache at all */
 | 
							/* No page in the page cache at all */
 | 
				
			||||||
		count_vm_event(PGMAJFAULT);
 | 
							count_vm_event(PGMAJFAULT);
 | 
				
			||||||
		count_memcg_event_mm(vmf->vma->vm_mm, PGMAJFAULT);
 | 
							count_memcg_event_mm(vmf->vma->vm_mm, PGMAJFAULT);
 | 
				
			||||||
		ret = VM_FAULT_MAJOR;
 | 
							ret = VM_FAULT_MAJOR;
 | 
				
			||||||
		fpin = do_sync_mmap_readahead(vmf);
 | 
							fpin = do_sync_mmap_readahead(vmf);
 | 
				
			||||||
retry_find:
 | 
					retry_find:
 | 
				
			||||||
 | 
							/*
 | 
				
			||||||
 | 
							 * See comment in filemap_create_page() why we need
 | 
				
			||||||
 | 
							 * invalidate_lock
 | 
				
			||||||
 | 
							 */
 | 
				
			||||||
 | 
							if (!mapping_locked) {
 | 
				
			||||||
 | 
								filemap_invalidate_lock_shared(mapping);
 | 
				
			||||||
 | 
								mapping_locked = true;
 | 
				
			||||||
 | 
							}
 | 
				
			||||||
		page = pagecache_get_page(mapping, offset,
 | 
							page = pagecache_get_page(mapping, offset,
 | 
				
			||||||
					  FGP_CREAT|FGP_FOR_MMAP,
 | 
										  FGP_CREAT|FGP_FOR_MMAP,
 | 
				
			||||||
					  vmf->gfp_mask);
 | 
										  vmf->gfp_mask);
 | 
				
			||||||
		if (!page) {
 | 
							if (!page) {
 | 
				
			||||||
			if (fpin)
 | 
								if (fpin)
 | 
				
			||||||
				goto out_retry;
 | 
									goto out_retry;
 | 
				
			||||||
 | 
								filemap_invalidate_unlock_shared(mapping);
 | 
				
			||||||
			return VM_FAULT_OOM;
 | 
								return VM_FAULT_OOM;
 | 
				
			||||||
		}
 | 
							}
 | 
				
			||||||
	}
 | 
						}
 | 
				
			||||||
| 
						 | 
					@ -3014,8 +3054,20 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
 | 
				
			||||||
	 * We have a locked page in the page cache, now we need to check
 | 
						 * We have a locked page in the page cache, now we need to check
 | 
				
			||||||
	 * that it's up-to-date. If not, it is going to be due to an error.
 | 
						 * that it's up-to-date. If not, it is going to be due to an error.
 | 
				
			||||||
	 */
 | 
						 */
 | 
				
			||||||
	if (unlikely(!PageUptodate(page)))
 | 
						if (unlikely(!PageUptodate(page))) {
 | 
				
			||||||
 | 
							/*
 | 
				
			||||||
 | 
							 * The page was in cache and uptodate and now it is not.
 | 
				
			||||||
 | 
							 * Strange but possible since we didn't hold the page lock all
 | 
				
			||||||
 | 
							 * the time. Let's drop everything get the invalidate lock and
 | 
				
			||||||
 | 
							 * try again.
 | 
				
			||||||
 | 
							 */
 | 
				
			||||||
 | 
							if (!mapping_locked) {
 | 
				
			||||||
 | 
								unlock_page(page);
 | 
				
			||||||
 | 
								put_page(page);
 | 
				
			||||||
 | 
								goto retry_find;
 | 
				
			||||||
 | 
							}
 | 
				
			||||||
		goto page_not_uptodate;
 | 
							goto page_not_uptodate;
 | 
				
			||||||
 | 
						}
 | 
				
			||||||
 | 
					
 | 
				
			||||||
	/*
 | 
						/*
 | 
				
			||||||
	 * We've made it this far and we had to drop our mmap_lock, now is the
 | 
						 * We've made it this far and we had to drop our mmap_lock, now is the
 | 
				
			||||||
| 
						 | 
					@ -3026,6 +3078,8 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
 | 
				
			||||||
		unlock_page(page);
 | 
							unlock_page(page);
 | 
				
			||||||
		goto out_retry;
 | 
							goto out_retry;
 | 
				
			||||||
	}
 | 
						}
 | 
				
			||||||
 | 
						if (mapping_locked)
 | 
				
			||||||
 | 
							filemap_invalidate_unlock_shared(mapping);
 | 
				
			||||||
 | 
					
 | 
				
			||||||
	/*
 | 
						/*
 | 
				
			||||||
	 * Found the page and have a reference on it.
 | 
						 * Found the page and have a reference on it.
 | 
				
			||||||
| 
						 | 
					@ -3056,6 +3110,7 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
	if (!error || error == AOP_TRUNCATED_PAGE)
 | 
						if (!error || error == AOP_TRUNCATED_PAGE)
 | 
				
			||||||
		goto retry_find;
 | 
							goto retry_find;
 | 
				
			||||||
 | 
						filemap_invalidate_unlock_shared(mapping);
 | 
				
			||||||
 | 
					
 | 
				
			||||||
	return VM_FAULT_SIGBUS;
 | 
						return VM_FAULT_SIGBUS;
 | 
				
			||||||
 | 
					
 | 
				
			||||||
| 
						 | 
					@ -3067,6 +3122,8 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
 | 
				
			||||||
	 */
 | 
						 */
 | 
				
			||||||
	if (page)
 | 
						if (page)
 | 
				
			||||||
		put_page(page);
 | 
							put_page(page);
 | 
				
			||||||
 | 
						if (mapping_locked)
 | 
				
			||||||
 | 
							filemap_invalidate_unlock_shared(mapping);
 | 
				
			||||||
	if (fpin)
 | 
						if (fpin)
 | 
				
			||||||
		fput(fpin);
 | 
							fput(fpin);
 | 
				
			||||||
	return ret | VM_FAULT_RETRY;
 | 
						return ret | VM_FAULT_RETRY;
 | 
				
			||||||
| 
						 | 
					@ -3437,6 +3494,8 @@ static struct page *do_read_cache_page(struct address_space *mapping,
 | 
				
			||||||
 *
 | 
					 *
 | 
				
			||||||
 * If the page does not get brought uptodate, return -EIO.
 | 
					 * If the page does not get brought uptodate, return -EIO.
 | 
				
			||||||
 *
 | 
					 *
 | 
				
			||||||
 | 
					 * The function expects mapping->invalidate_lock to be already held.
 | 
				
			||||||
 | 
					 *
 | 
				
			||||||
 * Return: up to date page on success, ERR_PTR() on failure.
 | 
					 * Return: up to date page on success, ERR_PTR() on failure.
 | 
				
			||||||
 */
 | 
					 */
 | 
				
			||||||
struct page *read_cache_page(struct address_space *mapping,
 | 
					struct page *read_cache_page(struct address_space *mapping,
 | 
				
			||||||
| 
						 | 
					@ -3460,6 +3519,8 @@ EXPORT_SYMBOL(read_cache_page);
 | 
				
			||||||
 *
 | 
					 *
 | 
				
			||||||
 * If the page does not get brought uptodate, return -EIO.
 | 
					 * If the page does not get brought uptodate, return -EIO.
 | 
				
			||||||
 *
 | 
					 *
 | 
				
			||||||
 | 
					 * The function expects mapping->invalidate_lock to be already held.
 | 
				
			||||||
 | 
					 *
 | 
				
			||||||
 * Return: up to date page on success, ERR_PTR() on failure.
 | 
					 * Return: up to date page on success, ERR_PTR() on failure.
 | 
				
			||||||
 */
 | 
					 */
 | 
				
			||||||
struct page *read_cache_page_gfp(struct address_space *mapping,
 | 
					struct page *read_cache_page_gfp(struct address_space *mapping,
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -192,6 +192,7 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
 | 
				
			||||||
	 */
 | 
						 */
 | 
				
			||||||
	unsigned int nofs = memalloc_nofs_save();
 | 
						unsigned int nofs = memalloc_nofs_save();
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
						filemap_invalidate_lock_shared(mapping);
 | 
				
			||||||
	/*
 | 
						/*
 | 
				
			||||||
	 * Preallocate as many pages as we will need.
 | 
						 * Preallocate as many pages as we will need.
 | 
				
			||||||
	 */
 | 
						 */
 | 
				
			||||||
| 
						 | 
					@ -236,6 +237,7 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
 | 
				
			||||||
	 * will then handle the error.
 | 
						 * will then handle the error.
 | 
				
			||||||
	 */
 | 
						 */
 | 
				
			||||||
	read_pages(ractl, &page_pool, false);
 | 
						read_pages(ractl, &page_pool, false);
 | 
				
			||||||
 | 
						filemap_invalidate_unlock_shared(mapping);
 | 
				
			||||||
	memalloc_nofs_restore(nofs);
 | 
						memalloc_nofs_restore(nofs);
 | 
				
			||||||
}
 | 
					}
 | 
				
			||||||
EXPORT_SYMBOL_GPL(page_cache_ra_unbounded);
 | 
					EXPORT_SYMBOL_GPL(page_cache_ra_unbounded);
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
							
								
								
									
										37
									
								
								mm/rmap.c
									
									
									
									
									
								
							
							
						
						
									
										37
									
								
								mm/rmap.c
									
									
									
									
									
								
							| 
						 | 
					@ -22,24 +22,25 @@
 | 
				
			||||||
 *
 | 
					 *
 | 
				
			||||||
 * inode->i_rwsem	(while writing or truncating, not reading or faulting)
 | 
					 * inode->i_rwsem	(while writing or truncating, not reading or faulting)
 | 
				
			||||||
 *   mm->mmap_lock
 | 
					 *   mm->mmap_lock
 | 
				
			||||||
 *     page->flags PG_locked (lock_page)   * (see hugetlbfs below)
 | 
					 *     mapping->invalidate_lock (in filemap_fault)
 | 
				
			||||||
 *       hugetlbfs_i_mmap_rwsem_key (in huge_pmd_share)
 | 
					 *       page->flags PG_locked (lock_page)   * (see hugetlbfs below)
 | 
				
			||||||
 *         mapping->i_mmap_rwsem
 | 
					 *         hugetlbfs_i_mmap_rwsem_key (in huge_pmd_share)
 | 
				
			||||||
 *           hugetlb_fault_mutex (hugetlbfs specific page fault mutex)
 | 
					 *           mapping->i_mmap_rwsem
 | 
				
			||||||
 *           anon_vma->rwsem
 | 
					 *             hugetlb_fault_mutex (hugetlbfs specific page fault mutex)
 | 
				
			||||||
 *             mm->page_table_lock or pte_lock
 | 
					 *             anon_vma->rwsem
 | 
				
			||||||
 *               swap_lock (in swap_duplicate, swap_info_get)
 | 
					 *               mm->page_table_lock or pte_lock
 | 
				
			||||||
 *                 mmlist_lock (in mmput, drain_mmlist and others)
 | 
					 *                 swap_lock (in swap_duplicate, swap_info_get)
 | 
				
			||||||
 *                 mapping->private_lock (in __set_page_dirty_buffers)
 | 
					 *                   mmlist_lock (in mmput, drain_mmlist and others)
 | 
				
			||||||
 *                   lock_page_memcg move_lock (in __set_page_dirty_buffers)
 | 
					 *                   mapping->private_lock (in __set_page_dirty_buffers)
 | 
				
			||||||
 *                     i_pages lock (widely used)
 | 
					 *                     lock_page_memcg move_lock (in __set_page_dirty_buffers)
 | 
				
			||||||
 *                       lruvec->lru_lock (in lock_page_lruvec_irq)
 | 
					 *                       i_pages lock (widely used)
 | 
				
			||||||
 *                 inode->i_lock (in set_page_dirty's __mark_inode_dirty)
 | 
					 *                         lruvec->lru_lock (in lock_page_lruvec_irq)
 | 
				
			||||||
 *                 bdi.wb->list_lock (in set_page_dirty's __mark_inode_dirty)
 | 
					 *                   inode->i_lock (in set_page_dirty's __mark_inode_dirty)
 | 
				
			||||||
 *                   sb_lock (within inode_lock in fs/fs-writeback.c)
 | 
					 *                   bdi.wb->list_lock (in set_page_dirty's __mark_inode_dirty)
 | 
				
			||||||
 *                   i_pages lock (widely used, in set_page_dirty,
 | 
					 *                     sb_lock (within inode_lock in fs/fs-writeback.c)
 | 
				
			||||||
 *                             in arch-dependent flush_dcache_mmap_lock,
 | 
					 *                     i_pages lock (widely used, in set_page_dirty,
 | 
				
			||||||
 *                             within bdi.wb->list_lock in __sync_single_inode)
 | 
					 *                               in arch-dependent flush_dcache_mmap_lock,
 | 
				
			||||||
 | 
					 *                               within bdi.wb->list_lock in __sync_single_inode)
 | 
				
			||||||
 *
 | 
					 *
 | 
				
			||||||
 * anon_vma->rwsem,mapping->i_mmap_rwsem   (memory_failure, collect_procs_anon)
 | 
					 * anon_vma->rwsem,mapping->i_mmap_rwsem   (memory_failure, collect_procs_anon)
 | 
				
			||||||
 *   ->tasklist_lock
 | 
					 *   ->tasklist_lock
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -412,7 +412,8 @@ EXPORT_SYMBOL(truncate_inode_pages_range);
 | 
				
			||||||
 * @mapping: mapping to truncate
 | 
					 * @mapping: mapping to truncate
 | 
				
			||||||
 * @lstart: offset from which to truncate
 | 
					 * @lstart: offset from which to truncate
 | 
				
			||||||
 *
 | 
					 *
 | 
				
			||||||
 * Called under (and serialised by) inode->i_rwsem.
 | 
					 * Called under (and serialised by) inode->i_rwsem and
 | 
				
			||||||
 | 
					 * mapping->invalidate_lock.
 | 
				
			||||||
 *
 | 
					 *
 | 
				
			||||||
 * Note: When this function returns, there can be a page in the process of
 | 
					 * Note: When this function returns, there can be a page in the process of
 | 
				
			||||||
 * deletion (inside __delete_from_page_cache()) in the specified range.  Thus
 | 
					 * deletion (inside __delete_from_page_cache()) in the specified range.  Thus
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
		Loading…
	
		Reference in a new issue