mirror of
				https://github.com/torvalds/linux.git
				synced 2025-10-31 00:28:52 +02:00 
			
		
		
		
	mm: Protect operations adding pages to page cache with invalidate_lock
Currently, serializing operations such as page fault, read, or readahead against hole punching is rather difficult. The basic race scheme is like: fallocate(FALLOC_FL_PUNCH_HOLE) read / fault / .. truncate_inode_pages_range() <create pages in page cache here> <update fs block mapping and free blocks> Now the problem is in this way read / page fault / readahead can instantiate pages in page cache with potentially stale data (if blocks get quickly reused). Avoiding this race is not simple - page locks do not work because we want to make sure there are *no* pages in given range. inode->i_rwsem does not work because page fault happens under mmap_sem which ranks below inode->i_rwsem. Also using it for reads makes the performance for mixed read-write workloads suffer. So create a new rw_semaphore in the address_space - invalidate_lock - that protects adding of pages to page cache for page faults / reads / readahead. Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>
This commit is contained in:
		
							parent
							
								
									c625b4cc57
								
							
						
					
					
						commit
						730633f0b7
					
				
					 7 changed files with 181 additions and 57 deletions
				
			
		|  | @ -271,19 +271,19 @@ prototypes:: | ||||||
| locking rules: | locking rules: | ||||||
| 	All except set_page_dirty and freepage may block | 	All except set_page_dirty and freepage may block | ||||||
| 
 | 
 | ||||||
| ======================	======================== ========= | ======================	======================== =========	=============== | ||||||
| ops			PageLocked(page)	 i_rwsem | ops			PageLocked(page)	 i_rwsem	invalidate_lock | ||||||
| ======================	======================== ========= | ======================	======================== =========	=============== | ||||||
| writepage:		yes, unlocks (see below) | writepage:		yes, unlocks (see below) | ||||||
| readpage:		yes, unlocks | readpage:		yes, unlocks				shared | ||||||
| writepages: | writepages: | ||||||
| set_page_dirty		no | set_page_dirty		no | ||||||
| readahead:		yes, unlocks | readahead:		yes, unlocks				shared | ||||||
| readpages:		no | readpages:		no					shared | ||||||
| write_begin:		locks the page		 exclusive | write_begin:		locks the page		 exclusive | ||||||
| write_end:		yes, unlocks		 exclusive | write_end:		yes, unlocks		 exclusive | ||||||
| bmap: | bmap: | ||||||
| invalidatepage:		yes | invalidatepage:		yes					exclusive | ||||||
| releasepage:		yes | releasepage:		yes | ||||||
| freepage:		yes | freepage:		yes | ||||||
| direct_IO: | direct_IO: | ||||||
|  | @ -378,7 +378,10 @@ keep it that way and don't breed new callers. | ||||||
| ->invalidatepage() is called when the filesystem must attempt to drop | ->invalidatepage() is called when the filesystem must attempt to drop | ||||||
| some or all of the buffers from the page when it is being truncated. It | some or all of the buffers from the page when it is being truncated. It | ||||||
| returns zero on success. If ->invalidatepage is zero, the kernel uses | returns zero on success. If ->invalidatepage is zero, the kernel uses | ||||||
| block_invalidatepage() instead. | block_invalidatepage() instead. The filesystem must exclusively acquire | ||||||
|  | invalidate_lock before invalidating page cache in truncate / hole punch path | ||||||
|  | (and thus calling into ->invalidatepage) to block races between page cache | ||||||
|  | invalidation and page cache filling functions (fault, read, ...). | ||||||
| 
 | 
 | ||||||
| ->releasepage() is called when the kernel is about to try to drop the | ->releasepage() is called when the kernel is about to try to drop the | ||||||
| buffers from the page in preparation for freeing it.  It returns zero to | buffers from the page in preparation for freeing it.  It returns zero to | ||||||
|  | @ -573,6 +576,25 @@ in sys_read() and friends. | ||||||
| the lease within the individual filesystem to record the result of the | the lease within the individual filesystem to record the result of the | ||||||
| operation | operation | ||||||
| 
 | 
 | ||||||
|  | ->fallocate implementation must be really careful to maintain page cache | ||||||
|  | consistency when punching holes or performing other operations that invalidate | ||||||
|  | page cache contents. Usually the filesystem needs to call | ||||||
|  | truncate_inode_pages_range() to invalidate relevant range of the page cache. | ||||||
|  | However the filesystem usually also needs to update its internal (and on disk) | ||||||
|  | view of file offset -> disk block mapping. Until this update is finished, the | ||||||
|  | filesystem needs to block page faults and reads from reloading now-stale page | ||||||
|  | cache contents from the disk. Since VFS acquires mapping->invalidate_lock in | ||||||
|  | shared mode when loading pages from disk (filemap_fault(), filemap_read(), | ||||||
|  | readahead paths), the fallocate implementation must take the invalidate_lock to | ||||||
|  | prevent reloading. | ||||||
|  | 
 | ||||||
|  | ->copy_file_range and ->remap_file_range implementations need to serialize | ||||||
|  | against modifications of file data while the operation is running. For | ||||||
|  | blocking changes through write(2) and similar operations inode->i_rwsem can be | ||||||
|  | used. To block changes to file contents via a memory mapping during the | ||||||
|  | operation, the filesystem must take mapping->invalidate_lock to coordinate | ||||||
|  | with ->page_mkwrite. | ||||||
|  | 
 | ||||||
| dquot_operations | dquot_operations | ||||||
| ================ | ================ | ||||||
| 
 | 
 | ||||||
|  | @ -630,11 +652,11 @@ pfn_mkwrite:	yes | ||||||
| access:		yes | access:		yes | ||||||
| =============	=========	=========================== | =============	=========	=========================== | ||||||
| 
 | 
 | ||||||
| ->fault() is called when a previously not present pte is about | ->fault() is called when a previously not present pte is about to be faulted | ||||||
| to be faulted in. The filesystem must find and return the page associated | in. The filesystem must find and return the page associated with the passed in | ||||||
| with the passed in "pgoff" in the vm_fault structure. If it is possible that | "pgoff" in the vm_fault structure. If it is possible that the page may be | ||||||
| the page may be truncated and/or invalidated, then the filesystem must lock | truncated and/or invalidated, then the filesystem must lock invalidate_lock, | ||||||
| the page, then ensure it is not already truncated (the page lock will block | then ensure the page is not already truncated (invalidate_lock will block | ||||||
| subsequent truncate), and then return with VM_FAULT_LOCKED, and the page | subsequent truncate), and then return with VM_FAULT_LOCKED, and the page | ||||||
| locked. The VM will unlock the page. | locked. The VM will unlock the page. | ||||||
| 
 | 
 | ||||||
|  | @ -647,12 +669,14 @@ page table entry. Pointer to entry associated with the page is passed in | ||||||
| "pte" field in vm_fault structure. Pointers to entries for other offsets | "pte" field in vm_fault structure. Pointers to entries for other offsets | ||||||
| should be calculated relative to "pte". | should be calculated relative to "pte". | ||||||
| 
 | 
 | ||||||
| ->page_mkwrite() is called when a previously read-only pte is | ->page_mkwrite() is called when a previously read-only pte is about to become | ||||||
| about to become writeable. The filesystem again must ensure that there are | writeable. The filesystem again must ensure that there are no | ||||||
| no truncate/invalidate races, and then return with the page locked. If | truncate/invalidate races or races with operations such as ->remap_file_range | ||||||
| the page has been truncated, the filesystem should not look up a new page | or ->copy_file_range, and then return with the page locked. Usually | ||||||
| like the ->fault() handler, but simply return with VM_FAULT_NOPAGE, which | mapping->invalidate_lock is suitable for proper serialization. If the page has | ||||||
| will cause the VM to retry the fault. | been truncated, the filesystem should not look up a new page like the ->fault() | ||||||
|  | handler, but simply return with VM_FAULT_NOPAGE, which will cause the VM to | ||||||
|  | retry the fault. | ||||||
| 
 | 
 | ||||||
| ->pfn_mkwrite() is the same as page_mkwrite but when the pte is | ->pfn_mkwrite() is the same as page_mkwrite but when the pte is | ||||||
| VM_PFNMAP or VM_MIXEDMAP with a page-less entry. Expected return is | VM_PFNMAP or VM_MIXEDMAP with a page-less entry. Expected return is | ||||||
|  |  | ||||||
|  | @ -190,6 +190,8 @@ int inode_init_always(struct super_block *sb, struct inode *inode) | ||||||
| 	mapping_set_gfp_mask(mapping, GFP_HIGHUSER_MOVABLE); | 	mapping_set_gfp_mask(mapping, GFP_HIGHUSER_MOVABLE); | ||||||
| 	mapping->private_data = NULL; | 	mapping->private_data = NULL; | ||||||
| 	mapping->writeback_index = 0; | 	mapping->writeback_index = 0; | ||||||
|  | 	__init_rwsem(&mapping->invalidate_lock, "mapping.invalidate_lock", | ||||||
|  | 		     &sb->s_type->invalidate_lock_key); | ||||||
| 	inode->i_private = NULL; | 	inode->i_private = NULL; | ||||||
| 	inode->i_mapping = mapping; | 	inode->i_mapping = mapping; | ||||||
| 	INIT_HLIST_HEAD(&inode->i_dentry);	/* buggered by rcu freeing */ | 	INIT_HLIST_HEAD(&inode->i_dentry);	/* buggered by rcu freeing */ | ||||||
|  |  | ||||||
|  | @ -436,6 +436,10 @@ int pagecache_write_end(struct file *, struct address_space *mapping, | ||||||
|  * struct address_space - Contents of a cacheable, mappable object. |  * struct address_space - Contents of a cacheable, mappable object. | ||||||
|  * @host: Owner, either the inode or the block_device. |  * @host: Owner, either the inode or the block_device. | ||||||
|  * @i_pages: Cached pages. |  * @i_pages: Cached pages. | ||||||
|  |  * @invalidate_lock: Guards coherency between page cache contents and | ||||||
|  |  *   file offset->disk block mappings in the filesystem during invalidates. | ||||||
|  |  *   It is also used to block modification of page cache contents through | ||||||
|  |  *   memory mappings. | ||||||
|  * @gfp_mask: Memory allocation flags to use for allocating pages. |  * @gfp_mask: Memory allocation flags to use for allocating pages. | ||||||
|  * @i_mmap_writable: Number of VM_SHARED mappings. |  * @i_mmap_writable: Number of VM_SHARED mappings. | ||||||
|  * @nr_thps: Number of THPs in the pagecache (non-shmem only). |  * @nr_thps: Number of THPs in the pagecache (non-shmem only). | ||||||
|  | @ -453,6 +457,7 @@ int pagecache_write_end(struct file *, struct address_space *mapping, | ||||||
| struct address_space { | struct address_space { | ||||||
| 	struct inode		*host; | 	struct inode		*host; | ||||||
| 	struct xarray		i_pages; | 	struct xarray		i_pages; | ||||||
|  | 	struct rw_semaphore	invalidate_lock; | ||||||
| 	gfp_t			gfp_mask; | 	gfp_t			gfp_mask; | ||||||
| 	atomic_t		i_mmap_writable; | 	atomic_t		i_mmap_writable; | ||||||
| #ifdef CONFIG_READ_ONLY_THP_FOR_FS | #ifdef CONFIG_READ_ONLY_THP_FOR_FS | ||||||
|  | @ -814,6 +819,33 @@ static inline void inode_lock_shared_nested(struct inode *inode, unsigned subcla | ||||||
| 	down_read_nested(&inode->i_rwsem, subclass); | 	down_read_nested(&inode->i_rwsem, subclass); | ||||||
| } | } | ||||||
| 
 | 
 | ||||||
|  | static inline void filemap_invalidate_lock(struct address_space *mapping) | ||||||
|  | { | ||||||
|  | 	down_write(&mapping->invalidate_lock); | ||||||
|  | } | ||||||
|  | 
 | ||||||
|  | static inline void filemap_invalidate_unlock(struct address_space *mapping) | ||||||
|  | { | ||||||
|  | 	up_write(&mapping->invalidate_lock); | ||||||
|  | } | ||||||
|  | 
 | ||||||
|  | static inline void filemap_invalidate_lock_shared(struct address_space *mapping) | ||||||
|  | { | ||||||
|  | 	down_read(&mapping->invalidate_lock); | ||||||
|  | } | ||||||
|  | 
 | ||||||
|  | static inline int filemap_invalidate_trylock_shared( | ||||||
|  | 					struct address_space *mapping) | ||||||
|  | { | ||||||
|  | 	return down_read_trylock(&mapping->invalidate_lock); | ||||||
|  | } | ||||||
|  | 
 | ||||||
|  | static inline void filemap_invalidate_unlock_shared( | ||||||
|  | 					struct address_space *mapping) | ||||||
|  | { | ||||||
|  | 	up_read(&mapping->invalidate_lock); | ||||||
|  | } | ||||||
|  | 
 | ||||||
| void lock_two_nondirectories(struct inode *, struct inode*); | void lock_two_nondirectories(struct inode *, struct inode*); | ||||||
| void unlock_two_nondirectories(struct inode *, struct inode*); | void unlock_two_nondirectories(struct inode *, struct inode*); | ||||||
| 
 | 
 | ||||||
|  | @ -2487,6 +2519,7 @@ struct file_system_type { | ||||||
| 
 | 
 | ||||||
| 	struct lock_class_key i_lock_key; | 	struct lock_class_key i_lock_key; | ||||||
| 	struct lock_class_key i_mutex_key; | 	struct lock_class_key i_mutex_key; | ||||||
|  | 	struct lock_class_key invalidate_lock_key; | ||||||
| 	struct lock_class_key i_mutex_dir_key; | 	struct lock_class_key i_mutex_dir_key; | ||||||
| }; | }; | ||||||
| 
 | 
 | ||||||
|  |  | ||||||
							
								
								
									
										99
									
								
								mm/filemap.c
									
									
									
									
									
								
							
							
						
						
									
										99
									
								
								mm/filemap.c
									
									
									
									
									
								
							|  | @ -77,7 +77,8 @@ | ||||||
|  *        ->i_pages lock |  *        ->i_pages lock | ||||||
|  * |  * | ||||||
|  *  ->i_rwsem |  *  ->i_rwsem | ||||||
|  *    ->i_mmap_rwsem		(truncate->unmap_mapping_range) |  *    ->invalidate_lock		(acquired by fs in truncate path) | ||||||
|  |  *      ->i_mmap_rwsem		(truncate->unmap_mapping_range) | ||||||
|  * |  * | ||||||
|  *  ->mmap_lock |  *  ->mmap_lock | ||||||
|  *    ->i_mmap_rwsem |  *    ->i_mmap_rwsem | ||||||
|  | @ -85,7 +86,8 @@ | ||||||
|  *        ->i_pages lock	(arch-dependent flush_dcache_mmap_lock) |  *        ->i_pages lock	(arch-dependent flush_dcache_mmap_lock) | ||||||
|  * |  * | ||||||
|  *  ->mmap_lock |  *  ->mmap_lock | ||||||
|  *    ->lock_page		(access_process_vm) |  *    ->invalidate_lock		(filemap_fault) | ||||||
|  |  *      ->lock_page		(filemap_fault, access_process_vm) | ||||||
|  * |  * | ||||||
|  *  ->i_rwsem			(generic_perform_write) |  *  ->i_rwsem			(generic_perform_write) | ||||||
|  *    ->mmap_lock		(fault_in_pages_readable->do_page_fault) |  *    ->mmap_lock		(fault_in_pages_readable->do_page_fault) | ||||||
|  | @ -2368,20 +2370,30 @@ static int filemap_update_page(struct kiocb *iocb, | ||||||
| { | { | ||||||
| 	int error; | 	int error; | ||||||
| 
 | 
 | ||||||
| 	if (!trylock_page(page)) { | 	if (iocb->ki_flags & IOCB_NOWAIT) { | ||||||
| 		if (iocb->ki_flags & (IOCB_NOWAIT | IOCB_NOIO)) | 		if (!filemap_invalidate_trylock_shared(mapping)) | ||||||
| 			return -EAGAIN; | 			return -EAGAIN; | ||||||
|  | 	} else { | ||||||
|  | 		filemap_invalidate_lock_shared(mapping); | ||||||
|  | 	} | ||||||
|  | 
 | ||||||
|  | 	if (!trylock_page(page)) { | ||||||
|  | 		error = -EAGAIN; | ||||||
|  | 		if (iocb->ki_flags & (IOCB_NOWAIT | IOCB_NOIO)) | ||||||
|  | 			goto unlock_mapping; | ||||||
| 		if (!(iocb->ki_flags & IOCB_WAITQ)) { | 		if (!(iocb->ki_flags & IOCB_WAITQ)) { | ||||||
|  | 			filemap_invalidate_unlock_shared(mapping); | ||||||
| 			put_and_wait_on_page_locked(page, TASK_KILLABLE); | 			put_and_wait_on_page_locked(page, TASK_KILLABLE); | ||||||
| 			return AOP_TRUNCATED_PAGE; | 			return AOP_TRUNCATED_PAGE; | ||||||
| 		} | 		} | ||||||
| 		error = __lock_page_async(page, iocb->ki_waitq); | 		error = __lock_page_async(page, iocb->ki_waitq); | ||||||
| 		if (error) | 		if (error) | ||||||
| 			return error; | 			goto unlock_mapping; | ||||||
| 	} | 	} | ||||||
| 
 | 
 | ||||||
|  | 	error = AOP_TRUNCATED_PAGE; | ||||||
| 	if (!page->mapping) | 	if (!page->mapping) | ||||||
| 		goto truncated; | 		goto unlock; | ||||||
| 
 | 
 | ||||||
| 	error = 0; | 	error = 0; | ||||||
| 	if (filemap_range_uptodate(mapping, iocb->ki_pos, iter, page)) | 	if (filemap_range_uptodate(mapping, iocb->ki_pos, iter, page)) | ||||||
|  | @ -2392,15 +2404,13 @@ static int filemap_update_page(struct kiocb *iocb, | ||||||
| 		goto unlock; | 		goto unlock; | ||||||
| 
 | 
 | ||||||
| 	error = filemap_read_page(iocb->ki_filp, mapping, page); | 	error = filemap_read_page(iocb->ki_filp, mapping, page); | ||||||
| 	if (error == AOP_TRUNCATED_PAGE) | 	goto unlock_mapping; | ||||||
| 		put_page(page); |  | ||||||
| 	return error; |  | ||||||
| truncated: |  | ||||||
| 	unlock_page(page); |  | ||||||
| 	put_page(page); |  | ||||||
| 	return AOP_TRUNCATED_PAGE; |  | ||||||
| unlock: | unlock: | ||||||
| 	unlock_page(page); | 	unlock_page(page); | ||||||
|  | unlock_mapping: | ||||||
|  | 	filemap_invalidate_unlock_shared(mapping); | ||||||
|  | 	if (error == AOP_TRUNCATED_PAGE) | ||||||
|  | 		put_page(page); | ||||||
| 	return error; | 	return error; | ||||||
| } | } | ||||||
| 
 | 
 | ||||||
|  | @ -2415,6 +2425,19 @@ static int filemap_create_page(struct file *file, | ||||||
| 	if (!page) | 	if (!page) | ||||||
| 		return -ENOMEM; | 		return -ENOMEM; | ||||||
| 
 | 
 | ||||||
|  | 	/*
 | ||||||
|  | 	 * Protect against truncate / hole punch. Grabbing invalidate_lock here | ||||||
|  | 	 * assures we cannot instantiate and bring uptodate new pagecache pages | ||||||
|  | 	 * after evicting page cache during truncate and before actually | ||||||
|  | 	 * freeing blocks.  Note that we could release invalidate_lock after | ||||||
|  | 	 * inserting the page into page cache as the locked page would then be | ||||||
|  | 	 * enough to synchronize with hole punching. But there are code paths | ||||||
|  | 	 * such as filemap_update_page() filling in partially uptodate pages or | ||||||
|  | 	 * ->readpages() that need to hold invalidate_lock while mapping blocks | ||||||
|  | 	 * for IO so let's hold the lock here as well to keep locking rules | ||||||
|  | 	 * simple. | ||||||
|  | 	 */ | ||||||
|  | 	filemap_invalidate_lock_shared(mapping); | ||||||
| 	error = add_to_page_cache_lru(page, mapping, index, | 	error = add_to_page_cache_lru(page, mapping, index, | ||||||
| 			mapping_gfp_constraint(mapping, GFP_KERNEL)); | 			mapping_gfp_constraint(mapping, GFP_KERNEL)); | ||||||
| 	if (error == -EEXIST) | 	if (error == -EEXIST) | ||||||
|  | @ -2426,9 +2449,11 @@ static int filemap_create_page(struct file *file, | ||||||
| 	if (error) | 	if (error) | ||||||
| 		goto error; | 		goto error; | ||||||
| 
 | 
 | ||||||
|  | 	filemap_invalidate_unlock_shared(mapping); | ||||||
| 	pagevec_add(pvec, page); | 	pagevec_add(pvec, page); | ||||||
| 	return 0; | 	return 0; | ||||||
| error: | error: | ||||||
|  | 	filemap_invalidate_unlock_shared(mapping); | ||||||
| 	put_page(page); | 	put_page(page); | ||||||
| 	return error; | 	return error; | ||||||
| } | } | ||||||
|  | @ -2967,6 +2992,7 @@ vm_fault_t filemap_fault(struct vm_fault *vmf) | ||||||
| 	pgoff_t max_off; | 	pgoff_t max_off; | ||||||
| 	struct page *page; | 	struct page *page; | ||||||
| 	vm_fault_t ret = 0; | 	vm_fault_t ret = 0; | ||||||
|  | 	bool mapping_locked = false; | ||||||
| 
 | 
 | ||||||
| 	max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE); | 	max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE); | ||||||
| 	if (unlikely(offset >= max_off)) | 	if (unlikely(offset >= max_off)) | ||||||
|  | @ -2976,25 +3002,39 @@ vm_fault_t filemap_fault(struct vm_fault *vmf) | ||||||
| 	 * Do we have something in the page cache already? | 	 * Do we have something in the page cache already? | ||||||
| 	 */ | 	 */ | ||||||
| 	page = find_get_page(mapping, offset); | 	page = find_get_page(mapping, offset); | ||||||
| 	if (likely(page) && !(vmf->flags & FAULT_FLAG_TRIED)) { | 	if (likely(page)) { | ||||||
| 		/*
 | 		/*
 | ||||||
| 		 * We found the page, so try async readahead before | 		 * We found the page, so try async readahead before waiting for | ||||||
| 		 * waiting for the lock. | 		 * the lock. | ||||||
| 		 */ | 		 */ | ||||||
| 		fpin = do_async_mmap_readahead(vmf, page); | 		if (!(vmf->flags & FAULT_FLAG_TRIED)) | ||||||
| 	} else if (!page) { | 			fpin = do_async_mmap_readahead(vmf, page); | ||||||
|  | 		if (unlikely(!PageUptodate(page))) { | ||||||
|  | 			filemap_invalidate_lock_shared(mapping); | ||||||
|  | 			mapping_locked = true; | ||||||
|  | 		} | ||||||
|  | 	} else { | ||||||
| 		/* No page in the page cache at all */ | 		/* No page in the page cache at all */ | ||||||
| 		count_vm_event(PGMAJFAULT); | 		count_vm_event(PGMAJFAULT); | ||||||
| 		count_memcg_event_mm(vmf->vma->vm_mm, PGMAJFAULT); | 		count_memcg_event_mm(vmf->vma->vm_mm, PGMAJFAULT); | ||||||
| 		ret = VM_FAULT_MAJOR; | 		ret = VM_FAULT_MAJOR; | ||||||
| 		fpin = do_sync_mmap_readahead(vmf); | 		fpin = do_sync_mmap_readahead(vmf); | ||||||
| retry_find: | retry_find: | ||||||
|  | 		/*
 | ||||||
|  | 		 * See comment in filemap_create_page() why we need | ||||||
|  | 		 * invalidate_lock | ||||||
|  | 		 */ | ||||||
|  | 		if (!mapping_locked) { | ||||||
|  | 			filemap_invalidate_lock_shared(mapping); | ||||||
|  | 			mapping_locked = true; | ||||||
|  | 		} | ||||||
| 		page = pagecache_get_page(mapping, offset, | 		page = pagecache_get_page(mapping, offset, | ||||||
| 					  FGP_CREAT|FGP_FOR_MMAP, | 					  FGP_CREAT|FGP_FOR_MMAP, | ||||||
| 					  vmf->gfp_mask); | 					  vmf->gfp_mask); | ||||||
| 		if (!page) { | 		if (!page) { | ||||||
| 			if (fpin) | 			if (fpin) | ||||||
| 				goto out_retry; | 				goto out_retry; | ||||||
|  | 			filemap_invalidate_unlock_shared(mapping); | ||||||
| 			return VM_FAULT_OOM; | 			return VM_FAULT_OOM; | ||||||
| 		} | 		} | ||||||
| 	} | 	} | ||||||
|  | @ -3014,8 +3054,20 @@ vm_fault_t filemap_fault(struct vm_fault *vmf) | ||||||
| 	 * We have a locked page in the page cache, now we need to check | 	 * We have a locked page in the page cache, now we need to check | ||||||
| 	 * that it's up-to-date. If not, it is going to be due to an error. | 	 * that it's up-to-date. If not, it is going to be due to an error. | ||||||
| 	 */ | 	 */ | ||||||
| 	if (unlikely(!PageUptodate(page))) | 	if (unlikely(!PageUptodate(page))) { | ||||||
|  | 		/*
 | ||||||
|  | 		 * The page was in cache and uptodate and now it is not. | ||||||
|  | 		 * Strange but possible since we didn't hold the page lock all | ||||||
|  | 		 * the time. Let's drop everything get the invalidate lock and | ||||||
|  | 		 * try again. | ||||||
|  | 		 */ | ||||||
|  | 		if (!mapping_locked) { | ||||||
|  | 			unlock_page(page); | ||||||
|  | 			put_page(page); | ||||||
|  | 			goto retry_find; | ||||||
|  | 		} | ||||||
| 		goto page_not_uptodate; | 		goto page_not_uptodate; | ||||||
|  | 	} | ||||||
| 
 | 
 | ||||||
| 	/*
 | 	/*
 | ||||||
| 	 * We've made it this far and we had to drop our mmap_lock, now is the | 	 * We've made it this far and we had to drop our mmap_lock, now is the | ||||||
|  | @ -3026,6 +3078,8 @@ vm_fault_t filemap_fault(struct vm_fault *vmf) | ||||||
| 		unlock_page(page); | 		unlock_page(page); | ||||||
| 		goto out_retry; | 		goto out_retry; | ||||||
| 	} | 	} | ||||||
|  | 	if (mapping_locked) | ||||||
|  | 		filemap_invalidate_unlock_shared(mapping); | ||||||
| 
 | 
 | ||||||
| 	/*
 | 	/*
 | ||||||
| 	 * Found the page and have a reference on it. | 	 * Found the page and have a reference on it. | ||||||
|  | @ -3056,6 +3110,7 @@ vm_fault_t filemap_fault(struct vm_fault *vmf) | ||||||
| 
 | 
 | ||||||
| 	if (!error || error == AOP_TRUNCATED_PAGE) | 	if (!error || error == AOP_TRUNCATED_PAGE) | ||||||
| 		goto retry_find; | 		goto retry_find; | ||||||
|  | 	filemap_invalidate_unlock_shared(mapping); | ||||||
| 
 | 
 | ||||||
| 	return VM_FAULT_SIGBUS; | 	return VM_FAULT_SIGBUS; | ||||||
| 
 | 
 | ||||||
|  | @ -3067,6 +3122,8 @@ vm_fault_t filemap_fault(struct vm_fault *vmf) | ||||||
| 	 */ | 	 */ | ||||||
| 	if (page) | 	if (page) | ||||||
| 		put_page(page); | 		put_page(page); | ||||||
|  | 	if (mapping_locked) | ||||||
|  | 		filemap_invalidate_unlock_shared(mapping); | ||||||
| 	if (fpin) | 	if (fpin) | ||||||
| 		fput(fpin); | 		fput(fpin); | ||||||
| 	return ret | VM_FAULT_RETRY; | 	return ret | VM_FAULT_RETRY; | ||||||
|  | @ -3437,6 +3494,8 @@ static struct page *do_read_cache_page(struct address_space *mapping, | ||||||
|  * |  * | ||||||
|  * If the page does not get brought uptodate, return -EIO. |  * If the page does not get brought uptodate, return -EIO. | ||||||
|  * |  * | ||||||
|  |  * The function expects mapping->invalidate_lock to be already held. | ||||||
|  |  * | ||||||
|  * Return: up to date page on success, ERR_PTR() on failure. |  * Return: up to date page on success, ERR_PTR() on failure. | ||||||
|  */ |  */ | ||||||
| struct page *read_cache_page(struct address_space *mapping, | struct page *read_cache_page(struct address_space *mapping, | ||||||
|  | @ -3460,6 +3519,8 @@ EXPORT_SYMBOL(read_cache_page); | ||||||
|  * |  * | ||||||
|  * If the page does not get brought uptodate, return -EIO. |  * If the page does not get brought uptodate, return -EIO. | ||||||
|  * |  * | ||||||
|  |  * The function expects mapping->invalidate_lock to be already held. | ||||||
|  |  * | ||||||
|  * Return: up to date page on success, ERR_PTR() on failure. |  * Return: up to date page on success, ERR_PTR() on failure. | ||||||
|  */ |  */ | ||||||
| struct page *read_cache_page_gfp(struct address_space *mapping, | struct page *read_cache_page_gfp(struct address_space *mapping, | ||||||
|  |  | ||||||
|  | @ -192,6 +192,7 @@ void page_cache_ra_unbounded(struct readahead_control *ractl, | ||||||
| 	 */ | 	 */ | ||||||
| 	unsigned int nofs = memalloc_nofs_save(); | 	unsigned int nofs = memalloc_nofs_save(); | ||||||
| 
 | 
 | ||||||
|  | 	filemap_invalidate_lock_shared(mapping); | ||||||
| 	/*
 | 	/*
 | ||||||
| 	 * Preallocate as many pages as we will need. | 	 * Preallocate as many pages as we will need. | ||||||
| 	 */ | 	 */ | ||||||
|  | @ -236,6 +237,7 @@ void page_cache_ra_unbounded(struct readahead_control *ractl, | ||||||
| 	 * will then handle the error. | 	 * will then handle the error. | ||||||
| 	 */ | 	 */ | ||||||
| 	read_pages(ractl, &page_pool, false); | 	read_pages(ractl, &page_pool, false); | ||||||
|  | 	filemap_invalidate_unlock_shared(mapping); | ||||||
| 	memalloc_nofs_restore(nofs); | 	memalloc_nofs_restore(nofs); | ||||||
| } | } | ||||||
| EXPORT_SYMBOL_GPL(page_cache_ra_unbounded); | EXPORT_SYMBOL_GPL(page_cache_ra_unbounded); | ||||||
|  |  | ||||||
							
								
								
									
										37
									
								
								mm/rmap.c
									
									
									
									
									
								
							
							
						
						
									
										37
									
								
								mm/rmap.c
									
									
									
									
									
								
							|  | @ -22,24 +22,25 @@ | ||||||
|  * |  * | ||||||
|  * inode->i_rwsem	(while writing or truncating, not reading or faulting) |  * inode->i_rwsem	(while writing or truncating, not reading or faulting) | ||||||
|  *   mm->mmap_lock |  *   mm->mmap_lock | ||||||
|  *     page->flags PG_locked (lock_page)   * (see hugetlbfs below) |  *     mapping->invalidate_lock (in filemap_fault) | ||||||
|  *       hugetlbfs_i_mmap_rwsem_key (in huge_pmd_share) |  *       page->flags PG_locked (lock_page)   * (see hugetlbfs below) | ||||||
|  *         mapping->i_mmap_rwsem |  *         hugetlbfs_i_mmap_rwsem_key (in huge_pmd_share) | ||||||
|  *           hugetlb_fault_mutex (hugetlbfs specific page fault mutex) |  *           mapping->i_mmap_rwsem | ||||||
|  *           anon_vma->rwsem |  *             hugetlb_fault_mutex (hugetlbfs specific page fault mutex) | ||||||
|  *             mm->page_table_lock or pte_lock |  *             anon_vma->rwsem | ||||||
|  *               swap_lock (in swap_duplicate, swap_info_get) |  *               mm->page_table_lock or pte_lock | ||||||
|  *                 mmlist_lock (in mmput, drain_mmlist and others) |  *                 swap_lock (in swap_duplicate, swap_info_get) | ||||||
|  *                 mapping->private_lock (in __set_page_dirty_buffers) |  *                   mmlist_lock (in mmput, drain_mmlist and others) | ||||||
|  *                   lock_page_memcg move_lock (in __set_page_dirty_buffers) |  *                   mapping->private_lock (in __set_page_dirty_buffers) | ||||||
|  *                     i_pages lock (widely used) |  *                     lock_page_memcg move_lock (in __set_page_dirty_buffers) | ||||||
|  *                       lruvec->lru_lock (in lock_page_lruvec_irq) |  *                       i_pages lock (widely used) | ||||||
|  *                 inode->i_lock (in set_page_dirty's __mark_inode_dirty) |  *                         lruvec->lru_lock (in lock_page_lruvec_irq) | ||||||
|  *                 bdi.wb->list_lock (in set_page_dirty's __mark_inode_dirty) |  *                   inode->i_lock (in set_page_dirty's __mark_inode_dirty) | ||||||
|  *                   sb_lock (within inode_lock in fs/fs-writeback.c) |  *                   bdi.wb->list_lock (in set_page_dirty's __mark_inode_dirty) | ||||||
|  *                   i_pages lock (widely used, in set_page_dirty, |  *                     sb_lock (within inode_lock in fs/fs-writeback.c) | ||||||
|  *                             in arch-dependent flush_dcache_mmap_lock, |  *                     i_pages lock (widely used, in set_page_dirty, | ||||||
|  *                             within bdi.wb->list_lock in __sync_single_inode) |  *                               in arch-dependent flush_dcache_mmap_lock, | ||||||
|  |  *                               within bdi.wb->list_lock in __sync_single_inode) | ||||||
|  * |  * | ||||||
|  * anon_vma->rwsem,mapping->i_mmap_rwsem   (memory_failure, collect_procs_anon) |  * anon_vma->rwsem,mapping->i_mmap_rwsem   (memory_failure, collect_procs_anon) | ||||||
|  *   ->tasklist_lock |  *   ->tasklist_lock | ||||||
|  |  | ||||||
|  | @ -412,7 +412,8 @@ EXPORT_SYMBOL(truncate_inode_pages_range); | ||||||
|  * @mapping: mapping to truncate |  * @mapping: mapping to truncate | ||||||
|  * @lstart: offset from which to truncate |  * @lstart: offset from which to truncate | ||||||
|  * |  * | ||||||
|  * Called under (and serialised by) inode->i_rwsem. |  * Called under (and serialised by) inode->i_rwsem and | ||||||
|  |  * mapping->invalidate_lock. | ||||||
|  * |  * | ||||||
|  * Note: When this function returns, there can be a page in the process of |  * Note: When this function returns, there can be a page in the process of | ||||||
|  * deletion (inside __delete_from_page_cache()) in the specified range.  Thus |  * deletion (inside __delete_from_page_cache()) in the specified range.  Thus | ||||||
|  |  | ||||||
		Loading…
	
		Reference in a new issue
	
	 Jan Kara
						Jan Kara