forked from mirrors/linux
		
	sched: Add documentation for bandwidth control
Basic description of usage and effect for CFS Bandwidth Control. Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com> Signed-off-by: Paul Turner <pjt@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20110721184758.498036116@google.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
This commit is contained in:
		
							parent
							
								
									d8b4986d3d
								
							
						
					
					
						commit
						88ebc08ea9
					
				
					 1 changed files with 122 additions and 0 deletions
				
			
		
							
								
								
									
										122
									
								
								Documentation/scheduler/sched-bwc.txt
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										122
									
								
								Documentation/scheduler/sched-bwc.txt
									
									
									
									
									
										Normal file
									
								
							| 
						 | 
				
			
			@ -0,0 +1,122 @@
 | 
			
		|||
CFS Bandwidth Control
 | 
			
		||||
=====================
 | 
			
		||||
 | 
			
		||||
[ This document only discusses CPU bandwidth control for SCHED_NORMAL.
 | 
			
		||||
  The SCHED_RT case is covered in Documentation/scheduler/sched-rt-group.txt ]
 | 
			
		||||
 | 
			
		||||
CFS bandwidth control is a CONFIG_FAIR_GROUP_SCHED extension which allows the
 | 
			
		||||
specification of the maximum CPU bandwidth available to a group or hierarchy.
 | 
			
		||||
 | 
			
		||||
The bandwidth allowed for a group is specified using a quota and period. Within
 | 
			
		||||
each given "period" (microseconds), a group is allowed to consume only up to
 | 
			
		||||
"quota" microseconds of CPU time.  When the CPU bandwidth consumption of a
 | 
			
		||||
group exceeds this limit (for that period), the tasks belonging to its
 | 
			
		||||
hierarchy will be throttled and are not allowed to run again until the next
 | 
			
		||||
period.
 | 
			
		||||
 | 
			
		||||
A group's unused runtime is globally tracked, being refreshed with quota units
 | 
			
		||||
above at each period boundary.  As threads consume this bandwidth it is
 | 
			
		||||
transferred to cpu-local "silos" on a demand basis.  The amount transferred
 | 
			
		||||
within each of these updates is tunable and described as the "slice".
 | 
			
		||||
 | 
			
		||||
Management
 | 
			
		||||
----------
 | 
			
		||||
Quota and period are managed within the cpu subsystem via cgroupfs.
 | 
			
		||||
 | 
			
		||||
cpu.cfs_quota_us: the total available run-time within a period (in microseconds)
 | 
			
		||||
cpu.cfs_period_us: the length of a period (in microseconds)
 | 
			
		||||
cpu.stat: exports throttling statistics [explained further below]
 | 
			
		||||
 | 
			
		||||
The default values are:
 | 
			
		||||
	cpu.cfs_period_us=100ms
 | 
			
		||||
	cpu.cfs_quota=-1
 | 
			
		||||
 | 
			
		||||
A value of -1 for cpu.cfs_quota_us indicates that the group does not have any
 | 
			
		||||
bandwidth restriction in place, such a group is described as an unconstrained
 | 
			
		||||
bandwidth group.  This represents the traditional work-conserving behavior for
 | 
			
		||||
CFS.
 | 
			
		||||
 | 
			
		||||
Writing any (valid) positive value(s) will enact the specified bandwidth limit.
 | 
			
		||||
The minimum quota allowed for the quota or period is 1ms.  There is also an
 | 
			
		||||
upper bound on the period length of 1s.  Additional restrictions exist when
 | 
			
		||||
bandwidth limits are used in a hierarchical fashion, these are explained in
 | 
			
		||||
more detail below.
 | 
			
		||||
 | 
			
		||||
Writing any negative value to cpu.cfs_quota_us will remove the bandwidth limit
 | 
			
		||||
and return the group to an unconstrained state once more.
 | 
			
		||||
 | 
			
		||||
Any updates to a group's bandwidth specification will result in it becoming
 | 
			
		||||
unthrottled if it is in a constrained state.
 | 
			
		||||
 | 
			
		||||
System wide settings
 | 
			
		||||
--------------------
 | 
			
		||||
For efficiency run-time is transferred between the global pool and CPU local
 | 
			
		||||
"silos" in a batch fashion.  This greatly reduces global accounting pressure
 | 
			
		||||
on large systems.  The amount transferred each time such an update is required
 | 
			
		||||
is described as the "slice".
 | 
			
		||||
 | 
			
		||||
This is tunable via procfs:
 | 
			
		||||
	/proc/sys/kernel/sched_cfs_bandwidth_slice_us (default=5ms)
 | 
			
		||||
 | 
			
		||||
Larger slice values will reduce transfer overheads, while smaller values allow
 | 
			
		||||
for more fine-grained consumption.
 | 
			
		||||
 | 
			
		||||
Statistics
 | 
			
		||||
----------
 | 
			
		||||
A group's bandwidth statistics are exported via 3 fields in cpu.stat.
 | 
			
		||||
 | 
			
		||||
cpu.stat:
 | 
			
		||||
- nr_periods: Number of enforcement intervals that have elapsed.
 | 
			
		||||
- nr_throttled: Number of times the group has been throttled/limited.
 | 
			
		||||
- throttled_time: The total time duration (in nanoseconds) for which entities
 | 
			
		||||
  of the group have been throttled.
 | 
			
		||||
 | 
			
		||||
This interface is read-only.
 | 
			
		||||
 | 
			
		||||
Hierarchical considerations
 | 
			
		||||
---------------------------
 | 
			
		||||
The interface enforces that an individual entity's bandwidth is always
 | 
			
		||||
attainable, that is: max(c_i) <= C. However, over-subscription in the
 | 
			
		||||
aggregate case is explicitly allowed to enable work-conserving semantics
 | 
			
		||||
within a hierarchy.
 | 
			
		||||
  e.g. \Sum (c_i) may exceed C
 | 
			
		||||
[ Where C is the parent's bandwidth, and c_i its children ]
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
There are two ways in which a group may become throttled:
 | 
			
		||||
	a. it fully consumes its own quota within a period
 | 
			
		||||
	b. a parent's quota is fully consumed within its period
 | 
			
		||||
 | 
			
		||||
In case b) above, even though the child may have runtime remaining it will not
 | 
			
		||||
be allowed to until the parent's runtime is refreshed.
 | 
			
		||||
 | 
			
		||||
Examples
 | 
			
		||||
--------
 | 
			
		||||
1. Limit a group to 1 CPU worth of runtime.
 | 
			
		||||
 | 
			
		||||
	If period is 250ms and quota is also 250ms, the group will get
 | 
			
		||||
	1 CPU worth of runtime every 250ms.
 | 
			
		||||
 | 
			
		||||
	# echo 250000 > cpu.cfs_quota_us /* quota = 250ms */
 | 
			
		||||
	# echo 250000 > cpu.cfs_period_us /* period = 250ms */
 | 
			
		||||
 | 
			
		||||
2. Limit a group to 2 CPUs worth of runtime on a multi-CPU machine.
 | 
			
		||||
 | 
			
		||||
	With 500ms period and 1000ms quota, the group can get 2 CPUs worth of
 | 
			
		||||
	runtime every 500ms.
 | 
			
		||||
 | 
			
		||||
	# echo 1000000 > cpu.cfs_quota_us /* quota = 1000ms */
 | 
			
		||||
	# echo 500000 > cpu.cfs_period_us /* period = 500ms */
 | 
			
		||||
 | 
			
		||||
	The larger period here allows for increased burst capacity.
 | 
			
		||||
 | 
			
		||||
3. Limit a group to 20% of 1 CPU.
 | 
			
		||||
 | 
			
		||||
	With 50ms period, 10ms quota will be equivalent to 20% of 1 CPU.
 | 
			
		||||
 | 
			
		||||
	# echo 10000 > cpu.cfs_quota_us /* quota = 10ms */
 | 
			
		||||
	# echo 50000 > cpu.cfs_period_us /* period = 50ms */
 | 
			
		||||
 | 
			
		||||
	By using a small period here we are ensuring a consistent latency
 | 
			
		||||
	response at the expense of burst capacity.
 | 
			
		||||
 | 
			
		||||
		Loading…
	
		Reference in a new issue