mirror of
				https://github.com/torvalds/linux.git
				synced 2025-11-04 02:30:34 +02:00 
			
		
		
		
	
	
		
			157 commits
		
	
	
	| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
| 
							 | 
						8e1c4961f4 | 
							
							
								
								mm/damon/core: avoid destroyed target reference from DAMOS quota
							
							
							
							
							
							
							
							When the number of the monitoring targets in running contexts is reduced,
there may be DAMOS quotas referencing the targets that will be destroyed.
Applying the scheme action for such DAMOS scheme will be skipped forever
looking for the starting part of the region for the destroyed monitoring
target.
To fix this issue, when the monitoring target is destroyed, reset the
starting part for all DAMOS quotas that reference the target.
Link: https://lkml.kernel.org/r/20250517141852.142802-1-akinobu.mita@gmail.com
Fixes: 
							
						 | 
						
							||
| 
							 | 
						591c4c78be | 
							
							
								
								mm/damon/core: warn and fix nr_accesses[_bp] corruption
							
							
							
							
							
							
							
							Patch series "mm/damon: minor fixups and improvements for code, tests, and documents". Yet another batch of miscellaneous DAMON changes. Fix and improve minor problems in code, tests and documents. This patch (of 6): For a bug such as double aggregation reset[1], ->nr_accesses and/or ->nr_accesses_bp of damon_region could be corrupted. Such corruption can make monitoring results pretty inaccurate, so the root causing bug should be investigated. Meanwhile, the corruption itself can easily be fixed but silently fixing it will hide the bug. Fix the corruption as soon as found, but WARN_ONCE() so that we can be aware of the existence of the bug while keeping the system running in a more sane way. Link: https://lkml.kernel.org/r/20250513002715.40126-1-sj@kernel.org Link: https://lkml.kernel.org/r/20250513002715.40126-2-sj@kernel.org Link: https://lore.kernel.org/20250302214145.356806-1-sj@kernel.org [1] Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: David Gow <davidgow@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>  | 
						
							||
| 
							 | 
						0e1c773b50 | 
							
							
								
								mm/damon/core: introduce damos quota goal metrics for memory node utilization
							
							
							
							
							
							
							
							Patch series "mm/damon: auto-tune DAMOS for NUMA setups including tiered
memory".
Utilizing DAMON for memory tiering usually requires manual tuning and/or
tedious controls.  Let it self-tune hotness and coldness thresholds for
promotion and demotion aiming high utilization of high memory tiers, by
introducing new DAMOS quota goal metrics representing the used and the
free memory ratios of specific NUMA nodes.  And introduce a sample DAMON
module that demonstrates how the new feature can be used for memory
tiering use cases.
Backgrounds
===========
A type of tiered memory system exposes the memory tiers as NUMA nodes.  A
straightforward pages placement strategy for such systems is placing
access-hot and cold pages on upper and lower tiers, reespectively,
pursuing higher utilization of upper tiers.  Since access temperature can
be dynamic, periodically finding and migrating hot pages and cold pages to
proper tiers (promoting and demoting) is also required.  Linux kernel
provides several features for such dynamic and transparent pages
placement.
Page Faults and LRU
-------------------
One widely known way is using NUMA balancing in tiering mode (a.k.a
NUMAB-2) and reclaim-based demotion features.  In the setup, NUMAB-2 finds
hot pages using access check-purpose page faults (a.k.a prot_none) and
promote those inside each process' context, until there is no more pages
to promote, or the upper tier is filled up and memory pressure happens. 
In the latter case, LRU-based reclaim logic wakes up as a response to the
memory pressure and demotes cold pages to lower tiers in asynchronous
(kswapd) and/or synchronous ways (direct reclaim).
DAMON
-----
Yet another available solution is using DAMOS with migrate_hot and
migrate_cold DAMOS actions for promotions and demotions, respectively.  To
make it optimum, users need to specify aggressiveness and access
temperature thresholds for promotions and demotions in a good balance that
results in high utilization of upper tiers.  The number of parameters is
not small, and optimum parameter values depend on characteristics of the
underlying hardware and the workload.  As a result, it often requires
manual, time consuming and repetitive tuning of the DAMOS schemes for
given workloads and systems combinations.
Self-tuned DAMON-based Memory Tiering
=====================================
To solve such manual tuning problems, DAMOS provides aim-oriented
feedback-driven quotas self-tuning.  Using the feature, we design a
self-tuned DAMON-based memory tiering for general multi-tier memory
systems.
For each memory tier node, if it has a lower tier, run a DAMOS scheme that
demotes cold pages of the node, auto-tuning the aggressiveness aiming an
amount of free space of the node.  The free space is for keeping the
headroom that avoids significant memory pressure during upper tier memory
usage spike, and promoting hot pages from the lower tier.
For each memory tier node, if it has an upper tier, run a DAMOS scheme
that promotes hot pages of the current node to the upper tier, auto-tuning
the aggressiveness aiming a high utilization ratio of the upper tier.  The
target ratio is to ensure higher tiers are utilized as much as possible. 
It should match with the headroom for demotion scheme, but have slight
overlap, to ensure promotion and demotion are not entirely stopped.
The aim-oriented aggressiveness auto-tuning of DAMOS is already available.
Hence, to make such tiering solution implementation, only new quota goal
metrics for utilization and free space ratio of specific NUMA node need to
be developed.
Discussions
===========
The design imposes below discussion points.
Expected Behaviors
------------------
The system will let upper tier memory node accommodates as many hot data
as possible.  If total amount of the data is less than the top tier
memory's promotion/demotion target utilization, entire data will be just
placed on the top tier.  Promotion scheme will do nothing since there is
no data to promote.  Demotion scheme will also do nothing since the free
space ratio of the top tier is higher than the goal.
Only if the amount of data is larger than the top tier's utilization
ratio, demotion scheme will demote cold pages and ensure the headroom free
space.  Since the promotion and demotion schemes for a single node has
small overlap at their target utilization and free space goals, promotions
and demotions will continue working with a moderate aggressiveness level. 
It will keep all data is placed on access hotness under dynamic access
pattern, while minimizing the migration overhead.
In any case, each node will keep headroom free space and as many upper
tiers are utilized as possible.
Ease of Use
-----------
Users still need to set the target utilization and free space ratio, but
it will be easier to set.  We argue 99.7 % utilization and 0.5 % free
space ratios can be good default values.  It can be easily adjusted based
on desired headroom size of given use case.  Users are also still required
to answer the minimum coldness and hotness thresholds.  Together with
monitoring intervals auto-tuning[2], DAMON will always show meaningful
amount of hot and cold memory.  And DAMOS quota's prioritization mechanism
will make good decision as long as the source information is that
colorful.  Hence, users can very naively set the minimum criterias.  We
believe any access observation and no access observation within last one
aggregation interval is enough for minimum hot and cold regions criterias.
General Tiered Memory Setup Applicability
-----------------------------------------
The design can be applied to any number of tiers having any performance
characteristics, as long as they can be hierarchical.  Hence, applying the
system to different tiered memory system will be straightforward.  Note
that this assumes only single CPU NUMA node case.  Because today's DAMON
is not aware of which CPU made each access, applying this on systems
having multiple CPU NUMA nodes can be complicated.  We are planning to
extend DAMON for the use case, but that's out of the scope of this patch
series.
How To Use
----------
Users can implement the auto-tuned DAMON-based memory tiering using DAMON
sysfs interface.  It can be easily done using DAMON user-space tool like
user-space tool.  Below evaluation results section shows an example DAMON
user-space tool command for that.
For wider and simpler deployment, having a kernel module that sets up and
run the DAMOS schemes via DAMON kernel API can be useful.  The module can
enable the memory tiering at boot time via kernel command line parameter
or at run time with single command.  This patch series implements a sample
DAMON kernel module that shows how such module can be implemented.
Comparison To Page Faults and LRU-based Approaches
--------------------------------------------------
The existing page faults based promotion (NUMAB-2) does hot pages
detection and migration in the process context.  When there are many pages
to promote, it can block the progress of the application's real works. 
DAMOS works in asynchronous worker thread, so it doesn't block the real
works.
NUMAB-2 doesn't provide a way to control aggressiveness of promotion other
than the maximum amount of pages to promote per given time widnow.  If hot
pages are found, promotions can happen in the upper-bound speed,
regardless of upper tier's memory pressure.  If the maximum speed is not
well set for the given workload, it can result in slow promotion or
unnecessary memory pressure.  Self-tuned DAMON-based memory tiering
alleviates the problem by adjusting the speed based on current utilization
of the upper tier.
LRU-based demotion can be triggered in both asynchronous (kswapd) and
synchronous (direct reclaim) ways.  Other than the way of finding cold
pages, asynchronous LRU-based demotion and DAMON-based demotion has no big
difference.  DAMON-based demotion can make a better balancing with
DAMON-based promotion, though.  The LRU-based demotion can do better than
DAMON-based demotion when the tier is having significant memory pressure. 
It would be wise to use DAMON-based demotion as a proactive and primary
one, but utilizing LRU-based demotions together as a fast backup solution.
Evaluation
==========
In short, under a setup that requires fast and frequent promotions,
self-tuned DAMON-based memory tiering's hot pages promotion improves
performance about 4.42 %.  We believe this shows self-tuned DAMON-based
promotion's effectiveness.  Meanwhile, NUMAB-2's hot pages promotion
degrades the performance about 7.34 %.  We suspect the degradation is
mostly due to NUMAB-2's synchronous nature that can block the
application's progress, which highlights the advantage of DAMON-based
solution's asynchronous nature.
Note that the test was done with the RFC version of this patch series.  We
don't run it again since this patch series got no meaningful change after
the RFC, while the test takes pretty long time.
Setup
-----
Hardware.  Use a machine that equips 250 GiB DRAM memory tier and 50 GiB
CXL memory tier.  The tiers are exposed as NUMA nodes 0 and 1,
respectively.
Kernel.  Use Linux kernel v6.13 that modified as following.  Add all DAMON
patches that available on mm tree of 2025-03-15, and this patch series. 
Also modify it to ignore mempolicy() system calls, to avoid bad effects
from application's traditional NUMA systems assumed optimizations.
Workload.  Use a modified version of Taobench benchmark[3] that available
on DCPerf benchmark suite.  It represents an in-memory caching workload. 
We set its 'memsize', 'warmup_time', and 'test_time' parameter as 340 GiB,
2,500 seconds and 1,440 seconds.  The parameters are chosen to ensure the
workload uses more than DRAM memory tier.  Its RSS under the parameter
grows to 270 GiB within the warmup time.
It turned out the workload has a very static access pattrn.  Only about 13
% of the RSS is frequently accessed from the beginning to end.  Hence
promotion shows no meaningful performance difference regardless of
different design and implementations.  We therefore modify the kernel to
periodically demote up to 10 GiB hot pages and promote up to 10 GiB cold
pages once per minute.  The intention is to simulate periodic access
pattern changes.  The hotness and coldness threshold is very naively set
so that it is more like random access pattern change rather than strict
hot/cold pages exchange.  This is why we call the workload as "modified". 
It is implemented as two DAMOS schemes each running on an asynchronous
thread.  It can be reproduced with DAMON user-space tool like below.
    # ./damo start \
        --ops paddr --numa_node 0 --monitoring_intervals 10s 200s 200s \
            --damos_action migrate_hot 1 \
            --damos_quota_interval 60s --damos_quota_space 10G \
        --ops paddr --numa_node 1 --monitoring_intervals 10s 200s 200s \
            --damos_action migrate_cold 0 \
            --damos_quota_interval 60s --damos_quota_space 10G \
        --nr_schemes 1 1 --nr_targets 1 1 --nr_ctxs 1 1
System configurations.  Use below variant system configurations.
- Baseline.  No memory tiering features are turned on.
- Numab_tiering.  On the baseline, enable NUMAB-2 and relcaim-based
  demotion.  In detail, following command is executed:
  echo 2 > /proc/sys/kernel/numa_balancing;
  echo 1 > /sys/kernel/mm/numa/demotion_enabled;
  echo 7 > /proc/sys/vm/zone_reclaim_mode
- DAMON_tiering.  On the baseline, utilize self-tuned DAMON-based memory
  tiering implementation via DAMON user-space tool.  It utilizes two
  kernel threads, namely promotion thread and demotion thread.  Demotion
  thread monitors access pattern of DRAM node using DAMON with
  auto-tuned monitoring intervals aiming 4% DAMON-observed access ratio,
  and demote coldest pages up to 200 MiB per second aiming 0.5% free
  space of DRAM node.  Promotion thread monitors CXL node using same
  intervals auto-tuning, and promote hot pages in same way but aiming
  for 99.7% utilization of DRAM node.  Because DAMON provides only
  best-effort accuracy, add young page DAMOS filters to allow only and
  reject all young pages at promoting and demoting, respectively.  It
  can be reproduced with DAMON user-space tool like below.
    # ./damo start \
        --numa_node 0 --monitoring_intervals_goal 4% 3 5ms 10s \
            --damos_action migrate_cold 1 --damos_access_rate 0% 0% \
            --damos_apply_interval 1s \
            --damos_quota_interval 1s --damos_quota_space 200MB \
            --damos_quota_goal node_mem_free_bp 0.5% 0 \
            --damos_filter reject young \
        --numa_node 1 --monitoring_intervals_goal 4% 3 5ms 10s \
            --damos_action migrate_hot 0 --damos_access_rate 5% max \
            --damos_apply_interval 1s \
            --damos_quota_interval 1s --damos_quota_space 200MB \
            --damos_quota_goal node_mem_used_bp 99.7% 0 \
            --damos_filter allow young \
            --damos_nr_quota_goals 1 1 --damos_nr_filters 1 1 \
        --nr_targets 1 1 --nr_schemes 1 1 --nr_ctxs 1 1
Measurment Results
------------------
On each system configuration, run the modified version of Taobench and
collect 'score'.  'score' is a metric that calculated and provided by
Taobench to represents the performance of the run on the system.  To
handle the measurement errors, repeat the measurement five times.  The
results are as below.
    Config         Score   Stdev   (%)     Normalized
    Baseline       1.6165  0.0319  1.9764  1.0000
    Numab_tiering  1.4976  0.0452  3.0209  0.9264
    DAMON_tiering  1.6881  0.0249  1.4767  1.0443
'Config' column shows the system config of the measurement.  'Score'
column shows the 'score' measurement in average of the five runs on the
system config.  'Stdev' column shows the standsard deviation of the five
measurements of the scores.  '(%)' column shows the 'Stdev' to 'Score'
ratio in percentage.  Finally, 'Normalized' column shows the averaged
score values of the configs that normalized to that of 'Baseline'.
The periodic hot pages demotion and cold pages promotion that was
conducted to simulate dynamic access pattern was started from the
beginning of the workload.  It resulted in the DRAM tier utilization
always under the watermark, and hence no real demotion was happened for
all test runs.  This means the above results show no difference between
LRU-based and DAMON-based demotions.  Only difference between NUMAB-2 and
DAMON-based promotions are represented on the results.
Numab_tiering config degraded the performance about 7.36 %.  We suspect
this happened because NUMAB-2's synchronous promotion was blocking the
Taobench's real work progress.
DAMON_tiering config improved the performance about 4.43 %.  We believe
this shows effectiveness of DAMON-based promotion that didn't block
Taobench's real work progress due to its asynchronous nature.  Also this
means DAMON's monitoring results are accurate enough to provide visible
amount of improvement.
Evaluation Limitations
----------------------
As mentioned above, this evaluation shows only comparison of promotion
mechanisms.  DAMON-based tiering is recommended to be used together with
reclaim-based demotion as a faster backup under significant memory
pressure, though.
From some perspective, the modified version of Taobench may seems making
the picture distorted too much.  It would be better to evaluate with more
realistic workload, or more finely tuned micro benchmarks.
Patch Sequence
==============
The first patch (patch 1) implements two new quota goal metrics on core
layer and expose it to DAMON core kernel API.  The second and third ones
(patches 2 and 3) further link it to DAMON sysfs interface.  Three
following patches (patches 4-6) document the new feature and sysfs file on
design, usage, and ABI documents.  The final one (patch 7) implements a
working version of a self-tuned DAMON-based memory tiering solution in an
incomplete but easy to understand form as a kernel module under
samples/damon/ directory.
References
==========
[1] https://lore.kernel.org/20231112195602.61525-1-sj@kernel.org/
[2] https://lore.kernel.org/20250303221726.484227-1-sj@kernel.org
[3] https://github.com/facebookresearch/DCPerf/blob/main/packages/tao_bench/README.md
This patch (of 7):
Used and free space ratios for specific NUMA nodes can be useful inputs
for NUMA-specific DAMOS schemes' aggressiveness self-tuning feedback loop.
Implement DAMOS quota goal metrics for such self-tuned schemes.
Link: https://lkml.kernel.org/r/20250420194030.75838-1-sj@kernel.org
Link: https://lkml.kernel.org/r/20250420194030.75838-2-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Yunjeong Mun <yunjeong.mun@sk.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
							
						 | 
						
							||
| 
							 | 
						7f29070f4c | 
							
							
								
								mm/damon/core: simplify control flow in damon_register_ops()
							
							
							
							
							
							
							
							The function logic is not complex, so using goto is unnecessary. Replace it with a straightforward if-else to simplify control flow and improve readability. Link: https://lkml.kernel.org/r/Z9vxcPCw8tDsjKw1@OneApple Signed-off-by: Taotao Chen <chentaotao@didiglobal.com> Reviewed-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>  | 
						
							||
| 
							 | 
						105f830fa3 | 
							
							
								
								mm/damon: remove damon_operations->reset_aggregated
							
							
							
							
							
							
							
							The operations layer hook was introduced to let operations set do any aggregation data reset if needed. But it is not really be used now. Remove it. Link: https://lkml.kernel.org/r/20250306175908.66300-14-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>  | 
						
							||
| 
							 | 
						99ce7c9c6d | 
							
							
								
								mm/damon: remove damon_callback->before_damos_apply
							
							
							
							
							
							
							
							The hook was introduced to let DAMON kernel API users access DAMOS schemes-eligible regions in a safe way. Now it is no more used by anyone, and the functionality is provided in a better way by damos_walk(). Remove it. Link: https://lkml.kernel.org/r/20250306175908.66300-13-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>  | 
						
							||
| 
							 | 
						cedee98f68 | 
							
							
								
								mm/damon: remove damon_callback->after_sampling
							
							
							
							
							
							
							
							The callback was used by DAMON sysfs interface for reading DAMON internal data. But it is no more being used, and damon_call() can do similar works in a better way. Remove it. Link: https://lkml.kernel.org/r/20250306175908.66300-12-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>  | 
						
							||
| 
							 | 
						07da21855b | 
							
							
								
								mm/damon: remove ->before_start of damon_callback
							
							
							
							
							
							
							
							The function pointer field was added to be used as a place to do some initialization works just before DAMON starts working. However, nobody is using it now. Remove it. Link: https://lkml.kernel.org/r/20250306175908.66300-11-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>  | 
						
							||
| 
							 | 
						258d941e58 | 
							
							
								
								mm/damon/core: make damon_set_attrs() be safe to be called from damon_call()
							
							
							
							
							
							
							
							Currently all DAMON kernel API callers do online DAMON parameters commit from damon_callback->after_aggregation because only those are safe place to call the DAMON monitoring attributes update function, namely damon_set_attrs(). Because damon_callback hooks provide no synchronization, the callers work in asynchronous ways or implement their own inefficient and complicated synchronization mechanisms. It also means online DAMON parameters commit can take up to one aggregation interval. On large systems having long aggregation intervals, that can be too slow. The synchronization can be done in more efficient and simple way while removing the latency constraint if it can be done using damon_call(). The fact that damon_call() can be executed in the middle of the aggregation makes damon_set_attrs() unsafe to be called from it, though. Two real problems can occur in the case. First, converting the not yet completely aggregated nr_accesses for new user-set intervals can arguably degrade the accuracy or at least make the logic complicated. Second, kdamond_reset_aggregated() will not be called after the monitoring results update, so next aggregation starts from unclean state. This can result in inconsistent and unexpected nr_accesses_bp. Make it safe as follows. Catch the middle-of-the-aggregation case from damon_set_attrs() by checking the passed_sample_intervals and next_aggregationsis of the context. And pass the information to nr_accesses conversion logic. The logic works as before if it is not the case (called after the current aggregation is completed). If it is the case (committing parameters in the middle of the aggregation), it drops the nr_accesses information that so far aggregated, and make the status same to the beginning of this aggregation, but as if the last aggregation was started with the updated sampling/aggregation intervals. The middle-of-aggregastion check introduce yet another edge case, though. This happens because kdamond_tune_intervals() can also call damon_set_attrs() with the middle-of-aggregation check. Consider damon_call() for parameters commit and kdamond_tune_intervals() are called in same iteration of kdamond main loop. Because kdamond_tune_interval() is called for aggregation intervals, it should be the end of the aggregation. The first damon_set_attrs() call from kdamond_call() understands it is the end of the aggregation and correctly handle it. But, because the damon_set_attrs() updated next_aggregation_sis of the context. Hence, the second damon_set_attrs() invocation from kdamond_tune_interval() believes it is called in the middle of the aggregation. It therefore resets aggregated information so far. After that, kdamond_reset_interval() is called and double-reset the aggregated information. Avoid this case, too, by setting the next_aggregation_sis before kdamond_tune_intervals() is invoked. Link: https://lkml.kernel.org/r/20250306175908.66300-4-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>  | 
						
							||
| 
							 | 
						bf74bdfd2e | 
							
							
								
								mm/damon/core: invoke kdamond_call() after merging is done if possible
							
							
							
							
							
							
							
							kdamond_call() callers may iterate the regions, so better to call it when the number of regions is as small as possible. It is when kdamond_merge_regions() is finished. Invoke it on the point. This change is also aimed to make future changes for carrying online parameters commit with damon_call() easier. The commit operation should be able to make sequence between other aggregation interval based operations including regioins merging and aggregation reset. Placing damon_call() invocation after the regions merging makes the sequence handling simpler. Link: https://lkml.kernel.org/r/20250306175908.66300-3-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>  | 
						
							||
| 
							 | 
						f7f0d88b7d | 
							
							
								
								mm/damon/core: expose damos_filter_for_ops() to DAMON kernel API callers
							
							
							
							
							
							
							
							damos_filter_for_ops() can be useful to avoid putting wrong type of filters in wrong place. Make it be exposed to DAMON kernel API callers. Link: https://lkml.kernel.org/r/20250305222733.59089-5-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>  | 
						
							||
| 
							 | 
						961df88e46 | 
							
							
								
								mm/damon/core: set damos_filter default allowance behavior based on installed filters
							
							
							
							
							
							
							
							Decide whether to allow or reject by default on core and opertions layer
handled filters evaluation stages.  It is decided as the opposite of the
last installed filter's behavior.  If there is no filter at all, allow by
default.  If there is any operations layer handled filters, core layer's
filtering stage sets allowing as the default behavior regardless of the
last filter of core layer-handling ones, since the last filter of core
layer handled filters in the case is not really the last filter of the
entire filtering stage.
Also, make the core layer's DAMOS filters handling stage uses the newly
set behavior field.
[sj@kernel.org: setup damos->{core,ops}_filters_default_reject for initial start]
  Link: https://lkml.kernel.org/r/20250315222610.35245-1-sj@kernel.org
Link: https://lkml.kernel.org/r/20250304211913.53574-8-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
							
						 | 
						
							||
| 
							 | 
						2a689e4e83 | 
							
							
								
								mm/damon/core: put ops-handled filters to damos->ops_filters
							
							
							
							
							
							
							
							damos->ops_filters has introduced to be used for all operations layer handled filters. But DAMON kernel API callers can put any type of DAMOS filters to any of damos->filters and damos->ops_filters. DAMON user-space ABI users have no way to use ->ops_filters at all. Update damos_add_filter(), which should be used by API callers to install DAMOS filters, to add filters to ->filters and ->ops_filters depending on their handling layer. The change forces both API callers and ABI users to use proper lists since ABI users use the API internally. Link: https://lkml.kernel.org/r/20250304211913.53574-5-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>  | 
						
							||
| 
							 | 
						3607cc590f | 
							
							
								
								mm/damon/core: support committing ops_filters
							
							
							
							
							
							
							
							DAMON kernel API callers should use damon_commit_ctx() to install DAMON parameters including DAMOS filters. But damos_commit_ops_filters(), which is called by damon_commit_ctx() for filters installing, is not handling damos->ops_filters. Hence, no DAMON kernel API caller can use damos->ops_filters. Do the committing of the ops_filters to make it usable. Link: https://lkml.kernel.org/r/20250304211913.53574-4-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>  | 
						
							||
| 
							 | 
						ab82e57981 | 
							
							
								
								mm/damon/core: introduce damos->ops_filters
							
							
							
							
							
							
							
							Patch series "mm/damon: make allow filters after reject filters useful and intuitive". DAMOS filters do allow or reject elements of memory for given DAMOS scheme only if those match the filter criterias. For elements that don't match any DAMOS filter, 'allowing' is the default behavior. This makes allow-filters that don't have any reject-filter after them meaningless sources of overhead. The decision was made to keep the behavior consistent with that before the introduction of allow-filters. This, however, makes usage of DAMOS filters confusing and inefficient. It is more intuitive and still consistent behavior to reject by default unless there is no filter at all or the last filter is a reject filter. Update the filtering logic in the way and update documents to clarify the behavior. Note that this is changing the old behavior. But the old behavior for the problematic filter combination was definitely confusing, inefficient and anyway useless. Also, the behavior has relatively recently introduced. It is difficult to anticipate any user that depends on the behavior. Hence this is not a user-breaking behavior change but an obvious improvement. This patch (of 9): DAMOS filters can be categorized into two groups depending on which layer they are handled, namely core layer and ops layer. The groups are important because the filtering behavior depends on evaluation sequence of filters, and core layer-handled filters are evaluated before operations layer-handled ones. The behavior is clearly documented, but the implementation is bit inefficient and complicated. All filters are maintained in a single list (damos->filters) in mix. Filters evaluation logics in core layer and operations layer iterates all the filters on the list, while skipping filters that should be not handled by the layer of the logic. It is inefficient. Making future extensions having differentiations for filters of different handling layers will also be complicated. Add a new list that will be used for having all operations layer-handled DAMOS filters to DAMOS scheme data structure. Also add the support of its initialization and basic traversal functions. Link: https://lkml.kernel.org/r/20250304211913.53574-1-sj@kernel.org Link: https://lkml.kernel.org/r/20250304211913.53574-2-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>  | 
						
							||
| 
							 | 
						f04b0fedbe | 
							
							
								
								mm/damon/core: implement intervals auto-tuning
							
							
							
							
							
							
							
							Implement the DAMON sampling and aggregation intervals auto-tuning mechanism as briefly described on 'struct damon_intervals_goal'. The core part for deciding the direction and amount of the changes is implemented reusing the feedback loop function which is being used for DAMOS quotas auto-tuning. Unlike the DAMOS quotas auto-tuning use case, limit the maximum decreasing amount after the adjustment to 50% of the current value, though. This is because the intervals have no good merits at rapid reductions since it could unnecessarily increase the monitoring overhead. Link: https://lkml.kernel.org/r/20250303221726.484227-3-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>  | 
						
							||
| 
							 | 
						1eb3471bf5 | 
							
							
								
								mm/damon: add data structure for monitoring intervals auto-tuning
							
							
							
							
							
							
							
							Patch series "mm/damon: auto-tune aggregation interval".
DAMON requires time-consuming and repetitive aggregation interval tuning. 
Introduce a feature for automating it using a feedback loop that aims an
amount of observed access events, like auto-exposing cameras.
Background: Access Frequency Monitoring and Aggregation Interval
================================================================
DAMON checks if each memory element (damon_region) is accessed or not for
every user-specified time interval called 'sampling interval'.  It
aggregates the check intervals on per-element counter called
'nr_accesses'.  DAMON users can read the counters to get the access
temperature of a given element.  The counters are reset for every another
user-specified time interval called 'aggregation interval'.
This can be illustrated as DAMON continuously capturing a snapshot of
access events that happen and captured within the last aggregation
interval.  This implies the aggregation interval plays a key role for the
quality of the snapshots, like the camera exposure time.  If it is too
short, the amount of access events that happened and captured for each
snapshot is small, so each snapshot will show no many interesting things
but just a cold and dark world with hopefuly one pale blue dot or two.  If
it is too long, too many events are aggregated in a single shot, so each
snapshot will look like world of flames, or Muspellheim.  It will be
difficult to find practical insights in both cases.
Problem: Time Consuming and Repetitive Tuning
=============================================
The appropriate length of the aggregation interval depends on how
frequently the system and workloads are making access events that DAMON
can observe.  Hence, users have to tune the interval with excessive amount
of tests with the target system and workloads.  If the system and
workloads are changed, the tuning should be done again.  If the
characteristic of the workloads is dynamic, it becomes more challenging. 
It is therefore time-consuming and repetitive.
The tuning challenge mainly stems from the wrong question.  It is not
asking users what quality of monitoring results they want, but how DAMON
should operate for their hidden goal.  To make the right answer, users
need to fully understand DAMON's mechanisms and the characteristics of
their workloads.  Users shouldn't be asked to understand the underlying
mechanism.  Understanding the characteristics of the workloads shouldn't
be the role of users but DAMON.
Aim-oriented Feedback-driven Auto-Tuning
=========================================
Fortunately, the appropriate length of the aggregation interval can be
inferred using a feedback loop.  If the current snapshots are showing no
much intresting information, in other words, if it shows only rare access
events, increasing the aggregation interval helps, and vice versa.  We
tested this theory on a few real-world workloads, and documented one of
the experience with an official DAMON monitoring intervals tuning
guideline.  Since it is a simple theory that requires repeatable tries, it
can be a good job for machines.
Based on the guideline's theory, we design an automation of aggregation
interval tuning, in a way similar to that of camera auto-exposure feature.
It defines the amount of interesting information as the ratio of
DAMON-observed access events that DAMON actually observed to theoretical
maximum amount of it within each snapshot.  Events are accounted in byte
and sampling attempts granularity.  For example, let's say there is a
region of 'X' bytes size.  DAMON tried access check smapling for the
region 'Y' times in total for a given aggregation.  Among the 'Y'
attempts, 'Z' times it shown positive results.  Then, the theoritical
maximum number of access events for the region is 'X * Y'.  And the number
of access events that DAMON has observed for the region is 'X * Z'.  The
abount of the interesting information is '(X * Z / X * Y)'.  Note that
each snapshot would have multiple regions.
Users can set an arbitrary value of the ratio as their target.  Once the
target is set, the automation periodically measures the current value of
the ratio and increase or decrease the aggregation interval if the ratio
value is lower or higher than the target.  The amount of the change is
proportion to the distance between the current adn the target values.
To avoid auto-tuning goes too long way, let users set the minimum and the
maximum aggregation interval times.  Changing only aggregation interval
while sampling interval is kept makes the maximum level of access
frequency in each snapshot, or discernment of regions inconsistent.  Also,
unnecessarily short sampling interval causes meaningless monitoring
overhed.  The automation therefore adjusts the sampling interval together
with aggregation interval, while keeping the ratio between the two
intervals.  Users can set the ratio, or the discernment.
Discussion
==========
The modified question (aimed amount of access events, or lights, in each
snapshot) is easy to answer by both the users and the kernel.  If users
are interested in finding more cold regions, the value should be lower,
and vice versa.  If users have no idea, kernel can suggest a fair default
value based on some theories and experiments.  For example, based on the
Pareto principle (80/20 rule), we could expect 20% target ratio will
capture 80% of real access events.  Since 80% might be too high, applying
the rule once again, 4% (20% * 20%) may capture about 56% (80% * 80%) of
real access events.
Sampling to aggregation intervals ratio and min/max aggregation intervals
are also arguably easy to answer.  What users want is discernment of
regions for efficient system operation, for examples, X amount of colder
regions or Y amount of warmer regions, not exactly how many times each
cache line is accessed in nanoseconds degree.  The appropriate min/max
aggregation interval can relatively naively set, and may better to set for
aimed monitoring overhead.  Since sampling interval is directly deciding
the overhead, setting it based on the sampling interval can be easy.  With
my experiences, I'd argue the intervals ratio 0.05, and 5 milliseconds to
20 seconds sampling interval range (100 milliseconds to 400 seconds
aggregation interval) can be a good default suggestion.
Evaluation
==========
On a machine running a real world server workload, I ran DAMON to monitor
its physical address space for about 23 hours, with this feature turned
on.  We set it to tune sampling interval in a range from 5 milliseconds to
10 seconds, aiming 4 % DAMON-observed access ratio per three aggregation
intervals.  The exact command I used is as below.
    damo start --monitoring_intervals_goal 4% 3 5ms 10s --damos_action stat
During the test run, DAMON continuously updated sampling and aggregation
intervals as designed, within the given range.  For all the time, DAMON
was able to find the intervals that meets the target access events ratio
in the given intervals range (sampling interval between 5 milliseconds and
10 seconds).
For most of the time, tuned sampling interval was converged in 300-400
milliseconds.  It made only small amount of changes within the range.  The
average of the tuned sampling interval during the test was about 380
milliseconds.
The workload periodically gets less load and decreases its CPU usage. 
Presumably this also caused it making less memory access events. 
Reactively to such event,s DAMON also increased the intervals as expected.
It was still able to find the optimum interval that satisfying the target
access ratio within the given intervals range.  Usually it was converged
to about 5 seconds.  Once the workload gets normal amount of load again,
DAMON reactively reduced the intervals to the normal range.
I collected and visualized DAMON's monitoring results on the server a few
times.  Every time the visualized access pattern looked not biased to only
cold or hot pages but diverse and balanced.  Let me show some of the
snapshots that I collected at the nearly end of the test (after about 23
hours have passed since starting DAMON on the server).
The recency histogram looks as below.  Please note that this visualization
shows only a very coarse grained information.  For more details about the
visualization format, please refer to DAMON user-space tool
documentation[1].
    # ./damo report access --style recency-sz-hist --tried_regions_of 0 0 0 --access_rate 0 0
    <last accessed time (us)> <total size>
    [-19 h 7 m 45.514 s, -17 h 12 m 58.963 s)  6.198 GiB  |****                |
    [-17 h 12 m 58.963 s, -15 h 18 m 12.412 s) 0 B        |                    |
    [-15 h 18 m 12.412 s, -13 h 23 m 25.860 s) 0 B        |                    |
    [-13 h 23 m 25.860 s, -11 h 28 m 39.309 s) 0 B        |                    |
    [-11 h 28 m 39.309 s, -9 h 33 m 52.757 s)  0 B        |                    |
    [-9 h 33 m 52.757 s, -7 h 39 m 6.206 s)    0 B        |                    |
    [-7 h 39 m 6.206 s, -5 h 44 m 19.654 s)    0 B        |                    |
    [-5 h 44 m 19.654 s, -3 h 49 m 33.103 s)   0 B        |                    |
    [-3 h 49 m 33.103 s, -1 h 54 m 46.551 s)   0 B        |                    |
    [-1 h 54 m 46.551 s, -0 ns)                16.967 GiB |*********           |
    [-0 ns, --6886551440000 ns)                38.835 GiB |********************|
    memory bw estimate: 9.425 GiB per second
    total size: 62.000 GiB
It shows about 38 GiB of memory was accessed at least once within last
aggregation interval (given ~300 milliseconds tuned sampling interval,
this is about six seconds).  This is about 61 % of the total memory.  In
other words, DAMON found warmest 61 % memory of the system.  The number is
particularly interesting given our Pareto principle based theory for the
tuning goal value.  We set it as 20 % of 20 % (4 %), thinking it would
capture 80 % of 80 % (64 %) real access events.  And it foudn 61 % hot
memory, or working set.  Nevertheless, to make the theory clearer, much
more discussion and tests would be needed.  At the moment, nonetheless, we
can say making the target value higher helps finding more hot memory
regions.
The histogram also shows an amount of cold memory.  About 17 GiB memory of
the system has not accessed at least for last aggregation interval (about
six seconds), and at most for about last two hours.  The real longest
unaccessed time of the 17 GiB memory was about 19 minutes, though.  This
is a limitation of this visualization format.
It further found very cold 6 GiB memory.  It has not accessed at least for
last 17 hours and at most 19 hours.
What about hot memory distribution?  To see this, I capture and visualize
the snapshot in access temperature histogram.  Again, please refer to the
DAMON user-space tool documentation[1] for the format and what access
temperature mean.  Both the visualization and metric shows only very
coarse grained and limited information.  The resulting histogram look like
below.
    # ./damo report access --style temperature-sz-hist --tried_regions_of 0 0 0
    <temperature> <total size>
    [-6,840,763,776,000, -5,501,580,939,800) 6.198 GiB  |***                 |
    [-5,501,580,939,800, -4,162,398,103,600) 0 B        |                    |
    [-4,162,398,103,600, -2,823,215,267,400) 0 B        |                    |
    [-2,823,215,267,400, -1,484,032,431,200) 0 B        |                    |
    [-1,484,032,431,200, -144,849,595,000)   0 B        |                    |
    [-144,849,595,000, 1,194,333,241,200)    55.802 GiB |********************|
    [1,194,333,241,200, 2,533,516,077,400)   4.000 KiB  |*                   |
    [2,533,516,077,400, 3,872,698,913,600)   4.000 KiB  |*                   |
    [3,872,698,913,600, 5,211,881,749,800)   8.000 KiB  |*                   |
    [5,211,881,749,800, 6,551,064,586,000)   12.000 KiB |*                   |
    [6,551,064,586,000, 7,890,247,422,200)   4.000 KiB  |*                   |
    memory bw estimate: 5.178 GiB per second
    total size: 62.000 GiB
We can see most of the memory is in similar access temperature range, and
definitely some pages are extremely hot.
To see the picture in more detail, let's capture and visualize the
snapshot per DAMON-region, sorted by their access temperature.  The total
number of the regions was about 300.  Due to the limited space, I'm
showing only a few parts of the output here.
    # ./damo report access --style hot --tried_regions_of 0 0 0
    heatmap: 00000000888888889999999888888888888888888888888888888888888888888888888888888888
    # min/max temperatures: -6,827,258,184,000, 17,589,052,500, column size: 793.600 MiB
     |999999999999999999999999999999999999999| 4.000 KiB   access 100 % 18 h 9 m 43.918 s
     |999999999999999999999999999999999999999| 8.000 KiB   access 100 % 17 h 56 m 5.351 s
     |999999999999999999999999999999999999999| 4.000 KiB   access 100 % 15 h 24 m 19.634 s
     |999999999999999999999999999999999999999| 4.000 KiB   access 100 % 14 h 10 m 55.606 s
     |999999999999999999999999999999999999999| 4.000 KiB   access 100 % 11 h 34 m 18.993 s
    [...]
               |99999999999999999999999999999| 8.000 KiB   access 100 % 1 m 27.945 s
               |11111111111111111111111111111| 80.000 KiB  access 15 %  1 m 21.180 s
               |00000000000000000000000000000| 24.000 KiB  access 5 %   1 m 21.180 s
               |00000000000000000000000000000| 5.919 GiB   access 10 %  1 m 14.415 s
               |99999999999999999999999999999| 12.000 KiB  access 100 % 1 m 7.650 s
    [...]
                                           |0| 4.000 KiB   access 5 %   0 ns
                                           |0| 12.000 KiB  access 5 %   0 ns
                                           |0| 188.000 KiB access 0 %   0 ns
                                           |0| 24.000 KiB  access 0 %   0 ns
                                           |0| 48.000 KiB  access 0 %   0 ns
    [...]
             |0000000000000000000000000000000| 8.000 KiB   access 0 %   6 m 45.901 s
            |00000000000000000000000000000000| 36.000 KiB  access 0 %   7 m 26.491 s
            |00000000000000000000000000000000| 4.000 KiB   access 0 %   12 m 37.682 s
           |000000000000000000000000000000000| 8.000 KiB   access 0 %   18 m 9.168 s
           |000000000000000000000000000000000| 16.000 KiB  access 0 %   19 m 3.288 s
    |0000000000000000000000000000000000000000| 6.198 GiB   access 0 %   18 h 57 m 52.582 s
    memory bw estimate: 8.798 GiB per second
    total size: 62.000 GiB
We can see DAMON found small and extremely hot regions that accessed for
all access check sampling (once per about 300 milliseconds) for more than
10 hours.  The access temperature rapidly decreases.  DAMON was also able
to find small and big regions that not accessed for up to about 19
minutes.  It even found an outlier cold region of 6 GiB that not accessed
for about 19 hours.  It is unclear what the outlier region is, as of this
writing.
For the testing, DAMON was consuming about 0.1% of single CPU time.  This
is again expected results, since DAMON was using about 370 milliseconds
sampling interval in most case.
    # ps -p $kdamond_pid -o %cpu
    %CPU
     0.1
I also ran similar tests against kernel build workload and an in-memory
cache workload benchmark[2].  Detialed results including tuned intervals
and captured access pattern were of course different sicne those depend on
the workloads.  But the auto-tuning feature was always working as expected
like the above results for the real world workload.
To wrap up, with intervals auto-tuning feature, DAMON was able to capture
access pattern snapshots of a quality on a real world server workload. 
The auto-tuning feature was able to adaptively react to the dynamic access
patterns of the workload and reliably provide consistent monitoring
results without manual human interventions.  Also, the auto-tuning made
DAMON consumes only necessary amount of resource for the required quality.
References
==========
[1] https://github.com/damonitor/damo/blob/next/USAGE.md#access-report-styles
[2] https://github.com/facebookresearch/DCPerf/blob/main/packages/tao_bench/README.md
This patch (of 8):
Add data structures for DAMON sampling and aggregation intervals automatic
tuning that aims specific amount of DAMON-observed access events per
snapshot.  In more detail, define the data structure for the tuning goal,
link it to the monitoring attributes data structure so that DAMON kernel
API callers can make the request, and update parameters setup DAMON
function to respect the new parameter.
Link: https://lkml.kernel.org/r/20250303221726.484227-1-sj@kernel.org
Link: https://lkml.kernel.org/r/20250303221726.484227-2-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
							
						 | 
						
							||
| 
							 | 
						0431c42622 | 
							
							
								
								mm/damon: introduce DAMOS filter type hugepage_size
							
							
							
							
							
							
							
							Patch series "mm/damon: add support for hugepage_size DAMOS filter", v5. hugepage_size DAMOS filter can be used to gather statistics to check if memory regions of specific access tempratures are backed by hugepages of a size in a specific range. This filter can help to observe and prove the effectivenes of different schemes for shrinking/collapsing hugepages. This patch (of 4): This is to gather statistics to check if memory regions of specific access tempratures are backed by pages of a size in a specific range. This filter can help to observe and prove the effectivenes of different schemes for shrinking/collapsing hugepages. [sj@kernel.org: add kernel-doc comment for damos_filter->sz_range] Link: https://lkml.kernel.org/r/20250218223058.52459-1-sj@kernel.org Link: https://lkml.kernel.org/r/20250211124437.278873-1-usamaarif642@gmail.com Link: https://lkml.kernel.org/r/20250211124437.278873-2-usamaarif642@gmail.com Signed-off-by: Usama Arif <usamaarif642@gmail.com> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Usama Arif <usamaarif642@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>  | 
						
							||
| 
							 | 
						6fa70372c8 | 
							
							
								
								mm/damon/core: do damos walking in entire regions granularity
							
							
							
							
							
							
							
							damos_walk_control can be installed while DAMOS is walking the regions. 
This means the walk callback function invocations can be started from a
region at the middle of the regions list.  This makes it hard to be used
reliably.  Particularly, DAMOS tried regions update for collecting
monitoring results gets problematic results.  Increase the
walk_control_lock critical section to do walking in entire regions
granularity.
Link: https://lkml.kernel.org/r/20250210182737.134994-4-sj@kernel.org
Fixes: 
							
						 | 
						
							||
| 
							 | 
						40eb655b41 | 
							
							
								
								mm/damon/core: do not call damos_walk_control->walk() if walk is completed
							
							
							
							
							
							
							
							damos_walk() invokes callback functions of schemes until all schemes
finishes at least one round of walks.  If there are multiple DAMOS schemes
having different apply_interval, the callback functions for longer apply
interval scheme will be called for more than a round of the walk.
The behavior is different from the document (see damos_walk() kernel-doc
comment), and not useful.  Make the behavior be same to the documented
one, by stopping invoking the callback if the walk for the given scheme is
completed.
Link: https://lkml.kernel.org/r/20250210182737.134994-3-sj@kernel.org
Fixes: 
							
						 | 
						
							||
| 
							 | 
						c32696ca5e | 
							
							
								
								mm/damon/core: unset damos->walk_completed after confimed set
							
							
							
							
							
							
							
							Patch series "mm/damon/core: fix wrong and/or useless damos_walk()
behaviors".
damos_walk() can finish working earlier or later than expected, and start
earlier than practical.  First two behaviors are clearly wrong behavior
(doesn't follow the documentation) and all three behaviors are only making
the feature useless.  Fix those.
This patch (of 3):
damos->walk_completed is only set, not unset.  This can cause next
damos_walk() finish earlier than expected.  Unset it after all
walk_completed is confirmed.
Link: https://lkml.kernel.org/r/20250210182737.134994-1-sj@kernel.org
Link: https://lkml.kernel.org/r/20250210182737.134994-2-sj@kernel.org
Fixes: 
							
						 | 
						
							||
| 
							 | 
						94ba17adab | 
							
							
								
								mm/damon: avoid applying DAMOS action to same entity multiple times
							
							
							
							
							
							
							
							'paddr' DAMON operations set can apply a DAMOS scheme's action to a large
folio multiple times in single DAMOS-regions-walk if the folio is laid on
multiple DAMON regions.  Add a field for DAMOS scheme object that can be
used by the underlying ops to know what was the last entity that the
scheme's action has applied.  The core layer unsets the field when each
DAMOS-regions-walk is done for the given scheme.  And update 'paddr' ops
to use the infrastructure to avoid the problem.
Link: https://lkml.kernel.org/r/20250207212033.45269-3-sj@kernel.org
Fixes: 
							
						 | 
						
							||
| 
							 | 
						73d7a69de2 | 
							
							
								
								mm/damon/core: initialize damos->walk_completed in damon_new_scheme()
							
							
							
							
							
							
							
							The function for allocating and initialize a 'struct damos' object,
damon_new_scheme(), is not initializing damos->walk_completed field.  Only
damos_walk_complete() is setting the field.  Hence the field will be
eventually set and used correctly from second damos_walk() call for the
scheme.  But the first damos_walk() could mistakenly not walk on the
regions.  Actually, a common usage of DAMOS for taking an access pattern
snapshot is installing a monitoring-purpose DAMOS scheme, doing
damos_walk() to retrieve the snapshot, and then removing the scheme. 
DAMON user-space tool (damo) also gets runtime snapshot in the way.  Hence
the problem can continuously happen in such use cases.  Initialize it
properly in the allocation function.
Link: https://lkml.kernel.org/r/20250228174450.41472-1-sj@kernel.org
Fixes: 
							
						 | 
						
							||
| 
							 | 
						39a326e6da | 
							
							
								
								mm/damon: respect core layer filters' allowance decision on ops layer
							
							
							
							
							
							
							
							Filtering decisions are made in filters evaluation order.  Once a decision
is made by a filter, filters that scheduled to be evaluated after the
decision-made filter should just respect it.  This is the intended and
documented behavior.  Since core layer-handled filters are evaluated
before operations layer-handled filters, decisions made on core layer
should respected by ops layer.
In case of reject filters, the decision is respected, since core
layer-rejected regions are not passed to ops layer.  But in case of allow
filters, ops layer filters don't know if the region has passed to them
because it was allowed by core filters or just because it didn't match to
any core layer.  The current wrong implementation assumes it was due to
not matched by any core filters.  As a reuslt, the decision is not
respected.  Pass the missing information to ops layer using a new filed in
'struct damos', and make the ops layer filters respect it.
Link: https://lkml.kernel.org/r/20250228175336.42781-1-sj@kernel.org
Fixes: 
							
						 | 
						
							||
| 
							 | 
						56dff92932 | 
							
							
								
								mm/damon/core: use str_high_low() helper in damos_wmark_wait_us()
							
							
							
							
							
							
							
							Remove hard-coded strings by using the str_high_low() helper function. Link: https://lkml.kernel.org/r/20250116204216.106999-2-thorsten.blum@linux.dev Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Reviewed-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>  | 
						
							||
| 
							 | 
						e2fbfedad0 | 
							
							
								
								mm/damon: add 'allow' argument to damos_new_filter()
							
							
							
							
							
							
							
							DAMON API users should set damos_filter->allow manually to use a DAMOS allow-filter, since damos_new_filter() unsets the field always. It is cumbersome and easy to mistake. Add an arugment for setting the field to damos_new_filter(). Link: https://lkml.kernel.org/r/20250109175126.57878-6-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>  | 
						
							||
| 
							 | 
						491fee286e | 
							
							
								
								mm/damon/core: support damos_filter->allow
							
							
							
							
							
							
							
							DAMOS filters supports allowing behavior, but the core layer's DAMOS filters handling logic still assumes only rejecting (filtering-out) behavior. Update the logic to aware of and respect the behavioral decision by reading damos_filter->allow when making the decision to exclude a region or not. Link: https://lkml.kernel.org/r/20250109175126.57878-4-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>  | 
						
							||
| 
							 | 
						fe6d7fdd62 | 
							
							
								
								mm/damon/core: add damos_filter->allow field
							
							
							
							
							
							
							
							DAMOS filters work as only exclusive (reject) filters. This makes it easy to be confused, and restrictive at combining multiple filters for covering various types of memory. Add a field named 'allow' to damos_filter. The field will be used to indicate whether the filter should work for inclusion or exclusion. To keep the old behavior, set it as 'false' (work as exclusive filter) by default, from damos_new_filter(). Following two commits will make the core and operations set layers, which handles damos_filter objects, respect the field, respectively. Link: https://lkml.kernel.org/r/20250109175126.57878-3-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>  | 
						
							||
| 
							 | 
						cfc33a7d2d | 
							
							
								
								mm/damon/core: pass per-region filter-passed bytes to damos_walk_control->walk_fn()
							
							
							
							
							
							
							
							Total size of memory that passed DAMON operations set layer-handled DAMOS filters per scheme is provided to DAMON core API and ABI (sysfs interface) users. Having it per-region in non-accumulated way can provide it in finer granularity. Provide it to damos_walk() core API users, by passing the data to damos_walk_control->walk_fn(). Link: https://lkml.kernel.org/r/20250106193401.109161-13-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>  | 
						
							||
| 
							 | 
						60fa9355a6 | 
							
							
								
								mm/damon/core: implement per-scheme ops-handled filter-passed bytes stat
							
							
							
							
							
							
							
							Implement a new per-DAMOS scheme statistic field, namely sz_ops_filter_passed, using the changed damon_operations->apply_scheme() interface. It counts total bytes of memory that given DAMOS action tried to be applied, and passed the operations layer handled region-internal filters of the scheme. DAMON API users can access it using DAMON-internal safe access features such as damon_call() and/or damos_walk(). Link: https://lkml.kernel.org/r/20250106193401.109161-8-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>  | 
						
							||
| 
							 | 
						b5bbe9c08f | 
							
							
								
								mm/damon: ask apply_scheme() to report filter-passed region-internal bytes
							
							
							
							
							
							
							
							Some DAMOS filter types including those for young page, anon page, and belonging memcg are handled by underlying DAMON operations set implementation, via damon_operations->apply_scheme() interface. How many bytes of the region have passed the filter can be useful for DAMOS scheme tuning and access pattern monitoring. Modify the interface to let the callback implementation reports back the number if possible. Link: https://lkml.kernel.org/r/20250106193401.109161-5-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>  | 
						
							||
| 
							 | 
						bf0eaba0ff | 
							
							
								
								mm/damon/core: implement damos_walk()
							
							
							
							
							
							
							
							Introduce a new core layer interface, damos_walk(). It aims to replace some damon_callback usages that access DAMOS schemes applied regions of ongoing kdamond with additional synchronizations. It receives a function pointer and asks kdamond to invoke it for any region that it tried to apply any DAMOS action within one scheme apply interval for every scheme of it. The function further waits until the kdamond finishes the invocations for every scheme, or cancels the request, and returns. The kdamond invokes the function as requested within the main loop. If it is deactivated by DAMOS watermarks or going out of the main loop, it marks the request as canceled, so that damos_walk() can wakeup and return. Link: https://lkml.kernel.org/r/20250103174400.54890-8-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>  | 
						
							||
| 
							 | 
						42b7491af1 | 
							
							
								
								mm/damon/core: introduce damon_call()
							
							
							
							
							
							
							
							Introduce a new DAMON core API function, damon_call(). It aims to replace some damon_callback usages that access damon_ctx of ongoing kdamond with additional synchronizations. It receives a function pointer, let the parallel kdamond invokes the function, and returns after the invocation is finished, or canceled due to some races. kdamond invokes the function inside the main loop after sampling is done. If it is deactivated by DAMOS watermarks or already out of the main loop, mark the request as canceled so that damon_call() can wakeup and return. Link: https://lkml.kernel.org/r/20250103174400.54890-4-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>  | 
						
							||
| 
							 | 
						6653995262 | 
							
							
								
								mm/damon/core: remove duplicate list_empty quota->goals check
							
							
							
							
							
							
							
							damos_set_effective_quota() checks quota contidions but there are some duplicate checks for quota->goals inside. This patch reduces one of if statement to simplify the esz calculation logic by setting esz as ULONG_MAX by default. Link: https://lkml.kernel.org/r/20241125184307.41746-1-sj@kernel.org Signed-off-by: Honggyu Kim <honggyu.kim@sk.com> Reviewed-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>  | 
						
							||
| 
							 | 
						7d390b5306 | 
							
							
								
								mm/damon/core: fix ignored quota goals and filters of newly committed schemes
							
							
							
							
							
							
							
							damon_commit_schemes() ignores quota goals and filters of the newly
committed schemes.  This makes users confused about the behaviors. 
Correctly handle those inputs.
Link: https://lkml.kernel.org/r/20241222231222.85060-3-sj@kernel.org
Fixes: 
							
						 | 
						
							||
| 
							 | 
						8debfc5b1a | 
							
							
								
								mm/damon/core: fix new damon_target objects leaks on damon_commit_targets()
							
							
							
							
							
							
							
							Patch series "mm/damon/core: fix memory leaks and ignored inputs from
damon_commit_ctx()".
Due to two bugs in damon_commit_targets() and damon_commit_schemes(),
which are called from damon_commit_ctx(), some user inputs can be ignored,
and some mmeory objects can be leaked.  Fix those.
Note that only DAMON sysfs interface users are affected.  Other DAMON core
API user modules that more focused more on simple and dedicated production
usages, including DAMON_RECLAIM and DAMON_LRU_SORT are not using the buggy
function in the way, so not affected.
This patch (of 2):
When new DAMON targets are added via damon_commit_targets(), the newly
created targets are not deallocated when updating the internal data
(damon_commit_target()) is failed.  Worse yet, even if the setup is
successfully done, the new target is not linked to the context.  Hence,
the new targets are always leaked regardless of the internal data setup
failure.  Fix the leaks.
Link: https://lkml.kernel.org/r/20241222231222.85060-2-sj@kernel.org
Fixes: 
							
						 | 
						
							||
| 
							 | 
						bf9aa14fc5 | 
							
							
								
								A rather large update for timekeeping and timers:
							
							
							
							
							
							
							
							- The final step to get rid of auto-rearming posix-timers
 
     posix-timers are currently auto-rearmed by the kernel when the signal
     of the timer is ignored so that the timer signal can be delivered once
     the corresponding signal is unignored.
 
     This requires to throttle the timer to prevent a DoS by small intervals
     and keeps the system pointlessly out of low power states for no value.
     This is a long standing non-trivial problem due to the lock order of
     posix-timer lock and the sighand lock along with life time issues as
     the timer and the sigqueue have different life time rules.
 
     Cure this by:
 
      * Embedding the sigqueue into the timer struct to have the same life
        time rules. Aside of that this also avoids the lookup of the timer
        in the signal delivery and rearm path as it's just a always valid
        container_of() now.
 
      * Queuing ignored timer signals onto a seperate ignored list.
 
      * Moving queued timer signals onto the ignored list when the signal is
        switched to SIG_IGN before it could be delivered.
 
      * Walking the ignored list when SIG_IGN is lifted and requeue the
        signals to the actual signal lists. This allows the signal delivery
        code to rearm the timer.
 
     This also required to consolidate the signal delivery rules so they are
     consistent across all situations. With that all self test scenarios
     finally succeed.
 
   - Core infrastructure for VFS multigrain timestamping
 
     This is required to allow the kernel to use coarse grained time stamps
     by default and switch to fine grained time stamps when inode attributes
     are actively observed via getattr().
 
     These changes have been provided to the VFS tree as well, so that the
     VFS specific infrastructure could be built on top.
 
   - Cleanup and consolidation of the sleep() infrastructure
 
     * Move all sleep and timeout functions into one file
 
     * Rework udelay() and ndelay() into proper documented inline functions
       and replace the hardcoded magic numbers by proper defines.
 
     * Rework the fsleep() implementation to take the reality of the timer
       wheel granularity on different HZ values into account. Right now the
       boundaries are hard coded time ranges which fail to provide the
       requested accuracy on different HZ settings.
 
     * Update documentation for all sleep/timeout related functions and fix
       up stale documentation links all over the place
 
     * Fixup a few usage sites
 
   - Rework of timekeeping and adjtimex(2) to prepare for multiple PTP clocks
 
     A system can have multiple PTP clocks which are participating in
     seperate and independent PTP clock domains. So far the kernel only
     considers the PTP clock which is based on CLOCK TAI relevant as that's
     the clock which drives the timekeeping adjustments via the various user
     space daemons through adjtimex(2).
 
     The non TAI based clock domains are accessible via the file descriptor
     based posix clocks, but their usability is very limited. They can't be
     accessed fast as they always go all the way out to the hardware and
     they cannot be utilized in the kernel itself.
 
     As Time Sensitive Networking (TSN) gains traction it is required to
     provide fast user and kernel space access to these clocks.
 
     The approach taken is to utilize the timekeeping and adjtimex(2)
     infrastructure to provide this access in a similar way how the kernel
     provides access to clock MONOTONIC, REALTIME etc.
 
     Instead of creating a duplicated infrastructure this rework converts
     timekeeping and adjtimex(2) into generic functionality which operates
     on pointers to data structures instead of using static variables.
 
     This allows to provide time accessors and adjtimex(2) functionality for
     the independent PTP clocks in a subsequent step.
 
   - Consolidate hrtimer initialization
 
     hrtimers are set up by initializing the data structure and then
     seperately setting the callback function for historical reasons.
 
     That's an extra unnecessary step and makes Rust support less straight
     forward than it should be.
 
     Provide a new set of hrtimer_setup*() functions and convert the core
     code and a few usage sites of the less frequently used interfaces over.
 
     The bulk of the htimer_init() to hrtimer_setup() conversion is already
     prepared and scheduled for the next merge window.
 
   - Drivers:
 
     * Ensure that the global timekeeping clocksource is utilizing the
       cluster 0 timer on MIPS multi-cluster systems.
 
       Otherwise CPUs on different clusters use their cluster specific
       clocksource which is not guaranteed to be synchronized with other
       clusters.
 
     * Mostly boring cleanups, fixes, improvements and code movement
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmc7kPITHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoZKkD/9OUL6fOJrDUmOYBa4QVeMyfTef4EaL
 tvwIMM/29XQFeiq3xxCIn+EMnHjXn2lvIhYGQ7GKsbKYwvJ7ZBDpQb+UMhZ2nKI9
 6D6BP6WomZohKeH2fZbJQAdqOi3KRYdvQdIsVZUexkqiaVPphRvOH9wOr45gHtZM
 EyMRSotPlQTDqcrbUejDMEO94GyjDCYXRsyATLxjmTzL/N4xD4NRIiotjM2vL/a9
 8MuCgIhrKUEyYlFoOxxeokBsF3kk3/ez2jlG9b/N8VLH3SYIc2zgL58FBgWxlmgG
 bY71nVG3nUgEjxBd2dcXAVVqvb+5widk8p6O7xxOAQKTLMcJ4H0tQDkMnzBtUzvB
 DGAJDHAmAr0g+ja9O35Pkhunkh4HYFIbq0Il4d1HMKObhJV0JumcKuQVxrXycdm3
 UZfq3seqHsZJQbPgCAhlFU0/2WWScocbee9bNebGT33KVwSp5FoVv89C/6Vjb+vV
 Gusc3thqrQuMAZW5zV8g4UcBAA/xH4PB0I+vHib+9XPZ4UQ7/6xKl2jE0kd5hX7n
 AAUeZvFNFqIsY+B6vz+Jx/yzyM7u5cuXq87pof5EHVFzv56lyTp4ToGcOGYRgKH5
 JXeYV1OxGziSDrd5vbf9CzdWMzqMvTefXrHbWrjkjhNOe8E1A8O88RZ5uRKZhmSw
 hZZ4hdM9+3T7cg==
 =2VC6
 -----END PGP SIGNATURE-----
Merge tag 'timers-core-2024-11-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer updates from Thomas Gleixner:
 "A rather large update for timekeeping and timers:
   - The final step to get rid of auto-rearming posix-timers
     posix-timers are currently auto-rearmed by the kernel when the
     signal of the timer is ignored so that the timer signal can be
     delivered once the corresponding signal is unignored.
     This requires to throttle the timer to prevent a DoS by small
     intervals and keeps the system pointlessly out of low power states
     for no value. This is a long standing non-trivial problem due to
     the lock order of posix-timer lock and the sighand lock along with
     life time issues as the timer and the sigqueue have different life
     time rules.
     Cure this by:
       - Embedding the sigqueue into the timer struct to have the same
         life time rules. Aside of that this also avoids the lookup of
         the timer in the signal delivery and rearm path as it's just a
         always valid container_of() now.
       - Queuing ignored timer signals onto a seperate ignored list.
       - Moving queued timer signals onto the ignored list when the
         signal is switched to SIG_IGN before it could be delivered.
       - Walking the ignored list when SIG_IGN is lifted and requeue the
         signals to the actual signal lists. This allows the signal
         delivery code to rearm the timer.
     This also required to consolidate the signal delivery rules so they
     are consistent across all situations. With that all self test
     scenarios finally succeed.
   - Core infrastructure for VFS multigrain timestamping
     This is required to allow the kernel to use coarse grained time
     stamps by default and switch to fine grained time stamps when inode
     attributes are actively observed via getattr().
     These changes have been provided to the VFS tree as well, so that
     the VFS specific infrastructure could be built on top.
   - Cleanup and consolidation of the sleep() infrastructure
       - Move all sleep and timeout functions into one file
       - Rework udelay() and ndelay() into proper documented inline
         functions and replace the hardcoded magic numbers by proper
         defines.
       - Rework the fsleep() implementation to take the reality of the
         timer wheel granularity on different HZ values into account.
         Right now the boundaries are hard coded time ranges which fail
         to provide the requested accuracy on different HZ settings.
       - Update documentation for all sleep/timeout related functions
         and fix up stale documentation links all over the place
       - Fixup a few usage sites
   - Rework of timekeeping and adjtimex(2) to prepare for multiple PTP
     clocks
     A system can have multiple PTP clocks which are participating in
     seperate and independent PTP clock domains. So far the kernel only
     considers the PTP clock which is based on CLOCK TAI relevant as
     that's the clock which drives the timekeeping adjustments via the
     various user space daemons through adjtimex(2).
     The non TAI based clock domains are accessible via the file
     descriptor based posix clocks, but their usability is very limited.
     They can't be accessed fast as they always go all the way out to
     the hardware and they cannot be utilized in the kernel itself.
     As Time Sensitive Networking (TSN) gains traction it is required to
     provide fast user and kernel space access to these clocks.
     The approach taken is to utilize the timekeeping and adjtimex(2)
     infrastructure to provide this access in a similar way how the
     kernel provides access to clock MONOTONIC, REALTIME etc.
     Instead of creating a duplicated infrastructure this rework
     converts timekeeping and adjtimex(2) into generic functionality
     which operates on pointers to data structures instead of using
     static variables.
     This allows to provide time accessors and adjtimex(2) functionality
     for the independent PTP clocks in a subsequent step.
   - Consolidate hrtimer initialization
     hrtimers are set up by initializing the data structure and then
     seperately setting the callback function for historical reasons.
     That's an extra unnecessary step and makes Rust support less
     straight forward than it should be.
     Provide a new set of hrtimer_setup*() functions and convert the
     core code and a few usage sites of the less frequently used
     interfaces over.
     The bulk of the htimer_init() to hrtimer_setup() conversion is
     already prepared and scheduled for the next merge window.
   - Drivers:
       - Ensure that the global timekeeping clocksource is utilizing the
         cluster 0 timer on MIPS multi-cluster systems.
         Otherwise CPUs on different clusters use their cluster specific
         clocksource which is not guaranteed to be synchronized with
         other clusters.
       - Mostly boring cleanups, fixes, improvements and code movement"
* tag 'timers-core-2024-11-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (140 commits)
  posix-timers: Fix spurious warning on double enqueue versus do_exit()
  clocksource/drivers/arm_arch_timer: Use of_property_present() for non-boolean properties
  clocksource/drivers/gpx: Remove redundant casts
  clocksource/drivers/timer-ti-dm: Fix child node refcount handling
  dt-bindings: timer: actions,owl-timer: convert to YAML
  clocksource/drivers/ralink: Add Ralink System Tick Counter driver
  clocksource/drivers/mips-gic-timer: Always use cluster 0 counter as clocksource
  clocksource/drivers/timer-ti-dm: Don't fail probe if int not found
  clocksource/drivers:sp804: Make user selectable
  clocksource/drivers/dw_apb: Remove unused dw_apb_clockevent functions
  hrtimers: Delete hrtimer_init_on_stack()
  alarmtimer: Switch to use hrtimer_setup() and hrtimer_setup_on_stack()
  io_uring: Switch to use hrtimer_setup_on_stack()
  sched/idle: Switch to use hrtimer_setup_on_stack()
  hrtimers: Delete hrtimer_init_sleeper_on_stack()
  wait: Switch to use hrtimer_setup_sleeper_on_stack()
  timers: Switch to use hrtimer_setup_sleeper_on_stack()
  net: pktgen: Switch to use hrtimer_setup_sleeper_on_stack()
  futex: Switch to use hrtimer_setup_sleeper_on_stack()
  fs/aio: Switch to use hrtimer_setup_sleeper_on_stack()
  ...
							
						 | 
						
							||
| 
							 | 
						4401e9d10a | 
							
							
								
								mm/damon/core: avoid overflow in damon_feed_loop_next_input()
							
							
							
							
							
							
							
							damon_feed_loop_next_input() is inefficient and fragile to overflows. 
Specifically, 'score_goal_diff_bp' calculation can overflow when 'score'
is high.  The calculation is actually unnecessary at all because 'goal' is
a constant of value 10,000.  Calculation of 'compensation' is again
fragile to overflow.  Final calculation of return value for under-achiving
case is again fragile to overflow when the current score is
under-achieving the target.
Add two corner cases handling at the beginning of the function to make the
body easier to read, and rewrite the body of the function to avoid
overflows and the unnecessary bp value calcuation.
Link: https://lkml.kernel.org/r/20241031161203.47751-1-sj@kernel.org
Fixes: 
							
						 | 
						
							||
| 
							 | 
						8e7bde615f | 
							
							
								
								mm/damon/core: handle zero schemes apply interval
							
							
							
							
							
							
							
							DAMON's logics to determine if this is the time to apply damos schemes
assumes next_apply_sis is always set larger than current
passed_sample_intervals.  And therefore assume continuously incrementing
passed_sample_intervals will make it reaches to the next_apply_sis in
future.  The logic hence does apply the scheme and update next_apply_sis
only if passed_sample_intervals is same to next_apply_sis.
If Schemes apply interval is set as zero, however, next_apply_sis is set
same to current passed_sample_intervals, respectively.  And
passed_sample_intervals is incremented before doing the next_apply_sis
check.  Hence, next_apply_sis becomes larger than next_apply_sis, and the
logic says it is not the time to apply schemes and update next_apply_sis. 
In other words, DAMON stops applying schemes until passed_sample_intervals
overflows.
Based on the documents and the common sense, a reasonable behavior for
such inputs would be applying the schemes for every sampling interval. 
Handle the case by removing the assumption.
Link: https://lkml.kernel.org/r/20241031183757.49610-3-sj@kernel.org
Fixes: 
							
						 | 
						
							||
| 
							 | 
						3488af0970 | 
							
							
								
								mm/damon/core: handle zero {aggregation,ops_update} intervals
							
							
							
							
							
							
							
							Patch series "mm/damon/core: fix handling of zero non-sampling intervals".
DAMON's internal intervals accounting logic is not correctly handling
non-sampling intervals of zero values for a wrong assumption.  This could
cause unexpected monitoring behavior, and even result in infinite hang of
DAMON sysfs interface user threads in case of zero aggregation interval. 
Fix those by updating the intervals accounting logic.  For details of the
root case and solutions, please refer to commit messages of fixes.
This patch (of 2):
DAMON's logics to determine if this is the time to do aggregation and ops
update assumes next_{aggregation,ops_update}_sis are always set larger
than current passed_sample_intervals.  And therefore it further assumes
continuously incrementing passed_sample_intervals every sampling interval
will make it reaches to the next_{aggregation,ops_update}_sis in future. 
The logic therefore make the action and update
next_{aggregation,ops_updaste}_sis only if passed_sample_intervals is same
to the counts, respectively.
If Aggregation interval or Ops update interval are zero, however,
next_aggregation_sis or next_ops_update_sis are set same to current
passed_sample_intervals, respectively.  And passed_sample_intervals is
incremented before doing the next_{aggregation,ops_update}_sis check. 
Hence, passed_sample_intervals becomes larger than
next_{aggregation,ops_update}_sis, and the logic says it is not the time
to do the action and update next_{aggregation,ops_update}_sis forever,
until an overflow happens.  In other words, DAMON stops doing aggregations
or ops updates effectively forever, and users cannot get monitoring
results.
Based on the documents and the common sense, a reasonable behavior for
such inputs is doing an aggregation and an ops update for every sampling
interval.  Handle the case by removing the assumption.
Note that this could incur particular real issue for DAMON sysfs interface
users, in case of zero Aggregation interval.  When user starts DAMON with
zero Aggregation interval and asks online DAMON parameter tuning via DAMON
sysfs interface, the request is handled by the aggregation callback. 
Until the callback finishes the work, the user who requested the online
tuning just waits.  Hence, the user will be stuck until the
passed_sample_intervals overflows.
Link: https://lkml.kernel.org/r/20241031183757.49610-1-sj@kernel.org
Link: https://lkml.kernel.org/r/20241031183757.49610-2-sj@kernel.org
Fixes: 
							
						 | 
						
							||
| 
							 | 
						ef0245582e | 
							
							
								
								mm/damon/core: Use generic upper bound recommondation for usleep_range()
							
							
							
							
							
							
							
							The upper bound for usleep_range_idle() was taken from the outdated documentation. As a recommondation for the upper bound of usleep_range() depends on HZ configuration it is not possible to hard code it. Use the define "USLEEP_RANGE_UPPER_BOUND" instead. Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: SeongJae Park <sj@kernel.org> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/all/20241014-devel-anna-maria-b4-timers-flseep-v3-8-dc8b907cb62f@linutronix.de  | 
						
							||
| 
							 | 
						102f085d84 | 
							
							
								
								timers: Rename usleep_idle_range() to usleep_range_idle()
							
							
							
							
							
							
							
							usleep_idle_range() is a variant of usleep_range(). Both are using usleep_range_state() as a base. To be able to find all the related functions in one go, rename it usleep_idle_range() to usleep_range_idle(). No functional change. Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: SeongJae Park <sj@kernel.org> Link: https://lore.kernel.org/all/20241014-devel-anna-maria-b4-timers-flseep-v3-4-dc8b907cb62f@linutronix.de  | 
						
							||
| 
							 | 
						25e8acbcf1 | 
							
							
								
								mm/damon/tests/core-kunit: skip damon_test_nr_accesses_to_accesses_bp() if aggr_interval is zero
							
							
							
							
							
							
							
							The aggregation interval of test purpose damon_attrs for
damon_test_nr_accesses_to_accesses_bp() becomes zero on 32 bit
architecture, since size of int and long types are same.  As a result,
damon_nr_accesses_to_accesses_bp() call with the test data triggers
divide-by-zero exception.  damon_nr_accesses_to_accesses_bp() shouldn't
be called with such data, and the non-test code avoids that by checking
the case on damon_update_monitoring_results().  Skip the test code in
the case, and add an explicit caution of the case on the comment for the
test target function.
Link: https://lkml.kernel.org/r/20240905162423.74053-1-sj@kernel.org
Fixes: 
							
						 | 
						
							||
| 
							 | 
						304b95847f | 
							
							
								
								mm/damon/core: replace per-quota regions priority histogram buffer usage with per-context one
							
							
							
							
							
							
							
							Replace the usage of per-quota region priorities histogram buffer with the per-context one. After this change, the per-quota histogram is not used by anyone, and hence it is ready to be removed. Link: https://lkml.kernel.org/r/20240826042323.87025-3-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>  | 
						
							||
| 
							 | 
						b7315fbb64 | 
							
							
								
								mm/damon/core: introduce per-context region priorities histogram buffer
							
							
							
							
							
							
							
							Patch series "replace per-quota region priorities histogram buffer with
per-context one".
Each DAMOS quota (struct damos_quota) maintains a histogram for total
regions size per its prioritization score.  DAMOS calcultes minimum
prioritization score of regions that are ok to apply the DAMOS action to
while respecting the quota.  The histogram is constructed only for the
calculation of the minimum score in damos_adjust_quota() for each quota
which called by kdamond_fn().
Hence, there is no real reason to have per-quota histogram.  Only
per-kdamond histogram is needed, since parallel kdamonds could have races
otherwise.  The current implementation is only wasting the memory, and can
easily cause unintended stack usage[1].
So, introducing a per-kdamond histogram and replacing the per-quota one
with it would be the right solution for the issue.  However, supporting
multiple DAMON contexts per kdamond is still an ongoing work[2] without a
clear estimated time of arrival.  Meanwhile, per-context histogram could
be an effective and straightforward solution having no blocker.  Let's fix
the problem first in the way.
This patch (of 4):
Introduce per-context buffer for region priority scores-total size
histogram.  Same to the per-quota one (->histogram of struct damos_quota),
the new buffer is hidden from DAMON API users by being defined as a
private field of DAMON context structure.  It is dynamically allocated and
de-allocated at the beginning and ending of the execution of the kdamond
by kdamond_fn() itself.
[1] commit 0742cadf5e4c ("mm/damon/lru_sort: adjust local variable to dynamic allocation")
[2] https://lore.kernel.org/20240531122320.909060-1-yorha.op@gmail.com
Link: https://lkml.kernel.org/r/20240826042323.87025-1-sj@kernel.org
Link: https://lkml.kernel.org/r/20240826042323.87025-2-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
							
						 | 
						
							||
| 
							 | 
						9bfbaa5e44 | 
							
							
								
								mm/damon: move kunit tests to tests/ subdirectory with _kunit suffix
							
							
							
							
							
							
							
							There was a discussion about better places for kunit test code[1] and test file name suffix[2]. Folowwing the conclusion, move kunit tests for DAMON to mm/damon/tests/ subdirectory and rename those. [1] https://lore.kernel.org/CABVgOS=pUdWb6NDHszuwb1HYws4a1-b1UmN=i8U_ED7HbDT0mg@mail.gmail.com [2] https://lore.kernel.org/CABVgOSmKwPq7JEpHfS6sbOwsR0B-DBDk_JP-ZD9s9ZizvpUjbQ@mail.gmail.com Link: https://lkml.kernel.org/r/20240827030336.7930-9-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendanhiggins@google.com> Cc: David Gow <davidgow@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>  | 
						
							||
| 
							 | 
						8ef6fd0e9e | 
							
							
								
								Merge branch 'mm-hotfixes-stable' into mm-stable to pick up "mm: fix
							
							
							
							
							
							
							
							crashes from deferred split racing folio migration", needed by "mm: migrate: split folio_migrate_mapping()".  | 
						
							||
| 
							 | 
						310d6c15e9 | 
							
							
								
								mm/damon/core: merge regions aggressively when max_nr_regions is unmet
							
							
							
							
							
							
							
							DAMON keeps the number of regions under max_nr_regions by skipping regions
split operations when doing so can make the number higher than the limit. 
It works well for preventing violation of the limit.  But, if somehow the
violation happens, it cannot recovery well depending on the situation.  In
detail, if the real number of regions having different access pattern is
higher than the limit, the mechanism cannot reduce the number below the
limit.  In such a case, the system could suffer from high monitoring
overhead of DAMON.
The violation can actually happen.  For an example, the user could reduce
max_nr_regions while DAMON is running, to be lower than the current number
of regions.  Fix the problem by repeating the merge operations with
increasing aggressiveness in kdamond_merge_regions() for the case, until
the limit is met.
[sj@kernel.org: increase regions merge aggressiveness while respecting min_nr_regions]
  Link: https://lkml.kernel.org/r/20240626164753.46270-1-sj@kernel.org
[sj@kernel.org: ensure max threshold attempt for max_nr_regions violation]
  Link: https://lkml.kernel.org/r/20240627163153.75969-1-sj@kernel.org
Link: https://lkml.kernel.org/r/20240624175814.89611-1-sj@kernel.org
Fixes: 
							
						 | 
						
							||
| 
							 | 
						9cb3d0b9df | 
							
							
								
								mm/damon/core: implement DAMON context commit function
							
							
							
							
							
							
							
							Implement functions for supporting online DAMON context level parameters update. The function receives two DAMON context structs. One is the struct that currently being used by a kdamond and therefore to be updated. The other one contains the parameters to be applied to the first one. The function applies the new parameters to the destination struct while keeping/updating the internal status and operation results. The function should be called from DAMON context-update-safe place, like DAMON callbacks. Link: https://lkml.kernel.org/r/20240618181809.82078-3-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>  | 
						
							||
| 
							 | 
						3ad1dce6c3 | 
							
							
								
								mm/damon/core: implement DAMOS quota goals online commit function
							
							
							
							
							
							
							
							Patch series "mm/damon: introduce DAMON parameters online commit function". DAMON context struct (damon_ctx) contains user requests (parameters), internal status, and operation results. For flexible usages, DAMON API users are encouraged to manually manipulate the struct. That works well for simple use cases. However, it has turned out that it is not that simple at least for online parameters udpate. It is easy to forget properly maintaining internal status and operation results. Also, such manual manipulation for online tuning is implemented multiple times on DAMON API users including DAMON sysfs interface, DAMON_RECLAIM and DAMON_LRU_SORT. As a result, we have multiple sources of bugs for same problem. Actually we found and fixed a few bugs from online parameter updating of DAMON API users. Implement a function for online DAMON parameters update in core layer, and replace DAMON API users' manual manipulation code for the use case. The core layer function could still have bugs, but this change reduces the source of bugs for the problem to one place. This patch (of 12): Implement functions for supporting online DAMOS quota goals parameters update. The function receives two DAMOS quota structs. One is the struct that currently being used by a kdamond and therefore to be updated. The other one contains the parameters to be applied to the first one. The function applies the new parameters to the destination struct while keeping/updating the internal status. The function should be called from parameters-update safe place, like DAMON callbacks. Link: https://lkml.kernel.org/r/20240618181809.82078-1-sj@kernel.org Link: https://lkml.kernel.org/r/20240618181809.82078-2-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>  |