PELT - Per-Entity Load Tracking

Posted Nov 20, 2025

12 min read

简介

负载计算公式

\[u = u_0 + u_1 * y + u_2 * y^2 + u_3 * y_3 + ... \tag{1} \\ y ^ {32} = 0.5\]

也就是说, 当前的负载会随着时间衰减, 经过 32 个时间单位会衰减为之前的$\frac{1}{2}$

该算法会对一段时间内的负载做估计，以更好的预测该se未来的负载以便实现per-se负载均衡。

假如 $u’$ 为 $u$的下个周期(1ms), 则:

\[\begin{align*} u' &= u_0' + y * u \\ &= u_0' + y(u_0 + u_1 * y + u_2 * y^2 + ...) \\ \end{align*}\]

当将 $u_i$ 当作 $u_{i+1}$ 时, 可得公式(1). 这种计算方式非常方便。

上实战.

内核中计算更新负载

我们以 load_sum 为例, accumulate_sum函数用来更新负载, 大致流程为:

因为我们不可能在每个1024us间隔处更新负载, 所以我们可能得到下面的图:

   last_update                     now(current_update)
        |                           |
... |---x---|------| ... |------|---x---| (now)

所以我们得到的delta为current_update - last_update, 可能跨越多个1024us,

另外last_update和current_update 可能没有位于间隔的边界处。该函数计算将delta 划分为三个subdelta: d1, d2, d3

          d1          d2           d3
          ^           ^            ^
          |           |            |
        |<->|<----------------->|<--->|
... |---x---|------| ... |------|-----x (now)

在now处平均负载 $u’$为, 我们假设 $d_2 = (p - 1) * 1024$

得

\[u' = (u + d_1) * y ^ p + 1024 \sum _{n=1} ^{p-1} y ^ n + d_3 * y^0 \\\]

我们可以根据 last_update 为分割点，将负载计算分为两部分:

$[-\infty, LastUpdate)$
$[LastUpdate, CurrentoUpdate)$

即

$u * y^p$
$d_1 * y^p + 1024 \sum _{n=1} ^{p-1} y ^ n + d_3 * y^0$

为什么要这么计算呢?

因为如果se 在本次负载计算时，处于idle状态，也就就意味这 $d1, d2, d3$ 贡献的平均负载为0，不需要计算这两部分的负载，而只需要衰减过去的负载。

在什么时候会这样调用呢? – update_blocked_averages()这里我们先略过这部分:

来看下 accumulate_sum实现

accumulate_sum

函数参数:

Pname	Type	Desc
delta	u64	距离上次更新负载的时间间隔
sa	sched_avg	se的记录平均负载的数据结构
load	unsigned long	表示任务正在运行??
runnable	unsigned long	表示该任务可运行, 在队列中正在运行或者等待运行
running	int	表示该任务正在运行

具体代码:

  
static __always_inline u32
accumulate_sum(u64 delta, struct sched_avg *sa,
           unsigned long load, unsigned long runnable, int running)
{
    u32 contrib = (u32)delta; /* p == 0 -> delta < 1024 */
    u64 periods;

    //上次更新的d3， 如(3) 所示
    delta += sa->period_contrib;
    periods = delta / 1024; /* A period is 1024us (~1ms) */

    /*
     * Step 1: decay old *_sum if we crossed period boundaries.
     */

    /* 
     * periods如果!=0, 说明本轮跨越了一个period, 需要对之前的负载，以及
     * d1 做衰减，否则不用衰减，直接当作d3 处理。(d3 的衰减因子为 p ^ 0 = 1)
     *
     * 另外，period的值恰好为p，即 d1/早起负载的衰减因子, 所以 decay_load()
     * 函数的作用就是 sa->load_sum * y ^ period, 见 (2)
     */
    if (periods) {
        sa->load_sum = decay_load(sa->load_sum, periods);
        sa->runnable_sum =
            decay_load(sa->runnable_sum, periods);
        sa->util_sum = decay_load((u64)(sa->util_sum), periods);

        /*
         * Step 2
         */
        //表示d3的值
        delta %= 1024;
        //load表示在这个delta
        if (load) {
            /*
             * This relies on the:
             *
             * if (!load)
             *  runnable = running = 0;
             *
             * clause from ___update_load_sum(); this results in
             * the below usage of @contrib to disappear entirely,
             * so no point in calculating it.
             */
            contrib = __accumulate_pelt_segments(periods,
                    1024 - sa->period_contrib, delta);
        }
    }
    sa->period_contrib = delta;

    if (load)
        sa->load_sum += load * contrib;
    if (runnable)
        sa->runnable_sum += runnable * contrib << SCHED_CAPACITY_SHIFT;
    if (running)
        sa->util_sum += contrib << SCHED_CAPACITY_SHIFT;

    return periods;
}

sa->period_contrib 为上次更新 avg_load时的 $d_3$，如下图所示

          d1          d2           d3
          ^           ^            ^
          |           |            |
        |<->|<----------------->|<--->|
... |---x---|------| ... |------|-----x (now)
    |<->|
      |
      .
 last update
    d3

delay_load 代码

delay load 代码及解释展开

  
/*
 * Approximate:
 *   val * y^n,    where y^32 ~= 0.5 (~1 scheduling period)
 */
static u64 decay_load(u64 val, u64 n)
{
    unsigned int local_n;

    // 这里表示 n 已经非常大了，所以得到的 y ^ n 已经非常接近0
    // 直接 ~= 0
    if (unlikely(n > LOAD_AVG_PERIOD * 63))
        return 0;
   
    /* after bounds checking we can collapse to 32-bit */
    local_n = n;
   
    /*
     * As y^PERIOD = 1/2, we can combine
     *    y^n = 1/2^(n/PERIOD) * y^(n%PERIOD)
     * With a look-up table which covers y^n (n<PERIOD)
     * 
     * 详细见(a)
     *
     * To achieve constant time decay_load.
     */
    // 如果local_n >  LOAD_AVG_PERIOD(32), 说明 n/PERIOD > 1
    if (unlikely(local_n >= LOAD_AVG_PERIOD)) {
        // n/PERIOD = local_n / LOAD_AVG_PERIOD
        // val >> X = (1/2) ^ X
        val >>= local_n / LOAD_AVG_PERIOD;

        //计算 n % PERIOD
        local_n %= LOAD_AVG_PERIOD;
    }

    //这里搞了一个静态数组, 用来记录 y ^ n >> 32 (n < 32) 的值
    //详细见(b)
    val = mul_u64_u32_shr(val, runnable_avg_yN_inv[local_n], 32);
    return val;
}

首先将n 进行拆解:
\[q = n / PERIOD \\ r = n \% PERIOD \\ n = q * PERIOD + r\]
$y ^ n$ 为:
\[\begin{align*} y ^ n &= y ^ {q * PERIOD + r} \\ &= {(y ^ {PERIOD})} ^ q * y ^ r \\ &= {(\frac{1}{2})} ^ q * y ^ r \\ &= {(\frac{1}{2})} ^ {(n/PERIOD)} * y ^{(n \% PERIOD)} \end{align*}\]
val 为 ${(y ^ {PERIOD})} ^ q$ 的值, 需要通过此处乘 $y ^ {local_n}$ 而 $y ^ {local_n} * 2 ^ {32}$ 保存在 runnable_avg_yN_inv[] 数组中
我们来具体看下是不是这么计算的
该数组由 Documentation/scheduler/sched-pelt.c 程序生成，里面的核心代码为:
1 2 y = pow(0.5, 1/(double)HALFLIFE); // HALFLIFE 为 32 x = ((1UL<<32)-1)*pow(y, i); //i 为 n
第一步是计算 $y$ 的值, 第二步 x 计算为 y ^ n * 1 << 32, 我们主要来看下y的计算: $y ^ {32} = \frac{1}{2} \\ y = {(\frac{1}{2})} ^ {\frac{1}{32}} \\$

__accumulate_pelt_segments

`__accumulate_pelt_segments` 代码展开

  
static u32 __accumulate_pelt_segments(u64 periods, u32 d1, u32 d3)
{
    u32 c1, c2, c3 = d3; /* y^0 == 1 */

    /*
     * c1 = d1 y^p
     */
    c1 = decay_load((u64)d1, periods);

    /*
     *            p-1
     * c2 = 1024 \Sum y^n
     *            n=1
     *
     *              inf        inf
     *    = 1024 ( \Sum y^n - \Sum y^n - y^0 )
     *              n=0        n=p
     */
    c2 = LOAD_AVG_MAX - decay_load(LOAD_AVG_MAX, periods) - 1024;

    return c1 + c2 + c3;
}

$d_1$, $d_3$ 的处理没有啥说的, 主要来看下 $d_2$

公式中c2首先做了下转换:

\[\begin{align*} c_2 &= 1024 \sum _{n=1}^{p-1} y ^ n \\ &= 1024 (\sum_{n=0}^{p-1} y^n - y^0) \\ &= 1024 (\sum_{n=0}^{\infty} y ^ n - \sum_{n=p}^{\infty} y ^ n - y ^0) \end{align*}\]

首先关于 LOAD_AVG_MAX, 同样是Documentation/scheduler/sched-pelt.c 程序生成，可以理解为一个应用程序连续运行无限时间后的load sum. 不再展开

其次 $\sum_{n=p}^{\infty} y ^ n$ 可以理解为 $\sum_{n=0}^{\infty} y ^ n$ 经过了 p 个周期后衰减的值。

所以, decay_load(LOAD_AVG_MAX, periods) 来计算LOAD_AVG_MAX 经过period周期衰减后的值.

其他

`__update_load_sum` 注释

/*
 * We can represent the historical contribution to runnable average as the
 * coefficients of a geometric series.  To do this we sub-divide our runnable
 * history into segments of approximately 1ms (1024us); label the segment that
 * occurred N-ms ago p_N, with p_0 corresponding to the current period, e.g.

 我们可以将历史对可运行平均值的贡献表示为一个几何级数的系数。为此，我们将可运
 行历史划分为大约1 毫秒（1024 微秒）左右的若干片段；将发生在 N 毫秒前的片段标记
 为 p_N，其中p_0 对应当前时间段，例如：

 *
 * [<- 1024us ->|<- 1024us ->|<- 1024us ->| ...
 *      p0            p1           p2
 *     (now)       (~1ms ago)  (~2ms ago)
 *
 * Let u_i denote the fraction of p_i that the entity was runnable.

   让 (u_i) 表示实体在片段 (p_i) 中可运行的比例。

 * We then designate the fractions u_i as our co-efficients, yielding the
 * following representation of historical load:

   然后我们将这些比例 (u_i) 作为我们的系数，从而得到如下对历史负载的表示：

 *   u_0 + u_1*y + u_2*y^2 + u_3*y^3 + ...
 * We choose y based on the with of a reasonably scheduling period, fixing:

   我们根据一个合理的调度周期的宽度来选择 (y)，并确定如下：

 *   y^32 = 0.5
 *
 * This means that the contribution to load ~32ms ago (u_32) will be weighted
 * approximately half as much as the contribution to load within the last ms
 * (u_0).

   这意味着大约 32 毫秒前的负载贡献（u_32）的权重将大约是最近 1 毫秒内负载贡献
  （u_0）的一半。

 *
 * When a period "rolls over" and we have new u_0`, multiplying the previous
 * sum again by y is sufficient to update:

	 当一个周期“轮换”并且我们有了新的 (u_0’) 时，再将之前的总和乘以 (y)
   就足以完成更新：   

 *   load_avg = u_0` + y*(u_0 + u_1*y + u_2*y^2 + ... )
 *            = u_0 + u_1*y + u_2*y^2 + ... [re-labeling u_i --> u_{i+1}]
 */

参考链接

LWN: Per-entity load tracking

schedule, pelt

sched

This post is licensed under CC BY 4.0 by the author.