PELT的全称为: Per-Entity Load Tracking, 根据直面意思,这是基于se的load tracking, PELT能够提供提供当前系统基于per-entity的负载情况. 具体点说就是通过用指数衰减的方式，把历史运行时间转化为当前负载预测，帮助负载均衡、CPU调频和能耗管理。

为什么需要 PELT

在之前的文章<调度(10)-CFS调度reweight>中可以看到,calc_group_shares函数在实际计算的时候引用了PELT的计算值,根据此,我们可以看到如果没有PELT,可能有如下问题

响应延迟：任务负载突变时，调度器反应迟钝
精度不足：无法区分短时爆发和持续高负载

主要原因是,如果按照传统计算平均值的方式,那么只能统计过去一段时间的平均值,而PELT能够实时的了解per-entity的负载情况

PELT 于 2013 年在 Linux 3.8 内核中引入，是 Vincent Guittot 等开发者的重要贡献, 彻底改变了 CFS 的负载感知能力。

PELT原理

PELT使用以过去的状态统计现在的负载的思路进行设计,它通过wall clock视为1ms(1024份)为单位 1ms (actually, 1024µs)在每个1ms的区间内,一个se对系统的新增的负载贡献取决于se_runable在rq上的时间比例,除此之外,之前的entity对系统产生的负载也会通过指数衰减累计下来,用作每次计算负载的参数,以公式的方式记录为如下


L_t = l_t + (y*l_{t-1}) + (y^2*l_{t-2}) ... + (y^t*l_0)

这里l_x是每个period记录的负载值,L_t是t时刻下计算总负载的值,y是计算的衰减因子,其默认y的固定值为


y^32 = 0.5

这里计算y是0.9786

此时我们计算L_t-1的公式,如下


L_{t-1} = l_{t-1} + (y*l_{t-2}) + (y^2*l_{t-3}) ... + (y^{t-1}*l_0)

我们将Lt-1乘y,可以计算为


y*L_{t-1} = y*l_{t-1} + (y^2*l_{t-2}) + (y^3*l_{t-3}) ... + (y^{t}*l_0)

拿这个公式和最上面L_t计算,可以得到递归公式如下


L_t = l_t + y * L{t_1}

关于此公式,内核也提供了注释说明,如下


/*
 * We can represent the historical contribution to runnable average as the
 * coefficients of a geometric series.  To do this we sub-divide our runnable
 * history into segments of approximately 1ms (1024us); label the segment that
 * occurred N-ms ago p_N, with p_0 corresponding to the current period, e.g.
 *
 * [<- 1024us ->|<- 1024us ->|<- 1024us ->| ...
 *      p0            p1           p2
 *     (now)       (~1ms ago)  (~2ms ago)
 *
 * Let u_i denote the fraction of p_i that the entity was runnable.
 *
 * We then designate the fractions u_i as our co-efficients, yielding the
 * following representation of historical load:
 *   u_0 + u_1*y + u_2*y^2 + u_3*y^3 + ...
 *
 * We choose y based on the with of a reasonably scheduling period, fixing:
 *   y^32 = 0.5
 *
 * This means that the contribution to load ~32ms ago (u_32) will be weighted
 * approximately half as much as the contribution to load within the last ms
 * (u_0).
 *
 * When a period "rolls over" and we have new u_0`, multiplying the previous
 * sum again by y is sufficient to update:
 *   load_avg = u_0` + y*(u_0 + u_1*y + u_2*y^2 + ... )
 *            = u_0 + u_1*y + u_2*y^2 + ... [re-labeling u_i --> u_{i+1}]
 */

数据结构

sched_avg


/*
 * The load/runnable/util_avg accumulates an infinite geometric series
 * (see __update_load_avg_cfs_rq() in kernel/sched/pelt.c).
 *
 * [load_avg definition]
 *
 *   load_avg = runnable% * scale_load_down(load)
 *
 * [runnable_avg definition]
 *
 *   runnable_avg = runnable% * SCHED_CAPACITY_SCALE
 *
 * [util_avg definition]
 *
 *   util_avg = running% * SCHED_CAPACITY_SCALE
 *
 * where runnable% is the time ratio that a sched_entity is runnable and
 * running% the time ratio that a sched_entity is running.
 *
 * For cfs_rq, they are the aggregated values of all runnable and blocked
 * sched_entities.
 *
 * The load/runnable/util_avg doesn't directly factor frequency scaling and CPU
 * capacity scaling. The scaling is done through the rq_clock_pelt that is used
 * for computing those signals (see update_rq_clock_pelt())
 *
 * N.B., the above ratios (runnable% and running%) themselves are in the
 * range of [0, 1]. To do fixed point arithmetics, we therefore scale them
 * to as large a range as necessary. This is for example reflected by
 * util_avg's SCHED_CAPACITY_SCALE.
 *
 * [Overflow issue]
 *
 * The 64-bit load_sum can have 4353082796 (=2^64/47742/88761) entities
 * with the highest load (=88761), always runnable on a single cfs_rq,
 * and should not overflow as the number already hits PID_MAX_LIMIT.
 *
 * For all other cases (including 32-bit kernels), struct load_weight's
 * weight will overflow first before we do, because:
 *
 *    Max(load_avg) <= Max(load.weight)
 *
 * Then it is the load_weight's responsibility to consider overflow
 * issues.
 */
struct sched_avg {
        u64                             last_update_time;
        u64                             load_sum;
        u64                             runnable_sum;
        u32                             util_sum;
        u32                             period_contrib;
        unsigned long                   load_avg;
        unsigned long                   runnable_avg;
        unsigned long                   util_avg;
        struct util_est                 util_est;
} ____cacheline_aligned;

这里记录每一个period(1024us)的负载信息,分别信息解释如下

last_update_time: 上次更新负载统计的时间戳
load_sum: 负载的累加和（未衰减的原始值）
runnable_sum: 可运行状态时间的累加和
util_sum: 实际运行状态时间的累加和
period_contrib: 当前周期内已贡献的时间比例份额
load_avg: 平均负载
runnable_avg: 平均runnable任务比例(1024计算)
util_avg: 平均利用率
util_est: 平均利用率的快速预估值

___update_load_sum


static __always_inline int
___update_load_sum(u64 now, struct sched_avg *sa,
		  unsigned long load, unsigned long runnable, int running)
{
	u64 delta;

	delta = now - sa->last_update_time;
	/*
	 * This should only happen when time goes backwards, which it
	 * unfortunately does during sched clock init when we swap over to TSC.
	 */
	if ((s64)delta < 0) {
		sa->last_update_time = now;
		return 0;
	}

	/*
	 * Use 1024ns as the unit of measurement since it's a reasonable
	 * approximation of 1us and fast to compute.
	 */
	delta >>= 10;
	if (!delta)
		return 0;

	sa->last_update_time += delta << 10;

	/*
	 * running is a subset of runnable (weight) so running can't be set if
	 * runnable is clear. But there are some corner cases where the current
	 * se has been already dequeued but cfs_rq->curr still points to it.
	 * This means that weight will be 0 but not running for a sched_entity
	 * but also for a cfs_rq if the latter becomes idle. As an example,
	 * this happens during idle_balance() which calls
	 * update_blocked_averages().
	 *
	 * Also see the comment in accumulate_sum().
	 */
	if (!load)
		runnable = running = 0;

	/*
	 * Now we know we crossed measurement unit boundaries. The *_avg
	 * accrues by two steps:
	 *
	 * Step 1: accumulate *_sum since last_update_time. If we haven't
	 * crossed period boundaries, finish.
	 */
	if (!accumulate_sum(delta, sa, load, runnable, running))
		return 0;

	return 1;
}

___update_load_sum函数将sched_avg的_sum更新,其中input的load/runnable/running对于sched_avg的load_sum/runnable_sum/util_sum,所以我们需要看函数accumulate_sum的实现

accumulate_sum


/*
 * Accumulate the three separate parts of the sum; d1 the remainder
 * of the last (incomplete) period, d2 the span of full periods and d3
 * the remainder of the (incomplete) current period.
 *
 *           d1          d2           d3
 *           ^           ^            ^
 *           |           |            |
 *         |<->|<----------------->|<--->|
 * ... |---x---|------| ... |------|-----x (now)
 *
 *                           p-1
 * u' = (u + d1) y^p + 1024 \Sum y^n + d3 y^0
 *                           n=1
 *
 *    = u y^p +					(Step 1)
 *
 *                     p-1
 *      d1 y^p + 1024 \Sum y^n + d3 y^0		(Step 2)
 *                     n=1
 */
static __always_inline u32
accumulate_sum(u64 delta, struct sched_avg *sa,
	       unsigned long load, unsigned long runnable, int running)
{
	u32 contrib = (u32)delta; /* p == 0 -> delta < 1024 */
	u64 periods;

	delta += sa->period_contrib;
	periods = delta / 1024; /* A period is 1024us (~1ms) */

	/*
	 * Step 1: decay old *_sum if we crossed period boundaries.
	 */
	if (periods) {
		sa->load_sum = decay_load(sa->load_sum, periods);
		sa->runnable_sum =
			decay_load(sa->runnable_sum, periods);
		sa->util_sum = decay_load((u64)(sa->util_sum), periods);

		/*
		 * Step 2
		 */
		delta %= 1024;
		if (load) {
			/*
			 * This relies on the:
			 *
			 * if (!load)
			 *	runnable = running = 0;
			 *
			 * clause from ___update_load_sum(); this results in
			 * the below usage of @contrib to dissapear entirely,
			 * so no point in calculating it.
			 */
			contrib = __accumulate_pelt_segments(periods,
					1024 - sa->period_contrib, delta);
		}
	}
	sa->period_contrib = delta;

	if (load)
		sa->load_sum += load * contrib;
	if (runnable)
		sa->runnable_sum += runnable * contrib << SCHED_CAPACITY_SHIFT;
	if (running)
		sa->util_sum += contrib << SCHED_CAPACITY_SHIFT;

	return periods;
}

Step 1: 旧值衰减

通过decay_load函数计算u y^p ,包含三个值

load_sum: 负载和的衰减值
runnable_sum: 可运行和的衰减值
util_sum: 利用率和衰减值

Step 2: 新值贡献

通过__accumulate_pelt_segments计算新值贡献d1 y^p + 1024 \Sum y^n + d3 y^0,然后将其相加就是负载和,如下

load_sum: += load * contrib
runnable_sum: += runnable * contrib
util_sum: += contrib

注释解释

假设有一个se再上一个时刻的load为u,经过delta时间后,我们尝试更新load,此时怎么计算这个delta的衰减呢?之前我们讨论的公式是将load乘以一个y的衰减量,但实际的代码分成了三个步骤

d1: 在上次计算时,还差d1时间可以凑满一个period
d2: p_contrub --> d1 后,剩下可形成的最大整数倍period(k*period)
d3: p_contrub --> d1 --> d2,剩下不满一周期的时间

注释配图了如下


 *           d1          d2           d3
 *           ^           ^            ^
 *           |           |            |
 *         |<->|<----------------->|<--->|
 * ... |---x---|------| ... |------|-----x (now)

如果我们计算这次经过了多少个period,那么我们可以先加上d1前的那个周期剩余值,也就是1024 - d1,那么计算p个period的公式为


p = (1024 - d1 + (d1 + d2 + d3)) / 1024

因为d3不满一个周期,所以没必要减掉一个d3再去除,默认除法取整了

所以带入原公式的p值


u' = (u + d1) × y^p + 1024 × y^(p-1) + 1024 × y^(p-2) + ... + 1024 × y^1 + d3 × y^0

展开第一个括号为


u' = u × y^p + d1 × y^p + 1024 × Σ(y^n, n=1 到 p-1) + d3 × y^0

回到代码上,可以看到

sa->period_contrib 为 1024 - d1
delta 从 d1+d2+d3 更新为 1024 - d1 + (delta)
period 为 p,通过(1024 - d1 + (d1 + d2 + d3)) / 1024 计算
decay_load 为 u × y^p 的计算,后面详细介绍
__accumulate_pelt_segments 为d1 × y^p + 1024 × Σ(y^n, n=1 到 p-1) + d3 × y^0 的计算值,后面详细介绍
d3 为 delta % 1024

总结

本文从宏观上介绍了PETL的结构体sched_avg和sum值计算,也简单说明了PELT的公式计算,但是遗留了decay_load 和 __accumulate_pelt_segments 的详细计算的介绍.

总体而言,PELT是linux负载均衡的核心,也是cfs中不可或缺的统计信息方式,深入了解PELT对理解CFS调度精髓而言至关重要

参考

https://lwn.net/Articles/531853/

目录