PELT的全称为: Per-Entity Load Tracking, 根据直面意思,这是基于se的load tracking, PELT能够提供提供当前系统基于per-entity的负载情况. 具体点说就是通过用指数衰减的方式,把历史运行时间转化为当前负载预测,帮助负载均衡、CPU调频和能耗管理。
在之前的文章<调度(10)-CFS调度reweight>中可以看到,calc_group_shares函数在实际计算的时候引用了PELT的计算值,根据此,我们可以看到如果没有PELT,可能有如下问题
主要原因是,如果按照传统计算平均值的方式,那么只能统计过去一段时间的平均值,而PELT能够实时的了解per-entity的负载情况
PELT 于 2013 年在 Linux 3.8 内核中引入,是 Vincent Guittot 等开发者的重要贡献, 彻底改变了 CFS 的负载感知能力。
PELT使用以过去的状态统计现在的负载的思路进行设计,它通过wall clock视为1ms(1024份)为单位 1ms (actually, 1024µs)在每个1ms的区间内,一个se对系统的新增的负载贡献取决于se_runable在rq上的时间比例,除此之外,之前的entity对系统产生的负载也会通过指数衰减累计下来,用作每次计算负载的参数,以公式的方式记录为如下
L_t = l_t + (y*l_{t-1}) + (y^2*l_{t-2}) ... + (y^t*l_0)
这里l_x是每个period记录的负载值,L_t是t时刻下计算总负载的值,y是计算的衰减因子,其默认y的固定值为
y^32 = 0.5
这里计算y是0.9786
此时我们计算L_t-1的公式,如下
L_{t-1} = l_{t-1} + (y*l_{t-2}) + (y^2*l_{t-3}) ... + (y^{t-1}*l_0)
我们将Lt-1乘y,可以计算为
y*L_{t-1} = y*l_{t-1} + (y^2*l_{t-2}) + (y^3*l_{t-3}) ... + (y^{t}*l_0)
拿这个公式和最上面L_t计算,可以得到递归公式如下
L_t = l_t + y * L{t_1}
关于此公式,内核也提供了注释说明,如下
/* * We can represent the historical contribution to runnable average as the * coefficients of a geometric series. To do this we sub-divide our runnable * history into segments of approximately 1ms (1024us); label the segment that * occurred N-ms ago p_N, with p_0 corresponding to the current period, e.g. * * [<- 1024us ->|<- 1024us ->|<- 1024us ->| ... * p0 p1 p2 * (now) (~1ms ago) (~2ms ago) * * Let u_i denote the fraction of p_i that the entity was runnable. * * We then designate the fractions u_i as our co-efficients, yielding the * following representation of historical load: * u_0 + u_1*y + u_2*y^2 + u_3*y^3 + ... * * We choose y based on the with of a reasonably scheduling period, fixing: * y^32 = 0.5 * * This means that the contribution to load ~32ms ago (u_32) will be weighted * approximately half as much as the contribution to load within the last ms * (u_0). * * When a period "rolls over" and we have new u_0`, multiplying the previous * sum again by y is sufficient to update: * load_avg = u_0` + y*(u_0 + u_1*y + u_2*y^2 + ... ) * = u_0 + u_1*y + u_2*y^2 + ... [re-labeling u_i --> u_{i+1}] */
/* * The load/runnable/util_avg accumulates an infinite geometric series * (see __update_load_avg_cfs_rq() in kernel/sched/pelt.c). * * [load_avg definition] * * load_avg = runnable% * scale_load_down(load) * * [runnable_avg definition] * * runnable_avg = runnable% * SCHED_CAPACITY_SCALE * * [util_avg definition] * * util_avg = running% * SCHED_CAPACITY_SCALE * * where runnable% is the time ratio that a sched_entity is runnable and * running% the time ratio that a sched_entity is running. * * For cfs_rq, they are the aggregated values of all runnable and blocked * sched_entities. * * The load/runnable/util_avg doesn't directly factor frequency scaling and CPU * capacity scaling. The scaling is done through the rq_clock_pelt that is used * for computing those signals (see update_rq_clock_pelt()) * * N.B., the above ratios (runnable% and running%) themselves are in the * range of [0, 1]. To do fixed point arithmetics, we therefore scale them * to as large a range as necessary. This is for example reflected by * util_avg's SCHED_CAPACITY_SCALE. * * [Overflow issue] * * The 64-bit load_sum can have 4353082796 (=2^64/47742/88761) entities * with the highest load (=88761), always runnable on a single cfs_rq, * and should not overflow as the number already hits PID_MAX_LIMIT. * * For all other cases (including 32-bit kernels), struct load_weight's * weight will overflow first before we do, because: * * Max(load_avg) <= Max(load.weight) * * Then it is the load_weight's responsibility to consider overflow * issues. */ struct sched_avg { u64 last_update_time; u64 load_sum; u64 runnable_sum; u32 util_sum; u32 period_contrib; unsigned long load_avg; unsigned long runnable_avg; unsigned long util_avg; struct util_est util_est; } ____cacheline_aligned;
这里记录每一个period(1024us)的负载信息,分别信息解释如下
static __always_inline int ___update_load_sum(u64 now, struct sched_avg *sa, unsigned long load, unsigned long runnable, int running) { u64 delta; delta = now - sa->last_update_time; /* * This should only happen when time goes backwards, which it * unfortunately does during sched clock init when we swap over to TSC. */ if ((s64)delta < 0) { sa->last_update_time = now; return 0; } /* * Use 1024ns as the unit of measurement since it's a reasonable * approximation of 1us and fast to compute. */ delta >>= 10; if (!delta) return 0; sa->last_update_time += delta << 10; /* * running is a subset of runnable (weight) so running can't be set if * runnable is clear. But there are some corner cases where the current * se has been already dequeued but cfs_rq->curr still points to it. * This means that weight will be 0 but not running for a sched_entity * but also for a cfs_rq if the latter becomes idle. As an example, * this happens during idle_balance() which calls * update_blocked_averages(). * * Also see the comment in accumulate_sum(). */ if (!load) runnable = running = 0; /* * Now we know we crossed measurement unit boundaries. The *_avg * accrues by two steps: * * Step 1: accumulate *_sum since last_update_time. If we haven't * crossed period boundaries, finish. */ if (!accumulate_sum(delta, sa, load, runnable, running)) return 0; return 1; }
___update_load_sum函数将sched_avg的_sum更新,其中input的load/runnable/running对于sched_avg的load_sum/runnable_sum/util_sum,所以我们需要看函数accumulate_sum的实现
/* * Accumulate the three separate parts of the sum; d1 the remainder * of the last (incomplete) period, d2 the span of full periods and d3 * the remainder of the (incomplete) current period. * * d1 d2 d3 * ^ ^ ^ * | | | * |<->|<----------------->|<--->| * ... |---x---|------| ... |------|-----x (now) * * p-1 * u' = (u + d1) y^p + 1024 \Sum y^n + d3 y^0 * n=1 * * = u y^p + (Step 1) * * p-1 * d1 y^p + 1024 \Sum y^n + d3 y^0 (Step 2) * n=1 */ static __always_inline u32 accumulate_sum(u64 delta, struct sched_avg *sa, unsigned long load, unsigned long runnable, int running) { u32 contrib = (u32)delta; /* p == 0 -> delta < 1024 */ u64 periods; delta += sa->period_contrib; periods = delta / 1024; /* A period is 1024us (~1ms) */ /* * Step 1: decay old *_sum if we crossed period boundaries. */ if (periods) { sa->load_sum = decay_load(sa->load_sum, periods); sa->runnable_sum = decay_load(sa->runnable_sum, periods); sa->util_sum = decay_load((u64)(sa->util_sum), periods); /* * Step 2 */ delta %= 1024; if (load) { /* * This relies on the: * * if (!load) * runnable = running = 0; * * clause from ___update_load_sum(); this results in * the below usage of @contrib to dissapear entirely, * so no point in calculating it. */ contrib = __accumulate_pelt_segments(periods, 1024 - sa->period_contrib, delta); } } sa->period_contrib = delta; if (load) sa->load_sum += load * contrib; if (runnable) sa->runnable_sum += runnable * contrib << SCHED_CAPACITY_SHIFT; if (running) sa->util_sum += contrib << SCHED_CAPACITY_SHIFT; return periods; }
通过decay_load函数计算u y^p ,包含三个值
通过__accumulate_pelt_segments计算新值贡献d1 y^p + 1024 \Sum y^n + d3 y^0,然后将其相加就是负载和,如下
假设有一个se再上一个时刻的load为u,经过delta时间后,我们尝试更新load,此时怎么计算这个delta的衰减呢?之前我们讨论的公式是将load乘以一个y的衰减量,但实际的代码分成了三个步骤
注释配图了如下
* d1 d2 d3 * ^ ^ ^ * | | | * |<->|<----------------->|<--->| * ... |---x---|------| ... |------|-----x (now)
如果我们计算这次经过了多少个period,那么我们可以先加上d1前的那个周期剩余值,也就是1024 - d1,那么计算p个period的公式为
p = (1024 - d1 + (d1 + d2 + d3)) / 1024
因为d3不满一个周期,所以没必要减掉一个d3再去除,默认除法取整了
所以带入原公式的p值
u' = (u + d1) × y^p + 1024 × y^(p-1) + 1024 × y^(p-2) + ... + 1024 × y^1 + d3 × y^0
展开第一个括号为
u' = u × y^p + d1 × y^p + 1024 × Σ(y^n, n=1 到 p-1) + d3 × y^0
回到代码上,可以看到
本文从宏观上介绍了PETL的结构体sched_avg和sum值计算,也简单说明了PELT的公式计算,但是遗留了decay_load 和 __accumulate_pelt_segments 的详细计算的介绍.
总体而言,PELT是linux负载均衡的核心,也是cfs中不可或缺的统计信息方式,深入了解PELT对理解CFS调度精髓而言至关重要