本文介绍一下CFS的另一个关键行为,选中下一个任务运行,当任务schedule的时候,会主动调用pick_next_task函数选出下一个需要调度的进程的任务
__schedule pick_next_task __pick_next_task_fair pick_next_task_fair pick_next_entity set_next_entity context_switch
函数原型如下
static inline struct task_struct * pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) { const struct sched_class *class; struct task_struct *p; /* * Optimization: we know that if all tasks are in the fair class we can * call that function directly, but only if the @prev task wasn't of a * higher scheduling class, because otherwise those loose the * opportunity to pull in more work from other CPUs. */ if (likely(prev->sched_class <= &fair_sched_class && rq->nr_running == rq->cfs.h_nr_running)) { p = pick_next_task_fair(rq, prev, rf); if (unlikely(p == RETRY_TASK)) goto restart; /* Assumes fair_sched_class->next == idle_sched_class */ if (!p) { put_prev_task(rq, prev); p = pick_next_task_idle(rq); } return p; } restart: put_prev_task_balance(rq, prev, rf); for_each_class(class) { p = class->pick_next_task(rq); if (p) return p; } /* The idle class should always have a runnable task: */ BUG(); }
根据代码,这里实际上有两个路径.区分代码为
if (likely(prev->sched_class <= &fair_sched_class && rq->nr_running == rq->cfs.h_nr_running))
也就是上一个任务是cfs调度,且所有任务都在cfs调度类中
直接调用pick_next_task_fair,如果没有任务了,运行idle任务.如果有任务,则将上一个任务put_prev_task放回本地队列
因为有rt相关的实时任务,所以先put_prev_task_balance将其负载均衡检查,让rt任务拥有更高优先级权限,降低此任务因为负载均衡导致put_prev_task的时候响应时间增加.
然后遍历所有的调度器,运行pick_next_task_fair.
函数简要实现如下
pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) { again: if (!sched_fair_runnable(rq)) goto idle; #ifdef CONFIG_FAIR_GROUP_SCHED if (!prev || prev->sched_class != &fair_sched_class) goto simple; do { se = pick_next_entity(cfs_rq, curr); cfs_rq = group_cfs_rq(se); } while (cfs_rq); p = task_of(se); /* * Since we haven't yet done put_prev_entity and if the selected task * is a different task than we started out with, try and touch the * least amount of cfs_rqs. */ if (prev != p) { struct sched_entity *pse = &prev->se; while (!(cfs_rq = is_same_group(se, pse))) { int se_depth = se->depth; int pse_depth = pse->depth; if (se_depth <= pse_depth) { put_prev_entity(cfs_rq_of(pse), pse); pse = parent_entity(pse); } if (se_depth >= pse_depth) { set_next_entity(cfs_rq_of(se), se); se = parent_entity(se); } } put_prev_entity(cfs_rq, pse); set_next_entity(cfs_rq, se); } goto done; simple: #endif if (prev) put_prev_task(rq, prev); if (repick) { for_each_sched_entity(se) set_next_entity(cfs_rq_of(se), se); goto done; } do { se = pick_next_entity(cfs_rq, NULL); set_next_entity(cfs_rq, se); cfs_rq = group_cfs_rq(se); } while (cfs_rq); p = task_of(se); idle: new_tasks = newidle_balance(rq, rf); /* * Because newidle_balance() releases (and re-acquires) rq->lock, it is * possible for any higher priority task to appear. In that case we * must re-start the pick_next_entity() loop. */ if (new_tasks < 0) return RETRY_TASK; if (new_tasks > 0) goto again; /* * rq is about to be idle, check if we need to update the * lost_idle_time of clock_pelt */ update_idle_rq_clock_pelt(rq); return NULL; }
先用sched_fair_runnable检查是否有任务,如果没任务,调用 newidle_balance,此函数会Attempts to pull tasks from other CPUs尝试从其他cpu上拉一个任务来运行,所以有三种情况
if (!prev || prev->sched_class != &fair_sched_class)会判断没有prev或prev不是cfs,如果不是,那么走simple路径,simple路径为
如果有prev,且prev是cfs中,那么任务走主路径,也叫做优化路径,其流程为
如果它们不在一个组,那么走优化代码,最小化需要更新的层级数量,通过对于选中se的depth和prev的se的depth. 这里depth是从root开始到se所属的组的深度
举个例子
task_a: root-->group_a-->se_a task_b: root-->se_a
那么 depth_a = 2,depth_b = 1,所以depth_a > depth_b
回到代码,两种情况
可以看到,其思路是通过depth来逐步上移,直到他们depth相等,且,当depth相等的时候,se和prev se都上移一层,这样能够更新组调度下他们共属组的se
本文介绍了pick_next_task的核心行为,如果从简单理解来看,就是单纯的找到rb-tree的最左节点,但是实际上,我们需要考虑调度组,需要考虑idle任务的场景,需要考虑遍历调度域的代码性能,需要考虑遍历组的代码性能,所以在pick_next_task中有上述的额外考虑代码.不过这些细节不用太过慌张
我们仍可以简单的认为pick_next_task找到了红黑树的最小实体.