本文介绍一下CFS的另一个关键行为,选中下一个任务运行,当任务schedule的时候,会主动调用pick_next_task函数选出下一个需要调度的进程的任务

代码流程


__schedule
  pick_next_task
    __pick_next_task_fair
      pick_next_task_fair
        pick_next_entity
        set_next_entity
  context_switch

pick_next_task

函数原型如下


static inline struct task_struct *
pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
{
        const struct sched_class *class;
        struct task_struct *p;

        /*
         * Optimization: we know that if all tasks are in the fair class we can
         * call that function directly, but only if the @prev task wasn't of a
         * higher scheduling class, because otherwise those loose the
         * opportunity to pull in more work from other CPUs.
         */
        if (likely(prev->sched_class <= &fair_sched_class &&
                   rq->nr_running == rq->cfs.h_nr_running)) {

                p = pick_next_task_fair(rq, prev, rf);
                if (unlikely(p == RETRY_TASK))
                        goto restart;

                /* Assumes fair_sched_class->next == idle_sched_class */
                if (!p) {
                        put_prev_task(rq, prev);
                        p = pick_next_task_idle(rq);
                }

                return p;
        }

restart:
        put_prev_task_balance(rq, prev, rf);

        for_each_class(class) {
                p = class->pick_next_task(rq);
                if (p)
                        return p;
        }

        /* The idle class should always have a runnable task: */
        BUG();
}

根据代码,这里实际上有两个路径.区分代码为


if (likely(prev->sched_class <= &fair_sched_class &&
                   rq->nr_running == rq->cfs.h_nr_running))

也就是上一个任务是cfs调度,且所有任务都在cfs调度类中

路径1: 所有任务都在cfs调度类中

直接调用pick_next_task_fair,如果没有任务了,运行idle任务.如果有任务,则将上一个任务put_prev_task放回本地队列

路径2: 如果有rt类型的任务

因为有rt相关的实时任务,所以先put_prev_task_balance将其负载均衡检查,让rt任务拥有更高优先级权限,降低此任务因为负载均衡导致put_prev_task的时候响应时间增加.

然后遍历所有的调度器,运行pick_next_task_fair.

pick_next_task_fair

函数简要实现如下


pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
{
again:
        if (!sched_fair_runnable(rq))
                goto idle;

#ifdef CONFIG_FAIR_GROUP_SCHED
        if (!prev || prev->sched_class != &fair_sched_class)
                goto simple;

        do {
                se = pick_next_entity(cfs_rq, curr);
                cfs_rq = group_cfs_rq(se);
        } while (cfs_rq);

        p = task_of(se);
        /*
         * Since we haven't yet done put_prev_entity and if the selected task
         * is a different task than we started out with, try and touch the
         * least amount of cfs_rqs.
         */
        if (prev != p) {
                struct sched_entity *pse = &prev->se;

                while (!(cfs_rq = is_same_group(se, pse))) {
                        int se_depth = se->depth;
                        int pse_depth = pse->depth;

                        if (se_depth <= pse_depth) {
                                put_prev_entity(cfs_rq_of(pse), pse);
                                pse = parent_entity(pse);
                        }
                        if (se_depth >= pse_depth) {
                                set_next_entity(cfs_rq_of(se), se);
                                se = parent_entity(se);
                        }
                }

                put_prev_entity(cfs_rq, pse);
                set_next_entity(cfs_rq, se);
        }

        goto done;
simple:
#endif
        if (prev)
                put_prev_task(rq, prev);

        if (repick) {
                for_each_sched_entity(se)
                        set_next_entity(cfs_rq_of(se), se);
                goto done;
        }

        do {
                se = pick_next_entity(cfs_rq, NULL);
                set_next_entity(cfs_rq, se);
                cfs_rq = group_cfs_rq(se);
        } while (cfs_rq);

        p = task_of(se);

idle:
        new_tasks = newidle_balance(rq, rf);

        /*
         * Because newidle_balance() releases (and re-acquires) rq->lock, it is
         * possible for any higher priority task to appear. In that case we
         * must re-start the pick_next_entity() loop.
         */
        if (new_tasks < 0)
                return RETRY_TASK;

        if (new_tasks > 0)
                goto again;

        /*
         * rq is about to be idle, check if we need to update the
         * lost_idle_time of clock_pelt
         */
        update_idle_rq_clock_pelt(rq);

        return NULL;
}

检查是否有任务

先用sched_fair_runnable检查是否有任务,如果没任务,调用 newidle_balance,此函数会Attempts to pull tasks from other CPUs尝试从其他cpu上拉一个任务来运行,所以有三种情况

newidle_balance 返回负数,但是因为newidle_balance持有锁,持锁期间担心有高优先级任务出现,所以重试 if (unlikely(p == RETRY_TASK)) goto restart;
newidle_balance 拉到任务了,跳转again
newidle_balance 拉不到任务,让pelt计算rq->lost_idle_time

检查上一个任务不为cfs

if (!prev || prev->sched_class != &fair_sched_class)会判断没有prev或prev不是cfs,如果不是,那么走simple路径,simple路径为

put_prev_task 将其放回本地队列
如果repick为1,则遍历所有se,调用set_next_entity重新选一个
否则逐步按组往下遍历,选择最优的se

如果上一个任务是cfs,且选中的任务不是prev

如果有prev,且prev是cfs中,那么任务走主路径,也叫做优化路径,其流程为

用pick_next_entity选中最优的se
用task_of拿到se的task_struct
如果获取的se和prev不是同一个进程
put_prev_entity将其放回rb-tree
将当前se作为即将运行的se

如果选中的任务不是prev,且它们不在同一个组

如果它们不在一个组,那么走优化代码,最小化需要更新的层级数量,通过对于选中se的depth和prev的se的depth. 这里depth是从root开始到se所属的组的深度

举个例子


task_a: root-->group_a-->se_a
task_b: root-->se_a

那么 depth_a = 2,depth_b = 1,所以depth_a > depth_b

回到代码,两种情况

如果当前选中的se depth <= prev se depth,将prev放回rb-tree,然后找prev se的parent
如果当前选中的se depth >= prev se depth,将se放回rb-tree,然后找se的parent

可以看到,其思路是通过depth来逐步上移,直到他们depth相等,且,当depth相等的时候,se和prev se都上移一层,这样能够更新组调度下他们共属组的se

总结

本文介绍了pick_next_task的核心行为,如果从简单理解来看,就是单纯的找到rb-tree的最左节点,但是实际上,我们需要考虑调度组,需要考虑idle任务的场景,需要考虑遍历调度域的代码性能,需要考虑遍历组的代码性能,所以在pick_next_task中有上述的额外考虑代码.不过这些细节不用太过慌张

我们仍可以简单的认为pick_next_task找到了红黑树的最小实体.

目录