linux進程調(diào)度機制剖析（基于3.16

老匹夫 2014-08-12

展開全文

進程調(diào)度所使用到的數(shù)據(jù)結(jié)構(gòu)：

1.就緒隊列

內(nèi)核為每一個cpu創(chuàng)建一個進程就緒隊列，該隊列上的進程均由該cpu執(zhí)行，代碼如下（kernel/sched/core.c）。

1 DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);

定義了一個struct rq結(jié)構(gòu)體數(shù)組，每個數(shù)組元素是一個就緒隊列，對應(yīng)一個cpu。下面看下struct rq結(jié)構(gòu)體（kernel/sched/sched.h）：

struct rq

該結(jié)構(gòu)體是本地cpu所有進程組成的就緒隊列，在linux內(nèi)核中，進程被分為普通進程和實時進程，這兩種進程的調(diào)度策略是不同的，因此在31-32行可以看到rq結(jié)構(gòu)體中又內(nèi)嵌了struct cfs_rq cfs和struct rt_rq rt兩個子就緒隊列，分別來組織普通進程和實時進程（普通進程將采用完全公平調(diào)度策略cfs，而實時進程將采用實時調(diào)度策略），第33行struct dl_rq dl調(diào)度空閑進程，暫且不討論。所以，如果咱們研究的是普通進程的調(diào)度，需要關(guān)心的就是struct cfs_rq cfs隊列；如果研究的是實時進程，就只關(guān)心struct rt_rq rt隊列。

1.1普通進程的就緒隊列struct cfs_rq（kernel/sched/sched.h）

struct cfs_rq

cfs_rq就緒隊列是以紅黑樹的形式來組織調(diào)度實體。第12行tasks_timeline成員就是紅黑樹的樹根。第13行rb_leftmost指向了紅黑樹最左邊的左孩子（下一個可調(diào)度的實體）。第19行curr指向當(dāng)前正運行的實體，next指向?qū)⒈粏拘训倪M程，last指向喚醒next進程的進程，next和last用法后邊會提到。第55行rq指向了該cfs_rq就緒隊列所屬的rq隊列。

1.2實時進程的就緒隊列struct rt_rq（kernel/sched/sched.h）

struct rt_rq

2.調(diào)度實體（include/linux/sched.h）

2.1普通進程的調(diào)度實體sched_entity

 1 struct sched_entity {
 2     struct load_weight    load;        /* for load-balancing */
 3     struct rb_node        run_node;
 4     struct list_head    group_node;
 5     unsigned int        on_rq;
 6 
 7     u64            exec_start;
 8     u64            sum_exec_runtime;
 9     u64            vruntime;
10     u64            prev_sum_exec_runtime;
11 
12     u64            nr_migrations;
13 
14 #ifdef CONFIG_SCHEDSTATS
15     struct sched_statistics statistics;
16 #endif
17 
18 #ifdef CONFIG_FAIR_GROUP_SCHED
19     int            depth;
20     struct sched_entity    *parent;
21     /* rq on which this entity is (to be) queued: */
22     struct cfs_rq        *cfs_rq;
23     /* rq "owned" by this entity/group: */
24     struct cfs_rq        *my_q;
25 #endif
26 
27 #ifdef CONFIG_SMP
28     /* Per-entity load-tracking */
29     struct sched_avg    avg;
30 #endif
31 };

每個進程描述符中均包含一個該結(jié)構(gòu)體變量，內(nèi)核使用該結(jié)構(gòu)體來將普通進程組織到采用完全公平調(diào)度策略的就緒隊列中（struct rq中的cfs隊列中，上邊提到過），該結(jié)構(gòu)體有兩個作用，一是包含有進程調(diào)度的信息（比如進程的運行時間，睡眠時間等等，調(diào)度程序參考這些信息決定是否調(diào)度進程），二是使用該結(jié)構(gòu)體來組織進程，第3行的struct rb_node類型結(jié)構(gòu)體變量run_node是紅黑樹節(jié)點，因此struct sched_entity調(diào)度實體將被組織成紅黑樹的形式，同時意味著普通進程也被組織成紅黑樹的形式。第18-25行是和組調(diào)度有關(guān)的成員，需要開啟組調(diào)度。第20行parent顧名思義指向了當(dāng)前實體的上一級實體，后邊再介紹。第22行的cfs_rq指向了該調(diào)度實體所在的就緒隊列。第24行my_q指向了本實體擁有的就緒隊列（調(diào)度組），該調(diào)度組（包括組員實體）屬于下一個級別，和本實體不在同一個級別，該調(diào)度組中所有成員實體的parent域指向了本實體，這就解釋了上邊的parent，此外，第19行depth代表了此隊列（調(diào)度組）的深度，每個調(diào)度組都比其parent調(diào)度組深度大1。內(nèi)核依賴my_q域?qū)崿F(xiàn)組調(diào)度。

2.2實時進程的調(diào)度實體 sched_rt_entity

 1 struct sched_rt_entity {
 2     struct list_head run_list;
 3     unsigned long timeout;
 4     unsigned long watchdog_stamp;
 5     unsigned int time_slice;
 6 
 7     struct sched_rt_entity *back;
 8 #ifdef CONFIG_RT_GROUP_SCHED
 9     struct sched_rt_entity    *parent;
10     /* rq on which this entity is (to be) queued: */
11     struct rt_rq        *rt_rq;
12     /* rq "owned" by this entity/group: */
13     struct rt_rq        *my_q;
14 #endif
15 };

該結(jié)構(gòu)體和上個結(jié)構(gòu)體是類似的，只不過用來組織實時進程，實時進程被組織到struct rq中的rt隊列中，上邊有提到。每個進程描述符中也包含一個該結(jié)構(gòu)體。該結(jié)構(gòu)體中并未包含struct rb_node類型結(jié)構(gòu)體變量，而在第1行出現(xiàn)了struct list_head類型結(jié)構(gòu)體變量run_list，因此可以看出實時進程的就緒隊列是雙向鏈表形式，而不是紅黑數(shù)的形式。

3.調(diào)度類（kernel/sched/sched.h）

 1 struct sched_class {
 2     const struct sched_class *next;
 3 
 4     void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags);
 5     void (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags);
 6     void (*yield_task) (struct rq *rq);
 7     bool (*yield_to_task) (struct rq *rq, struct task_struct *p, bool preempt);
 8 
 9     void (*check_preempt_curr) (struct rq *rq, struct task_struct *p, int flags);
10 
11     /*
12      * It is the responsibility of the pick_next_task() method that will
13      * return the next task to call put_prev_task() on the @prev task or
14      * something equivalent.
15      *
16      * May return RETRY_TASK when it finds a higher prio class has runnable
17      * tasks.
18      */
19     struct task_struct * (*pick_next_task) (struct rq *rq,
20                         struct task_struct *prev);
21     void (*put_prev_task) (struct rq *rq, struct task_struct *p);
22 
23 #ifdef CONFIG_SMP
24     int  (*select_task_rq)(struct task_struct *p, int task_cpu, int sd_flag, int flags);
25     void (*migrate_task_rq)(struct task_struct *p, int next_cpu);
26 
27     void (*post_schedule) (struct rq *this_rq);
28     void (*task_waking) (struct task_struct *task);
29     void (*task_woken) (struct rq *this_rq, struct task_struct *task);
30 
31     void (*set_cpus_allowed)(struct task_struct *p,
32                  const struct cpumask *newmask);
33 
34     void (*rq_online)(struct rq *rq);
35     void (*rq_offline)(struct rq *rq);
36 #endif
37 
38     void (*set_curr_task) (struct rq *rq);
39     void (*task_tick) (struct rq *rq, struct task_struct *p, int queued);
40     void (*task_fork) (struct task_struct *p);
41     void (*task_dead) (struct task_struct *p);
42 
43     void (*switched_from) (struct rq *this_rq, struct task_struct *task);
44     void (*switched_to) (struct rq *this_rq, struct task_struct *task);
45     void (*prio_changed) (struct rq *this_rq, struct task_struct *task,
46                  int oldprio);
47 
48     unsigned int (*get_rr_interval) (struct rq *rq,
49                      struct task_struct *task);
50 
51 #ifdef CONFIG_FAIR_GROUP_SCHED
52     void (*task_move_group) (struct task_struct *p, int on_rq);
53 #endif
54 };

內(nèi)核聲明了一個調(diào)度類sched_class的結(jié)構(gòu)體類型，用來實現(xiàn)不同的調(diào)度策略，可以看到該結(jié)構(gòu)體成員都是函數(shù)指針，這些指針指向的函數(shù)就是調(diào)度策略的具體實現(xiàn)，所有和進程調(diào)度有關(guān)的函數(shù)都直接或者間接調(diào)用了這些成員函數(shù)，來實現(xiàn)進程調(diào)度。此外，每個進程描述符中都包含一個指向該結(jié)構(gòu)體類型的指針sched_class，指向了所采用的調(diào)度類。下面我們看下完全公平調(diào)度策略類的定義和初始化（kernel/sched/fair.c）。

1 const struct sched_class fair_sched_class;

定義了一個全局的調(diào)度策略變量。初始化如下：

 1 const struct sched_class fair_sched_class = {
 2     .next            = &idle_sched_class,
 3     .enqueue_task        = enqueue_task_fair,
 4     .dequeue_task        = dequeue_task_fair,
 5     .yield_task        = yield_task_fair,
 6     .yield_to_task        = yield_to_task_fair,
 7 
 8     .check_preempt_curr    = check_preempt_wakeup,
 9 
10     .pick_next_task        = pick_next_task_fair,
11     .put_prev_task        = put_prev_task_fair,
12 
13 #ifdef CONFIG_SMP
14     .select_task_rq        = select_task_rq_fair,
15     .migrate_task_rq    = migrate_task_rq_fair,
16 
17     .rq_online        = rq_online_fair,
18     .rq_offline        = rq_offline_fair,
19 
20     .task_waking        = task_waking_fair,
21 #endif
22 
23     .set_curr_task          = set_curr_task_fair,
24     .task_tick        = task_tick_fair,
25     .task_fork        = task_fork_fair,
26 
27     .prio_changed        = prio_changed_fair,
28     .switched_from        = switched_from_fair,
29     .switched_to        = switched_to_fair,
30 
31     .get_rr_interval    = get_rr_interval_fair,
32 
33 #ifdef CONFIG_FAIR_GROUP_SCHED
34     .task_move_group    = task_move_group_fair,
35 #endif
36 };

可以看到該結(jié)構(gòu)體變量中函數(shù)成員很多，它們實現(xiàn)了不同的功能，待會用到時我們再做分析。

4.進程描述符task_struct（include/linux/sched.h）

 1 struct task_struct {
 2     volatile long state;    /* -1 unrunnable, 0 runnable, >0 stopped */
 3     .....
 4     int on_rq;
 5 
 6     int prio, static_prio, normal_prio;
 7     unsigned int rt_priority;
 8     const struct sched_class *sched_class;
 9     struct sched_entity se;
10     struct sched_rt_entity rt;
11 #ifdef CONFIG_CGROUP_SCHED
12     struct task_group *sched_task_group;
13 #endif
14     struct sched_dl_entity dl;
15     .....
16     .....
17     unsigned int policy;
18     .....
19     .....
20     struct sched_info sched_info;
21     .....
22     .....
23 };

只列出了和進程調(diào)度有關(guān)的成員。第6行三個變量代表了普通進程的三個優(yōu)先級，第7行的變量代表了實時進程的優(yōu)先級。關(guān)于進程優(yōu)先級的概念，大家可以看看《深入理解linux內(nèi)核架構(gòu)》這本書的進程管理章節(jié)。第8-10行就是我們上邊提到的那些結(jié)構(gòu)體在進程描述符中的定義。第17行的policy代表了當(dāng)前進程的調(diào)度策略，內(nèi)核給出了宏定義，它可以在這些宏中取值，關(guān)于詳細(xì)的講解還是去看《深入理解linux內(nèi)核架構(gòu)》這本書的進程管理部分。policy取了什么值，sched_class也應(yīng)該取相應(yīng)的調(diào)度類指針。

進程調(diào)度過程分析：

進程調(diào)度過程分為兩部分，一是對進程信息進行修改，主要是修改和調(diào)度相關(guān)的信息，比如進程的運行時間，睡眠時間，進程的狀態(tài)，cpu的負(fù)荷等等，二是進程的切換。和進程調(diào)度相關(guān)的所有函數(shù)中，只有schedule函數(shù)是用來進行進程切換的，其他函數(shù)都是用來修改進程的調(diào)度信息。schedule函數(shù)我們在前邊的博文中已經(jīng)探討過了，這里不再分析。對于其他函數(shù)，我們將按照其功能，逐類來分析。

1.scheduler_tick（kernel/sched/core.c ）

 1 void scheduler_tick(void)
 2 {
 3     int cpu = smp_processor_id();
 4     struct rq *rq = cpu_rq(cpu);
 5     struct task_struct *curr = rq->curr;
 6 
 7     sched_clock_tick();
 8 
 9     raw_spin_lock(&rq->lock);
10     update_rq_clock(rq);
11     curr->sched_class->task_tick(rq, curr, 0);
12     update_cpu_load_active(rq);
13     raw_spin_unlock(&rq->lock);
14 
15     perf_event_task_tick();
16 
17 #ifdef CONFIG_SMP
18     rq->idle_balance = idle_cpu(cpu);
19     trigger_load_balance(rq);
20 #endif
21     rq_last_tick_reset(rq);
22 }

該函數(shù)被時鐘中斷處理程序調(diào)用，將當(dāng)前cpu的負(fù)載情況記載到運行隊列struct rq的某些成員中，并更新當(dāng)前進程的時間片。第3行獲取當(dāng)前的cpu號，第4行獲取當(dāng)前cpu的就緒隊列（每個cpu有一個）rq，第5行從就緒隊列中獲取當(dāng)前運行進程的描述符，第10行更新就緒隊列中的clock和clock_task成員值，代表當(dāng)前的時間，一般我們會用到clock_task。第11行進入當(dāng)前進程的調(diào)度類的task_tick函數(shù)中，更新當(dāng)前進程的時間片，不同調(diào)度類的該函數(shù)實現(xiàn)不同，待會我們分析下完全公平調(diào)度類的該函數(shù)。第12行更新就緒隊列的cpu負(fù)載信息。第18行判斷當(dāng)前cpu是否是空閑的，是的話idle_cpu返回1，否則返回0。第19行掛起SCHED_SOFTIRQ軟中斷函數(shù)，去做周期性的負(fù)載平衡操作。第21行將最新的時鐘滴答數(shù)jiffies存入就緒隊列的last_sched_tick域中。再來看下task_tick_fair函數(shù)（kernel/sched/fair.c）：

 1 static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
 2 {
 3     struct cfs_rq *cfs_rq;
 4     struct sched_entity *se = &curr->se;
 5 
 6     for_each_sched_entity(se) {
 7         cfs_rq = cfs_rq_of(se);
 8         entity_tick(cfs_rq, se, queued);
 9     }
10 
11     if (numabalancing_enabled)
12         task_tick_numa(rq, curr);
13 
14     update_rq_runnable_avg(rq, 1);
15 }

如果某個進程的調(diào)度類采用完全公平調(diào)度類的話，那么上個函數(shù)scheduler_tick第11行所執(zhí)行的task_tick函數(shù)指針，就指向了本函數(shù)?？梢曰仡^看看完全公平調(diào)度對象的初始化，第24行的賦值語句.task_tick = task_tick_fair?；氐奖竞瘮?shù)，第4行獲取當(dāng)前進程的普通調(diào)度實體，將指針存放到se中，第6-8行遍歷當(dāng)前調(diào)度實體的上一級實體，以及上上一級實體，以此類推，然后在entity_tick函數(shù)中更新調(diào)度實體的運行時間等信息。在這里用循環(huán)來遍歷的原因是當(dāng)開啟了組調(diào)度后，調(diào)度實體的parent域就存儲了它的上一級節(jié)點，該實體和它parent指向的實體不是同一級別，因此使用循環(huán)就把從當(dāng)前級別（組）到最頂層的級別遍歷完了；如果未選擇組支持，則循環(huán)只執(zhí)行一次，僅對當(dāng)前調(diào)度實體進行更新。下面看下entity_tick的代碼（kernel/sched/fair.c）：

 1 static void
 2 entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
 3 {
 4     /*
 5      * Update run-time statistics of the 'current'.
 6      */
 7     update_curr(cfs_rq);
 8 
 9     /*
10      * Ensure that runnable average is periodically updated.
11      */
12     update_entity_load_avg(curr, 1);
13     update_cfs_rq_blocked_load(cfs_rq, 1);
14     update_cfs_shares(cfs_rq);
15 
16 #ifdef CONFIG_SCHED_HRTICK
17     /*
18      * queued ticks are scheduled to match the slice, so don't bother
19      * validating it and just reschedule.
20      */
21     if (queued) {
22         resched_task(rq_of(cfs_rq)->curr);
23         return;
24     }
25     /*
26      * don't let the period tick interfere with the hrtick preemption
27      */
28     if (!sched_feat(DOUBLE_TICK) &&
29             hrtimer_active(&rq_of(cfs_rq)->hrtick_timer))
30         return;
31 #endif
32 
33     if (cfs_rq->nr_running > 1)
34         check_preempt_tick(cfs_rq, curr);
35 }

在該函數(shù)中對調(diào)度實體（進程）的運行時間等信息進行更新。第7行update_curr函數(shù)對當(dāng)前進程的運行時間進行更新，隨后分析。第21行如果傳進來的參數(shù)queued不為0的話，當(dāng)前進程會被無條件設(shè)置重新調(diào)度標(biāo)志（允許被搶占了）。第33-34行如果當(dāng)前cfs_rq隊列等待調(diào)度的進程數(shù)量大于1，那么就執(zhí)行check_preempt_tick函數(shù)檢查當(dāng)前進程的時間片是否用完，用完的話就需要調(diào)度別的進程來運行（具體來說，如果當(dāng)前進程“真實時間片”用完，該函數(shù)給當(dāng)前進程設(shè)置need_resched標(biāo)志，然后schedule程序就可以重新在就緒隊列中調(diào)度新的進程），下面分析update_curr函數(shù)（kernel/sched/fair.c）：

 1 static void update_curr(struct cfs_rq *cfs_rq)
 2 {
 3     struct sched_entity *curr = cfs_rq->curr;
 4     u64 now = rq_clock_task(rq_of(cfs_rq));
 5     u64 delta_exec;
 6 
 7     if (unlikely(!curr))
 8         return;
 9 
10     delta_exec = now - curr->exec_start;
11     if (unlikely((s64)delta_exec <= 0))
12         return;
13 
14     curr->exec_start = now;
15 
16     schedstat_set(curr->statistics.exec_max,
17               max(delta_exec, curr->statistics.exec_max));
18 
19     curr->sum_exec_runtime += delta_exec;
20     schedstat_add(cfs_rq, exec_clock, delta_exec);
21 
22     curr->vruntime += calc_delta_fair(delta_exec, curr);
23     update_min_vruntime(cfs_rq);
24 
25     if (entity_is_task(curr)) {
26         struct task_struct *curtask = task_of(curr);
27 
28         trace_sched_stat_runtime(curtask, delta_exec, curr->vruntime);
29         cpuacct_charge(curtask, delta_exec);
30         account_group_exec_runtime(curtask, delta_exec);
31     }
32 
33     account_cfs_rq_runtime(cfs_rq, delta_exec);
34 }

該函數(shù)是更新進程運行時間最核心的一個函數(shù)。第3行獲取當(dāng)前的調(diào)度實體，第4行從就緒隊列rq的clock_task成員中獲取當(dāng)前時間，存入now中，該成員我們在scheduler_tick函數(shù)中提到過。第10行用當(dāng)前時間減去進程在上次時鐘中斷tick中的開始時間得到進程運行的時間間隔，存入delta_exec中。第14行當(dāng)前時間又成為進程新的開始時間。第19行將進程運行的時間間隔delta_exec累加到調(diào)度實體的sum_exec_runtime成員中，該成員代表進程到目前為止運行了多長時間。第20行將進程運行的時間間隔delta_exec也累加到公平調(diào)度就緒隊列cfs_rq的exec_clock成員中。第22行calc_delta_fair函數(shù)很關(guān)鍵，它將進程執(zhí)行的真實運行時間轉(zhuǎn)換成虛擬運行時間，然后累加到調(diào)度實體的vruntime域中，進程的虛擬時間非常重要，完全公平調(diào)度策略就是依賴該時間進行調(diào)度。關(guān)于進程的真實時間和虛擬時間的概念，以及它們之間的轉(zhuǎn)換算法，文章的后面會提到，詳細(xì)的內(nèi)容大家可以看看《深入理解linux內(nèi)核架構(gòu)》的進程管理章節(jié)。第23行更新cfs_rq隊列中的最小虛擬運行時間min_vruntime，該時間是就緒隊列中所有進程包括當(dāng)前進程的已運行的最小虛擬時間，只能單調(diào)遞增，待會我們分析update_min_vruntime函數(shù)，該函數(shù)比較重要。第25-30行，如果調(diào)度單位是進程的話（不是組），會更新進程描述符中的運行時間。第33行更新cfs_rq隊列的剩余運行時間，并計算出期望運行時間，必要的話可以對進程重新調(diào)度。下面我們先分析update_min_vruntime函數(shù)，然后分析calc_delta_fair函數(shù)（kernel/sched/fair.c）：

 1 static void update_min_vruntime(struct cfs_rq *cfs_rq)
 2 {
 3     u64 vruntime = cfs_rq->min_vruntime;
 4 
 5     if (cfs_rq->curr)
 6         vruntime = cfs_rq->curr->vruntime;
 7 
 8     if (cfs_rq->rb_leftmost) {
 9         struct sched_entity *se = rb_entry(cfs_rq->rb_leftmost,
10                            struct sched_entity,
11                            run_node);
12 
13         if (!cfs_rq->curr)
14             vruntime = se->vruntime;
15         else
16             vruntime = min_vruntime(vruntime, se->vruntime);
17     }
18 
19     /* ensure we never gain time by being placed backwards. */
20     cfs_rq->min_vruntime = max_vruntime(cfs_rq->min_vruntime, vruntime);
21 #ifndef CONFIG_64BIT
22     smp_wmb();
23     cfs_rq->min_vruntime_copy = cfs_rq->min_vruntime;
24 #endif
25 }

每個cfs_rq隊列均有一個min_vruntime成員，裝的是就緒隊列中所有進程包括當(dāng)前進程已運行的虛擬時間中最小的那個時間。本函數(shù)來更新這個時間。第5-6行如果當(dāng)前有進程正在執(zhí)行，將當(dāng)前進程已運行的虛擬時間保存在vruntime變量中。第8-17行如果就緒隊列中有下一個要被調(diào)度的進程（由rb_leftmost指針指向），則進入if體，第13-16行從當(dāng)前進程和下個被調(diào)度進程中，選擇最小的已運行虛擬時間，保存到vruntime中。第20行從當(dāng)前隊列的min_vruntime域和vruntime變量中，選最大的保存到隊列的min_vruntime域中，完成更新。其實第13-17行是最關(guān)鍵的，保證了隊列的min_vruntime域中存放的是就緒隊列中最小的虛擬運行時間，而第20行的作用僅僅是保證min_vruntime域中的值單調(diào)遞增，沒有別的含義了。隊列中的min_vruntime成員非常重要，用于在睡眠進程被喚醒后以及新進程被創(chuàng)建好時，進行虛擬時間補償或者懲罰，后面會分析到。

1 static inline u64 calc_delta_fair(u64 delta, struct sched_entity *se)
2 {
3     if (unlikely(se->load.weight != NICE_0_LOAD))
4         delta = __calc_delta(delta, NICE_0_LOAD, &se->load);
5 
6     return delta;
7 }

第3行判斷當(dāng)前進程nice值是否為0，如果是的話，直接返回真實運行時間delta（nice0級別的進程真實運行時間和虛擬運行時間值相等）；如果不是的話，第4行將真實時間轉(zhuǎn)換成虛擬時間。下面我們分析__calc_delta函數(shù)（kernel/sched/fair.c）：

 1 static u64 __calc_delta(u64 delta_exec, unsigned long weight, struct load_weight *lw)
 2 {
 3     u64 fact = scale_load_down(weight);
 4     int shift = WMULT_SHIFT;
 5 
 6     __update_inv_weight(lw);
 7 
 8     if (unlikely(fact >> 32)) {
 9         while (fact >> 32) {
10             fact >>= 1;
11             shift--;
12         }
13     }
14 
15     /* hint to use a 32x32->64 mul */
16     fact = (u64)(u32)fact * lw->inv_weight;
17 
18     while (fact >> 32) {
19         fact >>= 1;
20         shift--;
21     }
22 
23     return mul_u64_u32_shr(delta_exec, fact, shift);
24 }

該函數(shù)執(zhí)行了兩種算法：要么是delta_exec * weight / lw.weight，要么是(delta_exec * (weight * lw->inv_weight)) >> WMULT_SHIFT。計算的結(jié)果就是轉(zhuǎn)換后的虛擬時間。至此，scheduler_tick函數(shù)大致就分析完了，當(dāng)然還有一些細(xì)節(jié)沒有分析到，進程調(diào)度這塊非常龐雜，要想把所有函數(shù)分析完非常耗費時間和精力，以后如果分析到的話，再慢慢補充。scheduler_tick函數(shù)主要更新進程的運行時間，包括物理時間和虛擬時間。

2.進程優(yōu)先級設(shè)置的函數(shù)，進程的優(yōu)先級和調(diào)度關(guān)系密切，通過上邊分析可以看到，計算進程的虛擬運行時間要用到優(yōu)先級，優(yōu)先級決定進程權(quán)重，權(quán)重決定進程虛擬時間的增加速度，最終決定進程可運行時間的長短。權(quán)重越大的進程可以執(zhí)行的時間越長。從effective_prio函數(shù)開始（kernel/sched/core.c）：

 1 static int effective_prio(struct task_struct *p)
 2 {
 3     p->normal_prio = normal_prio(p);
 4     /*
 5      * If we are RT tasks or we were boosted to RT priority,
 6      * keep the priority unchanged. Otherwise, update priority
 7      * to the normal priority:
 8      */
 9     if (!rt_prio(p->prio))
10         return p->normal_prio;
11     return p->prio;
12 }

該函數(shù)在進程創(chuàng)建時或者在用戶的nice系統(tǒng)調(diào)用中都會被調(diào)用到，來設(shè)置進程的活動優(yōu)先級（進程的三種優(yōu)先級：活動優(yōu)先級prio，靜態(tài)優(yōu)先級static_prio，普通優(yōu)先級normal_prio），該函數(shù)設(shè)計的有一定技巧性，函數(shù)的返回值是用來設(shè)置進程的活動優(yōu)先級，但是在函數(shù)體中也把進程的普通優(yōu)先級設(shè)置了。第3行設(shè)置進程的普通優(yōu)先級，隨后分析normal_prio函數(shù)。第9-11行，通過進程的活動優(yōu)先級可以判斷出該進程是不是實時進程，如果是實時進程，則執(zhí)行11行，返回p->prio，實時進程的活動優(yōu)先級不變。否則返回p->normal_prio，這說明普通進程的活動優(yōu)先級等于它的普通優(yōu)先級。下面我們看看normal_prio函數(shù)（kernel/sched/core.c）：

 1 static inline int normal_prio(struct task_struct *p)
 2 {
 3     int prio;
 4 
 5     if (task_has_dl_policy(p))
 6         prio = MAX_DL_PRIO-1;
 7     else if (task_has_rt_policy(p))
 8         prio = MAX_RT_PRIO-1 - p->rt_priority;
 9     else
10         prio = __normal_prio(p);
11     return prio;
12 }

該函數(shù)用來設(shè)置進程的普通優(yōu)先級。第5行判斷當(dāng)前進程是不是空閑進程，是的話設(shè)置進程的普通優(yōu)先級為-1（-1是空閑進程的優(yōu)先級），否則的話第7行判斷進程是不是實時進程，是的話設(shè)置實時進程的普通優(yōu)先級為0-99（數(shù)字越小優(yōu)先級越高），可以看到這塊減去了p->rt_priority，比較奇怪，這是因為實時進程描述符的rt_priority成員中事先存放了它自己的優(yōu)先級（數(shù)字也是0-99，但在這里數(shù)字越大，優(yōu)先級越高），因此往p->prio中倒換的時候，需要處理一下，MAX_RT_PRIO值為100，因此MAX_RT_PRIO-1-（0，99）就倒換成了（99，0），這僅僅是個小技巧。如果當(dāng)前進程也不是實時進程（說明是普通進程嘍），則第10行將進程的靜態(tài)優(yōu)先級存入prio中，然后返回（也就是說普通進程的普通優(yōu)先級等于其靜態(tài)優(yōu)先級）。

總結(jié)下，總共有三種進程：空閑進程，實時進程，普通進程；每種進程都包含三種優(yōu)先級：活動優(yōu)先級，普通優(yōu)先級，靜態(tài)優(yōu)先級?？臻e進程的普通優(yōu)先級永遠(yuǎn)-1，實時進程的普通優(yōu)先級是根據(jù)p->rt_priority設(shè)定的（0-99），普通進程的普通優(yōu)先級是根據(jù)其靜態(tài)優(yōu)先級設(shè)定的（100-139）。

3.進程權(quán)重設(shè)置的函數(shù)，上邊我們提到，進程的優(yōu)先級決定進程的權(quán)重。權(quán)重進而決定進程運行時間的長短。我們先分析和權(quán)重相關(guān)的數(shù)據(jù)結(jié)構(gòu)。

struct load_weight（include/linux/sched.h）

1 struct load_weight {
2     unsigned long weight;
3     u32 inv_weight;
4 };

每個進程描述符，調(diào)度實體，就緒對列中都包含一個該類型變量，用來保存各自的權(quán)重。成員weight中存放進程優(yōu)先級所對應(yīng)的權(quán)重。成員inv_weight中存放NICE_0_LOAD/weight的結(jié)果，這個結(jié)果乘以進程運行的時間間隔delta_exec就是進程運行的虛擬時間。因此引入權(quán)重就是為了計算進程的虛擬時間。在這里將中間的計算結(jié)果保存下來，后邊就不用再計算了，直接可以用。

數(shù)組prio_to_weight[40]（kernel/sched/sched.h）

 1 static const int prio_to_weight[40] = {
 2  /* -20 */     88761,     71755,     56483,     46273,     36291,
 3  /* -15 */     29154,     23254,     18705,     14949,     11916,
 4  /* -10 */      9548,      7620,      6100,      4904,      3906,
 5  /*  -5 */      3121,      2501,      1991,      1586,      1277,
 6  /*   0 */      1024,       820,       655,       526,       423,
 7  /*   5 */       335,       272,       215,       172,       137,
 8  /*  10 */       110,        87,        70,        56,        45,
 9  /*  15 */        36,        29,        23,        18,        15,
10 };

該數(shù)組是普通進程的優(yōu)先級和權(quán)重對應(yīng)關(guān)系。數(shù)組元素是權(quán)重值，分別是優(yōu)先級從100-139或者nice值從-20-+19所對應(yīng)的權(quán)重值。nice值（-20-+19）是從用戶空間看到的普通進程的優(yōu)先級，和內(nèi)核空間的優(yōu)先級（100-139）一一對應(yīng)。struct load_weight中的weight成員存放正是這些權(quán)重值。

中間結(jié)果數(shù)組prio_to_wmult[40] （kernel/sched/sched.h）

 1 static const u32 prio_to_wmult[40] = {
 2  /* -20 */     48388,     59856,     76040,     92818,    118348,
 3  /* -15 */    147320,    184698,    229616,    287308,    360437,
 4  /* -10 */    449829,    563644,    704093,    875809,   1099582,
 5  /*  -5 */   1376151,   1717300,   2157191,   2708050,   3363326,
 6  /*   0 */   4194304,   5237765,   6557202,   8165337,  10153587,
 7  /*   5 */  12820798,  15790321,  19976592,  24970740,  31350126,
 8  /*  10 */  39045157,  49367440,  61356676,  76695844,  95443717,
 9  /*  15 */ 119304647, 148102320, 186737708, 238609294, 286331153,
10 };

該數(shù)組元素就是上個數(shù)組元素被NICE_0_LOAD除的結(jié)果，一一對應(yīng)。struct load_weight中的inv_weight成員存放正是這些值。

下邊我們分析和權(quán)重設(shè)置相關(guān)的函數(shù)。從set_load_weight函數(shù)開始（kernel/sched/core.c）：

 1 static void set_load_weight(struct task_struct *p)
 2 {
 3     int prio = p->static_prio - MAX_RT_PRIO;
 4     struct load_weight *load = &p->se.load;
 5 
 6     /*
 7      * SCHED_IDLE tasks get minimal weight:
 8      */
 9     if (p->policy == SCHED_IDLE) {
10         load->weight = scale_load(WEIGHT_IDLEPRIO);
11         load->inv_weight = WMULT_IDLEPRIO;
12         return;
13     }
14 
15     load->weight = scale_load(prio_to_weight[prio]);
16     load->inv_weight = prio_to_wmult[prio];
17 }

該函數(shù)用來設(shè)置進程p的權(quán)重。第3行將進程p的靜態(tài)優(yōu)先級轉(zhuǎn)換成數(shù)組下標(biāo)（減去100，從100-139--->0-39）。第4行獲取當(dāng)前進程的調(diào)度實體中的權(quán)重結(jié)構(gòu)體指針，存入load中。第9-12行，如果當(dāng)前進程是空閑進程，則設(shè)置相應(yīng)的權(quán)重和中間計算結(jié)果。第15-16行，設(shè)置實時進程和普通進程的權(quán)重和中間計算結(jié)果。

由于就緒隊列中也包含權(quán)重結(jié)構(gòu)體，所以也要對它們進行設(shè)置。使用以下函數(shù)（kernel/sched/fair.c）：

1 static inline void update_load_set(struct load_weight *lw, unsigned long w)
2 {
3     lw->weight = w;
4     lw->inv_weight = 0;
5 }

該函數(shù)用來設(shè)置就緒隊列的權(quán)重。

1 static inline void update_load_add(struct load_weight *lw, unsigned long inc)
2 {
3     lw->weight += inc;
4     lw->inv_weight = 0;
5 }

當(dāng)有進程加入就緒隊列，使用該函數(shù)增加就緒隊列的權(quán)重。

1 static inline void update_load_sub(struct load_weight *lw, unsigned long dec)
2 {
3     lw->weight -= dec;
4     lw->inv_weight = 0;
5 }

當(dāng)有進程從就緒隊列移除時，使用該函數(shù)減少就緒隊列的權(quán)重。就緒隊列的load_weight.inv_weight成員總是0，因為不會使用到就緒隊列的該域。

4.計算進程延遲周期的相關(guān)函數(shù)。進程的延遲周期指的是將所有進程執(zhí)行一遍的時間。當(dāng)就緒隊列中的進程數(shù)量不超出規(guī)定的時候，內(nèi)核有一個固定的延遲周期供調(diào)度使用，但是當(dāng)進程數(shù)量超出規(guī)定以后，就需要對該固定延遲周期進行擴展（不然隨著進程越多，每個進程分配的執(zhí)行時間會越少）。下面看看sched_slice函數(shù)（kernel/sched/fair.c）：

 1 static u64 sched_slice(struct cfs_rq *cfs_rq, struct sched_entity *se)
 2 {
 3     u64 slice = __sched_period(cfs_rq->nr_running + !se->on_rq);
 4 
 5     for_each_sched_entity(se) {
 6         struct load_weight *load;
 7         struct load_weight lw;
 8 
 9         cfs_rq = cfs_rq_of(se);
10         load = &cfs_rq->load;
11 
12         if (unlikely(!se->on_rq)) {
13             lw = cfs_rq->load;
14 
15             update_load_add(&lw, se->load.weight);
16             load = &lw;
17         }
18         slice = __calc_delta(slice, se->load.weight, load);
19     }
20     return slice;
21 }

直接看第18行，__calc_delta用來計算當(dāng)前進程在延遲周期中可占的時間（相當(dāng)于“時間片”，就是當(dāng)前進程可用的時間）。__calc_delta函數(shù)很強大，記得前邊在entity_tick函數(shù)中也調(diào)用過該函數(shù)（entity_tick--->update_curr--->calc_delta_fair--->__calc_delta），當(dāng)時使用該函數(shù)是為了將進程運行過的物理時間（真實時間）轉(zhuǎn)換成虛擬時間；而在此處，我們用它來計算當(dāng)前進程在延遲周期中可占的時間（相當(dāng)于“時間片”）。前邊提過，__calc_delta中用到param1 * param2 / param3.weight這個公式（param代表該函數(shù)接收的參數(shù)），當(dāng)param2的值固定不變（等于NICE_0_LOAD），param3（代表當(dāng)前進程的權(quán)重）在變化時，該函數(shù)是用來轉(zhuǎn)換真實時間和虛擬時間的；當(dāng)param3（代表當(dāng)前cfs_rq的權(quán)重，cfs_rq->load->weight）的值固定不變，param2在變化（代表當(dāng)前進程的權(quán)重）時，該函數(shù)則是用來計算當(dāng)前進程應(yīng)該運行的時間。因此第18行計算結(jié)果slice就是當(dāng)前進程應(yīng)該運行的真實時間。下面再看一個函數(shù)sched_vslice（kernel/sched/fair.c）：

1 static u64 sched_vslice(struct cfs_rq *cfs_rq, struct sched_entity *se)
2 {
3     return calc_delta_fair(sched_slice(cfs_rq, se), se);
4 }

該函數(shù)用來將進程應(yīng)該運行的真實時間轉(zhuǎn)換成應(yīng)該運行的虛擬時間，以供調(diào)度使用。可以看到該函數(shù)調(diào)用了cals_delta_fair來進行時間轉(zhuǎn)換，前邊已分析過，不再贅述。

5.選擇下一個需要調(diào)度的進程。所使用的函數(shù)是pick_next_task_fair，代碼如下（kernel/sched/fair.c）：

pick_next_task_fair

該函數(shù)會被賦給公平調(diào)度類的pick_next_task成員（.pick_next_task = pick_next_task_fair），在schedule函數(shù)中會調(diào)用該函數(shù)選擇下一個需要調(diào)用的進程，然后進行進程切換。第11-12行如果當(dāng)前就緒隊列中的進程數(shù)量為0，則退出函數(shù)。第25-49行對所有的調(diào)度組進行遍歷，從中選擇下一個可調(diào)度的進程，而不只局限在當(dāng)前隊列的當(dāng)前組。第26行獲取當(dāng)前調(diào)度實體，第34-37行如果存在一個當(dāng)前調(diào)度實體（進程）并且正在運行，則更新進程的運行時間，否則就緒隊列cfs_rq.curr置為null，表示當(dāng)前無進程運行。第47行獲取下一個調(diào)度實體，待會來分析該函數(shù)。第48行獲取下個調(diào)度實體所擁有的就緒隊列my_q（代表一個調(diào)度組），如果調(diào)度組非空，則進入下一次循環(huán)，在新的就緒隊列（調(diào)度組）中挑選下一個可調(diào)度進程，直到某個實體沒有自己的就緒隊列為止（遍歷完所有的調(diào)度組了）。退出循環(huán)后，可以發(fā)現(xiàn)此時的se是所遍歷的最后一個調(diào)度組的下個可運行實體。第51行獲取se對應(yīng)的進程描述符，第58-77行，如果當(dāng)前進程和下一個進程（se所對應(yīng)的進程）不是同一個的話，則執(zhí)行if體，第59行將當(dāng)前進程的調(diào)度實體指針裝入pse中，第61-72行如果當(dāng)前進程和下一個調(diào)度的進程（se對應(yīng)的進程）不屬于同一調(diào)度組，則進入循環(huán)。否則，執(zhí)行第75-76行，將當(dāng)前進程放回就緒隊列，將下個進程從就緒隊列中拿出，這兩個函數(shù)涉及到了就緒隊列的出隊和入隊操作，我們在下邊分析。第61-73的循環(huán)根據(jù)當(dāng)前實體和下個調(diào)度實體的節(jié)點深度進行調(diào)整（同一個級別的進程才能切換）。下面看看pick_next_entity，put_prev_entity和set_prev_entity函數(shù)代碼（kernel/sched/fair.c）：

 1 static struct sched_entity *
 2 pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr)
 3 {
 4     struct sched_entity *left = __pick_first_entity(cfs_rq);
 5     struct sched_entity *se;
 6 
 7     /*
 8      * If curr is set we have to see if its left of the leftmost entity
 9      * still in the tree, provided there was anything in the tree at all.
10      */
11     if (!left || (curr && entity_before(curr, left)))
12         left = curr;
13 
14     se = left; /* ideally we run the leftmost entity */
15 
16     /*
17      * Avoid running the skip buddy, if running something else can
18      * be done without getting too unfair.
19      */
20     if (cfs_rq->skip == se) {
21         struct sched_entity *second;
22 
23         if (se == curr) {
24             second = __pick_first_entity(cfs_rq);
25         } else {
26             second = __pick_next_entity(se);
27             if (!second || (curr && entity_before(curr, second)))
28                 second = curr;
29         }
30 
31         if (second && wakeup_preempt_entity(second, left) < 1)
32             se = second;
33     }
34 
35     /*
36      * Prefer last buddy, try to return the CPU to a preempted task.
37      */
38     if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, left) < 1)
39         se = cfs_rq->last;
40 
41     /*
42      * Someone really wants this to run. If it's not unfair, run it.
43      */
44     if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1)
45         se = cfs_rq->next;
46 
47     clear_buddies(cfs_rq, se);
48 
49     return se;
50 }

該函數(shù)選擇下一個調(diào)度的實體。第4行將紅黑樹的最左邊實體賦給left，第11-12行如果紅黑樹的最左邊實體為空或者當(dāng)前實體運行的虛擬時間小于下一個實體（繼續(xù)當(dāng)前的實體），把當(dāng)前實體賦給left，實際上left指向的是下一個要執(zhí)行的進程（該進程要么還是當(dāng)前進程，要么是下一個進程），第14行將left賦給se，第20-33行如果se進程是需要跳過的進程（不能被調(diào)度），執(zhí)行if體，如果se進程是當(dāng)前進程，則選擇紅黑數(shù)最左的進程賦給second，否則se進程已經(jīng)是最左的進程，就把next指向的進程賦給second（next指向的是剛剛被喚醒的進程），第32行將second再次賦給se，se是挑選出來的下個進程。第38-45，再次判斷，要么把last指向的進程賦給se，要么把next指向進程賦給se，內(nèi)核更傾向于把last賦給se，因為last是喚醒next進程的進程（一般就是當(dāng)前進程），所以執(zhí)行l(wèi)ast就意味著不用切換進程，效率最高。第47行清理掉next和last域。第31，38，44行使用到的wakeup_preempt_entity函數(shù)我們在進程喚醒那塊再分析。

 1 static void put_prev_entity(struct cfs_rq *cfs_rq, struct sched_entity *prev)
 2 {
 3     /*
 4      * If still on the runqueue then deactivate_task()
 5      * was not called and update_curr() has to be done:
 6      */
 7     if (prev->on_rq)
 8         update_curr(cfs_rq);
 9 
10     /* throttle cfs_rqs exceeding runtime */
11     check_cfs_rq_runtime(cfs_rq);
12 
13     check_spread(cfs_rq, prev);
14     if (prev->on_rq) {
15         update_stats_wait_start(cfs_rq, prev);
16         /* Put 'current' back into the tree. */
17         __enqueue_entity(cfs_rq, prev);
18         /* in !on_rq case, update occurred at dequeue */
19         update_entity_load_avg(prev, 1);
20     }
21     cfs_rq->curr = NULL;
22 }

該函數(shù)將當(dāng)前調(diào)度實體放回就緒隊列。第7-8行如果當(dāng)前實體正在運行，更新其時間片。第17行將當(dāng)前實體加入到就緒隊列中，待會分析__enqueue_entity函數(shù)。第21行將就緒隊列的curr域置為null，因為當(dāng)前進程已經(jīng)放回就緒隊列了，就表示當(dāng)前沒有進程正在執(zhí)行了，因此curr為空。

 1 static void
 2 set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
 3 {
 4     /* 'current' is not kept within the tree. */
 5     if (se->on_rq) {
 6         /*
 7          * Any task has to be enqueued before it get to execute on
 8          * a CPU. So account for the time it spent waiting on the
 9          * runqueue.
10          */
11         update_stats_wait_end(cfs_rq, se);
12         __dequeue_entity(cfs_rq, se);
13     }
14 
15     update_stats_curr_start(cfs_rq, se);
16     cfs_rq->curr = se;
17 #ifdef CONFIG_SCHEDSTATS
18     /*
19      * Track our maximum slice length, if the CPU's load is at
20      * least twice that of our own weight (i.e. dont track it
21      * when there are only lesser-weight tasks around):
22      */
23     if (rq_of(cfs_rq)->load.weight >= 2*se->load.weight) {
24         se->statistics.slice_max = max(se->statistics.slice_max,
25             se->sum_exec_runtime - se->prev_sum_exec_runtime);
26     }
27 #endif
28     se->prev_sum_exec_runtime = se->sum_exec_runtime;
29 }

該函數(shù)將下一個被調(diào)度實體從就緒隊列中取出。第12行實現(xiàn)取出操作，待會分析該函數(shù)。第16行將取出的調(diào)度實體指針賦給就緒隊列的curr，那么此時就有了正在運行的進程了。后邊的代碼更新當(dāng)前進程的時間統(tǒng)計信息。

6.就緒隊列的入隊和出隊操作（kernel/sched/fair.c）。

 1 static void __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
 2 {
 3     struct rb_node **link = &cfs_rq->tasks_timeline.rb_node;
 4     struct rb_node *parent = NULL;
 5     struct sched_entity *entry;
 6     int leftmost = 1;
 7 
 8     /*
 9      * Find the right place in the rbtree:
10      */
11     while (*link) {
12         parent = *link;
13         entry = rb_entry(parent, struct sched_entity, run_node);
14         /*
15          * We dont care about collisions. Nodes with
16          * the same key stay together.
17          */
18         if (entity_before(se, entry)) {
19             link = &parent->rb_left;
20         } else {
21             link = &parent->rb_right;
22             leftmost = 0;
23         }
24     }
25 
26     /*
27      * Maintain a cache of leftmost tree entries (it is frequently
28      * used):
29      */
30     if (leftmost)
31         cfs_rq->rb_leftmost = &se->run_node;
32 
33     rb_link_node(&se->run_node, parent, link);
34     rb_insert_color(&se->run_node, &cfs_rq->tasks_timeline);
35 }

該函數(shù)實現(xiàn)入隊操作。第3行獲取就緒隊列中紅黑樹的根節(jié)點，將樹根指針保存在link中。第12行parent暫時指向樹根。第13行獲得樹根節(jié)點的調(diào)度實體，保存在entry中。第18-22行，比較要入隊的實體中的已運行虛擬時間和樹根實體中的該信息，如果前者小的話，就要插入到樹的左子樹上（link指向樹根的左孩子，再次進入循環(huán)，類似于遞歸），否則就要插入到樹的右子樹上（同上）。這塊就將進程的調(diào)度策略展現(xiàn)的淋漓盡致：根據(jù)進程已運行的虛擬時間來決定進程的調(diào)度，紅黑樹的左子樹比右子樹要先被調(diào)度，已運行的虛擬時間越小的進程越在樹的左側(cè)。第30-31行，如果入隊的實體最終被插在左孩子上（該入隊實體仍是就緒隊列上最靠前的實體，下次就會被調(diào)用），那么就要讓就緒隊列的rb_leftmost指向入隊實體。rb_leftmost指針始終指向下次要被調(diào)度的實體（進程）。最后兩行要給紅黑數(shù)重新著色。

 1 static void __dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
 2 {
 3     if (cfs_rq->rb_leftmost == &se->run_node) {
 4         struct rb_node *next_node;
 5 
 6         next_node = rb_next(&se->run_node);
 7         cfs_rq->rb_leftmost = next_node;
 8     }
 9 
10     rb_erase(&se->run_node, &cfs_rq->tasks_timeline);
11 }

該函數(shù)實現(xiàn)出隊操作。第3行判斷要出隊的實體是不是紅黑樹最左側(cè)的孩子（rb_leftmost所指向的），如果不是的話，第10行直接刪除該出隊實體。否則執(zhí)行if體，第6行找到出隊實體的下一個實體（中序遍歷的下一個節(jié)點，也就是當(dāng)出隊實體刪除后，最左邊的孩子），賦給next_node。第7行讓rb_leftmost指向next_node。在刪除掉要出隊實體后，下一個需要被調(diào)度的實體也就設(shè)置好了。

7.睡眠進程被喚醒后搶占當(dāng)前進程。當(dāng)某個資源空出來后，等待該資源的進程就會被喚醒，喚醒后也許就要搶占當(dāng)前進程，因此這塊的函數(shù)也需要分析（kernel/sched/core.c）。

 1 static int
 2 try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 3 {
 4     unsigned long flags;
 5     int cpu, success = 0;
 6 
 7     /*
 8      * If we are going to wake up a thread waiting for CONDITION we
 9      * need to ensure that CONDITION=1 done by the caller can not be
10      * reordered with p->state check below. This pairs with mb() in
11      * set_current_state() the waiting thread does.
12      */
13     smp_mb__before_spinlock();
14     raw_spin_lock_irqsave(&p->pi_lock, flags);
15     if (!(p->state & state))
16         goto out;
17 
18     success = 1; /* we're going to change ->state */
19     cpu = task_cpu(p);
20 
21     if (p->on_rq && ttwu_remote(p, wake_flags))
22         goto stat;
23 
24 #ifdef CONFIG_SMP
25     /*
26      * If the owning (remote) cpu is still in the middle of schedule() with
27      * this task as prev, wait until its done referencing the task.
28      */
29     while (p->on_cpu)
30         cpu_relax();
31     /*
32      * Pairs with the smp_wmb() in finish_lock_switch().
33      */
34     smp_rmb();
35 
36     p->sched_contributes_to_load = !!task_contributes_to_load(p);
37     p->state = TASK_WAKING;
38 
39     if (p->sched_class->task_waking)
40         p->sched_class->task_waking(p);
41 
42     cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags);
43     if (task_cpu(p) != cpu) {
44         wake_flags |= WF_MIGRATED;
45         set_task_cpu(p, cpu);
46     }
47 #endif /* CONFIG_SMP */
48 
49     ttwu_queue(p, cpu);
50 stat:
51     ttwu_stat(p, cpu, wake_flags);
52 out:
53     raw_spin_unlock_irqrestore(&p->pi_lock, flags);
54 
55     return success;
56 }

該函數(shù)會喚醒參數(shù)p指定的進程，將進程加入就緒隊列中等待調(diào)度。第15行判斷p進程的狀態(tài)碼和傳進來的狀態(tài)碼是否一致，不一致的話函數(shù)結(jié)束（不一致則說明進程等待的條件未滿足）。第19行獲取要喚醒進程p原先所在的cpu號。第37行設(shè)置要喚醒進程p的狀態(tài)為TASK_WAKING。第40行執(zhí)行進程p的調(diào)度類中的task_waking函數(shù)，該函數(shù)指針指向了task_waking_fair函數(shù)，該函數(shù)主要是對睡眠進程的已運行虛擬時間進行補償，待會分析該函數(shù)。第42行為剛喚醒進程p選擇一個合適的就緒隊列供其加入，返回就緒隊列所在的cpu號。第43行如果進程p所在的原先cpu和為它挑選的cpu不是同一個的話，第45行更改進程p的cpu號。

 1 void wake_up_new_task(struct task_struct *p)
 2 {
 3     unsigned long flags;
 4     struct rq *rq;
 5 
 6     raw_spin_lock_irqsave(&p->pi_lock, flags);
 7 #ifdef CONFIG_SMP
 8     /*
 9      * Fork balancing, do it here and not earlier because:
10      *  - cpus_allowed can change in the fork path
11      *  - any previously selected cpu might disappear through hotplug
12      */
13     set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0));
14 #endif
15 
16     /* Initialize new task's runnable average */
17     init_task_runnable_average(p);
18     rq = __task_rq_lock(p);
19     activate_task(rq, p, 0);
20     p->on_rq = 1;
21     trace_sched_wakeup_new(p, true);
22     check_preempt_curr(rq, p, WF_FORK);
23 #ifdef CONFIG_SMP
24     if (p->sched_class->task_woken)
25         p->sched_class->task_woken(rq, p);
26 #endif
27     task_rq_unlock(rq, p, &flags);
28 }

該函數(shù)用來喚醒剛創(chuàng)建好的進程，而上個函數(shù)是用來喚醒睡眠中的進程。第13行為將喚醒的進程p設(shè)置合適的cpu。第17行設(shè)置進程p的可運行時間長度（類似“時間片”），第19行activate_task函數(shù)主要將喚醒的進程p加入就緒隊列,并更新隊列的時間,初始化進程p的時間等，該函數(shù)中的調(diào)用關(guān)系大致為activate_task->enqueue_task->enqueue_task_fair(p->sched_class->enqueue_task)->enqueue_entity。第22行check_preempt_curr函數(shù)調(diào)用了check_preempt_wakeup函數(shù)，來檢查喚醒進程是否能搶占當(dāng)前進程，下面分析該函數(shù)（kernel/sched/fair.c）。

 1 static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_flags)
 2 {
 3     struct task_struct *curr = rq->curr;
 4     struct sched_entity *se = &curr->se, *pse = &p->se;
 5     struct cfs_rq *cfs_rq = task_cfs_rq(curr);
 6     int scale = cfs_rq->nr_running >= sched_nr_latency;
 7     int next_buddy_marked = 0;
 8 
 9     if (unlikely(se == pse))
10         return;
11 
12     /*
13      * This is possible from callers such as move_task(), in which we
14      * unconditionally check_prempt_curr() after an enqueue (which may have
15      * lead to a throttle).  This both saves work and prevents false
16      * next-buddy nomination below.
17      */
18     if (unlikely(throttled_hierarchy(cfs_rq_of(pse))))
19         return;
20 
21     if (sched_feat(NEXT_BUDDY) && scale && !(wake_flags & WF_FORK)) {
22         set_next_buddy(pse);
23         next_buddy_marked = 1;
24     }
25 
26     /*
27      * We can come here with TIF_NEED_RESCHED already set from new task
28      * wake up path.
29      *
30      * Note: this also catches the edge-case of curr being in a throttled
31      * group (e.g. via set_curr_task), since update_curr() (in the
32      * enqueue of curr) will have resulted in resched being set.  This
33      * prevents us from potentially nominating it as a false LAST_BUDDY
34      * below.
35      */
36     if (test_tsk_need_resched(curr))
37         return;
38 
39     /* Idle tasks are by definition preempted by non-idle tasks. */
40     if (unlikely(curr->policy == SCHED_IDLE) &&
41         likely(p->policy != SCHED_IDLE))
42         goto preempt;
43 
44     /*
45      *  do not preempt non-idle tasks (their preemption
46      * is driven by the tick):
47      */
48     if (unlikely(p->policy != SCHED_NORMAL) || !sched_feat(WAKEUP_PREEMPTION))
49         return;
50 
51     find_matching_se(&se, &pse);
52     update_curr(cfs_rq_of(se));
53     BUG_ON(!pse);
54     if (wakeup_preempt_entity(se, pse) == 1) {
55         /*
56          * Bias pick_next to pick the sched entity that is
57          * triggering this preemption.
58          */
59         if (!next_buddy_marked)
60             set_next_buddy(pse);
61         goto preempt;
62     }
63 
64     return;
65 
66 preempt:
67     resched_task(curr);
68     /*
69      * Only set the backward buddy when the current task is still
70      * on the rq. This can happen when a wakeup gets interleaved
71      * with schedule on the ->pre_schedule() or idle_balance()
72      * point, either of which can * drop the rq lock.
73      *
74      * Also, during early boot the idle thread is in the fair class,
75      * for obvious reasons its a bad idea to schedule back to it.
76      */
77     if (unlikely(!se->on_rq || curr == rq->idle))
78         return;
79 
80     if (sched_feat(LAST_BUDDY) && scale && entity_is_task(se))
81         set_last_buddy(se);
82 }

第21-24行，如果開啟了NEXT_BUDDY并且喚醒的進程不是新進程（而是睡眠進程），那么第22行就將cfs_rq的next指針指向喚醒的進程，并且設(shè)置標(biāo)記。第36行如果當(dāng)前進程可以被搶占，函數(shù)直接返回。否則，第40-42行如果當(dāng)前進程是空閑進程并且被喚醒的進程不是空閑進程，則跳到preempt處，設(shè)置need_resched標(biāo)志，完成搶占設(shè)置。第48行，如果被喚醒進程是空閑進程或者批處理進程，直接返回，這些進程不能搶占別的進程。第51行如果當(dāng)前進程和被喚醒進程不在同一級別（同一個調(diào)度組），則尋找它們的祖先parent，把它們調(diào)整到同一級別，才能進行虛擬運行時間的比較，進而決定能否搶占。第54行，對當(dāng)前進程和被喚醒進程的虛擬運行時間進行比較，可以搶占的話（喚醒進程的虛擬時間小于當(dāng)前進程）執(zhí)行if體，跳到preempt處完成搶占。否則所有都不滿足的話，當(dāng)前進程不能被搶占，執(zhí)行第64行返回，隨后分析該函數(shù)。第80-81行如果開啟了LAST_BUDDY，就將cfs_rq的last指針指向喚醒進程的進程。在pick_next_entity函數(shù)中，next和last所指的進程會先于就緒隊列的left進程被選擇。下面分析下wakeup_preempt_entity函數(shù)（kernel/sched/fair.c）。

 1 static int
 2 wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se)
 3 {
 4     s64 gran, vdiff = curr->vruntime - se->vruntime;
 5 
 6     if (vdiff <= 0)
 7         return -1;
 8 
 9     gran = wakeup_gran(curr, se);
10     if (vdiff > gran)
11         return 1;
12 
13     return 0;
14 }

該函數(shù)是要保證在se實體在搶占curr實體時，curr實體已經(jīng)運行過一段時間（具體而言，物理時間1ms），第9行wakeup_gran函數(shù)是將sysctl_sched_wakeup_granularity的值（1ms）轉(zhuǎn)換成se實體的虛擬時間，保存在gran中，第10行比較vdiff和gran大小，實際上是比較curr->vruntime 和 se->vruntime+gran，因此就是想讓curr實體多執(zhí)行g(shù)ran時間，才能被搶占。

最后我們再分析下 try_to_wake_up函數(shù)中第40行遺留的那個函數(shù)指針task_waking，該指針指向了task_waking_fair函數(shù)，代碼如下（kernel/sched/fair.c）：

 1 static void task_waking_fair(struct task_struct *p)
 2 {
 3     struct sched_entity *se = &p->se;
 4     struct cfs_rq *cfs_rq = cfs_rq_of(se);
 5     u64 min_vruntime;
 6 
 7 #ifndef CONFIG_64BIT
 8     u64 min_vruntime_copy;
 9 
10     do {
11         min_vruntime_copy = cfs_rq->min_vruntime_copy;
12         smp_rmb();
13         min_vruntime = cfs_rq->min_vruntime;
14     } while (min_vruntime != min_vruntime_copy);
15 #else
16     min_vruntime = cfs_rq->min_vruntime;
17 #endif
18 
19     se->vruntime -= min_vruntime;
20     record_wakee(p);
21 }

該函數(shù)完成對睡眠進程的虛擬時間補償。考慮到睡眠時間長時間沒有運行，因此第19行從喚醒進程se的已運行虛擬時間中減去就緒隊列的最小虛擬時間，做點補償，讓其可以多運行一點時間。

8.新進程的處理函數(shù)（kernel/sched/fair.c）：

 1 static void task_fork_fair(struct task_struct *p)
 2 {
 3     struct cfs_rq *cfs_rq;
 4     struct sched_entity *se = &p->se, *curr;
 5     int this_cpu = smp_processor_id();
 6     struct rq *rq = this_rq();
 7     unsigned long flags;
 8 
 9     raw_spin_lock_irqsave(&rq->lock, flags);
10 
11     update_rq_clock(rq);
12 
13     cfs_rq = task_cfs_rq(current);
14     curr = cfs_rq->curr;
15 
16     /*
17      * Not only the cpu but also the task_group of the parent might have
18      * been changed after parent->se.parent,cfs_rq were copied to
19      * child->se.parent,cfs_rq. So call __set_task_cpu() to make those
20      * of child point to valid ones.
21      */
22     rcu_read_lock();
23     __set_task_cpu(p, this_cpu);
24     rcu_read_unlock();
25 
26     update_curr(cfs_rq);
27 
28     if (curr)
29         se->vruntime = curr->vruntime;
30     place_entity(cfs_rq, se, 1);
31 
32     if (sysctl_sched_child_runs_first && curr && entity_before(curr, se)) {
33         /*
34          * Upon rescheduling, sched_class::put_prev_task() will place
35          * 'current' within the tree based on its new key value.
36          */
37         swap(curr->vruntime, se->vruntime);
38         resched_task(rq->curr);
39     }
40 
41     se->vruntime -= cfs_rq->min_vruntime;
42 
43     raw_spin_unlock_irqrestore(&rq->lock, flags);
44 }

該函數(shù)在do_fork--->copy_process函數(shù)中調(diào)用，用來設(shè)置新創(chuàng)建進程的虛擬時間信息。第5行獲取當(dāng)前的cpu號，第23行為新進程設(shè)置該cpu號。第29行將當(dāng)前進程（父進程）的虛擬運行時間拷貝給新進程（子進程）。第30行的place_entity函數(shù)完成新進程的“時間片”計算以及虛擬時間懲罰，之后將新進程加入紅黑數(shù)中，待會分析。第32行如果設(shè)置了子進程先于父進程運行的標(biāo)志并且當(dāng)前進程不為空且當(dāng)前進程已運行的虛擬時間比新進程小，則執(zhí)行if體，第37行交換當(dāng)前進程和新進程的虛擬時間（新進程的虛擬時間變小，就排在了紅黑樹的左側(cè)，當(dāng)前進程之前，下次就能被調(diào)度），第38行設(shè)置重新調(diào)度標(biāo)志。第41行給新進程的虛擬運行時間減去隊列的最小虛擬時間來做一點補償（因為在上邊的place_entity函數(shù)中給新進程的虛擬時間加了一次min_vruntime，所以在這里要減去），再來看看place_entity函數(shù)（kernel/sched/fair.c）：

 1 static void
 2 place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
 3 {
 4     u64 vruntime = cfs_rq->min_vruntime;
 5 
 6     /*
 7      * The 'current' period is already promised to the current tasks,
 8      * however the extra weight of the new task will slow them down a
 9      * little, place the new task so that it fits in the slot that
10      * stays open at the end.
11      */
12     if (initial && sched_feat(START_DEBIT))
13         vruntime += sched_vslice(cfs_rq, se);
14 
15     /* sleeps up to a single latency don't count. */
16     if (!initial) {
17         unsigned long thresh = sysctl_sched_latency;
18 
19         /*
20          * Halve their sleep time's effect, to allow
21          * for a gentler effect of sleepers:
22          */
23         if (sched_feat(GENTLE_FAIR_SLEEPERS))
24             thresh >>= 1;
25 
26         vruntime -= thresh;
27     }
28 
29     /* ensure we never gain time by being placed backwards. */
30     se->vruntime = max_vruntime(se->vruntime, vruntime);
31 }

該函數(shù)完成新進程的“時間片”計算和虛擬時間懲罰，并且將新進程加入就緒隊列。第4行將就緒隊列的min_vruntime值存入vruntime中，第12-13行，如果initial標(biāo)志為1的話（說明當(dāng)前計算的是新進程的時間），將計算出的新進程的虛擬時間片累加到vruntime中，累加到原因是調(diào)度系統(tǒng)要保證先把就緒隊列中的所有的進程執(zhí)行一遍之后才能執(zhí)行新進程，一會具體解釋。第16-17行，如果當(dāng)前計算的不是新進程（睡眠的進程），把一個延遲周期的長度sysctl_sched_latency（6ms）賦給thresh，第24行thresh減半，第26行睡眠進程的虛擬運行時間減去減半后的thresh，因為睡眠進程好長時間未運行，因此要進行虛擬時間補償，把它已運行的虛擬時間減小一點，使得它能多運行一會。第30行將設(shè)置好的虛擬時間保存到進程調(diào)度實體的vruntime域。下面解釋下為什么要對新進程進行虛擬時間懲罰，其實原因只有一個，就是調(diào)度系統(tǒng)要保證將就緒隊列中現(xiàn)有的進程執(zhí)行一遍之后再執(zhí)行新進程，那么就必須使新進程的 vruntime=cfs_rq->min_vruntime+新進程的虛擬時間片，才能使得新進程插入到紅黑樹的右邊，最后參與調(diào)度，不然無法保證所有進程在新進程之前執(zhí)行。

最后，分析下和調(diào)度相關(guān)的這些函數(shù)執(zhí)行的時機

前面在介紹這些函數(shù)的時候，基本上都提到了會在哪里調(diào)用這些函數(shù)，最后，我們再系統(tǒng)總結(jié)一下：

進程調(diào)度分為兩個部分：一是進程信息的修改，二是進程切換。進程切換只有一個函數(shù)schedule，schedule的運行時機我們最后分析。其它函數(shù)的運行時機如下：

1.scheduler_tick函數(shù)是在每個時鐘中斷中被調(diào)用，用來更新當(dāng)前進程運行的時間。

2.effective_prio函數(shù)是在創(chuàng)建一個新進程或者用戶使用nice系統(tǒng)調(diào)用設(shè)置進程的優(yōu)先級時調(diào)用，用來設(shè)置進程的在內(nèi)核中優(yōu)先級（不同于nice值）。

3.set_load_weight函數(shù)也是在創(chuàng)建新進程或者用戶使用nice()設(shè)置進程的優(yōu)先級時調(diào)用，用來設(shè)置進程的權(quán)重。該函數(shù)和2中的函數(shù)其實是配套使用的，當(dāng)設(shè)置或者改變了一個進程的優(yōu)先級時，要么就要為這個進程設(shè)置或者改變該優(yōu)先級對應(yīng)的權(quán)重。

4.sched_slice函數(shù)主要是在scheduler_tick->...->check_preempt_tick中調(diào)用（別的地方也有），因此也是每個時鐘周期調(diào)用一次，用來計算當(dāng)前進程應(yīng)該執(zhí)行的“時間片”，進而才能判斷進程是否已經(jīng)超出它的時間片，超出的話就要設(shè)置搶占標(biāo)志，切換別的進程。

5.pick_next_task_fair函數(shù)schedule中調(diào)用，用來選擇下一個要被調(diào)度的進程，然后才能切換進程。它的執(zhí)行時機就是schedule的執(zhí)行時機，隨后分析。

6.__enqueue_entity/__dequeue_entity函數(shù)是在需要入就緒隊列或者出就緒隊列的地方被調(diào)用，因此使用它們的地方較多，比如schedule中調(diào)用（間接調(diào)用），就不一一分析了。

7.try_to_wake_up/wake_up_new_task函數(shù)，前者在進程所等待的資源滿足時被調(diào)用（一般在中斷內(nèi)調(diào)用）；后者是在創(chuàng)建好新進程后被調(diào)用。都是用來喚醒進程的，前者喚醒睡眠進程，后者喚醒新進程并將進程加入就緒隊列。

8.task_fork_fair函數(shù)也是在新進程被創(chuàng)建好后調(diào)用，用來設(shè)置新進程的“時間片”等信息，設(shè)置好以后新進程就可以被喚醒了。所以該函數(shù)和wake_up_new_task函數(shù)調(diào)用時機是一樣的。

最后，我們分析schedule函數(shù)的調(diào)用時機。該函數(shù)是唯一實現(xiàn)進程切換的函數(shù)。

在分析schedule函數(shù)的調(diào)用時機之前，我們先為大家介紹下“內(nèi)核控制路徑“的概念。所謂內(nèi)核控制路徑，就是由中斷或者異常引出的執(zhí)行路徑，說白了，執(zhí)行中斷或者異常處理程序時就處在內(nèi)核控制路徑中，此時代表的也是當(dāng)前進程。內(nèi)核控制路徑可以嵌套（也就是可以嵌套中斷），但是無論嵌套多少，最終都要回到當(dāng)前進程中，也就是說要從內(nèi)核控制路徑中返回，不可以在內(nèi)核控制路徑中進行進程切換（因此中斷處理程序中不允許調(diào)用能引起進程睡眠的函數(shù)）。說到底，內(nèi)核要么處在進程中，要么處在內(nèi)核控制路徑中（其實內(nèi)核控制路徑也代表當(dāng)前進程，因為其有特殊性，所以我們單列出來談），不會再有別的什么路徑了。那么進程切換的時機，或者說schedule函數(shù)被調(diào)用的時機，也就只可能發(fā)生于上述兩種路徑中。那么，1.當(dāng)在進程中調(diào)用schedule函數(shù)時（就是ULK這本書上說的直接調(diào)用），表明當(dāng)前進程因為等待資源或者別的原因需要掛起，主動放棄使用cpu，調(diào)用schedule函數(shù)切換到別的進程中；2.當(dāng)在內(nèi)核控制路徑中調(diào)用schedule函數(shù)時（上邊說過了，內(nèi)核控制路徑中不允許切換進程），其實是在內(nèi)核控制路徑返回進程時調(diào)用（該時機就是ULK上說的延遲調(diào)用），說明有更重要的進程等待執(zhí)行，需要搶占當(dāng)前進程，因此在中斷處理程序/異常處理程序返回時都要去檢查當(dāng)前進程能否被搶占，可以搶占的話就要調(diào)用schedule函數(shù)進行進程切換，包括從系統(tǒng)調(diào)用中返回用戶空間時也要檢查（這是統(tǒng)一的，因為系統(tǒng)調(diào)用本身也是異常，因此從系統(tǒng)調(diào)用中返回相當(dāng)于從異常處理程序中返回，通過系統(tǒng)調(diào)用進入到內(nèi)核態(tài)也可以說是內(nèi)核控制路徑，但是一般不這么叫）當(dāng)前進程的搶占標(biāo)志，能發(fā)生搶占的話就要去調(diào)用schedule函數(shù)。至此，進程切換的時機就分析完了。很好記的，要么是進程上下文發(fā)生進程切換（主動調(diào)用schedule），要么是從中斷返回時切換，因此每次中斷返回時必須要檢查能否發(fā)生搶占，包括從系統(tǒng)調(diào)用返回也屬于這種情形。

至此，進程調(diào)度機制咱們就分析完了（其實只分析了CFS調(diào)度）。實時進程調(diào)度以后再分析！

參考書籍：《深入理解linux內(nèi)核》

　　　　　《深入理解linux內(nèi)核架構(gòu)》

參考文章：blog.csdn.net/wudongxu/article/details/8574737

　　　　　blog.csdn.net/dog250/article/details/5302869

　　　　 chxxxyg.blog.163.com/blog/static/1502811932012912546208/