Post

async pf

introduce

在支持EPT的架构中, 对于GVA->HPA一般有两段映射:

  • GVA->GPA
  • GPA->HPA

而host kernel (kvm) 需要关心的是 GPA->HPA的映射, 需要host做的事情主要有 以下几个:

  1. 捕捉相关 VM-exit event (EPT violation), 得到 GPA
  2. 分配page
  3. 建立映射关系(当然这个映射关系, 不止是GPA->HPA的mmu pgtable, 还有 HVA – GPA, 在这里不展开, 总之分配好具体的page(分配HPA), 以及为其建立好 mmu pgtable, 就可以完成该事件的处理)

如下图:

图示
graphviz-ae4ca25f7bf30b9e61f0f3b83bc12338 digraph G { subgraph cluster_guest { EPT_violation [ label="EPT mapping(GPA->HPA) \nloss, trigger EPT violation" ] "access a VA"-> "trigger PF in VMX\n non-root operation"-> "mapping GVA->GPA\n in GUEST #PF hook"-> "fixup #PF, continue \naccess this VA"-> EPT_violation label="guest" } subgraph cluster_host { "find HVA though GPA"-> "GUP(HVA)"-> "mapping GPA->HPA" label="host" } "mapping GPA->HPA"->"access a VA" [ label="fixup EPT violation,\n VM entry" ] EPT_violation->"find HVA though GPA" [ label="VM exit" ] } G cluster_guest guest cluster_host host EPT_violation EPT mapping(GPA->HPA) loss, trigger EPT violation find HVA though GPA find HVA though GPA EPT_violation->find HVA though GPA VM exit access a VA access a VA trigger PF in VMX\n non-root operation trigger PF in VMX non-root operation access a VA->trigger PF in VMX\n non-root operation mapping GVA->GPA\n in GUEST #PF hook mapping GVA->GPA in GUEST #PF hook trigger PF in VMX\n non-root operation->mapping GVA->GPA\n in GUEST #PF hook fixup #PF, continue \naccess this VA fixup #PF, continue access this VA mapping GVA->GPA\n in GUEST #PF hook->fixup #PF, continue \naccess this VA fixup #PF, continue \naccess this VA->EPT_violation GUP(HVA) GUP(HVA) find HVA though GPA->GUP(HVA) mapping GPA->HPA mapping GPA->HPA GUP(HVA)->mapping GPA->HPA mapping GPA->HPA->access a VA fixup EPT violation, VM entry

但是, 已经建立好映射的页面, 也是qemu进程的虚拟地址空间(匿名页), 是可以被swap out, 当被swap out后, GUEST 访问该HPA对应的 GVA/GPA时, 仍然会触发 EPT violation. 这时还会 再走一次 VM-exit, 而且也需要完成上面所述的三件事, 其中第二件:分配page, 需要swap in 之前被swap out的page, 路径比较长, 如下:

1
2
3
4
5
6
7
VM-exit
  handle_ept_violation
    kvm_mmu_page_fault
      tdp_page_fault
        gfn_to_pfn
          hva_to_pfn
            get_user_pages --slow path

get_user_pages会走到slow path, 由于会走swap in流程, 所以该过程执行较慢. 所以大佬们就想着 能不能让其异步执行, 然后让vcpu先不complete 造成 EPT violation 的 instruction, 去干别的事 情, 等page present后, 再去执行该指令. 另外将 get_user_pages 让一个 dedicated thread 去完成, 这样, 对于虚拟机来说, 就相当于搞了一个额外的 硬件, 专门去处理 swap in, 解放了vcpu的算力.

NOTE

大家思考下, 如果要达到该目的, 一定是让GUEST有意无意的 sche out 造成 EPT violation 的进程,

该上面流程总结如下:



流程图
graphviz-f8c81926c9687c237f8513f3a6fb3624 digraph G { subgraph cluster_host { style="filled" color="#693886699" subgraph cluster_host_dedicated_thread { do_slow_path [ shape="note" label="I'm a delicated \nthread, Like a \nspecial hardware, \nsharing the \npressure of VCPU" ] label="dedicated thread" have_got_page_success [ label="work in done!\n tell the guest" ] do_slow_path->have_got_page_success [ label="a. get page, swap in..." fontcolor="blue" color="blue" ] } subgraph cluster_host_kvm_vcpu_thread { ept_violation_handler [ label="ept violation handler" ] dont_do_slow_path [ shape="note" label="I don't want \nhandle slow path, \nit will speed\nto much time" ] tell_guest_sched_out [ shape="note" label="work is doing,\nneed wait\n a a bit time,\n let guest do\n other things" ] dont_do_slow_path ->tell_guest_sched_out [ label="4.let guest \ndo other thing" fontcolor="green" color="green" ] ept_violation_handler-> dont_do_slow_path [ label="2.find page swap out" fontcolor="green" color="green" ] label="host kvm vcpu thread" } label = "host" } subgraph cluster_guest { style="filled" color="#77323456" subgraph cluster_trigger_ept_violation_task { task1_access_a_memory [ label="acesss a memory\n address [BEG]" color="white" style="filled" ] label="TASK1 trigger ept vioaltion" } subgraph cluster_sched_in_task2 { task2_run_a_time [ label="task2_run_a_time" ] label="task2" } label="guest" } dont_do_slow_path->do_slow_path [ label="3. start a work \nto do it" fontcolor="green" color="green" ] task1_access_a_memory -> ept_violation_handler [ label="1.page NOT present,\ntrigger EPT violation" fontcolor="green" color="green" ] have_got_page_success -> task2_run_a_time [ label="b. page NOT present\n SCHED IN" fontcolor="blue" color="blue" ] tell_guest_sched_out -> task1_access_a_memory [ label="5. page NOT present\n SCHED OUT" fontcolor="green" color="green" ] task2_run_a_time->task1_access_a_memory [ label="c. sched in\n task1" fontcolor="blue" color="blue" ] task1_access_a_memory->task2_run_a_time [ label="6.sched out\n task1" fontcolor="green" color="green" ] } G cluster_host host cluster_host_dedicated_thread dedicated thread cluster_host_kvm_vcpu_thread host kvm vcpu thread cluster_guest guest cluster_trigger_ept_violation_task TASK1 trigger ept vioaltion cluster_sched_in_task2 task2 do_slow_path I'm a delicated thread, Like a special hardware, sharing the pressure of VCPU have_got_page_success work in done! tell the guest do_slow_path->have_got_page_success a. get page, swap in... task2_run_a_time task2_run_a_time have_got_page_success->task2_run_a_time b. page NOT present SCHED IN ept_violation_handler ept violation handler dont_do_slow_path I don't want handle slow path, it will speed to much time ept_violation_handler->dont_do_slow_path 2.find page swap out dont_do_slow_path->do_slow_path 3. start a work to do it tell_guest_sched_out work is doing, need wait a a bit time, let guest do other things dont_do_slow_path->tell_guest_sched_out 4.let guest do other thing task1_access_a_memory acesss a memory address [BEG] tell_guest_sched_out->task1_access_a_memory 5. page NOT present SCHED OUT task1_access_a_memory->ept_violation_handler 1.page NOT present, trigger EPT violation task1_access_a_memory->task2_run_a_time 6.sched out task1 task2_run_a_time->task1_access_a_memory c. sched in task1


由上图可见, 引入async pf 的逻辑是让其能够在触发 EPT violation后, 能够让VCPU 调度 到另外一个task, 从而阻塞触发 EPT violation 的进程执行. 为了达到这一目的, 做了以下改动:

  • VCPU 线程在执行get_user_page()时, 仅执行fast path, 如果page 不是present的, 该接口直接 返回, 而剩下的工作, 则交给另外一个dedicated thread 去做
  • KVM 会通过一些方式, 让 GUEST 执行调度, 从而避免再次执行触发EPT violation的指令. 而dedicated thread 完成了swap in 的动作后, 会通知guest再次唤醒该之前调度出去的进程

代码细节

para virt interface

一般的半虚拟化实现往往都有一下几个特征:

  • use CPUID report this feature
  • use MSR transparent less information, e.g. :
    • a share memory address
    • enable/disable
  • use a share memory transparent more information

而 para virt async PF 也是这样实现的.

v1 Add shared memory hypercall to PV Linux guest版本中, 作者以 hypercall的方式实现了半虚拟化, 但是avi在随后建议(link) 使用MSR来替代 hypercall, 因为该方式 在INIT和热迁移流程中有现成的 save/restore 接口

原文如下:

1
2
3
4
5
6
7
8
9
10
11
Better to set this up as an MSR (with bit zero enabling, bits 1-5 
features, and 64-byte alignment).  This allows auto-reset on INIT and 
live migration using the existing MSR save/restore infrastructure.

最好将其设置为MSR
  - bit 0: enabling
  - bit 1-5: features
  - 64-byte alignment

他允许在INIT时 auto-reset, 并且可以使用现有的 MSR save/restore 
infrastructure 完成热迁移

接口流程图

图示
graphviz-5816c8c78422729ad7411897407bd311 digraph G { subgraph cluster_host { host_page_not_present [ label="initiate page \nnot present\n APF" color="red" ] host_page_present [ label="initiate page \nhave been\n present APF" color="green" ] label="host" } subgraph cluster_guest { pf_handler [ label="page fault handler" ] guest_invoke_task [ label="invoke task" ] label="guest" } cpuid [ shape="record" label="cpuid:\n KVM_FEATURE_ASYNC_PF:\n 1" ] subgraph cluster_msr { msr_bit_map [ shape="record" label="{ bit0\n enable bit\n value 1(enable)| bit 1-5\n reserved\n value 0| <shm_gpa>bit 63-6\n 64-byte aligned GPA\n value 0xabc }" ] label="MSR_KVM_ASYNC_PF_EN" } subgraph cluster_cr2 { token [ shape="record" label="token: \n unique id" ] label="cr2" } subgraph cluster_shm { shm [ shape="record" label="APF reason" ] label="share memory" } cpuid->msr_bit_map [ arrowhead="none" style="dashed" label="indicate apf \nfeature \navailable,\n so access\n MSR_KVM_ASYNC_PF_EN \nis valid" ] msr_bit_map:shm_gpa->shm [ arrowhead="none" style="dashed" label="point base GPA \nof this share\n memory" ] host_page_not_present->token [ label="1. initiate page \nnot present \nAPF, generate \ntoken write \nto CR2" color="red" fontcolor="red" ] host_page_not_present->shm [ label="2. update apf \nreason to 1" color="red" fontcolor="red" ] host_page_not_present->pf_handler [ label="3. inject page \nnot present \n#APF" color="red" fontcolor="red" ] pf_handler->shm [ label="4. get reason \nfrom shm:\n PAGE \nnot present\n" color="red" fontcolor="red" ] pf_handler->token [ label="5. get token \nfrom cr2,\nbind sched\n out thread\n and token\n" color="red" fontcolor="red" ] pf_handler->guest_invoke_task [ label="6. sched\n out it" color="red" fontcolor="red" ] host_page_present->token [ label="a. initiate page \n present APF, \nwrite prev \ntoken write \nto CR2" color="green" fontcolor="green" ] host_page_present->shm [ label="b. update apf \nreason to 2" color="green" fontcolor="green" ] host_page_present->pf_handler [ label="c. inject page \n present \n#APF" color="green" fontcolor="green" ] pf_handler->shm [ label="d. get reason \nfrom shm:\nPAGE present " color="green" fontcolor="green" ] pf_handler->token [ label="e. get token \nfrom cr2,\n find sched \nout thread \nby token" color="green" fontcolor="green" ] pf_handler->guest_invoke_task [ label="f.wakeup it" color="green" fontcolor="green" ] } G cluster_host host cluster_guest guest cluster_msr MSR_KVM_ASYNC_PF_EN cluster_cr2 cr2 cluster_shm share memory host_page_not_present initiate page not present APF pf_handler page fault handler host_page_not_present->pf_handler 3. inject page not present #APF token token: unique id host_page_not_present->token 1. initiate page not present APF, generate token write to CR2 shm APF reason host_page_not_present->shm 2. update apf reason to 1 host_page_present initiate page have been present APF host_page_present->pf_handler c. inject page present #APF host_page_present->token a. initiate page present APF, write prev token write to CR2 host_page_present->shm b. update apf reason to 2 guest_invoke_task invoke task pf_handler->guest_invoke_task 6. sched out it pf_handler->guest_invoke_task f.wakeup it pf_handler->token 5. get token from cr2, bind sched out thread and token pf_handler->token e. get token from cr2, find sched out thread by token pf_handler->shm 4. get reason from shm: PAGE not present pf_handler->shm d. get reason from shm: PAGE present cpuid cpuid: KVM_FEATURE_ASYNC_PF: 1 msr_bit_map bit0 enable bit value 1(enable) bit 1-5 reserved value 0 bit 63-6 64-byte aligned GPA value 0xabc cpuid->msr_bit_map indicate apf feature available, so access MSR_KVM_ASYNC_PF_EN is valid msr_bit_map:shm_gpa->shm point base GPA of this share memory
图中描述了host, guest在处理async pf时, 对寄存器/share memory 的操作

从图中可以看出, 会涉及到cpuid, MSR_KVM_ASYNC_PF_EN, share memory, 由于async pf 的实现,需要注入#PF, 所以还会涉及 CR2

cpuid

新增半虚拟化cpuid bit: KVM_FEATURE_ASYNC_PF

1
2
diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
+#define KVM_FEATURE_ASYNC_PF		4

关于该bit的文档说明

1
2
3
diff --git a/Documentation/kvm/cpuid.txt b/Documentation/kvm/cpuid.txt
+KVM_FEATURE_ASYNC_PF               ||     4 || async pf can be enabled by
+                                   ||       || writing to msr 0x4b564d02

大致意思是, 该cpuid如果时能, 表示可以通过write to MSR (0x4b564d02) 来enable async pf

MSR – share memaddr && enable bit

1
2
3
diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
+#define MSR_KVM_ASYNC_PF_EN 0x4b564d02

文档说明:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
diff --git a/Documentation/kvm/msr.txt b/Documentation/kvm/msr.txt
+ MSR_KVM_ASYNC_PF_EN: 0x4b564d02
+   data: Bits 63-6 hold 64-byte aligned physical address of a
+   64 byte memory area which must be in guest RAM and must be
+   zeroed. Bits 5-1 are reserved and should be zero. Bit 0 is 1
+   when asynchronous page faults are enabled on the vcpu 0 when
+   disabled.
    
    > Bits 63-6 保存着 64-byte 对其的 一个64 byte memory area 的物理地址,
    > 该memory area 必须是 guest RAM, 并且必须是被赋值为0. 
    >
    > Bit 5-1 被reserved并且应该为0.
    >
    > 当 在 vcpu 0 启用 async pf enable async pf(当是disable时), 
    > Bit 0 是1

该段主要介绍了MSR的 bit 组成:

  • MSR bit
    • Bit [63, 6]: a 64-byte aligned physical address
    • Bit [5, 1]: reserved
    • Bit 0 : enable bit

其实文档中还介绍了. share memory format 和 CR2, 但是为了方便阅读, 我们将拆分开到各个 小节

shared memory structure – APF reason

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
diff --git a/Documentation/kvm/msr.txt b/Documentation/kvm/msr.txt
    ...
+   First 4 byte of 64 byte memory location will be written to by
+   the hypervisor at the time of asynchronous page fault (APF)
+   injection to indicate type of asynchronous page fault. Value
+   of 1 means that the page referred to by the page fault is not
+   present. Value 2 means that the page is now available. Disabling
+   interrupt inhibits APFs. Guest must not enable interrupt
+   before the reason is read, or it may be overwritten by another
+   APF. Since APF uses the same exception vector as regular page
+   fault guest must reset the reason to 0 before it does
+   something that can generate normal page fault.  If during page
+   fault APF reason is 0 it means that this is regular page
+   fault.

    > 在 hypervisor 触发 APF 注入时, 4 byte memory location的前4个byte将被
    > 写入 来指示 APF 的类型. 
    >    1: page fault 涉及到的page 是 not present的.
    >    2: page 现在已经 available
    > 另外Disabling interrupt 将会 inhibits APF.
    >
    > Guest必须不能enable interrupt 在reason 被read之前, 否则可能会被另一个
    > APF覆盖. 因为 APF 使用 相同的 exception vector 作为 regular page
    > fault, 所以在做可能生成normal page fault 的事情之前, guest 必须 reset 
    > reason to 0. 如果 在 page fault 期间, APF reason 为0, 他意味着这是一个
    > regular page fault.

shared memory 一共有64 byte, 其中前4个byte(32 bit) 用来indicate apf type. host kvm 在注入 apf之前会将type写入该地址.

APF 有两种type(APF reason):

  • 1: page is not present
  • 2: not present page becomes available

另外, 在处理APF时, guest和host有下面约束:

  • 如果guest处于 disable interrupt, host不能注入apf
  • guest必须在enable interrupt 之前, 处理完当前的apf
  • guest必须在触发 normal #PF时, 处理完当前的apf, 并且reset reason to 0

CR2

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
diff --git a/Documentation/kvm/msr.txt b/Documentation/kvm/msr.txt
    ...
+   During delivery of type 1 APF cr2 contains a token that will
+   be used to notify a guest when missing page becomes
+   available. When page becomes available type 2 APF is sent with
+   cr2 set to the token associated with the page. There is special
+   kind of token 0xffffffff which tells vcpu that it should wake
+   up all processes waiting for APFs and no individual type 2 APFs
+   will be sent.

    > 在 type1 APF delivery 期间,  cr2 包含了一个token, 当missing page 
    > becomes available, 该token将会用于通知guest. 
    >
    > 当page becomes available, type2 APF 将会把 cr2 设置为和该page相关的 
    > token. 
    >
    > 这里有一个特殊的类型 token 0xffffffff, 他将告诉vcpu, 需要wakeup 所有
    > 等待APF的process 并且不会有单独的 type 2 APF 将会再发送
    
+   If APF is disabled while there are outstanding APFs, they will
+   not be delivered.

    > 当 outstanding APFs时, 如果APF 被disabled, 他们将不会被delivered.
    
+   Currently type 2 APF will be always delivered on the same vcpu as
+   type 1 was, but guest should not rely on that.

    > 当前 type 2 APF 将始终在与type 1 相同的vcpu上deliver, 但是guest不应该依赖它.

cr2 包含了一个token, 该token 用来唯一标识, 当前正在发生的APF 的 id. 但是其有一个特殊 value 0xffffffff, 该值用来告诉vcpu, 需要wakeup所有的正在等待 APF (type 2) 的 进程. 并且不会有单独的type2再发送.

另外还有几点约束和限制

  • 如果还有 outstanding APFs 时, 如果 APF 被disable了, 他们将不会被deliver
  • guest 不应该依赖 type2 APF 和 type1 APF在相同vcpu上deliver, 虽然目前是这样实现的.

大家可以思考下, 为什么要支持wake up all这样的API

可以想象一下热迁移场景.

当进行热迁移时, 我们先suspend vcpu, 然后迁移memory, 这时, 会等所有page swapin,然后 在进行迁移, 但是这时, guest已经不能再去注入异常了, 只能等dest端在注入.

此时来到dest端, 这时所有的memory都是present的. 所以直接注入wakeup all就可以唤醒所有 wait task.(当然, 也可能再此期间有swapout, 无非是再触发一次async pf)

GUP change

关于GUP 改动的细节我们放到

link

中介绍.

STRUCT – host

总体数据结构图

比较简单, 如下:



struct 结构图
graphviz-7842836ccacf1278b495c08c9dc5b30c digraph G { subgraph cluster_vcpu0 { kvm_vcpu0 [ shape="record" label="{struct kvm_vcpu||<queue>queue|<done>done}" ] subgraph cluster_uncomplete_work { work_uncomplete_1 [ shape="record" label="{kvm_vcpu_pf||<queue>queue|<link>link}" ] work_uncomplete_2 [ shape="record" label="{kvm_vcpu_pf||<queue>queue|<link>link}" ] label="uncomplete work" } subgraph cluster_done_work { work_done_1 [ shape="record" label="{kvm_vcpu_pf||<queue>queue|<link>link}" ] work_done_2 [ shape="record" label="{kvm_vcpu_pf||<queue>queue|<link>link}" ] label="done work" } label = "vcpu 0" } kvm_vcpu0:queue-> work_done_1:queue-> work_done_2:queue-> work_uncomplete_1:queue-> work_uncomplete_2:queue [ color="red" ] kvm_vcpu0:done-> work_done_1:link-> work_done_2:link [ color="blue" ] subgraph cluster_vcpu1 { kvm_vcpu1 [ shape="record" label="struct kvm_vcpu" ] work_5 [ shape="record" label="kvm_vcpu_pf" ] work_6 [ shape="record" label="kvm_vcpu_pf" ] label = "vcpu 1" kvm_vcpu1->work_5->work_6 } subgraph cluster_vcpu2 { kvm_vcpu2 [ shape="record" label="struct kvm_vcpu" ] work_7 [ shape="record" label="kvm_vcpu_pf" ] work_8 [ shape="record" label="kvm_vcpu_pf" ] label = "vcpu 2" kvm_vcpu2->work_7->work_8 } subgraph cluster_vcpu3 { kvm_vcpu3 [ shape="record" label="struct kvm_vcpu" ] work_9 [ shape="record" label="kvm_vcpu_pf" ] work_10 [ shape="record" label="kvm_vcpu_pf" ] label = "vcpu 3" kvm_vcpu3->work_9->work_10 } } G cluster_vcpu0 vcpu 0 cluster_uncomplete_work uncomplete work cluster_done_work done work cluster_vcpu1 vcpu 1 cluster_vcpu2 vcpu 2 cluster_vcpu3 vcpu 3 kvm_vcpu0 struct kvm_vcpu queue done work_done_1 kvm_vcpu_pf queue link kvm_vcpu0:queue->work_done_1:queue kvm_vcpu0:done->work_done_1:link work_uncomplete_1 kvm_vcpu_pf queue link work_uncomplete_2 kvm_vcpu_pf queue link work_uncomplete_1:queue->work_uncomplete_2:queue work_done_2 kvm_vcpu_pf queue link work_done_1:queue->work_done_2:queue work_done_1:link->work_done_2:link work_done_2:queue->work_uncomplete_1:queue kvm_vcpu1 struct kvm_vcpu work_5 kvm_vcpu_pf kvm_vcpu1->work_5 work_6 kvm_vcpu_pf work_5->work_6 kvm_vcpu2 struct kvm_vcpu work_7 kvm_vcpu_pf kvm_vcpu2->work_7 work_8 kvm_vcpu_pf work_7->work_8 kvm_vcpu3 struct kvm_vcpu work_9 kvm_vcpu_pf kvm_vcpu3->work_9 work_10 kvm_vcpu_pf work_9->work_10

每个cpu有自己链表, 串起属于该cpu的async pf work, 其中有两条链.

  • queue:

    串起所有work

  • done:

    串起所有完成的work



struct kvm_async_pf

该数据结构主要用来描述上面提到的dedicated thread

1
2
3
4
5
6
7
8
9
10
11
12
struct kvm_async_pf {
        struct work_struct work;
        struct list_head link;
        struct list_head queue;
        struct kvm_vcpu *vcpu;
        struct mm_struct *mm;
        gva_t gva;
        unsigned long addr;
        struct kvm_arch_async_pf arch;
        struct page *page;
        bool done;
};
  • work: dedicated thread实例, 使用 workqueue机制
  • link: 在patch中, 链接点主要有一个: vcpu 的work完成队列
  • queue: 用于链接该vcpu上的所有 kvm_async_pf
  • gva: 触发EPT violation, 需要get_user_page_slow的 GVA
  • addr: hva
  • done: indicate该work完没完成
  • kvm_arch_async_pf:
    1
    2
    3
    4
    
    struct kvm_arch_async_pf {
            u32 token;
            gfn_t gfn;
    };
    
    • token: 该成员用于唯一标识一次async PF, 由kvm_vcpu.arch.apf.idvcpu->vcpu_id综合计算得到. 在注入#PF时, 会当作 CR2 传入GUEST, 方便guest管理每一次的async PF.

上面说提到的kvm_async_pf->link,kvm_async_pf->queue所链接的队列, 如下:

CHANGE of struct kvm_vcpu

1
2
3
4
5
6
7
8
9
10
11
12
@@ -104,6 +125,15 @@ struct kvm_vcpu {
    gpa_t mmio_phys_addr;
 #endif

+#ifdef CONFIG_KVM_ASYNC_PF
+   struct {
+       u32 queued;
+       struct list_head queue;
+       struct list_head done;
+       spinlock_t lock;
+   } async_pf;
+#endif
  • queue: 链接所有kvm_async_pf(work)
  • done: 链接以完成的kvm_async_pf(work)
  • lock: 队列锁

change of struct kvm_vcpu_arch

1
2
3
4
5
6
7
8
9
10
11
12
struct kvm_vcpu_arch {
    ...
+   struct {
+           bool halted;
+           gfn_t gfns[roundup_pow_of_two(ASYNC_PF_PER_VCPU)];
+           struct gfn_to_hva_cache data;
+           u64 msr_val;
+           u32 id;
+           bool send_user_only;
+   } apf;
    ...
}

该数据结构变动涉及多个patch, 这里把最终的数据结构变动列出.

  • halted: 表示是否因为async PF halt 了vcpu
  • gfns : 这里做了一个数组, 用于记录所有现存的async pf work 的 gfn
  • data: 相当于HVA->HPA的cache, 这个映射关系一直存在且不变(大多数情况下, 除非执行__kvm_set_memory_region更改映射关系), 该HPA 指向上面提到的 share memory

    该部分被作者做成了一个通用功能, 相当于是 memslot-cached kvm_put_guest() and kvm_get_guest(). 我们放到另一篇文章中介绍. 主要介绍 这个功能引入和其实现.

  • msr_val: 记录guest设置的msr值
  • id: 记录下一个async pf work的id, 和kvm_vcpu->vcpu_id一起,唯一标识一次async PF
  • send_user_only: 表示只有trigger EPT violation in guest user space, host才能做async PF

STRUCT - GUEST

guest 数据结构主要是用于管理, 因为async PF 调度出去的task.

数据结构图

数据结构图
graphviz-e889c62c04290bdb7d5685fb187da7d7 digraph G { sleep_head [ shape="record" label="{ kvm_task_sleep_head|| [0]| <key0>link(key0)|| [1]| <key1>link(key1)|| [2]| <key2>link(key2) }" ] subgraph cluster_cpu0 { sleep_node0 [ shape="record" label="{ kvm_task_sleep_node|| <link>link| token=[id=0, vcpu=0]| cpu=0| mm=mm_struct of task0| halted=false }" ] sleep_node1 [ shape="record" label="{ kvm_task_sleep_node|| <link>link| token=[id=1, vcpu=0]| cpu=0| mm=mm_struct of task1| halted=false }" ] run_task_vcpu0 [ label="current task: task2" shape="record" color="red" ] label="cpu0 RUNNING" } subgraph cluster_cpu1 { run_task_vcpu1 [ label="current task: task4" shape="record" color="red" ] sleep_node3 [ shape="record" label="{ kvm_task_sleep_node|| <link>link| token=[id=0, vcpu=1]| cpu=1| mm=mm_struct of task3| halted=false }" ] sleep_node4 [ shape="record" label="{ kvm_task_sleep_node|| <link>link| token=[id=1, vcpu=1]| cpu=1| mm=mm_struct of task4| halted=true }" color="red" ] label="cpu1 HALT" } sleep_head:key0->sleep_node0:link [ color="blue" ] sleep_head:key1->sleep_node1:link [ color="gold" ] sleep_head:key2-> sleep_node3:link-> sleep_node4:link [ color="green" ] sleep_node4->run_task_vcpu1 [ arrowhead=none style=dashed ] } G cluster_cpu0 cpu0 RUNNING cluster_cpu1 cpu1 HALT sleep_head kvm_task_sleep_head [0] link(key0) [1] link(key1) [2] link(key2) sleep_node0 kvm_task_sleep_node link token=[id=0, vcpu=0] cpu=0 mm=mm_struct of task0 halted=false sleep_head:key0->sleep_node0:link sleep_node1 kvm_task_sleep_node link token=[id=1, vcpu=0] cpu=0 mm=mm_struct of task1 halted=false sleep_head:key1->sleep_node1:link sleep_node3 kvm_task_sleep_node link token=[id=0, vcpu=1] cpu=1 mm=mm_struct of task3 halted=false sleep_head:key2->sleep_node3:link run_task_vcpu0 current task: task2 run_task_vcpu1 current task: task4 sleep_node4 kvm_task_sleep_node link token=[id=1, vcpu=1] cpu=1 mm=mm_struct of task4 halted=true sleep_node3:link->sleep_node4:link sleep_node4->run_task_vcpu1
  • 图中一共有4个涉及async PF的task, 同时每个task关联一个kvm_task_sleep_node
  • kvm_task_sleep_head[]->link负责将所有key相同的 sleep_node串联起来, 方便查找
  • 每个kvm_task_sleep_node有一个唯一的 identify kvm_task_sleep_node->token
  • cpu0 上之前触发过两次async PF, 并且涉及到的task调度走了,目前正在运行task2
  • cpu1 上触发过两次async PF, 当task3 触发时, 成功将task3 sched out, 当task4 触发时, 由于此时guest vcpu 不能调度, 所以将该cpu halt. 目前该cpu正在task4 的上下文中halt.

kvm_task_sleep_head

1
2
3
4
static struct kvm_task_sleep_head {
        spinlock_t lock;
        struct hlist_head list;
} async_pf_sleepers[KVM_TASK_SLEEP_HASHSIZE];

该数据结构是一个hash map, 使用token作为hash key.

  • lock: 可以看到是每个hash key, 有一个lock. 减少race情况

kvm_task_sleep_node

1
2
3
4
5
6
7
8
struct kvm_task_sleep_node {
        struct hlist_node link;
        wait_queue_head_t wq;
        u32 token;
        int cpu;
        bool halted;
        struct mm_struct *mm;
};

该数据结构作为hash node, 描述每一个因为async pf 调度出去的task

这里并不一定指被调度出去的task, 可能链接着即将发生调度的task信息, 我们下面会介绍到.

  • wq: 等待队列
  • token: 和上面描述一样, 唯一标识一次async PF
  • halted: 有时候kvm注入async PF时, guest在这个时间点不能做schedule, 又 为了再次避免执行该代码流, 只能halt 该cpu. 这里用于标识是否该task halt了cpu

initiate async pf->inject async pf

上面提到了为了使用GUP noio接口, 将tdp_page_fault中的gfn_to_pfn改动为 try_async_pf. 我们来看下该接口

try_async_pf

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
static bool try_async_pf(struct kvm_vcpu *vcpu, gfn_t gfn, gva_t gva,
            pfn_t *pfn)
{
   bool async;
   //==(1)==
   *pfn = gfn_to_pfn_async(vcpu->kvm, gfn, &async);

   //==(2)==
   if (!async)
       return false; /* *pfn has correct page already */

   //==(3)==
   put_page(pfn_to_page(*pfn));

   //==(4)==
   if (can_do_async_pf(vcpu)) {
       trace_kvm_try_async_get_page(async, *pfn);
       //==(5)==
       if (kvm_find_async_pf_gfn(vcpu, gfn)) {
           trace_kvm_async_pf_doublefault(gva, gfn);
           kvm_make_request(KVM_REQ_APF_HALT, vcpu);
           return true;
       //==(6)==
       } else if (kvm_arch_setup_async_pf(vcpu, gva, gfn))
           return true;
   }

   //==(7)==
   *pfn = gfn_to_pfn(vcpu->kvm, gfn);

   return false;
}
  1. 前面提到过, 在try_async_pf 中会执行到gfn_to_pfn_async(), async 作为oparam 表示是否需要做async pf, 另外还有一个返回值, 该返回值 表示在该过程中得到的 pfn of gfn
  2. 当然, 如果得到的async为false, 说明不需要async pf, 那肯定得到了pfn 所以直接返回 false
  3. put_page
  4. 这里会判断当前vcpu的状态是否可以做async pf
    can_do_async_pf细节
    1
    2
    3
    4
    5
    6
    7
    8
    
    +static bool can_do_async_pf(struct kvm_vcpu *vcpu)
    +{
    +	if (unlikely(!irqchip_in_kernel(vcpu->kvm) ||
    +		     kvm_event_needs_reinjection(vcpu)))
    +		return false;
    +
    +	return kvm_x86_ops->interrupt_allowed(vcpu);
    +}
    

    我们这里详细讲解下, 这三个判断条件,

    • irqchip_in_kernel()
    • kvm_event_need_reinjection():
      1
      2
      3
      4
      5
      
      static inline bool kvm_event_needs_reinjection(struct kvm_vcpu *vcpu)
      {
              return vcpu->arch.exception.pending || vcpu->arch.interrupt.pending ||
                      vcpu->arch.nmi_injected;
      }
      

      可以看到这里, 在检测到有其他pending 事件的情况下, 不允许做async pf.

      自己的理解

      关于pending的event, 我们需要参考__vmx_complete_interrupts, 但是这里我们 不过度展开, 大概就是在 VM entry inject event 期间, 由于某些原因, 触发了 VM exit, 此时, VM entry, 还没有完成, 所以这些事件并没有被inject, 需要再次 VM entry时注入. 再这种情况下, 就会有这样的顺序

      1
      2
      3
      4
      5
      6
      
      inject_event1->
        VM entry->
        VM exit(get uncomplete event)->
        get vm exit reason: EPT violation PAGE not present->
        (do some handler)->
        VM entry
      

      那现在问题来了, 本次是该注入async PF, 还是注入 uncomplete event呢?

      我个人认为是注入uncomplete event. 首先按照顺序 uncomplete event先发生. 如果不注入 uncomplete event的情况下, 直接注入async pf, 给guest感觉是某些 event延后了.

      另外, uncomplete event是由于 EPT violation 而触发的. 所以在本次处理完EPT violation 之后,正好可以注入 uncomplete event, 并且大概率不会再次触发VM exit during EVENT inject.

      以上是自己的理解, 而且不确定处理 tdp_page_fault()时, 所有的event是否都来自于 上一次注入失败的uncomplete event.

      遗留问题

      • interrupt_allowed: 我们来看下intel vmx 代码
        1
        2
        3
        4
        5
        6
        
        static int vmx_interrupt_allowed(struct kvm_vcpu *vcpu)
        {
                return (vmcs_readl(GUEST_RFLAGS) & X86_EFLAGS_IF) &&
                        !(vmcs_read32(GUEST_INTERRUPTIBILITY_INFO) &
                                (GUEST_INTR_STATE_STI | GUEST_INTR_STATE_MOV_SS));
        }
        

        该部分代码, 主要是检测当前interrupt windows 是否open, 这里对 这些判断条件不做过多 解释, 详细见virtual interrupt 文章

        还未写该文章

        遗留部分

        但是我们需要理解下, 为什么要关注guest 是否能接收中断呢? 毕竟async pf 注入的是#PF 首先我们需要明确的是:

        自己的理解
        • Q: async pf的目的是什么?
          • A: 调度
        • Q: 该调度能发生在guest 运行的任何时机么
          • A: 需要满足guest意愿

        所以, 综上所述, 得需要在guest认为自己可以调度的情况下, 才能做async pf这个事情. 否则, 即使去启动了一个dedicated thread, 让guest调度, guest也不会去调度, 这样就没有意义了.

        那好在这样的背景下, 我们分情况考虑:

        • non-para virt: halt

          在halt vcpu之后, 能够wakeup vcpu的方式有两种event

          • interrupt
          • async pf work complete

          那在guest 不能注入中断的情况下, 只能由第二种event wakeup, 那就变成了sync的方式. 没有意义.

        • para virt, 因为是半虚拟化方式, 相当于通知guest去主动做一次调度, 但是也得满足guest意愿. 这实际上就像是和guest 协商的过程, 需要去关心guest这一刻是否能做调度. 作者在介绍 MSR_KVM_ASYNC_PF_EN明确了, guest在关中断时, 不能去再次注入async PF, guest可能还处在 APF handler中. 如果在此期间再次注入APF, 可能会导致 APF information 被覆盖, 例如:

          1
          2
          3
          4
          5
          6
          7
          8
          9
          10
          11
          12
          13
          14
          15
          16
          17
          
          host                       guest             cr2
          write token(a) to cr2
                                                       value: a
          inject APF1
                                     trigger #PF
                                       (disable 
                                       interrupt
                                       in VM-entry)
                                     do some thing...
          
          write token(b) to cr2
                                                       value: b
          inject APF2
                                     intend to read
                                      cr2 to get 
                                      APF1 token, 
                                      loss it !!!
          

          avi 的自问自答 中, 我们也能看到关于 interrupt allow的解释.

  5. 这里说明之前, 该vcpu触发过该地址的 EPT violation , 并且已经做了async pf, 相当于再次遇到了. 说明频率比较高, 那么直接halt该vcpu

    ??? ??? ???

  6. 下个小节中介绍
  7. 如果上述条件不满足, 则直接同步去做.

kvm_setup_async_pf

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
int kvm_setup_async_pf(struct kvm_vcpu *vcpu, gva_t gva, gfn_t gfn,
                       struct kvm_arch_async_pf *arch)
{
        struct kvm_async_pf *work;
        //==(1)==
        if (vcpu->async_pf.queued >= ASYNC_PF_PER_VCPU)
                return 0;

        /* setup delayed work */

        /*
         * do alloc nowait since if we are going to sleep anyway we
         * may as well sleep faulting in page
         */
        //==(2)==
        work = kmem_cache_zalloc(async_pf_cache, GFP_NOWAIT);
        if (!work)
                return 0;

        work->page = NULL;
        work->done = false;
        work->vcpu = vcpu;
        work->gva = gva;
        work->addr = gfn_to_hva(vcpu->kvm, gfn);
        work->arch = *arch;
        work->mm = current->mm;
        atomic_inc(&work->mm->mm_count);
        kvm_get_kvm(work->vcpu->kvm);

        /* this can't really happen otherwise gfn_to_pfn_async
           would succeed */
        if (unlikely(kvm_is_error_hva(work->addr)))
                goto retry_sync;

        //==(2.1)==
        INIT_WORK(&work->work, async_pf_execute);
        //==(3)==
        if (!schedule_work(&work->work))
                goto retry_sync;

        //==(4)==
        list_add_tail(&work->queue, &vcpu->async_pf.queue);
        vcpu->async_pf.queued++;
        //==(5)==
        kvm_arch_async_page_not_present(vcpu, work);
        return 1;
retry_sync:
        kvm_put_kvm(work->vcpu->kvm);
        mmdrop(work->mm);
        kmem_cache_free(async_pf_cache, work);
        return 0;
}
  1. 说明per cpu async_pf(work)超过了最大限制 – ASYNC_PF_PER_VCPU
  2. 申请,work并做相关初始化, 在(2.1)中将work hook设置为async_pf_execute
  3. schedule work
  4. 将work加到 vcpu->async_pf.queue队列中
  5. 代码如下:
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    
    void kvm_arch_async_page_not_present(struct kvm_vcpu *vcpu,
                                         struct kvm_async_pf *work)
    {
            trace_kvm_async_pf_not_present(work->arch.token, work->gva);
            kvm_add_async_pf_gfn(vcpu, work->arch.gfn);
            //==(1)==
            if (!(vcpu->arch.apf.msr_val & KVM_ASYNC_PF_ENABLED) ||
                (vcpu->arch.apf.send_user_only &&
                 kvm_x86_ops->get_cpl(vcpu) == 0))
                    kvm_make_request(KVM_REQ_APF_HALT, vcpu);
            //==(2)==
            else if (!apf_put_user(vcpu, KVM_PV_REASON_PAGE_NOT_PRESENT)) {
                    vcpu->arch.fault.error_code = 0;
                    vcpu->arch.fault.address = work->arch.token;
                    kvm_inject_page_fault(vcpu);
            }
    }
    
    1. can_do_async_pf, 这里也有一些判断当前状态是否合适向guest注入async pf 的条件, 我们放到下面介绍
    2. 如果可以注入, 则将KVM_PV_REASON_PAGE_NOT_PRESENT其写入 guest host 共享的 内存中, 表示本次注入的是page not present类型的 async pf. 另外, 设置好 本次注入异常的 address和 error code

async pf work

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
static void async_pf_execute(struct work_struct *work)
{
        struct page *page = NULL;
        struct kvm_async_pf *apf =
                container_of(work, struct kvm_async_pf, work);
        struct mm_struct *mm = apf->mm;
        struct kvm_vcpu *vcpu = apf->vcpu;
        unsigned long addr = apf->addr;
        gva_t gva = apf->gva;

        might_sleep();

        use_mm(mm);
        down_read(&mm->mmap_sem);
        //==(1)==
        get_user_pages(current, mm, addr, 1, 1, 0, &page, NULL);
        up_read(&mm->mmap_sem);
        unuse_mm(mm);

        spin_lock(&vcpu->async_pf.lock);
        //==(2)==
        list_add_tail(&apf->link, &vcpu->async_pf.done);
        apf->page = page;
        apf->done = true;
        spin_unlock(&vcpu->async_pf.lock);

        /*
         * apf may be freed by kvm_check_async_pf_completion() after
         * this point
         */

        trace_kvm_async_pf_completed(addr, page, gva);

        //==(3)==
        if (waitqueue_active(&vcpu->wq))
                wake_up_interruptible(&vcpu->wq);

        mmdrop(mm);
        kvm_put_kvm(vcpu->kvm);
}
  1. 调用get_user_pages, 该接口可以处理MAJOR fault

    get_user_pages() 第四个参数, 如果不为空,则会设置FOLL_GET

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    
    int get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
               unsigned long start, int nr_pages, int write, int force,
               struct page **pages, struct vm_area_struct **vmas)
    {
            int flags = FOLL_TOUCH;
    
            if (pages)
                    flags |= FOLL_GET;
            ...
    }
    

    如果设置了FOLL_GET, 则会在get_user_pages()的过程中, pin this page. 也就是get_page(), 但是需要注意的是, 该接口可能会返回错误, 但是看起来此流程 并没有判断该接口是否执行成功. IOW, 无论该接口是否执行成功, 都认为该work已经 complete, 都需要再次wakeup GUEST blocking thread.

  2. 将该work, 链接到vcpu->async_pf.done链表中
  3. 如果vcpu在等待队列中(halt), 唤醒该vcpu

接下来, 我们来看下, host是如何检测 page present事件, 并注入page present async pf

host inject PAGE PRESENT aync pf

1
2
3
4
5
6
7
8
9
@@ -5272,6 +5288,9 @@ static int __vcpu_run(struct kvm_vcpu *vcpu)
 			vcpu->run->exit_reason = KVM_EXIT_INTR;
 			++vcpu->stat.request_irq_exits;
 		}
+		
+		kvm_check_async_pf_completion(vcpu);
+
 		if (signal_pending(current)) {
 			r = -EINTR;

在vm exit后, 检测是否有需要 async pf complete

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
void kvm_check_async_pf_completion(struct kvm_vcpu *vcpu)
{
        struct kvm_async_pf *work;
        //==(1)==
        if (list_empty_careful(&vcpu->async_pf.done) ||
            !kvm_arch_can_inject_async_page_present(vcpu))
                return;

        spin_lock(&vcpu->async_pf.lock);
        work = list_first_entry(&vcpu->async_pf.done, typeof(*work), link);
        list_del(&work->link);
        spin_unlock(&vcpu->async_pf.lock);
        //==(2)==
        if (work->page)
                kvm_arch_async_page_ready(vcpu, work);
        //==(3)==
        kvm_arch_async_page_present(vcpu, work);

        list_del(&work->queue);
        vcpu->async_pf.queued--;
        if (work->page)
                put_page(work->page);
        kmem_cache_free(async_pf_cache, work);
}
  1. 有两个判断条件:
    • 判断是否有完成的work
    • guest此时是否适合注入 page present async PF (下面章节介绍)
  2. 如果work->page为 NULL, 说明async work中, 执行get_user_pages()失败了, 那么本次就不需要在执行kvm_arch_async_page_ready(), 该函数作用是, 再 次执行tdp_page_fault, 如果page is ready, 那只需要执行get_user_page fast path和__direct_map建立GPA->HPA的映射.

    但是如果page is not ready(work->page)为NULL, 作者的想法是, 让其在次vm entry, wakeup guest blocking thread, 让其再次触发EPT violation, 然后再发起async pf. 所以在这里没有必要在做一次kvm_arch_async_page_ready->tdp_page_fault, 那可能 有同学会说, 那为什么不在HOST中, 等待get_user_pages()一定返回成功之后, 再 注入 page present #PF,

    实话说,我也不知道, 但这里总感觉作者不想增加复杂的代码逻辑, 需要关注下后续的patch, 看看是否对这部分有优化

    遗留问题

  3. kvm_arch_async_page_present
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    
    void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
                                     struct kvm_async_pf *work)
    {
            trace_kvm_async_pf_ready(work->arch.token, work->gva);
            //==(1)==
            if (is_error_page(work->page))
                    work->arch.token = ~0; /* broadcast wakeup */
            else
                    kvm_del_async_pf_gfn(vcpu, work->arch.gfn);
       
            //==(2)==
            if ((vcpu->arch.apf.msr_val & KVM_ASYNC_PF_ENABLED) &&
                !apf_put_user(vcpu, KVM_PV_REASON_PAGE_READY)) {
                    vcpu->arch.fault.error_code = 0;
                    vcpu->arch.fault.address = work->arch.token;
                    kvm_inject_page_fault(vcpu);
            }
    }
    
    1. 关于error page, 我们放在另一篇文章中讲述.

      遗留问题

    2. 置入KVM_ASYNC_PF_PF_ENABLED, 准备注入 page present async #PF

guest handle async PF

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
dotraplinkage void __kprobes
do_async_page_fault(struct pt_regs *regs, unsigned long error_code)
{
        //==(1)==
        switch (kvm_read_and_reset_pf_reason()) {
        default:
        //==(2)==
                do_page_fault(regs, error_code);
                break;
        case KVM_PV_REASON_PAGE_NOT_PRESENT:
        //==(3)==
                /* page is swapped out by the host. */
                kvm_async_pf_task_wait((u32)read_cr2());
                break;
        //==(4)==
        case KVM_PV_REASON_PAGE_READY:
                kvm_async_pf_task_wake((u32)read_cr2());
                break;
        }
}

该部分代码逻辑很清晰, async PF event 是使用了原有的#PF exception vector, guest 需要在exception handler 中判断这个#PF的类型, 然后执行相应的handler

  1. 从share memory 中获取 async pf reason
  2. indicate NORMAL #PF
  3. indicate PAGE NOT PRESENT async pf
  4. indicate PAGE PRESENT async pf

page not present async pf

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
void kvm_async_pf_task_wait(u32 token)
{
        u32 key = hash_32(token, KVM_TASK_SLEEP_HASHBITS);
        struct kvm_task_sleep_head *b = &async_pf_sleepers[key];
        struct kvm_task_sleep_node n, *e;
        DEFINE_WAIT(wait);
        int cpu, idle;

        cpu = get_cpu();
        idle = idle_cpu(cpu);
        put_cpu();

        spin_lock(&b->lock);
        //===(1)==
        e = _find_apf_task(b, token);
        if (e) {
                /* dummy entry exist -> wake up was delivered ahead of PF */
                hlist_del(&e->link);
                kfree(e);
                spin_unlock(&b->lock);
                return;
        }

        //===(2)==
        n.token = token;
        n.cpu = smp_processor_id();
        n.mm = current->active_mm;
        //===(2.1)==
        n.halted = idle || preempt_count() > 1;
        atomic_inc(&n.mm->mm_count);
        init_waitqueue_head(&n.wq);
        //===(3)==
        hlist_add_head(&n.link, &b->list);
        spin_unlock(&b->lock);

        for (;;) {
                //===(4)==
                if (!n.halted)
                        prepare_to_wait(&n.wq, &wait, TASK_UNINTERRUPTIBLE);
                if (hlist_unhashed(&n.link))
                        break;

                //===(4)==
                if (!n.halted) {
                        local_irq_enable();
                        schedule();
                        local_irq_disable();
                } else {
                        /*
                         * We cannot reschedule. So halt.
                         */
                        native_safe_halt();
                        local_irq_disable();
                }
        }
        if (!n.halted)
                finish_wait(&n.wq, &wait);

        return;
}
  1. 在kernel doc介绍MSR_KVM_ASYNC_PF_EN, 作者有提到过. 一对[type2 APF, type1 APF] 不一定 会在同一个vcpu上触发, 那也就意味着两者可能并行执行(虽然现在的host kvm 没有这样做,但是guest 不能依赖它), 如下:
    1
    2
    3
    4
    5
    6
    7
    8
    9
    
    kvm                vcpu1      vcpu2
    1.inject type1 
      APF to VCPU1
                                          
    2. inject type2 
      APF to VCPU2
                                   3. handle type2 APF
                       4. handle
                        type1 APF
    

    可以看到kvm虽然是按照顺序注入的type1 APF, 和type2 APF, 但是注入到了不同的vcpu. vcpu在处理时, handle type2 APF先执行, 此时page 已经present了, 不需要再sched out, 这里会在type2 APFhandler 中预先将带有该token的sleep_node放到head中, 以便type 1 APF handler 可以跳过这次的sched out (需要结合type2 APF handle – kvm_async_pf_task_wake().)

  2. 将task(current->active_mm)和token绑定, 这样当type2 APF触发时, 可以根据token找到当前block的 task
    • 需要注意的时, guest在某些情况下不能sched out, 这时, 只能halt当前cpu

      我们放到另一篇文章中去介绍

      遗留问题

  3. sleep_node链到sleep_head
  4. 如果guest此时可以调度, 则将进程D住, sched out

page present async pf

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
void kvm_async_pf_task_wake(u32 token)
{
        u32 key = hash_32(token, KVM_TASK_SLEEP_HASHBITS);
        struct kvm_task_sleep_head *b = &async_pf_sleepers[key];
        struct kvm_task_sleep_node *n;

        if (token == ~0) {
                apf_task_wake_all();
                return;
        }

again:
        spin_lock(&b->lock);
        //===(1)==
        n = _find_apf_task(b, token);
        //===(2)==
        if (!n) {
                /*
                 * async PF was not yet handled.
                 * Add dummy entry for the token.
                 */
                n = kmalloc(sizeof(*n), GFP_ATOMIC);
                if (!n) {
                        /*
                         * Allocation failed! Busy wait while other cpu
                         * handles async PF.
                         */
                        spin_unlock(&b->lock);
                        cpu_relax();
                        goto again;
                }
                n->token = token;
                n->cpu = smp_processor_id();
                n->mm = NULL;
                init_waitqueue_head(&n->wq);
                hlist_add_head(&n->link, &b->list);
        } else
        //===(3)==
                apf_task_wake_one(n);
        spin_unlock(&b->lock);
        return;
}
  1. 根据token, 在sleep_head中查找sleep_node
  2. type1 APF handler, type2 APF可能在于type1 APF不同的cpu上先执行, 此时 在sleep_head中找不到和该token相关的sleep_node, 这时, 需要新创建一个 sleep_node将其添加到sleep_head中, 以便type1 APF handler可以查找到, 避免block该task
  3. 如果查找到了, 说明type1 APF handler已经触发, task已经block, 需要wakeup 该task

参考链接

  1. MAIL list:
    v1

    v2

    v3

    v4

    v5

    v6

    v7

This post is licensed under CC BY 4.0 by the author.