async pf

Posted Apr 10, 2024

44 min read

introduce

在支持EPT的架构中, 对于GVA->HPA一般有两段映射:

GVA->GPA
GPA->HPA

而host kernel (kvm) 需要关心的是 GPA->HPA的映射, 需要host做的事情主要有以下几个:

捕捉相关 VM-exit event (EPT violation), 得到 GPA
分配page
建立映射关系(当然这个映射关系, 不止是GPA->HPA的mmu pgtable, 还有 HVA – GPA, 在这里不展开, 总之分配好具体的page(分配HPA), 以及为其建立好 mmu pgtable, 就可以完成该事件的处理)

如下图:

图示

但是, 已经建立好映射的页面, 也是qemu进程的虚拟地址空间(匿名页), 是可以被swap out, 当被swap out后, GUEST 访问该HPA对应的 GVA/GPA时, 仍然会触发 EPT violation. 这时还会再走一次 VM-exit, 而且也需要完成上面所述的三件事, 其中第二件:分配page, 需要swap in 之前被swap out的page, 路径比较长, 如下:

VM-exit
  handle_ept_violation
    kvm_mmu_page_fault
      tdp_page_fault
        gfn_to_pfn
          hva_to_pfn
            get_user_pages --slow path

get_user_pages会走到slow path, 由于会走swap in流程, 所以该过程执行较慢. 所以大佬们就想着能不能让其异步执行, 然后让vcpu先不complete 造成 EPT violation 的 instruction, 去干别的事情, 等page present后, 再去执行该指令. 另外将 get_user_pages 让一个 dedicated thread 去完成, 这样, 对于虚拟机来说, 就相当于搞了一个额外的硬件, 专门去处理 swap in, 解放了vcpu的算力.

NOTE
大家思考下, 如果要达到该目的, 一定是让GUEST有意无意的 sche out 造成 EPT violation 的进程,

该上面流程总结如下:

流程图

由上图可见, 引入async pf 的逻辑是让其能够在触发 EPT violation后, 能够让VCPU 调度到另外一个task, 从而阻塞触发 EPT violation 的进程执行. 为了达到这一目的, 做了以下改动:

VCPU 线程在执行get_user_page()时, 仅执行fast path, 如果page 不是present的, 该接口直接返回, 而剩下的工作, 则交给另外一个dedicated thread 去做
KVM 会通过一些方式, 让 GUEST 执行调度, 从而避免再次执行触发EPT violation的指令. 而dedicated thread 完成了swap in 的动作后, 会通知guest再次唤醒该之前调度出去的进程

代码细节

para virt interface

一般的半虚拟化实现往往都有一下几个特征:

use CPUID report this feature
use MSR transparent less information, e.g. :
- a share memory address
- enable/disable
use a share memory transparent more information

而 para virt async PF 也是这样实现的.

在v1 Add shared memory hypercall to PV Linux guest版本中, 作者以 hypercall的方式实现了半虚拟化, 但是avi在随后建议(link) 使用MSR来替代 hypercall, 因为该方式在INIT和热迁移流程中有现成的 save/restore 接口
原文如下:
1 2 3 4 5 6 7 8 9 10 11 Better to set this up as an MSR (with bit zero enabling, bits 1-5 features, and 64-byte alignment). This allows auto-reset on INIT and live migration using the existing MSR save/restore infrastructure. 最好将其设置为MSR - bit 0: enabling - bit 1-5: features - 64-byte alignment 他允许在INIT时 auto-reset, 并且可以使用现有的 MSR save/restore infrastructure 完成热迁移

接口流程图

图示

图中描述了host, guest在处理async pf时, 对寄存器/share memory 的操作

从图中可以看出, 会涉及到cpuid, MSR_KVM_ASYNC_PF_EN, share memory, 由于async pf 的实现,需要注入#PF, 所以还会涉及 CR2

cpuid

新增半虚拟化cpuid bit: KVM_FEATURE_ASYNC_PF

diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
+#define KVM_FEATURE_ASYNC_PF		4

关于该bit的文档说明

diff --git a/Documentation/kvm/cpuid.txt b/Documentation/kvm/cpuid.txt
+KVM_FEATURE_ASYNC_PF               ||     4 || async pf can be enabled by
+                                   ||       || writing to msr 0x4b564d02

大致意思是, 该cpuid如果时能, 表示可以通过write to MSR (0x4b564d02) 来enable async pf

  
diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
+#define MSR_KVM_ASYNC_PF_EN 0x4b564d02

文档说明:

diff --git a/Documentation/kvm/msr.txt b/Documentation/kvm/msr.txt
+ MSR_KVM_ASYNC_PF_EN: 0x4b564d02
+   data: Bits 63-6 hold 64-byte aligned physical address of a
+   64 byte memory area which must be in guest RAM and must be
+   zeroed. Bits 5-1 are reserved and should be zero. Bit 0 is 1
+   when asynchronous page faults are enabled on the vcpu 0 when
+   disabled.
    
    > Bits 63-6 保存着 64-byte 对其的 一个64 byte memory area 的物理地址,
    > 该memory area 必须是 guest RAM, 并且必须是被赋值为0. 
    >
    > Bit 5-1 被reserved并且应该为0.
    >
    > 当 在 vcpu 0 启用 async pf enable async pf(当是disable时), 
    > Bit 0 是1

该段主要介绍了MSR的 bit 组成:

MSR bit
- Bit [63, 6]: a 64-byte aligned physical address
- Bit [5, 1]: reserved
- Bit 0 : enable bit

其实文档中还介绍了. share memory format 和 CR2, 但是为了方便阅读, 我们将拆分开到各个小节

shared memory structure – APF reason

  
diff --git a/Documentation/kvm/msr.txt b/Documentation/kvm/msr.txt
    ...
+   First 4 byte of 64 byte memory location will be written to by
+   the hypervisor at the time of asynchronous page fault (APF)
+   injection to indicate type of asynchronous page fault. Value
+   of 1 means that the page referred to by the page fault is not
+   present. Value 2 means that the page is now available. Disabling
+   interrupt inhibits APFs. Guest must not enable interrupt
+   before the reason is read, or it may be overwritten by another
+   APF. Since APF uses the same exception vector as regular page
+   fault guest must reset the reason to 0 before it does
+   something that can generate normal page fault.  If during page
+   fault APF reason is 0 it means that this is regular page
+   fault.

    > 在 hypervisor 触发 APF 注入时, 4 byte memory location的前4个byte将被
    > 写入 来指示 APF 的类型. 
    >    1: page fault 涉及到的page 是 not present的.
    >    2: page 现在已经 available
    > 另外Disabling interrupt 将会 inhibits APF.
    >
    > Guest必须不能enable interrupt 在reason 被read之前, 否则可能会被另一个
    > APF覆盖. 因为 APF 使用 相同的 exception vector 作为 regular page
    > fault, 所以在做可能生成normal page fault 的事情之前, guest 必须 reset 
    > reason to 0. 如果 在 page fault 期间, APF reason 为0, 他意味着这是一个
    > regular page fault.

shared memory 一共有64 byte, 其中前4个byte(32 bit) 用来indicate apf type. host kvm 在注入 apf之前会将type写入该地址.

APF 有两种type(APF reason):

1: page is not present
2: not present page becomes available

另外, 在处理APF时, guest和host有下面约束:

如果guest处于 disable interrupt, host不能注入apf
guest必须在enable interrupt 之前, 处理完当前的apf
guest必须在触发 normal #PF时, 处理完当前的apf, 并且reset reason to 0

CR2

  
diff --git a/Documentation/kvm/msr.txt b/Documentation/kvm/msr.txt
    ...
+   During delivery of type 1 APF cr2 contains a token that will
+   be used to notify a guest when missing page becomes
+   available. When page becomes available type 2 APF is sent with
+   cr2 set to the token associated with the page. There is special
+   kind of token 0xffffffff which tells vcpu that it should wake
+   up all processes waiting for APFs and no individual type 2 APFs
+   will be sent.

    > 在 type1 APF delivery 期间,  cr2 包含了一个token, 当missing page 
    > becomes available, 该token将会用于通知guest. 
    >
    > 当page becomes available, type2 APF 将会把 cr2 设置为和该page相关的 
    > token. 
    >
    > 这里有一个特殊的类型 token 0xffffffff, 他将告诉vcpu, 需要wakeup 所有
    > 等待APF的process 并且不会有单独的 type 2 APF 将会再发送
    
+   If APF is disabled while there are outstanding APFs, they will
+   not be delivered.

    > 当 outstanding APFs时, 如果APF 被disabled, 他们将不会被delivered.
    
+   Currently type 2 APF will be always delivered on the same vcpu as
+   type 1 was, but guest should not rely on that.

    > 当前 type 2 APF 将始终在与type 1 相同的vcpu上deliver, 但是guest不应该依赖它.

cr2 包含了一个token, 该token 用来唯一标识, 当前正在发生的APF 的 id. 但是其有一个特殊 value 0xffffffff, 该值用来告诉vcpu, 需要wakeup所有的正在等待 APF (type 2) 的进程. 并且不会有单独的type2再发送.

另外还有几点约束和限制

如果还有 outstanding APFs 时, 如果 APF 被disable了, 他们将不会被deliver
guest 不应该依赖 type2 APF 和 type1 APF在相同vcpu上deliver, 虽然目前是这样实现的.

大家可以思考下, 为什么要支持wake up all这样的API
可以想象一下热迁移场景.
当进行热迁移时, 我们先suspend vcpu, 然后迁移memory, 这时, 会等所有page swapin,然后在进行迁移, 但是这时, guest已经不能再去注入异常了, 只能等dest端在注入.
此时来到dest端, 这时所有的memory都是present的. 所以直接注入wakeup all就可以唤醒所有 wait task.(当然, 也可能再此期间有swapout, 无非是再触发一次async pf)

GUP change

关于GUP 改动的细节我们放到

link

中介绍.

STRUCT – host

总体数据结构图

比较简单, 如下:

struct 结构图

每个cpu有自己链表, 串起属于该cpu的async pf work, 其中有两条链.

queue:
串起所有work
done:
串起所有完成的work

struct kvm_async_pf

该数据结构主要用来描述上面提到的dedicated thread

  
struct kvm_async_pf {
        struct work_struct work;
        struct list_head link;
        struct list_head queue;
        struct kvm_vcpu *vcpu;
        struct mm_struct *mm;
        gva_t gva;
        unsigned long addr;
        struct kvm_arch_async_pf arch;
        struct page *page;
        bool done;
};

work: dedicated thread实例, 使用 workqueue机制
link: 在patch中, 链接点主要有一个: vcpu 的work完成队列
queue: 用于链接该vcpu上的所有 kvm_async_pf
gva: 触发EPT violation, 需要get_user_page_slow的 GVA
addr: hva
done: indicate该work完没完成
kvm_arch_async_pf:
1 2 3 4 struct kvm_arch_async_pf { u32 token; gfn_t gfn; };
- token: 该成员用于唯一标识一次async PF, 由kvm_vcpu.arch.apf.id和 vcpu->vcpu_id综合计算得到. 在注入#PF时, 会当作 CR2 传入GUEST, 方便guest管理每一次的async PF.

上面说提到的kvm_async_pf->link,kvm_async_pf->queue所链接的队列, 如下:

CHANGE of `struct kvm_vcpu`

  
@@ -104,6 +125,15 @@ struct kvm_vcpu {
    gpa_t mmio_phys_addr;
 #endif

+#ifdef CONFIG_KVM_ASYNC_PF
+   struct {
+       u32 queued;
+       struct list_head queue;
+       struct list_head done;
+       spinlock_t lock;
+   } async_pf;
+#endif

queue: 链接所有kvm_async_pf(work)
done: 链接以完成的kvm_async_pf(work)
lock: 队列锁

change of `struct kvm_vcpu_arch`

struct kvm_vcpu_arch {
    ...
+   struct {
+           bool halted;
+           gfn_t gfns[roundup_pow_of_two(ASYNC_PF_PER_VCPU)];
+           struct gfn_to_hva_cache data;
+           u64 msr_val;
+           u32 id;
+           bool send_user_only;
+   } apf;
    ...
}

该数据结构变动涉及多个patch, 这里把最终的数据结构变动列出.

halted: 表示是否因为async PF halt 了vcpu
gfns : 这里做了一个数组, 用于记录所有现存的async pf work 的 gfn
data: 相当于HVA->HPA的cache, 这个映射关系一直存在且不变(大多数情况下, 除非执行__kvm_set_memory_region更改映射关系), 该HPA 指向上面提到的 share memory
该部分被作者做成了一个通用功能, 相当于是 memslot-cached kvm_put_guest() and kvm_get_guest(). 我们放到另一篇文章中介绍. 主要介绍这个功能引入和其实现.
msr_val: 记录guest设置的msr值
id: 记录下一个async pf work的id, 和kvm_vcpu->vcpu_id一起,唯一标识一次async PF
send_user_only: 表示只有trigger EPT violation in guest user space, host才能做async PF

STRUCT - GUEST

guest 数据结构主要是用于管理, 因为async PF 调度出去的task.

数据结构图

图中一共有4个涉及async PF的task, 同时每个task关联一个kvm_task_sleep_node
kvm_task_sleep_head[]->link负责将所有key相同的 sleep_node串联起来, 方便查找
每个kvm_task_sleep_node有一个唯一的 identify kvm_task_sleep_node->token
cpu0 上之前触发过两次async PF, 并且涉及到的task调度走了,目前正在运行task2
cpu1 上触发过两次async PF, 当task3 触发时, 成功将task3 sched out, 当task4 触发时, 由于此时guest vcpu 不能调度, 所以将该cpu halt. 目前该cpu正在task4 的上下文中halt.

kvm_task_sleep_head

  
static struct kvm_task_sleep_head {
        spinlock_t lock;
        struct hlist_head list;
} async_pf_sleepers[KVM_TASK_SLEEP_HASHSIZE];

该数据结构是一个hash map, 使用token作为hash key.

lock: 可以看到是每个hash key, 有一个lock. 减少race情况

kvm_task_sleep_node

  
struct kvm_task_sleep_node {
        struct hlist_node link;
        wait_queue_head_t wq;
        u32 token;
        int cpu;
        bool halted;
        struct mm_struct *mm;
};

该数据结构作为hash node, 描述每一个因为async pf 调度出去的task

这里并不一定指被调度出去的task, 可能链接着即将发生调度的task信息, 我们下面会介绍到.

wq: 等待队列
token: 和上面描述一样, 唯一标识一次async PF
halted: 有时候kvm注入async PF时, guest在这个时间点不能做schedule, 又为了再次避免执行该代码流, 只能halt 该cpu. 这里用于标识是否该task halt了cpu

initiate async pf->inject async pf

上面提到了为了使用GUP noio接口, 将tdp_page_fault中的gfn_to_pfn改动为 try_async_pf. 我们来看下该接口

try_async_pf

  
static bool try_async_pf(struct kvm_vcpu *vcpu, gfn_t gfn, gva_t gva,
            pfn_t *pfn)
{
   bool async;
   //==(1)==
   *pfn = gfn_to_pfn_async(vcpu->kvm, gfn, &async);

   //==(2)==
   if (!async)
       return false; /* *pfn has correct page already */

   //==(3)==
   put_page(pfn_to_page(*pfn));

   //==(4)==
   if (can_do_async_pf(vcpu)) {
       trace_kvm_try_async_get_page(async, *pfn);
       //==(5)==
       if (kvm_find_async_pf_gfn(vcpu, gfn)) {
           trace_kvm_async_pf_doublefault(gva, gfn);
           kvm_make_request(KVM_REQ_APF_HALT, vcpu);
           return true;
       //==(6)==
       } else if (kvm_arch_setup_async_pf(vcpu, gva, gfn))
           return true;
   }

   //==(7)==
   *pfn = gfn_to_pfn(vcpu->kvm, gfn);

   return false;
}

前面提到过, 在try_async_pf 中会执行到gfn_to_pfn_async(), async 作为oparam 表示是否需要做async pf, 另外还有一个返回值, 该返回值表示在该过程中得到的 pfn of gfn
当然, 如果得到的async为false, 说明不需要async pf, 那肯定得到了pfn 所以直接返回 false
put_page

这里会判断当前vcpu的状态是否可以做async pf

can_do_async_pf细节

+static bool can_do_async_pf(struct kvm_vcpu *vcpu)
+{
+	if (unlikely(!irqchip_in_kernel(vcpu->kvm) ||
+		     kvm_event_needs_reinjection(vcpu)))
+		return false;
+
+	return kvm_x86_ops->interrupt_allowed(vcpu);
+}

我们这里详细讲解下, 这三个判断条件,

irqchip_in_kernel()

kvm_event_need_reinjection():

  
static inline bool kvm_event_needs_reinjection(struct kvm_vcpu *vcpu)
{
        return vcpu->arch.exception.pending || vcpu->arch.interrupt.pending ||
                vcpu->arch.nmi_injected;
}

可以看到这里, 在检测到有其他pending 事件的情况下, 不允许做async pf.

自己的理解

关于pending的event, 我们需要参考__vmx_complete_interrupts, 但是这里我们不过度展开, 大概就是在 VM entry inject event 期间, 由于某些原因, 触发了 VM exit, 此时, VM entry, 还没有完成, 所以这些事件并没有被inject, 需要再次 VM entry时注入. 再这种情况下, 就会有这样的顺序
1 2 3 4 5 6 inject_event1-> VM entry-> VM exit(get uncomplete event)-> get vm exit reason: EPT violation PAGE not present-> (do some handler)-> VM entry
那现在问题来了, 本次是该注入async PF, 还是注入 uncomplete event呢?
我个人认为是注入uncomplete event. 首先按照顺序 uncomplete event先发生. 如果不注入 uncomplete event的情况下, 直接注入async pf, 给guest感觉是某些 event延后了.
另外, uncomplete event是由于 EPT violation 而触发的. 所以在本次处理完EPT violation 之后,正好可以注入 uncomplete event, 并且大概率不会再次触发VM exit during EVENT inject.
以上是自己的理解, 而且不确定处理 tdp_page_fault()时, 所有的event是否都来自于上一次注入失败的uncomplete event.
遗留问题

interrupt_allowed: 我们来看下intel vmx 代码

  
static int vmx_interrupt_allowed(struct kvm_vcpu *vcpu)
{
        return (vmcs_readl(GUEST_RFLAGS) & X86_EFLAGS_IF) &&
                !(vmcs_read32(GUEST_INTERRUPTIBILITY_INFO) &
                        (GUEST_INTR_STATE_STI | GUEST_INTR_STATE_MOV_SS));
}

该部分代码, 主要是检测当前interrupt windows 是否open, 这里对这些判断条件不做过多解释, 详细见virtual interrupt 文章

还未写该文章
遗留部分

但是我们需要理解下, 为什么要关注guest 是否能接收中断呢? 毕竟async pf 注入的是#PF 首先我们需要明确的是:

自己的理解

Q: async pf的目的是什么?
A: 调度
Q: 该调度能发生在guest 运行的任何时机么
A: 需要满足guest意愿
所以, 综上所述, 得需要在guest认为自己可以调度的情况下, 才能做async pf这个事情. 否则, 即使去启动了一个dedicated thread, 让guest调度, guest也不会去调度, 这样就没有意义了.
那好在这样的背景下, 我们分情况考虑:
non-para virt: halt
在halt vcpu之后, 能够wakeup vcpu的方式有两种event
interrupt
async pf work complete
那在guest 不能注入中断的情况下, 只能由第二种event wakeup, 那就变成了sync的方式. 没有意义.
para virt, 因为是半虚拟化方式, 相当于通知guest去主动做一次调度, 但是也得满足guest意愿. 这实际上就像是和guest 协商的过程, 需要去关心guest这一刻是否能做调度. 作者在介绍 MSR_KVM_ASYNC_PF_EN明确了, guest在关中断时, 不能去再次注入async PF, guest可能还处在 APF handler中. 如果在此期间再次注入APF, 可能会导致 APF information 被覆盖, 例如:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 host guest cr2 write token(a) to cr2 value: a inject APF1 trigger #PF (disable interrupt in VM-entry) do some thing... write token(b) to cr2 value: b inject APF2 intend to read cr2 to get APF1 token, loss it !!!
在avi 的自问自答中, 我们也能看到关于 interrupt allow的解释.

这里说明之前, 该vcpu触发过该地址的 EPT violation , 并且已经做了async pf, 相当于再次遇到了. 说明频率比较高, 那么直接halt该vcpu
??? ??? ???
下个小节中介绍
如果上述条件不满足, 则直接同步去做.

kvm_setup_async_pf

  
int kvm_setup_async_pf(struct kvm_vcpu *vcpu, gva_t gva, gfn_t gfn,
                       struct kvm_arch_async_pf *arch)
{
        struct kvm_async_pf *work;
        //==(1)==
        if (vcpu->async_pf.queued >= ASYNC_PF_PER_VCPU)
                return 0;

        /* setup delayed work */

        /*
         * do alloc nowait since if we are going to sleep anyway we
         * may as well sleep faulting in page
         */
        //==(2)==
        work = kmem_cache_zalloc(async_pf_cache, GFP_NOWAIT);
        if (!work)
                return 0;

        work->page = NULL;
        work->done = false;
        work->vcpu = vcpu;
        work->gva = gva;
        work->addr = gfn_to_hva(vcpu->kvm, gfn);
        work->arch = *arch;
        work->mm = current->mm;
        atomic_inc(&work->mm->mm_count);
        kvm_get_kvm(work->vcpu->kvm);

        /* this can't really happen otherwise gfn_to_pfn_async
           would succeed */
        if (unlikely(kvm_is_error_hva(work->addr)))
                goto retry_sync;

        //==(2.1)==
        INIT_WORK(&work->work, async_pf_execute);
        //==(3)==
        if (!schedule_work(&work->work))
                goto retry_sync;

        //==(4)==
        list_add_tail(&work->queue, &vcpu->async_pf.queue);
        vcpu->async_pf.queued++;
        //==(5)==
        kvm_arch_async_page_not_present(vcpu, work);
        return 1;
retry_sync:
        kvm_put_kvm(work->vcpu->kvm);
        mmdrop(work->mm);
        kmem_cache_free(async_pf_cache, work);
        return 0;
}

说明per cpu async_pf(work)超过了最大限制 – ASYNC_PF_PER_VCPU
申请,work并做相关初始化, 在(2.1)中将work hook设置为async_pf_execute
schedule work
将work加到 vcpu->async_pf.queue队列中

代码如下:

  
void kvm_arch_async_page_not_present(struct kvm_vcpu *vcpu,
                                     struct kvm_async_pf *work)
{
        trace_kvm_async_pf_not_present(work->arch.token, work->gva);
        kvm_add_async_pf_gfn(vcpu, work->arch.gfn);
        //==(1)==
        if (!(vcpu->arch.apf.msr_val & KVM_ASYNC_PF_ENABLED) ||
            (vcpu->arch.apf.send_user_only &&
             kvm_x86_ops->get_cpl(vcpu) == 0))
                kvm_make_request(KVM_REQ_APF_HALT, vcpu);
        //==(2)==
        else if (!apf_put_user(vcpu, KVM_PV_REASON_PAGE_NOT_PRESENT)) {
                vcpu->arch.fault.error_code = 0;
                vcpu->arch.fault.address = work->arch.token;
                kvm_inject_page_fault(vcpu);
        }
}

和can_do_async_pf, 这里也有一些判断当前状态是否合适向guest注入async pf 的条件, 我们放到下面介绍
如果可以注入, 则将KVM_PV_REASON_PAGE_NOT_PRESENT其写入 guest host 共享的内存中, 表示本次注入的是page not present类型的 async pf. 另外, 设置好本次注入异常的 address和 error code

async pf work

  
static void async_pf_execute(struct work_struct *work)
{
        struct page *page = NULL;
        struct kvm_async_pf *apf =
                container_of(work, struct kvm_async_pf, work);
        struct mm_struct *mm = apf->mm;
        struct kvm_vcpu *vcpu = apf->vcpu;
        unsigned long addr = apf->addr;
        gva_t gva = apf->gva;

        might_sleep();

        use_mm(mm);
        down_read(&mm->mmap_sem);
        //==(1)==
        get_user_pages(current, mm, addr, 1, 1, 0, &page, NULL);
        up_read(&mm->mmap_sem);
        unuse_mm(mm);

        spin_lock(&vcpu->async_pf.lock);
        //==(2)==
        list_add_tail(&apf->link, &vcpu->async_pf.done);
        apf->page = page;
        apf->done = true;
        spin_unlock(&vcpu->async_pf.lock);

        /*
         * apf may be freed by kvm_check_async_pf_completion() after
         * this point
         */

        trace_kvm_async_pf_completed(addr, page, gva);

        //==(3)==
        if (waitqueue_active(&vcpu->wq))
                wake_up_interruptible(&vcpu->wq);

        mmdrop(mm);
        kvm_put_kvm(vcpu->kvm);
}

调用get_user_pages, 该接口可以处理MAJOR fault

get_user_pages() 第四个参数, 如果不为空,则会设置FOLL_GET
1 2 3 4 5 6 7 8 9 10 int get_user_pages(struct task_struct *tsk, struct mm_struct *mm, unsigned long start, int nr_pages, int write, int force, struct page **pages, struct vm_area_struct **vmas) { int flags = FOLL_TOUCH; if (pages) flags |= FOLL_GET; ... }
如果设置了FOLL_GET, 则会在get_user_pages()的过程中, pin this page. 也就是get_page(), 但是需要注意的是, 该接口可能会返回错误, 但是看起来此流程并没有判断该接口是否执行成功. IOW, 无论该接口是否执行成功, 都认为该work已经 complete, 都需要再次wakeup GUEST blocking thread.

将该work, 链接到vcpu->async_pf.done链表中
如果vcpu在等待队列中(halt), 唤醒该vcpu

接下来, 我们来看下, host是如何检测 page present事件, 并注入page present async pf 的

host inject PAGE PRESENT aync pf

@@ -5272,6 +5288,9 @@ static int __vcpu_run(struct kvm_vcpu *vcpu)
 			vcpu->run->exit_reason = KVM_EXIT_INTR;
 			++vcpu->stat.request_irq_exits;
 		}
+		
+		kvm_check_async_pf_completion(vcpu);
+
 		if (signal_pending(current)) {
 			r = -EINTR;

在vm exit后, 检测是否有需要 async pf complete

  
void kvm_check_async_pf_completion(struct kvm_vcpu *vcpu)
{
        struct kvm_async_pf *work;
        //==(1)==
        if (list_empty_careful(&vcpu->async_pf.done) ||
            !kvm_arch_can_inject_async_page_present(vcpu))
                return;

        spin_lock(&vcpu->async_pf.lock);
        work = list_first_entry(&vcpu->async_pf.done, typeof(*work), link);
        list_del(&work->link);
        spin_unlock(&vcpu->async_pf.lock);
        //==(2)==
        if (work->page)
                kvm_arch_async_page_ready(vcpu, work);
        //==(3)==
        kvm_arch_async_page_present(vcpu, work);

        list_del(&work->queue);
        vcpu->async_pf.queued--;
        if (work->page)
                put_page(work->page);
        kmem_cache_free(async_pf_cache, work);
}

有两个判断条件:
- 判断是否有完成的work
- guest此时是否适合注入 page present async PF (下面章节介绍)
如果work->page为 NULL, 说明async work中, 执行get_user_pages()失败了, 那么本次就不需要在执行kvm_arch_async_page_ready(), 该函数作用是, 再次执行tdp_page_fault, 如果page is ready, 那只需要执行get_user_page fast path和__direct_map建立GPA->HPA的映射.
但是如果page is not ready(work->page)为NULL, 作者的想法是, 让其在次vm entry, wakeup guest blocking thread, 让其再次触发EPT violation, 然后再发起async pf. 所以在这里没有必要在做一次kvm_arch_async_page_ready->tdp_page_fault, 那可能有同学会说, 那为什么不在HOST中, 等待get_user_pages()一定返回成功之后, 再注入 page present #PF,
实话说,我也不知道, 但这里总感觉作者不想增加复杂的代码逻辑, 需要关注下后续的patch, 看看是否对这部分有优化
遗留问题

kvm_arch_async_page_present

  
void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
                                 struct kvm_async_pf *work)
{
        trace_kvm_async_pf_ready(work->arch.token, work->gva);
        //==(1)==
        if (is_error_page(work->page))
                work->arch.token = ~0; /* broadcast wakeup */
        else
                kvm_del_async_pf_gfn(vcpu, work->arch.gfn);
   
        //==(2)==
        if ((vcpu->arch.apf.msr_val & KVM_ASYNC_PF_ENABLED) &&
            !apf_put_user(vcpu, KVM_PV_REASON_PAGE_READY)) {
                vcpu->arch.fault.error_code = 0;
                vcpu->arch.fault.address = work->arch.token;
                kvm_inject_page_fault(vcpu);
        }
}

关于error page, 我们放在另一篇文章中讲述.
遗留问题
置入KVM_ASYNC_PF_PF_ENABLED, 准备注入 page present async #PF

guest handle async PF

  
dotraplinkage void __kprobes
do_async_page_fault(struct pt_regs *regs, unsigned long error_code)
{
        //==(1)==
        switch (kvm_read_and_reset_pf_reason()) {
        default:
        //==(2)==
                do_page_fault(regs, error_code);
                break;
        case KVM_PV_REASON_PAGE_NOT_PRESENT:
        //==(3)==
                /* page is swapped out by the host. */
                kvm_async_pf_task_wait((u32)read_cr2());
                break;
        //==(4)==
        case KVM_PV_REASON_PAGE_READY:
                kvm_async_pf_task_wake((u32)read_cr2());
                break;
        }
}

该部分代码逻辑很清晰, async PF event 是使用了原有的#PF exception vector, guest 需要在exception handler 中判断这个#PF的类型, 然后执行相应的handler

从share memory 中获取 async pf reason
indicate NORMAL #PF
indicate PAGE NOT PRESENT async pf
indicate PAGE PRESENT async pf

page not present async pf

  
void kvm_async_pf_task_wait(u32 token)
{
        u32 key = hash_32(token, KVM_TASK_SLEEP_HASHBITS);
        struct kvm_task_sleep_head *b = &async_pf_sleepers[key];
        struct kvm_task_sleep_node n, *e;
        DEFINE_WAIT(wait);
        int cpu, idle;

        cpu = get_cpu();
        idle = idle_cpu(cpu);
        put_cpu();

        spin_lock(&b->lock);
        //===(1)==
        e = _find_apf_task(b, token);
        if (e) {
                /* dummy entry exist -> wake up was delivered ahead of PF */
                hlist_del(&e->link);
                kfree(e);
                spin_unlock(&b->lock);
                return;
        }

        //===(2)==
        n.token = token;
        n.cpu = smp_processor_id();
        n.mm = current->active_mm;
        //===(2.1)==
        n.halted = idle || preempt_count() > 1;
        atomic_inc(&n.mm->mm_count);
        init_waitqueue_head(&n.wq);
        //===(3)==
        hlist_add_head(&n.link, &b->list);
        spin_unlock(&b->lock);

        for (;;) {
                //===(4)==
                if (!n.halted)
                        prepare_to_wait(&n.wq, &wait, TASK_UNINTERRUPTIBLE);
                if (hlist_unhashed(&n.link))
                        break;

                //===(4)==
                if (!n.halted) {
                        local_irq_enable();
                        schedule();
                        local_irq_disable();
                } else {
                        /*
                         * We cannot reschedule. So halt.
                         */
                        native_safe_halt();
                        local_irq_disable();
                }
        }
        if (!n.halted)
                finish_wait(&n.wq, &wait);

        return;
}

在kernel doc介绍MSR_KVM_ASYNC_PF_EN, 作者有提到过. 一对[type2 APF, type1 APF] 不一定会在同一个vcpu上触发, 那也就意味着两者可能并行执行(虽然现在的host kvm 没有这样做,但是guest 不能依赖它), 如下:

kvm                vcpu1      vcpu2
1.inject type1 
  APF to VCPU1
                                      
2. inject type2 
  APF to VCPU2
                               3. handle type2 APF
                   4. handle
                    type1 APF

可以看到kvm虽然是按照顺序注入的type1 APF, 和type2 APF, 但是注入到了不同的vcpu. vcpu在处理时, handle type2 APF先执行, 此时page 已经present了, 不需要再sched out, 这里会在type2 APFhandler 中预先将带有该token的sleep_node放到head中, 以便type 1 APF handler 可以跳过这次的sched out (需要结合type2 APF handle – kvm_async_pf_task_wake().)

将task(current->active_mm)和token绑定, 这样当type2 APF触发时, 可以根据token找到当前block的 task
- 需要注意的时, guest在某些情况下不能sched out, 这时, 只能halt当前cpu
  我们放到另一篇文章中去介绍
  遗留问题
将sleep_node链到sleep_head上
如果guest此时可以调度, 则将进程D住, sched out

page present async pf

  
void kvm_async_pf_task_wake(u32 token)
{
        u32 key = hash_32(token, KVM_TASK_SLEEP_HASHBITS);
        struct kvm_task_sleep_head *b = &async_pf_sleepers[key];
        struct kvm_task_sleep_node *n;

        if (token == ~0) {
                apf_task_wake_all();
                return;
        }

again:
        spin_lock(&b->lock);
        //===(1)==
        n = _find_apf_task(b, token);
        //===(2)==
        if (!n) {
                /*
                 * async PF was not yet handled.
                 * Add dummy entry for the token.
                 */
                n = kmalloc(sizeof(*n), GFP_ATOMIC);
                if (!n) {
                        /*
                         * Allocation failed! Busy wait while other cpu
                         * handles async PF.
                         */
                        spin_unlock(&b->lock);
                        cpu_relax();
                        goto again;
                }
                n->token = token;
                n->cpu = smp_processor_id();
                n->mm = NULL;
                init_waitqueue_head(&n->wq);
                hlist_add_head(&n->link, &b->list);
        } else
        //===(3)==
                apf_task_wake_one(n);
        spin_unlock(&b->lock);
        return;
}

根据token, 在sleep_head中查找sleep_node
同type1 APF handler, type2 APF可能在于type1 APF不同的cpu上先执行, 此时在sleep_head中找不到和该token相关的sleep_node, 这时, 需要新创建一个 sleep_node将其添加到sleep_head中, 以便type1 APF handler可以查找到, 避免block该task
如果查找到了, 说明type1 APF handler已经触发, task已经block, 需要wakeup 该task

参考链接

MAIL list:
v1
v2
v3
v4
v5
v6
v7

kvm, async_pf

para_virt

This post is licensed under CC BY 4.0 by the author.