简介

任何可能出错的事情最终都会出错 --墨菲定律

最近分析CVE-2026-31431 Copy Fail等一些漏洞时提到和脏牛漏洞有一些关系,故分析一下,简单搜索了一下网上,没有比较详细的文档介绍内核代码的运行路径,下面这篇是最准确的标注内核是如何走到漏洞位置的,不过仅标注了函数路径不够详细,后面的分析时候均为以这篇文章的路径为准:

使用的poc:

脏牛利用的是写内存cow(mmap/ptrace等) + mem drop(madvise等)竞态,选择的这个poc比较简单,复现概率也很高

内核注释:

漏洞的发现

Dirty COW漏洞是一位名叫Phil Oester的安全研究人员在日常工作中,通过分析他管理的服务器所遭受的真实网络攻击时发现的,而不是在理论分析或内部代码审计中找到的

发现者:该漏洞由Linux开发者兼安全研究人员Phil Oester发现并报告。

发现契机:当时,Oester管理的网站在使用一台似乎已经被攻击过的服务器。

分析方法:Oester有抓取服务器所有入站HTTP流量进行取证分析的习惯。正是通过分析这些捕获的网络数据包,他从中提取到了攻击者上传并执行的漏洞利用程序(exploit)。他将提取出的二进制文件在一个隔离的沙箱环境中进行了测试和分析,以确认其行为。

严重性确认:分析后,Oester确认该漏洞利用极其稳定,成功率很高。他在测试中发现,攻击者“可以在不到5秒的时间内非常可靠地获得root权限”。

漏洞披露与响应:最终,这个在内核中潜藏了长达9年之久的漏洞于2016年10月被公开披露。Oester将漏洞细节报告给了Linux内核维护者,促使修复补丁得以迅速开发并发布

内核代码使用4.7.9版本,该CVE补丁是4.8时合入

开始

在那个时期linux内核对于这样的文件映射允许写入(有些发行版不允许):

map=mmap(NULL,st.st_size,PROT_READ,MAP_PRIVATE,f,0);

只不过写的时候是写到了自己进程的page里,不会落盘,仅进程自己可见

了解写时复制参考这篇文章:

看一下怎么实现的:

poc代码中先mmap(PROT_READ,MAP_PRIVATE

  map=mmap(NULL,st.st_size,PROT_READ,MAP_PRIVATE,f,0);

  pthread_create(&pth2,NULL,procselfmemThread,argv[2]);

然后在线程中“光标”移动到文件映射位置,写入内容

void *procselfmemThread(void *arg)
{
  int f=open("/proc/self/mem",O_RDWR);

  for(i=0;i<100000000;i++) {
    lseek(f,(uintptr_t) map,SEEK_SET);
    c+=write(f,str,strlen(str));

通过/proc/self/mem写入是因为漏洞发生在mem_write -> __get_user_pages中,直接*addr = val不会触发漏洞,通过ptrace写入也可。

在内核里,从write系统调用进入,主要功能发生在__get_user_pages

-exec bt
#2  0xffffffff8116b439 in __get_user_pages (tsk=<optimized out>, mm=<optimized out>, start=18446612133000729416, nr_pages=9223372037488988261, gup_flags=8215, pages=<optimized out>, vmas=0xffff880028eabd78, nonblocking=0x0 <irq_stack_union>) at mm/gup.c:574
#3  0xffffffff8116be04 in __get_user_pages_locked (flags=<optimized out>, notify_drop=<optimized out>, locked=<optimized out>, vmas=<optimized out>, pages=<optimized out>, force=<optimized out>, write=<optimized out>, nr_pages=<optimized out>, start=<optimized out>, mm=<optimized out>, tsk=<optimized out>) at mm/gup.c:754
#4  get_user_pages_remote (tsk=<optimized out>, mm=<optimized out>, start=<optimized out>, nr_pages=<optimized out>, write=<optimized out>, force=<optimized out>, pages=0xffff880028eabd80, vmas=0xffff880028eabd78) at mm/gup.c:962
#5  0xffffffff8116c7e8 in __access_remote_vm (tsk=<optimized out>, mm=<optimized out>, addr=139704570122240, buf=0xffff880027976000, len=1, write=<optimized out>) at mm/memory.c:3737
#6  0xffffffff8117206f in access_remote_vm (mm=<optimized out>, addr=<optimized out>, buf=<optimized out>, len=<optimized out>, write=<optimized out>) at mm/memory.c:3797
#7  0xffffffff812043e8 in mem_rw (buf=0x7fff97fc8ee4 "b", count=1, ppos=<optimized out>, write=1, file=<optimized out>) at fs/proc/base.c:875
#8  0xffffffff8120445b in mem_write (file=<optimized out>, buf=<optimized out>, count=<optimized out>, ppos=<optimized out>) at fs/proc/base.c:909
#9  0xffffffff8119c2f8 in __vfs_write (file=0xffff880028d10a00, p=<optimized out>, count=<optimized out>, pos=0x8000000025cd5065) at fs/read_write.c:510
#10 0xffffffff8119d012 in vfs_write (file=0x8000000025cd5065, buf=0xffffea0000000000 "", count=1, pos=0xffff880028eabf00) at fs/read_write.c:560
#11 0xffffffff8119e376 in SYSC_write (count=<optimized out>, buf=<optimized out>, fd=<optimized out>) at fs/read_write.c:607
#12 SyS_write (fd=<optimized out>, buf=140735743299300, count=1) at fs/read_write.c:599
#13 0xffffffff81001947 in do_syscall_64 (regs=0x8000000025cd5065) at arch/x86/entry/common.c:350

要修改内存先通过follow_page_mask找到内存对应的page:

注意到这里有个retry 会多次进入

mm/gup.c: 516

long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
		unsigned long start, unsigned long nr_pages,
		unsigned int gup_flags, struct page **pages,
		struct vm_area_struct **vmas, int *nonblocking)
{
    retry:
        page = follow_page_mask(vma, start, foll_flags, &page_mask);
		if (!page) {
			int ret;
			ret = faultin_page(tsk, vma, start, &foll_flags,
					nonblocking);
			switch (ret) {
			case 0:
				goto retry;

第一次 获取私有页

都一次进来时候follow_page_mask -> follow_page_pte时候页表还没建立会返回NULL

static struct page *follow_page_pte(struct vm_area_struct *vma,
		unsigned long address, pmd_t *pmd, unsigned int flags)
{
	if (!pte_present(pte)) {
		swp_entry_t entry;
		if (likely(!(flags & FOLL_MIGRATION)))
			goto no_page;
		if (pte_none(pte))
			goto no_page;

然后尝试faultin_page

    retry:
        page = follow_page_mask(vma, start, foll_flags, &page_mask);
		if (!page) {
			int ret;
			ret = faultin_page(tsk, vma, start, &foll_flags,
					nonblocking);
			switch (ret) {
			case 0:
				goto retry;

faultin_page里的函数handle_mm_fault同样也是缺页异常中断回调里的主要函数

mm/gup.c: 350

static int faultin_page(struct task_struct *tsk, struct vm_area_struct *vma,
		unsigned long address, unsigned int *flags, int *nonblocking)
{
	ret = handle_mm_fault(mm, vma, address, fault_flags);

handle_mm_fault -> handle_pte_fault时候,因为页表映射不存在,会走到do_fault

/root/qemu/linux-4.7.9/mm/memory.c: 3341

static int handle_pte_fault(struct mm_struct *mm,
		     struct vm_area_struct *vma, unsigned long address,
		     pte_t *pte, pmd_t *pmd, unsigned int flags)
{
	if (!pte_present(entry)) { // 第一次时候pte不存在
		if (pte_none(entry)) { // !pte.pte
				return do_fault(mm, vma, address, pte, pmd,
						flags, entry);

这次时候带着FAULT_FLAG_WRITEflag,会走到do_cow_fault

static int do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
		unsigned long address, pte_t *page_table, pmd_t *pmd,
		unsigned int flags, pte_t orig_pte)
{
	if (!(vma->vm_flags & VM_SHARED)) // 私有映射
		return do_cow_fault(mm, vma, address, pmd, pgoff, flags,
				orig_pte);

do_cow_fault主要做两件事情

  • 通过vma->vm_ops->fault将文件中的内容复制到一份新的new_page中,这个page不是来源于文件的page cache,孤立的,不会落盘
  • do_set_pte 新增pte 并配置权限
    • maybe_mkwrite(pte_mkdirty(entry), vma);
static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
		unsigned long address, pmd_t *pmd,
		pgoff_t pgoff, unsigned int flags, pte_t orig_pte)
{
    new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address);

	ret = __do_fault(vma, address, pgoff, flags, new_page, &fault_page,
			 &fault_entry); // 通过vma->vm_ops->fault返回page

	do_set_pte(vma, address, new_page, pte, true, true);

do_set_pte -> maybe_mkwrite时候,会根据mmap时候的的权限配置pte的权限,mmap时候是只读的,所以没有写权限

/root/qemu/linux-4.7.9/mm/memory.c: 2878

void do_set_pte(struct vm_area_struct *vma, unsigned long address,
		struct page *page, pte_t *pte, bool write, bool anon)
{
	if (write)
		entry = maybe_mkwrite(pte_mkdirty(entry), vma);

/root/qemu/linux-4.7.9/include/linux/mm.h: 597

static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
{
	if (likely(vma->vm_flags & VM_WRITE))
		pte = pte_mkwrite(pte);

ok 第一次做的事情结束

faultin_page
  handle_mm_fault
    __handle_mm_fault
      handle_pte_fault
        do_fault <- pte is not present
	  do_cow_fault <- FAULT_FLAG_WRITE
	    alloc_set_pte
	      maybe_mkwrite(pte_mkdirty(entry), vma) <- mark the page dirty
	      						but keep it RO 
# Returns with 0 and retry
  • [dirtycow/dirtycow.github.io VulnerabilityDetails](https://github.com/dirtycow/dirtycow.github.io/wiki/VulnerabilityDetails

第二次 COW

第一次__get_user_pages时候因为页表还没映射,所以follow_page_mask返回空,由faultin_page->handle_mm_fault->__handle_mm_fault->handle_pte_fault->do_fault->do_cow_fault完整页表映射,pte权限只读

接着retry,再次进入follow_page_mask

/root/qemu/linux-4.7.9/mm/gup.c: 516

long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,	
		unsigned long start, unsigned long nr_pages,
		unsigned int gup_flags, struct page **pages,
		struct vm_area_struct **vmas, int *nonblocking)
{

    retry:
        page = follow_page_mask(vma, start, foll_flags, &page_mask);
		if (!page) {
			int ret;
			ret = faultin_page(tsk, vma, start, &foll_flags,
					nonblocking);
			switch (ret) {
			case 0:
				goto retry;

可惜,follow_page_mask->follow_page_pte还有一个检查没通过,本次要写入,但是页面是只读的。。。

/root/qemu/linux-4.7.9/mm/gup.c: 63

static struct page *follow_page_pte(struct vm_area_struct *vma,
		unsigned long address, pmd_t *pmd, unsigned int flags)
{
    if (!pte_present(pte)) { // 第一次时候这里没通过 现在ok了
        ......
    }
    if ((flags & FOLL_WRITE) && !pte_write(pte)) { // 需要写入,但这个page不允许写
		return NULL;

follow_page_mask失败,再走到faultin_page->handle_mm_fault

    retry:
        page = follow_page_mask(vma, start, foll_flags, &page_mask);
		if (!page) {
			int ret;
			ret = faultin_page(tsk, vma, start, &foll_flags,
					nonblocking);

第二次时候faultin_page->handle_mm_fault->__handle_mm_fault->handle_pte_fault换了一条路径,走到了COW里。

task COW时候,父子task的页表都会被标记取消写权限,由第一个发起内存写的task触发COW,复制page,更新双方的页表。不过现在不是task COW,是read private mmap COW

/root/qemu/linux-4.7.9/mm/memory.c: 3341

static int handle_pte_fault(struct mm_struct *mm,
		     struct vm_area_struct *vma, unsigned long address,
		     pte_t *pte, pmd_t *pmd, unsigned int flags)
{
	if (!pte_present(entry)) { // 第一次时候pte不存在
		if (pte_none(entry)) { // !pte.pte
				return do_fault(mm, vma, address, pte, pmd,
						flags, entry);
		}
		return do_swap_page(mm, vma, address,
					pte, pmd, flags, entry);
	}
	
    if (flags & FAULT_FLAG_WRITE) { // 第二次到这里
		if (!pte_write(entry))
			return do_wp_page(mm, vma, address,
					pte, pmd, ptl, entry);

do_wp_page->wp_page_reuse共同完成COW,并返回VM_FAULT_WRITE

COW完成后,在faultin_page中根据返回值VM_FAULT_WRITE判断COW完成了,既然已经COW了,也就是当前task有私有页了,那么读写都无所谓了,就取消了FOLL_WRITE标记

/root/qemu/linux-4.7.9/mm/gup.c: 350

static int faultin_page(struct task_struct *tsk, struct vm_area_struct *vma,
		unsigned long address, unsigned int *flags, int *nonblocking)
{
	ret = handle_mm_fault(mm, vma, address, fault_flags);

	/*
	 * The VM_FAULT_WRITE bit tells us that do_wp_page has broken COW when
	 * necessary, even if maybe_mkwrite decided not to set pte_write. We
	 * can thus safely do subsequent page lookups as if they were reads.
	 * But only do so when looping for pte_write is futile: in some cases
	 * userspace may also be wanting to write to the gotten user page,
	 * which a read fault here might prevent (a readonly page might get
	 * reCOWed by userspace write).
	 */
	if ((ret & VM_FAULT_WRITE) && !(vma->vm_flags & VM_WRITE))
		*flags &= ~FOLL_WRITE;  

注释中翻译:

VM_FAULT_WRITE 标志位告诉我们在必要时 do_wp_page 已经处理了写时复制(COW),即使 maybe_mkwrite 决定不设置 pte_write。因此我们可以安全地执行后续的页面查找,就像它们是读操作一样。但仅在循环检查 pte_write 无效时才这样做:在某些情况下,用户空间可能也希望写入所获取的用户页面,而这里的读缺页处理可能会阻止这一操作(一个只读页面可能会因为用户空间的写入而被重新 COW)。

这样,retryfollow_page_mask -> follow_page_pte时候,因为flags &= ~FOLL_WRITE就不会再次返回空页找不到了,就正常工作了。

    retry:
        page = follow_page_mask(vma, start, foll_flags, &page_mask);
		if (!page) {
			int ret;
			ret = faultin_page(tsk, vma, start, &foll_flags,
					nonblocking);
			switch (ret) {
			case 0:
				goto retry;
static struct page *follow_page_pte(struct vm_area_struct *vma,
		unsigned long address, pmd_t *pmd, unsigned int flags)
{
    if (!pte_present(pte)) { // 第一次时候这里没通过 现在ok了
        ......
    }
	if ((flags & FOLL_WRITE) && !pte_write(pte)) { // 第二次时已COW 也ok了
		pte_unmap_unlock(ptep, ptl);
		return NULL;
	}
	
	正常获取page

follow_page_pte所有条件已通过,正常操作,writemmap(READ, PRIVATER)已正常

follow_page_mask
  follow_page_pte
    (flags & FOLL_WRITE) && !pte_write(pte) <- retry fault
faultin_page
  handle_mm_fault
    __handle_mm_fault
      handle_pte_fault
        FAULT_FLAG_WRITE && !pte_write
	  do_wp_page
	    PageAnon() <- this is CoWed page already
	    reuse_swap_page <- page is exclusively ours
	    wp_page_reuse
	      maybe_mkwrite <- dirty but RO again
	      ret = VM_FAULT_WRITE
((ret & VM_FAULT_WRITE) && !(vma->vm_flags & VM_WRITE)) <- we drop FOLL_WRITE
# Returns with 0 and retry as a read fault

第三次 第三者!

但是,在上面的流程中有一个bug:

第二次取消FOLL_WRITE权限之后page会通过follow_page_mask再次查找,如果这中间page drop了呢

poc中的madvise(MADV_DONTNEED)就会导致这种情况:

  pthread_create(&pth1,NULL,madviseThread,argv[1]);
  
void *madviseThread(void *arg)
{
  char *str;
  str=(char*)arg;
  int i,c=0;
  for(i=0;i<100000000;i++)
  {
/*
You have to race madvise(MADV_DONTNEED) :: https://access.redhat.com/security/vulnerabilities/2706661
> This is achieved by racing the madvise(MADV_DONTNEED) system call
> while having the page of the executable mmapped in memory.
*/
    c+=madvise(map,100,MADV_DONTNEED);

madvise系统调用会丢弃map页表中的page页面

mm/memory.c: 1100

static unsigned long zap_pte_range(struct mmu_gather *tlb,
				struct vm_area_struct *vma, pmd_t *pmd,
				unsigned long addr, unsigned long end,
				struct zap_details *details)
{
	do {
		if (pte_present(ptent)) {
			ptent = ptep_get_and_clear_full(mm, addr, pte,
							tlb->fullmm); // 清除pte条目
-exec bt
#0  zap_pte_range (details=<optimized out>, end=<optimized out>, addr=<optimized out>, pmd=<optimized out>, vma=<optimized out>, tlb=<optimized out>) at mm/memory.c:1188
#1  zap_pmd_range (details=<optimized out>, end=<optimized out>, addr=<optimized out>, pud=<optimized out>, vma=<optimized out>, tlb=<optimized out>) at mm/memory.c:1242
#2  zap_pud_range (details=<optimized out>, end=<optimized out>, addr=<optimized out>, pgd=<optimized out>, vma=<optimized out>, tlb=<optimized out>) at mm/memory.c:1263
#3  unmap_page_range (tlb=0xffff880028ddbdc8, vma=0xffff880028dc5508, addr=139890012123136, end=139890012127232, details=0x0 <irq_stack_union>) at mm/memory.c:1284
#4  0xffffffff8116eb79 in unmap_single_vma (tlb=0xffff880028ddbdc8, vma=0xffff880028dc5508, start_addr=<optimized out>, end_addr=<optimized out>, details=<optimized out>) at mm/memory.c:1329
#5  0xffffffff8116ef13 in zap_page_range (vma=0xffff880028dc5508, start=139890012123136, size=<optimized out>, details=0x0 <irq_stack_union>) at mm/memory.c:1384
#6  0xffffffff81180d40 in madvise_dontneed (end=<optimized out>, start=<optimized out>, prev=<optimized out>, vma=<optimized out>) at mm/madvise.c:478
#7  madvise_vma (behavior=<optimized out>, end=<optimized out>, start=<optimized out>, prev=<optimized out>, vma=<optimized out>) at mm/madvise.c:584
#8  SYSC_madvise (behavior=<optimized out>, len_in=<optimized out>, start=<optimized out>) at mm/madvise.c:747
#9  SyS_madvise (start=<optimized out>, len_in=<optimized out>, behavior=<optimized out>) at mm/madvise.c:676
#10 0xffffffff81001947 in do_syscall_64 (regs=0xffff880028ddbdc8) at arch/x86/entry/common.c:350

madviseThreadprocselfmemThread都是以线程并发的!

Boom!

在第二次循环结束之后,丢掉了FOLL_WRITE权限,而切好此时madvise丢掉了cache时,follow_page_mask将会再次找不到pte

/root/qemu/linux-4.7.9/mm/gup.c: 516

long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,	
		unsigned long start, unsigned long nr_pages,
		unsigned int gup_flags, struct page **pages,
		struct vm_area_struct **vmas, int *nonblocking)
{

    retry:
        page = follow_page_mask(vma, start, foll_flags, &page_mask);
		if (!page) {
			int ret;
			ret = faultin_page(tsk, vma, start, &foll_flags,
					nonblocking);
			switch (ret) {
			case 0:
				goto retry;

和第一次进来时候follow_page_mask -> follow_page_pte时候页表还没建立会返回NULL一样

/root/qemu/linux-4.7.9/mm/gup.c: 63

static struct page *follow_page_pte(struct vm_area_struct *vma,
		unsigned long address, pmd_t *pmd, unsigned int flags)
{
	if (!pte_present(pte)) {
		swp_entry_t entry;
		if (likely(!(flags & FOLL_MIGRATION)))
			goto no_page;
		if (pte_none(pte))
			goto no_page;

再次执行到faultin_page,但这一次,没有了FAULT_FLAG_WRITE权限,开始走到不该走的路径中:

faultin_page->handle_mm_fault->__handle_mm_fault->handle_pte_fault,和第一次一样,没有页表

/root/qemu/linux-4.7.9/mm/memory.c: 3341

static int handle_pte_fault(struct mm_struct *mm,
		     struct vm_area_struct *vma, unsigned long address,
		     pte_t *pte, pmd_t *pmd, unsigned int flags)
{
	if (!pte_present(entry)) { // 第一次时候pte不存在 第三次也是pte不存在
		if (pte_none(entry)) { // !pte.pte
				return do_fault(mm, vma, address, pte, pmd,
						flags, entry);

But,本来是要走到do_cow_fault,现在走到了do_read_fault

/root/qemu/linux-4.7.9/mm/memory.c: 3178

static int do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
		unsigned long address, pte_t *page_table, pmd_t *pmd,
		unsigned int flags, pte_t orig_pte)
{
	if (!(flags & FAULT_FLAG_WRITE))  // BOOM 走到了这里
		return do_read_fault(mm, vma, address, pmd, pgoff, flags,
				orig_pte);
	if (!(vma->vm_flags & VM_SHARED)) // 正常应该走这里,即第一次这里
		return do_cow_fault(mm, vma, address, pmd, pgoff, flags,
				orig_pte);

不同于do_cow_fault返回的是私有映射的page,do_read_fault没有重新alloc page,直接使用文件的page了

/root/qemu/linux-4.7.9/mm/memory.c: 3008

static int do_read_fault(struct mm_struct *mm, struct vm_area_struct *vma,
		unsigned long address, pmd_t *pmd,
		pgoff_t pgoff, unsigned int flags, pte_t orig_pte)
{
    ret = __do_fault(vma, address, pgoff, flags, NULL, &fault_page, NULL);

第四次 败北

现在,第四次循环,没有了FOLL_WRITE权限,pte映射的真实文件的page

/root/qemu/linux-4.7.9/mm/gup.c: 516

long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,	
		unsigned long start, unsigned long nr_pages,
		unsigned int gup_flags, struct page **pages,
		struct vm_area_struct **vmas, int *nonblocking)
{

    retry:
        page = follow_page_mask(vma, start, foll_flags, &page_mask);
		if (!page) {
			int ret;
			ret = faultin_page(tsk, vma, start, &foll_flags,
					nonblocking);
			switch (ret) {
			case 0:
				goto retry;

第三次进来时follow_page_mask->follow_page_pte不会阻拦,现在第四次也不会阻拦了,正常获取page

static struct page *follow_page_pte(struct vm_area_struct *vma,
		unsigned long address, pmd_t *pmd, unsigned int flags)
{
    if (!pte_present(pte)) { // 第一次时候这里没通过 现在ok了
        ......
    }
	if ((flags & FOLL_WRITE) && !pte_write(pte)) { // 第二次时已COW 也ok了
		pte_unmap_unlock(ptep, ptl);
		return NULL;
	}
	
	正常获取page
cond_resched -> different thread will now unmap via madvise
follow_page_mask
  !pte_present && pte_none
faultin_page
  handle_mm_fault
    __handle_mm_fault
      handle_pte_fault
        do_fault <- pte is not present
	  do_read_fault <- this is a read fault and we will get pagecache
	  		   page!

漏洞出现过程

在2.6.13时候linus本人发现过这个漏洞并尝试修复,可惜后来这个修复不完善被撤销,后续没有再修复。直到2.6.21潘多拉魔盒打开…

任何可能出错的事情最终都会出错 --墨菲定律

补丁

补丁主要修改两处内容:

第二次COW时不再取消FOLL_WRITE标记,改而新增一个FOLL_COW标记

不会再丢掉FOLL_WRITE标记了,避免后续又跑飞或者语义不恰当(用没有VM_FAULT_WRITE当做已COW)

@@ -2232,6 +2232,7 @@ static inline struct page *follow_page(struct vm_area_struct *vma,
+#define FOLL_COW       0x4000  /* internal GUP flag */

@@ -412,7 +422,7 @@ static int faultin_page(struct task_struct *tsk, struct vm_area_struct *vma,
         * reCOWed by userspace write).
         */
        if ((ret & VM_FAULT_WRITE) && !(vma->vm_flags & VM_WRITE))
-               *flags &= ~FOLL_WRITE;
+               *flags |= FOLL_COW;
        return 0;
 }

follow_page_mask->follow_page_pte中的检查条件变化,原来通过FOLL_WRITE为true判断还没有COW,结合pte_write(pte)判断虽然页表没权限如果已经COW就方形,现在取消FOLL_WRITE后,COW的情况通过FOLL_FORCE && FOLL_COW判断

FOLL_FORCE 指强制写入 即使没有write权限 /root/qemu/linux-4.7.9/mm/gup.c:920 @force: whether to force access even when user mapping is currently protected (but never forces write access to shared mapping).

--- a/mm/gup.c
+++ b/mm/gup.c
@@ -60,6 +60,16 @@ static int follow_pfn_pte(struct vm_area_struct *vma, unsigned long address,
        return -EEXIST;
 }

+/*
+ * FOLL_FORCE can write to even unwritable pte's, but only
+ * after we've gone through a COW cycle and they are dirty.
+ */
+static inline bool can_follow_write_pte(pte_t pte, unsigned int flags)
+{
+       return pte_write(pte) ||
+               ((flags & FOLL_FORCE) && (flags & FOLL_COW) && pte_dirty(pte));
+}
+
 static struct page *follow_page_pte(struct vm_area_struct *vma,
                unsigned long address, pmd_t *pmd, unsigned int flags)
 {
@@ -95,7 +105,7 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
        }
        if ((flags & FOLL_NUMA) && pte_protnone(pte))
                goto no_page;
-       if ((flags & FOLL_WRITE) && !pte_write(pte)) {
+       if ((flags & FOLL_WRITE) && !can_follow_write_pte(pte, flags)) {
                pte_unmap_unlock(ptep, ptl);
                return NULL;
        }

5535be309971 2022年撤销FOLL_COW

总结

脏牛漏洞是指:__get_user_pages从COW到获取page之间缺乏原子性,存在竞态,中间可能被丢弃page影响,从而导致__get_user_pages获取到真实文件的page,实现写入内容到任意文件(包括只读、S权限)

虽然是竞态,有一定概率,但这个概率是一定能复现,且不需要太长的时间,不像有些漏洞像内存泄漏给其他进程一点点极难利用,而这个相当于百分百可被利用,且危险很高,可提权。

参考

Logo

openEuler 是由开放原子开源基金会孵化的全场景开源操作系统项目,面向数字基础设施四大核心场景(服务器、云计算、边缘计算、嵌入式),全面支持 ARM、x86、RISC-V、loongArch、PowerPC、SW-64 等多样性计算架构

更多推荐