CVE-2022–0492

Linux CGroup v1 滥用导致容器逃逸

7 min readMay 21, 2023

前言

相比之前的 notify_on_release 和写入 release_agent 的方法，此利用方式不再需要 --privileged 启动的 container 或者 container 拥有 CAP_SYS_ADMIN。此方式的创新点在于使用 unprivileged user namespace creation.

以往漏洞详情可以参考：https://github.com/cdk-team/CDK/wiki/Exploit:-mount-cgroup

漏洞通告

the vulnerability can also allow root host processes with no capabilities, or non-root host processes with the CAP_DAC_OVERRIDE capability, to escalate privileges and attain all capabilities.
It has been discovered that under certain circumstances, the Linux kernel’s
cgroups v1 release_agent feature can be used to escalate privilege and
bypass namespace isolation unexpectedly.
CVE-2022–0492 has been assigned to this issue, which is corrected by
requiring CAP_SYS_ADMIN in the initial user namespace when setting
release_agent. This has been included upstream in commit
24f6008564183aa120d07c03d9289519c2fe02af. ( https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=24f6008564183aa120d07c03d9289519c2fe02af )
Thank you to Yiqi Sun and Kevin Wang of Huawei Security Team for disclosing
their work that led to this fix.

前置条件

当前容器以 root 用户身份启动进程。但是存在一定几率当前容器内的 root 用户在启动时被 uid 映射为非 0 的宿主机用户，此时不可利用。并且，此利用方式要求必须为 CGroup v1，因为 release_agent + notify_on_release 机制仅在 Cgroup v1 中存在。Systemd v243 版本后默认使用 Cgroup v2 启动，故不受此问题影响，但是我们观测到大量下游发行版的启动参数为混合 cgroup v1 & v2，并没有采取上游的默认 v2 only。由于大量版本依然默认为 Cgroup v1，故影响范围依然较大。

PoC

#!/bin/bash

echo "[*] Testing whether CVE-2022-0492 can be exploited for container escape" 

# Setup test dir
test_dir=/tmp/.cve-2022-0492-test
if ! mkdir -p $test_dir ; then
    echo "ERROR: failed to create test directory at $test_dir" 
    exit 1
fi

# Test whether escape via CAP_SYS_ADMIN is possible
if mount -t cgroup -o memory cgroup $test_dir >/dev/null 2>&1 ; then
    if test -w $test_dir/release_agent ; then
        echo "[!] Exploitable: the container can escape as it runs with CAP_SYS_ADMIN"
        umount $test_dir && rm -rf $test_dir
        exit 0
    fi
    umount $test_dir
fi

# Test whether escape via user namespaces is possible
while read -r subsys
do
    if unshare -UrmC --propagation=unchanged bash -c "mount -t cgroup -o $subsys cgroup $test_dir 2>&1 >/dev/null && test -w $test_dir/release_agent" >/dev/null 2>&1 ; then
        echo "[!] Exploitable: the container can abuse user namespaces to escape"
        rm -rf $test_dir
        exit 0
    fi
done <<< $(cat /proc/$$/cgroup | grep -Eo '[0-9]+:[^:]+' | grep -Eo '[^:]+$')

# Cannot escape via either method
rm -rf $test_dir
echo "[+] Contained: cannot escape via CVE-2022-0492"

可用的 Cgroup

需要注意的是，宿主机上的 CGroup 都是 RW mount，而容器中都是 RO mount。

Docker 容器的CGroup 默认是使用了 Children CGroup ，挂载在宿主机的 /sys/fs/cgroup/<subsystem>/docker/<container-id> 中。而K8S 则是默认挂载到 Root CGroup 的 kubepods.slice，有一个例外是：RDMA 是两者都挂载的是 Root CGroup （仅为本文的示例，Remote Direct Memory Access）。

Cgroup v1 / v2

What the fxxk of Hybrid Mode?

systemd/CGROUP_DELEGATION.md at main · systemd/systemd

So you are wondering about resource management with systemd, you know Linux control groups (cgroups) a bit and are…

github.com

cgroup 有如下子系统：

devices 进程范围设备权限
cpuset 分配进程可使用的 CPU数和内存节点
cpu 控制CPU占有率
cpuacct 统计CPU使用情况，例如运行时间，throttled时间
memory 限制内存的使用上限
freezer 暂停 Cgroup 中的进程
net_cls 配合 tc(traffic controller)限制网络带宽
net_prio 设置进程的网络流量优先级
huge_tlb 限制 HugeTLB 的使用
perf_event 允许 Perf 工具基于 Cgroup 分组做性能检测

补丁

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=24f6008564183aa120d07c03d9289519c2fe02af ，简单粗暴的在设置 release_agent 的时候检查是否为 CAP_SYS_ADMIN 或对应特权用户的 user_ns 就行。

diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c
index 41e0837a5a0bd..0e877dbcfeea9 100644
--- a/kernel/cgroup/cgroup-v1.c
+++ b/kernel/cgroup/cgroup-v1.c
@@ -549,6 +549,14 @@ static ssize_t cgroup_release_agent_write(struct kernfs_open_file *of,
 
 	BUILD_BUG_ON(sizeof(cgrp->root->release_agent_path) < PATH_MAX);
 
+	/*
+	 * Release agent gets called with all capabilities,
+	 * require capabilities to set release agent.
+	 */
+	if ((of->file->f_cred->user_ns != &init_user_ns) ||
+	    !capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
 	cgrp = cgroup_kn_lock_live(of->kn, false);
 	if (!cgrp)
 		return -ENODEV;
@@ -954,6 +962,12 @@ int cgroup1_parse_param(struct fs_context *fc, struct fs_parameter *param)
 		/* Specifying two release agents is forbidden */
 		if (ctx->release_agent)
 			return invalfc(fc, "release_agent respecified");
+		/*
+		 * Release agent gets called with all capabilities,
+		 * require capabilities to set release agent.
+		 */
+		if ((fc->user_ns != &init_user_ns) || !capable(CAP_SYS_ADMIN))
+			return invalfc(fc, "Setting release_agent not allowed");
 		ctx->release_agent = param->string;
 		param->string = NULL;
 		break;

分析

When a user namespace is created, the kernel records the effective user ID of the creating process as being the “owner” of the namespace. A process whose effective user ID matches that of the owner of a user namespace and which is a member of the parent namespace has all capabilities in the namespace. By virtue of the previous rule, those capabilities propagate down into all descendant namespaces as well. This means that after creation of a new user namespace, other processes owned by the same user in the parent namespace have all capabilities in the new namespace.

利用

feat(exploit/abuse_unpriv_userns.go): exploit of CVE-2022-0492 by kmahyyg · Pull Request #41 ·…

co-operate with PR #40. Use reexec technique to let a multi-thread program (such as this golang program) runs in a…

github.com

Implement mount-cgroup in Golang style by kmahyyg · Pull Request #40 · cdk-team/CDK

feat(exp/mount_cgroup.go): completely fix #35 in golang-style This implemented mount-cgroup exploit totally in Golang…

github.com

测试用 Dockerfile:

FROM ubuntu:21.04
LABEL MAINTAINER kmahyyg<16604643+kmahyyg@users.noreply.github.com>

COPY sleep.elf /
RUN echo "nameserver 223.5.5.5" > /etc/resolv.conf
RUN sed -i 's/archive.ubuntu.com/mirrors.aliyun.com/g' /etc/apt/sources.list && \
    apt update -y && \
    apt install -y ca-certificates wget curl nano socat libcap2-bin && \
    rm -rf /var/cache/apt

CMD ["/sleep.elf"]

测试 Bash Script:

#!/bin/bash
docker build . -t cve-2022-0492:latest

sysctl -w kernel.unprivileged_userns_clone=1
setenforce 0
cnt1=$(docker run -d --rm --security-opt "seccomp=unconfined" --security-opt "apparmor=unconfined" cve-2022-0492:latest)
echo $cnt1
echo "Container created. Try Exec."
docker exec -it $cnt1 bash

一些开发的问题

可能需要一并修复：

相关的解释

Manpage中对于相关机制解释的已经很清楚了：https://man7.org/linux/man-pages/man7/cgroups.7.html

Cgroups v1 release notification
Two files can be used to determine whether the kernel provides
notifications when a cgroup becomes empty. A cgroup is
considered to be empty when it contains no child cgroups and no
member processes.
A special file in the root directory of each cgroup hierarchy,
release_agent , can be used to register the pathname of a program
that may be invoked when a cgroup in the hierarchy becomes empty.
The pathname of the newly empty cgroup (relative to the cgroup
mount point) is provided as the sole command-line argument when
the release_agent program is invoked. The release_agent' program
might remove the cgroup directory, or perhaps repopulate it with
a process.
The default value of the release_agent file is empty, meaning
that no release agent is invoked.
The content of the release_agent file can also be specified via a
mount option when the cgroup filesystem is mounted:
mount -o release_agent=pathname ...
Whether or not the release_agent program is invoked when a
particular cgroup becomes empty is determined by the value in the
notify_on_release file in the corresponding cgroup directory. If
this file contains the value 0, then the release_agent program is
not invoked. If it contains the value 1, the release_agent
program is invoked. The default value for this file in the root
cgroup is 0. At the time when a new cgroup is created, the value
in this file is inherited from the corresponding file in the
parent cgroup.

对于 notify_on_release 机制的官方文档可以参考： https://www.kernel.org/doc/Documentation/cgroup-v1/cgroups.txt

Each cgroup is represented by a directory in the cgroup file system
containing the following files describing that cgroup:
tasks: list of tasks (by PID) attached to that cgroup. This list
is not guaranteed to be sorted. Writing a thread ID into this file
moves the thread into this cgroup.
cgroup.procs: list of thread group IDs in the cgroup. This list is
not guaranteed to be sorted or free of duplicate TGIDs, and userspace
should sort/uniquify the list if this property is required.
Writing a thread group ID into this file moves all threads in that
group into this cgroup.
notify_on_release flag: run the release agent on exit?
release_agent: the path to use for release notifications (this file
exists in the top cgroup only)
1.4 What does notify_on_release do ?
— — — — — — — — — — — — — — — — — —
If the notify_on_release flag is enabled (1) in a cgroup, then
whenever the last task in the cgroup leaves (exits or attaches to
some other cgroup) and the last child cgroup of that cgroup
is removed, then the kernel runs the command specified by the contents
of the release_agent file in that hierarchy's root directory,
supplying the pathname (relative to the mount point of the cgroup
file system) of the abandoned cgroup. This enables automatic
removal of abandoned cgroups. The default value of
notify_on_release in the root cgroup at system boot is disabled
(0). The default value of other cgroups at creation is the current
value of their parents’ notify_on_release settings. The default value of
a cgroup hierarchy’s release_agent path is empty.

参考文献

LWN 的 Namespace in Operations 系列文章：

CVE-2022–0492

Linux CGroup v1 滥用导致容器逃逸

前言

漏洞通告

前置条件

PoC

可用的 Cgroup

Cgroup v1 / v2

systemd/CGROUP_DELEGATION.md at main · systemd/systemd

So you are wondering about resource management with systemd, you know Linux control groups (cgroups) a bit and are…

补丁

分析

利用

feat(exploit/abuse_unpriv_userns.go): exploit of CVE-2022-0492 by kmahyyg · Pull Request #41 ·…

co-operate with PR #40. Use reexec technique to let a multi-thread program (such as this golang program) runs in a…

Implement mount-cgroup in Golang style by kmahyyg · Pull Request #40 · cdk-team/CDK

feat(exp/mount_cgroup.go): completely fix #35 in golang-style This implemented mount-cgroup exploit totally in Golang…

一些开发的问题

相关的解释

参考文献

Written by Patrick Young

No responses yet