CVE-2022–0492
Linux CGroup v1 滥用导致容器逃逸
前言
相比之前的 notify_on_release
和写入 release_agent
的方法,此利用方式不再需要 --privileged
启动的 container 或者 container 拥有 CAP_SYS_ADMIN
。此方式的创新点在于使用 unprivileged user namespace creation.
以往漏洞详情可以参考:https://github.com/cdk-team/CDK/wiki/Exploit:-mount-cgroup
漏洞通告
the vulnerability can also allow root host processes with no capabilities, or non-root host processes with the
CAP_DAC_OVERRIDE
capability, to escalate privileges and attain all capabilities.It has been discovered that under certain circumstances, the Linux kernel’s
cgroups v1
release_agent
feature can be used to escalate privilege andbypass namespace isolation unexpectedly.
CVE-2022–0492 has been assigned to this issue, which is corrected by
requiring
CAP_SYS_ADMIN
in the initial user namespace when setting
release_agent
. This has been included upstream in commit24f6008564183aa120d07c03d9289519c2fe02af. ( https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=24f6008564183aa120d07c03d9289519c2fe02af )
Thank you to Yiqi Sun and Kevin Wang of Huawei Security Team for disclosing
their work that led to this fix.
前置条件
当前容器以 root 用户身份启动进程。但是存在一定几率当前容器内的 root 用户在启动时被 uid 映射为非 0 的宿主机用户,此时不可利用。并且,此利用方式要求必须为 CGroup v1,因为 release_agent + notify_on_release
机制仅在 Cgroup v1 中存在。Systemd v243 版本后默认使用 Cgroup v2 启动,故不受此问题影响,但是我们观测到大量下游发行版的启动参数为混合 cgroup v1 & v2,并没有采取上游的默认 v2 only。由于大量版本依然默认为 Cgroup v1,故影响范围依然较大。
PoC
#!/bin/bash
echo "[*] Testing whether CVE-2022-0492 can be exploited for container escape"
# Setup test dir
test_dir=/tmp/.cve-2022-0492-test
if ! mkdir -p $test_dir ; then
echo "ERROR: failed to create test directory at $test_dir"
exit 1
fi
# Test whether escape via CAP_SYS_ADMIN is possible
if mount -t cgroup -o memory cgroup $test_dir >/dev/null 2>&1 ; then
if test -w $test_dir/release_agent ; then
echo "[!] Exploitable: the container can escape as it runs with CAP_SYS_ADMIN"
umount $test_dir && rm -rf $test_dir
exit 0
fi
umount $test_dir
fi
# Test whether escape via user namespaces is possible
while read -r subsys
do
if unshare -UrmC --propagation=unchanged bash -c "mount -t cgroup -o $subsys cgroup $test_dir 2>&1 >/dev/null && test -w $test_dir/release_agent" >/dev/null 2>&1 ; then
echo "[!] Exploitable: the container can abuse user namespaces to escape"
rm -rf $test_dir
exit 0
fi
done <<< $(cat /proc/$$/cgroup | grep -Eo '[0-9]+:[^:]+' | grep -Eo '[^:]+$')
# Cannot escape via either method
rm -rf $test_dir
echo "[+] Contained: cannot escape via CVE-2022-0492"
可用的 Cgroup
需要注意的是,宿主机上的 CGroup 都是 RW mount,而容器中都是 RO mount。
Docker 容器的CGroup 默认是使用了 Children CGroup ,挂载在宿主机的 /sys/fs/cgroup/<subsystem>/docker/<container-id>
中。而K8S 则是默认挂载到 Root CGroup 的 kubepods.slice
, 有一个例外是:RDMA 是两者都挂载的是 Root CGroup (仅为本文的示例,Remote Direct Memory Access)。
Cgroup v1 / v2
What the fxxk of Hybrid Mode?
cgroup 有如下子系统:
- devices 进程范围设备权限
- cpuset 分配进程可使用的 CPU数和内存节点
- cpu 控制CPU占有率
- cpuacct 统计CPU使用情况,例如运行时间,throttled时间
- memory 限制内存的使用上限
- freezer 暂停 Cgroup 中的进程
net_cls
配合 tc(traffic controller)限制网络带宽net_prio
设置进程的网络流量优先级huge_tlb
限制 HugeTLB 的使用perf_event
允许 Perf 工具基于 Cgroup 分组做性能检测
补丁
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=24f6008564183aa120d07c03d9289519c2fe02af ,简单粗暴的在设置 release_agent
的时候检查是否为 CAP_SYS_ADMIN
或对应特权用户的 user_ns
就行。
diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c
index 41e0837a5a0bd..0e877dbcfeea9 100644
--- a/kernel/cgroup/cgroup-v1.c
+++ b/kernel/cgroup/cgroup-v1.c
@@ -549,6 +549,14 @@ static ssize_t cgroup_release_agent_write(struct kernfs_open_file *of,
BUILD_BUG_ON(sizeof(cgrp->root->release_agent_path) < PATH_MAX);
+ /*
+ * Release agent gets called with all capabilities,
+ * require capabilities to set release agent.
+ */
+ if ((of->file->f_cred->user_ns != &init_user_ns) ||
+ !capable(CAP_SYS_ADMIN))
+ return -EPERM;
+
cgrp = cgroup_kn_lock_live(of->kn, false);
if (!cgrp)
return -ENODEV;
@@ -954,6 +962,12 @@ int cgroup1_parse_param(struct fs_context *fc, struct fs_parameter *param)
/* Specifying two release agents is forbidden */
if (ctx->release_agent)
return invalfc(fc, "release_agent respecified");
+ /*
+ * Release agent gets called with all capabilities,
+ * require capabilities to set release agent.
+ */
+ if ((fc->user_ns != &init_user_ns) || !capable(CAP_SYS_ADMIN))
+ return invalfc(fc, "Setting release_agent not allowed");
ctx->release_agent = param->string;
param->string = NULL;
break;
分析
When a user namespace is created, the kernel records the effective user ID of the creating process as being the “owner” of the namespace. A process whose effective user ID matches that of the owner of a user namespace and which is a member of the parent namespace has all capabilities in the namespace. By virtue of the previous rule, those capabilities propagate down into all descendant namespaces as well. This means that after creation of a new user namespace, other processes owned by the same user in the parent namespace have all capabilities in the new namespace.
利用
测试用 Dockerfile:
FROM ubuntu:21.04
LABEL MAINTAINER kmahyyg<16604643+kmahyyg@users.noreply.github.com>
COPY sleep.elf /
RUN echo "nameserver 223.5.5.5" > /etc/resolv.conf
RUN sed -i 's/archive.ubuntu.com/mirrors.aliyun.com/g' /etc/apt/sources.list && \
apt update -y && \
apt install -y ca-certificates wget curl nano socat libcap2-bin && \
rm -rf /var/cache/apt
CMD ["/sleep.elf"]
测试 Bash Script:
#!/bin/bash
docker build . -t cve-2022-0492:latest
sysctl -w kernel.unprivileged_userns_clone=1
setenforce 0
cnt1=$(docker run -d --rm --security-opt "seccomp=unconfined" --security-opt "apparmor=unconfined" cve-2022-0492:latest)
echo $cnt1
echo "Container created. Try Exec."
docker exec -it $cnt1 bash
一些开发的问题
可能需要一并修复:
相关的解释
Manpage中对于相关机制解释的已经很清楚了:https://man7.org/linux/man-pages/man7/cgroups.7.html
Cgroups v1 release notification
Two files can be used to determine whether the kernel provides
notifications when a cgroup becomes empty. A cgroup is
considered to be empty when it contains no child cgroups and no
member processes.
A special file in the root directory of each cgroup hierarchy,
release_agent
, can be used to register the pathname of a programthat may be invoked when a cgroup in the hierarchy becomes empty.
The pathname of the newly empty cgroup (relative to the cgroup
mount point) is provided as the sole command-line argument when
the
release_agent
program is invoked. Therelease_agent
' programmight remove the cgroup directory, or perhaps repopulate it with
a process.
The default value of the
release_agent
file is empty, meaningthat no release agent is invoked.
The content of the
release_agent
file can also be specified via amount option when the cgroup filesystem is mounted:
mount -o release_agent=pathname ...
Whether or not the
release_agent
program is invoked when aparticular cgroup becomes empty is determined by the value in the
notify_on_release
file in the corresponding cgroup directory. Ifthis file contains the value 0, then the
release_agent
program isnot invoked. If it contains the value 1, the
release_agent
program is invoked. The default value for this file in the root
cgroup is 0. At the time when a new cgroup is created, the value
in this file is inherited from the corresponding file in the
parent cgroup.
对于 notify_on_release
机制的官方文档可以参考: https://www.kernel.org/doc/Documentation/cgroup-v1/cgroups.txt
Each cgroup is represented by a directory in the cgroup file system
containing the following files describing that cgroup:
tasks: list of tasks (by PID) attached to that cgroup. This list
is not guaranteed to be sorted. Writing a thread ID into this file
moves the thread into this cgroup.
cgroup.procs: list of thread group IDs in the cgroup. This list is
not guaranteed to be sorted or free of duplicate TGIDs, and userspace
should sort/uniquify the list if this property is required.
Writing a thread group ID into this file moves all threads in that
group into this cgroup.
notify_on_release
flag: run the release agent on exit?
release_agent
: the path to use for release notifications (this fileexists in the top cgroup only)
1.4 What does
notify_on_release
do ?— — — — — — — — — — — — — — — — — —
If the
notify_on_release
flag is enabled (1) in a cgroup, thenwhenever the last task in the cgroup leaves (exits or attaches to
some other cgroup) and the last child cgroup of that cgroup
is removed, then the kernel runs the command specified by the contents
of the
release_agent
file in that hierarchy's root directory,supplying the pathname (relative to the mount point of the cgroup
file system) of the abandoned cgroup. This enables automatic
removal of abandoned cgroups. The default value of
notify_on_release
in the root cgroup at system boot is disabled(0). The default value of other cgroups at creation is the current
value of their parents’
notify_on_release
settings. The default value ofa cgroup hierarchy’s
release_agent
path is empty.
参考文献
- https://thehackernews.com/2022/03/new-linux-kernel-cgroups-vulnerability.html
- https://bugs.launchpad.net/ubuntu/+source/snapd/+bug/1850667
- https://github.com/PaloAltoNetworks/can-ctr-escape-cve-2022-0492
- https://unit42.paloaltonetworks.com/cve-2022-0492-cgroups/
- https://www.openwall.com/lists/oss-security/2022/02/04/1
- https://ubuntu.com/security/CVE-2022-0492
- https://www.kernel.org/doc/Documentation/cgroup-v2.txt
- https://www.kernel.org/doc/Documentation/cgroup-v1/cgroups.txt
- https://man7.org/linux/man-pages/man7/cgroups.7.html
LWN 的 Namespace in Operations 系列文章: