Why Container Escapes Are the Highest-Severity K8s Runtime Event
Most K8s security incidents are contained within a container's namespace. A cryptominer running in a compromised pod is a cost problem. Lateral movement between pods in the same namespace is a data access problem. But a container escape — where an attacker successfully exits the container's namespace isolation and gains access to the host OS — is an entirely different class of event. From the host, an attacker can read the kubelet's credentials (/var/lib/kubelet/config.yaml), access other containers' runtime filesystems via /proc/<pid>/root, steal ServiceAccount tokens mounted at /run/secrets/kubernetes.io/serviceaccount/token for any pod on the node, and potentially compromise the entire cluster by gaining control of a high-privilege ServiceAccount.
This is why container escape sits at the top of MITRE ATT&CK for containers under tactic TA0004 (Privilege Escalation) and TA0007 (Discovery). An attacker who achieves host access on even one node has a lateral movement path to every workload on that node and a credential theft path toward cluster-admin access if any privileged ServiceAccount token is cached locally.
The detection challenge is that escape attempts look like normal syscalls until they don't. A process making a setns() call could be a legitimate container runtime operation, or it could be an attacker attempting to join the host network namespace. The syscall number alone doesn't tell you which. Context — process ancestry, which namespace the call originates from, whether it's running inside a container cgroup vs. the kubelet — is what distinguishes the two.
The Syscall Signatures of a Container Escape
Container escapes exploit one of several categories of isolation boundary weakness. Each category has a characteristic syscall signature that eBPF probes can observe:
Namespace Reassignment (setns / unshare)
The kernel syscall setns(2) allows a process with appropriate capabilities to join an existing namespace. An attacker who gains CAP_SYS_ADMIN within a container (often via a misconfigured PodSecurityContext or a vulnerability in a SUID binary) can call setns(fd, CLONE_NEWPID | CLONE_NEWNS) to join the host PID and mount namespaces. The detection signal: setns() called from a process within a container cgroup (as identified by its cgroup hierarchy) targeting namespace file descriptors under /proc/1/ns/ — process 1 being init, which runs in the host namespace. A container process joining the host's PID 1 namespace is unambiguous evidence of an escape attempt.
The unshare variant (CVE-2022-0492): the kernel bug allowed an unprivileged process inside a container to create a new user namespace via unshare(CLONE_NEWUSER), then write to /proc/self/uid_map to map itself to UID 0 inside that new namespace, then abuse the cgroup v1 release_agent mechanism by writing a shell command to /sys/fs/cgroup/memory/release_agent. When the cgroup is released, the release_agent script executes with root privileges in the host context. The eBPF signature: an unshare(CLONE_NEWUSER) call from within a container cgroup, followed by writes to /proc/self/uid_map or /proc/self/gid_map.
runC / Container Runtime Exploitation (open on /proc/self/exe)
CVE-2019-5736 demonstrated that a malicious container could overwrite the runC binary on the host by exploiting the fact that /proc/self/exe in a container is a symlink to the host's runC binary if the container's init process was exec'd by runC and runC is still holding the file descriptor open. The attack: call open("/proc/self/exe", O_WRONLY | O_TRUNC) with O_WRONLY — which succeeds if the host runC binary is still mapped — and overwrite it with a malicious payload. The eBPF signal: a write-mode open of /proc/self/exe from a process in a container cgroup is highly anomalous. Almost no legitimate container workload needs to open its own executable for writing. This pattern also applies to variations exploiting the Docker or containerd runtime init binary path.
Privileged Container Misuse
Privileged pods (securityContext.privileged: true) run without namespace isolation for devices and capabilities. From a privileged pod, an attacker can trivially mount the host filesystem (mount /dev/sda1 /mnt), read /mnt/etc/passwd, chroot into /mnt, and achieve full host access. The eBPF detection signal here is on the kernel path: sys_enter_mount called from a process in a container cgroup with device source arguments that correspond to host block devices. Additionally, sys_enter_chroot to a path outside the container's expected filesystem root is a strong indicator.
Admission controllers using Kubernetes PodSecurityStandards (PSS) should block privileged pods from being scheduled in the first place — the restricted and even baseline profiles disallow privileged: true. But teams that haven't enforced PSS cluster-wide, or who have exceptions for specific namespaces, leave this vector open. Runtime detection catches it when the admission control fails or is bypassed.
Kernel Exploit Patterns (CVE-2022-0185, CVE-2024-1086)
Kernel vulnerabilities that grant local privilege escalation are the hardest to detect at admission time because no image scan catches a zero-day. CVE-2022-0185, a heap overflow in the fsconfig syscall (used by new-style mount API), was exploited by calling fsconfig() with a crafted buffer overflow that corrupted kernel heap to gain arbitrary write. CVE-2024-1086, a use-after-free in the nf_tables netfilter component, was exploited via carefully crafted nftables rule operations to gain kernel code execution.
For these, the eBPF detection point isn't the vulnerability itself — it's the post-exploitation behavior. After achieving kernel code execution, attackers typically call commit_creds(prepare_kernel_cred(NULL)) (or equivalent) to escalate to UID 0, then attempt to call setns(), chroot(), or exec a shell. The observable signal is an unusual capability set appearing on a process that shouldn't have it, combined with subsequent namespace or filesystem operations that are inconsistent with the container's expected behavior.
Building an eBPF Detection Rule for Namespace Escapes
In our internal red-team simulations reproducing CVE-2022-0492-class escape patterns on test clusters, we built detection around three correlated signals rather than a single rule:
- Trigger:
unshare(CLONE_NEWUSER)from a container cgroup — the initial step of the user namespace escape. This alone generates some false positives in clusters that legitimately use user namespaces for container builds (e.g., rootless buildah). Filter: correlate with whether the calling process's parent is a container runtime process or an application process. - Follow-up: write to
/proc/self/uid_mapor/proc/self/gid_mapwithin 500ms of the unshare — this narrows the false positive set dramatically. Legitimate user namespace usage typically maps UIDs immediately, but does so in a recognizable pattern tied to the container image's expected startup behavior. - Escalation: subsequent
open()orwrite()to cgroupfs release_agent path/sys/fs/cgroup/*/release_agent— this is the exploitation step. No legitimate container workload writes to a cgroup release_agent path.
A Falco-compatible rule for the first signal looks like this:
- rule: Container User Namespace Escape Attempt
desc: A process inside a container unshares the user namespace — potential escape attempt.
condition: >
evt.type = unshare
and container.id != host
and evt.arg.flags contains CLONE_NEWUSER
and not proc.name in (allowed_container_build_tools)
output: >
"User namespace escape attempt detected (container=%container.name
pid=%proc.pid parent=%proc.pname cmdline=%proc.cmdline)"
priority: CRITICAL
tags: [container_escape, privilege_escalation]
The allowed_container_build_tools macro would contain image build tools like newuidmap and buildah for environments that legitimately build container images inside pods. Without that exception, this rule fires on rootless Podman builds — an example of why detection rules need environment-aware tuning, not just raw pattern matching.
The Counterpoint: Admission Control Is Not Optional
We're not saying runtime detection replaces admission control for container escape prevention. We're saying runtime detection catches what admission control misses after the fact.
PodSecurityStandards restricted profile eliminates a large fraction of the escape attack surface: no privileged containers, no CAP_SYS_ADMIN, no hostPID, no hostNetwork, no writable root filesystem. Applying restricted to all production namespaces and baseline as a cluster default is correct and you should do it. The three-layer K8s security model covers this at the deploy layer.
But admission control has gaps. Teams create namespace exceptions for operators and system components that need elevated permissions. New vulnerabilities (like kernel CVEs) enable escapes that no PodSpec restriction would have blocked — the attack vector is the kernel itself, not a misconfigured SecurityContext. And in the window between a vulnerability being exploited and a pod being evicted, the runtime detector is your only active sensor.
The right architecture is defense in depth: PodSecurityStandards at admission, NetworkPolicy to limit blast radius, and runtime detection (via eBPF-based DaemonSet) for the events that slip through. Container escape detection is what bridges that gap.
If you're building escape detection rules for your own Falco deployment or want to understand the full escape detection matrix Kubesentry covers, see the container escape use case page or review the detection rule documentation.