How eBPF Makes Runtime Security Practical for K8s Teams

Before eBPF, runtime security meant kernel modules (fragile) or sidecar agents (overhead). eBPF changed the equation: attach a probe at the kernel level, zero overhead to your application, full syscall visibility. Here's how it works under the hood.

Dev Anand CTO, Kubesentry

eBPF probes hooking into Linux kernel syscall table visualization

The Problem with Pre-eBPF Runtime Security

Before eBPF became viable for production use (roughly kernel 4.18 and stabilizing around 5.8), teams who wanted runtime visibility in Kubernetes had two bad options. The first was kernel modules: loadable kernel objects that could intercept syscalls but required matching the exact kernel version, broke on upgrades, and introduced crash risk directly in ring 0. One bad pointer dereference and you're looking at a kernel panic on a production node. The second was sidecar injection: an agent container running alongside every application pod, intercepting process events via ptrace or shared-memory tricks. Sidecar approaches work, but they add per-pod CPU and memory overhead, require mutations to every PodSpec, and still can't see kernel-level events that happen before the sidecar registers.

Neither option was acceptable for a team running 50-node EKS clusters with mixed workloads and an SRE team of four. So most mid-size K8s teams simply didn't have runtime visibility. They had image scanners that ran at build time and admission controllers that ran at deploy time — and then a gap the size of the Mariana Trench once the container actually started executing.

eBPF changed this by solving the attachment problem at the kernel level, safely.

How eBPF Actually Works: The Verifier and the Hook Points

eBPF is a virtual machine inside the Linux kernel. You write a program — typically in restricted C, compiled to eBPF bytecode via LLVM — and submit it to the kernel's verifier. The verifier performs static analysis: no unbounded loops, no unsafe memory access, bounded stack depth (512 bytes), no kernel function calls outside a controlled set. If your program passes verification, it gets JIT-compiled to native machine code and loaded. If it doesn't pass, it's rejected — your process gets an error, not a kernel panic.

This is the fundamental safety property that makes eBPF usable in production: the verifier is the gatekeeper. You can't accidentally crash the host kernel with a buggy eBPF program the way you can with a kernel module. The worst outcome for a rejected program is an error code; the worst outcome for a running verified program is bounded by the verifier's invariants.

Once loaded, eBPF programs attach to hook points. The hook points relevant to runtime security are:

tracepoints: stable kernel instrumentation points. sys_enter_* and sys_exit_* tracepoints fire for every syscall, giving you both the entry arguments and the return value.
kprobes/kretprobes: dynamic probes on arbitrary kernel functions. More powerful but less stable across kernel versions — function signatures can change.
LSM hooks (Linux Security Module): the same hook points used by SELinux and AppArmor. With eBPF LSM (kernel 5.7+), you can attach security enforcement logic without writing a full LSM module.
uprobes: probes on userspace function calls. Useful for language-specific instrumentation (Go runtime events, glibc calls) but with higher overhead than kernel probes.

For container escape detection, the most important hook points are sys_enter_setns (namespace reassignment), sys_enter_unshare (namespace unsharing), and the LSM hook task_alloc. For cryptomining detection, sys_enter_connect combined with sys_enter_clone and process name inspection via task_comm_len covers the main signal sources.

eBPF Maps: How Data Gets from Kernel to Userspace

eBPF programs themselves can't make system calls or write to files — they're sandboxed by the verifier. They communicate with userspace through eBPF maps: kernel-resident data structures that both the eBPF program and a userspace process can read and write. The common map types used in security tools are:

BPF_MAP_TYPE_HASH: key-value lookups. Used to store per-container-ID state (syscall counts, baseline profiles, allowlists).
BPF_MAP_TYPE_PERF_EVENT_ARRAY: high-throughput ring buffer for streaming events from kernel to userspace. This is how raw alert events get to the detection engine without blocking the syscall path.
BPF_MAP_TYPE_RINGBUF (kernel 5.8+): successor to perf event array, with better ordering guarantees and lower overhead for high-event-rate scenarios.
BPF_MAP_TYPE_LRU_HASH: least-recently-used hash for bounded-size caches — useful for storing per-process behavioral state without unbounded memory growth.

The architecture in practice: the eBPF probe fires on a syscall, writes a minimal event record (syscall number, PID, container cgroup ID, timestamp, relevant arguments) to a ring buffer, and returns in microseconds. A userspace agent — running as a DaemonSet pod on each node — reads the ring buffer continuously, enriches events with container metadata from the K8s API, and ships them to the policy evaluation engine.

This design has a critical property: the eBPF probe path is off the application's critical path. The probe fires, writes to the ring buffer, and returns. The application's syscall continues immediately. There is no blocking, no ptrace-style stop-the-process. The performance cost is measured in single-digit nanoseconds per syscall on the probe path — negligible compared to the syscall itself.

Container Identity: The cgroup Trick

One non-obvious challenge in K8s runtime security with eBPF: how do you know which container a syscall came from? PIDs alone aren't enough — a PID 1234 in the default namespace and a PID 1234 in a container namespace are different processes, but from the kernel's perspective they just have different PID namespace contexts.

The reliable anchor is the cgroup ID. Every container in Kubernetes gets a unique cgroup hierarchy rooted at a path like /sys/fs/cgroup/kubepods/burstable/pod<UID>/<containerID>. The eBPF helper bpf_get_current_cgroup_id() returns the numeric cgroup ID for the current task, which the userspace agent maps to a container ID by reading the cgroupfs hierarchy. This gives you a stable, kernel-level container identity that works regardless of PID namespace remapping.

Tools like Falco, Tracee (from Aqua Security), and Cilium's Tetragon all use this cgroup-anchored approach. Cilium Tetragon additionally enriches with Kubernetes pod metadata via its direct integration with the Kubernetes API — so you get pod name, namespace, labels, and service account directly in every event record.

Where eBPF Runtime Security Has Real Limits

eBPF is not magic, and it's worth being precise about what it can and cannot catch.

First: eBPF operates at the syscall level. It sees what syscalls a process makes, not why. A process calling execve("/bin/sh", ...) after a legitimate container build step looks identical at the syscall level to a process calling execve("/bin/sh", ...) after an exploitation. The difference is context: process ancestry, which container it's in, what the normal syscall pattern for that container image looks like. This is why behavioral baselining — learning the expected syscall profile per container image — matters. Without it, you're writing brittle rules that will fire on legitimate workloads.

Second: encrypted payloads are opaque. eBPF on sys_enter_write gives you that a write happened to a file descriptor, but not the payload content (unless you add a uprobe on the application's TLS library, which is considerably more complex). For data exfiltration detection, you're working with metadata signals — destination IP, port, connection frequency, byte volume — rather than payload inspection.

Third: the eBPF verifier requirements mean you need kernel 5.8+ for the most reliable map types and LSM hook support. Clusters running Amazon Linux 2 (kernel 4.14 by default), older Ubuntu 18.04 images (kernel 4.15), or any kernel below 4.18 won't support eBPF programs of this complexity. In practice, this is less of a constraint than it was two years ago — EKS AL2023, GKE's default node images, and AKS Ubuntu nodes are all kernel 5.10+ — but self-managed or legacy clusters require attention.

We're not saying eBPF solves everything. We're saying it solves the attachment problem — the kernel-level observation point — better than any prior approach. The detection logic on top of that data still requires careful engineering.

How This Translates to a K8s DaemonSet Deployment

The practical deployment model for eBPF-based runtime security in Kubernetes is a DaemonSet: one agent pod per node, running with elevated Linux capabilities. Specifically, the agent needs CAP_SYS_ADMIN (for eBPF program loading in kernels below 5.8 that don't yet have CAP_BPF + CAP_PERFMON split), access to the host PID namespace (hostPID: true), and access to the host cgroupfs and procfs (/sys/fs/cgroup and /proc mounted read-only).

These requirements are a legitimate security concern — a compromised DaemonSet agent has significant host access. The mitigations are: run the agent in its own dedicated namespace with tight NetworkPolicy restrictions, use PodSecurityAdmission with a custom allowance for the agent's namespace only, and audit the agent's own syscall footprint against its stated behavior. The agent itself should have a well-defined, auditable surface area.

Consider a scenario from our internal red-team work: a 40-node EKS cluster running a multi-tenant SaaS application, with the Kubesentry DaemonSet deployed across all nodes. A simulated attacker with code execution in an application pod attempts a namespace escape by calling unshare(CLONE_NEWUSER) followed by writing to /proc/self/uid_map — a standard unprivileged user namespace escape vector that CVE-2022-0492 demonstrated in cgroup v1 release_agent scenarios. The eBPF LSM hook on task_alloc combined with the tracepoint on sys_enter_unshare fires within milliseconds. The event includes the container cgroup ID, the calling PID and process name, and the namespace flags requested. The alert reaches the policy engine in under 200ms from syscall to processed event — well within the detection window needed to interrupt the escape sequence before host filesystem access occurs.

That's the practical value proposition: catching the syscall pattern that indicates an escape attempt, before the escape completes, without adding sidecar overhead to the 40 application pods sharing those nodes.

If you want to explore eBPF runtime security further, the open-source Tracee project (Apache 2.0, maintained by Aqua Security) and Tetragon (Apache 2.0, by Cilium/Isovalent) are the best places to understand what's possible at the kernel level. Falco's eBPF driver is the most production-deployed reference implementation for the detection-rule-on-top-of-eBPF pattern. The Kubesentry platform is built on this same foundation — see how our approach compares to running Falco OSS directly, or explore container escape detection specifically.