Deploy in your cluster
Blog

Why SREs Are Becoming the First Line of Security Response

Security teams write policies. SREs get paged at 2 AM when something breaks. The convergence of those two roles is reshaping how runtime incidents get detected, triaged, and resolved.

SRE and security team convergence concept

The Pager Doesn't Care Whether It's a Security Incident

A PagerDuty alert at 2:47 AM reads: "Pod payment-processor-7d8b9c — CPU 98%, 15 minutes sustained." The on-call SRE acknowledges it. They start with the usual runbook: check recent deploys, check HPA, check whether a batch job kicked off, look at resource limits. Nothing obvious. They exec into the pod to see what's running. There's an unfamiliar process — kswapd0 — consuming all the CPU. Not a kernel process. An application process named to look like one. They're looking at a cryptominer, not a hung application. The incident escalated from "noisy pod" to "active security incident" in under two minutes of investigation, on a rotation with no security engineer on call.

This scenario plays out regularly in SaaS engineering teams. The security team wrote the policy: "no privileged pods, no unvetted base images." But when the event happened — a dependency confusion attack that slipped a malicious package into the Python dependencies, activating a dormant miner after a few days of clean operation — it was the SRE who detected it (incidentally, while investigating CPU), contained it (kubectl delete pod), and filed the post-mortem. The security team learned about it from the post-mortem.

This is not a failure mode unique to any particular company size. It's structural.

Why SREs End Up Owning Runtime Incidents

The organizational split between SRE and security functions made sense in a pre-cloud, pre-container era. Security operated at the network perimeter (firewalls, IDS), and reliability was a separate concern (load balancers, databases). In a Kubernetes environment, those boundaries have dissolved. The attack surface is the runtime environment: the cluster, the pods, the syscall behavior of running containers. The reliability engineers who own the runtime are closest to the events when they happen.

SREs have the operational context that's necessary to triage runtime security events quickly. They know which pods are expected to call the payment API and which aren't. They know which container images were deployed last Tuesday and what changed. They have cluster access. They're already running incident response processes — runbooks, post-mortems, on-call rotations, MTTR tracking. Security teams often don't have cluster-level operational context; they have policy frameworks and compliance requirements.

The convergence doesn't mean SREs should become security analysts. It means the tools and processes designed for security operations need to fit into the operational patterns SREs already use — not require a separate SOC workflow that exists in parallel and rarely intersects.

The Alert Quality Problem

When teams do deploy runtime detection tools — Falco being the most common — the first thing SREs notice is alert volume. A default Falco deployment against a production cluster with active workloads generates a significant noise floor. Rules like Write below root in the filesystem fire on package managers, log rotators, and a dozen other legitimate operations. Rules like Terminal shell in container fire on every kubectl exec from a developer investigating a bug. The rules are technically correct — these are unusual events that could indicate compromise — but they're not tuned to the specific workload context.

The result: the security alert channel fills with false positives, the SRE on-call stops reading it within a week, and the detection capability you installed provides no practical protection because no one acts on its output. This is the alert fatigue problem applied specifically to runtime security, and it's widespread enough to be a named failure pattern in incident response post-mortems.

The fix is not to disable detection. The fix is behavioral baselining: learning the expected syscall and process patterns for each container image in your specific workload, and suppressing alerts that match the known-good pattern while escalating deviations. A container image that legitimately runs sh as part of its health check startup script should not generate a "shell spawned in container" alert on every pod restart. A container image that has never run sh generating that alert is genuinely anomalous.

What "SRE-Friendly" Runtime Detection Actually Requires

We've talked with SREs at growing SaaS companies about what makes a security alert actionable versus ignorable. The requirements that come up consistently:

Process tree context, not just the leaf event. An alert that says "unexpected exec: /bin/sh in container api-server-5f8d" is not triageable without knowing what spawned it. Was it nginx → sh (suspicious), or was it containerd-shim → entrypoint.sh → sh (expected startup behavior)? The full process ancestry from the container init process to the offending call is the minimum context for a 2 AM triage.

Container and namespace identification. "A pod on node ip-10-0-1-47 made a suspicious syscall" requires the SRE to run kubectl get pods --all-namespaces -o wide | grep 10.0.1.47 to figure out which pod. That's friction that increases time-to-contain. The alert should include pod name, namespace, and labels — the same identifiers the SRE already uses to navigate the cluster.

Integration with existing alerting channels. SREs have a pager rotation. They have a Slack channel for operational alerts. A security incident that pages into a separate tool that SREs don't normally monitor will be seen later rather than sooner. Security alerts should route through the same PagerDuty or OpsGenie setup that already manages operational on-call, with appropriate severity tiers — a container escape attempt is Critical and wakes someone up; an anomalous DNS query from a dev namespace is Warning and goes into the monitoring queue.

Response runbooks tied to alert types. "Container escape attempt detected" is not an actionable alert unless the on-call knows what to do. The first response steps — isolate the pod (remove from service), capture the process state (kubectl describe + logs), identify the escape vector, check whether the host OS shows evidence of filesystem modification — should be documented and linked directly from the alert. This is standard SRE runbook practice applied to security response.

The MTTD and MTTR Framing

SRE teams have metrics for reliability: MTBF, MTTD, MTTR. These same metrics apply to security incidents and are often missing from security team reporting. Industry incident response data indicates that median detection time for container-based cryptomining events is measured in days. The billing-alert detection mechanism that most teams rely on has a latency of days-to-weeks between attack start and detection. Contrast that with kernel-level detection, where the execve of the miner process fires an alert within seconds of start. The difference is measured in hours of compute cost and, for higher-severity incidents like container escapes, in the difference between early containment and host-level compromise.

MTTR for K8s runtime incidents handled by SREs averages longer than reliability incidents of comparable complexity — partly because security incidents require evidence preservation (not just service restoration) and partly because the triage workflow isn't as mature. The path to lower MTTR runs through alert quality (less time to triage), process tree context (faster root cause), and clear response runbooks (less judgment required at 2 AM).

We're not saying SREs need to become security engineers, and we're not saying security teams should be the first responders to runtime incidents. The observation is that runtime security events happen in the environment that SREs already operate, and the tools used to detect and respond to them need to match the operational patterns of that environment — fast triage, clear context, integration with existing alert routing, and documented response steps. See how Kubesentry integrates with PagerDuty, OpsGenie, and Slack, or explore the alert context model that surfaces process tree and namespace information in every detection event.