Container security · syscall filtering
seccomp & container syscall filtering
seccomp-BPF is the Linux kernel mechanism that lets a process restrict the set of syscalls it (and its children) are allowed to make. It is the backbone of every modern container runtime and a cheap, durable mitigation against entire classes of post-exploitation moves.
What seccomp is
A seccomp filter is a small classic-BPF program attached to a process. On every syscall the kernel runs the filter with the syscall number, architecture, and arguments as input; the filter returns an action — allow, return an error to userspace, kill the thread, or notify a userspace supervisor. Once installed (via prctl(PR_SET_SECCOMP) or the seccomp() syscall), the filter applies for the lifetime of the process and all its descendants; it cannot be relaxed, only further restricted.
Filter actions
SECCOMP_RET_ALLOW
Permit the syscall normally. This is the default for syscalls not explicitly listed.
SECCOMP_RET_ERRNO
Return -1 with errno set to the specified value, without entering the syscall. Useful for graceful denial of optional functionality (e.g. mlock for a container that shouldn't be locking pages).
SECCOMP_RET_TRAP
Deliver SIGSYS to the calling thread. The handler can inspect siginfo_t->si_syscall to log and decide.
SECCOMP_RET_KILL_PROCESS
Kill the entire thread group immediately. The strongest setting; preferred for production hardening where a denied syscall always indicates compromise.
SECCOMP_RET_LOG / NOTIFY
RET_LOG records the syscall in the audit log without blocking. RET_USER_NOTIF (Linux 5.0+) blocks the syscall and hands it to a userspace supervisor that can inspect and emulate it — used by gVisor-style sandboxes.
Docker / Podman default profile
Docker, Podman and the major Kubernetes runtimes ship a default seccomp profile that allows roughly 300 syscalls and blocks the rest. The default blocks are mostly things a workload should never legitimately need: kernel-module manipulation, advanced ptrace, raw bpf(), namespace creation, low-level system administration, and obsolete or compat ABI variants.
Blocked by the default profile (1)
Kubernetes integration
Kubernetes exposes seccomp via the securityContext.seccompProfile field on Pods and containers (GA in 1.19). RuntimeDefault delegates to the kubelet's default (the same one Docker / containerd ship). Localhost lets you point at a custom profile JSON on the node. Unconfined disables seccomp entirely; never set this in production.
# Pod-level seccomp via securityContext (Kubernetes 1.19+)
apiVersion: v1
kind: Pod
spec:
securityContext:
seccompProfile:
type: RuntimeDefault # use kubelet's default (Docker-like) profile
containers:
- name: app
image: my-app:latest
securityContext:
seccompProfile:
type: Localhost
localhostProfile: my-profiles/app.jsonWriting a custom filter
Default-deny with a small allow-list is the strongest pattern. The libseccomp helper makes this readable; equivalent BPF can be generated for embedding directly:
#include <seccomp.h>
int harden(void) {
scmp_filter_ctx ctx = seccomp_init(SCMP_ACT_ERRNO(EPERM));
/* allow only what we need */
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(read), 0);
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(write), 0);
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(openat), 0);
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(close), 0);
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(exit_group), 0);
/* W^X: deny PROT_EXEC anonymous mmap */
seccomp_rule_add(ctx, SCMP_ACT_ERRNO(EPERM), SCMP_SYS(mmap),
1, SCMP_A2(SCMP_CMP_MASKED_EQ, PROT_EXEC, PROT_EXEC));
return seccomp_load(ctx);
}Always test against your real workload: a missing syscall causes a hard-to-diagnose EPERM somewhere deep in your stack. Tools like docker-slim and falco can extract the exact set used at runtime.
Syscalls worth filtering carefully
- mmap(2) Map files or anonymous memory into a process's address space.
- execve(2) Replace the current process image with a new program.
- clone(2) Create a new process that can selectively share memory, file descriptors, namespaces, and other resources with its parent.
- ptrace(2) Inspect and control another process — the kernel primitive behind gdb, strace, and process-injection malware.
- ioctl(2) Catch-all device-control syscall: perform a driver-defined operation on a file descriptor.