Security & Credentials · Section 2
seccomp(2)
Install a BPF-based syscall filter for the calling thread — the backbone of container sandboxing on Linux.
Signature
#include <linux/seccomp.h>
#include <sys/syscall.h>
int seccomp(unsigned int operation, unsigned int flags, void * args);- operation
- SECCOMP_SET_MODE_STRICT, SECCOMP_SET_MODE_FILTER, SECCOMP_GET_ACTION_AVAIL, or SECCOMP_GET_NOTIF_SIZES.
- flags
- Operation-specific. For SET_MODE_FILTER: TSYNC (apply to all threads), LOG, SPEC_ALLOW (skip side-channel mitigations for trusted filters), NEW_LISTENER (return notify fd), TSYNC_ESRCH.
- args
- Operation-specific. For SET_MODE_FILTER, a pointer to a struct sock_fprog describing the BPF program.
Description
seccomp() configures the kernel's secure-computing facility for the calling thread (and, with TSYNC, all threads of the process). The two main operations: SECCOMP_SET_MODE_STRICT restricts the thread to read(), write(), _exit(), and sigreturn() — a tiny sandbox sufficient for puzzle-style sandboxes but useless for real workloads. SECCOMP_SET_MODE_FILTER installs a classic-BPF program; on every subsequent syscall the program is run with seccomp_data (arch, nr, args) as input, and its return value selects an action: allow, return errno, kill the thread/process, trap to SIGSYS, log, or notify a userspace supervisor (RET_USER_NOTIF, the gVisor primitive). Filters are write-once: once installed, they can be made stricter (a child filter is ANDed with the parent) but never relaxed. With SECCOMP_FILTER_FLAG_NEW_LISTENER the filter installation returns a listener fd that a supervisor process can read userspace-notify events from.
Architecture mapping
| Architecture | Number | ABI | Entry point |
|---|---|---|---|
| x86 (i386) | 354 | i386 | sys_seccomp |
| x64 (x86_64) | 317 | common | sys_seccomp |
| ARM64 (aarch64) | 277 | — | sys_seccomp |
Kernel history
Introduced in Linux 3.17.
2.6.12
Strict mode (read/write/exit/sigreturn only) shipped — the original seccomp, useful for tiny sandboxes like vsftpd's privsep but not enough for general workloads.
3.5
Filter mode (BPF-driven) was added, enabling Chromium's renderer sandbox and the modern container ecosystem. The kernel runs the filter on every syscall; the BPF verifier ensures the program terminates.
3.17
The seccomp() syscall was added (Linux 3.17) so filters could be installed without going through prctl(); it also added the SECCOMP_FILTER_FLAG_TSYNC flag, fixing the long-standing 'each thread must install its own filter' footgun.
5.0
RET_USER_NOTIF and the listener-fd mechanism were added (Linux 5.0), enabling userspace supervisors to intercept and emulate syscalls — the foundation of gVisor's syscall implementation and a major sandboxing primitive.
5.5
Userspace-notif gained SECCOMP_IOCTL_NOTIF_ADDFD to inject a file descriptor into the target process from the supervisor's notify handler — closing the last big gap that prevented seccomp-based syscall emulators from handling open()-returning calls cleanly.
seccomp & containers
Docker default profile
Allowed
Podman default profile
Allowed
seccomp() is allowed by Docker / Podman default profiles because every container init that installs its own filter needs it. Block it only in workloads that should not be installing further sandboxes — never useful in practice because the outer filter cannot be relaxed by an inner one. Argument filtering on the operation code (allow only SECCOMP_SET_MODE_FILTER, deny GET_*) is rarely worthwhile.
libseccomp
// Allow seccomp() itself only for SECCOMP_SET_MODE_FILTER
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(seccomp),
1, SCMP_A0(SCMP_CMP_EQ, SECCOMP_SET_MODE_FILTER));strace example
$ strace -e seccomp,prctl docker run --rm alpine:3 echo hi 2>&1 | grep -E 'prctl|seccomp' | head -5
prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0) = 0
seccomp(SECCOMP_SET_MODE_FILTER, SECCOMP_FILTER_FLAG_TSYNC, {len=…, filter=0x…}) = 0seccomp() in strace usually appears once per process at startup, immediately after a prctl(PR_SET_NO_NEW_PRIVS) call. The BPF program length and pointer are shown; -e seccomp filters to just these calls. A repeated seccomp() during runtime is unusual — investigate.
Security & observability
seccomp() is itself the primary syscall-level sandboxing primitive on Linux — the backbone of Docker's default profile, Chromium's renderer, OpenSSH's privsep, systemd's SystemCallFilter=, and gVisor's syscall interception. CVEs in the seccomp implementation are rare but devastating (CVE-2019-5736 was orthogonal but illustrates the blast radius). For threat-hunting, an unexpected seccomp() call by a workload that should be running under an already-installed filter is rare but worth flagging — most attempted seccomp use by malware is to install a bypass filter intercepting AV/EDR hooks. eBPF tracepoint sys_enter_seccomp captures every call. From the inside, /proc/<pid>/status Seccomp: line shows the current mode (0=disabled, 1=strict, 2=filter).
Errors
- EACCES
- SECCOMP_SET_MODE_FILTER called without PR_SET_NO_NEW_PRIVS and without CAP_SYS_ADMIN — the safety interlock that prevents unprivileged callers from bypassing set-UID semantics by filtering execve.
- EFAULT
- —
- EINVAL
- Bad operation, unknown flag, malformed BPF program, or BPF program rejected by the verifier (jumps out of range, etc.).
- ENOMEM
- Filter program too large or kernel out of memory for filter storage.
- ENOSYS
- —
- EOPNOTSUPP
- —
Flags
- SECCOMP_SET_MODE_STRICT
- 0
- —
- SECCOMP_SET_MODE_FILTER
- 1
- Install a BPF filter program. Requires either CAP_SYS_ADMIN or PR_SET_NO_NEW_PRIVS to be set first.
- SECCOMP_GET_ACTION_AVAIL
- 2
- —
- SECCOMP_GET_NOTIF_SIZES
- 3
- —
- SECCOMP_FILTER_FLAG_TSYNC
- 0x1
- After installing, synchronise the filter to every other thread in the process. Without TSYNC, only the calling thread is filtered (a common bug in custom sandboxes).
- SECCOMP_FILTER_FLAG_LOG
- 0x2
- —
- SECCOMP_FILTER_FLAG_SPEC_ALLOW
- 0x4
- —
- SECCOMP_FILTER_FLAG_NEW_LISTENER
- 0x8
- Return a listener fd that a parent / supervisor process can use to handle RET_USER_NOTIF syscalls — the gVisor / nsjail interception pattern.
- SECCOMP_FILTER_FLAG_TSYNC_ESRCH
- 0x10
- —
- SECCOMP_RET_KILL_PROCESS
- 0x80000000
- Kill the entire thread group on a matching syscall. Strongest action; preferred for production hardening.
- SECCOMP_RET_KILL_THREAD
- 0x00000000
- —
- SECCOMP_RET_TRAP
- 0x00030000
- —
- SECCOMP_RET_ERRNO
- 0x00050000
- —
- SECCOMP_RET_USER_NOTIF
- 0x7fc00000
- Block the syscall and deliver it to a userspace supervisor via the listener fd. The supervisor inspects, decides, and replies with a return value. Used by gVisor.
- SECCOMP_RET_TRACE
- 0x7ff00000
- —
- SECCOMP_RET_LOG
- 0x7ffc0000
- —
- SECCOMP_RET_ALLOW
- 0x7fff0000
- —