Skip to content
/linux-syscalls

Security & Credentials · Section 2

seccomp(2)

Install a BPF-based syscall filter for the calling thread — the backbone of container sandboxing on Linux.

Signature

#include <linux/seccomp.h>
#include <sys/syscall.h>

int seccomp(unsigned int operation, unsigned int flags, void * args);
operation
SECCOMP_SET_MODE_STRICT, SECCOMP_SET_MODE_FILTER, SECCOMP_GET_ACTION_AVAIL, or SECCOMP_GET_NOTIF_SIZES.
flags
Operation-specific. For SET_MODE_FILTER: TSYNC (apply to all threads), LOG, SPEC_ALLOW (skip side-channel mitigations for trusted filters), NEW_LISTENER (return notify fd), TSYNC_ESRCH.
args
Operation-specific. For SET_MODE_FILTER, a pointer to a struct sock_fprog describing the BPF program.

Description

seccomp() configures the kernel's secure-computing facility for the calling thread (and, with TSYNC, all threads of the process). The two main operations: SECCOMP_SET_MODE_STRICT restricts the thread to read(), write(), _exit(), and sigreturn() — a tiny sandbox sufficient for puzzle-style sandboxes but useless for real workloads. SECCOMP_SET_MODE_FILTER installs a classic-BPF program; on every subsequent syscall the program is run with seccomp_data (arch, nr, args) as input, and its return value selects an action: allow, return errno, kill the thread/process, trap to SIGSYS, log, or notify a userspace supervisor (RET_USER_NOTIF, the gVisor primitive). Filters are write-once: once installed, they can be made stricter (a child filter is ANDed with the parent) but never relaxed. With SECCOMP_FILTER_FLAG_NEW_LISTENER the filter installation returns a listener fd that a supervisor process can read userspace-notify events from.

Architecture mapping

ArchitectureNumberABIEntry point
x86 (i386)354i386sys_seccomp
x64 (x86_64)317commonsys_seccomp
ARM64 (aarch64)277sys_seccomp

Kernel history

Introduced in Linux 3.17.

  1. 2.6.12

    Strict mode (read/write/exit/sigreturn only) shipped — the original seccomp, useful for tiny sandboxes like vsftpd's privsep but not enough for general workloads.

  2. 3.5

    Filter mode (BPF-driven) was added, enabling Chromium's renderer sandbox and the modern container ecosystem. The kernel runs the filter on every syscall; the BPF verifier ensures the program terminates.

  3. 3.17

    The seccomp() syscall was added (Linux 3.17) so filters could be installed without going through prctl(); it also added the SECCOMP_FILTER_FLAG_TSYNC flag, fixing the long-standing 'each thread must install its own filter' footgun.

  4. 5.0

    RET_USER_NOTIF and the listener-fd mechanism were added (Linux 5.0), enabling userspace supervisors to intercept and emulate syscalls — the foundation of gVisor's syscall implementation and a major sandboxing primitive.

  5. 5.5

    Userspace-notif gained SECCOMP_IOCTL_NOTIF_ADDFD to inject a file descriptor into the target process from the supervisor's notify handler — closing the last big gap that prevented seccomp-based syscall emulators from handling open()-returning calls cleanly.

seccomp & containers

Docker default profile

Allowed

Podman default profile

Allowed

seccomp() is allowed by Docker / Podman default profiles because every container init that installs its own filter needs it. Block it only in workloads that should not be installing further sandboxes — never useful in practice because the outer filter cannot be relaxed by an inner one. Argument filtering on the operation code (allow only SECCOMP_SET_MODE_FILTER, deny GET_*) is rarely worthwhile.

libseccomp

// Allow seccomp() itself only for SECCOMP_SET_MODE_FILTER
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(seccomp),
    1, SCMP_A0(SCMP_CMP_EQ, SECCOMP_SET_MODE_FILTER));

strace example

$ strace -e seccomp,prctl docker run --rm alpine:3 echo hi 2>&1 | grep -E 'prctl|seccomp' | head -5
prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0)  = 0
seccomp(SECCOMP_SET_MODE_FILTER, SECCOMP_FILTER_FLAG_TSYNC, {len=…, filter=0x…}) = 0

seccomp() in strace usually appears once per process at startup, immediately after a prctl(PR_SET_NO_NEW_PRIVS) call. The BPF program length and pointer are shown; -e seccomp filters to just these calls. A repeated seccomp() during runtime is unusual — investigate.

Security & observability

seccomp() is itself the primary syscall-level sandboxing primitive on Linux — the backbone of Docker's default profile, Chromium's renderer, OpenSSH's privsep, systemd's SystemCallFilter=, and gVisor's syscall interception. CVEs in the seccomp implementation are rare but devastating (CVE-2019-5736 was orthogonal but illustrates the blast radius). For threat-hunting, an unexpected seccomp() call by a workload that should be running under an already-installed filter is rare but worth flagging — most attempted seccomp use by malware is to install a bypass filter intercepting AV/EDR hooks. eBPF tracepoint sys_enter_seccomp captures every call. From the inside, /proc/<pid>/status Seccomp: line shows the current mode (0=disabled, 1=strict, 2=filter).

Errors

EACCES
SECCOMP_SET_MODE_FILTER called without PR_SET_NO_NEW_PRIVS and without CAP_SYS_ADMIN — the safety interlock that prevents unprivileged callers from bypassing set-UID semantics by filtering execve.
EFAULT
EINVAL
Bad operation, unknown flag, malformed BPF program, or BPF program rejected by the verifier (jumps out of range, etc.).
ENOMEM
Filter program too large or kernel out of memory for filter storage.
ENOSYS
EOPNOTSUPP

Flags

SECCOMP_SET_MODE_STRICT
0
SECCOMP_SET_MODE_FILTER
1
Install a BPF filter program. Requires either CAP_SYS_ADMIN or PR_SET_NO_NEW_PRIVS to be set first.
SECCOMP_GET_ACTION_AVAIL
2
SECCOMP_GET_NOTIF_SIZES
3
SECCOMP_FILTER_FLAG_TSYNC
0x1
After installing, synchronise the filter to every other thread in the process. Without TSYNC, only the calling thread is filtered (a common bug in custom sandboxes).
SECCOMP_FILTER_FLAG_LOG
0x2
SECCOMP_FILTER_FLAG_SPEC_ALLOW
0x4
SECCOMP_FILTER_FLAG_NEW_LISTENER
0x8
Return a listener fd that a parent / supervisor process can use to handle RET_USER_NOTIF syscalls — the gVisor / nsjail interception pattern.
SECCOMP_FILTER_FLAG_TSYNC_ESRCH
0x10
SECCOMP_RET_KILL_PROCESS
0x80000000
Kill the entire thread group on a matching syscall. Strongest action; preferred for production hardening.
SECCOMP_RET_KILL_THREAD
0x00000000
SECCOMP_RET_TRAP
0x00030000
SECCOMP_RET_ERRNO
0x00050000
SECCOMP_RET_USER_NOTIF
0x7fc00000
Block the syscall and deliver it to a userspace supervisor via the listener fd. The supervisor inspects, decides, and replies with a return value. Used by gVisor.
SECCOMP_RET_TRACE
0x7ff00000
SECCOMP_RET_LOG
0x7ffc0000
SECCOMP_RET_ALLOW
0x7fff0000

Related syscalls