Why Quadlet Is Different, Part 2: Defence-in-Depth Cybersecurity at the Service Level

From Resource Limits to Security Boundaries

Part 1 of this series showed how Quadlet inherits resource governance capabilities from systemd that Compose structurally cannot express — soft memory throttles, CPU core pinning, I/O bandwidth control, and hierarchical resource budgets via slices. Those features protect the device from its workloads.

This post addresses the harder problem: protecting the device — and the plant — from a compromised workload.

IEC 62443 defines a defence-in-depth security model for industrial control systems. The principle is straightforward: no single security mechanism should be the last line of defence. What follows are systemd hardening directives that Quadlet exposes and Compose structurally cannot, each implementing a distinct layer of defence.

Filesystem Isolation: Beyond Read-Only Containers

Compose supports read_only: true for the root filesystem. That’s one directive. systemd provides a graduated, fine-grained filesystem isolation model:

[Service]
ProtectSystem=strict
ProtectHome=true
PrivateTmp=true
ReadWritePaths=/var/lib/sensor-data
InaccessiblePaths=/etc/shadow /etc/ssh
TemporaryFileSystem=/var:ro

Each directive does something specific and auditable:

ProtectSystem=strict mounts the entire filesystem hierarchy read-only, except /dev, /proc, /sys. Unlike Compose’s read_only, this applies at the service level by the OS, not by the container runtime. A compromised container engine cannot override it.

ProtectHome=true makes /home, /root, and /run/user completely inaccessible, not just read-only.

PrivateTmp=true creates a private /tmp mount visible only to this service. No other process — containerised or not — can read or write to it.

ReadWritePaths= whitelists specific writable paths. Everything else remains read-only. This is positive-list security: deny everything, then permit only what’s needed.

InaccessiblePaths= makes specific paths completely invisible to the service. Not read-only — invisible. The path doesn’t exist in the service’s filesystem view.

Why Compose can’t do this: Compose delegates filesystem isolation to the container runtime’s mount namespace. The read_only flag and volumes are the only controls. There’s no way to make specific host paths invisible, no way to create per-service /tmp isolation enforced by the OS, and no way to make filesystem controls survive a container runtime compromise.

IEC 62443 mapping: These directives implement least privilege at the filesystem level — Security Level 2 (SL-2) requirement FR 2 (Use Control). An auditor can read the Quadlet file and verify that the service cannot access anything it doesn’t need. With Compose, the auditor must examine the container image’s internal filesystem, the bind mounts, the volume configuration, and the runtime’s seccomp profile — four artefacts instead of one.

Kernel Attack Surface Reduction

A container doesn’t need to load kernel modules, change kernel tunables, modify cgroups, access the hardware clock, or inspect other processes. But unless you explicitly deny these capabilities, a compromised process inside the container might exploit them. systemd provides per-service controls that Compose cannot express:

[Service]
ProtectKernelTunables=true
ProtectKernelModules=true
ProtectKernelLogs=true
ProtectControlGroups=true
ProtectClock=true
ProtectProc=invisible

ProtectKernelTunables=true makes /proc/sys and /sys read-only. A compromised container cannot change IP forwarding, TCP keepalive, or other kernel parameters.

ProtectKernelModules=true blocks insmod, modprobe, and any mechanism to load kernel modules. This prevents a container from loading a malicious driver.

ProtectKernelLogs=true denies access to dmesg and the kernel log ring buffer. Kernel logs can leak sensitive information about hardware, memory layout, and kernel addresses.

ProtectControlGroups=true makes the cgroup filesystem read-only for this service. A compromised container cannot modify its own resource limits or interfere with other services’ limits.

ProtectClock=true prevents writes to the hardware or system clock. In OT environments where time synchronisation (PTP/IEEE 1588) is critical for event ordering, a compromised clock can corrupt audit trails.

ProtectProc=invisible hides information about other processes in /proc. The container can only see its own processes, not the PID list of the entire system.

None of these is available in the Compose specification. Some can be partially achieved through custom seccomp profiles passed via security_opt, but that requires building and maintaining a JSON seccomp filter per application — a significant operational burden that Compose makes invisible rather than declarative.

Network Restriction: Per-Service Kernel Firewalls

Compose supports port mapping and network selection. But it cannot restrict which types of network communication a container is allowed to use.

[Service]
RestrictAddressFamilies=AF_INET AF_INET6 AF_UNIX
IPAddressAllow=10.0.0.0/8 192.168.0.0/16
IPAddressDeny=any

RestrictAddressFamilies= limits which socket types the process can create. An MQTT protocol adapter needs AF_INET (IPv4) and maybe AF_UNIX (local sockets). It doesn’t need AF_NETLINK (kernel comms), AF_PACKET (raw packets), or AF_BLUETOOTH. Denying these address families eliminates entire classes of network-level exploits.

IPAddressAllow= and IPAddressDeny= implement source/destination IP filtering at the cgroup level. This is a kernel-enforced firewall per service — not a container network policy, not iptables rules that a daemon manages, but BPF-based filtering attached directly to the service’s cgroup.

Industrial scenario: A protocol gateway bridges a PLC network (10.0.1.0/24) to a SCADA server (10.0.2.5). The Quadlet file declares IPAddressAllow=10.0.1.0/24 10.0.2.5/32 and IPAddressDeny=any. If the gateway container is compromised, the attacker cannot use it to reach the corporate network, the internet, or any other OT segment. This is microsegmentation enforced by the kernel, declared in a single file, auditable with systemctl show.

Syscall Filtering: Declared Inline, Not in a Separate JSON

systemd provides curated syscall filter groups that restrict which kernel system calls a process can make:

[Service]
SystemCallArchitectures=native
SystemCallFilter=@system-service
SystemCallFilter=~@privileged @mount @keyring @debug

SystemCallArchitectures=native ensures only native syscalls (x86-64 on x86-64) are permitted — blocking 32-bit compatibility syscalls that have historically been a source of privilege escalation exploits.

SystemCallFilter=@system-service permits the common syscall set for network services. ~@privileged @mount @keyring denies privilege escalation syscalls, mount operations, and kernel keyring access.

Compose supports seccomp profiles via security_opt: - seccomp:profile.json. But the profile is a separate JSON file that must be authored, distributed, and maintained independently of the Compose file. The systemd approach is declarative within the service definition itself — the security policy lives with the service, versioned together, auditable as a unit.

Rootless Mode and Hardening: A Required Trade-off

⚠️ Technical note — Rootless hardening conflicts: Several of the sandboxing directives described above interact with Podman’s rootless execution model in ways that require careful configuration.
Directives such as RestrictAddressFamilies=, SystemCallFilter=~@privileged, ProtectKernelTunables=true, and MemoryDenyWriteExecute=true all implicitly set NoNewPrivileges=yes on the service. In rootless mode, this creates a conflict: the newuidmap and newgidmap helper binaries require elevated privileges (via setuid bits or file capabilities) to establish the user namespace that rootless containers depend on. When NoNewPrivileges=yes is active, these helpers are blocked from acquiring the necessary privileges, and the container fails to start.
The solution is a namespace setup service. Because user namespaces persist independently of the process that created them, the initialization can be separated from the hardened workload:
# userns-setup@.service — runs BEFORE the hardened container
[Service]
Type=oneshot
ExecStart=/usr/bin/podman unshare true
# No hardening directives — this service needs newuidmap
# sensor-gateway.container — the hardened workload
[Unit]
After=userns-setup@sensor-gateway.service
Requires=userns-setup@sensor-gateway.service

[Service]
RestrictAddressFamilies=AF_INET AF_INET6 AF_UNIX
SystemCallFilter=@system-service
NoNewPrivileges=yes
# Safe: user namespace is already established
The setup service runs once without hardening restrictions, establishes the user namespace, and exits. The hardened container then starts within the pre-existing namespace. This pattern maintains full rootless operation while enabling the complete hardening stack. On devices where containers run as root (common in locked-down industrial appliances with no interactive users), this conflict does not apply.

Measurable Security: `systemd-analyze security`

Here’s something that has no equivalent anywhere in the Compose ecosystem. systemd provides a built-in security audit tool:

$ systemd-analyze security sensor-gateway.service

This command generates a scored report evaluating every hardening directive — filesystem isolation, network restrictions, capability restrictions, syscall filtering, namespace isolation — and produces a numerical exposure rating from 0.0 (fully hardened) to 10.0 (no hardening).

A typical unhardened service scores 9.6 (UNSAFE). A well-hardened Quadlet service scores 2.0–4.0 (OK to MEDIUM). The audit is deterministic, reproducible, and can be integrated into CI/CD pipelines: “reject any Quadlet file that scores above 5.0.”

IEC 62443 mapping: This tool directly supports the verification requirement in SL-2: “The control system shall provide the capability to audit the use of system services.” The auditor runs a single command and receives a quantified security assessment. For Compose-managed containers, there is no equivalent single-command verification — the auditor must inspect the image, the Compose file, the daemon configuration, and the runtime seccomp profile separately.

The Audit Question

When an IEC 62443 security assessor reviews an edge device, they ask: “Show me the security policy for each running workload.”

With Quadlet, the answer is a single file per workload. Every resource limit, every filesystem restriction, every network policy, every capability constraint, every syscall filter — declared in the same file, enforced by the kernel, verifiable with systemd-analyze security, scored numerically.

With Compose, the answer is a set of distributed artifacts: the Compose file (which has 15 security-relevant knobs), the container image (which must be inspected for setuid binaries, default capabilities, and internal file permissions), the daemon configuration (which controls seccomp defaults, user namespace mapping, and logging drivers), and any custom seccomp profile JSON files (which must be maintained separately). No single command produces a security score. No single file contains the complete policy.

This isn’t a criticism of Compose — it’s a consequence of its architectural position. Compose sits above the OS and delegates security to the container engine. Quadlet sits inside the OS and inherits security from the kernel. On the industrial edge, where security audits are expensive and device fleets number in thousands, the difference between “one file per workload” and “four artifacts per workload” is a material operational cost.

Security Directives Compose Cannot Express — Summary

Security Layer	Compose	Quadlet
Root filesystem read-only	✅ `read_only: true`	✅ `ProtectSystem=strict`
Home directory isolation	❌	✅ `ProtectHome=true`
Private /tmp per service	❌	✅ `PrivateTmp=true`
Make paths invisible	❌	✅ `InaccessiblePaths=`
Whitelist writable paths	❌	✅ `ReadWritePaths=`
Block kernel module loading	❌	✅ `ProtectKernelModules=true`
Read-only /proc/sys	❌	✅ `ProtectKernelTunables=true`
Block dmesg access	❌	✅ `ProtectKernelLogs=true`
Read-only cgroups	❌	✅ `ProtectControlGroups=true`
Block clock writes	❌	✅ `ProtectClock=true`
Hide other processes	❌	✅ `ProtectProc=invisible`
Socket type restriction	❌	✅ `RestrictAddressFamilies=`
Per-service IP firewall	❌	✅ `IPAddressAllow=/Deny=`
Inline syscall filtering	❌ (separate JSON)	✅ `SystemCallFilter=`
Native-only syscalls	❌	✅ `SystemCallArchitectures=native`
Security score audit	❌	✅ `systemd-analyze security`

Up Next

Security defines what each workload is allowed to do. But industrial production also demands precise control over how workloads start, stop, depend on each other, and recover from failure — behaviours that touch hardware watchdog timers, GPIO safety relays, and services that exist entirely outside the container world.

In Part 3, we cover the lifecycle directives that Compose can’t express: push-based readiness (Type=notify), kernel-level watchdog supervision (WatchdogSec=), hard lifecycle coupling to non-container services (BindsTo=), conditional execution based on hardware presence, and event-driven failure actions that trigger safety fallbacks.

→ Continue to Part 3: Production Lifecycle on the Factory Floor

This series accompanies a Margo specification enhancement proposal for quadlet.v1 as a deployment profile. Discussion and feedback welcome on discourse.margo.org and the Margo specification GitHub.