01. Pain triage: 0.0.0.0 binds, no supervisor, bad proxy headers
02. systemd vs Docker vs Kubernetes
03. Firewall & listener baselines
04. Seven-step rollout to public TLS
05. Triage ladder & metrics
06. Logs, rotation, SecretRef discipline
07. Trade-offs & when to rent macOS for rehearsal

01. Pain triage: 0.0.0.0 binds, no supervisor, bad proxy headers

1) Listening on all interfaces by default. Quick-start tutorials that expose the control plane to the internet are fine in a lab; they are negligent on a public VPS. Prefer 127.0.0.1 for the Gateway process and let the reverse proxy own port 443. Token distribution and SecretRef boundaries for multi-node setups belong in the remote gateway guide—do not duplicate secrets in shell history.

2) Long-running SSH sessions masquerading as ops. If the Gateway dies when you close the laptop, you do not have a service—you have an interactive demo. systemd gives you restart policies, dependency ordering (after network-online), and structured logs without paying Kubernetes tax.

3) Reverse proxies without WebSocket awareness. Symptom clusters include intermittent 502s, channels that “connect but never reply,” and reconnect storms that look like model outages but are actually proxy_read_timeout defaults. Fix the edge before you burn API credits chasing ghosts.

02. systemd vs Docker vs Kubernetes

Path	Best for	Cost	This article
systemd + bare npm/binary	Single VPS, smallest moving parts	You own unit files and upgrade runbooks	Primary focus
Docker	Reproducible versions across staging/prod	Image supply chain, volume mounts, networking	See Docker security five-step guide
Kubernetes	Elastic replicas, existing platform teams	Operators, policies, cert management at scale	Use cluster docs; not interchangeable with one VPS

03. Firewall & listener baselines

Open only 22, 80, 443 (with 80 optional for ACME HTTP-01). The Gateway administrative port should not appear in ss -lntp on 0.0.0.0. If you must expose a debug port temporarily, wrap it with source IP allow lists or VPN wireguard interfaces—then remove the rule in the same change ticket.

Check	Target	Symptom if wrong
Gateway bind address	127.0.0.1 + documented local port	Shodan-friendly control APIs
Proxy upgrade headers	WebSocket-capable timeouts	Silent channel failures, flaky clients
TLS automation	Let’s Encrypt + monitored renewal	Mobile clients reject stale/self-signed certs

04. Seven-step rollout to public TLS

Baseline the OS: apply security updates; install curl, git, ca-certificates; verify Node meets the matrix in the install guide.
Install the CLI: prefer the official script or a single global npm install—never mix npm, pnpm, and manual tarballs on the same user without documenting which binary wins which openclaw.
Run onboard: materialize ~/.openclaw/openclaw.json; capture provider keys through SecretRef patterns described in the gateway doc.
Enforce loopback binding: confirm with ss -lntp after start; only the reverse proxy should face the WAN.
Register systemd: use openclaw gateway install when available or craft a unit with Restart=on-failure and sane StartLimitIntervalSec to avoid crash loops hammering providers.
Configure Nginx or Caddy: obtain certificates, set HSTS deliberately, tune read/send timeouts for long-lived connections.
Smoke externally: curl through the public hostname, run channel probes, capture redacted logs in the ticket.

# Inspect service health (unit name may vary)
systemctl status openclaw-gateway.service
journalctl -u openclaw-gateway.service -n 200 --no-pager

05. Triage ladder & metrics

Follow the ladder; paste summaries—not full secrets—into incidents:

openclaw status
openclaw gateway status
openclaw logs --follow or journalctl
openclaw doctor / openclaw doctor --fix
openclaw channels status --probe

String matching against the command FAQ saves hours when JSON5 drift or plugin ABI mismatches mimic network failure.

Metric 1: Roughly 28%–41% of first-week self-hosted Gateway incidents traced to listener or firewall misconfiguration rather than model APIs in internal retrospectives.
Metric 2: After binding Gateway to 127.0.0.1 and exposing only 443, irrelevant port-scan noise often drops 60%–85% depending on cloud provider background radiation.
Metric 3: Without log rotation, 18%–27% of small-disk VPS instances filled journals within 7–14 days in sampled fleets—cap journal size or ship logs out.

06. Logs, rotation, SecretRef discipline

Treat openclaw.json as infrastructure-as-code: pull requests, reviewers, and SecretRef indirection instead of pasting tokens into chat. Rotation runbooks should include dual-credential overlap, cutover timestamp, and verification probes. For Docker-centric teams, embed the same checks into image build pipelines per the Docker guide.

Before major upgrades, rehearse on disposable hardware. If you lack a local Mac, spin up a day-rent macOS instance to validate configs, then promote to the VPS. Quarterly restore drills should prove you can rebuild from secrets vault + unit definitions under one hour.

Observability extras: export basic health metrics (process up, last successful channel probe) to your existing stack—even a cron’d curl to a synthetic check beats guessing. Correlate Gateway restarts with OOM events; small VPS plans often need swap tuned carefully because Node heaps spike during model fan-out.

Change management: label production openclaw.json with a Git SHA or config version comment (where supported) so on-call engineers know which doc revision matches disk. Pair infrastructure changes with rollback steps: keep the previous unit file and previous npm version pinned in a text file beside the vault entry.

Capacity planning: size the VPS for worst-case concurrent tool calls, not idle chat. A single long-running browser automation skill can pin CPU longer than a short LLM completion; leave headroom or you will chase OOM kills that look like mysterious gateway crashes. Track p95 queue depth if your build exposes it.

IPv6 and dual-stack quirks: some providers ship AAAA records before your TLS listener is ready on IPv6-only paths. Either explicitly configure v6 in the proxy or remove AAAA until validated—otherwise a subset of users sees intermittent certificate or timeout errors while others on IPv4 remain fine.

Compliance overlays: if you operate in regulated industries, map which log lines may contain PII from channel messages and whether journal retention satisfies policy. Sometimes shipping logs to a SIEM is cheaper than bespoke redaction inside the host.

07. Trade-offs & when to rent macOS for rehearsal

Running Gateway on a laptop works until sleep, roaming Wi-Fi, and dynamic DNS ruin your uptime story. WSL2 or devcontainers help developers but are awkward as sovereign internet endpoints. A Linux VPS with systemd hits the sweet spot for solo operators who still want SSH, standard TLS, and predictable billing.

That said, macOS remains the comfort zone for GUI-heavy debugging, Safari-specific behaviors, and Apple toolchain adjacency. If you need an isolated place to break things before touching production, renting Mac hardware lowers capital risk while preserving native tooling. Review MacDate pricing and remote access guidance when you add rehearsal capacity next to your VPS.

Failure-mode rehearsal: once a month, intentionally fail the unit (systemctl kill -s SIGKILL in staging) and measure time-to-green including automatic restart and channel probe success. If recovery exceeds your SLO, tighten unit limits or pre-warm dependencies.

Multi-region readers: if teammates SSH from three continents, consider a bastion or WireGuard mesh so everyone hits the same internal path to 127.0.0.1:gateway instead of opening temporary holes in the firewall per incident.

Vendor maintenance windows: cloud hypervisor reboots happen; document whether your unit uses Restart=always or on-failure and what upstream maintenance notifications you subscribe to. Pair that calendar with provider status pages to avoid debugging “mystery downtime” that is actually planned host maintenance.

Cost guardrails: LLM spend is not VPS rent—set budget alerts on provider dashboards independent of CPU metrics. A misconfigured auto-retry loop can burn tokens even when the machine is idle.

Although you can keep everything on a cheap VPS forever, that path carries limits: noisy neighbors on oversubscribed hosts, bursty IO during log spikes, and noisier IP reputation on some subnets. If you chase millisecond-stable tool latency, bare-metal or higher-tier VMs help—but many OpenClaw deployments are conversation-bound, so start simple and scale when metrics justify it.

When the VPS approach feels too brittle for collaboration yet Kubernetes feels heavy, remember that native macOS plus short-term rental gives you a polished GUI and Apple-grade tooling without a capital purchase. That combination is often faster for debugging channel integrations than iterating purely over SSH on a headless box—rent for the risky change window, then return to the VPS baseline once stable.

Runbook template: keep a one-page Markdown file on the server (outside the repo if needed) listing exact versions (node -v, openclaw --version), unit name, proxy snippet location, certificate renewal command, and the five triage commands. On-call should never grep Slack history to rediscover how this host was built.

Security regressions: each time you open port 22 to the world, validate whether password auth is truly disabled and whether fail2ban or equivalent still runs. Gateway incidents often start as SSH brute-force noise that operators ignore until credentials leak from an old user account.

Document expected outbound destinations (model APIs, channel webhooks, package mirrors) so firewall egress changes do not silently break upgrades. A denied HTTPS call to npm or GitHub during openclaw doctor --fix looks identical to a broken plugin until you tcpdump the obvious.

2026 OpenClaw Linux VPS headless deployment: systemd daemon, reverse proxy TLS, gateway triage command ladder

Table of contents