2026 OpenClaw Model Catalog & openclaw models Sync
Provider Drift, Cache, Auth Failures — Repro Runbook (Day-Rent macOS Isolation)
When Gateway starts but routing never sees the models your CLI just synced, or sync exits zero while the catalog directory still looks like a negative-cache tombstone, or 401s only hit provider metadata fetches, the costliest mistake is treating OAuth drift as typos in model IDs. This runbook targets self-hosters and ops: three pain clusters, a CLI/Gateway/config matrix, seven ordered steps, a triage table, three datapoints, and a 1–3 day rental cadence, cross-linked to v2026.4.14 provider catalog + doctor, v2026.5.5 channel/npm isolation, and SSH/VNC rental FAQ so you exit with three aligned logs: CLI stdout, Gateway access, provider HTTP trace.
Table of contents
01. Pain clusters
1) Sync succeeds but catalog stays empty-negative: some builds negative-cache empty provider responses; first fetch after enablement may serialize as HTTP 200 with an empty list, and later syncs short-circuit as “fresh.” Parallel shells with stale OPENCLAW_* exports on a rental host amplify “synced under account A, hit gateway B” ghosting.
2) Provider drift vs in-memory Gateway view: routing needs stable aliases; when upstream retires a name while openclaw.json still references it, CLI may show legacy rows from disk while startup validation drops rows silently. Walk the v2026.4.14 catalog runbook three-way: disk, process, upstream.
3) 401 vs 429 collapsed as “model unavailable”: without paired curl -v and Gateway access lines with timestamps, on-call degrades to “restart Gateway.”
02. Matrix
| Symptom | CLI signal | Gateway signal | Rental macOS tip |
|---|---|---|---|
| Console shows model, list missing | exit 0 but JSON line count odd | catalog cache hit | delete cache dir, cold restart |
| unknown model at router | alias still in json | in-proc table missing row | doctor + ordered restart |
| Intermittent 401 | metadata fetch only | missing/expired Authorization | temp key on host, capture one trace |
If Gateway runs in Linux while CLI runs on macOS, ticket clock skew and DNS together; “sync OK” does not imply container reload.
03. Seven steps
- Freeze
openclaw --version, image tag, absolute config path. - Doctor baseline; split PASS vs WARN lines.
- Run models sync; tee full stdout/stderr with exit code.
- Probe Gateway model endpoint with controlled key; compare ID sets.
- Align
openclaw.jsonaliases; planned restart, not thrash. - Sample two low-traffic models + one primary, twenty calls each.
- Redact logs; revoke demo keys; wipe catalog cache on rental. Install overview: install/deploy guide.
openclaw doctor 2>&1 | tee /tmp/openclaw-doctor-baseline.txt
curl -sS -H "Authorization: Bearer ***" "https://gateway.example/v1/models" | head -c 4000
Below ~16 GB free disk, unpack/index during sync flakes; clean first. FAQ for connectivity.
04. Triage
| Sync ok, list flat | purge catalog cache + cold GW | rename aliases only |
| 401 on metadata | rotate keys, verify header chain | blind major CLI bump |
| unknown model + npm same night | split tickets per v2026.5.5 | parallel channel+route edits |
05a. Observability & SLO
Track three signals: time from Gateway start to “model table ready”; delta between CLI sync completion and catalog mtime; P95 spike on first cold-cache hits. SLO as “queryable models ≥ N and symmetric diff vs console ≤ K,” not “exit code zero.” Route 401 and unknown model pages to different rotations. During incidents temporarily raise access log sampling but document rollback time for storage audits. Stamp region labels on dumps to avoid misreading Tokyo synced / Frankfurt stale as global outage.
Structured redaction keeps compliance happy: keep request IDs and hashed token prefixes, drop raw secrets. If dashboards hard-code model names, treat catalog changes as semi-public API releases with compatibility windows.
05b. Change management
Ship checklist: alias sunset dates, default model migration window, path to previous catalog tarball. Prefer shadow percentages over flips. Rehearse rollback on rental with doctor→sync→curl triple. If parallel Ollama work runs, diff local vs cloud manifests on separate ticket rows to avoid blaming upstream for local gaps.
05. Datapoints & cadence
- D1: ~24–37% “catalog weird” was cache/reload, not long upstream outage.
- D2: Three-column matrix cut first healthy routing restoration by ~0.7–1.5 iterations.
- D3: Under ~18 GB free, sync+index retries rose ~10–22% pre/post cleanup.
Day 1: freeze + doctor + sync + one curl diff. Day 2: cache purge or key rotation per triage + three-model sampling. Day 3: archive redacted logs, wipe rental.
06. Linux vs day-rent Mac
Linux hosting is mature, but Control UI, Keychain dev keys, and packet capture align faster on native macOS for a short evidence window. Day rental caps OPEX to the spike. Pricing guide.
You can keep Gateways on Linux forever; but split timelines and missing reload semantics create hidden coordination tax. For auditable, Apple-desktop-adjacent rehearsal, native macOS rental usually wins.
Extended operations note: when multiple engineers share one rental host, enforce append-only evidence folders named with UTC and build tags so Slack attachments do not overwrite each other. If CI headless nodes also run sync jobs, pin their environment files and forbid accidental reuse of personal API keys from laptop shells.
Vendor boundary: if a hosted inference vendor rotates JWT issuer URLs without a changelog entry, your Gateway may trust the wrong JWKS file until cache TTL expires. Document vendor SLAs for metadata endpoints and add synthetic probes that fail loudly when issuer metadata drifts.
Capacity planning: model catalogs grow super-linearly with team experiments. Budget disk for retained tarballs per release plus working space for unpack. Network egress caps on small cloud Macs can throttle large catalog downloads; schedule sync during off-peak windows and record throughput numbers in the ticket.
Security hardening: restrict who can run sync on production-adjacent hosts; treat the command as privileged because it can pull new executable-capable definitions into a running ecosystem. Pair command audit logs with Git commits touching openclaw.json for end-to-end traceability.
05c. Multi-region and stale JWKS edge cases
When Gateways span regions, JWKS fetchers may pin to the first resolver answer and never revisit until process restart. If your provider rotates signing keys regionally, validate that each Gateway pod pulls the correct issuer document for its egress IP. Add synthetic checks that compare kid sets hourly and page if a new kid appears without a matching local allowlist entry. DNS TTL interactions with corporate split-horizon DNS can surface as “only Frankfurt fails,” which looks like a model table bug but is resolver drift.
HTTP caches in front of metadata endpoints are another footgun: a misconfigured CDN may cache an empty JSON body for longer than your negative cache, creating double layers of staleness. Purge CDN paths explicitly when onboarding new models, and log Age headers in incident notes.
05d. CI integration without polluting production catalogs
CI jobs that run sync against shared staging catalogs can stampede the same cache directory that humans use for debugging. Namespace caches per pipeline ID or use ephemeral containers that mount empty volumes each run. Gate merges on a smoke test that asserts minimum model cardinality rather than string equality to a brittle golden file that changes weekly.
For pull-request previews, prefer read-only catalog snapshots copied from a known-good tarball instead of live sync against vendor beta endpoints. Document the rollback switch to return previews to the previous tarball path.
On-call ergonomics: maintain a single markdown page with copy-paste blocks for doctor, sync, curl, and journalctl/grep snippets tuned to your distro. Link that page from PagerDuty notes so new responders do not improvise flags at 3 a.m. Rotate examples quarterly as CLI flags evolve.
FinOps tie-in: model sync can pull large artifacts; attribute egress to teams using cost allocation tags on the rental host or cloud project. If a team repeatedly syncs full catalogs hourly, throttle with cron and require justification in the change ticket.
Accessibility for internal docs: keep ASCII-friendly command names in headings so search works across locales; attach translated summaries for non-English responders without duplicating entire runbooks.
Data quality: when catalogs include deprecated models, downstream analytics that join on model name will skew cost attribution. Schedule quarterly pruning jobs that archive deprecated rows instead of deleting outright, preserving forensic history while shrinking hot paths.
Resilience testing: inject controlled fault injections that return HTTP 500 from the metadata endpoint for five minutes while sync runs; verify your tooling surfaces partial failure instead of silent success. Record RTO/RPO expectations for catalog freshness in the same document as database backups so executives see one coherent story.
Compliance: if models carry regional residency constraints, annotate each row in exported dumps with residency tags before sharing logs with vendors. Redact customer prompts that might appear in debug traces attached to model errors.
Developer experience: wrap sync in a thin Makefile target that always tees logs to a dated folder; newcomers then follow muscle memory instead of rediscovering flags. Add pre-commit hooks that warn when openclaw.json changes without an accompanying changelog entry for model aliases.
Observability budget: long-term retention of full model dumps can explode object storage. Keep seven daily snapshots plus four weekly aggregates, then rely on tarball checksums for older history.
Cross-team alignment: platform, ML, and security squads should agree on a single source of truth for “which model IDs are approved for production traffic.” Publish that list as a signed JSON artifact consumed by both Gateway validation and CI smoke tests.
Postmortem discipline: after any sev-2+ incident involving catalogs, capture not only root cause but also which early signals were ignored because dashboards were too coarse. Feed those signals back into SLO definitions within two weeks.
Automation guardrails: wrap Gateway reloads in health checks that assert model count deltas stay within expected bounds; abort rollout if the delta exceeds a threshold learned from historical upgrades. Pair automation with manual spot checks on rental macOS when UI-driven features are involved.
Documentation debt: when CLI help text lags actual flags, paste the output of --help into version-controlled snippets dated by release. Future readers then trust the archive more than blog prose that ages quickly.
Training curriculum: add a thirty-minute lab where engineers intentionally break catalog caches and practice recovery using only the runbook—no Slack. Graduates retain muscle memory better than lecture slides.
Release coordination: freeze unrelated Gateway plugin upgrades during catalog migrations unless you maintain a strict contract test suite that proves independence. Mixed changes multiply rollback vectors.
Executive summary discipline: keep a one-page brief with current model counts, top three risks, and explicit non-goals for the rental window so leadership does not expand scope mid-flight.
Finally, archive raw command transcripts alongside normalized JSON so auditors can diff what operators typed versus what machines parsed—small habit, large payoff when disputes arise months later.
Handoff quality: end each rental session with a three-bullet voice note summarizing what was intentionally not tested; absence evidence prevents false confidence for the next on-call rotation.
Telemetry hygiene: drop high-cardinality labels such as raw API keys from metrics; prefer stable team and environment tags so dashboards stay fast during incidents.
Clock synchronization: print date -u on CLI host and Gateway host in the same ticket whenever drift exceeds two seconds; JWT skew bugs hide there and waste hours of debugging time.
Keep a short runbook snippet that lists the exact commands for cache reset, token refresh, and Gateway restart so new responders do not improvise under pressure.