2026 Running DeepSeek V4 Flash Locally on a Mac with ds4 (DwarfStar 4):
antirez Engine, q2/q4 Quantization Tiers and 96/128/256/512GB Benchmark Table

Developers, researchers and privacy-sensitive teams who want frontier open-source LLMs on Apple Silicon keep asking the same questions: what exactly is the ds4 engine antirez shipped in a week, how much RAM do q2 and q4 truly demand, what do tokens-per-second look like on 128GB MacBook Pros versus 512GB Mac Studios, and at what point does daily Mac rental beat dropping six figures on a maxed-out Studio?

ds4 DwarfStar 4 running DeepSeek V4 Flash locally on a Mac, abstract circuitry visual

In May 2026, Redis creator Salvatore "antirez" Sanfilippo shipped a tiny C engine that does exactly one thing: ds4 (DwarfStar 4) is a native inference backend dedicated to DeepSeek V4 Flash. It is not a generic GGUF runner, not a wrapper around llama.cpp or Ollama, and not a framework. Its Metal backend targets Macs from 96GB upwards, while the CUDA path takes special care of NVIDIA's DGX Spark. Combined with on-disk KV cache and a built-in OpenAI-compatible API, ds4 is the first project that makes frontier-class local inference feel engineering-grade on a consumer Mac. This article is written for three audiences: independent developers who want to run DeepSeek V4 Flash on Apple Silicon, power users who want Cursor or opencode to talk to a local backend, and small studios or privacy-sensitive teams who would rather rent a maxed-out Mac Studio by the day than spend a six-figure sum up front. You will get the engineering philosophy, the q2 / q4 / MTP quantization receipts, a 96/128/256/512GB benchmark table, a five-step setup walkthrough, and the crossover point where rental beats ownership.

01. What ds4 actually is: antirez's one-week, 11k-star DeepSeek V4 engine

ds4 stands for DwarfStar 4, written by the same author who gave us Redis, Sentinel and Cluster. Within days of going public the repository crossed 11,000 GitHub stars, for a simple reason: it is currently the only engine that pushes DeepSeek V4 Flash onto the practical line of "a 128GB Mac will really run it".

The project solves an awkward reality. DeepSeek V4 Flash is a MoE architecture with roughly 284B parameters and 165 GB of original F16 weights. llama.cpp and Ollama are still wrestling with proper support; antirez instead rewrote a Metal / CUDA graph executor in plain C, paired it with his own asymmetric 2/8-bit GGUF, and reduced the "first token" experience to roughly two commands: make and ./ds4 -p.

02. ds4 versus llama.cpp / Ollama: the "narrow and deep" engineering bet

llama.cpp and Ollama are wide engines: one runtime, one hundred model families. ds4 takes the opposite bet, dedicating itself to a single family. The differences show up in three places.

  • No abstraction tax. Model loading, prompt rendering, KV state and tool calling are all hand-coded for V4 Flash. There is no overhead from "we left an interface open for the next model".
  • Official-vector validation. antirez pulls logits from the reference DeepSeek implementation and matches ds4 against them, so quantized output stays numerically close to the original rather than drifting into vibes-only territory.
  • One repo, all the pieces. You get the CLI (ds4), the OpenAI-compatible server (ds4-server), a built-in coding agent, and tooling for GGUF and imatrix generation. No glue scripts required.

antirez puts the philosophy bluntly in the README: new models ship faster than any generic runtime can chase, so ds4 picks one model at a time and pushes it to be credible on a high-end personal machine. For developers that translates into not having to read 200 issues just to keep V4 Flash from crashing on a Mac.

03. The three-tier quant receipt: q2 (80.8 GiB), q4 (153.3 GiB), MTP (3.6 GiB)

The antirez/deepseek-v4-gguf repository on Hugging Face ships exactly three files, one per memory tier:

Quant tier File size Strategy Target Mac RAM Typical use
q2 (IQ2_XXS + Q2_K) 80.8 GiB Routed experts at 2-bit; attention / shared experts at Q8_0 96 / 128 GB MacBook Pro M4/M5 Max entry tier
q4 (Q4_K experts) 153.3 GiB All experts at Q4_K; HC / Compressor / Indexer at F16 256 / 512 GB Mac Studio Ultra primary inference
MTP (speculative) 3.6 GiB Auxiliary multi-token prediction model Optional add-on Pair with q2 or q4 to boost generate t/s

Three numbers worth memorizing. First, the 80.8 GiB q2 weights plus a fully populated 26 GB KV cache barely fit a 128 GB Mac, so you will need to kill Chrome and Xcode before launch. Second, q4 weights weigh 153.3 GiB, which leaves only tens of gigabytes for context on a 256 GB box. Third, the MTP file is a 3.6 GiB optional add-on that drops on top of q2 or q4 to accelerate generation through speculative decoding.

04. Mac memory benchmark: what 96, 128, 256 and 512 GB actually deliver

The numbers below come from the ds4 repository README and community runs, expressed as tokens per second (prefill / generate):

Hardware Quant Context Prefill t/s Generate t/s
MacBook Pro M5 Max 128GB q2 short 463.0 34.0
Mac Studio M3 Ultra 512GB q2 short 384.43 36.86
Mac Studio M3 Ultra 512GB q2 11,709 tokens 250.11 27.39
Mac Studio M3 Ultra 512GB q4 short 78.95 35.50
Mac Studio M3 Ultra 512GB q4 12,018 tokens 448.82 26.62
DGX Spark GB10 128GB (reference) q2 7,047 tokens 343.81 13.75

Three takeaways. A 128 GB M5 Max MacBook Pro already pushes 463 t/s prefill on q2 short prompts, which feels far better than expected for a laptop. A 512 GB M3 Ultra running q4 on a 12k-token prompt hits 448.82 t/s prefill, the most powerful V4 Flash experience you can currently buy in a single Mac. And the DGX Spark GB10 only generates at 13.75 t/s, well behind the M3 Ultra's 36.86 t/s, illustrating how Apple Silicon's unified memory pays off structurally for MoE inference.

05. Five steps to a working ds4 on a Mac Studio M3 Ultra

Below is the shortest path from a fresh macOS install to the first generated token, roughly 30 to 45 minutes end to end (model download dominates the timeline):

  1. Clone and build. git clone https://github.com/antirez/ds4 && cd ds4 && make. macOS picks Metal automatically; no CUDA toolchain required.
  2. Download weights. Run ./download_model.sh q2 on 128 GB machines, ./download_model.sh q4 on 256 GB or larger boxes, and optionally ./download_model.sh mtp for speculative decoding.
  3. Smoke test. ./ds4 -p "Explain Redis streams in one paragraph." confirms the loader, tokenizer and Metal backend are wired up.
  4. Start the OpenAI-compatible server. ./ds4-server --ctx 100000 --kv-disk-dir /tmp/ds4-kv --kv-disk-space-mb 8192 listens on 127.0.0.1:8080 by default.
  5. Record a baseline. Send a real 12k-token engineering prompt, log prefill / generate t/s and peak GPU memory, and keep those numbers as your tuning baseline.
# 1. Clone and compile (Metal) $ git clone https://github.com/antirez/ds4 && cd ds4 && make # 2. Download weights (q2 for 128GB Macs) $ ./download_model.sh q2 # 3. Launch OpenAI-compatible server with on-disk KV $ ./ds4-server --ctx 100000 --kv-disk-dir /tmp/ds4-kv --kv-disk-space-mb 8192 # 4. Verify $ curl -s http://127.0.0.1:8080/v1/models | jq .

06. KV cache on disk and the safe envelope for the 1M context window

The most underrated design choice in ds4 is its persistent KV cache. On a Mac with a fast NVMe SSD, sessions no longer need a full prefill on restart; you recover a 100k-token context in seconds after restarting the server. Three boundaries to respect:

  • A full 1M context burns roughly 26 GB of GPU memory on its own, with the compressed indexer alone taking about 22 GB. On a 128 GB Mac already holding 81 GB of q2 weights, forcing 1M almost guarantees an OOM.
  • 128 GB machines should start with --ctx 100000–300000. Community reports describe 250k contexts on 96 GB Macs, but only after killing Chrome, Xcode and other memory-heavy processes.
  • --kv-disk-space-mb should be at least 8192, and 16384 or more for long sessions or multi-user workloads.
Practical advice: on a 128 GB MacBook Pro start conservatively at --ctx 100000, watch GPU and wired memory in Activity Monitor, then walk it up to 200k. If wired memory approaches the physical limit, roll back immediately or the system will freeze.

07. Plug ds4-server into Cursor and opencode as an OpenAI backend

ds4-server implements /v1/chat/completions, /v1/models and OpenAI Function Calling. From the outside it is just another OpenAI-compatible endpoint, so Cursor, opencode and Continue accept it with no code changes.

  1. In Cursor's settings, add a new custom-model provider with baseURL set to http://127.0.0.1:8080/v1 and any non-empty string for apiKey.
  2. Use deepseek-v4-flash as the model name (the id returned by ds4-server's /v1/models).
  3. For remote access, put the Mac Studio on a Tailscale mesh and point baseURL at the mesh IP. Never expose port 8080 to the public internet.
  4. Tool calls such as file editing, command execution and reading git diffs run through Function Calling. ds4's built-in coding agent has already exercised the path end to end.
  5. When debugging, log ds4-server requests to a file and diff them against Cursor's request payloads. Schema mismatches show up immediately.

08. Owning a maxed-out Mac versus daily rental: the crossover point

If you cannot afford a maxed-out Mac but still want frontier V4 Flash performance, ownership is the obvious first thought. The price tags are not gentle:

  • MacBook Pro M5 Max 128GB: roughly USD 4,200; runs q2 and sits at the entry tier.
  • Mac Studio M4 Ultra 256GB: roughly USD 8,500; handles q4 at modest context.
  • Mac Studio M3 Ultra 512GB top spec: roughly USD 15,000; the only configuration that runs q4 at long context comfortably.

Daily rental of a 512GB Mac Studio M3 Ultra falls into the range of tens of dollars per day. Three rules of thumb follow:

  • Crossover at roughly 200 usage days per year. Less than 200, rental is cheaper and you skip depreciation risk.
  • Team sharing multiplies the savings. Five engineers rotating through one rented Studio reduce effective cost another five-fold.
  • Hardware refresh risk is real. When M5 Ultra or M6 Max arrives, the secondhand value of a maxed-out Studio drops 20–30 percent overnight. Rental absorbs that risk for you.

09. Two real macOS pitfalls: CPU panics and the thermal envelope

antirez calls out two gotchas in the README, both learned the hard way:

  • The CPU backend panics on macOS. A current VM bug in macOS triggers a kernel panic when ds4 runs the CPU path. The clean conclusion: always use Metal on macOS; never make cpu. The CPU path is only for correctness checks on Linux.
  • Thermal and power walls bite. A MacBook Pro under sustained inference can hit 90C with fans wide open. Use mains power, lift the chassis, and consider a cooling pad. A Mac Studio's machined airflow channels make long runs vastly more stable than any laptop.

One more easy-to-miss detail: do not let Time Machine run a backup while inference is live. The I/O contention crushes KV-cache throughput and halves generate t/s in seconds.

10. Local inference versus commercial APIs: privacy, compliance, control

The real motivation for pulling V4 Flash onto local hardware is rarely cost; it is keeping data on the machine. Compared with hosted APIs you gain:

  • Privacy. Zero egress. Enterprise source, user logs, medical or financial data never leave the box.
  • Compliance. GDPR, sector regulations and internal residency policies all care about where weights live and where data flows. Local inference is the cleanest answer.
  • Control. Hosted vendors change rate limits, weights and protocols at will. A pinned ds4 plus V4 Flash snapshot is reproducible and auditable.
  • Predictable cost. Hosted APIs bill per token; long-context agents create budget spikes. Local inference is fixed depreciation, rental and electricity, which makes CFOs much happier.

11. A 1–3 day rental schedule from ds4 build to Cursor integration

The following three-day plan is designed for a small team that wants to try ds4 before committing to hardware:

  1. Day 0 evening. File a daily-rental ticket on macdate.com for a Mac Studio M3 Ultra 512GB and a 1–3 day window. Pre-stage your ds4 fork, SSH keys and Tailscale credentials.
  2. Day 1 morning. SSH in, install git via Homebrew, clone ds4, run make against Metal, and start ./download_model.sh q4 (153 GiB; allow 1.5–3 hours on a 1 Gbps link).
  3. Day 1 afternoon. Run ds4 -p to smoke test, then ds4-server --ctx 200000 --kv-disk-dir ~/kv --kv-disk-space-mb 16384. Push a 12k-token real workload through it and record your baseline.
  4. Day 2. Join the mesh through Tailscale; point Cursor and opencode at the mesh IP. Spend the day on actual coding tasks while logging t/s and perceived latency.
  5. Day 3 morning. Layer in MTP for speculative decoding and compare generate gains; probe the 1M context boundary, starting from --ctx 400000.
  6. Day 3 afternoon. Export your benchmark CSV, delete /tmp/ds4-kv, scrub SSH keys and the Tailscale node, then release the instance. Billing closes at actual days used.

Three quotable numbers worth keeping. First, the q4 download is roughly 153 GiB, which is 30–40 minutes on a 1 Gbps connection. Second, a single 1–3 day rental is enough to complete the "try, then decide" decision cycle. Third, the rental versus ownership crossover lands at about 200 active days per year. See also the daily Mac rental guide and the Mac mini M4 rent vs buy cost worksheet.

12. Honest limits and the better alternative

Running ds4 + DeepSeek V4 Flash locally embraces the consensus that a maxed-out Mac is the best consumer-grade inference platform for frontier MoE models in 2026. Three caveats remain:

  • Steep hardware floor. Even q2 expects 96–128 GB of unified memory; q4 needs 256 GB; the PRO route wants 512 GB. None of these are standard MacBook configurations.
  • Daily-driver pollution. 80 GiB of weights plus 100+ GB of on-disk KV cache plus sustained thermals will steal headroom from your editor, Xcode and video calls if you run it on your main machine.
  • Depreciation risk. M5 Ultra and M6 Max are coming. The three-year resale curve of a maxed-out Studio looks far worse than 1,095 days of rental.

The cleaner combination is to run ds4 + DeepSeek V4 Flash on a daily-rented physical Mac Studio M3 Ultra 512GB. You get the full q4 + long-context experience, independent bandwidth, an isolated keychain and a dedicated KV directory. When you shut down, the depreciation problem is no longer yours. Cursor and opencode reach the box through Tailscale, so you write code locally and run inference in the cloud while your daily driver stays clean. Pick ds4 + V4 Flash for the model, and let macdate.com provide the physical Mac hardware that makes it boring to operate.

Further Reading