DwarfStar · Metal first · Unified memory · TCO math · A 60-minute VNC runbook
In May 2026, Redis creator antirez released ds4 (DwarfStar), a single-binary, pure-C inference engine handcrafted for DeepSeek V4 Flash and PRO. Within a week it crossed 11k GitHub stars and became the first credible answer to the question of running a true frontier model locally on an Apple Silicon Mac. The catch is hardware. With 96GB, 256GB or 512GB of unified memory as the entry tickets, most independent developers, AI researchers and technical writers are left watching from the cheap seats. This article walks you through the actual ds4 specs and hardware floor, the structural advantage of UMA over consumer NVIDIA HBM, a sober buy-versus-rent TCO calculation, and a 60-minute VNC runbook on a rented VNCMac node for getting ds4 + DeepSeek V4 Flash from git clone to a working OpenAI-compatible endpoint. It links across to CoreWeave’s record backlog, OpenClaw + Ollama hybrid embeddings and OpenClaw egress proxy so you can keep frontier inference and your day job on the same rented Mac.
There is a reason ds4 was branded the “best local engine for DeepSeek V4 on a Mac” within days of release. antirez is not just any open-source author; he is the creator of Redis and one of the few engineers who built an aesthetic out of very little code doing very fast things. With ds4 he carried that aesthetic into LLM inference: no Python, no third-party runtime, no hidden magic. The result is a sharply scoped project that does one thing differently from general-purpose frameworks like llama.cpp, MLX, ollama or vllm. Five design decisions explain the star count.
Pure C, no third-party inference stack. The repo builds with a plain make. The resulting binary has no Python interpreter, no CUDA toolchain and no pip dependency wall. First boot drops from hours to minutes.
Metal first. Deep adaptation to Apple Silicon GPUs. On a MacBook Pro M5 Max the project reports 463 t/s prefill and 34 t/s generation, numbers that outclass most equivalently priced PC plus consumer NVIDIA setups in measured throughput.
One-million-token context. ds4 supports a 1M context window, paired with the aggressively compressed KV cache built into DeepSeek V4. Long documents and multi-turn coding sessions stop being “re-read the file every turn”.
Persistent KV cache on disk. The KV cache is serialised to the Mac’s NVMe SSD. Sessions resume across sleeps without a fresh prefill, which is a perfect match for how Mac users actually work.
2-bit aggressive quant and a built-in agent. Only routed experts are quantised hard while the rest of the model keeps precision, which is what lets Flash slip into a 128GB machine. Tool Calling is native, the API is OpenAI and Anthropic compatible, and Cursor or opencode can talk to ds4 out of the box.
The political weight of this design is larger than its raw throughput. ds4 pulls the on-ramp for frontier model inference back from “a cloud account and a five-figure GPU” to “a MacBook and one binary”. It also says something sharper, however quietly: the real barrier in 2026 is no longer software, it is hardware cost. Section 02 puts numbers on that.
The ds4 performance figures are pretty, but the table below is the one most readers actually need to see: which quantisation, which Mac, how much money. List prices are 2026 reference figures from mainstream channels and should be treated as orders of magnitude, not quotes.
| Model | Minimum unified memory | Typical Mac (2026) | Reference price (USD) | Typical use |
|---|---|---|---|---|
| DeepSeek V4 Flash · q2 | 96 GB | MacBook Pro M3/M4/M5 Max (96 GB UMA) | $4,200+ | Personal coding copilot, doc Q&A, research |
| DeepSeek V4 Flash · q4 | 256 GB | Mac Studio M3/M4 Ultra (256 GB UMA) | $8,500+ | Stable output, long-context engineering Q&A |
| DeepSeek V4 PRO · q2 | 512 GB | Mac Studio M3 Ultra top-spec (512 GB UMA) | $15,500+ | Local agent, public API, in-house agent products |
| DeepSeek V4 PRO · q4 | 1 TB+ | No single consumer machine. Multi-node or server class only. | — | Research labs, platform-scale serving |
A few caveats that are easy to miss. First, 96 GB is the floor to run Flash q2, not the floor to run comfortably. If Xcode, Chrome and a couple of Slack workspaces are open at the same time, leave 20 to 30 GB of headroom for macOS, otherwise swap will kick in mid-inference and prefill speed will drop by half. Second, q4 is more stable than q2 but its memory and disk-KV footprint scale roughly linearly, so independent developers should validate workloads on q2 before paying for q4. Third, PRO q4 has no consumer-grade integrated machine that can run it solo today. Anything resembling platform-scale serving still belongs on multi-node or server-class infrastructure.
Validate the workflow on q2 first, then decide whether to pay for 256GB or 512GB. Run it before you buy it.
ds4 lists Metal as the “first-class target”, and that is not because antirez likes macOS aesthetics. What he is actually betting on is the Unified Memory Architecture (UMA) in Apple Silicon. At the consumer tier, UMA carries a few structural wins for large model inference that NVIDIA cannot replicate.
CPU and GPU share one large pool. M3, M4 and M5 SoCs solder 96 to 512 GB directly onto the package. Model weights never have to be copied between CPU RAM and GPU VRAM, which removes PCIe transfer overhead and a whole class of OOM failure modes.
Consumer NVIDIA VRAM ceilings. Consumer NVIDIA cards still cap out at roughly 24 to 32 GB of VRAM. Fitting a 90 GB Flash q2 weight set means multi-GPU or CPU offload, both of which give back a large fraction of throughput to PCIe and cross-GPU communication.
High bandwidth, low power. M4 and M5 Max memory bandwidth is competitive with HBM-class numbers, while the whole machine draws only a few dozen watts. A home circuit can drive it. A GPU server with similar memory needs a dedicated PDU and rack cooling.
Native fit with the SSD KV cache. macOS NVMe sequential reads commonly exceed 5 GB/s, and ds4’s disk KV cache means the next session resumes in seconds. Replicating that on Linux is possible, but you must hand-roll mmap, locking and scheduler corners that Apple gives you for free.
The price you pay. UMA solders RAM permanently to the SoC. A 128 GB MacBook Pro will never become a 256 GB one. That is exactly why “rent first, buy later” is unusually rational in 2026. Section 04 puts numbers on that decision.
Restated: “why does it have to be a Mac” is not a marketing line, it is a hardware observation. At the consumer tier, only Apple Silicon ships 96GB+ of true shared memory in a single machine. At data-centre scale NVIDIA H200 and B100 remain unquestioned for training, but if you want inference for one person and one wallet, Mac is currently the only platform that engineers are seriously porting to. That is precisely why ds4 starts from a Metal-first design instead of pretending to be cross-platform.
The table below collapses the buy-versus-rent question into a single dimension, total cost in year one, so you can have the discussion with your team in five minutes. Numbers are 2026 reference values in USD, replace them with your own quotes and local electricity rates.
| Path | Up-front | Annual hidden cost | Year 1 total (light use) | Break-even / fit |
|---|---|---|---|---|
| Buy MacBook Pro M5 Max 96GB | $4,200+ | Electricity, depreciation, no upgrade path $500–700 | ~$4,800+ | 3 hours+ per day, 3-year horizon |
| Buy Mac Studio Ultra 256GB | $8,500+ | Electricity, fan noise, depreciation $800–1,200 | ~$9,500+ | Team sharing, daily heavy inference |
| Buy Mac Studio Ultra 512GB top-spec | $15,500+ | Electricity, maintenance, depreciation $1,200–1,800 | ~$17,000+ | Public API, research-grade workloads |
| Rent VNCMac 96GB+ remote Mac (monthly) | $0 | Fixed monthly fee, only when running | Often 1/3 to 1/5 of buying | Project-based, occasional inference, evaluation |
| Rent VNCMac high-memory node (hourly) | $0 | Stop when done, no idle cost | Lowest, you pay only for live hours | Short evaluation, single PoC, demo recording |
The table is not meant to point to one row as “cheapest”. It is meant to help you place yourself on the spectrum. If you genuinely run inference three or more hours per day, every day, and you can commit for three years, a 96GB MacBook Pro will likely break even by year three. If the honest description of your workload is “evaluate ds4 a few times”, “produce one client demo”, or “follow a couple of DeepSeek V4 releases”, hourly rental is a much friendlier cashflow shape and you are not exposed to three-year depreciation on a machine you cannot upgrade. The JSON below is a minimal calculator you can drop into a team doc.
{
"scenario": "ds4_deepseek_v4_flash_q2",
"daily_active_hours": 2.0,
"active_days_per_year": 180,
"owned_total_year_one": 4800,
"rental_hourly_rate": 1.2,
"rental_year_one": "daily_active_hours * active_days_per_year * rental_hourly_rate",
"break_even_years": "owned_total_year_one / rental_year_one"
}
Tip: Replace those five numbers with your real ones. Most evaluators, freelancers and small teams land at break_even_years > 3, which is exactly the case where renting first is the strongest decision.
Two costs that often get omitted from spreadsheets: electricity and fan noise. A loaded Mac Studio Ultra draws roughly 200–300 W under sustained inference. Running it 24x7 adds a non-trivial bill and the audible fan noise in a home or shared office is a real ergonomic cost. Outsourcing that to a data centre is one of the most under-stated reasons freelancers end up renting in the first place.
Sections 03 and 04 settle whether you should buy. This section gives you a copy-pasteable shortest path: from ordering a VNCMac high-memory node to chatting with DeepSeek V4 Flash in a browser, target time under 60 minutes. Steps marked with a star are the ones that will silently stall in an SSH-only session and that genuinely require a VNC graphical session.
Pick the node. On the pricing page, select a remote Mac with at least 96 GB of memory, ideally M3, M4 or M5 Max and an SSD of 1 TB or more so that weights and KV cache have room. Save the VNC and SSH credentials sent by email.
First VNC login (star). Connect with your local VNC viewer. On first desktop entry macOS will pop “allow this computer to be observed” type dialogs. SSH cannot click those; only a graphical session can.
Clone and build ds4. In Terminal run git clone https://github.com/antirez/ds4 && cd ds4 && make. ds4 needs only system Clang and the Metal SDK, so the first build typically completes in one to three minutes.
Download weights (star). Fetch the DeepSeek V4 Flash q2 weights, roughly 90 GB, from the official source or a mirror. First writes into a new folder trigger a disk-write permission and a “downloaded apps may access this folder” type prompt that SSH cannot answer.
First launch and Metal authorization (star). Run ./ds4 --model deepseek-v4-flash-q2.gguf --port 18080. The first call into Metal triggers a GPU access prompt, plus possibly a Gatekeeper or SIP warning. Approve them from the VNC desktop and, if needed, allowlist the binary in System Settings.
KV cache sanity check. In Finder open ~/.ds4/cache and verify cache files grow with each session. If the directory stays empty, authorization probably did not pass or the cache lives on a read-only volume.
Hook up Cursor or opencode. Point your client’s base URL to http://<remote-mac-ip>:18080/v1 and the model name to deepseek-v4-flash. ds4 implements the OpenAI-compatible protocol, so the first round of chat will validate Tool Calling and SSE streaming.
Stop when done. Back in the VNCMac console, release the node. Hourly billing stops the moment you release, no “forgot to turn it off” surprises tomorrow.
New users often ask whether they can skip VNC by automating everything over SSH. The honest answer is that day-to-day inference, yes; first-run authorization, no. That gap is exactly why a remote Mac with a real graphical session is more practical than a pure-SSH cloud VM. The three-column table below pins it down so you can drop it straight into a runbook.
| Checkpoint | SSH alone? | What VNC must do |
|---|---|---|
| Screen sharing first authorisation | No | Click “allow” on the top-right desktop dialog |
| Disk write permission for weights folder | No | System Settings → Privacy → Files and Folders |
| First Metal GPU call | No | Approve dialog, allowlist in SIP if asked |
| KV cache directory verification | Partial (ls) | Finder shows file size growing per session |
| Day-to-day inference and Cursor | Yes | VNC only needed when something breaks |
Watch out: blaming the Metal authorization dialog on a ds4 bug is the most common misdiagnosis. Most of the time SSH simply cannot see the dialog. A single VNC session resolves it in one click.
The links below sit on the same axis as this post: frontier model inference plus the rented Mac that makes it affordable. Read them together if you want to consolidate inference workloads and your day-to-day iOS or agent work onto a single rented node.
How the GPU half of compute-as-a-service splits from the Mac half.
Read →Agent-side small models for embeddings, complementary to ds4 full inference.
Read →Proxy and allowlist patterns for cross-border DeepSeek and Anthropic calls.
Read →ds4 is not a general GGUF loader. It is a single-purpose C engine hand-written by antirez for DeepSeek V4 Flash and PRO, optimised for the Metal backend and disk-backed KV cache. On Mac it tends to outperform general frameworks for that one model family, but it is not meant to replace llama.cpp or MLX for your other workloads.
On a 96 GB M3, M4 or M5 Max, Flash at q2 falls into the “usable” range, with prefill and generation noticeably faster than comparably priced PC setups. However you must leave 20 to 30 GB of headroom for macOS, Xcode and browsers, otherwise swap kicks in and throughput collapses.
If your annual active-use ratio is below roughly 30 percent, the depreciation, electricity and fan noise of a 512 GB Mac Studio Ultra rarely pay off. Hourly or monthly rental of a high-memory VNCMac node usually fits an “on-demand inference” cashflow far better. See Section 04 for the math.
Day-to-day inference calls happily flow through SSH and the OpenAI-compatible API. But first-run Gatekeeper, Metal driver authorization, disk write permission and KV cache directory checks still require a real GUI session. SSH alone will silently stall on those dialogs. See the three-column table in Section 05.
With ds4, antirez pulled the on-ramp for frontier model inference back from “a cloud account and a five-figure GPU” to “a Mac and one binary”. What he did not solve, and was never going to, is the harder problem: a 96 GB Mac starts at thousands of dollars and a 512 GB Mac Studio Ultra clears five figures. For most independent developers, researchers, technical writers and small teams, the gap between “I want to run DeepSeek V4” and “I can run DeepSeek V4” is not a software gap, it is a cashflow gap.
Owning brings its own hidden costs. UMA solders memory to the SoC, meaning you buy once and never upgrade. Fan noise and electricity are real ergonomic costs in home offices. Three years later your machine will only be worth its used-resale price the moment you would want to step up to PRO q4. If your honest profile is evaluation, project work, occasional inference, the three-year depreciation alone often exceeds what hourly rental on VNCMac would have cost you.
That is the point of VNCMac remote Mac rental in the ds4 era: turn the “top-spec local inference environment” that used to belong to people who could swallow a Mac Studio Ultra purchase into infrastructure anyone can rent by the hour or month. Inference data stays inside your dedicated node, no third-party API in the loop, billing stops when you stop the box. Hit the primary button below to open the English pricing page, spin up a 96 GB-class node and run the Section 05 runbook; if you then still want a Mac Studio Ultra under your desk, at least that decision will be made on numbers, not on hype. Browse the homepage first for hardware specs and plans.