ds4.c
A small, Metal-native local inference engine specifically built for DeepSeek V4 Flash, featuring disk KV cache persistence, OpenAI/Anthropic-compatible server API, and 2-bit quantization support.
At a Glance
Fully free and open-source under the MIT License. Download, use, modify, and distribute at no cost.
Engagement
Available On
Alternatives
Listed May 2026
About ds4.c
ds4.c is a deliberately narrow, Metal-native local inference engine built exclusively for DeepSeek V4 Flash. It is not a generic GGUF runner or framework — it provides a DeepSeek V4 Flash-specific Metal graph executor with DS4-specific loading, prompt rendering, KV state management, and an OpenAI/Anthropic-compatible HTTP server API. The project bets on one model at a time, with official-vector validation, long-context tests, and agent integration to ensure the model truly works end-to-end on high-end personal machines and Mac Studios starting from 128 GB of RAM.
Key features include:
- Metal-only inference — the optimized execution path runs entirely on Apple Metal; a CPU path exists only for correctness checks
- Disk KV cache persistence — compressed KV caches are written to SSD, allowing long-context sessions to survive server restarts and session switches without re-prefilling
- 2-bit and 4-bit quantization — asymmetric quantization targeting only routed MoE experts (IQ2_XXS up/gate, Q2_K down) lets the 284B-parameter model run on 128 GB MacBooks
- 1 million token context window — the model supports up to 1M tokens; practical context is limited by available RAM
- OpenAI-compatible server —
ds4-serverexposes/v1/chat/completions,/v1/completions, and/v1/modelsendpoints with SSE streaming, tool calling, and thinking-mode controls - Anthropic-compatible endpoint —
/v1/messagessupports Claude Code-style clients withtool_useblocks and thinking controls - Thinking mode support — non-thinking, thinking, and Think Max modes are supported; reasoning is streamed natively
- Agent client integration — documented configuration for opencode, Pi, and Claude Code coding agents
- Speculative decoding (MTP) — optional multi-token prediction path for greedy decoding; currently experimental
- Test vector validation — short and long-context continuation vectors captured from the official DeepSeek V4 Flash API are used to catch tokenizer, template, or attention regressions
- Interactive CLI — multi-turn chat with
/think,/nothink,/ctx,/read, and other commands; Ctrl+C interrupts generation
To get started, clone the repository, run ./download_model.sh q2 (128 GB machines) or ./download_model.sh q4 (256 GB+ machines), then make. Launch the CLI with ./ds4 or start the server with ./ds4-server --ctx 100000 --kv-disk-dir /tmp/ds4-kv --kv-disk-space-mb 8192.
Community Discussions
Be the first to start a conversation about ds4.c
Share your experience with ds4.c, ask questions, or help others learn from your insights.
Pricing
Open Source (MIT)
Fully free and open-source under the MIT License. Download, use, modify, and distribute at no cost.
- Metal-native DeepSeek V4 Flash inference
- 2-bit and 4-bit quantization support
- OpenAI and Anthropic-compatible server API
- Disk KV cache persistence
- Interactive CLI
Capabilities
Key Features
- Metal-native inference engine for DeepSeek V4 Flash
- Disk KV cache persistence for long-context sessions
- 2-bit and 4-bit asymmetric quantization
- 1 million token context window
- OpenAI-compatible HTTP server API
- Anthropic-compatible /v1/messages endpoint
- SSE streaming with thinking-mode support
- Tool calling with DSML format mapping
- Speculative decoding via MTP (experimental)
- Interactive multi-turn CLI
- Official logit vector validation tests
- Prefix-aware KV cache reuse across sessions
- Single-session serialized Metal inference worker
