# ds4.c

> A small, Metal-native local inference engine specifically built for DeepSeek V4 Flash, featuring disk KV cache persistence, OpenAI/Anthropic-compatible server API, and 2-bit quantization support.

ds4.c is a deliberately narrow, Metal-native local inference engine built exclusively for DeepSeek V4 Flash. It is not a generic GGUF runner or framework — it provides a DeepSeek V4 Flash-specific Metal graph executor with DS4-specific loading, prompt rendering, KV state management, and an OpenAI/Anthropic-compatible HTTP server API. The project bets on one model at a time, with official-vector validation, long-context tests, and agent integration to ensure the model truly works end-to-end on high-end personal machines and Mac Studios starting from 128 GB of RAM.

Key features include:

- **Metal-only inference** — *the optimized execution path runs entirely on Apple Metal; a CPU path exists only for correctness checks*
- **Disk KV cache persistence** — *compressed KV caches are written to SSD, allowing long-context sessions to survive server restarts and session switches without re-prefilling*
- **2-bit and 4-bit quantization** — *asymmetric quantization targeting only routed MoE experts (IQ2_XXS up/gate, Q2_K down) lets the 284B-parameter model run on 128 GB MacBooks*
- **1 million token context window** — *the model supports up to 1M tokens; practical context is limited by available RAM*
- **OpenAI-compatible server** — *`ds4-server` exposes `/v1/chat/completions`, `/v1/completions`, and `/v1/models` endpoints with SSE streaming, tool calling, and thinking-mode controls*
- **Anthropic-compatible endpoint** — *`/v1/messages` supports Claude Code-style clients with `tool_use` blocks and thinking controls*
- **Thinking mode support** — *non-thinking, thinking, and Think Max modes are supported; reasoning is streamed natively*
- **Agent client integration** — *documented configuration for opencode, Pi, and Claude Code coding agents*
- **Speculative decoding (MTP)** — *optional multi-token prediction path for greedy decoding; currently experimental*
- **Test vector validation** — *short and long-context continuation vectors captured from the official DeepSeek V4 Flash API are used to catch tokenizer, template, or attention regressions*
- **Interactive CLI** — *multi-turn chat with `/think`, `/nothink`, `/ctx`, `/read`, and other commands; Ctrl+C interrupts generation*

To get started, clone the repository, run `./download_model.sh q2` (128 GB machines) or `./download_model.sh q4` (256 GB+ machines), then `make`. Launch the CLI with `./ds4` or start the server with `./ds4-server --ctx 100000 --kv-disk-dir /tmp/ds4-kv --kv-disk-space-mb 8192`.

## Features
- Metal-native inference engine for DeepSeek V4 Flash
- Disk KV cache persistence for long-context sessions
- 2-bit and 4-bit asymmetric quantization
- 1 million token context window
- OpenAI-compatible HTTP server API
- Anthropic-compatible /v1/messages endpoint
- SSE streaming with thinking-mode support
- Tool calling with DSML format mapping
- Speculative decoding via MTP (experimental)
- Interactive multi-turn CLI
- Official logit vector validation tests
- Prefix-aware KV cache reuse across sessions
- Single-session serialized Metal inference worker

## Integrations
OpenAI API, Anthropic API, Claude Code, opencode, Pi agent, Hugging Face, llama.cpp / GGML (reference), GGUF format

## Platforms
MACOS, CLI, API

## Pricing
Open Source

## Version
main

## Links
- Website: https://github.com/antirez/ds4
- Documentation: https://github.com/antirez/ds4
- Repository: https://github.com/antirez/ds4
- EveryDev.ai: https://www.everydev.ai/tools/ds4-c