ds4.c

Name: ds4.c
Availability: OnlineOnly
Author: antirez

A small, Metal-native local inference engine specifically built for DeepSeek V4 Flash, featuring disk KV cache persistence, OpenAI/Anthropic-compatible server API, and 2-bit quantization support.

Visit Website

At a Glance

Pricing

Open Source

Fully free and open-source under the MIT License. Download, use, modify, and distribute at no cost.

Engagement

Available On

macOS

CLI

API

antirezantirez is the creator of Redis and an independent open-sour…

Listed May 2026

About ds4.c

ds4.c is a deliberately narrow, Metal-native local inference engine built exclusively for DeepSeek V4 Flash. It is not a generic GGUF runner or framework — it provides a DeepSeek V4 Flash-specific Metal graph executor with DS4-specific loading, prompt rendering, KV state management, and an OpenAI/Anthropic-compatible HTTP server API. The project bets on one model at a time, with official-vector validation, long-context tests, and agent integration to ensure the model truly works end-to-end on high-end personal machines and Mac Studios starting from 128 GB of RAM.

Key features include:

Metal-only inference — the optimized execution path runs entirely on Apple Metal; a CPU path exists only for correctness checks
Disk KV cache persistence — compressed KV caches are written to SSD, allowing long-context sessions to survive server restarts and session switches without re-prefilling
2-bit and 4-bit quantization — asymmetric quantization targeting only routed MoE experts (IQ2_XXS up/gate, Q2_K down) lets the 284B-parameter model run on 128 GB MacBooks
1 million token context window — the model supports up to 1M tokens; practical context is limited by available RAM
OpenAI-compatible server — ds4-server exposes /v1/chat/completions, /v1/completions, and /v1/models endpoints with SSE streaming, tool calling, and thinking-mode controls
Anthropic-compatible endpoint — /v1/messages supports Claude Code-style clients with tool_use blocks and thinking controls
Thinking mode support — non-thinking, thinking, and Think Max modes are supported; reasoning is streamed natively
Agent client integration — documented configuration for opencode, Pi, and Claude Code coding agents
Speculative decoding (MTP) — optional multi-token prediction path for greedy decoding; currently experimental
Test vector validation — short and long-context continuation vectors captured from the official DeepSeek V4 Flash API are used to catch tokenizer, template, or attention regressions
Interactive CLI — multi-turn chat with /think, /nothink, /ctx, /read, and other commands; Ctrl+C interrupts generation

To get started, clone the repository, run ./download_model.sh q2 (128 GB machines) or ./download_model.sh q4 (256 GB+ machines), then make. Launch the CLI with ./ds4 or start the server with ./ds4-server --ctx 100000 --kv-disk-dir /tmp/ds4-kv --kv-disk-space-mb 8192.

Community Discussions

Be the first to start a conversation about ds4.c

Share your experience with ds4.c, ask questions, or help others learn from your insights.

Pricing

OPEN SOURCE

Open Source (MIT)

Fully free and open-source under the MIT License. Download, use, modify, and distribute at no cost.

Metal-native DeepSeek V4 Flash inference
2-bit and 4-bit quantization support
OpenAI and Anthropic-compatible server API
Disk KV cache persistence
Interactive CLI

Capabilities

Key Features

Metal-native inference engine for DeepSeek V4 Flash
Disk KV cache persistence for long-context sessions
2-bit and 4-bit asymmetric quantization
1 million token context window
OpenAI-compatible HTTP server API
Anthropic-compatible /v1/messages endpoint
SSE streaming with thinking-mode support
Tool calling with DSML format mapping
Speculative decoding via MTP (experimental)
Interactive multi-turn CLI
Official logit vector validation tests
Prefix-aware KV cache reuse across sessions
Single-session serialized Metal inference worker

Integrations

OpenAI API

Anthropic API

Claude Code

opencode

Pi agent

Hugging Face

llama.cpp / GGML (reference)

GGUF format

API Available

View Docs

Back to all tools

ds4.c

Local Inference

A small, Metal-native local inference engine specifically built for DeepSeek V4 Flash, featuring disk KV cache persistence, OpenAI/Anthropic-compatible server API, and 2-bit quantization support.

Visit Website

At a Glance

Pricing

Open Source

Fully free and open-source under the MIT License. Download, use, modify, and distribute at no cost.

Engagement

Discussions

Available On

macOS

CLI

API

Resources

Website Docs GitHub llms.txt

Topics

Local Inference AI Infrastructure LLM Orchestration

Alternatives

Lemonade Bodega Inference Engine Rapid-MLX

Developer

antirezantirez is the creator of Redis and an independent open-sour…

Listed May 2026

About ds4.c

Key features include:

Metal-only inference — the optimized execution path runs entirely on Apple Metal; a CPU path exists only for correctness checks
Disk KV cache persistence — compressed KV caches are written to SSD, allowing long-context sessions to survive server restarts and session switches without re-prefilling
2-bit and 4-bit quantization — asymmetric quantization targeting only routed MoE experts (IQ2_XXS up/gate, Q2_K down) lets the 284B-parameter model run on 128 GB MacBooks
1 million token context window — the model supports up to 1M tokens; practical context is limited by available RAM
OpenAI-compatible server — ds4-server exposes /v1/chat/completions, /v1/completions, and /v1/models endpoints with SSE streaming, tool calling, and thinking-mode controls
Anthropic-compatible endpoint — /v1/messages supports Claude Code-style clients with tool_use blocks and thinking controls
Thinking mode support — non-thinking, thinking, and Think Max modes are supported; reasoning is streamed natively
Agent client integration — documented configuration for opencode, Pi, and Claude Code coding agents
Speculative decoding (MTP) — optional multi-token prediction path for greedy decoding; currently experimental
Test vector validation — short and long-context continuation vectors captured from the official DeepSeek V4 Flash API are used to catch tokenizer, template, or attention regressions
Interactive CLI — multi-turn chat with /think, /nothink, /ctx, /read, and other commands; Ctrl+C interrupts generation

Community Discussions

Be the first to start a conversation about ds4.c

Share your experience with ds4.c, ask questions, or help others learn from your insights.

Pricing

OPEN SOURCE

Open Source (MIT)

Fully free and open-source under the MIT License. Download, use, modify, and distribute at no cost.

Metal-native DeepSeek V4 Flash inference
2-bit and 4-bit quantization support
OpenAI and Anthropic-compatible server API
Disk KV cache persistence
Interactive CLI

Capabilities

Key Features

Metal-native inference engine for DeepSeek V4 Flash
Disk KV cache persistence for long-context sessions
2-bit and 4-bit asymmetric quantization
1 million token context window
OpenAI-compatible HTTP server API
Anthropic-compatible /v1/messages endpoint
SSE streaming with thinking-mode support
Tool calling with DSML format mapping
Speculative decoding via MTP (experimental)
Interactive multi-turn CLI
Official logit vector validation tests
Prefix-aware KV cache reuse across sessions
Single-session serialized Metal inference worker

Integrations

OpenAI API

Anthropic API

Claude Code

opencode

Pi agent

Hugging Face

llama.cpp / GGML (reference)

GGUF format

API Available

View Docs

Back to all tools