Hypura
Storage-tier-aware LLM inference scheduler for Apple Silicon that runs models too big for your Mac's memory across GPU, RAM, and NVMe.
At a Glance
Pricing
Free and open-source under MIT license.
Engagement
Available On
Alternatives
Developer
Listed Mar 2026
About Hypura
Hypura is a storage-tier-aware LLM inference scheduler built for Apple Silicon Macs. It solves a common problem for developers and researchers working with large language models on consumer hardware: models that exceed available memory cause swap-thrashing and out-of-memory crashes under standard inference tools like llama.cpp. Hypura addresses this by intelligently placing model tensors across three storage tiers — GPU (Metal), RAM, and NVMe — based on access patterns, bandwidth costs, and hardware capabilities.
The scheduler reads a GGUF model file, profiles the host hardware including GPU working set size, RAM capacity, and NVMe sequential read bandwidth, then solves a placement optimization that assigns every tensor to the appropriate tier. Norms and embeddings, which are small but accessed every token, get pinned to GPU. For Mixture-of-Experts architectures like Mixtral, router interception identifies which experts are selected per token and loads only the needed expert strides from NVMe, achieving a 75 percent I/O reduction. A neuron cache tracks loaded expert slices across tokens with a 99.5 percent hit rate from temporal locality. Dense FFN weights stream from NVMe through a dynamically sized pool buffer while attention and norms stay GPU-resident.
Hypura selects between three inference modes automatically. Full-resident mode runs when the model fits entirely in GPU and RAM with no NVMe I/O and zero overhead. Expert-streaming mode handles MoE models by keeping only non-expert tensors on GPU and streaming expert weights on demand. Dense FFN-streaming mode extends this approach to non-MoE models like Llama 70B by keeping attention and norms on GPU while streaming FFN tensors from NVMe with scaled prefetch lookahead.
Benchmarks on an M1 Max with 32 GB unified memory and 5.1 GB per second NVMe read speed show Qwen 2.5 14B running at 21 tokens per second in full-resident mode with zero overhead versus stock llama.cpp. A 31 GB Mixtral 8x7B achieves 2.2 tokens per second in expert-streaming mode where llama.cpp crashes with OOM. A 40 GB Llama 70B runs at 0.3 tokens per second in dense FFN-streaming mode, again where llama.cpp fails entirely.
Hypura also exposes an Ollama-compatible HTTP server, making it a drop-in replacement for any tool that speaks the Ollama protocol including OpenClaw. The server supports text completion, chat completion with NDJSON streaming, model metadata queries, and health checks. Configuration is automatic with no manual tuning of pool buffer sizes, prefetch depth, or memory budgets required.
The project is written in Rust and organized as a Cargo workspace with two crates: the main binary and library, and FFI bindings to a vendored llama.cpp built via CMake. It requires Rust 1.75 or newer and CMake to build from source. Hypura performs only read operations on the SSD during inference, generating zero write wear on the storage device.
Community Discussions
Be the first to start a conversation about Hypura
Share your experience with Hypura, ask questions, or help others learn from your insights.
Pricing
Open Source
Free and open-source under MIT license.
- Full source code access under MIT license
- All inference modes: full-resident, expert-streaming, dense FFN-streaming
- Ollama-compatible HTTP server
- Built-in hardware profiling and benchmarking
Capabilities
Key Features
- Storage-tier-aware tensor placement across GPU, RAM, and NVMe based on access patterns and bandwidth costs
- Automatic hardware profiling of GPU working set, RAM capacity, and NVMe throughput with no manual tuning required
- Expert-streaming mode for MoE models that loads only active expert strides from NVMe with 75% I/O reduction
- Neuron cache with 99.5% hit rate that tracks loaded expert slices across tokens using temporal locality
- Dense FFN-streaming mode for large dense models with dynamically-sized pool buffers and scaled prefetch lookahead
- Full-resident mode with zero overhead when models fit entirely in GPU and RAM
- Automatic inference mode selection based on model size, architecture, and available memory
- Ollama-compatible HTTP server for drop-in integration with tools like OpenClaw
- Co-activation tracking that predicts which MoE experts will fire next for speculative prefetch
- Built-in A/B benchmarking harness comparing Hypura scheduling against naive baseline
- Hardware safety checks that block baseline benchmarks when models exceed RAM minus 4 GB headroom
- Read-only NVMe I/O path that generates zero SSD write wear during inference
