Hypura

Storage-tier-aware LLM inference scheduler for Apple Silicon that runs models too big for your Mac's memory across GPU, RAM, and NVMe.

Visit Website

At a Glance

Pricing

Open Source

Free and open-source under MIT license.

Engagement

Available On

macOS

CLI

t8Developer exploring LLM-assisted software creation, focused…

Listed Mar 2026

About Hypura

Hypura is a storage-tier-aware LLM inference scheduler built for Apple Silicon Macs. It solves a common problem for developers and researchers working with large language models on consumer hardware: models that exceed available memory cause swap-thrashing and out-of-memory crashes under standard inference tools like llama.cpp. Hypura addresses this by intelligently placing model tensors across three storage tiers — GPU (Metal), RAM, and NVMe — based on access patterns, bandwidth costs, and hardware capabilities.

The scheduler reads a GGUF model file, profiles the host hardware including GPU working set size, RAM capacity, and NVMe sequential read bandwidth, then solves a placement optimization that assigns every tensor to the appropriate tier. Norms and embeddings, which are small but accessed every token, get pinned to GPU. For Mixture-of-Experts architectures like Mixtral, router interception identifies which experts are selected per token and loads only the needed expert strides from NVMe, achieving a 75 percent I/O reduction. A neuron cache tracks loaded expert slices across tokens with a 99.5 percent hit rate from temporal locality. Dense FFN weights stream from NVMe through a dynamically sized pool buffer while attention and norms stay GPU-resident.

Hypura selects between three inference modes automatically. Full-resident mode runs when the model fits entirely in GPU and RAM with no NVMe I/O and zero overhead. Expert-streaming mode handles MoE models by keeping only non-expert tensors on GPU and streaming expert weights on demand. Dense FFN-streaming mode extends this approach to non-MoE models like Llama 70B by keeping attention and norms on GPU while streaming FFN tensors from NVMe with scaled prefetch lookahead.

Benchmarks on an M1 Max with 32 GB unified memory and 5.1 GB per second NVMe read speed show Qwen 2.5 14B running at 21 tokens per second in full-resident mode with zero overhead versus stock llama.cpp. A 31 GB Mixtral 8x7B achieves 2.2 tokens per second in expert-streaming mode where llama.cpp crashes with OOM. A 40 GB Llama 70B runs at 0.3 tokens per second in dense FFN-streaming mode, again where llama.cpp fails entirely.

Hypura also exposes an Ollama-compatible HTTP server, making it a drop-in replacement for any tool that speaks the Ollama protocol including OpenClaw. The server supports text completion, chat completion with NDJSON streaming, model metadata queries, and health checks. Configuration is automatic with no manual tuning of pool buffer sizes, prefetch depth, or memory budgets required.

The project is written in Rust and organized as a Cargo workspace with two crates: the main binary and library, and FFI bindings to a vendored llama.cpp built via CMake. It requires Rust 1.75 or newer and CMake to build from source. Hypura performs only read operations on the SSD during inference, generating zero write wear on the storage device.

Community Discussions

Be the first to start a conversation about Hypura

Share your experience with Hypura, ask questions, or help others learn from your insights.

Pricing

OPEN SOURCE

Open Source

Free and open-source under MIT license.

Full source code access under MIT license
All inference modes: full-resident, expert-streaming, dense FFN-streaming
Ollama-compatible HTTP server
Built-in hardware profiling and benchmarking

Capabilities

Key Features

Storage-tier-aware tensor placement across GPU, RAM, and NVMe based on access patterns and bandwidth costs
Automatic hardware profiling of GPU working set, RAM capacity, and NVMe throughput with no manual tuning required
Expert-streaming mode for MoE models that loads only active expert strides from NVMe with 75% I/O reduction
Neuron cache with 99.5% hit rate that tracks loaded expert slices across tokens using temporal locality
Dense FFN-streaming mode for large dense models with dynamically-sized pool buffers and scaled prefetch lookahead
Full-resident mode with zero overhead when models fit entirely in GPU and RAM
Automatic inference mode selection based on model size, architecture, and available memory
Ollama-compatible HTTP server for drop-in integration with tools like OpenClaw
Co-activation tracking that predicts which MoE experts will fire next for speculative prefetch
Built-in A/B benchmarking harness comparing Hypura scheduling against naive baseline
Hardware safety checks that block baseline benchmarks when models exceed RAM minus 4 GB headroom
Read-only NVMe I/O path that generates zero SSD write wear during inference

Integrations

llama.cpp

Ollama

OpenClaw

GGUF

Metal

Back to all tools

Hypura

Local Inference

Storage-tier-aware LLM inference scheduler for Apple Silicon that runs models too big for your Mac's memory across GPU, RAM, and NVMe.

Visit Website

At a Glance

Pricing

Open Source

Free and open-source under MIT license.

Engagement

17views

Discussions

Available On

macOS

CLI

Resources

Website GitHub llms.txt

Topics

Local Inference AI Infrastructure Compute Optimization

Alternatives

Wafer AI Backends DeepSpeed

Developer

t8Developer exploring LLM-assisted software creation, focused…

Listed Mar 2026

About Hypura

Community Discussions

Be the first to start a conversation about Hypura

Share your experience with Hypura, ask questions, or help others learn from your insights.

Pricing

OPEN SOURCE

Open Source

Free and open-source under MIT license.

Full source code access under MIT license
All inference modes: full-resident, expert-streaming, dense FFN-streaming
Ollama-compatible HTTP server
Built-in hardware profiling and benchmarking

Capabilities

Key Features

Storage-tier-aware tensor placement across GPU, RAM, and NVMe based on access patterns and bandwidth costs
Automatic hardware profiling of GPU working set, RAM capacity, and NVMe throughput with no manual tuning required
Expert-streaming mode for MoE models that loads only active expert strides from NVMe with 75% I/O reduction
Neuron cache with 99.5% hit rate that tracks loaded expert slices across tokens using temporal locality
Dense FFN-streaming mode for large dense models with dynamically-sized pool buffers and scaled prefetch lookahead
Full-resident mode with zero overhead when models fit entirely in GPU and RAM
Automatic inference mode selection based on model size, architecture, and available memory
Ollama-compatible HTTP server for drop-in integration with tools like OpenClaw
Co-activation tracking that predicts which MoE experts will fire next for speculative prefetch
Built-in A/B benchmarking harness comparing Hypura scheduling against naive baseline
Hardware safety checks that block baseline benchmarks when models exceed RAM minus 4 GB headroom
Read-only NVMe I/O path that generates zero SSD write wear during inference

Integrations

llama.cpp

Ollama

OpenClaw

GGUF

Metal

Back to all tools