Main Menu
  • Tools
  • Developers
  • Topics
  • Discussions
  • News
  • Blogs
  • Builds
  • Contests
Create
Sign In
    EveryDev.ai
    Sign inSubscribe
    Home
    Tools

    1,737+ AI tools

    • New
    • Trending
    • Featured
    • Compare
    Categories
    • Agents891
    • Coding869
    • Infrastructure377
    • Marketing357
    • Design302
    • Research276
    • Projects271
    • Analytics266
    • Testing160
    • Integration157
    • Data150
    • Security131
    • MCP125
    • Learning124
    • Extensions108
    • Communication107
    • Prompts100
    • Voice90
    • Commerce89
    • DevOps70
    • Web66
    • Finance17
    Sign In
    1. Home
    2. Tools
    3. Hypura
    Hypura icon

    Hypura

    Local Inference

    Storage-tier-aware LLM inference scheduler for Apple Silicon that runs models too big for your Mac's memory across GPU, RAM, and NVMe.

    Visit Website

    At a Glance

    Pricing

    Open Source

    Free and open-source under MIT license.

    Engagement

    Available On

    macOS
    CLI

    Resources

    WebsiteGitHubllms.txt

    Topics

    Local InferenceAI InfrastructureCompute Optimization

    Alternatives

    AI BackendsChutes AISynthetic

    Developer

    t8Developer exploring LLM-assisted software creation, focused…

    Listed Mar 2026

    About Hypura

    Hypura is a storage-tier-aware LLM inference scheduler built for Apple Silicon Macs. It solves a common problem for developers and researchers working with large language models on consumer hardware: models that exceed available memory cause swap-thrashing and out-of-memory crashes under standard inference tools like llama.cpp. Hypura addresses this by intelligently placing model tensors across three storage tiers — GPU (Metal), RAM, and NVMe — based on access patterns, bandwidth costs, and hardware capabilities.

    The scheduler reads a GGUF model file, profiles the host hardware including GPU working set size, RAM capacity, and NVMe sequential read bandwidth, then solves a placement optimization that assigns every tensor to the appropriate tier. Norms and embeddings, which are small but accessed every token, get pinned to GPU. For Mixture-of-Experts architectures like Mixtral, router interception identifies which experts are selected per token and loads only the needed expert strides from NVMe, achieving a 75 percent I/O reduction. A neuron cache tracks loaded expert slices across tokens with a 99.5 percent hit rate from temporal locality. Dense FFN weights stream from NVMe through a dynamically sized pool buffer while attention and norms stay GPU-resident.

    Hypura selects between three inference modes automatically. Full-resident mode runs when the model fits entirely in GPU and RAM with no NVMe I/O and zero overhead. Expert-streaming mode handles MoE models by keeping only non-expert tensors on GPU and streaming expert weights on demand. Dense FFN-streaming mode extends this approach to non-MoE models like Llama 70B by keeping attention and norms on GPU while streaming FFN tensors from NVMe with scaled prefetch lookahead.

    Benchmarks on an M1 Max with 32 GB unified memory and 5.1 GB per second NVMe read speed show Qwen 2.5 14B running at 21 tokens per second in full-resident mode with zero overhead versus stock llama.cpp. A 31 GB Mixtral 8x7B achieves 2.2 tokens per second in expert-streaming mode where llama.cpp crashes with OOM. A 40 GB Llama 70B runs at 0.3 tokens per second in dense FFN-streaming mode, again where llama.cpp fails entirely.

    Hypura also exposes an Ollama-compatible HTTP server, making it a drop-in replacement for any tool that speaks the Ollama protocol including OpenClaw. The server supports text completion, chat completion with NDJSON streaming, model metadata queries, and health checks. Configuration is automatic with no manual tuning of pool buffer sizes, prefetch depth, or memory budgets required.

    The project is written in Rust and organized as a Cargo workspace with two crates: the main binary and library, and FFI bindings to a vendored llama.cpp built via CMake. It requires Rust 1.75 or newer and CMake to build from source. Hypura performs only read operations on the SSD during inference, generating zero write wear on the storage device.

    Hypura - 1

    Community Discussions

    Be the first to start a conversation about Hypura

    Share your experience with Hypura, ask questions, or help others learn from your insights.

    Pricing

    OPEN SOURCE

    Open Source

    Free and open-source under MIT license.

    • Full source code access under MIT license
    • All inference modes: full-resident, expert-streaming, dense FFN-streaming
    • Ollama-compatible HTTP server
    • Built-in hardware profiling and benchmarking
    View official pricing

    Capabilities

    Key Features

    • Storage-tier-aware tensor placement across GPU, RAM, and NVMe based on access patterns and bandwidth costs
    • Automatic hardware profiling of GPU working set, RAM capacity, and NVMe throughput with no manual tuning required
    • Expert-streaming mode for MoE models that loads only active expert strides from NVMe with 75% I/O reduction
    • Neuron cache with 99.5% hit rate that tracks loaded expert slices across tokens using temporal locality
    • Dense FFN-streaming mode for large dense models with dynamically-sized pool buffers and scaled prefetch lookahead
    • Full-resident mode with zero overhead when models fit entirely in GPU and RAM
    • Automatic inference mode selection based on model size, architecture, and available memory
    • Ollama-compatible HTTP server for drop-in integration with tools like OpenClaw
    • Co-activation tracking that predicts which MoE experts will fire next for speculative prefetch
    • Built-in A/B benchmarking harness comparing Hypura scheduling against naive baseline
    • Hardware safety checks that block baseline benchmarks when models exceed RAM minus 4 GB headroom
    • Read-only NVMe I/O path that generates zero SSD write wear during inference

    Integrations

    llama.cpp
    Ollama
    OpenClaw
    GGUF
    Metal

    Reviews & Ratings

    No ratings yet

    Be the first to rate Hypura and help others make informed decisions.

    Developer

    t8

    Developer exploring LLM-assisted software creation, focused on underutilized NVMe-backed inference for consumer hardware.

    Read more about t8
    WebsiteGitHub
    1 tool in directory

    Similar Tools

    AI Backends icon

    AI Backends

    Self-hosted open-source AI API server that exposes unified REST endpoints and supports multiple LLM providers for integration into applications.

    Chutes AI icon

    Chutes AI

    Serverless GPU inference platform for deploying and running AI models with pay-per-use pricing.

    Synthetic icon

    Synthetic

    AI platform providing access to multiple LLMs with subscription or usage-based pricing, offering both UI and API access.

    Browse all tools

    Related Topics

    Local Inference

    Tools and platforms for running AI inference locally without cloud dependence.

    57 tools

    AI Infrastructure

    Infrastructure designed for deploying and running AI models.

    167 tools

    Compute Optimization

    Tools for optimizing computational resources and performance.

    14 tools
    Browse all topics
    Back to all tools
    Explore AI Tools
    • AI Coding Assistants
    • Agent Frameworks
    • MCP Servers
    • AI Prompt Tools
    • Vibe Coding Tools
    • AI Design Tools
    • AI Database Tools
    • AI Website Builders
    • AI Testing Tools
    • LLM Evaluations
    Follow Us
    • X / Twitter
    • LinkedIn
    • Reddit
    • Discord
    • Threads
    • Bluesky
    • Mastodon
    • YouTube
    • GitHub
    • Instagram
    Get Started
    • About
    • Editorial Standards
    • Corrections & Disclosures
    • Community Guidelines
    • Advertise
    • Contact Us
    • Newsletter
    • Submit a Tool
    • Start a Discussion
    • Write A Blog
    • Share A Build
    • Terms of Service
    • Privacy Policy
    Explore with AI
    • ChatGPT
    • Gemini
    • Claude
    • Grok
    • Perplexity
    Agent Experience
    • llms.txt
    Theme
    With AI, Everyone is a Dev. EveryDev.ai © 2026
    Sign in