IonRouter
High throughput, low cost AI inference API powered by IonAttention, supporting LLMs, vision, image, video, and audio models with OpenAI-compatible endpoints.
At a Glance
Pricing
Join Discord to receive $5 of free credits to start building on IonRouter.
Engagement
Available On
Developer
Listed Mar 2026
About IonRouter
IonRouter is a high-performance AI inference platform built by Cumulus Labs, powered by their custom IonAttention engine that multiplexes models on a single GPU with millisecond swap times. It delivers significantly higher throughput than standard inference providers — over 7,000 tokens/second on a single GH200 for Qwen2.5-7B — while keeping costs low with per-token, usage-based pricing. The platform is OpenAI API-compatible, meaning teams can switch with a single line of code change, and supports a wide range of model types including language, vision, image, video, and audio.
- IonAttention Engine: A custom inference stack built from the ground up for NVIDIA Grace Hopper Superchips, enabling real-time model multiplexing and adaptive traffic handling for maximum throughput.
- OpenAI-Compatible API: Point any existing OpenAI client (Python, TypeScript, Go, etc.) at
api.ionrouter.io/v1with no other code changes required. - Broad Model Support: Access frontier and open-source models across language (Qwen3.5, GPT-OSS-120B, Kimi-K2.5), vision (VLMs), image (Flux Schnell), video (Wan2.2), and audio categories.
- Custom Model Deployment: Deploy your own finetunes, custom LoRAs, or any open-source model on dedicated GPU streams with no cold starts and per-second billing.
- Usage-Based Pricing: Pay per million tokens (input and output priced separately) or per GPU-second for video/image generation — no idle costs or seat fees.
- Iondex Leaderboard: Browse and compare available models via the Iondex leaderboard and category explorer to find the best model for your use case.
- Playground: Test any supported model directly in the browser playground before integrating into your application.
- Enterprise Support: Book a call for dedicated GPU streams, custom deployments, and enterprise-grade SLAs tailored to high-volume workloads like robotics, surveillance, and AI video pipelines.
Community Discussions
Be the first to start a conversation about IonRouter
Share your experience with IonRouter, ask questions, or help others learn from your insights.
Pricing
Free Plan Available
Join Discord to receive $5 of free credits to start building on IonRouter.
- $5 free credits
- Access to all models
- OpenAI-compatible API
- Playground access
Pay-as-you-go
Pay per million tokens (input/output priced separately) or per GPU-second for image/video generation. No idle costs.
- Per-token billing (input and output)
- Per-GPU-second billing for video/image
- No idle costs
- Access to all available models
- OpenAI-compatible API
- Playground access
Enterprise
Dedicated GPU streams, custom model deployments, and enterprise-grade SLAs for high-volume workloads.
- Dedicated GPU streams
- Custom LoRA and finetune deployment
- No cold starts
- Per-second billing
- Enterprise SLAs
- Tailored for robotics, surveillance, and AI video pipelines
Capabilities
Key Features
- IonAttention custom inference engine
- OpenAI-compatible API
- Multi-model support (language, vision, image, video, audio)
- Custom model and LoRA deployment
- Dedicated GPU streams
- No cold starts
- Per-token and per-GPU-second billing
- Iondex model leaderboard
- Browser playground
- Enterprise deployments
- NVIDIA Grace Hopper Superchip infrastructure
