Deep Infra
Cloud inference platform providing low-cost, scalable APIs and infrastructure to run, host, and deploy machine learning models and custom LLMs.
At a Glance
Pricing
Paid
Engagement
Available On
About Deep Infra
Deep Infra provides developer-friendly, pay-as-you-go inference APIs and hosted infrastructure to run a large catalog of machine learning models and custom LLMs at scale. The platform offers OpenAI-compatible endpoints, native DeepInfra APIs, SDKs, and streaming support so teams can migrate or integrate with existing toolchains. Deep Infra also offers dedicated GPU instances and private deployments, with SOC 2 and ISO 27001 security controls and a zero-retention policy for user data.
- OpenAI-compatible API — Use existing OpenAI-style requests and SDKs to call models hosted on Deep Infra with minimal changes.
- Model marketplace (100+ models) — Access text, embedding, image, audio, and multimodal models and choose per-model token or execution pricing.
- Custom LLM hosting — Deploy your own model on dedicated GPUs (A100, H100, H200, B200) and pay for GPU uptime with autoscaling options.
- Token- and usage-based pricing — Per-input and per-output token pricing and per-minute / per-hour execution billing for models and GPUs; billing is pay-as-you-go.
- Security & compliance — SOC 2 and ISO 27001 certifications and a stated zero-retention policy for inputs and outputs.
- Integrations & SDKs — Official docs and SDKs (REST, Python, JavaScript), OpenAI-compatible endpoints, and integrations like LangChain and LlamaIndex.
Getting started: create an account on the web dashboard, obtain an API token, and call the OpenAI-compatible or DeepInfra-native endpoints; use the docs and SDKs for Python/JS examples and enable dedicated instances or private deployments via the dashboard when needed.

Community Discussions
Be the first to start a conversation about Deep Infra
Share your experience with Deep Infra, ask questions, or help others learn from your insights.
Pricing
Token-based inference
Per-token pricing for model inference; input and output tokens are billed separately and shown per 1M tokens.
- Input tokens billed (example: $0.27 per 1M input tokens)
- Output tokens billed (example: $0.40 per 1M output tokens)
- Access to hosted model catalog and streaming
Dedicated GPU (A100 example)
Example price for A100 dedicated GPU per GPU-hour; other GPU types (H100, H200, B200) have different hourly rates.
- Dedicated GPU instances for custom model hosting
- Billed per GPU-hour with autoscaling options
- Suitable for private deployments and high-throughput inference
Capabilities
Key Features
- OpenAI-compatible API and native DeepInfra API
- 100+ hosted models across text, image, audio and multimodal
- Custom LLM deployment on dedicated GPUs
- Per-token and per-execution billing (pay-as-you-go)
- Streaming responses and SDKs for REST, Python, JavaScript
- SOC 2 and ISO 27001 certified with zero-retention policy