BentoML
AI inference platform for deploying, scaling, and optimizing any ML model in production with full control over infrastructure.
At a Glance
Pricing
Full access to Bento Inference Platform with one-time free compute credit
Engagement
Available On
About BentoML
BentoML is an AI inference platform designed for speed and control, enabling teams to deploy any model anywhere with tailored optimization, efficient scaling, and streamlined operations. The platform offers both a managed cloud service (Bento Inference Platform) and an open-source framework for serving AI/ML models and custom inference pipelines in production.
BentoML simplifies inference infrastructure while providing full control over deployments, supporting popular open-source models like Llama, DeepSeek, Flux, and Qwen, as well as custom fine-tuned models across any architecture, framework, or modality.
- Open Model Catalog allows deploying popular open-source models with just a few clicks, including day-one access to newly released models.
- Custom Model Serving provides a unified framework for packaging and deploying models using vLLM, TRT-LLM, JAX, SGLang, PyTorch, and Transformers.
- Tailored Optimization enables automatic configuration finding based on latency, throughput, or cost requirements, with advanced performance tuning and distributed LLM inference across multiple GPUs.
- Smart Scaling features intelligent auto-scaling that adapts to inference-specific metrics and patterns, with blazing-fast cold starts and scale-to-zero capabilities.
- Advanced Serving Patterns support interactive applications, async long-running tasks, large-scale batch inference, and complex workflow orchestration for RAG and compound AI systems.
- Dev Codespace allows iterating in the cloud as fast as locally, with instant cloud GPU runs in seconds.
- LLM Gateway provides a unified interface for all LLM providers with centralized cost control and optimization.
- Full Observability offers comprehensive monitoring including compute and performance tracking, LLM-specific metrics, and system health monitoring.
- Enterprise Features include self-hosting on any cloud or on-premises, SOC 2 Type II and ISO 27001 compliance, HIPAA support, SSO, audit logs, and dedicated support engineering.
To get started, sign up for the Starter plan with free compute credits to prototype and test deployments. Use the BentoML open-source library to package your models, then deploy to the cloud with automatic scaling and monitoring. For enterprise needs, contact the team for custom SLAs and bring-your-own-cloud options.

Community Discussions
Be the first to start a conversation about BentoML
Share your experience with BentoML, ask questions, or help others learn from your insights.
Pricing
Until credits exhausted
Full access to Bento Inference Platform with one-time free compute credit
- Deploy open-source LLMs
- Deploy custom models with BentoML
- Spin up GPUs and test deployments
Starter
Learn and prototype with no up-front commitment
- Dedicated deployments
- Pay only compute you use
- Fast cold start and auto-scaling
- SOC 2 Type II compliant
- Monitoring and logging dashboard
- Community Slack support
Scale
Cost-efficient scaling for growing workloads with committed use discount
- Priority access to H100, H200 and more
- Unlimited seats and deployments
- Dedicated compute pool and cold-start guarantee
- Region selection
- Dedicated Slack channel
Enterprise
Full control and dedicated support in your environment
- Full control in your VPC or on-prem
- Tailored performance research and tuning
- Custom SLAs
- Use existing cloud commitments
- Full control over data and network policies
- Multi-cloud, hybrid compute orchestration
- Audit logs, SSO, compliance evidence kit
- Dedicated support engineering
Capabilities
Key Features
- Open model catalog with one-click deployment
- Custom model serving across any framework
- Automatic performance optimization
- Intelligent auto-scaling with scale-to-zero
- Distributed LLM inference across multiple GPUs
- Dev codespace for cloud iteration
- LLM Gateway for unified API access
- Comprehensive observability and monitoring
- Deployment automation and CI/CD
- Canary, shadow, and A/B testing
- Multi-cloud and hybrid compute orchestration
- Cross-region scaling
- Cold-start acceleration
- Batch inference processing
- SOC 2 Type II compliance
- HIPAA compliance
- SSO and audit logs