# BentoML

> AI inference platform for deploying, scaling, and optimizing any ML model in production with full control over infrastructure.

BentoML is an AI inference platform designed for speed and control, enabling teams to deploy any model anywhere with tailored optimization, efficient scaling, and streamlined operations. The platform offers both a managed cloud service (Bento Inference Platform) and an open-source framework for serving AI/ML models and custom inference pipelines in production.

BentoML simplifies inference infrastructure while providing full control over deployments, supporting popular open-source models like Llama, DeepSeek, Flux, and Qwen, as well as custom fine-tuned models across any architecture, framework, or modality.

- **Open Model Catalog** allows deploying popular open-source models with just a few clicks, including day-one access to newly released models.
- **Custom Model Serving** provides a unified framework for packaging and deploying models using vLLM, TRT-LLM, JAX, SGLang, PyTorch, and Transformers.
- **Tailored Optimization** enables automatic configuration finding based on latency, throughput, or cost requirements, with advanced performance tuning and distributed LLM inference across multiple GPUs.
- **Smart Scaling** features intelligent auto-scaling that adapts to inference-specific metrics and patterns, with blazing-fast cold starts and scale-to-zero capabilities.
- **Advanced Serving Patterns** support interactive applications, async long-running tasks, large-scale batch inference, and complex workflow orchestration for RAG and compound AI systems.
- **Dev Codespace** allows iterating in the cloud as fast as locally, with instant cloud GPU runs in seconds.
- **LLM Gateway** provides a unified interface for all LLM providers with centralized cost control and optimization.
- **Full Observability** offers comprehensive monitoring including compute and performance tracking, LLM-specific metrics, and system health monitoring.
- **Enterprise Features** include self-hosting on any cloud or on-premises, SOC 2 Type II and ISO 27001 compliance, HIPAA support, SSO, audit logs, and dedicated support engineering.

To get started, sign up for the Starter plan with free compute credits to prototype and test deployments. Use the BentoML open-source library to package your models, then deploy to the cloud with automatic scaling and monitoring. For enterprise needs, contact the team for custom SLAs and bring-your-own-cloud options.

## Features
- Open model catalog with one-click deployment
- Custom model serving across any framework
- Automatic performance optimization
- Intelligent auto-scaling with scale-to-zero
- Distributed LLM inference across multiple GPUs
- Dev codespace for cloud iteration
- LLM Gateway for unified API access
- Comprehensive observability and monitoring
- Deployment automation and CI/CD
- Canary, shadow, and A/B testing
- Multi-cloud and hybrid compute orchestration
- Cross-region scaling
- Cold-start acceleration
- Batch inference processing
- SOC 2 Type II compliance
- HIPAA compliance
- SSO and audit logs

## Integrations
vLLM, TRT-LLM, JAX, SGLang, PyTorch, Transformers, AWS, GCP, Azure, Kubernetes, Nvidia GPUs, AMD GPUs

## Platforms
WEB, API, DEVELOPER_SDK

## Pricing
Freemium — Free tier available with paid upgrades

## Links
- Website: https://bentoml.com
- Documentation: https://docs.bentoml.com/
- Repository: https://github.com/bentoml/BentoML
- EveryDev.ai: https://www.everydev.ai/tools/bentoml