Together AI
A full-stack AI cloud platform offering serverless and dedicated inference, GPU clusters, fine-tuning, and model evaluations powered by cutting-edge systems research.
At a Glance
Free tier for developers getting started with Together AI APIs. Community support via Discord.
Engagement
Available On
Alternatives
Updated Apr 2026
About Together AI
Together AI is a full-stack AI Native Cloud platform designed to accelerate every stage of the AI development lifecycle — from experimentation to large-scale production. It combines high-performance inference APIs, GPU compute clusters, fine-tuning tools, and developer environments, all backed by original systems research including FlashAttention, ThunderKittens, and ATLAS. The platform targets AI-native teams that need speed, cost efficiency, and control without managing complex infrastructure.
- Serverless Inference — Run open-source models on demand via API with no infrastructure to manage; supports chat, vision, image, audio, video, transcription, embeddings, reranking, and moderation.
- Batch Inference — Process massive asynchronous workloads at up to 50% lower cost; scales to 30 billion tokens per model.
- Dedicated Model Inference — Deploy models on single-tenant GPU instances (H100, H200, B200) with guaranteed performance, autoscaling, and custom model support.
- Dedicated Container Inference — GPU infrastructure purpose-built for generative media workloads including video, audio, and image models.
- GPU Clusters — Self-service NVIDIA GPU clusters (H100, H200, B200, GB200, GB300) available on-demand hourly or reserved for longer durations.
- Fine-Tuning — Train open-source models using Supervised Fine-Tuning (SFT) or Direct Preference Optimization (DPO) with LoRA or full fine-tuning; supports models up to 100B+ parameters.
- Evaluations — Measure and compare model quality to guide model selection and fine-tuning decisions.
- Sandbox — Fast, secure code sandboxes for building full-scale development environments for AI apps and agents.
- Managed Storage — High-performance object storage and parallel filesystems optimized for AI workloads with zero egress fees.
- Model Library — Access a curated library of top open-source models from Meta, DeepSeek, Qwen, Google, Mistral, and more.
- Research-Backed Performance — Platform improvements driven by published research (FlashAttention-4, ATLAS, ThunderKittens) delivering up to 2× faster inference and 60% lower cost.
Community Discussions
Be the first to start a conversation about Together AI
Share your experience with Together AI, ask questions, or help others learn from your insights.
Pricing
Build
Free tier for developers getting started with Together AI APIs. Community support via Discord.
- Access to serverless inference APIs
- Model library access
- Playground access
- Community support via Discord
Serverless Inference
Pay-as-you-go serverless inference for chat, vision, image, audio, video, embeddings, and more.
- Chat models (e.g., from $0.02/1M tokens)
- Vision models
- Image generation models
- Audio/TTS models
- Video generation models
- Speech transcription
- Embeddings
- Reranking
- Content moderation
- Batch Inference API at 50% lower cost
Dedicated Model Inference
Single-tenant GPU instances for guaranteed performance with custom model support.
- Guaranteed performance (no sharing)
- Support for custom models
- Autoscaling & traffic spike handling
- 1x H100 80GB from $3.99/hr
- 1x H200 141GB from $5.49/hr
- 1x B200 180GB from $9.95/hr
GPU Clusters (On-Demand)
Self-service NVIDIA GPU clusters billed hourly with no long-term commitment.
- NVIDIA HGX H100 from $3.49/hr
- NVIDIA HGX H200 from $4.19/hr
- NVIDIA HGX B200 from $7.49/hr
- No long-term commitment
- Together Kernel Collection optimization
GPU Clusters (Reserved)
Reserved GPU capacity for 6+ days with discounted rates.
- NVIDIA HGX H100 from $2.55/hr (4-6 months)
- NVIDIA HGX H200 from $2.89/hr (4-6 months)
- NVIDIA HGX B200 from $6.39/hr (4-6 months)
- GB200 NVL72 and GB300 NVL72 available (contact sales)
- Minimum 6-day reservation
Fine-Tuning
Train open-source models with SFT or DPO using LoRA or full fine-tuning, priced per 1M tokens.
- Supervised Fine-Tuning (LoRA and Full)
- Direct Preference Optimization (LoRA and Full)
- Models up to 100B parameters
- Specialized pricing for DeepSeek, Llama 4, Qwen3, and more
- LoRA from $0.48/1M tokens (up to 16B models)
Enterprise
Custom enterprise plan with dedicated support, SLAs, and tailored pricing.
- Custom pricing and plan
- Silver or Gold support included
- Slack communication channel
- Priority queueing (Gold)
- Technical Account Manager (Gold)
- 20 hours training/services (Gold, annual commit)
- Enterprise trial available
Capabilities
Key Features
- Serverless Inference API
- Batch Inference API
- Dedicated Model Inference
- Dedicated Container Inference
- GPU Clusters (H100, H200, B200, GB200, GB300)
- Fine-Tuning (SFT and DPO, LoRA and Full)
- Model Evaluations
- Code Sandbox
- Managed Storage
- Model Library with 100+ open-source models
- Voice Agent support
- Playground and Together Chat
- FlashAttention-powered inference
- ATLAS runtime-learning accelerators
- Together Kernel Collection
