# SGLang

> Fast serving framework for large language models and vision language models with efficient inference and structured generation.

SGLang is a fast serving framework designed for large language models (LLMs) and vision language models (VLMs). It provides efficient inference capabilities with a focus on structured generation and high-performance serving. The framework is built to handle complex AI workloads with optimized throughput and latency characteristics, making it suitable for production deployments.

- **High-Performance Inference** - Delivers fast and efficient inference for both large language models and vision language models, optimizing for throughput and latency in production environments.

- **Structured Generation** - Supports structured output generation, enabling developers to constrain model outputs to specific formats like JSON schemas, regular expressions, and other structured patterns.

- **RadixAttention** - Implements an innovative attention mechanism that enables efficient KV cache reuse across multiple requests, significantly improving serving efficiency.

- **Flexible Backend Support** - Works with various model architectures and supports multiple hardware backends for deployment flexibility.

- **OpenAI-Compatible API** - Provides an API interface compatible with OpenAI's format, making it easy to integrate into existing applications and workflows.

- **Python Frontend** - Offers a Pythonic interface for defining complex generation patterns and workflows, allowing developers to express sophisticated prompting strategies programmatically.

To get started with SGLang, install it via pip and launch the server with your chosen model. The framework supports popular open-source models and can be configured for various deployment scenarios. Documentation and examples are available in the GitHub repository to help developers quickly integrate SGLang into their AI infrastructure.

## Features
- High-performance LLM and VLM inference
- Structured generation with JSON and regex constraints
- RadixAttention for KV cache reuse
- OpenAI-compatible API
- Python frontend for complex generation patterns
- Multi-model support
- Efficient batch processing
- Continuous batching
- Tensor parallelism support

## Integrations
OpenAI API, Hugging Face Models, PyTorch

## Platforms
LINUX, API, DEVELOPER_SDK

## Pricing
Open Source

## Links
- Website: https://www.sglang.io
- Documentation: https://sgl-project.github.io/
- Repository: https://github.com/sgl-project/sglang
- EveryDev.ai: https://www.everydev.ai/tools/sglang