Meta's family of open-weight large language models, available for download, fine-tuning, and deployment across cloud, on-premise, and edge environments.
At a Glance
Download Llama model weights for self-hosted deployment under the Llama Community License.
Engagement
Available On
Alternatives
Updated May 2026
About Llama
Llama is Meta's family of large language models, released under a bespoke community license that permits broad commercial use, fine-tuning, and redistribution. The models range from lightweight 1B-parameter variants designed for edge and mobile devices to the flagship Llama 4 Maverick, a natively multimodal mixture-of-experts model with a 10-million-token context window. Developers can download weights directly from llama.com, access them via the Llama API, or use them through hosting partners including Amazon Web Services, Microsoft Azure, Google Cloud, IBM Watsonx, Oracle Cloud, Snowflake, Databricks, Hugging Face, Groq, Cerebras, and SambaNova.
What It Is
Llama is a series of auto-regressive transformer language models built by Meta's AI research team. The core job is to serve as a foundation for developers and researchers who want to build, fine-tune, distill, or deploy AI applications without being locked into a proprietary API. Models are distributed as downloadable weights, meaning the inference environment—and the privacy of inputs and outputs—stays under the licensee's control. The FAQ explicitly states that Meta cannot access inputs or outputs once models are downloaded.
Model Families and Capabilities
The current lineup spans two major generations:
- Llama 4 — Natively multimodal models using early fusion to jointly pre-train on text and vision tokens. Llama 4 Maverick (128-expert MoE) and Llama 4 Scout (16-expert MoE) both feature 10M-token context windows and support 12 languages for text-to-text tasks. Benchmark scores published on the site show Llama 4 Maverick at 80.5 on MMLU Pro, 69.8 on GPQA Diamond, and 94.4 on DocVQA.
- Llama 3 — The open-weight generation covering Llama 3.1 (8B, 70B, 405B), Llama 3.2 (1B, 3B lightweight; 11B, 90B multimodal), and Llama 3.3 (70B multilingual). Llama 3.3 70B is positioned as a high-performance replacement for Llama 3.1 70B.
Deployment and Optimization Path
Llama models run on GPUs, CPUs (x86 and ARM), TPUs, NPUs, and AI accelerators. Smaller models target system-on-chip platforms found in PCs, mobile devices, and other edge hardware. The documentation covers:
- Prompt engineering — Improving LLM performance through natural language techniques
- Fine-tuning — Adapting pre-trained weights to specific use cases; examples are in the Llama Cookbook repository on GitHub
- Quantization — Reducing computational and memory requirements
- Distillation — Teaching a smaller model to match a larger model's performance
- RAG — Reference implementations available in the developer documentation
Licensing and Legal Framework
Llama models are not released under an OSI-approved open-source license. They use a bespoke Llama Community License Agreement that allows broad commercial use and derivative model creation, with restrictions including an Acceptable Use Policy. Key points from the FAQ: outputs from Llama 3.1 and later can be used to train other AI models with proper attribution; products built on Llama must display "Built with Llama" prominently; and EU-based individuals and companies face additional restrictions on multimodal model usage under the Llama 3.2, 3.3, and 4 AUPs.
Safety Infrastructure
Meta publishes a suite of protection tools under the Llama Protections umbrella, including the Llama Defenders Program, which the site describes as enabling AI defenders to deploy generative AI responsibly. A Developer Use Guide accompanies each model release to help licensees navigate responsible deployment.
Update: Llama 4
The most recent major release is Llama 4, which introduces native multimodality via early fusion—a departure from the frozen, separate multimodal weights used in prior generations. The site describes this as "a step change in intelligence." Llama 4 Scout is designed for single H100 GPU efficiency, while Llama 4 Maverick targets memory, personalization, and multi-modal application use cases. The Llama API, which provides hosted access to these models, was in waitlist status at the time of the source capture.
Community Discussions
Be the first to start a conversation about Llama
Share your experience with Llama, ask questions, or help others learn from your insights.
Pricing
Model Download
Download Llama model weights for self-hosted deployment under the Llama Community License.
- Access to Llama 4 Maverick and Scout
- Access to Llama 3.1, 3.2, 3.3 model families
- Fine-tuning and distillation permitted
- Commercial use allowed under community license
- Deploy on any hardware (GPU, CPU, TPU, edge)
Llama API
Hosted API access to Llama models with usage-based pricing.
- Hosted inference via Llama API
- Access to latest Llama 4 models
- No infrastructure management required
- Usage-based token pricing
Capabilities
Key Features
- Natively multimodal Llama 4 models with early fusion architecture
- 10-million-token context window (Llama 4 Maverick and Scout)
- Downloadable model weights for self-hosted deployment
- Llama API for hosted model access
- Fine-tuning support with Llama Cookbook examples
- Quantization for reduced memory and compute requirements
- Distillation tooling to compress larger models
- RAG reference implementations
- Multilingual support (12 languages in Llama 4)
- Prompt engineering guides
- Vision capabilities for image and text reasoning
- Edge-optimized lightweight models (1B, 3B)
- Llama Protections safety toolkit
- Llama Defenders Program for responsible AI deployment
