Perception Models

Name: Perception Models
Availability: OnlineOnly
Author: Meta AI

Meta FAIR's open-source repository of state-of-the-art vision, video, and audio encoders (Perception Encoder) and a multimodal language model (PerceptionLM) for image, video, and audio understanding.

Visit Website

At a Glance

Pricing

Open Source

Fully open-source under Apache 2.0. Free to use, modify, and distribute.

Engagement

Available On

CLI

API

SDK

Meta AIMenlo ParkEst. 2004$2.3B raised

Listed Jun 2026

About Perception Models

Perception Models is a research repository from Meta FAIR (Fundamental AI Research) that houses two flagship model families: the Perception Encoder (PE) for image, video, and audio encoding, and the Perception Language Model (PLM) for vision-language decoding. Released in April 2025 under the Apache 2.0 license, the repository provides fully open weights, training code, datasets, and evaluation tooling. The project has accumulated over 2,300 GitHub stars since its April 2025 launch.

What It Is

Perception Models is an open-source AI research toolkit targeting multimodal perception — the ability of models to understand images, video, and audio together. It is organized around two core components:

Perception Encoder (PE): A family of CLIP-style vision and audio encoders trained with contrastive pretraining. PE comes in four checkpoint types — PE core (zero-shot classification and retrieval), PE lang (LLM-aligned for multimodal LLMs), PE spatial (dense prediction tasks like detection, depth, and tracking), and PE audio-visual (joint audio-video-text embedding space).
Perception Language Model (PLM): A family of open, fully reproducible vision-language models in 1B, 3B, and 8B parameter sizes, built on Llama-3.x base LLMs and powered by PE lang encoders.

The repository also releases large-scale datasets including PE-Video-Dataset (1M high-quality videos) and PLM-Video-Human (human-annotated video understanding data).

Model Architecture and Variants

PE follows a scalable contrastive pretraining recipe across multiple model sizes (Tiny, Small, Base, Large, Giant) and patch sizes. The four checkpoint families serve distinct downstream use cases:

PE core checkpoints (T/16 through G/14) target zero-shot image/video classification and retrieval, with the repo claiming PE-Core-G14-448 outperforms SigLIP2 on image benchmarks and InternVideo2 on video benchmarks.
PE lang checkpoints are aligned for use as vision encoders inside multimodal LLMs; the repo claims PLM-8B (using PE-Lang-G14-448-Tiling) achieves competitive results against InternVL3 and QwenVL2.5 on standard VLM benchmarks.
PE spatial checkpoints are tuned for dense prediction, with the repo claiming PE-Spatial-G14-448 outperforms DINOv2 on segmentation and tracking tasks.
PE audio-visual (PE-AV) and PE audio-frame (PE-A-Frame) models embed audio, video, and text into a joint space and support audio event localization.

PLM releases three model sizes (1B, 3B, 8B) with full training, fine-tuning, and evaluation documentation, including an end-to-end radiology fine-tuning example.

Dataset Releases

The repository ships three major dataset releases alongside the models:

PE-Video-Dataset (PVD): 1 million high-quality, diverse videos, with 120K carrying human-verified annotations, video descriptions, and keywords. Videos are motion-centered and cover first- and third-person views.
PLM-Video-Human: Human-annotated data for fine-grained question answering (FGQA), region-temporal localization (RTLoc), region video captioning (RCap), and region dense temporal captioning (RDCap).
Auto-generated datasets: PLM-Image-Auto and PLM-Video-Auto, covering synthetic captions and QA pairs over SA1B, OpenImages, Object365, ArxivQA, Ego4D, and YouTube-1B.

Update: PE-AV and Ongoing Releases (December 2025 – July 2025)

The repository has seen active development since its April 2025 launch:

December 16, 2025: Perception Encoder Audio-Visual (PE-AV) and Perception Encoder Audio-Frame (PE-A-Frame) models released, extending PE to joint audio-video-text embedding and audio event localization.
July 14, 2025: PerceptionLM is now available natively in Hugging Face Transformers.
July 11, 2025: Eight new PE checkpoints released, including small core models (T and S), tiling-tuned lang models (G and L), and four smaller spatial models.
May 28, 2025: PE integrated into timm (PyTorch Image Models).
April 18, 2025: PLM and PLM-VideoBench added to lmms-eval for standardized evaluation.

Setup Path

Installation requires cloning the repository, creating a Python 3.12 conda environment, installing PyTorch 2.5.1 with CUDA 12.4, and installing torchcodec for video decoding via ffmpeg. The package installs in editable mode (pip install -e .), allowing local code modifications. Pretrained weights download automatically from Hugging Face when instantiated via from_config(..., pretrained=True). A Google Colab demo notebook is provided for PE image and text feature extraction without local setup.

Community Discussions

Be the first to start a conversation about Perception Models

Share your experience with Perception Models, ask questions, or help others learn from your insights.

Pricing

OPEN SOURCE

Open Source

Fully open-source under Apache 2.0. Free to use, modify, and distribute.

All PE model weights (Core, Lang, Spatial, Audio-Visual)
All PLM model weights (1B, 3B, 8B)
Full training and fine-tuning code
Dataset releases (PVD, PLM-Video-Human, auto-generated datasets)
Evaluation tooling and benchmark integrations

Capabilities

Key Features

Perception Encoder (PE) for image, video, and audio encoding
Perception Language Model (PLM) in 1B, 3B, and 8B sizes
PE core: zero-shot image and video classification and retrieval
PE lang: LLM-aligned encoder for multimodal LLMs
PE spatial: dense prediction (detection, depth, tracking)
PE audio-visual: joint audio-video-text embedding
PE audio-frame: audio event localization with timestamps
PE-Video-Dataset: 1M high-quality annotated videos
PLM-Video-Human: human-annotated fine-grained video QA and captioning
Auto-generated image and video datasets (PLM-Image-Auto, PLM-Video-Auto)
Hugging Face Transformers integration for PLM
timm integration for PE
lmms-eval integration for PLM and PLM-VideoBench
Google Colab demo notebooks
Full training and fine-tuning code
Apache 2.0 license for code and models

Integrations

Hugging Face Transformers

Hugging Face Hub

timm (PyTorch Image Models)

lmms-eval

PyTorch

torchcodec

ffmpeg

Detectron2

DETA

mmsegmentation

Google Colab

Llama 3.x

API Available

View Docs

Back to all tools Suggest an edit