# Perception Models

> Meta FAIR's open-source repository of state-of-the-art vision, video, and audio encoders (Perception Encoder) and a multimodal language model (PerceptionLM) for image, video, and audio understanding.

Perception Models is a research repository from Meta FAIR (Fundamental AI Research) that houses two flagship model families: the Perception Encoder (PE) for image, video, and audio encoding, and the Perception Language Model (PLM) for vision-language decoding. Released in April 2025 under the Apache 2.0 license, the repository provides fully open weights, training code, datasets, and evaluation tooling. The project has accumulated over 2,300 GitHub stars since its April 2025 launch.

## What It Is

Perception Models is an open-source AI research toolkit targeting multimodal perception — the ability of models to understand images, video, and audio together. It is organized around two core components:

- **Perception Encoder (PE):** A family of CLIP-style vision and audio encoders trained with contrastive pretraining. PE comes in four checkpoint types — PE core (zero-shot classification and retrieval), PE lang (LLM-aligned for multimodal LLMs), PE spatial (dense prediction tasks like detection, depth, and tracking), and PE audio-visual (joint audio-video-text embedding space).
- **Perception Language Model (PLM):** A family of open, fully reproducible vision-language models in 1B, 3B, and 8B parameter sizes, built on Llama-3.x base LLMs and powered by PE lang encoders.

The repository also releases large-scale datasets including PE-Video-Dataset (1M high-quality videos) and PLM-Video-Human (human-annotated video understanding data).

## Model Architecture and Variants

PE follows a scalable contrastive pretraining recipe across multiple model sizes (Tiny, Small, Base, Large, Giant) and patch sizes. The four checkpoint families serve distinct downstream use cases:

- **PE core** checkpoints (T/16 through G/14) target zero-shot image/video classification and retrieval, with the repo claiming PE-Core-G14-448 outperforms SigLIP2 on image benchmarks and InternVideo2 on video benchmarks.
- **PE lang** checkpoints are aligned for use as vision encoders inside multimodal LLMs; the repo claims PLM-8B (using PE-Lang-G14-448-Tiling) achieves competitive results against InternVL3 and QwenVL2.5 on standard VLM benchmarks.
- **PE spatial** checkpoints are tuned for dense prediction, with the repo claiming PE-Spatial-G14-448 outperforms DINOv2 on segmentation and tracking tasks.
- **PE audio-visual** (PE-AV) and **PE audio-frame** (PE-A-Frame) models embed audio, video, and text into a joint space and support audio event localization.

PLM releases three model sizes (1B, 3B, 8B) with full training, fine-tuning, and evaluation documentation, including an end-to-end radiology fine-tuning example.

## Dataset Releases

The repository ships three major dataset releases alongside the models:

- **PE-Video-Dataset (PVD):** 1 million high-quality, diverse videos, with 120K carrying human-verified annotations, video descriptions, and keywords. Videos are motion-centered and cover first- and third-person views.
- **PLM-Video-Human:** Human-annotated data for fine-grained question answering (FGQA), region-temporal localization (RTLoc), region video captioning (RCap), and region dense temporal captioning (RDCap).
- **Auto-generated datasets:** PLM-Image-Auto and PLM-Video-Auto, covering synthetic captions and QA pairs over SA1B, OpenImages, Object365, ArxivQA, Ego4D, and YouTube-1B.

## Update: PE-AV and Ongoing Releases (December 2025 – July 2025)

The repository has seen active development since its April 2025 launch:

- **December 16, 2025:** Perception Encoder Audio-Visual (PE-AV) and Perception Encoder Audio-Frame (PE-A-Frame) models released, extending PE to joint audio-video-text embedding and audio event localization.
- **July 14, 2025:** PerceptionLM is now available natively in Hugging Face Transformers.
- **July 11, 2025:** Eight new PE checkpoints released, including small core models (T and S), tiling-tuned lang models (G and L), and four smaller spatial models.
- **May 28, 2025:** PE integrated into `timm` (PyTorch Image Models).
- **April 18, 2025:** PLM and PLM-VideoBench added to `lmms-eval` for standardized evaluation.

## Setup Path

Installation requires cloning the repository, creating a Python 3.12 conda environment, installing PyTorch 2.5.1 with CUDA 12.4, and installing `torchcodec` for video decoding via `ffmpeg`. The package installs in editable mode (`pip install -e .`), allowing local code modifications. Pretrained weights download automatically from Hugging Face when instantiated via `from_config(..., pretrained=True)`. A Google Colab demo notebook is provided for PE image and text feature extraction without local setup.

## Features
- Perception Encoder (PE) for image, video, and audio encoding
- Perception Language Model (PLM) in 1B, 3B, and 8B sizes
- PE core: zero-shot image and video classification and retrieval
- PE lang: LLM-aligned encoder for multimodal LLMs
- PE spatial: dense prediction (detection, depth, tracking)
- PE audio-visual: joint audio-video-text embedding
- PE audio-frame: audio event localization with timestamps
- PE-Video-Dataset: 1M high-quality annotated videos
- PLM-Video-Human: human-annotated fine-grained video QA and captioning
- Auto-generated image and video datasets (PLM-Image-Auto, PLM-Video-Auto)
- Hugging Face Transformers integration for PLM
- timm integration for PE
- lmms-eval integration for PLM and PLM-VideoBench
- Google Colab demo notebooks
- Full training and fine-tuning code
- Apache 2.0 license for code and models

## Integrations
Hugging Face Transformers, Hugging Face Hub, timm (PyTorch Image Models), lmms-eval, PyTorch, torchcodec, ffmpeg, Detectron2, DETA, mmsegmentation, Google Colab, Llama 3.x

## Platforms
CLI, API, DEVELOPER_SDK

## Pricing
Open Source

## Links
- Website: https://github.com/facebookresearch/perception_models
- Documentation: https://github.com/facebookresearch/perception_models/blob/main/apps/pe/README.md
- Repository: https://github.com/facebookresearch/perception_models
- EveryDev.ai: https://www.everydev.ai/tools/perception-models
