EveryDev.ai
Subscribe
Home
Tools

2,911+ AI tools

  • New
  • Trending
  • Featured
  • Compare
  • Arena
Categories
  • Agents1815
  • Coding1295
  • Infrastructure600
  • Marketing467
  • Projects433
  • Research403
  • Analytics351
  • Design338
  • Security243
  • MCP242
  • Testing238
  • Data230
  • Integration178
  • Prompts160
  • Learning159
  • Communication154
  • Extensions150
  • Voice130
  • Commerce125
  • DevOps108
  • Web80
  • Finance21
AI Tools by Topic
  • AI Coding Assistants
  • Agent Frameworks
  • MCP Servers
  • AI Prompt Tools
  • Vibe Coding Tools
  • AI Design Tools
  • AI Database Tools
  • AI Website Builders
  • AI Testing Tools
  • LLM Evaluations
Follow Us
  • X / Twitter
  • LinkedIn
  • Reddit
  • Discord
  • Threads
  • Bluesky
  • Mastodon
  • YouTube
  • GitHub
  • Instagram
Get Started
  • About
  • Editorial Standards
  • Corrections & Disclosures
  • Community Guidelines
  • Advertise
  • Contact Us
  • Newsletter
  • Submit a Tool
  • Start a Discussion
  • Write A Blog
  • Share A Build
  • Terms of Service
  • Privacy Policy
Explore with AI
  • ChatGPT
  • Gemini
  • Claude
  • Grok
  • Perplexity
Agent Experience
  • llms.txt
Theme
With AI, Everyone is a Dev. EveryDev.ai © 2026
    1. Home
    2. Tools
    3. Perception Models
    Perception Models icon

    Perception Models

    AI Development Libraries
    Featured

    Meta FAIR's open-source repository of state-of-the-art vision, video, and audio encoders (Perception Encoder) and a multimodal language model (PerceptionLM) for image, video, and audio understanding.

    Visit Website

    At a Glance

    Pricing
    Open Source

    Fully open-source under Apache 2.0. Free to use, modify, and distribute.

    Engagement

    Available On

    CLI
    API
    SDK

    Resources

    WebsiteDocsGitHubllms.txt

    Topics

    AI Development LibrariesImageVideo

    Alternatives

    STARFlowSanaMLX-VLM
    Developer
    Meta AIMenlo ParkEst. 2004$2.3B raised

    Listed Jun 2026

    About Perception Models

    Perception Models is a research repository from Meta FAIR (Fundamental AI Research) that houses two flagship model families: the Perception Encoder (PE) for image, video, and audio encoding, and the Perception Language Model (PLM) for vision-language decoding. Released in April 2025 under the Apache 2.0 license, the repository provides fully open weights, training code, datasets, and evaluation tooling. The project has accumulated over 2,300 GitHub stars since its April 2025 launch.

    What It Is

    Perception Models is an open-source AI research toolkit targeting multimodal perception — the ability of models to understand images, video, and audio together. It is organized around two core components:

    • Perception Encoder (PE): A family of CLIP-style vision and audio encoders trained with contrastive pretraining. PE comes in four checkpoint types — PE core (zero-shot classification and retrieval), PE lang (LLM-aligned for multimodal LLMs), PE spatial (dense prediction tasks like detection, depth, and tracking), and PE audio-visual (joint audio-video-text embedding space).
    • Perception Language Model (PLM): A family of open, fully reproducible vision-language models in 1B, 3B, and 8B parameter sizes, built on Llama-3.x base LLMs and powered by PE lang encoders.

    The repository also releases large-scale datasets including PE-Video-Dataset (1M high-quality videos) and PLM-Video-Human (human-annotated video understanding data).

    Model Architecture and Variants

    PE follows a scalable contrastive pretraining recipe across multiple model sizes (Tiny, Small, Base, Large, Giant) and patch sizes. The four checkpoint families serve distinct downstream use cases:

    • PE core checkpoints (T/16 through G/14) target zero-shot image/video classification and retrieval, with the repo claiming PE-Core-G14-448 outperforms SigLIP2 on image benchmarks and InternVideo2 on video benchmarks.
    • PE lang checkpoints are aligned for use as vision encoders inside multimodal LLMs; the repo claims PLM-8B (using PE-Lang-G14-448-Tiling) achieves competitive results against InternVL3 and QwenVL2.5 on standard VLM benchmarks.
    • PE spatial checkpoints are tuned for dense prediction, with the repo claiming PE-Spatial-G14-448 outperforms DINOv2 on segmentation and tracking tasks.
    • PE audio-visual (PE-AV) and PE audio-frame (PE-A-Frame) models embed audio, video, and text into a joint space and support audio event localization.

    PLM releases three model sizes (1B, 3B, 8B) with full training, fine-tuning, and evaluation documentation, including an end-to-end radiology fine-tuning example.

    Dataset Releases

    The repository ships three major dataset releases alongside the models:

    • PE-Video-Dataset (PVD): 1 million high-quality, diverse videos, with 120K carrying human-verified annotations, video descriptions, and keywords. Videos are motion-centered and cover first- and third-person views.
    • PLM-Video-Human: Human-annotated data for fine-grained question answering (FGQA), region-temporal localization (RTLoc), region video captioning (RCap), and region dense temporal captioning (RDCap).
    • Auto-generated datasets: PLM-Image-Auto and PLM-Video-Auto, covering synthetic captions and QA pairs over SA1B, OpenImages, Object365, ArxivQA, Ego4D, and YouTube-1B.

    Update: PE-AV and Ongoing Releases (December 2025 – July 2025)

    The repository has seen active development since its April 2025 launch:

    • December 16, 2025: Perception Encoder Audio-Visual (PE-AV) and Perception Encoder Audio-Frame (PE-A-Frame) models released, extending PE to joint audio-video-text embedding and audio event localization.
    • July 14, 2025: PerceptionLM is now available natively in Hugging Face Transformers.
    • July 11, 2025: Eight new PE checkpoints released, including small core models (T and S), tiling-tuned lang models (G and L), and four smaller spatial models.
    • May 28, 2025: PE integrated into timm (PyTorch Image Models).
    • April 18, 2025: PLM and PLM-VideoBench added to lmms-eval for standardized evaluation.

    Setup Path

    Installation requires cloning the repository, creating a Python 3.12 conda environment, installing PyTorch 2.5.1 with CUDA 12.4, and installing torchcodec for video decoding via ffmpeg. The package installs in editable mode (pip install -e .), allowing local code modifications. Pretrained weights download automatically from Hugging Face when instantiated via from_config(..., pretrained=True). A Google Colab demo notebook is provided for PE image and text feature extraction without local setup.

    Perception Models - 1

    Community Discussions

    Be the first to start a conversation about Perception Models

    Share your experience with Perception Models, ask questions, or help others learn from your insights.

    Pricing

    OPEN SOURCE

    Open Source

    Fully open-source under Apache 2.0. Free to use, modify, and distribute.

    • All PE model weights (Core, Lang, Spatial, Audio-Visual)
    • All PLM model weights (1B, 3B, 8B)
    • Full training and fine-tuning code
    • Dataset releases (PVD, PLM-Video-Human, auto-generated datasets)
    • Evaluation tooling and benchmark integrations

    Capabilities

    Key Features

    • Perception Encoder (PE) for image, video, and audio encoding
    • Perception Language Model (PLM) in 1B, 3B, and 8B sizes
    • PE core: zero-shot image and video classification and retrieval
    • PE lang: LLM-aligned encoder for multimodal LLMs
    • PE spatial: dense prediction (detection, depth, tracking)
    • PE audio-visual: joint audio-video-text embedding
    • PE audio-frame: audio event localization with timestamps
    • PE-Video-Dataset: 1M high-quality annotated videos
    • PLM-Video-Human: human-annotated fine-grained video QA and captioning
    • Auto-generated image and video datasets (PLM-Image-Auto, PLM-Video-Auto)
    • Hugging Face Transformers integration for PLM
    • timm integration for PE
    • lmms-eval integration for PLM and PLM-VideoBench
    • Google Colab demo notebooks
    • Full training and fine-tuning code
    • Apache 2.0 license for code and models

    Integrations

    Hugging Face Transformers
    Hugging Face Hub
    timm (PyTorch Image Models)
    lmms-eval
    PyTorch
    torchcodec
    ffmpeg
    Detectron2
    DETA
    mmsegmentation
    Google Colab
    Llama 3.x
    API Available
    View Docs

    Ratings & Reviews

    No ratings yet

    Be the first to rate Perception Models and help others make informed decisions.

    Developer

    Meta AI

    Meta AI Research (formerly Facebook AI Research or FAIR) is a research laboratory within Meta Platforms (formerly Facebook) dedicated to advancing the field of artificial intelligence through open research. The division focuses on making significant advancements in AI technology and freely publishing research papers, open-sourcing code, and releasing state-of-the-art AI models for the broader AI research community. Meta AI works on a broad range of fundamental and applied research areas including computer vision, natural language processing, reasoning, multimodal AI, robotics, and responsible AI development.

    Founded 2004
    1 Meta Way
    $2.3B raised
    77,986 employees

    Used by

    CNN
    Fox News
    Le Monde
    Amazon Web Services (AWS)
    +2 more
    Read more about Meta AI
    WebsiteGitHubX / Twitter
    6 tools in directory

    Similar Tools

    STARFlow icon

    STARFlow

    STARFlow is Apple's open-source transformer autoregressive flow model for high-quality text-to-image and text-to-video generation, combining autoregressive models with normalizing flows.

    Sana icon

    Sana

    SANA is an open-source, efficiency-oriented framework by NVIDIA Labs for high-resolution image and video generation using Linear Diffusion Transformers, deployable on consumer GPUs with as little as 8GB VRAM.

    MLX-VLM icon

    MLX-VLM

    A Python library for running Vision Language Models on Apple Silicon using the MLX framework.

    Browse all tools

    Related Topics

    AI Development Libraries

    Programming libraries and frameworks that provide machine learning capabilities, model integration, and AI functionality for developers.

    228 tools

    Image

    AI tools that generate or edit still images — illustrations, photos, logos, icons, and graphics — from text prompts, references, or existing images.

    77 tools

    Video

    AI tools that generate or edit video — from text-to-video and animation to avatars, dubbing, and short-form clips.

    69 tools
    Browse all topics
    Back to all toolsSuggest an edit
    ratings
    discussions