VOID: Video Object and Interaction Deletion

Name: VOID: Video Object and Interaction Deletion
Availability: OnlineOnly
Author: Netflix

VOID removes objects from videos along with all physical interactions they induce on the scene, such as objects falling when a person is removed.

Visit Website

At a Glance

Pricing

Open Source

Fully free and open-source under Apache License 2.0. Download model checkpoints from HuggingFace and run locally.

Engagement

Available On

CLI

API

SDK

NetflixLos GatosEst. 1997$3.1B+ raised

Listed Apr 2026

About VOID: Video Object and Interaction Deletion

VOID (Video Object and Interaction Deletion) is an open-source video inpainting model developed by Netflix Research that removes objects from videos while also eliminating the physical interactions those objects cause — not just visual artifacts like shadows, but dynamic effects like objects falling or being displaced. Built on top of CogVideoX and fine-tuned for interaction-aware video inpainting, VOID uses a novel quadmask conditioning system to distinguish between primary objects, affected regions, and background. The model runs in two sequential passes: Pass 1 for base inpainting and Pass 2 for warped-noise temporal refinement, enabling high-quality results on longer video clips.

Interaction-aware removal — Uses a 4-value quadmask (primary object, overlap, affected region, background) to model and remove physical interactions like falling objects, not just the target object itself.
Two-pass inference pipeline — Pass 1 performs base inpainting; Pass 2 applies optical flow-warped latent initialization for improved temporal consistency on longer clips.
VLM-powered mask generation — The VLM-MASK-REASONER pipeline uses SAM2 segmentation and Gemini for automated reasoning about interaction-affected regions, generating quadmasks from raw video.
Manual mask refinement GUI — An included GUI editor allows frame-by-frame refinement of quadmasks with brush tools, grid toggles, and undo/redo support.
Synthetic training data pipelines — Provides two data generation pipelines: HUMOTO (human-object interaction via Blender/motion capture) and Kubric (physics-based object interaction), both producing paired counterfactual videos.
Google Colab notebook — A ready-to-run notebook handles setup, model download, and inference on sample videos; requires a GPU with 40GB+ VRAM (e.g., A100).
HuggingFace model hosting — Both VOID Pass 1 and Pass 2 checkpoints are available on HuggingFace for direct download and use.
Apache 2.0 licensed — Fully open-source; free to use, modify, and distribute under the Apache License 2.0.

VOID: Video Object and Interaction Deletion - 1

Community Discussions

Be the first to start a conversation about VOID: Video Object and Interaction Deletion

Share your experience with VOID: Video Object and Interaction Deletion, ask questions, or help others learn from your insights.

Pricing

OPEN SOURCE

Open Source

Fully free and open-source under Apache License 2.0. Download model checkpoints from HuggingFace and run locally.

Apache License 2.0
Full source code access on GitHub
VOID Pass 1 and Pass 2 model checkpoints on HuggingFace
VLM-MASK-REASONER pipeline
Training data generation pipelines (HUMOTO + Kubric)

Capabilities

Key Features

Interaction-aware video object removal
Two-pass inference pipeline (base + warped-noise refinement)
Quadmask conditioning (4-value semantic mask)
VLM-powered automated mask generation (SAM2 + Gemini)
Manual quadmask refinement GUI
HUMOTO-based synthetic training data generation (Blender)
Kubric-based synthetic training data generation
Google Colab notebook for quick start
HuggingFace model hosting (Pass 1 and Pass 2 checkpoints)
Batch inference support
Optical flow-based temporal consistency (Pass 2)
DeepSpeed ZeRO stage 2 training support

Integrations

CogVideoX

SAM2

Gemini (Google AI API)

HuggingFace

Google Colab

Blender

Kubric

DeepSpeed

HUMOTO

ffmpeg

VideoX-Fun

API Available

View Docs

Back to all tools Suggest an edit

About VOID: Video Object and Interaction Deletion

Interaction-aware removal — Uses a 4-value quadmask (primary object, overlap, affected region, background) to model and remove physical interactions like falling objects, not just the target object itself.
Two-pass inference pipeline — Pass 1 performs base inpainting; Pass 2 applies optical flow-warped latent initialization for improved temporal consistency on longer clips.
VLM-powered mask generation — The VLM-MASK-REASONER pipeline uses SAM2 segmentation and Gemini for automated reasoning about interaction-affected regions, generating quadmasks from raw video.
Manual mask refinement GUI — An included GUI editor allows frame-by-frame refinement of quadmasks with brush tools, grid toggles, and undo/redo support.
Synthetic training data pipelines — Provides two data generation pipelines: HUMOTO (human-object interaction via Blender/motion capture) and Kubric (physics-based object interaction), both producing paired counterfactual videos.
Google Colab notebook — A ready-to-run notebook handles setup, model download, and inference on sample videos; requires a GPU with 40GB+ VRAM (e.g., A100).
HuggingFace model hosting — Both VOID Pass 1 and Pass 2 checkpoints are available on HuggingFace for direct download and use.
Apache 2.0 licensed — Fully open-source; free to use, modify, and distribute under the Apache License 2.0.

VOID: Video Object and Interaction Deletion