VOID: Video Object and Interaction Deletion
VOID removes objects from videos along with all physical interactions they induce on the scene, such as objects falling when a person is removed.
At a Glance
Fully free and open-source under Apache License 2.0. Download model checkpoints from HuggingFace and run locally.
Engagement
Available On
Alternatives
Listed Apr 2026
About VOID: Video Object and Interaction Deletion
VOID (Video Object and Interaction Deletion) is an open-source video inpainting model developed by Netflix Research that removes objects from videos while also eliminating the physical interactions those objects cause — not just visual artifacts like shadows, but dynamic effects like objects falling or being displaced. Built on top of CogVideoX and fine-tuned for interaction-aware video inpainting, VOID uses a novel quadmask conditioning system to distinguish between primary objects, affected regions, and background. The model runs in two sequential passes: Pass 1 for base inpainting and Pass 2 for warped-noise temporal refinement, enabling high-quality results on longer video clips.
- Interaction-aware removal — Uses a 4-value quadmask (primary object, overlap, affected region, background) to model and remove physical interactions like falling objects, not just the target object itself.
- Two-pass inference pipeline — Pass 1 performs base inpainting; Pass 2 applies optical flow-warped latent initialization for improved temporal consistency on longer clips.
- VLM-powered mask generation — The VLM-MASK-REASONER pipeline uses SAM2 segmentation and Gemini for automated reasoning about interaction-affected regions, generating quadmasks from raw video.
- Manual mask refinement GUI — An included GUI editor allows frame-by-frame refinement of quadmasks with brush tools, grid toggles, and undo/redo support.
- Synthetic training data pipelines — Provides two data generation pipelines: HUMOTO (human-object interaction via Blender/motion capture) and Kubric (physics-based object interaction), both producing paired counterfactual videos.
- Google Colab notebook — A ready-to-run notebook handles setup, model download, and inference on sample videos; requires a GPU with 40GB+ VRAM (e.g., A100).
- HuggingFace model hosting — Both VOID Pass 1 and Pass 2 checkpoints are available on HuggingFace for direct download and use.
- Apache 2.0 licensed — Fully open-source; free to use, modify, and distribute under the Apache License 2.0.
Community Discussions
Be the first to start a conversation about VOID: Video Object and Interaction Deletion
Share your experience with VOID: Video Object and Interaction Deletion, ask questions, or help others learn from your insights.
Pricing
Open Source
Fully free and open-source under Apache License 2.0. Download model checkpoints from HuggingFace and run locally.
- Apache License 2.0
- Full source code access on GitHub
- VOID Pass 1 and Pass 2 model checkpoints on HuggingFace
- VLM-MASK-REASONER pipeline
- Training data generation pipelines (HUMOTO + Kubric)
Capabilities
Key Features
- Interaction-aware video object removal
- Two-pass inference pipeline (base + warped-noise refinement)
- Quadmask conditioning (4-value semantic mask)
- VLM-powered automated mask generation (SAM2 + Gemini)
- Manual quadmask refinement GUI
- HUMOTO-based synthetic training data generation (Blender)
- Kubric-based synthetic training data generation
- Google Colab notebook for quick start
- HuggingFace model hosting (Pass 1 and Pass 2 checkpoints)
- Batch inference support
- Optical flow-based temporal consistency (Pass 2)
- DeepSpeed ZeRO stage 2 training support
