Video editing has always had a dirty secret: removing an object from footage is easy; making the scene look like it was never there is brutally hard. Take out a person holding a guitar, and you’re left with a floating instrument that defies gravity. Hollywood VFX teams spend weeks fixing exactly this kind of problem. A team of researchers from Netflix and INSAIT, Sofia University ‘St. Kliment Ohridski,’ released VOID (Video Object and Interaction Deletion) model that can do it automatically.
VOID removes objects from videos along with all interactions they induce on the scene — not just secondary effects like shadows and reflections, but physical interactions like objects falling when a person is removed.
What Problem Is VOID Actually Solving?
Standard video inpainting models — the kind used in most editing workflows today — are trained to fill in the pixel region where an object was. They’re essentially very sophisticated background painters. What they don’t do is reason about causality: if I remove an actor who is holding a prop, what should happen to that prop?
Existing video object removal methods excel at inpainting content ‘behind’ the object and correcting appearance-level artifacts such as shadows and reflections. However, when the removed object has more significant interactions, such as collisions with other objects, current models fail to correct them and produce implausible results.
VOID is built on top of CogVideoX and fine-tuned for video inpainting with interaction-aware mask conditioning. The key innovation is in how the model understands the scene — not just ‘what pixels should I fill?’ but ‘what is physically plausible after this object disappears?’
The canonical example from the research paper: if a person holding a guitar is removed, VOID also removes the person’s effect on the guitar — causing it to fall naturally. That’s not trivial. The model has to understand that the guitar was being supported by the person, and that removing the person means gravity takes over.
And unlike prior work, VOID was evaluated head-to-head against real competitors. Experiments on both synthetic and real data show that the approach better preserves consistent scene dynamics after object removal compared to prior video object removal methods including ProPainter, DiffuEraser, Runway, MiniMax-Remover, ROSE, and Gen-Omnimatte.
https://arxiv.org/pdf/2604.02296
The Architecture: CogVideoX Under the Hood
VOID is built on CogVideoX-Fun-V1.5-5b-InP — a model from Alibaba PAI — and fine-tuned for video inpainting with interaction-aware quadmask conditioning. CogVideoX is a 3D Transformer-based video generation model. Think of it like a video version of Stable Diffusion — a diffusion model that operates over temporal sequences of frames rather than single images. The specific base model (CogVideoX-Fun-V1.5-5b-InP) is released by Alibaba PAI on Hugging Face, which is the checkpoint engineers will need to download separately before running VOID.
The fine-tuned architecture specs: a CogVideoX 3D Transformer with 5B parameters, taking video, quadmask, and a text prompt describing the scene after removal as input, operating at a default resolution of 384×672, processing a maximum of 197 frames, using the DDIM scheduler, and running in BF16 with FP8 quantization for memory efficiency.
The quadmask is arguably the most interesting technical contribution here. Rather than a binary mask (remove this pixel / keep this pixel), the quadmask is a 4-value mask that encodes the primary object to remove, overlap regions, affected regions (falling objects, displaced items), and background to keep.
In practice, each pixel in the mask gets one of four values: 0 (primary object being removed), 63 (overlap between primary and affected regions), 127 (interaction-affected region — things that will move or change as a result of the removal), and 255 (background, keep as-is). This gives the model a structured semantic map of what’s happening in the scene, not just where the object is.
Two-Pass Inference Pipeline
VOID uses two transformer checkpoints, trained sequentially. You can run inference with Pass 1 alone or chain both passes for higher temporal consistency.
Pass 1 (void_pass1.safetensors) is the base inpainting model and is sufficient for most videos. Pass 2 serves a specific purpose: correcting a known failure mode. If the model detects object morphing — a known failure mode of smaller video diffusion models — an optional second pass re-runs inference using flow-warped noise derived from the first pass, stabilizing object shape along the newly synthesized trajectories.
It’s worth understanding the distinction: Pass 2 isn’t just for longer clips — it’s specifically a shape-stability fix. When the diffusion model produces objects that gradually warp or deform across frames (a well-documented artifact in video diffusion), Pass 2 uses optical flow to warp the latents from Pass 1 and feeds them as initialization into a second diffusion run, anchoring the shape of synthesized objects frame-to-frame.
How the Training Data Was Generated
This is where things get genuinely interesting. Training a model to understand physical interactions requires paired videos — the same scene, with and without the object, where the physics plays out correctly in both. Real-world paired data at this scale doesn’t exist. So the team built it synthetically.
Training used paired counterfactual videos generated from two sources: HUMOTO — human-object interactions rendered in Blender with physics simulation — and Kubric — object-only interactions using Google Scanned Objects.
HUMOTO uses motion-capture data of human-object interactions. The key mechanic is a Blender re-simulation: the scene is set up with a human and objects, rendered once with the human present, then the human is removed from the simulation and physics is re-run forward from that point. The result is a physically correct counterfactual — objects that were being held or supported now fall, exactly as they should. Kubric, developed by Google Research, applies the same idea to object-object collisions. Together, they produce a dataset of paired videos where the physics is provably correct, not approximated by a human annotator.
Key Takeaways
VOID goes beyond pixel-filling. Unlike existing video inpainting tools that only correct visual artifacts like shadows and reflections, VOID understands physical causality — if you remove a person holding an object, the object falls naturally in the output video.
The quadmask is the core innovation. Instead of a simple binary remove/keep mask, VOID uses a 4-value quadmask (values 0, 63, 127, 255) that encodes not just what to remove, but which surrounding regions of the scene will be physically affected — giving the diffusion model structured scene understanding to work with.
Two-pass inference solves a real failure mode. Pass 1 handles most videos; Pass 2 exists specifically to fix object morphing artifacts — a known weakness of video diffusion models — by using optical flow-warped latents from Pass 1 as initialization for a second diffusion run.
Synthetic paired data made training possible. Since real-world paired counterfactual video data doesn’t exist at scale, the research team built it using Blender physics re-simulation (HUMOTO) and Google’s Kubric framework, generating ground-truth before/after video pairs where the physics is provably correct.
Check out the Paper, Model Weight and Repo. Also, feel free to follow us on Twitter and don’t forget to join our 120k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Netflix AI Team Just Open-Sourced VOID: an AI Model That Erases Objects From Videos — Physics and All appeared first on MarkTechPost.

