A 3B-parameter open-source unified multimodal model from ByteDance supporting image and video understanding, generation, and editing within a single framework.
At a Glance
Fully open-source under Apache License 2.0. Free to use, modify, and distribute.
Engagement
Available On
Alternatives
Listed May 2026
About Lance
Lance is a 3B-active-parameter native unified multimodal model developed by researchers at ByteDance, released on GitHub under the Apache License 2.0. It handles image generation, image editing, video generation, video editing, image understanding, and video understanding all within a single model architecture. The repository was created in May 2026 and accompanies an arXiv paper (2605.18678) titled "Lance: Unified Multimodal Modeling by Multi-Task Synergy."
What It Is
Lance is a research model in the category of unified multimodal models — systems that combine visual understanding and visual generation in one framework rather than relying on separate specialist models. The core design keeps a shared interleaved sequence for text, image, and video context, then separates semantic understanding and visual generation through dedicated experts. According to the project page, it uses semantic ViT tokens for understanding, clean/noisy VAE latents for generation, generalized 3D causal attention, and a component called MaPE to reduce positional interference among heterogeneous visual tokens. The transformer backbone is trained entirely from scratch (except for the ViT and VAE encoders) using a staged multi-task recipe on a 128-A100-GPU budget.
Supported Tasks
Lance covers six distinct inference tasks through a unified command-line interface:
- t2i — Text-to-image generation
- t2v — Text-to-video generation
- image_edit — Instruction-guided image editing
- video_edit — Instruction-guided video editing
- x2t_image — Image understanding (visual question answering, chart reasoning, OCR)
- x2t_video — Video understanding (video QA, captioning, temporal reasoning)
Multi-turn consistency editing is also demonstrated, where a sequence of linked edits (replacement, accessory addition, background rewrite, motion update) is applied to the same subject across turns.
Benchmark Performance
The project page and README publish detailed benchmark comparisons against both generation-only and unified model baselines:
- GenEVAL (image generation): Lance at 3B parameters ties the best overall score (0.90) among listed unified models, matching TUNA at 7B.
- DPG-Bench (image generation): Lance scores 84.67 overall, with particularly strong relation grounding (93.38).
- GEdit-Bench (image editing): Lance reports the best average score (7.30) among listed unified models, ahead of InternVL-U with CoT (6.88) and BAGEL (6.52).
- VBench (video generation): Lance achieves a total score of 85.11, the highest in the unified model group, above TUNA at 84.06.
- MVBench (video understanding): Lance scores 62.0 average, the best among listed unified models.
These results are vendor-published comparisons from the project's own paper and website.
Architecture and Efficiency Angle
A key design goal stated by the authors is efficiency at the 3B scale. The model delivers competitive results across image generation, image editing, and video generation benchmarks while using only 3B active parameters — smaller than most competing unified models (which typically range from 4B to 13B). The project acknowledges the ViT and VAE encoders are not trained from scratch, but the transformer backbone is. Inference requires a GPU with at least 40GB VRAM, Python 3.10+, and CUDA 12.4+.
Setup and Deployment
Lance is deployed by cloning the repository, running a setup script (setup_env.sh), and downloading model checkpoints from Hugging Face (bytedance-research/Lance). Two checkpoint variants are available: Lance_3B for image tasks and Lance_3B_Video for video tasks. A unified shell script (inference_lance.sh) handles all six task types with configurable parameters including number of GPUs, denoising steps, CFG scale, resolution preset, and frame count. A Gradio interface is also provided for interactive text-to-video and video-to-text use. Ready-to-run benchmark scripts are included under a benchmarks/ directory.
Current Status
The repository was created on May 15, 2026, and last updated on May 21, 2026, indicating very recent and active development. The authors note they are "actively updating and improving this repository." The accompanying arXiv preprint (2605.18678) was submitted in 2026. Model weights are publicly available on Hugging Face at bytedance-research/Lance.
Community Discussions
Be the first to start a conversation about Lance
Share your experience with Lance, ask questions, or help others learn from your insights.
Pricing
Open Source
Fully open-source under Apache License 2.0. Free to use, modify, and distribute.
- Text-to-image generation
- Text-to-video generation
- Image editing
- Video editing
- Image understanding
Capabilities
Key Features
- Text-to-image generation
- Text-to-video generation
- Instruction-guided image editing
- Instruction-guided video editing
- Image understanding and visual question answering
- Video understanding and captioning
- Multi-turn consistency editing
- Unified command-line inference interface
- Gradio interactive demo
- Configurable denoising steps, CFG scale, and resolution
- Ready-to-run benchmark evaluation scripts
- Supports up to 121 frames for video generation
