Lance

Name: Lance
Availability: OnlineOnly
Author: ByteDance

A 3B-parameter open-source unified multimodal model from ByteDance supporting image and video understanding, generation, and editing within a single framework.

Visit Website

At a Glance

Pricing

Open Source

Fully open-source under Apache License 2.0. Free to use, modify, and distribute.

Engagement

Available On

Windows

API

CLI

ByteDanceBuilding 2, 1733 Commercial SpaceEst. 2012$9.4B+ raised

Listed May 2026

About Lance

Lance is a 3B-active-parameter native unified multimodal model developed by researchers at ByteDance, released on GitHub under the Apache License 2.0. It handles image generation, image editing, video generation, video editing, image understanding, and video understanding all within a single model architecture. The repository was created in May 2026 and accompanies an arXiv paper (2605.18678) titled "Lance: Unified Multimodal Modeling by Multi-Task Synergy."

What It Is

Lance is a research model in the category of unified multimodal models — systems that combine visual understanding and visual generation in one framework rather than relying on separate specialist models. The core design keeps a shared interleaved sequence for text, image, and video context, then separates semantic understanding and visual generation through dedicated experts. According to the project page, it uses semantic ViT tokens for understanding, clean/noisy VAE latents for generation, generalized 3D causal attention, and a component called MaPE to reduce positional interference among heterogeneous visual tokens. The transformer backbone is trained entirely from scratch (except for the ViT and VAE encoders) using a staged multi-task recipe on a 128-A100-GPU budget.

Supported Tasks

Lance covers six distinct inference tasks through a unified command-line interface:

t2i — Text-to-image generation
t2v — Text-to-video generation
image_edit — Instruction-guided image editing
video_edit — Instruction-guided video editing
x2t_image — Image understanding (visual question answering, chart reasoning, OCR)
x2t_video — Video understanding (video QA, captioning, temporal reasoning)

Multi-turn consistency editing is also demonstrated, where a sequence of linked edits (replacement, accessory addition, background rewrite, motion update) is applied to the same subject across turns.

Benchmark Performance

The project page and README publish detailed benchmark comparisons against both generation-only and unified model baselines:

GenEVAL (image generation): Lance at 3B parameters ties the best overall score (0.90) among listed unified models, matching TUNA at 7B.
DPG-Bench (image generation): Lance scores 84.67 overall, with particularly strong relation grounding (93.38).
GEdit-Bench (image editing): Lance reports the best average score (7.30) among listed unified models, ahead of InternVL-U with CoT (6.88) and BAGEL (6.52).
VBench (video generation): Lance achieves a total score of 85.11, the highest in the unified model group, above TUNA at 84.06.
MVBench (video understanding): Lance scores 62.0 average, the best among listed unified models.

These results are vendor-published comparisons from the project's own paper and website.

Architecture and Efficiency Angle

A key design goal stated by the authors is efficiency at the 3B scale. The model delivers competitive results across image generation, image editing, and video generation benchmarks while using only 3B active parameters — smaller than most competing unified models (which typically range from 4B to 13B). The project acknowledges the ViT and VAE encoders are not trained from scratch, but the transformer backbone is. Inference requires a GPU with at least 40GB VRAM, Python 3.10+, and CUDA 12.4+.

Setup and Deployment

Lance is deployed by cloning the repository, running a setup script (setup_env.sh), and downloading model checkpoints from Hugging Face (bytedance-research/Lance). Two checkpoint variants are available: Lance_3B for image tasks and Lance_3B_Video for video tasks. A unified shell script (inference_lance.sh) handles all six task types with configurable parameters including number of GPUs, denoising steps, CFG scale, resolution preset, and frame count. A Gradio interface is also provided for interactive text-to-video and video-to-text use. Ready-to-run benchmark scripts are included under a benchmarks/ directory.

Current Status

The repository was created on May 15, 2026, and last updated on May 21, 2026, indicating very recent and active development. The authors note they are "actively updating and improving this repository." The accompanying arXiv preprint (2605.18678) was submitted in 2026. Model weights are publicly available on Hugging Face at bytedance-research/Lance.

Community Discussions

Be the first to start a conversation about Lance

Share your experience with Lance, ask questions, or help others learn from your insights.

Pricing

OPEN SOURCE

Open Source

Fully open-source under Apache License 2.0. Free to use, modify, and distribute.

Text-to-image generation
Text-to-video generation
Image editing
Video editing
Image understanding

Capabilities

Key Features

Text-to-image generation
Text-to-video generation
Instruction-guided image editing
Instruction-guided video editing
Image understanding and visual question answering
Video understanding and captioning
Multi-turn consistency editing
Unified command-line inference interface
Gradio interactive demo
Configurable denoising steps, CFG scale, and resolution
Ready-to-run benchmark evaluation scripts
Supports up to 121 frames for video generation

Integrations

Hugging Face (model weights)

Gradio (interactive UI)

CUDA

Python

API Available

View Docs

Back to all tools Suggest an edit