STARFlow

Name: STARFlow
Availability: OnlineOnly
Author: Apple

STARFlow is Apple's open-source transformer autoregressive flow model for high-quality text-to-image and text-to-video generation, combining autoregressive models with normalizing flows.

Visit Website

At a Glance

Pricing

Open Source

Freely available open-source code and pretrained model weights on GitHub and Hugging Face.

Engagement

Available On

CLI

API

AppleApple Park, CaliforniaEst. 1976$1300+ raised

Listed May 2026

About STARFlow

STARFlow is Apple's official open-source release of a novel transformer autoregressive flow architecture for high-quality image and video generation. The project, hosted on GitHub under the apple organization, covers both STARFlow (text-to-image) and STARFlow-V (text-to-video), with pretrained model checkpoints available on Hugging Face.

What It Is

STARFlow is a generative AI research framework that combines the expressiveness of autoregressive models with the efficiency of normalizing flows. Rather than relying on diffusion-based approaches, it introduces a "deep-shallow" transformer block architecture that processes latent representations through normalizing flow layers. The result is a family of models capable of generating high-resolution images and temporally consistent videos from text prompts.

Architecture and Model Family

The project ships two primary model variants:

STARFlow (3B parameters): Text-to-image generation at 256×256 resolution. Uses a 6-block deep-shallow architecture, T5-XL text encoder, SD-VAE, and RoPE positional encoding.
STARFlow-V (7B parameters): Text-to-video generation at up to 640×480 (480p). Supports up to 481 frames (~30 seconds at 16 FPS) with causal temporal attention and WAN2.2-VAE.
STARFlow2 and NTM (Normalizing Trajectory Models): Two follow-on research directions with papers published but code listed as "TBD."

A key inference optimization is block-wise Jacobi iteration, which accelerates sampling by enabling parallel convergence across token blocks rather than strictly sequential decoding.

Research Lineage and Recognition

The STARFlow paper (arXiv:2506.06276) was accepted as a NeurIPS 2025 Spotlight, and STARFlow-V (arXiv:2511.20462) received a CVPR 2026 Highlight designation, according to the repository's own badges and citations. The project cites four arXiv papers in total, reflecting an active research program at Apple spanning image synthesis, video generation, and unified multimodal generation.

Setup and Usage Path

The repository targets ML researchers and practitioners comfortable with Python and distributed training. Setup involves:

Cloning the repo and creating a conda environment via scripts/setup_conda.sh or pip install -r requirements.txt
Downloading pretrained checkpoints from Hugging Face into a local ckpts/ directory
Running inference via torchrun with provided shell scripts for both image and video generation

Training is supported via FSDP (Fully Sharded Data Parallel) for large-scale distributed runs, with gradient checkpointing available to reduce memory usage. The repository includes separate training scripts for image and video tasks, along with dry-run validation flags.

Update: Active Development as of May 2026

The repository was created in October 2025 and last pushed to in May 2026, with 563 stars and 39 forks as of the latest metadata. The codebase covers STARFlow and STARFlow-V with full training and inference support, while STARFlow2 and NTM remain paper-only releases with code marked as forthcoming. The project is licensed under a custom Apple license (separate LICENSE and LICENSE_MODEL files), not a standard OSI-approved license.

Community Discussions

Be the first to start a conversation about STARFlow

Share your experience with STARFlow, ask questions, or help others learn from your insights.

Pricing

OPEN SOURCE

Open Source

Freely available open-source code and pretrained model weights on GitHub and Hugging Face.

Full source code access
Pretrained model checkpoints via Hugging Face
Text-to-image generation (3B model)
Text-to-video generation (7B model)
Training scripts with FSDP support

Capabilities

Key Features

Text-to-image generation (256×256)
Text-to-video generation (up to 480p, ~30 seconds)
Text-image-to-video (TI2V) generation
Transformer autoregressive flow architecture
Block-wise Jacobi iteration for fast sampling
FSDP support for distributed training
Variable-length video generation
Classifier-free guidance
RoPE positional encoding
Causal temporal attention for video
Gradient checkpointing for memory efficiency
Configurable aspect ratios and resolutions

Integrations

Hugging Face (model checkpoints)

T5-XL (text encoder)

SD-VAE

WAN2.2-VAE

PyTorch

torchrun

conda

wandb (training logging)

API Available

View Docs

Back to all tools Suggest an edit