STARFlow
STARFlow is Apple's open-source transformer autoregressive flow model for high-quality text-to-image and text-to-video generation, combining autoregressive models with normalizing flows.
At a Glance
Freely available open-source code and pretrained model weights on GitHub and Hugging Face.
Engagement
Available On
Alternatives
Listed May 2026
About STARFlow
STARFlow is Apple's official open-source release of a novel transformer autoregressive flow architecture for high-quality image and video generation. The project, hosted on GitHub under the apple organization, covers both STARFlow (text-to-image) and STARFlow-V (text-to-video), with pretrained model checkpoints available on Hugging Face.
What It Is
STARFlow is a generative AI research framework that combines the expressiveness of autoregressive models with the efficiency of normalizing flows. Rather than relying on diffusion-based approaches, it introduces a "deep-shallow" transformer block architecture that processes latent representations through normalizing flow layers. The result is a family of models capable of generating high-resolution images and temporally consistent videos from text prompts.
Architecture and Model Family
The project ships two primary model variants:
- STARFlow (3B parameters): Text-to-image generation at 256×256 resolution. Uses a 6-block deep-shallow architecture, T5-XL text encoder, SD-VAE, and RoPE positional encoding.
- STARFlow-V (7B parameters): Text-to-video generation at up to 640×480 (480p). Supports up to 481 frames (~30 seconds at 16 FPS) with causal temporal attention and WAN2.2-VAE.
- STARFlow2 and NTM (Normalizing Trajectory Models): Two follow-on research directions with papers published but code listed as "TBD."
A key inference optimization is block-wise Jacobi iteration, which accelerates sampling by enabling parallel convergence across token blocks rather than strictly sequential decoding.
Research Lineage and Recognition
The STARFlow paper (arXiv:2506.06276) was accepted as a NeurIPS 2025 Spotlight, and STARFlow-V (arXiv:2511.20462) received a CVPR 2026 Highlight designation, according to the repository's own badges and citations. The project cites four arXiv papers in total, reflecting an active research program at Apple spanning image synthesis, video generation, and unified multimodal generation.
Setup and Usage Path
The repository targets ML researchers and practitioners comfortable with Python and distributed training. Setup involves:
- Cloning the repo and creating a conda environment via
scripts/setup_conda.shorpip install -r requirements.txt - Downloading pretrained checkpoints from Hugging Face into a local
ckpts/directory - Running inference via
torchrunwith provided shell scripts for both image and video generation
Training is supported via FSDP (Fully Sharded Data Parallel) for large-scale distributed runs, with gradient checkpointing available to reduce memory usage. The repository includes separate training scripts for image and video tasks, along with dry-run validation flags.
Update: Active Development as of May 2026
The repository was created in October 2025 and last pushed to in May 2026, with 563 stars and 39 forks as of the latest metadata. The codebase covers STARFlow and STARFlow-V with full training and inference support, while STARFlow2 and NTM remain paper-only releases with code marked as forthcoming. The project is licensed under a custom Apple license (separate LICENSE and LICENSE_MODEL files), not a standard OSI-approved license.
Community Discussions
Be the first to start a conversation about STARFlow
Share your experience with STARFlow, ask questions, or help others learn from your insights.
Pricing
Open Source
Freely available open-source code and pretrained model weights on GitHub and Hugging Face.
- Full source code access
- Pretrained model checkpoints via Hugging Face
- Text-to-image generation (3B model)
- Text-to-video generation (7B model)
- Training scripts with FSDP support
Capabilities
Key Features
- Text-to-image generation (256×256)
- Text-to-video generation (up to 480p, ~30 seconds)
- Text-image-to-video (TI2V) generation
- Transformer autoregressive flow architecture
- Block-wise Jacobi iteration for fast sampling
- FSDP support for distributed training
- Variable-length video generation
- Classifier-free guidance
- RoPE positional encoding
- Causal temporal attention for video
- Gradient checkpointing for memory efficiency
- Configurable aspect ratios and resolutions
