DiffusionBlocks
A principled framework for block-wise neural network training via diffusion interpretation, reducing memory requirements proportionally while maintaining competitive performance across transformer architectures.
At a Glance
Freely available under Apache License 2.0. Use, modify, and distribute without cost.
Engagement
Available On
Alternatives
Listed May 2026
About DiffusionBlocks
DiffusionBlocks is an open-source research framework from Sakana AI that enables memory-efficient training of transformer-based neural networks by partitioning them into independently trainable blocks. Accepted at ICLR 2026, the work was authored by Makoto Shing, Masanori Koyama, and Takuya Akiba, and the official implementation is available on GitHub under the Apache License 2.0.
What It Is
DiffusionBlocks addresses a fundamental bottleneck in deep learning: end-to-end backpropagation requires storing activations across all layers simultaneously, which limits how large models can be trained on available hardware. The framework reframes transformer residual connections as updates in a dynamical system, then converts those updates into a denoising process. This allows each block to be trained independently using a score matching objective, so only one block's gradients need to be held in memory at a time — reducing memory requirements in proportion to the number of blocks.
Core Technical Approach
The key insight in DiffusionBlocks is that residual connections in transformers naturally correspond to updates in a dynamical system. With minimal modifications, these updates can be recast as those of a denoising diffusion process, enabling each block to learn independently via score matching rather than requiring a global backpropagation pass. This is a theoretically grounded departure from prior block-wise training methods, which the paper characterizes as relying on ad-hoc local objectives.
- Each training step updates only one block at a time
- Total iterations are aligned with baseline by multiplying epochs by the number of blocks
- Compatible with vision transformers (ViT), diffusion models, autoregressive models, recurrent-depth models, and masked diffusion architectures
Experimental Scope
The paper's experiments span a range of transformer architectures and tasks, going beyond the small-scale classification benchmarks that prior block-wise methods typically target. The official implementation focuses on image classification using Vision Transformers on CIFAR-100, with support for data augmentation schedules and cosine learning rate schedulers. The paper reports that DiffusionBlocks training matches the performance of end-to-end training across these diverse settings.
Setup and Requirements
The repository uses uv for dependency management and targets Python 3.12 with CUDA 12.2 on H100 GPUs. Setup requires logging into Hugging Face and Weights & Biases. The ViT implementation builds on HuggingFace Transformers, and the EDM (energy-based diffusion model) implementation is based on Stability AI's generative-models codebase.
Update: ICLR 2026 Acceptance and arXiv v3
The paper was first submitted to arXiv in June 2025 (v1), revised in October 2025 (v2), and last updated in February 2026 (v3). It is confirmed to appear at the 14th International Conference on Learning Representations (ICLR 2026). The GitHub repository was created in September 2025 and last pushed in February 2026, with 94 stars and 5 forks as of the available data.
Community Discussions
Be the first to start a conversation about DiffusionBlocks
Share your experience with DiffusionBlocks, ask questions, or help others learn from your insights.
Pricing
Open Source
Freely available under Apache License 2.0. Use, modify, and distribute without cost.
- Full source code access
- Apache License 2.0
- Block-wise ViT training on CIFAR-100
- Baseline and DiffusionBlocks training scripts
- Evaluation scripts
Capabilities
Key Features
- Block-wise transformer training with independent gradient computation per block
- Memory reduction proportional to number of blocks
- Score matching objective for local block training
- Compatible with vision, diffusion, autoregressive, recurrent-depth, and masked diffusion transformers
- CIFAR-100 image classification reference implementation
- Support for cosine learning rate scheduler and random augmentation
- Hugging Face and Weights & Biases integration
- uv-based dependency management
