DiffusionBlocks

Name: DiffusionBlocks
Availability: OnlineOnly
Author: Sakana AI

A principled framework for block-wise neural network training via diffusion interpretation, reducing memory requirements proportionally while maintaining competitive performance across transformer architectures.

Visit Website

At a Glance

Pricing

Open Source

Freely available under Apache License 2.0. Use, modify, and distribute without cost.

Engagement

Available On

CLI

API

Sakana AITokyo, JapanEst. 2023$379000000 raised

Listed May 2026

About DiffusionBlocks

DiffusionBlocks is an open-source research framework from Sakana AI that enables memory-efficient training of transformer-based neural networks by partitioning them into independently trainable blocks. Accepted at ICLR 2026, the work was authored by Makoto Shing, Masanori Koyama, and Takuya Akiba, and the official implementation is available on GitHub under the Apache License 2.0.

What It Is

DiffusionBlocks addresses a fundamental bottleneck in deep learning: end-to-end backpropagation requires storing activations across all layers simultaneously, which limits how large models can be trained on available hardware. The framework reframes transformer residual connections as updates in a dynamical system, then converts those updates into a denoising process. This allows each block to be trained independently using a score matching objective, so only one block's gradients need to be held in memory at a time — reducing memory requirements in proportion to the number of blocks.

Core Technical Approach

The key insight in DiffusionBlocks is that residual connections in transformers naturally correspond to updates in a dynamical system. With minimal modifications, these updates can be recast as those of a denoising diffusion process, enabling each block to learn independently via score matching rather than requiring a global backpropagation pass. This is a theoretically grounded departure from prior block-wise training methods, which the paper characterizes as relying on ad-hoc local objectives.

Each training step updates only one block at a time
Total iterations are aligned with baseline by multiplying epochs by the number of blocks
Compatible with vision transformers (ViT), diffusion models, autoregressive models, recurrent-depth models, and masked diffusion architectures

Experimental Scope

The paper's experiments span a range of transformer architectures and tasks, going beyond the small-scale classification benchmarks that prior block-wise methods typically target. The official implementation focuses on image classification using Vision Transformers on CIFAR-100, with support for data augmentation schedules and cosine learning rate schedulers. The paper reports that DiffusionBlocks training matches the performance of end-to-end training across these diverse settings.

Setup and Requirements

The repository uses uv for dependency management and targets Python 3.12 with CUDA 12.2 on H100 GPUs. Setup requires logging into Hugging Face and Weights & Biases. The ViT implementation builds on HuggingFace Transformers, and the EDM (energy-based diffusion model) implementation is based on Stability AI's generative-models codebase.

Update: ICLR 2026 Acceptance and arXiv v3

The paper was first submitted to arXiv in June 2025 (v1), revised in October 2025 (v2), and last updated in February 2026 (v3). It is confirmed to appear at the 14th International Conference on Learning Representations (ICLR 2026). The GitHub repository was created in September 2025 and last pushed in February 2026, with 94 stars and 5 forks as of the available data.

Community Discussions

Be the first to start a conversation about DiffusionBlocks

Share your experience with DiffusionBlocks, ask questions, or help others learn from your insights.

Pricing

OPEN SOURCE

Open Source

Freely available under Apache License 2.0. Use, modify, and distribute without cost.

Full source code access
Apache License 2.0
Block-wise ViT training on CIFAR-100
Baseline and DiffusionBlocks training scripts
Evaluation scripts

Capabilities

Key Features

Block-wise transformer training with independent gradient computation per block
Memory reduction proportional to number of blocks
Score matching objective for local block training
Compatible with vision, diffusion, autoregressive, recurrent-depth, and masked diffusion transformers
CIFAR-100 image classification reference implementation
Support for cosine learning rate scheduler and random augmentation
Hugging Face and Weights & Biases integration
uv-based dependency management

Integrations

HuggingFace Transformers

Weights & Biases

Hugging Face Hub

Stability AI generative-models (EDM)

API Available

View Docs

Back to all tools Suggest an edit

About DiffusionBlocks

What It Is

Core Technical Approach

Each training step updates only one block at a time
Total iterations are aligned with baseline by multiplying epochs by the number of blocks
Compatible with vision transformers (ViT), diffusion models, autoregressive models, recurrent-depth models, and masked diffusion architectures

DiffusionBlocks