cuTile Rust
A tile-based system for writing memory-safe, data-race-free GPU kernels in idiomatic Rust, extending Rust's ownership discipline across the GPU launch boundary.
At a Glance
Free to use, modify, and distribute under the Apache License 2.0.
Engagement
Available On
Listed Jun 2026
About cuTile Rust
cuTile Rust (cutile-rs) is an open-source research project from NVIDIA's NVlabs that brings Rust's ownership and safety guarantees to GPU kernel programming. It targets tile-based kernels that lower through CUDA Tile IR, with APIs built around tensor partitions and tensor-core-oriented operations. The project was created in March 2026 and reached its v0.2.0 release in June 2026.
What It Is
cuTile Rust is a domain-specific language (DSL) and runtime library for authoring GPU kernels in Rust. Rather than exposing raw CUDA primitives, it models GPU work through a tile abstraction: mutable tensors are partitioned into disjoint pieces before launch, immutable tensors are shared, and generated launchers preserve Rust ownership semantics while GPU work is in flight. The #[cutile::module] macro captures a Rust AST for each kernel in the host binary; at runtime, cuTile Rust JIT-compiles that AST through CUDA Tile IR into a GPU cubin. The same model supports synchronous launches, asynchronous pipelines, and CUDA graph replay.
Safety Model and Architecture
The core design extends Rust's borrow checker across the GPU launch boundary:
- Mutable tensors are partitioned into disjoint chunks before launch, preventing data races at the type level.
- Immutable tensors are shared across tiles as read-only inputs.
- Generated launchers hold ownership of tensor arguments while GPU work is in flight, so the host cannot alias or free them prematurely.
- Local opt-outs remain available when lower-level control is needed.
The workspace is organized into layered crates: cutile (user-facing), cutile-compiler, cutile-ir (pure Rust Tile IR builder), cuda-async, cuda-core, and cuda-bindings (NVIDIA CUDA bindings under NVIDIA Software License).
Performance and Paper
The accompanying paper, Fearless Concurrency on the GPU (arXiv:2606.15991), reports that on NVIDIA B200, cuTile Rust reaches 7 TB/s for element-wise operations and 2 PFlop/s for GEMM — approximately 91% of peak memory bandwidth and 92% of dense f16 peak, respectively. The paper states the GEMM result is competitive with cuBLAS, and that safety overhead microbenchmarks show no measurable runtime cost. The paper also evaluates Grout, a Qwen3 inference engine built with cuTile Rust in collaboration with Hugging Face, which the paper reports reaches 171 tokens/s for Qwen3-4B on RTX 5090 and 82 tokens/s for Qwen3-32B on B200.
Setup Requirements
cuTile Rust has specific hardware and software requirements:
- NVIDIA GPU with compute capability
sm_80or higher - CUDA 13.3 recommended (for
sm_80+coverage and Tile IR 13.3 features such as FP4 packing and block-scaled MMA) - Rust 1.89+
- Linux (tested on Ubuntu 24.04)
A Nix flake is provided for reproducible development environments. The flake automatically locates host NVIDIA driver libraries on both NixOS and non-NixOS systems.
Update: v0.2.0
Version 0.2.0 was published on June 16, 2026, and serves as the reference version for the paper evaluation benchmarks. The project README describes it as an early-stage research release under active development, with expected bugs, incomplete features, and API breakage ahead. The repository had 380 stars and 30 forks as of the last update. Related projects include cuTile Python, TileGym, and the Hugging Face Grout inference engine.
Community Discussions
Be the first to start a conversation about cuTile Rust
Share your experience with cuTile Rust, ask questions, or help others learn from your insights.
Pricing
Open Source
Free to use, modify, and distribute under the Apache License 2.0.
- Full source code access
- Tile-based GPU kernel authoring in Rust
- JIT compilation through CUDA Tile IR
- Async and sync kernel launch support
- CUDA graph replay
Capabilities
Key Features
- Tile-based GPU kernel authoring in idiomatic Rust
- Ownership-safe tensor partitioning across GPU launch boundary
- #[cutile::module] macro for JIT kernel compilation
- JIT compilation through CUDA Tile IR to GPU cubin
- Synchronous and asynchronous kernel launch support
- CUDA graph replay support
- Tensor partition API for disjoint mutable access
- Shared read-only tensor inputs
- Local opt-outs for lower-level control
- Nix flake for reproducible development environments
- Reusable kernel library (cutile-kernels)
- Async CUDA execution via async Rust
