# cuTile Rust

> A tile-based system for writing memory-safe, data-race-free GPU kernels in idiomatic Rust, extending Rust's ownership discipline across the GPU launch boundary.

cuTile Rust (`cutile-rs`) is an open-source research project from NVIDIA's NVlabs that brings Rust's ownership and safety guarantees to GPU kernel programming. It targets tile-based kernels that lower through CUDA Tile IR, with APIs built around tensor partitions and tensor-core-oriented operations. The project was created in March 2026 and reached its v0.2.0 release in June 2026.

## What It Is

cuTile Rust is a domain-specific language (DSL) and runtime library for authoring GPU kernels in Rust. Rather than exposing raw CUDA primitives, it models GPU work through a tile abstraction: mutable tensors are partitioned into disjoint pieces before launch, immutable tensors are shared, and generated launchers preserve Rust ownership semantics while GPU work is in flight. The `#[cutile::module]` macro captures a Rust AST for each kernel in the host binary; at runtime, cuTile Rust JIT-compiles that AST through CUDA Tile IR into a GPU cubin. The same model supports synchronous launches, asynchronous pipelines, and CUDA graph replay.

## Safety Model and Architecture

The core design extends Rust's borrow checker across the GPU launch boundary:

- **Mutable tensors** are partitioned into disjoint chunks before launch, preventing data races at the type level.
- **Immutable tensors** are shared across tiles as read-only inputs.
- **Generated launchers** hold ownership of tensor arguments while GPU work is in flight, so the host cannot alias or free them prematurely.
- **Local opt-outs** remain available when lower-level control is needed.

The workspace is organized into layered crates: `cutile` (user-facing), `cutile-compiler`, `cutile-ir` (pure Rust Tile IR builder), `cuda-async`, `cuda-core`, and `cuda-bindings` (NVIDIA CUDA bindings under NVIDIA Software License).

## Performance and Paper

The accompanying paper, *Fearless Concurrency on the GPU* (arXiv:2606.15991), reports that on NVIDIA B200, cuTile Rust reaches 7 TB/s for element-wise operations and 2 PFlop/s for GEMM — approximately 91% of peak memory bandwidth and 92% of dense f16 peak, respectively. The paper states the GEMM result is competitive with cuBLAS, and that safety overhead microbenchmarks show no measurable runtime cost. The paper also evaluates Grout, a Qwen3 inference engine built with cuTile Rust in collaboration with Hugging Face, which the paper reports reaches 171 tokens/s for Qwen3-4B on RTX 5090 and 82 tokens/s for Qwen3-32B on B200.

## Setup Requirements

cuTile Rust has specific hardware and software requirements:

- **NVIDIA GPU** with compute capability `sm_80` or higher
- **CUDA 13.3** recommended (for `sm_80+` coverage and Tile IR 13.3 features such as FP4 packing and block-scaled MMA)
- **Rust 1.89+**
- **Linux** (tested on Ubuntu 24.04)

A Nix flake is provided for reproducible development environments. The flake automatically locates host NVIDIA driver libraries on both NixOS and non-NixOS systems.

## Update: v0.2.0

Version 0.2.0 was published on June 16, 2026, and serves as the reference version for the paper evaluation benchmarks. The project README describes it as an early-stage research release under active development, with expected bugs, incomplete features, and API breakage ahead. The repository had 380 stars and 30 forks as of the last update. Related projects include cuTile Python, TileGym, and the Hugging Face Grout inference engine.

## Features
- Tile-based GPU kernel authoring in idiomatic Rust
- Ownership-safe tensor partitioning across GPU launch boundary
- #[cutile::module] macro for JIT kernel compilation
- JIT compilation through CUDA Tile IR to GPU cubin
- Synchronous and asynchronous kernel launch support
- CUDA graph replay support
- Tensor partition API for disjoint mutable access
- Shared read-only tensor inputs
- Local opt-outs for lower-level control
- Nix flake for reproducible development environments
- Reusable kernel library (cutile-kernels)
- Async CUDA execution via async Rust

## Integrations
CUDA Tile IR, NVIDIA CUDA, Hugging Face Grout, cuBLAS, Rust cargo ecosystem, Nix flakes

## Platforms
LINUX, API, DEVELOPER_SDK, CLI

## Pricing
Open Source

## Version
v0.2.0

## Links
- Website: https://nvlabs.github.io/cutile-rs/main/
- Documentation: https://nvlabs.github.io/cutile-rs/
- Repository: https://github.com/nvlabs/cutile-rs
- EveryDev.ai: https://www.everydev.ai/tools/cutile-rs
