cuTile Rust

Name: cuTile Rust
Availability: OnlineOnly
Author: NVlabs (NVIDIA Research)

A tile-based system for writing memory-safe, data-race-free GPU kernels in idiomatic Rust, extending Rust's ownership discipline across the GPU launch boundary.

Visit Website

At a Glance

Pricing

Open Source

Free to use, modify, and distribute under the Apache License 2.0.

Engagement

Available On

Linux

API

SDK

CLI

NVlabs (NVIDIA Research)NVlabs is NVIDIA's research division, publishing open-source…

Listed Jun 2026

About cuTile Rust

cuTile Rust (cutile-rs) is an open-source research project from NVIDIA's NVlabs that brings Rust's ownership and safety guarantees to GPU kernel programming. It targets tile-based kernels that lower through CUDA Tile IR, with APIs built around tensor partitions and tensor-core-oriented operations. The project was created in March 2026 and reached its v0.2.0 release in June 2026.

What It Is

cuTile Rust is a domain-specific language (DSL) and runtime library for authoring GPU kernels in Rust. Rather than exposing raw CUDA primitives, it models GPU work through a tile abstraction: mutable tensors are partitioned into disjoint pieces before launch, immutable tensors are shared, and generated launchers preserve Rust ownership semantics while GPU work is in flight. The #[cutile::module] macro captures a Rust AST for each kernel in the host binary; at runtime, cuTile Rust JIT-compiles that AST through CUDA Tile IR into a GPU cubin. The same model supports synchronous launches, asynchronous pipelines, and CUDA graph replay.

Safety Model and Architecture

The core design extends Rust's borrow checker across the GPU launch boundary:

Mutable tensors are partitioned into disjoint chunks before launch, preventing data races at the type level.
Immutable tensors are shared across tiles as read-only inputs.
Generated launchers hold ownership of tensor arguments while GPU work is in flight, so the host cannot alias or free them prematurely.
Local opt-outs remain available when lower-level control is needed.

The workspace is organized into layered crates: cutile (user-facing), cutile-compiler, cutile-ir (pure Rust Tile IR builder), cuda-async, cuda-core, and cuda-bindings (NVIDIA CUDA bindings under NVIDIA Software License).

Performance and Paper

The accompanying paper, Fearless Concurrency on the GPU (arXiv:2606.15991), reports that on NVIDIA B200, cuTile Rust reaches 7 TB/s for element-wise operations and 2 PFlop/s for GEMM — approximately 91% of peak memory bandwidth and 92% of dense f16 peak, respectively. The paper states the GEMM result is competitive with cuBLAS, and that safety overhead microbenchmarks show no measurable runtime cost. The paper also evaluates Grout, a Qwen3 inference engine built with cuTile Rust in collaboration with Hugging Face, which the paper reports reaches 171 tokens/s for Qwen3-4B on RTX 5090 and 82 tokens/s for Qwen3-32B on B200.

Setup Requirements

cuTile Rust has specific hardware and software requirements:

NVIDIA GPU with compute capability sm_80 or higher
CUDA 13.3 recommended (for sm_80+ coverage and Tile IR 13.3 features such as FP4 packing and block-scaled MMA)
Rust 1.89+
Linux (tested on Ubuntu 24.04)

A Nix flake is provided for reproducible development environments. The flake automatically locates host NVIDIA driver libraries on both NixOS and non-NixOS systems.

Update: v0.2.0

Version 0.2.0 was published on June 16, 2026, and serves as the reference version for the paper evaluation benchmarks. The project README describes it as an early-stage research release under active development, with expected bugs, incomplete features, and API breakage ahead. The repository had 380 stars and 30 forks as of the last update. Related projects include cuTile Python, TileGym, and the Hugging Face Grout inference engine.

Community Discussions

Be the first to start a conversation about cuTile Rust

Share your experience with cuTile Rust, ask questions, or help others learn from your insights.

Pricing

OPEN SOURCE

Open Source

Free to use, modify, and distribute under the Apache License 2.0.

Full source code access
Tile-based GPU kernel authoring in Rust
JIT compilation through CUDA Tile IR
Async and sync kernel launch support
CUDA graph replay

Capabilities

Key Features

Tile-based GPU kernel authoring in idiomatic Rust
Ownership-safe tensor partitioning across GPU launch boundary
#[cutile::module] macro for JIT kernel compilation
JIT compilation through CUDA Tile IR to GPU cubin
Synchronous and asynchronous kernel launch support
CUDA graph replay support
Tensor partition API for disjoint mutable access
Shared read-only tensor inputs
Local opt-outs for lower-level control
Nix flake for reproducible development environments
Reusable kernel library (cutile-kernels)
Async CUDA execution via async Rust

Integrations

CUDA Tile IR

NVIDIA CUDA

Hugging Face Grout

cuBLAS

Rust cargo ecosystem

Nix flakes

API Available

View Docs

Back to all tools Suggest an edit