Spiral
A data warehouse for pre-training that maximizes model FLOPs utilization with multimodal data support and GPU saturation.
At a Glance
Pricing
Engagement
Available On
About Spiral
Spiral is a data warehouse designed specifically for pre-training machine learning models, enabling teams to maximize Model FLOPs Utilization (MFU) with multimodal data. It provides a scalable infrastructure for ingesting, processing, and enriching large datasets including tensors, audio, images, and video without the typical I/O bottlenecks that slow down GPU training pipelines.
-
Multimodal Data Ingestion: Quickly ingest any data type at any size, including tensors, audio, images, and video files, making it ideal for diverse pre-training datasets.
-
Flexible Schema Evolution: Append columns and rows without rewriting existing data, allowing datasets to evolve organically without costly migrations or upfront schema design.
-
GPU Saturation: Run interactive queries that load more bytes per second into an H100 than precomputed Parquet results on local disk, eliminating I/O bottlenecks.
-
Selective and Parameterized Reads: Access data selectively with push-down predicates, reading only the data you need without custom data access layers.
-
Massive Scale Support: Scale to millions of columns without upfront schema design, accommodating the complex metadata requirements of modern ML datasets.
-
Built on Vortex: Powered by Vortex, an open-source columnar format donated to the Linux Foundation, offering Pareto-optimal performance faster than Apache Parquet for virtually any workload.
-
Broad Ecosystem Integration: Works seamlessly with popular tools including Spark, Dask, Modal, DuckDB, Polars, PyTorch, Pandas, Arrow, Iceberg, and Ray.
To get started with Spiral, request access through their website. The platform integrates with familiar data processing tools and standards, making adoption straightforward for teams already working with modern data stacks. Spiral is particularly suited for organizations building large-scale pre-training pipelines that need to efficiently manage and serve multimodal datasets to GPU clusters.

Community Discussions
Be the first to start a conversation about Spiral
Share your experience with Spiral, ask questions, or help others learn from your insights.
Pricing
Enterprise
Contact for access to the data warehouse for pre-training
- Multimodal data ingestion
- Schema evolution without rewriting
- GPU saturation
- Selective and parameterized reads
- Scale to millions of columns
- Tool integrations
Capabilities
Key Features
- Multimodal data ingestion (tensors, audio, images, video)
- Schema evolution without data rewriting
- GPU saturation for maximum throughput
- Selective and parameterized push-down reads
- Scale to millions of columns
- Built on Vortex columnar format
- Pareto-optimal performance vs Parquet
- Interoperable with existing data ecosystems