# Daft

> Open-source, high-performance data engine for AI and multimodal workloads, enabling processing of images, audio, video, and structured data at any scale using a Python dataframe API.

Daft is an open-source data engine built by Eventual Inc. for AI and multimodal data pipelines, licensed under Apache 2.0. Its core is written in Rust for performance, and it exposes a Python dataframe API familiar to Pandas and Spark users. The project has over 5,500 GitHub stars and is described by the vendor as being in production at organizations including Amazon and Essential AI.

## What It Is

Daft is a distributed data processing framework designed specifically for the demands of AI workloads — particularly pipelines that mix structured metadata with unstructured multimodal data like images, video, audio, and embeddings. Unlike general-purpose dataframe libraries, Daft treats multimodal column types as first-class citizens and handles CPU/GPU scheduling within a single pipeline, eliminating the need for separate orchestration glue code.

## Architecture and Performance

Daft's core engine is written in Rust and uses Apache Arrow for zero-copy execution. Key architectural properties include:

- **Multimodal-native column types**: Images, video, audio, text, and embeddings are native column types that can be decoded, transformed, and filtered like any other column.
- **CPU and GPU co-scheduling**: GPU inference and embeddings run alongside CPU decode and filter operations in one pipeline; Daft handles batching and scheduling automatically.
- **Lower memory footprint**: The vendor claims Daft runs the same queries with 5x less memory than alternatives, allowing jobs that would OOM on Spark or Pandas to complete successfully.
- **20x faster start time**: The vendor reports a 20x improvement in pipeline start time compared to alternatives.
- **Rust core**: Decoding video, running transforms, and joining multimodal data at TB scale without Python overhead.

## Ecosystem Integrations

Daft integrates with a broad set of data infrastructure and ML tooling:

- **Table formats**: Apache Iceberg, Delta Lake, Apache Hudi, Unity Catalog
- **Cloud storage**: Amazon Web Services (S3), Azure, Google Cloud Storage
- **Compute**: Ray (for distributed execution)
- **ML frameworks**: PyTorch, Hugging Face
- **Dataframe interop**: Pandas
- **Model providers**: OpenAI, Hugging Face, and custom models via UDFs

## Use Cases

The vendor highlights three primary use cases:

1. **AI Search** — Using LLMs and embedding models, Daft extracts metadata, generates vectors, and writes them to a vector database.
2. **Data Enrichment** — Enriching raw datasets with model-generated labels, captions, or structured outputs.
3. **Multimodal AI ETL** — End-to-end pipelines from raw multimodal data to training-ready datasets.

## Adoption Signals

The vendor publishes several user testimonials and case studies. According to the vendor, Amazon uses Daft to manage exabytes of Apache Parquet in its S3-based data catalog, with one engineer stating it improved efficiency of a critical data processing job by over 24%, saving over 40,000 years of Amazon EC2 vCPU computing time annually. Essential AI reportedly scaled a vLLM-inference pipeline to 32,000 sustained requests per second per VM using Daft. Together AI states Daft sped up fuzzy deduplication workloads by 10x on 100TB+ text data pipelines. The vendor also reports petabytes processed daily across its user base.

## Update: v0.7.16

The latest release is v0.7.16, published on June 26, 2026, reflecting active and frequent development. The repository was last pushed to on July 1, 2026, with 323 open issues and 502 forks, indicating a healthy open-source community. The project has been under continuous development since its creation in April 2022.

## Features
- Multimodal-native column types (images, video, audio, embeddings)
- CPU and GPU co-scheduling in a single pipeline
- Python dataframe API compatible with Pandas and Spark patterns
- Managed UDF runtime with automatic batching, retries, and error handling
- Zero-copy execution powered by Apache Arrow
- Rust core for high-performance data processing
- Local to production consistency — same code runs on laptop or cluster
- 5x lower memory footprint vs alternatives
- Native model operators for embeddings, LLM extraction, and structured outputs
- Distributed execution via Ray integration
- Support for Apache Iceberg, Delta Lake, Apache Hudi, Unity Catalog
- Cloud-native I/O for AWS S3, Azure, Google Cloud Storage

## Integrations
Apache Iceberg, Delta Lake, Apache Hudi, Unity Catalog, Amazon Web Services (S3), Azure, Google Cloud Storage, Ray, Pandas, PyTorch, Hugging Face, OpenAI

## Platforms
CLI, API, DEVELOPER_SDK

## Pricing
Open Source

## Version
v0.7.16

## Links
- Website: https://daft.ai
- Documentation: https://docs.daft.ai/en/stable/
- Repository: https://github.com/Eventual-Inc/Daft
- EveryDev.ai: https://www.everydev.ai/tools/daft
