Daft

Name: Daft
Availability: OnlineOnly
Author: Eventual Inc.

Open-source, high-performance data engine for AI and multimodal workloads, enabling processing of images, audio, video, and structured data at any scale using a Python dataframe API.

Visit Website

At a Glance

Pricing

Open Source

Fully open-source under Apache License 2.0. Free to use, modify, and distribute.

Engagement

Available On

CLI

API

SDK

Eventual Inc.San Francisco, CAEst. 2022$30M raised

Listed Jul 2026

About Daft

Daft is an open-source data engine built by Eventual Inc. for AI and multimodal data pipelines, licensed under Apache 2.0. Its core is written in Rust for performance, and it exposes a Python dataframe API familiar to Pandas and Spark users. The project has over 5,500 GitHub stars and is described by the vendor as being in production at organizations including Amazon and Essential AI.

What It Is

Daft is a distributed data processing framework designed specifically for the demands of AI workloads — particularly pipelines that mix structured metadata with unstructured multimodal data like images, video, audio, and embeddings. Unlike general-purpose dataframe libraries, Daft treats multimodal column types as first-class citizens and handles CPU/GPU scheduling within a single pipeline, eliminating the need for separate orchestration glue code.

Architecture and Performance

Daft's core engine is written in Rust and uses Apache Arrow for zero-copy execution. Key architectural properties include:

Multimodal-native column types: Images, video, audio, text, and embeddings are native column types that can be decoded, transformed, and filtered like any other column.
CPU and GPU co-scheduling: GPU inference and embeddings run alongside CPU decode and filter operations in one pipeline; Daft handles batching and scheduling automatically.
Lower memory footprint: The vendor claims Daft runs the same queries with 5x less memory than alternatives, allowing jobs that would OOM on Spark or Pandas to complete successfully.
20x faster start time: The vendor reports a 20x improvement in pipeline start time compared to alternatives.
Rust core: Decoding video, running transforms, and joining multimodal data at TB scale without Python overhead.

Ecosystem Integrations

Daft integrates with a broad set of data infrastructure and ML tooling:

Table formats: Apache Iceberg, Delta Lake, Apache Hudi, Unity Catalog
Cloud storage: Amazon Web Services (S3), Azure, Google Cloud Storage
Compute: Ray (for distributed execution)
ML frameworks: PyTorch, Hugging Face
Dataframe interop: Pandas
Model providers: OpenAI, Hugging Face, and custom models via UDFs

Use Cases

The vendor highlights three primary use cases:

AI Search — Using LLMs and embedding models, Daft extracts metadata, generates vectors, and writes them to a vector database.
Data Enrichment — Enriching raw datasets with model-generated labels, captions, or structured outputs.
Multimodal AI ETL — End-to-end pipelines from raw multimodal data to training-ready datasets.

Adoption Signals

The vendor publishes several user testimonials and case studies. According to the vendor, Amazon uses Daft to manage exabytes of Apache Parquet in its S3-based data catalog, with one engineer stating it improved efficiency of a critical data processing job by over 24%, saving over 40,000 years of Amazon EC2 vCPU computing time annually. Essential AI reportedly scaled a vLLM-inference pipeline to 32,000 sustained requests per second per VM using Daft. Together AI states Daft sped up fuzzy deduplication workloads by 10x on 100TB+ text data pipelines. The vendor also reports petabytes processed daily across its user base.

Update: v0.7.16

The latest release is v0.7.16, published on June 26, 2026, reflecting active and frequent development. The repository was last pushed to on July 1, 2026, with 323 open issues and 502 forks, indicating a healthy open-source community. The project has been under continuous development since its creation in April 2022.

Community Discussions

Be the first to start a conversation about Daft

Share your experience with Daft, ask questions, or help others learn from your insights.

Pricing

OPEN SOURCE

Open Source

Fully open-source under Apache License 2.0. Free to use, modify, and distribute.

Full data engine for AI and multimodal workloads
Python dataframe API
Multimodal-native column types
CPU and GPU co-scheduling
Distributed execution via Ray

Capabilities

Key Features

Multimodal-native column types (images, video, audio, embeddings)
CPU and GPU co-scheduling in a single pipeline
Python dataframe API compatible with Pandas and Spark patterns
Managed UDF runtime with automatic batching, retries, and error handling
Zero-copy execution powered by Apache Arrow
Rust core for high-performance data processing
Local to production consistency — same code runs on laptop or cluster
5x lower memory footprint vs alternatives
Native model operators for embeddings, LLM extraction, and structured outputs
Distributed execution via Ray integration
Support for Apache Iceberg, Delta Lake, Apache Hudi, Unity Catalog
Cloud-native I/O for AWS S3, Azure, Google Cloud Storage

Integrations

Apache Iceberg

Delta Lake

Apache Hudi

Unity Catalog

Amazon Web Services (S3)

Azure

Google Cloud Storage

Ray

Pandas

PyTorch

Hugging Face

OpenAI

API Available

View Docs

Back to all tools Suggest an edit