Needle In A Haystack

Name: Needle In A Haystack
Availability: OnlineOnly
Author: Greg Kamradt

A CLI tool that pressure-tests LLM long-context retrieval by sweeping context length and needle depth combinations to measure model accuracy.

Visit Website

At a Glance

Pricing

Open Source

Fully free and open-source CLI tool available via pip install.

Engagement

Available On

API

CLI

Greg KamradtGreg Kamradt builds open-source tools for evaluating and und…

Listed Jul 2026

About Needle In A Haystack

Needle In A Haystack is an open-source benchmarking tool created by Greg Kamradt that evaluates how well large language models retrieve information from long contexts. Originally published in November 2023, it gained wide attention for its visual heatmap results comparing GPT-4 and Claude 2.1 long-context performance. The project is now in v2, a clean refactor released in May 2026.

What It Is

Needle In A Haystack (niah) is a CLI-driven sweep framework that runs a grid of (context length × needle depth) cells against any configured LLM, scores each response, and writes one result row per cell to a JSONL file. The core idea is simple: hide a "needle" (a fact, UUID, or chain of linked values) somewhere inside a large "haystack" of text, then ask the model to retrieve it — and repeat this across many context lengths and insertion depths to build a complete accuracy map.

Built-in Tasks and Architecture

The tool ships four task types out of the box:

single — one fact placed at one depth; exact-match scored
multi — N facts spread evenly through the context; fractional score
uuid — one fresh UUID at one depth; model must repeat it verbatim
uuid_chain — a chain of A → B → C → … links spread through the context; the model must discover multi-step hops without being told the chain structure

The architecture is built around small Protocols connected by registries, so adding a new provider, task type, haystack source, or scorer requires writing one file and a registry call — the runner itself never needs to change.

Supported Providers and Configuration

Out of the box, niah supports OpenAI, Anthropic, and Cohere. Runs are driven by two small YAML files: a run config (sweep dimensions, task type, haystack source, concurrency, resume behavior) and a model config (SDK, API style, request parameters). Anything under request: is forwarded verbatim to the SDK, so provider-specific knobs like thinking, reasoning_effort, or top_p require no code changes.

Result Storage and Reconstruction

Each JSONL row stores a compact recipe rather than the full rendered context, keeping file sizes small even for 200k-token sweeps. The niah reconstruct command walks the recipe to reproduce the byte-identical prompt the model actually saw — useful when a surprising result needs manual inspection. Each row also records token usage, cost in USD, duration, score details, and seed for full reproducibility.

Update: v2.0.0 — Clean Refactor

Version 2.0.0 was published on May 30, 2026, representing a significant refactor of the original 2023 codebase. The v2 schema is not backward-compatible with the original result files (preserved in original_results/ for reference). Key improvements include the uuid_chain task for multi-step reasoning evaluation, a niah reconstruct command, YAML-driven configuration, a --dry-run flag, resume support, and a fix to the v1 multi-needle depth-reporting bug where each needle's reported depth was inflated by earlier insertions.

Why It Got Attention

The original November 2023 runs — testing GPT-4-128K and Claude 2.1 — produced heatmap visualizations that circulated widely on Twitter/X and became a reference benchmark in the LLM community for understanding long-context reliability. The repository has accumulated over 2,300 stars and 247 forks on GitHub according to its project metadata. Greg Kamradt published a behind-the-scenes video and tweet threads documenting the methodology and results for both models.

Community Discussions

Be the first to start a conversation about Needle In A Haystack

Share your experience with Needle In A Haystack, ask questions, or help others learn from your insights.

Pricing

OPEN SOURCE

Open Source

Fully free and open-source CLI tool available via pip install.

Full sweep framework
All built-in tasks
OpenAI, Anthropic, Cohere providers
YAML configuration
JSONL result storage

Capabilities

Key Features

Context length × needle depth sweep
Single-fact retrieval task
Multi-fact recall task
UUID retrieval task
UUID-chain multi-hop reasoning task
YAML-driven run and model configuration
JSONL result output with recipe-based reconstruction
niah reconstruct command for exact prompt replay
Dry-run and validate modes
Resume support for interrupted sweeps
Concurrency and retry configuration
Built-in FakeProvider for no-API-key testing
Cost tracking per cell (USD)
Token usage tracking
Plugin architecture for custom providers, tasks, haystacks, and scorers
OpenAI, Anthropic, and Cohere support out of the box

Integrations

OpenAI

Anthropic

Cohere

API Available

View Docs

Demo Video

Watch on YouTube

Back to all tools Suggest an edit

Needle In A Haystack

LLM Evaluations

A CLI tool that pressure-tests LLM long-context retrieval by sweeping context length and needle depth combinations to measure model accuracy.

Visit Website

At a Glance

Pricing

Open Source

Fully free and open-source CLI tool available via pip install.

Engagement

ratings

discussions

Available On

API

CLI

Resources

Website Docs GitHub llms.txt

Topics

LLM Evaluations AI Development Libraries Performance Metrics

Alternatives

Artificial Analysis Inspect AI BridgeBench

Developer

Greg KamradtGreg Kamradt builds open-source tools for evaluating and und…

Listed Jul 2026

About Needle In A Haystack

What It Is

Built-in Tasks and Architecture

The tool ships four task types out of the box:

single — one fact placed at one depth; exact-match scored
multi — N facts spread evenly through the context; fractional score
uuid — one fresh UUID at one depth; model must repeat it verbatim
uuid_chain — a chain of A → B → C → … links spread through the context; the model must discover multi-step hops without being told the chain structure

Supported Providers and Configuration

Result Storage and Reconstruction

Update: v2.0.0 — Clean Refactor

Why It Got Attention

Community Discussions

Be the first to start a conversation about Needle In A Haystack

Share your experience with Needle In A Haystack, ask questions, or help others learn from your insights.

Pricing

OPEN SOURCE

Open Source

Fully free and open-source CLI tool available via pip install.

Full sweep framework
All built-in tasks
OpenAI, Anthropic, Cohere providers
YAML configuration
JSONL result storage

Capabilities

Key Features

Context length × needle depth sweep
Single-fact retrieval task
Multi-fact recall task
UUID retrieval task
UUID-chain multi-hop reasoning task
YAML-driven run and model configuration
JSONL result output with recipe-based reconstruction
niah reconstruct command for exact prompt replay
Dry-run and validate modes
Resume support for interrupted sweeps
Concurrency and retry configuration
Built-in FakeProvider for no-API-key testing
Cost tracking per cell (USD)
Token usage tracking
Plugin architecture for custom providers, tasks, haystacks, and scorers
OpenAI, Anthropic, and Cohere support out of the box

Integrations

OpenAI

Anthropic

Cohere

API Available

View Docs

Demo Video

Watch on YouTube

Back to all tools Suggest an edit