# Needle In A Haystack

> A CLI tool that pressure-tests LLM long-context retrieval by sweeping context length and needle depth combinations to measure model accuracy.

Needle In A Haystack is an open-source benchmarking tool created by Greg Kamradt that evaluates how well large language models retrieve information from long contexts. Originally published in November 2023, it gained wide attention for its visual heatmap results comparing GPT-4 and Claude 2.1 long-context performance. The project is now in v2, a clean refactor released in May 2026.

## What It Is

Needle In A Haystack (`niah`) is a CLI-driven sweep framework that runs a grid of `(context length × needle depth)` cells against any configured LLM, scores each response, and writes one result row per cell to a JSONL file. The core idea is simple: hide a "needle" (a fact, UUID, or chain of linked values) somewhere inside a large "haystack" of text, then ask the model to retrieve it — and repeat this across many context lengths and insertion depths to build a complete accuracy map.

## Built-in Tasks and Architecture

The tool ships four task types out of the box:

- **`single`** — one fact placed at one depth; exact-match scored
- **`multi`** — N facts spread evenly through the context; fractional score
- **`uuid`** — one fresh UUID at one depth; model must repeat it verbatim
- **`uuid_chain`** — a chain of `A → B → C → …` links spread through the context; the model must discover multi-step hops without being told the chain structure

The architecture is built around small Protocols connected by registries, so adding a new provider, task type, haystack source, or scorer requires writing one file and a registry call — the runner itself never needs to change.

## Supported Providers and Configuration

Out of the box, `niah` supports **OpenAI**, **Anthropic**, and **Cohere**. Runs are driven by two small YAML files: a run config (sweep dimensions, task type, haystack source, concurrency, resume behavior) and a model config (SDK, API style, request parameters). Anything under `request:` is forwarded verbatim to the SDK, so provider-specific knobs like `thinking`, `reasoning_effort`, or `top_p` require no code changes.

## Result Storage and Reconstruction

Each JSONL row stores a compact **recipe** rather than the full rendered context, keeping file sizes small even for 200k-token sweeps. The `niah reconstruct` command walks the recipe to reproduce the byte-identical prompt the model actually saw — useful when a surprising result needs manual inspection. Each row also records token usage, cost in USD, duration, score details, and seed for full reproducibility.

## Update: v2.0.0 — Clean Refactor

Version 2.0.0 was published on May 30, 2026, representing a significant refactor of the original 2023 codebase. The v2 schema is not backward-compatible with the original result files (preserved in `original_results/` for reference). Key improvements include the `uuid_chain` task for multi-step reasoning evaluation, a `niah reconstruct` command, YAML-driven configuration, a `--dry-run` flag, resume support, and a fix to the v1 multi-needle depth-reporting bug where each needle's reported depth was inflated by earlier insertions.

## Why It Got Attention

The original November 2023 runs — testing GPT-4-128K and Claude 2.1 — produced heatmap visualizations that circulated widely on Twitter/X and became a reference benchmark in the LLM community for understanding long-context reliability. The repository has accumulated over 2,300 stars and 247 forks on GitHub according to its project metadata. Greg Kamradt published a behind-the-scenes video and tweet threads documenting the methodology and results for both models.

## Features
- Context length × needle depth sweep
- Single-fact retrieval task
- Multi-fact recall task
- UUID retrieval task
- UUID-chain multi-hop reasoning task
- YAML-driven run and model configuration
- JSONL result output with recipe-based reconstruction
- niah reconstruct command for exact prompt replay
- Dry-run and validate modes
- Resume support for interrupted sweeps
- Concurrency and retry configuration
- Built-in FakeProvider for no-API-key testing
- Cost tracking per cell (USD)
- Token usage tracking
- Plugin architecture for custom providers, tasks, haystacks, and scorers
- OpenAI, Anthropic, and Cohere support out of the box

## Integrations
OpenAI, Anthropic, Cohere

## Platforms
API, CLI

## Pricing
Open Source

## Version
v2.0.0

## Links
- Website: https://github.com/gkamradt/needle-in-a-haystack
- Documentation: https://github.com/gkamradt/needle-in-a-haystack
- Repository: https://github.com/gkamradt/needle-in-a-haystack
- EveryDev.ai: https://www.everydev.ai/tools/needle-in-a-haystack
