Needle In A Haystack
A CLI tool that pressure-tests LLM long-context retrieval by sweeping context length and needle depth combinations to measure model accuracy.
At a Glance
Fully free and open-source CLI tool available via pip install.
Engagement
Available On
Alternatives
Listed Jul 2026
About Needle In A Haystack
Needle In A Haystack is an open-source benchmarking tool created by Greg Kamradt that evaluates how well large language models retrieve information from long contexts. Originally published in November 2023, it gained wide attention for its visual heatmap results comparing GPT-4 and Claude 2.1 long-context performance. The project is now in v2, a clean refactor released in May 2026.
What It Is
Needle In A Haystack (niah) is a CLI-driven sweep framework that runs a grid of (context length × needle depth) cells against any configured LLM, scores each response, and writes one result row per cell to a JSONL file. The core idea is simple: hide a "needle" (a fact, UUID, or chain of linked values) somewhere inside a large "haystack" of text, then ask the model to retrieve it — and repeat this across many context lengths and insertion depths to build a complete accuracy map.
Built-in Tasks and Architecture
The tool ships four task types out of the box:
single— one fact placed at one depth; exact-match scoredmulti— N facts spread evenly through the context; fractional scoreuuid— one fresh UUID at one depth; model must repeat it verbatimuuid_chain— a chain ofA → B → C → …links spread through the context; the model must discover multi-step hops without being told the chain structure
The architecture is built around small Protocols connected by registries, so adding a new provider, task type, haystack source, or scorer requires writing one file and a registry call — the runner itself never needs to change.
Supported Providers and Configuration
Out of the box, niah supports OpenAI, Anthropic, and Cohere. Runs are driven by two small YAML files: a run config (sweep dimensions, task type, haystack source, concurrency, resume behavior) and a model config (SDK, API style, request parameters). Anything under request: is forwarded verbatim to the SDK, so provider-specific knobs like thinking, reasoning_effort, or top_p require no code changes.
Result Storage and Reconstruction
Each JSONL row stores a compact recipe rather than the full rendered context, keeping file sizes small even for 200k-token sweeps. The niah reconstruct command walks the recipe to reproduce the byte-identical prompt the model actually saw — useful when a surprising result needs manual inspection. Each row also records token usage, cost in USD, duration, score details, and seed for full reproducibility.
Update: v2.0.0 — Clean Refactor
Version 2.0.0 was published on May 30, 2026, representing a significant refactor of the original 2023 codebase. The v2 schema is not backward-compatible with the original result files (preserved in original_results/ for reference). Key improvements include the uuid_chain task for multi-step reasoning evaluation, a niah reconstruct command, YAML-driven configuration, a --dry-run flag, resume support, and a fix to the v1 multi-needle depth-reporting bug where each needle's reported depth was inflated by earlier insertions.
Why It Got Attention
The original November 2023 runs — testing GPT-4-128K and Claude 2.1 — produced heatmap visualizations that circulated widely on Twitter/X and became a reference benchmark in the LLM community for understanding long-context reliability. The repository has accumulated over 2,300 stars and 247 forks on GitHub according to its project metadata. Greg Kamradt published a behind-the-scenes video and tweet threads documenting the methodology and results for both models.
Community Discussions
Be the first to start a conversation about Needle In A Haystack
Share your experience with Needle In A Haystack, ask questions, or help others learn from your insights.
Pricing
Open Source
Fully free and open-source CLI tool available via pip install.
- Full sweep framework
- All built-in tasks
- OpenAI, Anthropic, Cohere providers
- YAML configuration
- JSONL result storage
Capabilities
Key Features
- Context length × needle depth sweep
- Single-fact retrieval task
- Multi-fact recall task
- UUID retrieval task
- UUID-chain multi-hop reasoning task
- YAML-driven run and model configuration
- JSONL result output with recipe-based reconstruction
- niah reconstruct command for exact prompt replay
- Dry-run and validate modes
- Resume support for interrupted sweeps
- Concurrency and retry configuration
- Built-in FakeProvider for no-API-key testing
- Cost tracking per cell (USD)
- Token usage tracking
- Plugin architecture for custom providers, tasks, haystacks, and scorers
- OpenAI, Anthropic, and Cohere support out of the box
Integrations
Demo Video

