EveryDev.ai
Subscribe
Home
Tools

3,020+ AI tools

  • New
  • Trending
  • Featured
  • Compare
  • Arena
Categories
  • Agents2063
  • Coding1441
  • Infrastructure665
  • Marketing524
  • Projects470
  • Research437
  • Design408
  • Analytics371
  • MCP268
  • Security265
  • Testing255
  • Data249
  • Integration183
  • Prompts183
  • Communication172
  • Learning166
  • Extensions163
  • Voice146
  • Commerce132
  • DevOps115
  • Web84
  • Finance24
AI Tools by Topic
  • AI Coding Assistants
  • Agent Frameworks
  • MCP Servers
  • AI Prompt Tools
  • Vibe Coding Tools
  • AI Design Tools
  • AI Database Tools
  • AI Website Builders
  • AI Testing Tools
  • LLM Evaluations
Follow Us
  • X / Twitter
  • LinkedIn
  • Reddit
  • Discord
  • Threads
  • Bluesky
  • Mastodon
  • YouTube
  • GitHub
  • Instagram
Get Started
  • About
  • Editorial Standards
  • Corrections & Disclosures
  • Community Guidelines
  • Advertise
  • Contact Us
  • Newsletter
  • Submit a Tool
  • Start a Discussion
  • Write A Blog
  • Share A Build
  • Terms of Service
  • Privacy Policy
Explore with AI
  • ChatGPT
  • Gemini
  • Claude
  • Grok
  • Perplexity
Agent Experience
  • llms.txt
Theme
With AI, Everyone is a Dev. EveryDev.ai © 2026
    1. Home
    2. Tools
    3. Needle In A Haystack
    Needle In A Haystack icon

    Needle In A Haystack

    LLM Evaluations

    A CLI tool that pressure-tests LLM long-context retrieval by sweeping context length and needle depth combinations to measure model accuracy.

    Visit Website

    At a Glance

    Pricing
    Open Source

    Fully free and open-source CLI tool available via pip install.

    Engagement

    Available On

    API
    CLI

    Resources

    WebsiteDocsGitHubllms.txt

    Topics

    LLM EvaluationsAI Development LibrariesPerformance Metrics

    Alternatives

    Artificial AnalysisInspect AIBridgeBench
    Developer
    Greg KamradtGreg Kamradt builds open-source tools for evaluating and und…

    Listed Jul 2026

    About Needle In A Haystack

    Needle In A Haystack is an open-source benchmarking tool created by Greg Kamradt that evaluates how well large language models retrieve information from long contexts. Originally published in November 2023, it gained wide attention for its visual heatmap results comparing GPT-4 and Claude 2.1 long-context performance. The project is now in v2, a clean refactor released in May 2026.

    What It Is

    Needle In A Haystack (niah) is a CLI-driven sweep framework that runs a grid of (context length × needle depth) cells against any configured LLM, scores each response, and writes one result row per cell to a JSONL file. The core idea is simple: hide a "needle" (a fact, UUID, or chain of linked values) somewhere inside a large "haystack" of text, then ask the model to retrieve it — and repeat this across many context lengths and insertion depths to build a complete accuracy map.

    Built-in Tasks and Architecture

    The tool ships four task types out of the box:

    • single — one fact placed at one depth; exact-match scored
    • multi — N facts spread evenly through the context; fractional score
    • uuid — one fresh UUID at one depth; model must repeat it verbatim
    • uuid_chain — a chain of A → B → C → … links spread through the context; the model must discover multi-step hops without being told the chain structure

    The architecture is built around small Protocols connected by registries, so adding a new provider, task type, haystack source, or scorer requires writing one file and a registry call — the runner itself never needs to change.

    Supported Providers and Configuration

    Out of the box, niah supports OpenAI, Anthropic, and Cohere. Runs are driven by two small YAML files: a run config (sweep dimensions, task type, haystack source, concurrency, resume behavior) and a model config (SDK, API style, request parameters). Anything under request: is forwarded verbatim to the SDK, so provider-specific knobs like thinking, reasoning_effort, or top_p require no code changes.

    Result Storage and Reconstruction

    Each JSONL row stores a compact recipe rather than the full rendered context, keeping file sizes small even for 200k-token sweeps. The niah reconstruct command walks the recipe to reproduce the byte-identical prompt the model actually saw — useful when a surprising result needs manual inspection. Each row also records token usage, cost in USD, duration, score details, and seed for full reproducibility.

    Update: v2.0.0 — Clean Refactor

    Version 2.0.0 was published on May 30, 2026, representing a significant refactor of the original 2023 codebase. The v2 schema is not backward-compatible with the original result files (preserved in original_results/ for reference). Key improvements include the uuid_chain task for multi-step reasoning evaluation, a niah reconstruct command, YAML-driven configuration, a --dry-run flag, resume support, and a fix to the v1 multi-needle depth-reporting bug where each needle's reported depth was inflated by earlier insertions.

    Why It Got Attention

    The original November 2023 runs — testing GPT-4-128K and Claude 2.1 — produced heatmap visualizations that circulated widely on Twitter/X and became a reference benchmark in the LLM community for understanding long-context reliability. The repository has accumulated over 2,300 stars and 247 forks on GitHub according to its project metadata. Greg Kamradt published a behind-the-scenes video and tweet threads documenting the methodology and results for both models.

    Needle In A Haystack - 1

    Community Discussions

    Be the first to start a conversation about Needle In A Haystack

    Share your experience with Needle In A Haystack, ask questions, or help others learn from your insights.

    Pricing

    OPEN SOURCE

    Open Source

    Fully free and open-source CLI tool available via pip install.

    • Full sweep framework
    • All built-in tasks
    • OpenAI, Anthropic, Cohere providers
    • YAML configuration
    • JSONL result storage

    Capabilities

    Key Features

    • Context length × needle depth sweep
    • Single-fact retrieval task
    • Multi-fact recall task
    • UUID retrieval task
    • UUID-chain multi-hop reasoning task
    • YAML-driven run and model configuration
    • JSONL result output with recipe-based reconstruction
    • niah reconstruct command for exact prompt replay
    • Dry-run and validate modes
    • Resume support for interrupted sweeps
    • Concurrency and retry configuration
    • Built-in FakeProvider for no-API-key testing
    • Cost tracking per cell (USD)
    • Token usage tracking
    • Plugin architecture for custom providers, tasks, haystacks, and scorers
    • OpenAI, Anthropic, and Cohere support out of the box

    Integrations

    OpenAI
    Anthropic
    Cohere
    API Available
    View Docs

    Demo Video

    Needle In A Haystack Demo Video
    Watch on YouTube

    Ratings & Reviews

    No ratings yet

    Be the first to rate Needle In A Haystack and help others make informed decisions.

    Developer

    Greg Kamradt

    Greg Kamradt builds open-source tools for evaluating and understanding large language models. He created the Needle In A Haystack benchmark, which became a widely referenced test for LLM long-context retrieval accuracy. His work focuses on practical, reproducible evaluation frameworks that help developers understand model behavior at scale.

    Read more about Greg Kamradt
    WebsiteGitHubX / Twitter
    1 tool in directory

    Similar Tools

    Artificial Analysis icon

    Artificial Analysis

    Independent AI model benchmarking platform providing comprehensive performance analysis across intelligence, speed, cost, and quality metrics

    Inspect AI icon

    Inspect AI

    An open-source Python framework for large language model evaluations developed by the UK AI Security Institute, supporting agentic tasks, tool use, multi-turn dialog, and 200+ pre-built benchmarks.

    BridgeBench icon

    BridgeBench

    BridgeBench ranks AI coding models across UI generation, security, refactoring, hallucination, debugging, and speed benchmarks.

    Browse all tools

    Related Topics

    LLM Evaluations

    Platforms and frameworks for evaluating, testing, and benchmarking LLM systems and AI applications. These tools provide evaluators and evaluation models to score AI outputs, measure hallucinations, assess RAG quality, detect failures, and optimize model performance. Features include automated testing with LLM-as-a-judge metrics, component-level evaluation with tracing, regression testing in CI/CD pipelines, custom evaluator creation, dataset curation, and real-time monitoring of production systems. Teams use these solutions to validate prompt effectiveness, compare models side-by-side, ensure answer correctness and relevance, identify bias and toxicity, prevent PII leakage, and continuously improve AI product quality through experiments, benchmarks, and performance analytics.

    99 tools

    AI Development Libraries

    Programming libraries and frameworks that provide machine learning capabilities, model integration, and AI functionality for developers.

    244 tools

    Performance Metrics

    Specialized tools for measuring, evaluating, and optimizing AI model performance across accuracy, speed, resource utilization, and other critical parameters.

    48 tools
    Browse all topics
    Back to all toolsSuggest an edit
    ratings
    discussions