Anserini
A Lucene-based toolkit for reproducible information retrieval research, bridging academic IR research and real-world search application development.
At a Glance
Fully free and open-source under the Apache License 2.0. No cost to use, modify, or distribute.
Engagement
Available On
Alternatives
Listed May 2026
About Anserini
Anserini is an open-source Java toolkit built on Apache Lucene, designed to make information retrieval research reproducible and practically applicable. Maintained by the Castorini research group, it grew out of a 2016 reproducibility study of open-source retrieval engines (Lin et al., ECIR 2016) and has since been described in peer-reviewed publications at SIGIR 2017 and the Journal of Data and Information Quality (2018). The project is licensed under Apache 2.0 and is actively developed on GitHub with over 380 contributors.
What It Is
Anserini is a research toolkit that wraps Apache Lucene to provide a principled, reproducible environment for information retrieval (IR) experiments. Its core job is to let researchers index document collections, run retrieval experiments, and reproduce published baselines — all with a consistent, version-controlled codebase. The project explicitly positions itself as a bridge between academic IR research and the engineering of real-world search systems. A companion Python interface, Pyserini, exposes most Anserini features for users who prefer Python over Java.
Architecture and Setup Paths
Anserini offers two primary installation modes:
- Fatjar: A self-contained JAR downloaded via
curl, requiring no repository clone. This is the fastest path for running experiments. - Dev environment: A full repository clone for contributors or users who need to modify source code.
The toolkit is primarily written in Java (83%), with Python (14%) and Shell scripts rounding out the codebase. It is distributed on Maven Central under the io.anserini namespace, making it easy to include as a dependency in other Java projects.
Reproducibility as a First-Class Goal
The project's stated mission is reproducible IR research. It ships with prebuilt index registries and topic registries so that published experimental results can be re-run with a single command. Two reproduction workflows are documented: one from prebuilt indexes (faster) and one from raw document collections (more thorough). The repository includes dedicated runs/ and logs/ directories to capture experiment outputs in a structured way, and CI badges confirm that the build and test suite remain green on the master branch.
Agent-Aware Workflow
Anserini has added explicit support for coding agents (such as those powered by large language models). The repository includes an .agents/skills/ directory with structured skill files for:
- Installing the dev environment or fatjar
- Running CLI commands (prebuilt-index registry, topics registry, search, REST workflows)
- Executing reproducibility experiments
The README provides direct prompt templates users can give to their coding agents, making Anserini one of the earlier research toolkits to formally document agent-oriented onboarding paths.
Update: v2.0.0 and Lucene 10.4.0
As of April 12, 2026 (commit c6eed6), Anserini was upgraded to Lucene 10.4.0 as part of the v2.0.0 release. Lucene 9 indexes remain readable by the new code, but indexes generated by Lucene 10 cannot be read by older versions of Anserini. The repository shows active development with commits as recent as May 20, 2026, including SPLADE-v3 ONNX reproduction updates and locale-stable reproduction output fixes.
Community Discussions
Be the first to start a conversation about Anserini
Share your experience with Anserini, ask questions, or help others learn from your insights.
Pricing
Open Source
Fully free and open-source under the Apache License 2.0. No cost to use, modify, or distribute.
- Apache License 2.0
- Full source code access
- Maven Central distribution
- Fatjar download
- Community contributions
Capabilities
Key Features
- Lucene-based indexing and retrieval
- Reproducible IR experiment framework
- Prebuilt index registry
- Topics registry
- BM25 and dense retrieval support
- SPLADE and ONNX model support
- Fatjar self-contained distribution
- Maven Central package
- Pyserini Python interface
- Agent-oriented skill files for coding agents
- REST API workflows
- Prebuilt and raw document collection reproduction paths
