Anserini

Name: Anserini
Availability: OnlineOnly
Author: Castorini

A Lucene-based toolkit for reproducible information retrieval research, bridging academic IR research and real-world search application development.

Visit Website

At a Glance

Pricing

Open Source

Fully free and open-source under the Apache License 2.0. No cost to use, modify, or distribute.

Engagement

Available On

CLI

API

SDK

CastoriniCastorini is a research group that builds open-source toolki…

Listed May 2026

About Anserini

Anserini is an open-source Java toolkit built on Apache Lucene, designed to make information retrieval research reproducible and practically applicable. Maintained by the Castorini research group, it grew out of a 2016 reproducibility study of open-source retrieval engines (Lin et al., ECIR 2016) and has since been described in peer-reviewed publications at SIGIR 2017 and the Journal of Data and Information Quality (2018). The project is licensed under Apache 2.0 and is actively developed on GitHub with over 380 contributors.

What It Is

Anserini is a research toolkit that wraps Apache Lucene to provide a principled, reproducible environment for information retrieval (IR) experiments. Its core job is to let researchers index document collections, run retrieval experiments, and reproduce published baselines — all with a consistent, version-controlled codebase. The project explicitly positions itself as a bridge between academic IR research and the engineering of real-world search systems. A companion Python interface, Pyserini, exposes most Anserini features for users who prefer Python over Java.

Architecture and Setup Paths

Anserini offers two primary installation modes:

Fatjar: A self-contained JAR downloaded via curl, requiring no repository clone. This is the fastest path for running experiments.
Dev environment: A full repository clone for contributors or users who need to modify source code.

The toolkit is primarily written in Java (83%), with Python (14%) and Shell scripts rounding out the codebase. It is distributed on Maven Central under the io.anserini namespace, making it easy to include as a dependency in other Java projects.

Reproducibility as a First-Class Goal

The project's stated mission is reproducible IR research. It ships with prebuilt index registries and topic registries so that published experimental results can be re-run with a single command. Two reproduction workflows are documented: one from prebuilt indexes (faster) and one from raw document collections (more thorough). The repository includes dedicated runs/ and logs/ directories to capture experiment outputs in a structured way, and CI badges confirm that the build and test suite remain green on the master branch.

Agent-Aware Workflow

Anserini has added explicit support for coding agents (such as those powered by large language models). The repository includes an .agents/skills/ directory with structured skill files for:

Installing the dev environment or fatjar
Running CLI commands (prebuilt-index registry, topics registry, search, REST workflows)
Executing reproducibility experiments

The README provides direct prompt templates users can give to their coding agents, making Anserini one of the earlier research toolkits to formally document agent-oriented onboarding paths.

Update: v2.0.0 and Lucene 10.4.0

As of April 12, 2026 (commit c6eed6), Anserini was upgraded to Lucene 10.4.0 as part of the v2.0.0 release. Lucene 9 indexes remain readable by the new code, but indexes generated by Lucene 10 cannot be read by older versions of Anserini. The repository shows active development with commits as recent as May 20, 2026, including SPLADE-v3 ONNX reproduction updates and locale-stable reproduction output fixes.

Community Discussions

Be the first to start a conversation about Anserini

Share your experience with Anserini, ask questions, or help others learn from your insights.

Pricing

OPEN SOURCE

Open Source

Fully free and open-source under the Apache License 2.0. No cost to use, modify, or distribute.

Apache License 2.0
Full source code access
Maven Central distribution
Fatjar download
Community contributions

Capabilities

Key Features

Lucene-based indexing and retrieval
Reproducible IR experiment framework
Prebuilt index registry
Topics registry
BM25 and dense retrieval support
SPLADE and ONNX model support
Fatjar self-contained distribution
Maven Central package
Pyserini Python interface
Agent-oriented skill files for coding agents
REST API workflows
Prebuilt and raw document collection reproduction paths

Integrations

Apache Lucene

Pyserini

Maven Central

ONNX

SPLADE

trec_eval

MS MARCO

BEIR

API Available

View Docs

Back to all tools Suggest an edit