# SciArena

> Open evaluation platform from the Allen Institute for AI where researchers compare and rank foundation models on scientific literature tasks using head-to-head, literature-grounded responses.

SciArena is an open evaluation platform from the Allen Institute for AI (Ai2) for benchmarking foundation models on scientific literature tasks. Instead of relying on static benchmarks, SciArena collects head-to-head comparisons from human researchers: users submit research questions, see side-by-side, literature-grounded answers from two models, and vote for the better response. These votes drive a public leaderboard and power SciArena-Eval, a meta-evaluation benchmark for testing LLM-as-judge systems.

- **Arena-style model comparison** — Submit scientific questions, inspect long-form, citation-attributed answers from two foundation models, and cast a vote for the preferred output.
- **Leaderboard with Elo-style ratings** — Track how models like o3, Claude, Gemini, and DeepSeek rank overall and by scientific discipline using an Elo-style rating system.
- **SciArena-Eval benchmark** — Use the released human preference data and code to study automated evaluators, LLM-as-judge setups, and model alignment with expert judgments.
- **Literature-grounded retrieval** — Behind the scenes, SciArena uses a multi-stage retrieval pipeline over the Semantic Scholar corpus to ground answers in relevant, up-to-date papers.
- **Research-grade data quality controls** — Expert annotators, training, blind ratings, and agreement checks help ensure the preference data is reliable enough for serious evaluation work.

## Features
- Semantic search across scientific literature
- AI-generated paper summaries
- Conversational Q&A over papers
- Filters for date/venue/author and citation export

## Integrations
Semantic Scholar, arXiv, PubMed

## Platforms
WEB, API

## Pricing
Free

## Links
- Website: https://sciarena.allen.ai
- Documentation: https://allenai.org/blog/sciarena
- EveryDev.ai: https://www.everydev.ai/tools/sciarena