SciArena
SciArena is an open evaluation platform from the Allen Institute for AI (Ai2) for benchmarking foundation models on scientific literature tasks. Instead of relying on static benchmarks, SciArena collects head-to-head comparisons from human researchers: users submit research questions, see side-by-side, literature-grounded answers from two models, and vote for the better response. These votes drive a public leaderboard and power SciArena-Eval, a meta-evaluation benchmark for testing LLM-as-judge systems.
- Arena-style model comparison — Submit scientific questions, inspect long-form, citation-attributed answers from two foundation models, and cast a vote for the preferred output.
- Leaderboard with Elo-style ratings — Track how models like o3, Claude, Gemini, and DeepSeek rank overall and by scientific discipline using an Elo-style rating system.
- SciArena-Eval benchmark — Use the released human preference data and code to study automated evaluators, LLM-as-judge setups, and model alignment with expert judgments.
- Literature-grounded retrieval — Behind the scenes, SciArena uses a multi-stage retrieval pipeline over the Semantic Scholar corpus to ground answers in relevant, up-to-date papers.
- Research-grade data quality controls — Expert annotators, training, blind ratings, and agreement checks help ensure the preference data is reliable enough for serious evaluation work.
No discussions yet
Be the first to start a discussion about SciArena
Developer
The Allen Institute for AI (AI2) is a non-profit research institute founded in 2014 by the late Microsoft co-founder Paul Allen. AI2 co…read more
Pricing and Plans
(Free)
Free
Free
Free access to core SciArena search, summarization, and conversational features.
- Core semantic search
- AI-generated summaries
- Conversational Q&A
- Basic filters and citation export
System Requirements
Operating System
Any OS with a modern web browser
Memory (RAM)
4 GB+ RAM
Processor
Any modern 64-bit CPU
Disk Space
No local storage required (cloud-based)
AI Capabilities
Semantic-search
Summarization
Question-answering