olmOCR
olmOCR is an open-source toolkit by AI2 for converting PDFs and document images into clean, structured plain text using vision-language models.
At a Glance
Pricing
Fully free and open-source toolkit available on GitHub under a permissive license.
Engagement
Available On
Listed Mar 2026
About olmOCR
olmOCR is an open-source document processing toolkit developed by the Allen Institute for AI (AI2) that converts PDFs and scanned document images into clean, structured plain text. It leverages vision-language models to accurately extract text from complex layouts, tables, and figures. Designed for large-scale data pipelines, olmOCR is optimized for processing millions of documents efficiently. It is particularly useful for researchers and engineers building training datasets for large language models.
- PDF & Image OCR: Convert PDFs and scanned images to plain text using state-of-the-art vision-language models for high accuracy on complex layouts.
- Large-Scale Processing: Built for throughput, olmOCR can handle millions of documents in batch pipelines, making it suitable for dataset construction at scale.
- Structured Text Output: Preserves document structure including headings, tables, and lists in the extracted text output.
- Open Source: Fully open-source under a permissive license, allowing researchers and developers to inspect, modify, and extend the codebase freely.
- CLI & Python API: Accessible via command-line interface and Python API, enabling easy integration into existing data processing workflows.
- Model-Backed Extraction: Uses AI2's OLMo-family vision-language models to power document understanding beyond simple character recognition.
- Batch Pipeline Support: Designed to integrate into distributed computing environments for processing large document corpora efficiently.
- Research-Grade Quality: Developed by AI2 researchers with a focus on producing high-quality text for LLM pre-training and academic research use cases.
Community Discussions
Be the first to start a conversation about olmOCR
Share your experience with olmOCR, ask questions, or help others learn from your insights.
Pricing
Open Source
Fully free and open-source toolkit available on GitHub under a permissive license.
- PDF to plain text conversion
- Vision-language model OCR
- CLI interface
- Python API
- Batch processing
Capabilities
Key Features
- PDF to plain text conversion
- Scanned image OCR
- Vision-language model-powered extraction
- Large-scale batch processing
- Structured text output
- CLI interface
- Python API
- Open-source codebase
- Table and layout preservation
- LLM training dataset construction
