olmOCR

Name: olmOCR
Availability: OnlineOnly
Author: Allen Institute for AI

olmOCR is an open-source toolkit by AI2 for converting PDFs and document images into clean, structured plain text using vision-language models.

Visit Website

At a Glance

Pricing

Open Source

Fully free and open-source toolkit available on GitHub under a permissive license.

Engagement

Available On

API

Linux

macOS

Windows

Allen Institute for AISeattle, WAEst. 2014$40M raised

Listed Mar 2026

About olmOCR

olmOCR is an open-source document processing toolkit developed by the Allen Institute for AI (AI2) that converts PDFs and scanned document images into clean, structured plain text. It leverages vision-language models to accurately extract text from complex layouts, tables, and figures. Designed for large-scale data pipelines, olmOCR is optimized for processing millions of documents efficiently. It is particularly useful for researchers and engineers building training datasets for large language models.

PDF & Image OCR: Convert PDFs and scanned images to plain text using state-of-the-art vision-language models for high accuracy on complex layouts.
Large-Scale Processing: Built for throughput, olmOCR can handle millions of documents in batch pipelines, making it suitable for dataset construction at scale.
Structured Text Output: Preserves document structure including headings, tables, and lists in the extracted text output.
Open Source: Fully open-source under a permissive license, allowing researchers and developers to inspect, modify, and extend the codebase freely.
CLI & Python API: Accessible via command-line interface and Python API, enabling easy integration into existing data processing workflows.
Model-Backed Extraction: Uses AI2's OLMo-family vision-language models to power document understanding beyond simple character recognition.
Batch Pipeline Support: Designed to integrate into distributed computing environments for processing large document corpora efficiently.
Research-Grade Quality: Developed by AI2 researchers with a focus on producing high-quality text for LLM pre-training and academic research use cases.

Community Discussions

Be the first to start a conversation about olmOCR

Share your experience with olmOCR, ask questions, or help others learn from your insights.

Pricing

OPEN SOURCE

Open Source

Fully free and open-source toolkit available on GitHub under a permissive license.

PDF to plain text conversion
Vision-language model OCR
CLI interface
Python API
Batch processing

Capabilities

Key Features

PDF to plain text conversion
Scanned image OCR
Vision-language model-powered extraction
Large-scale batch processing
Structured text output
CLI interface
Python API
Open-source codebase
Table and layout preservation
LLM training dataset construction

Integrations

Python

OLMo vision-language models

PDF processing libraries

API Available

View Docs

Back to all tools Suggest an edit

olmOCR

Document Management

olmOCR is an open-source toolkit by AI2 for converting PDFs and document images into clean, structured plain text using vision-language models.

Visit Website

At a Glance

Pricing

Open Source

Fully free and open-source toolkit available on GitHub under a permissive license.

Engagement

ratings

discussions

29views

Available On

API

Linux

macOS

Windows

Resources

Website Docs GitHub llms.txt

Topics

Document Management Data Processing Academic Research

Alternatives

Sylvian DocForge ChatPDF

Developer

Allen Institute for AISeattle, WAEst. 2014$40M raised

Listed Mar 2026

About olmOCR

PDF & Image OCR: Convert PDFs and scanned images to plain text using state-of-the-art vision-language models for high accuracy on complex layouts.
Large-Scale Processing: Built for throughput, olmOCR can handle millions of documents in batch pipelines, making it suitable for dataset construction at scale.
Structured Text Output: Preserves document structure including headings, tables, and lists in the extracted text output.
Open Source: Fully open-source under a permissive license, allowing researchers and developers to inspect, modify, and extend the codebase freely.
CLI & Python API: Accessible via command-line interface and Python API, enabling easy integration into existing data processing workflows.
Model-Backed Extraction: Uses AI2's OLMo-family vision-language models to power document understanding beyond simple character recognition.
Batch Pipeline Support: Designed to integrate into distributed computing environments for processing large document corpora efficiently.
Research-Grade Quality: Developed by AI2 researchers with a focus on producing high-quality text for LLM pre-training and academic research use cases.

Community Discussions

Be the first to start a conversation about olmOCR

Share your experience with olmOCR, ask questions, or help others learn from your insights.

Pricing

OPEN SOURCE

Open Source

Fully free and open-source toolkit available on GitHub under a permissive license.

PDF to plain text conversion
Vision-language model OCR
CLI interface
Python API
Batch processing

Capabilities

Key Features

PDF to plain text conversion
Scanned image OCR
Vision-language model-powered extraction
Large-scale batch processing
Structured text output
CLI interface
Python API
Open-source codebase
Table and layout preservation
LLM training dataset construction

Integrations

Python

OLMo vision-language models

PDF processing libraries

API Available

View Docs

Back to all tools Suggest an edit