MolmoWeb
An open-source multimodal web agent by Ai2 that autonomously controls a browser to complete natural-language tasks via clicking, typing, scrolling, and navigating.
At a Glance
About MolmoWeb
MolmoWeb is an open multimodal web agent built by Ai2 (Allen Institute for AI) and released under the Apache 2.0 license. Given a natural-language task, it autonomously controls a web browser — clicking, typing, scrolling, and navigating — to complete the task end-to-end. The repository includes agent code, an inference client, evaluation benchmarks, training code, and everything needed to reproduce the results from the accompanying arXiv paper (2604.08516).
What It Is
MolmoWeb is a vision-language model fine-tuned specifically for web navigation. It takes screenshots of browser state as visual input and predicts the next action (click coordinates, keystrokes, scroll commands) to advance toward a user-specified goal. The system is built on top of Molmo2 pretrained checkpoints and trained with a single-stage supervised fine-tuning (SFT) pipeline on a mixture of human-annotated and synthetically generated web trajectories.
Model Variants and Architecture
Four model checkpoints are published on HuggingFace under the allenai organization:
- MolmoWeb-8B — 8B parameters, HuggingFace/Transformers-compatible
- MolmoWeb-4B — 4B parameters, HuggingFace/Transformers-compatible
- MolmoWeb-8B-Native — 8B parameters, molmo-native checkpoint format
- MolmoWeb-4B-Native — 4B parameters, molmo-native checkpoint format
The native checkpoints use the OLMo attention backend, which differs from vLLM's implementation; the README explicitly cautions that vLLM integration may produce unexpected behavior or reduced accuracy.
Inference and Deployment Model
The inference client (MolmoWeb Python class) manages a browser session and communicates with a running model server over HTTP. Four inference backends are supported: fastapi (remote HTTP endpoint), modal (serverless), native (in-process OLMo-compatible checkpoint), and hf (in-process HuggingFace Transformers checkpoint). Browser environments can be either a local Chromium instance via Playwright or a Browserbase cloud browser. The server exposes a single POST /predict endpoint accepting a text prompt and a base64-encoded screenshot.
Evaluation Framework
The benchmarks/ directory provides a unified two-stage evaluation pipeline — run (agent executes tasks) and judge (LLM scores trajectories). Six benchmarks are supported out of the box: WebVoyager, Online Mind2Web, Odysseys, DeepShop, WebTailBench, and a Custom bring-your-own-tasks mode. Judge implementations include a GPT-4o-based WebVoyager judge, a DeepShop judge, a WebJudge for Online Mind2Web, and a Gemini rubric judge for Odysseys. The same framework can generate synthetic training data by running any supported agent and collecting trajectory logs.
Training Pipeline
Training lives in the train/ directory and is a single-stage SFT on Molmo2 pretrained checkpoints. Nine datasets are hosted on HuggingFace under the MolmoWeb Data collection, covering synthetic grounding, synthetic QA, Gemini-generated trajectories, human-annotated trajectories, synthetic and human atomic skill demonstrations, and visual grounding benchmarks (PixMoPoints, ScreenSpot, ScreenSpotV2). The training script uses torchrun and is configurable via shell variables for checkpoint path, data mixture, GPU count, batch size, sequence length, and training duration.
Current Status
The repository was created in March 2026 and last pushed in June 2026, with 574 stars and 78 forks as of the data snapshot. The project is actively maintained by Ai2 researchers including Tanmay Gupta, Piper Wolters, Zixian Ma, and others listed in the paper citation. A live demo is available at molmoweb.allen.ai and a blog post accompanies the release at allenai.org/blog/molmoweb.
Community Discussions
Be the first to start a conversation about MolmoWeb
Share your experience with MolmoWeb, ask questions, or help others learn from your insights.
Pricing
Open Source
Fully open-source under Apache 2.0. Free to use, modify, and distribute.
- Full agent source code
- 4B and 8B model checkpoints on HuggingFace
- Inference client and server
- Evaluation benchmarks (WebVoyager, Mind2Web, Odysseys, DeepShop, WebTailBench)
- Training pipeline with SFT code
Capabilities
Key Features
- Autonomous browser control (click, type, scroll, navigate)
- Natural-language task input
- Multimodal vision-language model backbone
- 4B and 8B parameter model variants
- HuggingFace Transformers-compatible checkpoints
- Local Chromium and Browserbase cloud browser support
- Single-query and batch-query inference
- Follow-up query continuation within a session
- Accessibility tree extraction
- Unified evaluation framework for 6 benchmarks
- Two-stage run/judge evaluation pipeline
- Synthetic training data generation via trajectory collection
- Single-stage SFT training pipeline
- Grounding evaluation on ScreenSpot and ScreenSpot-v2
- Apache 2.0 open-source license
