MolmoWeb

Name: MolmoWeb
Availability: OnlineOnly
Author: Allen Institute for AI (Ai2)

An open-source multimodal web agent by Ai2 that autonomously controls a browser to complete natural-language tasks via clicking, typing, scrolling, and navigating.

Visit Website

At a Glance

Pricing

Open Source

Fully open-source under Apache 2.0. Free to use, modify, and distribute.

Engagement

Available On

CLI

API

SDK

Allen Institute for AI (Ai2)Seattle, WAEst. 2014$1B+ raised

Listed Jun 2026

About MolmoWeb

MolmoWeb is an open multimodal web agent built by Ai2 (Allen Institute for AI) and released under the Apache 2.0 license. Given a natural-language task, it autonomously controls a web browser — clicking, typing, scrolling, and navigating — to complete the task end-to-end. The repository includes agent code, an inference client, evaluation benchmarks, training code, and everything needed to reproduce the results from the accompanying arXiv paper (2604.08516).

What It Is

MolmoWeb is a vision-language model fine-tuned specifically for web navigation. It takes screenshots of browser state as visual input and predicts the next action (click coordinates, keystrokes, scroll commands) to advance toward a user-specified goal. The system is built on top of Molmo2 pretrained checkpoints and trained with a single-stage supervised fine-tuning (SFT) pipeline on a mixture of human-annotated and synthetically generated web trajectories.

Model Variants and Architecture

Four model checkpoints are published on HuggingFace under the allenai organization:

MolmoWeb-8B — 8B parameters, HuggingFace/Transformers-compatible
MolmoWeb-4B — 4B parameters, HuggingFace/Transformers-compatible
MolmoWeb-8B-Native — 8B parameters, molmo-native checkpoint format
MolmoWeb-4B-Native — 4B parameters, molmo-native checkpoint format

The native checkpoints use the OLMo attention backend, which differs from vLLM's implementation; the README explicitly cautions that vLLM integration may produce unexpected behavior or reduced accuracy.

Inference and Deployment Model

The inference client (MolmoWeb Python class) manages a browser session and communicates with a running model server over HTTP. Four inference backends are supported: fastapi (remote HTTP endpoint), modal (serverless), native (in-process OLMo-compatible checkpoint), and hf (in-process HuggingFace Transformers checkpoint). Browser environments can be either a local Chromium instance via Playwright or a Browserbase cloud browser. The server exposes a single POST /predict endpoint accepting a text prompt and a base64-encoded screenshot.

Evaluation Framework

The benchmarks/ directory provides a unified two-stage evaluation pipeline — run (agent executes tasks) and judge (LLM scores trajectories). Six benchmarks are supported out of the box: WebVoyager, Online Mind2Web, Odysseys, DeepShop, WebTailBench, and a Custom bring-your-own-tasks mode. Judge implementations include a GPT-4o-based WebVoyager judge, a DeepShop judge, a WebJudge for Online Mind2Web, and a Gemini rubric judge for Odysseys. The same framework can generate synthetic training data by running any supported agent and collecting trajectory logs.

Training Pipeline

Training lives in the train/ directory and is a single-stage SFT on Molmo2 pretrained checkpoints. Nine datasets are hosted on HuggingFace under the MolmoWeb Data collection, covering synthetic grounding, synthetic QA, Gemini-generated trajectories, human-annotated trajectories, synthetic and human atomic skill demonstrations, and visual grounding benchmarks (PixMoPoints, ScreenSpot, ScreenSpotV2). The training script uses torchrun and is configurable via shell variables for checkpoint path, data mixture, GPU count, batch size, sequence length, and training duration.

Current Status

The repository was created in March 2026 and last pushed in June 2026, with 574 stars and 78 forks as of the data snapshot. The project is actively maintained by Ai2 researchers including Tanmay Gupta, Piper Wolters, Zixian Ma, and others listed in the paper citation. A live demo is available at molmoweb.allen.ai and a blog post accompanies the release at allenai.org/blog/molmoweb.

Community Discussions

Be the first to start a conversation about MolmoWeb

Share your experience with MolmoWeb, ask questions, or help others learn from your insights.

Pricing

OPEN SOURCE

Open Source

Fully open-source under Apache 2.0. Free to use, modify, and distribute.

Full agent source code
4B and 8B model checkpoints on HuggingFace
Inference client and server
Evaluation benchmarks (WebVoyager, Mind2Web, Odysseys, DeepShop, WebTailBench)
Training pipeline with SFT code

Capabilities

Key Features

Autonomous browser control (click, type, scroll, navigate)
Natural-language task input
Multimodal vision-language model backbone
4B and 8B parameter model variants
HuggingFace Transformers-compatible checkpoints
Local Chromium and Browserbase cloud browser support
Single-query and batch-query inference
Follow-up query continuation within a session
Accessibility tree extraction
Unified evaluation framework for 6 benchmarks
Two-stage run/judge evaluation pipeline
Synthetic training data generation via trajectory collection
Single-stage SFT training pipeline
Grounding evaluation on ScreenSpot and ScreenSpot-v2
Apache 2.0 open-source license

Integrations

HuggingFace Hub

Playwright (Chromium)

Browserbase

Google Gemini API

OpenAI API (GPT-4o judge)

Modal (serverless inference)

FastAPI (HTTP inference server)

PyTorch / torchrun

uv (dependency management)

API Available

View Docs

Back to all tools Suggest an edit

About MolmoWeb

What It Is

Model Variants and Architecture

Four model checkpoints are published on HuggingFace under the allenai organization:

MolmoWeb