webclaw

Name: webclaw
Availability: OnlineOnly
Author: webclaw

webclaw is an open-source web extraction engine built in Rust that turns any website into clean markdown, JSON, or LLM-ready structured data via CLI, REST API, and MCP server.

Visit Website

At a Glance

Pricing

Open Source

Free tier available

Self-host forever under AGPL-3.0. CLI, server, and MCP server with no usage limits on your own hardware.

Starter: $15/mo

Growth: $39/mo

Pro: $79/mo

+2 more plans

Engagement

Available On

macOS

Linux

Web

API

SDK

webclawItalyEst. 2024

Listed May 2026

About webclaw

webclaw is a web extraction toolkit built in Rust and licensed under AGPL-3.0. It converts any URL into clean markdown, JSON, plain text, or token-optimized output without requiring a headless browser, using browser-grade TLS fingerprint impersonation instead. The project ships as three standalone binaries — a CLI, a REST API server, and an MCP server — all powered by the same extraction core. A hosted cloud API at webclaw.io complements the open-source self-hosted path.

What It Is

webclaw sits in the web scraping and data extraction category, specifically designed for AI agent and RAG pipeline workflows. Rather than spinning up Playwright or Puppeteer, it uses raw HTTP with Chrome and Firefox TLS fingerprint profiles to fetch pages fast and lightweight. The extraction engine (webclaw-core) is a pure Rust crate with no network I/O — it takes raw HTML and returns structured output — making it WASM-compatible and independently usable. The hosted API adds protected-site access, JavaScript rendering, async crawl jobs, web search, and production usage tracking on top of the open-source core.

Architecture and Deployment Model

The project is a Rust workspace split into focused crates:

webclaw-core — pure extraction engine: readability scoring, noise filtering, markdown conversion, LLM optimization, CSS selector filtering, diff engine, brand extraction
webclaw-fetch — HTTP client with browser TLS impersonation, BFS crawler, sitemap discovery, batch operations, proxy pool rotation
webclaw-llm — LLM provider chain (Ollama → OpenAI → Anthropic) for JSON schema extraction, prompt extraction, and summarization
webclaw-pdf — PDF text extraction
webclaw-server — axum-based REST API with auth, CORS, gzip, and async job management
webclaw-mcp — MCP server over stdio transport exposing tools for AI agents
webclaw-cli — command-line interface

Users can self-host the entire stack on their own hardware with no usage limits, or use the hosted cloud API with an API key.

Ten Extraction Endpoints

The hosted API and self-hosted server expose ten endpoints covering the full extraction surface: /v1/scrape (single-page extraction), /v1/crawl (BFS same-origin crawling), /v1/search (web search), /v1/map (URL discovery without full extraction), /v1/batch (parallel multi-URL scraping), /v1/extract (LLM-powered structured JSON extraction), /v1/summarize, /v1/brand (colors, fonts, logos, favicon), /v1/diff (content change tracking), and /v1/research (multi-source research workflow). The site states that the LLM-optimized output format runs a 9-step pipeline that strips navigation, ads, and boilerplate, with the site claiming a median 95% token reduction measured on 18 production sites.

MCP Integration and AI Agent Workflow

webclaw ships an MCP server binary that exposes tools over the Model Context Protocol stdio transport, compatible with Claude Desktop, Claude Code, Cursor, Windsurf, OpenCode, Codex, and Antigravity. The one-command setup npx create-webclaw auto-detects supported MCP clients and configures the server automatically. The docs list 8 tools available locally (scrape, crawl, map, batch, extract, summarize, diff, brand) and 2 that require the hosted API (search, research). SDKs are available for TypeScript, Python, and Go, and the API is documented as a drop-in Firecrawl replacement with compatible /v2 endpoints.

Update: v0.6.4

The latest release is v0.6.4, published on May 21, 2026, according to the GitHub repository. The repository was created in March 2026 and has seen active development, with the last push also on May 21, 2026. The GitHub repository reports 1,184 stars and 141 forks. Recent blog posts from May 2026 cover JavaScript rendering fallback strategies, anti-bot signal detection, and evaluation frameworks for scraping APIs in AI agent workflows, indicating active product development and content direction focused on the AI agent use case.

Community Discussions

Be the first to start a conversation about webclaw

Share your experience with webclaw, ask questions, or help others learn from your insights.

Pricing

OPEN SOURCE

Open Source

Self-host forever under AGPL-3.0. CLI, server, and MCP server with no usage limits on your own hardware.

CLI tool
REST API server (self-hosted)
MCP server
No usage limits on your hardware
AGPL-3.0 license

Starter

Entry-level hosted plan with 10,000 credits/month and 3 research runs.

$15/mo

billed annually

$19/mo monthly

10,000 credits/month
3 research runs/month
Max 10 sources per research
5 concurrent requests
Email support

Growth

Popular

Popular mid-tier plan with 100,000 credits/month and 10 research runs.

$39/mo

billed annually

$49/mo monthly

100,000 credits/month
10 research runs/month
Max 20 sources per research
20 concurrent requests
Priority support

Pro

High-volume plan with 250,000 credits/month and 20 research runs.

$79/mo

billed annually

$99/mo monthly

250,000 credits/month
20 research runs/month
Max 30 sources per research
50 concurrent requests
Priority support

Scale

Large-scale plan with 1,000,000 credits/month and 60 research runs.

$319/mo

billed annually

$399/mo monthly

1,000,000 credits/month
60 research runs/month
Max 100 sources per research
100 concurrent requests
Priority + Slack support

Dedicated

Single-tenant deployment on your cloud with unlimited pages, unlimited research, and 200 concurrent requests.

Custom

contact sales

Unlimited pages
Unlimited research
200 concurrent requests
Single-tenant on your cloud
Your proxies, your rules
Dedicated Slack channel
SLA

View official pricing

Capabilities

Key Features

Single-page scraping with clean markdown, JSON, HTML, plain text, and LLM-optimized output
BFS same-origin crawler with configurable depth, concurrency, and delay
Sitemap.xml and robots.txt discovery
Batch multi-URL scraping in parallel
LLM-powered structured JSON extraction via schema or prompt
Page summarization
Content diff and change tracking
Brand identity extraction (colors, fonts, logos, favicon)
Web search with scraped results
Multi-source deep research workflow
MCP server with 8+ tools for Claude, Cursor, Windsurf, and other MCP clients
Browser-grade TLS fingerprint impersonation (Chrome and Firefox profiles)
Anti-bot and CAPTCHA handling
CSS selector include/exclude filtering
9-step LLM optimization pipeline for token reduction
PDF and DOCX auto-detection and extraction
YouTube transcript extraction
Proxy pool rotation
Drop-in Firecrawl /v2 API compatibility
Self-hostable under AGPL-3.0
TypeScript, Python, and Go SDKs

Integrations

Claude Desktop

Claude Code

Cursor

Windsurf

OpenCode

Codex

Antigravity

LangChain

Ollama

OpenAI

Anthropic

Docker

Homebrew

npm (create-webclaw)

OpenClaw

Hermes Agent

API Available

View Docs

Back to all tools Suggest an edit

About webclaw

What It Is

Architecture and Deployment Model

The project is a Rust workspace split into focused crates:

webclaw-core — pure extraction engine: readability scoring, noise filtering, markdown conversion, LLM optimization, CSS selector filtering, diff engine, brand extraction
webclaw-fetch — HTTP client with browser TLS impersonation, BFS crawler, sitemap discovery, batch operations, proxy pool rotation
webclaw-llm — LLM provider chain (Ollama → OpenAI → Anthropic) for JSON schema extraction, prompt extraction, and summarization
webclaw-pdf — PDF text extraction
webclaw-server — axum-based REST API with auth, CORS, gzip, and async job management
webclaw-mcp — MCP server over stdio transport exposing tools for AI agents
webclaw-cli — command-line interface

Users can self-host the entire stack on their own hardware with no usage limits, or use the hosted cloud API with an API key.

webclaw