webclaw
webclaw is an open-source web extraction engine built in Rust that turns any website into clean markdown, JSON, or LLM-ready structured data via CLI, REST API, and MCP server.
At a Glance
Self-host forever under AGPL-3.0. CLI, server, and MCP server with no usage limits on your own hardware.
Engagement
Available On
Listed May 2026
About webclaw
webclaw is a web extraction toolkit built in Rust and licensed under AGPL-3.0. It converts any URL into clean markdown, JSON, plain text, or token-optimized output without requiring a headless browser, using browser-grade TLS fingerprint impersonation instead. The project ships as three standalone binaries — a CLI, a REST API server, and an MCP server — all powered by the same extraction core. A hosted cloud API at webclaw.io complements the open-source self-hosted path.
What It Is
webclaw sits in the web scraping and data extraction category, specifically designed for AI agent and RAG pipeline workflows. Rather than spinning up Playwright or Puppeteer, it uses raw HTTP with Chrome and Firefox TLS fingerprint profiles to fetch pages fast and lightweight. The extraction engine (webclaw-core) is a pure Rust crate with no network I/O — it takes raw HTML and returns structured output — making it WASM-compatible and independently usable. The hosted API adds protected-site access, JavaScript rendering, async crawl jobs, web search, and production usage tracking on top of the open-source core.
Architecture and Deployment Model
The project is a Rust workspace split into focused crates:
- webclaw-core — pure extraction engine: readability scoring, noise filtering, markdown conversion, LLM optimization, CSS selector filtering, diff engine, brand extraction
- webclaw-fetch — HTTP client with browser TLS impersonation, BFS crawler, sitemap discovery, batch operations, proxy pool rotation
- webclaw-llm — LLM provider chain (Ollama → OpenAI → Anthropic) for JSON schema extraction, prompt extraction, and summarization
- webclaw-pdf — PDF text extraction
- webclaw-server — axum-based REST API with auth, CORS, gzip, and async job management
- webclaw-mcp — MCP server over stdio transport exposing tools for AI agents
- webclaw-cli — command-line interface
Users can self-host the entire stack on their own hardware with no usage limits, or use the hosted cloud API with an API key.
Ten Extraction Endpoints
The hosted API and self-hosted server expose ten endpoints covering the full extraction surface: /v1/scrape (single-page extraction), /v1/crawl (BFS same-origin crawling), /v1/search (web search), /v1/map (URL discovery without full extraction), /v1/batch (parallel multi-URL scraping), /v1/extract (LLM-powered structured JSON extraction), /v1/summarize, /v1/brand (colors, fonts, logos, favicon), /v1/diff (content change tracking), and /v1/research (multi-source research workflow). The site states that the LLM-optimized output format runs a 9-step pipeline that strips navigation, ads, and boilerplate, with the site claiming a median 95% token reduction measured on 18 production sites.
MCP Integration and AI Agent Workflow
webclaw ships an MCP server binary that exposes tools over the Model Context Protocol stdio transport, compatible with Claude Desktop, Claude Code, Cursor, Windsurf, OpenCode, Codex, and Antigravity. The one-command setup npx create-webclaw auto-detects supported MCP clients and configures the server automatically. The docs list 8 tools available locally (scrape, crawl, map, batch, extract, summarize, diff, brand) and 2 that require the hosted API (search, research). SDKs are available for TypeScript, Python, and Go, and the API is documented as a drop-in Firecrawl replacement with compatible /v2 endpoints.
Update: v0.6.4
The latest release is v0.6.4, published on May 21, 2026, according to the GitHub repository. The repository was created in March 2026 and has seen active development, with the last push also on May 21, 2026. The GitHub repository reports 1,184 stars and 141 forks. Recent blog posts from May 2026 cover JavaScript rendering fallback strategies, anti-bot signal detection, and evaluation frameworks for scraping APIs in AI agent workflows, indicating active product development and content direction focused on the AI agent use case.
Community Discussions
Be the first to start a conversation about webclaw
Share your experience with webclaw, ask questions, or help others learn from your insights.
Pricing
Open Source
Self-host forever under AGPL-3.0. CLI, server, and MCP server with no usage limits on your own hardware.
- CLI tool
- REST API server (self-hosted)
- MCP server
- No usage limits on your hardware
- AGPL-3.0 license
Starter
Entry-level hosted plan with 10,000 credits/month and 3 research runs.
- 10,000 credits/month
- 3 research runs/month
- Max 10 sources per research
- 5 concurrent requests
- Email support
Growth
Popular mid-tier plan with 100,000 credits/month and 10 research runs.
- 100,000 credits/month
- 10 research runs/month
- Max 20 sources per research
- 20 concurrent requests
- Priority support
Pro
High-volume plan with 250,000 credits/month and 20 research runs.
- 250,000 credits/month
- 20 research runs/month
- Max 30 sources per research
- 50 concurrent requests
- Priority support
Scale
Large-scale plan with 1,000,000 credits/month and 60 research runs.
- 1,000,000 credits/month
- 60 research runs/month
- Max 100 sources per research
- 100 concurrent requests
- Priority + Slack support
Dedicated
Single-tenant deployment on your cloud with unlimited pages, unlimited research, and 200 concurrent requests.
- Unlimited pages
- Unlimited research
- 200 concurrent requests
- Single-tenant on your cloud
- Your proxies, your rules
- Dedicated Slack channel
- SLA
Capabilities
Key Features
- Single-page scraping with clean markdown, JSON, HTML, plain text, and LLM-optimized output
- BFS same-origin crawler with configurable depth, concurrency, and delay
- Sitemap.xml and robots.txt discovery
- Batch multi-URL scraping in parallel
- LLM-powered structured JSON extraction via schema or prompt
- Page summarization
- Content diff and change tracking
- Brand identity extraction (colors, fonts, logos, favicon)
- Web search with scraped results
- Multi-source deep research workflow
- MCP server with 8+ tools for Claude, Cursor, Windsurf, and other MCP clients
- Browser-grade TLS fingerprint impersonation (Chrome and Firefox profiles)
- Anti-bot and CAPTCHA handling
- CSS selector include/exclude filtering
- 9-step LLM optimization pipeline for token reduction
- PDF and DOCX auto-detection and extraction
- YouTube transcript extraction
- Proxy pool rotation
- Drop-in Firecrawl /v2 API compatibility
- Self-hostable under AGPL-3.0
- TypeScript, Python, and Go SDKs
