AgenticCrawler
Health Uyari
- License — License: MIT
- Description — Repository has a description
- Active repo — Last push 0 days ago
- Low visibility — Only 5 GitHub stars
Code Gecti
- Code scan — Scanned 12 files during light audit, no dangerous patterns found
Permissions Gecti
- Permissions — No dangerous permissions requested
Bu listing icin henuz AI raporu yok.
acrawl — LLM-powered web crawler. Describe what you want in plain English, get structured data back. Single Rust binary, 25 providers, MCP server built-in.
█████╗ ██████╗██████╗ █████╗ ██╗ ██╗██╗ ██╔══██╗██╔════╝██╔══██╗██╔══██╗██║ ██║██║ ███████║██║ ██████╔╝███████║██║ █╗ ██║██║ ██╔══██║██║ ██╔══██╗██╔══██║██║███╗██║██║ ██║ ██║╚██████╗██║ ██║██║ ██║╚███╔███╔╝███████╗ ╚═╝ ╚═╝ ╚═════╝╚═╝ ╚═╝╚═╝ ╚═╝ ╚══╝╚══╝ ╚══════╝
LLM-powered web crawler. Describe what you want in plain English — get structured data back.
Single binary. No Python runtime. 21 tools. 25 LLM providers. MCP server built-in.
Why acrawl?
Most web scraping still means writing code: XPath selectors, pagination logic, retry handling, anti-bot workarounds. LLMs can read pages like humans do, but wiring one up to a browser is a project in itself.
acrawl is that wiring, packaged as a single Rust binary. You describe a goal; the agent figures out which pages to visit, what to click, what to extract, and when it's done.
- No code required. Describe the goal in English. The agent plans and executes.
- One binary, zero runtimes.
cargo build --releaseproduces a self-contained executable. No Python, no Node runtime — just Rust and a Chromium download for browser automation. - Smart fetching. Static pages are served over HTTP (fast). When JavaScript or interaction is needed, acrawl detects JS framework markers (
__next_data__,__nuxt,__vue,ng-app, React roots), auth redirects, and short<noscript>bodies — then transparently escalates to a headless browser. - 21 tools, not a chatbot. The agent has real tools — navigate, click, fill forms, run JS, take screenshots, manage tabs — plus a fork/join layer to spawn parallel sub-agents across multiple browser tabs.
- 25 LLM providers. Anthropic, OpenAI, Google Gemini, DeepSeek, AWS Bedrock, Azure OpenAI, Vertex AI, GitHub Copilot, Groq, Mistral, xAI, Cohere, Alibaba DashScope, OpenRouter, and more. Or bring your own via any OpenAI-compatible endpoint.
- MCP client. Extend the agent with custom tools via Model Context Protocol servers (stdio, SSE, HTTP, WebSocket).
- MCP server.
acrawl mcpexposes all 17 browser tools plus an autonomousrun_goalagent to any MCP-compatible client — Claude Code, Cursor, Windsurf, VS Code, Zed, JetBrains, TRAE, Gemini CLI, and more. Install withacrawl mcp install.
How does it compare?
| acrawl | Scrapy | Playwright scripts | Browser-use | |
|---|---|---|---|---|
| No code needed | Yes | No | No | Yes |
| Single binary | Yes | No | No | No |
| JS rendering | Yes | No | Yes | Yes |
| LLM-powered | Yes | No | No | Yes |
| No Python required | Yes | No | No | No |
| Form filling / interaction | Yes | Limited | Yes | Yes |
| Sub-agent parallelism | Yes | N/A | No | No |
| 25 provider support | Yes | N/A | N/A | Limited |
| MCP client (use external tools) | Yes | No | No | No |
| MCP server (expose as tools) | Yes | No | No | No |
Quick Start
Install
Linux / macOS (x64 / ARM64):
curl -fsSL https://raw.githubusercontent.com/Mingye-Lu/AgenticCrawler/main/install.sh | bash
Windows (x64, PowerShell):
irm https://raw.githubusercontent.com/Mingye-Lu/AgenticCrawler/main/install.ps1 | iex
This downloads the latest binary, verifies its SHA256 checksum, and sets up CloakBrowser for stealth browser automation. Requires Node.js 20+ for browser features.
acrawl checks for updates on startup and shows a notification when a new version is available.
Build from sourcegit clone https://github.com/Mingye-Lu/AgenticCrawler.git
cd AgenticCrawler
cargo build --release
# Install CloakBrowser (required for browser automation — binary auto-downloads on first use)
npm install
Browser Extension (optional)
The acrawl Bridge extension lets acrawl control your real browser (with your sessions, cookies, and existing extensions) instead of a headless CloakBrowser instance. Download acrawl-extension.zip from the latest release, unzip it, then load it into your browser:
| Browser | Extensions page | Developer mode toggle |
|---|---|---|
| Chrome | chrome://extensions |
Top-right |
| Edge | edge://extensions |
Bottom-left |
| Brave | brave://extensions |
Top-right |
| Arc / Vivaldi / Opera | <browser>://extensions |
Varies |
Enable Developer mode, click Load unpacked, and select the unzipped folder. Then run /extension in the acrawl REPL to connect. See extension/README.md for full setup details.
Configure
# Set up your LLM provider (interactive prompt)
./target/release/acrawl auth anthropic # or: openai, other
Credentials are stored in ~/.acrawl/credentials.json. Override the config directory with ACRAWL_CONFIG_HOME.
Run
# Interactive REPL
./target/release/acrawl
# One-shot mode
./target/release/acrawl prompt "scrape all book titles and prices from books.toscrape.com"
# Resume a saved session
./target/release/acrawl --resume session.json /status /compact
Examples
Scrape a product catalog:
acrawl > scrape all book titles, prices, and ratings from books.toscrape.com
The agent navigates to the site, reads the page, extracts the data, paginates through all 50 pages, and returns structured JSON.
Fill and submit a form:
acrawl > go to example.com/contact, fill in name "Jane Doe", email "[email protected]",
message "Hello", and submit the form
The agent locates form fields, fills them in, clicks submit, and confirms the result.
Monitor a price:
acrawl > check the current price of "Rust in Action" on books.toscrape.com
Single-page extraction — the agent fetches, reads, and returns the price without unnecessary navigation.
Extract from JS-rendered pages:
acrawl > get all repository names and star counts from github.com/trending
Static HTTP won't work here. acrawl detects React/Next.js markers and automatically escalates to a headless browser to render the JavaScript.
Parallel multi-page crawl:
acrawl > scrape the title, author, and price of every book across all 50 pages on books.toscrape.com.
Fork a sub-agent for each page to speed this up.
The agent spawns up to 5 concurrent sub-agents, each on its own browser tab, to crawl pages in parallel. Results are merged when all sub-agents finish.
Features
21-Tool Toolbox
Navigation
| Tool | Description |
|---|---|
navigate |
Go to a URL (supports format: markdown/text/html). Uses HTTP first, auto-escalates to browser when JS is detected. Returns structured content with a page_map. |
go_back |
Browser back button. Returns page_state with the resulting page structure. |
scroll |
Scroll up or down by pixel amount (pixels, default: 500). Returns page_state after scrolling. |
switch_tab |
Switch to a different browser tab by index. Returns page_state of the new tab. |
wait |
Wait for a CSS selector to reach a given state (visible, hidden, attached, detached) or a fixed timeout (up to 300s). |
Interaction
| Tool | Description |
|---|---|
click |
Click an element by CSS selector. Returns page_state after the click. |
click_at |
Click at specific viewport coordinates (x, y). Use for canvas, maps, or SVGs. Returns page_state. |
fill_form |
Fill form fields by selector or name, with optional auto-submit. Returns page_state. |
select_option |
Select a dropdown option by value, label, or index. Returns page_state. |
hover |
Hover over an element to reveal tooltips or menus. Returns page_state. |
press_key |
Press a keyboard key (Enter, Escape, Tab, etc.), optionally targeting an element. Returns page_state. |
execute_js |
Run arbitrary JavaScript in the page context and return the result. |
Content Extraction
| Tool | Description |
|---|---|
page_map |
Get the page's structural map: headings, landmarks, forms, links, and interactive elements (with selectors and state). Supports scope to query within a specific element (e.g. a modal). |
read_content |
Extract text by heading name or CSS selector, with offset/limit pagination for large pages. |
list_resources |
List all links, images, and forms on the current page. |
screenshot |
Capture a full-page screenshot (base64 PNG). |
save_file |
Download a URL to the output directory (path traversal protected). |
Agent Control
| Tool | Description |
|---|---|
fork |
Spawn a sub-agent on a new browser tab with its own goal and step budget. |
wait_for_subagents |
Wait for specific or all sub-agents to finish and collect results. |
done |
Signal task completion. Auto-waits for any active sub-agents and merges their data. |
Sub-Agent Parallelism
The agent can fork child agents to crawl multiple pages concurrently. Each child gets its own browser tab, step budget, and independent state.
| Setting | Default | Description |
|---|---|---|
max_concurrent_per_parent |
5 | Max children running in parallel per parent |
max_fork_depth |
3 | Max nesting depth (agents forking agents) |
max_total_agents |
10 | Global cap across all parents |
fork_child_max_steps |
15 | Step budget per child agent |
fork_wait_timeout_secs |
60 | Timeout waiting for sub-agents |
Smart Fetch Routing
Every navigate call goes through a two-tier fetch router:
- HTTP first — fast reqwest-based fetch (30s timeout, follows up to 10 redirects).
- Auto-escalation — if any of the following are detected, the request is transparently replayed in a headless browser:
- HTTP 403, 429, or 503 responses
- JS framework markers:
__next_data__,__nuxt,__vue,ng-app,_react,data-reactroot - Auth redirects: URLs containing
/login,/signin,/auth,/oauth,accounts.google.com - Short response body (< 500 chars) with a
<noscript>tag
When --no-headless / --headed is set, all fetches go directly through the browser.
25 LLM Providers
| Category | Provider | Auth | Env Var |
|---|---|---|---|
| Popular | Anthropic | API key | ANTHROPIC_API_KEY |
| OpenAI | API key | OPENAI_API_KEY | |
| Google Gemini | API key | GEMINI_API_KEY | |
| DeepSeek | API key | DEEPSEEK_API_KEY | |
| Enterprise | Amazon Bedrock | AWS SigV4 | AWS_ACCESS_KEY_ID |
| Azure OpenAI | Azure API key | AZURE_OPENAI_API_KEY | |
| Google Vertex AI | GCP service account | GOOGLE_APPLICATION_CREDENTIALS | |
| GitHub Copilot | Device OAuth | — | |
| SAP AI Core | API key | SAP_AI_CORE_API_KEY | |
| GitLab Duo | GitLab token | GITLAB_TOKEN | |
| OSS Hosting | Groq | API key | GROQ_API_KEY |
| Cerebras | API key | CEREBRAS_API_KEY | |
| DeepInfra | API key | DEEPINFRA_API_KEY | |
| Together AI | API key | TOGETHER_API_KEY | |
| Mistral AI | API key | MISTRAL_API_KEY | |
| Specialized | Perplexity | API key | PERPLEXITY_API_KEY |
| xAI (Grok) | API key | XAI_API_KEY | |
| Cohere | API key | COHERE_API_KEY | |
| Alibaba (DashScope) | API key | DASHSCOPE_API_KEY | |
| Gateways | OpenRouter | API key | OPENROUTER_API_KEY |
| Vercel AI | API key | VERCEL_API_KEY | |
| Cloudflare Workers AI | API token | CLOUDFLARE_API_TOKEN | |
| Cloudflare AI Gateway | API token | CLOUDFLARE_API_TOKEN | |
| Other | Venice AI | API key | VENICE_API_KEY |
| Custom (OpenAI-compatible) | API key (optional) | — |
Models use the provider/model-id format: anthropic/claude-sonnet-4-6, openai/gpt-4o, amazon-bedrock/anthropic.claude-sonnet-4-6-20250514-v1:0, etc.
Interactive TUI
The default interface is a full terminal UI with:
- Markdown rendering with syntax highlighting and streaming output
- Slash command overlay — type
/to see all commands with Tab completion - Model picker —
/modelopens a searchable list grouped by provider category - Auth modal —
/authwalks through provider setup interactively - Session header — shows current model, session ID, cost, and context usage in real time
- Debug mode —
/debugtoggles raw tool call input/output in the transcript - Reasoning effort —
Ctrl+Tcycles through high/medium/low for reasoning models (o3, o4-mini)
Keybindings:
| Key | Action |
|---|---|
Enter |
Submit prompt |
Shift+Enter / Ctrl+J |
Insert newline |
PageUp / PageDown |
Scroll transcript |
Ctrl+T |
Cycle reasoning effort |
Ctrl+C |
Interrupt task (busy) or exit (idle) |
Esc Esc |
Interrupt task (double-tap while busy) |
Tab |
Auto-complete slash command |
Running acrawl without a TTY on stdout (e.g. piped or redirected) exits with an error pointing at acrawl prompt for one-shot use and acrawl --resume for session maintenance.
Session Management
- Auto-save — sessions are saved automatically on exit.
- Resume —
--resume session.jsonreloads a conversation. Resume-safe slash commands (/status,/compact,/cost,/config,/version,/export,/help,/clear) can be appended to the command line. - Export —
/export [file]writes a human-readable markdown transcript. - Auto-compaction — when context exceeds the token threshold (default 200K), acrawl summarizes older messages while preserving the most recent turns, unique tools used, and pending work items.
- Multiple sessions —
/session listto browse,/session switch <id>to switch.
Tool Allowlist
Use --allowedTools to restrict which tools the agent can invoke (comma-separated, flag is repeatable):
acrawl prompt "scrape titles" --allowedTools navigate,read_content,screenshot
Omit --allowedTools to allow all 21 tools. Useful for locking down a crawl to read-only tools or excluding fork/wait_for_subagents when sub-agent parallelism is not desired.
MCP Extensibility
acrawl supports Model Context Protocol servers as a client, allowing you to extend the agent with custom tools. MCP tools are namespaced as server_name__tool_name and available alongside the built-in 20.
Supported transports: stdio, SSE, HTTP, WebSocket.
MCP Server (expose acrawl as a tool)
acrawl mcp starts a built-in MCP server that exposes acrawl's browser automation capabilities to external agents like Claude Code, Cursor, VS Code, Zed, JetBrains, TRAE, Gemini CLI, or any MCP-compatible client.
The server provides 18 tools in two modes:
Direct browser tools (17) — fine-grained control for clients that orchestrate themselves:navigate, click, click_at, fill_form, page_map, read_content, screenshot, go_back, scroll, wait, select_option, execute_js, hover, press_key, switch_tab, list_resources, save_file
Autonomous agent (1) — delegate a full crawl task:
run_goal— Execute a high-level crawl goal autonomously. The agent plans, navigates, and extracts data using its own LLM loop. Requires~/.acrawl/credentials.jsonconfigured with a model.
Transport: stdio only (no SSE / HTTP / WebSocket in this release).
Quick install
acrawl mcp install
Interactive installer that auto-detects your IDEs, lets you toggle which to configure (Space to select, Enter to confirm), and writes the correct config for each. Supports global (user-level) and project-level scopes.
Supported clients: Claude Code, Claude Desktop, Cursor, Windsurf, VS Code (Copilot), OpenCode, Zed, TRAE, JetBrains IDEs, Gemini CLI, Qwen Code, Codex CLI, Hermes, OpenClaw, Goose, Crush, Aider.
Manual configuration
If you prefer to configure manually, add this to your IDE's MCP config file:
| IDE | Config file | Configuration |
|---|---|---|
| Claude Code | .mcp.json (project)~/.claude.json (user) |
|
| Cursor | .cursor/mcp.json | |
| Windsurf | ~/.codeium/windsurf/mcp_config.json | |
| Claude Desktop | %APPDATA%\Claude\claude_desktop_config.json (Win)~/Library/Application Support/Claude/claude_desktop_config.json (Mac) | |
| TRAE | .trae/mcp.json | |
| Gemini CLI | ~/.gemini/settings.json | |
| Qwen Code | ~/.qwen/settings.json | |
| VS Code (Copilot) | .vscode/mcp.json |
|
| OpenCode | opencode.json |
|
| Zed | ~/.config/zed/settings.json |
|
Or via the Claude Code CLI directly:
claude mcp add acrawl -- acrawl mcp
The browser tools share a persistent session across calls. run_goal creates its own isolated agent and browser.
Requirements: The 17 browser tools work without any configuration. run_goal requires ~/.acrawl/credentials.json (via acrawl auth) for its internal LLM.
Usage
acrawl [OPTIONS] [COMMAND]
Commands:
prompt <text> Run a single goal non-interactively
mcp Start MCP server (stdio transport)
mcp install Install MCP config into your IDEs interactively
auth [provider] Configure provider credentials
system-prompt Print the system prompt (for debugging)
Options:
--model MODEL Model in provider/id format (e.g. anthropic/claude-sonnet-4-6)
--output-format FORMAT text | json
--resume FILE Resume a saved session (with optional /commands)
--compact Compact history on resume
--headless[=BOOL] Force browser headless on/off
--no-headless, --headed Launch browser in visible mode
--allowedTools TOOLS Restrict available tools (comma-separated, repeatable)
-p TEXT Shorthand for prompt mode
-V, --version Print version
Slash Commands
| Command | Description | Resume-safe |
|---|---|---|
/help |
List available commands | Yes |
/status |
Session info — model, tokens, cost | Yes |
/model [name] |
Show or switch the active model | No |
/compact |
Compact conversation history | Yes |
/clear |
Start a fresh session | Yes |
/cost |
Detailed cost breakdown | Yes |
/session [list|switch] |
List or switch sessions | No |
/export [file] |
Export conversation to markdown | Yes |
/resume <path> |
Load a saved session | No |
/config [section] |
View acrawl config | Yes |
/auth [provider] |
Configure credentials | No |
/headed |
Switch to visible browser | No |
/headless |
Switch to headless browser | No |
/extension |
Start extension bridge server, show token | No |
/cloakbrowser |
Switch back to CloakBrowser mode | No |
/debug |
Toggle raw tool output | No |
/version |
Version and build info | Yes |
/exit |
Exit and save session | No |
Configuration
All config lives in ~/.acrawl/ (override with ACRAWL_CONFIG_HOME).
credentials.json
Managed via acrawl auth. Stores per-provider:
| Field | Description |
|---|---|
active_provider |
Currently selected provider |
auth_method |
api_key, oauth, or aws_sigv4 |
api_key |
Provider API key |
oauth |
OAuth tokens — access, refresh, expiry, scopes |
default_model |
Default model for this provider |
base_url |
Custom API endpoint (e.g. local Ollama, Azure resource) |
Azure additionally requires resource_name and deployment_name. Bedrock requires aws_access_key_id, aws_secret_access_key, and region. Vertex requires gcp_project_id and gcp_region.
settings.json
Created with defaults on first run.
| Field | Default | Description |
|---|---|---|
headless |
true |
Run browser without a visible window |
max_steps |
50 |
Max agent loop iterations per goal |
output_dir |
"output" |
Where save_file writes output |
auto_compact_input_tokens |
200000 |
Token threshold for auto-compaction |
reasoning_effort |
"high" |
For reasoning models: high / medium / low |
max_concurrent_per_parent |
5 |
Max concurrent sub-agents per parent |
max_fork_depth |
3 |
Max nesting depth for forked agents |
max_total_agents |
10 |
Global cap on total agents |
fork_child_max_steps |
15 |
Step budget for each child agent |
fork_wait_timeout_secs |
60 |
Timeout for wait_for_subagents |
browser_backend |
null |
Active browser backend: "extension" or null (CloakBrowser) |
extension_bridge_port |
19876 |
Port for Chrome extension bridge WebSocket server |
Environment Variables
| Variable | Description |
|---|---|
ACRAWL_CONFIG_HOME |
Override config directory (default: ~/.acrawl/) |
Provider-specific env vars (see provider table above) are read as fallbacks when no credentials.json entry exists.
How It Works
flowchart LR
Goal([Goal\nnatural language]) --> Plan
Plan --> Navigate --> Observe --> Act --> Extract
Extract -->|repeat until done| Plan
Extract --> Output([Output\nJSON / CSV])
- The agent receives a goal and builds a multi-step plan via a 7-section system prompt covering identity, operating procedure, data integrity, constraints, error recovery, completion protocol, and parallel exploration guidance.
- Each turn, it picks from its 21 tools based on what it observes on the page.
navigatehits the FetchRouter, which tries HTTP first and auto-escalates to a headless Chromium browser when JavaScript, auth redirects, or framework markers are detected.- The browser is driven by an embedded Node.js subprocess (the PlaywrightBridge) speaking newline-delimited JSON over stdio — uses CloakBrowser for stealth browsing, not stock Playwright. Alternatively, acrawl can drive the user's real browser via a Chrome extension (
/extensioncommand) using CDP over a local WebSocket bridge. - For multi-page tasks, the agent can
forkchild agents onto separate browser tabs, each with independent state and step budgets.wait_for_subagentsordonemerges results. - When context grows large, auto-compaction summarizes older messages while preserving recent turns, tool usage, and pending work items.
- The agent calls
donewhen the goal is met, or stops when the step limit is reached.
Architecture
crates/
core/ Shared types, traits, error hierarchy (acrawl-core)
api/ 25 provider clients (Anthropic, OpenAI, Gemini, DeepSeek, Bedrock, Azure, ...), SSE streaming
browser/ PlaywrightBridge, ExtensionBridge, FetchRouter, BrowserContext, WsBridgeServer
agent/ 21 tools, agent loop, sub-agent fork/join, CrawlState
runtime/ ConversationRuntime, config, sessions, MCP client stack, OAuth PKCE
render/ Markdown rendering, tool output formatting, OutputSink
mcp-server/ Built-in MCP server (JSON-RPC over stdio), IDE installer
tui/ Ratatui terminal UI (acrawl-tui)
cli/ Thin binary entry point, LiveCli orchestration, session management
commands/ 17 slash commands with resume-safety annotations
crawler/ Transitional re-export shim (will be removed)
11 crates, ~38K lines of Rust, 770 tests.
Development
cargo build --release # build
cargo test --workspace # run all tests
cargo clippy --workspace --all-targets -- -D warnings # lint (pedantic)
cargo fmt --check # format check
See CONTRIBUTING.md for the full development guide.
Changelog
See CHANGELOG.md.
Security
See SECURITY.md for the security policy and how to report vulnerabilities.
License
Yorumlar (0)
Yorum birakmak icin giris yap.
Yorum birakSonuc bulunamadi