Ask-the-Web-103
Health Warn
- License — License: MIT
- Description — Repository has a description
- Active repo — Last push 0 days ago
- Low visibility — Only 6 GitHub stars
Code Pass
- Code scan — Scanned 12 files during light audit, no dangerous patterns found
Permissions Pass
- Permissions — No dangerous permissions requested
No AI report is available for this listing yet.
Production-grade Perplexity-like AI agent with real-time web search, ReACT/ReWOO/Reflexion/Tree-Search reasoning, MCP & A2A protocols, multi-agent orchestration, streaming SSE API, and full evaluation suite. Built with FastAPI, OpenAI, Anthropic, and Redis.
Project 3 - Build an "Ask-the-Web" Agent similar to Perplexity with Tool calling
A production-grade, Perplexity-like AI research agent
built with ReACT · ReWOO · Reflexion · Tree Search · MCP · A2A
Ask anything. The agent searches, reasons, verifies, and answers —
with full citations, streaming output, and production-grade reliability.To better understand this project, first visit this link for a visualization of the project and what I built: Link
Then, if you want to learn each topic in a tutorial format, read this file thoroughly: Link
Quick Start •
Architecture •
Agents •
API Reference •
Configuration •
Evaluation •
Contributing
Table of Contents
- What Is This?
- Key Features
- Architecture
- Project Structure
- Quick Start
- Agent Types
- Workflows
- Tools
- Multi-Agent Systems
- API Reference
- Configuration
- Evaluation
- Observability
- Testing
- Deployment
- Roadmap
- Contributing
- License
What Is This?
Ask-the-Web Agent is a production-ready AI research assistant that works
like Perplexity AI — but fully open, self-hosted,
and extensible.
You ask a question in natural language. The agent:
- Plans how to answer it (which strategy, how many steps)
- Searches the web in real time using Tavily or SerpAPI
- Scrapes relevant pages for detailed content
- Reasons step-by-step using one of five agent strategies
- Verifies its own answer through self-critique (Reflexion)
- Synthesizes a final, cited, markdown-formatted answer
- Streams the result token-by-token to the client
Unlike a raw LLM, this agent never makes up facts — every claim is
grounded in real-time web sources with inline citations.
Why build this?
| Problem with raw LLMs | How this agent solves it |
|---|---|
| Knowledge cutoff (training data is stale) | Real-time web search on every query |
| Hallucination (confident but wrong) | Source-grounded answers + Reflexion critique |
| No citations (can't verify claims) | Every fact linked to a URL |
| Single-shot (one chance to get it right) | Multi-step reasoning with tool loops |
| Can't handle complex multi-part questions | Orchestrator decomposes and parallelizes |
Key Features
Five Agent Strategies
Choose automatically via smart routing or manually per request:
- ReACT — Fast, iterative reason-and-act loops
- Reflexion — ReACT + self-critique and automatic revision
- ReWOO — Full plan upfront, parallel execution, single synthesis
- Orchestrator — Decomposes complex queries into parallel sub-agents
- Tree Search — Explores multiple reasoning paths, picks the best
Production Tool Stack
- Web Search — Tavily (primary) or SerpAPI (fallback)
- Web Scraper — Playwright + BeautifulSoup, cleans boilerplate
- Calculator — Safe sandboxed math expression evaluator
- Summarizer — Condenses long scraped content
- MCP Support — Connect any Model Context Protocol server
Multi-Agent Coordination
- Orchestrator-Worker — Spawn N parallel specialist agents
- A2A Protocol — Agent-to-Agent HTTP communication standard
- MultiAgentCoordinator — Route tasks to registered specialist agents
API & Streaming
- REST API — FastAPI with full OpenAPI docs
- SSE Streaming — Token-by-token answer delivery
- Redis Cache — SHA256-keyed response caching (1hr TTL)
- Rate Limiting — Per-IP sliding window
Evaluation System
- LLM-as-Judge — Multi-dimensional answer quality scoring
- Text Metrics — Citation coverage, structure, length (no LLM cost)
- Benchmark Suite — 5 built-in test cases across categories
- Parallel Voting — Majority-vote answer verification
Production Infrastructure
- Structured logging — structlog + rich, JSON in production
- Prometheus metrics —
/metricsendpoint - Docker + Compose — One-command deployment
- Retry logic — Tenacity-backed exponential backoff
- Context management — Automatic token trimming at window limits
- Multi-provider — Switch between OpenAI and Anthropic
Architecture
System Overview
┌─────────────────────────────────┐
│ Client (HTTP/SSE) │
└──────────────┬──────────────────┘
│
┌──────────────▼──────────────────┐
│ FastAPI (REST API) │
│ middleware: rate limit, logging │
│ middleware: request ID, errors │
└──────────────┬──────────────────┘
│
┌──────────────▼──────────────────┐
│ Redis Cache │
│ (SHA256 keyed, 1hr TTL) │
└──────────────┬──────────────────┘
miss │
┌─────────────▼───────────────────┐
│ Query Router │
│ rule-based pre-filter + │
│ LLM-based classification │
└──┬───────┬──────┬──────┬────────┘
│ │ │ │
┌────────────▼─┐ ┌───▼──┐ ┌▼────┐ ┌▼──────────────┐
│ ReACT Agent │ │ReWOO │ │Refl.│ │ Orchestrator │
│ (fast Q&A) │ │Agent │ │Agent│ │ (multi-part) │
└──────┬───────┘ └──┬───┘ └──┬──┘ └──────┬────────┘
│ │ │ │
┌──────▼────────────▼─────────▼────────────▼───────┐
│ Tool Executor │
│ (parallel or sequential) │
└───┬──────────┬──────────┬──────────┬─────────────┘
│ │ │ │
┌──────▼──┐ ┌─────▼───┐ ┌───▼────┐ ┌──▼──────────┐
│ Web │ │ Web │ │ Calc- │ │ MCP │
│ Search │ │ Scraper │ │ ulator │ │ Servers │
└─────────┘ └─────────┘ └────────┘ └─────────────┘
Agent Decision Flow
User Query
│
▼
┌───────────────────────────────────────────────┐
│ TaskPlanner │
│ Analyzes complexity → PlanningLevel (1-5) │
└───────────────────────┬───────────────────────┘
│
┌─────────────▼──────────────┐
│ QueryRouter │
│ Rule-based quick classify │
│ ──────────────────────── │
│ LLM-based deep classify │
└─────┬──────┬──────┬───────┘
│ │ │
┌──────────▼─┐ ┌─▼────┐ ┌▼───────────────────┐
│ simple_qa │ │ calc │ │ research / │
│ → ReACT │ │→ReACT│ │ multi_faceted / │
└────────────┘ └──────┘ │ → Reflexion / │
│ → Orchestrator │
└─────────────────────┘
│
┌─────────▼──────────┐
│ ReACT Loop │
│ ┌─────────────┐ │
│ │ THINK │ │
│ │ (LLM call) │ │
│ └──────┬──────┘ │
│ │ │
│ ┌──────▼──────┐ │
│ │ ACT │ │
│ │ (tool calls)│ │
│ └──────┬──────┘ │
│ │ │
│ ┌──────▼──────┐ │
│ │ OBSERVE │ │
│ │ (results) │ │
│ └──────┬──────┘ │
│ │ │
│ done?│ no → loop │
└──────────┼───────────┘
│ yes
┌──────────▼───────────┐
│ Final Answer │
│ (with citations) │
└──────────────────────┘
Token & Context Management
Every LLM call:
messages → TokenCounter.count_messages()
│
exceeds context limit?
yes │ no
│ │
trim_to_fit() │ │ → proceed
(drop oldest │
non-system │
messages) │
└──────────►│ → LLM call
📁 Project Structure
ask_the_web_agent/
│
├── 📄 pyproject.toml # Dependencies, build config, tool settings
├── 📄 .env.example # All environment variables documented
├── 📄 docker-compose.yml # Agent + Redis + Prometheus
├── 📄 Dockerfile # Multi-stage build (builder + runtime)
├── 📄 README.md # This file
│
├── 📁 configs/ # Application configuration
│ ├── settings.py # Pydantic Settings (type-safe env loading)
│ ├── logging_config.py # structlog + rich setup
│ └── prometheus.yml # Prometheus scrape config
│
├── 📁 core/ # Shared infrastructure
│ ├── exceptions.py # Full exception hierarchy
│ ├── message_types.py # Message, ToolCall, AgentState types
│ ├── token_counter.py # tiktoken-based counter + trim
│ └── llm_client.py # Unified OpenAI + Anthropic client
│
├── 📁 tools/ # Tool layer
│ ├── base_tool.py # BaseTool ABC + ToolDefinition schema
│ ├── tool_registry.py # Central tool store
│ ├── tool_executor.py # Parallel + sequential execution
│ ├── web_search.py # Tavily / SerpAPI search
│ ├── web_scraper.py # httpx + BeautifulSoup scraper
│ ├── calculator.py # Safe sandboxed math eval
│ ├── summarizer.py # Extractive text summarizer
│ └── mcp_client.py # MCP protocol client + registry
│
├── 📁 agents/ # Agent implementations
│ ├── base_agent.py # Abstract base + shared utilities
│ ├── react_agent.py # ReACT: iterative reason-act-observe
│ ├── reflexion_agent.py # Reflexion: ReACT + self-critique
│ ├── rewoo_agent.py # ReWOO: plan-execute-solve
│ ├── orchestrator.py # Orchestrator-Worker: decompose + parallel
│ ├── tree_search_agent.py # Best-first tree search
│ ├── planner.py # Task planner + PlanningLevel
│ └── a2a.py # Agent-to-Agent protocol
│
├── 📁 workflows/ # Workflow patterns
│ ├── prompt_chaining.py # Sequential chained LLM calls
│ ├── routing.py # LLM + rule-based query router
│ ├── parallelization.py # Sectioning + voting patterns
│ ├── reflection.py # Standalone critique-revise loop
│ └── __init__.py # build_routed_pipeline()
│
├── 📁 evaluation/ # Quality assessment
│ ├── metrics.py # Fast rule-based text metrics
│ ├── evaluator.py # LLM-as-judge evaluator
│ └── benchmarks.py # Benchmark runner + built-in cases
│
├── 📁 api/ # FastAPI application
│ ├── main.py # App factory + lifespan
│ ├── routes.py # All endpoint handlers
│ ├── schemas.py # Pydantic request/response models
│ ├── middleware.py # Rate limit, logging, error handling
│ └── cache.py # Redis response cache
│
└── 📁 tests/ # Full test suite
├── test_tools.py # Tool unit tests
├── test_agents.py # Agent behavior tests
├── test_workflows.py # Workflow + metric tests
└── test_api.py # API endpoint + middleware tests
Quick Start
Prerequisites
| Requirement | Version | Notes |
|---|---|---|
| Python | 3.11+ | Uses match statements, Self type |
| Redis | 7+ | For response caching |
| Docker | 24+ | Optional, for containerized run |
| OpenAI API Key | — | Primary LLM provider |
| Tavily API Key | — | Primary search provider |
Minimum to get started: Python 3.11 + OpenAI key + Tavily key.
Redis and Docker are optional for local development.
Installation
Option A — pip (development)
# 1. Clone the repository
git https://github.com/AdilShamim8/Ask-the-Web-103.git
cd ask-the-web-agent
# 2. Create and activate a virtual environment
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
# 3. Install with all dev dependencies
pip install -e ".[dev]"
# 4. Install Playwright browser (for web scraping)
playwright install chromium
# 5. Verify installation
python -c "import openai, fastapi, redis; print('✅ All dependencies OK')"
Option B — Docker (production)
git clone https://github.com/AdilShamim8/Ask-the-Web-103.git
cd ask-the-web-agent
cp .env.example .env
# Edit .env with your API keys
docker-compose up -d
Environment Setup
Copy the example and fill in your keys:
cp .env.example .env
Open .env and set the required values:
# ── REQUIRED ────────────────────────────────────────────────────────────────
# LLM provider (at least one required)
OPENAI_API_KEY=sk-proj-... # Get at: https://platform.openai.com
ANTHROPIC_API_KEY=sk-ant-... # Get at: https://console.anthropic.com
# Search provider (at least one required)
TAVILY_API_KEY=tvly-... # Get at: https://tavily.com (free tier available)
SERPAPI_API_KEY=... # Get at: https://serpapi.com (fallback)
# ── OPTIONAL ────────────────────────────────────────────────────────────────
# Which providers to use by default
DEFAULT_LLM_PROVIDER=openai # openai | anthropic
DEFAULT_MODEL=gpt-4o # gpt-4o | gpt-4o-mini | claude-3-5-sonnet-...
SEARCH_PROVIDER=tavily # tavily | serpapi
# Redis (skip for local dev — cache silently disabled if unavailable)
REDIS_URL=redis://localhost:6379/0
# Agent behavior
MAX_AGENT_ITERATIONS=10 # Hard cap on reasoning loops
MAX_TOKENS_PER_RESPONSE=4096 # Max tokens in any single LLM response
CONTEXT_WINDOW_LIMIT=120000 # Trim history above this token count
# Application
APP_ENV=development # development | staging | production
LOG_LEVEL=INFO # DEBUG | INFO | WARNING | ERROR
# Rate limiting
RATE_LIMIT_REQUESTS=100 # Requests per window per IP
RATE_LIMIT_WINDOW=60 # Window size in seconds
# Timeouts (seconds)
LLM_TIMEOUT=60.0
SEARCH_TIMEOUT=15.0
SCRAPE_TIMEOUT=20.0
Security note: Never commit
.envto version control.
The.gitignorealready excludes it.
Running Locally
# Start Redis (required for caching — skip if you don't need it)
docker run -d -p 6379:6379 redis:7-alpine
# Start the API server
uvicorn api.main:app --host 0.0.0.0 --port 8000 --reload
# Verify it's running
curl http://localhost:8000/v1/health
Expected response:
{
"status": "ok",
"version": "1.0.0",
"providers": {
"openai": true
}
}
Open the interactive API docs:
- Swagger UI: http://localhost:8000/docs
- ReDoc: http://localhost:8000/redoc
Note: Docs are disabled in production (
APP_ENV=production).
Running with Docker
# Start everything: agent + Redis + Prometheus
docker-compose up -d
# View logs
docker-compose logs -f agent
# Scale workers (behind a load balancer)
docker-compose up -d --scale agent=3
# Stop everything
docker-compose down
# Stop and remove volumes (wipes Redis data)
docker-compose down -v
Services started by docker-compose up:
| Service | Port | Description |
|---|---|---|
agent |
8000 |
FastAPI application |
redis |
6379 |
Response cache |
prometheus |
9090 |
Metrics collection |
Agent Types
The system supports five distinct agent strategies. The auto mode
uses the smart router to pick the right one automatically.
1. ReACT Agent
Best for: Simple factual questions, current events, quick lookups.
How it works:
THINK → ACT → OBSERVE → THINK → ACT → OBSERVE → ... → FINAL ANSWER
The LLM alternates between reasoning about what to do next (THINK)
and calling tools (ACT), then observing the tool results (OBSERVE).
This continues until the LLM produces a response with no tool calls.
Example interaction:
User: "Who won the 2024 Nobel Prize in Physics?"
Agent THINKS: I need to search for this.
Agent ACTS: web_search("2024 Nobel Prize Physics winner")
Agent OBSERVES: [Search results: John Hopfield and Geoffrey Hinton...]
Agent THINKS: I have the answer.
Agent ANSWERS: "The 2024 Nobel Prize in Physics was awarded to
John Hopfield and Geoffrey Hinton..."
Configuration:
{
"query": "Who won the 2024 Nobel Prize in Physics?",
"agent_type": "react",
"max_iterations": 5
}
Token cost: Low (2 LLM calls per tool use)
Latency: Fast (2–4 seconds typical)
2. Reflexion Agent
Best for: Research questions requiring accuracy verification,
complex topics where errors are costly.
How it works:
ReACT run → Initial Answer
│
▼
Reflection LLM: "Is this answer accurate and complete?"
│
├── VERDICT: ACCEPT → Return answer
│
└── VERDICT: REVISE → ReACT run with critique context
│
▼
Revised Answer → Reflect again (max N rounds)
The agent critiques its own answer using a separate LLM call that
checks for factual accuracy, completeness, and citation quality.
Example critique output:
VERDICT: REVISE
CRITIQUE: The answer states the prize was awarded for "AI research"
but does not specify the cited contribution (artificial neural
networks and Boltzmann machines).
SUGGESTION: Search for the specific scientific contribution cited by
the Nobel Committee and include it in the answer.
Configuration:
{
"query": "Explain the mechanism behind CRISPR-Cas9 gene editing",
"agent_type": "reflexion"
}
Token cost: Medium (adds 1–2 LLM calls per reflection round)
Latency: Medium (5–12 seconds typical)
3. ReWOO Agent
Best for: Queries with a clear, known sequence of research steps.
Most token-efficient for multi-step research.
How it works (Xu et al., 2023):
Phase 1 — PLAN (1 LLM call):
Step 1: Thought: ... Tool: web_search Args: {...}
Step 2: Thought: ... Tool: scrape_webpage Args: {url: #E1.results[0].url}
Step 3: Thought: ... Tool: web_search Args: {...}
Phase 2 — EXECUTE (parallel where possible):
Steps without #E refs → run in PARALLEL
Steps with #E refs → run SEQUENTIALLY after dependencies
Phase 3 — SOLVE (1 LLM call):
LLM reads all observations → writes final answer
Why it's efficient: Instead of O(2N) LLM calls (ReACT), ReWOO
uses O(2) LLM calls regardless of how many tool steps are needed.
Example plan generated:
Step 1:
Thought: Search for recent SpaceX launches
Tool: web_search
Args: {"query": "SpaceX Starship launches 2024", "num_results": 5}
Step 2:
Thought: Get detailed info from the most relevant result
Tool: scrape_webpage
Args: {"url": "#E1"}
Step 3:
Thought: Search for launch success metrics
Tool: web_search
Args: {"query": "SpaceX Starship 2024 success rate statistics"}
Configuration:
{
"query": "What were SpaceX's key milestones in 2024?",
"agent_type": "rewoo"
}
Token cost: Lowest for multi-step (only 2 LLM calls total)
Latency: Fast (parallel execution)
4. Orchestrator Agent
Best for: Complex multi-part questions that span multiple
independent topics, comparison queries, comprehensive research reports.
How it works:
Orchestrator LLM: Decompose into sub-questions
│
├── "Sub-question 1" → Worker ReACT Agent 1 ─┐
├── "Sub-question 2" → Worker ReACT Agent 2 ─┤ (parallel)
├── "Sub-question 3" → Worker ReACT Agent 3 ─┤
└── "Sub-question N" → Worker ReACT Agent N ─┘
│
┌──────────────▼──────────────┐
│ Orchestrator LLM │
│ Synthesizes all answers │
│ into unified response │
└─────────────────────────────┘
Example decomposition:
Query: "Compare the AI strategies of the US, China, and EU in 2024"
[
"What is the United States AI strategy and major initiatives in 2024?",
"What is China's AI development strategy and investments in 2024?",
"What is the European Union's AI regulatory and investment approach in 2024?"
]
All three sub-questions are answered simultaneously by parallel
ReACT agents, then synthesized into a unified comparison.
Configuration:
{
"query": "Compare AI chip strategies of NVIDIA, AMD, and Intel in 2024",
"agent_type": "orchestrator"
}
Token cost: Higher (N parallel agents + synthesis call)
Latency: Moderate despite N agents (they run in parallel)
5. Tree Search Agent
Best for: Ambiguous questions with multiple valid approaches,
exploratory research, hypothesis generation and testing.
How it works (Beam Search over reasoning paths):
Depth 0: [Root: "Start researching..."]
│
├── Expand: K=3 candidate thoughts
│
Depth 1: [Candidate A: 0.85] [Candidate B: 0.72] [Candidate C: 0.41]
│ │
│ beam=2: keep top 2 │
▼ ▼
Depth 2: [A1: 0.91] [A2: 0.78] [B1: 0.89] [B2: 0.55]
│
│ Terminal detected (score 0.91)
▼
FINAL ANSWER from best terminal node
At each depth:
- Each beam node generates
branching_factorcandidate next thoughts - All candidates are scored in parallel (0.0–1.0)
- Top
beam_widthcandidates become the next beam - If any candidate signals a final answer, the highest-scored wins
Configuration:
{
"query": "What might cause a sudden drop in transformer model performance?",
"agent_type": "tree_search"
}
Token cost: Highest (branching factor × depth × 2 LLM calls)
Latency: Slower (but finds better answers for hard problems)
Agent Selection Guide
Is the question simple and factual?
│
YES ──────┤──────── NO
│ │
[ ReACT ] Does it have multiple
independent sub-parts?
│
YES ──────┤──────── NO
│ │
[Orchestrator] Is accuracy
critical?
│
YES ──────┤─── NO
│ │
[Reflexion] Is the
sequence
known?
│
YES ─────┤── NO
│ │
[ ReWOO ] [TreeSearch]
Or just use "agent_type": "auto" and let the router decide.
Workflows
Workflows are reusable reasoning patterns that agents are built from.
You can use them directly or compose them into custom agents.
Prompt Chaining
Execute a sequence of LLM calls where each step's output feeds the next.
from workflows.prompt_chaining import PromptChain, ChainStep
from core.llm_client import LLMClient
chain = PromptChain(llm_client=LLMClient())
chain.add_step(ChainStep(
name="identify_intent",
prompt_template="Analyze this question: {query}\nIdentify: intent, entities, time-sensitivity.",
output_key="intent",
))
chain.add_step(ChainStep(
name="generate_queries",
prompt_template="Generate 3 search queries for:\nQuestion: {query}\nIntent: {intent}",
output_key="search_queries",
transform=lambda text: text.strip().split("\n"), # parse into list
))
result = await chain.run({"query": "Latest AI breakthroughs 2024"})
print(result["search_queries"])
# → ["AI breakthroughs 2024", "machine learning advances 2024", ...]
Routing
Route queries to different handlers based on LLM or rule-based classification.
from workflows.routing import QueryRouter, QueryClassifier, Route
# Rule-based (free, no LLM call)
route = QueryClassifier.quick_classify("hello there")
# → "conversational"
# LLM-based
router = QueryRouter(llm_client=llm)
router.add_route(Route("simple_qa", "Short factual Q&A", react_handler))
router.add_route(Route("research", "Deep analysis needed", reflexion_handler))
route_name, result = await router.route("What causes inflation?")
Parallelization
Sectioning — Run the same worker on multiple items concurrently:
from workflows.parallelization import ParallelSectioning
sectioner = ParallelSectioning(max_concurrency=5)
urls = ["https://a.com", "https://b.com", "https://c.com"]
results = await sectioner.run(urls, worker=scrape_tool.execute)
Voting — Run the same prompt N times, take majority vote:
from workflows.parallelization import ParallelVoting, AnswerVerifier
voter = ParallelVoting(llm_client=llm, num_votes=5, temperature=0.7)
majority, distribution = await voter.vote(messages, extract_answer=str.strip)
# Fact verification
verifier = AnswerVerifier(llm_client=llm, num_votes=3)
result = await verifier.verify(
claim="The Eiffel Tower is 330 meters tall",
context=scraped_page_content,
)
# → {"verdict": "FALSE", "confidence": 0.85, "distribution": {...}}
Reflection
Apply critique-and-revise to any generated text:
from workflows.reflection import ReflectionWorkflow
reflector = ReflectionWorkflow(llm_client=llm, rounds=2)
result = await reflector.run(
question="What is quantum entanglement?",
initial_answer=draft_answer,
)
print(result["final_answer"]) # improved version
print(result["rounds"]) # list of {critique, revised_answer}
Tools
Built-in Tools
| Tool | Name | Description | Key Args |
|---|---|---|---|
| Web Search | web_search |
Search via Tavily or SerpAPI | query, num_results, search_depth |
| Web Scraper | scrape_webpage |
Fetch + clean page text | url, extract_links |
| Calculator | calculator |
Safe math expression eval | expression |
| Summarizer | summarize_text |
Extractive text summary | text, max_sentences |
MCP Integration
Connect any Model Context Protocol server:
from tools.mcp_client import MCPRegistry, MCPServerConfig
from tools import build_default_registry
# Define your MCP servers
mcp = MCPRegistry()
mcp.add_server(MCPServerConfig(
name="filesystem",
base_url="http://localhost:3001",
api_key="your-mcp-key",
))
mcp.add_server(MCPServerConfig(
name="database",
base_url="http://localhost:3002",
))
# Build registry: local tools + all MCP tools auto-discovered
base = build_default_registry()
registry = await mcp.build_registry(base_registry=base)
# Use with any agent
agent = build_agent(AgentType.REACT, registry=registry)
state = await agent.run("Query the database for last month's sales")
MCP tools are automatically discovered via the tools/list JSON-RPC call
and wrapped as standard BaseTool instances — the agent treats them
identically to built-in tools.
Adding Custom Tools
Create any tool by subclassing BaseTool:
from tools.base_tool import BaseTool, ToolDefinition
from core.exceptions import ToolExecutionError
class WeatherTool(BaseTool):
"""Fetch current weather for a city."""
@property
def definition(self) -> ToolDefinition:
return ToolDefinition(
name="get_weather",
description="Get current weather conditions for any city.",
parameters={
"type": "object",
"properties": {
"city": {
"type": "string",
"description": "City name, e.g. 'Tokyo'",
},
"units": {
"type": "string",
"enum": ["celsius", "fahrenheit"],
"default": "celsius",
},
},
"required": ["city"],
},
)
async def execute(self, city: str, units: str = "celsius", **_) -> str:
async with httpx.AsyncClient() as client:
resp = await client.get(
"https://api.weather.example.com/current",
params={"city": city, "units": units},
)
data = resp.json()
return json.dumps(data)
# Register it
from tools import build_default_registry
registry = build_default_registry()
registry.register(WeatherTool())
# Use with any agent
agent = build_agent(AgentType.REACT, registry=registry)
Requirements for a valid tool:
- Subclass
BaseTool - Implement
definitionproperty → returnsToolDefinitionwith valid JSON Schema - Implement
async execute(**kwargs) -> str→ always returns a string - Raise
ToolExecutionError(tool_name, reason)on failure (never raise raw exceptions)
Multi-Agent Systems
Orchestrator-Worker Pattern
The OrchestratorAgent implements the orchestrator-worker pattern natively.
One orchestrator LLM decomposes the query; N worker ReACT agents run in
parallel; the orchestrator synthesizes all results.
from agents import AgentType, build_agent
agent = build_agent(
agent_type=AgentType.ORCHESTRATOR,
model="gpt-4o",
max_workers=4, # max parallel worker agents
max_iterations=8, # per-worker iteration cap
)
state = await agent.run(
"Compare renewable energy adoption rates in Germany, France, and the UK"
)
print(state.final_answer)
print(state.metadata["sub_questions"]) # what the orchestrator decomposed
print(state.metadata["worker_iterations"]) # how many steps each worker took
A2A (Agent-to-Agent) Protocol
Expose any agent as an A2A-compliant HTTP service, and call remote agents
from other agents using the standardized protocol.
Expose your agent as an A2A server:
from agents.a2a import AgentCard, AgentCapability, create_a2a_router
from api.main import app
# Define what your agent can do
card = AgentCard(
name="Research Specialist",
description="Deep web research agent specializing in science topics",
url="https://research-agent.yourdomain.com",
capabilities=[
AgentCapability(
name="research",
description="Research any scientific topic with citations",
input_schema={
"type": "object",
"properties": {"query": {"type": "string"}},
"required": ["query"],
},
)
],
)
# Define the handler
async def handle_task(capability: str, input_data: dict) -> dict:
agent = build_agent(AgentType.REFLEXION)
state = await agent.run(input_data["query"])
return {
"answer": state.final_answer,
"sources": state.sources,
}
# Mount A2A routes
a2a_router = create_a2a_router(card, handle_task)
app.include_router(a2a_router)
# Now serving:
# GET /.well-known/agent.json → capability card
# POST /a2a/tasks → submit task
# GET /a2a/tasks/{id} → poll status
# DELETE /a2a/tasks/{id} → cancel
Call a remote agent from another agent:
from agents.a2a import A2AClient, AgentCard, MultiAgentCoordinator
# Discover remote agent
remote_card = AgentCard(
name="Research Specialist",
url="https://research-agent.yourdomain.com",
description="...",
)
# Build coordinator
coordinator = MultiAgentCoordinator()
coordinator.register_agent(
capability="research",
client=A2AClient(remote_card, timeout=60.0),
)
# Delegate tasks
result = await coordinator.delegate(
capability="research",
input_data={"query": "Latest quantum computing breakthroughs"},
)
# Delegate multiple tasks in parallel
results = await coordinator.delegate_parallel([
("research", {"query": "US AI policy 2024"}),
("research", {"query": "EU AI Act implementation"}),
("research", {"query": "China AI investment 2024"}),
])
📡 API Reference
Endpoints
| Method | Path | Description | Auth |
|---|---|---|---|
GET |
/v1/health |
Health check + provider status | None |
POST |
/v1/ask |
Submit query (batch response) | Optional |
POST |
/v1/ask/stream |
Submit query (SSE streaming) | Optional |
POST |
/v1/evaluate |
Evaluate answer quality | Optional |
GET |
/v1/models |
List available models | None |
GET |
/metrics |
Prometheus metrics | None |
Request & Response Schemas
POST /v1/ask
Request:
{
"query": "What are the latest breakthroughs in fusion energy?",
"agent_type": "auto",
"model": "gpt-4o",
"max_iterations": 8,
"stream": false
}
| Field | Type | Default | Description |
|---|---|---|---|
query |
string |
required | Question (1–2000 chars) |
agent_type |
enum |
"auto" |
auto react reflexion rewoo orchestrator |
model |
string |
env default | Override LLM model |
max_iterations |
int |
env default | Max reasoning steps (1–20) |
stream |
bool |
false |
Enable SSE streaming |
Response:
{
"request_id": "a3f8b2c1d4e5",
"query": "What are the latest breakthroughs in fusion energy?",
"answer": "## Fusion Energy Breakthroughs in 2024\n\nSeveral significant...",
"sources": [
{
"title": "NIF achieves fusion ignition milestone",
"url": "https://www.science.org/..."
}
],
"agent_type": "react",
"iterations": 3,
"tools_called": ["web_search", "scrape_webpage"],
"model": "gpt-4o",
"cached": false,
"metadata": {}
}
POST /v1/evaluate
Request:
{
"query": "What is the capital of France?",
"answer": "## Answer\nThe capital of France is Paris.\n## Sources\n- https://example.com",
"sources": [{"title": "Example", "url": "https://example.com"}],
"ground_truth": "Paris"
}
Response:
{
"scores": {
"factual_accuracy": 0.98,
"completeness": 0.85,
"clarity": 0.95,
"source_usage": 0.90,
"hallucination_risk": 0.97,
"citation_coverage": 1.0,
"length_score": 0.72,
"structure_score": 0.70,
"has_sources_section": 1.0
},
"feedback": {
"factual_accuracy": "Claim is correct and well-supported.",
"completeness": "Could include additional context about Paris.",
"clarity": "Clear and concise.",
"source_usage": "Source is cited correctly.",
"hallucination_risk": "No hallucination detected."
},
"overall_score": 0.91,
"passed": true
}
Streaming (SSE)
The /v1/ask/stream endpoint uses
Server-Sent Events.
Each event is a JSON object.
Event types:
# 1. Metadata (sent first — before any tokens)
data: {"type": "metadata", "request_id": "abc", "sources": [...],
"iterations": 3, "tools_called": ["web_search"]}
# 2. Token stream (one per token)
data: {"type": "token", "delta": "The ", "done": false}
data: {"type": "token", "delta": "answer ", "done": false}
data: {"type": "token", "delta": "is...", "done": false}
# 3. Completion signal
data: {"type": "done", "done": true}
# On error
data: {"type": "error", "error": "Search service unavailable"}
Client example (JavaScript):
const source = new EventSource('/v1/ask/stream');
const response = await fetch('/v1/ask/stream', {
method: 'POST',
headers: {'Content-Type': 'application/json'},
body: JSON.stringify({query: 'What is quantum computing?'}),
});
const reader = response.body.getReader();
const decoder = new TextDecoder();
while (true) {
const {done, value} = await reader.read();
if (done) break;
const lines = decoder.decode(value).split('\n');
for (const line of lines) {
if (!line.startsWith('data: ')) continue;
const event = JSON.parse(line.slice(6));
if (event.type === 'token') process.stdout.write(event.delta);
if (event.type === 'done') break;
if (event.type === 'error') console.error(event.error);
}
}
Python client example:
import httpx
async with httpx.AsyncClient() as client:
async with client.stream(
"POST",
"http://localhost:8000/v1/ask/stream",
json={"query": "Latest AI news", "agent_type": "auto"},
) as resp:
async for line in resp.aiter_lines():
if not line.startswith("data: "):
continue
import json
event = json.loads(line[6:])
if event["type"] == "token":
print(event["delta"], end="", flush=True)
Authentication
The API uses optional Bearer token authentication.
Set API_KEY in your .env to enable it:
API_KEY=your-secret-key
Then include in requests:
curl -H "Authorization: Bearer your-secret-key" \
-X POST http://localhost:8000/v1/ask \
-d '{"query": "test"}'
If API_KEY is not set, all requests are allowed (development mode).
Configuration
All settings are loaded from environment variables via Pydantic Settings.
Full reference:
| Variable | Type | Default | Description |
|---|---|---|---|
OPENAI_API_KEY |
str |
— | OpenAI API key |
ANTHROPIC_API_KEY |
str |
— | Anthropic API key |
DEFAULT_LLM_PROVIDER |
openai|anthropic |
openai |
Primary LLM |
DEFAULT_MODEL |
str |
gpt-4o |
Default model name |
FALLBACK_MODEL |
str |
gpt-4o-mini |
Fallback on error |
TAVILY_API_KEY |
str |
— | Tavily search key |
SERPAPI_API_KEY |
str |
— | SerpAPI key (fallback) |
SEARCH_PROVIDER |
tavily|serpapi |
tavily |
Search backend |
REDIS_URL |
str |
redis://localhost:6379/0 |
Redis connection |
APP_ENV |
development|staging|production |
production |
Environment |
LOG_LEVEL |
str |
INFO |
Log verbosity |
MAX_AGENT_ITERATIONS |
int |
10 |
Max reasoning loops |
MAX_TOKENS_PER_RESPONSE |
int |
4096 |
Max response tokens |
CONTEXT_WINDOW_LIMIT |
int |
120000 |
Token window cap |
RATE_LIMIT_REQUESTS |
int |
100 |
Requests per window |
RATE_LIMIT_WINDOW |
int |
60 |
Window size (seconds) |
LLM_TIMEOUT |
float |
60.0 |
LLM request timeout |
SEARCH_TIMEOUT |
float |
15.0 |
Search timeout |
SCRAPE_TIMEOUT |
float |
20.0 |
Scrape timeout |
Switching to Anthropic:
DEFAULT_LLM_PROVIDER=anthropic
DEFAULT_MODEL=claude-3-5-sonnet-20241022
Using SerpAPI instead of Tavily:
SEARCH_PROVIDER=serpapi
SERPAPI_API_KEY=your-key
Evaluation
Answer Quality Metrics
Answers are evaluated on two levels:
Level 1 — Rule-based (instant, free):
| Metric | Description | Weight |
|---|---|---|
citation_coverage |
% of sources actually cited in answer | — |
length_score |
Penalizes too-short or too-long answers | — |
structure_score |
Presence of headers, lists, sections | — |
has_sources_section |
Answer ends with ## Sources | — |
Level 2 — LLM-as-Judge (1 LLM call):
| Metric | Description | Weight |
|---|---|---|
factual_accuracy |
Claims supported by sources | 30% |
completeness |
Fully addresses the question | 20% |
clarity |
Well-written and readable | 15% |
source_usage |
Citations correct and relevant | 15% |
hallucination_risk |
Grounded in evidence | 20% |
Pass threshold: Overall score ≥ 0.70
Running Benchmarks
Programmatic:
import asyncio
from evaluation.benchmarks import BenchmarkRunner, BenchmarkCase
from agents import AgentType
# Run built-in benchmark suite
runner = BenchmarkRunner(
agent_type=AgentType.REACT,
model="gpt-4o-mini", # use cheaper model for benchmarks
)
results = asyncio.run(runner.run_all(concurrency=2))
print(f"Pass rate: {results['pass_rate']:.0%}")
print(f"Avg score: {results['avg_score']:.3f}")
print(f"Avg latency: {results['avg_latency_s']:.1f}s")
print(f"By category: {results['category_scores']}")
Custom benchmark cases:
custom_cases = [
BenchmarkCase(
id="my_test_01",
query="What is the latest version of Python?",
expected_keywords=["3.12", "3.13", "python"],
category="factual",
),
BenchmarkCase(
id="my_test_02",
query="Explain the difference between RAG and fine-tuning",
ground_truth="RAG retrieves context at inference time; fine-tuning updates weights",
expected_keywords=["retrieval", "fine-tuning", "weights"],
category="research",
),
]
results = asyncio.run(runner.run_all(cases=custom_cases))
Expected benchmark output:
╭─────────────────────────────────────────────────────╮
│ Benchmark Results Summary │
├─────────────────────────────────────────────────────┤
│ Total cases: 5 │
│ Passed: 4 (80%) │
│ Avg score: 0.812 │
│ Avg latency: 4.3s │
├─────────────────────────────────────────────────────┤
│ By category: │
│ factual: 0.891 │
│ research: 0.823 │
│ calculation: 0.950 │
│ multi_faceted: 0.754 │
│ current_events: 0.742 │
╰─────────────────────────────────────────────────────╯
Observability
Structured Logging
In development, logs are human-readable rich text:
2024-01-15 10:23:41 [info ] react_agent.start agent=ReACT query=What is the capital of France?
2024-01-15 10:23:41 [info ] tool.execute.start tool=web_search call_id=call_a3f8b
2024-01-15 10:23:42 [info ] tool.execute.success tool=web_search elapsed_s=0.823
2024-01-15 10:23:43 [info ] agent.completed iterations=2 tools_called=1 elapsed_s=2.1
In production (APP_ENV=production), logs are JSON:
{"event": "react_agent.start", "agent": "ReACT", "query": "...", "timestamp": "..."}
Prometheus Metrics
Metrics are exposed at /metrics in Prometheus format:
curl http://localhost:8000/metrics
Key metrics to monitor:
| Metric | Type | Description |
|---|---|---|
http_requests_total |
Counter | Total requests by path + status |
http_request_duration_seconds |
Histogram | Request latency |
agent_iterations_total |
Counter | Reasoning iterations |
tool_calls_total |
Counter | Tool usage by name |
cache_hits_total |
Counter | Redis cache hit rate |
Grafana dashboard (import from configs/grafana-dashboard.json):
┌─────────────────────────────────────────────────────────────┐
│ Requests/min │ P95 Latency │ Cache Hit Rate │
│ 142 │ 3.2s │ 67% │
├─────────────────────────────────────────────────────────────┤
│ Agent Type Distribution │ Tool Usage │
│ react 64% │ web_search 78% │
│ orchestrator 21% │ scrape_webpage 18% │
│ reflexion 15% │ calculator 4% │
└─────────────────────────────────────────────────────────────┘
Testing
Run the Full Test Suite
# All tests
pytest tests/ -v
# With coverage report
pytest tests/ -v --cov=. --cov-report=term-missing --cov-report=html
# Specific test file
pytest tests/test_agents.py -v
# Specific test class
pytest tests/test_agents.py::TestReACTAgent -v
# Specific test
pytest tests/test_agents.py::TestReACTAgent::test_direct_answer_no_tools -v
# Run only fast tests (skip integration)
pytest tests/ -v -m "not integration"
Test Categories
| File | What it tests | Type |
|---|---|---|
test_tools.py |
Calculator, scraper, search, registry, executor | Unit |
test_agents.py |
ReACT, Reflexion, Orchestrator behavior + edge cases | Unit |
test_workflows.py |
Chaining, routing, voting, reflection, metrics | Unit |
test_api.py |
All endpoints, caching, middleware, error handling | Integration |
Coverage Requirements
# Enforce 80% minimum coverage
pytest --cov=. --cov-fail-under=80
Mocking Strategy
All tests mock the LLM client and HTTP calls — no real API keys needed:
# Example: testing an agent without real LLM calls
from unittest.mock import AsyncMock, MagicMock
from core.llm_client import LLMClient
llm = MagicMock(spec=LLMClient)
llm.complete = AsyncMock(return_value=Message(
role=Role.ASSISTANT,
content="Mocked answer",
))
agent = ReACTAgent(llm_client=llm, registry=registry, executor=executor)
state = await agent.run("test query")
assert state.final_answer == "Mocked answer"
Deployment
Production Checklist
Before deploying to production:
☐ Set APP_ENV=production
☐ Set strong API_KEY (if using auth)
☐ Configure Redis with persistence (appendonly yes)
☐ Set MAX_AGENT_ITERATIONS to a safe limit (8-10)
☐ Configure RATE_LIMIT_REQUESTS appropriately
☐ Disable Swagger docs (automatic in production)
☐ Set up log aggregation (ELK, Datadog, etc.)
☐ Configure Prometheus + Grafana dashboards
☐ Set up health check monitoring
☐ Test /v1/health endpoint returns "ok"
Docker Production Deploy
# Build production image
docker build --target runtime -t ask-web-agent:v1.0.0 .
# Run with environment file
docker run -d \
--name ask-web-agent \
--env-file .env.production \
-p 8000:8000 \
--restart unless-stopped \
--memory 2g \
--cpus 2 \
ask-web-agent:v1.0.0
Gunicorn (Multi-Worker)
For high-throughput production use:
gunicorn api.main:app \
--worker-class uvicorn.workers.UvicornWorker \
--workers 4 \
--bind 0.0.0.0:8000 \
--timeout 120 \
--keep-alive 5 \
--log-level info
Worker count rule of thumb:
2 × CPU cores + 1
For async workloads like this, 2–4 workers is usually optimal.
Kubernetes (Helm-style manifest)
apiVersion: apps/v1
kind: Deployment
metadata:
name: ask-web-agent
spec:
replicas: 3
selector:
matchLabels:
app: ask-web-agent
template:
metadata:
labels:
app: ask-web-agent
spec:
containers:
- name: agent
image: yourregistry/ask-web-agent:v1.0.0
ports:
- containerPort: 8000
envFrom:
- secretRef:
name: ask-web-agent-secrets
resources:
requests:
memory: "512Mi"
cpu: "500m"
limits:
memory: "2Gi"
cpu: "2000m"
livenessProbe:
httpGet:
path: /v1/health
port: 8000
initialDelaySeconds: 10
periodSeconds: 30
readinessProbe:
httpGet:
path: /v1/health
port: 8000
initialDelaySeconds: 5
periodSeconds: 10
Roadmap
v1.1 — Near-term
- Persistent memory — Cross-session conversation history via Redis
- Image understanding — Multimodal support for vision queries
- PDF/document ingestion — Upload and query documents directly
- Webhook callbacks — POST result to URL when async job completes
v1.2 — Medium-term
- Fine-tuning pipeline — Use benchmark results to fine-tune smaller models
- Vector store integration — RAG over your own knowledge base
- Agent marketplace — Plug-and-play specialist agents via A2A
- Cost tracking — Per-request token cost logging and budgets
v2.0 — Long-term
- Self-improving agents — Agents that update their own system prompts
- Multi-modal tools — Image search, chart reading, video transcription
- Federated agents — Cross-organization A2A agent networks
- On-device models — Local Ollama/llama.cpp backend support
Contributing
Contributions are welcome! Please read this section before submitting.
Development Setup
git clone https://github.com/yourorg/ask-the-web-agent.git
cd ask-the-web-agent
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"
pre-commit install
Pre-commit Hooks
# Runs automatically on git commit:
# - ruff (linting + formatting)
# - mypy (type checking)
# - pytest (fast unit tests only)
pre-commit run --all-files
Pull Request Guidelines
Fork the repository and create a feature branch
git checkout -b feature/my-new-agentWrite tests first — all new features need test coverage ≥ 80%
Follow the patterns — new agents extend
BaseAgent,
new tools extendBaseToolType everything — all functions must have complete type annotations
Update docs — add your feature to this README
Pass CI:
ruff check . mypy . pytest tests/ --cov=. --cov-fail-under=80Open a PR with:
- Clear description of what and why
- Example input/output
- Performance impact (latency, token cost)
Adding a New Agent
# 1. Create agents/my_agent.py
class MyAgent(BaseAgent):
async def run(self, query: str, **kwargs: Any) -> AgentState:
... # implement your strategy
# 2. Register in agents/__init__.py
class AgentType(str, Enum):
MY_AGENT = "my_agent" # add this
agent_map[AgentType.MY_AGENT] = MyAgent # add this
# 3. Add tests in tests/test_agents.py
class TestMyAgent:
async def test_basic_run(self) -> None: ...
# 4. Document in README under "Agent Types"
License
This project is licensed under the MIT License - see the LICENSE file for details.
Contact
- Website: Adil Shamim
- GitHub: Adil Shamim
- Create an issue in this repository for questions or suggestions
⭐ If you find this repository helpful, please consider giving it a star! ⭐
Reviews (0)
Sign in to leave a review.
Leave a reviewNo results found