docpull
mcp
Pass
Health Pass
- License — License: MIT
- Description — Repository has a description
- Active repo — Last push 0 days ago
- Community trust — 19 GitHub stars
Code Pass
- Code scan — Scanned 12 files during light audit, no dangerous patterns found
Permissions Pass
- Permissions — No dangerous permissions requested
No AI report is available for this listing yet.
Crawl any website and convert it to clean, AI-ready Markdown — async Python CLI with MCP support, crawl profiles, caching, and RAG-optimized output
README.md
docpull
Pull documentation from any website and convert it to clean, AI-ready Markdown.
Install
pip install docpull
Usage
# Basic fetch
docpull https://docs.example.com
# With options
docpull https://aptos.dev --max-pages 100 --output-dir ./docs
# Filter paths
docpull https://docs.example.com --include-paths "/api/*" --exclude-paths "/changelog/*"
# Enable caching for incremental updates
docpull https://docs.example.com --cache
Profiles
docpull https://site.com --profile rag # Optimized for RAG/LLM (default)
docpull https://site.com --profile mirror # Full site archive with caching
docpull https://site.com --profile quick # Fast sampling (50 pages, depth 2)
Options
Crawl:
--max-pages N Maximum pages to fetch
--max-depth N Maximum crawl depth
--include-paths P Only crawl matching URL patterns
--exclude-paths P Skip matching URL patterns
Cache:
--cache Enable caching for incremental updates
--cache-dir DIR Cache directory (default: .docpull-cache)
--cache-ttl DAYS Days before cache expires (default: 30)
Content:
--streaming-dedup Real-time duplicate detection
--language CODE Filter by language (e.g., en)
Output:
--output-dir, -o DIR Output directory (default: ./docs)
--dry-run Show what would be fetched
--verbose, -v Verbose output
See docpull --help for all options.
Python API
import asyncio
from docpull import Fetcher, DocpullConfig, ProfileName, EventType
async def main():
config = DocpullConfig(
url="https://docs.example.com",
profile=ProfileName.RAG,
crawl={"max_pages": 100},
cache={"enabled": True},
)
async with Fetcher(config) as fetcher:
async for event in fetcher.run():
if event.type == EventType.FETCH_PROGRESS:
print(f"{event.current}/{event.total}: {event.url}")
print(f"Done: {fetcher.stats.pages_fetched} pages")
asyncio.run(main())
Output
Each page becomes a Markdown file with YAML frontmatter:
---
title: "Getting Started"
source: https://docs.example.com/guide
---
# Getting Started
...
Security
- HTTPS-only, mandatory robots.txt compliance
- Blocks private/internal network IPs
- Path traversal and XXE protection
Troubleshooting
docpull --doctor # Check installation
docpull URL --verbose # Verbose output
docpull URL --dry-run # Test without downloading
Links
License
MIT
Reviews (0)
Sign in to leave a review.
Leave a reviewNo results found