website2markdown
Health Uyari
- License — License: Apache-2.0
- Description — Repository has a description
- Active repo — Last push 0 days ago
- Low visibility — Only 5 GitHub stars
Code Uyari
- network request — Outbound network request in packages/mcp/src/convert.ts
- process.env — Environment variable access in packages/mcp/src/index.ts
Permissions Gecti
- Permissions — No dangerous permissions requested
Bu listing icin henuz AI raporu yok.
Convert any URL to clean Markdown. Cloudflare Worker with 14 site adapters, MCP Server, Agent Skills, llms.txt. Open source, Apache-2.0.
URL to Markdown Converter
Convert any web page to clean Markdown — JS-heavy SPAs, paywalled content, Chinese platforms (WeChat, Zhihu, Feishu), and more. Powered by Cloudflare Workers with a 5-layer fallback pipeline and 14 site adapters.
Quick Start
# Convert any URL to Markdown (try it now!)
curl -H "Accept: text/markdown" https://md.genedai.me/https://example.com
# WeChat article
curl -H "Accept: text/markdown" "https://md.genedai.me/https://mp.weixin.qq.com/s/YOUR_ARTICLE_ID"
# JSON output with metadata
curl "https://md.genedai.me/https://example.com?format=json&raw=true"
Or just open in your browser: md.genedai.me/https://example.com
Need browser-rendered pages (WeChat, Feishu, JS-heavy SPAs) or higher limits?
Get a free API key at md.genedai.me/portal/.
How It Works
https://md.genedai.me/<target-url>
Conversion Flow
Request
│
▼
Fetch target with Accept: text/markdown
│
├─ Response is text/markdown? ──▶ Path 1: Native Markdown
│
└─ Response is text/html?
│
├─ Anti-bot / JS-required detected? ──▶ Path 3: Browser Rendering → Readability + Turndown
│
└─ Normal HTML ──▶ Path 2: Readability + Turndown
| Path | When | How | X-Markdown-Method |
|---|---|---|---|
| Native | Target site supports Markdown for Agents | Cloudflare edge converts via Accept: text/markdown content negotiation |
native |
| Fallback | Normal HTML pages | Readability extracts main content → Turndown converts to Markdown | readability+turndown |
| Browser | Anti-bot pages, JS-rendered content | Headless Chrome renders the page → Readability + Turndown | browser+readability+turndown |
| Firecrawl | Explicit engine=firecrawl, non-text documents, or thin local extraction |
Convert via Firecrawl v2 scrape; omits Authorization by default for keyless when accepted | firecrawl |
| Jina | Explicit engine=jina or last-resort fallback |
Convert via Jina Reader API while preserving the same output/query surface | jina |
API Usage
Browser (URL bar)
# Full URL
https://md.genedai.me/https://example.com/page
# Bare domain (auto-prepends https://)
https://md.genedai.me/example.com/page
Raw Markdown API
# Get raw Markdown via query param
curl "https://md.genedai.me/https://example.com/page?raw=true"
# Get raw Markdown via Accept header
curl https://md.genedai.me/https://example.com/page \
-H "Accept: text/markdown"
API Keys and Tiers
Sign up at md.genedai.me/portal/ with
your email to get an API key. No password; a sign-in link is emailed to you.
| Tier | Credits/month | Browser rendering | Proxy / Engine selection |
|---|---|---|---|
| Anonymous (no key) | — | ❌ no browser rendering | ✅ keyless engine=jina / engine=firecrawl |
| Free | 1,000 | ✅ | ✅ keyless engine=jina / engine=firecrawl |
| Pro | 50,000 | ✅ | ✅ all engines + no_cache= + force_browser= |
Credit cost is fixed per request type, not per actual conversion path
(so bills are predictable even if a site silently switches from static to
browser rendering behind the scenes):
| Endpoint | Credits |
|---|---|
GET /<url> |
1 |
GET /api/stream |
1 |
POST /api/batch (per URL) |
1 |
POST /api/extract |
3 |
POST /api/deepcrawl (per URL) |
2 |
Cache hits on a paying tier still consume 1 credit; when your quota is
exhausted the API keeps serving cached URLs (with X-Quota-Exceeded: true)
but rejects cache-miss requests with 429.
Using your key
# Bearer header (recommended)
curl "https://md.genedai.me/https://example.com/page?raw=true" \
-H "Authorization: Bearer mk_..."
# The old ?token= query-parameter form is supported for legacy
# PUBLIC_API_TOKEN deployments, but NOT for mk_ keys. Never put a real
# API key in a query string — logs, referrers, and monitoring capture it.
Every authenticated response includes per-key rate limit headers:
X-RateLimit-Limit: 50000
X-RateLimit-Remaining: 49993
X-Request-Cost: 1
Portal API (session cookie)
Once signed in at /portal/, these endpoints are available under the same
session cookie:
| Endpoint | Method | Description |
|---|---|---|
/api/me |
GET | Current account (email, tier, account_id) |
/api/keys |
GET | List your keys (prefix only, never plaintext) |
/api/keys |
POST | Create a new key; plaintext returned once |
/api/keys/:id |
DELETE | Revoke a key (takes effect within 60s — LRU cache) |
/api/usage |
GET | Usage breakdown (tier, quota, used, daily history) |
/api/auth/logout |
POST | Destroy session, clear cookie |
/api/usage also accepts an Authorization: Bearer mk_... header so SDK
and CLI tools can poll usage without a session.
Output Formats
# Markdown (default)
curl "https://md.genedai.me/https://example.com?format=markdown&raw=true"
# Clean HTML
curl "https://md.genedai.me/https://example.com?format=html&raw=true"
# Plain text (no formatting)
curl "https://md.genedai.me/https://example.com?format=text&raw=true"
# JSON (structured: url, title, markdown, method, timestamp)
curl "https://md.genedai.me/https://example.com?format=json&raw=true"
CSS Selector Extraction
Extract specific page elements instead of the full article:
# Extract only the article body
curl "https://md.genedai.me/https://example.com?selector=.article-body&raw=true"
# Extract a specific section
curl "https://md.genedai.me/https://example.com?selector=%23main-content&raw=true"
selectormaximum length is256characters.
Force Browser Rendering
curl "https://md.genedai.me/https://example.com/js-heavy-page?raw=true&force_browser=true"
Jina Reader Engine
Use engine=jina to convert via r.jina.ai instead of the built-in pipeline. This is useful for JS-heavy pages when browser rendering is unavailable. Keyless/free tier: 20 RPM per IP without an API key; higher limits are available with a Jina key.
curl "https://md.genedai.me/https://example.com?raw=true&engine=jina"
Jina is also used automatically as a last-resort fallback when Readability extraction produces very little content and no browser/proxy path was used.
Firecrawl Keyless Fallback
Use engine=firecrawl to convert via Firecrawl v2 scrape. If FIRECRAWL_API_KEY
is not configured, the worker sends no Authorization header so Firecrawl can
use its keyless free tier when the upstream accepts the request. Keyless can
still return 403 or 429; automatic fallbacks treat that as non-fatal and
continue to Jina.
curl "https://md.genedai.me/https://example.com?raw=true&engine=firecrawl"
engine=jinaandengine=firecrawlare intentionally available without a
Pro key because both upstreams provide keyless/free reader paths. Account-backed
engines such asengine=cfstill require Pro. Firecrawl is also tried before
Jina when local extraction is too thin or the target is a non-text document
such as a PDF/Word/Excel URL.
Cache Control
Results are cached in KV for fast repeat access. To bypass cache:
curl "https://md.genedai.me/https://example.com?raw=true&no_cache=true"
Batch Conversion
Convert multiple URLs in a single request:
curl -X POST https://md.genedai.me/api/batch \
-H "Authorization: Bearer <api-token>" \
-H "Content-Type: application/json" \
-d '{
"urls": [
"https://example.com/page1",
{
"url": "https://example.com/page2",
"format": "text",
"selector": "article",
"force_browser": false,
"no_cache": true
}
]
}'
urls supports:
- String item:
"https://example.com/a"(defaults to markdown) - Object item:
{ "url": "...", "format?": "markdown|html|text|json", "selector?": "...", "force_browser?": boolean, "no_cache?": boolean, "engine?": "jina" }
Response:
{
"results": [
{
"url": "...",
"format": "markdown",
"content": "...",
"markdown": "...",
"title": "...",
"method": "...",
"cached": false,
"fallbacks": ["jsonld"]
},
{
"url": "...",
"format": "text",
"content": "...",
"title": "...",
"method": "...",
"cached": true
}
]
}
Structured Extraction API
Extract structured fields from URL or raw HTML.
curl -X POST https://md.genedai.me/api/extract \
-H "Authorization: Bearer <api-token>" \
-H "Content-Type: application/json" \
-d '{
"strategy": "css",
"url": "https://example.com/article",
"schema": {
"fields": [
{ "name": "title", "selector": "h1", "type": "text", "required": true },
{ "name": "author", "selector": ".author", "type": "text" }
]
},
"include_markdown": true
}'
Batch extraction (items) is also supported (max 10 items).
Additional extraction capabilities:
- Use either top-level
url/htmlor nestedinput.url/input.html. schema.fields[*].requiredfails extraction when a required field is missing.optionssupportsdedupe,includeEmpty, andregexFlags.include_markdown: trueattaches converted markdown alongside extracted data.
Job API (create / query / stream / run)
Submit crawl/extract tasks as queued jobs, then run and monitor. Jobs are persisted as queued records in KV; execution begins when you call /run:
# 1) Create job
curl -X POST https://md.genedai.me/api/jobs \
-H "Authorization: Bearer <api-token>" \
-H "Content-Type: application/json" \
-H "Idempotency-Key: demo-job-1" \
-d '{
"type": "crawl",
"tasks": [
"https://example.com/a",
"https://example.com/b"
],
"priority": 10,
"maxRetries": 2
}'
# 2) Query status
curl -H "Authorization: Bearer <api-token>" \
https://md.genedai.me/api/jobs/<job-id>
# 3) Watch status stream (SSE)
curl -N -H "Authorization: Bearer <api-token>" \
https://md.genedai.me/api/jobs/<job-id>/stream
# 4) Execute queued tasks
curl -X POST -H "Authorization: Bearer <api-token>" \
https://md.genedai.me/api/jobs/<job-id>/run
Job API notes:
- Supports both
type: "crawl"andtype: "extract". type: "crawl"accepts string URLs or object tasks withformat,selector,force_browser, andno_cache.type: "extract"reuses the same task shape as/api/extract.Idempotency-Keyis keyed by both the header value and request payload: same key + same payload returns the existing job; same key + different payload returns409 Conflict.priorityis normalized to1..100(default10),maxRetriesto0..10(default2).- Up to
100tasks are allowed per job.
Deep Crawl API
Run BFS/BestFirst deep crawl with filters/scoring and opt-in checkpoint resume.
# non-stream
curl -X POST https://md.genedai.me/api/deepcrawl \
-H "Authorization: Bearer <api-token>" \
-H "Content-Type: application/json" \
-d '{
"seed": "https://example.com/docs",
"max_depth": 2,
"max_pages": 20,
"strategy": "best_first",
"filters": {
"allow_domains": ["example.com"],
"url_patterns": ["https://example.com/docs/*"]
},
"scorer": {
"keywords": ["api", "reference"],
"weight": 2
},
"checkpoint": {
"crawl_id": "docs-crawl-001",
"snapshot_interval": 5
}
}'
# stream mode (SSE: start/node/done/fail)
curl -N -X POST https://md.genedai.me/api/deepcrawl \
-H "Authorization: Bearer <api-token>" \
-H "Content-Type: application/json" \
-d '{
"seed": "https://example.com/docs",
"stream": true
}'
Deep crawl request supports:
include_externalto traverse off-domain links.filters.url_patterns,filters.allow_domains,filters.block_domains,filters.content_types.scorer.keywords,scorer.weight,scorer.score_threshold.output.include_markdownto attach per-page markdown.fetch.selector,fetch.force_browser,fetch.no_cacheto control page conversion.checkpoint.crawl_id,checkpoint.resume,checkpoint.snapshot_interval,checkpoint.ttl_seconds.
Supported Sites
Special adapters for optimal extraction on these platforms:
| Site | Features |
|---|---|
WeChat (mp.weixin.qq.com) |
MicroMessenger UA, image proxy for hotlink bypass |
Feishu/Lark Docs (document surfaces such as /wiki, /docx, /docs on .feishu.cn / .larksuite.com) |
Virtual scroll handling, R2 image storage, UI noise removal |
Zhihu (zhihu.com/p/) |
Login wall removal, lazy image swap, hybrid proxy bypass |
Yuque (yuque.com) |
SPA rendering, sidebar/toc removal |
Notion (notion.site, notion.so) |
SPA rendering, lazy scroll loading |
Juejin (juejin.cn/post/) |
Login popup removal, code block expansion |
Twitter/X (twitter.com, x.com) |
Stealth rendering, login wall bypass |
Reddit (reddit.com) |
URL transform to old.reddit.com, content extraction |
CSDN (csdn.net) |
Login popup removal, code block expansion |
36Kr (36kr.com) |
Stealth rendering, content extraction |
Toutiao (toutiao.com) |
Stealth rendering, content extraction |
NetEase (163.com) |
Content extraction |
Weibo (weibo.com) |
Stealth rendering, hybrid proxy bypass |
| All other sites | Generic mobile UA, lazy image handling |
JavaScript / TypeScript
const res = await fetch(
"https://md.genedai.me/https://example.com/page?raw=true"
);
const markdown = await res.text();
console.log(res.headers.get("X-Markdown-Method"));
console.log(res.headers.get("X-Cache-Status")); // "HIT" or "MISS"
Python
import requests
url = "https://md.genedai.me/https://example.com/page"
resp = requests.get(url, params={"raw": "true", "format": "json"})
data = resp.json()
print(data["title"], data["method"])
API Endpoints
| Endpoint | Method | Description |
|---|---|---|
/ |
GET | Landing page with URL input form |
/<url> |
GET | Convert URL and render Markdown as HTML page |
/<url>?raw=true |
GET | Return raw Markdown as plain text |
/<url>?format=json |
GET | Return structured JSON (url, title, markdown, method) |
/<url>?format=html |
GET | Return HTML output for preview/basic rendering |
/<url>?format=text |
GET | Return plain text (no formatting) |
/<url>?selector=.class |
GET | Extract specific CSS selector |
/<url>?force_browser=true |
GET | Force browser rendering |
/<url>?engine=jina |
GET | Convert via Jina Reader API using the same output formats |
/<url>?engine=firecrawl |
GET | Convert via Firecrawl scrape using keyless mode when no key is configured |
/<url>?no_cache=true |
GET | Bypass KV cache |
/api/stream?url=<encoded-url> |
GET | SSE conversion stream (step, done, fail) with selector / force_browser / no_cache / engine / token support |
/api/batch |
POST | Batch convert multiple URLs (max 10) |
/api/extract |
POST | Structured extraction API (css / xpath / regex) |
/api/jobs |
POST | Create queued crawl/extract job record |
/api/jobs/:id |
GET | Query job status |
/api/jobs/:id/stream |
GET | SSE job status stream |
/api/jobs/:id/run |
POST | Execute queued/failed tasks in job |
/api/deepcrawl |
POST | Deep crawl API (BFS/BestFirst, stream/non-stream, checkpoint) |
/api/og |
GET | Dynamic Open Graph image for landing/rendered pages |
/img/<encoded-url> |
GET | Image proxy (bypasses hotlink protection) |
/r2img/<key> |
GET | Serve image from R2 storage |
/api/health |
GET | Health + runtime + operational metrics |
Authentication Matrix
The hosted instance at md.genedai.me uses D1-backed API keys with tiers
(see API Keys and Tiers). Self-hosted deployments
can skip the AUTH_DB binding and fall back to the legacyAPI_TOKEN / PUBLIC_API_TOKEN secrets.
| Route Group | Anonymous | Free tier (mk_…) |
Pro tier (mk_…) |
|---|---|---|---|
GET /<url> |
✅ cache + readability + keyless engine=jina/firecrawl |
✅ full pipeline + keyless engine=jina/firecrawl |
✅ + all engines, no_cache, force_browser |
GET /api/stream |
✅ cache + readability + keyless engine=jina/firecrawl |
✅ full pipeline + keyless engine=jina/firecrawl |
✅ full + params |
POST /api/batch |
❌ 401 | ✅ | ✅ |
POST /api/extract |
❌ 401 | ✅ | ✅ |
POST /api/deepcrawl |
❌ 401 | ✅ | ✅ |
POST /api/jobs* |
❌ 401 | ✅ | ✅ |
GET /api/me, /api/keys, /api/usage |
— | session cookie | session cookie or Bearer key |
POST /api/auth/magic-link, /auth/logout |
public | public | public |
GET /api/auth/verify |
public (single-use token) | — | — |
GET /portal/ (SPA) |
public HTML | — | — |
GET /api/health, /llms.txt, /robots.txt, /sitemap.xml |
public | public | public |
The batch / extract / deepcrawl / jobs endpoints are always gated because
they either fan out into many conversions or touch Browser Rendering
directly.
Response Headers (Raw API)
| Header | Description |
|---|---|
Content-Type |
text/markdown, application/json, text/html, or text/plain |
X-Source-URL |
The original target URL |
X-Markdown-Tokens |
Token count (native Markdown for Agents only) |
X-Markdown-Native |
"true" when native, "false" otherwise |
X-Markdown-Method |
"native", "readability+turndown", "browser+readability+turndown", "jina", "firecrawl", or "cf" |
X-Cache-Status |
"HIT" or "MISS" |
X-Markdown-Fallbacks |
Comma-separated fallback list (when used) |
X-Browser-Rendered |
"true" when browser rendering path was used |
X-Paywall-Detected |
"true" when paywall heuristics were triggered |
X-RateLimit-Limit |
Monthly credit quota (authenticated requests only) |
X-RateLimit-Remaining |
Credits remaining this month |
X-Request-Cost |
Fixed per-request-type credit cost |
X-Quota-Exceeded |
"true" when quota is exhausted but a cached response was served |
Retry-After |
Present on 429 responses (IP rate limit or quota exceeded) |
Access-Control-Allow-Origin |
* — CORS enabled |
Features
| Feature | Description |
|---|---|
| Any Website | Works on every site with four conversion paths |
| Site Adapters | Specialized extractors for WeChat, Feishu, Zhihu, Yuque, Notion, Juejin |
| Anti-Bot Bypass | Browser Rendering handles JS challenges, CAPTCHAs, and verification |
| 3-Tier Cache | In-memory hot cache → Cloudflare Cache API (per-colo, free) → KV (global, persistent) |
| Developer Portal | Self-service signup, API key management, real-time usage dashboard |
| Tier System | Anonymous (cache+readability only), Free (1k/mo), Pro (50k/mo) |
| R2 Image Storage | Images stored reliably, served via proxy URLs |
| Multiple Formats | Markdown, HTML, text, or structured JSON output |
| CSS Selectors | Target specific page elements for extraction |
| Batch API v2 | Convert up to 10 URLs with per-item format/selector/browser/cache options |
| Structured Extraction | CSS/XPath/Regex extraction via /api/extract with optional markdown attachment |
| Job Dispatcher | Queue + run + monitor crawl/extract workloads via /api/jobs/* |
| Deep Crawl | BFS + BestFirst traversal, filters/scorers, stream mode, checkpoint/resume |
| Table Support | Improved handling of simple and complex tables |
| Smart Extraction | Readability strips nav, ads, sidebars — extracts main article content |
| Rendered View | Dark-themed Markdown preview with GitHub CSS and tab switching |
| Session Profiles | Persist/replay cookies and localStorage for repeat authenticated crawling |
| Proxy Pool Fallback | Multi-proxy + UA/header variant rotation for challenge-prone targets |
| SSRF Protection | Blocks private IPs, IPv6 link-local, cloud metadata endpoints |
| Timeout Protection | Time-budgeted scrolling for Feishu virtual scroll documents |
| Built-in Rate Limiting | Per-IP limits for conversion, stream, and batch routes |
| Runtime Paywall Rules | Support dynamic paywall rule updates via env/KV JSON |
| Operational Health | /api/health exposes throughput/success/retry/backlog and P50/P95 latency |
Tech Stack
| Component | Role |
|---|---|
| Cloudflare Workers | Edge runtime — global deployment |
| Cloudflare Browser Rendering | Headless Chrome for JS-heavy/anti-bot pages |
| Cloudflare KV | Edge key-value cache for converted content |
| Cloudflare R2 | Object storage for images |
| Markdown for Agents | Native HTML→Markdown at edge |
| @mozilla/readability | Article content extraction (Firefox Reader View) |
| Turndown | HTML→Markdown conversion |
| @cloudflare/puppeteer | Puppeteer API for Browser Rendering |
| LinkeDOM | Lightweight DOM for Workers |
| Vitest | Unit testing framework |
AI Agent Integration
Three ways to use Website2Markdown from AI agents:
Agent Skills (Claude Code, OpenClaw, Claw)
One command install, auto-discovered by your agent. Includes usage patterns, error handling, and guides for all 21 adapters.
# Claude Code
git clone https://github.com/Digidai/website2markdown-skills ~/.claude/skills/website2markdown
# Codex CLI
git clone https://github.com/Digidai/website2markdown-skills ~/.codex/skills/website2markdown
# Gemini CLI
git clone https://github.com/Digidai/website2markdown-skills ~/.gemini/skills/website2markdown
# OpenClaw
npx clawhub@latest install website2markdown
One command, auto-discovered in new sessions. See the website2markdown-skills repo for full documentation.
MCP Server (Claude Desktop, Cursor IDE, Windsurf)
Standard MCP protocol with convert_url tool.
npm install -g @digidai/mcp-website2markdown
Claude Desktop config (~/.claude/claude_desktop_config.json):
{
"mcpServers": {
"website2markdown": {
"command": "mcp-website2markdown",
"env": {
"WEBSITE2MARKDOWN_API_URL": "https://md.genedai.me"
}
}
}
}
llms.txt
Machine-readable API description for AI system auto-discovery:
https://md.genedai.me/llms.txt
Which to choose?
| Skills | MCP Server | llms.txt | |
|---|---|---|---|
| Best for | CLI-based agents (Claude Code, OpenClaw) | IDE-based agents (Claude Desktop, Cursor) | Any AI with web access |
| Latency | Direct HTTP (fastest) | MCP protocol overhead | Direct HTTP |
| Context | Rich (patterns, error handling, adapters) | Tool schema only | API description |
| Install | git clone (one command) |
npm install -g |
None |
Project Structure
md-genedai/
├── src/
│ ├── index.ts # Router + conversion + extraction + job/deepcrawl endpoints
│ ├── types.ts # Shared TS types (Env, extraction/job payloads, adapters)
│ ├── config.ts # Limits, timeouts, UA and parser constants
│ ├── utils.ts # Shared helpers (headers, parsing, formatting)
│ ├── converter.ts # Readability + Turndown pipeline and content shaping
│ ├── security.ts # SSRF guardrails, retry wrappers, safe fetch helpers
│ ├── paywall.ts # Paywall heuristics + runtime rule updates
│ ├── proxy.ts # Forward proxy + pool parsing/selection
│ ├── browser/
│ │ ├── index.ts # Browser rendering orchestrator and capacity control
│ │ ├── stealth.ts # Anti-detection hardening
│ │ └── adapters/ # 14 site-specific browser adapters
│ ├── cache/
│ │ └── index.ts # KV conversion cache + R2 image storage
│ ├── extraction/
│ │ └── strategies.ts # CSS/XPath/Regex structured extraction
│ ├── dispatcher/
│ │ ├── model.ts # Job schema + KV persistence/idempotency
│ │ └── runner.ts # Job execution and retry orchestration
│ ├── deepcrawl/
│ │ ├── bfs.ts # BFS/BestFirst traversal core
│ │ ├── filters.ts # Crawl filters (domains, patterns, content hints)
│ │ └── scorers.ts # Keyword/domain scoring for BestFirst strategy
│ ├── session/
│ │ └── profile.ts # Session profile capture/replay (cookie/localStorage)
│ ├── observability/
│ │ └── metrics.ts # Throughput/success/retry/backlog/latency snapshots
│ ├── templates/
│ │ ├── landing.ts # Landing page HTML
│ │ ├── rendered.ts # Markdown preview page HTML
│ │ ├── loading.ts # SSE loading/progress page HTML
│ │ └── error.ts # Error page HTML
│ └── __tests__/ # 37 test files
├── docs/
│ └── slo-reference.md # SLO targets used by /api/health operational metrics
├── scripts/
│ └── smoke-api.sh # End-to-end API smoke checks for deployed/local worker
├── package.json
├── wrangler.toml # Worker config: browser, KV, R2 bindings
├── tsconfig.json
├── vitest.config.ts
└── .gitignore
Deployment
This project uses Cloudflare Git Integration — push to main and Cloudflare automatically builds and deploys.
Setup (one-time)
- Fork or push this repo to GitHub
- Create required resources:
# Create KV namespace wrangler kv namespace create CACHE_KV # Update the namespace ID in wrangler.toml # Create R2 bucket wrangler r2 bucket create md-images - Go to Cloudflare Dashboard > Workers & Pages > Create > Import a Git repository
- Select the GitHub repo — Cloudflare will deploy automatically on every push to
main
Secrets / Runtime Variables
# Required: Bearer auth for protected write APIs
# Used by: /api/batch, /api/extract, /api/jobs, /api/deepcrawl
wrangler secret put API_TOKEN
# Optional: protect raw convert API + /api/stream
wrangler secret put PUBLIC_API_TOKEN
# Optional: dynamic paywall rules (JSON array)
wrangler secret put PAYWALL_RULES_JSON
# Optional: single upstream proxy (format: username:password@host:port)
wrangler secret put PROXY_URL
# Optional: proxy pool for rotation/fallback (comma or newline separated)
wrangler secret put PROXY_POOL
Optional KV-driven paywall rule source:
- Set
PAYWALL_RULES_KV_KEY(plain env var) to a KV key that stores JSON paywall rules. - If both
PAYWALL_RULES_JSONand KV key are configured, KV value takes precedence.
Example plain env var in wrangler.toml:
[vars]
PAYWALL_RULES_KV_KEY = "paywall:rules:v1"
Browser Rendering Binding
[browser]
binding = "MYBROWSER"
Note: Browser Rendering requires a Workers Paid plan. It only works in deployed Workers or with
wrangler dev --remote.
Custom Domain
- In Cloudflare Dashboard > Workers & Pages > your Worker > Settings > Domains & Routes
- Add your custom domain (e.g.
md.example.com)
Local Development
npm install
npm run dev # Local dev at http://localhost:8787
npm run build # Dry-run bundle to dist/
npm run typecheck # Type check
npm test # Run unit tests
npm run test:watch # Watch mode
npm run test:coverage # Coverage
npm run smoke:api # API smoke checks (requires BASE_URL + API_TOKEN env vars)
Checkpoint behavior:
- Deep crawl checkpoint persistence is only enabled when you provide
checkpointoptions such ascrawl_id,resume,snapshot_interval, orttl_seconds. - If you omit
checkpoint, the API still returns acrawlIdfor tracing, but no checkpoint record is written. - Resume requests must match the original crawl configuration; changing filters, scoring, or fetch options returns
409 Conflict.
Smoke example:
BASE_URL="https://md.genedai.me" \
API_TOKEN="<api-token>" \
TARGET_URL="https://example.com" \
npm run smoke:api
Validation Workflow (2026-03-06)
Use Node 22 locally (see .nvmrc) or rely on GitHub Actions in .github/workflows/ci.yml:
| Check | Command |
|---|---|
| Type safety | npm run typecheck |
| Unit/integration tests | npm test |
| Coverage | npm run test:coverage |
| Worker bundle dry-run | npm run build |
| Live health check | curl https://website2markdown.genedai.workers.dev/api/health |
| Live public conversion | GET /https://website2markdown.genedai.workers.dev/https://example.com?raw=true |
Production note:
- Protected write APIs (
/api/extract,/api/jobs*,/api/deepcrawl,/api/batch) requireAPI_TOKEN. - If
API_TOKENis not configured in deployed Worker, these endpoints return503(API_TOKEN not set).
License
MIT
Yorumlar (0)
Yorum birakmak icin giris yap.
Yorum birakSonuc bulunamadi