URL to Markdown Converter

Convert any web page to clean Markdown — JS-heavy SPAs, paywalled content, Chinese platforms (WeChat, Zhihu, Feishu), and more. Powered by Cloudflare Workers with a 5-layer fallback pipeline and 14 site adapters.

Quick Start

# Convert any URL to Markdown (try it now!)
curl -H "Accept: text/markdown" https://md.genedai.me/https://example.com

# WeChat article
curl -H "Accept: text/markdown" "https://md.genedai.me/https://mp.weixin.qq.com/s/YOUR_ARTICLE_ID"

# JSON output with metadata
curl "https://md.genedai.me/https://example.com?format=json&raw=true"

Or just open in your browser: md.genedai.me/https://example.com

Need browser-rendered pages (WeChat, Feishu, JS-heavy SPAs) or higher limits?
Get a free API key at md.genedai.me/portal/.

How It Works

https://md.genedai.me/<target-url>

Conversion Flow

Request
  │
  ▼
Fetch target with Accept: text/markdown
  │
  ├─ Response is text/markdown? ──▶ Path 1: Native Markdown
  │
  └─ Response is text/html?
       │
       ├─ Anti-bot / JS-required detected? ──▶ Path 3: Browser Rendering → Readability + Turndown
       │
       └─ Normal HTML ──▶ Path 2: Readability + Turndown

Path	When	How	`X-Markdown-Method`
Native	Target site supports Markdown for Agents	Cloudflare edge converts via `Accept: text/markdown` content negotiation	`native`
Fallback	Normal HTML pages	Readability extracts main content → Turndown converts to Markdown	`readability+turndown`
Browser	Anti-bot pages, JS-rendered content	Headless Chrome renders the page → Readability + Turndown	`browser+readability+turndown`
Firecrawl	Explicit `engine=firecrawl`, non-text documents, or thin local extraction	Convert via Firecrawl v2 scrape; omits Authorization by default for keyless when accepted	`firecrawl`
Jina	Explicit `engine=jina` or last-resort fallback	Convert via Jina Reader API while preserving the same output/query surface	`jina`

API Usage

Browser (URL bar)

# Full URL
https://md.genedai.me/https://example.com/page

# Bare domain (auto-prepends https://)
https://md.genedai.me/example.com/page

Raw Markdown API

# Get raw Markdown via query param
curl "https://md.genedai.me/https://example.com/page?raw=true"

# Get raw Markdown via Accept header
curl https://md.genedai.me/https://example.com/page \
  -H "Accept: text/markdown"

API Keys and Tiers

Tier	Credits/month	Browser rendering	Proxy / Engine selection
Anonymous (no key)	—	❌ no browser rendering	✅ keyless `engine=jina` / `engine=firecrawl`
Free	1,000	✅	✅ keyless `engine=jina` / `engine=firecrawl`
Pro	50,000	✅	✅ all engines + `no_cache=` + `force_browser=`

Credit cost is fixed per request type, not per actual conversion path
(so bills are predictable even if a site silently switches from static to
browser rendering behind the scenes):

Endpoint	Credits
`GET /<url>`	1
`GET /api/stream`	1
`POST /api/batch` (per URL)	1
`POST /api/extract`	3
`POST /api/deepcrawl` (per URL)	2

Cache hits on a paying tier still consume 1 credit; when your quota is
exhausted the API keeps serving cached URLs (with X-Quota-Exceeded: true)
but rejects cache-miss requests with 429.

Using your key

# Bearer header (recommended)
curl "https://md.genedai.me/https://example.com/page?raw=true" \
  -H "Authorization: Bearer mk_..."

# The old ?token= query-parameter form is supported for legacy
# PUBLIC_API_TOKEN deployments, but NOT for mk_ keys. Never put a real
# API key in a query string — logs, referrers, and monitoring capture it.

Every authenticated response includes per-key rate limit headers:

X-RateLimit-Limit:     50000
X-RateLimit-Remaining: 49993
X-Request-Cost:        1

Portal API (session cookie)

Once signed in at /portal/, these endpoints are available under the same
session cookie:

Endpoint	Method	Description
`/api/me`	GET	Current account (email, tier, account_id)
`/api/keys`	GET	List your keys (prefix only, never plaintext)
`/api/keys`	POST	Create a new key; plaintext returned once
`/api/keys/:id`	DELETE	Revoke a key (takes effect within 60s — LRU cache)
`/api/usage`	GET	Usage breakdown (tier, quota, used, daily history)
`/api/auth/logout`	POST	Destroy session, clear cookie

/api/usage also accepts an Authorization: Bearer mk_... header so SDK
and CLI tools can poll usage without a session.

Output Formats

# Markdown (default)
curl "https://md.genedai.me/https://example.com?format=markdown&raw=true"

# Clean HTML
curl "https://md.genedai.me/https://example.com?format=html&raw=true"

# Plain text (no formatting)
curl "https://md.genedai.me/https://example.com?format=text&raw=true"

# JSON (structured: url, title, markdown, method, timestamp)
curl "https://md.genedai.me/https://example.com?format=json&raw=true"

CSS Selector Extraction

Extract specific page elements instead of the full article:

# Extract only the article body
curl "https://md.genedai.me/https://example.com?selector=.article-body&raw=true"

# Extract a specific section
curl "https://md.genedai.me/https://example.com?selector=%23main-content&raw=true"

selector maximum length is 256 characters.

Force Browser Rendering

curl "https://md.genedai.me/https://example.com/js-heavy-page?raw=true&force_browser=true"

Jina Reader Engine

Use engine=jina to convert via r.jina.ai instead of the built-in pipeline. This is useful for JS-heavy pages when browser rendering is unavailable. Keyless/free tier: 20 RPM per IP without an API key; higher limits are available with a Jina key.

curl "https://md.genedai.me/https://example.com?raw=true&engine=jina"

Jina is also used automatically as a last-resort fallback when Readability extraction produces very little content and no browser/proxy path was used.

Firecrawl Keyless Fallback

Use engine=firecrawl to convert via Firecrawl v2 scrape. If FIRECRAWL_API_KEY
is not configured, the worker sends no Authorization header so Firecrawl can
use its keyless free tier when the upstream accepts the request. Keyless can
still return 403 or 429; automatic fallbacks treat that as non-fatal and
continue to Jina.

curl "https://md.genedai.me/https://example.com?raw=true&engine=firecrawl"

engine=jina and engine=firecrawl are intentionally available without a
Pro key because both upstreams provide keyless/free reader paths. Account-backed
engines such as engine=cf still require Pro. Firecrawl is also tried before
Jina when local extraction is too thin or the target is a non-text document
such as a PDF/Word/Excel URL.

Cache Control

Results are cached in KV for fast repeat access. To bypass cache:

curl "https://md.genedai.me/https://example.com?raw=true&no_cache=true"

Batch Conversion

Convert multiple URLs in a single request:

curl -X POST https://md.genedai.me/api/batch \
  -H "Authorization: Bearer <api-token>" \
  -H "Content-Type: application/json" \
  -d '{
    "urls": [
      "https://example.com/page1",
      {
        "url": "https://example.com/page2",
        "format": "text",
        "selector": "article",
        "force_browser": false,
        "no_cache": true
      }
    ]
  }'

urls supports:

String item: "https://example.com/a" (defaults to markdown)
Object item: { "url": "...", "format?": "markdown|html|text|json", "selector?": "...", "force_browser?": boolean, "no_cache?": boolean, "engine?": "jina" }

Response:

{
  "results": [
    {
      "url": "...",
      "format": "markdown",
      "content": "...",
      "markdown": "...",
      "title": "...",
      "method": "...",
      "cached": false,
      "fallbacks": ["jsonld"]
    },
    {
      "url": "...",
      "format": "text",
      "content": "...",
      "title": "...",
      "method": "...",
      "cached": true
    }
  ]
}

Structured Extraction API

Extract structured fields from URL or raw HTML.

curl -X POST https://md.genedai.me/api/extract \
  -H "Authorization: Bearer <api-token>" \
  -H "Content-Type: application/json" \
  -d '{
    "strategy": "css",
    "url": "https://example.com/article",
    "schema": {
      "fields": [
        { "name": "title", "selector": "h1", "type": "text", "required": true },
        { "name": "author", "selector": ".author", "type": "text" }
      ]
    },
    "include_markdown": true
  }'

Batch extraction (items) is also supported (max 10 items).

Additional extraction capabilities:

Use either top-level url / html or nested input.url / input.html.
schema.fields[*].required fails extraction when a required field is missing.
options supports dedupe, includeEmpty, and regexFlags.
include_markdown: true attaches converted markdown alongside extracted data.

Job API (create / query / stream / run)

Submit crawl/extract tasks as queued jobs, then run and monitor. Jobs are persisted as queued records in KV; execution begins when you call /run:

# 1) Create job
curl -X POST https://md.genedai.me/api/jobs \
  -H "Authorization: Bearer <api-token>" \
  -H "Content-Type: application/json" \
  -H "Idempotency-Key: demo-job-1" \
  -d '{
    "type": "crawl",
    "tasks": [
      "https://example.com/a",
      "https://example.com/b"
    ],
    "priority": 10,
    "maxRetries": 2
  }'

# 2) Query status
curl -H "Authorization: Bearer <api-token>" \
  https://md.genedai.me/api/jobs/<job-id>

# 3) Watch status stream (SSE)
curl -N -H "Authorization: Bearer <api-token>" \
  https://md.genedai.me/api/jobs/<job-id>/stream

# 4) Execute queued tasks
curl -X POST -H "Authorization: Bearer <api-token>" \
  https://md.genedai.me/api/jobs/<job-id>/run

Job API notes:

Supports both type: "crawl" and type: "extract".
type: "crawl" accepts string URLs or object tasks with format, selector, force_browser, and no_cache.
type: "extract" reuses the same task shape as /api/extract.
Idempotency-Key is keyed by both the header value and request payload: same key + same payload returns the existing job; same key + different payload returns 409 Conflict.
priority is normalized to 1..100 (default 10), maxRetries to 0..10 (default 2).
Up to 100 tasks are allowed per job.

Deep Crawl API

Run BFS/BestFirst deep crawl with filters/scoring and opt-in checkpoint resume.

# non-stream
curl -X POST https://md.genedai.me/api/deepcrawl \
  -H "Authorization: Bearer <api-token>" \
  -H "Content-Type: application/json" \
  -d '{
    "seed": "https://example.com/docs",
    "max_depth": 2,
    "max_pages": 20,
    "strategy": "best_first",
    "filters": {
      "allow_domains": ["example.com"],
      "url_patterns": ["https://example.com/docs/*"]
    },
    "scorer": {
      "keywords": ["api", "reference"],
      "weight": 2
    },
    "checkpoint": {
      "crawl_id": "docs-crawl-001",
      "snapshot_interval": 5
    }
  }'

# stream mode (SSE: start/node/done/fail)
curl -N -X POST https://md.genedai.me/api/deepcrawl \
  -H "Authorization: Bearer <api-token>" \
  -H "Content-Type: application/json" \
  -d '{
    "seed": "https://example.com/docs",
    "stream": true
  }'

Deep crawl request supports:

include_external to traverse off-domain links.
filters.url_patterns, filters.allow_domains, filters.block_domains, filters.content_types.
scorer.keywords, scorer.weight, scorer.score_threshold.
output.include_markdown to attach per-page markdown.
fetch.selector, fetch.force_browser, fetch.no_cache to control page conversion.
checkpoint.crawl_id, checkpoint.resume, checkpoint.snapshot_interval, checkpoint.ttl_seconds.

Supported Sites

Special adapters for optimal extraction on these platforms:

Site	Features
WeChat (`mp.weixin.qq.com`)	MicroMessenger UA, image proxy for hotlink bypass
Feishu/Lark Docs (document surfaces such as `/wiki`, `/docx`, `/docs` on `.feishu.cn` / `.larksuite.com`)	Virtual scroll handling, R2 image storage, UI noise removal
Zhihu (`zhihu.com/p/`)	Login wall removal, lazy image swap, hybrid proxy bypass
Yuque (`yuque.com`)	SPA rendering, sidebar/toc removal
Notion (`notion.site`, `notion.so`)	SPA rendering, lazy scroll loading
Juejin (`juejin.cn/post/`)	Login popup removal, code block expansion
Twitter/X (`twitter.com`, `x.com`)	Stealth rendering, login wall bypass
Reddit (`reddit.com`)	URL transform to old.reddit.com, content extraction
CSDN (`csdn.net`)	Login popup removal, code block expansion
36Kr (`36kr.com`)	Stealth rendering, content extraction
Toutiao (`toutiao.com`)	Stealth rendering, content extraction
NetEase (`163.com`)	Content extraction
Weibo (`weibo.com`)	Stealth rendering, hybrid proxy bypass
All other sites	Generic mobile UA, lazy image handling

JavaScript / TypeScript

const res = await fetch(
  "https://md.genedai.me/https://example.com/page?raw=true"
);
const markdown = await res.text();
console.log(res.headers.get("X-Markdown-Method"));
console.log(res.headers.get("X-Cache-Status")); // "HIT" or "MISS"

Python

import requests

url = "https://md.genedai.me/https://example.com/page"
resp = requests.get(url, params={"raw": "true", "format": "json"})
data = resp.json()
print(data["title"], data["method"])

API Endpoints

Endpoint	Method	Description
`/`	GET	Landing page with URL input form
`/<url>`	GET	Convert URL and render Markdown as HTML page
`/<url>?raw=true`	GET	Return raw Markdown as plain text
`/<url>?format=json`	GET	Return structured JSON (url, title, markdown, method)
`/<url>?format=html`	GET	Return HTML output for preview/basic rendering
`/<url>?format=text`	GET	Return plain text (no formatting)
`/<url>?selector=.class`	GET	Extract specific CSS selector
`/<url>?force_browser=true`	GET	Force browser rendering
`/<url>?engine=jina`	GET	Convert via Jina Reader API using the same output formats
`/<url>?engine=firecrawl`	GET	Convert via Firecrawl scrape using keyless mode when no key is configured
`/<url>?no_cache=true`	GET	Bypass KV cache
`/api/stream?url=<encoded-url>`	GET	SSE conversion stream (`step`, `done`, `fail`) with `selector` / `force_browser` / `no_cache` / `engine` / `token` support
`/api/batch`	POST	Batch convert multiple URLs (max 10)
`/api/extract`	POST	Structured extraction API (`css` / `xpath` / `regex`)
`/api/jobs`	POST	Create queued crawl/extract job record
`/api/jobs/:id`	GET	Query job status
`/api/jobs/:id/stream`	GET	SSE job status stream
`/api/jobs/:id/run`	POST	Execute queued/failed tasks in job
`/api/deepcrawl`	POST	Deep crawl API (BFS/BestFirst, stream/non-stream, checkpoint)
`/api/og`	GET	Dynamic Open Graph image for landing/rendered pages
`/img/<encoded-url>`	GET	Image proxy (bypasses hotlink protection)
`/r2img/<key>`	GET	Serve image from R2 storage
`/api/health`	GET	Health + runtime + operational metrics

Authentication Matrix

The hosted instance at md.genedai.me uses D1-backed API keys with tiers
(see API Keys and Tiers). Self-hosted deployments
can skip the AUTH_DB binding and fall back to the legacy
API_TOKEN / PUBLIC_API_TOKEN secrets.

Route Group	Anonymous	Free tier (`mk_…`)	Pro tier (`mk_…`)
`GET /<url>`	✅ cache + readability + keyless `engine=jina/firecrawl`	✅ full pipeline + keyless `engine=jina/firecrawl`	✅ + all engines, `no_cache`, `force_browser`
`GET /api/stream`	✅ cache + readability + keyless `engine=jina/firecrawl`	✅ full pipeline + keyless `engine=jina/firecrawl`	✅ full + params
`POST /api/batch`	❌ 401	✅	✅
`POST /api/extract`	❌ 401	✅	✅
`POST /api/deepcrawl`	❌ 401	✅	✅
`POST /api/jobs*`	❌ 401	✅	✅
`GET /api/me`, `/api/keys`, `/api/usage`	—	session cookie	session cookie or Bearer key
`POST /api/auth/magic-link`, `/auth/logout`	public	public	public
`GET /api/auth/verify`	public (single-use token)	—	—
`GET /portal/` (SPA)	public HTML	—	—
`GET /api/health`, `/llms.txt`, `/robots.txt`, `/sitemap.xml`	public	public	public

The batch / extract / deepcrawl / jobs endpoints are always gated because
they either fan out into many conversions or touch Browser Rendering
directly.

Response Headers (Raw API)

Header	Description
`Content-Type`	`text/markdown`, `application/json`, `text/html`, or `text/plain`
`X-Source-URL`	The original target URL
`X-Markdown-Tokens`	Token count (native Markdown for Agents only)
`X-Markdown-Native`	`"true"` when native, `"false"` otherwise
`X-Markdown-Method`	`"native"`, `"readability+turndown"`, `"browser+readability+turndown"`, `"jina"`, `"firecrawl"`, or `"cf"`
`X-Cache-Status`	`"HIT"` or `"MISS"`
`X-Markdown-Fallbacks`	Comma-separated fallback list (when used)
`X-Browser-Rendered`	`"true"` when browser rendering path was used
`X-Paywall-Detected`	`"true"` when paywall heuristics were triggered
`X-RateLimit-Limit`	Monthly credit quota (authenticated requests only)
`X-RateLimit-Remaining`	Credits remaining this month
`X-Request-Cost`	Fixed per-request-type credit cost
`X-Quota-Exceeded`	`"true"` when quota is exhausted but a cached response was served
`Retry-After`	Present on `429` responses (IP rate limit or quota exceeded)
`Access-Control-Allow-Origin`	`*` — CORS enabled

Features

Feature	Description
Any Website	Works on every site with four conversion paths
Site Adapters	Specialized extractors for WeChat, Feishu, Zhihu, Yuque, Notion, Juejin
Anti-Bot Bypass	Browser Rendering handles JS challenges, CAPTCHAs, and verification
3-Tier Cache	In-memory hot cache → Cloudflare Cache API (per-colo, free) → KV (global, persistent)
Developer Portal	Self-service signup, API key management, real-time usage dashboard
Tier System	Anonymous (cache+readability only), Free (1k/mo), Pro (50k/mo)
R2 Image Storage	Images stored reliably, served via proxy URLs
Multiple Formats	Markdown, HTML, text, or structured JSON output
CSS Selectors	Target specific page elements for extraction
Batch API v2	Convert up to 10 URLs with per-item format/selector/browser/cache options
Structured Extraction	CSS/XPath/Regex extraction via `/api/extract` with optional markdown attachment
Job Dispatcher	Queue + run + monitor crawl/extract workloads via `/api/jobs/*`
Deep Crawl	BFS + BestFirst traversal, filters/scorers, stream mode, checkpoint/resume
Table Support	Improved handling of simple and complex tables
Smart Extraction	Readability strips nav, ads, sidebars — extracts main article content
Rendered View	Dark-themed Markdown preview with GitHub CSS and tab switching
Session Profiles	Persist/replay cookies and localStorage for repeat authenticated crawling
Proxy Pool Fallback	Multi-proxy + UA/header variant rotation for challenge-prone targets
SSRF Protection	Blocks private IPs, IPv6 link-local, cloud metadata endpoints
Timeout Protection	Time-budgeted scrolling for Feishu virtual scroll documents
Built-in Rate Limiting	Per-IP limits for conversion, stream, and batch routes
Runtime Paywall Rules	Support dynamic paywall rule updates via env/KV JSON
Operational Health	`/api/health` exposes throughput/success/retry/backlog and P50/P95 latency

Tech Stack

Component	Role
Cloudflare Workers	Edge runtime — global deployment
Cloudflare Browser Rendering	Headless Chrome for JS-heavy/anti-bot pages
Cloudflare KV	Edge key-value cache for converted content
Cloudflare R2	Object storage for images
Markdown for Agents	Native HTML→Markdown at edge
@mozilla/readability	Article content extraction (Firefox Reader View)
Turndown	HTML→Markdown conversion
@cloudflare/puppeteer	Puppeteer API for Browser Rendering
LinkeDOM	Lightweight DOM for Workers
Vitest	Unit testing framework

AI Agent Integration

Three ways to use Website2Markdown from AI agents:

Agent Skills (Claude Code, OpenClaw, Claw)

One command install, auto-discovered by your agent. Includes usage patterns, error handling, and guides for all 21 adapters.

# Claude Code
git clone https://github.com/Digidai/website2markdown-skills ~/.claude/skills/website2markdown

# Codex CLI
git clone https://github.com/Digidai/website2markdown-skills ~/.codex/skills/website2markdown

# Gemini CLI
git clone https://github.com/Digidai/website2markdown-skills ~/.gemini/skills/website2markdown

# OpenClaw
npx clawhub@latest install website2markdown

One command, auto-discovered in new sessions. See the website2markdown-skills repo for full documentation.

MCP Server (Claude Desktop, Cursor IDE, Windsurf)

Standard MCP protocol with convert_url tool.

npm install -g @digidai/mcp-website2markdown

Claude Desktop config (~/.claude/claude_desktop_config.json):

{
  "mcpServers": {
    "website2markdown": {
      "command": "mcp-website2markdown",
      "env": {
        "WEBSITE2MARKDOWN_API_URL": "https://md.genedai.me"
      }
    }
  }
}

llms.txt

Machine-readable API description for AI system auto-discovery:

https://md.genedai.me/llms.txt

Which to choose?

	Skills	MCP Server	llms.txt
Best for	CLI-based agents (Claude Code, OpenClaw)	IDE-based agents (Claude Desktop, Cursor)	Any AI with web access
Latency	Direct HTTP (fastest)	MCP protocol overhead	Direct HTTP
Context	Rich (patterns, error handling, adapters)	Tool schema only	API description
Install	`git clone` (one command)	`npm install -g`	None

Project Structure

md-genedai/
├── src/
│   ├── index.ts              # Router + conversion + extraction + job/deepcrawl endpoints
│   ├── types.ts              # Shared TS types (Env, extraction/job payloads, adapters)
│   ├── config.ts             # Limits, timeouts, UA and parser constants
│   ├── utils.ts              # Shared helpers (headers, parsing, formatting)
│   ├── converter.ts          # Readability + Turndown pipeline and content shaping
│   ├── security.ts           # SSRF guardrails, retry wrappers, safe fetch helpers
│   ├── paywall.ts            # Paywall heuristics + runtime rule updates
│   ├── proxy.ts              # Forward proxy + pool parsing/selection
│   ├── browser/
│   │   ├── index.ts          # Browser rendering orchestrator and capacity control
│   │   ├── stealth.ts        # Anti-detection hardening
│   │   └── adapters/         # 14 site-specific browser adapters
│   ├── cache/
│   │   └── index.ts          # KV conversion cache + R2 image storage
│   ├── extraction/
│   │   └── strategies.ts     # CSS/XPath/Regex structured extraction
│   ├── dispatcher/
│   │   ├── model.ts          # Job schema + KV persistence/idempotency
│   │   └── runner.ts         # Job execution and retry orchestration
│   ├── deepcrawl/
│   │   ├── bfs.ts            # BFS/BestFirst traversal core
│   │   ├── filters.ts        # Crawl filters (domains, patterns, content hints)
│   │   └── scorers.ts        # Keyword/domain scoring for BestFirst strategy
│   ├── session/
│   │   └── profile.ts        # Session profile capture/replay (cookie/localStorage)
│   ├── observability/
│   │   └── metrics.ts        # Throughput/success/retry/backlog/latency snapshots
│   ├── templates/
│   │   ├── landing.ts        # Landing page HTML
│   │   ├── rendered.ts       # Markdown preview page HTML
│   │   ├── loading.ts        # SSE loading/progress page HTML
│   │   └── error.ts          # Error page HTML
│   └── __tests__/            # 37 test files
├── docs/
│   └── slo-reference.md      # SLO targets used by /api/health operational metrics
├── scripts/
│   └── smoke-api.sh          # End-to-end API smoke checks for deployed/local worker
├── package.json
├── wrangler.toml             # Worker config: browser, KV, R2 bindings
├── tsconfig.json
├── vitest.config.ts
└── .gitignore

Deployment

This project uses Cloudflare Git Integration — push to main and Cloudflare automatically builds and deploys.

Setup (one-time)

Fork or push this repo to GitHub

Create required resources:

# Create KV namespace
wrangler kv namespace create CACHE_KV
# Update the namespace ID in wrangler.toml

# Create R2 bucket
wrangler r2 bucket create md-images

Go to Cloudflare Dashboard > Workers & Pages > Create > Import a Git repository
Select the GitHub repo — Cloudflare will deploy automatically on every push to main

Secrets / Runtime Variables

# Required: Bearer auth for protected write APIs
# Used by: /api/batch, /api/extract, /api/jobs, /api/deepcrawl
wrangler secret put API_TOKEN

# Optional: protect raw convert API + /api/stream
wrangler secret put PUBLIC_API_TOKEN

# Optional: dynamic paywall rules (JSON array)
wrangler secret put PAYWALL_RULES_JSON

# Optional: single upstream proxy (format: username:password@host:port)
wrangler secret put PROXY_URL

# Optional: proxy pool for rotation/fallback (comma or newline separated)
wrangler secret put PROXY_POOL

Optional KV-driven paywall rule source:

Set PAYWALL_RULES_KV_KEY (plain env var) to a KV key that stores JSON paywall rules.
If both PAYWALL_RULES_JSON and KV key are configured, KV value takes precedence.

Example plain env var in wrangler.toml:

[vars]
PAYWALL_RULES_KV_KEY = "paywall:rules:v1"

Browser Rendering Binding

[browser]
binding = "MYBROWSER"

Note: Browser Rendering requires a Workers Paid plan. It only works in deployed Workers or with wrangler dev --remote.

Custom Domain

In Cloudflare Dashboard > Workers & Pages > your Worker > Settings > Domains & Routes
Add your custom domain (e.g. md.example.com)

Local Development

npm install
npm run dev           # Local dev at http://localhost:8787
npm run build         # Dry-run bundle to dist/
npm run typecheck     # Type check
npm test              # Run unit tests
npm run test:watch    # Watch mode
npm run test:coverage # Coverage
npm run smoke:api     # API smoke checks (requires BASE_URL + API_TOKEN env vars)

Checkpoint behavior:

Deep crawl checkpoint persistence is only enabled when you provide checkpoint options such as crawl_id, resume, snapshot_interval, or ttl_seconds.
If you omit checkpoint, the API still returns a crawlId for tracing, but no checkpoint record is written.
Resume requests must match the original crawl configuration; changing filters, scoring, or fetch options returns 409 Conflict.

Smoke example:

BASE_URL="https://md.genedai.me" \
API_TOKEN="<api-token>" \
TARGET_URL="https://example.com" \
npm run smoke:api

Validation Workflow (2026-03-06)

Use Node 22 locally (see .nvmrc) or rely on GitHub Actions in .github/workflows/ci.yml:

Check	Command
Type safety	`npm run typecheck`
Unit/integration tests	`npm test`
Coverage	`npm run test:coverage`
Worker bundle dry-run	`npm run build`
Live health check	`curl https://website2markdown.genedai.workers.dev/api/health`
Live public conversion	`GET /https://website2markdown.genedai.workers.dev/https://example.com?raw=true`

Production note:

Protected write APIs (/api/extract, /api/jobs*, /api/deepcrawl, /api/batch) require API_TOKEN.
If API_TOKEN is not configured in deployed Worker, these endpoints return 503 (API_TOKEN not set).

License

MIT