octen-mcp

MCP server for Octen Extract — turn any URL into clean, LLM-ready markdown. Plug into Claude / Cursor / VS Code / Windsurf and let the model pull the live web.

Why this MCP

Most extract tools (Firecrawl, Jina Reader, Exa, Tavily) hand you the page body. Octen returns the body plus structured page labels in the same call:

category — topical labels with subcategories (e.g., Computers, Electronics & Technology / Artificial Intelligence, Health, Finance, Travel). Use to skip out-of-vertical pages in RAG pipelines — a finance pipeline can filter out random forum / entertainment pages before embedding.
page_structure — what kind of page this actually is (e.g., Content Page / Article, Homepage, Index Page, No Main Content). Use to skip listing/navigation pages, dead links, and login-wall shells before paying for LLM calls — in real RAG pipelines, a meaningful share of fetched URLs (often 20–30%) are index pages or content-less shells.
highlights — pass a query and get the most relevant snippets ranked per page instead of the full body (cheaper context, better signal).

The two labels move filtering upstream — instead of fetching everything, embedding it, then realizing a chunk of pages are useless, you skip them at fetch time. None of category / page_structure / highlights exist in Firecrawl, Jina, Exa, or Tavily today.

When `success` isn't enough

A common failure mode for extract pipelines: the request returns success, the response body is non-empty, but the page is actually a login wall, paywall, JS shell, or "we'll be right back" stub. The agent has no signal until it pays for an LLM call to discover the page has nothing to summarize. Octen flags these at fetch time.

Take https://github.com/login — visually it looks like a normal page:

Screenshot of GitHub's login page — a form with email/password fields and 'Continue with Google/Apple' buttons, no article content

But there's no main content to extract — it's a sign-in form. Same URL on both APIs returns very different signals:

Firecrawl `/v1/scrape`	Octen `/extract` (this server)

That single page_structure: "No Main Content" lets the agent skip the page without an LLM call. With other tools, the agent only finds out by spending tokens to summarize an empty page — at scale, a real chunk of the token bill.

Quick start

VS Code users: click → the button prompts for your Octen API key on install (grab one at octen.ai first).

For other clients, configure manually:

Claude Desktop

Edit ~/Library/Application Support/Claude/claude_desktop_config.json (macOS) or %APPDATA%\Claude\claude_desktop_config.json (Windows):

{
  "mcpServers": {
    "octen": {
      "command": "npx",
      "args": ["-y", "octen-mcp"],
      "env": {
        "OCTEN_API_KEY": "your-key-here"
      }
    }
  }
}

Quit and reopen Claude Desktop. Ask "fetch octen.ai and summarize" — Claude routes to the extract tool automatically.

Cursor

Add to ~/.cursor/mcp.json:

{
  "mcpServers": {
    "octen": {
      "command": "npx",
      "args": ["-y", "octen-mcp"],
      "env": { "OCTEN_API_KEY": "your-key-here" }
    }
  }
}

VS Code (workspace `.vscode/mcp.json`)

The one-click badges above handle the user-level install. For a per-workspace config:

{
  "servers": {
    "octen": {
      "command": "npx",
      "args": ["-y", "octen-mcp"],
      "env": { "OCTEN_API_KEY": "your-key-here" }
    }
  }
}

Claude Code (CLI)

One line, no JSON editing:

claude mcp add --scope user octen \
  -e OCTEN_API_KEY=your-key-here \
  -- npx -y octen-mcp

--scope user makes it available from any directory. Verify with claude mcp list — should show octen: ✓ Connected.

Windsurf / Cline / other MCP clients

Same npx -y octen-mcp command with OCTEN_API_KEY env — works in any MCP-compatible client.

Tool reference: `extract`

Param	Type	Default	Description
`urls`	`string[]`	required	1–20 URLs per call. Bare hosts like `octen.ai` are auto-prefixed with `https://`.
`query`	`string`	none	Intent-focused keywords. When set, results contain `highlights` instead of `full_content`. Max 500 chars.
`max_age_seconds`	`int`	`86400`	Cache TTL in seconds (min 300). Lower this for time-sensitive pages (news, prices).
`format`	`markdown` \| `text`	`markdown`	Output content format.
`timeout`	`int`	`30`	Per-URL extraction timeout, 1–60 seconds.
`include_images`	`bool`	`false`	Include image URLs found on each page.
`include_videos`	`bool`	`false`	Include video URLs found on each page.
`include_audio`	`bool`	`false`	Include audio URLs found on each page.
`include_favicon`	`bool`	`false`	Include each page's favicon URL.

Full API reference: docs.octen.ai/api-reference/extract.

Response example

One result object per URL. Success shape:

{
  "url": "https://en.wikipedia.org/wiki/Model_Context_Protocol",
  "status": "success",
  "title": "Model Context Protocol - Wikipedia",
  "category": {
    "primary": "Computers, Electronics & Technology",
    "secondary": "Programming and Developer Software"
  },
  "page_structure": {
    "primary": "Content Page",
    "secondary": "Encyclopedia"
  },
  "time_published": "2024-11-25T00:00:00Z",
  "time_last_crawled": "2026-05-21T08:14:22Z",
  "full_content": "# Model Context Protocol\n\n…clean markdown body…"
}

When query is set, full_content is replaced by "highlights": ["…ranked snippet 1…", "…ranked snippet 2…"]. When include_images / include_videos / include_audio / include_favicon are set, the corresponding fields appear alongside.

Failure shape (e.g., 404 / DNS / 5xx — see the edge cases section below):

{
  "url": "https://httpbin.org/status/404",
  "status": "failed",
  "error_message": "Target returned HTTP 404"
}

Example prompts to try

Differentiating use-cases (these exercise Octen's per-page labels):

Fetch these 10 URLs and only summarize the ones whose category is Finance. (filter by category)
Fetch these search results and skip any whose page_structure is Index Page or that come back as failed. (filter by page_structure)
Pull octen.ai/pricing and confirm its page_structure is a content page, not a redirect or empty shell. (page_structure validation)
Search 'pricing' across firecrawl.dev — return only the relevant highlights. (triggers query → highlights)

Basic fetch use-cases:

Fetch octen.ai and summarize the main product features.
Compare the positioning of firecrawl.dev and octen.ai.
What does the Hacker News front page say right now? Pull the top 5 story titles.

How Octen handles edge cases

For the silent-success case (login walls / shells), see When success isn't enough above. Other failure modes come back as structured status: failed results, not empty markdown:

Scenario	Example URL	Octen response	Why it's useful
Hard 404	`https://httpbin.org/status/404`	`status: failed`, `error_message: "Target returned HTTP 404"`	Agent knows the URL is dead — no need to retry.
Server error (5xx)	`https://httpbin.org/status/500`	`status: failed`, `error_message: "Target server error (HTTP 500)"`	Distinguishes server-side outage from client-side dead page — can be safely retried later.
DNS failure / dead domain	`https://nonexistent-zzz-fake-xyz.invalid`	`status: failed`, `error_message: "Failed to resolve domain"`	Distinguishes "domain doesn't exist" from "page doesn't exist" — different remediation.

Environment variables

Variable	Required	Default	Notes
`OCTEN_API_KEY`	✅	—	Get one at octen.ai
`OCTEN_API_URL`	optional	`https://api.octen.ai`	Override for staging or self-hosted

Local development

git clone https://github.com/Octen-Team/octen-mcp.git
cd octen-mcp
npm install
npm run build
OCTEN_API_KEY=<key> npm run inspect    # opens MCP Inspector

Tip — make Claude prefer this tool

If your client also has a built-in web-fetch tool, drop a hint in Claude Desktop's Customize / Project Instructions:

When the user asks to fetch or extract content from a URL, prefer the extract tool from the octen MCP server. Use query whenever the user is looking for something specific on the page (returns ranked highlights, not the whole body).

With the hint in place, a single tool call classifies three mixed URLs (article / homepage / discussion) in one shot:

overview demo

octen-mcp

octen-mcp

Why this MCP

When `success` isn't enough

Quick start

Claude Desktop

Cursor

VS Code (workspace `.vscode/mcp.json`)

Claude Code (CLI)

Windsurf / Cline / other MCP clients

Tool reference: `extract`

Response example

Example prompts to try

How Octen handles edge cases

Environment variables

Local development

Tip — make Claude prefer this tool

License

Yorumlar (0)

octen-mcp

Why this MCP

When success isn't enough

Quick start

Claude Desktop

Cursor

VS Code (workspace .vscode/mcp.json)

Claude Code (CLI)

Windsurf / Cline / other MCP clients

Tool reference: extract

Response example

Example prompts to try

How Octen handles edge cases

Environment variables

Local development

Tip — make Claude prefer this tool

License

Yorumlar (0)

When `success` isn't enough

VS Code (workspace `.vscode/mcp.json`)

Tool reference: `extract`