Octoparse MCP Server

Name: octoparse-mcp
Author: octoparse

🇺🇸 English | 🇨🇳 中文

This repository is the AI-native, workflow-focused version of the Octoparse MCP Server.

The server now exposes only 3 tools:

search_templates
execute_task
export_data

The goal is to shorten the tool chain, reduce token usage, and make the scraping flow much easier for LLMs to execute reliably.

Workflow

Canonical flow:

search_templates → execute_task → export_data

What each tool does:

search_templates finds runnable templates and returns recommendedTemplate
execute_task is a dual-mode tool: validateOnly=true runs synchronous preflight validation, while normal execution creates and starts an Octoparse cloud task
export_data is the follow-up entrypoint for non-task clients and the unified export tool after task execution completes

Current Capabilities

`search_templates`

Searches templates by keyword
Separates API relevance from likes-based browsing order
Returns recommendedTemplate
Provides explicit local-only guidance when the best match cannot run in cloud

`execute_task`

Accepts only templateName + parameters
Builds the low-level Octoparse parameter structure server-side
Supports validateOnly=true
Supports optional MCP task execution
In task mode, follow runtime state through tasks/get and tasks/result
In non-task mode, returns accepted + taskId immediately after create/start succeeds; then follow up with export_data(taskId)
Supports targetMaxRows
targetMaxRows > 0 only takes effect in task mode and enables threshold-stop behavior
targetMaxRows = 0 or omitting the field means run to natural completion

`export_data`

Preview mode by default
mode=inline for larger inline payloads
mode=summary for columns + sample rows only
Marks rows exported only when preview/inline returns the full pending set

Requirements

Node.js 18+
npm 9+
Access to the Octoparse Client API

Install

npm install

Common Environment Variables

NODE_ENV=development
PORT=8080
HOST=0.0.0.0

SERVER_NAME=octoparse-mcp-server
SERVER_VERSION=1.0.0

CLIENTAPI_BASE_URL=https://pre-v2-clientapi.octoparse.com
OFFICIAL_SITE_URL=https://pre.octoparse.com

HTTP_TIMEOUT=30000
HTTP_RETRIES=3
HTTP_RETRY_DELAY=1000

SEARCH_TEMPLATE_PAGE_SIZE=8
EXECUTE_TASK_POLL_MAX_MINUTES=10

TRANSPORT_IDLE_TTL_SECONDS=1800
TRANSPORT_CLEANUP_INTERVAL_SECONDS=300

LOG_LEVEL=debug
LOG_ENABLE_CONSOLE=true

Start

npm run build
npm run start

For local development:

npm run dev

MCP Endpoints

The server listens on:

POST /
GET /
DELETE /

Health checks:

GET /hc
GET /liveness

Authentication:

Authorization: Bearer <token>
or X-API-Key: <api-key>

Tool Examples

1. Search a template

{
  "tool": "search_templates",
  "arguments": {
    "keyword": "amazon"
  }
}

2. Validate parameters without creating a task

{
  "tool": "execute_task",
  "arguments": {
    "templateName": "amazon-product-scraper",
    "validateOnly": true,
    "parameters": {
      "SearchKeyword": ["iphone"]
    }
  }
}

3. Start a cloud task and return `accepted` immediately

{
  "tool": "execute_task",
  "arguments": {
    "templateName": "amazon-product-scraper",
    "parameters": {
      "SearchKeyword": ["iphone"]
    }
  }
}

4. Use threshold-stop in task mode

{
  "tool": "execute_task",
  "arguments": {
    "templateName": "amazon-product-scraper",
    "parameters": {
      "SearchKeyword": ["iphone"]
    },
    "targetMaxRows": 100
  }
}

Notes:

This call should use MCP task augmentation
When targetMaxRows > 0, the server polls in the background and best-effort requests stopTask near the threshold
targetMaxRows = 0 means no threshold stop

5. Export a summary

{
  "tool": "export_data",
  "arguments": {
    "taskId": "your-task-id",
    "mode": "summary"
  }
}

Validation

npm run build
npm test

The current regression coverage focuses on:

recommended template selection and local-only guidance
execute_task.validateOnly
non-task execute_task returning accepted
targetMaxRows=0 meaning natural completion
missing / unmapped parameter handling
export_data.summary not calling markExported

Design Priorities

This version optimizes for two things:

Better agent usability: fewer tools, fewer low-level parameters, clearer next actions
Better runtime stability: shorter call chains, tighter transport cleanup, lighter logging, and more compact error payloads