octoparse-mcp

mcp
Security Audit
Fail
Health Warn
  • License — License: MIT
  • Description — Repository has a description
  • Active repo — Last push 0 days ago
  • Low visibility — Only 5 GitHub stars
Code Fail
  • rm -rf — Recursive force deletion command in package.json
  • network request — Outbound network request in package.json
  • network request — Outbound network request in src/api/clients/http-client-factory.ts
  • process.env — Environment variable access in src/api/protected-resource.ts
  • process.env — Environment variable access in src/config/app-config.ts
Permissions Pass
  • Permissions — No dangerous permissions requested
Purpose
This server acts as a bridge for AI models, enabling them to interact with the Octoparse platform to search for, execute, and export web scraping workflows via a standardized API.

Security Assessment
The overall risk is rated as Medium. The tool accesses environment variables to manage API keys and server configurations, which is standard practice, though these credentials must be protected locally. It makes several outbound network requests to communicate with the official Octoparse API servers to fetch and export scraped data. A notable failure in the scan was the detection of a recursive force deletion command (`rm -rf`) within the `package.json` file. While this is frequently used in build scripts to clean directories, it is a potential risk that requires manual verification to ensure it does not target unexpected paths. No hardcoded secrets or explicitly dangerous system permissions were found.

Quality Assessment
The codebase is licensed under the permissive MIT license and appears to be actively maintained, with its most recent push occurring today. However, community trust and visibility are currently very low. The repository has only 5 stars on GitHub, indicating that the project has not yet been widely tested or vetted by the broader open-source community.

Verdict
Use with caution — verify the build scripts due to the force deletion commands, and be aware of its low community vetting.
SUMMARY

Official Octoparse MCP server for AI-powered web scraping workflows.

README.md

Octoparse MCP Server

🇺🇸 English | 🇨🇳 中文

This repository is the AI-native, workflow-focused version of the Octoparse MCP Server.

The server now exposes only 3 tools:

  1. search_templates
  2. execute_task
  3. export_data

The goal is to shorten the tool chain, reduce token usage, and make the scraping flow much easier for LLMs to execute reliably.

Workflow

Canonical flow:

search_templatesexecute_taskexport_data

What each tool does:

  • search_templates finds runnable templates and returns recommendedTemplate
  • execute_task is a dual-mode tool: validateOnly=true runs synchronous preflight validation, while normal execution creates and starts an Octoparse cloud task
  • export_data is the follow-up entrypoint for non-task clients and the unified export tool after task execution completes

Current Capabilities

search_templates

  • Searches templates by keyword
  • Separates API relevance from likes-based browsing order
  • Returns recommendedTemplate
  • Provides explicit local-only guidance when the best match cannot run in cloud

execute_task

  • Accepts only templateName + parameters
  • Builds the low-level Octoparse parameter structure server-side
  • Supports validateOnly=true
  • Supports optional MCP task execution
  • In task mode, follow runtime state through tasks/get and tasks/result
  • In non-task mode, returns accepted + taskId immediately after create/start succeeds; then follow up with export_data(taskId)
  • Supports targetMaxRows
  • targetMaxRows > 0 only takes effect in task mode and enables threshold-stop behavior
  • targetMaxRows = 0 or omitting the field means run to natural completion

export_data

  • Preview mode by default
  • mode=inline for larger inline payloads
  • mode=summary for columns + sample rows only
  • Marks rows exported only when preview/inline returns the full pending set

Requirements

  • Node.js 18+
  • npm 9+
  • Access to the Octoparse Client API

Install

npm install

Common Environment Variables

NODE_ENV=development
PORT=8080
HOST=0.0.0.0

SERVER_NAME=octoparse-mcp-server
SERVER_VERSION=1.0.0

CLIENTAPI_BASE_URL=https://pre-v2-clientapi.octoparse.com
OFFICIAL_SITE_URL=https://pre.octoparse.com

HTTP_TIMEOUT=30000
HTTP_RETRIES=3
HTTP_RETRY_DELAY=1000

SEARCH_TEMPLATE_PAGE_SIZE=8
EXECUTE_TASK_POLL_MAX_MINUTES=10

TRANSPORT_IDLE_TTL_SECONDS=1800
TRANSPORT_CLEANUP_INTERVAL_SECONDS=300

LOG_LEVEL=debug
LOG_ENABLE_CONSOLE=true

Start

npm run build
npm run start

For local development:

npm run dev

MCP Endpoints

The server listens on:

  • POST /
  • GET /
  • DELETE /

Health checks:

  • GET /hc
  • GET /liveness

Authentication:

  • Authorization: Bearer <token>
  • or X-API-Key: <api-key>

Tool Examples

1. Search a template

{
  "tool": "search_templates",
  "arguments": {
    "keyword": "amazon"
  }
}

2. Validate parameters without creating a task

{
  "tool": "execute_task",
  "arguments": {
    "templateName": "amazon-product-scraper",
    "validateOnly": true,
    "parameters": {
      "SearchKeyword": ["iphone"]
    }
  }
}

3. Start a cloud task and return accepted immediately

{
  "tool": "execute_task",
  "arguments": {
    "templateName": "amazon-product-scraper",
    "parameters": {
      "SearchKeyword": ["iphone"]
    }
  }
}

4. Use threshold-stop in task mode

{
  "tool": "execute_task",
  "arguments": {
    "templateName": "amazon-product-scraper",
    "parameters": {
      "SearchKeyword": ["iphone"]
    },
    "targetMaxRows": 100
  }
}

Notes:

  • This call should use MCP task augmentation
  • When targetMaxRows > 0, the server polls in the background and best-effort requests stopTask near the threshold
  • targetMaxRows = 0 means no threshold stop

5. Export a summary

{
  "tool": "export_data",
  "arguments": {
    "taskId": "your-task-id",
    "mode": "summary"
  }
}

Validation

npm run build
npm test

The current regression coverage focuses on:

  • recommended template selection and local-only guidance
  • execute_task.validateOnly
  • non-task execute_task returning accepted
  • targetMaxRows=0 meaning natural completion
  • missing / unmapped parameter handling
  • export_data.summary not calling markExported

Design Priorities

This version optimizes for two things:

  1. Better agent usability: fewer tools, fewer low-level parameters, clearer next actions
  2. Better runtime stability: shorter call chains, tighter transport cleanup, lighter logging, and more compact error payloads

Reviews (0)

No results found