markitdown
mcp
Gecti
Health Gecti
- License — License: MIT
- Description — Repository has a description
- Active repo — Last push 0 days ago
- Community trust — 815 GitHub stars
Code Gecti
- Code scan — Scanned 12 files during light audit, no dangerous patterns found
Permissions Gecti
- Permissions — No dangerous permissions requested
Purpose
This tool converts common office documents (PDF, Word, Excel, HTML, images, and ZIP archives) into Markdown format via a Java CLI or an MCP server. It also supports extracting text from images and scanned PDFs using configurable OCR backends.
Security Assessment
Overall risk: Low. The light code audit scanned 12 files and found no dangerous patterns, hardcoded secrets, or dangerous permissions. The tool requires standard file system read/write access to process documents. Network requests are strictly limited to user-configured external OCR services (such as remote PaddleOCR APIs). Because users must explicitly provide the endpoint and API key via command-line arguments, environment variables, or a local configuration file, the risk of unintended data exfiltration is minimal.
Quality Assessment
The project demonstrates strong health and reliability. It is actively maintained, with its most recent push occurring today, and has garnered 815 GitHub stars, indicating solid community trust. The repository is transparent, providing extensive documentation, automated tests (Maven), and a large dataset of test files to ensure stability. Additionally, the code is fully open-source under the permissive MIT license, making it highly accessible for integration and modification.
Verdict
Safe to use.
This tool converts common office documents (PDF, Word, Excel, HTML, images, and ZIP archives) into Markdown format via a Java CLI or an MCP server. It also supports extracting text from images and scanned PDFs using configurable OCR backends.
Security Assessment
Overall risk: Low. The light code audit scanned 12 files and found no dangerous patterns, hardcoded secrets, or dangerous permissions. The tool requires standard file system read/write access to process documents. Network requests are strictly limited to user-configured external OCR services (such as remote PaddleOCR APIs). Because users must explicitly provide the endpoint and API key via command-line arguments, environment variables, or a local configuration file, the risk of unintended data exfiltration is minimal.
Quality Assessment
The project demonstrates strong health and reliability. It is actively maintained, with its most recent push occurring today, and has garnered 815 GitHub stars, indicating solid community trust. The repository is transparent, providing extensive documentation, automated tests (Maven), and a large dataset of test files to ensure stability. Additionally, the code is fully open-source under the permissive MIT license, making it highly accessible for integration and modification.
Verdict
Safe to use.
markitdown:CLI;MCP
README.md
markitdown
markitdown 是一个面向文档转 Markdown 的仓库,目前主线交付物是 markitdown4j Java CLI。它可以把常见办公文档、网页、图片、压缩包和部分音频元数据转换为 Markdown,并支持通过统一配置切换不同 OCR 后端。
这个项目能做什么
- 将 PDF、Word、Excel、PowerPoint、HTML、图片、文本、ZIP 转换为 Markdown
- 支持 OCR 补充识别图片和扫描版 PDF
- 支持多平台制品:
lite、full、win32、win64、linux64、mac - 支持统一 OCR 配置,切换后端时不需要修改转换流程
- 支持远程 OCR Provider,例如
paddleocr
仓库结构
- java/README.md:Java CLI 主项目说明
- java/COMMAND_REFERENCE.md:命令与参数参考
- test/README.md:测试数据集和验证说明
- OCR_PROVIDER_ROADMAP.md:OCR / VLM 扩展路线图
快速开始
- 安装 Java 11 或更高版本
- 下载适合你的发布包
- 运行转换命令
示例:
java -jar target/markitdown4j-0.0.3-lite.jar document.pdf -o output.md
下载哪个包
win64:64 位 Windows,内置 Windows OCR nativewin32:32 位 Windows,内置 Windows OCR nativelinux64:Linux,推荐配合本地或远程 OCRmac:macOS,推荐配合本地或远程 OCRlite:最小体积,不内置tess4jfull:完整包,包含完整 OCR 资源
OCR 配置
项目采用统一 OCR 配置模型。用户不需要为不同 OCR 单独学一套配置,只需要修改同一组字段。
示例:
ocr.enable=true
ocr.engine=paddleocr
ocr.endpoint=https://paddleocr.aistudio-app.com/api/v2/ocr/jobs
ocr.api.key=YOUR_TOKEN
ocr.model=PaddleOCR-VL-1.5
ocr.timeout=30000
ocr.poll.interval=5000
ocr.language=auto
配置文件位置:
统一 OCR 配置字段:
ocr.enableocr.engineocr.endpointocr.api.keyocr.modelocr.timeoutocr.poll.intervalocr.language
配置优先级:
- 命令行参数,例如
--ocr-engine - 环境变量,例如
MARKITDOWN_OCR_ENGINE .markitdown.properties- 程序内置默认值
常用环境变量:
MARKITDOWN_OCR_ENGINEMARKITDOWN_OCR_ENDPOINTMARKITDOWN_OCR_API_KEYMARKITDOWN_OCR_MODELMARKITDOWN_OCR_TIMEOUTMARKITDOWN_OCR_POLL_INTERVAL
当前对外支持的 OCR 后端:
tess4j:适合 Windows 内嵌 OCRtesseract-cli:适合 Linux / macOS 本地 OCRpaddleocr:适合远程结构化 OCRhttp:适合接自定义远程 OCR 服务
测试与验证
项目不是只写了功能说明,也提供了可复用的测试资产和已执行的验证路径。
自动化测试
执行:
mvn test
当前覆盖:
- Profile 构建和命名检查
- OCR engine factory 选择
- PaddleOCR 响应解析
- 文本流式转换
- ZIP 委托和嵌套转换行为
测试数据集
仓库中的 test/test.zip 是正式测试文件包,当前包含约 104 个测试文件。它用于:
- 回归测试
- 兼容性验证
- release 前手工检查
解压后的 test/README.md 说明了如何使用这些文件进行验证。
已验证的关键链路
lite基础转换win64 + tess4jlinux64 + tesseract-clilite + paddleocr
文档入口
其他子项目
markitdown-mcp:MCP 相关内容
License
本仓库按当前仓库中的 License 文件或后续发布说明为准。
Yorumlar (0)
Yorum birakmak icin giris yap.
Yorum birakSonuc bulunamadi