datamimic

mcp
Security Audit
Fail
Health Pass
  • License รขโ‚ฌโ€ License: MIT
  • Description รขโ‚ฌโ€ Repository has a description
  • Active repo รขโ‚ฌโ€ Last push 0 days ago
  • Community trust รขโ‚ฌโ€ 32 GitHub stars
Code Fail
  • rm -rf รขโ‚ฌโ€ Recursive force deletion command in .github/workflows/main.yml
Permissions Pass
  • Permissions รขโ‚ฌโ€ No dangerous permissions requested

No AI report is available for this listing yet.

SUMMARY

๐Ÿง  Model-driven synthetic test data for CI/CD and analytics - deterministic, privacy-preserving, and domain-aware. Includes Python APIs, XML pipelines, and MCP/IDE integration to orchestrate realistic datasets for finance, healthcare, and other regulated environments.

README.md

DATAMIMIC โ€” Governed Test Data for Regulated Enterprises

This repository contains the DATAMIMIC Community Edition (CE) โ€” the open-source deterministic data engine at the core of the DATAMIMIC Enterprise Platform.

CE is fully usable standalone for deterministic synthetic data generation and PII-aware pseudonymization. The Enterprise Platform builds on a separately optimised EE core and adds governed workflows, PII scanning, role-based access, audit logging, scheduling, multi-system execution, and the full operational layer that regulated enterprises require.

๐Ÿ‘‰ Enterprise Platform: datamimic.io ย |ย  ๐Ÿ“˜ Docs: docs.datamimic.io ย |ย  ๐Ÿ“… Book a strategy call: datamimic.io/contact


CI
Coverage
Maintainability
Python
License: MIT
MCP Ready


What is DATAMIMIC?

DATAMIMIC is the enterprise standard for governed test data operations.

Enterprises in banking, insurance, and regulated industries use DATAMIMIC to:

  • Scan source systems for PII โ€” probability-scored field detection with configurable thresholds (EE: automated via DataWorkbench; CE: manual model definition)
  • Generate fully synthetic, deterministic datasets โ€” model-driven, zero production data, no compliance risk
  • Pseudonymize source data โ€” deterministic (seeded) or privacy-maximized (non-seeded) field transformation from source to target system
  • Execute repeatable workflows across complex system landscapes: Oracle, PostgreSQL, MongoDB, Kafka, JSON, XML, CSV
  • Audit every run with immutable logs, provenance hashing, and role-based traceability
  • Govern test data demand through reusable templates, approval flows, and self-service execution

Used in production at Tier-1 European banks and global payment processing enterprises for deterministic test data across Oracle, MongoDB, and Kafka pipelines.


CE vs Enterprise Platform

CE and EE are not the same engine with a feature flag. The EE core is an independently optimised execution engine built for enterprise-scale throughput and operational control.

Engine comparison

Capability Community Edition (CE) Enterprise Platform (EE)
Deterministic data generation โœ… โœ…
Pseudonymization โ€” seeded (GDPR Art. 25) โœ… manual model โœ… automated via DataWorkbench
Pseudonymization โ€” non-seeded (privacy-maximized) โœ… manual model โœ… automated via DataWorkbench
Python API + XML pipelines โœ… โœ…
Domain models: Finance, Healthcare, Demographics โœ… โœ…
MCP server for AI agent integration โœ… โœ…
CLI + local execution โœ… โœ…
Scale millions of records linearly scalable to 1,000,000,000+ records via isolated multiprocessing and Ray-based distributed execution
PII scanner โŒ โœ… probability-scored field detection, configurable threshold, DataWorkbench integration
Runtime configuration profiles โŒ โœ… Performance ยท Balanced ยท Flexibility
Memory management standard optimised for high-volume batch and streaming
Logging granularity flat execution log configurable: minimal ยท standard ยท deep nested tracing
Nested structure evaluation basic deep nested generation with extended condition + ruleset evaluation
Importer / exporter logging โŒ per-stage logging for importers and exporters
Error handling standard exceptions structured error catalog with recovery strategies
ML engine integration โŒ combine statistical models with conditions, rulesets, validators

Platform capabilities (EE only)

Capability EE
Multi-user collaboration โœ…
Role-based access control (RBAC) โœ…
Audit logs + provenance dashboards โœ…
PII scanner โ€” probability scoring, threshold-based field flagging โœ…
DataWorkbench โ€” visual field mapping and pseudonymization model builder โœ…
Reusable enterprise template library โœ…
Scheduled execution + task runner โœ…
CI/CD pipeline integration (Tosca, Jenkins, GitLab) โœ…
Multi-system execution: Oracle, MongoDB, Kafka โœ…
Template engine: EDIFACT, SWIFT MT, HL7 + spec-specific editors โœ…
GDPR / HIPAA / PCI audit compliance layer โœ…
On-premise deployment + air-gapped environments โœ…
LSP-powered IDE tooling for DSL authoring โœ…

๐Ÿ‘‰ Compare editions in detail ย |ย  Book a platform demo


EE runtime profiles

The EE core supports three runtime configuration profiles, selectable per execution context:

Profile Optimises for Typical use case
Performance Maximum throughput via isolated multiprocessing + Ray-based distributed execution Bulk generation at nine-figure record volumes to PostgreSQL, Oracle, Kafka
Balanced Throughput + full audit logging Standard enterprise pipeline runs with compliance requirements
Flexibility Deep nested evaluation, extended condition and ruleset processing Complex domain models with ML engine combinations, multi-level referential structures

Logging depth is independently configurable per profile โ€” from minimal (throughput-optimised) to full nested tracing across importers, exporters, and generation stages.


EE template engine

The EE template engine generates industry-standard financial and healthcare message formats from DATAMIMIC models, with spec-specific editors for each format:

Format Standard Spec-specific editor
EDIFACT UN/EDIFACT โœ…
SWIFT MT SWIFT MT โœ…
HL7 HL7 v2.x โœ…

Templates are versioned, reusable across scenarios, and fully integrated with the DATAMIMIC DSL and audit layer. Generated messages are deterministic and traceable to their source model.


Who is DATAMIMIC for?

Enterprise Platform (EE)

Role What DATAMIMIC solves
QA / Test Manager Eliminate manual test data requests. Self-service, governed, always ready.
Business Analyst Define data requirements in business-readable models โ€” no scripting needed.
Platform / DevOps Engineer Integrate deterministic test data generation into CI/CD and scheduled pipelines.
Compliance / Audit Full audit trail for every generation run. Regulator-ready logs, no production data exposure.
Enterprise Architect One governed standard across Oracle, MongoDB, Kafka, flat files, and custom systems.

Community Edition (CE)

Developers and data engineers who need deterministic synthetic data generation or PII-aware pseudonymization in local environments, CI pipelines, or agent-driven workflows. PII field identification is manual โ€” the EE DataWorkbench automates this step.


Why deterministic generation matters

Most test data tools produce random output. That breaks regression tests, audit trails, and cross-team reproducibility.

DATAMIMIC's determinism contract:

  • Same seed + same model = byte-identical output, every run, every machine
  • Frozen clocks + canonical hashing = stable temporal context
  • UUIDv5 namespaces = reproducible entity identifiers
  • Provenance hash on every output = audit-ready lineage
from datamimic_ce.domains.facade import generate_domain

request = {
    "domain": "person",
    "version": "v1",
    "count": 1,
    "seed": "regression-suite-42",       # identical seed โ†’ identical output
    "locale": "en_US",
    "clock": "2025-01-01T00:00:00Z"      # fixed clock = stable time context
}

response = generate_domain(request)
# Same input โ†’ same output, always, everywhere

Why DATAMIMIC beats Faker and generic generators

Faker / Random generators DATAMIMIC CE DATAMIMIC EE
Reproducible output โŒ โœ… โœ…
Domain-aware relationships โŒ โœ… โœ…
Business logic constraints โŒ โœ… โœ…
Audit-ready provenance โŒ โœ… โœ…
Source data pseudonymization โŒ โœ… manual โœ… automated
PII field detection โŒ โŒ โœ… probability-scored
Enterprise governance layer โŒ โŒ โœ…
Multi-system execution โŒ โŒ โœ…
Role-based workflows โŒ โŒ โœ…
Regulated industry compliance โŒ โŒ โœ…
# Faker โ€” broken relationships
from faker import Faker
fake = Faker()
patient_age = fake.random_int(1, 99)
conditions  = [fake.word()]
# "25-year-old with Alzheimer's" โ€” meaningless for any real test

# DATAMIMIC โ€” domain-aware, deterministic
from datamimic_ce.domains.healthcare.services import PatientService
patient = PatientService().generate()
print(f"{patient.full_name}, {patient.age}, {patient.conditions}")
# "Shirley Thompson, 72, ['Diabetes', 'Hypertension']" โ€” every time

Quickstart โ€” Community Edition

pip install datamimic-ce

Healthcare domain

from datamimic_ce.domains.healthcare.services import PatientService

patient = PatientService().generate()
print(patient.full_name, patient.age, patient.conditions)
# Age-appropriate conditions, demographically realistic, deterministic

Finance domain

from datamimic_ce.domains.finance.services import BankAccountService

account = BankAccountService().generate()
print(account.account_number, account.balance)
# Balance-consistent, locale-correct, reproducible

Pseudonymization โ€” CE (manual model)

DATAMIMIC supports two pseudonymization modes with different privacy postures:

Mode How Legal classification Use case
Seeded (rngSeed set) Deterministic, reproducible Pseudonymization โ€” GDPR Art. 25 Regression testing, stable CI/CD pipelines
Non-seeded (no rngSeed) Non-deterministic, no reversible mapping at field level Privacy-maximized transformation One-time data delivery, higher privacy posture

Note on GDPR anonymization: Full anonymization status under GDPR depends on complete field coverage across all quasi-identifiers and a re-identification risk assessment on the complete record โ€” not on individual field transformation alone. DATAMIMIC does not make anonymization claims on behalf of the customer. Non-seeded mode maximizes privacy at the transformation level; the customer is responsible for assessing re-identification risk across the full dataset.

In CE, PII fields are identified and modeled manually in the XML pipeline:

<setup>
  <generate name="customers" source="customer_export" target="customer_test">
    <key name="first_name"  converter="Mask" />
    <key name="email"       converter="anonymize_email" />
    <key name="iban"        converter="generate_iban" dataset="DE" rngSeed="42" />
    <key name="birth_date"  converter="shift_date" shiftDays="90" />
  </generate>
</setup>
datamimic run ./pseudonymize-customers/datamimic.xml

source is a controlled export or staging input โ€” never a live production connection.

With rngSeed set: same source record โ†’ same pseudonymized output on every run. Stable for regression testing.

Without rngSeed: non-deterministic output โ€” no reversible mapping exists at the field level. Stronger privacy posture for one-time delivery scenarios.

In the Enterprise Platform (EE): the DataWorkbench PII scanner automatically scans source schemas, assigns probability scores to each field, and flags candidates above a configurable threshold. Flagged fields are wired into the pseudonymization model automatically โ€” no manual field mapping required.

<setup>
  <generate name="patients" count="1000" target="CSV">
    <variable name="patient" entity="Patient" dataset="US" ageMin="60" ageMax="80" rngSeed="42" />
    <key name="full_name"   script="patient.full_name" />
    <key name="age"         script="patient.age" />
    <array name="conditions" script="patient.conditions" />
  </generate>
</setup>
datamimic run ./patient-scenario/datamimic.xml

MCP Server โ€” AI Agent Integration

DATAMIMIC CE ships with a Model Context Protocol (MCP) server, making it directly callable from AI agents, Claude, Cursor, and any MCP-compatible runtime.

pip install datamimic-ce[mcp]

export DATAMIMIC_MCP_HOST=127.0.0.1
export DATAMIMIC_MCP_PORT=8765
export DATAMIMIC_MCP_API_KEY=your-key
datamimic-mcp

Agents can call generate with a domain, seed, count, and locale and receive deterministic, provenance-hashed output โ€” making DATAMIMIC the natural test data runtime for agent-driven workflows.

import anyio, json
from fastmcp.client import Client
from datamimic_ce.mcp.models import GenerateArgs
from datamimic_ce.mcp.server import create_server

async def main():
    args = GenerateArgs(domain="person", locale="en_US", seed=42, count=2)
    payload = args.model_dump(mode="python")
    async with Client(create_server()) as c:
        a = await c.call_tool("generate", {"args": payload})
        b = await c.call_tool("generate", {"args": payload})
        # Determinism proof: identical hashes across calls
        assert (json.loads(a[0].text)["determinism_proof"]["content_hash"]
             == json.loads(b[0].text)["determinism_proof"]["content_hash"])

anyio.run(main)

๐Ÿ“˜ Full guide: docs/mcp_quickstart.md


Architecture

CE and EE have separate, independently maintained cores. CE is not a stripped-down EE. EE is not CE with features unlocked. They share the same DSL and determinism contract but diverge completely at the execution layer.

โ•”โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•—
โ•‘              DATAMIMIC ENTERPRISE PLATFORM (EE)                  โ•‘
โ•‘                                                                  โ•‘
โ•‘  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ•‘
โ•‘  โ”‚  PLATFORM LAYER                                          โ”‚    โ•‘
โ•‘  โ”‚  UI ยท RBAC ยท Governance ยท Audit Dashboards               โ”‚    โ•‘
โ•‘  โ”‚  DataWorkbench ยท PII Scanner ยท Pseudonymization Builder  โ”‚    โ•‘
โ•‘  โ”‚  Scheduler ยท Task Runner ยท CI/CD ยท Template Engine       โ”‚    โ•‘
โ•‘  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ•‘
โ•‘                                                                  โ•‘
โ•‘  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ•‘
โ•‘  โ”‚  EE CORE  (optimised, separate from CE)                  โ”‚    โ•‘
โ•‘  โ”‚                                                          โ”‚    โ•‘
โ•‘  โ”‚  Ray-based distributed execution                         โ”‚    โ•‘
โ•‘  โ”‚  Isolated multiprocessing ยท Linear scalability           โ”‚    โ•‘
โ•‘  โ”‚  Runtime profiles: Performance ยท Balanced ยท Flexibility  โ”‚    โ•‘
โ•‘  โ”‚  Deep nested evaluation ยท Conditions ยท Rulesets          โ”‚    โ•‘
โ•‘  โ”‚  ML engine integration ยท Structured error catalog        โ”‚    โ•‘
โ•‘  โ”‚  Per-stage importer/exporter logging                     โ”‚    โ•‘
โ•‘  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ•‘
โ•šโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•

โ•”โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•—
โ•‘              DATAMIMIC COMMUNITY EDITION (CE)  โ€” this repo       โ•‘
โ•‘                                                                  โ•‘
โ•‘  Determinism Kit ยท Domain Services ยท Schema Validators           โ•‘
โ•‘  Synthetic Generation ยท Pseudonymization (manual model)          โ•‘
โ•‘  Python API ยท XML Pipelines ยท CLI ยท MCP Server                   โ•‘
โ•šโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•

         โ†“              โ†“              โ†“              โ†“
    PostgreSQL       Oracle         MongoDB      Kafka / Files

Both editions share the same DATAMIMIC DSL and determinism contract. Scale, throughput, governance, and operational control are EE-only.


Supported systems (Enterprise Platform)

System Read Write Notes
PostgreSQL โœ… โœ… Schema introspection, referential integrity
Oracle โœ… โœ… Production-validated in Tier-1 banking environments
MongoDB โœ… โœ… Nested document generation
Apache Kafka โœ… โœ… Real-time streaming, payment scenarios
CSV / JSON / XML โœ… โœ… Flat file pipelines
EDIFACT / SWIFT MT โ€” โœ… Financial message formats

CE domains

Domain Models available
Healthcare Patient, Doctor, Hospital, MedicalRecord
Finance BankAccount, CreditCard, Transaction, LoanRecord
Demographics Person (DE / US / VN locale packs), Address, Company

All domains are versioned, seeded, and audit-ready.


CLI reference

# Run a scenario
datamimic run ./my-scenario/datamimic.xml

# Launch a demo
datamimic demo create healthcare-example
datamimic run ./healthcare-example/datamimic.xml

# Version check
datamimic version

Documentation

Resource Link
Full documentation docs.datamimic.io
MCP quickstart docs/mcp_quickstart.md
Developer guide docs/developer_guide.md
Enterprise platform datamimic.io
GitHub Discussions Discussions
Issue tracker Issues
Email support [email protected]

Contributing

See CONTRIBUTING.md. CE is MIT licensed and community contributions are welcome.

The CE engine is the foundation. If you are building integrations, domain extensions, or MCP tooling on top of DATAMIMIC, we want to hear from you.


License

MIT โ€” see LICENSE.

The DATAMIMIC Enterprise Platform (EE) is a commercial product. Contact us for licensing.


DATAMIMIC โ€” Make test data a standard, not a manual process.

datamimic.io ย |ย  Book a demo ย |ย  LinkedIn

Reviews (0)

No results found