datamimic
Health Pass
- License รขโฌโ License: MIT
- Description รขโฌโ Repository has a description
- Active repo รขโฌโ Last push 0 days ago
- Community trust รขโฌโ 32 GitHub stars
Code Fail
- rm -rf รขโฌโ Recursive force deletion command in .github/workflows/main.yml
Permissions Pass
- Permissions รขโฌโ No dangerous permissions requested
No AI report is available for this listing yet.
๐ง Model-driven synthetic test data for CI/CD and analytics - deterministic, privacy-preserving, and domain-aware. Includes Python APIs, XML pipelines, and MCP/IDE integration to orchestrate realistic datasets for finance, healthcare, and other regulated environments.
DATAMIMIC โ Governed Test Data for Regulated Enterprises
This repository contains the DATAMIMIC Community Edition (CE) โ the open-source deterministic data engine at the core of the DATAMIMIC Enterprise Platform.
CE is fully usable standalone for deterministic synthetic data generation and PII-aware pseudonymization. The Enterprise Platform builds on a separately optimised EE core and adds governed workflows, PII scanning, role-based access, audit logging, scheduling, multi-system execution, and the full operational layer that regulated enterprises require.
๐ Enterprise Platform: datamimic.io ย |ย ๐ Docs: docs.datamimic.io ย |ย ๐ Book a strategy call: datamimic.io/contact
What is DATAMIMIC?
DATAMIMIC is the enterprise standard for governed test data operations.
Enterprises in banking, insurance, and regulated industries use DATAMIMIC to:
- Scan source systems for PII โ probability-scored field detection with configurable thresholds (EE: automated via DataWorkbench; CE: manual model definition)
- Generate fully synthetic, deterministic datasets โ model-driven, zero production data, no compliance risk
- Pseudonymize source data โ deterministic (seeded) or privacy-maximized (non-seeded) field transformation from source to target system
- Execute repeatable workflows across complex system landscapes: Oracle, PostgreSQL, MongoDB, Kafka, JSON, XML, CSV
- Audit every run with immutable logs, provenance hashing, and role-based traceability
- Govern test data demand through reusable templates, approval flows, and self-service execution
Used in production at Tier-1 European banks and global payment processing enterprises for deterministic test data across Oracle, MongoDB, and Kafka pipelines.
CE vs Enterprise Platform
CE and EE are not the same engine with a feature flag. The EE core is an independently optimised execution engine built for enterprise-scale throughput and operational control.
Engine comparison
| Capability | Community Edition (CE) | Enterprise Platform (EE) |
|---|---|---|
| Deterministic data generation | โ | โ |
| Pseudonymization โ seeded (GDPR Art. 25) | โ manual model | โ automated via DataWorkbench |
| Pseudonymization โ non-seeded (privacy-maximized) | โ manual model | โ automated via DataWorkbench |
| Python API + XML pipelines | โ | โ |
| Domain models: Finance, Healthcare, Demographics | โ | โ |
| MCP server for AI agent integration | โ | โ |
| CLI + local execution | โ | โ |
| Scale | millions of records | linearly scalable to 1,000,000,000+ records via isolated multiprocessing and Ray-based distributed execution |
| PII scanner | โ | โ probability-scored field detection, configurable threshold, DataWorkbench integration |
| Runtime configuration profiles | โ | โ Performance ยท Balanced ยท Flexibility |
| Memory management | standard | optimised for high-volume batch and streaming |
| Logging granularity | flat execution log | configurable: minimal ยท standard ยท deep nested tracing |
| Nested structure evaluation | basic | deep nested generation with extended condition + ruleset evaluation |
| Importer / exporter logging | โ | per-stage logging for importers and exporters |
| Error handling | standard exceptions | structured error catalog with recovery strategies |
| ML engine integration | โ | combine statistical models with conditions, rulesets, validators |
Platform capabilities (EE only)
| Capability | EE |
|---|---|
| Multi-user collaboration | โ |
| Role-based access control (RBAC) | โ |
| Audit logs + provenance dashboards | โ |
| PII scanner โ probability scoring, threshold-based field flagging | โ |
| DataWorkbench โ visual field mapping and pseudonymization model builder | โ |
| Reusable enterprise template library | โ |
| Scheduled execution + task runner | โ |
| CI/CD pipeline integration (Tosca, Jenkins, GitLab) | โ |
| Multi-system execution: Oracle, MongoDB, Kafka | โ |
| Template engine: EDIFACT, SWIFT MT, HL7 + spec-specific editors | โ |
| GDPR / HIPAA / PCI audit compliance layer | โ |
| On-premise deployment + air-gapped environments | โ |
| LSP-powered IDE tooling for DSL authoring | โ |
๐ Compare editions in detail ย |ย Book a platform demo
EE runtime profiles
The EE core supports three runtime configuration profiles, selectable per execution context:
| Profile | Optimises for | Typical use case |
|---|---|---|
| Performance | Maximum throughput via isolated multiprocessing + Ray-based distributed execution | Bulk generation at nine-figure record volumes to PostgreSQL, Oracle, Kafka |
| Balanced | Throughput + full audit logging | Standard enterprise pipeline runs with compliance requirements |
| Flexibility | Deep nested evaluation, extended condition and ruleset processing | Complex domain models with ML engine combinations, multi-level referential structures |
Logging depth is independently configurable per profile โ from minimal (throughput-optimised) to full nested tracing across importers, exporters, and generation stages.
EE template engine
The EE template engine generates industry-standard financial and healthcare message formats from DATAMIMIC models, with spec-specific editors for each format:
| Format | Standard | Spec-specific editor |
|---|---|---|
| EDIFACT | UN/EDIFACT | โ |
| SWIFT MT | SWIFT MT | โ |
| HL7 | HL7 v2.x | โ |
Templates are versioned, reusable across scenarios, and fully integrated with the DATAMIMIC DSL and audit layer. Generated messages are deterministic and traceable to their source model.
Who is DATAMIMIC for?
Enterprise Platform (EE)
| Role | What DATAMIMIC solves |
|---|---|
| QA / Test Manager | Eliminate manual test data requests. Self-service, governed, always ready. |
| Business Analyst | Define data requirements in business-readable models โ no scripting needed. |
| Platform / DevOps Engineer | Integrate deterministic test data generation into CI/CD and scheduled pipelines. |
| Compliance / Audit | Full audit trail for every generation run. Regulator-ready logs, no production data exposure. |
| Enterprise Architect | One governed standard across Oracle, MongoDB, Kafka, flat files, and custom systems. |
Community Edition (CE)
Developers and data engineers who need deterministic synthetic data generation or PII-aware pseudonymization in local environments, CI pipelines, or agent-driven workflows. PII field identification is manual โ the EE DataWorkbench automates this step.
Why deterministic generation matters
Most test data tools produce random output. That breaks regression tests, audit trails, and cross-team reproducibility.
DATAMIMIC's determinism contract:
- Same seed + same model = byte-identical output, every run, every machine
- Frozen clocks + canonical hashing = stable temporal context
- UUIDv5 namespaces = reproducible entity identifiers
- Provenance hash on every output = audit-ready lineage
from datamimic_ce.domains.facade import generate_domain
request = {
"domain": "person",
"version": "v1",
"count": 1,
"seed": "regression-suite-42", # identical seed โ identical output
"locale": "en_US",
"clock": "2025-01-01T00:00:00Z" # fixed clock = stable time context
}
response = generate_domain(request)
# Same input โ same output, always, everywhere
Why DATAMIMIC beats Faker and generic generators
| Faker / Random generators | DATAMIMIC CE | DATAMIMIC EE | |
|---|---|---|---|
| Reproducible output | โ | โ | โ |
| Domain-aware relationships | โ | โ | โ |
| Business logic constraints | โ | โ | โ |
| Audit-ready provenance | โ | โ | โ |
| Source data pseudonymization | โ | โ manual | โ automated |
| PII field detection | โ | โ | โ probability-scored |
| Enterprise governance layer | โ | โ | โ |
| Multi-system execution | โ | โ | โ |
| Role-based workflows | โ | โ | โ |
| Regulated industry compliance | โ | โ | โ |
# Faker โ broken relationships
from faker import Faker
fake = Faker()
patient_age = fake.random_int(1, 99)
conditions = [fake.word()]
# "25-year-old with Alzheimer's" โ meaningless for any real test
# DATAMIMIC โ domain-aware, deterministic
from datamimic_ce.domains.healthcare.services import PatientService
patient = PatientService().generate()
print(f"{patient.full_name}, {patient.age}, {patient.conditions}")
# "Shirley Thompson, 72, ['Diabetes', 'Hypertension']" โ every time
Quickstart โ Community Edition
pip install datamimic-ce
Healthcare domain
from datamimic_ce.domains.healthcare.services import PatientService
patient = PatientService().generate()
print(patient.full_name, patient.age, patient.conditions)
# Age-appropriate conditions, demographically realistic, deterministic
Finance domain
from datamimic_ce.domains.finance.services import BankAccountService
account = BankAccountService().generate()
print(account.account_number, account.balance)
# Balance-consistent, locale-correct, reproducible
Pseudonymization โ CE (manual model)
DATAMIMIC supports two pseudonymization modes with different privacy postures:
| Mode | How | Legal classification | Use case |
|---|---|---|---|
Seeded (rngSeed set) |
Deterministic, reproducible | Pseudonymization โ GDPR Art. 25 | Regression testing, stable CI/CD pipelines |
Non-seeded (no rngSeed) |
Non-deterministic, no reversible mapping at field level | Privacy-maximized transformation | One-time data delivery, higher privacy posture |
Note on GDPR anonymization: Full anonymization status under GDPR depends on complete field coverage across all quasi-identifiers and a re-identification risk assessment on the complete record โ not on individual field transformation alone. DATAMIMIC does not make anonymization claims on behalf of the customer. Non-seeded mode maximizes privacy at the transformation level; the customer is responsible for assessing re-identification risk across the full dataset.
In CE, PII fields are identified and modeled manually in the XML pipeline:
<setup>
<generate name="customers" source="customer_export" target="customer_test">
<key name="first_name" converter="Mask" />
<key name="email" converter="anonymize_email" />
<key name="iban" converter="generate_iban" dataset="DE" rngSeed="42" />
<key name="birth_date" converter="shift_date" shiftDays="90" />
</generate>
</setup>
datamimic run ./pseudonymize-customers/datamimic.xml
source is a controlled export or staging input โ never a live production connection.
With rngSeed set: same source record โ same pseudonymized output on every run. Stable for regression testing.
Without rngSeed: non-deterministic output โ no reversible mapping exists at the field level. Stronger privacy posture for one-time delivery scenarios.
In the Enterprise Platform (EE): the DataWorkbench PII scanner automatically scans source schemas, assigns probability scores to each field, and flags candidates above a configurable threshold. Flagged fields are wired into the pseudonymization model automatically โ no manual field mapping required.
<setup>
<generate name="patients" count="1000" target="CSV">
<variable name="patient" entity="Patient" dataset="US" ageMin="60" ageMax="80" rngSeed="42" />
<key name="full_name" script="patient.full_name" />
<key name="age" script="patient.age" />
<array name="conditions" script="patient.conditions" />
</generate>
</setup>
datamimic run ./patient-scenario/datamimic.xml
MCP Server โ AI Agent Integration
DATAMIMIC CE ships with a Model Context Protocol (MCP) server, making it directly callable from AI agents, Claude, Cursor, and any MCP-compatible runtime.
pip install datamimic-ce[mcp]
export DATAMIMIC_MCP_HOST=127.0.0.1
export DATAMIMIC_MCP_PORT=8765
export DATAMIMIC_MCP_API_KEY=your-key
datamimic-mcp
Agents can call generate with a domain, seed, count, and locale and receive deterministic, provenance-hashed output โ making DATAMIMIC the natural test data runtime for agent-driven workflows.
import anyio, json
from fastmcp.client import Client
from datamimic_ce.mcp.models import GenerateArgs
from datamimic_ce.mcp.server import create_server
async def main():
args = GenerateArgs(domain="person", locale="en_US", seed=42, count=2)
payload = args.model_dump(mode="python")
async with Client(create_server()) as c:
a = await c.call_tool("generate", {"args": payload})
b = await c.call_tool("generate", {"args": payload})
# Determinism proof: identical hashes across calls
assert (json.loads(a[0].text)["determinism_proof"]["content_hash"]
== json.loads(b[0].text)["determinism_proof"]["content_hash"])
anyio.run(main)
๐ Full guide: docs/mcp_quickstart.md
Architecture
CE and EE have separate, independently maintained cores. CE is not a stripped-down EE. EE is not CE with features unlocked. They share the same DSL and determinism contract but diverge completely at the execution layer.
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ DATAMIMIC ENTERPRISE PLATFORM (EE) โ
โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ PLATFORM LAYER โ โ
โ โ UI ยท RBAC ยท Governance ยท Audit Dashboards โ โ
โ โ DataWorkbench ยท PII Scanner ยท Pseudonymization Builder โ โ
โ โ Scheduler ยท Task Runner ยท CI/CD ยท Template Engine โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ EE CORE (optimised, separate from CE) โ โ
โ โ โ โ
โ โ Ray-based distributed execution โ โ
โ โ Isolated multiprocessing ยท Linear scalability โ โ
โ โ Runtime profiles: Performance ยท Balanced ยท Flexibility โ โ
โ โ Deep nested evaluation ยท Conditions ยท Rulesets โ โ
โ โ ML engine integration ยท Structured error catalog โ โ
โ โ Per-stage importer/exporter logging โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ DATAMIMIC COMMUNITY EDITION (CE) โ this repo โ
โ โ
โ Determinism Kit ยท Domain Services ยท Schema Validators โ
โ Synthetic Generation ยท Pseudonymization (manual model) โ
โ Python API ยท XML Pipelines ยท CLI ยท MCP Server โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ โ โ
PostgreSQL Oracle MongoDB Kafka / Files
Both editions share the same DATAMIMIC DSL and determinism contract. Scale, throughput, governance, and operational control are EE-only.
Supported systems (Enterprise Platform)
| System | Read | Write | Notes |
|---|---|---|---|
| PostgreSQL | โ | โ | Schema introspection, referential integrity |
| Oracle | โ | โ | Production-validated in Tier-1 banking environments |
| MongoDB | โ | โ | Nested document generation |
| Apache Kafka | โ | โ | Real-time streaming, payment scenarios |
| CSV / JSON / XML | โ | โ | Flat file pipelines |
| EDIFACT / SWIFT MT | โ | โ | Financial message formats |
CE domains
| Domain | Models available |
|---|---|
| Healthcare | Patient, Doctor, Hospital, MedicalRecord |
| Finance | BankAccount, CreditCard, Transaction, LoanRecord |
| Demographics | Person (DE / US / VN locale packs), Address, Company |
All domains are versioned, seeded, and audit-ready.
CLI reference
# Run a scenario
datamimic run ./my-scenario/datamimic.xml
# Launch a demo
datamimic demo create healthcare-example
datamimic run ./healthcare-example/datamimic.xml
# Version check
datamimic version
Documentation
| Resource | Link |
|---|---|
| Full documentation | docs.datamimic.io |
| MCP quickstart | docs/mcp_quickstart.md |
| Developer guide | docs/developer_guide.md |
| Enterprise platform | datamimic.io |
| GitHub Discussions | Discussions |
| Issue tracker | Issues |
| Email support | [email protected] |
Contributing
See CONTRIBUTING.md. CE is MIT licensed and community contributions are welcome.
The CE engine is the foundation. If you are building integrations, domain extensions, or MCP tooling on top of DATAMIMIC, we want to hear from you.
License
MIT โ see LICENSE.
The DATAMIMIC Enterprise Platform (EE) is a commercial product. Contact us for licensing.
DATAMIMIC โ Make test data a standard, not a manual process.
datamimic.io ย |ย Book a demo ย |ย LinkedIn
Reviews (0)
Sign in to leave a review.
Leave a reviewNo results found