Oracle AI Data Platform Workbench Samples

This repository contains a curated collection of sample notebooks demonstrating how to build data pipelines, run machine learning workloads, and integrate AI capabilities using Oracle AI Data Platform (AIDP) Workbench — a unified, governed workspace for data engineering, ML, and AI development powered by Apache Spark.

What is Oracle AI Data Platform Workbench?

Oracle AI Data Platform Workbench is a unified, governed workspace for building, managing, and deploying AI and data-driven solutions. It brings together notebooks, agent development, orchestration, and catalog management in a single collaborative platform — empowering teams to explore data, fine-tune models, and operationalize AI with trust and speed.

Learn more about AIDP Workbench →

Repository Structure

oracle-aidp-samples/
├── getting-started/          # Foundational notebooks for new users
│   ├── Delta_Lake/           # Delta Lake feature walkthroughs
│   └── migration/            # Migrating workloads to AIDP
├── data-engineering/
│   ├── ingestion/            # Connectors and data loading patterns
│   └── transformation/       # Pipeline architectures and table formats
│       ├── liquid-clustering/
│       ├── medallion-lake/
│       ├── scd/
│       └── streaming/
├── ai/
│   ├── agent-flows/          # Agent orchestration and scheduling
│   └── ml-datascience/       # ML, LLM, and AI service integrations
└── shared-utils/             # Reusable utilities and data generators

Sample Catalog

Getting Started

Foundational examples to help you get up and running on AIDP Workbench.

Notebook	Description
Access ALH Data	Write and query data in Oracle Autonomous AI Lakehouse (ALH) using PySpark `insertInto` and SQL `INSERT` statements with external catalogs.
Access Object Storage Data	Read and write data from OCI Object Storage using direct access, external volumes, and external tables.
Analyse Data Using PySpark	PySpark fundamentals: catalog and schema setup, table creation, data insertion, schema exploration, and matplotlib visualizations.
Analyse Data Using SQL	Core SQL operations on AIDP including DataFrame creation, transformations, aggregations, and simple visualizations.

Delta Lake

Notebook	Description
Use Delta Lake Table	Comprehensive guide covering Delta table operations: updates, merges, time travel, liquid clustering, and vacuuming.
Delta Change Data Feed	Capture row-level changes (inserts, updates, deletes) from Delta tables for CDC, incremental processing, and streaming pipelines.
Handle Schema Evolution	Add and evolve columns in Delta tables without rewriting existing data, leveraging automatic schema evolution.
Delta UniForm Tables	Create Delta UniForm tables that automatically synchronize Iceberg metadata for cross-format interoperability.

Migration

Notebook	Description
Migrate Files from Databricks to AIDP	Recursively export notebooks and files from a Databricks workspace to AIDP using the `databricks-sdk` library.
Download from Git to AIDP	Download notebooks and files from a Git repository as a ZIP archive and extract them directly into an AIDP workspace volume.

Data Engineering — Ingestion

Patterns for connecting to and loading data from a wide range of sources.

Notebook	Description
Read/Write Oracle Ecosystem Connectors	Connect to Oracle Database, Oracle Exadata, ALH, and ATP with external catalog support and SQL pushdown.
Read/Write External Ecosystem Connectors	Read/write operations with Hive Metastore, Microsoft SQL Server, PostgreSQL, and MySQL.
Read-Only Ingestion Connectors	Use read-only connectors for MySQL HeatWave, REST APIs, Oracle Fusion BICC, Kafka, and other sources.
Connect Using Custom JDBC Driver	Integrate custom JDBC drivers (e.g., SQLite, Snowflake) with Spark for connecting to databases not bundled by default.
Execute Oracle ALH SQL	Execute SQL statements directly against Oracle ALH using the `oracledb` Python package.
Ingest Data Using YAML	Config-driven ingestion from cloud storage (CSV, JSON) and JDBC sources with schema validation and data quality checks.
Ingest from Multi-Cloud	Ingest data from Azure Data Lake Storage (ADLS) and AWS S3 with proper JAR configuration and credential management.
Ingest into Apache Iceberg (OCI Native)	End-to-end Apache Iceberg workflow: table creation, querying, schema evolution, time travel, and metadata inspection using OCI native protocol and Hadoop catalog.
Pipe-Delimited File Ingestion	Read pipe-delimited (`\|`) files from OCI Object Storage and register them as external tables.
Read Excel Files	Read Excel (`.xlsx`) files using the Spark Excel connector and convert them to Spark DataFrames or CSV.
Streaming from OCI Streaming Service	Consume messages from OCI Streaming (Kafka-compatible) using Spark Structured Streaming with SASL/OAUTHBearer authentication.
Streaming from Volume Path	Process CSV files from a workspace volume using one-time micro-batch streaming with `Trigger.Once()`.

Data Engineering — Transformation

Architectural patterns and pipeline templates for data transformation at scale.

Medallion Architecture

Implements the Bronze → Silver → Gold lakehouse pattern with data quality checks and aggregations. Industry variants available:

Notebook	Industry
Education	Education analytics pipeline
Energy	Energy consumption and reporting
Financial Services	Financial transactions and risk
Healthcare	Patient records and clinical data
Hospitality	Hotel bookings and guest analytics
Insurance	Policy and claims processing
Manufacturing	Production line and quality data
Media	Content engagement and subscriptions
Real Estate	Property listings and transactions
Retail	Sales, inventory, and customer data
Telecommunications	Network usage and customer churn
Transportation	Logistics and fleet tracking

Delta Liquid Clustering

Demonstrates Delta Lake liquid clustering for automatic query optimization and data layout management. Industry variants available:

Notebook	Industry
Education	Student performance analytics with ML prediction
Energy	Smart grid monitoring and anomaly detection
Financial Services	Transaction analytics and reporting
Healthcare	Patient data access patterns
Hospitality	Booking and occupancy analytics
Insurance	Claims and policy data optimization
Manufacturing	Production and quality metrics
Media	Content and engagement data
Real Estate	Property and transaction data
Retail	Sales and inventory analytics
Telecommunications	Network and customer usage data
Transportation	Fleet and logistics optimization

Apache Iceberg Uniform Liquid Clustering

Combines Delta UniForm with Apache Iceberg Liquid Clustering for open-format, cross-engine table optimization. Industry variants available:

Notebook	Industry
Education	Student performance data
Energy	Grid and sensor data
Financial Services	Transaction and risk data
Healthcare	Clinical and patient records
Hospitality	Booking and revenue data
Insurance	Policy and claims data
Manufacturing	Production and IoT data
Media	Content delivery data
Real Estate	Property listings data
Retail	Sales and inventory data
Telecommunications	Network usage data
Transportation	Fleet and route data

Other Transformation Patterns

Notebook	Description
Slowly Changing Dimensions (SCD Type 2)	Track historical changes to dimension records using SCD Type 2 with Jinja2-templated merge logic.
Streaming — Energy Delta Liquid Clustering	Real-time smart grid monitoring with streaming Delta tables, anomaly detection, and statistical baselines for energy consumption.
Streaming — Manufacturing Delta Liquid Clustering	Continuous ingestion and clustering of manufacturing sensor data using Spark Structured Streaming and Delta Lake.

AI & Machine Learning

Notebooks covering generative AI, NLP, ML model training, and LLM-powered analytics.

Notebook	Description
Sentiment Analysis with OCI GenAI	Perform sentiment analysis on text data using OCI Generative AI (Llama model) via the AIDP `query_model` function.
OCI Language Service Translation	Translate text using OCI AI Language Service via REST API, demonstrated with a round-trip English ↔ Spanish translation.
Customer Churn Prediction (GPU)	Train a TensorFlow neural network on GPU for customer churn prediction, including preprocessing, training, and evaluation.
LLM Model Output Parser	Use an LLM to parse and translate statistical model outputs into business-friendly insights and plain-language summaries.
Natural Language to SQL (NL2SQL)	Introspect a database schema and generate accurate SQL queries from natural language questions using an LLM, with result summarization.
Multi-Table NL2SQL with Grouped Analysis	Extend NL2SQL to multi-table scenarios with grouped LLM analysis for complex procurement and supplier-item intelligence.
Retrieval-Augmented Generation (RAG)	End-to-end RAG pipeline: ingest documents from OCI Object Storage, chunk and embed text, retrieve relevant context, and generate answers with an LLM.
Movie Recommendation System	Build a collaborative filtering recommendation engine using PySpark ML's ALS algorithm, trained and evaluated on movie rating data.
Linear Mixed Effects Model	Apply a Linear Mixed Effects Model (LME) with `statsmodels` and PySpark to analyze student test scores across schools, accounting for fixed and random effects.

Agent Flows

Notebook	Description
Agent Flow Schedule Trigger	Invoke AIDP agent flows via REST API using OCI request signing, demonstrating programmatic agent orchestration with custom message handling.

Shared Utilities

Notebook	Description
Data Code Generator	Generate realistic multi-table synthetic datasets from a YAML configuration file, with CSV and JSON export support for testing and prototyping.
Data Quality Checker	Run comprehensive data quality checks including null, uniqueness, range, pattern, foreign key, and AI-powered semantic validation across single and multiple tables.
OCI Vault Secret Retrieval	Securely retrieve secrets (passwords, API keys, connection strings) from OCI Vault using auto-detected authentication — Resource Principal on AI Data Platform or OCI config file locally.

Running the Samples

Prerequisites

Before running any sample, ensure you have:

An active Oracle AI Data Platform Workbench environment with a compute cluster.
The required IAM policies configured for the services used (Object Storage, ALH, AI Services, etc.).
Cluster libraries installed from the requirements.txt file included in the relevant sample folder, where applicable.

General Steps

Open your AIDP Workbench notebook environment.
Clone or import the samples into your workspace.
Navigate to the notebook of your choice and open it.
Follow the instructions and prerequisites described in the notebook's opening cells.
Attach the notebook to a running compute cluster and execute the cells.

MLflow Tracking Server

Several ML samples integrate with MLflow for experiment tracking. Ensure your AIDP environment has an MLflow Tracking Server configured. Refer to the AIDP documentation for setup instructions.

Documentation

Get Support

If you encounter issues with these samples, please open an issue in this repository. For questions about Oracle AI Data Platform itself, refer to the OCI Support portal.

Security

Please consult the security guide for our responsible security vulnerability disclosure process.

Contributing

This project welcomes contributions from the community. Before submitting a pull request, please review our contribution guide.

License

See LICENSE