AI-Powered Decisioning Platform

Business Problem

Manual document processing was creating bottlenecks — assessors spending hours extracting data from unstructured documents, inconsistent decisions due to human bias, and no explainability for rejected cases.

The platform needed to automate extraction, compute risk scores in real time, and give every decision a clear, auditable explanation for compliance teams.

Architecture — End-to-End Flow

① Source Systems

📡

Kafka

High-throughput events
Confluent · SASL_SSL

📁

Blob Storage

ADLS Gen2 · Parquet
Event Grid notify

🌐

REST API

OAuth2 · Paginated
Incremental pull

🗄️

Synapse

JDBC connector
updated_at watermark

📨

Storage Queue

Azure Function trigger
Async · event-driven

↘ ↓ ↓ ↓ ↙

② Schema Registry + Data Contracts Gate

📜 Confluent Schema Registry

Avro/Protobuf · BACKWARD compat
schema_id in every message

📋 Data Contracts (YAML)

Owner · SLA · field expectations
Unity Catalog tags enforced

✅ Great Expectations Gate

Schema validation pre-ingest
FAIL → DLQ + alert

↓ validated data

③ Bronze Layer

bronze_transactions · bronze_documents

✦ raw_value STRING (exact JSON) ✦ Append-only — never modify ✦ CDF enabled · Snappy · 30d retention ✦ Legal audit trail · Delta ACID

↓ Delta CDF — new rows only

④ Silver Layer

silver_transactions · silver_profiles

🔍 Parse
from_json()
typed columns

→

✅ Validate
null checks
foreachBatch

→

⏱️ Watermark
2hr late events
no state leak

→

🔁 Deduplicate
MERGE INTO
exactly-once

→

🤖 LLM Enrich
ai_extract()
structured fields

❌ Invalid → DLQ Table

✅ Valid → Silver Delta

🔐 PII masked → Unity Catalog

↓ clean validated rows

⑤ Feature Store — Databricks

🔬 Feature Engineering

debt_ratio · tx_velocity_30d
country_risk · avg_balance_90d

🗃️ Feature Tables

Versioned + time-aware
Point-in-time correct joins

🔗 No Training-Serving Skew

Same features train + serve
feature_store.log_model()

↓ pre-computed features

⑥ ML + AI Intelligence Layer

⑥a XGBoost Model

✦ Risk score 0–100
✦ ROC-AUC 0.95+
✦ SHAP per prediction
✦ MLflow tracked

            predict_proba(features)
→ risk_score: 73.4
          

⑥b LLM — Llama 3.1 405B

✦ ai_extract() structured
✦ ai_classify() risk category
✦ ai_query() recommendation
✦ No external API key

            ai_query('llama-3.1-405b')
→ "APPROVE — low risk"
          

⑥c Vector Search + RAG

✦ Silver docs → embeddings
✦ Cosine similarity retrieval
✦ Context → LLM prompt
✦ Grounded — no hallucination

            similarity_search(q, k=5)
→ inject top-k context
          

⑥d MLflow Tracking

✦ Every run logged
✦ SHAP artifact per run
✦ Registry: Staging→Prod
✦ Drift → auto-retrain

            mlflow.start_run()
mlflow.log_metrics(r)
          

↓ score + recommendation + RAG context

⑦ Gold Layer

gold_decisions · gold_risk_profiles · gold_pipeline_health

Decision Fields

risk_score ← XGBoost
recommendation ← LLM
decision → APPROVE / REVIEW / REJECT
shap_reason · rag_context

Decision Logic

score < 40 → ✅ APPROVE
score 40–70 → 👁️ REVIEW
score > 70 → ❌ REJECT
OPTIMIZE daily · ZORDER · VACUUM weekly

↓ business-ready decisions

⑧ Model Serving Endpoint

🚀 REST Endpoint

POST /invocations
P99 latency <200ms · auto-scale

🔄 Champion-Challenger

90% champion / 10% challenger
Shadow mode · canary rollout

📊 Inference Logging

Input + output logged
Ground truth feedback loop

↓ scored results

⑨ Consumption

🖥️

Databricks SQL

Live dashboard
Pipeline health

📊

Power BI

DirectQuery
KPI dashboards

💬

Chat Interface

Natural language
Text-to-SQL + RAG

🔌

REST API

External CRM
Real-time scoring

Cross-Cutting — All Layers

🔐 Unity Catalog

Column masking · Row security
Audit logs · PII tags · Lineage

🚀 CI/CD · DAB

Azure DevOps · pytest
Dev → Staging → Prod

📡 Monitoring

Consumer lag · DLQ alerts
Model drift · PagerDuty

🗓️ Databricks Workflows

Orchestrates all tasks
OPTIMIZE · VACUUM · Retrain

⚖️ GDPR Compliance

PII masking · Retention
Lineage auditable · Right to erase

How It Works — Step by Step

Multi-Source Ingestion

5 source systems — Kafka (high-throughput streaming), ADLS Blob, REST APIs, Synapse JDBC, Azure Storage Queue. Each with dedicated ingestion pattern (Structured Streaming, Auto Loader, Python job, Synapse Connector, Azure Function trigger).

Schema Registry + Data Contracts Gate

Confluent Schema Registry enforces Avro schema compatibility. YAML Data Contracts define owner, SLA, field-level expectations. Great Expectations validates at boundary — contract breach blocks data and triggers alert.

Bronze → Silver Medallion Processing

Bronze stores raw immutable data with CDF enabled. Silver parses, validates, deduplicates via MERGE, applies 2-hour watermarks, and enriches with LLM extraction (ai_extract) — invalid records routed to DLQ.

Feature Store — Training/Serving Parity

Databricks Feature Store computes and stores features (debt_ratio, tx_velocity_30d, country_risk). Same feature logic used in training and real-time serving — eliminates training-serving skew entirely.

XGBoost + LLM + Vector Search

XGBoost computes risk score 0–100 with SHAP explanation per prediction. LLM (Llama 3.1 405B via ai_query) generates recommendation text. Vector Search retrieves relevant policy documents — injected as RAG context into LLM for grounded answers.

Model Serving + Champion-Challenger

Databricks Model Serving Endpoint exposes scoring as REST API (<200ms P99). Champion 90% / Challenger 10% traffic split. New model proven in shadow mode before full rollout. Every prediction logged for drift detection.

Gold Layer + Consumption

Gold tables store final decisions with risk score, LLM recommendation, SHAP reason. Consumed by Databricks SQL dashboards, Power BI (DirectQuery), conversational chat interface, and external REST API for CRM integration.

MLOps Practices

Experiment tracking: Every training run logged — params, metrics, SHAP plots, LLM prompt versions
Model registry: Staging → Production promotion with evaluation gate (AUC must exceed champion)
Champion-Challenger: 10% traffic to new model, 2-week comparison before full rollout
Drift monitoring: Evidently weekly report — auto-retrain triggered if drift exceeds 10%
Feature Store: Eliminates training-serving skew — same features at training and inference time
SHAP explainability: Every decision auditable — feature contribution per prediction logged to MLflow

Governance & Compliance

Unity Catalog: Column masking on PII fields, row-level security per business unit
GDPR compliance: Retention policies on all tables, right-to-erasure via VACUUM, PII tagged
Data lineage: Automatic end-to-end lineage — source to Gold — auditable in Unity Catalog
Data Contracts: Schema evolution controlled, breaking changes blocked at ingestion boundary
Audit logs: system.access.audit captures every read/write — full compliance trail

Tech Stack

Azure Databricks Apache Kafka Delta Lake Delta Live Tables Auto Loader Structured Streaming MLflow XGBoost SHAP Evidently Databricks Feature Store Model Serving Endpoints Llama 3.1 405B Vector Search RAG Unity Catalog Great Expectations Azure DevOps Databricks Asset Bundles Power BI Python PySpark