Business Problem
Manual document processing was creating bottlenecks — assessors spending hours extracting data from unstructured documents, inconsistent decisions due to human bias, and no explainability for rejected cases.
The platform needed to automate extraction, compute risk scores in real time, and give every decision a clear, auditable explanation for compliance teams.
Architecture — End-to-End Flow
① Sources → Kafka (real-time) · Blob Storage · REST API · Synapse · Storage Queue
↓
② Schema Registry + Data Contracts → validation gate pre-Bronze
↓
③ Bronze Layer → raw immutable Delta tables · append-only · CDF enabled
↓
④ Silver Layer → parse · validate · deduplicate · LLM extraction (ai_extract)
↓
⑤ Feature Store → pre-computed features · training/serving parity · point-in-time correct
↓
⑥ ML + AI Layer → XGBoost scoring · LLM recommendation · Vector Search + RAG
↓
⑦ Gold Layer → decisions · risk profiles · pipeline health tables
↓
⑧ Model Serving Endpoint → REST API · A/B traffic split · inference logging
↓
⑨ Consumption → Databricks SQL · Power BI · Chat Interface · REST API
↓
② Schema Registry + Data Contracts → validation gate pre-Bronze
↓
③ Bronze Layer → raw immutable Delta tables · append-only · CDF enabled
↓
④ Silver Layer → parse · validate · deduplicate · LLM extraction (ai_extract)
↓
⑤ Feature Store → pre-computed features · training/serving parity · point-in-time correct
↓
⑥ ML + AI Layer → XGBoost scoring · LLM recommendation · Vector Search + RAG
↓
⑦ Gold Layer → decisions · risk profiles · pipeline health tables
↓
⑧ Model Serving Endpoint → REST API · A/B traffic split · inference logging
↓
⑨ Consumption → Databricks SQL · Power BI · Chat Interface · REST API
How It Works — Step by Step
1
Multi-Source Ingestion
5 source systems — Kafka (high-throughput streaming), ADLS Blob, REST APIs, Synapse JDBC, Azure Storage Queue. Each with dedicated ingestion pattern (Structured Streaming, Auto Loader, Python job, Synapse Connector, Azure Function trigger).
2
Schema Registry + Data Contracts Gate
Confluent Schema Registry enforces Avro schema compatibility. YAML Data Contracts define owner, SLA, field-level expectations. Great Expectations validates at boundary — contract breach blocks data and triggers alert.
3
Bronze → Silver Medallion Processing
Bronze stores raw immutable data with CDF enabled. Silver parses, validates, deduplicates via MERGE, applies 2-hour watermarks, and enriches with LLM extraction (ai_extract) — invalid records routed to DLQ.
4
Feature Store — Training/Serving Parity
Databricks Feature Store computes and stores features (debt_ratio, tx_velocity_30d, country_risk). Same feature logic used in training and real-time serving — eliminates training-serving skew entirely.
5
XGBoost + LLM + Vector Search
XGBoost computes risk score 0–100 with SHAP explanation per prediction. LLM (Llama 3.1 405B via ai_query) generates recommendation text. Vector Search retrieves relevant policy documents — injected as RAG context into LLM for grounded answers.
6
Model Serving + Champion-Challenger
Databricks Model Serving Endpoint exposes scoring as REST API (<200ms P99). Champion 90% / Challenger 10% traffic split. New model proven in shadow mode before full rollout. Every prediction logged for drift detection.
7
Gold Layer + Consumption
Gold tables store final decisions with risk score, LLM recommendation, SHAP reason. Consumed by Databricks SQL dashboards, Power BI (DirectQuery), conversational chat interface, and external REST API for CRM integration.
MLOps Practices
- Experiment tracking: Every training run logged — params, metrics, SHAP plots, LLM prompt versions
- Model registry: Staging → Production promotion with evaluation gate (AUC must exceed champion)
- Champion-Challenger: 10% traffic to new model, 2-week comparison before full rollout
- Drift monitoring: Evidently weekly report — auto-retrain triggered if drift exceeds 10%
- Feature Store: Eliminates training-serving skew — same features at training and inference time
- SHAP explainability: Every decision auditable — feature contribution per prediction logged to MLflow
Governance & Compliance
- Unity Catalog: Column masking on PII fields, row-level security per business unit
- GDPR compliance: Retention policies on all tables, right-to-erasure via VACUUM, PII tagged
- Data lineage: Automatic end-to-end lineage — source to Gold — auditable in Unity Catalog
- Data Contracts: Schema evolution controlled, breaking changes blocked at ingestion boundary
- Audit logs: system.access.audit captures every read/write — full compliance trail