Finance

Automated Anomaly Detection for ETL Pipelines: From Problem to Solution

The Problem: Silent Failures and Undetected Anomalies in ETL Pipelines With the increasing number of ETL pipelines running across the platform, a consistent pattern of issues became apparent: a. Mis

Confidential
6 Months
5 Team Members

The Challenge

The Problem: Silent Failures and Undetected Anomalies in ETL Pipelines
With the increasing number of ETL pipelines running across the platform, a consistent pattern of issues became apparent:
a. Missing or Unexpected Row Counts
Some pipelines would extract fewer records than expected, or worse, return zero rows. Traditional monitoring only checked whether a job succeeded not whether the output made sense.
b. No Visibility Into Historical Behavior
We had no automated system comparing current run behavior against:
• last week’s runs
• same day-of-month patterns
• weekday vs weekend volumes
• long-term baselines
• typical seasonal fluctuations
This meant errors such as sudden drops or spikes went unnoticed.
c. One-Size Monitoring Didn’t Fit All Pipelines
Pipelines varied widely:
• Some ran daily with stable volumes
• Others had high variance depending on business cycles
• Newer pipelines had limited historical data

Our Solution

To overcome the problem, we built a daily anomaly detection pipeline that evaluates every ETL configuration and identifies irregular behavior in insertion row count before it cascades into downstream issues.
The solution includes the following major components:

A. Centralized Data Retrieval and Feature Engineering
Each day, the system automatically:
• Fetches the last six months of run history for every ETL configuration
• Computes daily averages, weekday patterns, day-of-month behaviors, and overall baselines
• Identifies zero-count outputs or missing runs
• Converts dates into cyclical patterns (sin/cos transforms)
• Generates rolling metrics:
o 28-day rolling median
o 28-day rolling MAD (Median Absolute Deviation)
o Recent z-scores for deviation detection
These engineered features allow the system to measure “normal” behavior for each pipeline.

B. Dual Detection Approach: Statistical + Machine Learning Models
Because pipelines vary in age and history, we implemented two complementary detection mechanisms:
1. Statistical Rule-Based Detection (for short history pipelines)
For pipelines with less than 31 days of data, the system uses:
• deviation from historical averages
• deviation from weekday averages
• deviation from day-of-month patterns
If deviation exceeds pre-defined thresholds, it flags an anomaly.
This ensures new or infrequent pipelines are monitored reliably.

2. Isolation Forest Machine Learning Model (for mature pipelines)
For pipelines with enough historical data, we train an Isolation Forest model daily to learn:
• the distribution of row counts
• recent trends
• seasonal variations
• business-day vs weekend behavior
• rolling deviations
Based on the model’s scoring:
• Extremely low confidence values are flagged as anomalies
• A percentile-based threshold defines outliers
This approach continuously adapts to changing data patterns and minimizes false positives.

C. Daily Evaluation & Summary Output
For every ETL configuration, the system generates:
• whether today’s run is normal or anomalous
• reason for detection
• statistical deviation values
• model score and threshold
• volume extracted
• number of training rows
A consolidated report is created for:
• normal behaviors
• statistically anomalous runs
• ML-detected anomalies

D. Automated Storage, Logging & Downstream Visibility
After processing, the system:
1. Stores anomaly results in S3
A daily file containing features, detection summaries, and anomaly reasoning is written to an S3 location.
2. Logs execution details to Redshift
Every run is recorded in a central ETL log table, including:
• config ID
• extraction row count
• timestamp
• success/failure status
• anomaly summary
This provides complete traceability.

Technologies Used

Python
Pandas
NumPy
Isolation Forest (scikit-learn)
Amazon S3
Amazon Redshift
SQL (for logging and retrieval)
AWS ETL logging tables
Statistical Modeling (MAD
rolling metrics
z-score)
Airflow
JSON/CSV for output storage.

Results & Impact

• Detection of missing data
• Adaptive learning for changing volumes
• Automated daily monitoring
• Zero manual intervention required
• Improved reliability of downstream dashboards and analytics
The system now acts as an always-on guardian for ETL operations, continuously monitoring, scoring, and evaluating pipeline health across the platform.