I built a CLI data quality tool that goes beyond schema checks - here's what I learned

By Crystal Cyclone · April 3, 2026 · 1 min read

What SageScan does differently SageScan is a CLI tool that runs statistical validation using a YAML config. Instead of checking rules you define manually, it checks: whether your data behaves like it used to. 1. Distribution drift (KS test) Compares current vs baseline distribution. Catches: ETL bugs upstream schema changes silent corruption 2. Outlier detection (Z-score + IQR) Flags statistically abnormal rows. Not: "outside a fixed range" But: "outside what the data itself considers normal" 3. Population Stability Index (PSI) Used in ML pipelines for drift detection. Quantifies: how much a column’s distribution has shifted 4. Categorical drift (Chi-square test) Detects changes in category distribution. Example: Credit card usage drops from 80% -> 45% That's not invalid data. That's a signal. Architecture (the controversial part) This is where I'd love feedback. SageScan is: Go CLI Python engine They communicate via JSON over stdin/stdout. Why? Go → fast, portable CLI (great for CI

I built a CLI data quality tool that goes beyond schema checks - here's what I learned

Related Posts

Trending on ShareHub

Latest on ShareHub

Browse Topics

Around the Network