Production ML Safety Nets Checklist
Production ML Safety Nets Checklist
Purpose: Use this checklist before deploying ML models to production to ensure you have safety nets in place. Based on lessons from ads ranking, GenAI copilots, and marketplace personalization at scale.
How to use: Review each section during design reviews and pre-launch readiness checks. Track completion in your deployment docs.
1. Rollback & Traffic Control
Goal: Ensure models can be reverted quickly without data pipeline changes.
- Rollback runbook exists and has been rehearsed (not just documented)
- Feature flags or traffic switches allow gradual rollout (1% → 10% → 50% → 100%)
- Shadow mode deployment tested (new model runs alongside old, doesn’t serve traffic yet)
- Model versioning tracked in config/metadata (not just filenames)
- Rollback trigger criteria defined (latency P99 > X, accuracy < Y, business metric drops Z%)
- Rollback owner assigned and on-call during launch window
2. Observability & Monitoring
Goal: Surface issues before customers report them.
- Input distribution monitoring (detect drift in feature distributions)
- Prediction distribution monitoring (e.g., % of high-confidence predictions over time)
- Latency tracking (P50, P90, P99 for model serving)
- Error rate dashboards (inference failures, timeouts, null predictions)
- Business metric tracking (CTR, conversion, revenue per session - not just ML metrics)
- Automated alerts configured with low false-positive rates
- Runbook links in alerts (engineers know what to do when paged)
3. Testing & Validation
Goal: Catch regressions before production deployment.
- Golden test suite exists (curated examples that should not regress)
- Contract tests between data pipeline / model / serving layer
- A/B test plan approved (sample size, duration, success criteria)
- Counterfactual logging enabled for offline evaluation post-launch
- Schema validation for inputs (reject malformed requests early)
- Model performance CI checks (accuracy/AUC gate in pipeline)
4. Data Quality & Freshness
Goal: Prevent garbage-in-garbage-out scenarios.
- Data freshness SLAs defined (max acceptable lag for training/inference data)
- Missing value handling strategy documented (imputation, rejection, default)
- Outlier detection in production inference pipeline
- Feature store lineage tracked (know which upstream changes affect the model)
- Training-serving skew tests (compare feature distributions in batch vs online)
5. Compliance & Safety Rails
Goal: Avoid regulatory violations and brand damage.
- PII/GDPR compliance verified (no leaking user data in logs/predictions)
- Bias & fairness review completed (disparate impact across demographics)
- Safety filters for GenAI outputs (toxicity, PII leakage, jailbreak attempts)
- Explainability available for high-stakes decisions (credit, hiring, ads targeting)
- Human-in-the-loop fallback for edge cases model can’t handle confidently
- Audit logs for model predictions (especially for regulated domains)
6. Platform & Infra
Goal: Models don’t destabilize the rest of the system.
- Resource limits configured (CPU, memory, GPU quota)
- Timeout & circuit breaker patterns implemented
- Load testing completed (can handle peak traffic + 2x headroom)
- Dependency failure mode tested (what happens if feature store goes down?)
- Cost monitoring enabled (inference costs don’t exceed budget)
- Deployment automation (no manual SSH into prod servers)
7. Team Readiness
Goal: Humans are prepared to respond when things go wrong.
- On-call rotation includes someone familiar with this model
- Incident response plan documented (who to page, escalation path)
- Stakeholder communication plan ready (how to notify product/business if rollback needed)
- Post-launch review scheduled (retrospective 2 weeks after launch)
Next Steps
- If <80% of items checked: Delay launch until safety nets are in place
- If 80-95% checked: Launch with extra monitoring and quick rollback plan
- If >95% checked: You’re ready for production
Need help implementing these? Check out the Production ML pillars or book a talk.
Continue the conversation
Need a sounding board for ML, GenAI, or measurement decisions? Reach out or follow along with new playbooks.