The Gap Between Notebooks and Production
Most ML projects die in the notebook. A data scientist builds a promising model, shares impressive metrics in a slide deck... and then nothing ships. The gap between "it works on my laptop" and "it's running in production" is where MLOps lives.
MLOps (Machine Learning Operations) applies DevOps principles to ML systems: version control, automated testing, continuous deployment, and monitoring — adapted for the unique challenges of machine learning.
Why ML Is Harder to Operationalize
Traditional software has one artifact: code. ML systems have three:
- Code — the training pipeline, feature engineering, serving logic
- Data — training datasets, feature stores, validation sets
- Models — trained weights, hyperparameters, metadata
All three can change independently, and any change can break the system. That's why you need versioning, testing, and monitoring for all three.
The MLOps Stack
1. Experiment Tracking
Track every training run: hyperparameters, metrics, datasets, and artifacts. Tools like MLflow, Weights & Biases, or Neptune make this automatic.
import mlflow
mlflow.start_run()
mlflow.log_param("learning_rate", 0.001)
mlflow.log_param("epochs", 50)
mlflow.log_metric("accuracy", 0.94)
mlflow.log_metric("f1_score", 0.91)
mlflow.sklearn.log_model(model, "model")
mlflow.end_run()
2. Model Registry
A central repository for trained models with versioning, stage management (staging → production → archived), and approval workflows. MLflow Model Registry and SageMaker Model Registry are popular choices.
3. Data Versioning
Track changes to your training data the same way you track code changes. DVC (Data Version Control) works alongside Git:
# Track a large dataset
dvc add data/training_set.parquet
# Push data to remote storage
dvc push
# Reproduce the exact training data from any commit
git checkout v1.2
dvc checkout
4. Feature Stores
A centralized system for computing, storing, and serving features. Ensures the same feature logic is used in training and serving — eliminating training-serving skew. Tools: Feast, Tecton, Hopsworks.
CI/CD for Machine Learning
ML CI/CD pipelines look different from traditional software:
- Code tests: Unit tests for feature engineering, data validation, and pipeline logic.
- Data validation: Check schema, distributions, missing values, and outliers on new data.
- Model training: Automated retraining triggered by new data or code changes.
- Model validation: Compare new model against baseline on held-out data. Only promote if it's better.
- Deployment: Blue-green or canary deployment with automatic rollback.
Test the data and model, not just the code. A pipeline that produces a bad model should fail the same way a pipeline with a bug should.
Monitoring in Production
Models degrade silently. The world changes, user behavior shifts, and your training data becomes stale. You need to monitor:
- Data drift: Has the input distribution changed? (KS test, PSI)
- Concept drift: Has the relationship between inputs and outputs changed?
- Prediction drift: Are prediction distributions shifting?
- Latency and throughput: Is the model fast enough for your SLA?
- Business metrics: Is the model actually improving the thing you care about?
The Minimum Viable MLOps
You don't need the entire stack on day one. Start with:
- Version everything: Code in Git, data in DVC, models in a registry.
- Automate training: One command or trigger retrains the model end-to-end.
- Test before deploying: Compare new model vs. baseline on validation data.
- Monitor predictions: Log predictions and set up alerts for drift.
Add sophistication as your ML system matures. The goal is reliability, not complexity.
Common Anti-Patterns
- Manual model deployment: "I'll just copy the weights to the server" — leads to unreproducible deployments.
- No baseline comparison: Deploying a model without checking if it's better than the current one.
- Ignoring data quality: The model is only as good as the data it sees in production.
- Over-engineering too early: Building Kubernetes-scale infrastructure for a model that serves 10 requests/day.
The best ML system is the one that's actually running in production, delivering value, and being monitored. Ship first, optimize later.