What are Data Drifts and How Do They Impact ML Models?

If you have been in the AI ecosystem for a while, you may have heard about data drifts. In this post, we will dive deep into what data drifts are and how they impact our machine-learning algorithms. Also, can we stop them?

Let’s start with a scenario to understand exactly what data drifts are.

Let’s say you are running a clothing brand. And you built a machine-learning model to predict your client’s demand for clothing. The model worked flawlessly, and sales were great.

So, for a long time, your model’s accuracy remained quite high.

However, suddenly, you notice a drop in your model’s performance. The predicted demand for the sales did not match with the actual sales of your clothing brand.

After looking into it, you notice a data drift.

What caused the drift? It turned out that a sudden heatwave had struck, producing an increase in demand for summer clothes.

The model, which you trained on previous data with a more balanced seasonal distribution, was caught off guard by this sudden shift.

This could lead to your company being taken off guard by the increased demand for summer clothes. In this case, you may not have enough stock to fulfill client demand, resulting in lost revenue and angry customers.

So, what exactly are data drifts?

Data drift is a phenomenon in machine learning in which the statistical features of the target data distribution change with time. This results in differences between the data used for model training and the data encountered for model deployment.

Data drift can have a negative impact on the performance and dependability of machine learning models in production, resulting in decreased predicted accuracy and the possibility of introducing biases or errors into model outputs. Data drift must be effectively monitored and managed to ensure that machine learning systems continue to function and be valid in real-world contexts.

Here is an example of how a data drift could look like in a graph.

Ekran Goruntusu 2024 05 03 133922

If you never thought about the problem of data drifting, this could be a nuance to take into account. According to a study that was published in Nature Journal, it has been observed that %91 of AI models degrade because of data drifts.

There are different types of data drifts that can affect your data. Let’s briefly discuss them.

Feature Drift

This sort of drift is primarily related to changes in the statistical characteristics of input features over time.

Feature drift happens when the distribution, scale, or relationship of features changes. For example, in a weather forecasting model, if the temperature sensor degrades with time, it can add biases or mistakes in temperature measurements, resulting in feature drift.

Concept Drift

Concept drift is about changes in the link between input features and the target variable.

It happens when the underlying concepts or relationships in a dataset change over time.

For example, in a customer churn prediction model for a subscription-based business, concept drift occurs when the reasons impacting customer turnover vary (e.g., due to changes in service offers or customer preferences).

Class Distribution Drift

This type of drift refers to variations in the distribution of classes within the target variable over time.

It occurs when the proportions of different classes shift dramatically. For example, in an online fraud detection system, if fraudsters change their tactics or security measures improve the percentage of fraudulent and non-fraudulent transactions could shift, resulting in class distribution drift.

Covariate Shift

The term “covariate shift” refers to differences in the distribution of input features between the model’s training and deployment stages. It happens when the marginal distribution of features changes while the conditional distribution of the target variable given the features remains constant.

For example, if a user base’s demographic composition changes over time while user preferences stay consistent, a covariate shift will occur.

So, what are the ways we can avoid having data drifts in our machine-learning model?

Machine Learning Monitoring Workflow

The best way of detecting data drifts is to have a consistent machine-learning workflow for your model.

The first and most important part is doing constant performance monitoring. Is your model still performing well after having some time passed? Calculate realized performance and measure business impact.

What Separates Machine Learning Problems?

This point is where we should also mention the differences between traditional software development and machine learning solutions.

Traditional software development usually entails creating deterministic solutions in which developers can program how the system will respond to any given request.

The system behaves predictably and consistently.

In contrast, machine learning solutions have a probabilistic nature.

Data scientists create models based on patterns in training data, which then generate probabilistic predictions or classifications. As a result, the output of machine learning models can vary probabilistically, creating intrinsic uncertainty in predictions.

When Performance Degradation is Detected…

After the initial stage of performance monitoring, we detected a performance degradation.

We should then have an automated root-cause analysis.

What exactly went wrong in our model?

We should detect data drifts. And, then, retrain our model or change some of our business processes.

One popular strategy is to perform statistical tests to compare samples from the training and production data. These experiments provide vital insights into the nature of drift, helping data scientists to better understand how the data evolved.

Statistical techniques such as the Kolmogorov-Smirnov (KS) test, Jensen-Shannon divergence, and Wasserstein distance can be used to compare feature distributions between training data and incoming data for production. Significant differences between these distributions could suggest data drift.

However, using only statistical methods to detect data drift may be insufficient.

While evaluating the difference across distributions and setting drift detection thresholds may seem simple, it is critical to understand that not all drift has the same influence on model performance. The relevance and importance of drifting features in the model can influence whether or not a drift has a major effect on performance.

So, what could be the other ways of detecting drifts?

1. Evidently.ai

Evidently is an open-source Python library that simplifies the task of checking machine-learning models for data drift.

It provides a convenient method for understanding and displaying changes in model performance and data properties over time.

You can compare your model’s performance across multiple datasets, like as training data against production data. The library includes a variety of statistical indicators and visualizations that show disparities between these datasets, allowing you to discover potential drift.

One distinguishing element of Evidently is its emphasis on practical insights.

Rather than depending primarily on complicated statistical tests, Evidently focuses on the interpretation of results straightforwardly. This makes their method accessible to various technological backgrounds.

Evidently Reports Main Min

2. Fiddler.ai

Fiddler AI’s method of managing data drift entails extensive monitoring and analysis. T

hey provide built-in data drift monitoring tools, allowing you to easily track differences in data distribution between baseline and production datasets. Fiddler detects potential drift that could influence model performance by comparing these distributions.

They also identify the contributing factors to the observed drift.

So you can figure out all the features that are causing the drift. And can detect when and how to retrain models accordingly.

To measure data drift, they use popular drift metrics such as Jensen-Shannon Divergence (JSD) and the Population Stability Index (PSI). Using these indicators, you can assess the extent of drift and its impact on model accuracy, allowing for more informed decisions on model maintenance and retraining.

3. nannyML

NannyML takes a rather new approach to data drift detection, prioritizing model performance estimation and using probabilistic algorithms to predict performance.

This method enables NannyML to continually assess model performance in real time without depending solely on data drift as an alerting mechanism.

They use probabilistic methods called Confidence-based Performance Estimation (CBPE) and Direct Loss Estimation (DLE) to estimate model performance for classification and regression tasks, respectively. By reviewing expected performance regularly, NannyML remains proactive in assessing the model.

They recognize potential data drifts before they affect business outcomes.

In the case of a performance drop, NannyML does a Root Cause Analysis (RCA) using data drift detection algorithms. Their methods for detecting drift, help in identifying covariate shifts and pinpointing factors that cause performance decline.

Ekran Goruntusu 2024 05 03 154004

Conclusion

Artificial Intelligence has come into our lives as a new wave, entering almost every industry and our daily lives. Of course, models of AI and, therefore, machine learning are susceptible to errors on their own as well.

Data drifts can be one of the points to keep an eye on.

I believe if you are in the machine learning business, or if you are a developer, you could stay ahead of the game by noticing how to detect or avoid data drifts in your systems, or better maybe you could develop another method that detects the drifts, just as some of the companies we discussed.