The Deceiving Dangers of Data Degradation

The insights you gain are only as good as the data you feed into the model.

The key to data science is fundamentally the data. Data is not an eternal element that never changes. Instead, it ages and becomes increasingly irrelevant, this phenomenon is called data degradation. Take, for example, information on ‘customer purchasing behavior.’ As a customer goes through phases in life, her purchasing behavior will change. But using a customer’s purchasing behavior data at a single point in time — say, when she was searching for a job, or was dependent on someone, or was in her 20s — to predict what she will purchase today — when she is self-sufficient, has a high-income job, is married with kids, or is in her mid-30s — is not likely to be accurate. (Do note that any data used to provide a reference point for analytical models is largely considered an exception to this rule.)

Another example is credit rating information. Banks rely on a customer’s recent history and current credit score to calculate their overall risk. A customer’s credit rating when she was 20 will not be relevant or add value in the risk calculation when she is 45. This is a clear indication that the relevance of data can and does degrade over time.

Both the analytical model chosen and the way you apply that model on the data matter, but it is the quality and validity of the underlying data that makes or breaks the value generated out of the data. Sure, you can use effective algorithms to rip apart a dataset and throw insights at the user; but what use is that insight going to be if it was applied on a junk dataset or data that is no longer relevant? You must consider the aging of data when analyzing.

This axiom can be expressed simply: When we talk about text analytics we’re automatically entering the world of Big Data. To extract value from large volumes of data, we need to understand the four dimensions of data (the four V’s). The first three, Volume (Vo), Velocity (Ve) and Variety (Va) are straight forward enough. Let’s instead focus on the fourth V, “Veracity” (Vr), which involves “accuracy” and “truthfulness” in detail. Veracity is a function of four components: Quality of data (Qd), Volatility (Vl), Validity (Vd) and Quality of data density (Qn) at any given timeframe. This relationship is represented by an equation:

Vr  = f(Qd, Vl, Vd, Qn) / f(t)

Quality of data (Qd) relates to data quality, as defined by completeness, correctness, clarity, and consistency. Qd is represented by a score obtained as result of profiling data.

Volatility (Vl) corresponds to the duration that a piece of data/dataset will remain relevant for. Some analysts relate volatility to the rate of change of data.

Validity (Vd) represents the value of information trends, like correlation, causation, and regression, obtained after the application of analytical models.

Quality of data density (Qn) highlights the density of relevant data in a large dataset.

All the above components depend on a time factor, f(t). Any equation affected by time will have a lifetime and/or value associated with it. Similarly, in the case of data, the value of data will either remain constant or degrade over time depending on the type of data. See, I told you it was simple!