In a world of cloud computing, where storage and elasticity of resources have become more flexible and affordable than ever, more and more organizations are increasingly asking why they are limiting themselves to antiquated architectures that have dated for decades. Today, data is abundant with more diverse information being generated each day. While large data sets allow us to gain complete insights about our customers or competitors, why would we limit ourselves to only explore smaller segments of the data?
Sample-based data preparation techniques – where a random, smaller subset of the entire data is selected to infer general rules about the shape and quality of the full data – have their roots in statistics but are now attempting to creep into all sorts of data projects. Just as looking through a small lens that only shows part of a larger object, this paradigm is limited, flawed and not based on reality.
More on that later, but let’s first study where the concept of sampling started.
Sampling and its roots in statistics and machine learning
Researchers and scientists rarely, if ever, work with the entire population of the data. Instead, they conduct studies using samples to make generalizations about the entire data population. Take medical researchers: they rely on a sample of patients to study the effect of a treatment on a disease. In pharmaceuticals, clinical trials are performed on a subset of subjects. Marketing specialists build campaigns based on surveys conducted across a subset of their entire customer base. And the list goes on.
Yes, this methodology – i.e. using a sample of data to model predictive outcomes or calculate risks and exposures on new data – was introduced decades ago, namely because of two major limitations:
- Data was never available in its entirety.
- The processing power and computational resources could not handle larger datasets.
Over the last several decades, statistical tools were dominated by desktop applications which inherently have a limited capacity for data storage and compute. As a result, statisticians resorted to samples.
Data sampling problems in statistical and machine learning projects
Even though sampling is a common practice in data science projects, organizations continue to seek ways to overcome the errors introduced by sampling.
The general recommendation is that the sample size should be sufficiently large. But, how large should the sample size be? Well, it really all depends on the population. When the population is skewed or asymmetric, the sample size should be large, but if the population is symmetric, we can draw small samples as well.
The size of the data sample also depends on the type of the model selected. A large amount of training data plays a critical role in making the deep learning models successful, although traditional machine learning models (e.g. linear regression) don’t require as much.
However, as this article points out, large data impacts the performance of models in both traditional machine learning as well as more advanced deep learning in a much similar way. The graph below, as referenced by the article, shows how the performance of both linear and nonlinear algorithms (e.g. deep learning) methods continue to improve as you give them more data.
Scientists agree that with smaller samples, the room for bias (difference between actual and prediction) and variance (difference between training and test data) is high. As we increase the number of data points (i.e. the size of the sample), we successfully capture its true distribution. More data helps the model uncover the true relationship between the two different data elements.
However, keep in mind that in machine learning, although selecting larger samples and adding more input data helps with the performance of the model, selecting the entire population of the data, even if technically and physically achievable, is not recommended. This is because it can overfit the model to the extent that the model learns so much of the detail and noise in the training data that it negatively impacts its performance/predictions when applied to new data. After all, we don’t want machine learning models to pick up the noise or random fluctuations in the training data so much so that they are learned as concepts by the model.
Nevertheless, it is proven that the more data available for training, the better the performance of the model. This is so critical that many data scientists develop a set of procedures to overcome the problems of sampling in order to make conclusions about the population more accurately.
Because sample-based findings are likely different than what you would find using the entire population, statisticians apply math principals to determine if what you see in the sample is what you would see in the population. These techniques include generalizations and inferential statistics such as regression analysis or analysis of variances to ensure that the sample is a good representation of the whole data, helping data scientists distinguish signal (e.g. real differences) from the noise (e.g. random error). To learn more about this, there are plenty of articles available.
Machine learning aside: How samples can undermine business decisions
To recap, even if a sample is created as a representation of the entire data, sample sizes should be large enough to improve the performance of machine learning models and that statistical techniques should be considered so the insights are not skewed or misrepresented by samples.
However, it should be profoundly clear that these techniques are at the purview and expertise of SMEs in statistical or data science fields. So, what about other types of data projects; those such as analytics and reporting, creating a single view of customers or suppliers, and application migrations and consolidation, where the data practitioner is not a statistician or a data science expert?
Let’s take this a step further. Knowing that there are dozens of data sampling techniques, how would we expect a general business or data analyst or even a SQL developer to know which one of sampling techniques to apply, as there are many examples including:
- Simple random sampling randomly selects subjects from the whole population.
- Stratified sampling subsets data sets or population based on a common factor, where samples are randomly collected from each subgroup.
- Cluster sampling divides a larger data set is into subsets (clusters) based on a defined factor, then a random sampling of clusters is analyzed.
- Multistage sampling is a more complicated form of cluster sampling that involves dividing the larger population into a number of clusters, where second-stage clusters are then broken out based on a secondary factor, and those clusters are then sampled and analyzed. Note, this could continue where multiple subsets are identified, clustered and analyzed.
- Systematic sampling sets an interval at which to extract data from the larger population – for example, selecting every 10th row.
As you can see, sampling only works when it is put in the hands of data science specialists. But what about your general business users or data or business analysts that don’t have the expertise or the programming mindset or have not grown up in the school of mathematics? While they all want to prepare, shape, and clean data on their own, the reality is they only have three choices:
- Leave data cleanup and shaping to their technical counterparts and hope for the best.
- Use randomly generated samples to glean insights and formulate how the data should be shaped, cleaned and prepared, and use those findings as guidance for their technical counterparts who will clean the full data based on the sample-based guidance.
- Leverage modern technology to prepare and shape the data on their own using the entire body of data, not just samples.
For those that want to throw in the towel and call it a day (aka option #1) we understand. But for those who have a data-driven mindset and a self-service attitude – those that would naturally pick options 2 or 3, there is hope.
In Part II of this series, we will explain how to avoid problems with samples outside the context of machine learning and statistics and delve into how modern technologies can free you from samples.