In the first article in this series, we highlighted some of the motivations for using artificial intelligence in the enterprise to better understand the present and predict the future.1 In the second article, we provided descriptions of data analytics, data science, machine learning and deep learning.2 In this article, we continue with a practical description of machine learning and, in our follow-on article, we will do the same for deep learning, a subset of machine learning.
What is machine learning?
Machine learning is one approach to AI — not the only approach, but currently one that is easily the most successful in enterprise applications and more. Machine learning approaches in AI are different from explicit, rules-based AI approaches, such as expert systems, in that they are designed to learn from the data. The algorithms at the heart of machine learning applications use data to generate and refine rules (as opposed to the programmer explicitly defining the rules). The computer then decides how to respond based on what it has learned from the data.
So, how does this work? At the highest and simplest level, all machine learning methods have two phases: training and inference.
Training
In the first phase, the algorithm, or model, is trained to recognize features in a dataset, such as the characteristics that are common to housing prices, consumer purchases or pictures of common objects. If the model sees enough data that is consistent and well-labeled, it can find patterns and “learn” the features in the data that are consistent with the labels. For example, it could be used to understand which features predict home purchase prices most accurately, which purchase attempts are fraudulent, and which objects under a scanner are apples, oranges or bananas.
Of course, you need great data to train an accurate model. But how do you know your model is accurate? You validate your model against a subset of the data that is not used for training and score the accuracy. This is an iterative process, with successive rounds of training and validation.
For a simple application, you might train your model on 70 percent of the data in a dataset and validate it on the remaining 30 percent. With a more complex application, you might use 60 percent of the data for training, 20 percent for validation and 20 percent for final testing. These ratios aren’t hard-and-fast rules, and the appropriate ratios for each phase can vary based on the dataset. It’s the data scientist’s job to look at the dataset and determine the ratios that will work best — and to determine when the trained model is accurate enough to be deployed.
Inference
In this second step in the machine learning process, the rubber hits the road. You put the trained model to work with real-world data and let it infer answers based on that data. You then monitor the performance of your model over time. If it’s not meeting your accuracy goals, you might send it back to boot camp for additional training, often with additional/new data that has been collected.
Solving simple problems with linear regression
Machine learning has been around for ages in various forms. Linear regression, for example, is a statistical technique that is a very basic form of machine learning. Linear regression is used to show the relationship between variables, often expressed in terms of a slope and a Y-intercept in a chart.
Let’s take a deliberately simple example, for the sake of illustration. Say you want to understand the relationship between the selling prices of houses in a particular subdivision and the square footage of those houses. With linear regression techniques, you could take these data points for dozens of sales and plot them on an X-Y chart. The resulting upward-sloping line on the chart would show you that price is a function of the size of the house — when the square footage goes up, the price usually goes up.
This understanding of the relationship between these two variables would allow a machine learning model to use data to make statistical predictions about the future and the selling prices of homes. Stated another way, linear regression allows you to use seen data to define a function that infers against unseen data. Regression is the first tool in the data scientist’s toolbox.
Figure 1. A look at linear regression analysis, from “An Introduction to Linear Regression Analysis” by data scientist David Longstreet.
Solving harder problems with decision trees
Data in the real world, of course, isn’t as simple as it is in the previous example. There are always complexities and nuances to data. To stick with our housing market example, the value of houses might also be influenced by dwelling type, lot size, recent upgrades, proximity to a neighborhood park and intangible variables like curbside appeal. And, in the real world, houses wouldn’t all be in the same neighborhood, so your machine learning model must also consider the ZIP code for the property.
To consider this wider range of variables, we need to dig deeper into the data scientist’s toolbox and pull out some more sophisticated machine learning methods, including random forests and gradient boosting. These capabilities help you train models that can make more accurate predictions based on data that is too complex to be understood with simple linear regression tools.
Random forest is a technique for generating decision trees. Decision trees make predictions based on more complex relationships in data. In this case, the machine learning algorithm is trained on a set of data. As it works with the training data, the algorithm randomly generates decision trees and explores different if-this/then-that branches in the trees. The idea is that, when one thing happens, there are consequences that lead down one branch or another.
Gradient boosting is a technique that helps with noisier data, or data that seems to be all over the place. Gradient boosting helps you determine the features described by data that should be included in a decision tree, and those features that should be excluded because they really don’t matter.
So how do you get there?
With the random forest technique, you feed massive amounts of data into the algorithm, and it runs the data down different paths (trees), looking for patterns in the data. It iterates over and over to improve its predictive capabilities. After a great deal of training, the machine learning algorithm settles on the decision trees that work the best. In some cases, it might average multiple decision trees. Along the way, no human is involved in generating the decision trees. The machine learning algorithm does it all on its own. In the end, the trained model based on decision trees can be used to make classifications and predictions with very good accuracy for many types of enterprise data.
Figure 2. A look at the random forest technique, from “Random forests and decision trees from scratch in python” by Vaibhav Kumar, published in Towards Data Science.
At this point, data scientists and business leaders jump back into the game. They make the decision on when a trained model is ready to be validated with new data that it hasn’t seen before. Once that process is successful, the human decision makers determine when the model is ready to be put into production. They set the thresholds for this decision. For example, they might dictate that the predictive capability of the model must be at least 95 percent accurate before it goes live.
And, of course, the process doesn’t end there. When you put a trained machine learning model to work, you need to monitor it over time to verify that its predictions are accurate and useful. This is a bit like having an actual employee on the job. You do periodic performance reviews to make sure the employee is meeting all the expectations that come with a particular job title. If so, great! If not, you provide additional training to help improve performance and achieve business objectives.
Supervised vs. unsupervised learning
There is one additional nuance to be aware of here: the distinction between supervised and unsupervised learning. With the supervised approach, data is labeled, so the model has both features and answers to train against. For example, in addition to knowing the input factors (such as the square footage and the location of a house) it also knows the expected answer (the sales price). For this article, we have kept the focus on the supervised approach, which is used in the vast majority of enterprise machine learning applications.
With the unsupervised approach, the data isn’t labeled, and the model has to figure things out on its own. We will take up the topic of unsupervised training in subsequent articles, in which we talk about machine learning techniques that, for example, use clustering techniques to find natural groupings of items in large amounts of unlabeled data. With these unsupervised learning processes, the algorithm, all on its own, goes through data and identifies groups — for example, the customer buckets that different types of shoppers belong in.
Key takeaways
In subsequent articles in this series, we will talk about some of the use cases for the machine learning techniques introduced here. For now, an important point to keep in mind is simply this: machine learning can help you turn raw data into models that can be applied to produce meaningful insights. With these techniques, you can use diverse data types, including unstructured and semi-structured data, to gain the understanding that leads to system-generated actions and decisions in AI applications.
Next up: a dive into deep learning
In the next article in this series, we will discuss deep learning, which is a type of machine learning built on a deep hierarchy of interconnected “neural network” layers. We will explain how deep learning techniques take massive amounts of data and determine the common rules and features associated with the data — without any help from humans. And, looking out a little further, in follow-on articles we will share real-world use-case examples from various organizations that are capitalizing on the power of AI.
We’re glad to have you aboard and hope you will continue to follow our series as we explore how AI will forever transform business and create new products, services, and jobs.
Jay Boisseau, Ph.D., is an artificial intelligence and high performance computing technology strategist at Dell EMC.
Lucas Wilson, Ph.D., is an artificial intelligence researcher and lead data scientist in the HPC and AI Innovation Lab at Dell EMC.
_____________________________________
About this series
Artificial intelligence has long been shrouded in mystery. It’s often talked about in futuristic terms, although it is in use today and enriching our lives in countless ways. In this series, Jay Boisseau, an AI and HPC strategist for Dell EMC, and Lucas Wilson, an AI researcher for Dell EMC, cut through the talk and explain in simple terms the rapidly evolving technologies and techniques that are coming together to fuel the adoption of AI systems on a broad scale.
_______________
- CIO, “Enterprise AI: The Ongoing Quest for Insight and Foresight,” December 2018.
- CIO, “Enterprise AI: Data Analytics, Data Science and Machine Learning,” February 2019.
Copyright © 2019 IDG Communications, Inc.