My previous tutorial, “Machine Learning for Java developers,” introduced setting up a machine learning algorithm and developing a prediction function in Java. I demonstrated the inner workings of a machine learning algorithm and walked through the process of developing and training a machine learning model. This tutorial picks up where that one left off. I’ll show you how to set up a machine learning data pipeline, introduce a step-by-step process for taking your machine learning model from development into production, and briefly discuss technologies for deploying a trained machine learning model in a Java-based production environment.
Requirements and what to expect from this tutorial
Deploying a machine learning model is a separate endeavor from developing one, often implemented by a different team. Developing a machine learning model requires understanding the underlying data and having a good grasp of mathematics and statistics. Deploying a machine learning model in production is typically a job for someone with software engineering and operations experience.
This tutorial shows you how to make a machine learning model available in a highly scalable production environment. I assume you have some development experience and a basic understanding of machine learning models and algorithms; otherwise, you may want to start by reading “Machine learning for Java developers, Part 1.”
I’ll start with a brief refresher on supervised learning, including an example application that I’ll use to demonstrate how to train, deploy, and process a machine learning model for use in production.
Supervised machine learning: A refresher
I’ll use a simple, supervised machine learning model to illustrate the machine learning deployment process. The example machine learning model shown in Figure 1 can be used to predict the expected sale price of a house.
Recall that a machine learning model is a function with internal, learnable parameters that map inputs to outputs. In the above diagram, a linear regression function, hθ(x), is used to predict the sale price for a house based on a variety of features. The x variables of the function represent the input data. The θ (theta) variables represents the internal, learnable model parameters.
To predict the sale price of a house, you must first create an input data array of x variables. This array contains features such as the size of the lot or the number of rooms in a house. This array is called the feature vector.
Because most machine learning functions require a numerical representation of features, you will likely have to perform some data transformations in order to build a feature vector. For instance, a feature specifying the location of the garage could include labels such as “attached to home” or “built-in,” which have to be mapped to numerical values. When you execute the house-price prediction, the machine learning function will be applied with this input feature vector as well as the internal, trained model parameters. The function’s output is the estimated house price. This output is called a label.
Training the model
Internal, learnable model parameters (θ) are the part of the model that is learned from training data. The learnable parameters will be set during the training process. A supervised machine learning model like the one shown below has to be trained in order to make useful predictions.
Typically, the training process starts with an untrained model where all the learnable parameters are set with an initial value such as zero. The model consumes data about various house features along with real house prices. Gradually, it identifies correlations between houses features and house prices, as well as the weight of these relationships. The model adjusts its internal, learnable model parameters and uses them to make predictions.
After the training process, the model will be able to estimate the sale price of a house by assessing its features.
Machine learning algorithms in Java code
The HousePriceModel
provides two methods. One method implements the learning algorithm to train (or fit) the model. The other method is used for predictions.
The fit() method
The fit()
method is used to train the model. It consumes house features as well as house-sale prices as input parameters but returns nothing. The fit()
method requires the correct “answer” to be able to adjust its internal model parameters. Using housing listings paired with sale prices, the learning algorithm looks for patterns in the training data. From these, it produces model parameters that generalize from those patterns. As the input data becomes more accurate, the model’s internal parameters will be adjusted.
Listing 1. The fit() method is used to train a machine learning model
// load training data
// ...
// e.g. [{MSSubClass=60.0, LotFrontage=65.0, ...}, {MSSubClass=20.0, ...}]
List<Map<String, Double>> houses = ...;
// e.g. [208500.0, 181500.0, 223500.0, 140000.0, 250000.0, ...]
List<Double> prices = ...;
// create and train the model
var model = new HousePriceModel();
model.fit(houses, prices);
Note that the house features are double typed in the code. This is because the machine learning algorithm used to implemented the fit()
method requires numbers as input. All house features must be represented numerically so that they can be used as x parameters in the linear regression formula, as shown here:
hθ(x) = θ0 * x0 + … + θn * xn
The trained house price prediction model could look like what you see below:
price = -490130.8527 * 1 + -241.0244 * MSSubClass + -143.716 * LotFrontage + … * …
Here, the input house features such as MSSubClas
or LotFrontage
are represented as x variables. The learnable model parameters (θ) are set with values like -490130.8527 or -241.0244, which have been gained during the training process.
This example uses a simple machine learning algorithm, which requires just a few model parameters. A more complex algorithm, such as for a deep neural network, could require millions of model parameters; that is one of the main reasons why the process of training such algorithms requires high computation power.
The predict() method
Once you have finished training the model, you can use the predict()
method to determine the estimated sale price of a house. This method consumes data about house features and produces an estimated sale price. In practice, an agent of a real estate company could enter features such as the size of a lot (lot-area), the number of rooms, or the overall house quality in order to receive an estimated sale price for a given house.
Transforming non-numeric values
You will often be faced with datasets that contain non-numeric values. For instance, the Ames Housing dataset used for the Kaggle House Prices competition includes both numeric and textual listings of house features:
To make things more complicated, the Kaggle dataset also includes empty values (marked NA), which cannot be processed by the linear regression algorithm shown in Listing 1.
Real-world data records are often incomplete, inconsistent, lacking in desired behaviors or trends, and may contain errors. This typically occurs in cases where the input data has been joined using different sources. Input data must be converted into a clean data set before being fed into a model.
In the sample above, you would need to replace the missing (NA) numeric LotFrontage
value. You would also need to replace textual values such as MSZoning
“RL” or “RM” with numeric values. These transformations are necessary to convert the raw data into a syntactically correct format that can be processed by your model.
Once you’ve converted your data to a generally readable format, you may still need to make additional changes to improve the quality of input data. For instance, you might remove values not following the general trend of the data, or place infrequently occurring categories into a single umbrella category.
How to build your machine learning data pipeline
Often, the data preparation or preprocessing steps are arranged as a pipeline. For instance, the simplified house prediction pipeline below arranges a set of preprocessing transformer components with a final house prediction model.
The transformer components clean the raw data and transform it into a format the model is able to consume. The data becomes more suitable for the model after each stage in the transformation.
The pipeline pattern allows you to organize your transformation code so that each transformer component has a single responsibility. For instance, the CategoryToNumberTransformer
class below replaces all textual feature values with numeric ones. Because this transformer implementation does not handle null values, the transformer has to be processed after applying an AddMissingValuesTransformer
. Internally, the CategoryToNumberTransformer
holds a map using textual feature values as the key, and unique, generated numbers as values. The mapping of the MSZoning
feature might look as follows:
{FV=1, RH=2, RM=3, C=5, …, RL=8, «default»=-1}
When you call the transform()
method, textual values will be detected and transformed into numbers using the mapping collection, as shown in Listing 2.
Listing 2. Replace textual feature values with numeric ones
public class CategoryToNumberTransformer implements Transformer<Object, Double, Double> {
private final CategoryToNumberResolver categoryToNumber = new CategoryToNumberResolver();
public List<Map<String, Double>> transform(List<Map<String, Object>> houses) {
return houses.stream().map(this::transform).collect(Collectors.toList());
}
private Map<String, Double> transform(Map<String, Object> house) {
return house.entrySet()
.stream()
.collect(Collectors.toMap(feature -> feature.getKey(),
feature -> (feature.getValue() instanceof String)
? categoryToNumber.map(feature)
: (Double) feature.getValue()));
}
public void fit(List<Map<String, Object>> houses , List<Double> prices) {
houses.forEach(house -> house.entrySet()
.stream()
.filter(feature -> feature.getValue() instanceof String)
.forEach(categoryToNumber::add));
}
private static final class CategoryToNumberResolver {
private final Map<String, Double> categoryToNumberMapping = Maps.newHashMap();
void add(Map.Entry<String, Object> feature) {
// ..
}
Double map(Map.Entry<String, Object> feature) {
// ..
}
}
}
There are two ways to create the internal category-to-number map. To do it manually, you would add all possible entries to the map during development time. To do it dynamically, as shown above, you would scan all the available records at training time. In this example, the fit()
training method dynamically builds the category-to-number map. First it extracts a set of all textual values, then it uses the value set to build a map, which includes the newly generated numbers that are associated to the unique textual values.
Configuring the machine learning data pipeline
In most cases, preprocessing logic is specific to the model, so updating the logic of the preprocessing components requires re-training the model. For this reason, the preprocessing code and the model code are often packaged together, as shown below. Here, a generic Pipeline
class is used to arrange the transformers together with a final house prediction model.
Listing 3. A generic Pipeline class
var pipeline = Pipeline.add(new DropNumericOutliners("LotArea", 10))
.add(new AddMissingValuesTransformer())
.add(new CategoryToNumberTransformer())
.add(new AddComputedFeatureTransformer())
.add(new DropUnnecessaryFeatureTransformer("YrSold", "YearRemodAdd"))
.add(new HousePriceModel());
pipeline.fit(houses, prices);
// …
Some machine learning libraries provide pipeline abstractions similar to the example above. Others provides configurable and customizable preprocessing components only.