A Complete Tutorial On Implementing Lasso Regression In Python With MachineHack Data Science Hackathon

When we talk about Machine Learning or Data Science or any process that involves predictive analysis using data — regression, overfitting and regularization are terms that are often used. Understanding regularization and the methods to regularize can have a big impact on a Predictive Model in producing reliable and low variance predictions.

In this article, we will learn to implement one of the key regularization techniques in Machine Learning using scikit learn and python.

What is Regularization?

Overfitting is one of the most annoying things about a Machine Learning model. After all those time-consuming processes that took to gather the data, clean and preprocess it, the model is still incapable to give out an optimised result. There can be lots of noises in data which may be the variance in the target variable for the same and exact predictors or irrelevant features or it can be corrupted data points. The ML model is unable to identify the noises and hence uses them as well to train the model. This can have a negative impact on the predictions of the model. This is called overfitting.

In simple words, overfitting is the result of an ML model trying to fit everything that it gets from the data including noises.

Why regularization?

Regularization is intended to tackle the problem of overfitting. Overfitting becomes a clear menace when there is a large dataset with thousands of features and records. Ridge regression and Lasso regression are two popular techniques that make use of regularization for predicting.

Both the techniques work by penalising the magnitude of coefficients of features along with minimizing the error between predictions and actual values or records. The key difference however, between Ridge and Lasso regression is that Lasso Regression has the ability to nullify the impact of an irrelevant feature in the data, meaning that it can reduce the coefficient of a feature to zero thus completely eliminating it and hence is better at reducing the variance when the data consists of many insignificant features. Ridge regression, however, can not reduce the coefficients to absolute zero. Ridge regression performs better when the data consists of features which are sure to be more relevant and useful.

Lasso Regression

Lasso stands for Least Absolute Shrinkage and Selection Operator. Let us have a look at what Lasso regression means mathematically:

Residual Sum of Squares + λ * (Sum of the absolute value of the magnitude of coefficients)

Where,

λ denotes the amount of shrinkage
λ = 0 implies all features are considered and it is equivalent to the linear regression where only the residual sum of squares are considered to build a predictive model
λ = ∞ implies no feature is considered i.e, as λ closes to infinity it eliminates more and more features
The bias increases with increase in λ
variance increases with decrease in λ

Implementing Lasso Regression In Python

For this example code, we will consider a dataset from Machinehack’s Predicting Restaurant Food Cost Hackathon.

Consider going through the following article to help you with Data Cleaning and Preprocessing:

A Complete Guide to Cracking The Predicting Restaurant Food Cost Hackathon By MachineHack

After completing all the steps till Feature Scaling(Excluding) we can proceed to building a Lasso regression. We are avoiding feature scaling as the lasso regressor comes with a parameter that allows us to normalise the data while fitting it to the model.

Lets Code!

import numpy as np

Creating a New Train and Validation Datasets

from sklearn.model_selection import train_test_split data_train, data_val = train_test_split(new_data_train, test_size = 0.2, random_state = 2)

Classifying Predictors and Target

#Classifying Independent and Dependent Features #_______________________________________________ #Dependent Variable Y_train = data_train.iloc[:, -1].values #Independent Variables X_train = data_train.iloc[:,0 : -1].values #Independent Variables for Test Set X_test = data_val.iloc[:,0 : -1].values

Evaluating The Model With RMLSE

def score(y_pred, y_true): error = np.square(np.log10(y_pred +1) - np.log10(y_true +1)).mean() ** 0.5 score = 1 - error return score

actual_cost = list(data_val['COST']) actual_cost = np.asarray(actual_cost)

See Also

Building the Lasso Regressor

###################################################################### #Lasso Regression ############################################################################ from sklearn.linear_model import Lasso

#Initializing the Lasso Regressor with Normalization Factor as True lasso_reg = Lasso(normalize=True)

#Fitting the Training data to the Lasso regressor lasso_reg.fit(X_train,Y_train)

#Predicting for X_test y_pred_lass =lasso_reg.predict(X_test)

#Printing the Score with RMLSE print("nnLasso SCORE : ", score(y_pred_lass, actual_cost))

Output:

0.7335508027883148

The Lasso Regression attained an accuracy of 73% with the given Dataset

Also, check out the following resources to help you more with this problem:

Guide To Implement StackingCVRegressor In Python With MachineHack’s Predicting Restaurant Food Cost Hackathon

Model Selection With K-fold Cross Validation — A Walkthrough with MachineHack’s Food Cost Prediction Hackathon

Flight Ticket Price Prediction Hackathon: Use These Resources To Crack Our MachineHack Data Science Challenge

Hands-on Tutorial On Data Pre-processing In Python

Data Preprocessing With R: Hands-On Tutorial

Getting started with Linear regression Models in R

How To Create Your first Artificial Neural Network In Python

Getting started with Non Linear regression Models in R

Beginners Guide To Creating Artificial Neural Networks In R

Enjoyed this story? Join our Telegram group. And be part of an engaging community.

Provide your comments below

comments

Amal Nair

A Computer Science Engineer turned Data Scientist who is passionate about AI and all related technologies.
Contact: [email protected]

A Complete Tutorial On Implementing Lasso Regression In Python With MachineHack Data Science Hackathon

Analysts Set Expectations for Marcus Corp’s FY2019 Earnings (NYSE:MCS)

Tire industry grapples with raw materials challenges

Tire industry grapples with raw materials challenges

Category

HPIN International Financial Platform Becomes a New Benchmark for India’s Digital Economy

Top 10 Market Research Companies in the world

3 Best Market Research Certifications in High Demand