**Regression
analysis is a form of predictive modelling technique which investigates the
relationship between a dependent (target) and independent variable
(s) (predictor).** This technique is used for

finding the relationship between the variables.

Example: Let’s say, you want to estimate

growth in sales of a company based on current economic conditions. You have the

recent company data which indicates that the growth in sales is around two

and a half times the growth in the economy. Using this insight, we can predict

future sales of the company based on current & past information.

It indicates the significant relationships between the dependent variable and independent variable. It indicates the strength of the impact of multiple independent variables on a dependent variable

**Content**

1. Terminologies related to regression analysis

2. Linear regression

3. Logistic regression

4. Bias – Variance

5. Regularization

6. Performance metrics – Linear Regression

7. Performance metrics – Logistic Regression

## 1. Terminologies related to regression analysis

**Outliers:** Suppose there is an observation in the dataset which is having very high or very low value as compared to the other observations in the data, i.e. it does not belong to the population, such an observation is called an outlier. In simple words, it is an extreme value. An outlier is a problem because many times it hampers the results we get.

Multicollinearity: When the independent variables are highly correlated to each other than the variables are said to be multicollinear. Many types of regression techniques assume multicollinearity should not be present in the dataset. It is because it causes problems in ranking variables based on its importance. Or it makes the job difficult in selecting the most important independent variable (factor).

Heteroscedasticity: When the dependent variable’s variability is not equal across values of an independent variable, it is called heteroscedasticity. Example – As one’s income increases; the variability of food consumption will increase. A poorer person will spend a rather constant amount by always eating inexpensive food; a wealthier person may occasionally buy inexpensive food and at other times eat expensive meals. Those with higher incomes display a greater variability of food consumption.

**Underfitting and Overfitting:** When we use unnecessary explanatory variables, it might lead to overfitting. Overfitting means that our algorithm works well on the training set but is unable to perform better on the test sets. It is also known as the problem of high variance. When our algorithm works so poorly that it is unable to fit even training set well then it is said to underfit the data. It is also known as the problem of high bias.

**Autocorrelation:** It states that the errors associated with one observation are not correlated with the errors of any other observation. It is a problem when you use time-series data.

Suppose you have collected data from type of food consumption in eight different states. It is likely that the food consumption trend within each state will tend to be more like one another that consumption from different states i.e. their errors are not independent.

## 2. Linear Regression

Suppose you want to predict the amount of

ice cream sales you would make based on the temperature of the day, then you

can plot a regression line that passes through all data points.

Least squares are one of the methods to

find the best fit line for a dataset using linear regression. The most common

application is to create a straight line that minimizes the sum of squares of

the errors generated from the differences in the observed value and the value

anticipated from the model. Least-squares problems fall into two categories:

linear and nonlinear squares, depending on whether the residuals are linear in

all unknowns.

### 2.1. Important assumptions in regression analysis:

**There should be a linear and additive relationship between the dependent (response) variable and the independent (predictor) variable(s).**A linear relationship suggests that a change in response Y due to one unit change in X is constant, regardless of the value of X. An additive relationship suggests that the effect of X on Y is independent of other variables.

- There should be no correlation

between the residual (error) terms. Absence of this phenomenon is known as

Autocorrelation.

- The independent variables

should not be correlated. Absence of this phenomenon is known as

multicollinearity.

- The error terms must have constant variance. This phenomenon is known as homoskedasticity. The presence of non-constant variance is referred to as heteroskedasticity.

- The error terms must be

normally distributed.

## 3. Logistic Regression

**Logistic regression is based on Maximum Likelihood (ML) Estimation which says coefficients should be chosen in such a way that it maximizes the probability of Y given X (likelihood).** With ML, the computer uses different “iterations” in which it tries different solutions until it gets the maximum likelihood estimates.

Fisher Scoring is the most popular

iterative method of estimating the regression parameters.

**logit(p)
= b0 + b1X1 + b2X2 + —— + bk Xk**

**where
logit(p) = ln(p / (1-p))**

p: the probability of the dependent

variable equaling a “success” or “event”.

**The reason for not using linear regression for such cases is that the homoscedasticity assumption is violated. Errors are not normally distributed. Y follows a binomial distribution. **

### 3.1. Interpretation of Logistic Regression Estimates:

- If X increases by one unit, the

log-odds of Y increases by k unit, given the other variables in the model are

held constant.

- In logistic regression, the odds ratio is easier to interpret. That is also called Point estimate. It is the exponential value of the estimate.

- For Continuous Predictor: An unit increase in years of experience increases the odds of getting a job by a multiplicative factor of 4.27, given the other variables in the model are held constant.

- For Binary Predictor: The odds

of a person having years of experience getting a job are 4.27 times greater

than the odds of a person having no experience.

### 3.2. Assumptions of Logistic Regression:

- The logit transformation of the

outcome variable has a linear relationship with the predictor variables.

- The one way to check the assumption is to categorize the independent variables. Transform the numeric variables to 10/20 groups and then check whether they have a linear or monotonic relationship.

- No multicollinearity problem.

- No influential observations

(Outliers).

- Large Sample Size – It requires at least 10 events per independent variable.

### 3.3. Odd’s ratio

**In logistic regression, odds ratios compare the odds of each level**The ratios

of a categorical response variable.

quantify how each predictor affects the probabilities of each response level.

- For example, suppose that you want to determine whether age and gender affect whether a customer chooses an electric car. Suppose the logistic regression procedure declares both predictors to be significant. If GENDER has an odds ratio of 2.0, the odds of a woman buying an electric car are twice the odds of a man. If AGE has an odds ratio of 1.05, then the odds that a customer buys a hybrid car increase by 5% for each additional year of age

- Odd Ratio (exp of estimate)

less than 1 ==> Negative relationship (It means negative coefficient value

of estimate coefficients)

## 4. Bias – Variance

**Error due to Bias:** It is taken as the difference between the expected prediction of the model and the actual value. Bias measures how far off, in general, the predictions are from the actuals.

**Error
due to Variance:** It is taken as the variability of

a model prediction for a given data pint. The variance is how much the

prediction varies for different models.

### 4.1. Interpretation of Bias – Variance

- Imagine that the center of the

target is a model that perfectly predicts the correct values. As we move away

from the bulls-eye, our predictions get worse and worse. - Imagine we can repeat our

entire model building process to get several separate hits on the target. - Each hit represents an

individual realization of our model, given the chance variability in the

training data we gather.

### 4.2. Bias – Variance Trade-Off

- Bias is reduced and variance is

increased in relation to model complexity. **As more and more parameters are added to a model, the complexity of**

the model rises, and variance becomes our primary concern while bias steadily

falls.

## 5. Regularization

Regularization helps to solve over fitting problem which implies model performing well on training data but performing poorly on validation (test) data. **Regularization solves this problem by adding a penalty term to the objective function and control the model complexity using that penalty term.** We do not regularize the intercept term. The constraint is just on the sum of squares of regression coefficients of X’s.

**Regularization
is generally useful in the following situations:**

- Large number of variables
- Low ratio of number

observations to number of variables - High Multi-Collinearity

**L1
Regularization: **

- In L1 regularization we try to

minimize the objective function by adding a penalty term to the sum of the

absolute values of coefficients. - This is also known as least

absolute deviations method. - Lasso Regression makes use of

L1 regularization.

**L2
Regularization: **

- In L2 regularization we try to

minimize the objective function by adding a penalty term to the sum of the

squares of coefficients. - Ridge Regression or

shrinkage regression makes use of L2 regularization.

### 5.1. Lasso Regression

**Lasso
(Least Absolute Shrinkage and Selection Operator) penalizes the absolute size
of the regression coefficients.** Lasso regression

used L1 regularization. In addition, it can reduce the variability and

improving the accuracy of linear regression models.

**Lasso regression differs from ridge regression in a way that it uses absolute values in the penalty function, instead of squares.** This leads to penalizing (or equivalently constraining the sum of the absolute values of the estimates) values which causes some of the parameter estimates to turn out exactly zero. Larger the penalty applied, further the estimates get shrunk towards absolute zero. This results in variable selection out of given n variables.

### 5.2. Ridge Regression

**Ridge
Regression is a technique used when the data suffers from multicollinearity.** Ridge regression uses L2 regularization. In multicollinearity,

even though the least squares estimate (OLS) are unbiased, their variances are

large which deviates the observed value far from the true value. By adding

a degree of bias to the regression estimates, ridge regression reduces the

standard errors.

In a linear equation, prediction errors can be decomposed into two sub components. First is due to the bias and second is due to the variance. Prediction error can occur due to any one of these two or both components. Here, we’ll discuss the error caused due to variance. **Ridge regression solves the multicollinearity problem through Shrinkage Parameter λ (lambda).**

## 6. Performance Metrics – Linear Regression Model

### 6.1. R-Squared

- It measures the proportion of

the variation in your dependent variable explained by all your independent

variables in the model. - Higher the R-squared, the

better the model fits your data.

SS-res denotes residual sum of squares

SS-tot denotes total sum of squares

### 6.2. Adjusted R-Square

- The use of adjusted R2 is an

attempt to take account of non-explanatory variables that are added to the model.

Unlike R2, the adjusted R2 increases only when the increase in R2 is due to

significant independent variable that affects dependent variable - The adjusted R2 can be

negative, and its value will always be less than or equal to that of R2.

k denotes the total number of

explanatory variables in the model

n denotes the sample size.

### 6.3. Root Mean Square Error (RMSE)

- It explains how close the actual

data points are to the model’s predicted values. - It measures standard deviation

of the residuals.

yi denotes the actual values of

dependent variable

yi-hat denotes the predicted values

n – denotes sample size

### 6.4. Mean Absolute Error (MAE)

- MAE measures the average

magnitude of the errors in a set of predictions, without considering their

direction. - It’s the average over the test

sample of the absolute differences between prediction and actual observation

where all individual differences have equal weight.

## 7. Performance Metrics – Logistic Model

### 7.1. Performance Analysis (Development Vs OOT)

- The Kolmogorov-Smirnov (KS)

test is used to measure the difference between two probability distributions. - For example, the distribution

of the good accounts and the distribution of the bad accounts with the score.

### 7.2. Model Alignment

- Model alignment is done by

calculating the PDO (Point of Double Odds). The objective of scaling is to

convert the raw (model output) score into a scaled score so that Business can assess

and track changes in risk effectively. - The PDO is calculated only on

the good and the bad observations. For a scale to be specified, the following

three measurements need to be provided.

1. Reference score: This can be any arbitrary score value, taken as a reference point.2. Odds at reference score: The odds of becoming a bad, at the reference score.3. PDO: Points to Double odds. |

### 7.3. Model Accuracy

- Model accuracy is checked by

calculating the Bad Count error percentage (BCEP) for Development and OOT

sample. - Check whether the BCEP for each

model was within the permissible limit of +/- 25%

### 7.4. Model Stability

- To confirm whether the model is

stable over time, we check the Population stability Index (PSI) in Development

and OOT sample. - Check whether the PSI is within

the range of +/- 25%