Posts Elastic Net Regression - Algorithm Intuition
Post
Cancel

Elastic Net Regression - Algorithm Intuition

Background

Brief

Elastic Net Regression is an algorithm that overcame the limitations of both ridge and lasso regression by incorporating both of them. Yes, you read this right it is a combination of ridge and lasso regression. That is the reason it is preferred over the latter two algorithms. It is also a regularized version of linear regression, so I would be skipping all the similar explanations. I would recommend you to understand Linear Regression first and then continue this article, so that you can understand elastic net regression properly. Let’s understand the math behind this algorithm.

Math Behind

The cost function of an elastic net regression consists of both l1 and l2 penalties. Following is the cost function of an elastic net regression.

Cost Function > Elastic Net Regression

$$ \operatorname{J}(\theta) = \operatorname{MSE}(\theta) + r \alpha \sum_{j=1}^{n}\left|\theta_{j}\right|+\frac{1-r}{2} \alpha \sum_{k=1}^{n} \left(\theta_{k}\right)^{2} $$

Where, \( \operatorname{MSE}(\boldsymbol{\theta}) \) = gradient vector explained here in batch gradient descent Partial Derivative of a Cost Function
r = floating-point value of a mix ratio of ridge and lasso regression given between 0 to 1, where 0 is equivalent to the ridge regression and 1 is equivalent to the lasso regression.
\( \alpha \) = (alpha) penalty constant parameter given to a model.
n = number of features, remember it starts from 1 because a bias value is kept at 0th position.
\( \theta_{j} \) and \( \theta_{k} \) = are weights corresponding to a feature.
|…| = for taking an absolute value as used in a lasso regression.
At the end, weight is squared as used in a ridge regression.

Cost Function > Explanation

⦿ As stated in the equation above, \( \operatorname{MSE}(\boldsymbol{\theta}) \) is a partial derivative of a cost function from a gradient descent.

⦿ Mix ratio r is multiplied with a L1 penalty term of lasso regression.

⦿ At last, L2 penalty term of a ridge regression is multiplied with \( \frac{1-r}{2} \).

⦿ This calculation will give us a cost value that will be used in updating existing weights of features as mentioned in a batch gradient descent > updating weights.

⦿ Have you noticed that if we give a mix ratio r=0 then it will be completely a ridge regression and r=1 will be a lasso regression.

⦿ Do not forget to scale the input data as this regularization algorithm is also very sensitive to scale like others.

Prediction

After doing this computation, we will get new coefficients/weights that we can use to predict with the following formula:

$$ y = x * \boldsymbol{\theta} $$

Where x is our input variables matrix as mentioned below and \( \boldsymbol{\theta} \) is the lasso regression coefficients/weights calculated earlier with a bias matrix \( \boldsymbol{\theta_{0}} \).

$$ x = \pmatrix{ 1 & x_{11} & x_{12} & \ldots & x_{1k} \cr 1 & x_{21} & x_{22} & \ldots & x_{2k} \cr \vdots & \vdots & \vdots & \ldots & \vdots \cr \vdots & \vdots & \vdots & \ldots & \vdots \cr 1 & x_{n1} & x_{n2} & \ldots & x_{nk} \cr } $$

All the input records are present from second column onward in x as 1’s in the first column are kept for taking a bias value from \( \boldsymbol{\theta_{0}} \) because bias values will be multiplied with the first column. We have n records and k features/variables in this matrix.

Conclusion

⦿ Elastic Net Regression is a middle state between a ridge and a lasso regression algorithms hence it manages to deal with limitations of both pretty well.

⦿ Plain linear regression without any regularization is not advised to use instead regularized models like ridge, lasso and elastic net should be used.

⦿ When the number of features is high then lasso and elastic net regression should be used as they minimize the impact of useless features.

⦿ Though elastic net regression performs better than lasso regression when the number of features is more than training records or some of the features are strongly correlated.

For more information, check the following articles on different topics mentioned in this article:

Batch Gradient Descent
Linear Regression
Ridge Regression (L2 Regularization)
Lasso Regression (L1 Regularization)

End Quote

“Machine Learning is going to result in a real revolution.” - Greg Papadopoulos

This post is licensed under CC BY 4.0 by the author.