Background
Brief
Lasso regression is also known as L1 regularization and LASSO stands for Least Absolute Shrinkage and Selection Operator. It is a regularized version of linear regression that adds l1 penalty terms in the cost function and thereby reducing coefficients to absolute zero and eliminating their impact to a model. Let’s understand it better with it’s cost function.
As this is regularized version of a linear regression, so check out Linear regression here: Linear Regression.
Intuition
Because it is a regularized version of linear regression, so I would not be covering all the similar details. I would suggest you go through linear regression first and consider all the calculations till cost function and rest are mentioned here for lasso regression. We will directly look at where it differs from the linear regression.
Math Behind
Following is the cost function of lasso regression:
Cost Function > Lasso Regression
$$ \operatorname{J}(\boldsymbol{\theta}) = \operatorname{MSE}(\boldsymbol{\theta}) + \alpha\sum _{j=1}^{m} \left|w _{j} \right| $$
where,
$$ w_{j} =\begin{cases} -1 & \text { if } w _{j} < 0 \ \cr \ \ \ 0 & \text { if } w _{j} = 0 \ \cr +1 & \text { if } w _{j} > 0 \end{cases} $$
here in the first equation,
m = number of features,
\( \operatorname{MSE}(\boldsymbol{\theta}) \) = gradient vector explained here in batch gradient descent Partial Derivative of a Cost Function
\( \alpha \) = (alpha) constant model parameter for how much l1 penalty we want to give to our feature,
w = weight of a feature.
as mentioned second mathematical information for weights \( w_{j} \), absolute values will be considered for weights. 0 weight remain 0 and less than 0 will become -1 and greater than 0 will become 1.
Cost Function > Explanation
⦿ As stated in the equation above, \( \operatorname{MSE}(\boldsymbol{\theta}) \) is a partial derivative of a cost function from a gradient descent.
⦿ In the L1 penalty calculation, weights are taken as absolute values before the multiplication of sum of all the weights with model parameter alpha.
⦿ Looking at second mathematical information \( w_{j} \), we can see how weights are transformed between -1, 0 and 1.
⦿ At the time of multiplication of summation of weights with alpha, -1 will reduce coefficient, 0 will not affect whereas 1 will increase the coefficient value.
⦿ Old weights will be updated by a newly identified \( \theta \) value along with a learning rate as mentioned in a batch gradient descent.
⦿ This new weights will be used for new predictions and this entire process will be continued until an error stop reducing.
⦿ Based on this weights and the L1 penalty calculation, we can say that Lasso regression is eliminating the impact of features which are least important to dependent variable/feature by setting them to 0 and -1 accordingly.
⦿ That way it also performs feature selection implicitly.
Lasso Regression > Things to Remember
⦿ It is crucial to scale (e.g. StandardScaler) input features because regression models are sensitive to them.
⦿ Another thing to remember is that in the above equation, alpha is the parameter that we will give to our model while developing, so it has to be chosen wisely.
⦿ Small alpha will give better results as it will change the coefficient by a small margin, unlike large alpha. The larger alpha value can lead to the under-fitting of a model, so you should experiment with different alpha values. The wisely chosen alpha value will prevent a model from over-fitting.
⦿ Transforming our input features with a polynomial method and then applying lasso regression also proved to be making a great model.
Check out how to transform features to polynomial and perform polynomial regression here: Polynomial Regression
Prediction
After doing this computation, we will get new coefficients/weights that we can use to predict with the following formula:
$$ y = x * \boldsymbol{\theta} $$
Where x is our input variables matrix as mentioned below and \( \boldsymbol{\theta} \) is the lasso regression coefficients/weights calculated earlier with a bias matrix \( \boldsymbol{\theta_{0}} \).
$$ x = \pmatrix{ 1 & x_{11} & x_{12} & \ldots & x_{1k} \cr 1 & x_{21} & x_{22} & \ldots & x_{2k} \cr \vdots & \vdots & \vdots & \ldots & \vdots \cr \vdots & \vdots & \vdots & \ldots & \vdots \cr 1 & x_{n1} & x_{n2} & \ldots & x_{nk} \cr } $$
All the input records are present from second column onward in x as 1’s in the first column are kept for taking a bias value from \( \boldsymbol{\theta_{0}} \) because bias values will be multiplied with the first column. We have n records and k features/variables in this matrix.
This is where lasso regression overcame problems of ridge regression by making negative weights to -1 and keeping weight 0 to 0. By adding the L1 penalty to cost function with alpha multiplied with sum of all absolute value of weights give us better coefficients that result in a regression line better than ridge regression.
Conclusion
Lasso regression has improvement in weights calculation. It emphasises important features from others more and eliminates impact of least important features and that is why it is called as Least Absolute Shrinkage and Selection Operator.
Evaluation metrics are the same as linear regression. You can use MSE, RMSE, MAE, etc, whichever suits your usecase.
Check out evaluation metrics here: Evaluation Metrics
Keep this in mind that when we transform input features into polynomials, they will become highly correlated. In that case, lasso regression will give higher coefficients.
It is useful in high dimensional data as it deals with feature selection and correlation of them.
Though, there is another good regression algorithm Elastic Net Regression, which is a combination of both ridge and lasso regression.
If you want to know more about a batch gradient descent then check: Batch Gradient Descent
For a ridge regression check this: Ridge Regression (L2 Regularization)
End Quote
“Artificial Intelligence is the new electricity.” - Andrew NG