L1 regularization usually entails some loss of predictive power of the model.
One of the properties of L1 regularization is to force the smallest weights to 0 and thereby reduce the number of features taken into account in the model. This is a desired behavior when the number of features (n) is large compared to the number of samples (N). L1 is better suited for datasets with many features.
The Stochastic Gradient Descent algorithm with L1 regularization is known as the Least Absolute Shrinkage and Selection Operator (Lasso) algorithm.
In both cases the hyper-parameters of the model are as follows:
- The learning rate of the SGD algorithm
- A parameter to tune the amount of regularization added to the model