Lasso Regression adds an L1 penalty to ordinary least squares, shrinking some coefficients all the way to zero. This makes it a powerful tool for both regularization and automatic feature selection, especially when you suspect that only a handful of predictors actually matter. In this article, you will learn how Lasso works under the hood, how to implement it with scikit-learn, and how to tune its alpha parameter using cross-validation.
When you build a linear regression model with many features, overfitting becomes a real risk. The model fits noise in the training data instead of learning genuine patterns, and performance on unseen data suffers. Regularization techniques address this by penalizing large coefficients, and Lasso (Least Absolute Shrinkage and Selection Operator) is one of the most widely used approaches. Originally introduced by Robert Tibshirani in 1996, Lasso has become a standard tool in the machine learning toolkit, available through scikit-learn's linear_model module.
What Is Lasso Regression
Lasso Regression is a variation of linear regression that adds a penalty equal to the sum of the absolute values of the model's coefficients. The objective function that Lasso minimizes looks like this:
# Lasso objective function (conceptual)
# Minimize: (1 / 2n) * sum((y_i - y_hat_i)^2) + alpha * sum(|w_j|)
#
# Where:
# y_i = actual target value
# y_hat_i = predicted value
# w_j = model coefficient for feature j
# alpha = regularization strength (controls the penalty)
# n = number of samples
The first term is the standard mean squared error (MSE) that ordinary least squares minimizes. The second term is the L1 penalty, which is the sum of absolute values of all coefficients multiplied by the hyperparameter alpha. A larger alpha increases the penalty and forces more coefficients toward zero. When alpha is set to zero, Lasso behaves exactly like ordinary linear regression.
The name "LASSO" stands for Least Absolute Shrinkage and Selection Operator. The "selection" part is key -- unlike Ridge Regression, Lasso can set coefficients exactly to zero, effectively removing features from the model.
How L1 Regularization Works
The defining characteristic of L1 regularization is its ability to produce sparse models. To understand why, consider how the penalty interacts with the optimization process. The L1 penalty creates a diamond-shaped constraint region in coefficient space. When the loss function's contours intersect this diamond, they tend to hit at a corner, which corresponds to one or more coefficients being exactly zero.
This is fundamentally different from Ridge Regression's L2 penalty, which creates a circular constraint region. The circular shape means contours almost never land exactly at a corner, so Ridge shrinks coefficients toward zero but rarely reaches it.
Scikit-learn's Lasso implementation uses coordinate descent as its optimization algorithm. Coordinate descent works by optimizing one coefficient at a time while holding all others fixed, cycling through all coefficients repeatedly until convergence. This approach is efficient for the L1 penalty because the single-coordinate subproblem has a closed-form solution involving a soft-thresholding operation.
# Soft-thresholding function (the core of coordinate descent for Lasso)
import numpy as np
def soft_threshold(rho, alpha):
"""Apply soft-thresholding to a single coordinate.
If rho is larger than alpha, shrink it by alpha.
If rho is smaller than -alpha, shrink it by -alpha.
Otherwise, set it to zero.
"""
if rho > alpha:
return rho - alpha
elif rho < -alpha:
return rho + alpha
else:
return 0.0
# Example: coefficient updates at different rho values
for rho in [3.0, 0.5, -2.0, 0.1, -0.1]:
result = soft_threshold(rho, alpha=1.0)
print(f"rho={rho:+.1f}, alpha=1.0 -> coefficient={result:+.1f}")
The soft-thresholding function is what gives Lasso its sparsity property. Any coefficient whose magnitude falls below the alpha threshold gets zeroed out entirely, rather than just being made smaller.
Basic Lasso with Scikit-Learn
Implementing Lasso in scikit-learn is straightforward. The Lasso class lives in sklearn.linear_model and follows the same fit-predict pattern as every other estimator in the library.
from sklearn.linear_model import Lasso
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np
# Generate synthetic data with 20 features, only 5 of which are informative
X, y, true_coef = make_regression(
n_samples=200,
n_features=20,
n_informative=5,
noise=10,
coef=True,
random_state=42
)
# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.25, random_state=42
)
# Create and fit the Lasso model
lasso = Lasso(alpha=1.0)
lasso.fit(X_train, y_train)
# Make predictions
y_pred = lasso.predict(X_test)
# Evaluate
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"Mean Squared Error: {mse:.2f}")
print(f"R-squared Score: {r2:.4f}")
print(f"Non-zero coefficients: {np.sum(lasso.coef_ != 0)} out of {len(lasso.coef_)}")
print(f"Intercept: {lasso.intercept_:.4f}")
The key parameters you should know about when creating a Lasso instance include alpha (the regularization strength, defaulting to 1.0), max_iter (the maximum number of coordinate descent iterations, defaulting to 1000), and tol (the convergence tolerance). If your model fails to converge, try increasing max_iter or adjusting alpha.
Always standardize your features before applying Lasso. Because the L1 penalty treats all coefficients equally, features measured on larger scales will be penalized less than those on smaller scales unless you normalize first. Use StandardScaler from sklearn.preprocessing in a pipeline.
Tuning Alpha with LassoCV
Choosing the right alpha value is critical. Too small, and you get little regularization. Too large, and you zero out important features. Scikit-learn provides LassoCV, which performs cross-validation across a range of alpha values automatically.
from sklearn.linear_model import LassoCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
import numpy as np
# Create a pipeline with scaling and LassoCV
pipeline = Pipeline([
('scaler', StandardScaler()),
('lasso_cv', LassoCV(
alphas=np.logspace(-4, 1, 50), # 50 alpha values from 0.0001 to 10
cv=5, # 5-fold cross-validation
max_iter=10000,
random_state=42
))
])
# Fit the pipeline
pipeline.fit(X_train, y_train)
# Extract the LassoCV model from the pipeline
lasso_cv = pipeline.named_steps['lasso_cv']
# Results
print(f"Best alpha: {lasso_cv.alpha_:.6f}")
print(f"Number of non-zero coefficients: {np.sum(lasso_cv.coef_ != 0)}")
print(f"R-squared on test set: {pipeline.score(X_test, y_test):.4f}")
# Show the coefficients
print("\nCoefficient values:")
for i, coef in enumerate(lasso_cv.coef_):
if coef != 0:
print(f" Feature {i:2d}: {coef:+.4f}")
The LassoCV approach uses warm starting, meaning each model along the regularization path is initialized with the coefficients from the previous alpha value. This makes it substantially faster than wrapping a standard Lasso inside GridSearchCV. For datasets with many collinear features, LassoCV is generally the preferred choice.
An alternative is LassoLarsCV, which is based on the Least Angle Regression (LARS) algorithm. The LARS-based variant can be faster when the number of samples is very small relative to the number of features, and it explores more relevant alpha values along the path.
Feature Selection with Lasso
One of the most practical applications of Lasso is automatic feature selection. Because Lasso drives irrelevant coefficients to zero, the surviving features with non-zero coefficients are effectively "selected" by the model. You can use this behavior to identify which features matter in your dataset.
from sklearn.linear_model import Lasso
from sklearn.datasets import make_regression
from sklearn.preprocessing import StandardScaler
import numpy as np
# Create data with known structure: 5 informative features out of 30
X, y, true_coef = make_regression(
n_samples=300,
n_features=30,
n_informative=5,
noise=5,
coef=True,
random_state=42
)
# Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Fit Lasso with a moderate alpha
lasso = Lasso(alpha=0.5)
lasso.fit(X_scaled, y)
# Compare Lasso's selection with the true informative features
true_nonzero = np.where(true_coef != 0)[0]
lasso_nonzero = np.where(lasso.coef_ != 0)[0]
print("True informative features:", true_nonzero)
print("Lasso-selected features: ", lasso_nonzero)
# Check overlap
overlap = set(true_nonzero) & set(lasso_nonzero)
print(f"\nCorrectly identified: {len(overlap)} out of {len(true_nonzero)} true features")
print(f"False positives: {len(lasso_nonzero) - len(overlap)}")
You can also use scikit-learn's SelectFromModel meta-transformer to embed Lasso-based feature selection directly into a pipeline. This lets you chain feature selection with any downstream estimator.
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import Lasso, LinearRegression
from sklearn.pipeline import Pipeline
# Pipeline: scale -> Lasso feature selection -> linear regression
feature_selection_pipeline = Pipeline([
('scaler', StandardScaler()),
('selector', SelectFromModel(Lasso(alpha=0.1, max_iter=10000))),
('regressor', LinearRegression())
])
feature_selection_pipeline.fit(X_train, y_train)
# How many features survived?
selector = feature_selection_pipeline.named_steps['selector']
selected_mask = selector.get_support()
print(f"Features selected: {np.sum(selected_mask)} out of {X_train.shape[1]}")
print(f"R-squared: {feature_selection_pipeline.score(X_test, y_test):.4f}")
Lasso vs Ridge vs ElasticNet
Understanding when to use Lasso versus other regularized regression methods is essential for choosing the right tool for your problem.
Lasso (L1) adds the sum of absolute coefficient values as the penalty. It produces sparse models by driving some coefficients to exactly zero. Use Lasso when you believe only a few features are truly relevant and you want automatic feature selection.
Ridge (L2) adds the sum of squared coefficient values as the penalty. It shrinks all coefficients toward zero but never eliminates any entirely. Use Ridge when you believe many features contribute small amounts and multicollinearity is a concern.
ElasticNet (L1 + L2) combines both penalties. It has two hyperparameters: alpha controls overall regularization strength, and l1_ratio controls the balance between L1 and L2. When l1_ratio=1.0, ElasticNet is identical to Lasso. Use ElasticNet when you have groups of correlated features -- Lasso tends to pick one feature from each group arbitrarily, while ElasticNet tends to keep or drop them together.
from sklearn.linear_model import Lasso, Ridge, ElasticNet
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
import numpy as np
# Standardize
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Compare all three
models = {
'Lasso (L1)': Lasso(alpha=0.1, max_iter=10000),
'Ridge (L2)': Ridge(alpha=1.0),
'ElasticNet': ElasticNet(alpha=0.1, l1_ratio=0.5, max_iter=10000),
}
for name, model in models.items():
scores = cross_val_score(model, X_scaled, y_train, cv=5, scoring='r2')
model.fit(X_scaled, y_train)
n_nonzero = np.sum(model.coef_ != 0)
print(f"{name:20s} | CV R2: {scores.mean():.4f} (+/- {scores.std():.4f}) | "
f"Non-zero coefs: {n_nonzero}/{len(model.coef_)}")
When features are highly correlated, Lasso can behave unpredictably -- it may select one feature from a correlated group and ignore the others, and the choice can change with small perturbations in the data. In these scenarios, ElasticNet is often a more stable alternative.
Practical Example on Real Data
Here is a complete, end-to-end example using the California Housing dataset that ships with scikit-learn. This example covers data loading, preprocessing, model fitting, alpha tuning, evaluation, and coefficient interpretation.
from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import LassoCV
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
import numpy as np
# Load the California Housing dataset
housing = fetch_california_housing()
X, y = housing.data, housing.target
feature_names = housing.feature_names
print(f"Dataset shape: {X.shape}")
print(f"Features: {feature_names}")
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Standardize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Fit LassoCV to find the optimal alpha
lasso_cv = LassoCV(
alphas=np.logspace(-5, 1, 100),
cv=5,
max_iter=20000,
random_state=42
)
lasso_cv.fit(X_train_scaled, y_train)
# Predictions
y_pred = lasso_cv.predict(X_test_scaled)
# Evaluation metrics
print(f"\nOptimal alpha: {lasso_cv.alpha_:.6f}")
print(f"R-squared: {r2_score(y_test, y_pred):.4f}")
print(f"RMSE: {np.sqrt(mean_squared_error(y_test, y_pred)):.4f}")
print(f"MAE: {mean_absolute_error(y_test, y_pred):.4f}")
# Feature importance from coefficients
print("\nFeature Coefficients (sorted by magnitude):")
coef_importance = sorted(
zip(feature_names, lasso_cv.coef_),
key=lambda x: abs(x[1]),
reverse=True
)
for name, coef in coef_importance:
status = "" if coef != 0 else " [DROPPED]"
print(f" {name:12s}: {coef:+.4f}{status}")
In this example, the standardized coefficients reveal which features have the strongest influence on housing prices. Features with larger absolute coefficient values contribute more to the prediction, while any features driven to zero have been deemed uninformative by the Lasso penalty.
Key Takeaways
- Lasso adds an L1 penalty to linear regression, which is the sum of absolute coefficient values multiplied by alpha. This penalty shrinks coefficients and can force them to exactly zero.
- Sparsity is the defining advantage. Unlike Ridge Regression, Lasso performs automatic feature selection by eliminating irrelevant features entirely. This makes your model simpler and more interpretable.
- Alpha controls the trade-off. A small alpha means weak regularization (close to ordinary least squares), while a large alpha aggressively zeros out coefficients. Use
LassoCVto find the optimal value through cross-validation. - Always standardize your features before applying Lasso. The L1 penalty treats all coefficients equally, so features on larger scales will receive less effective regularization.
- Consider ElasticNet for correlated features. Lasso can behave erratically when predictors are highly correlated, selecting one from a group at random. ElasticNet blends L1 and L2 penalties to handle this scenario more gracefully.
- Scikit-learn uses coordinate descent to fit the Lasso model. This algorithm is efficient for sparse solutions and is the standard implementation in scikit-learn version 1.8 and beyond.
Lasso Regression strikes a balance between prediction accuracy and model simplicity that few other techniques can match. Whether you are working with high-dimensional genomics data, financial models with hundreds of potential indicators, or any dataset where you suspect many features are noise, Lasso gives you a principled way to let the data decide which features survive. Pair it with LassoCV for automatic alpha selection, and you have a production-ready workflow in just a few lines of Python.