Python Ensemble Methods: Bagging, Boosting, Stacking, and Voting with Scikit-Learn

A single machine learning model can only capture so much signal from your data. Ensemble methods combine the predictions of multiple models to produce results that are more accurate and more robust than any individual learner could achieve on its own. In this guide, you will learn how to implement the four major categories of ensemble learning in Python using scikit-learn: bagging, boosting, stacking, and voting.

If you have ever trained a decision tree and watched it overfit your training data, or built a logistic regression that just could not capture the complexity of your dataset, you have already felt the problem that ensemble methods are designed to solve. Instead of relying on one model's perspective, ensembles aggregate the wisdom of multiple models to reduce errors, control overfitting, and strengthen generalization.

Scikit-learn (version 1.8 as of this writing) provides a comprehensive sklearn.ensemble module with production-ready implementations of all the major ensemble techniques. This article walks through each one with working code you can run immediately.

What Are Ensemble Methods?

Ensemble methods combine the predictions of several base estimators built with one or more learning algorithms in order to improve generalizability and robustness over a single estimator. The core idea is straightforward: individual models have individual weaknesses, but when you combine them strategically, those weaknesses tend to cancel out.

There are two broad families of ensemble techniques. The first family is averaging methods, where several estimators are built independently and their predictions are averaged. The combined estimator tends to perform better than any single base estimator because its variance is reduced. Bagging and Random Forests fall into this category. The second family is boosting methods, where base estimators are built sequentially. Each new estimator focuses on correcting the mistakes made by the previous ones, reducing bias over time. AdaBoost and Gradient Boosting are the classic examples here.

Beyond these two families, scikit-learn also provides stacking and voting, which let you combine entirely different types of models rather than multiple copies of the same algorithm.

Note

All code examples in this article use scikit-learn 1.8 and assume you have it installed. Run pip install scikit-learn to get the latest version. We use the Iris and California Housing datasets throughout so that examples are self-contained and require no external data files.

Bagging: Random Forest and BaggingClassifier

Bagging, short for Bootstrap Aggregating, works by training several instances of the same model on different random subsets of the training data. Each subset is drawn with replacement (a bootstrap sample). The individual predictions are then combined, typically by majority vote for classification or averaging for regression. Because each model sees a slightly different version of the data, the ensemble reduces variance and helps prevent overfitting.

Bagging works especially well with strong, complex models that tend to overfit, such as fully developed decision trees. The randomization in the data sampling smooths out the high variance these models naturally produce.

Random Forest

Random Forest is the most widely used bagging method. It trains a collection of decision trees, each on a bootstrap sample of the data, and adds an additional layer of randomness by considering only a random subset of features at each split. This dual randomization produces trees that are decorrelated from each other, which makes their averaged predictions significantly more stable.

from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score

# Load data
X, y = load_iris(return_X_y=True)

# Create a Random Forest with 100 trees
rf = RandomForestClassifier(
    n_estimators=100,
    max_depth=None,       # Fully grown trees (high variance, reduced by bagging)
    min_samples_split=2,
    random_state=42
)

# Evaluate with 5-fold cross-validation
scores = cross_val_score(rf, X, y, cv=5, scoring='accuracy')
print(f"Random Forest Accuracy: {scores.mean():.4f} (+/- {scores.std():.4f})")
# Output: Random Forest Accuracy: 0.9600 (+/- 0.0249)

The n_estimators parameter controls how many trees are in the forest. Increasing this number generally improves performance up to a point, after which the gains plateau and training just takes longer. The max_features parameter (which defaults to "sqrt" for classifiers) determines how many features each tree considers at each split.

BaggingClassifier: Bagging with Any Estimator

While Random Forest is specific to decision trees, the BaggingClassifier lets you apply bagging to any base estimator. This is useful when you want to reduce the variance of a model other than a decision tree.

from sklearn.ensemble import BaggingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score

X, y = load_iris(return_X_y=True)

# Bag a KNN classifier
bagged_knn = BaggingClassifier(
    estimator=KNeighborsClassifier(),
    n_estimators=15,
    max_samples=0.7,      # Each model trains on 70% of samples
    max_features=0.8,     # Each model sees 80% of features
    bootstrap=True,
    random_state=42
)

scores = cross_val_score(bagged_knn, X, y, cv=5, scoring='accuracy')
print(f"Bagged KNN Accuracy: {scores.mean():.4f} (+/- {scores.std():.4f})")
# Output: Bagged KNN Accuracy: 0.9533 (+/- 0.0327)

Pro Tip

Set oob_score=True on BaggingClassifier or RandomForestClassifier to get an out-of-bag accuracy estimate without needing a separate validation set. Because each tree is trained on a bootstrap sample, about 37% of samples are left out of each tree and can be used for evaluation.

Boosting: AdaBoost, Gradient Boosting, and HistGradientBoosting

Unlike bagging, where models are built independently, boosting builds models sequentially. Each new model in the sequence focuses on the mistakes made by the combined ensemble so far. This sequential correction reduces bias, making boosting particularly effective with weak learners like shallow decision trees.

AdaBoost

AdaBoost (Adaptive Boosting) was one of the first successful boosting algorithms. It works by fitting a sequence of weak learners on repeatedly reweighted versions of the data. After each round, the training examples that the current ensemble misclassifies receive higher weights so that the next learner pays more attention to them. The final prediction is a weighted majority vote across all learners.

from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score

X, y = load_iris(return_X_y=True)

# AdaBoost with shallow decision trees (stumps)
ada = AdaBoostClassifier(
    estimator=DecisionTreeClassifier(max_depth=1),
    n_estimators=50,
    learning_rate=1.0,
    algorithm='SAMME',
    random_state=42
)

scores = cross_val_score(ada, X, y, cv=5, scoring='accuracy')
print(f"AdaBoost Accuracy: {scores.mean():.4f} (+/- {scores.std():.4f})")
# Output: AdaBoost Accuracy: 0.9533 (+/- 0.0327)

The learning_rate parameter controls how much each new tree contributes to the ensemble. Lower values require more trees but often produce better generalization. There is a natural trade-off between n_estimators and learning_rate.

Gradient Boosting

Gradient Boosting generalizes the boosting concept to arbitrary differentiable loss functions. Instead of adjusting sample weights, each new tree fits the negative gradient (essentially the residual errors) of the loss function. This approach is extremely powerful for both classification and regression tasks, particularly on tabular data.

Scikit-learn provides two implementations of gradient boosting. The classic GradientBoostingClassifier and GradientBoostingRegressor work well on small to medium datasets. For larger datasets (10,000+ samples), the histogram-based HistGradientBoostingClassifier and HistGradientBoostingRegressor are dramatically faster because they bin continuous features into a fixed number of discrete bins (typically 255), which speeds up the search for optimal splits.

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score

X, y = load_iris(return_X_y=True)

# Classic Gradient Boosting
gbc = GradientBoostingClassifier(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=3,
    subsample=0.8,        # Stochastic gradient boosting
    random_state=42
)

scores = cross_val_score(gbc, X, y, cv=5, scoring='accuracy')
print(f"Gradient Boosting Accuracy: {scores.mean():.4f} (+/- {scores.std():.4f})")
# Output: Gradient Boosting Accuracy: 0.9533 (+/- 0.0327)

HistGradientBoosting: The Faster Alternative

The histogram-based variant is inspired by LightGBM and can be orders of magnitude faster on large datasets. It also has built-in support for missing values (it learns which direction to send NaN values at each split) and native categorical feature support, so you can skip one-hot encoding entirely.

from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score

X, y = load_iris(return_X_y=True)

# Histogram-based Gradient Boosting
hgbc = HistGradientBoostingClassifier(
    max_iter=100,             # Number of boosting rounds
    learning_rate=0.1,
    max_leaf_nodes=31,
    max_depth=None,
    min_samples_leaf=20,
    early_stopping='auto',   # Stops when validation score plateaus
    random_state=42
)

scores = cross_val_score(hgbc, X, y, cv=5, scoring='accuracy')
print(f"HistGradientBoosting Accuracy: {scores.mean():.4f} (+/- {scores.std():.4f})")
# Output: HistGradientBoosting Accuracy: 0.9533 (+/- 0.0327)

Note

HistGradientBoostingClassifier also supports monotonic constraints via the monotonic_cst parameter and interaction constraints via interaction_cst. Monotonic constraints let you enforce domain knowledge (for example, requiring that higher credit scores always lead to higher predicted approval rates). These features make it a strong choice for applications where interpretability and regulatory compliance matter.

Gradient Boosting for Regression

Here is an example using HistGradientBoostingRegressor on the California Housing dataset, demonstrating the regression side of boosting with quantile loss for prediction intervals.

from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error

X, y = fetch_california_housing(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Standard regression
hgbr = HistGradientBoostingRegressor(
    loss='squared_error',
    max_iter=200,
    learning_rate=0.1,
    random_state=42
)
hgbr.fit(X_train, y_train)
y_pred = hgbr.predict(X_test)
print(f"MAE: {mean_absolute_error(y_test, y_pred):.4f}")

# Quantile regression for prediction intervals
hgbr_low = HistGradientBoostingRegressor(
    loss='quantile', quantile=0.05, max_iter=200, random_state=42
)
hgbr_high = HistGradientBoostingRegressor(
    loss='quantile', quantile=0.95, max_iter=200, random_state=42
)
hgbr_low.fit(X_train, y_train)
hgbr_high.fit(X_train, y_train)

y_low = hgbr_low.predict(X_test)
y_high = hgbr_high.predict(X_test)
print(f"90% prediction interval width (mean): {(y_high - y_low).mean():.4f}")

The quantile loss option lets you build prediction intervals rather than just point predictions. Training separate models at the 5th and 95th percentiles gives you a 90% prediction interval, which is invaluable when you need to communicate uncertainty.

Stacking: Combining Different Model Types

Stacking (also called stacked generalization) takes a fundamentally different approach from bagging and boosting. Instead of combining multiple instances of the same algorithm, stacking combines multiple different algorithms. A meta-learner is trained on top of the base models' predictions to learn the optimal way to combine them.

The process works in two levels. At the first level, several diverse base models are trained on the data. At the second level, a meta-model (called the final estimator in scikit-learn) is trained using the predictions from the base models as its input features. Cross-validation is used internally so that the meta-model is never trained on predictions that the base models made on data they saw during training.

from sklearn.ensemble import StackingClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score

X, y = load_iris(return_X_y=True)

# Define base estimators
estimators = [
    ('rf', RandomForestClassifier(n_estimators=50, random_state=42)),
    ('gb', GradientBoostingClassifier(n_estimators=50, random_state=42)),
    ('svc', SVC(kernel='rbf', probability=True, random_state=42))
]

# Create stacking classifier with logistic regression as meta-learner
stacking_clf = StackingClassifier(
    estimators=estimators,
    final_estimator=LogisticRegression(max_iter=1000),
    cv=5,                 # Internal cross-validation folds
    stack_method='auto',  # Uses predict_proba when available
    passthrough=False     # Only pass base model predictions to meta-learner
)

scores = cross_val_score(stacking_clf, X, y, cv=5, scoring='accuracy')
print(f"Stacking Accuracy: {scores.mean():.4f} (+/- {scores.std():.4f})")
# Output: Stacking Accuracy: 0.9667 (+/- 0.0211)

The passthrough=True option also passes the original features to the meta-learner alongside the base model predictions. This can sometimes improve performance but increases the risk of overfitting.

Stacking for Regression

from sklearn.ensemble import StackingRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.linear_model import Ridge
from sklearn.svm import SVR
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import cross_val_score

X, y = fetch_california_housing(return_X_y=True)

estimators = [
    ('rf', RandomForestRegressor(n_estimators=50, random_state=42)),
    ('gb', GradientBoostingRegressor(n_estimators=50, random_state=42)),
    ('svr', SVR(kernel='rbf'))
]

stacking_reg = StackingRegressor(
    estimators=estimators,
    final_estimator=Ridge(alpha=1.0),
    cv=5
)

scores = cross_val_score(stacking_reg, X, y, cv=5,
                         scoring='neg_mean_absolute_error')
print(f"Stacking MAE: {-scores.mean():.4f} (+/- {scores.std():.4f})")

Pro Tip

The key to effective stacking is diversity among the base learners. Choose models that make different types of errors. A random forest, a gradient boosting model, and an SVM approach the problem differently, so their combined output captures patterns that no single model could find alone. Using three random forests with slightly different hyperparameters would not provide the same benefit.

Voting: Hard and Soft Voting Classifiers

Voting is the simplest form of model combination. You train several different classifiers and let them vote on the final prediction. Scikit-learn's VotingClassifier supports two modes: hard voting, where the class label predicted by the majority of models wins, and soft voting, where the predicted class probabilities are averaged and the class with the highest average probability wins.

Soft voting generally outperforms hard voting because it accounts for how confident each model is in its prediction, not just what it predicts.

from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score

X, y = load_iris(return_X_y=True)

# Define individual classifiers
clf1 = LogisticRegression(max_iter=1000, random_state=42)
clf2 = DecisionTreeClassifier(max_depth=4, random_state=42)
clf3 = SVC(kernel='rbf', probability=True, random_state=42)

# Hard voting: majority rules
hard_voter = VotingClassifier(
    estimators=[('lr', clf1), ('dt', clf2), ('svc', clf3)],
    voting='hard'
)

# Soft voting: average probabilities
soft_voter = VotingClassifier(
    estimators=[('lr', clf1), ('dt', clf2), ('svc', clf3)],
    voting='soft'
)

hard_scores = cross_val_score(hard_voter, X, y, cv=5, scoring='accuracy')
soft_scores = cross_val_score(soft_voter, X, y, cv=5, scoring='accuracy')

print(f"Hard Voting Accuracy: {hard_scores.mean():.4f} (+/- {hard_scores.std():.4f})")
print(f"Soft Voting Accuracy: {soft_scores.mean():.4f} (+/- {soft_scores.std():.4f})")

You can also assign different weights to each classifier using the weights parameter. This is useful when you know that one model consistently outperforms the others but still want to benefit from the diversity of the full ensemble.

# Weighted soft voting
weighted_voter = VotingClassifier(
    estimators=[('lr', clf1), ('dt', clf2), ('svc', clf3)],
    voting='soft',
    weights=[2, 1, 3]   # SVC gets 3x weight, LR gets 2x, DT gets 1x
)

scores = cross_val_score(weighted_voter, X, y, cv=5, scoring='accuracy')
print(f"Weighted Voting Accuracy: {scores.mean():.4f} (+/- {scores.std():.4f})")

Warning

For soft voting, every classifier must support predict_proba(). The default SVC does not; you must set probability=True when creating it. This adds computational cost because scikit-learn uses Platt scaling internally to calibrate the SVM outputs into probabilities.

Comparing Ensemble Methods Side by Side

Here is a complete script that trains and evaluates all the ensemble methods discussed in this article on the same dataset, giving you a direct comparison.

from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import (
    RandomForestClassifier,
    BaggingClassifier,
    AdaBoostClassifier,
    GradientBoostingClassifier,
    HistGradientBoostingClassifier,
    StackingClassifier,
    VotingClassifier
)
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

X, y = load_iris(return_X_y=True)

models = {
    'Random Forest': RandomForestClassifier(
        n_estimators=100, random_state=42
    ),
    'Bagging (DT)': BaggingClassifier(
        estimator=DecisionTreeClassifier(),
        n_estimators=50, random_state=42
    ),
    'AdaBoost': AdaBoostClassifier(
        estimator=DecisionTreeClassifier(max_depth=1),
        n_estimators=50, algorithm='SAMME', random_state=42
    ),
    'Gradient Boosting': GradientBoostingClassifier(
        n_estimators=100, random_state=42
    ),
    'HistGradientBoosting': HistGradientBoostingClassifier(
        max_iter=100, random_state=42
    ),
    'Stacking': StackingClassifier(
        estimators=[
            ('rf', RandomForestClassifier(n_estimators=50, random_state=42)),
            ('gb', GradientBoostingClassifier(n_estimators=50, random_state=42)),
            ('svc', SVC(probability=True, random_state=42))
        ],
        final_estimator=LogisticRegression(max_iter=1000),
        cv=5
    ),
    'Soft Voting': VotingClassifier(
        estimators=[
            ('lr', LogisticRegression(max_iter=1000, random_state=42)),
            ('rf', RandomForestClassifier(n_estimators=50, random_state=42)),
            ('svc', SVC(probability=True, random_state=42))
        ],
        voting='soft'
    )
}

print(f"{'Model':<25} {'Mean Accuracy':>15} {'Std':>10}")
print("-" * 52)

for name, model in models.items():
    scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
    print(f"{name:<25} {scores.mean():>15.4f} {scores.std():>10.4f}")

When choosing an ensemble method for a real project, consider the following guidelines. Bagging (especially Random Forest) is excellent when your primary concern is overfitting with complex models. Boosting methods are the go-to choice for tabular data where squeezing out every fraction of a percent of accuracy matters. HistGradientBoostingClassifier should be your default boosting choice for any dataset larger than a few thousand rows due to its speed advantage and support for missing values and categorical features. Stacking is worth trying when you already have several well-tuned models and want to see if combining them yields additional gains. Voting is the simplest combination technique and works best when your models are already diverse and roughly equal in performance.

Key Takeaways

Bagging reduces variance: Methods like Random Forest train multiple instances of the same model on bootstrap samples and average their predictions. This works well with complex, high-variance models like fully developed decision trees.
Boosting reduces bias: Sequential methods like AdaBoost, Gradient Boosting, and HistGradientBoosting correct errors made by prior models. They work well with weak learners and are among the top performers on tabular datasets.
HistGradientBoosting is the modern default: For datasets with 10,000+ samples, HistGradientBoostingClassifier and HistGradientBoostingRegressor provide dramatically faster training, built-in missing value support, native categorical feature handling, monotonic constraints, and quantile regression for prediction intervals.
Stacking combines model diversity: By training a meta-learner on the outputs of several different base models, stacking can capture complementary patterns that no single model would find alone.
Voting is the simplest ensemble: Use VotingClassifier with voting='soft' for a quick, low-effort way to combine multiple classifiers. Soft voting outperforms hard voting by incorporating prediction confidence.
Ensembles are not guaranteed improvements: There are cases where an ensemble produces lower accuracy than an individual model. Always validate ensemble performance against your best single model using proper cross-validation.

Ensemble methods represent one of the most reliable strategies for building production-grade machine learning systems. Whether you start with a simple voting classifier or move to a fully stacked pipeline, the patterns covered in this article give you the tools to combine models effectively using scikit-learn's sklearn.ensemble module. The key is to match your ensemble strategy to your data and your goals: reduce variance with bagging, reduce bias with boosting, or leverage model diversity with stacking and voting.