Building a machine learning model is only half the challenge. The other half is knowing whether you picked the right one, tuned it correctly, and can trust its performance on unseen data. Python's scikit-learn library provides a comprehensive model_selection module that handles all of this, from splitting your data and running cross-validation to searching massive hyperparameter spaces efficiently. This guide walks through every major tool in that module with practical code you can use right away.
Scikit-learn version 1.8, released in December 2025, is the current stable release and contains the full set of model selection utilities covered here. Whether you are comparing a logistic regression against a random forest or fine-tuning hundreds of hyperparameters on a gradient boosting ensemble, the tools in sklearn.model_selection give you a structured, repeatable process for finding the configuration that generalizes best to new data.
Why Model Selection Matters
Every machine learning estimator makes tradeoffs between bias and variance. A model with high bias (like a simple linear regression on nonlinear data) will underfit both the training set and new data. A model with high variance (like a very deep decision tree) will memorize the training set and fail to generalize. Model selection is the process of navigating this tradeoff by evaluating candidate models on held-out data that was not used during training.
If you simply train a model on all available data and report the training accuracy, you have no way to know whether the model has learned real patterns or just noise. The tools in scikit-learn's model_selection module exist to solve this problem systematically.
Model selection is not just about picking an algorithm. It also covers choosing the right hyperparameters, deciding how many features to use, and determining whether your dataset is large enough. The tools covered in this guide address all of these concerns.
Splitting Data with train_test_split
The simplest form of model evaluation is to split your dataset into two parts: one for training and one for testing. Scikit-learn's train_test_split function handles this in a single line, with options for controlling the split ratio, shuffling, random state reproducibility, and stratification.
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
X, y = load_iris(return_X_y=True)
# Split 80% train, 20% test with stratification
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
print(f"Training set size: {X_train.shape[0]}")
print(f"Test set size: {X_test.shape[0]}")
The stratify parameter ensures each class is represented proportionally in both the training and test sets. This is especially important for imbalanced datasets where a random split could leave the minority class underrepresented in one of the subsets.
The random_state parameter guarantees that the same split is produced every time you run the code. Without it, each execution would create a different train/test partition, making results difficult to reproduce or compare.
A single train/test split gives you only one estimate of model performance, which can vary significantly depending on which samples end up in each set. For more reliable estimates, use cross-validation instead.
Cross-Validation Strategies
Cross-validation solves the instability problem of a single train/test split by repeating the evaluation process multiple times with different partitions of the data. The most common approach is k-fold cross-validation, where the dataset is divided into k equal parts (folds). The model is trained k times, each time using a different fold as the validation set and the remaining k-1 folds as the training set.
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=100, random_state=42)
# 5-fold cross-validation
scores = cross_val_score(clf, X, y, cv=5, scoring='accuracy')
print(f"Fold scores: {scores}")
print(f"Mean accuracy: {scores.mean():.4f} (+/- {scores.std():.4f})")
The cross_val_score function returns an array of scores, one per fold. The mean gives you a more stable performance estimate than a single split, while the standard deviation tells you how much the score varies across folds.
Choosing the Right Cross-Validation Splitter
Scikit-learn provides several cross-validation splitters for different situations. The default behavior of cross_val_score uses StratifiedKFold for classification tasks and KFold for regression. However, you can pass any cross-validation iterator to the cv parameter.
from sklearn.model_selection import (
KFold,
StratifiedKFold,
LeaveOneOut,
RepeatedStratifiedKFold,
GroupKFold
)
# Standard k-fold (no stratification)
kf = KFold(n_splits=5, shuffle=True, random_state=42)
# Stratified k-fold (maintains class proportions)
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
# Repeated stratified k-fold for more stable estimates
rskf = RepeatedStratifiedKFold(
n_splits=5, n_repeats=3, random_state=42
)
# Leave-one-out (useful for very small datasets)
loo = LeaveOneOut()
# Group k-fold (ensures groups stay together)
# Useful when samples from the same source should not
# appear in both train and test sets
gkf = GroupKFold(n_splits=5)
StratifiedKFold is the go-to choice for classification because it preserves class balance in every fold. RepeatedStratifiedKFold runs the entire k-fold process multiple times with different random shuffles, producing even more stable performance estimates at the cost of longer computation.
GroupKFold is essential when your data has natural groupings, such as multiple measurements from the same patient or multiple images from the same camera. Without it, related samples could leak from the training set into the validation set, inflating your scores.
For a quick sanity check, use cross_val_score with cv=5. For final model evaluation in a publication or production pipeline, use RepeatedStratifiedKFold with at least 3 repeats to reduce variance in your performance estimate.
Hyperparameter Tuning with Grid Search and Randomized Search
Once you have decided on a model type, you need to find the best hyperparameters. Hyperparameters are settings that you configure before training begins, such as the number of trees in a random forest, the regularization strength in logistic regression, or the kernel type in a support vector machine.
GridSearchCV: Exhaustive Search
GridSearchCV evaluates every possible combination of hyperparameter values that you specify. It wraps cross-validation around the search process, so each combination is scored on held-out data rather than training data.
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
param_grid = {
'C': [0.1, 1, 10, 100],
'gamma': [0.001, 0.01, 0.1],
'kernel': ['rbf', 'linear']
}
grid_search = GridSearchCV(
SVC(),
param_grid,
cv=5,
scoring='accuracy',
n_jobs=-1, # use all CPU cores
verbose=1
)
grid_search.fit(X_train, y_train)
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best CV score: {grid_search.best_score_:.4f}")
# Evaluate on the held-out test set
test_score = grid_search.score(X_test, y_test)
print(f"Test set score: {test_score:.4f}")
This example creates 4 x 3 x 2 = 24 parameter combinations, each evaluated with 5-fold cross-validation, resulting in 120 total model fits. The n_jobs=-1 argument parallelizes the work across all available CPU cores.
After fitting, the GridSearchCV object automatically refits the best model on the entire training set, so you can call grid_search.predict() directly on new data.
RandomizedSearchCV: Sampling from Distributions
When the hyperparameter space is large, exhaustive grid search becomes impractical. RandomizedSearchCV samples a fixed number of parameter combinations from specified distributions, making it far more efficient for large search spaces.
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import GradientBoostingClassifier
from scipy.stats import randint, uniform
param_distributions = {
'n_estimators': randint(50, 500),
'max_depth': randint(3, 15),
'learning_rate': uniform(0.01, 0.3),
'subsample': uniform(0.6, 0.4),
'min_samples_split': randint(2, 20)
}
random_search = RandomizedSearchCV(
GradientBoostingClassifier(random_state=42),
param_distributions,
n_iter=50, # number of random combinations to try
cv=5,
scoring='accuracy',
random_state=42,
n_jobs=-1
)
random_search.fit(X_train, y_train)
print(f"Best parameters: {random_search.best_params_}")
print(f"Best CV score: {random_search.best_score_:.4f}")
With n_iter=50, this evaluates only 50 randomly sampled combinations instead of the thousands that a full grid would require. Research has shown that random search is often more efficient than grid search because it explores more distinct values of each hyperparameter, whereas grid search wastes evaluations on duplicate values of less important parameters.
Multi-Metric Evaluation
Both GridSearchCV and RandomizedSearchCV support evaluating multiple metrics simultaneously. This is useful when you care about both precision and recall, for example.
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
param_grid = {'C': [0.01, 0.1, 1, 10]}
grid_search = GridSearchCV(
LogisticRegression(max_iter=1000),
param_grid,
cv=5,
scoring=['accuracy', 'f1_weighted', 'roc_auc_ovr'],
refit='f1_weighted', # which metric to use for selecting best model
n_jobs=-1
)
grid_search.fit(X_train, y_train)
# Access results for each metric
results = grid_search.cv_results_
for i, params in enumerate(results['params']):
print(f"C={params['C']:>5} | "
f"Accuracy={results['mean_test_accuracy'][i]:.4f} | "
f"F1={results['mean_test_f1_weighted'][i]:.4f}")
The refit parameter tells scikit-learn which metric to optimize when selecting the best model. The winning model is then refit on the full training set and made available through the best_estimator_ attribute.
Successive Halving: Faster Hyperparameter Search
Scikit-learn includes an experimental successive halving strategy through HalvingGridSearchCV and HalvingRandomSearchCV. These work by first evaluating all candidates with a small amount of resources (for example, a small subset of training data), then progressively eliminating the worst performers while increasing the resources allocated to survivors.
from sklearn.experimental import enable_halving_search_cv # required
from sklearn.model_selection import HalvingRandomSearchCV
from sklearn.ensemble import RandomForestClassifier
from scipy.stats import randint
param_distributions = {
'n_estimators': randint(50, 500),
'max_depth': randint(3, 30),
'min_samples_split': randint(2, 20),
'min_samples_leaf': randint(1, 10)
}
halving_search = HalvingRandomSearchCV(
RandomForestClassifier(random_state=42),
param_distributions,
n_candidates=100, # start with 100 candidates
factor=3, # eliminate 2/3 of candidates each round
cv=5,
scoring='accuracy',
random_state=42,
n_jobs=-1
)
halving_search.fit(X_train, y_train)
print(f"Best parameters: {halving_search.best_params_}")
print(f"Best CV score: {halving_search.best_score_:.4f}")
The halving search classes are still marked as experimental in scikit-learn 1.8. You must import enable_halving_search_cv from sklearn.experimental before using them. The API may change in future releases.
The factor parameter controls how aggressively candidates are eliminated. A factor of 3 means that roughly one-third of candidates survive each round. The resource parameter (which defaults to n_samples) determines what gets increased each round. You can also set it to a model hyperparameter like n_estimators so that survivors are trained with progressively more trees instead of more data.
This approach can be dramatically faster than a standard grid or random search when you have a large number of candidates. Weak candidates are identified quickly with minimal computation and discarded before expensive full-scale evaluation.
Learning Curves and Validation Curves
Before tuning hyperparameters, it helps to understand whether your model is suffering from underfitting or overfitting. Learning curves and validation curves are diagnostic tools that visualize this.
Learning Curves
A learning curve plots model performance as a function of training set size. It answers the question: would collecting more data improve my model?
import numpy as np
from sklearn.model_selection import learning_curve
from sklearn.svm import SVC
train_sizes, train_scores, test_scores = learning_curve(
SVC(kernel='rbf', gamma=0.01),
X, y,
train_sizes=np.linspace(0.1, 1.0, 10),
cv=5,
scoring='accuracy',
n_jobs=-1
)
train_mean = train_scores.mean(axis=1)
train_std = train_scores.std(axis=1)
test_mean = test_scores.mean(axis=1)
test_std = test_scores.std(axis=1)
print("Train sizes | Train score | Test score")
print("-" * 45)
for size, tr, te in zip(train_sizes, train_mean, test_mean):
print(f" {size:>5} | {tr:.4f} | {te:.4f}")
When interpreting learning curves, look for two patterns. If the training and test scores both remain low as you add more data, the model is underfitting and you need a more complex model or better features. If the training score is high but the test score is low with a large gap between them, the model is overfitting and you need more data, regularization, or a simpler model.
Scikit-learn 1.8 also provides a LearningCurveDisplay class that can generate the plot directly from an estimator without manually handling the arrays.
from sklearn.model_selection import LearningCurveDisplay, ShuffleSplit
# Using the display class for direct plotting
LearningCurveDisplay.from_estimator(
SVC(kernel='rbf', gamma=0.01),
X, y,
train_sizes=np.linspace(0.1, 1.0, 10),
cv=ShuffleSplit(n_splits=50, test_size=0.2, random_state=0),
score_type="both",
n_jobs=-1,
score_name="Accuracy"
)
Validation Curves
A validation curve plots model performance as a function of a single hyperparameter. It answers the question: what value of this hyperparameter gives the best generalization?
from sklearn.model_selection import validation_curve
param_range = np.logspace(-6, 2, 10)
train_scores, test_scores = validation_curve(
SVC(kernel='rbf'),
X, y,
param_name='gamma',
param_range=param_range,
cv=5,
scoring='accuracy',
n_jobs=-1
)
train_mean = train_scores.mean(axis=1)
test_mean = test_scores.mean(axis=1)
print("gamma | Train score | Test score")
print("-" * 48)
for gamma, tr, te in zip(param_range, train_mean, test_mean):
print(f" {gamma:>10.6f} | {tr:.4f} | {te:.4f}")
The sweet spot is the hyperparameter value where the test score is highest. Values to the left typically show underfitting (both scores are low), while values to the right show overfitting (the training score is high but the test score drops). Like learning curves, there is also a ValidationCurveDisplay class available for generating plots directly.
Putting It All Together: A Complete Workflow
Here is a complete model selection workflow that combines the tools covered above. It compares multiple algorithms, tunes the best candidate, and evaluates the final model on held-out data.
import numpy as np
from sklearn.datasets import load_wine
from sklearn.model_selection import (
train_test_split,
cross_val_score,
GridSearchCV,
StratifiedKFold
)
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import (
RandomForestClassifier,
GradientBoostingClassifier
)
from sklearn.svm import SVC
# Load data and create held-out test set
X, y = load_wine(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# Define candidate pipelines
candidates = {
'Logistic Regression': Pipeline([
('scaler', StandardScaler()),
('clf', LogisticRegression(max_iter=2000))
]),
'Random Forest': Pipeline([
('clf', RandomForestClassifier(random_state=42))
]),
'Gradient Boosting': Pipeline([
('clf', GradientBoostingClassifier(random_state=42))
]),
'SVM': Pipeline([
('scaler', StandardScaler()),
('clf', SVC(random_state=42))
])
}
# Step 1: Compare candidates with cross-validation
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
print("Step 1: Comparing candidate models")
print("=" * 50)
for name, pipeline in candidates.items():
scores = cross_val_score(pipeline, X_train, y_train, cv=cv)
print(f"{name:>22}: {scores.mean():.4f} (+/- {scores.std():.4f})")
# Step 2: Tune the best candidate
print("\nStep 2: Tuning Gradient Boosting")
print("=" * 50)
param_grid = {
'clf__n_estimators': [100, 200, 300],
'clf__max_depth': [3, 5, 7],
'clf__learning_rate': [0.01, 0.1, 0.2],
'clf__subsample': [0.8, 1.0]
}
grid_search = GridSearchCV(
candidates['Gradient Boosting'],
param_grid,
cv=cv,
scoring='accuracy',
n_jobs=-1
)
grid_search.fit(X_train, y_train)
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best CV score: {grid_search.best_score_:.4f}")
# Step 3: Final evaluation on the test set
test_score = grid_search.score(X_test, y_test)
print(f"\nStep 3: Final test set score: {test_score:.4f}")
There are a few important things to notice in this workflow. First, the test set is created before any modeling begins and is never used during cross-validation or hyperparameter tuning. It serves as a final, unbiased estimate of performance. Second, each candidate model is wrapped in a Pipeline so that preprocessing steps like scaling are applied correctly within each cross-validation fold, preventing data leakage. Third, when specifying hyperparameters in the grid for a pipeline, the parameter names use the double underscore syntax (like clf__n_estimators) to indicate which step of the pipeline the parameter belongs to.
Always use Pipeline objects when combining preprocessing with model training. If you scale your data before splitting it into folds, information from the validation fold leaks into the training fold through the computed mean and standard deviation, leading to overly optimistic performance estimates.
Key Takeaways
- Always hold out a test set: Use
train_test_splitat the very beginning of your workflow. This test set should never be touched during model selection or hyperparameter tuning. It exists solely for the final, unbiased performance evaluation. - Use cross-validation for reliable estimates: A single train/test split is noisy.
cross_val_scorewithStratifiedKFoldgives you a mean and standard deviation that tell you both how well the model performs and how stable that performance is. - Start with RandomizedSearchCV: Unless your hyperparameter space is small and well-defined,
RandomizedSearchCVis more efficient thanGridSearchCV. It explores more unique values per hyperparameter with fewer total evaluations. - Consider successive halving for large searches:
HalvingRandomSearchCVcan evaluate hundreds of candidates in a fraction of the time by quickly eliminating poor performers before committing full resources. - Diagnose before you tune: Use learning curves and validation curves to understand whether your model needs more data, more complexity, or more regularization. Tuning hyperparameters on a fundamentally underfitting model will not produce good results.
- Wrap everything in a Pipeline: Preprocessing steps must be included inside the cross-validation loop to prevent data leakage. Scikit-learn's
Pipelinemakes this automatic and clean.
Model selection is one of the most critical skills in applied machine learning. Scikit-learn's model_selection module gives you everything you need to compare models fairly, tune hyperparameters systematically, and diagnose performance issues before they reach production. The key is to treat every step as a repeatable, reproducible experiment where the test set is sacred, cross-validation is standard practice, and pipelines keep your preprocessing honest.