Python's scikit-learn library gives you access to dozens of machine learning algorithms through one consistent interface. But with so many models available, it can be hard to know which one to use, how each one actually works, and when to reach for one over another. This guide walks through every major model category in scikit-learn with plain-language explanations and runnable code examples so you can confidently choose the right tool for any dataset.
Every scikit-learn model follows the same pattern: create the model, call .fit() to train it on data, then call .predict() to generate predictions. This consistency is one of the reasons scikit-learn remains the go-to library for classical machine learning in Python, even as deep learning frameworks have grown in popularity. As of version 1.8 (released December 2025), scikit-learn now supports GPU computation through native Array API integration, making it faster than ever on large datasets.
Before running any examples in this article, make sure you have scikit-learn installed:
pip install scikit-learn
Let's start with the supervised learning models, which learn from labeled training data to make predictions on new, unseen examples.
Supervised Learning: Regression Models
Regression models predict a continuous numerical value. Think of predicting house prices, stock returns, temperature, or any measurement that falls on a number line. Here are all the major regression models available in scikit-learn.
Linear Regression
Linear regression is the foundation of machine learning. It draws the best-fitting straight line through your data by minimizing the sum of squared differences between predicted and actual values. The model assumes a linear relationship between input features and the target variable.
When to use it: your data has a roughly linear relationship, you need interpretable results (each coefficient tells you exactly how much a feature influences the prediction), or you want a fast baseline model before trying anything more complex.
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_regression
# Generate sample data
X, y = make_regression(n_samples=200, n_features=3, noise=10, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train the model
model = LinearRegression()
model.fit(X_train, y_train)
# Evaluate
print(f"R-squared: {model.score(X_test, y_test):.4f}")
print(f"Coefficients: {model.coef_}")
print(f"Intercept: {model.intercept_:.4f}")
Ridge Regression (L2 Regularization)
Ridge regression is linear regression with a penalty term that shrinks the coefficients toward zero. This penalty is proportional to the square of the coefficient values (L2 norm). The result is a model that is less likely to overfit, especially when you have many features or your features are correlated with each other.
The alpha parameter controls how strong the penalty is. Higher alpha means more shrinkage, which means simpler models that may underfit. Lower alpha means less shrinkage, behaving more like standard linear regression.
from sklearn.linear_model import Ridge
model = Ridge(alpha=1.0)
model.fit(X_train, y_train)
print(f"Ridge R-squared: {model.score(X_test, y_test):.4f}")
Lasso Regression (L1 Regularization)
Lasso regression also adds a penalty, but it uses the absolute value of the coefficients (L1 norm) instead of their squares. This has a powerful side effect: lasso can drive coefficients all the way to exactly zero, effectively removing features from the model. This makes lasso a built-in feature selection tool.
When to use it: you suspect that only a few features actually matter and want the model to automatically identify them.
from sklearn.linear_model import Lasso
model = Lasso(alpha=0.1)
model.fit(X_train, y_train)
print(f"Lasso R-squared: {model.score(X_test, y_test):.4f}")
print(f"Non-zero coefficients: {sum(model.coef_ != 0)}")
ElasticNet
ElasticNet combines the L1 and L2 penalties from lasso and ridge into a single model. The l1_ratio parameter controls the mix: a value of 1.0 is pure lasso, 0.0 is pure ridge, and anything in between blends both penalties. This is useful when you have groups of correlated features because lasso tends to pick just one feature from a correlated group while ElasticNet can keep several.
from sklearn.linear_model import ElasticNet
model = ElasticNet(alpha=0.1, l1_ratio=0.5)
model.fit(X_train, y_train)
print(f"ElasticNet R-squared: {model.score(X_test, y_test):.4f}")
Polynomial Regression
Polynomial regression extends linear regression by creating new features from powers and interactions of the original features. For example, if you have a single feature x, polynomial degree 2 will generate x, x^2, and a bias term. The model is still "linear" in the mathematical sense because it fits a linear combination of these transformed features, but it can capture curved relationships in your data.
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
# Degree 2 polynomial regression
model = make_pipeline(
PolynomialFeatures(degree=2, include_bias=False),
LinearRegression()
)
model.fit(X_train, y_train)
print(f"Polynomial R-squared: {model.score(X_test, y_test):.4f}")
Be careful with high polynomial degrees. Degree 3 or above with many features can create thousands of new columns, dramatically slowing down training and increasing the risk of overfitting. Always combine polynomial features with regularization (ridge or lasso).
Support Vector Regression (SVR)
SVR applies the principles of support vector machines to regression. Instead of trying to minimize overall error like linear regression, SVR defines a "tube" of acceptable error around the predictions (controlled by the epsilon parameter). It only penalizes predictions that fall outside this tube. This makes SVR more robust to outliers. The kernel parameter lets you capture nonlinear relationships by projecting the data into higher-dimensional spaces.
from sklearn.svm import SVR
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
# SVR requires feature scaling
model = make_pipeline(StandardScaler(), SVR(kernel='rbf', C=100, epsilon=0.1))
model.fit(X_train, y_train)
print(f"SVR R-squared: {model.score(X_test, y_test):.4f}")
Decision Tree Regressor
A decision tree regressor splits your data into smaller and smaller groups by asking yes/no questions about feature values. At each leaf node, the prediction is the average of the target values that ended up in that group. Trees are extremely interpretable because you can trace exactly why a prediction was made, but they are prone to overfitting if you let them grow too deep.
from sklearn.tree import DecisionTreeRegressor
model = DecisionTreeRegressor(max_depth=5, random_state=42)
model.fit(X_train, y_train)
print(f"Decision Tree R-squared: {model.score(X_test, y_test):.4f}")
K-Nearest Neighbors Regressor
KNN regression makes predictions by finding the K training examples that are closest to the new data point (using a distance metric like Euclidean distance) and averaging their target values. It requires no training step beyond storing the data, which makes it a "lazy learner." The main downside is that predictions are slow on large datasets because the model must compute distances to every stored example.
from sklearn.neighbors import KNeighborsRegressor
model = KNeighborsRegressor(n_neighbors=5)
model.fit(X_train, y_train)
print(f"KNN Regressor R-squared: {model.score(X_test, y_test):.4f}")
Supervised Learning: Classification Models
Classification models predict which category or class a data point belongs to. Spam vs. not spam, cat vs. dog, malignant vs. benign -- these are all classification problems. Here are the core classification models in scikit-learn.
Logistic Regression
Despite the name, logistic regression is a classification algorithm, not a regression one. It works by fitting a linear model and then passing the output through a sigmoid function that squashes the result into a probability between 0 and 1. If the probability is above a threshold (typically 0.5), the model predicts the positive class.
Logistic regression is fast, interpretable, and works surprisingly well on many real-world problems. It naturally handles multiclass classification through the one-vs-rest or multinomial strategy.
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
# Load dataset
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train
model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)
# Evaluate
y_pred = model.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(classification_report(y_test, y_pred))
Support Vector Machine (SVM)
An SVM finds the hyperplane that best separates classes with the widest possible margin. The "support vectors" are the data points closest to this boundary -- they are the only points that actually determine where the boundary goes. Using kernel functions (linear, polynomial, RBF), SVMs can create highly nonlinear decision boundaries in the original feature space.
When to use it: you have a clear separation between classes, a moderate-sized dataset (SVMs become slow on very large datasets), or high-dimensional data where the number of features exceeds the number of samples.
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
# SVM requires feature scaling
model = make_pipeline(StandardScaler(), SVC(kernel='rbf', C=1.0, gamma='scale'))
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(f"SVM Accuracy: {accuracy_score(y_test, y_pred):.4f}")
Always scale your features before using SVM. Without scaling, features with larger numeric ranges will dominate the distance calculations and the model will perform poorly. StandardScaler (zero mean, unit variance) is the standard choice.
K-Nearest Neighbors (KNN) Classifier
KNN classification works just like KNN regression: find the K closest training examples to the new data point, then take a majority vote among their labels. K=1 means the model simply copies the label of the nearest neighbor, which is prone to noise. Larger values of K smooth out the decision boundary but can blur the distinction between classes.
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier(n_neighbors=5)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(f"KNN Accuracy: {accuracy_score(y_test, y_pred):.4f}")
Decision Tree Classifier
A decision tree classifier splits the data at each node by choosing the feature and threshold that produces the most "pure" child groups, where purity is measured by Gini impurity or entropy. At each leaf, the predicted class is whichever class appears there most frequently. Decision trees are the building blocks of many powerful ensemble methods like random forests and gradient boosting.
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(max_depth=4, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(f"Decision Tree Accuracy: {accuracy_score(y_test, y_pred):.4f}")
Naive Bayes
Naive Bayes classifiers are based on Bayes' theorem and the "naive" assumption that every feature is independent of every other feature. Despite this simplification (which is almost never true in practice), naive Bayes classifiers are remarkably effective, especially for text classification. Scikit-learn offers three main variants:
- GaussianNB -- assumes features follow a normal distribution. Good for continuous numerical data.
- MultinomialNB -- designed for count data like word frequencies in documents. The standard choice for text classification.
- BernoulliNB -- designed for binary features (present/absent). Useful when features represent whether a word appears or not.
from sklearn.naive_bayes import GaussianNB
model = GaussianNB()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(f"Naive Bayes Accuracy: {accuracy_score(y_test, y_pred):.4f}")
# Text classification example with MultinomialNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
documents = [
"free money now click here",
"important meeting tomorrow morning",
"win a prize today free",
"project deadline next week",
"claim your free gift card",
"quarterly report is ready"
]
labels = [1, 0, 1, 0, 1, 0] # 1 = spam, 0 = not spam
vectorizer = CountVectorizer()
X_text = vectorizer.fit_transform(documents)
nb_model = MultinomialNB()
nb_model.fit(X_text, labels)
new_email = vectorizer.transform(["free prize winner click now"])
print(f"Prediction: {'spam' if nb_model.predict(new_email)[0] == 1 else 'not spam'}")
Stochastic Gradient Descent (SGD) Classifier
SGD is not a model itself but a training strategy. Instead of computing the gradient over the entire dataset (which can be very slow for millions of rows), SGD updates the model using one sample at a time. Scikit-learn's SGDClassifier can implement logistic regression, SVM, and other linear models using this approach, making it practical for datasets too large to fit in memory.
from sklearn.linear_model import SGDClassifier
model = make_pipeline(
StandardScaler(),
SGDClassifier(loss='hinge', max_iter=1000, random_state=42) # 'hinge' = SVM
)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(f"SGD Accuracy: {accuracy_score(y_test, y_pred):.4f}")
Ensemble Methods
Ensemble methods combine multiple individual models to produce a single, more accurate prediction. The idea is simple: a crowd of weak learners can outperform a single strong learner if their mistakes are sufficiently different from one another.
Random Forest
A random forest trains many decision trees on random subsets of the data and random subsets of the features, then combines their predictions. For classification, it uses majority vote; for regression, it averages the outputs. The randomness injected into the process ensures that individual trees make different errors, which cancel out when combined.
Random forests are one of the most reliable out-of-the-box models. They handle nonlinear relationships, high-dimensional data, and missing values reasonably well without extensive tuning.
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(f"Random Forest Accuracy: {accuracy_score(y_test, y_pred):.4f}")
# Feature importance
for name, importance in zip(load_iris().feature_names, model.feature_importances_):
print(f" {name}: {importance:.4f}")
Gradient Boosting
Gradient boosting builds trees sequentially rather than in parallel. Each new tree focuses specifically on correcting the errors that the previous trees made. It does this by fitting the new tree to the residual errors (the gradient of the loss function) rather than the original target values. This iterative refinement tends to produce very accurate models, often the best performers on structured/tabular data.
The tradeoff is that gradient boosting is more sensitive to hyperparameters and more prone to overfitting than random forests, so careful tuning of learning_rate, n_estimators, and max_depth is important.
from sklearn.ensemble import GradientBoostingClassifier
model = GradientBoostingClassifier(
n_estimators=100,
learning_rate=0.1,
max_depth=3,
random_state=42
)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(f"Gradient Boosting Accuracy: {accuracy_score(y_test, y_pred):.4f}")
Histogram-Based Gradient Boosting
Scikit-learn's HistGradientBoosting models are inspired by LightGBM. Instead of evaluating every possible split point for every feature, they bin continuous features into discrete histograms first. This dramatically speeds up training on large datasets (often 10 to 30 times faster than standard gradient boosting) while producing comparable accuracy. These models also handle missing values natively without requiring imputation.
from sklearn.ensemble import HistGradientBoostingClassifier
model = HistGradientBoostingClassifier(
max_iter=100,
learning_rate=0.1,
max_depth=5,
random_state=42
)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(f"HistGradientBoosting Accuracy: {accuracy_score(y_test, y_pred):.4f}")
For large datasets (over 10,000 samples), prefer HistGradientBoostingClassifier over GradientBoostingClassifier. It trains significantly faster and handles missing values automatically. For even larger or more complex tabular data, consider XGBoost or LightGBM, which are separate libraries that integrate well with scikit-learn's API.
AdaBoost
AdaBoost (Adaptive Boosting) was one of the first successful boosting algorithms. It works by training a sequence of weak learners (typically shallow decision trees called "stumps") and assigning higher weights to samples that the previous models misclassified. Each subsequent model focuses more on the hard-to-predict examples. The final prediction is a weighted vote of all the weak learners.
from sklearn.ensemble import AdaBoostClassifier
model = AdaBoostClassifier(
n_estimators=50,
learning_rate=1.0,
random_state=42
)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(f"AdaBoost Accuracy: {accuracy_score(y_test, y_pred):.4f}")
Bagging Classifier
Bagging (Bootstrap Aggregating) trains multiple copies of the same base model on different random subsets of the training data (drawn with replacement). The individual model predictions are then combined by majority vote (classification) or averaging (regression). Unlike random forests, which randomize features as well, bagging only randomizes the data samples. You can use any base estimator, not just decision trees.
from sklearn.ensemble import BaggingClassifier
model = BaggingClassifier(
n_estimators=10,
max_samples=0.8,
random_state=42
)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(f"Bagging Accuracy: {accuracy_score(y_test, y_pred):.4f}")
Voting Classifier
A voting classifier combines predictions from completely different types of models. For "hard" voting, the final prediction is whichever class gets the majority of votes. For "soft" voting, the model averages the predicted probabilities from each estimator and picks the class with the highest average probability. Soft voting generally performs better because it takes confidence into account.
from sklearn.ensemble import VotingClassifier
estimators = [
('lr', LogisticRegression(max_iter=200)),
('rf', RandomForestClassifier(n_estimators=50, random_state=42)),
('svc', SVC(probability=True))
]
model = VotingClassifier(estimators=estimators, voting='soft')
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(f"Voting Classifier Accuracy: {accuracy_score(y_test, y_pred):.4f}")
Stacking Classifier
Stacking takes the ensemble idea a step further. Instead of using a simple vote or average, it trains a "meta-model" that learns the best way to combine the predictions from the base models. The base models make predictions, and those predictions become the input features for the meta-model. This approach can learn complex relationships between different models' strengths and weaknesses.
from sklearn.ensemble import StackingClassifier
estimators = [
('rf', RandomForestClassifier(n_estimators=50, random_state=42)),
('knn', KNeighborsClassifier(n_neighbors=5))
]
model = StackingClassifier(
estimators=estimators,
final_estimator=LogisticRegression(),
cv=5
)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(f"Stacking Accuracy: {accuracy_score(y_test, y_pred):.4f}")
Unsupervised Learning: Clustering
Clustering algorithms group data points based on similarity without any labeled examples. They discover natural structure in your data. These are useful for customer segmentation, anomaly detection, data exploration, and finding patterns you didn't know existed.
K-Means
K-Means partitions the data into K clusters by iteratively assigning each point to the nearest cluster center (centroid) and then recalculating the centroids as the mean of all assigned points. The algorithm repeats until the centroids stop moving. You must specify the number of clusters K in advance, which is both a strength (you decide the granularity) and a limitation (you need some idea of how many groups exist).
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
# Generate sample clustered data
X_cluster, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.6, random_state=42)
# Fit K-Means
kmeans = KMeans(n_clusters=4, random_state=42, n_init=10)
kmeans.fit(X_cluster)
print(f"Cluster centers:\n{kmeans.cluster_centers_}")
print(f"Inertia (within-cluster sum of squares): {kmeans.inertia_:.2f}")
print(f"Labels: {kmeans.labels_[:10]}")
DBSCAN
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) finds clusters by identifying regions of high density separated by regions of low density. Unlike K-Means, DBSCAN does not require you to specify the number of clusters. It has two key parameters: eps (the maximum distance between two points for them to be considered neighbors) and min_samples (the minimum number of points needed to form a dense region). Points that don't belong to any dense region are labeled as noise (cluster label -1).
from sklearn.cluster import DBSCAN
dbscan = DBSCAN(eps=0.5, min_samples=5)
dbscan.fit(X_cluster)
n_clusters = len(set(dbscan.labels_)) - (1 if -1 in dbscan.labels_ else 0)
n_noise = list(dbscan.labels_).count(-1)
print(f"Clusters found: {n_clusters}")
print(f"Noise points: {n_noise}")
Agglomerative Clustering
Agglomerative clustering is a hierarchical approach. It starts with each data point as its own cluster, then repeatedly merges the two closest clusters until only the desired number of clusters remains. The "linkage" parameter controls how the distance between clusters is measured: "ward" minimizes variance within clusters, "complete" uses the maximum distance between points in two clusters, "average" uses the mean distance, and "single" uses the minimum distance.
from sklearn.cluster import AgglomerativeClustering
agg = AgglomerativeClustering(n_clusters=4, linkage='ward')
labels = agg.fit_predict(X_cluster)
print(f"Cluster labels: {labels[:10]}")
Mean Shift
Mean Shift is a centroid-based algorithm that does not require you to specify the number of clusters. It works by placing a window over each data point and iteratively shifting it toward the area of highest density. Points that converge to the same location are grouped into the same cluster. It automatically discovers the number of clusters based on the data's density landscape.
from sklearn.cluster import MeanShift
ms = MeanShift()
ms.fit(X_cluster)
print(f"Clusters found: {len(ms.cluster_centers_)}")
print(f"Cluster centers:\n{ms.cluster_centers_}")
Gaussian Mixture Models (GMM)
A Gaussian Mixture Model assumes the data is generated from a mixture of several Gaussian (normal) distributions, each representing a cluster. Unlike K-Means, which makes hard assignments (each point belongs to exactly one cluster), GMM provides soft assignments -- probability scores indicating how likely each point is to belong to each cluster. This is useful when clusters overlap.
from sklearn.mixture import GaussianMixture
gmm = GaussianMixture(n_components=4, random_state=42)
gmm.fit(X_cluster)
labels = gmm.predict(X_cluster)
probabilities = gmm.predict_proba(X_cluster)
print(f"Labels: {labels[:5]}")
print(f"Probabilities (first point): {probabilities[0].round(3)}")
Isolation Forest (Anomaly Detection)
Isolation Forest is technically a clustering-adjacent algorithm designed for anomaly detection. It works on the principle that anomalies are "few and different" -- they are easier to isolate from the rest of the data. The algorithm builds random trees and measures how many splits it takes to isolate each point. Anomalies require fewer splits and therefore have shorter path lengths in the trees.
from sklearn.ensemble import IsolationForest
import numpy as np
# Normal data with a few anomalies injected
np.random.seed(42)
X_normal = np.random.randn(200, 2)
X_anomaly = np.random.uniform(low=-6, high=6, size=(10, 2))
X_combined = np.vstack([X_normal, X_anomaly])
iso_forest = IsolationForest(contamination=0.05, random_state=42)
predictions = iso_forest.fit_predict(X_combined)
n_anomalies = sum(predictions == -1)
print(f"Anomalies detected: {n_anomalies}")
Dimensionality Reduction
When your dataset has dozens or hundreds of features, dimensionality reduction helps you distill it down to the features or combinations of features that carry the useful information. This can speed up training, reduce overfitting, and make your data easier to visualize.
Principal Component Analysis (PCA)
PCA finds the directions (called principal components) along which the data varies the most. It then projects the data onto these directions, effectively compressing your features into a smaller number of new features that capture as much of the original variance as possible. The first principal component captures the direction of maximum variance, the second captures the next-most variance (perpendicular to the first), and so on.
from sklearn.decomposition import PCA
from sklearn.datasets import load_iris
X, y = load_iris(return_X_y=True)
# Reduce from 4 features to 2
pca = PCA(n_components=2)
X_reduced = pca.fit_transform(X)
print(f"Original shape: {X.shape}")
print(f"Reduced shape: {X_reduced.shape}")
print(f"Variance explained: {pca.explained_variance_ratio_}")
print(f"Total variance retained: {sum(pca.explained_variance_ratio_):.4f}")
t-SNE
t-Distributed Stochastic Neighbor Embedding (t-SNE) is a visualization technique that maps high-dimensional data to 2 or 3 dimensions while preserving local structure. Points that are close together in the high-dimensional space will remain close in the low-dimensional projection. t-SNE is excellent for exploring data and discovering clusters, but it should not be used for feature engineering or as a preprocessing step for other models because the transformation is not reproducible on new data.
from sklearn.manifold import TSNE
tsne = TSNE(n_components=2, perplexity=30, random_state=42)
X_embedded = tsne.fit_transform(X)
print(f"t-SNE output shape: {X_embedded.shape}")
Truncated SVD
Truncated SVD (Singular Value Decomposition) is similar to PCA but works directly on sparse matrices without centering the data first. This makes it the go-to choice for reducing the dimensionality of text data represented as TF-IDF matrices. In the context of text, truncated SVD is often called Latent Semantic Analysis (LSA).
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer
documents = [
"machine learning algorithms",
"deep learning neural networks",
"natural language processing",
"computer vision image recognition",
"reinforcement learning agents"
]
vectorizer = TfidfVectorizer()
X_tfidf = vectorizer.fit_transform(documents)
svd = TruncatedSVD(n_components=2, random_state=42)
X_reduced = svd.fit_transform(X_tfidf)
print(f"Original shape: {X_tfidf.shape}")
print(f"Reduced shape: {X_reduced.shape}")
Linear Discriminant Analysis (LDA)
LDA is both a dimensionality reduction technique and a classifier. Unlike PCA (which ignores class labels and focuses purely on variance), LDA finds the directions that maximize the separation between known classes. This makes it especially useful as a preprocessing step before classification because it reduces dimensions while preserving the information that matters for distinguishing between classes.
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
X, y = load_iris(return_X_y=True)
lda = LinearDiscriminantAnalysis(n_components=2)
X_lda = lda.fit_transform(X, y)
print(f"LDA reduced shape: {X_lda.shape}")
print(f"Explained variance ratio: {lda.explained_variance_ratio_}")
Choosing the Right Model
With so many models available, selecting the right one can feel overwhelming. Here is a practical framework to guide your decision.
Start simple. Always begin with a simple model like logistic regression (for classification) or linear regression (for regression). These give you a performance baseline and are fast to train. If the simple model performs well enough, you may not need anything more complex.
Consider your data size. Small datasets (under 1,000 samples) tend to work better with simpler models or models that have strong regularization. SVMs and KNN can work well here. Large datasets (over 100,000 samples) favor gradient boosting methods (especially HistGradientBoosting) and linear models trained with SGD.
Think about interpretability. If you need to explain why the model made a particular decision (common in healthcare, finance, and cybersecurity), stick with linear models, decision trees, or logistic regression. Ensemble methods like random forests trade some interpretability for better accuracy, though feature importance scores still offer some insight.
Use cross-validation. Never evaluate a model on the same data you trained it on. Use scikit-learn's cross_val_score to get a reliable estimate of how your model will perform on unseen data.
from sklearn.model_selection import cross_val_score
models = {
'Logistic Regression': LogisticRegression(max_iter=200),
'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
'SVM': make_pipeline(StandardScaler(), SVC()),
'KNN': KNeighborsClassifier(n_neighbors=5),
'Gradient Boosting': GradientBoostingClassifier(random_state=42)
}
X, y = load_iris(return_X_y=True)
for name, model in models.items():
scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
print(f"{name}: {scores.mean():.4f} (+/- {scores.std():.4f})")
Scikit-learn is designed for classical machine learning on structured, tabular data. For tasks involving images, audio, or text at scale, deep learning frameworks like PyTorch or TensorFlow are typically more appropriate. However, scikit-learn remains essential for preprocessing, feature engineering, evaluation metrics, and pipeline management even in deep learning workflows.
Key Takeaways
- Every scikit-learn model follows the same API. Create the model, call
.fit(), call.predict(). This consistency makes it easy to swap models and compare results. - Regression models predict continuous values. Start with linear regression and add complexity (ridge, lasso, polynomial, SVR, decision trees) only when the baseline falls short.
- Classification models predict categories. Logistic regression is your default starting point. SVMs shine on medium-sized datasets with clear margins, and naive Bayes is hard to beat for text classification.
- Ensemble methods combine multiple models for better performance. Random forests offer strong out-of-the-box accuracy with minimal tuning. Gradient boosting (especially histogram-based) tends to produce the best results on structured data but requires more careful hyperparameter tuning.
- Clustering algorithms find hidden structure in unlabeled data. K-Means is the fastest and simplest, DBSCAN handles arbitrary shapes and noise, and Gaussian Mixture Models provide soft probabilistic assignments.
- Dimensionality reduction compresses features while retaining information. PCA is the standard approach, t-SNE is for visualization, and LDA is optimized for classification tasks.
- Always use cross-validation to evaluate your models. A single train/test split can give misleading results. Five-fold cross-validation provides a much more reliable performance estimate.
Machine learning in Python comes down to understanding your data, choosing an appropriate model, and validating your results properly. Scikit-learn gives you all the tools you need under one roof. Start simple, measure everything, and add complexity only when the data demands it.