Python Machine Learning Examples: Hands-On Guide with Scikit-Learn

Machine learning transforms raw data into predictions, decisions, and insights. Python's scikit-learn library makes this accessible with a clean, consistent API that covers everything from classification and regression to clustering and model evaluation. This guide walks through practical, runnable examples that demonstrate each core technique.

Scikit-learn (imported as sklearn) is an open-source library built on top of NumPy, SciPy, and Matplotlib. It provides efficient implementations of algorithms ranging from linear regression and decision trees to random forests and support vector machines. The library follows a consistent pattern across all models: create an estimator, call .fit() to train it, and call .predict() to generate predictions. This uniform interface makes it straightforward to swap algorithms and compare results.

The current stable release is scikit-learn 1.8.0, released in December 2025, which includes performance improvements, better model interpretability, and expanded integration with the broader Python data science ecosystem. Every example in this article uses this version and runs as a complete, self-contained script.

Setting Up Your Environment

Before writing any machine learning code, you need scikit-learn and its dependencies installed. If you already have Python 3.9 or later, installation takes a single command.

pip install scikit-learn numpy pandas matplotlib

Verify the installation by checking the version number in a Python shell:

import sklearn
print(sklearn.__version__)
# Expected output: 1.8.0

Scikit-learn ships with several built-in datasets that are perfect for learning. These datasets load directly into memory without any file downloads, which means you can focus on the machine learning concepts rather than data wrangling. The examples in this article use the Iris dataset (classification), the California Housing dataset (regression), and synthetic data (clustering).

Note

All examples in this article follow the standard scikit-learn workflow: load data, split into training and testing sets, train the model, make predictions, and evaluate performance. Once you understand this pattern, you can apply it to any algorithm in the library.

Classification: Predicting Categories

Classification is the task of assigning a label to an input based on learned patterns. A spam filter classifying email as "spam" or "not spam" is classification. A medical system identifying a tumor as "malignant" or "benign" is classification. The model learns the relationship between input features and discrete output labels during training, then applies that knowledge to new, unseen data.

Example 1: Random Forest Classifier on the Iris Dataset

The Iris dataset contains 150 samples of iris flowers, each described by four measurements (sepal length, sepal width, petal length, petal width) and labeled as one of three species. A Random Forest builds multiple decision trees during training and combines their votes to produce a final prediction, which tends to be more accurate and robust than any single tree.

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score

# Load the dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split: 80% training, 20% testing
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Create and train the model
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)

# Make predictions
y_pred = clf.predict(X_test)

# Evaluate
print(f"Accuracy: {accuracy_score(y_test, y_pred):.2f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=iris.target_names))

This script loads the Iris data, reserves 20% for testing, trains a Random Forest with 100 decision trees, and then prints a detailed performance report. The classification_report function provides precision, recall, and F1-score for each class, giving a much richer picture of model performance than accuracy alone.

Example 2: Support Vector Machine (SVM) with Feature Scaling

Support Vector Machines find the optimal boundary (hyperplane) that separates classes with the widest possible margin. SVMs are sensitive to the scale of input features, so preprocessing with StandardScaler is essential. The scaler transforms each feature to have a mean of 0 and a standard deviation of 1.

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Load and split
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
    iris.data, iris.target, test_size=0.2, random_state=42
)

# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train an SVM with an RBF kernel
svm_clf = SVC(kernel='rbf', C=1.0, gamma='scale', random_state=42)
svm_clf.fit(X_train_scaled, y_train)

# Predict and evaluate
y_pred = svm_clf.predict(X_test_scaled)
print(f"SVM Accuracy: {accuracy_score(y_test, y_pred):.2f}")
Pro Tip

Always call fit_transform() on the training data and plain transform() on the test data. If you fit the scaler on the test data too, you introduce data leakage, which means the model indirectly "sees" information from the test set during training and produces misleadingly optimistic results.

Regression: Predicting Continuous Values

Regression predicts a continuous numeric output rather than a discrete label. Predicting house prices, forecasting temperature, or estimating stock returns are all regression tasks. The model learns a function that maps input features to a continuous target value.

Example 3: Linear Regression on California Housing

The California Housing dataset contains information about housing blocks in California, including features like median income, house age, average rooms, and geographic coordinates. The target variable is the median house value for each block. Linear Regression fits a straight-line relationship (in higher dimensions, a hyperplane) between the features and the target.

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

# Load the dataset
housing = fetch_california_housing()
X, y = housing.data, housing.target

# Split the data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train the model
reg = LinearRegression()
reg.fit(X_train, y_train)

# Make predictions
y_pred = reg.predict(X_test)

# Evaluate
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

print(f"Root Mean Squared Error: {rmse:.4f}")
print(f"R-squared Score: {r2:.4f}")

# Show feature importance
print("\nFeature Coefficients:")
for name, coef in zip(housing.feature_names, reg.coef_):
    print(f"  {name:<15} {coef:+.4f}")

The R-squared score measures how much of the variance in the target variable the model explains. A value of 1.0 means the model perfectly predicts every value, while 0.0 means it performs no better than simply predicting the mean. Root Mean Squared Error (RMSE) gives you the average magnitude of prediction errors in the same units as the target variable, which makes it easier to interpret.

Example 4: Gradient Boosting Regressor

Gradient Boosting builds an ensemble of weak learners (typically small decision trees) sequentially. Each new tree corrects the errors of the previous ones. This approach often achieves better accuracy than a single linear model, especially when the relationship between features and target is non-linear.

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

# Load and split
housing = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(
    housing.data, housing.target, test_size=0.2, random_state=42
)

# Train a Gradient Boosting model
gbr = GradientBoostingRegressor(
    n_estimators=200,
    max_depth=4,
    learning_rate=0.1,
    random_state=42
)
gbr.fit(X_train, y_train)

# Predict and evaluate
y_pred = gbr.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)

print(f"Gradient Boosting RMSE: {rmse:.4f}")
print(f"Gradient Boosting R-squared: {r2:.4f}")

# Feature importance
print("\nFeature Importance (top 5):")
importances = gbr.feature_importances_
indices = np.argsort(importances)[::-1]
for i in range(5):
    idx = indices[i]
    print(f"  {housing.feature_names[idx]:<15} {importances[idx]:.4f}")

The learning_rate parameter controls how much each new tree contributes. Smaller values require more trees (n_estimators) but often produce better generalization. The max_depth parameter limits the complexity of each individual tree. Finding the right combination of these hyperparameters is where cross-validation becomes essential, which is covered in a later section.

Clustering: Finding Hidden Groups

Clustering is an unsupervised learning technique. Unlike classification, there are no labels to learn from. The algorithm discovers natural groupings in the data based on similarity. Common applications include customer segmentation, document grouping, and anomaly detection.

Example 5: KMeans Clustering

KMeans partitions data into K clusters by iteratively assigning each point to its nearest cluster center, then updating the centers based on the assigned points. This example generates synthetic data with three distinct groups and lets KMeans find them.

from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import numpy as np

# Generate synthetic data with 3 natural clusters
X, y_true = make_blobs(
    n_samples=300,
    centers=3,
    cluster_std=0.60,
    random_state=42
)

# Apply KMeans
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
y_pred = kmeans.fit_predict(X)

# Evaluate with silhouette score
score = silhouette_score(X, y_pred)
print(f"Silhouette Score: {score:.4f}")

# Cluster centers
print("\nCluster Centers:")
for i, center in enumerate(kmeans.cluster_centers_):
    print(f"  Cluster {i}: [{center[0]:.2f}, {center[1]:.2f}]")

# How many points per cluster
unique, counts = np.unique(y_pred, return_counts=True)
print("\nPoints per Cluster:")
for cluster_id, count in zip(unique, counts):
    print(f"  Cluster {cluster_id}: {count} points")

The silhouette score ranges from -1 to 1. A value near 1 means clusters are dense and well-separated. A value near 0 means clusters overlap. A negative value suggests data points may have been assigned to the wrong cluster. This metric is especially useful when you have no ground-truth labels to compare against.

Note

Choosing the right number of clusters (K) is a critical decision. Two common approaches are the Elbow Method, which plots inertia against different values of K and looks for a bend in the curve, and the Silhouette Method, which measures how well each point fits within its assigned cluster compared to neighboring clusters.

Example 6: Choosing K with the Elbow Method

from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans

# Generate data
X, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.80, random_state=42)

# Test different values of K
inertias = []
k_range = range(2, 10)

for k in k_range:
    km = KMeans(n_clusters=k, random_state=42, n_init=10)
    km.fit(X)
    inertias.append(km.inertia_)

# Print results to find the "elbow"
print("K  |  Inertia")
print("-" * 20)
for k, inertia in zip(k_range, inertias):
    print(f"{k}  |  {inertia:.2f}")

Look for the value of K where inertia stops decreasing sharply and begins to level off. That bend is the "elbow," and it typically indicates the natural number of clusters in the data. In this example with four generated centers, you should see a clear elbow at K=4.

Building a Complete ML Pipeline

In production workflows, you rarely run preprocessing and modeling as separate, disconnected steps. Scikit-learn's Pipeline class chains multiple transformations and a final estimator into a single object. This ensures transformations are applied consistently, prevents data leakage during cross-validation, and makes your code cleaner.

Example 7: Classification Pipeline with Preprocessing

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Load and split
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
    iris.data, iris.target, test_size=0.2, random_state=42
)

# Build the pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),          # Step 1: Scale features
    ('pca', PCA(n_components=2)),          # Step 2: Reduce to 2 dimensions
    ('classifier', RandomForestClassifier( # Step 3: Classify
        n_estimators=100, random_state=42
    ))
])

# Train the entire pipeline
pipeline.fit(X_train, y_train)

# Predict using the entire pipeline
y_pred = pipeline.predict(X_test)

print(f"Pipeline Accuracy: {accuracy_score(y_test, y_pred):.2f}")

When you call pipeline.fit(X_train, y_train), scikit-learn calls fit_transform() on every step except the last one, and fit() on the final estimator. When you call pipeline.predict(X_test), it applies transform() to each preprocessing step and predict() on the final estimator. This guarantees the test data goes through the exact same transformations as the training data.

Example 8: Regression Pipeline with Imputation

Real-world datasets often contain missing values. This pipeline handles them automatically using SimpleImputer before scaling and modeling.

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge
from sklearn.metrics import r2_score
import numpy as np

# Load data
housing = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(
    housing.data, housing.target, test_size=0.2, random_state=42
)

# Simulate missing values in the training data
rng = np.random.RandomState(42)
missing_mask = rng.rand(*X_train.shape) < 0.05
X_train_missing = X_train.copy()
X_train_missing[missing_mask] = np.nan

missing_mask_test = rng.rand(*X_test.shape) < 0.05
X_test_missing = X_test.copy()
X_test_missing[missing_mask_test] = np.nan

# Build a pipeline that handles missing values
pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),  # Fill NaNs with median
    ('scaler', StandardScaler()),                    # Scale features
    ('regressor', Ridge(alpha=1.0))                  # Ridge regression
])

# Train and predict
pipeline.fit(X_train_missing, y_train)
y_pred = pipeline.predict(X_test_missing)

print(f"R-squared with missing data handled: {r2_score(y_test, y_pred):.4f}")
Pro Tip

Ridge regression adds an L2 penalty to the loss function, which discourages large coefficients and helps prevent overfitting. The alpha parameter controls the strength of this penalty. A larger alpha means stronger regularization. Use GridSearchCV to find the optimal value.

Model Evaluation and Cross-Validation

A single train-test split gives you one estimate of model performance, which can vary depending on how the data was divided. Cross-validation provides a more reliable estimate by splitting the data multiple times and averaging the results.

Example 9: K-Fold Cross-Validation

from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
import numpy as np

# Load data
iris = load_iris()
X, y = iris.data, iris.target

# Create the model
clf = RandomForestClassifier(n_estimators=100, random_state=42)

# 5-fold cross-validation
scores = cross_val_score(clf, X, y, cv=5, scoring='accuracy')

print("Cross-Validation Results (5 folds):")
print(f"  Individual scores: {scores}")
print(f"  Mean accuracy:     {scores.mean():.4f}")
print(f"  Std deviation:     {scores.std():.4f}")

Five-fold cross-validation divides the data into five equal parts. In each round, one part is held out for testing while the other four are used for training. The model is trained and evaluated five times, and the results are averaged. This gives you both a performance estimate and a measure of how stable that estimate is (the standard deviation).

Example 10: Hyperparameter Tuning with GridSearchCV

Machine learning models have hyperparameters that you set before training. Choosing the right combination can significantly affect performance. GridSearchCV systematically tests every combination of specified hyperparameter values using cross-validation.

from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

# Load data
iris = load_iris()
X, y = iris.data, iris.target

# Define hyperparameter grid
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 5, 10],
    'min_samples_split': [2, 5, 10]
}

# Run grid search with 5-fold cross-validation
grid_search = GridSearchCV(
    RandomForestClassifier(random_state=42),
    param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1  # Use all CPU cores
)
grid_search.fit(X, y)

# Results
print(f"Best Parameters: {grid_search.best_params_}")
print(f"Best CV Score:   {grid_search.best_score_:.4f}")

# Access the best model directly
best_model = grid_search.best_estimator_
print(f"\nBest model: {best_model}")

The n_jobs=-1 parameter tells scikit-learn to use all available CPU cores, which speeds up the search considerably. With the grid defined above, there are 3 x 3 x 3 = 27 hyperparameter combinations, each evaluated with 5-fold cross-validation, for a total of 135 model fits. Once complete, the best_estimator_ attribute gives you a fully trained model with the optimal parameters, ready for predictions.

Warning

Grid search explores every combination in the parameter grid, which can become very slow with large grids. For high-dimensional search spaces, consider RandomizedSearchCV instead, which samples a fixed number of random combinations and often finds good results with far fewer evaluations.

Key Takeaways

  1. The fit/predict pattern is universal: Every scikit-learn estimator follows the same interface. Once you learn the workflow with one algorithm, switching to another requires changing just a few lines of code.
  2. Preprocessing matters as much as model choice: Scaling, imputation, and feature engineering often have a larger impact on performance than the choice of algorithm. Use pipelines to keep preprocessing consistent and prevent data leakage.
  3. Always validate with cross-validation: A single train-test split can produce misleading results. Cross-validation gives you a more reliable estimate of how your model will perform on unseen data.
  4. Start simple, then add complexity: Begin with a baseline model like Linear Regression or Logistic Regression. If that falls short, try ensemble methods like Random Forest or Gradient Boosting. Compare everything with the same evaluation metrics.
  5. Use GridSearchCV or RandomizedSearchCV for tuning: Manual hyperparameter tuning is inefficient and error-prone. Let scikit-learn's built-in search tools find the optimal configuration systematically.

These examples provide a working foundation for the core machine learning tasks in Python. Each script is self-contained and runnable as-is, so the next step is to modify them with your own data. Replace the built-in datasets with a CSV file loaded via pandas, adjust the hyperparameters, and experiment with different algorithms. The consistent scikit-learn API means the structure stays the same regardless of the dataset or problem.

back to articles