Getting Started with Scikit-learn: Python's Machine Learning Workhorse

Scikit-learn is the library that turns machine learning from a research paper into five lines of Python. Whether you want to classify emails, predict house prices, or segment customers into groups, the API works the same way every time — and that consistency is what makes it so powerful.

Scikit-learn started life as a Google Summer of Code project in 2007, created by David Cournapeau. Members of the French research institute INRIA took over development in 2010 and released the first public version that same year. Roughly 15 years later, the library sits at version 1.8 and is used by financial institutions, insurance companies, and data teams at organizations of every size. It does not try to compete with deep learning frameworks like TensorFlow or PyTorch. Instead, it occupies a distinct and irreplaceable role: clean, well-documented, production-ready classical machine learning.

What Scikit-learn Is (and Isn't)

Scikit-learn is an open-source Python library built on top of NumPy, SciPy, and Matplotlib. It provides tools for data preprocessing, supervised learning, unsupervised learning, model selection, and evaluation — all under a single, consistent API. You install it with a single command and import it as sklearn.

pip install scikit-learn

What it is not is a deep learning library. Scikit-learn does not include neural networks in the sense that TensorFlow or PyTorch do. It has a basic MLPClassifier and MLPRegressor for shallow neural nets, but if you need transformers, CNNs, or GPU-accelerated gradient descent, you will need a different tool. Scikit-learn shines on tabular data with structured features — the kind of data that lives in spreadsheets and databases.

Note

Scikit-learn requires Python 3.9 or later. Version 1.8, released December 2025, adds support through Python 3.14, including free-threaded CPython builds.

Real companies rely on it in production. AXA uses it to speed up car accident compensation workflows and flag potential fraud. Zopa, the peer-to-peer lending platform, uses it for credit risk modelling and marketing segmentation. BNP Paribas Cardif uses it to route incoming mail and manage internal model governance through pipelines that reduce overfitting risk. These are not toy projects.

Core Concepts: The Estimator API

The entire library is built around one idea: every object that learns from data is an estimator. Estimators share a common interface, which means once you learn how to use one model, you essentially know how to use all of them.

Every estimator has a fit() method that trains the model on data. Supervised models then expose a predict() method to generate outputs. Transformers — objects that preprocess data rather than make predictions — expose a transform() method instead. Many transformers also offer fit_transform() as a convenience shortcut.

from sklearn.preprocessing import StandardScaler
import numpy as np

# Sample data: 4 observations, 2 features
X = np.array([[10.0, 200.0],
              [20.0, 400.0],
              [30.0, 600.0],
              [40.0, 800.0]])

scaler = StandardScaler()

# fit() learns the mean and std of each feature
# transform() applies the scaling
X_scaled = scaler.fit_transform(X)

print(X_scaled)

The output will be a zero-mean, unit-variance version of your original data. Notice that you did not write any math — you just called the method. That is the entire promise of the Scikit-learn API.

Pro Tip

Always fit your scaler (or any transformer) on training data only, then use transform() on your test set. Fitting on the full dataset leaks information and makes your evaluation results unreliable.

When you need to evaluate a model, Scikit-learn provides train_test_split() to divide your data, and a full suite of metrics under sklearn.metrics — accuracy, F1 score, mean squared error, ROC AUC, and more. Cross-validation helpers like cross_val_score() wrap the whole process automatically.

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris

X, y = load_iris(return_X_y=True)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)

# Score on held-out test set
print("Test accuracy:", model.score(X_test, y_test))

# Or use cross-validation for a more robust estimate
scores = cross_val_score(model, X, y, cv=5)
print("CV accuracy: %.2f +/- %.2f" % (scores.mean(), scores.std()))

Classification, Regression, and Clustering

Scikit-learn organizes machine learning tasks into three broad categories, each with a large selection of algorithms. The choice of algorithm depends on your data, your problem type, and how much interpretability you need.

Classification

Classification problems ask the model to assign observations to discrete categories. Common algorithms include Logistic Regression, Decision Trees, Random Forest, Gradient Boosting, Support Vector Machines, and k-Nearest Neighbors. The example below trains a Random Forest on the Iris dataset:

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred,
      target_names=['setosa', 'versicolor', 'virginica']))

The classification_report() function prints precision, recall, and F1 score for every class — far more informative than accuracy alone when classes are imbalanced.

Regression

Regression problems predict a continuous numeric value. The same estimator API applies — swap the classifier for a regressor and use metrics like Mean Absolute Error or Root Mean Squared Error instead of accuracy.

from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.datasets import fetch_california_housing

X, y = fetch_california_housing(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

reg = GradientBoostingRegressor(n_estimators=200, learning_rate=0.1)
reg.fit(X_train, y_train)

preds = reg.predict(X_test)
print("MAE:", mean_absolute_error(y_test, preds))

Clustering

Clustering is unsupervised — there are no labels. The algorithm tries to discover natural groupings in the data. K-Means is the classic starting point, but Scikit-learn also offers DBSCAN, Agglomerative Clustering, and several others for cases where the number of clusters is unknown or the shape of clusters is irregular.

from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

# Scale first — K-Means is distance-based and sensitive to feature scale
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

kmeans = KMeans(n_clusters=3, random_state=42, n_init='auto')
kmeans.fit(X_scaled)

labels = kmeans.labels_
print("Cluster assignments:", labels[:10])

Note

Setting n_init='auto' in K-Means (introduced in version 1.2) avoids the deprecation warning from the old default and lets the library choose the number of initializations intelligently based on your data size.

Pipelines: Wiring It All Together

Pipelines are the feature that separates people who use Scikit-learn casually from people who use it well. A pipeline chains preprocessing steps and a final estimator into a single object. You call fit() once and the pipeline handles everything in order — no risk of accidentally transforming test data with a scaler that was fit on training data, no repeated boilerplate.

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV

pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('pca',    PCA()),
    ('svc',    SVC())
])

# Define a hyperparameter grid — note the double underscore syntax
# to reference parameters inside pipeline steps
param_grid = {
    'pca__n_components': [5, 10, 20],
    'svc__C':            [0.1, 1.0, 10.0],
    'svc__kernel':       ['linear', 'rbf']
}

search = GridSearchCV(pipe, param_grid, cv=5, n_jobs=-1)
search.fit(X_train, y_train)

print("Best params:", search.best_params_)
print("Best CV score:", search.best_score_)

The double underscore syntax (pca__n_components) is how you reach inside a pipeline step to tune its hyperparameters. GridSearchCV then exhaustively tries every combination you specify. For larger search spaces, RandomizedSearchCV is faster — it samples a fixed number of parameter combinations at random rather than evaluating all of them.

"Pipelines are the closest thing sklearn has to a silver bullet. They make your code cleaner, your validation more correct, and your models easier to deploy." — common wisdom in the sklearn community

If your dataset has a mix of numeric and categorical features, ColumnTransformer lets you apply different preprocessing to different columns and then feeds the combined result into a pipeline. This is the standard approach for real-world tabular data.

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

numeric_features = ['age', 'salary']
categorical_features = ['department', 'region']

preprocessor = ColumnTransformer(transformers=[
    ('num', StandardScaler(), numeric_features),
    ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
])

full_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier',   RandomForestClassifier(n_estimators=100))
])

full_pipeline.fit(X_train, y_train)
print("Score:", full_pipeline.score(X_test, y_test))

What's New in Version 1.8

Scikit-learn 1.8 was released on December 10, 2025 and represents one of the more significant updates the library has seen in recent years. The headline change is native Array API support — a step that brings meaningful GPU acceleration to Scikit-learn for the first time without requiring a complete rewrite of your codebase.

Array API and GPU Support

The Python Array API standard defines a consistent interface across array libraries including NumPy, PyTorch, and CuPy. By adopting this standard, Scikit-learn can now accept PyTorch tensors and CuPy arrays directly and dispatch computation to whatever device those arrays live on — including GPUs. In practice, this means you can push data onto a GPU before passing it to a supported estimator, and the computation happens there without converting back to NumPy first.

import os
os.environ["SCIPY_ARRAY_API"] = "1"  # required before importing scipy/sklearn

import torch
import numpy as np
import sklearn
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import RidgeClassifierCV
from sklearn.calibration import CalibratedClassifierCV

# Move data to GPU
X_gpu = torch.tensor(X_train.astype(np.float32), device="cuda")
y_gpu = torch.tensor(y_train.astype(np.float32), device="cuda")

alphas = [0.1, 1.0, 10.0]

# This pipeline runs RidgeClassifierCV and calibration on the GPU
gpu_pipeline = make_pipeline(
    CalibratedClassifierCV(
        RidgeClassifierCV(alphas=alphas),
        method="temperature"
    )
)

with sklearn.config_context(array_api_dispatch=True):
    gpu_pipeline.fit(X_gpu, y_gpu)

Pro Tip

Array API support is still experimental and must be explicitly enabled via sklearn.config_context(array_api_dispatch=True) and the SCIPY_ARRAY_API=1 environment variable. Not every estimator supports it yet — check the official docs before assuming your workflow will run on GPU.

The performance gains are real. On typical cross-validation workloads, running on a GPU through PyTorch tensors yields roughly a 10x speedup compared to a single CPU core. The library team notes that this is driven by PyTorch's multithreaded CPU operations even before you touch a GPU — simply using PyTorch tensors on CPU is already faster than NumPy for many operations.

Classical MDS Added to Manifold Module

Version 1.8 also adds Classical Multidimensional Scaling (MDS), also known as Principal Coordinates Analysis, to sklearn.manifold. Classical MDS approximates pairwise scalar products through eigendecomposition and has an exact analytic solution — unlike iterative MDS, it does not require random initialization and is fully deterministic.

CPU and Memory Efficiency Improvements

Estimators and metric functions that rely on weighted percentiles received significant efficiency improvements in 1.8, with better alignment to NumPy and SciPy's unweighted implementations. Linear model fit times were also improved. These are the kinds of changes that do not make for flashy announcements but matter a great deal if you are training models on large datasets in production.

Free-Threaded Python 3.14 Support

Scikit-learn 1.8 ships free-threaded wheels for Python 3.14 across all supported platforms. Free-threaded Python removes the Global Interpreter Lock for multi-threaded workloads, which could unlock meaningful parallelism for certain Scikit-learn workflows. The library team is actively soliciting user feedback on this feature.

Key Takeaways

The estimator API is the foundation: Every object in Scikit-learn follows the same fit() / predict() / transform() pattern. Learn it once and the rest of the library becomes intuitive.
Use pipelines from the start: Pipelines prevent data leakage, reduce boilerplate, and make your models deployable. There is almost no situation where writing preprocessing steps outside a pipeline is the better choice.
Scikit-learn handles tabular data better than almost anything else: For structured data with engineered features, Random Forest, Gradient Boosting, and Logistic Regression with proper preprocessing will outperform neural networks in many real-world scenarios — and they are far easier to interpret and debug.
Version 1.8 brings meaningful GPU support: If you work with larger datasets and have access to a GPU, the new Array API dispatch path is worth evaluating. It is experimental but functional for a growing list of estimators including StandardScaler, RidgeClassifierCV, PolynomialFeatures, and CalibratedClassifierCV.
Validation is built in: cross_val_score(), GridSearchCV, and classification_report() give you honest, rigorous evaluation with very little code. Use them. Do not evaluate your model on the same data you trained on.

Scikit-learn is one of the most mature, well-documented libraries in the Python ecosystem. Its API has remained stable across major versions, its documentation includes worked examples for nearly every estimator, and its community is active. If you are building anything with tabular data in Python — whether it is your first model or your fiftieth — it belongs in your toolkit.