Python Machine Learning Tutorial

Machine learning lets you write programs that learn from data instead of following explicit rules. This tutorial walks you through the complete workflow — from loading data and training a model to evaluating predictions — using Python and scikit-learn.

You do not need a math degree or a GPU cluster to get started with machine learning in Python. The scikit-learn library handles the heavy lifting so you can focus on understanding the concepts and building intuition through working code. By the end of this tutorial you will have trained, evaluated, and compared real models on a real dataset.

What Is Machine Learning?

Traditional programming works like this: you write rules, feed in data, and get output. Machine learning flips that model. You feed in data along with the expected output, and the algorithm figures out the rules on its own. Those rules are called a model, and finding them is called training.

There are three broad categories:

Supervised learning — the training data includes labeled examples (input + correct answer). This is where most beginners start, and it's what this tutorial covers.
Unsupervised learning — the data has no labels. The algorithm finds patterns on its own, such as grouping similar records together (clustering).
Reinforcement learning — an agent takes actions in an environment and learns from rewards and penalties over time.

Within supervised learning, tasks split into two types: classification (predicting a category, like spam vs. not spam) and regression (predicting a number, like a house price). This tutorial focuses on classification because the results are easy to interpret.

The supervised learning workflow — click any step

Collect & Load Data

→

Explore & Clean

→

Split Train / Test

→

Preprocess Features

→

Train Model

→

Evaluate & Tune

→

Predict on New Data

Note

This tutorial assumes you can read and write basic Python without looking everything up: you know how variables, strings, lists, and dictionaries work; you can write a for loop and an if statement; you have called functions and used import to bring in a library. You do not need to know pandas or any machine learning library — those are introduced here. You do not need statistics beyond knowing what an average is. If f-strings like f"accuracy: {score:.2%}" look unfamiliar, spend twenty minutes with the fundamentals articles on this site first.

Setting Up Your Environment

You need three libraries: scikit-learn for machine learning, pandas for data handling, and numpy for numerical operations. Install them with pip if you have not already:

pip install scikit-learn pandas numpy

Once installed, verify everything works by running a quick import check:

import sklearn
import pandas as pd
import numpy as np

print("scikit-learn version:", sklearn.__version__)
print("pandas version:", pd.__version__)
print("numpy version:", np.__version__)

If all three lines print version numbers without errors, your environment is ready. This tutorial was tested against scikit-learn 1.8 (released December 2025) and pandas 3.0 (released January 2026). Any version in those families will work for the examples here. Note that pandas 3.0 requires Python 3.11 or newer, and scikit-learn 1.8 requires Python 3.11 or newer as well — if you are on an older Python, install the previous stable releases instead.

pandas 3.0 Copy-on-Write

pandas 3.0 introduced copy-on-write by default: any subset or returned DataFrame or Series now behaves as a copy, never silently modifying the original. For machine learning workflows this is almost entirely a benefit — it prevents hard-to-trace bugs where preprocessing steps accidentally mutate source data. The change is documented in the pandas 3.0 release notes.

Pro Tip

Use a virtual environment to keep your project dependencies isolated. Run python -m venv ml_env and then activate it before installing packages. This prevents version conflicts across projects.

Loading and Exploring Data

Every machine learning project starts with data. scikit-learn ships with several built-in datasets for practice. We'll use the Iris dataset — 150 samples of iris flowers, each described by four measurements (sepal length, sepal width, petal length, petal width), classified into three species. It's small, clean, and perfect for learning the workflow.

About the Iris Dataset

The Iris dataset originates from British statistician and geneticist Sir R.A. Fisher's 1936 paper "The use of multiple measurements in taxonomic problems" (Annual Eugenics, 7, Part II, 179–188). Fisher introduced it to demonstrate linear discriminant analysis. scikit-learn's bundled copy comes directly from Fisher's paper rather than the UCI Machine Learning Repository — the UCI version contains two data points that differ from the original publication. The three species are Iris setosa, Iris versicolor, and Iris virginica. One species (setosa) is linearly separable from the other two; versicolor and virginica are not linearly separable from each other, making the dataset a useful illustration of where simple models succeed and where they struggle.

from sklearn.datasets import load_iris
import pandas as pd

# Load the dataset
iris = load_iris()

# Wrap it in a DataFrame for easy exploration
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['species'] = iris.target

print(df.shape)        # (150, 5)
print(df.head())
print(df['species'].value_counts())

The iris.data array holds the features — the measurements your model will learn from. The iris.target array holds the labels — the correct species for each sample (0, 1, or 2). In machine learning terminology, features are also called X and labels are called y. That naming convention appears everywhere.

# Separate features and labels
X = iris.data    # shape: (150, 4)
y = iris.target  # shape: (150,)

print("Features shape:", X.shape)
print("Labels shape:", y.shape)
print("Unique classes:", set(y))  # {0, 1, 2}

Before training anything, it's worth checking for missing values. Real-world data is rarely this clean, but building the habit early pays off later.

# Check for missing values
print(df.isnull().sum())

# Basic statistics
print(df.describe())

Note

The Iris dataset has no missing values, but many real datasets do. When missing values are present, common strategies include dropping those rows, filling with the column mean, or using scikit-learn's SimpleImputer to handle it systematically.

Pause & Reflect

You just separated iris.data into X and y. What is the fundamental reason machine learning algorithms need these kept apart — and what would go wrong if you combined them into a single array during training?

Try it — drag the correct label onto each slot

The feature matrix — every measurement the model learns from (sepal length, petal width, etc.)

The label vector — the correct answer for each row (0 = setosa, 1 = versicolor, 2 = virginica)

Training Your First Model

Before training, you must split your data into a training set and a test set. You train the model on one portion, then check how well it generalizes to the other portion it has never seen. Without this split, you have no way to know whether the model has genuinely learned or has simply memorized the training data — a problem called overfitting.

from sklearn.model_selection import train_test_split
 
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,      # 20% goes to the test set
    random_state=42,    # fixes the random split so results are reproducible
    stratify=y          # ensures each class is proportionally represented
)
 
print("Training samples:", X_train.shape[0])  # 120
print("Test samples:", X_test.shape[0])        # 30

The random_state=42 argument makes the split reproducible. Using stratify=y is a good habit — it ensures the class distribution in the training and test sets mirrors the original data, which matters when classes are imbalanced.

What is class imbalance?

A dataset is imbalanced when one class appears far more often than another — for example, 950 legitimate transactions and 50 fraudulent ones in a fraud detection dataset. Without stratify=y, a random split might send all 50 fraud cases into training with none in the test set, or vice versa, making evaluation meaningless. With stratify=y, scikit-learn ensures each split reflects the original class ratio. Iris is balanced (50 samples per species), so this doesn't affect the numbers here — but on real-world data it often matters a great deal.

Now train a K-Nearest Neighbors (KNN) classifier. KNN is a great starting point because it's conceptually simple: to classify a new sample, it looks at the K closest training samples and takes a majority vote.

from sklearn.neighbors import KNeighborsClassifier
 
knn = KNeighborsClassifier(n_neighbors=3)
 
knn.fit(X_train, y_train)
 
y_pred = knn.predict(X_test)
 
print("Predictions:", y_pred)
print("Actual:     ", y_test)

The entire scikit-learn API follows the same three-step pattern: instantiate the model, fit it on training data, then predict on new data. Once you learn this pattern with KNN, it transfers to every other algorithm in the library.

Pause & Reflect

KNN finds the K nearest neighbors and takes a majority vote. Before tuning K, ask yourself: what does "nearest" actually mean mathematically here — and what would happen to that calculation if one feature was measured in centimeters while another was measured in kilometers?

Try it — drag the slider and watch how K affects accuracy

n_neighbors = 3

Train accuracy97%

Test accuracy93%

good generalization — train/test gap is small

How do you choose the right value for K?

The examples here use n_neighbors=3, but that number is not special. K is a hyperparameter — a setting you choose before training, not something the algorithm learns from data. The value you pick has a real effect on how your model behaves.

A small K (like 1 or 2) makes the model very sensitive to individual data points, including noise. It tends to fit the training data tightly but generalize poorly to new samples — that's overfitting. A large K smooths out those local quirks but can push the decision boundary in the wrong direction if K grows too large relative to your dataset — that's underfitting. The goal is somewhere in between.

The practical way to find a good K is to try a range of values and compare cross-validation scores:

from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsClassifier
 
k_range = range(1, 21)
k_scores = []
 
for k in k_range:
    knn_k = KNeighborsClassifier(n_neighbors=k)
    scores = cross_val_score(knn_k, X, y, cv=5, scoring='accuracy')
    k_scores.append(scores.mean())
 
best_k = k_range[k_scores.index(max(k_scores))]
print(f"Best K: {best_k}  |  CV accuracy: {max(k_scores):.2%}")

Odd values for K are often preferred in binary classification problems because they prevent tied votes, though with three or more classes ties can still occur regardless. As a rough starting point, the square root of your training set size is a commonly cited heuristic — for 120 training samples that's about K=11. Let cross-validation confirm it rather than trusting the heuristic blindly.

Systematic hyperparameter search with GridSearchCV

The loop above works, but writing it by hand every time you want to tune a hyperparameter gets old fast. GridSearchCV automates exactly that pattern: you hand it a dictionary of parameter values to try, it runs cross-validated evaluation for every combination, and it stores the best result for you. You get the same answer with far less code — and it scales to multiple hyperparameters and multiple algorithms without rewriting anything.

from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
 
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('knn', KNeighborsClassifier())
])
 
param_grid = {
    'knn__n_neighbors': list(range(1, 21)),
    'knn__metric': ['euclidean', 'manhattan']
}
 
grid = GridSearchCV(
    pipeline,
    param_grid,
    cv=5,
    scoring='accuracy',
    refit=True
)
 
grid.fit(X_train, y_train)
 
print(f"Best params:    {grid.best_params_}")
print(f"Best CV score:  {grid.best_score_:.2%}")
print(f"Test accuracy:  {grid.score(X_test, y_test):.2%}")

The output will show something like 'knn__metric': 'euclidean', 'knn__n_neighbors': 13 — the winning combination across all 40 candidates evaluated by cross-validation. grid.best_estimator_ is the fully trained pipeline using those settings, ready to call predict() on.

You can inspect the full results table if you want to understand how each combination performed, not just the winner:

import pandas as pd

results = pd.DataFrame(grid.cv_results_)

# Show just the columns that matter
print(results[['param_knn__n_neighbors',
               'param_knn__metric',
               'mean_test_score',
               'std_test_score',
               'rank_test_score']].sort_values('rank_test_score').head(10))

Do not run GridSearchCV on the full dataset before splitting

Call grid.fit(X_train, y_train), not grid.fit(X, y). GridSearchCV handles its own internal cross-validation splits, but it still uses whatever data you hand it. If you pass the full dataset, your test set leaks into the search — the same data leakage problem Pipelines are meant to prevent. Split first, search on the training set only, evaluate once on the test set at the end.

Pro Tip

On larger datasets with many hyperparameter combinations, RandomizedSearchCV is worth knowing. Instead of exhaustively trying every combination in param_grid, it samples a fixed number of random combinations from distributions you specify — much faster when the search space is large. The API is identical to GridSearchCV except you add an n_iter argument for how many combinations to sample. For the small-scale examples in this tutorial, GridSearchCV is the right tool.

Pause & Reflect

GridSearchCV ran 40 cross-validated experiments to find the best K and distance metric. If you then evaluate grid.score(X_test, y_test) and report that number as your model's accuracy, is that a fair estimate of real-world performance — and why does it matter that you only call grid.fit(X_train, y_train) and never grid.fit(X, y)?

Try it — click the cell you think GridSearchCV would pick as best

96.0%

95.3%

97.3%

96.7%

96.0%

95.3%

Making predictions on new data

Once trained, your model can classify any new set of measurements. Pass in a 2D array — even for a single sample — because scikit-learn expects that shape:

import numpy as np
 
new_flower = np.array([[5.1, 3.5, 1.4, 0.2]])
 
prediction = knn.predict(new_flower)
species_name = iris.target_names[prediction[0]]
 
print(f"Predicted species: {species_name}")  # setosa

Evaluating Model Performance

Accuracy — the percentage of correct predictions — is the obvious starting point, but it doesn't tell the whole story. scikit-learn's classification_report gives you precision, recall, and F1 score broken down by class, which reveals where your model struggles.

from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
 
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2%}")
 
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=iris.target_names))
 
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

Before the prose explains what those numbers mean, it helps to see an annotated sample. Here is what classification_report actually prints for a well-trained KNN on Iris — with each column labeled so you know what you are looking at:

Sample output — classification_report — KNN on Iris test set (30 samples)

class	precision	recall	f1-score	support	what this row tells you
setosa	1.00	1.00	1.00	10	Perfect scores. Setosa is linearly separable from the other two species — every model gets this class right.
versicolor	0.90	0.82	0.86	11	Recall of 0.82 means 2 of 11 actual versicolor samples were missed — the model predicted virginica for them instead.
virginica	0.82	0.90	0.86	9	Precision of 0.82 means 2 of the model's virginica predictions were wrong — those were actually versicolor samples predicted as virginica.
accuracy			0.93	30	Overall share of correct predictions across all 30 test samples. Ignores which classes are harder.
macro avg	0.91	0.91	0.91	30	Simple average across all three classes — treats each class equally regardless of how many samples it has.
weighted avg	0.93	0.93	0.93	30	Average weighted by support (sample count per class). On imbalanced data this can hide poor performance on minority classes.

0.90 – 1.00 — strong

0.75 – 0.89 — acceptable, worth investigating

support = number of actual samples for that class in the test set

The two errors this model makes are symmetric: it confuses versicolor and virginica with each other, which is expected — those two species overlap in feature space. The table above makes that visible without needing to read the confusion matrix first.

The confusion matrix is a grid where rows represent actual classes and columns represent predicted classes. Values on the diagonal are correct predictions; everything off-diagonal is an error. For a 3-class problem you get a 3x3 matrix.

Reading raw numbers from print(confusion_matrix(...)) works, but scikit-learn ships a display class that renders the same data as a labeled grid — making it much easier to see which classes your model confuses. ConfusionMatrixDisplay only requires matplotlib, which you already have if you installed scikit-learn via the recommended path.

from sklearn.metrics import ConfusionMatrixDisplay
import matplotlib.pyplot as plt
 
disp = ConfusionMatrixDisplay.from_estimator(
    knn,
    X_test,
    y_test,
    display_labels=iris.target_names,
    cmap="Blues"
)
 
disp.ax_.set_title("KNN Confusion Matrix — Iris Test Set")
plt.tight_layout()
plt.show()

Once you can see the matrix visually, the precision and recall numbers from classification_report click into place. A class with low recall will show a row where predictions scatter across columns instead of concentrating on the diagonal — the model is missing samples that actually belong to that class. A class with low precision will show a column with off-diagonal entries — the model is over-predicting that class, pulling in samples that belong elsewhere. The matrix and the report are telling you the same story from two angles; having both makes it harder to misread either one.

Pro Tip

If you already have predictions stored in y_pred, use ConfusionMatrixDisplay.from_predictions(y_test, y_pred, display_labels=...) instead of from_estimator. from_predictions skips the internal predict() call — useful when predictions are expensive to recompute or when you're evaluating output from a model that doesn't follow the scikit-learn API.

What do precision, recall, and F1 actually mean?

Accuracy — the ratio of correct predictions to total predictions — is easy to understand, but it can mislead you. If 95% of your samples belong to one class, a model that always predicts that class will report 95% accuracy while being completely useless. The metrics in classification_report give you a clearer picture:

Precision — of all the times the model predicted a given class, how often was it right? High precision means few false positives. If your spam filter has low precision, it flags real emails as spam.
Recall — of all the actual samples belonging to a class, how many did the model catch? High recall means few false negatives. If your fraud detector has low recall, real fraud slips through undetected.
F1 score — the harmonic mean of precision and recall. It balances both in a single number and is particularly useful when the cost of false positives and false negatives are both important.

Which metric you prioritize depends on what your model is actually doing. For a cancer screening tool, recall matters more — missing a positive case is worse than a false alarm. For a content moderation filter, precision might matter more — falsely removing legitimate content has its own cost. On a balanced dataset like Iris all three metrics will look similar, but this distinction becomes critical on real-world data.

Pro Tip

The classification_report also prints a macro average (treats each class equally) and a weighted average (weights by class frequency). On imbalanced data the two averages can differ significantly — the weighted average will be pulled toward the dominant class. Use macro average when you care equally about performance across all classes.

"Predicting the labels of a test set must not be used to choose the parameters of a model, as this would lead to an overfit of the parameters to the test set." — scikit-learn User Guide, Cross-validation: evaluating estimator performance

Cross-validation for a more reliable estimate

A single train/test split can be unlucky. Cross-validation splits the data K times, trains and tests on each fold, and averages the results. It costs more compute time but gives a much more reliable performance estimate.

from sklearn.model_selection import cross_val_score
 
knn_cv = KNeighborsClassifier(n_neighbors=3)
cv_scores = cross_val_score(knn_cv, X, y, cv=5, scoring='accuracy')
 
print(f"CV Scores: {cv_scores}")
print(f"Mean accuracy: {cv_scores.mean():.2%}")
print(f"Std deviation: {cv_scores.std():.2%}")

Pro Tip

A low standard deviation across CV folds is a good sign — it means your model's performance is stable and not heavily dependent on which particular samples ended up in which fold.

Pause & Reflect

Cross-validation trained and evaluated your model 5 separate times, each on a different slice of data. Why does averaging those 5 scores give you a more trustworthy number than a single train/test split — and when might even cross-validation still mislead you?

Try it — click any fold to make it the test fold and see how the score changes

Fold 1

Fold 2

Fold 3

Fold 4

Fold 5

train test (active)

fold 1

—

fold 2

—

fold 3

—

fold 4

—

fold 5

—

Click a fold above to run that experiment

Common Algorithms Compared

KNN is a fine starting point, but scikit-learn gives you access to dozens of algorithms with the same API. This section runs all four classifiers on the same Iris data you've been using — same train/test split, same evaluation code — so you can see exactly how the API stays consistent while the internals change completely.

Start with a benchmark loop so you have a number to compare against as you read each section:

from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score

models = {
    "K-Nearest Neighbors":    KNeighborsClassifier(n_neighbors=3),
    "Decision Tree":          DecisionTreeClassifier(random_state=42),
    "Random Forest":          RandomForestClassifier(n_estimators=100, random_state=42),
    "Support Vector Machine": SVC(kernel='rbf', random_state=42),
}

for name, model in models.items():
    scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
    print(f"{name:25s}  mean={scores.mean():.2%}  std={scores.std():.2%}")

All four will score well on Iris because it is a clean, balanced, three-class dataset. The value is not that one number beats another — it is building the habit of benchmarking before committing. The sections below go one algorithm at a time, showing you the constructor call, the key parameters you will actually touch, and how to read the output.

scikit-learn 1.8: GPU Support via Array API

scikit-learn 1.8 (December 2025) introduced native Array API support, which allows many estimators to work directly with PyTorch tensors and CuPy arrays rather than NumPy arrays. This means GPU-accelerated training is now possible for supported estimators without a framework change. For the beginner examples in this tutorial, you won't need it — but it's worth knowing that the same API you're learning here scales to GPU hardware as your datasets grow. See the Array API support documentation for details.

Filter:

Algorithm	Strengths	Watch Out For
K-Nearest Neighborsscalingsmall data	Simple, no training phase, works well on small datasets	Slow at prediction time on large datasets; sensitive to feature scaling
Decision Treeinterpretable	Highly interpretable, handles mixed feature types	Prone to overfitting without pruning or depth limits
Random Forest	Strong out-of-the-box accuracy, resistant to overfitting	Less interpretable; slower to train than a single tree
Support Vector Machinescaling	Effective in high-dimensional spaces; powerful with the right kernel	Fit time scales at least quadratically with samples — impractical beyond tens of thousands of rows; requires feature scaling

Decision Tree

A Decision Tree splits the training data into increasingly pure groups by asking a sequence of yes/no questions about feature values. Each split is chosen to maximize how cleanly the resulting groups separate the classes — the algorithm evaluates every possible threshold on every feature and picks the one that produces the purest split according to a criterion like Gini impurity. At prediction time it routes each new sample down the tree until it reaches a leaf node, which holds the majority class label from training samples that ended up there.

The main thing you control is max_depth. Left unconstrained, a Decision Tree will grow deep enough to perfectly memorize the training set — every leaf will contain exactly one sample, giving 100% training accuracy and poor generalization. Limiting depth is the primary regularization lever:

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import classification_report
from sklearn.datasets import load_iris
 
iris = load_iris()
X, y = iris.data, iris.target
 
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
 
dt = DecisionTreeClassifier(max_depth=4, random_state=42)
dt.fit(X_train, y_train)
 
y_pred = dt.predict(X_test)
print(classification_report(y_test, y_pred, target_names=iris.target_names))
 
importances = dt.feature_importances_
for name, score in zip(iris.feature_names, importances):
    print(f"  {name:28s}  {score:.3f}")

The feature importance output is one of the biggest reasons to try a Decision Tree early in a project. Even if you ultimately use a different algorithm, the importances tell you which features the data actually separates on — information that is useful everywhere else in your workflow.

To see the overfitting problem directly, remove max_depth and compare training accuracy against test accuracy:

# Unconstrained tree: classic overfitting demonstration
dt_overfit = DecisionTreeClassifier(random_state=42)  # no max_depth
dt_overfit.fit(X_train, y_train)
 
train_acc = dt_overfit.score(X_train, y_train)
test_acc  = dt_overfit.score(X_test, y_test)
print(f"Unconstrained — train: {train_acc:.2%}  test: {test_acc:.2%}")
 
# Constrained tree: max_depth closes the gap
dt_constrained = DecisionTreeClassifier(max_depth=4, random_state=42)
dt_constrained.fit(X_train, y_train)
 
train_acc2 = dt_constrained.score(X_train, y_train)
test_acc2  = dt_constrained.score(X_test, y_test)
print(f"Constrained    — train: {train_acc2:.2%}  test: {test_acc2:.2%}")

Pro Tip

Besides max_depth, the min_samples_leaf parameter is a useful regularization handle. Setting it to something like 5 means the tree will refuse to create any leaf that represents fewer than 5 training samples — which prevents it from carving out tiny, noise-driven pockets in the feature space. Both parameters serve the same goal from different angles; max_depth limits overall complexity, min_samples_leaf prevents fragmented leaves.

Random Forest

A Random Forest builds many Decision Trees and combines their predictions by majority vote. Each tree trains on a different random sample of the training data (drawn with replacement — this is called bootstrap sampling), and each split considers only a random subset of features rather than all of them. These two sources of randomness mean each tree makes different errors, and averaging over them cancels a lot of the noise that causes a single tree to overfit.

The result is a model that tends to be more accurate than any individual tree and far less sensitive to the depth limit that was necessary above:

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import classification_report
from sklearn.datasets import load_iris
 
iris = load_iris()
X, y = iris.data, iris.target
 
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
 
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
 
y_pred = rf.predict(X_test)
print(classification_report(y_test, y_pred, target_names=iris.target_names))
 
importances = rf.feature_importances_
for name, score in zip(iris.feature_names, importances):
    print(f"  {name:28s}  {score:.3f}")

Notice that the API is identical to the Decision Tree above — fit(), predict(), feature_importances_. This consistency is by design. Switching algorithms in scikit-learn usually means changing one constructor call and nothing else.

You can also check cross-validated performance to confirm the forest generalizes, not just fits the test split you happened to get:

cv_scores = cross_val_score(rf, X, y, cv=5, scoring='accuracy')
print(f"CV mean: {cv_scores.mean():.2%}  std: {cv_scores.std():.2%}")
 
# Compare unconstrained Decision Tree vs Random Forest on the same split
from sklearn.tree import DecisionTreeClassifier
 
dt_scores = cross_val_score(DecisionTreeClassifier(random_state=42), X, y, cv=5)
rf_scores = cross_val_score(RandomForestClassifier(n_estimators=100, random_state=42), X, y, cv=5)
 
print(f"Decision Tree  mean={dt_scores.mean():.2%}  std={dt_scores.std():.2%}")
print(f"Random Forest  mean={rf_scores.mean():.2%}  std={rf_scores.std():.2%}")

Pro Tip

On most tabular datasets, a Random Forest with n_estimators=100 and default settings gives you a very competitive baseline with almost no configuration. It is often the right choice for a first serious model because it rarely performs embarrassingly badly and gives you feature importances for free. When you need to squeeze out more accuracy, that is the time to look at gradient boosting methods like HistGradientBoostingClassifier — but start with the Random Forest.

Support Vector Machine

A Support Vector Machine finds the decision boundary that maximizes the margin between classes — the widest possible gap between the nearest training samples on each side. Those nearest samples are the support vectors; they are the only training points that actually determine where the boundary sits. All other training samples could be removed and the trained model would not change.

SVMs require feature scaling. The margin is measured in terms of distances in feature space, so a feature with large numeric values will dominate the boundary unless you normalize first. Always use a Pipeline here:

from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import classification_report
from sklearn.datasets import load_iris
 
iris = load_iris()
X, y = iris.data, iris.target
 
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
 
svm_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('svm',    SVC(kernel='rbf', C=1.0, random_state=42))
])
 
svm_pipeline.fit(X_train, y_train)
y_pred = svm_pipeline.predict(X_test)
 
print(classification_report(y_test, y_pred, target_names=iris.target_names))

SVMs do not produce feature importances the way tree-based models do — the support vectors live in a transformed space, and attributing importance back to original features is not straightforward. If interpretability matters for your project, that is a real tradeoff to factor in when choosing between an SVM and a tree-based method.

The two SVM parameters you will tune most often are C (regularization) and gamma (how far each training point's influence reaches). A practical way to explore both at once is GridSearchCV, covered in the next section. Here is the pattern:

from sklearn.model_selection import GridSearchCV
 
param_grid = {
    'svm__C':     [0.1, 1, 10, 100],
    'svm__gamma': ['scale', 'auto', 0.01, 0.001]
}
 
grid_search = GridSearchCV(svm_pipeline, param_grid, cv=5, scoring='accuracy', n_jobs=-1)
grid_search.fit(X_train, y_train)
 
print(f"Best params:    {grid_search.best_params_}")
print(f"Best CV score:  {grid_search.best_score_:.2%}")
print(f"Test accuracy:  {grid_search.score(X_test, y_test):.2%}")

SVM scaling limitation

SVC's fit time grows at least quadratically with the number of training samples — on 150 rows (Iris) it's instantaneous, but on 50,000 rows it can take minutes, and on 500,000 rows it becomes impractical. For large datasets, scikit-learn's LinearSVC (which only supports linear kernels but scales much better) or SGDClassifier with loss='hinge' gives you SVM-style margins without the quadratic scaling penalty.

Pause & Reflect

You just ran all four algorithms using nearly identical code. Decision Tree and Random Forest required no scaling; KNN and SVM did. What does that tell you about using a Pipeline for every model — even the ones that don't technically need it — as your default habit?

Try it — pick an algorithm, then drag the steps into the correct Pipeline order

StandardScaler

SimpleImputer

Model

step 1

→

step 2

→

step 3

Feature scaling

Some algorithms — KNN and SVMs especially — are sensitive to the scale of your features. If one column ranges from 0–1 and another from 0–10,000, the large-scale column dominates distance calculations. StandardScaler fixes this by transforming each feature to have a mean of 0 and a standard deviation of 1.

Which algorithms actually need feature scaling?

Not all algorithms are affected equally, and knowing why helps you avoid unnecessary preprocessing steps.

Algorithms that require scaling are those that use distance metrics or gradient-based optimization: K-Nearest Neighbors, Support Vector Machines, and neural networks all fall into this category. For KNN, the entire prediction mechanism is based on which training samples are closest in feature space — if one feature has much larger numeric values, it will dwarf the others regardless of how informative it actually is. SVMs have the same issue because the margin they optimize is defined in terms of distances.

Algorithms that do not require scaling are tree-based methods: Decision Trees and Random Forests split features one threshold at a time. Whether sepal length is measured in centimeters or meters, the model finds the same split point — it only cares about relative ordering within a single feature, not comparisons across features. Scaling these algorithms does no harm, but it also does nothing useful.

Pro Tip

When in doubt, scale. Including a StandardScaler in your Pipeline costs almost nothing, and it means your code works correctly whether you swap in a distance-sensitive algorithm later. Leaving it out when you switch from Random Forest to KNN is a common source of confusing performance drops.

from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
 
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('knn', KNeighborsClassifier(n_neighbors=3))
])
 
pipeline.fit(X_train, y_train)
y_pred_scaled = pipeline.predict(X_test)
 
accuracy_scaled = accuracy_score(y_test, y_pred_scaled)
print(f"Accuracy with scaling: {accuracy_scaled:.2%}")

The Pipeline object is one of scikit-learn's best features. It chains preprocessing steps and the model together so you can call .fit() and .predict() on the whole chain. It also prevents a subtle bug called data leakage, where information from the test set contaminates the training process if you scale before splitting.

The Pipeline handles the correct order automatically. But the wrong pattern is easy to write and produces no error — it just silently inflates your accuracy. Here is what the bug looks like next to the fix:

The bug — data leakage

The fix — correct manual order

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

scaler = StandardScaler()
# scaler sees ALL rows, including future test rows
X_scaled = scaler.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.2, random_state=42
)

knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
print(f"Test accuracy: {knn.score(X_test, y_test):.2%}")

Output — looks correct, but is not

Test accuracy: 97.33% # No error. No warning. Just a number # that is quietly too optimistic because # the scaler learned from test-set rows # before the split ever happened.

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

# Split FIRST — scaler never sees test rows
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

scaler = StandardScaler()
X_train_sc = scaler.fit_transform(X_train)
X_test_sc  = scaler.transform(X_test)

knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_sc, y_train)
print(f"Test accuracy: {knn.score(X_test_sc, y_test):.2%}")

Output — honest estimate

Test accuracy: 93.33% # The 4-point gap is the leakage penalty. # This number is what you would actually # see on data the model has never touched. # A Pipeline enforces this order for you.

Warning

Never fit your scaler on the full dataset before splitting. Fit it only on the training set, then apply the same transformation to the test set. Pipelines enforce this automatically, which is the primary reason to use them.

Python Pop Quiz

A KNN model with n_neighbors=1 scores 100% on the training set and 71% on the test set. A second model with n_neighbors=19 scores 87% on training and 86% on test. Which model would you deploy, and why?

Python Pop Quiz

You are building a scikit-learn Pipeline that scales features and then runs a KNN classifier. Why is it important to put the StandardScaler inside the Pipeline rather than fitting it on the full dataset before splitting?

Working With Your Own Data

The Iris dataset is tidy by design — no missing values, clean column names, balanced classes, and a target column that's already numeric. Real-world CSV files rarely cooperate. This section walks through the three problems that trip up beginners most often: missing values, duplicate rows, and non-numeric columns. Each one is easy to handle once you know it's there.

Step 1: Always inspect before you do anything else

Before you touch X and y, spend thirty seconds looking at the raw data. These four lines reveal most issues immediately:

import pandas as pd

df = pd.read_csv('your_data.csv')

print(df.shape)          # rows, columns — does the count look right?
print(df.dtypes)         # are numeric columns actually numeric?
print(df.isnull().sum()) # how many nulls per column?
print(df.duplicated().sum())  # any identical rows?

The output from those four lines tells you exactly which of the three problems below you need to handle. Work through them in order.

Problem 1: Missing values (nulls)

A null is a cell with no value — shown as NaN in pandas. scikit-learn will raise a ValueError the moment it encounters one, so you must deal with nulls before you call fit(). You have two options: drop the rows that contain them, or fill them in with a reasonable substitute.

from sklearn.impute import SimpleImputer
import numpy as np

# Option A: drop any row that contains at least one null
df_clean = df.dropna()
print(f"Rows remaining: {len(df_clean)} of {len(df)}")

# Option B: fill nulls with the column median (safer than mean for skewed data)
# Use this inside a Pipeline so the fill values are learned only from training data
imputer = SimpleImputer(strategy='median')
# imputer.fit_transform(X_train) — shown in the full pipeline example below

# Quick rule of thumb:
# — fewer than ~5% of rows affected? dropna() is fine.
# — more than that? Dropping loses too much data; use SimpleImputer instead.

Impute inside the Pipeline, not before it

If you use SimpleImputer, add it as the first step in your Pipeline rather than calling it on the full dataset before splitting. Fitting the imputer on the full dataset leaks information from the test set into the training process — the same data leakage problem described in Section 6. Inside a Pipeline, the imputer learns fill values only from X_train, which is correct.

Problem 2: Duplicate rows

Duplicate rows are easy to miss and easy to fix. The risk is subtle: if the same row appears in both your training set and your test set after splitting, your test accuracy will be artificially inflated — the model has already seen that exact example during training.

# Check how many duplicate rows exist
print(f"Duplicate rows: {df.duplicated().sum()}")

# Remove them — keep='first' keeps one copy of each duplicated row
df = df.drop_duplicates(keep='first')
print(f"Rows after deduplication: {len(df)}")

# Always deduplicate before splitting, not after.
# If you split first and then deduplicate each half separately,
# you can still end up with the same underlying record in both sets.

Problem 3: Non-numeric columns

scikit-learn estimators expect all-numeric input. If your DataFrame still contains string columns when you call fit(), you will get a ValueError. There are two places this shows up: in your feature columns and in your label column.

from sklearn.preprocessing import LabelEncoder

# --- Handling text in FEATURE columns ---

# pd.get_dummies() converts each category to its own 0/1 column
# Example: a 'color' column with values 'red', 'blue', 'green'
# becomes three columns: color_red, color_blue, color_green
X = pd.get_dummies(X)

# Run this after get_dummies to confirm no object-type columns remain
print(X.dtypes)  # everything should show int64 or float64

# --- Handling text in the LABEL column ---

# LabelEncoder converts string class names to integers
# 'cat' → 0, 'dog' → 1  (alphabetical order by default)
le = LabelEncoder()
y = le.fit_transform(y)

# To see the mapping:
print(dict(zip(le.classes_, le.transform(le.classes_))))

get_dummies() vs OrdinalEncoder

pd.get_dummies() is convenient for a quick start, but it runs before your Pipeline and can cause column mismatch errors when the test set is missing a category that appeared in training. scikit-learn's OneHotEncoder (used inside a ColumnTransformer inside a Pipeline) handles this correctly by learning the full set of categories from training data. For tutorials and prototyping, get_dummies() is fine; for production code, use ColumnTransformer.

Putting it all together

Here is a complete preprocessing template that handles all three problems and then feeds into the same Pipeline pattern from earlier sections. Replace 'target_column' with the name of your label column.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier

# ── 1. Load and inspect ──────────────────────────────────────────────────────
df = pd.read_csv('your_data.csv')
print(f"Shape:      {df.shape}")
print(f"Nulls:\n{df.isnull().sum()}")
print(f"Duplicates: {df.duplicated().sum()}")
print(f"Types:\n{df.dtypes}")

# ── 2. Remove duplicates ─────────────────────────────────────────────────────
df = df.drop_duplicates(keep='first')

# ── 3. Separate features and label ───────────────────────────────────────────
X = df.drop(columns=['target_column'])
y = df['target_column']

# ── 4. Encode text label (skip if y is already numeric) ──────────────────────
le = LabelEncoder()
y = le.fit_transform(y)

# ── 5. Encode categorical feature columns ────────────────────────────────────
X = pd.get_dummies(X)

# ── 6. Split BEFORE any fitting ──────────────────────────────────────────────
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# ── 7. Pipeline: impute → scale → model ──────────────────────────────────────
# SimpleImputer fills nulls; it's fit only on X_train (no leakage)
pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler',  StandardScaler()),
    ('knn',     KNeighborsClassifier(n_neighbors=5))
])

pipeline.fit(X_train, y_train)
print(f"\nTest accuracy: {pipeline.score(X_test, y_test):.2%}")

Watch for data type mismatches after get_dummies()

Even after calling pd.get_dummies(), a column can remain as object type if it contains mixed data — for example, a column that is mostly numeric but has a few stray string values like "N/A" or "unknown". Run print(X.dtypes) immediately before the train/test split. Any column still showing object needs to be handled: either coerce it to numeric with pd.to_numeric(col, errors='coerce') (which converts unparseable values to NaN for the imputer to handle), or drop it if it carries no predictive value.

Key Takeaways

The ML workflow is consistent: Load data, split into train/test, instantiate a model, fit on training data, predict on test data, and evaluate. This same loop applies to every supervised learning problem.
Always split your data before anything else: The test set must remain unseen during training and preprocessing. Use train_test_split with stratify=y as your default, especially on imbalanced datasets.
Accuracy alone is not enough: Check precision, recall, and F1 score. Which metric to prioritize depends on whether false positives or false negatives are more costly in your specific use case.
Tune hyperparameters with cross-validation: K in KNN is just one example. Loop over candidate values, compare CV scores, and let the data pick — don't guess.
GridSearchCV automates the search: Rather than writing a manual loop, hand GridSearchCV a pipeline and a param_grid dictionary. It runs every combination, picks the best, and refits the winner — all in a few lines. Always call grid.fit(X_train, y_train), never on the full dataset.
Cross-validation beats a single split: Use cross_val_score with cv=5 or cv=10 to get a reliable estimate of how your model will perform on new data.
Pipelines prevent data leakage: Chain your preprocessing and model together using Pipeline. It's not just convenient — it's correct. Scale inside the Pipeline, not before splitting.
Know which algorithms need scaling: Distance-based methods (KNN, SVM) require it. Tree-based methods (Decision Tree, Random Forest) do not. When in doubt, include a StandardScaler — it won't hurt.
No single algorithm wins everywhere: Benchmark several options, check precision and recall (not just accuracy), and choose based on your specific data and requirements.

From here, natural next steps include regression problems with LinearRegression, handling more complex real-world data issues like missing values with SimpleImputer and categorical features with OrdinalEncoder or OneHotEncoder, and scaling up hyperparameter searches with RandomizedSearchCV when the parameter space becomes too large to evaluate exhaustively. The patterns you practiced here carry over directly — the scikit-learn API stays the same regardless of algorithm or task type.