K-Nearest Neighbors (KNN) in Python: A Complete Guide with scikit-learn

K-Nearest Neighbors is one of the simplest and most intuitive machine learning algorithms you can learn. It works by looking at the data points closest to a new observation and using their labels to make a prediction. In this guide, you will build KNN models for both classification and regression using Python and scikit-learn 1.8, the latest stable release.

Picture yourself in a neighborhood you have never visited before. You want to know whether a particular house is expensive or affordable. The easiest approach is to look at the houses closest to it. If the five nearest houses all sold for high prices, you can reasonably guess this house is expensive too. That is the core idea behind K-Nearest Neighbors. The algorithm stores the entire training dataset and, when asked to classify or predict a new data point, it finds the K closest neighbors and uses their known values to produce an answer.

What Is KNN and How Does It Work

KNN is a supervised machine learning algorithm, which means it learns from labeled data. Unlike algorithms such as linear regression or decision trees, KNN does not build an internal model during training. Instead, it memorizes the training data and performs all computation at prediction time. This is why KNN is called a lazy learner or an instance-based learner.

When a new data point arrives, KNN follows a straightforward process. First, it calculates the distance between the new point and every point in the training set. Second, it identifies the K nearest points based on that distance. Third, for classification, it takes a majority vote among those K neighbors. For regression, it averages their values. The result becomes the prediction.

Note

The "K" in KNN refers to the number of neighbors the algorithm considers. It is a hyperparameter you choose before training, not something the algorithm learns on its own. A small K (like 1 or 3) makes the model sensitive to noise, while a large K produces smoother boundaries but may miss local patterns.

Setting Up Your Environment

This tutorial uses scikit-learn 1.8, which is the current stable release as of early 2026. You will also need NumPy, pandas, and matplotlib. If you do not already have these installed, you can set everything up with pip.

pip install scikit-learn numpy pandas matplotlib

Once the packages are installed, import the modules you will use throughout this guide.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.metrics import (
    accuracy_score,
    classification_report,
    confusion_matrix,
    mean_squared_error,
    r2_score,
)
from sklearn.datasets import load_iris, make_regression
from sklearn.pipeline import Pipeline

KNN for Classification

The Iris dataset is a classic starting point for classification tasks. It contains 150 samples of iris flowers divided into three species, with four measured features per sample: sepal length, sepal width, petal length, and petal width. Here is how to load the data, split it into training and test sets, and train a KNN classifier.

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

# Create and train the KNN classifier
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)

# Make predictions
y_pred = knn.predict(X_test)

# Evaluate the model
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=iris.target_names))

The stratify=y parameter ensures that each class is proportionally represented in both the training and test sets. This is important when dealing with imbalanced datasets or small sample sizes. The n_neighbors=5 argument tells the classifier to look at the five closest data points when making a prediction.

After running this code, you should see accuracy above 95% on the Iris dataset. The classification report provides precision, recall, and F1-score for each species, giving you a more detailed picture of the model's performance.

Visualizing the Confusion Matrix

A confusion matrix shows exactly where the model gets predictions right and where it makes mistakes. Each row represents the actual class, and each column represents the predicted class.

# Build the confusion matrix
cm = confusion_matrix(y_test, y_pred)

fig, ax = plt.subplots(figsize=(7, 5))
im = ax.imshow(cm, interpolation="nearest", cmap=plt.cm.Blues)
ax.set(
    xticks=np.arange(cm.shape[1]),
    yticks=np.arange(cm.shape[0]),
    xticklabels=iris.target_names,
    yticklabels=iris.target_names,
    ylabel="Actual",
    xlabel="Predicted",
    title="KNN Confusion Matrix",
)

# Annotate each cell with the count
for i in range(cm.shape[0]):
    for j in range(cm.shape[1]):
        ax.text(j, i, str(cm[i, j]), ha="center", va="center",
                color="white" if cm[i, j] > cm.max() / 2 else "black")

plt.tight_layout()
plt.savefig("knn_confusion_matrix.png", dpi=150)
plt.show()

Why Feature Scaling Matters

KNN relies on distance calculations between data points. If one feature ranges from 0 to 1 and another ranges from 0 to 10,000, the larger feature will dominate the distance computation. The algorithm would effectively ignore the smaller feature, leading to biased and inaccurate predictions.

Consider a dataset with two features: a customer's account age in months (ranging from 1 to 72) and their total spending in dollars (ranging from 20 to 8,000). A one-unit change in spending is a single dollar, while a one-unit change in account age is an entire month. Without scaling, the spending feature would overpower the account age feature in every distance calculation.

Warning

Skipping feature scaling is one of the leading causes of poor KNN performance. Always scale your features before training a KNN model. This applies to both classification and regression.

StandardScaler from scikit-learn standardizes each feature by subtracting the mean and dividing by the standard deviation. After scaling, every feature has a mean of zero and a standard deviation of one, which means they contribute equally to distance calculations.

# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train KNN on scaled data
knn_scaled = KNeighborsClassifier(n_neighbors=5)
knn_scaled.fit(X_train_scaled, y_train_scaled)

# Evaluate
y_pred_scaled = knn_scaled.predict(X_test_scaled)
print(f"Accuracy with scaling: {accuracy_score(y_test, y_pred_scaled):.4f}")
Pro Tip

Always call fit_transform() on the training data and transform() on the test data. Never fit the scaler on the test data. If you do, you introduce information leakage, and your model evaluation becomes unreliable.

Finding the Best K Value

Choosing the right value of K is critical. A K of 1 makes the model memorize every training point, leading to overfitting. A very large K smooths out all variation and can cause underfitting. The goal is to find a value that balances complexity and generalization.

The Elbow Method

One approach is to train models with a range of K values and plot the error rate for each. The point where the error stops decreasing significantly is called the "elbow," and the K value at that point is often a good choice.

error_rates = []
k_range = range(1, 31)

for k in k_range:
    knn_temp = KNeighborsClassifier(n_neighbors=k)
    knn_temp.fit(X_train_scaled, y_train)
    y_pred_temp = knn_temp.predict(X_test_scaled)
    error_rates.append(np.mean(y_pred_temp != y_test))

plt.figure(figsize=(10, 6))
plt.plot(k_range, error_rates, marker="o", linewidth=2, markersize=5)
plt.title("Error Rate vs. K Value")
plt.xlabel("K")
plt.ylabel("Error Rate")
plt.xticks(k_range)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig("knn_elbow_plot.png", dpi=150)
plt.show()

Cross-Validation with GridSearchCV

The elbow method gives a quick visual, but a more rigorous approach uses cross-validation. GridSearchCV tests every K value you specify, evaluates each one using multiple train-test splits, and identifies the combination that performs best on average.

# Define the parameter grid
param_grid = {"n_neighbors": range(1, 31)}

# Set up GridSearchCV with 5-fold cross-validation
grid_search = GridSearchCV(
    KNeighborsClassifier(),
    param_grid,
    cv=5,
    scoring="accuracy",
    return_train_score=True,
)
grid_search.fit(X_train_scaled, y_train)

# Display the best parameters
print(f"Best K: {grid_search.best_params_['n_neighbors']}")
print(f"Best Cross-Validation Accuracy: {grid_search.best_score_:.4f}")

# Evaluate on the test set using the best model
best_knn = grid_search.best_estimator_
y_pred_best = best_knn.predict(X_test_scaled)
print(f"Test Accuracy: {accuracy_score(y_test, y_pred_best):.4f}")

GridSearchCV with cv=5 splits the training data into five folds. For each K value, it trains on four folds, evaluates on the fifth, and rotates through all five combinations. The K with the highest average score across all folds is selected as the best choice.

KNN for Regression

KNN is not limited to classification. When used for regression, the algorithm finds the K nearest neighbors and returns the average (or weighted average) of their target values instead of taking a majority vote.

# Generate a synthetic regression dataset
X_reg, y_reg = make_regression(
    n_samples=500,
    n_features=5,
    noise=25,
    random_state=42,
)

# Split and scale
X_train_r, X_test_r, y_train_r, y_test_r = train_test_split(
    X_reg, y_reg, test_size=0.3, random_state=42
)
scaler_r = StandardScaler()
X_train_r_scaled = scaler_r.fit_transform(X_train_r)
X_test_r_scaled = scaler_r.transform(X_test_r)

# Train the KNN regressor
knn_reg = KNeighborsRegressor(n_neighbors=7)
knn_reg.fit(X_train_r_scaled, y_train_r)

# Predict and evaluate
y_pred_r = knn_reg.predict(X_test_r_scaled)
print(f"Mean Squared Error: {mean_squared_error(y_test_r, y_pred_r):.2f}")
print(f"R-squared: {r2_score(y_test_r, y_pred_r):.4f}")

The R-squared value tells you how much of the variance in the target variable your model explains. A value of 1.0 means perfect prediction, while 0.0 means the model does no better than simply predicting the mean.

Distance Metrics Explained

By default, scikit-learn's KNN implementation uses Euclidean distance, which is the straight-line distance between two points. However, this is not the only option. The metric parameter lets you choose from several alternatives.

Euclidean distance is the standard choice and works well when features are continuous and scaled to similar ranges. It calculates the square root of the sum of squared differences between each feature.

Manhattan distance (also called city-block distance) sums the absolute differences between features. It can perform better than Euclidean distance when your data has many dimensions or when features represent different types of measurements.

Minkowski distance is a generalization that includes both Euclidean (when p=2) and Manhattan (when p=1) as special cases. You can tune the p parameter to find the right balance for your data.

# Compare different distance metrics
metrics = ["euclidean", "manhattan", "minkowski"]

for metric in metrics:
    knn_metric = KNeighborsClassifier(n_neighbors=5, metric=metric)
    scores = cross_val_score(knn_metric, X_train_scaled, y_train, cv=5)
    print(f"{metric:12s} | Mean Accuracy: {scores.mean():.4f} (+/- {scores.std():.4f})")

Weighted Voting with KNN

Standard KNN treats all K neighbors equally, regardless of how far they are from the query point. A neighbor that is barely within range gets the same vote as the closest neighbor. The weights parameter changes this behavior.

Setting weights="distance" gives closer neighbors a stronger vote. Specifically, each neighbor's contribution is weighted by the inverse of its distance. A neighbor at distance 0.5 has twice the influence of a neighbor at distance 1.0.

# Uniform vs. distance-weighted voting
knn_uniform = KNeighborsClassifier(n_neighbors=7, weights="uniform")
knn_distance = KNeighborsClassifier(n_neighbors=7, weights="distance")

scores_uniform = cross_val_score(knn_uniform, X_train_scaled, y_train, cv=5)
scores_distance = cross_val_score(knn_distance, X_train_scaled, y_train, cv=5)

print(f"Uniform weights:  {scores_uniform.mean():.4f} (+/- {scores_uniform.std():.4f})")
print(f"Distance weights: {scores_distance.mean():.4f} (+/- {scores_distance.std():.4f})")

Distance-weighted voting often improves accuracy when the data has overlapping class boundaries, because it reduces the influence of neighbors that happen to be nearby but belong to a different class.

Building a Complete Pipeline

In a production setting, you want to combine preprocessing and modeling into a single pipeline. This prevents data leakage and makes your workflow easier to reproduce. Scikit-learn's Pipeline class handles this cleanly.

# Build a pipeline: scaling + KNN
pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("knn", KNeighborsClassifier()),
])

# Define the hyperparameter search space
param_grid = {
    "knn__n_neighbors": range(1, 21),
    "knn__weights": ["uniform", "distance"],
    "knn__metric": ["euclidean", "manhattan"],
}

# Run GridSearchCV on the pipeline
grid = GridSearchCV(pipe, param_grid, cv=5, scoring="accuracy", n_jobs=-1)
grid.fit(X_train, y_train)

print(f"Best parameters: {grid.best_params_}")
print(f"Best CV accuracy: {grid.best_score_:.4f}")
print(f"Test accuracy:    {grid.best_estimator_.score(X_test, y_test):.4f}")
Pro Tip

When you use a pipeline with GridSearchCV, the scaler is fit only on the training folds during each cross-validation split. This is the correct way to prevent data leakage. If you scale the entire dataset before splitting, your test scores may be misleadingly high.

The n_jobs=-1 parameter tells GridSearchCV to use all available CPU cores, which can dramatically speed up the search when you have a large parameter grid.

When to Use KNN and When to Avoid It

KNN shines in certain situations and struggles in others. Understanding these trade-offs will help you decide whether it is the right algorithm for your problem.

KNN works well when: your dataset is small to medium-sized (tens of thousands of samples or fewer), the relationship between features and the target is nonlinear, you need a quick baseline model with minimal tuning, the decision boundary between classes is irregular, or you need both classification and regression capabilities from the same algorithm family.

KNN struggles when: the dataset is very large because it must store all training data and compute distances at prediction time, the feature space is high-dimensional (a phenomenon known as the "curse of dimensionality" where distances become less meaningful), features have very different scales and are not properly normalized, the data contains many irrelevant features that add noise to distance calculations, or you need fast real-time predictions on large datasets.

Note

For large datasets, consider using scikit-learn's BallTree or KDTree algorithms (set via the algorithm parameter) instead of the default brute-force approach. These data structures speed up neighbor lookups significantly. You can also let scikit-learn decide automatically by setting algorithm="auto", which is the default behavior.

Key Takeaways

  1. KNN is intuitive and versatile: It works for both classification (majority vote) and regression (averaging neighbor values) with the same underlying logic of finding the nearest data points.
  2. Feature scaling is non-negotiable: Because KNN relies entirely on distance calculations, unscaled features with larger ranges will dominate predictions and produce unreliable results.
  3. The choice of K defines model complexity: Small K values create complex, noise-sensitive boundaries, while large K values produce simpler, smoother boundaries. Use cross-validation to find the right balance.
  4. Distance metrics and weighting matter: Euclidean distance is the default, but Manhattan distance or distance-weighted voting can improve performance depending on your data's characteristics.
  5. Pipelines prevent data leakage: Combining scaling and modeling in a scikit-learn Pipeline ensures that preprocessing is applied correctly during cross-validation and avoids inflated test scores.
  6. KNN has scalability limits: It stores the entire training set and computes distances at prediction time. For very large datasets or high-dimensional feature spaces, tree-based methods or other algorithms may be more practical.

KNN is an excellent starting point for anyone learning machine learning. It requires no assumptions about the underlying data distribution, its behavior is easy to explain, and it gives you hands-on experience with essential concepts like feature scaling, cross-validation, and hyperparameter tuning. Once you are comfortable with the fundamentals covered in this guide, you can explore more advanced techniques like Neighborhood Components Analysis (NCA) for learning optimized distance metrics, or apply KNN as part of ensemble methods for improved accuracy.

back to articles