Python Classification Models: A Practical Guide with Scikit-Learn

Classification is one of the foundational tasks in machine learning, where the goal is to predict which category a data point belongs to based on its features. Whether you are filtering spam from legitimate email, diagnosing a medical condition from patient data, or detecting fraudulent transactions in a banking system, classification models are doing the heavy lifting behind the scenes. Python's scikit-learn library makes building these models remarkably accessible, and this guide walks through the major classification algorithms with code you can run immediately.

Scikit-learn (imported as sklearn) is an open-source machine learning library built on top of NumPy, SciPy, and Matplotlib. It provides a consistent API across all of its models—you call .fit() to train, .predict() to generate predictions, and .score() to evaluate—which means once you learn the pattern for one classifier, switching to another takes only a few lines of changed code. The current stable release is version 1.8, which shipped in December 2025 with native Array API support for GPU computation through PyTorch and CuPy arrays, temperature scaling for probability calibration, and efficiency improvements across linear models.

This article covers seven widely used classification algorithms, each demonstrated with working Python code. By the end, you will understand when to reach for each model and how to measure its performance using the right metrics.

What Is Classification?

Classification is a type of supervised learning. You provide the algorithm with a dataset where each sample has a set of input features and a known label (the target class). The model learns patterns in the features that correspond to each label, then uses those patterns to assign labels to new, unseen data.

There are two main flavors. Binary classification involves exactly two classes, such as "spam" or "not spam." Multiclass classification involves three or more classes, such as classifying an image as a cat, dog, or bird. Some problems extend further into multilabel classification, where a single sample can belong to multiple classes simultaneously—think of tagging a news article with "politics," "economy," and "international" all at once.

Note

Classification differs from regression. In regression, you predict a continuous numeric value (like a house price). In classification, you predict a discrete category (like whether a house will sell above or below asking price). Choosing the wrong problem type leads to models that produce meaningless output.

Setting Up Your Data

Before training any classifier, you need to split your data into a training set and a test set. The training set teaches the model; the test set evaluates how well it generalizes to data it has never seen. Scikit-learn's train_test_split function handles this in a single call.

The following code loads the classic Iris dataset—150 samples of iris flowers classified into three species based on four measurements—and splits it 70/30 for training and testing:

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# Load dataset
iris = load_iris()
X = iris.data      # Feature matrix (150 samples, 4 features)
y = iris.target    # Target labels (0, 1, or 2)

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

print(f"Training samples: {len(X_train)}")
print(f"Test samples: {len(X_test)}")

Setting random_state=42 ensures the split is reproducible. Every time you run this code, you get the same training and test sets, which is essential for comparing different models fairly.

Pro Tip

For imbalanced datasets where one class vastly outnumbers others, use stratify=y in train_test_split. This preserves the original class distribution in both the training and test splits, preventing situations where your test set might be missing entire classes.

Logistic Regression

Despite the word "regression" in its name, logistic regression is a classification algorithm. It models the probability that a given sample belongs to a particular class by applying a sigmoid function to a linear combination of the input features. When the estimated probability exceeds a threshold (typically 0.5), the model assigns the positive class; otherwise, it assigns the negative class.

For multiclass problems, scikit-learn extends logistic regression using either a one-vs-rest (OvR) scheme or a multinomial approach. The multinomial approach optimizes across all classes simultaneously and tends to perform better when classes are mutually exclusive.

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Create and train the model
log_reg = LogisticRegression(max_iter=200, random_state=42)
log_reg.fit(X_train, y_train)

# Make predictions
y_pred = log_reg.predict(X_test)

# Evaluate
accuracy = accuracy_score(y_test, y_pred)
print(f"Logistic Regression Accuracy: {accuracy:.4f}")

Logistic regression works well when the relationship between features and class membership is approximately linear. It trains quickly, produces interpretable coefficients, and serves as an excellent baseline before trying more complex models. Its main limitation is that it struggles with highly nonlinear decision boundaries unless you engineer polynomial or interaction features beforehand.

Decision Trees

A decision tree classifier splits the dataset into smaller subsets by asking a series of yes/no questions about the features. At each internal node, the algorithm selects the feature and threshold that best separates the classes, measured by criteria like Gini impurity or information gain (entropy). The process continues recursively until the tree reaches a stopping condition, such as a maximum depth or a minimum number of samples per leaf.

from sklearn.tree import DecisionTreeClassifier

# Create and train the model
tree_clf = DecisionTreeClassifier(
    max_depth=4,
    random_state=42
)
tree_clf.fit(X_train, y_train)

# Make predictions
y_pred_tree = tree_clf.predict(X_test)

# Evaluate
accuracy_tree = accuracy_score(y_test, y_pred_tree)
print(f"Decision Tree Accuracy: {accuracy_tree:.4f}")

# View feature importance
for name, importance in zip(iris.feature_names, tree_clf.feature_importances_):
    print(f"  {name}: {importance:.4f}")

Decision trees are easy to visualize and interpret. You can trace a prediction from the root to a leaf and understand exactly why the model made that choice. However, they are prone to overfitting—a deep tree memorizes the training data instead of learning generalizable patterns. Setting max_depth, min_samples_split, or min_samples_leaf helps control this tendency.

Random Forest

A random forest is an ensemble of decision trees. During training, each tree is built on a random bootstrap sample of the data, and at each split, only a random subset of features is considered. This combination of randomness reduces overfitting and produces a model that is more robust than any individual tree. The final prediction is determined by majority vote across all trees.

from sklearn.ensemble import RandomForestClassifier

# Create and train the model
rf_clf = RandomForestClassifier(
    n_estimators=100,
    max_depth=5,
    random_state=42
)
rf_clf.fit(X_train, y_train)

# Make predictions
y_pred_rf = rf_clf.predict(X_test)

# Evaluate
accuracy_rf = accuracy_score(y_test, y_pred_rf)
print(f"Random Forest Accuracy: {accuracy_rf:.4f}")

# Feature importance
for name, importance in zip(iris.feature_names, rf_clf.feature_importances_):
    print(f"  {name}: {importance:.4f}")

Random forests are among the go-to algorithms in practice because they handle nonlinear relationships, tolerate noisy data, and rarely need extensive hyperparameter tuning to produce strong results. The trade-off is reduced interpretability compared to a single decision tree—you cannot easily trace how 100 trees collectively arrived at a decision.

Note

The n_estimators parameter controls how many trees the forest contains. More trees generally improve performance up to a point, after which gains flatten while training time continues to increase. For many problems, 100 to 500 trees strikes a good balance.

Support Vector Machines

Support vector machines (SVMs) work by finding the hyperplane that maximizes the margin between classes. The "support vectors" are the data points closest to this decision boundary—they are the critical samples that define where the boundary sits. For data that is not linearly separable, SVMs use a kernel trick to project the features into a higher-dimensional space where a linear separator exists.

from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# SVMs perform better with scaled features
svm_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('svm', SVC(kernel='rbf', C=1.0, gamma='scale', random_state=42))
])

# Train the pipeline
svm_pipeline.fit(X_train, y_train)

# Make predictions
y_pred_svm = svm_pipeline.predict(X_test)

# Evaluate
accuracy_svm = accuracy_score(y_test, y_pred_svm)
print(f"SVM Accuracy: {accuracy_svm:.4f}")

Notice the use of a Pipeline here. SVMs are sensitive to feature scale—if one feature ranges from 0 to 1000 while another ranges from 0 to 1, the larger feature dominates the distance calculations. The pipeline ensures that StandardScaler is applied to the training data during fitting and to the test data during prediction, preventing data leakage.

Common kernels include 'linear' for linearly separable data, 'rbf' (radial basis function) for nonlinear boundaries, and 'poly' for polynomial decision boundaries. The C parameter controls the trade-off between a smooth decision boundary and correctly classifying training points.

K-Nearest Neighbors

K-nearest neighbors (KNN) is one of the simplest classification algorithms. It stores the entire training set and makes predictions by finding the k training samples closest to a new data point, then assigning the class that appears in the majority among those neighbors. There is no explicit training phase—all computation happens at prediction time.

from sklearn.neighbors import KNeighborsClassifier

# Create and train the model
knn_clf = KNeighborsClassifier(n_neighbors=5)
knn_clf.fit(X_train, y_train)

# Make predictions
y_pred_knn = knn_clf.predict(X_test)

# Evaluate
accuracy_knn = accuracy_score(y_test, y_pred_knn)
print(f"KNN Accuracy: {accuracy_knn:.4f}")

The value of k has a significant impact on the model. A small k (like 1) makes the model highly sensitive to noise—a single mislabeled neighbor can flip the prediction. A large k smooths the decision boundary but may blur the lines between genuinely distinct classes. Odd values of k are preferred in binary classification to avoid ties.

Warning

KNN becomes extremely slow on large datasets because it must compute the distance from the new point to every training sample. If your dataset has hundreds of thousands of rows, consider approximate nearest neighbor algorithms or a different classifier entirely.

Naive Bayes

Naive Bayes classifiers apply Bayes' theorem with the simplifying assumption that all features are conditionally independent given the class label. Despite this "naive" assumption rarely holding true in practice, the algorithm performs surprisingly well in many real-world scenarios, particularly text classification tasks like spam filtering and sentiment analysis.

from sklearn.naive_bayes import GaussianNB

# Create and train the model
nb_clf = GaussianNB()
nb_clf.fit(X_train, y_train)

# Make predictions
y_pred_nb = nb_clf.predict(X_test)

# Evaluate
accuracy_nb = accuracy_score(y_test, y_pred_nb)
print(f"Naive Bayes Accuracy: {accuracy_nb:.4f}")

# View predicted probabilities for the first 3 samples
probabilities = nb_clf.predict_proba(X_test[:3])
for i, prob in enumerate(probabilities):
    print(f"  Sample {i}: {prob}")

Scikit-learn provides several naive Bayes variants. GaussianNB assumes features follow a normal distribution and works well with continuous data. MultinomialNB is designed for count-based data like word frequencies in text. BernoulliNB is suited for binary feature vectors. Choosing the right variant depends on the nature of your features.

The algorithm's main advantages are speed and simplicity. It trains in a single pass through the data, requires very little memory, and naturally produces probability estimates. Its weakness is that the independence assumption can lead to poorly calibrated probabilities when features are correlated.

Gradient Boosting

Gradient boosting builds an ensemble of weak learners (typically shallow decision trees) sequentially. Each new tree is trained to correct the errors made by the previous trees, focusing on the samples that were hardest to classify. The result is a powerful model that often achieves state-of-the-art accuracy on tabular data.

from sklearn.ensemble import GradientBoostingClassifier, HistGradientBoostingClassifier

# Standard Gradient Boosting
gb_clf = GradientBoostingClassifier(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=3,
    random_state=42
)
gb_clf.fit(X_train, y_train)

y_pred_gb = gb_clf.predict(X_test)
accuracy_gb = accuracy_score(y_test, y_pred_gb)
print(f"Gradient Boosting Accuracy: {accuracy_gb:.4f}")

# HistGradientBoosting (faster for larger datasets)
hgb_clf = HistGradientBoostingClassifier(
    max_iter=100,
    learning_rate=0.1,
    max_depth=3,
    random_state=42
)
hgb_clf.fit(X_train, y_train)

y_pred_hgb = hgb_clf.predict(X_test)
accuracy_hgb = accuracy_score(y_test, y_pred_hgb)
print(f"HistGradientBoosting Accuracy: {accuracy_hgb:.4f}")

Scikit-learn offers two gradient boosting implementations. GradientBoostingClassifier is the traditional version that builds trees one sample at a time. HistGradientBoostingClassifier uses histogram-based splitting, which is dramatically faster on datasets with more than a few thousand samples. The histogram variant also handles missing values natively and supports categorical features directly, making it the better choice for many practical applications.

Pro Tip

The learning_rate parameter controls how much each tree contributes to the final model. Lower values (like 0.01 or 0.05) typically produce better results but require more trees (n_estimators) and longer training times. A common strategy is to use early stopping with HistGradientBoostingClassifier to automatically find the optimal number of iterations.

Evaluating Your Models

Accuracy alone is often not enough to judge a classification model, especially when classes are imbalanced. A model that predicts "not fraud" for every transaction might achieve 99.5% accuracy on a dataset where only 0.5% of transactions are fraudulent, yet it would be completely useless at its actual job.

Scikit-learn provides a comprehensive set of evaluation tools. The classification_report function generates precision, recall, and F1-score for each class in a single call:

from sklearn.metrics import classification_report, confusion_matrix

# Using the random forest predictions from earlier
print(classification_report(y_test, y_pred_rf, target_names=iris.target_names))

# Confusion matrix
cm = confusion_matrix(y_test, y_pred_rf)
print("Confusion Matrix:")
print(cm)

Here is what each metric tells you:

  • Precision: Of all samples the model labeled as class X, what fraction actually belonged to class X? High precision means few false positives.
  • Recall: Of all samples that truly belonged to class X, what fraction did the model correctly identify? High recall means few false negatives.
  • F1-Score: The harmonic mean of precision and recall. It balances both metrics into a single number, which is useful when you need a compromise between the two.
  • Support: The number of actual samples in each class within the test set.

For a more robust evaluation, use cross-validation instead of a single train/test split. Cross-validation divides the data into multiple folds, trains on all but one fold, tests on the held-out fold, and rotates through until every fold has been used for testing:

from sklearn.model_selection import cross_val_score

# 5-fold cross-validation on the random forest
cv_scores = cross_val_score(rf_clf, X, y, cv=5, scoring='accuracy')
print(f"Cross-Validation Scores: {cv_scores}")
print(f"Mean Accuracy: {cv_scores.mean():.4f}")
print(f"Standard Deviation: {cv_scores.std():.4f}")

Scikit-learn 1.8 also introduced confusion_matrix_at_thresholds, which returns true negatives, false positives, false negatives, and true positives across multiple decision thresholds. This is particularly useful when you need to tune the trade-off between precision and recall for binary classifiers.

Choosing the Right Model

There is no single classifier that outperforms all others on every dataset. The best choice depends on your data characteristics, performance requirements, and interpretability needs. Here are some practical guidelines:

Start with logistic regression. It trains fast, provides interpretable coefficients, and establishes a baseline. If it performs well enough, you may not need anything more complex.

Use random forests or gradient boosting for tabular data. These ensemble methods handle mixed feature types, nonlinear relationships, and noisy data well. HistGradientBoostingClassifier is especially practical because it handles missing values and categorical features without manual preprocessing.

Use SVMs for small to medium datasets with clear margins. SVMs excel when the number of features is high relative to the number of samples, such as in text classification or genomics. They become impractical on very large datasets due to training time.

Use naive Bayes for text classification or when speed is critical. It is the fastest classifier to train and works well as a baseline for natural language processing tasks.

Use KNN for quick prototyping or small datasets. It requires no training phase, which makes it convenient for exploratory analysis, but it does not scale to large datasets.

"All models are wrong, but some are useful." — George E. P. Box

The practical approach is to try several models on your specific dataset, compare their cross-validated performance, and select the one that best balances accuracy, speed, and interpretability for your use case. Scikit-learn's consistent API makes this comparison straightforward.

Key Takeaways

  1. Always split your data: Use train_test_split or cross-validation to evaluate models on data they have not seen during training. Evaluating on training data gives misleadingly optimistic results.
  2. Start simple, then add complexity: Begin with logistic regression as a baseline. Only move to more complex models like gradient boosting if the baseline falls short. Simple models are faster to train, easier to debug, and less likely to overfit.
  3. Scale your features when needed: Algorithms that rely on distance calculations (SVM, KNN) or gradient-based optimization (logistic regression, neural networks) benefit significantly from feature scaling. Tree-based methods (decision trees, random forests, gradient boosting) do not require it.
  4. Look beyond accuracy: Use precision, recall, F1-score, and confusion matrices to understand where your model succeeds and fails. This is especially critical for imbalanced datasets where accuracy alone can be misleading.
  5. Use pipelines: Scikit-learn's Pipeline class chains preprocessing steps with your classifier, keeping your code clean, preventing data leakage, and making deployment simpler.
  6. Leverage scikit-learn 1.8 features: Take advantage of Array API support for GPU acceleration with PyTorch arrays, temperature scaling for better probability calibration in multiclass problems, and HistGradientBoostingClassifier for fast, native handling of missing values and categorical data.

Classification models are among the tools you will reach for again and again as a Python developer working with data. The algorithms covered here—logistic regression, decision trees, random forests, SVMs, KNN, naive Bayes, and gradient boosting—cover the vast majority of classification tasks you will encounter. Master the scikit-learn workflow of loading data, splitting, training, predicting, and evaluating, and you will have a reliable process for tackling any classification problem that comes your way.

back to articles