Naive Bayes is one of the fastest and surprisingly effective classification algorithms in machine learning. Built on Bayes' theorem and a simple independence assumption, it punches well above its weight in tasks like spam filtering, sentiment analysis, and document categorization. This guide walks through every Naive Bayes variant available in scikit-learn 1.8, with full code examples you can run immediately.
If you have ever trained a complex neural network only to discover a Naive Bayes model matched its accuracy in a fraction of the time, you are not alone. The algorithm's speed, minimal memory footprint, and ability to learn from small datasets make it a go-to starting point for classification tasks. It also supports incremental learning through a partial_fit method, which means it can handle datasets too large to fit in memory all at once.
What Is Naive Bayes?
Naive Bayes is a family of probabilistic classifiers that apply Bayes' theorem with one key simplification: every feature is assumed to be conditionally independent of every other feature, given the class label. This is the "naive" part of the name. In reality, features are rarely truly independent, but the assumption works remarkably well in practice.
Consider a medical diagnosis scenario. A Naive Bayes classifier would treat symptoms like fever, cough, and fatigue as independent indicators of a disease, even though they might be correlated. Despite this oversimplification, the classifier often produces correct predictions because it only needs to get the ranking of probabilities right, not the exact values.
Scikit-learn (version 1.8, the latest stable release as of early 2026) provides five Naive Bayes variants, each designed for different types of feature data: GaussianNB, MultinomialNB, BernoulliNB, ComplementNB, and CategoricalNB.
The Math Behind It
Bayes' theorem gives us a way to compute the probability of a class given the observed features. Written out, it looks like this:
# Bayes' Theorem (conceptual)
# P(class | features) = P(features | class) * P(class) / P(features)
#
# Where:
# P(class | features) = posterior probability
# P(features | class) = likelihood
# P(class) = prior probability
# P(features) = evidence (constant across classes)
Since P(features) is the same for all classes, the classifier simplifies its job to finding the class that maximizes the numerator. This approach is called Maximum A Posteriori (MAP) estimation.
The naive independence assumption lets us decompose the likelihood into a product of individual feature probabilities:
# With the naive independence assumption:
# P(x1, x2, ..., xn | class) = P(x1 | class) * P(x2 | class) * ... * P(xn | class)
#
# This means each feature contributes independently
# to the probability of a class.
This decomposition is what makes Naive Bayes so fast. Instead of estimating a single joint probability distribution over all features (which grows exponentially with the number of features), the algorithm estimates one distribution per feature per class. Each of these distributions is one-dimensional, which sidesteps the curse of dimensionality entirely.
While Naive Bayes is a solid classifier, it tends to be a poor probability estimator. The raw probability values from predict_proba should not be taken at face value. Use them for ranking classes, not as calibrated confidence scores. If you need calibrated probabilities, consider wrapping the classifier with sklearn.calibration.CalibratedClassifierCV.
GaussianNB: Continuous Features
GaussianNB assumes that continuous features follow a normal (Gaussian) distribution within each class. For every feature in every class, the algorithm estimates a mean and variance from the training data, then uses the Gaussian probability density function to compute likelihoods.
This variant is the natural choice when your features are real-valued measurements like height, weight, temperature, or sensor readings.
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report
# Load the Iris dataset
X, y = load_iris(return_X_y=True)
# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42
)
# Create and train the classifier
gnb = GaussianNB()
gnb.fit(X_train, y_train)
# Evaluate
y_pred = gnb.predict(X_test)
print(classification_report(y_test, y_pred, target_names=load_iris().target_names))
After fitting, you can inspect the learned parameters directly. The theta_ attribute holds the mean of each feature per class, and var_ holds the variance. These tell you exactly what the model has learned about how each feature is distributed within each class.
# Inspect learned parameters
print("Class means (theta):")
print(gnb.theta_)
print("\nClass variances (var):")
print(gnb.var_)
print("\nClass priors:")
print(gnb.class_prior_)
GaussianNB also accepts a var_smoothing parameter (defaulting to 1e-9). This adds a small portion of the largest variance across all features to every variance estimate, which prevents division-by-zero errors and improves numerical stability when a feature has very low variance in a given class.
If your continuous features are not normally distributed, consider applying a PowerTransformer (Yeo-Johnson or Box-Cox) before feeding data to GaussianNB. This can significantly improve accuracy by making the Gaussian assumption more valid.
MultinomialNB: Count-Based Features
MultinomialNB is designed for features that represent counts or frequencies. It is the classic choice for text classification, where features are typically word counts or TF-IDF scores. The algorithm models each class as a multinomial distribution over the feature space.
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline
from sklearn.metrics import accuracy_score
# Load a subset of the 20 Newsgroups dataset
categories = ['sci.space', 'rec.sport.hockey', 'comp.graphics', 'talk.politics.guns']
train = fetch_20newsgroups(subset='train', categories=categories)
test = fetch_20newsgroups(subset='test', categories=categories)
# Build a pipeline: TF-IDF vectorizer + Multinomial Naive Bayes
model = make_pipeline(
TfidfVectorizer(stop_words='english', max_features=10000),
MultinomialNB(alpha=1.0)
)
# Train and evaluate
model.fit(train.data, train.target)
y_pred = model.predict(test.data)
print(f"Accuracy: {accuracy_score(test.target, y_pred):.4f}")
The alpha parameter controls Laplace smoothing. Setting alpha=1.0 (the default) adds one pseudo-count to every feature, which prevents zero probabilities from appearing when a word has not been seen in training data for a particular class. Smaller values like 0.1 or 0.01 apply less smoothing, which can sometimes improve performance on larger datasets where zero-frequency events are less of a concern.
Feature values passed to MultinomialNB must be non-negative. If you use TF-IDF scores, this is already the case. If you are working with raw data that may contain negative values, use GaussianNB instead or apply a MinMaxScaler to shift values into a non-negative range.
BernoulliNB: Binary Features
BernoulliNB works with binary feature vectors. Each feature is either present (1) or absent (0). Unlike MultinomialNB, which only considers the presence of features, BernoulliNB explicitly penalizes the absence of features that are indicative of a class. This makes it particularly effective for short text classification where the non-occurrence of a word carries meaning.
from sklearn.naive_bayes import BernoulliNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import cross_val_score
# Sample dataset: short text snippets
texts = [
"free money win prize lottery",
"claim your free gift now",
"winner selected for cash prize",
"meeting tomorrow at 3pm",
"project deadline next friday",
"can you review the report",
"congratulations you won a trip",
"urgent: update your account now",
"lunch plans for today",
"quarterly review slides attached",
]
labels = [1, 1, 1, 0, 0, 0, 1, 1, 0, 0] # 1 = spam, 0 = not spam
# Convert text to binary features (word present or not)
vectorizer = CountVectorizer(binary=True)
X = vectorizer.fit_transform(texts)
# Train BernoulliNB
bnb = BernoulliNB(alpha=1.0, binarize=None)
bnb.fit(X, labels)
# Predict on new text
new_texts = ["free prize winner", "meeting agenda attached"]
X_new = vectorizer.transform(new_texts)
predictions = bnb.predict(X_new)
for text, pred in zip(new_texts, predictions):
print(f"'{text}' -> {'Spam' if pred == 1 else 'Not Spam'}")
The binarize parameter offers a convenient shortcut. If you pass in non-binary data, setting binarize=0.0 will automatically convert any feature value greater than zero to 1 and everything else to 0. When set to None, the classifier expects data that is already binary.
ComplementNB: Handling Imbalanced Data
ComplementNB is a variation of MultinomialNB that was designed to address the limitations of the standard algorithm when working with imbalanced datasets. Instead of computing weights based on feature frequencies within each class, it uses statistics from the complement of each class -- that is, all the other classes combined.
This approach tends to produce more stable parameter estimates and often outperforms MultinomialNB on text classification tasks, even when the data is balanced.
from sklearn.naive_bayes import ComplementNB
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import make_pipeline
from sklearn.metrics import accuracy_score
# Load full 20 Newsgroups dataset (all 20 categories)
train = fetch_20newsgroups(subset='train')
test = fetch_20newsgroups(subset='test')
# Compare MultinomialNB vs ComplementNB
from sklearn.naive_bayes import MultinomialNB
for name, clf in [("MultinomialNB", MultinomialNB()), ("ComplementNB", ComplementNB())]:
model = make_pipeline(
TfidfVectorizer(stop_words='english', max_features=15000),
clf
)
model.fit(train.data, train.target)
y_pred = model.predict(test.data)
print(f"{name} Accuracy: {accuracy_score(test.target, y_pred):.4f}")
ComplementNB also supports a norm parameter (defaulting to False). When set to True, it normalizes the weight vectors, which can further improve performance by reducing the impact of features with unusually high frequencies.
CategoricalNB: Categorical Features
CategoricalNB is built specifically for datasets where features are discrete categories rather than continuous values or counts. Each feature is assumed to follow a categorical distribution within each class. This makes it ideal for datasets with features like color, shape, job type, or education level.
from sklearn.naive_bayes import CategoricalNB
from sklearn.preprocessing import OrdinalEncoder
import numpy as np
# Sample dataset: predicting whether someone will buy a product
# Features: [Age Group, Income Level, Student, Credit Rating]
# Encoded as integers (ordinal)
X = np.array([
[0, 2, 0, 0], # youth, high, no, fair
[0, 2, 0, 1], # youth, high, no, excellent
[1, 2, 0, 0], # middle_aged, high, no, fair
[2, 1, 0, 0], # senior, medium, no, fair
[2, 0, 1, 0], # senior, low, yes, fair
[2, 0, 1, 1], # senior, low, yes, excellent
[1, 0, 1, 1], # middle_aged, low, yes, excellent
[0, 1, 0, 0], # youth, medium, no, fair
[0, 0, 1, 0], # youth, low, yes, fair
[2, 1, 1, 0], # senior, medium, yes, fair
[0, 1, 1, 1], # youth, medium, yes, excellent
[1, 1, 0, 1], # middle_aged, medium, no, excellent
[1, 2, 1, 0], # middle_aged, high, yes, fair
[2, 1, 0, 1], # senior, medium, no, excellent
])
y = np.array([0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0])
# Train CategoricalNB
cnb = CategoricalNB(alpha=1.0)
cnb.fit(X, y)
# Predict: middle_aged, medium income, yes student, fair credit
new_sample = np.array([[1, 1, 1, 0]])
prediction = cnb.predict(new_sample)
probabilities = cnb.predict_proba(new_sample)
print(f"Prediction: {'Buy' if prediction[0] == 1 else 'No Buy'}")
print(f"Probabilities: No Buy={probabilities[0][0]:.3f}, Buy={probabilities[0][1]:.3f}")
Features passed to CategoricalNB must be encoded as non-negative integers. Use OrdinalEncoder from scikit-learn to convert string categories to integers before fitting. Do not use one-hot encoding, as that would convert categorical features into binary features (which is BernoulliNB territory).
Choosing the Right Variant
Selecting the correct Naive Bayes variant comes down to understanding the nature of your features:
- GaussianNB -- Use when features are continuous real numbers (sensor data, measurements, scores). Assumes a bell-curve distribution per feature per class.
- MultinomialNB -- Use when features are counts or frequencies (word counts, event tallies). The go-to for text classification with bag-of-words or TF-IDF.
- BernoulliNB -- Use when features are binary (present/absent). Especially useful for short documents where word absence matters.
- ComplementNB -- Use as a drop-in replacement for
MultinomialNBwhen your classes are imbalanced or when you want potentially better text classification accuracy. - CategoricalNB -- Use when features are discrete categories (job type, color, region). Requires integer-encoded inputs.
In many real-world projects, the dataset contains a mix of feature types. One common strategy is to train separate Naive Bayes classifiers on different feature subsets and combine their predictions. You can also use ColumnTransformer from scikit-learn to apply different preprocessing pipelines to different columns before feeding them into a single classifier.
Practical Example: Email Spam Classifier
Here is a more complete example that brings together several concepts: pipeline construction, hyperparameter tuning, evaluation metrics, and incremental learning.
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report, confusion_matrix
import numpy as np
# Simulate a binary spam/not-spam problem using 2 newsgroup categories
categories = ['rec.autos', 'misc.forsale']
train = fetch_20newsgroups(subset='train', categories=categories)
test = fetch_20newsgroups(subset='test', categories=categories)
# Build a pipeline
pipeline = Pipeline([
('tfidf', TfidfVectorizer(stop_words='english')),
('clf', MultinomialNB()),
])
# Hyperparameter grid
param_grid = {
'tfidf__max_features': [5000, 10000, 20000],
'tfidf__ngram_range': [(1, 1), (1, 2)],
'clf__alpha': [0.01, 0.1, 0.5, 1.0],
}
# Grid search with 5-fold cross-validation
grid_search = GridSearchCV(
pipeline,
param_grid,
cv=5,
scoring='f1',
n_jobs=-1,
verbose=0
)
grid_search.fit(train.data, train.target)
# Best parameters and score
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best CV F1 score: {grid_search.best_score_:.4f}")
# Evaluate on test set
best_model = grid_search.best_estimator_
y_pred = best_model.predict(test.data)
print("\nClassification Report:")
print(classification_report(test.target, y_pred, target_names=categories))
print("Confusion Matrix:")
print(confusion_matrix(test.target, y_pred))
Incremental Learning with partial_fit
When your dataset is too large to fit in memory, you can train MultinomialNB, BernoulliNB, or GaussianNB in batches using the partial_fit method. The key requirement is that you must pass the full list of expected class labels on the first call.
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import HashingVectorizer
import numpy as np
# HashingVectorizer does not need to be fitted on all data at once
vectorizer = HashingVectorizer(
n_features=2**16,
stop_words='english',
alternate_sign=False # ensures non-negative features for MultinomialNB
)
clf = MultinomialNB()
all_classes = np.array([0, 1]) # all possible class labels
# Simulate streaming data in batches
batches = [
(["free money win big prize now", "claim your lottery winnings"], [1, 1]),
(["meeting at 2pm tomorrow", "please review the budget"], [0, 0]),
(["you won a free vacation", "limited time offer act now"], [1, 1]),
(["project status update", "team standup notes from today"], [0, 0]),
]
for texts, labels in batches:
X_batch = vectorizer.transform(texts)
clf.partial_fit(X_batch, labels, classes=all_classes)
# Test the incrementally trained model
test_texts = ["win a free car today", "quarterly report summary"]
X_test = vectorizer.transform(test_texts)
predictions = clf.predict(X_test)
for text, pred in zip(test_texts, predictions):
print(f"'{text}' -> {'Spam' if pred == 1 else 'Not Spam'}")
Using partial_fit introduces some computational overhead compared to a single call to fit. If your data fits in memory, fit will be faster. Reserve partial_fit for genuine streaming or out-of-core scenarios.
Key Takeaways
- Speed and simplicity: Naive Bayes classifiers train and predict extremely fast because they estimate each feature distribution independently, making them ideal for high-dimensional data and rapid prototyping.
- Choose the variant that matches your data:
GaussianNBfor continuous features,MultinomialNBfor counts,BernoulliNBfor binary data,ComplementNBfor imbalanced text, andCategoricalNBfor discrete categories. - Smoothing matters: The
alphaparameter (Laplace smoothing) prevents zero-probability problems and is worth tuning. Start with the default of1.0and try smaller values like0.1for larger datasets. - Strong baseline, not always the ceiling: Naive Bayes makes an excellent first model to establish a performance baseline. If it already meets your accuracy requirements, there is no reason to reach for a more complex algorithm.
- Probability calibration: If you need reliable probability estimates (not just correct class predictions), wrap the classifier with
CalibratedClassifierCVsince raw Naive Bayes probabilities tend to be pushed toward 0 and 1.
Naive Bayes remains a cornerstone of practical machine learning. Its mathematical elegance, computational efficiency, and solid real-world performance make it a classifier that belongs in every Python developer's toolkit. Start with the variant that matches your feature types, tune the smoothing parameter, and you will often find yourself with a model that is hard to beat for the effort involved.