Python Convolutional Neural Networks (CNNs): From Architecture to Image Classification

Convolutional Neural Networks are the backbone of modern computer vision. From recognizing faces in photos to guiding self-driving vehicles, CNNs have transformed how machines interpret visual data. This article walks through the architecture of a CNN, explains each layer type, and demonstrates how to build, train, and evaluate a CNN from scratch using Python with TensorFlow and Keras.

If you have ever wondered how an image recognition system can distinguish a cat from a dog, or how a medical imaging tool can identify a tumor in a scan, the answer almost always involves a convolutional neural network. CNNs are a specialized type of neural network that processes data with a grid-like structure, such as pixels in an image. They learn to detect patterns automatically, starting with simple edges and textures and building up to complex objects and scenes.

This article uses TensorFlow 2.21 and Keras 3, which together represent the current standard for building CNNs in Python. We will also cover a PyTorch implementation for readers who prefer that framework.

What Is a Convolutional Neural Network?

A convolutional neural network is a class of deep neural network designed specifically for processing structured grid data, with images being the primary use case. Unlike a standard fully connected neural network where every neuron connects to every neuron in the next layer, a CNN uses a more targeted approach. Each neuron connects only to a small, localized region of the input called the receptive field. This makes CNNs far more efficient at handling image data, since they exploit the spatial structure of pixels rather than treating each pixel as an independent feature.

The core idea behind CNNs draws inspiration from the visual cortex in the human brain. Researchers discovered that certain neurons in the brain respond only to stimuli in specific regions of the visual field, and that these neurons are organized in layers that process increasingly complex patterns. CNNs mirror this architecture by stacking layers that detect progressively more abstract features.

A typical CNN pipeline processes an input image through three main types of layers: convolutional layers that detect features, pooling layers that reduce spatial dimensions, and fully connected layers that produce the final classification. The image flows through this pipeline, transforming from raw pixel values into a probability distribution over output classes.

Note

CNNs are not limited to images. They are also used in natural language processing, audio analysis, and time-series forecasting, though image classification remains their most established application.

Understanding the Core Layers

Convolutional Layers

The convolutional layer is the foundation of a CNN. It applies a set of learnable filters (also called kernels) to the input. Each filter is a small matrix, typically 3x3 or 5x5, that slides across the input image computing dot products at each position. The result is a feature map that highlights where a particular pattern was detected in the input.

For example, one filter might detect vertical edges, another horizontal edges, and another diagonal lines. As the network trains, these filters are learned automatically through backpropagation rather than being hand-designed. Early layers tend to learn low-level features like edges and corners, while deeper layers learn higher-level features like shapes and object parts.

Key parameters of a convolutional layer include the number of filters (which determines how many feature maps are produced), the kernel size (which controls the size of the receptive field), the stride (how many pixels the filter moves at each step), and padding (whether to add zeros around the border of the input to control the output dimensions).

Activation Functions

After each convolution operation, an activation function introduces non-linearity into the network. Without activation functions, a CNN would only be able to learn linear relationships, no matter how many layers were stacked. The Rectified Linear Unit, or ReLU, is the standard choice for CNN hidden layers. It sets any negative values to zero and passes positive values through unchanged. This simple operation is computationally efficient and helps the network learn complex patterns.

# ReLU activation example
import numpy as np

def relu(x):
    return np.maximum(0, x)

# Input feature map values
feature_map = np.array([-2.5, 0.3, -1.0, 4.7, -0.1, 2.8])
activated = relu(feature_map)
print(activated)  # [0.  0.3 0.  4.7 0.  2.8]

Other activation functions like GELU and Swish have gained traction in recent architectures, but ReLU remains the default for standard CNN implementations.

Pooling Layers

Pooling layers reduce the spatial dimensions of feature maps, which decreases computational cost and helps the network become more invariant to small translations in the input. The most common pooling operation is max pooling, which selects the maximum value from a small window (typically 2x2) and discards the rest. This reduces the height and width of the feature map by half while retaining the strongest activations.

Average pooling is another option that computes the mean instead of the maximum, but max pooling tends to perform better in practice because it preserves the most prominent features.

Fully Connected Layers

After the convolutional and pooling layers have extracted features from the input, the resulting feature maps are flattened into a one-dimensional vector and passed through one or more fully connected (dense) layers. These layers combine the extracted features to make the final prediction. The last fully connected layer typically uses a softmax activation function for multi-class classification, which converts raw output values into probabilities that sum to 1.

Setting Up Your Environment

Before building a CNN, install the required libraries. TensorFlow includes Keras as its high-level API, so a single install gives you both.

# Install TensorFlow (includes Keras)
pip install tensorflow

# For GPU support (NVIDIA GPUs)
pip install tensorflow[and-cuda]

# Verify the installation
import tensorflow as tf
print(tf.__version__)  # 2.21.0
print(tf.config.list_physical_devices('GPU'))

Pro Tip

Keras 3 is now a multi-backend framework supporting TensorFlow, PyTorch, and JAX. You can switch backends by setting the KERAS_BACKEND environment variable. For this article, we use the TensorFlow backend, which is the default.

Building a CNN with TensorFlow and Keras

Let's build a CNN that classifies images from the CIFAR-10 dataset. CIFAR-10 contains 60,000 color images (32x32 pixels) across 10 classes: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck. It is split into 50,000 training images and 10,000 test images.

Loading and Preprocessing the Data

import tensorflow as tf
from tensorflow.keras import layers, models
from tensorflow.keras.datasets import cifar10
from tensorflow.keras.utils import to_categorical
import matplotlib.pyplot as plt

# Load the CIFAR-10 dataset
(train_images, train_labels), (test_images, test_labels) = cifar10.load_data()

# Normalize pixel values from [0, 255] to [0, 1]
train_images = train_images.astype("float32") / 255.0
test_images = test_images.astype("float32") / 255.0

# One-hot encode the labels
train_labels = to_categorical(train_labels, 10)
test_labels = to_categorical(test_labels, 10)

# Verify the shapes
print(f"Training data: {train_images.shape}")   # (50000, 32, 32, 3)
print(f"Test data: {test_images.shape}")         # (10000, 32, 32, 3)
print(f"Training labels: {train_labels.shape}")  # (50000, 10)

Normalizing pixel values to the range [0, 1] helps the model converge faster during training. One-hot encoding converts integer labels (like 3 for "cat") into binary vectors where only the position corresponding to the correct class is set to 1.

Defining the CNN Architecture

model = models.Sequential([
    # First convolutional block
    layers.Conv2D(32, (3, 3), activation="relu", padding="same",
                  input_shape=(32, 32, 3)),
    layers.BatchNormalization(),
    layers.Conv2D(32, (3, 3), activation="relu", padding="same"),
    layers.BatchNormalization(),
    layers.MaxPooling2D((2, 2)),
    layers.Dropout(0.25),

    # Second convolutional block
    layers.Conv2D(64, (3, 3), activation="relu", padding="same"),
    layers.BatchNormalization(),
    layers.Conv2D(64, (3, 3), activation="relu", padding="same"),
    layers.BatchNormalization(),
    layers.MaxPooling2D((2, 2)),
    layers.Dropout(0.25),

    # Third convolutional block
    layers.Conv2D(128, (3, 3), activation="relu", padding="same"),
    layers.BatchNormalization(),
    layers.Conv2D(128, (3, 3), activation="relu", padding="same"),
    layers.BatchNormalization(),
    layers.MaxPooling2D((2, 2)),
    layers.Dropout(0.25),

    # Classification head
    layers.Flatten(),
    layers.Dense(256, activation="relu"),
    layers.BatchNormalization(),
    layers.Dropout(0.5),
    layers.Dense(10, activation="softmax")
])

model.summary()

This architecture follows a common CNN design pattern where the number of filters doubles as spatial dimensions are halved. The model starts with 32 filters, increases to 64, and then to 128. Each convolutional block uses two convolution layers followed by batch normalization, max pooling, and dropout.

Here is what each component does in this architecture:

Conv2D(32, (3, 3)) -- Applies 32 filters of size 3x3 to extract features. The padding="same" argument adds zero-padding so the output retains the same spatial dimensions as the input.
BatchNormalization() -- Normalizes the activations of each layer to stabilize and accelerate training. It reduces internal covariate shift, allowing the use of higher learning rates.
MaxPooling2D((2, 2)) -- Reduces spatial dimensions by half, selecting the maximum value from each 2x2 window.
Dropout(0.25) -- Randomly sets 25% of neurons to zero during training, which prevents overfitting by forcing the network to learn redundant representations.
Flatten() -- Converts the 3D feature maps into a 1D vector for the dense layers.
Dense(256) -- A fully connected layer with 256 neurons that combines extracted features.
Dense(10, activation="softmax") -- The output layer with 10 neurons (one per class) using softmax to produce class probabilities.

Note

The input_shape=(32, 32, 3) parameter tells the model to expect 32x32 pixel images with 3 color channels (RGB). You only need to specify the input shape on the first layer.

Training the Model

Compiling the Model

Before training, compile the model by specifying an optimizer, a loss function, and evaluation metrics.

model.compile(
    optimizer="adam",
    loss="categorical_crossentropy",
    metrics=["accuracy"]
)

The Adam optimizer adapts its learning rate for each parameter during training, making it a reliable default choice. Categorical crossentropy is the standard loss function for multi-class classification with one-hot encoded labels. If you use integer labels instead of one-hot encoding, use sparse_categorical_crossentropy.

Training with a Validation Split

history = model.fit(
    train_images, train_labels,
    epochs=50,
    batch_size=64,
    validation_split=0.2,
    verbose=1
)

The validation_split=0.2 reserves 20% of the training data for validation, allowing you to monitor the model's performance on unseen data during training. The batch_size=64 processes 64 images at a time, balancing memory usage and training speed.

Visualizing Training Progress

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Plot training and validation accuracy
axes[0].plot(history.history["accuracy"], label="Training Accuracy")
axes[0].plot(history.history["val_accuracy"], label="Validation Accuracy")
axes[0].set_title("Model Accuracy")
axes[0].set_xlabel("Epoch")
axes[0].set_ylabel("Accuracy")
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Plot training and validation loss
axes[1].plot(history.history["loss"], label="Training Loss")
axes[1].plot(history.history["val_loss"], label="Validation Loss")
axes[1].set_title("Model Loss")
axes[1].set_xlabel("Epoch")
axes[1].set_ylabel("Loss")
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

Watch for the gap between training and validation curves. A large gap indicates overfitting -- the model memorizes training data but fails to generalize. If this happens, consider increasing dropout rates, adding data augmentation, or reducing the model's complexity.

Evaluating and Using the Trained Model

Test Set Evaluation

# Evaluate on the test set
test_loss, test_accuracy = model.evaluate(test_images, test_labels, verbose=0)
print(f"Test Accuracy: {test_accuracy:.4f}")
print(f"Test Loss: {test_loss:.4f}")

Making Predictions

import numpy as np

# Class names for CIFAR-10
class_names = ["airplane", "automobile", "bird", "cat", "deer",
               "dog", "frog", "horse", "ship", "truck"]

# Predict on a single image
sample_image = test_images[0:1]  # Keep the batch dimension
predictions = model.predict(sample_image)

# Get the predicted class
predicted_class = np.argmax(predictions[0])
confidence = predictions[0][predicted_class]

print(f"Predicted: {class_names[predicted_class]}")
print(f"Confidence: {confidence:.2%}")

Saving and Loading the Model

# Save the entire model (architecture + weights + optimizer state)
model.save("cifar10_cnn.keras")

# Load the model later
loaded_model = tf.keras.models.load_model("cifar10_cnn.keras")

# Verify it works
test_loss, test_acc = loaded_model.evaluate(test_images, test_labels, verbose=0)
print(f"Loaded model accuracy: {test_acc:.4f}")

Pro Tip

The .keras format is the recommended way to save models in Keras 3. It stores everything in a single file and is more portable than the older HDF5 (.h5) format.

Improving CNN Performance

Data Augmentation

Data augmentation artificially expands your training set by applying random transformations to existing images. This helps the model generalize better and reduces overfitting.

data_augmentation = models.Sequential([
    layers.RandomFlip("horizontal"),
    layers.RandomRotation(0.1),
    layers.RandomZoom(0.1),
    layers.RandomTranslation(0.1, 0.1),
    layers.RandomContrast(0.1),
])

# Build an augmented model
augmented_model = models.Sequential([
    # Augmentation layers (only active during training)
    data_augmentation,

    # CNN layers
    layers.Conv2D(32, (3, 3), activation="relu", padding="same",
                  input_shape=(32, 32, 3)),
    layers.BatchNormalization(),
    layers.MaxPooling2D((2, 2)),
    layers.Dropout(0.25),

    layers.Conv2D(64, (3, 3), activation="relu", padding="same"),
    layers.BatchNormalization(),
    layers.MaxPooling2D((2, 2)),
    layers.Dropout(0.25),

    layers.Conv2D(128, (3, 3), activation="relu", padding="same"),
    layers.BatchNormalization(),
    layers.MaxPooling2D((2, 2)),
    layers.Dropout(0.25),

    layers.Flatten(),
    layers.Dense(256, activation="relu"),
    layers.Dropout(0.5),
    layers.Dense(10, activation="softmax")
])

Augmentation layers like RandomFlip and RandomRotation are only active during training. During inference, they pass images through unchanged. This is a modern Keras 3 feature that integrates augmentation directly into the model pipeline.

Learning Rate Scheduling

A learning rate schedule reduces the learning rate as training progresses, helping the optimizer make finer adjustments as it approaches a minimum.

from tensorflow.keras.callbacks import ReduceLROnPlateau, EarlyStopping

callbacks = [
    # Reduce learning rate when validation loss plateaus
    ReduceLROnPlateau(
        monitor="val_loss",
        factor=0.5,
        patience=5,
        min_lr=1e-6,
        verbose=1
    ),
    # Stop training when validation loss stops improving
    EarlyStopping(
        monitor="val_loss",
        patience=10,
        restore_best_weights=True,
        verbose=1
    )
]

augmented_model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=1e-3),
    loss="categorical_crossentropy",
    metrics=["accuracy"]
)

history = augmented_model.fit(
    train_images, train_labels,
    epochs=100,
    batch_size=64,
    validation_split=0.2,
    callbacks=callbacks,
    verbose=1
)

The ReduceLROnPlateau callback halves the learning rate if validation loss does not improve for 5 consecutive epochs. The EarlyStopping callback stops training entirely after 10 epochs without improvement and restores the weights from the best epoch.

Building a CNN with PyTorch

PyTorch is the other dominant deep learning framework, and many researchers prefer it for its flexibility and pythonic feel. Here is the same CIFAR-10 classifier implemented in PyTorch.

import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms

# Define transforms with augmentation
transform_train = transforms.Compose([
    transforms.RandomHorizontalFlip(),
    transforms.RandomCrop(32, padding=4),
    transforms.ToTensor(),
    transforms.Normalize((0.4914, 0.4822, 0.4465),
                         (0.2470, 0.2435, 0.2616))
])

transform_test = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.4914, 0.4822, 0.4465),
                         (0.2470, 0.2435, 0.2616))
])

# Load datasets
trainset = torchvision.datasets.CIFAR10(
    root="./data", train=True, download=True,
    transform=transform_train
)
trainloader = torch.utils.data.DataLoader(
    trainset, batch_size=64, shuffle=True, num_workers=2
)

testset = torchvision.datasets.CIFAR10(
    root="./data", train=False, download=True,
    transform=transform_test
)
testloader = torch.utils.data.DataLoader(
    testset, batch_size=64, shuffle=False, num_workers=2
)

Defining the PyTorch CNN

class CIFAR10CNN(nn.Module):
    def __init__(self):
        super(CIFAR10CNN, self).__init__()

        self.features = nn.Sequential(
            # Block 1
            nn.Conv2d(3, 32, kernel_size=3, padding=1),
            nn.BatchNorm2d(32),
            nn.ReLU(inplace=True),
            nn.Conv2d(32, 32, kernel_size=3, padding=1),
            nn.BatchNorm2d(32),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2, 2),
            nn.Dropout2d(0.25),

            # Block 2
            nn.Conv2d(32, 64, kernel_size=3, padding=1),
            nn.BatchNorm2d(64),
            nn.ReLU(inplace=True),
            nn.Conv2d(64, 64, kernel_size=3, padding=1),
            nn.BatchNorm2d(64),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2, 2),
            nn.Dropout2d(0.25),

            # Block 3
            nn.Conv2d(64, 128, kernel_size=3, padding=1),
            nn.BatchNorm2d(128),
            nn.ReLU(inplace=True),
            nn.Conv2d(128, 128, kernel_size=3, padding=1),
            nn.BatchNorm2d(128),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2, 2),
            nn.Dropout2d(0.25),
        )

        self.classifier = nn.Sequential(
            nn.Linear(128 * 4 * 4, 256),
            nn.ReLU(inplace=True),
            nn.BatchNorm1d(256),
            nn.Dropout(0.5),
            nn.Linear(256, 10)
        )

    def forward(self, x):
        x = self.features(x)
        x = x.view(x.size(0), -1)  # Flatten
        x = self.classifier(x)
        return x

Training the PyTorch Model

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = CIFAR10CNN().to(device)

criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=1e-3)
scheduler = optim.lr_scheduler.ReduceLROnPlateau(
    optimizer, mode="min", factor=0.5, patience=5
)

# Training loop
num_epochs = 50

for epoch in range(num_epochs):
    model.train()
    running_loss = 0.0
    correct = 0
    total = 0

    for images, labels in trainloader:
        images, labels = images.to(device), labels.to(device)

        optimizer.zero_grad()
        outputs = model(images)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

        running_loss += loss.item()
        _, predicted = outputs.max(1)
        total += labels.size(0)
        correct += predicted.eq(labels).sum().item()

    train_acc = 100.0 * correct / total
    avg_loss = running_loss / len(trainloader)

    # Validation
    model.eval()
    val_correct = 0
    val_total = 0
    val_loss = 0.0

    with torch.no_grad():
        for images, labels in testloader:
            images, labels = images.to(device), labels.to(device)
            outputs = model(images)
            loss = criterion(outputs, labels)
            val_loss += loss.item()
            _, predicted = outputs.max(1)
            val_total += labels.size(0)
            val_correct += predicted.eq(labels).sum().item()

    val_acc = 100.0 * val_correct / val_total
    avg_val_loss = val_loss / len(testloader)
    scheduler.step(avg_val_loss)

    if (epoch + 1) % 10 == 0:
        print(f"Epoch [{epoch+1}/{num_epochs}] "
              f"Train Loss: {avg_loss:.4f} | Train Acc: {train_acc:.2f}% | "
              f"Val Loss: {avg_val_loss:.4f} | Val Acc: {val_acc:.2f}%")

PyTorch requires writing the training loop explicitly, which offers more control compared to Keras. Notice that PyTorch uses nn.CrossEntropyLoss() which expects raw logits (no softmax on the output layer) and integer labels (not one-hot encoded). This is a key difference from the Keras implementation.

Warning

Always call model.train() before the training loop and model.eval() before evaluation. These toggle behaviors like dropout and batch normalization that behave differently during training and inference.

Common CNN Architectures

Beyond building a CNN from scratch, you can leverage pre-trained architectures that have been trained on millions of images. This approach, called transfer learning, allows you to achieve excellent results even with small datasets.

Here are the architectures worth knowing:

VGG (2014) -- Uses a simple, uniform architecture of stacked 3x3 convolutions. Easy to understand but parameter-heavy. VGG-16 has about 138 million parameters.
ResNet (2015) -- Introduced skip connections that allow gradients to flow through the network more easily, enabling training of very deep networks (50, 101, or even 152 layers). Remains one of the most widely used architectures.
EfficientNet (2019) -- Uses a compound scaling method to balance network depth, width, and resolution. Achieves strong accuracy with fewer parameters than earlier architectures.
ConvNeXt (2022) -- Modernizes the classic ResNet design by incorporating ideas from Vision Transformers, such as larger kernel sizes and layer normalization. Demonstrates that pure convolutional architectures can match or exceed transformer performance on many vision tasks.

Transfer Learning Example

# Using a pre-trained ResNet50 with transfer learning
base_model = tf.keras.applications.ResNet50(
    weights="imagenet",
    include_top=False,
    input_shape=(224, 224, 3)
)

# Freeze the base model weights
base_model.trainable = False

# Build a new classification head
transfer_model = models.Sequential([
    base_model,
    layers.GlobalAveragePooling2D(),
    layers.Dense(256, activation="relu"),
    layers.Dropout(0.5),
    layers.Dense(10, activation="softmax")
])

transfer_model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=1e-4),
    loss="categorical_crossentropy",
    metrics=["accuracy"]
)

transfer_model.summary()

Transfer learning works by freezing the pre-trained convolutional layers (which already know how to extract useful features) and only training the new classification head on your specific dataset. This dramatically reduces training time and data requirements.

Key Takeaways

CNNs are built from three core layer types: Convolutional layers extract spatial features, pooling layers reduce dimensions, and fully connected layers produce predictions. Understanding these building blocks is essential before using more advanced architectures.
Regularization prevents overfitting: Techniques like dropout, batch normalization, and data augmentation are critical for building models that generalize to new data. Without them, a CNN will memorize training samples rather than learning useful patterns.
Both Keras and PyTorch are strong choices: Keras offers a higher-level API that is faster to prototype with, while PyTorch provides more granular control over the training process. The underlying CNN concepts are identical across both frameworks.
Transfer learning is often the best starting point: Pre-trained models like ResNet and EfficientNet have already learned powerful feature representations from millions of images. Fine-tuning these models on your specific task is faster, more data-efficient, and often more accurate than training from scratch.
Monitor training with validation metrics: Always split your data into training and validation sets and watch for divergence between training and validation performance. Callbacks like EarlyStopping and ReduceLROnPlateau automate this monitoring process.

Convolutional neural networks remain a foundational tool in machine learning, even as newer architectures like Vision Transformers gain popularity. The principles covered here -- feature extraction through convolutions, spatial reduction through pooling, and classification through dense layers -- form the basis for understanding any modern computer vision system. Whether you are classifying handwritten digits or building a medical image analysis pipeline, these techniques provide the framework to get started and iterate toward a production-ready solution.