Python Deep Learning: Build Neural Networks from Scratch

Deep learning is the engine behind self-driving cars, voice assistants, medical image analysis, and large language models. If you write Python, you already have everything you need to start building neural networks. This guide walks through the core concepts, introduces the frameworks that matter in 2026, and puts working code in your hands by the end.

Machine learning teaches computers to find patterns in data. Deep learning is a specialized branch of machine learning that uses neural networks with multiple layers—sometimes hundreds of them—to learn increasingly abstract representations of that data. The "deep" in deep learning refers to the depth of these layered architectures, not to the depth of understanding (that part is still up to you).

Python dominates the deep learning landscape because of its readable syntax, massive ecosystem of scientific computing libraries, and the fact that every major deep learning framework provides a Python-first interface. Whether you are a data analyst looking to expand your skill set, a software developer exploring AI, or a student just getting started, Python is the right entry point.

What Is Deep Learning

At its core, a neural network is a mathematical function that takes input data, passes it through a series of transformations, and produces an output. Each transformation happens inside a layer, and each layer contains neurons (also called nodes) that apply a weighted sum followed by a nonlinear activation function.

Here is the fundamental flow of a neural network:

Input layer: Receives raw data such as pixel values from an image, words from a sentence, or numerical features from a spreadsheet.
Hidden layers: Transform the data through learned weights and biases. Each successive layer extracts higher-level features from the previous layer's output.
Output layer: Produces the final prediction, whether that is a class label, a probability, or a continuous value.

During training, the network compares its predictions to the correct answers using a loss function. It then uses an algorithm called backpropagation to calculate how much each weight contributed to the error, and an optimizer (such as stochastic gradient descent or Adam) adjusts those weights to reduce the error. This process repeats across many iterations, called epochs, until the model converges on accurate predictions.

Note

Deep learning is not the right tool for every problem. For structured tabular data with limited rows, traditional machine learning methods like gradient boosting (XGBoost, LightGBM) often outperform neural networks and train much faster. Deep learning shines when working with unstructured data like images, audio, text, and video.

Choosing a Framework: PyTorch, TensorFlow, and Keras

Three frameworks dominate Python deep learning in 2026. Each serves a different audience and workflow, though there is significant overlap in what they can accomplish.

PyTorch

Developed by Meta's AI Research lab, PyTorch has become the default framework for deep learning research and is rapidly gaining ground in production environments. It uses dynamic computation graphs, which means you can change your network architecture on the fly and debug it with standard Python tools like pdb or your IDE's debugger. PyTorch 2.6, released in January 2025, added FP16 support for x86 CPUs, FlexAttention on CPU for large language model inference, and expanded Intel GPU compatibility.

PyTorch is the foundation for many well-known models and libraries, including Hugging Face Transformers, Meta's Llama models, and OpenAI's GPT family.

TensorFlow

Google's TensorFlow remains a strong choice for production deployment and enterprise applications. Its ecosystem includes TensorFlow Lite for mobile and edge devices, TensorFlow.js for browser-based inference, and TensorFlow Serving for scalable model deployment. TensorFlow's tight integration with Keras as its high-level API makes it accessible to beginners while still offering low-level control when needed.

Keras

Keras is a high-level API that runs on top of TensorFlow (its default backend), JAX, or PyTorch. It abstracts away the complexity of defining layers, compiling models, and running training loops into a clean, consistent interface. If you want to go from idea to working model as quickly as possible, Keras is hard to beat.

Pro Tip

If you are just starting out, pick one framework and stick with it until you are comfortable. PyTorch is recommended for learners who want to understand what is happening under the hood. Keras is recommended for learners who want fast results with minimal boilerplate. You can always switch later—the concepts transfer directly.

Other Frameworks Worth Knowing

Google's JAX is gaining traction among researchers for its functional programming style, just-in-time compilation through XLA, and efficient support for TPU/GPU hardware acceleration. FastAI provides a high-level wrapper around PyTorch that makes transfer learning and model training remarkably concise. ONNX (Open Neural Network Exchange) is not a training framework but a standardized format for exporting models between frameworks, making it valuable for cross-platform deployment.

Your First Neural Network in PyTorch

The best way to learn deep learning is to build something. This example creates a simple feedforward neural network that classifies handwritten digits from the MNIST dataset. MNIST contains 70,000 grayscale images of digits 0 through 9, each 28 by 28 pixels.

First, install PyTorch if you have not already:

pip install torch torchvision

Now build and train the model:

import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader

# Prepare the data
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.1307,), (0.3081,))
])

train_data = datasets.MNIST(
    root="./data", train=True,
    download=True, transform=transform
)
test_data = datasets.MNIST(
    root="./data", train=False,
    download=True, transform=transform
)

train_loader = DataLoader(train_data, batch_size=64, shuffle=True)
test_loader = DataLoader(test_data, batch_size=1000, shuffle=False)

# Define the neural network
class DigitClassifier(nn.Module):
    def __init__(self):
        super().__init__()
        self.flatten = nn.Flatten()
        self.layers = nn.Sequential(
            nn.Linear(28 * 28, 256),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(256, 128),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(128, 10)
        )

    def forward(self, x):
        x = self.flatten(x)
        return self.layers(x)

# Initialize model, loss function, and optimizer
model = DigitClassifier()
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Training loop
for epoch in range(5):
    model.train()
    running_loss = 0.0

    for images, labels in train_loader:
        optimizer.zero_grad()
        outputs = model(images)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        running_loss += loss.item()

    avg_loss = running_loss / len(train_loader)
    print(f"Epoch {epoch + 1}/5 - Loss: {avg_loss:.4f}")

# Evaluate on test set
model.eval()
correct = 0
total = 0

with torch.no_grad():
    for images, labels in test_loader:
        outputs = model(images)
        _, predicted = torch.max(outputs, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

accuracy = 100 * correct / total
print(f"Test Accuracy: {accuracy:.2f}%")

This model should reach roughly 97-98% accuracy after five epochs. That is a solid result for a simple feedforward network with no convolutional layers.

Let's break down what is happening in this code:

Data loading: torchvision.datasets.MNIST handles downloading and loading the dataset. The transforms.Normalize call standardizes pixel values using the dataset's known mean and standard deviation.
Model architecture: The network flattens each 28x28 image into a 784-element vector, passes it through two hidden layers with ReLU activations, and outputs 10 values (one per digit class). Dropout layers randomly zero out 20% of neurons during training to prevent overfitting.
Training: For each batch, the model makes predictions, calculates the loss, computes gradients via loss.backward(), and updates weights via optimizer.step().
Evaluation: The torch.no_grad() context manager disables gradient computation during testing, which saves memory and speeds up inference.

Building a CNN for Image Classification

Feedforward networks treat each pixel independently. Convolutional Neural Networks (CNNs) are designed specifically for spatial data. They use convolutional filters that slide across the image, detecting local patterns like edges, textures, and shapes. Deeper layers combine these patterns to recognize increasingly complex features.

Here is a CNN version of the digit classifier:

class ConvDigitClassifier(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv_layers = nn.Sequential(
            nn.Conv2d(1, 32, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2),
            nn.Conv2d(32, 64, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2),
        )
        self.fc_layers = nn.Sequential(
            nn.Flatten(),
            nn.Linear(64 * 7 * 7, 128),
            nn.ReLU(),
            nn.Dropout(0.5),
            nn.Linear(128, 10)
        )

    def forward(self, x):
        x = self.conv_layers(x)
        return self.fc_layers(x)

This CNN applies two convolutional layers, each followed by a ReLU activation and max pooling. The first convolutional layer uses 32 filters of size 3x3, and the second uses 64. Max pooling halves the spatial dimensions at each step, so the 28x28 input becomes 7x7 by the time it reaches the fully connected layers. This model typically achieves 99%+ accuracy on MNIST.

The key difference from the feedforward network is that convolutional layers preserve spatial relationships. A filter might learn to detect a horizontal edge, and the network learns where those edges appear in the image, not just that they exist somewhere in the flattened vector.

Note

The training loop and evaluation code from the previous example work identically with this CNN. Just replace DigitClassifier() with ConvDigitClassifier() and the rest of the code stays the same. That is one of PyTorch's strengths—swapping architectures is seamless because every model follows the same nn.Module interface.

Understanding Recurrent Networks and Sequence Data

Not all data is spatial. Text, time series, audio, and stock prices are sequential—the order of elements matters. Recurrent Neural Networks (RNNs) are designed for this kind of data. They process sequences one element at a time, maintaining a hidden state that carries information from previous steps.

The basic RNN suffers from the vanishing gradient problem: as sequences get longer, gradients shrink during backpropagation, making it difficult for the network to learn long-range dependencies. Two architectures solve this problem:

LSTM (Long Short-Term Memory): Uses gating mechanisms (input gate, forget gate, output gate) to control what information is stored, updated, or discarded in the cell state. This allows LSTMs to learn dependencies across hundreds of time steps.
GRU (Gated Recurrent Unit): A simplified version of LSTM with only two gates (reset and update). GRUs are faster to train and often perform comparably to LSTMs on many tasks.

Here is a simple LSTM for sequence classification:

class SequenceClassifier(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim, num_classes):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.lstm = nn.LSTM(
            embed_dim, hidden_dim,
            batch_first=True, bidirectional=True
        )
        self.classifier = nn.Linear(hidden_dim * 2, num_classes)

    def forward(self, x):
        embedded = self.embedding(x)
        lstm_out, (hidden, _) = self.lstm(embedded)
        # Concatenate forward and backward hidden states
        hidden = torch.cat((hidden[-2], hidden[-1]), dim=1)
        return self.classifier(hidden)

This model embeds input tokens into dense vectors, processes them through a bidirectional LSTM (which reads the sequence both forward and backward), and feeds the final hidden state into a classification layer.

Pro Tip

For text classification and natural language processing tasks in 2026, Transformer-based models (BERT, GPT, Llama) have largely replaced RNNs in production. However, understanding RNNs and LSTMs remains valuable because they are simpler to reason about, faster to train on small datasets, and still effective for time series forecasting and signal processing.

Training Tips That Actually Matter

Getting a model to run is the easy part. Getting it to generalize well is where the real work begins. Here are the techniques that make the biggest difference.

Start with a Learning Rate Finder

The learning rate is the single most important hyperparameter. Too high and the model diverges. Too low and training takes forever or gets stuck in poor local minima. A learning rate finder starts with a very small rate and gradually increases it while recording the loss. The optimal rate is typically just before the loss starts increasing sharply.

# Simple learning rate finder
lrs = []
losses = []
lr = 1e-7
model_copy = DigitClassifier()
optimizer = optim.Adam(model_copy.parameters(), lr=lr)

for images, labels in train_loader:
    optimizer.param_groups[0]["lr"] = lr
    optimizer.zero_grad()
    outputs = model_copy(images)
    loss = criterion(outputs, labels)
    loss.backward()
    optimizer.step()

    lrs.append(lr)
    losses.append(loss.item())
    lr *= 1.1  # Increase by 10% each step

    if lr > 1.0:
        break

Use Data Augmentation

For image tasks, augmenting your training data with random rotations, flips, crops, and color jitter artificially increases your dataset size and forces the model to learn more robust features. PyTorch's torchvision.transforms module makes this straightforward:

augment_transform = transforms.Compose([
    transforms.RandomRotation(10),
    transforms.RandomAffine(0, translate=(0.1, 0.1)),
    transforms.ToTensor(),
    transforms.Normalize((0.1307,), (0.3081,))
])

Monitor for Overfitting

If your training loss keeps decreasing but your validation loss starts increasing, the model is memorizing the training data instead of learning general patterns. Common remedies include adding dropout layers, using weight decay (L2 regularization) in your optimizer, reducing model complexity, and adding more training data.

Use a Learning Rate Scheduler

Starting with a higher learning rate and reducing it as training progresses often yields better results than using a fixed rate:

scheduler = optim.lr_scheduler.ReduceLROnPlateau(
    optimizer, mode="min", factor=0.5,
    patience=3, verbose=True
)

# Call after each epoch
scheduler.step(validation_loss)

Warning

Never evaluate your model on the same data you used for training. Always split your data into training, validation, and test sets. The validation set guides hyperparameter tuning, and the test set gives you a final unbiased estimate of real-world performance. Evaluating on training data will give you misleadingly optimistic results.

Where to Go from Here

Once you are comfortable building and training basic models, here are the paths forward depending on your interests:

Computer vision: Explore transfer learning with pretrained models like ResNet, EfficientNet, and Vision Transformers (ViT). Libraries like Detectron2 (built on PyTorch) handle advanced tasks such as object detection and image segmentation.
Natural language processing: The Hugging Face Transformers library provides access to thousands of pretrained language models. Fine-tuning a pretrained model on your specific task typically produces better results than training from scratch.
Generative AI: Diffusion models power modern image generation (Stable Diffusion, DALL-E). Understanding the underlying mathematics of noise scheduling, denoising, and latent space representation opens up creative applications.
Deployment: ONNX lets you export models for cross-platform inference. TensorFlow Lite targets mobile devices. TorchScript and TorchServe handle production serving of PyTorch models.
Large-scale training: PyTorch's DistributedDataParallel and Fully Sharded Data Parallel (FSDP) enable multi-GPU and multi-node training for large models.

If you are interested in the research side, Google's JAX framework is worth exploring. Its functional programming approach, just-in-time compilation, and native support for automatic vectorization and parallelization make it popular in labs working on cutting-edge architectures.

Key Takeaways

Deep learning uses layered neural networks to learn hierarchical representations of data. The "deep" refers to the number of layers, and backpropagation trains the network by adjusting weights to minimize prediction errors.
PyTorch is the recommended starting framework for learning deep learning in 2026. Its dynamic computation graphs, Pythonic design, and strong community make it accessible for beginners and powerful enough for production. Keras is the best alternative if you want minimal boilerplate and fast prototyping.
Match the architecture to the data type. Use feedforward networks for tabular data, CNNs for images and spatial data, RNNs/LSTMs for sequences, and Transformers for text and large-scale generative tasks.
Training technique matters as much as architecture. Learning rate selection, data augmentation, regularization (dropout, weight decay), and proper train/validation/test splits are what separate a model that works from one that overfits or underperforms.
Transfer learning accelerates everything. For real-world applications, fine-tuning a pretrained model on your specific data almost always outperforms training from scratch, saves compute time, and requires less training data.

Deep learning is a skill best learned by building. Start with the MNIST example above, swap in a CNN architecture, try it on a different dataset, and work your way toward more complex problems. The frameworks handle the heavy lifting of GPU acceleration, automatic differentiation, and distributed training. Your job is to understand the data, choose the right architecture, and tune the training process until the model generalizes well.