PyTorch is the framework of choice for researchers, AI labs, and production teams building everything from large language models to computer vision pipelines. It feels like Python, debugs like Python, and grows with you as your projects get more ambitious. This guide walks you through the fundamentals from the ground up — tensors, autograd, building models, and what changed in the latest releases.
PyTorch started inside Meta's AI Research lab and was released as open source in 2016. Where other frameworks of that era required you to compile a static computation graph before running anything, PyTorch ran operations dynamically — line by line, just like regular Python. That design decision made it far easier to experiment, inspect, and debug. Today it is governed by the independent PyTorch Foundation under the Linux Foundation, with contributions from across the industry, and it sits at the center of a huge ecosystem that includes Hugging Face Transformers, Meta's Llama models, Tesla Autopilot, and more.
What Is PyTorch?
PyTorch is an open-source machine learning framework built around two core ideas: a flexible tensor library that can run on CPUs and GPUs, and a dynamic computation graph that builds itself as your code runs. Every operation you perform on a tensor creates a node in that graph automatically, and PyTorch uses that graph to compute gradients during training — the mathematical process that teaches neural networks to improve.
The framework is described by its maintainers as providing tensors and dynamic neural networks in Python with strong GPU acceleration. What that means in practice is that you write ordinary Python, and PyTorch figures out how to run it efficiently. There is no compilation step, no special domain language, and no waiting for a graph to be built before you can inspect a value.
"PyTorch is designed to be intuitive, linear in thought, and easy to use. When you execute a line of code, it gets executed." — PyTorch documentation
Working with Tensors
A tensor is a multi-dimensional array — the fundamental data structure in PyTorch. A single number is a rank-0 tensor. A list of numbers is a rank-1 tensor (a vector). A table of numbers is a rank-2 tensor (a matrix). A stack of matrices is a rank-3 tensor, and so on. If you have used NumPy, tensors will feel immediately familiar. The key difference is that PyTorch tensors can be moved to a GPU with a single method call, and they track operations for automatic differentiation.
import torch
# Create tensors from Python data
scalar = torch.tensor(3.14)
vector = torch.tensor([1.0, 2.0, 3.0])
matrix = torch.tensor([[1, 2, 3],
[4, 5, 6]])
# Check shape and dtype
print(matrix.shape) # torch.Size([2, 3])
print(matrix.dtype) # torch.int64
# Create tensors with built-in helpers
zeros = torch.zeros(3, 4) # 3x4 matrix of 0s
ones = torch.ones(3, 4) # 3x4 matrix of 1s
randoms = torch.randn(3, 4) # 3x4 random normal values
# Math works element-wise, just like NumPy
a = torch.tensor([1.0, 2.0, 3.0])
b = torch.tensor([4.0, 5.0, 6.0])
print(a + b) # tensor([5., 7., 9.])
print(torch.dot(a, b))# tensor(32.) (dot product)
# Move to GPU if available
device = "cuda" if torch.cuda.is_available() else "cpu"
a = a.to(device)
# Convert back to NumPy (CPU only)
numpy_array = a.cpu().numpy()
Always match your tensor's device to your model's device before passing data in. A mismatch between CPU and GPU tensors is one of the most common errors beginners hit, and the error message PyTorch gives you is clear — it will tell you exactly which devices are involved.
Automatic Differentiation with Autograd
Training a neural network requires computing gradients — specifically, how much each weight in the network contributed to the error on the current batch of data. Computing these by hand would be impossibly tedious for networks with millions of parameters. PyTorch's autograd system handles it automatically.
When you create a tensor with requires_grad=True, PyTorch watches every operation performed on it. When you call .backward() on a loss value, it walks that history in reverse and computes the gradient for every tracked tensor. Those gradients are then used by the optimizer to update the weights.
import torch
# A simple example of autograd
x = torch.tensor(3.0, requires_grad=True)
y = x ** 2 + 2 * x + 1 # y = x^2 + 2x + 1
# Compute gradient of y with respect to x
y.backward()
# dy/dx at x=3 should be 2x + 2 = 8
print(x.grad) # tensor(8.)
# In practice, autograd tracks your model weights automatically
w = torch.randn(4, 3, requires_grad=True)
b = torch.randn(3, requires_grad=True)
x_input = torch.randn(10, 4)
output = x_input @ w + b # matrix multiply + bias
loss = output.sum()
loss.backward() # gradients now available in w.grad and b.grad
print(w.grad.shape) # torch.Size([4, 3])
PyTorch accumulates gradients by default — calling .backward() multiple times adds to the existing .grad values rather than replacing them. Always call optimizer.zero_grad() at the start of each training step to clear the previous batch's gradients before computing new ones.
Installing PyTorch
The recommended way to install PyTorch is through the official install selector at pytorch.org, which generates the correct pip or conda command for your operating system, Python version, and hardware. The current stable release as of March 2026 is PyTorch 2.10.0, released January 21, 2026. It requires Python 3.10 or higher.
# CPU-only install (works on any machine, good for learning)
pip install torch torchvision torchaudio
# Verify the installation
import torch
print(torch.__version__) # 2.10.0
print(torch.cuda.is_available()) # True if you have a compatible NVIDIA GPU
# Apple Silicon (M-series Macs) — use MPS backend
device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")
Google Colab provides free GPU access with PyTorch pre-installed and is one of the best ways to start learning without any local setup. For Apple Silicon Mac users, PyTorch's Metal Performance Shaders (MPS) backend gives you GPU acceleration on M-series chips without needing NVIDIA hardware.
Building Models with torch.nn
PyTorch's neural network building blocks live in the torch.nn module. The central concept is nn.Module — a base class that every model and layer inherits from. You define your network by subclassing nn.Module, declaring your layers in __init__, and specifying how data flows through them in the forward method.
This design gives you full control over the network architecture. The forward method is just Python — you can use any control flow, call other functions, or branch based on input values. PyTorch records the operations through autograd regardless.
import torch
import torch.nn as nn
class SimpleNet(nn.Module):
def __init__(self, input_size, hidden_size, output_size):
super().__init__()
# Define layers as attributes
self.fc1 = nn.Linear(input_size, hidden_size)
self.relu = nn.ReLU()
self.dropout = nn.Dropout(p=0.3)
self.fc2 = nn.Linear(hidden_size, output_size)
def forward(self, x):
# Define the data flow — pure Python
x = self.fc1(x)
x = self.relu(x)
x = self.dropout(x)
x = self.fc2(x)
return x
# Instantiate and inspect
model = SimpleNet(input_size=784, hidden_size=128, output_size=10)
print(model)
# Count trainable parameters
params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Trainable parameters: {params:,}")
Using nn.Sequential for Simpler Architectures
When your model is a straightforward stack of layers with no branching, nn.Sequential lets you define it without writing a class. It is more concise but less flexible — a good choice for quick experiments.
import torch.nn as nn
# Equivalent to SimpleNet above, more compact
model = nn.Sequential(
nn.Linear(784, 128),
nn.ReLU(),
nn.Dropout(0.3),
nn.Linear(128, 10)
)
# Forward pass works the same way
x = torch.randn(32, 784) # batch of 32 samples
output = model(x)
print(output.shape) # torch.Size([32, 10])
Writing a Training Loop
Unlike frameworks that provide a fit() method, PyTorch puts you in charge of the training loop. This means more code up front, but it also means complete visibility into what is happening at every step — and the freedom to customize anything. Below is a complete, runnable example training a classifier on the MNIST handwritten digit dataset.
import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader
# --- Setup ---
device = "cuda" if torch.cuda.is_available() else "cpu"
# --- Data ---
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,)) # MNIST mean and std
])
train_data = datasets.MNIST('./data', train=True, download=True, transform=transform)
test_data = datasets.MNIST('./data', train=False, download=True, transform=transform)
train_loader = DataLoader(train_data, batch_size=64, shuffle=True)
test_loader = DataLoader(test_data, batch_size=64, shuffle=False)
# --- Model ---
class MNISTNet(nn.Module):
def __init__(self):
super().__init__()
self.net = nn.Sequential(
nn.Flatten(),
nn.Linear(28 * 28, 128),
nn.ReLU(),
nn.Dropout(0.2),
nn.Linear(128, 64),
nn.ReLU(),
nn.Linear(64, 10)
)
def forward(self, x):
return self.net(x)
model = MNISTNet().to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=1e-3)
# --- Training Loop ---
for epoch in range(5):
model.train()
running_loss = 0.0
for images, labels in train_loader:
images, labels = images.to(device), labels.to(device)
optimizer.zero_grad() # 1. Clear previous gradients
outputs = model(images) # 2. Forward pass
loss = criterion(outputs, labels) # 3. Compute loss
loss.backward() # 4. Backpropagate
optimizer.step() # 5. Update weights
running_loss += loss.item()
avg_loss = running_loss / len(train_loader)
# Evaluate on test set
model.eval()
correct, total = 0, 0
with torch.no_grad(): # Disable gradient tracking for evaluation
for images, labels in test_loader:
images, labels = images.to(device), labels.to(device)
outputs = model(images)
_, predicted = torch.max(outputs, 1)
total += labels.size(0)
correct += (predicted == labels).sum().item()
print(f"Epoch {epoch+1}/5 | Loss: {avg_loss:.4f} | Test Acc: {correct/total:.4f}")
# Epoch 5/5 | Loss: 0.0812 | Test Acc: 0.9831
The five-step pattern inside the training loop — zero gradients, forward pass, compute loss, backward pass, optimizer step — is the core rhythm of PyTorch training. Learn that pattern and you can train almost any model. The model.eval() call before evaluation is important: it disables dropout and switches batch normalization layers to use running statistics rather than batch statistics, which is the correct behavior for inference.
Save your trained model with torch.save(model.state_dict(), 'model.pth') and reload it with model.load_state_dict(torch.load('model.pth')). Saving only the state_dict (the weights) rather than the whole model object is the recommended approach — it is more portable and does not depend on your class definition being in a specific location at load time.
What Is New in PyTorch 2.x
The PyTorch 2.x series has been one of the more significant stretches of development in the framework's history. The current stable release is PyTorch 2.10.0, published January 21, 2026. Here are the changes that matter for learners and practitioners.
torch.compile — The Biggest Addition Since 2.0
Introduced in PyTorch 2.0, torch.compile() is a single-line addition that can make your models run substantially faster on modern GPUs without changing any other code. You wrap your model with it after creation, and PyTorch captures the computation graph, compiles it with TorchInductor, and generates optimized machine code — including GPU kernel fusion via OpenAI Triton. On newer NVIDIA GPUs (A100, H100, Blackwell), the speedups are measurable and often significant for training workloads.
import torch
model = MNISTNet().to(device)
# One line to potentially speed up training and inference
model = torch.compile(model)
# Everything else stays the same
# optimizer, training loop, evaluation — unchanged
torch.compile has a warm-up cost on the first few batches while it traces and compiles the graph. This is normal. The speedup kicks in once compilation finishes, so short training runs may not show much benefit — it pays off on longer jobs and repeated inference.
Python 3.14 Support and Deterministic Compilation (2.10)
PyTorch 2.10 added full Python 3.14 support for torch.compile(), with the freethreaded Python 3.14t build available experimentally. A long-requested debugging feature also landed: torch.compile now respects torch.use_deterministic_algorithms(True), which forces two compilation runs on the same input to produce identical results. This makes it far easier to track down subtle numerical bugs in compiled models. A new DebugMode tool was also added to help isolate numerical divergence between model versions by attaching deterministic hashes to tensors at runtime.
TorchScript Deprecated — Use torch.export Instead
TorchScript, the older mechanism for serializing and deploying PyTorch models outside of Python, was deprecated in PyTorch 2.10. Its replacement is torch.export, which uses the same graph capture technology underlying torch.compile and is better suited to modern deployment scenarios including mobile and edge targets.
If your deployment pipeline uses torch.jit.script or torch.jit.trace (TorchScript), plan your migration to torch.export. TorchScript is deprecated as of 2.10, meaning it will receive no new features and will eventually be removed.
Expanding Hardware Support
PyTorch 2.9 expanded wheel support to include AMD ROCm, Intel XPU, and NVIDIA CUDA 13 in a single release, making it easier to install the right build for your hardware without manually finding the correct index URL. PyTorch 2.10 continued this work by extending Intel GPU support to the latest Core Ultra Series 3 processors on both Windows and Linux, including FP8 support for faster low-precision inference.
Dynamic Computation Graph — Still the Foundation
All the new compiler work sits on top of PyTorch's original design principle: the computation graph is built dynamically as your code runs. This means your forward method can include Python conditionals, loops, and any control flow you need. torch.compile uses TorchDynamo to capture this dynamic graph safely — it traces Python bytecode at the CPython level and handles graph breaks (places where dynamic behavior prevents full compilation) by falling back to eager execution for those portions only.
Key Takeaways
- PyTorch runs eagerly by default. Code executes line by line, making it easy to inspect values, set breakpoints, and debug — just like regular Python. This is what made PyTorch the dominant framework for research.
- Autograd handles gradients automatically. Set
requires_grad=Trueon a tensor, perform operations, call.backward()on the loss, and gradients appear in.grad. Always calloptimizer.zero_grad()before each training step. - Build models by subclassing nn.Module. Define layers in
__init__, define the data flow inforward. The training loop — zero grads, forward, loss, backward, step — is a pattern worth ingraining from the start. - torch.compile is one line and often free speed. Add it after creating any model. It works best on longer training runs with modern NVIDIA GPUs, and it is backward-compatible with all existing PyTorch code.
- The current stable release is 2.10.0. Released January 21, 2026, it requires Python 3.10 or higher. TorchScript is deprecated — use
torch.exportfor model serialization going forward.
PyTorch rewards the time you spend learning it. Because the training loop is explicit rather than hidden behind a fit() call, you develop a clear mental model of what is actually happening during training — gradients, weight updates, the forward and backward passes. That understanding transfers directly to debugging, fine-tuning, and custom architecture work. Start with the MNIST example above, then explore the official PyTorch tutorials for image classification, natural language processing, and transfer learning with pretrained models.