Standard neural networks treat every input as independent, which makes them a poor fit for data where order matters. Recurrent Neural Networks (RNNs) solve this by maintaining a hidden state that carries information from one time step to the next. Long Short-Term Memory (LSTM) networks take this further by using gating mechanisms that let the network decide what to remember and what to forget across long sequences. This article walks through both architectures in Python with working code you can adapt to your own projects.
If you have ever tried to predict the next word in a sentence, forecast tomorrow's stock price, or generate music note by note, you have encountered a problem where the order of data points carries meaning. A feedforward neural network has no concept of sequence -- it processes each input in isolation. RNNs were designed specifically to handle this limitation by introducing a loop that feeds the output of one time step back into the network as input for the next. This article covers the theory behind RNNs and LSTMs, then moves into hands-on Python implementations using both TensorFlow/Keras and PyTorch.
What Are Recurrent Neural Networks
A Recurrent Neural Network introduces the concept of memory into a neural network by maintaining a hidden state. At each time step, the network receives two inputs: the current data point and the hidden state from the previous time step. It then produces an output and an updated hidden state that gets passed forward. This recurrent connection is what gives the architecture its name.
Mathematically, a simple RNN cell computes the following at each time step t:
# RNN cell computation (pseudocode)
# h_t = tanh(W_hh * h_(t-1) + W_xh * x_t + b_h)
# y_t = W_hy * h_t + b_y
# Where:
# x_t = input at time step t
# h_(t-1) = hidden state from previous time step
# h_t = new hidden state
# y_t = output at time step t
# W_hh = weight matrix for hidden-to-hidden connections
# W_xh = weight matrix for input-to-hidden connections
# W_hy = weight matrix for hidden-to-output connections
# b_h, b_y = bias terms
The tanh activation function squashes values into the range of -1 to +1, which helps regulate the flow of information through the network. The key insight is that the same weight matrices (W_hh, W_xh, W_hy) are shared across all time steps. This parameter sharing is what allows the network to generalize across different sequence lengths.
Common applications of RNNs include natural language processing tasks like machine translation, speech recognition, sentiment analysis, time series forecasting, and music generation. Any problem where the data has a sequential or temporal structure is a candidate for an RNN-based approach.
When people say "RNN," they are often referring to the entire family of recurrent architectures, including LSTMs and GRUs. When distinguishing between them, the basic version is sometimes called a "vanilla RNN" or "simple RNN."
The Vanishing Gradient Problem
The simple RNN has a critical flaw that limits its practical usefulness: the vanishing gradient problem. During training, neural networks learn by backpropagation -- computing how much each weight contributed to the error and adjusting accordingly. In an RNN, this process is called Backpropagation Through Time (BPTT), where gradients are computed across every time step in the sequence.
The problem arises because gradients are multiplied together at each time step during backpropagation. When those gradient values are less than 1 (which is common with the tanh activation), repeated multiplication causes them to shrink exponentially. After just 10 or 20 time steps, the gradient can become so small that the earliest parts of the sequence have essentially zero influence on the weight updates. The network effectively "forgets" what it saw at the beginning of the sequence.
The reverse can also happen. If gradient values are greater than 1, they can grow exponentially, causing what is known as the exploding gradient problem. Exploding gradients are easier to handle -- gradient clipping (capping the gradient at a maximum value) works well. Vanishing gradients require architectural changes, which is exactly what LSTM networks provide.
# Demonstrating gradient decay over time steps
import numpy as np
# Simulate gradient flow through 20 time steps
# with a typical weight magnitude
weight = 0.7
gradient = 1.0
print("Time Step | Gradient Magnitude")
print("-" * 35)
for t in range(1, 21):
gradient *= weight
print(f" {t:2d} | {gradient:.10f}")
# After 20 steps with weight=0.7:
# gradient = 0.7^20 = 0.0007979227
# The signal has nearly vanished
This simple demonstration shows why a vanilla RNN struggles with sequences longer than about 10-20 steps. The gradient decays so rapidly that the network cannot learn dependencies that span more than a few time steps. This was the core problem that Sepp Hochreiter and Jurgen Schmidhuber set out to solve when they introduced the LSTM architecture in 1997.
LSTM Architecture and Gates
Long Short-Term Memory networks solve the vanishing gradient problem by introducing a cell state -- a dedicated pathway that carries information across time steps with minimal interference. The cell state acts like a conveyor belt: information can flow along it unchanged, and the network uses a system of gates to add or remove information at each step.
An LSTM cell has three gates, each implemented as a small neural network with a sigmoid activation that outputs values between 0 and 1:
The Forget Gate decides what information to discard from the cell state. It looks at the previous hidden state and the current input, then outputs a value between 0 (completely forget) and 1 (completely keep) for each element of the cell state. For example, when a language model encounters a new subject in a sentence, the forget gate can clear the old subject from memory.
The Input Gate decides what new information to store in the cell state. It has two parts: a sigmoid layer that decides which values to update, and a tanh layer that creates a vector of candidate values. These are multiplied together, and the result is added to the cell state.
The Output Gate decides what to output based on the cell state. The cell state is passed through tanh (scaling values to the -1 to 1 range), then multiplied by the output of a sigmoid layer that determines which parts of the cell state to expose as the hidden state.
# LSTM gate equations (pseudocode)
import numpy as np
def lstm_cell(x_t, h_prev, c_prev, weights):
"""
Single LSTM cell forward pass.
Parameters:
x_t : input at current time step
h_prev : hidden state from previous time step
c_prev : cell state from previous time step
weights: dictionary of weight matrices and biases
"""
# Concatenate input and previous hidden state
combined = np.concatenate([h_prev, x_t])
# Forget gate: what to discard from cell state
f_t = sigmoid(weights['Wf'] @ combined + weights['bf'])
# Input gate: what new info to store
i_t = sigmoid(weights['Wi'] @ combined + weights['bi'])
# Candidate values to potentially add to cell state
c_candidate = np.tanh(weights['Wc'] @ combined + weights['bc'])
# Update cell state
c_t = f_t * c_prev + i_t * c_candidate
# Output gate: what to output
o_t = sigmoid(weights['Wo'] @ combined + weights['bo'])
# New hidden state
h_t = o_t * np.tanh(c_t)
return h_t, c_t
The critical design decision in the LSTM is how the cell state is updated. Notice that the new cell state (c_t) is computed as a linear combination of the old cell state and the candidate values. There is no weight matrix multiplied against c_prev -- just an element-wise multiplication by the forget gate. This additive structure is what prevents gradients from vanishing, because the gradient can flow backward through the cell state with minimal decay.
You do not need to fully understand every matrix operation inside an LSTM to use one effectively. Frameworks like TensorFlow and PyTorch handle the internal math. Focus on understanding what the gates do conceptually, and spend your energy on data preparation, hyperparameter tuning, and architecture decisions like the number of layers and hidden units.
Building an RNN from Scratch in Python
Before reaching for a framework, building a simple RNN from scratch with NumPy helps solidify the core concepts. The following implementation creates a character-level RNN that learns to predict the next character in a sequence.
import numpy as np
class SimpleRNN:
"""
A minimal character-level RNN built with NumPy.
Demonstrates the forward pass, loss computation,
and backpropagation through time (BPTT).
"""
def __init__(self, input_size, hidden_size, output_size,
learning_rate=0.01):
self.hidden_size = hidden_size
self.lr = learning_rate
# Weight initialization using Xavier/Glorot
scale_xh = np.sqrt(2.0 / (input_size + hidden_size))
scale_hh = np.sqrt(2.0 / (hidden_size + hidden_size))
scale_hy = np.sqrt(2.0 / (hidden_size + output_size))
self.W_xh = np.random.randn(hidden_size, input_size) * scale_xh
self.W_hh = np.random.randn(hidden_size, hidden_size) * scale_hh
self.W_hy = np.random.randn(output_size, hidden_size) * scale_hy
self.b_h = np.zeros((hidden_size, 1))
self.b_y = np.zeros((output_size, 1))
def forward(self, inputs, h_prev):
"""
Forward pass through the sequence.
Parameters:
inputs : list of one-hot encoded input vectors
h_prev : initial hidden state
Returns:
outputs : predicted probabilities at each step
hidden : hidden states at each step
h_final : final hidden state
"""
hidden_states = {-1: h_prev.copy()}
outputs = {}
for t, x_t in enumerate(inputs):
# Compute new hidden state
hidden_states[t] = np.tanh(
self.W_xh @ x_t +
self.W_hh @ hidden_states[t - 1] +
self.b_h
)
# Compute output logits
logits = self.W_hy @ hidden_states[t] + self.b_y
# Apply softmax for probabilities
exp_logits = np.exp(logits - np.max(logits))
outputs[t] = exp_logits / np.sum(exp_logits)
h_final = hidden_states[len(inputs) - 1]
return outputs, hidden_states, h_final
def compute_loss(self, outputs, targets):
"""Cross-entropy loss across the sequence."""
loss = 0.0
for t in range(len(targets)):
loss -= np.log(outputs[t][targets[t], 0] + 1e-8)
return loss
# --- Usage Example ---
# Define a tiny vocabulary
text = "hello world"
chars = sorted(set(text))
char_to_idx = {ch: i for i, ch in enumerate(chars)}
idx_to_char = {i: ch for i, ch in enumerate(chars)}
vocab_size = len(chars)
# Create the RNN
rnn = SimpleRNN(
input_size=vocab_size,
hidden_size=64,
output_size=vocab_size,
learning_rate=0.01
)
# Prepare a training sample
def encode_sequence(text, char_to_idx, vocab_size):
"""Convert text to list of one-hot vectors."""
encoded = []
for ch in text:
vec = np.zeros((vocab_size, 1))
vec[char_to_idx[ch]] = 1.0
encoded.append(vec)
return encoded
inputs = encode_sequence(text[:-1], char_to_idx, vocab_size)
targets = [char_to_idx[ch] for ch in text[1:]]
# Run forward pass
h_init = np.zeros((64, 1))
outputs, hidden, h_final = rnn.forward(inputs, h_init)
loss = rnn.compute_loss(outputs, targets)
print(f"Initial loss: {loss:.4f}")
This from-scratch implementation reveals the inner mechanics: the hidden state is updated at each time step using the previous hidden state and the current input, the same weight matrices are reused at every step, and the output probabilities are computed via softmax. In practice, you would add backpropagation through time and gradient clipping to complete the training loop, but the forward pass alone illustrates the core RNN concept.
Building an LSTM with Keras
TensorFlow's Keras API provides a high-level interface for building LSTM networks without worrying about the gate-level math. The following example builds an LSTM for time series prediction -- one of the most common use cases.
import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout
from sklearn.preprocessing import MinMaxScaler
# --- Generate synthetic time series data ---
np.random.seed(42)
t = np.linspace(0, 100, 2000)
# Sine wave with noise (simulating a periodic signal)
data = np.sin(0.5 * t) + 0.1 * np.random.randn(len(t))
data = data.reshape(-1, 1)
# Scale to [0, 1] range -- important for LSTM performance
scaler = MinMaxScaler()
data_scaled = scaler.fit_transform(data)
# --- Create sequences for training ---
def create_sequences(data, seq_length):
"""
Convert time series into input/output pairs.
Each input is a window of seq_length steps,
and the output is the next value.
"""
X, y = [], []
for i in range(len(data) - seq_length):
X.append(data[i:i + seq_length])
y.append(data[i + seq_length])
return np.array(X), np.array(y)
SEQ_LENGTH = 50
X, y = create_sequences(data_scaled, SEQ_LENGTH)
# Split into train and test sets (80/20)
split = int(0.8 * len(X))
X_train, X_test = X[:split], X[split:]
y_train, y_test = y[:split], y[split:]
# Reshape for LSTM: [samples, timesteps, features]
# X is already shaped correctly from create_sequences
print(f"Training shape: {X_train.shape}")
print(f"Test shape: {X_test.shape}")
# --- Build the LSTM model ---
model = Sequential([
LSTM(128, input_shape=(SEQ_LENGTH, 1),
return_sequences=True),
Dropout(0.2),
LSTM(64, return_sequences=False),
Dropout(0.2),
Dense(32, activation='relu'),
Dense(1)
])
model.compile(
optimizer='adam',
loss='mse',
metrics=['mae']
)
model.summary()
# --- Train the model ---
history = model.fit(
X_train, y_train,
epochs=50,
batch_size=32,
validation_split=0.1,
verbose=1
)
# --- Evaluate and predict ---
test_loss, test_mae = model.evaluate(X_test, y_test, verbose=0)
print(f"\nTest Loss (MSE): {test_loss:.6f}")
print(f"Test MAE: {test_mae:.6f}")
# Make predictions
predictions = model.predict(X_test)
# Inverse transform to original scale
predictions_original = scaler.inverse_transform(predictions)
y_test_original = scaler.inverse_transform(y_test)
print(f"\nSample predictions vs actual:")
for i in range(5):
print(f" Predicted: {predictions_original[i][0]:.4f} | "
f"Actual: {y_test_original[i][0]:.4f}")
There are several important details in this implementation. The data is scaled to the 0-1 range using MinMaxScaler, which is critical because LSTM gates use sigmoid activations that operate in this range. The input is reshaped into three dimensions -- [samples, timesteps, features] -- which is the format that Keras LSTM layers expect. The first LSTM layer uses return_sequences=True so it outputs a sequence (needed when stacking LSTM layers), while the second LSTM layer outputs only the final hidden state. Dropout layers between the LSTMs help prevent overfitting.
When stacking multiple LSTM layers, all layers except the last one must have return_sequences=True. This ensures that each intermediate layer passes its full output sequence to the next layer, rather than just the final time step.
Building an LSTM with PyTorch
PyTorch gives you more explicit control over the model architecture and training loop. This is useful when you need custom behavior that Keras' high-level API does not easily support. The following example implements the same time series prediction task using PyTorch.
import torch
import torch.nn as nn
import numpy as np
from torch.utils.data import DataLoader, TensorDataset
class LSTMPredictor(nn.Module):
"""
LSTM model for time series prediction in PyTorch.
"""
def __init__(self, input_size=1, hidden_size=128,
num_layers=2, dropout=0.2):
super(LSTMPredictor, self).__init__()
self.hidden_size = hidden_size
self.num_layers = num_layers
# LSTM layer
# batch_first=True means input shape is
# (batch, seq_len, features)
self.lstm = nn.LSTM(
input_size=input_size,
hidden_size=hidden_size,
num_layers=num_layers,
batch_first=True,
dropout=dropout
)
# Fully connected output layer
self.fc = nn.Sequential(
nn.Linear(hidden_size, 32),
nn.ReLU(),
nn.Linear(32, 1)
)
def forward(self, x):
"""
Forward pass.
Parameters:
x : tensor of shape (batch, seq_len, features)
Returns:
output : predictions of shape (batch, 1)
"""
# Initialize hidden state and cell state with zeros
h0 = torch.zeros(
self.num_layers, x.size(0), self.hidden_size
).to(x.device)
c0 = torch.zeros(
self.num_layers, x.size(0), self.hidden_size
).to(x.device)
# LSTM forward pass
# lstm_out shape: (batch, seq_len, hidden_size)
lstm_out, (h_n, c_n) = self.lstm(x, (h0, c0))
# Use only the last time step's output
last_output = lstm_out[:, -1, :]
# Pass through fully connected layers
prediction = self.fc(last_output)
return prediction
# --- Prepare data (reusing the synthetic data from above) ---
np.random.seed(42)
t = np.linspace(0, 100, 2000)
data = np.sin(0.5 * t) + 0.1 * np.random.randn(len(t))
# Normalize
data_mean = data.mean()
data_std = data.std()
data_norm = (data - data_mean) / data_std
# Create sequences
SEQ_LENGTH = 50
def create_torch_sequences(data, seq_length):
X, y = [], []
for i in range(len(data) - seq_length):
X.append(data[i:i + seq_length])
y.append(data[i + seq_length])
return (
torch.FloatTensor(np.array(X)).unsqueeze(-1),
torch.FloatTensor(np.array(y)).unsqueeze(-1)
)
X, y = create_torch_sequences(data_norm, SEQ_LENGTH)
# Train/test split
split = int(0.8 * len(X))
train_dataset = TensorDataset(X[:split], y[:split])
test_dataset = TensorDataset(X[split:], y[split:])
train_loader = DataLoader(
train_dataset, batch_size=32, shuffle=True
)
test_loader = DataLoader(
test_dataset, batch_size=32, shuffle=False
)
# --- Initialize model, loss, optimizer ---
device = torch.device('cuda' if torch.cuda.is_available()
else 'cpu')
model = LSTMPredictor(
input_size=1,
hidden_size=128,
num_layers=2,
dropout=0.2
).to(device)
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
# --- Training loop ---
EPOCHS = 50
for epoch in range(EPOCHS):
model.train()
train_loss = 0.0
for batch_X, batch_y in train_loader:
batch_X = batch_X.to(device)
batch_y = batch_y.to(device)
# Forward pass
predictions = model(batch_X)
loss = criterion(predictions, batch_y)
# Backward pass and optimize
optimizer.zero_grad()
loss.backward()
# Gradient clipping to prevent exploding gradients
torch.nn.utils.clip_grad_norm_(
model.parameters(), max_norm=1.0
)
optimizer.step()
train_loss += loss.item()
avg_loss = train_loss / len(train_loader)
if (epoch + 1) % 10 == 0:
print(f"Epoch [{epoch+1}/{EPOCHS}] "
f"Loss: {avg_loss:.6f}")
# --- Evaluation ---
model.eval()
test_predictions = []
test_actuals = []
with torch.no_grad():
for batch_X, batch_y in test_loader:
batch_X = batch_X.to(device)
preds = model(batch_X)
test_predictions.append(preds.cpu().numpy())
test_actuals.append(batch_y.numpy())
test_predictions = np.concatenate(test_predictions)
test_actuals = np.concatenate(test_actuals)
# Inverse normalize
test_predictions = test_predictions * data_std + data_mean
test_actuals = test_actuals * data_std + data_mean
mse = np.mean((test_predictions - test_actuals) ** 2)
print(f"\nTest MSE: {mse:.6f}")
The PyTorch version makes several things more explicit. The hidden state and cell state (h0 and c0) are initialized manually. The LSTM returns both its output across all time steps and the final hidden/cell states as a tuple. Note that LSTM returns two state tensors (hidden state and cell state), while a GRU returns only one (hidden state). The training loop handles forward pass, loss computation, backpropagation, gradient clipping, and weight updates step by step.
Always use batch_first=True in PyTorch LSTM layers. The default (batch_first=False) expects input shaped as (seq_len, batch, features), which is counterintuitive if you are coming from Keras or working with DataLoaders that naturally batch along the first dimension.
GRU -- The Lightweight Alternative
The Gated Recurrent Unit (GRU) is a streamlined variant of the LSTM that combines the forget and input gates into a single update gate and merges the cell state with the hidden state. This simpler design means fewer parameters, faster training, and comparable performance on many tasks.
A GRU has two gates: the update gate controls how much of the previous hidden state to carry forward, and the reset gate determines how much of the previous state to ignore when computing the candidate hidden state. Because there is no separate cell state, GRUs are more memory-efficient and train faster than LSTMs.
# Swapping LSTM for GRU in Keras -- it is a one-line change
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import GRU, Dense, Dropout
model_gru = Sequential([
GRU(128, input_shape=(50, 1), return_sequences=True),
Dropout(0.2),
GRU(64, return_sequences=False),
Dropout(0.2),
Dense(32, activation='relu'),
Dense(1)
])
model_gru.compile(optimizer='adam', loss='mse')
model_gru.summary()
# Swapping LSTM for GRU in PyTorch
class GRUPredictor(nn.Module):
def __init__(self, input_size=1, hidden_size=128,
num_layers=2, dropout=0.2):
super(GRUPredictor, self).__init__()
self.hidden_size = hidden_size
self.num_layers = num_layers
# Simply replace nn.LSTM with nn.GRU
self.gru = nn.GRU(
input_size=input_size,
hidden_size=hidden_size,
num_layers=num_layers,
batch_first=True,
dropout=dropout
)
self.fc = nn.Sequential(
nn.Linear(hidden_size, 32),
nn.ReLU(),
nn.Linear(32, 1)
)
def forward(self, x):
# GRU returns (output, h_n) -- no cell state
h0 = torch.zeros(
self.num_layers, x.size(0), self.hidden_size
).to(x.device)
gru_out, h_n = self.gru(x, h0)
last_output = gru_out[:, -1, :]
return self.fc(last_output)
The practical difference between LSTM and GRU often comes down to dataset size and sequence length. GRUs tend to perform comparably on shorter sequences and smaller datasets while training faster. LSTMs may hold an advantage on longer sequences where the separate cell state provides more nuanced memory control. When in doubt, try both and compare their validation performance on your specific task.
Practical Tips and Common Pitfalls
Scale your data. LSTM and GRU gates use sigmoid and tanh activations that are sensitive to input scale. Feeding raw, unscaled data will result in poor performance. Use MinMaxScaler (scales to 0-1) or StandardScaler (zero mean, unit variance) from scikit-learn. Always fit the scaler on training data only, then transform both training and test sets.
Choose the right sequence length. Too short, and the model cannot capture meaningful patterns. Too long, and training becomes slow with diminishing returns. Start with a sequence length that covers one or two full cycles of the pattern you expect to find, then experiment. For time series data, examine autocorrelation plots to identify natural lag structures.
Use gradient clipping. Even with LSTMs, very long sequences can still produce large gradients. Clip gradients to a maximum norm (typically 1.0 to 5.0) to stabilize training. In PyTorch, use torch.nn.utils.clip_grad_norm_. In Keras, pass clipnorm=1.0 to the optimizer.
Apply dropout correctly. Dropout between LSTM layers (not within the recurrent connections themselves) is the standard approach. In Keras, use the Dropout layer between LSTM layers. In PyTorch, the dropout parameter in nn.LSTM applies dropout to the outputs of all but the last LSTM layer -- it has no effect if num_layers=1.
Bidirectional LSTMs. For tasks where future context is available (like text classification, where you have the entire sentence), use bidirectional LSTMs. They process the sequence both forward and backward, giving the model access to context in both directions. In Keras, wrap the LSTM layer with Bidirectional(LSTM(...)). In PyTorch, set bidirectional=True in the nn.LSTM constructor.
Do not use bidirectional LSTMs for tasks where you are predicting the future (such as time series forecasting or next-word prediction). The backward pass would give the model access to information it would not have at inference time, resulting in data leakage and unrealistically good training metrics that fail in production.
Stateful vs. stateless LSTMs. By default, LSTM layers reset their hidden state between batches. This means each batch is treated as an independent sequence. For very long sequences (like an entire book or continuous sensor data), you may want to set stateful=True in Keras so the hidden state carries over between batches. This requires fixed batch sizes and manual state resets between epochs.
When to consider Transformers instead. While LSTMs remain valuable, Transformer architectures have become dominant for many sequence tasks, particularly in natural language processing. If your sequences are very long (hundreds or thousands of steps), if you have access to large datasets, or if parallelism during training is important, Transformers are likely the better choice. LSTMs still shine for smaller datasets, embedded/resource-constrained environments, and tasks where sequential processing is a natural fit like streaming sensor data.
Key Takeaways
- RNNs introduce memory into neural networks by maintaining a hidden state that passes information from one time step to the next. This makes them suitable for sequential data like text, audio, and time series.
- Vanilla RNNs suffer from vanishing gradients, which limits their ability to learn dependencies across more than about 10-20 time steps. This is why LSTM and GRU variants were developed.
- LSTMs use three gates and a cell state to control information flow. The forget gate discards irrelevant information, the input gate adds new information, and the output gate determines what to expose. The additive cell state update prevents gradient decay.
- GRUs are a simpler alternative that combine the forget and input gates into an update gate and eliminate the separate cell state. They train faster and often perform comparably, especially on shorter sequences.
- Data preparation is critical. Always scale your input data, shape it correctly for the framework you are using (samples, timesteps, features), and choose an appropriate sequence length based on the patterns in your data.
- Both TensorFlow/Keras and PyTorch provide robust LSTM implementations. Keras offers a higher-level API that is faster to prototype with, while PyTorch gives more fine-grained control over the training process.
Recurrent architectures remain a foundational part of the deep learning toolkit. Even as Transformers have taken over many tasks, understanding RNNs and LSTMs gives you the vocabulary and intuition needed to work with sequential data of any kind. The gating mechanisms pioneered by LSTMs directly influenced the attention mechanisms that power modern Transformer models. Start with the Keras examples above to get results quickly, then move to PyTorch when you need more control over your training pipeline.