Deep Q-Networks (DQN) in Python: From Q-Tables to Neural Networks

Classical Q-learning stores every state-action pair in a lookup table. That works fine when your environment has a handful of discrete states, but the moment you face raw pixel inputs, continuous sensor readings, or anything resembling a real-world problem, the table explodes beyond what any machine can hold in memory. Deep Q-Networks solve this by replacing the table with a neural network that approximates Q-values, letting an agent generalize across states it has never seen before.

This article walks through a complete DQN implementation in Python using PyTorch and Gymnasium. Along the way, it covers the two core innovations that made DQN stable enough to learn Atari games from raw pixels—experience replay and target networks—and then extends the implementation with Double DQN and Dueling DQN, two widely adopted variants that improve accuracy and learning efficiency.

Why Q-Tables Break Down

In tabular Q-learning, the agent maintains a table where every row represents a state, every column represents an action, and each cell holds the estimated Q-value for that state-action pair. The update rule is straightforward: after taking action a in state s, observing reward r, and landing in state s', the agent adjusts the table entry using the Bellman equation:

# Tabular Q-learning update
Q[s, a] = Q[s, a] + alpha * (
    r + gamma * max(Q[s_next, :]) - Q[s, a]
)

This works when the state space is small and discrete. Consider the CartPole environment: it has four continuous state variables (cart position, cart velocity, pole angle, pole angular velocity). Even if you discretize each variable into just 20 bins, you end up with 20 x 20 x 20 x 20 = 160,000 states. Scale that to an Atari screen of 210 x 160 pixels with 128 possible color values, and the number of possible states becomes astronomically large. No table can hold that.

The solution is function approximation. Instead of storing a value for every state-action pair, you train a neural network to predict Q-values given a state as input. The network learns a compressed representation of the Q-function that generalizes to states the agent has never encountered during training.

Note

The original DQN paper by Mnih et al. (2015), published in Nature, demonstrated that a single neural network architecture could learn to play 49 different Atari games directly from pixel inputs, reaching human-level performance on many of them. This was the result that put deep reinforcement learning on the map.

The DQN Architecture

A DQN takes the current state as input and outputs a Q-value for every possible action. The agent selects the action with the highest predicted Q-value (or explores randomly via an epsilon-greedy policy). The network is a standard regression model—it outputs continuous float values, not class probabilities.

Two innovations make training stable. Without them, the neural network diverges or oscillates instead of converging on useful Q-values:

Experience Replay: Instead of training on transitions in the order they occur, the agent stores transitions in a buffer and samples random mini-batches for training. This breaks the temporal correlation between consecutive samples, which would otherwise destabilize gradient-based optimization. It also allows the agent to reuse rare or important experiences multiple times.

Target Network: The agent maintains two copies of the neural network. The policy network (also called the online network) is updated every training step. The target network is a frozen copy whose weights are only synchronized with the policy network every N steps. The target network provides stable Q-value targets during training, preventing the problem where the network is chasing a moving target—the targets shift with every weight update, creating a feedback loop that causes divergence.

Building a DQN Agent in Python

The implementation below uses PyTorch for the neural network and Gymnasium (the maintained fork of OpenAI Gym) for the environment. Start by installing the dependencies:

# Install required packages
# pip install torch gymnasium

Here is the complete DQN agent, broken into its core components. First, the neural network that approximates the Q-function:

import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
import random
from collections import deque, namedtuple

# Define a named tuple for storing transitions
Transition = namedtuple(
    "Transition", ("state", "action", "reward", "next_state", "done")
)


class QNetwork(nn.Module):
    """Feed-forward network that maps states to Q-values."""

    def __init__(self, state_dim, action_dim, hidden_dim=128):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, action_dim),
        )

    def forward(self, x):
        return self.net(x)

The network has three linear layers with ReLU activations. The input dimension matches the state space (4 for CartPole), and the output dimension matches the number of actions (2 for CartPole: push left or push right). Each output neuron produces a Q-value estimate for its corresponding action.

The Replay Buffer

Next, the experience replay buffer. This stores past transitions and provides random batches for training:

class ReplayBuffer:
    """Fixed-size buffer that stores transitions and samples batches."""

    def __init__(self, capacity=100_000):
        self.buffer = deque(maxlen=capacity)

    def push(self, state, action, reward, next_state, done):
        self.buffer.append(
            Transition(state, action, reward, next_state, done)
        )

    def sample(self, batch_size):
        batch = random.sample(self.buffer, batch_size)
        states, actions, rewards, next_states, dones = zip(*batch)
        return (
            torch.FloatTensor(np.array(states)),
            torch.LongTensor(actions),
            torch.FloatTensor(rewards),
            torch.FloatTensor(np.array(next_states)),
            torch.FloatTensor(dones),
        )

    def __len__(self):
        return len(self.buffer)

The buffer is a deque with a fixed maximum length. When it fills up, the oldest transitions are automatically discarded. The sample method draws a random batch and converts the data into PyTorch tensors for efficient GPU-compatible computation.

The DQN Agent

Now the agent class that ties everything together:

class DQNAgent:
    """DQN agent with experience replay and target network."""

    def __init__(
        self,
        state_dim,
        action_dim,
        lr=1e-3,
        gamma=0.99,
        epsilon_start=1.0,
        epsilon_end=0.01,
        epsilon_decay=0.995,
        batch_size=64,
        target_update_freq=10,
        buffer_capacity=100_000,
    ):
        self.action_dim = action_dim
        self.gamma = gamma
        self.epsilon = epsilon_start
        self.epsilon_end = epsilon_end
        self.epsilon_decay = epsilon_decay
        self.batch_size = batch_size
        self.target_update_freq = target_update_freq

        # Policy network (updated every step)
        self.policy_net = QNetwork(state_dim, action_dim)
        # Target network (frozen copy, updated periodically)
        self.target_net = QNetwork(state_dim, action_dim)
        self.target_net.load_state_dict(self.policy_net.state_dict())
        self.target_net.eval()

        self.optimizer = optim.Adam(self.policy_net.parameters(), lr=lr)
        self.memory = ReplayBuffer(buffer_capacity)
        self.steps_done = 0

    def select_action(self, state):
        """Epsilon-greedy action selection."""
        if random.random() < self.epsilon:
            return random.randrange(self.action_dim)
        with torch.no_grad():
            state_t = torch.FloatTensor(state).unsqueeze(0)
            q_values = self.policy_net(state_t)
            return q_values.argmax(dim=1).item()

    def update_epsilon(self):
        """Decay epsilon after each episode."""
        self.epsilon = max(
            self.epsilon_end, self.epsilon * self.epsilon_decay
        )

    def train_step(self):
        """Sample a batch from replay memory and update the policy network."""
        if len(self.memory) < self.batch_size:
            return None

        states, actions, rewards, next_states, dones = self.memory.sample(
            self.batch_size
        )

        # Q-values for actions actually taken
        q_values = self.policy_net(states).gather(1, actions.unsqueeze(1))

        # Target Q-values from the frozen target network
        with torch.no_grad():
            next_q_values = self.target_net(next_states).max(1)[0]
            target_q = rewards + self.gamma * next_q_values * (1 - dones)

        # MSE loss between predicted and target Q-values
        loss = nn.MSELoss()(q_values.squeeze(), target_q)

        self.optimizer.zero_grad()
        loss.backward()
        # Gradient clipping prevents exploding gradients
        nn.utils.clip_grad_norm_(self.policy_net.parameters(), max_norm=1.0)
        self.optimizer.step()

        # Periodically sync the target network
        self.steps_done += 1
        if self.steps_done % self.target_update_freq == 0:
            self.target_net.load_state_dict(self.policy_net.state_dict())

        return loss.item()
Pro Tip

Gradient clipping with clip_grad_norm_ is a small detail that makes a big difference in DQN training. Without it, large TD errors early in training can produce enormous gradients that destabilize the network weights. Clipping to a maximum norm of 1.0 keeps updates stable.

Experience Replay and Target Networks

It is worth pausing to understand why these two mechanisms are so critical. Without experience replay, the network trains on consecutive transitions from the same trajectory. These transitions are highly correlated: the state at time t+1 is nearly identical to the state at time t. Training on correlated data causes neural networks to oscillate or diverge because the gradient updates push the weights in directions that are locally consistent but globally destructive.

By sampling random batches from a large buffer, you break this correlation. The batch might contain a transition from episode 1, another from episode 50, and another from episode 200. This diversity stabilizes the gradient and makes the loss landscape smoother.

The target network addresses a separate problem. During training, the Q-value target is calculated as:

# The TD target
target = reward + gamma * max(Q_target(next_state))

If you use the same network for both the prediction and the target, every weight update changes both sides of the equation simultaneously. The target moves with every gradient step, creating a feedback loop. The target network solves this by holding the target side constant for a fixed number of steps, giving the policy network a stable objective to learn against.

Training the Agent on CartPole

Here is the training loop that runs the agent in the CartPole-v1 environment:

import gymnasium as gym


def train_dqn(num_episodes=500, render=False):
    env = gym.make("CartPole-v1", render_mode="human" if render else None)
    state_dim = env.observation_space.shape[0]
    action_dim = env.action_space.n

    agent = DQNAgent(state_dim, action_dim)
    reward_history = []

    for episode in range(num_episodes):
        state, _ = env.reset()
        total_reward = 0

        while True:
            action = agent.select_action(state)
            next_state, reward, terminated, truncated, _ = env.step(action)
            done = terminated or truncated

            agent.memory.push(state, action, reward, next_state, float(done))
            agent.train_step()

            state = next_state
            total_reward += reward

            if done:
                break

        agent.update_epsilon()
        reward_history.append(total_reward)

        # Print progress every 50 episodes
        if (episode + 1) % 50 == 0:
            avg = np.mean(reward_history[-50:])
            print(
                f"Episode {episode + 1} | "
                f"Avg Reward (last 50): {avg:.1f} | "
                f"Epsilon: {agent.epsilon:.3f}"
            )

    env.close()
    return agent, reward_history


# Run training
agent, rewards = train_dqn()

A well-tuned DQN typically solves CartPole-v1 (sustaining a reward of 500 for consecutive episodes) within 200-400 episodes. Early episodes show low rewards because epsilon is high and the agent explores randomly. As epsilon decays and the network improves, the agent learns to balance the pole indefinitely.

Common Pitfall

Do not start training until the replay buffer has enough samples to fill a batch. The code above handles this inside train_step() by returning early if the buffer has fewer than batch_size transitions. Skipping this check leads to errors or degenerate early training.

Double DQN: Fixing Overestimation

Standard DQN has a well-documented flaw: it systematically overestimates Q-values. The problem stems from using the same network to both select the best next action and evaluate that action's value. The max operator introduces an upward bias because it always picks the highest Q-value, including cases where that value is high due to noise rather than genuine expected return.

Double DQN (DDQN), proposed by van Hasselt, Guez, and Silver in 2015, decouples selection from evaluation. The policy network selects the best action, but the target network evaluates it. This one-line change significantly reduces overestimation:

def train_step_double_dqn(self):
    """Double DQN: decouple action selection from evaluation."""
    if len(self.memory) < self.batch_size:
        return None

    states, actions, rewards, next_states, dones = self.memory.sample(
        self.batch_size
    )

    q_values = self.policy_net(states).gather(1, actions.unsqueeze(1))

    with torch.no_grad():
        # Policy net SELECTS the best action
        best_actions = self.policy_net(next_states).argmax(1, keepdim=True)
        # Target net EVALUATES that action
        next_q_values = self.target_net(next_states).gather(
            1, best_actions
        ).squeeze()
        target_q = rewards + self.gamma * next_q_values * (1 - dones)

    loss = nn.MSELoss()(q_values.squeeze(), target_q)

    self.optimizer.zero_grad()
    loss.backward()
    nn.utils.clip_grad_norm_(self.policy_net.parameters(), max_norm=1.0)
    self.optimizer.step()

    self.steps_done += 1
    if self.steps_done % self.target_update_freq == 0:
        self.target_net.load_state_dict(self.policy_net.state_dict())

    return loss.item()

The critical difference is two lines. In standard DQN, the target network both selects and evaluates: target_net(next_states).max(1)[0]. In Double DQN, the policy network selects the action via argmax, and the target network evaluates the Q-value for that specific action. This breaks the feedback loop that causes overestimation, leading to more accurate value estimates and more stable policies.

Dueling DQN: Separating Value from Advantage

Dueling DQN, proposed by Wang et al. in 2016, changes the network architecture rather than the update rule. Instead of outputting Q-values directly, the network splits into two streams after the shared feature layers: one stream estimates the state value V(s) (how good is it to be in this state, regardless of action), and the other estimates the advantage A(s, a) (how much better is this action compared to the average action in this state).

The Q-value is then reconstructed as: Q(s, a) = V(s) + A(s, a) - mean(A(s, .)). Subtracting the mean advantage ensures identifiability—without it, the network could shift values arbitrarily between V and A without changing Q.

class DuelingQNetwork(nn.Module):
    """Dueling architecture: separate value and advantage streams."""

    def __init__(self, state_dim, action_dim, hidden_dim=128):
        super().__init__()

        # Shared feature extraction layers
        self.feature = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.ReLU(),
        )

        # Value stream: estimates V(s)
        self.value_stream = nn.Sequential(
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, 1),
        )

        # Advantage stream: estimates A(s, a) for each action
        self.advantage_stream = nn.Sequential(
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, action_dim),
        )

    def forward(self, x):
        features = self.feature(x)
        value = self.value_stream(features)
        advantage = self.advantage_stream(features)
        # Combine: Q = V + (A - mean(A))
        q_values = value + advantage - advantage.mean(dim=1, keepdim=True)
        return q_values

The advantage of this architecture is subtle but meaningful. In many environments, there are states where the choice of action barely matters. Think of a driving game where the car is on a straight road with no obstacles—turning left or right produces nearly identical outcomes. The dueling architecture lets the network learn that the state is good (high V) without needing to precisely estimate each action's Q-value. This leads to faster learning and better generalization, especially when many actions have similar values.

Pro Tip

Double DQN and Dueling DQN address different problems and can be combined. Simply use the Dueling architecture for both the policy and target networks, then apply the Double DQN update rule. This combination—sometimes called Dueling Double DQN or D3QN—often outperforms either improvement alone. Adding Prioritized Experience Replay on top of that yields even stronger results.

Swapping in the Dueling Network

To use the Dueling architecture with the existing DQNAgent, replace QNetwork with DuelingQNetwork in the agent's constructor:

# Replace these lines in DQNAgent.__init__:
self.policy_net = DuelingQNetwork(state_dim, action_dim)
self.target_net = DuelingQNetwork(state_dim, action_dim)
self.target_net.load_state_dict(self.policy_net.state_dict())
self.target_net.eval()

Everything else—the replay buffer, epsilon-greedy selection, training loop, and target network synchronization—stays exactly the same. The dueling architecture is a drop-in replacement because it still takes a state as input and produces Q-values as output. The internal split into value and advantage streams is invisible to the rest of the code.

Key Takeaways

  1. DQN replaces Q-tables with neural networks: A feed-forward network maps states to Q-values, enabling reinforcement learning in environments with large or continuous state spaces where tabular methods become computationally infeasible.
  2. Experience replay and target networks are essential: Random sampling from a replay buffer breaks temporal correlations in training data. A frozen target network provides stable Q-value targets. Without both mechanisms, DQN training is unstable and prone to divergence.
  3. Double DQN reduces overestimation bias: By using the policy network to select the best action and the target network to evaluate it, DDQN produces more accurate value estimates with a minimal code change.
  4. Dueling DQN separates state value from action advantage: The dual-stream architecture helps the network learn which states are inherently valuable, independent of the specific action taken. This is especially useful in environments where many actions produce similar outcomes.
  5. These improvements are composable: Double DQN, Dueling DQN, and Prioritized Experience Replay can all be stacked together. The Rainbow algorithm (Hessel et al., 2018) demonstrated that combining six such improvements yields substantial gains over any individual technique.

DQN and its variants remain foundational algorithms in deep reinforcement learning. While newer approaches like Proximal Policy Optimization (PPO) and Soft Actor-Critic (SAC) dominate continuous-action problems, DQN-family algorithms continue to be the go-to choice for discrete action spaces. Understanding them is a prerequisite for grasping the more advanced methods that followed.

back to articles