Q-Learning in Python: Build a Reinforcement Learning Agent from Scratch

Q-Learning is one of the foundational algorithms in reinforcement learning. It allows an agent to learn the best action to take in any given state purely through trial and error, without needing a model of the environment. In this article, we will build a Q-Learning agent in Python from scratch, walk through every piece of the algorithm, and watch it solve a grid world navigation problem.

Reinforcement learning sits in a unique spot within machine learning. Unlike supervised learning, there are no labeled examples to learn from. Unlike unsupervised learning, the goal is not to find hidden patterns. Instead, an agent interacts with an environment, takes actions, receives rewards (or penalties), and gradually figures out how to maximize its long-term payoff. Q-Learning is one of the simplest and most widely used algorithms for doing exactly that.

What Is Q-Learning?

Q-Learning is a model-free reinforcement learning algorithm. "Model-free" means the agent does not need to know the rules of the environment in advance. It does not need a map, a transition probability table, or any insider knowledge. It learns entirely from experience by trying actions, observing what happens, and adjusting its strategy accordingly.

The core data structure in Q-Learning is the Q-table. This is a lookup table where each row represents a state the agent can be in, and each column represents an action the agent can take. The value stored in each cell, called a Q-value, estimates how good it is to take that particular action from that particular state. "Good" here means the expected total future reward the agent can collect from that point forward.

At the start of training, the Q-table is initialized to zeros (or small random values). The agent knows nothing. Over thousands of episodes of trial and error, the Q-values converge toward their true optimal values, and the agent learns which action is best in every state.

Note

The "Q" in Q-Learning stands for "quality." A Q-value represents the quality of a specific action taken in a specific state, measured in terms of expected cumulative reward.

The Math Behind the Q-Update Rule

Every time the agent takes an action and observes the result, it updates the relevant Q-value using the temporal difference (TD) update rule. Here is the formula:

Q(s, a) = Q(s, a) + alpha * (reward + gamma * max(Q(s', all_actions)) - Q(s, a))

Let's unpack each variable:

  • s -- the current state the agent is in.
  • a -- the action the agent just took.
  • reward -- the immediate reward received after taking action a in state s.
  • s' -- the new state the agent lands in after the action.
  • alpha -- the learning rate (between 0 and 1). Controls how much weight is given to new information versus old Q-values. A value of 0.1 means the agent makes small, cautious updates.
  • gamma -- the discount factor (between 0 and 1). Controls how much the agent cares about future rewards versus immediate rewards. A value close to 1 means the agent thinks long-term.

The expression reward + gamma * max(Q(s', all_actions)) is called the TD target. It represents the agent's updated estimate of the total reward it can earn. The difference between this target and the old Q-value is called the TD error, and the learning rate controls how aggressively the Q-value moves toward the target.

Exploration vs. Exploitation

A critical challenge in Q-Learning is balancing exploration (trying new actions to discover better strategies) with exploitation (choosing the best known action to maximize reward). The standard approach is the epsilon-greedy policy:

  • With probability epsilon, the agent picks a random action (exploration).
  • With probability 1 - epsilon, the agent picks the action with the highest Q-value (exploitation).

Typically, epsilon starts high (e.g., 1.0) and decays over time. Early in training, the agent explores heavily. As it learns, it gradually shifts toward exploiting its knowledge.

Building a Grid World Environment

Before writing the agent, we need an environment for it to operate in. We will build a simple 5x5 grid world where the agent starts in the top-left corner and must navigate to the bottom-right corner. There are walls (impassable cells) that the agent must learn to avoid.

import numpy as np
import random

class GridWorld:
    def __init__(self):
        self.rows = 5
        self.cols = 5
        self.start = (0, 0)
        self.goal = (4, 4)
        self.state = self.start

        # Define walls as impassable cells
        self.walls = {(1, 1), (2, 1), (3, 1), (1, 3), (2, 3)}

        # Actions: 0=up, 1=down, 2=left, 3=right
        self.actions = [0, 1, 2, 3]
        self.action_map = {
            0: (-1, 0),  # up
            1: (1, 0),   # down
            2: (0, -1),  # left
            3: (0, 1)    # right
        }

    def reset(self):
        self.state = self.start
        return self.state

    def step(self, action):
        dr, dc = self.action_map[action]
        new_row = self.state[0] + dr
        new_col = self.state[1] + dc

        # Check boundaries and walls
        if (0 <= new_row < self.rows and
            0 <= new_col < self.cols and
            (new_row, new_col) not in self.walls):
            self.state = (new_row, new_col)

        # Determine reward
        if self.state == self.goal:
            return self.state, 10.0, True   # reached the goal
        else:
            return self.state, -0.1, False  # small penalty per step

    def render(self):
        for r in range(self.rows):
            row_str = ""
            for c in range(self.cols):
                if (r, c) == self.state:
                    row_str += " A "
                elif (r, c) == self.goal:
                    row_str += " G "
                elif (r, c) in self.walls:
                    row_str += " # "
                else:
                    row_str += " . "
            print(row_str)
        print()

The environment follows the standard reinforcement learning interface: a reset() method that returns the agent to the start, and a step(action) method that returns the next state, a reward, and a boolean indicating whether the episode is over. The agent earns +10 for reaching the goal and receives a small -0.1 penalty for each step, encouraging it to find the shortest path.

Pro Tip

The step penalty of -0.1 is important. Without it, the agent has no incentive to reach the goal quickly. It might wander aimlessly and still collect the +10 reward eventually. The penalty encourages efficiency.

Implementing the Q-Learning Agent

Now for the agent itself. The Q-Learning agent maintains a Q-table, chooses actions using the epsilon-greedy policy, and updates its Q-values after every step.

class QLearningAgent:
    def __init__(self, n_states, n_actions, alpha=0.1,
                 gamma=0.99, epsilon=1.0, epsilon_decay=0.995,
                 epsilon_min=0.01):
        self.q_table = np.zeros((n_states, n_actions))
        self.alpha = alpha
        self.gamma = gamma
        self.epsilon = epsilon
        self.epsilon_decay = epsilon_decay
        self.epsilon_min = epsilon_min
        self.n_actions = n_actions

    def state_to_index(self, state, cols=5):
        """Convert a (row, col) tuple to a flat index."""
        return state[0] * cols + state[1]

    def choose_action(self, state_idx):
        """Epsilon-greedy action selection."""
        if random.random() < self.epsilon:
            return random.randint(0, self.n_actions - 1)
        else:
            return int(np.argmax(self.q_table[state_idx]))

    def update(self, state_idx, action, reward, next_state_idx, done):
        """Apply the Q-Learning update rule."""
        if done:
            td_target = reward
        else:
            td_target = reward + self.gamma * np.max(
                self.q_table[next_state_idx]
            )

        td_error = td_target - self.q_table[state_idx, action]
        self.q_table[state_idx, action] += self.alpha * td_error

    def decay_epsilon(self):
        """Reduce exploration rate over time."""
        self.epsilon = max(
            self.epsilon_min,
            self.epsilon * self.epsilon_decay
        )

A few things to note about this implementation. The Q-table is a 2D NumPy array with shape (n_states, n_actions). For our 5x5 grid, that means 25 states and 4 actions, giving us a 25x4 table. The state_to_index method flattens the (row, col) tuple into a single integer so it can be used as a row index.

The update method is where the core algorithm lives. When the episode is done (the agent reached the goal), there is no future reward to consider, so the TD target is simply the immediate reward. Otherwise, the target includes the discounted maximum Q-value of the next state.

Training and Watching the Agent Learn

With the environment and agent ready, we can run the training loop. The agent will play thousands of episodes, each time starting from (0, 0) and trying to reach (4, 4).

def train(episodes=1000, max_steps=100):
    env = GridWorld()
    agent = QLearningAgent(
        n_states=env.rows * env.cols,
        n_actions=len(env.actions)
    )

    rewards_per_episode = []

    for episode in range(episodes):
        state = env.reset()
        state_idx = agent.state_to_index(state)
        total_reward = 0

        for step in range(max_steps):
            action = agent.choose_action(state_idx)
            next_state, reward, done = env.step(action)
            next_state_idx = agent.state_to_index(next_state)

            agent.update(state_idx, action, reward,
                         next_state_idx, done)

            state_idx = next_state_idx
            total_reward += reward

            if done:
                break

        agent.decay_epsilon()
        rewards_per_episode.append(total_reward)

        # Print progress every 200 episodes
        if (episode + 1) % 200 == 0:
            avg = np.mean(rewards_per_episode[-200:])
            print(f"Episode {episode + 1:>5} | "
                  f"Avg Reward: {avg:.2f} | "
                  f"Epsilon: {agent.epsilon:.4f}")

    return agent, rewards_per_episode

agent, rewards = train(episodes=2000)

When you run this, you will see the average reward climb over time. In the early episodes, the agent stumbles around randomly and accumulates heavy step penalties. As epsilon decays and the Q-table fills with useful information, the agent starts finding the goal consistently and in fewer steps.

A typical training run might look like this:

Episode   200 | Avg Reward: 2.14 | Epsilon: 0.3660
Episode   400 | Avg Reward: 7.85 | Epsilon: 0.1340
Episode   600 | Avg Reward: 9.12 | Epsilon: 0.0491
Episode   800 | Avg Reward: 9.25 | Epsilon: 0.0180
Episode  1000 | Avg Reward: 9.30 | Epsilon: 0.0100
Episode  1200 | Avg Reward: 9.30 | Epsilon: 0.0100
Episode  1400 | Avg Reward: 9.30 | Epsilon: 0.0100
Episode  1600 | Avg Reward: 9.30 | Epsilon: 0.0100
Episode  1800 | Avg Reward: 9.30 | Epsilon: 0.0100
Episode  2000 | Avg Reward: 9.30 | Epsilon: 0.0100

Notice how the reward plateaus around 9.30. The maximum possible reward is 10.0 (the goal reward) minus the step penalties along the shortest path. A reward of 9.30 means the agent is taking about 7 steps, which is exactly the length of the optimal path through the grid while avoiding the walls.

Note

The epsilon_min value of 0.01 means the agent never stops exploring completely. Even after convergence, it takes a random action 1% of the time. This prevents the agent from getting permanently stuck in a suboptimal policy if the environment were to change.

Visualizing the Learned Policy

Once training is complete, we can extract the learned policy from the Q-table and display it as a grid of arrows showing the agent's preferred direction in each cell.

def show_policy(agent, env):
    """Display the learned policy as a grid of arrows."""
    arrows = {0: "^", 1: "v", 2: "<", 3: ">"}

    print("Learned Policy:")
    print("-" * 21)
    for r in range(env.rows):
        row_str = "|"
        for c in range(env.cols):
            if (r, c) == env.goal:
                row_str += " G |"
            elif (r, c) in env.walls:
                row_str += " # |"
            else:
                idx = agent.state_to_index((r, c))
                best_action = int(np.argmax(agent.q_table[idx]))
                row_str += f" {arrows[best_action]} |"
        print(row_str)
        print("-" * 21)

show_policy(agent, GridWorld())

The output will look something like this:

Learned Policy:
---------------------
| v | v | v | v | v |
---------------------
| v | # | v | # | v |
---------------------
| v | # | v | # | v |
---------------------
| > | # | > | > | v |
---------------------
| > | > | > | > | G |
---------------------

The arrows show the optimal action the agent has learned for each open cell. You can trace the path from the top-left corner down and around the walls to the goal in the bottom-right. The agent has discovered the shortest route without ever being told where the walls are or how the grid works.

Plotting the Reward Curve

To see how the agent improved over time, you can plot the reward history using matplotlib:

import matplotlib.pyplot as plt

def plot_rewards(rewards, window=50):
    """Plot a smoothed reward curve."""
    smoothed = [
        np.mean(rewards[max(0, i - window):i + 1])
        for i in range(len(rewards))
    ]

    plt.figure(figsize=(10, 5))
    plt.plot(smoothed, color="#4b8bbe", linewidth=1.5)
    plt.xlabel("Episode")
    plt.ylabel("Total Reward (smoothed)")
    plt.title("Q-Learning Training Progress")
    plt.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.savefig("q_learning_rewards.png", dpi=150)
    plt.show()

plot_rewards(rewards)

The resulting curve will show a sharp improvement in the first few hundred episodes, followed by a plateau as the agent converges on the optimal policy.

Key Takeaways

  1. Q-Learning is model-free. The agent does not need any prior knowledge of the environment. It learns entirely through interaction, which makes the same algorithm applicable to a wide variety of problems.
  2. The Q-table stores learned knowledge. Each cell maps a (state, action) pair to an estimated future reward. After training, the best action in any state is simply the one with the highest Q-value in that row.
  3. Epsilon-greedy balances exploration and exploitation. Starting with high exploration and decaying over time lets the agent discover the environment early on and then settle into the best strategy it has found.
  4. Hyperparameters matter. The learning rate (alpha), discount factor (gamma), and epsilon decay schedule all affect how quickly and reliably the agent converges. Tuning these is often the difference between an agent that learns and one that does not.
  5. Q-Learning has limits. The Q-table approach works well when the number of states is small and discrete. For environments with large or continuous state spaces, such as video games or robotic control, Deep Q-Networks (DQN) replace the table with a neural network that can generalize across states.

Q-Learning is one of those algorithms that rewards hands-on experimentation. Try changing the grid size, adding more walls, adjusting the reward structure, or tweaking the hyperparameters. Each change teaches you something new about how the agent learns and adapts. Once you are comfortable with tabular Q-Learning, the jump to Deep Q-Networks and more advanced reinforcement learning techniques becomes much more intuitive.

back to articles