Reinforcement Learning Models in Python: From Q-Learning to PPO

Reinforcement learning teaches machines to make decisions through trial and error, rewarding good behavior and penalizing mistakes until an agent learns an optimal strategy. Python has become the dominant language for RL development thanks to libraries like Stable Baselines3, TorchRL, and Gymnasium. This article walks through the core algorithms, the libraries that implement them, and working code you can run today.

Reinforcement learning (RL) sits apart from supervised and unsupervised learning because there is no pre-labeled dataset. Instead, an agent interacts with an environment, takes actions, receives rewards (or penalties), and gradually learns a policy—a mapping from states to actions that maximizes cumulative reward over time. This feedback loop is what makes RL uniquely suited for problems like game playing, robotics, autonomous navigation, and financial trading.

Python's RL ecosystem has matured rapidly. In 2026, the landscape is anchored by Gymnasium (the successor to OpenAI Gym) for environments, Stable Baselines3 for production-ready algorithm implementations, and TorchRL as PyTorch's native RL library. Let's explore each layer of this stack.

How Reinforcement Learning Works

At its core, every RL problem follows the same cycle. An agent observes the current state of its environment, selects an action based on its current policy, and transitions to a new state. The environment then returns a reward signal—a scalar value indicating how good or bad that action was. The agent's goal is to learn a policy that maximizes the total expected reward over an episode.

This cycle is formalized as a Markov Decision Process (MDP), defined by a set of states, a set of actions, transition probabilities between states, a reward function, and a discount factor (gamma) that determines how much the agent values future rewards versus immediate ones. A discount factor close to 1 makes the agent far-sighted, while a value closer to 0 makes it prioritize short-term gains.

Note

RL algorithms split into two broad families: value-based methods (like Q-learning and DQN) that learn the value of being in a state or taking an action, and policy-based methods (like PPO and REINFORCE) that directly optimize the policy itself. Actor-critic methods combine both approaches by maintaining separate networks for the policy (actor) and the value estimate (critic).

Setting Up Your RL Environment

Before writing any RL code, you need an environment to train in and a library to implement algorithms. Gymnasium (formerly OpenAI Gym) provides a standardized API for hundreds of simulation environments, from simple control tasks like CartPole to complex Atari games. It is maintained by the Farama Foundation and serves as the common interface that almost every Python RL library expects.

Start by installing the core dependencies. Using a virtual environment is strongly recommended to keep your packages isolated.

# Create and activate a virtual environment
python -m venv rl-env
source rl-env/bin/activate  # Linux/Mac
# rl-env\Scripts\activate   # Windows

# Install core RL libraries
pip install gymnasium
pip install stable-baselines3[extra]
pip install torch torchrl
pip install numpy matplotlib

Once installed, verify that Gymnasium is working by spinning up a quick environment.

import gymnasium as gym

# Create the CartPole environment
env = gym.make("CartPole-v1", render_mode="human")
observation, info = env.reset()

for step in range(200):
    action = env.action_space.sample()  # Random action
    observation, reward, terminated, truncated, info = env.step(action)

    if terminated or truncated:
        observation, info = env.reset()

env.close()

This script creates a CartPole environment, takes random actions, and resets whenever the episode ends. The observation returned at each step is a NumPy array describing the cart's position, velocity, pole angle, and angular velocity. The reward is +1 for every timestep the pole remains upright. The goal of your RL agent will be to replace that random action selection with a learned policy.

Pro Tip

Gymnasium replaced OpenAI Gym as the actively maintained standard. If you encounter older tutorials using import gym, you can typically swap in import gymnasium as gym with no other changes. The API is backward-compatible in the vast majority of cases.

Q-Learning from Scratch

Q-learning is the foundational value-based RL algorithm and a great starting point for understanding how agents learn. The "Q" refers to a quality function, Q(s, a), that estimates the expected cumulative reward of taking action a in state s and then following the optimal policy thereafter.

The algorithm maintains a Q-table—a lookup table with one entry for every state-action pair. After each action, the table is updated using the Bellman equation: the new Q-value for the current state-action pair is adjusted toward the observed reward plus the discounted maximum Q-value of the next state.

import gymnasium as gym
import numpy as np

env = gym.make("Taxi-v3")

# Initialize Q-table with zeros
q_table = np.zeros([env.observation_space.n, env.action_space.n])

# Hyperparameters
learning_rate = 0.1
discount_factor = 0.99
epsilon = 1.0           # Exploration rate
epsilon_decay = 0.995
epsilon_min = 0.01
episodes = 10000

for episode in range(episodes):
    state, info = env.reset()
    done = False
    total_reward = 0

    while not done:
        # Epsilon-greedy action selection
        if np.random.random() < epsilon:
            action = env.action_space.sample()  # Explore
        else:
            action = np.argmax(q_table[state])  # Exploit

        next_state, reward, terminated, truncated, info = env.step(action)
        done = terminated or truncated

        # Q-value update (Bellman equation)
        best_next = np.max(q_table[next_state])
        q_table[state, action] += learning_rate * (
            reward + discount_factor * best_next - q_table[state, action]
        )

        state = next_state
        total_reward += reward

    # Decay exploration rate
    epsilon = max(epsilon_min, epsilon * epsilon_decay)

    if (episode + 1) % 1000 == 0:
        print(f"Episode {episode + 1} | Total Reward: {total_reward:.1f} | Epsilon: {epsilon:.4f}")

print("Training complete.")

The epsilon-greedy strategy balances exploration and exploitation. Early in training, the agent takes random actions frequently (high epsilon) to discover the environment. As epsilon decays, the agent increasingly relies on its learned Q-values. The Taxi-v3 environment has a discrete state space of 500 states and 6 possible actions, making it manageable for a tabular approach.

Q-learning works well for small, discrete environments, but it falls apart when the state space grows large or becomes continuous. That is where deep reinforcement learning comes in.

Deep Q-Networks with Stable Baselines3

Deep Q-Networks (DQN) replace the Q-table with a neural network that approximates Q-values for continuous or high-dimensional state spaces. The original DQN paper demonstrated agents learning to play Atari games directly from pixel input, a landmark result in AI research.

Stable Baselines3 (SB3) is one of the leading RL libraries in the Python ecosystem. It provides clean, well-tested implementations of major algorithms built on PyTorch, and follows a scikit-learn-like API that makes it approachable for developers who are new to RL. As of early 2026, SB3 is at version 2.8 and requires Python 3.10 or higher.

from stable_baselines3 import DQN
from stable_baselines3.common.env_util import make_vec_env
from stable_baselines3.common.evaluation import evaluate_policy

# Create a vectorized environment (4 parallel instances)
env = make_vec_env("CartPole-v1", n_envs=4)

# Initialize the DQN model
model = DQN(
    "MlpPolicy",
    env,
    learning_rate=1e-3,
    buffer_size=100000,
    learning_starts=1000,
    batch_size=64,
    gamma=0.99,
    target_update_interval=500,
    verbose=1,
    tensorboard_log="./dqn_cartpole_logs/"
)

# Train for 50,000 timesteps
model.learn(total_timesteps=50000)

# Evaluate the trained agent
eval_env = make_vec_env("CartPole-v1", n_envs=1)
mean_reward, std_reward = evaluate_policy(model, eval_env, n_eval_episodes=20)
print(f"Mean reward: {mean_reward:.2f} +/- {std_reward:.2f}")

# Save and reload the model
model.save("dqn_cartpole")
loaded_model = DQN.load("dqn_cartpole")

Several features make SB3 especially practical. Vectorized environments (make_vec_env) run multiple environment instances in parallel, which accelerates data collection. The replay buffer stores past experiences and samples mini-batches for training, decorrelating sequential observations. The target network (updated every 500 steps in this example) stabilizes learning by providing a slowly-moving Q-value target. And built-in TensorBoard logging lets you visualize training metrics in real time.

Note

DQN is designed for discrete action spaces only. If your problem involves continuous actions (like controlling a robot joint angle), you need algorithms like SAC, TD3, or DDPG instead. SB3 implements all of these.

Proximal Policy Optimization (PPO)

Proximal Policy Optimization is currently one of the go-to algorithms in reinforcement learning. It belongs to the policy gradient family and works by directly optimizing the policy network. PPO's key innovation is a clipped surrogate objective function that prevents destructively large policy updates, which makes training more stable than earlier methods like vanilla policy gradients or TRPO.

PPO is an on-policy algorithm, meaning it collects a batch of experience using the current policy, uses that batch to compute an update, and then discards the data. This is less sample-efficient than off-policy methods like DQN or SAC, but PPO tends to be simpler to tune and more reliable across different problem types.

from stable_baselines3 import PPO
from stable_baselines3.common.env_util import make_vec_env
from stable_baselines3.common.callbacks import EvalCallback

# Create training and evaluation environments
train_env = make_vec_env("LunarLander-v3", n_envs=8)
eval_env = make_vec_env("LunarLander-v3", n_envs=1)

# Set up an evaluation callback
eval_callback = EvalCallback(
    eval_env,
    best_model_save_path="./best_model/",
    log_path="./eval_logs/",
    eval_freq=5000,
    n_eval_episodes=10,
    deterministic=True,
)

# Initialize the PPO model
model = PPO(
    "MlpPolicy",
    train_env,
    learning_rate=3e-4,
    n_steps=2048,
    batch_size=64,
    n_epochs=10,
    gamma=0.99,
    gae_lambda=0.95,
    clip_range=0.2,
    verbose=1,
    tensorboard_log="./ppo_lunarlander_logs/"
)

# Train with evaluation callback
model.learn(total_timesteps=500000, callback=eval_callback)

print("Training complete. Best model saved to ./best_model/")

There are several key hyperparameters to understand here. n_steps controls how many timesteps of experience are collected before each update. n_epochs determines how many passes through the collected batch are made during optimization. clip_range (set to 0.2) defines the clipping threshold for the surrogate objective—it restricts how far the new policy can deviate from the old one in a single update. And gae_lambda controls the bias-variance tradeoff in the Generalized Advantage Estimation used for computing advantages.

The EvalCallback periodically evaluates the agent and saves the best-performing model checkpoint, which is important since RL training can be noisy and later checkpoints are not always the best ones.

Soft Actor-Critic for Continuous Control

Soft Actor-Critic (SAC) is an off-policy actor-critic algorithm designed specifically for environments with continuous action spaces. Unlike PPO, SAC stores past experiences in a replay buffer and reuses them for training, making it significantly more sample-efficient. SAC also incorporates an entropy bonus into its objective function, which encourages exploration by rewarding the agent for maintaining a stochastic policy.

SAC uses twin critic networks to mitigate overestimation bias (similar to TD3) and supports automatic temperature tuning for the entropy coefficient, which reduces the amount of manual hyperparameter work.

from stable_baselines3 import SAC
from stable_baselines3.common.env_util import make_vec_env

# Pendulum-v1 has a continuous action space
env = make_vec_env("Pendulum-v1", n_envs=1)

# Initialize SAC with automatic entropy tuning
model = SAC(
    "MlpPolicy",
    env,
    learning_rate=3e-4,
    buffer_size=1000000,
    batch_size=256,
    tau=0.005,            # Soft target update coefficient
    gamma=0.99,
    learning_starts=1000,
    ent_coef="auto",      # Automatic entropy temperature tuning
    verbose=1,
    tensorboard_log="./sac_pendulum_logs/"
)

# Train the agent
model.learn(total_timesteps=100000)

# Watch the trained agent perform
eval_env = make_vec_env("Pendulum-v1", n_envs=1)
obs = eval_env.reset()
for _ in range(500):
    action, _states = model.predict(obs, deterministic=True)
    obs, rewards, dones, info = eval_env.step(action)

model.save("sac_pendulum")

The tau parameter controls how quickly the target networks track the main networks via Polyak averaging. A small value (like 0.005) means gradual updates, which keeps training stable. Setting ent_coef to "auto" lets SB3 automatically learn the entropy temperature during training, which is generally recommended over setting it manually.

Pro Tip

For continuous control problems (robotics, motor control, physics simulations), SAC is often the strongest starting point. It typically requires less hyperparameter tuning than TD3 or DDPG and handles exploration naturally through its entropy maximization objective.

TorchRL and the PyTorch Ecosystem

TorchRL is PyTorch's official reinforcement learning library. It takes a modular, composable approach where environments, transforms, replay buffers, loss functions, and data collectors are all separate building blocks you can mix and match. TorchRL is built around TensorDict, a dictionary-like data structure that carries tensors through the RL pipeline and standardizes how data flows between components.

As of early 2026, TorchRL is at version 0.11 and has been expanding rapidly, with recent releases adding LLM training support (including GRPO for preference optimization), vLLM integration for large-scale inference, and a high-level PPOTrainer class. It requires Python 3.10+ and works best with PyTorch 2.7 or later.

import torch
from torchrl.envs import GymEnv, TransformedEnv, Compose
from torchrl.envs.transforms import (
    DoubleToFloat,
    ObservationNorm,
    StepCounter,
)
from torchrl.modules import MLP, ProbabilisticActor, ValueOperator
from torchrl.modules.distributions import TanhNormal
from tensordict.nn import TensorDictModule

# Create and wrap the environment with transforms
base_env = GymEnv("InvertedDoublePendulum-v4", device="cpu")

env = TransformedEnv(
    base_env,
    Compose(
        ObservationNorm(in_keys=["observation"]),
        DoubleToFloat(),
        StepCounter(),
    ),
)

# Initialize normalization stats from random rollouts
env.transform[0].init_stats(num_iter=1000, reduce_dim=0, cat_dim=0)

# Build actor network (policy)
actor_net = MLP(
    in_features=env.observation_spec["observation"].shape[-1],
    out_features=2 * env.action_spec.shape[-1],  # Mean and std
    num_cells=[256, 256],
    activation_class=torch.nn.Tanh,
)

# Wrap as a TensorDict module
policy_module = TensorDictModule(
    actor_net,
    in_keys=["observation"],
    out_keys=["loc", "scale"],
)

# Create probabilistic actor
actor = ProbabilisticActor(
    module=policy_module,
    in_keys=["loc", "scale"],
    distribution_class=TanhNormal,
    return_log_prob=True,
)

# Build value network (critic)
value_net = MLP(
    in_features=env.observation_spec["observation"].shape[-1],
    out_features=1,
    num_cells=[256, 256],
    activation_class=torch.nn.Tanh,
)

value_module = ValueOperator(
    module=value_net,
    in_keys=["observation"],
)

print(f"Actor parameters: {sum(p.numel() for p in actor.parameters()):,}")
print(f"Critic parameters: {sum(p.numel() for p in value_module.parameters()):,}")

TorchRL's transform system works differently from Gymnasium's wrappers. Instead of wrapping environments inside each other (which can get unwieldy), TorchRL stacks transforms in a flat list. You can insert, remove, or reorder transforms at any point without restructuring the pipeline. ObservationNorm normalizes observations to roughly match a unit Gaussian distribution, which is generally important for stable neural network training.

The ProbabilisticActor samples actions from a learned distribution (in this case, a TanhNormal distribution that squashes outputs to a bounded range), which naturally supports exploration during training. For evaluation, you can switch to deterministic action selection by using the distribution's mode instead of sampling.

TorchRL gives you finer-grained control than SB3, which makes it a strong choice when you need to customize training loops, implement novel algorithms, or integrate RL with other PyTorch workflows. The tradeoff is that it requires more code and deeper understanding to get started.

Choosing the Right Algorithm

Selecting an RL algorithm depends on the characteristics of your problem. Here is a practical guide for common scenarios.

Discrete actions + small state space (grid worlds, board games, simple routing): Start with tabular Q-learning. It is simple, requires no neural networks, and converges reliably on small problems.

Discrete actions + large or visual state space (Atari games, text-based environments): Use DQN or one of its improved variants (Double DQN, Dueling DQN, Rainbow). SB3 provides a solid DQN implementation out of the box. SB3 Contrib adds Quantile Regression DQN (QRDQN) for distributional RL.

Continuous actions (robotics, motor control, physics simulations): SAC is typically the best starting point due to its sample efficiency and automatic exploration via entropy regularization. TD3 is a good alternative if you want a deterministic policy.

General-purpose tasks where simplicity matters: PPO is widely considered the most reliable all-rounder. It works across discrete and continuous action spaces, is relatively easy to tune, and is the default choice for many applications including RLHF for large language models.

Multi-agent scenarios or production-scale distributed training: Ray RLlib is built for horizontal scaling across multiple CPUs and GPUs. It supports independent multi-agent learning, parameter sharing, adversarial self-play, and integrates with the broader Ray ecosystem for data processing and serving.

Warning

RL training is inherently noisy. The same algorithm with the same hyperparameters can produce wildly different results across random seeds. Always run multiple seeds (at least 3 to 5) and report mean performance with standard deviation when evaluating your results. A single successful run does not prove your approach works.

Key Takeaways

Start with Gymnasium: It provides the standard environment interface that all major RL libraries depend on. Get comfortable with the reset(), step(), and observation/action space APIs before moving to complex algorithms.
Use Stable Baselines3 for rapid prototyping: SB3 offers production-ready implementations of PPO, DQN, SAC, TD3, A2C, and DDPG with minimal boilerplate. Its scikit-learn-style API gets you training agents in under 10 lines of code.
Consider TorchRL for custom research: When you need to modify training loops, implement new algorithms, or integrate RL into larger PyTorch pipelines, TorchRL's modular design gives you the flexibility SB3 abstracts away.
Match the algorithm to the problem: DQN for discrete actions, SAC for continuous control, PPO as a reliable general-purpose option, and RLlib when you need to scale across machines.
Track experiments carefully: RL training is stochastic and can take hours or days. Use TensorBoard, Weights & Biases, or similar tools to log rewards, losses, and episode lengths so you can diagnose problems and compare runs.

Reinforcement learning remains one of the more challenging branches of machine learning, but the Python ecosystem has made it dramatically more accessible. Whether you are building a game-playing agent, training a robot controller, or fine-tuning a language model with human feedback, the libraries covered here provide tested, reliable foundations to build on. Start simple, experiment systematically, and scale up once you have a working baseline.