Python Actor-Critic Methods: From TD Actor-Critic to PPO

Actor-Critic methods sit at the heart of modern reinforcement learning. They combine the strengths of two separate approaches -- policy gradient methods that directly learn which actions to take, and value-based methods that estimate how good a given state is -- into a single, unified architecture. The idea of pairing a decision-maker with a learned evaluator traces back to 1983, when Barto, Sutton, and Anderson first demonstrated that two cooperating adaptive elements could solve the cart-pole balancing problem (Barto et al., "Neuronlike Adaptive Elements That Can Solve Difficult Learning Control Problems," IEEE Transactions on Systems, Man, and Cybernetics, 1983). Four decades later, this same architectural principle powers everything from robotic controllers to the RLHF stage of large language model training. This article walks through the core ideas behind Actor-Critic algorithms and implements three increasingly powerful variants in Python using PyTorch: a basic TD Actor-Critic, Advantage Actor-Critic (A2C), and Proximal Policy Optimization (PPO).

Policy gradient methods like REINFORCE (Williams, 1992) can learn a policy directly, but they suffer from high variance because they rely on complete episode trajectories to compute returns. Value-based methods like Q-learning can learn from individual steps, but they struggle with continuous action spaces and require a separate mechanism to derive a policy. Actor-Critic methods eliminate this tradeoff by pairing two neural networks: an actor that selects actions according to a learned policy, and a critic that evaluates those actions by estimating state values. As Sutton et al. proved in their foundational 1999 policy gradient theorem, the gradient of expected reward can be estimated using an approximate value function without introducing bias -- a result that gave Actor-Critic methods their theoretical foundation (Sutton et al., "Policy Gradient Methods for Reinforcement Learning with Function Approximation," NeurIPS, 1999).

What Is the Actor-Critic Architecture?

The Actor-Critic framework uses two components that train simultaneously. The actor is a policy network that takes the current state as input and outputs a probability distribution over available actions. The critic is a value network that takes the same state and outputs a single scalar -- its estimate of the expected cumulative reward from that state forward.

At each timestep, the actor selects an action. The environment returns a reward and a new state. The critic then computes the temporal difference (TD) error, which measures how much better or worse the outcome was compared to what the critic predicted. This TD error serves two purposes: it trains the critic to make better predictions, and it tells the actor whether the action it chose was better or worse than expected.

The TD error is defined as:

# TD error (delta)
# delta = reward + gamma * V(next_state) - V(current_state)
#
# If delta > 0: the action led to a better outcome than expected
# If delta < 0: the action led to a worse outcome than expected
Note

Unlike REINFORCE, which waits until the end of an episode to update the policy, Actor-Critic methods can learn at every single timestep. This makes them significantly more sample-efficient -- for an episode of 150 steps, you get 150 learning signals instead of just one.

The tradeoff is bias. The critic's value estimates are imperfect, especially early in training. This means the gradient signals fed to the actor are biased -- the actor is learning from the critic's approximation of reality rather than reality itself. However, in practice this bias is far less damaging than the high variance of pure Monte Carlo methods. As the critic improves, the bias diminishes, and the overall system converges toward an optimal policy. This relationship between actor and critic creates a bootstrapping dynamic: each component helps the other improve, but neither is ever perfectly accurate at any given point in training.

The Origins: From Adaptive Critics to Deep RL

Understanding where Actor-Critic methods came from helps explain why they work the way they do. The idea of separating policy selection from policy evaluation was not born in the deep learning era. In 1983, Barto, Sutton, and Anderson published a landmark paper demonstrating that two cooperating components -- an Associative Search Element (ASE) and an Adaptive Critic Element (ACE) -- could learn to balance a pole on a cart using only a sparse failure signal. Their ACE constructed what the authors described as a learned evaluation more informative than raw reinforcement alone (Barto et al., IEEE Transactions on Systems, Man, and Cybernetics, 1983). This is precisely what modern critics do: transform a sparse or delayed reward into a dense, step-by-step learning signal.

The next critical milestone came in 1999, when Sutton, McAllester, Singh, and Mansour proved the Policy Gradient Theorem. This result showed that the gradient of expected cumulative reward could be written in a form that an approximate value function could estimate without introducing bias -- provided certain compatibility conditions were met. This gave Actor-Critic methods a rigorous theoretical foundation and opened the door to using neural networks as function approximators for both the policy and the value function (Sutton et al., NeurIPS, 1999).

The deep learning revolution brought this architecture to scale. In 2016, Mnih et al. at DeepMind introduced A3C (Asynchronous Advantage Actor-Critic), which showed that running parallel copies of the actor-critic system across CPU threads could stabilize training without the need for experience replay buffers. Their asynchronous approach achieved top performance on Atari games while training on a single multi-core CPU rather than expensive GPUs (Mnih et al., "Asynchronous Methods for Deep Reinforcement Learning," ICML, 2016). A2C (the synchronous variant we implement below) emerged as a simpler, equally effective alternative that is easier to debug and reproduce.

Then in 2017, Schulman et al. introduced PPO -- a method designed to balance simplicity, sample efficiency, and wall-clock training time. PPO's clipped surrogate objective addressed a persistent problem in policy gradient methods: that a single overly aggressive gradient step could destroy a well-performing policy. The original paper described PPO as matching TRPO's reliability and data efficiency through first-order methods alone (Schulman et al., "Proximal Policy Optimization Algorithms," arXiv:1707.06347, 2017). PPO quickly became the default algorithm for applied reinforcement learning -- and its role expanded dramatically when OpenAI used it as the core optimization algorithm in the RLHF pipeline for InstructGPT and later ChatGPT (Ouyang et al., "Training Language Models to Follow Instructions with Human Feedback," arXiv:2203.02155, 2022).

Building a TD Actor-Critic from Scratch

The following implementation uses PyTorch and Gymnasium (the maintained successor to OpenAI Gym). The environment is CartPole-v1, where an agent must balance a pole on a moving cart. The task is considered solved when the agent consistently achieves high episode rewards approaching the maximum of 500 steps.

First, install the required packages:

pip install torch gymnasium

Define the actor and critic as separate neural networks. The actor outputs a softmax probability distribution over the two available actions (push left or push right). The critic outputs a single value estimate for the given state.

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import gymnasium as gym
import numpy as np


class Actor(nn.Module):
    def __init__(self, state_dim, action_dim, hidden_dim=128):
        super().__init__()
        self.fc1 = nn.Linear(state_dim, hidden_dim)
        self.fc2 = nn.Linear(hidden_dim, action_dim)

    def forward(self, state):
        x = F.relu(self.fc1(state))
        action_probs = F.softmax(self.fc2(x), dim=-1)
        return action_probs


class Critic(nn.Module):
    def __init__(self, state_dim, hidden_dim=128):
        super().__init__()
        self.fc1 = nn.Linear(state_dim, hidden_dim)
        self.fc2 = nn.Linear(hidden_dim, 1)

    def forward(self, state):
        x = F.relu(self.fc1(state))
        value = self.fc2(x)
        return value

Now write the training loop. At each step, the actor selects an action, the environment responds, and both networks update using the TD error.

def train_td_actor_critic(env_name="CartPole-v1", episodes=1000,
                          gamma=0.99, lr_actor=1e-3, lr_critic=5e-3):
    env = gym.make(env_name)
    state_dim = env.observation_space.shape[0]
    action_dim = env.action_space.n

    actor = Actor(state_dim, action_dim)
    critic = Critic(state_dim)
    actor_optimizer = optim.Adam(actor.parameters(), lr=lr_actor)
    critic_optimizer = optim.Adam(critic.parameters(), lr=lr_critic)

    reward_history = []

    for episode in range(episodes):
        state, _ = env.reset()
        state = torch.FloatTensor(state)
        episode_reward = 0

        done = False
        while not done:
            # Actor selects an action
            action_probs = actor(state)
            dist = torch.distributions.Categorical(action_probs)
            action = dist.sample()

            # Environment step
            next_state, reward, terminated, truncated, _ = env.step(action.item())
            done = terminated or truncated
            next_state = torch.FloatTensor(next_state)
            episode_reward += reward

            # Critic evaluates current and next states
            value = critic(state)
            next_value = critic(next_state) if not terminated else torch.tensor([0.0])

            # Compute TD error
            td_target = reward + gamma * next_value.detach()
            td_error = td_target - value

            # Update critic: minimize squared TD error
            critic_loss = td_error.pow(2)
            critic_optimizer.zero_grad()
            critic_loss.backward()
            critic_optimizer.step()

            # Update actor: increase probability of good actions
            actor_loss = -dist.log_prob(action) * td_error.detach()
            actor_optimizer.zero_grad()
            actor_loss.backward()
            actor_optimizer.step()

            state = next_state

        reward_history.append(episode_reward)

        if episode % 50 == 0:
            avg = np.mean(reward_history[-50:])
            print(f"Episode {episode:4d} | Avg Reward: {avg:.1f}")

    env.close()
    return actor, critic, reward_history
Pro Tip

Notice the use of .detach() in two critical places. When computing the actor loss, td_error.detach() prevents the actor's gradient from flowing back into the critic's weights. When computing the TD target, next_value.detach() treats the next-state value as a fixed target rather than a differentiable quantity. Forgetting either of these is one of the most common bugs in Actor-Critic implementations.

Advantage Actor-Critic (A2C)

The basic TD Actor-Critic works, but it can be noisy because the TD error from a single step is a rough estimate. Advantage Actor-Critic (A2C) improves on this by collecting a batch of experiences across multiple steps before performing a single, more stable update. A2C is the synchronous counterpart of A3C (Mnih et al., 2016), which originally used asynchronous parallel workers to decorrelate training data. In practice, the synchronous version produces comparable results with simpler code and more reproducible behavior. A2C also typically uses a shared network backbone for both actor and critic, which reduces the total number of parameters and allows the feature extractor to benefit from both learning signals.

Shared Network Architecture

In A2C, a single network extracts features from the state, then branches into two separate output heads: one for action probabilities and one for the state value. This shared representation often trains faster because gradient signals from both the policy and value losses shape the same feature layers.

class ActorCriticNetwork(nn.Module):
    """Shared-backbone network with separate actor and critic heads."""

    def __init__(self, state_dim, action_dim, hidden_dim=256):
        super().__init__()
        # Shared feature extractor
        self.shared = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
        )
        # Actor head: outputs action probabilities
        self.actor_head = nn.Linear(hidden_dim, action_dim)
        # Critic head: outputs state value
        self.critic_head = nn.Linear(hidden_dim, 1)

    def forward(self, state):
        features = self.shared(state)
        action_probs = F.softmax(self.actor_head(features), dim=-1)
        value = self.critic_head(features)
        return action_probs, value

N-Step Returns and Advantage Estimation

Instead of updating after every single step, A2C collects n_steps of experience and then computes the advantage for each step. The advantage tells the actor how much better (or worse) a particular action was compared to the average expected outcome from that state.

def compute_returns_and_advantages(rewards, values, dones,
                                     next_value, gamma=0.99):
    """Compute discounted returns and advantages for a batch of steps."""
    returns = []
    advantages = []
    R = next_value

    for t in reversed(range(len(rewards))):
        if dones[t]:
            R = 0.0
        R = rewards[t] + gamma * R
        advantage = R - values[t]
        returns.append(R)
        advantages.append(advantage)

    returns.reverse()
    advantages.reverse()
    return (
        torch.tensor(returns, dtype=torch.float32),
        torch.tensor(advantages, dtype=torch.float32),
    )

A2C Training Loop

The A2C training loop collects a rollout of n_steps transitions, computes returns and advantages, and then performs a single combined update. The total loss includes three components: the actor (policy) loss, the critic (value) loss, and an entropy bonus that encourages exploration by penalizing overly confident action distributions.

def train_a2c(env_name="CartPole-v1", total_steps=100_000,
              n_steps=5, gamma=0.99, lr=7e-4,
              value_coef=0.5, entropy_coef=0.01):

    env = gym.make(env_name)
    state_dim = env.observation_space.shape[0]
    action_dim = env.action_space.n

    model = ActorCriticNetwork(state_dim, action_dim)
    optimizer = optim.Adam(model.parameters(), lr=lr)

    state, _ = env.reset()
    state = torch.FloatTensor(state)
    episode_reward = 0
    reward_history = []
    step_count = 0

    while step_count < total_steps:
        # Collect n_steps of experience
        log_probs, values, rewards, dones, entropies = [], [], [], [], []

        for _ in range(n_steps):
            action_probs, value = model(state)
            dist = torch.distributions.Categorical(action_probs)
            action = dist.sample()

            next_state, reward, terminated, truncated, _ = env.step(action.item())
            done = terminated or truncated

            log_probs.append(dist.log_prob(action))
            values.append(value.squeeze())
            rewards.append(reward)
            dones.append(done)
            entropies.append(dist.entropy())

            episode_reward += reward
            step_count += 1

            if done:
                reward_history.append(episode_reward)
                episode_reward = 0
                next_state, _ = env.reset()

            state = torch.FloatTensor(next_state)

        # Bootstrap value for the last state
        with torch.no_grad():
            _, next_value = model(state)
            next_value = next_value.squeeze().item()

        # Compute returns and advantages
        values_list = [v.item() for v in values]
        returns, advantages = compute_returns_and_advantages(
            rewards, values_list, dones, next_value, gamma
        )

        # Normalize advantages for training stability
        if len(advantages) > 1:
            advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)

        # Stack tensors
        log_probs = torch.stack(log_probs)
        values = torch.stack(values)
        entropies = torch.stack(entropies)

        # Compute losses
        actor_loss = -(log_probs * advantages.detach()).mean()
        critic_loss = F.mse_loss(values, returns.detach())
        entropy_bonus = entropies.mean()

        total_loss = actor_loss + value_coef * critic_loss - entropy_coef * entropy_bonus

        optimizer.zero_grad()
        total_loss.backward()
        # Gradient clipping prevents exploding gradients
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=0.5)
        optimizer.step()

        if len(reward_history) % 50 == 0 and len(reward_history) > 0:
            avg = np.mean(reward_history[-50:])
            print(f"Step {step_count:6d} | Episodes: {len(reward_history)} "
                  f"| Avg Reward: {avg:.1f}")

    env.close()
    return model, reward_history
Note

The entropy_coef term is often overlooked but plays a critical role. Without it, the policy can collapse early to a suboptimal deterministic action -- especially in environments with sparse rewards. The entropy bonus keeps the probability distribution spread out, ensuring the agent continues to explore different actions during training.

Proximal Policy Optimization (PPO)

PPO is arguably the default algorithm in applied reinforcement learning today. Introduced by Schulman et al. in 2017, it builds on A2C by adding a mechanism that prevents the policy from changing too drastically in a single update. Large policy updates can destabilize training, causing performance to collapse -- a problem that plagued earlier methods like vanilla policy gradients and even TRPO, which addressed the issue through computationally expensive second-order optimization. PPO replaces that complexity with a simple clipping mechanism that keeps updates within a safe range, using only first-order gradients. The result is an algorithm that, as the original authors noted, strikes a favorable balance between simplicity and performance (Schulman et al., arXiv:1707.06347, 2017).

The Clipped Objective

PPO works by comparing the current policy to the previous policy. For each action in the collected batch, it computes a ratio: how much more or less likely the current policy is to take that action compared to the old policy. If this ratio deviates too far from 1.0, the objective is clipped, preventing the update from overshooting.

# The PPO clipped objective:
#
# ratio = pi_new(a|s) / pi_old(a|s)
#
# L_clip = min(
#     ratio * advantage,
#     clip(ratio, 1 - epsilon, 1 + epsilon) * advantage
# )
#
# epsilon is typically 0.2

PPO Implementation

def train_ppo(env_name="CartPole-v1", total_steps=100_000,
              n_steps=128, n_epochs=4, batch_size=64,
              gamma=0.99, gae_lambda=0.95,
              lr=3e-4, clip_epsilon=0.2,
              value_coef=0.5, entropy_coef=0.01):

    env = gym.make(env_name)
    state_dim = env.observation_space.shape[0]
    action_dim = env.action_space.n

    model = ActorCriticNetwork(state_dim, action_dim)
    optimizer = optim.Adam(model.parameters(), lr=lr)

    state, _ = env.reset()
    state = torch.FloatTensor(state)
    episode_reward = 0
    reward_history = []
    step_count = 0

    while step_count < total_steps:
        # --- Rollout Phase ---
        states, actions, log_probs_old = [], [], []
        rewards, dones, values = [], [], []

        with torch.no_grad():
            for _ in range(n_steps):
                action_probs, value = model(state)
                dist = torch.distributions.Categorical(action_probs)
                action = dist.sample()

                states.append(state)
                actions.append(action)
                log_probs_old.append(dist.log_prob(action))
                values.append(value.squeeze())

                next_state, reward, terminated, truncated, _ = env.step(action.item())
                done = terminated or truncated

                rewards.append(reward)
                dones.append(done)
                episode_reward += reward
                step_count += 1

                if done:
                    reward_history.append(episode_reward)
                    episode_reward = 0
                    next_state, _ = env.reset()

                state = torch.FloatTensor(next_state)

            # Bootstrap value
            _, next_val = model(state)
            next_val = next_val.squeeze()

        # --- GAE Advantage Estimation ---
        advantages = []
        gae = 0.0
        values_tensor = torch.stack(values)

        for t in reversed(range(n_steps)):
            if dones[t]:
                next_v = 0.0
            elif t == n_steps - 1:
                next_v = next_val.item()
            else:
                next_v = values_tensor[t + 1].item()

            delta = rewards[t] + gamma * next_v - values_tensor[t].item()
            gae = delta + gamma * gae_lambda * (0.0 if dones[t] else 1.0) * gae
            advantages.insert(0, gae)

        advantages = torch.tensor(advantages, dtype=torch.float32)
        returns = advantages + values_tensor.detach()

        # Normalize advantages
        advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)

        # Stack rollout data
        states = torch.stack(states)
        actions = torch.stack(actions)
        log_probs_old = torch.stack(log_probs_old)

        # --- Optimization Phase ---
        dataset_size = n_steps
        for epoch in range(n_epochs):
            indices = np.random.permutation(dataset_size)

            for start in range(0, dataset_size, batch_size):
                end = start + batch_size
                batch_idx = indices[start:end]

                batch_states = states[batch_idx]
                batch_actions = actions[batch_idx]
                batch_old_log_probs = log_probs_old[batch_idx]
                batch_advantages = advantages[batch_idx]
                batch_returns = returns[batch_idx]

                # Forward pass with current policy
                action_probs, values_pred = model(batch_states)
                dist = torch.distributions.Categorical(action_probs)
                new_log_probs = dist.log_prob(batch_actions)
                entropy = dist.entropy().mean()

                # Policy ratio
                ratio = torch.exp(new_log_probs - batch_old_log_probs)

                # Clipped surrogate objective
                surr1 = ratio * batch_advantages
                surr2 = torch.clamp(ratio, 1.0 - clip_epsilon,
                                    1.0 + clip_epsilon) * batch_advantages
                actor_loss = -torch.min(surr1, surr2).mean()

                # Value loss
                critic_loss = F.mse_loss(values_pred.squeeze(), batch_returns)

                # Total loss
                loss = (actor_loss
                        + value_coef * critic_loss
                        - entropy_coef * entropy)

                optimizer.zero_grad()
                loss.backward()
                torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=0.5)
                optimizer.step()

        if len(reward_history) % 50 == 0 and len(reward_history) > 0:
            avg = np.mean(reward_history[-50:])
            print(f"Step {step_count:6d} | Episodes: {len(reward_history)} "
                  f"| Avg Reward: {avg:.1f}")

    env.close()
    return model, reward_history
Pro Tip

PPO uses Generalized Advantage Estimation (GAE), introduced by Schulman et al. in a separate 2016 paper, and controlled by the gae_lambda parameter. Setting gae_lambda=1.0 gives the same result as standard Monte Carlo advantage estimation (low bias, high variance). Setting gae_lambda=0.0 gives a pure one-step TD advantage (high bias, low variance). The default value of 0.95 provides a practical balance for many environments. As the original GAE paper established, this parameter allows practitioners to smoothly interpolate between these two extremes, giving precise control over the bias-variance tradeoff (Schulman et al., "High-Dimensional Continuous Control Using Generalized Advantage Estimation," ICLR, 2016).

Comparing the Three Approaches

Each of the three implementations represents a different point on the complexity-performance spectrum. Run all three on CartPole-v1 using the following snippet:

if __name__ == "__main__":
    print("=" * 50)
    print("Training TD Actor-Critic")
    print("=" * 50)
    _, _, td_rewards = train_td_actor_critic(episodes=1000)

    print("\n" + "=" * 50)
    print("Training A2C")
    print("=" * 50)
    _, a2c_rewards = train_a2c(total_steps=100_000)

    print("\n" + "=" * 50)
    print("Training PPO")
    print("=" * 50)
    _, ppo_rewards = train_ppo(total_steps=100_000)

Here is a summary of how the three approaches differ:

TD Actor-Critic updates after every single step. This makes it simple to implement and fast per-update, but the gradient signal from a single timestep is noisy. It works well on simple environments but can struggle with more complex tasks where single-step TD estimates introduce too much bias.

A2C collects short rollouts (typically 5 to 20 steps) before updating, which reduces variance compared to single-step updates. The shared network backbone and entropy bonus add stability. A2C is a solid general-purpose algorithm that balances simplicity and performance.

PPO adds the clipped surrogate objective and multiple epochs of optimization over the same batch of data. This makes it remarkably stable -- the policy cannot change too much in any single update, which prevents the catastrophic performance collapses that can plague other policy gradient methods. PPO is the workhorse algorithm used across many production reinforcement learning systems, from game-playing agents to robotics controllers. It also serves as the core RL optimizer in the RLHF pipeline used by OpenAI to align InstructGPT and ChatGPT with human preferences, where a reward model trained on human rankings provides the signal that PPO optimizes against (Ouyang et al., arXiv:2203.02155, 2022). DeepMind's Sparrow dialogue agent used A2C in its RLHF pipeline (Glaese et al., "Improving Alignment of Dialogue Agents via Targeted Human Judgements," arXiv:2209.14375, 2022), and Anthropic has used RLHF-based approaches -- combining reinforcement learning with constitutional AI principles -- in training Claude (Bai et al., "Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback," arXiv:2204.05862, 2022).

Common Pitfalls

Three bugs account for many failed Actor-Critic implementations. First, forgetting to .detach() the TD error when computing the actor loss, which causes gradients to leak into the critic. Second, not handling terminal states correctly -- when an episode ends, the next-state value must be zero, not the critic's estimate. Third, using the same learning rate for both actor and critic. The critic typically benefits from a higher learning rate (2x to 5x) since it needs to converge faster to provide useful feedback to the actor.

The Bias-Variance Tradeoff: Why It Matters More Than You Think

The bias-variance tradeoff is the single most important concept underlying the progression from TD Actor-Critic to A2C to PPO. Every design decision in these algorithms -- from how many steps to collect before updating, to whether to use GAE, to how aggressively to clip the surrogate objective -- is fundamentally an answer to the question: how much bias are we willing to accept in exchange for lower variance?

To see why this matters, consider what happens at the extremes. A pure Monte Carlo method (like REINFORCE) uses the actual cumulative return from each episode as its learning signal. This is unbiased -- it reflects exactly what happened. But it is also extremely noisy, because the return from any single episode is influenced by every random action the agent took and every stochastic transition in the environment. Training on such noisy signals requires enormous amounts of data and patience.

At the other extreme, a one-step TD critic uses only the immediate reward plus its own estimate of the next state's value. This is low variance (it depends on only one step of randomness), but it is biased: the critic's estimate of the next state might be wrong, especially early in training. The agent is learning from its own imperfect predictions about the future rather than from the future itself.

GAE provides the bridge. By setting gae_lambda between 0 and 1, you control exactly how far into the future the advantage estimate looks before relying on the critic's prediction. This is not just a theoretical nicety -- the original GAE paper found empirically that the optimal value of lambda was substantially lower than the optimal discount factor gamma, suggesting that value function inaccuracy introduces more bias than temporal discounting (Schulman et al., ICLR, 2016). In practice, this means you should tune gae_lambda based on how confident you are in your critic: better critic, lower lambda; weaker critic, higher lambda.

PPO's clipping mechanism adds a second layer to this tradeoff. Even with good advantage estimates, the policy update itself can overshoot. Clipping the probability ratio ensures that no single update can push the policy too far from its previous version, at the cost of potentially slowing down learning when larger updates would have been safe. The interplay between GAE's bias-variance knob and PPO's update-size limiter is what makes PPO so robust across diverse environments without extensive hyperparameter tuning.

Beyond CartPole: When Simple Implementations Break

The implementations in this article work well on CartPole-v1, but real-world reinforcement learning tasks expose failure modes that simple code does not address. Knowing where these implementations will break -- and what to do about it -- is essential before applying Actor-Critic methods to anything beyond toy environments.

Sparse and delayed rewards. CartPole provides a reward of +1 at every timestep the pole stays upright. Many real tasks provide reward only at the end of a long sequence of actions (e.g., winning a game, completing a navigation task, or generating a helpful response in an LLM). When rewards are sparse, the critic has almost no signal to learn from, and the TD error becomes meaningless noise. The standard solution is reward shaping -- adding intermediate reward signals that guide the agent toward the goal. A more principled approach is to use curiosity-driven exploration, where an intrinsic reward module provides a bonus for visiting novel states, giving the critic something to work with even before external rewards arrive. Hindsight Experience Replay (HER) offers another avenue: by retroactively relabeling failed trajectories with the goals they accidentally achieved, HER generates useful training data from failures.

High-dimensional continuous action spaces. The implementations above use Categorical distributions for discrete actions. Continuous control tasks (robotic arms, autonomous vehicles, simulated locomotion) require outputting real-valued actions from a continuous distribution, typically a Gaussian parameterized by a learned mean and standard deviation. This introduces new failure modes: the standard deviation can collapse to near-zero too early (killing exploration) or remain too large (preventing convergence). Soft Actor-Critic (SAC), introduced by Haarnoja et al. at ICML 2018, addresses this directly by incorporating an entropy bonus into the objective function itself -- not as an auxiliary loss term, but as a formal part of the optimization target. SAC automatically adjusts its exploration-exploitation balance by tuning the entropy coefficient during training.

Multi-agent settings and non-stationarity. When multiple agents share an environment, the state transitions become non-stationary from each agent's perspective -- the "environment" includes other agents who are simultaneously learning and changing their behavior. Standard Actor-Critic methods assume a stationary MDP, and this assumption breaks down in multi-agent settings. Independent PPO (IPPO) -- where each agent runs its own PPO instance without sharing parameters -- has proven surprisingly effective in practice, even though it lacks theoretical justification. Centralized training with decentralized execution (CTDE), where a shared critic has access to all agents' observations during training but each agent acts independently during deployment, provides a more principled framework.

Reward hacking and misalignment. In RLHF applications, the critic's role is played by a learned reward model trained on human preferences. A known failure mode is reward hacking: the policy learns to exploit quirks in the reward model rather than achieving the intended behavior. This is analogous to a student who learns to game a test rather than understand the material -- the "critic" (the test) gives high marks, but the underlying goal is not met. Mitigation strategies include KL-divergence penalties that keep the RL-trained policy close to a supervised baseline, ensemble reward models that reduce the chance of systematic blind spots, and iterative reward model retraining as the policy evolves.

Key Takeaways

  1. Two networks, one goal: The actor learns a policy that maximizes expected reward. The critic learns a value function that evaluates states. Together, they reduce variance while maintaining the ability to handle complex action spaces. This pairing traces directly back to the ASE/ACE architecture of Barto, Sutton, and Anderson (1983).
  2. TD error is the backbone: The temporal difference error -- the gap between what the critic predicted and what actually happened -- drives learning for both networks. It tells the actor which actions were surprisingly good or bad, and it tells the critic how to improve its estimates. This signal is dense (available at every timestep) even when external rewards are sparse.
  3. A2C adds stability through batching: Collecting multi-step rollouts, using a shared network, and adding an entropy bonus all contribute to more stable, efficient training compared to single-step updates. The synchronous design (as opposed to A3C's asynchronous approach) simplifies debugging and produces more reproducible results.
  4. PPO adds safety through clipping: The clipped surrogate objective prevents destructively large policy updates. Combined with GAE for advantage estimation (Schulman et al., 2016) and multiple optimization epochs per rollout, PPO delivers reliable performance across a wide range of tasks. Its stability is the primary reason it was chosen as the RL optimizer in the InstructGPT/ChatGPT RLHF pipeline (Ouyang et al., 2022), while other Actor-Critic variants like A2C have been used in DeepMind's Sparrow dialogue agent (Glaese et al., 2022).
  5. The bias-variance tradeoff is the master knob: Every design choice in Actor-Critic methods -- rollout length, GAE lambda, clipping epsilon, learning rate ratios -- maps to a position on the bias-variance spectrum. Understanding this tradeoff is more valuable than memorizing any specific algorithm.
  6. Start with PPO for real projects: While understanding TD Actor-Critic and A2C is valuable for building intuition, PPO (or its library implementation in Stable Baselines3) is the practical choice for production use cases. It handles hyperparameter sensitivity far better than the simpler variants. For continuous action spaces, also consider SAC (Haarnoja et al., ICML, 2018).

Actor-Critic methods form the foundation of nearly every modern reinforcement learning algorithm. Whether the end goal is training a game-playing agent, fine-tuning language models with human feedback, or controlling robotic systems, the core principle remains the same: pair a decision-maker with an evaluator, and let them learn from each other. The implementations in this article provide a working starting point for exploring these ideas further -- experiment with different environments, tweak hyperparameters, and observe how each variant responds to increased task complexity. For those looking to go deeper, the key papers to study next are the GAE paper (Schulman et al., ICLR, 2016) for a rigorous understanding of advantage estimation, the PPO paper (Schulman et al., 2017) for the clipping mechanism's theoretical motivation, the InstructGPT paper (Ouyang et al., 2022) for how these algorithms connect to the alignment of large language models, and SAC (Haarnoja et al., ICML, 2018) for maximum-entropy Actor-Critic methods in continuous control.

References

  1. Barto, A. G., Sutton, R. S., & Anderson, C. W. (1983). Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics, SMC-13(5), 834-846.
  2. Sutton, R. S., McAllester, D., Singh, S., & Mansour, Y. (1999). Policy gradient methods for reinforcement learning with function approximation. Advances in Neural Information Processing Systems (NeurIPS), 12.
  3. Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D., & Kavukcuoglu, K. (2016). Asynchronous methods for deep reinforcement learning. International Conference on Machine Learning (ICML), 48, 1928-1937.
  4. Schulman, J., Moritz, P., Levine, S., Jordan, M. I., & Abbeel, P. (2016). High-dimensional continuous control using generalized advantage estimation. International Conference on Learning Representations (ICLR).
  5. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
  6. Ouyang, L., Wu, J., Jiang, X., et al. (2022). Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155.
  7. Glaese, A., McAleese, N., Trębacz, M., et al. (2022). Improving alignment of dialogue agents via targeted human judgements. arXiv preprint arXiv:2209.14375.
  8. Bai, Y., Jones, A., Ndousse, K., et al. (2022). Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862.
  9. Haarnoja, T., Zhou, A., Abbeel, P., & Levine, S. (2018). Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. International Conference on Machine Learning (ICML), 80, 1861-1870.
back to articles