Intro to RL

What is RL?

Important

In RL, an agent learns to make decisions by interacting with an environment (through trial and error)

The agent receives feedback in the form of rewards.

The goal is to maximize the total reward over time.

How does the agent make decisions?

The agent uses a policy ( $π$ ) to decide what action to take in a given state.

Given a state, the policy outputs an action (if deterministic) or a probability distribution over actions (if stochastic).

The goal of RL is to find an optimal policy $π^{*}$ that maximizes the expected cumulative reward.

What is the Exploration/Exploitation trade-off?

Exploration: trying out new actions to discover the best ones.

Exploitation: choosing the best actions based on what we already know.

What are the two main types of RL algorithms?

Value-based methods: estimate the value of being in a state (or taking an action in a state) and use this to make decisions.

train a value function that estimates the expected cumulative reward of being in a state.

Policy-based methods: train a policy function that outputs the action to take in a given state.

Value-based methods

What is the value of a state?

The value of a state is the expected discounted return the agent can get if it starts at that state and then acts according to our policy.

But what does it mean to "act according to our policy?"

For value-based methods, you don’t train the policy.

Instead, the policy is just a simple pre-specified function that uses the values given by the value-function to select its actions.

The greedy policy is an example of this: it selects the action that maximizes the value function. > - Epsilon-greedy policy is commonly used (it handles the exploration/exploitation trade-off that will be mentioned later).

Tldr

in policy-based methods, the optimal policy ( $π^{*}$ ) is found by training the policy directly.

in value-based methods, the optimal policy is found by finding an optimal value function ( $V^{*}$ or $Q^{*}$ ) and then using it to derive the optimal policy.

What is the link between Value and Policy?

$π^{*} (s) = ar g a max Q^{*} (s, a)$

State-Value Function

The state-value function ( $V^{π} (s)$ ) estimates the expected cumulative reward of being in a state and following a policy $π$ .

$V_{π} (s) = E [R_{t} ∣ s_{t} = s, π]$

For each state, the state-value function outputs the expected return if the agent starts at that state and then follows the policy $π$ forever.

The action-value function

The action-value function ( $Q^{π} (s, a)$ ) estimates the expected cumulative reward of taking an action in a state and then following a policy $π$ . $Q_{π} (s, a) = E_{π} [R_{t} ∣ s_{t} = s, a_{t} = a]$

Action-Value VS. State-Value function:

State-value function: calculates the value of a state.

Action-value function: calculates the value of taking an action in a state (state-action pair).

The Bellman Equation

What is the Bellman equation?

A way to simplify our state-value or state-action value calculation.

Remember: To calculate the value of a state, we need to calculate the return starting at that state and then following the policy forever.

However, when calculating $V (S_{t})$ , we need to know $V (S_{t + 1})$ .

Thus, to avoid repeated computation, we use Dynamic Programming, specifically, the Bellman Equation:

V_{π} (s) = E_{π} [R_{t + 1} + γ V_{π} (S_{t + 1}) ∣ S_{t} = s]

The value of our state is the expected reward we get at the next time step plus the (discounted) expected value of the next state.

The value of $V (S_{t + 1})$ is equal to the immediate reward $R_{t + 2}$ plus the discounted value of the next state ( $γ \cdot V (S_{t + 2})$ ).

Monte Carlo and Temporal Difference Learning

Monte Carlo: learning at the end of an episode

In Monte Carlo Learning, you wait until the end of the epsiode, then calculate the return $G_{t}$ , and then update the value function $V (S_{t})$ .

$V (S_{t}) = V (S_{t}) + α (G_{t} - V (S_{t}))$

Q-Learning

What is Q-Learning?

Q-Learning is an off-policy, value-based method that uses a TD approach to train its action-value function.

Remember

Recall: value-based method: find the optimal policy indirectly by training a value/action-value function that tells us the value of each state / value of each state-action pair

Recall: TD approach: update the value function after each time step, instead of at the end of the episode.

What is the difference between value and reward?

Reward: The immediate reward received after taking an action in a state.

Value: The expected cumulative reward from taking an action in a state and following a policy thereafter.

Representing the Q-Function

The Q-function is a table (Q-table) that maps state-action pairs to values.

It tells us the value of taking a specific action in a given state.

Initially, the Q-table is uninformative (all values are zero or random), but as the agent explores the environment, the Q-table is updated to approximate the optimal policy.

Why does Q-Learning allow us to learn the optimal policy?

By training the Q-function, represented as a Q-table, we derive the optimal policy since it maps each state-action pair to the best action.

The Q-Learning Algorithm

Step 1: Initialize the Q-table:

Set $Q (s, a) = 0$ for all $s \in S$ and $a \in A (s)$ .

Ensure $Q (terminal_state, \cdot) = 0$ .

Step 2: Choose an action using the epsilon-greedy strategy:

Initialize $ϵ = 1.0$ .

Exploration: With probability $ϵ$ , pick a random action.

Exploitation: With probability $1 - ϵ$ , pick the best action from the Q-table.

Step 3: Perform the action $A_{t}$ :

Observe reward $R_{t + 1}$ and the next state $S_{t + 1}$ .

Step 4: Update the Q-value using the Bellman equation:
$Q (S_{t}, A_{t}) \leftarrow Q (S_{t}, A_{t}) + α [R_{t + 1} + γ a max Q (S_{t + 1}, a) - Q (S_{t}, A_{t})] .$

DQN

Why do we use Deep Q-Learning?

The traditional Q-Learning Q-table becomes impractical for large state spaces.

Deep Q-Learning replaces the Q-table with a neural network that approximates the Q-function, enabling efficient handling of large state spaces.

Deep Q-Learning Architecture & Algorithm

Input: A stack of 4 frames passed through a convolutional neural network (CNN).

Why 4 frames?

To capture the motion of the agent.

Output: A vector of Q-values, one for each possible action.

Loss Function: MSE

Minimize the mean squared error between predicted and target Q-values.

Use the Bellman equation to calculate the target Q-values.

Training Deep Q-Learning

Step 1: Sampling

Perform actions in the environment.

Store observed experience tuples $(S_{t}, A_{t}, R_{t + 1}, S_{t + 1})$ in a replay memory.

Step 2: Training

Randomly sample a small batch of experience tuples from replay memory.

Use gradient descent to update the Q-network.

Improvements to Deep Q-Learning

Experience Replay:

Store experiences in a replay memory.

Sample uniformly during training to reduce correlations in data.

Fixed Q-Targets:

Use two networks:

Q-network to select actions.

Target network to evaluate actions.

Update the target network less frequently.

Double Deep Q-Learning:

Mitigate overestimation of Q-values by using separate networks for action selection and evaluation.

Policy Gradient

Policy-based methods

Directly learn to approximate $π^{*}$ without having to learn a value function

So we parameterize the policy using a neural network.

$π_{θ} (s) = P (A ∣ s; θ)$

Goal

Maximize the performance of the parameterized policy using gradient ascent

Pros and Cons of Policy Gradient Methods

Pros

Directly estimate policy without storing additional data

Can learn a stochastic policy

Don’t need to implement an exploration/expolitation tradeoff

Much more effective in high-dimensional (continuous) action spaces

Better convergence properties Cons

They converge to a local max instead of a global max

Can take longer to train

Can have higher variance

Policy Gradient Algorithm

Initialize the policy parameters $θ$ randomly.

Define the policy $π_{θ} (s)$ as a parameterized probability distribution over actions.

Generate an episode by following the current policy $π_{θ}$ :

Observe states $s_{t}$ , take actions $a_{t}$ , and receive rewards $r_{t}$ for $t = 1, 2, \dots, T$ .

Calculate the cumulative reward for each step:
$G_{t} = k = t \sum T γ^{k - t} r_{k},$
where $γ$ is the discount factor.

Compute the policy gradient:
$\nabla_{θ} J (θ) = E_{π_{θ}} [\nabla_{θ} lo g π_{θ} (s, a) G_{t}] .$

Estimate this gradient by sampling episodes and using Monte Carlo methods.

Update the policy parameters using gradient ascent:
$θ \leftarrow θ + α \nabla_{θ} J (θ),$
where $α$ is the learning rate.

Repeat steps 2–5 until convergence or for a predefined number of iterations.

Why does the policy gradient work?

The gradient $\nabla_{θ} J (θ)$ pushes the policy to assign higher probabilities to actions that lead to higher rewards.

This allows the policy to improve iteratively by maximizing the expected cumulative reward.

Variants of Policy Gradient

REINFORCE: A simple implementation using Monte Carlo estimates for $G_{t}$ .

Actor-Critic: Combines a parameterized policy (actor) with a value function (critic) to reduce variance in gradient estimation.

Proximal Policy Optimization (PPO): Regularizes policy updates to ensure stability and prevent large changes in the policy.

Policy Gradient Theorem

Policy Gradient Theorem

Objective Function:

The goal is to maximize the expected cumulative reward under a parameterized policy $π_{θ}$ :

$J (θ) = E_{π_{θ}} [t = 0 \sum \infty γ^{t} r_{t}] .$

Here, $γ \in [0, 1]$ is the discount factor.

Policy Gradient Theorem:

The gradient of $J (θ)$ is given by:

$\nabla_{θ} J (θ) = E_{π_{θ}} [\nabla_{θ} lo g π_{θ} (s, a) Q^{π_{θ}} (s, a)] .$

This connects the expected reward to the gradient of the policy’s log-probability, weighted by the action-value function $Q^{π_{θ}} (s, a)$ .

Simplified Gradient Estimate:

Using the cumulative reward $G_{t}$ as an estimate of $Q^{π_{θ}} (s, a)$ :

$\nabla_{θ} J (θ) \approx E_{π_{θ}} [\nabla_{θ} lo g π_{θ} (s, a) G_{t}] .$

Key Advantages of Policy Gradient Theorem

Directly optimizes the policy without needing a value function.

Works well for stochastic or continuous action spaces.

Facilitates modeling of complex, parameterized policies.

Variants and Improvements

Baseline Subtraction:

Reduce variance by subtracting a baseline $b (s)$ from $G_{t}$ :

$\nabla_{θ} J (θ) = E_{π_{θ}} [\nabla_{θ} lo g π_{θ} (s, a) (G_{t} - b (s))] .$

Common choice for $b (s)$ is the value function $V^{π_{θ}} (s)$ .

Actor-Critic Methods:

Use a critic to estimate $Q^{π_{θ}} (s, a)$ or $G_{t}$ to reduce variance and stabilize training.

Actor-Critic Methods

Variance Problem in REINFORCE

REINFORCE relies on Monte Carlo sampling to estimate the gradient:

$\nabla_{θ} J (θ) \approx E_{π_{θ}} [\nabla_{θ} lo g π_{θ} (s, a) G_{t}] .$

The estimate of $G_{t}$ (cumulative reward) can have high variance, especially when:

Rewards are sparse or delayed.

The episode length is long.

The policy $π_{θ}$ explores random or suboptimal actions frequently.

Why is high variance a problem?

High variance causes instability in training.

The policy parameters $θ$ may oscillate or fail to converge.

Learning slows down as the policy struggles to distinguish between good and bad updates.

Intuition for the Variance Problem

Consider an environment where reward depends heavily on the final action in a long episode.

If actions earlier in the episode are unrelated to the reward, their gradients still get updated based on noisy estimates of $G_{t}$ , leading to inefficient learning.

Actor-Critic Methods

Actor: Parameterizes the policy $π_{θ} (s) = P (A ∣ s; θ)$ .

Critic: Estimates the value function $V (s)$ or action-value function $Q (s, a)$ .

Training:

The actor updates the policy using policy gradients.

The critic updates the value estimates using temporal difference (TD) learning.

A2C (Advantage Actor-Critic)

A2C is a reinforcement learning algorithm that combines actor-critic methods and uses the advantage function to improve training stability and efficiency.

In A2C, the actor learns the policy $π_{θ} (s)$ , while the critic learns the value function $V^{π} (s)$ .

The advantage function is defined as:

$A^{π} (s, a) = Q^{π} (s, a) - V^{π} (s),$
which represents how much better the action $a$ is compared to the average action at state $s$ .

The actor is updated using the policy gradient:

$\nabla_{θ} J (θ) = E_{π_{θ}} [\nabla_{θ} lo g π_{θ} (s, a) A^{π} (s, a)] .$

The critic is updated using the temporal difference (TD) error:

$δ_{t} = r_{t} + γ V^{π} (s_{t + 1}) - V^{π} (s_{t}) .$

Key Benefits of A2C:

Reduces variance by using the advantage function.

More stable than basic REINFORCE, as it incorporates a value function.

A3C (Asynchronous Advantage Actor-Critic)

A3C extends A2C by using multiple asynchronous agents that update the global model in parallel.

These agents each interact with their own environment and compute gradients, updating the global parameters asynchronously.

The global network aggregates updates from multiple workers to improve training speed and stability.

Key Benefits of A3C:

Asynchronous updates prevent correlated gradients and reduce the risk of local minima, leading to faster convergence.

It can explore different parts of the environment simultaneously, leading to better generalization.

Architecture of A3C:

Each worker runs a separate instance of the environment and computes gradients based on its experiences.

Gradients from each worker are asynchronously sent to a global network, which updates the shared parameters.

The global network combines the benefits of multiple workers to converge faster than a single worker.

How A3C Improves Over A2C?

Parallelism: A3C uses multiple agents (workers) running in parallel, allowing for asynchronous updates to the global model, which improves training efficiency and exploration.

Reduced Overfitting: As different workers interact with different environments, A3C reduces the likelihood of overfitting to a single environment.

Proximal Policy Optimization

Proximal Policy Optimization (PPO)

Proximal Policy Optimization (PPO) is an on-policy reinforcement learning algorithm designed to improve the stability and performance of policy gradient methods. PPO addresses key issues like high variance and large policy updates, which can destabilize training. The core idea is to limit the size of policy updates to ensure smooth and safe exploration of the policy space.

Algorithm

The objective function in PPO is designed to maximize the expected cumulative reward while ensuring that the new policy does not deviate too far from the old one. The update rule is: $θ max E_{π_{θ}} [min (r_{t} (θ) \hat{A}, clip (r_{t} (θ), 1 - ϵ, 1 + ϵ) \hat{A})] .$

$r_{t} (θ) = \frac{π _{θ} ( a ∣ s )}{π _{old} ( a ∣ s )}$ : Ratio of new to old policy probabilities at timestep $t$ . This ratio quantifies how much the policy has changed.

$\hat{A}$ : Advantage estimate, typically computed using a Generalized Advantage Estimation (GAE) to reduce bias and variance.

$ϵ$ : A small hyperparameter that controls the extent to which the policy is allowed to change. Typical values range from 0.1 to 0.3.

Why is clipping used in PPO?

Clipping is a key component of PPO that ensures stable updates by limiting the likelihood ratio $r_{t} (θ)$ , preventing overly large updates that could destabilize training.

By clipping $r_{t} (θ)$ within the range $[1 - ϵ, 1 + ϵ]$ , PPO avoids situations where the policy change is too large. This means that when the ratio is outside this range, the objective function will not continue to increase, thus maintaining stability.

This clip ensures that updates only happen when the ratio between the old and new policies is within a small, safe range.

PPO Update Rule Explained

The objective function in PPO has two terms:

The original objective: $r_{t} (θ) \hat{A}$ — where we multiply the ratio by the advantage estimate. This encourages actions that lead to higher advantages (i.e., higher returns compared to the average).

The clipped objective: $clip (r_{t} (θ), 1 - ϵ, 1 + ϵ) \hat{A}$ — where the ratio is clipped to lie within the range $[1 - ϵ, 1 + ϵ]$ . If the ratio exceeds this range, the objective function remains constant and does not increase further, preventing large policy updates.

The min operator ensures that the final objective function is the smaller of the two terms, effectively ensuring that the policy updates do not deviate drastically from the previous policy.

Why does PPO work well?

Efficiency: PPO strikes a balance between the sample efficiency of methods like TRPO (Trust Region Policy Optimization) and the simplicity of simpler policy gradient methods like REINFORCE.

Stability: By clipping the probability ratios, PPO ensures that updates are stable, avoiding issues with overly large policy changes.

Scalability: PPO can be applied to large-scale problems with high-dimensional state and action spaces, making it suitable for environments like robotics, games, and simulated environments.

PPO vs. Other Policy Optimization Methods

TRPO (Trust Region Policy Optimization):

TRPO uses a constraint on the KL divergence between the old and new policies to ensure safe policy updates. However, TRPO is computationally expensive because it involves solving a constrained optimization problem.

PPO simplifies this by using clipping rather than a trust region, making it easier to implement and more efficient without sacrificing much performance.

A3C:

While A3C uses multiple parallel workers to asynchronously update a global network, PPO is a more sample-efficient method that does not require parallelism and can achieve similar performance on many tasks.

REINFORCE:

PPO outperforms REINFORCE because REINFORCE has higher variance and lacks the stability that comes with PPO’s clipping mechanism.

Key Components of PPO

Advantage Estimation:

Typically, a Generalized Advantage Estimation (GAE) is used to compute the advantage function, which combines Monte Carlo returns and temporal difference methods to reduce bias and variance.

Policy Updates:

The policy is updated by taking steps based on the gradient of the objective function. The key difference from basic policy gradients is the use of the clipped objective function, which provides stability during updates.

Entropy Regularization (Optional):

To encourage exploration and prevent premature convergence, PPO often adds an entropy term to the objective function. The entropy term penalizes overly deterministic policies, promoting exploration.

The entropy regularization is typically added as:

$E_{π_{θ}} [lo g π_{θ} (a ∣ s)] .$

Hyperparameters for PPO

$ϵ$ (Clipping Parameter):

Controls how much the new policy can deviate from the old policy. A typical range is between 0.1 and 0.3.

Learning Rate:

The step size used in the optimizer, which controls how much the policy is updated at each iteration.

Batch Size:

The number of experience samples used in each update. PPO often uses mini-batch optimization with experience replay.

Entropy Coefficient:

The weight of the entropy term in the objective function, which controls the exploration-exploitation tradeoff.

Multi-Agent Reinforcement Learning

Summary

Extends RL to environments with multiple interacting agents.

Challenges:

Non-stationarity: Agents’ policies change over time.

Scalability: Large state-action spaces.

Approaches:

Centralized Training, Decentralized Execution: Train a joint policy but allow decentralized decision-making.

Multi-Agent Policy Gradients: Extend policy gradients to handle multiple agents.

References

Hugging Face Deep RL Course

CS 285 Berkeley

Brayden Zhang

Explorer

An Overview of Deep Reinforcement Learning

Intro to RL

What is RL?

Value-based methods

The Bellman Equation

Monte Carlo and Temporal Difference Learning

Monte Carlo: learning at the end of an episode

Q-Learning

DQN

Policy Gradient

Policy Gradient Theorem

Actor-Critic Methods

Proximal Policy Optimization

Multi-Agent Reinforcement Learning

References

Recent Notes

Books

Why Transformers Work

GraphCast

Graph View

Table of Contents

Backlinks