An Overview of Deep Reinforcement Learning

A concise overview of the key concepts, formulas, and methods in Deep Reinforcement Learning.

Intro to RL

What is RL?

In RL, an agent learns to make decisions by interacting with an environment (through trial and error)
The agent receives feedback in the form of rewards.
The goal is to maximize the total reward over time.

How does the agent make decisions?

The agent uses a policy ($\pi$) to decide what action to take in a given state.
Given a state, the policy outputs an action (if deterministic) or a probability distribution over actions (if stochastic).
The goal of RL is to find an optimal policy $\pi ^ *$ that maximizes the expected cumulative reward.

What is the Exploration/Exploitation trade-off?

Exploration: trying out new actions to discover the best ones.

Exploitation: choosing the best actions based on what we already know.

What are the two main types of RL algorithms?

Value-based methods: estimate the value of being in a state (or taking an action in a state) and use this to make decisions.

train a value function that estimates the expected cumulative reward of being in a state.

Policy-based methods: train a policy function that outputs the action to take in a given state.

Value-based methods

What is the value of a state?

The value of a state is the expected discounted return the agent can get if it starts at that state and then acts according to our policy.

But what does it mean to “act according to our policy?”

For value-based methods, you don’t train the policy.
Instead, the policy is just a simple pre-specified function that uses the values given by the value-function to select its actions.
- The greedy policy is an example of this: it selects the action that maximizes the value function.
  - Epsilon-greedy policy is commonly used (it handles the exploration/exploitation trade-off that will be mentioned later).

Thus:

in policy-based methods, the optimal policy ($\pi ^ *$) is found by training the policy directly.
in value-based methods, the optimal policy is found by finding an optimal value function ($V^$ or $Q^$) and then using it to derive the optimal policy.

What is the link between Value and Policy?

$ \boxed{\pi ^ * (s) = \arg \max _a Q^* (s, a)} $

The state-value function

The state-value function ($V^{\pi} (s)$) estimates the expected cumulative reward of being in a state and following a policy $\pi$.
$V_{\pi} (s) = \mathbb{E} [R_t s_t = s, \pi]$
For each state, the state-value function outputs the expected return if the agent starts at that state and then follows the policy $\pi$ forever.

The action-value function

The action-value function ($Q^{\pi} (s, a)$) estimates the expected cumulative reward of taking an action in a state and then following a policy $\pi$.
$Q_{\pi} (s, a) = \mathbb{E_\pi} [R_t s_t = s, a_t = a]$

Action-Value VS. State-Value function:

State-value function: calculates the value of a state.
Action-value function: calculates the value of taking an action in a state (state-action pair).

The Bellman Equation

What is the Bellman equation?

A way to simplify our state-value or state-action value calculation.

Remember: To calculate the value of a state, we need to calculate the return starting at that state and then following the policy forever.

However, when calculating $V(S_{t})$, we need to know $V(S_{t+1})$.

Thus, to avoid repeated computation, we use Dynamic Programming, specifically, the Bellman Equation:

\[\Large \boxed{V_\pi(s) = \mathbb{E}_\pi [R_{t+1} + \gamma V_\pi(S_{t+1}) | S_{t} = s]}\]

The value of our state is the expected reward we get at the next time step plus the (discounted) expected value of the next state.

The value of $V(S_{t+1})$ is equal to the immediate reward $R_{t+2}$ plus the discounted value of the next state ($\gamma \cdot V(S_{t+2})$).

Monte Carlo and Temporal Difference Learning

Monte Carlo: learning at the end of an episode

In Monte Carlo Learning, you wait until the end of the epsiode, then calculate the return $ G_t $ , and then update the value function $V(S_t)$.

$\Large \boxed{V(S_t) = V(S_t) + \alpha (G_t - V(S_t))}$

Q-Learning

What is Q-Learning? Q-Learning is an off-policy, value-based method that uses a TD approach to train its action-value function.

Recall: value-based method: find the optimal policy indirectly by training a value/action-value function that tells us the value of each state / value of each state-action pair

Recall: TD approach: update the value function after each time step, instead of at the end of the episode.

We use Q-Learning as the algorithm to train our Q-function (the action-value function that determines the value of taking an action in a state).

What is the difference between value and reward?

Reward: the immediate reward we get after taking an action in a state. Value: the expected cumulative reward we get after taking an action in a state and then following our policy.

How do we represent the Q-function?

The Q-function is a table that maps each state-action pair to a value.

So given a state and an action, the Q-function tells us the value of taking that action in that state.

Initially, the Q-table is useless (all values are set to zero / random). But as the agent explores the environment, we update the Q-table, giving us a better and better approximation of the optimal policy.

So, why does Q-learning allow us to learn the optimal policy?

We are training a Q-function, which is represented as a Q-table that contains all the state-action pair values.
Given a state and action, our Q-function then searches its Q-table to find the value of that state-action pair.
Once we are done training, we will have an Optimal Q-function (optimal Q-table).
If we have an optimal Q-function, we have an optimal policy, since we now know the best action to take in each state.

The Algorithm:

Initialize the Q-table for each state-action pair with 0 values.
- Initialize $Q$ arbitrarily (e.g., $Q(s, a) = 0$ for all $s \in S$, $a \in A(s)$), and $Q(\text{terminal_state}, \cdot) = 0$
Choose Action using epsilon-greedy strategy
What is the epsilon greedy strategy?
- initialize $\epsilon = 1.0$
- With probability $\epsilon$, we do exploration (trying a random action)
- with probability $1 - \epsilon$, we do exploitation (taking the best action according to our Q-table)
- at beginning of training, probability of doing exploration is high since $\epsilon$ is high.
Perofrm Action $A_t$, get Reward $R_t+1$, and observe new state $S_{t+1}$
Update $Q(S_t, A_t)$

Off-Policy vs. On-Policy

Off-Policy: the agent uses a different policy for acting and updating
On-Policy: the agent uses same policy for acting and updating

Deep Q-Learning

Why do we want to use Deep Q-Learning?

Q-Learning is limited by the size of the Q-table.
For large state spaces, the Q-table becomes too large to store and update.
Deep Q-Learning uses a neural network to approximate the Q-function, allowing us to handle large state spaces.

Architecture & Algorithm:

Input: stack of 4 frames passed through a CNN.
output: vector of Q-values for each action.
- why 4 frames?
  - to capture the motion of the agent.
in deep Q-learning, we use a loss function to update the Q-network.
- the loss function is the mean squared error between the predicted Q-values and the target Q-values.
- the target Q-values are calculated using the Bellman equation.

The Deep Q-Learning training algorithm has two phases:

Sampling: we perform actions and store the observed experience tuples in a replay memory.
Training: Select a small batch of tuples randomly and learn from this batch using a gradient descent update step.

Methods to improve Deep Q-Learning:

Experience Replay: store the agent’s experiences in a replay memory and sample uniformly from this memory during training.
Fixed Q-Targets: use two networks: one to select actions and another to evaluate actions.
- The target network is updated less frequently than the Q-network.
Double Deep Q-Learning: to handle the problem of the overestimation of Q-values.

Policy Gradient Methods

In Q-Learning, we learned a value-function to approximate optimal policy $\pi^*$.
In Policy-based methods, we learn the policy directly.

\[\Large \boxed{\pi_\theta(s) = P[A|s;\theta]}\]

maximize performance of the parameterized policy using gradient ascent. The formula for gradient ascent in policy gradient methods is:

\[\nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta} \left[ \nabla_\theta \log \pi_\theta(s,a) Q^{\pi_\theta}(s,a) \right]\]

where $\nabla_\theta J(\theta)$ is the gradient of the objective function $J(\theta)$ with respect to the policy parameters $\theta$, $\pi_\theta(s,a)$ is the probability of taking action $a$ in state $s$ under policy $\pi_\theta$, and $Q^{\pi_\theta}(s,a)$ is the action-value function.

By updating the policy parameters in the direction of the gradient, we can iteratively improve the policy to maximize the expected cumulative reward.

Policy-based vs policy-gradient methods:

policy-based, directly search for optimal policy
- optimize the parameter $\theta$ of the policy $\pi_\theta$ indirectly by maximizing the local approximation of the objective function $J(\theta)$ with techniques like hill climbing, simulated annealing, or genetic algorithms.
- policy-gradient methods, directly optimize the policy by maximizing the expected cumulative reward using gradient ascent.