An Overview of Deep Reinforcement Learning

A concise overview of the key concepts, formulas, and methods in Deep Reinforcement Learning.

Intro to RL

What is RL?

How does the agent make decisions?

What is the Exploration/Exploitation trade-off?

  • Exploration: trying out new actions to discover the best ones.
  • Exploitation: choosing the best actions based on what we already know.

What are the two main types of RL algorithms?

  • Value-based methods: estimate the value of being in a state (or taking an action in a state) and use this to make decisions.
    • train a value function that estimates the expected cumulative reward of being in a state.
  • Policy-based methods: train a policy function that outputs the action to take in a given state.

Value-based methods

What is the value of a state?

But what does it mean to “act according to our policy?”

Thus:

What is the link between Value and Policy?

$ \boxed{\pi ^ * (s) = \arg \max _a Q^* (s, a)} $

The state-value function

The action-value function

Action-Value VS. State-Value function:

The Bellman Equation

What is the Bellman equation?

A way to simplify our state-value or state-action value calculation.

Remember: To calculate the value of a state, we need to calculate the return starting at that state and then following the policy forever.

However, when calculating $V(S_{t})$, we need to know $V(S_{t+1})$.

Thus, to avoid repeated computation, we use Dynamic Programming, specifically, the Bellman Equation:

\[\Large \boxed{V_\pi(s) = \mathbb{E}_\pi [R_{t+1} + \gamma V_\pi(S_{t+1}) | S_{t} = s]}\]

The value of our state is the expected reward we get at the next time step plus the (discounted) expected value of the next state.

The value of $V(S_{t+1})$ is equal to the immediate reward $R_{t+2}$ plus the discounted value of the next state ($\gamma \cdot V(S_{t+2})$).

Monte Carlo and Temporal Difference Learning

Monte Carlo: learning at the end of an episode

$\Large \boxed{V(S_t) = V(S_t) + \alpha (G_t - V(S_t))}$

Q-Learning

What is Q-Learning? Q-Learning is an off-policy, value-based method that uses a TD approach to train its action-value function.

  • Recall: value-based method: find the optimal policy indirectly by training a value/action-value function that tells us the value of each state / value of each state-action pair
  • Recall: TD approach: update the value function after each time step, instead of at the end of the episode.

What is the difference between value and reward?

  • Reward: the immediate reward we get after taking an action in a state. Value: the expected cumulative reward we get after taking an action in a state and then following our policy.

How do we represent the Q-function?

  • The Q-function is a table that maps each state-action pair to a value.
  • So given a state and an action, the Q-function tells us the value of taking that action in that state.

    Initially, the Q-table is useless (all values are set to zero / random). But as the agent explores the environment, we update the Q-table, giving us a better and better approximation of the optimal policy.

So, why does Q-learning allow us to learn the optimal policy?

  1. We are training a Q-function, which is represented as a Q-table that contains all the state-action pair values.
  2. Given a state and action, our Q-function then searches its Q-table to find the value of that state-action pair.
  3. Once we are done training, we will have an Optimal Q-function (optimal Q-table).
  4. If we have an optimal Q-function, we have an optimal policy, since we now know the best action to take in each state.

The Algorithm:

  1. Initialize the Q-table for each state-action pair with 0 values.
    • Initialize $Q$ arbitrarily (e.g., $Q(s, a) = 0$ for all $s \in S$, $a \in A(s)$), and $Q(\text{terminal_state}, \cdot) = 0$
  2. Choose Action using epsilon-greedy strategy

    What is the epsilon greedy strategy?

    • initialize $\epsilon = 1.0$
    • With probability $\epsilon$, we do exploration (trying a random action)
    • with probability $1 - \epsilon$, we do exploitation (taking the best action according to our Q-table)
    • at beginning of training, probability of doing exploration is high since $\epsilon$ is high.
  3. Perofrm Action $A_t$, get Reward $R_t+1$, and observe new state $S_{t+1}$
  4. Update $Q(S_t, A_t)$

Off-Policy vs. On-Policy

Deep Q-Learning

Why do we want to use Deep Q-Learning?

Architecture & Algorithm:

The Deep Q-Learning training algorithm has two phases:

  1. Sampling: we perform actions and store the observed experience tuples in a replay memory.
  2. Training: Select a small batch of tuples randomly and learn from this batch using a gradient descent update step.

Methods to improve Deep Q-Learning:

  1. Experience Replay: store the agent’s experiences in a replay memory and sample uniformly from this memory during training.
  2. Fixed Q-Targets: use two networks: one to select actions and another to evaluate actions.
    • The target network is updated less frequently than the Q-network.
  3. Double Deep Q-Learning: to handle the problem of the overestimation of Q-values.

Policy Gradient Methods

\[\Large \boxed{\pi_\theta(s) = P[A|s;\theta]}\] \[\nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta} \left[ \nabla_\theta \log \pi_\theta(s,a) Q^{\pi_\theta}(s,a) \right]\]

where $\nabla_\theta J(\theta)$ is the gradient of the objective function $J(\theta)$ with respect to the policy parameters $\theta$, $\pi_\theta(s,a)$ is the probability of taking action $a$ in state $s$ under policy $\pi_\theta$, and $Q^{\pi_\theta}(s,a)$ is the action-value function.

By updating the policy parameters in the direction of the gradient, we can iteratively improve the policy to maximize the expected cumulative reward.

Policy-based vs policy-gradient methods: