Course Notes for CMU 16-831 Graduate Course on Robot Learning: Introduction to Robot Learning | Introduction to Robot Learning

Imitation Learning

MDP (Markov Decision process)

Definitions:

$S$ : State space, $s_{t} \in S$ : state at time $t$

$A$ : Action space, $a_{t} \in A$ : action at time $t$

$p$ : Transition probability, $s_{t + 1} \sim p (\cdot ∣ s_{t}, a_{t})$

$r$ : Reward function, $r : S \times A \to R$

Goal: Learn a policy $π (a_{t} ∣ s_{t})$ .

POMDP (Partially Observed MDP)

Additional Definitions:

$O$ : Observation space, $o_{t} \in O$ : observation at time $t$

$h$ : Observation model, $o_{t} \sim h (\cdot ∣ s_{t})$

Goal: Learn a policy $π (a_{t} ∣ o_{t})$ .

Imitation Learning

Idea

collect expert data (observation/state and action pairs)

Train a function to map observations/states to actions

Dataset Aggregation (DAgger)

Process:
- Start with expert demonstrations.
- Train policy $π_{1}$ via supervised learning.
- Run $π_{1}$ , query the expert to correct mistakes, and collect new data.
- Aggregate new and old data, retrain to create $π_{2}$ .
- Repeat the process iteratively.
Advantages:
- Reduces cascading errors.
- Provides theoretical regret guarantees.
Limitations:
- Requires frequent expert queries.

IL with Privileged Teachers

It can be hard to directly learn the policy $π_{θ} (a_{t}, o_{t})$ especially if $o_{t}$ is high-dimensional

Obtain a "privileged" teacher $π_{p} (a_{t}, o_{t})$

$p_{t}$ contains “ground truth” information that is not available to the “students”

Then use $π_{p} (a_{t}, o_{t})$ to generate demonstrations for $π_{θ} (a_{t}, o_{t})$

Example

Stage 1: learn a “privileged agent” from expert

It knows ground truth state (traffic light, other vehicles’ pos/vel, etc)

Stage 2: a sensorimotor student learns from this trained privileged agent

This is especially useful in simulation, because we know every variable’s value in sim. So the privileged teacher learns from that, but the student only learns from stuff it can directly see/measure.

privileged teacher is usually trained by PPO

Variants

Student learning in the latent space: Adapting Rapid Motor Adaptation for Bipedal Robots

Student learning to predict rays: [2401.17583] Agile But Safe: Learning Collision-Free High-Speed Legged Locomotion

Deep Imitation Learning with Generative Modeling

What is the problem posed by generative modeling?

Learn: learn a distribution $p_{θ}$ that matches $p_{d} a t a$

Sample: Generate novel data so that $x_{n e w} \sim p_{θ}$

For robotics, we want our $p_{d a t a}$ to be from experts. There are three leading approaches:

GAN + Imitation Learning $\Rightarrow$ Generative Adversarial Imitation Learning (GAIL)

Sample trajectory from students

Update the discriminator, which is aimed at classifying the teacher and the student

Train the student policy which aims to minimize the discriminator’s accuracy

VAE + IL $\Rightarrow$ Action Chunking with Transformers

Based on CVAE (conditional VAE)

Encoder: expert action sequence + observation → latent

Decoder: latent + more observation → action sequence prediction

Key: action chunking + temporal ensemble

Diffusion + IL $\Rightarrow$ Diffusion Policy

Diffusion Policy

Model-Free RL

See An Overview of Deep Reinforcement Learning.

Model-Based RL

Offline RL

Bandits and Exploration

Robot Simulation & Sim2Real

Train in simulation, deploy in real world (with real-time adaptation)

Why simulators for robot learning?

most RL-based algos are very sample inefficient

They are cheap/fast/scalable

Problems of Sim2Real

non-parametric mismatches (simulator doesn’t consider some effects at all)

complex aerodynamics, fluid dynamics, tire dynamics, etc

Parametric mismatches (simulator uses different parameters than real)

robot mass/friction,etc

Domain Randomization

Randomize $e$ in $x_{t + 1} = f (x_{t}, u_{t}, e)$

Train a single RL policy $π (x)$ that works for many $e$

Approximation of robust control

Learning to Adapt

Randomize $e$ in $x_{t + 1} = f (x_{t}, u_{t}, e)$

Train an adaptive RL policy $π (x, e)$ that works for many $e$

approximation of adaptive control

Issue! $e$ is often unknown in real

Solution! Learning from a privileged teacher

Sim: First Train a teacher policy with privileged information $π (x, e)$

Sim: Student policy $π_{s} (x, available info in the real)$ learns from $π (x, e)$

Real: Deploy student policy $π_{s} (x, available info in the real)$

Basically an Imitation Learning problem

Safe Robot Learning

Multi-task and Adaptive Robot Learning

Foundation Models for Robotics

A more comprehensive list: JeffreyYH/Awesome-Generalist-Robots-via-Foundation-Models: Paper list in the survey paper: Toward General-Purpose Robots via Foundation Models: A Survey and Meta-Analysis

![[Pasted image 20241228225953.png|]]

VLMs for Robotic Perception
- CLIport
- GeFF
LLMs for Task Planning
- SayCan
LLMs for Action Generation
- Reward Generation: Language to Rewards for Robotic Skill Synthesis
Robotics Foundation Models
- VLAs

Future Directions

Improving Simulations and Sim2Real
LLM for Reward Design with Eureka
Doing imitation learning
No simulator. Collect data from real → learn a model → design a policy → deploy
Meta-learned dynamics model + online adaptive control

Brayden Zhang

Explorer

Introduction to Robot Learning

Imitation Learning

Dataset Aggregation (DAgger)

IL with Privileged Teachers

Deep Imitation Learning with Generative Modeling

Model-Free RL

Model-Based RL

Offline RL

Bandits and Exploration

Robot Simulation & Sim2Real

Safe Robot Learning

Multi-task and Adaptive Robot Learning

Foundation Models for Robotics

Future Directions

Recent Notes

Brayden Zhang 😀

Identifying Contrails to reduce Airplane Emissions

Mobile Robotics and Perception

Graph View

Table of Contents

Backlinks