Course Notes for CMU 16-831 Graduate Course on Robot Learning: Introduction to Robot Learning | Introduction to Robot Learning

Imitation Learning

MDP (Markov Decision process)

Definitions:

  • : State space, : state at time
  • : Action space, : action at time
  • : Transition probability,
  • : Reward function,

Goal: Learn a policy .

POMDP (Partially Observed MDP)

Additional Definitions:

  • : Observation space, : observation at time
  • : Observation model,

Goal: Learn a policy .

Imitation Learning

Idea

  • collect expert data (observation/state and action pairs)
  • Train a function to map observations/states to actions

Dataset Aggregation (DAgger)

  1. Process:

    • Start with expert demonstrations.
    • Train policy via supervised learning.
    • Run , query the expert to correct mistakes, and collect new data.
    • Aggregate new and old data, retrain to create .
    • Repeat the process iteratively.
  2. Advantages:

    • Reduces cascading errors.
    • Provides theoretical regret guarantees.
  3. Limitations:

    • Requires frequent expert queries.

IL with Privileged Teachers

  • It can be hard to directly learn the policy especially if is high-dimensional

Obtain a "privileged" teacher

  • contains “ground truth” information that is not available to the “students”
  • Then use to generate demonstrations for

Example

  • Stage 1: learn a “privileged agent” from expert
    • It knows ground truth state (traffic light, other vehicles’ pos/vel, etc)
  • Stage 2: a sensorimotor student learns from this trained privileged agent

This is especially useful in simulation, because we know every variable’s value in sim. So the privileged teacher learns from that, but the student only learns from stuff it can directly see/measure.

  • privileged teacher is usually trained by PPO

Variants

Deep Imitation Learning with Generative Modeling

What is the problem posed by generative modeling?

  • Learn: learn a distribution that matches
  • Sample: Generate novel data so that

For robotics, we want our to be from experts. There are three leading approaches:

GAN + Imitation Learning Generative Adversarial Imitation Learning (GAIL)

  • Sample trajectory from students
  • Update the discriminator, which is aimed at classifying the teacher and the student
  • Train the student policy which aims to minimize the discriminator’s accuracy

VAE + IL Action Chunking with Transformers

  • Based on CVAE (conditional VAE)
  • Encoder: expert action sequence + observation latent
  • Decoder: latent + more observation action sequence prediction
  • Key: action chunking + temporal ensemble

Diffusion + IL Diffusion Policy

Model-Free RL

See An Overview of Deep Reinforcement Learning.

Model-Based RL

Offline RL

Bandits and Exploration

Robot Simulation & Sim2Real

Train in simulation, deploy in real world (with real-time adaptation)

Why simulators for robot learning?

  • most RL-based algos are very sample inefficient
  • They are cheap/fast/scalable

Problems of Sim2Real

  • non-parametric mismatches (simulator doesn’t consider some effects at all)
    • complex aerodynamics, fluid dynamics, tire dynamics, etc
  • Parametric mismatches (simulator uses different parameters than real)
    • robot mass/friction,etc

Domain Randomization

  • Randomize in
  • Train a single RL policy that works for many
    • Approximation of robust control

Learning to Adapt

  • Randomize in
  • Train an adaptive RL policy that works for many
    • approximation of adaptive control
  • Issue! is often unknown in real
    • Solution! Learning from a privileged teacher
      • Sim: First Train a teacher policy with privileged information
      • Sim: Student policy learns from
      • Real: Deploy student policy
    • Basically an Imitation Learning problem

Safe Robot Learning

Multi-task and Adaptive Robot Learning

Foundation Models for Robotics

A more comprehensive list: JeffreyYH/Awesome-Generalist-Robots-via-Foundation-Models: Paper list in the survey paper: Toward General-Purpose Robots via Foundation Models: A Survey and Meta-Analysis

![[Pasted image 20241228225953.png|]]

Future Directions

  • Improving Simulations and Sim2Real
  • LLM for Reward Design with Eureka
  • Doing imitation learning
  • No simulator. Collect data from real learn a model design a policy deploy
  • Meta-learned dynamics model + online adaptive control