IOC5259 Spring 2022 - Reinforcement Learning (強化學習原理)
Instructor: Ping-Chun Hsieh
Email: pinghsieh [AT] nctu [DOT] edu [DOT] tw
Lectures:
Office Hours: 4:30pm-5pm on Tuesdays or by appointment
References:
[SB] Richard Sutton and Andrew Barto, Reinforcement Learning: An Introduction, 2nd edition, 2019
[AJK] Alekh Agarwal, Nan Jiang Sham M. Kakade, Reinforcement Learning: Theory and Algorithms, 2020 (https:rltheorybook.github.io/rl_monograph_AJK.pdf)
[BCN] Léon Bottou, Frank E. Curtis, and Jorge Nocedal, Optimization Methods for Large-Scale Machine Learning (https:arxiv.orgabs1606.04838)
[NW] Jorge Nocedal and Stephen Wright, Numerical optimization, 2006
[LS] Tor Lattimore and Csaba Szepesvari, Bandit Algorithms, 2019 (https:tor-lattimore.comdownloadsbook/book.pdf)
Grading
Assignments: 30%
Theory Project: 30% (Deliverable: a technical report)
Team Implementation Project: 40% (Report: 30%, Presentation: 10%)
Week | Lecture | Date | Topics | Lecture Slides |
1 | 1 | 2/15 | Logistics and Introduction to RL | Lec1 |
1 | 2 | 2/18 | Introduction to RL and MDP | Lec2 |
2 | 3 | 2/22 | Planning for MDPs | Lec3 |
2 | 4 | 2/25 | Planning and Distributional Perspective of MDPs | Lec4 |
3 | | 3/1 | Peace Memorial day | |
3 | 5 | 3/4 | A Distributional Perspective of MDPs and Policy Optimization | Lec5 |
4 | 6 | 3/8 | Policy Optimization and Gradient Descent | Lec6 |
4 | 7 | 3/11 | Policy Gradient | Lec7 |
5 | 8 | 3/15 | Policy Gradient and Stochastic Gradient Descent | Lec8 |
5 | 9 | 3/18 | Variance Reduction for Stochastic PG | Lec9 |
6 | 10 | 3/22 | Variance Reduction for Model-Free Prediction | Lec10 |
6 | 11 | 3/25 | Model-Free Prediction | Lec11 |
7 | 12 | 3/29 | Model-Free Prediction | Lec12 |
7 | 13 | 4/1 | TD-Lambda and Global Convergence of PG | Lec13 |
8 | | 4/5 | Spring Break | |
8 | 14 | 4/8 | Value Function Approximation | Lec14 |
9 | 15 | 4/12 | Value Function Approximation (II) | Lec15 |
9 | 16 | 4/15 | Trust Region Policy Optimization (TRPO) | Lec16 |
10 | 17 | 4/19 | Trust Region Policy Optimization (TRPO) | Lec17 |
10 | 18 | 4/22 | Proximal Policy Optimization (PPO) and Deterministic Policy Gradient (DPG) | Lec18 |
11 | 19 | 4/26 | Deterministic Policy Gradient (DPG) | Lec19 |
11 | 20 | 4/29 | DPG, DDPG, and Off-Policy Learning | Lec20 |
12 | 21 | 5/3 | Off-Policy Stochastic PG | Lec21 |
12 | 22 | 5/6 | Off-Policy Stochastic PG and Value-Based Methods | Lec22 |
13 | 23 | 5/10 | Value-Based Methods - Expected Sarsa and Q-Learning | Lec23 |
13 | 24 | 5/13 | Value-Based Methods | Lec24 |
14 | 25 | 5/17 | Q-Learning and Double Q-Learning | Lec25 |
14 | | 5/20 | Rescheduled for Final Presentation | |
15 | 26 | 5/24 | Q-Learning With VFA, DQN and Double DQN | Lec26 |
15 | 27 | 5/27 | Distributional RL (C51, QR-DQN, and IQN) and Soft Actor-Critic | Lec27 |
16 | | 5/31 | Rescheduled for Final Presentation (Final Exam Week) | |
16 | | 6/3 | Dragon Boat Festival | |
17 | 28 | 6/10 | Inverse RL | Lec28 |
17 | | 6/14 | Final Presentation | |
|
|