IOC5262 Spring 2020 - Reinforcement Learning (強化學習原理)
Instructor: Ping-Chun Hsieh
Email: pinghsieh [AT] nctu [DOT] edu [DOT] tw
Lectures:
Office Hours: By appointment
References:
[SB] Richard Sutton and Andrew Barto, Reinforcement Learning: An Introduction, 2nd edition, 2019
[AJK] Alekh Agarwal, Nan Jiang Sham M. Kakade, Reinforcement Learning: Theory and Algorithms, 2019 (https:rltheorybook.github.io/rl_monograph_AJK.pdf)
[BCN] Léon Bottou, Frank E. Curtis, and Jorge Nocedal, Optimization Methods for Large-Scale Machine Learning (https:arxiv.orgabs1606.04838)
[NW] Jorge Nocedal and Stephen Wright, Numerical optimization, 2006
[LS] Tor Lattimore and Csaba Szepesvari, Bandit Algorithms, 2019 (https:tor-lattimore.comdownloadsbook/book.pdf)
Grading
Warm-Up Assignments: 20%
Theory Project: 35% (Deliverable: a technical report)
Team Implementation Project: 45% (Report: 35%, Presentation: 10%)
Week | Lecture | Date | Topics | Lecture Slides |
1 | 1 | 3/3 | Logistics and Introduction to RL | Lec1 |
1 | 2 | 3/6 | Introduction to RL and MDP | Lec2 |
2 | 3 | 3/10 | Planning for MDPs | Lec3 |
2 | 4 | 3/13 | Distributional Perspective of MDPs and Overview of Policy Optimization | Lec4 |
3 | 5 | 3/17 | Stochastic Gradient Descent and Policy Gradient | Lec5 |
3 | 6 | 3/20 | Policy Gradient and Variance Reduction | Lec6 |
4 | 7 | 3/24 | Variance Reduction for Policy Gradient | Lec7 |
4 | 8 | 3/27 | Critic, Advantage Function, and Model-Free Prediction | Lec8 |
5 | 9 | 3/31 | Model-Free Prediction and Actor-Critic Algorithms | Lec9 |
5 | | 4/3 | Ching Ming Festival (No class) | |
6 | 10 | 4/7 | Actor-Critic Algorithms and Value Function Approximation | Lec10 |
6 | 11 | 4/10 | Value Function Approximation and Trust Region Policy Optimization | Lec11 |
7 | 12 | 4/14 | Trust Region Policy Optimization (TRPO) | Lec12 |
7 | 13 | 4/17 | TRPO and PPO | Lec13 |
8 | 14 | 4/21 | PPO and Constrained Policy Optimization | Lec14 |
8 | 15 | 4/24 | CPO and Deterministic Policy Gradient | Lec15 |
9 | 16 | 4/28 | DPG and DDPG | Lec16 |
9 | | 5/1 | Guest Lecture | |
10 | 17 | 5/5 | DDPG and Off-Policy Stochastic PG | Lec17 |
10 | 18 | 5/8 | Off-Policy Stochastic PG and Policy Evaluation | Lec18 |
11 | 19 | 5/12 | Off-Policy Learning via Bootstrapping | Lec19 |
11 | 20 | 5/15 | Off-Policy Learning via Bootstrapping (II) | Lec20 |
12 | 21 | 5/19 | Q-Learning With Value Function Approximation | Lec21 |
12 | 22 | 5/26 | Distributional RL | Lec22 |
13 | 22 | 5/29 | Distributional RL and Bandits | Lec23 |
13 | 23 | 6/5 | Bandits: Learning With No Regret | Lec24 |
14 | 24 | 6/9 | Bandits: Upper Confidence Bound & Thompson Sampling | Lec25 |
14 | 25 | 6/12 | Bandits and Beyond | Lec26 |
15 | 26 | 6/16 | Exploration for RL | Lec27 |
15 | | 6/19 | Final Presentation | |
16 | | 6/23 | Final Presentation | |
16 | | 6/26 | Final Presentation |
|
|