IOC5259 Spring 2022 - Reinforcement Learning (強化學習原理)

  • Instructor: Ping-Chun Hsieh

  • Email: pinghsieh [AT] nctu [DOT] edu [DOT] tw

  • Lectures:

    • Tuesdays 3:30pm-4:20pm @ EDB27

    • Fridays 10:10am-12:00noon @ EDB27

  • Office Hours: 4:30pm-5pm on Tuesdays or by appointment

  • References:

    • [SB] Richard Sutton and Andrew Barto, Reinforcement Learning: An Introduction, 2nd edition, 2019

    • [AJK] Alekh Agarwal, Nan Jiang Sham M. Kakade, Reinforcement Learning: Theory and Algorithms, 2020 (https:rltheorybook.github.io/rl_monograph_AJK.pdf)

    • [BCN] Léon Bottou, Frank E. Curtis, and Jorge Nocedal, Optimization Methods for Large-Scale Machine Learning (https:arxiv.orgabs1606.04838)

    • [NW] Jorge Nocedal and Stephen Wright, Numerical optimization, 2006

    • [LS] Tor Lattimore and Csaba Szepesvari, Bandit Algorithms, 2019 (https:tor-lattimore.comdownloadsbook/book.pdf)

  • Grading

    • Assignments: 30%

    • Theory Project: 30% (Deliverable: a technical report)

    • Team Implementation Project: 40% (Report: 30%, Presentation: 10%)

  • Lecture Schedule:

Week Lecture Date Topics Lecture Slides
1 1 2/15 Logistics and Introduction to RL Lec1
1 2 2/18 Introduction to RL and MDP Lec2
2 3 2/22 Planning for MDPs Lec3
2 4 2/25 Planning and Distributional Perspective of MDPs Lec4
3 3/1 Peace Memorial day
3 5 3/4 A Distributional Perspective of MDPs and Policy Optimization Lec5
4 6 3/8 Policy Optimization and Gradient Descent Lec6
4 7 3/11 Policy Gradient Lec7
5 8 3/15 Policy Gradient and Stochastic Gradient Descent Lec8
5 9 3/18 Variance Reduction for Stochastic PG Lec9
6 10 3/22 Variance Reduction for Model-Free Prediction Lec10
6 11 3/25 Model-Free Prediction Lec11
7 12 3/29 Model-Free Prediction Lec12
7 13 4/1 TD-Lambda and Global Convergence of PG Lec13
8 4/5 Spring Break
8 14 4/8 Value Function Approximation Lec14
9 15 4/12 Value Function Approximation (II) Lec15
9 16 4/15 Trust Region Policy Optimization (TRPO) Lec16
10 17 4/19 Trust Region Policy Optimization (TRPO) Lec17
10 18 4/22 Proximal Policy Optimization (PPO) and Deterministic Policy Gradient (DPG) Lec18
11 19 4/26 Deterministic Policy Gradient (DPG) Lec19
11 20 4/29 DPG, DDPG, and Off-Policy Learning Lec20
12 21 5/3 Off-Policy Stochastic PG Lec21
12 22 5/6 Off-Policy Stochastic PG and Value-Based Methods Lec22
13 23 5/10 Value-Based Methods - Expected Sarsa and Q-Learning Lec23
13 24 5/13 Value-Based Methods Lec24
14 25 5/17 Q-Learning and Double Q-Learning Lec25
14 5/20 Rescheduled for Final Presentation
15 26 5/24 Q-Learning With VFA, DQN and Double DQN Lec26
15 27 5/27 Distributional RL (C51, QR-DQN, and IQN) and Soft Actor-Critic Lec27
16 5/31 Rescheduled for Final Presentation (Final Exam Week)
16 6/3 Dragon Boat Festival
17 28 6/10 Inverse RL Lec28
17 6/14 Final Presentation