IOC5269 Spring 2021 - Reinforcement Learning (強化學習原理)

  • Instructor: Ping-Chun Hsieh

  • Email: pinghsieh [AT] nctu [DOT] edu [DOT] tw

  • Lectures:

    • Tuesdays 09:00am-09:50am @ EC115

    • Fridays 1:20pm-3:10pm @ EC115

  • Office Hours: By appointment

  • References:

    • [SB] Richard Sutton and Andrew Barto, Reinforcement Learning: An Introduction, 2nd edition, 2019

    • [AJK] Alekh Agarwal, Nan Jiang Sham M. Kakade, Reinforcement Learning: Theory and Algorithms, 2020 (https:rltheorybook.github.io/rl_monograph_AJK.pdf)

    • [BCN] Léon Bottou, Frank E. Curtis, and Jorge Nocedal, Optimization Methods for Large-Scale Machine Learning (https:arxiv.orgabs1606.04838)

    • [NW] Jorge Nocedal and Stephen Wright, Numerical optimization, 2006

    • [LS] Tor Lattimore and Csaba Szepesvari, Bandit Algorithms, 2019 (https:tor-lattimore.comdownloadsbook/book.pdf)

  • Grading

    • Assignments: 30%

    • Theory Project: 30% (Deliverable: a technical report)

    • Team Implementation Project: 40% (Report: 30%, Presentation: 10%)

  • Lecture Schedule:

Week Lecture Date Topics Lecture Slides
1 1 2/23 Logistics and Introduction to RL Lec1
1 2 2/26 Introduction to RL and MDP Lec2
2 3 3/2 Planning for MDPs Lec3
2 4 3/5 Planning and Distributional Perspective of MDPs Lec4
3 5 3/9 A Distributional Perspective of MDPs and Policy Optimization Lec5
3 6 3/12 Policy Optimization and Gradient Descent Lec6
4 7 3/16 Policy Gradient Lec7
4 8 3/19 Variance Reduction and Model-Free Prediction Lec8
5 9 3/23 Model-Free Prediction and Actor-Critic Algorithms Lec9
5 10 3/26 Model-Free Prediction and Global Convergence of Policy Gradient Lec10
6 11 3/30 Global Convergence of Policy Gradient Lec11, Lec11 (annotated)
6 4/2 Spring Break
7 4/6 Spring Break
7 12 4/9 Global Convergence of Policy Gradient and Value Function Approximation Lec12
8 13 4/13 Value Function Approximation Lec13
8 14 4/16 Trust Region Policy Optimization (TRPO) Lec14
9 15 4/20 Trust Region Policy Optimization (TRPO) Lec15
9 16 4/23 Proximal Policy Optimization (PPO) and Deterministic Policy Gradient (DPG) Lec16
10 17 4/27 Off-Policy Learning via Deterministic and Stochastic Policy Gradients Lec17
10 18 4/30 Off-Policy Learning via Deterministic and Stochastic Policy Gradients Lec18
11 19 5/4 Off-Policy Learning and Value-Based Methods Lec19, Lec19 (annotated)
11 20 5/7 Value-Based Methods Lec20
12 21 5/11 Value-Based Methods - Expected Sarsa and Q-Learning Lec21, Lec 21 (annotated)
12 22 5/14 Value-Based Methods - Q-Learning, Double Q-Learning Lec22
13 5/18 Rescheduled for Final Presentation
13 23 5/21 Value-Based Methods - DQN and Double DQN Lec23
14 5/25 Rescheduled for Final Presentation
14 5/28 Rescheduled to 6/18
15 24 6/1 Distributional RL - C51 Lec24, Lec24 (annotated)
15 25 6/4 Distributional RL - QR-DQN Lec25, Lec 25 (annotated)
16 6/8 Rescheduled for Final Presentation (Final Exam Week)
16 6/11 Rescheduled for Final Presentation (Final Exam Week)
17 6/15 No Class
17 26 6/18 Implicit Quantile Networks and Soft Actor-Critic Lec26,Lec26 (annotated)
18 6/23 Final Presentation
18 6/24 Final Presentation