IOC5262 Spring 2020 - Reinforcement Learning (強化學習原理)

  • Instructor: Ping-Chun Hsieh

  • Email: pinghsieh [AT] nctu [DOT] edu [DOT] tw

  • Lectures:

    • Tuesdays 09:00am-09:50am @ ED102

    • Fridays 1:20pm-3:10pm @ ED102

  • Office Hours: By appointment

  • References:

    • [SB] Richard Sutton and Andrew Barto, Reinforcement Learning: An Introduction, 2nd edition, 2019

    • [AJK] Alekh Agarwal, Nan Jiang Sham M. Kakade, Reinforcement Learning: Theory and Algorithms, 2019 (https:rltheorybook.github.io/rl_monograph_AJK.pdf)

    • [BCN] Léon Bottou, Frank E. Curtis, and Jorge Nocedal, Optimization Methods for Large-Scale Machine Learning (https:arxiv.orgabs1606.04838)

    • [NW] Jorge Nocedal and Stephen Wright, Numerical optimization, 2006

    • [LS] Tor Lattimore and Csaba Szepesvari, Bandit Algorithms, 2019 (https:tor-lattimore.comdownloadsbook/book.pdf)

  • Grading

    • Warm-Up Assignments: 20%

    • Theory Project: 35% (Deliverable: a technical report)

    • Team Implementation Project: 45% (Report: 35%, Presentation: 10%)

  • Lecture Schedule:

Week Lecture Date Topics Lecture Slides
1 1 3/3 Logistics and Introduction to RL Lec1
1 2 3/6 Introduction to RL and MDP Lec2
2 3 3/10 Planning for MDPs Lec3
2 4 3/13 Distributional Perspective of MDPs and Overview of Policy Optimization Lec4
3 5 3/17 Stochastic Gradient Descent and Policy Gradient Lec5
3 6 3/20 Policy Gradient and Variance Reduction Lec6
4 7 3/24 Variance Reduction for Policy Gradient Lec7
4 8 3/27 Critic, Advantage Function, and Model-Free Prediction Lec8
5 9 3/31 Model-Free Prediction and Actor-Critic Algorithms Lec9
5 4/3 Ching Ming Festival (No class)
6 10 4/7 Actor-Critic Algorithms and Value Function Approximation Lec10
6 11 4/10 Value Function Approximation and Trust Region Policy Optimization Lec11
7 12 4/14 Trust Region Policy Optimization (TRPO) Lec12
7 13 4/17 TRPO and PPO Lec13
8 14 4/21 PPO and Constrained Policy Optimization Lec14
8 15 4/24 CPO and Deterministic Policy Gradient Lec15
9 16 4/28 DPG and DDPG Lec16
9 5/1 Guest Lecture
10 17 5/5 DDPG and Off-Policy Stochastic PG Lec17
10 18 5/8 Off-Policy Stochastic PG and Policy Evaluation Lec18
11 19 5/12 Off-Policy Learning via Bootstrapping Lec19
11 20 5/15 Off-Policy Learning via Bootstrapping (II) Lec20
12 21 5/19 Q-Learning With Value Function Approximation Lec21
12 22 5/26 Distributional RL Lec22
13 22 5/29 Distributional RL and Bandits Lec23
13 23 6/5 Bandits: Learning With No Regret Lec24
14 24 6/9 Bandits: Upper Confidence Bound & Thompson Sampling Lec25
14 25 6/12 Bandits and Beyond Lec26
15 26 6/16 Exploration for RL Lec27
15 6/19 Final Presentation
16 6/23 Final Presentation
16 6/26 Final Presentation