IOC5269 Spring 2021 - Reinforcement Learning (強化學習原理)

Instructor: Ping-Chun Hsieh
Email: pinghsieh [AT] nctu [DOT] edu [DOT] tw
Lectures:
- Tuesdays 09:00am-09:50am @ EC115
- Fridays 1:20pm-3:10pm @ EC115
Office Hours: By appointment
References:
- [SB] Richard Sutton and Andrew Barto, Reinforcement Learning: An Introduction, 2nd edition, 2019
- [AJK] Alekh Agarwal, Nan Jiang Sham M. Kakade, Reinforcement Learning: Theory and Algorithms, 2020 (https:rltheorybook.github.io/rlmonographAJK.pdf)
- [BCN] Léon Bottou, Frank E. Curtis, and Jorge Nocedal, Optimization Methods for Large-Scale Machine Learning (https:arxiv.orgabs1606.04838)
- [NW] Jorge Nocedal and Stephen Wright, Numerical optimization, 2006
- [LS] Tor Lattimore and Csaba Szepesvari, Bandit Algorithms, 2019 (https:tor-lattimore.comdownloadsbook/book.pdf)

Grading
- Assignments: 30%
- Theory Project: 30% (Deliverable: a technical report)
- Team Implementation Project: 40% (Report: 30%, Presentation: 10%)

Lecture Schedule:

Week	Lecture	Date	Topics	Lecture Slides
1	1	2/23	Logistics and Introduction to RL	Lec1
1	2	2/26	Introduction to RL and MDP	Lec2
2	3	3/2	Planning for MDPs	Lec3
2	4	3/5	Planning and Distributional Perspective of MDPs	Lec4
3	5	3/9	A Distributional Perspective of MDPs and Policy Optimization	Lec5
3	6	3/12	Policy Optimization and Gradient Descent	Lec6
4	7	3/16	Policy Gradient	Lec7
4	8	3/19	Variance Reduction and Model-Free Prediction	Lec8
5	9	3/23	Model-Free Prediction and Actor-Critic Algorithms	Lec9
5	10	3/26	Model-Free Prediction and Global Convergence of Policy Gradient	Lec10
6	11	3/30	Global Convergence of Policy Gradient	Lec11, Lec11 (annotated)
6		4/2	Spring Break
7		4/6	Spring Break
7	12	4/9	Global Convergence of Policy Gradient and Value Function Approximation	Lec12
8	13	4/13	Value Function Approximation	Lec13
8	14	4/16	Trust Region Policy Optimization (TRPO)	Lec14
9	15	4/20	Trust Region Policy Optimization (TRPO)	Lec15
9	16	4/23	Proximal Policy Optimization (PPO) and Deterministic Policy Gradient (DPG)	Lec16
10	17	4/27	Off-Policy Learning via Deterministic and Stochastic Policy Gradients	Lec17
10	18	4/30	Off-Policy Learning via Deterministic and Stochastic Policy Gradients	Lec18
11	19	5/4	Off-Policy Learning and Value-Based Methods	Lec19, Lec19 (annotated)
11	20	5/7	Value-Based Methods	Lec20
12	21	5/11	Value-Based Methods - Expected Sarsa and Q-Learning	Lec21, Lec 21 (annotated)
12	22	5/14	Value-Based Methods - Q-Learning, Double Q-Learning	Lec22
13		5/18	Rescheduled for Final Presentation
13	23	5/21	Value-Based Methods - DQN and Double DQN	Lec23
14		5/25	Rescheduled for Final Presentation
14		5/28	Rescheduled to 6/18
15	24	6/1	Distributional RL - C51	Lec24, Lec24 (annotated)
15	25	6/4	Distributional RL - QR-DQN	Lec25, Lec 25 (annotated)
16		6/8	Rescheduled for Final Presentation (Final Exam Week)
16		6/11	Rescheduled for Final Presentation (Final Exam Week)
17		6/15	No Class
17	26	6/18	Implicit Quantile Networks and Soft Actor-Critic	Lec26,Lec26 (annotated)
18		6/23	Final Presentation
18		6/24	Final Presentation