IOC5259 Spring 2022 - Reinforcement Learning (強化學習原理)

Instructor: Ping-Chun Hsieh
Email: pinghsieh [AT] nctu [DOT] edu [DOT] tw
Lectures:
- Tuesdays 3:30pm-4:20pm @ EDB27
- Fridays 10:10am-12:00noon @ EDB27
Office Hours: 4:30pm-5pm on Tuesdays or by appointment
References:
- [SB] Richard Sutton and Andrew Barto, Reinforcement Learning: An Introduction, 2nd edition, 2019
- [AJK] Alekh Agarwal, Nan Jiang Sham M. Kakade, Reinforcement Learning: Theory and Algorithms, 2020 (https:rltheorybook.github.io/rlmonographAJK.pdf)
- [BCN] Léon Bottou, Frank E. Curtis, and Jorge Nocedal, Optimization Methods for Large-Scale Machine Learning (https:arxiv.orgabs1606.04838)
- [NW] Jorge Nocedal and Stephen Wright, Numerical optimization, 2006
- [LS] Tor Lattimore and Csaba Szepesvari, Bandit Algorithms, 2019 (https:tor-lattimore.comdownloadsbook/book.pdf)

Grading
- Assignments: 30%
- Theory Project: 30% (Deliverable: a technical report)
- Team Implementation Project: 40% (Report: 30%, Presentation: 10%)

Lecture Schedule:

Week	Lecture	Date	Topics	Lecture Slides
1	1	2/15	Logistics and Introduction to RL	Lec1
1	2	2/18	Introduction to RL and MDP	Lec2
2	3	2/22	Planning for MDPs	Lec3
2	4	2/25	Planning and Distributional Perspective of MDPs	Lec4
3		3/1	Peace Memorial day
3	5	3/4	A Distributional Perspective of MDPs and Policy Optimization	Lec5
4	6	3/8	Policy Optimization and Gradient Descent	Lec6
4	7	3/11	Policy Gradient	Lec7
5	8	3/15	Policy Gradient and Stochastic Gradient Descent	Lec8
5	9	3/18	Variance Reduction for Stochastic PG	Lec9
6	10	3/22	Variance Reduction for Model-Free Prediction	Lec10
6	11	3/25	Model-Free Prediction	Lec11
7	12	3/29	Model-Free Prediction	Lec12
7	13	4/1	TD-Lambda and Global Convergence of PG	Lec13
8		4/5	Spring Break
8	14	4/8	Value Function Approximation	Lec14
9	15	4/12	Value Function Approximation (II)	Lec15
9	16	4/15	Trust Region Policy Optimization (TRPO)	Lec16
10	17	4/19	Trust Region Policy Optimization (TRPO)	Lec17
10	18	4/22	Proximal Policy Optimization (PPO) and Deterministic Policy Gradient (DPG)	Lec18
11	19	4/26	Deterministic Policy Gradient (DPG)	Lec19
11	20	4/29	DPG, DDPG, and Off-Policy Learning	Lec20
12	21	5/3	Off-Policy Stochastic PG	Lec21
12	22	5/6	Off-Policy Stochastic PG and Value-Based Methods	Lec22
13	23	5/10	Value-Based Methods - Expected Sarsa and Q-Learning	Lec23
13	24	5/13	Value-Based Methods	Lec24
14	25	5/17	Q-Learning and Double Q-Learning	Lec25
14		5/20	Rescheduled for Final Presentation
15	26	5/24	Q-Learning With VFA, DQN and Double DQN	Lec26
15	27	5/27	Distributional RL (C51, QR-DQN, and IQN) and Soft Actor-Critic	Lec27
16		5/31	Rescheduled for Final Presentation (Final Exam Week)
16		6/3	Dragon Boat Festival
17	28	6/10	Inverse RL	Lec28
17		6/14	Final Presentation