535515 Spring 2023 - Reinforcement Learning (強化學習原理)

Instructor: Ping-Chun Hsieh
Email: pinghsieh [AT] nycu [DOT] edu [DOT] tw
Lectures:
- Tuesdays 3:30pm-4:20pm @ EC115
- Fridays 10:10am-12:00noon @ EC115
- Note: The first lecture on 2/14 (Tue.) will be delivered via Webex: Webex Link
Office Hours: 4:30pm-5pm on Tuesdays or by appointment
References:
- [SB] Richard Sutton and Andrew Barto, Reinforcement Learning: An Introduction, 2nd edition, 2019
- [AJK] Alekh Agarwal, Nan Jiang Sham M. Kakade, Reinforcement Learning: Theory and Algorithms, 2020 (https://rltheorybook.github.io/rl_monograph_AJK.pdf)
- [BCN] Léon Bottou, Frank E. Curtis, and Jorge Nocedal, Optimization Methods for Large-Scale Machine Learning (https://arxiv.org/abs/1606.04838)
- [NW] Jorge Nocedal and Stephen Wright, Numerical optimization, 2006
- [LS] Tor Lattimore and Csaba Szepesvari, Bandit Algorithms, 2019 (https://tor-lattimore.com/downloads/book/book.pdf)

Grading
- Assignments: 35%
- Theory Project: 30%
- Team Implementation Project: 35% (Report: 20%, Presentation: 15%)

Week	Lecture	Date	Topics	Lecture Slides
1	1	2/14	Logistics and Introduction to RL
1	2	2/17	Introduction to RL and MDPs
2	3	2/21	Planning for MDPs
2	4	2/24	Regularized and Distributional Perspective of MDPs
3		2/28	Peace Memorial day
3	5	3/3	Policy Optimization
4	6	3/7	Policy Optimization and First-Order Optimization Methods
4	7	3/10	Policy Gradient
5	8	3/14	Policy Gradient and Stochastic Gradient Descent
5	9	3/17	Variance Reduction for Stochastic PG
6	10	3/21	Variance Reduction for Model-Free Prediction
6	11	3/24	Model-Free Prediction
7	12	3/28	Global Convergence of PG
7	13	3/31	Natural PG
8		4/4	Spring Break
8	14	4/7	Value Function Approximation
9	15	4/11	Value Function Approximation
9	16	4/14	Trust Region Policy Optimization (TRPO)
10	17	4/18	Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO)
10	18	4/21	Deterministic Policy Gradient (DPG)
11	19	4/25	DPG, DDPG, and Off-Policy Learning
11	20	4/28	Off-Policy Stochastic PG
12	21	5/2	Value-Based Methods - Sarsa and Expected Sarsa
12	22	5/5	Value-Based Methods - Q-Learning and Double Q-Learning
13	23	5/9	Q-Learning With VFA, DQN and Double DQN
13	24	5/12	Q-Learning for Continuous Control and Soft Actor-Critic
14	25	5/16	Distributional RL (C51, QR-DQN, and IQN)
14	26	5/19	Inverse RL
15	27	5/23	Inverse RL
15	28	5/26	Inverse RL
16		5/30	Rescheduled for Final Presentation (Final Exam Week)
16		6/2	Rescheduled for Final Presentation (Final Exam Week)
17		6/6	Final Presentation
17		6/9	Final Presentation