IOC5262 Spring 2020 - Reinforcement Learning (強化學習原理)

Instructor: Ping-Chun Hsieh
Email: pinghsieh [AT] nctu [DOT] edu [DOT] tw
Lectures:
- Tuesdays 09:00am-09:50am @ ED102
- Fridays 1:20pm-3:10pm @ ED102
Office Hours: By appointment
References:
- [SB] Richard Sutton and Andrew Barto, Reinforcement Learning: An Introduction, 2nd edition, 2019
- [AJK] Alekh Agarwal, Nan Jiang Sham M. Kakade, Reinforcement Learning: Theory and Algorithms, 2019 (https:rltheorybook.github.io/rlmonographAJK.pdf)
- [BCN] Léon Bottou, Frank E. Curtis, and Jorge Nocedal, Optimization Methods for Large-Scale Machine Learning (https:arxiv.orgabs1606.04838)
- [NW] Jorge Nocedal and Stephen Wright, Numerical optimization, 2006
- [LS] Tor Lattimore and Csaba Szepesvari, Bandit Algorithms, 2019 (https:tor-lattimore.comdownloadsbook/book.pdf)

Grading
- Warm-Up Assignments: 20%
- Theory Project: 35% (Deliverable: a technical report)
- Team Implementation Project: 45% (Report: 35%, Presentation: 10%)

Week	Lecture	Date	Topics	Lecture Slides
1	1	3/3	Logistics and Introduction to RL	Lec1
1	2	3/6	Introduction to RL and MDP	Lec2
2	3	3/10	Planning for MDPs	Lec3
2	4	3/13	Distributional Perspective of MDPs and Overview of Policy Optimization	Lec4
3	5	3/17	Stochastic Gradient Descent and Policy Gradient	Lec5
3	6	3/20	Policy Gradient and Variance Reduction	Lec6
4	7	3/24	Variance Reduction for Policy Gradient	Lec7
4	8	3/27	Critic, Advantage Function, and Model-Free Prediction	Lec8
5	9	3/31	Model-Free Prediction and Actor-Critic Algorithms	Lec9
5		4/3	Ching Ming Festival (No class)
6	10	4/7	Actor-Critic Algorithms and Value Function Approximation	Lec10
6	11	4/10	Value Function Approximation and Trust Region Policy Optimization	Lec11
7	12	4/14	Trust Region Policy Optimization (TRPO)	Lec12
7	13	4/17	TRPO and PPO	Lec13
8	14	4/21	PPO and Constrained Policy Optimization	Lec14
8	15	4/24	CPO and Deterministic Policy Gradient	Lec15
9	16	4/28	DPG and DDPG	Lec16
9		5/1	Guest Lecture
10	17	5/5	DDPG and Off-Policy Stochastic PG	Lec17
10	18	5/8	Off-Policy Stochastic PG and Policy Evaluation	Lec18
11	19	5/12	Off-Policy Learning via Bootstrapping	Lec19
11	20	5/15	Off-Policy Learning via Bootstrapping (II)	Lec20
12	21	5/19	Q-Learning With Value Function Approximation	Lec21
12	22	5/26	Distributional RL	Lec22
13	22	5/29	Distributional RL and Bandits	Lec23
13	23	6/5	Bandits: Learning With No Regret	Lec24
14	24	6/9	Bandits: Upper Confidence Bound & Thompson Sampling	Lec25
14	25	6/12	Bandits and Beyond	Lec26
15	26	6/16	Exploration for RL	Lec27
15		6/19	Final Presentation
16		6/23	Final Presentation
16		6/26	Final Presentation