
Tong Zhang
[introductory/intermediate] Reinforcement Learning for Large Language Models
Summary
This short course introduces reinforcement learning methods used in the posttraining of large language models.
The course begins with an overview of large language model posttraining. We describe the standard posttraining pipeline, including supervised instruction tuning and alignment objectives.
The second lecture covers reinforcement learning from human feedback. We formalize RLHF by viewing the language model as a policy over token sequences and human preferences as a learned reward function. We discuss reward modeling from pairwise comparisons, common failure modes such as reward hacking, and commonly used policy training methods.
The final lecture focuses on reinforcement learning for reasoning models with verifiable rewards. We study settings where rewards come from automatic checks, such as math or code correctness, rather than human judgments. We introduce commonly used policy training methods and explain why they are effective in this setting.
Syllabus
Lecture 1: Introduction to foundation model posttraining
Lecture 2: Reinforcement learning from human feedback (RLHF)
Lecture 3: Reinforcement learning for reasoning with verifiable rewards
References
Pre-requisites
An upper-level undergraduate course in machine learning, including deep learning.
Short bio
Tong Zhang is a professor in the Computer Science department at the University of Illinois Urbana-Champaign. His research interests include machine learning theory, algorithms, and applications, and he has extensive industrial experience. He is a Fellow of the IEEE, the American Statistical Association, and the Institute of Mathematical Statistics.
















