Tong Zhang

University of Illinois Urbana-Champaign

[introductory/intermediate] Reinforcement Learning for Large Language Models

Summary

This short course introduces reinforcement learning methods used in the posttraining of large language models.

The course begins with an overview of large language model posttraining. We describe the standard posttraining pipeline, including supervised instruction tuning and alignment objectives.

The second lecture covers reinforcement learning from human feedback. We formalize RLHF by viewing the language model as a policy over token sequences and human preferences as a learned reward function. We discuss reward modeling from pairwise comparisons, common failure modes such as reward hacking, and commonly used policy training methods.

The final lecture focuses on reinforcement learning for reasoning models with verifiable rewards. We study settings where rewards come from automatic checks, such as math or code correctness, rather than human judgments. We introduce commonly used policy training methods and explain why they are effective in this setting.

Syllabus

Lecture 1: Introduction to foundation model posttraining

Lecture 2: Reinforcement learning from human feedback (RLHF)

Lecture 3: Reinforcement learning for reasoning with verifiable rewards

References

Pre-requisites

An upper-level undergraduate course in machine learning, including deep learning.

Short bio

Tong Zhang is a professor in the Computer Science department at the University of Illinois Urbana-Champaign. His research interests include machine learning theory, algorithms, and applications, and he has extensive industrial experience. He is a Fellow of the IEEE, the American Statistical Association, and the Institute of Mathematical Statistics.

Cookie	Duration	Description
cookielawinfo-checkbox-advertisement	1 year	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Advertisement".
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
PHPSESSID	session	This cookie is native to PHP applications. The cookie is used to store and identify a users' unique session ID for the purpose of managing user session on the website. The cookie is a session cookies and is deleted when all the browser windows are closed.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
_ga	2 years	This cookie is installed by Google Analytics. The cookie is used to calculate visitor, session, campaign data and keep track of site usage for the site's analytics report. The cookies store information anonymously and assign a randomly generated number to identify unique visitors.
_gat_gtag_UA_74880351_9	1 minute	This cookie is set by Google and is used to distinguish users.
_gid	1 day	This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected including the number visitors, the source where they have come from, and the pages visted in an anonymous form.