Jianfei Chen

Tsinghua University

Jianfei Chen (Tsinghua University), [intermediate] Efficient Large Model Training and Inference

Summary

Large language models have become extraordinarily expensive to train and serve, putting frontier research seemingly out of reach for academic groups and small teams. Yet recent systems such as DeepSeek demonstrate that careful co-design of model architecture, learning algorithms, and GPU kernels — guided by an awareness of the underlying hardware — can deliver order-of-magnitude gains in efficiency without sacrificing capability. This tutorial uses DeepSeek as a running case study to unpack the principles and practice behind state-of-the-art efficient machine learning.

Participants will move from first principles — the GPU performance model and the arithmetic of a transformer forward pass — through the modern toolbox of efficient attention, mixture-of-experts, low-precision computation, and structured sparsity. Each topic is approached both as an idea (why it works, what it costs) and as an implementation (how to write, profile, and validate it in Triton or PyTorch). By the end of the three sessions, participants will be able to explain what makes DeepSeek-class models efficient and will have the conceptual and practical tools to apply the same recipes to their own research under limited academic compute budgets.

Syllabus

Lecture 1 — Foundations: Transformers, GPU Performance Models, and Triton

The economics of large model training and the academic compute gap
Transformer architecture revisited from a systems perspective: FLOPs, memory traffic, and activation footprint
The GPU performance model: roofline analysis, arithmetic intensity, memory hierarchy, tensor cores
Profiling a real model: where the time and memory actually go
Custom kernels in Triton: a hands-on walkthrough from element-wise ops to fused matmul

Lecture 2 — Efficient Attention and Mixture-of-Experts

Why attention is the bottleneck: quadratic cost, KV-cache, and the long-context regime
IO-aware attention: FlashAttention and its descendants
Sparse and linear attention families; native sparse attention
Multi-head Latent Attention (MLA) as used in DeepSeek-V2/V3
Mixture-of-Experts: routing, load balancing, expert parallelism, and DeepSeek’s MoE design
Putting it together: how attention + MoE choices reshape the training and serving budget

Lecture 3 — Quantization and Sparsity

Numerical formats for deep learning: FP16, BF16, FP8, INT8, INT4, microscaling formats
Post-training quantization vs. quantization-aware training; outlier handling
Low-precision training: FP8 training as deployed in DeepSeek-V3
Activation, weight, and gradient sparsity; structured vs. unstructured patterns
SageAttention and other low-precision attention kernels
Wrapping up: a checklist for “what would I co-design if I were building my own DeepSeek?”

References

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. In Advances in Neural Information Processing Systems (NeurIPS).

Grattafiori, A., et al. (2024). The Llama 3 herd of models. arXiv:2407.21783.

DeepSeek-AI. (2024). DeepSeek-V2: A strong, economical, and efficient mixture-of-experts language model. arXiv:2405.04434.

DeepSeek-AI. (2024). DeepSeek-V3 technical report. arXiv:2412.19437.

DeepSeek-AI. (2025). DeepSeek-V3.2-Exp: Boosting long-context efficiency with DeepSeek sparse attention. https://aarnphm.xyz/thoughts/papers/DeepSeek_V3_2.pdf.

Dao, T., Fu, D. Y., Ermon, S., Rudra, A., & Ré, C. (2022). FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In Advances in Neural Information Processing Systems (NeurIPS).

Shah, J., Bikshandi, G., Zhang, Y., Thakkar, V., Ramani, P., & Dao, T. (2024). FlashAttention-3: Fast and accurate attention with asynchrony and low-precision. In Advances in Neural Information Processing Systems (NeurIPS).

Yuan, J., et al. (2025). Native sparse attention: Hardware-aligned and natively trainable sparse attention. arXiv:2502.11089.

Lu, E., et al. (2025). MoBA: Mixture of block attention for long-context LLMs. arXiv:2502.13189.

Xiao, G., Tian, Y., Chen, B., Han, S., & Lewis, M. (2024). Efficient streaming language models with attention sinks. In International Conference on Learning Representations (ICLR).

Xiao, G., Lin, J., Seznec, M., Wu, H., Demouth, J., & Han, S. (2023). SmoothQuant: Accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning (ICML).

Mishra, A., Latorre, J. A., Pool, J., Stosic, D., Stosic, D., Venkatesh, G., Yu, C., & Micikevicius, P. (2021). Accelerating sparse deep neural networks. arXiv:2104.08378.

Sun, M., Liu, Z., Bair, A., & Kolter, J. Z. (2024). A simple and effective pruning approach for large language models. In International Conference on Learning Representations (ICLR).

Zhang, J., Huang, H., Zhang, P., Xu, J., & Chen, J. (2024). SageAttention: Accurate 8-bit attention for plug-and-play inference acceleration. In Advances in Neural Information Processing Systems (NeurIPS).

Pre-requisites

A working knowledge of deep learning and the transformer architecture (the level of a graduate ML course, or having trained / fine-tuned an LLM at least once). Familiarity with PyTorch is assumed. Prior exposure to GPU programming, CUDA, or Triton is helpful but not required — Lecture 1 introduces the necessary systems background from scratch.

Short bio

Jianfei Chen is an Associate Professor in the Department of Computer Science at Tsinghua University. His research focuses on efficient machine learning, with contributions across efficient training and inference algorithms, low-precision computation, and accelerated sampling for generative models. He has open-sourced several widely adopted projects — including DPM-Solver, SageAttention, and TurboDiffusion — which together have accumulated 10K+ GitHub stars and are deployed in many large-scale commercial generative models. His work has been recognized at top machine learning venues (NeurIPS, ICML, ICLR) and underpins production systems used by millions of users.

Cookie	Duration	Description
cookielawinfo-checkbox-advertisement	1 year	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Advertisement".
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
PHPSESSID	session	This cookie is native to PHP applications. The cookie is used to store and identify a users' unique session ID for the purpose of managing user session on the website. The cookie is a session cookies and is deleted when all the browser windows are closed.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
_ga	2 years	This cookie is installed by Google Analytics. The cookie is used to calculate visitor, session, campaign data and keep track of site usage for the site's analytics report. The cookies store information anonymously and assign a randomly generated number to identify unique visitors.
_gat_gtag_UA_74880351_9	1 minute	This cookie is set by Google and is used to distinguish users.
_gid	1 day	This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected including the number visitors, the source where they have come from, and the pages visted in an anonymous form.