
Jianfei Chen
Jianfei Chen (Tsinghua University), [intermediate] Efficient Large Model Training and Inference
Summary
Large language models have become extraordinarily expensive to train and serve, putting frontier research seemingly out of reach for academic groups and small teams. Yet recent systems such as DeepSeek demonstrate that careful co-design of model architecture, learning algorithms, and GPU kernels — guided by an awareness of the underlying hardware — can deliver order-of-magnitude gains in efficiency without sacrificing capability. This tutorial uses DeepSeek as a running case study to unpack the principles and practice behind state-of-the-art efficient machine learning.
Participants will move from first principles — the GPU performance model and the arithmetic of a transformer forward pass — through the modern toolbox of efficient attention, mixture-of-experts, low-precision computation, and structured sparsity. Each topic is approached both as an idea (why it works, what it costs) and as an implementation (how to write, profile, and validate it in Triton or PyTorch). By the end of the three sessions, participants will be able to explain what makes DeepSeek-class models efficient and will have the conceptual and practical tools to apply the same recipes to their own research under limited academic compute budgets.
Syllabus
Lecture 1 — Foundations: Transformers, GPU Performance Models, and Triton
- The economics of large model training and the academic compute gap
- Transformer architecture revisited from a systems perspective: FLOPs, memory traffic, and activation footprint
- The GPU performance model: roofline analysis, arithmetic intensity, memory hierarchy, tensor cores
- Profiling a real model: where the time and memory actually go
- Custom kernels in Triton: a hands-on walkthrough from element-wise ops to fused matmul
Lecture 2 — Efficient Attention and Mixture-of-Experts
- Why attention is the bottleneck: quadratic cost, KV-cache, and the long-context regime
- IO-aware attention: FlashAttention and its descendants
- Sparse and linear attention families; native sparse attention
- Multi-head Latent Attention (MLA) as used in DeepSeek-V2/V3
- Mixture-of-Experts: routing, load balancing, expert parallelism, and DeepSeek’s MoE design
- Putting it together: how attention + MoE choices reshape the training and serving budget
Lecture 3 — Quantization and Sparsity
- Numerical formats for deep learning: FP16, BF16, FP8, INT8, INT4, microscaling formats
- Post-training quantization vs. quantization-aware training; outlier handling
- Low-precision training: FP8 training as deployed in DeepSeek-V3
- Activation, weight, and gradient sparsity; structured vs. unstructured patterns
- SageAttention and other low-precision attention kernels
- Wrapping up: a checklist for “what would I co-design if I were building my own DeepSeek?”
References
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. In Advances in Neural Information Processing Systems (NeurIPS).
Grattafiori, A., et al. (2024). The Llama 3 herd of models. arXiv:2407.21783.
DeepSeek-AI. (2024). DeepSeek-V2: A strong, economical, and efficient mixture-of-experts language model. arXiv:2405.04434.
DeepSeek-AI. (2024). DeepSeek-V3 technical report. arXiv:2412.19437.
DeepSeek-AI. (2025). DeepSeek-V3.2-Exp: Boosting long-context efficiency with DeepSeek sparse attention. https://aarnphm.xyz/thoughts/papers/DeepSeek_V3_2.pdf.
Dao, T., Fu, D. Y., Ermon, S., Rudra, A., & Ré, C. (2022). FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In Advances in Neural Information Processing Systems (NeurIPS).
Shah, J., Bikshandi, G., Zhang, Y., Thakkar, V., Ramani, P., & Dao, T. (2024). FlashAttention-3: Fast and accurate attention with asynchrony and low-precision. In Advances in Neural Information Processing Systems (NeurIPS).
Yuan, J., et al. (2025). Native sparse attention: Hardware-aligned and natively trainable sparse attention. arXiv:2502.11089.
Lu, E., et al. (2025). MoBA: Mixture of block attention for long-context LLMs. arXiv:2502.13189.
Xiao, G., Tian, Y., Chen, B., Han, S., & Lewis, M. (2024). Efficient streaming language models with attention sinks. In International Conference on Learning Representations (ICLR).
Xiao, G., Lin, J., Seznec, M., Wu, H., Demouth, J., & Han, S. (2023). SmoothQuant: Accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning (ICML).
Mishra, A., Latorre, J. A., Pool, J., Stosic, D., Stosic, D., Venkatesh, G., Yu, C., & Micikevicius, P. (2021). Accelerating sparse deep neural networks. arXiv:2104.08378.
Sun, M., Liu, Z., Bair, A., & Kolter, J. Z. (2024). A simple and effective pruning approach for large language models. In International Conference on Learning Representations (ICLR).
Zhang, J., Huang, H., Zhang, P., Xu, J., & Chen, J. (2024). SageAttention: Accurate 8-bit attention for plug-and-play inference acceleration. In Advances in Neural Information Processing Systems (NeurIPS).
Pre-requisites
A working knowledge of deep learning and the transformer architecture (the level of a graduate ML course, or having trained / fine-tuned an LLM at least once). Familiarity with PyTorch is assumed. Prior exposure to GPU programming, CUDA, or Triton is helpful but not required — Lecture 1 introduces the necessary systems background from scratch.
Short bio
Jianfei Chen is an Associate Professor in the Department of Computer Science at Tsinghua University. His research focuses on efficient machine learning, with contributions across efficient training and inference algorithms, low-precision computation, and accelerated sampling for generative models. He has open-sourced several widely adopted projects — including DPM-Solver, SageAttention, and TurboDiffusion — which together have accumulated 10K+ GitHub stars and are deployed in many large-scale commercial generative models. His work has been recognized at top machine learning venues (NeurIPS, ICML, ICLR) and underpins production systems used by millions of users.

















