Xia “Ben” Hu
[introductory/advanced] Efficient LLM Serving: Algorithms and Systems
Summary
Large Language Models (LLMs) have demonstrated strong performance across a wide range of areas. However, the generality and robustness of LLMs are largely attributed to their vast scale. As a result, deploying and serving these LLMs is not cost-efficient. To address the high computational demands and improve accessibility, various techniques have been proposed to make LLMs more efficient, such as model compression. In this lecture, we will delve into the fundamentals of the challenges and opportunities in large language model serving. Particularly, we will examine techniques like model quantization and weight pruning, and see how those advances with “lossy” compression reduce computational demands with minimal performance drop. In the meantime, we will explore various schools of KV cache compression techniques, discuss their characteristics and trade-offs, illustrate benchmark results for some exemplar methods, and highlight interesting caveats in conducting proper long-context evaluations.
Syllabus
- Introduction to large language models
- Model quantization
- Weight pruning
- KV cache compression
- Extending to long context scenario
- Evaluation of the compression method
References
Brown, Tom B. “Language models are few-shot learners.” arXiv:2005.14165 (2020).
Sun, Mingjie, et al. “A simple and effective pruning approach for large language models.” arXiv:2306.11695 (2023).
Lin, Ji, et al. “AWQ: Activation-aware Weight Quantization for On-Device LLM Compression and Acceleration.” Proceedings of Machine Learning and Systems 6 (2024): 87-100.
Liu, Zirui, et al. “Kivi: A tuning-free asymmetric 2bit quantization for kv cache.” arXiv:2402.02750 (2024).
Yuan, Jiayi, et al. “Kv cache compression, but what must we give in return? a comprehensive benchmark of long context capable approaches.” arXiv:2407.01527 (2024).
Jin, Hongye, et al. “Llm maybe longlm: Self-extend llm context window without tuning.” arXiv:2401.01325 (2024).
Xiao, Guangxuan, et al. “Efficient streaming language models with attention sinks.” arXiv:2309.17453 (2023).
Pre-requisites
Machine learning and deep neural network fundamentals, linear algebra, transformers, and Large Language Model basics.
Short bio
Dr. Xia “Ben” Hu is an Associate Professor at Rice University in the Department of Computer Science. Dr. Hu has published over 200 papers in several major academic venues, including NeurIPS, ICLR, ICML, KDD, IJCAI, etc. An open-source package developed by his group, namely AutoKeras, has become the most used automated deep learning system on Github (with over 9,000 stars and 1,000 forks). Also, his works on LLM efficiency, deep collaborative filtering, anomaly detection, knowledge graphs, and fast interpretation have been included in the Hugging Face, TensorFlow, Apple, Bing and Meta production systems, respectively. His papers have received several Best Paper (Candidate) awards from venues such as ICML, WWW, WSDM, ICDM, AMIA and INFORMS. He is the recipient of a NSF CAREER Award and an ACM SIGKDD Rising Star Award. His work has been cited more than 30,000 times with an h-index of 70. He was the conference General Co-Chair for WSDM 2020 and ICHI 2023, and Program Co-Chair for AIHC 2024 and CHASE 2025.