Model Compression Techniques for LLMs

Overview:

AI theoretical Baselines: Optimal Brain Surgeon (1993 - Hassabi {Stanford}) and Optimal Brain Damage (1989 - Yann LeCun {Meta AI Chief Scientist})

Model Compression Techniques like Knowledge Distillation, Quantization or Pruning are used to reduce the Model sizes without losing accuracy. Smaller sizes allow for faster inference and less computational costs as well as carbon emissions. But how do their differ and is accuracy really the only metric to optimize for during the compression process? What specific weights are useful to quantize or prune and which weights shouldnt we prune because of their importance for the model performance ? Do some compression techniques interact or influence each others ? Is a pruned model (without fine-tuning) better or worse than a smaller dense model with similar parameter count?

In this theses, we will evaluate different compression techniques for LLMs and compare their effect or impact on the model performance. Using Huggingface and open source code of different model compression techniques.

Research Questions:

Potential research questions for theses lie at the intersection of distributed systems and ML:

Investigate current methodologies and challenges in model compression.
Explore the integration of compression mechanisms in LLMs.
Develop and evaluate algorithms that address challenges in model compression and their interplay between each other.
Analyze the performance and robustness of these algorithms

Prerequisites:

Work in this research area requires knowledge or interest of machine learning. In addition, the information on how to conduct theses at our department must be read and considered.

Start: Immediately

Contact: Patrick Wilhelm (patrick.wilhelm ∂ tu-berlin.de)

References:

Quantization vs Pruning vs Distillation: Optimizing NNs for Inference ( https://www.youtube.com/watch?v=UcwDgsMgTu4&t=588s )
Learning both Weights and Connections for Efficient Neural Networks, NeurIPS (2015).
A Simple and Effective Pruning Approach for Large Language Models, ICLR (2024)
LLM.int8(): 8-bit matrix multiplication for transformers at scale. NeurIPS (2022)
Optimal brain damage, NeurIPS (1989)
CompressingLLMS: The Truth is Rarely Pure and Never Simple. ICLR (2024).
The Science of Deep Learning Model Compression ( https://www.youtube.com/watch?v=zHWMfFS1HX0 )