LLM acceleration
Summary
Methods
There are two main methods to acclerate LLM
- low-rank: reduce dimension of matrix
- block: compute matrix with block
and another tricky methods
already read papers: 9
Reference
- xformers: collection of optimized transformers
- fast attention collection
- unsloth
- awesome LLM system
Categories
Low-rank
LoRA
Low-Rank of large matrices when fine-tune
informaiton
- Jun 2021
- 70%
- note
reference
- Measuring the Intrinsic Dimension of Objective Landscapes
- Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning
Linformer
SVD decomposition for large QKV projection matrices to reduce required memory
- Jun 2020
- 30%
- code: hold
Performers
low-rank projection with a novel method named FAVOR
- Sep 2020
- 10%
- code: hold
Block
FlashAttention
Matrices multiplication by blocks
Self-attention Does Not Need O(n2) Memory
attention calculation with blocks
- Dec 2021
- 70%
+ FlashDecoding++
FlashDecoding++: Faster Large Language Model Inference on GPUs, three parts
Softmax with block and Unified Maximum Value, result of block softmax can be directly used and merging is unnecesary. Optimized from FlashAttention.
- Flat GEMM(small batch size when reference) Optimization with Double Buffering. [didn’t understand]
Heuristic Dataflow with Hardware Resource Adaption, choose difference optimizaiton methods for different M value(batch size and sequence length) [didn’t total understand]
- reference
- cuBLAS / CUTLASS
- flat GEMM: method in current paper
- fastGEMV
- No code(2024.11)
Basic
Parallelization
New Solutions on LLM Acceleration, Optimization, and Application
1) Medusa: output top-k predictions for next multiple positions parallelly through adding LM heads for next several positons, which can reduce inference latency.
2) SnapKV: compress KV cacha for long sequence tasks
Infrastructure
- An alternative language for cuda, designed for deep neural network
- published in 2019, purchase by OpenAI
- reasons why it’s great
- designed for deep neural network
- open-source, active project in Github
- clients, like unsloth, other in Github issues
- friendly to use and implentment, adding them into current Python code, Good to start
- support for other chips
Hardware Acceleration of LLMs: A comprehensive survey and comparison Simple introduce and compare different hardware acceleration method in terms of efficiency and performance
- collect all method from 2020-2024
- comparison with the same process technology
- different choose for both efficiency and performance
To Read
Basic
- Accelerating Relative Entropy Coding with Space Partitioning
parameter
- Inference with Reference: Lossless Acceleration of Large Language Models
RNN
- RWKV: RWKV is an RNN with transformer-level LLM performance
- An Attention Free Transformer
MoE
- SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention
backprog
- DropBP: Accelerating Fine-Tuning of Large Language Models by Dropping Backward Propagation
Long sequence
- IceFormer: Accelerated Inference with Long-Sequence Transformers on CPUs
2:4
- Accelerating Transformer Pre-training with 2:4 Sparsity
Pruning
- Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning
cache
- Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference
trade-off
- AWQ: Activation-aware Weight Quantization for On-Device LLM Compression and Acceleration
PE