Post

LLM acceleration

LLM acceleration

Summary

Methods

There are two main methods to acclerate LLM

  • low-rank: reduce dimension of matrix
  • block: compute matrix with block

and another tricky methods

already read papers: 9

Reference

Categories

Low-rank

LoRA

Low-Rank of large matrices when fine-tune

informaiton

reference

  • Measuring the Intrinsic Dimension of Objective Landscapes
  • Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning

Linformer

SVD decomposition for large QKV projection matrices to reduce required memory

  • Jun 2020
  • 30%
  • code: hold

Performers

low-rank projection with a novel method named FAVOR

  • Sep 2020
  • 10%
  • code: hold

Block

FlashAttention

Matrices multiplication by blocks

Self-attention Does Not Need O(n2) Memory

attention calculation with blocks

  • Dec 2021
  • 70%

+ FlashDecoding++

FlashDecoding++: Faster Large Language Model Inference on GPUs, three parts

  • Softmax with block and Unified Maximum Value, result of block softmax can be directly used and merging is unnecesary. Optimized from FlashAttention.

    scalability

  • Flat GEMM(small batch size when reference) Optimization with Double Buffering. [didn’t understand]
  • Heuristic Dataflow with Hardware Resource Adaption, choose difference optimizaiton methods for different M value(batch size and sequence length) [didn’t total understand]

    scalability

  • reference
    • cuBLAS / CUTLASS
    • flat GEMM: method in current paper
    • fastGEMV
  • No code(2024.11)

Basic

Parallelization

New Solutions on LLM Acceleration, Optimization, and Application

1) Medusa: output top-k predictions for next multiple positions parallelly through adding LM heads for next several positons, which can reduce inference latency.

scalability

2) SnapKV: compress KV cacha for long sequence tasks

Infrastructure

triton

  • An alternative language for cuda, designed for deep neural network
  • published in 2019, purchase by OpenAI
  • reasons why it’s great
    • designed for deep neural network
    • open-source, active project in Github
    • clients, like unsloth, other in Github issues
    • friendly to use and implentment, adding them into current Python code, Good to start
    • support for other chips

Hardware Acceleration of LLMs: A comprehensive survey and comparison Simple introduce and compare different hardware acceleration method in terms of efficiency and performance

  • collect all method from 2020-2024
  • comparison with the same process technology
  • different choose for both efficiency and performance

To Read

Basic

  • Accelerating Relative Entropy Coding with Space Partitioning

parameter

  • Inference with Reference: Lossless Acceleration of Large Language Models

RNN

MoE

  • SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention

backprog

  • DropBP: Accelerating Fine-Tuning of Large Language Models by Dropping Backward Propagation

Long sequence

  • IceFormer: Accelerated Inference with Long-Sequence Transformers on CPUs

2:4

  • Accelerating Transformer Pre-training with 2:4 Sparsity

Pruning

  • Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning

cache

  • Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference

trade-off

  • AWQ: Activation-aware Weight Quantization for On-Device LLM Compression and Acceleration

PE

This post is licensed under CC BY 4.0 by the author.