Back to Blog List

SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models

โ†Paper Review

ArXivhttps://arxiv.org/abs/2411.05007
Project Pagehttps://hanlab.mit.edu/projects/svdquant
Github Codehttps://github.com/mit-han-lab/nunchaku
Demohttps://svdquant.mit.edu/
AffiliationMIT, NVIDIA, CMU, Princeton, UC Berkeley, SJTU, Pika Labs
๐Ÿ’ก

Key Differentiator

โ€œOutlier Absorption Using Singular Value Decompositionโ€

Blog Image

Song Han?

Song Han is an associate professor at MIT EECS. He earned his PhD from Stanford, pioneering efficient AI computing techniques such as โ€œDeep Compressionโ€ (pruning, quantization) and the โ€œEfficient Inference Engine,โ€ which first introduced weight sparsity to modern AI chips, making it one of the top-5 most cited papers in the 50-year history of ISCA (1953-2023). His innovations, including TinyML and hardware-aware neural architecture search (Once-for-All Network), have advanced AI model deployment on resource-constrained devices.
Blog Image

1. Introduction

Blog Image

LLM๊ณผ ๋น„๊ตํ–ˆ์„ ๋•Œ, ๋ชจ๋ธ ์‚ฌ์ด์ฆˆ์— ๋”ฐ๋ผ ๊ณ„์‚ฐ ๋น„์šฉ์ด ๋น ๋ฅด๊ฒŒ ์ฆ๊ฐ€ํ•œ๋‹ค.

Mooreโ€™s law๊ฐ€ slow down ํ•จ์œผ๋กœ์„œ, ์ €๋ ดํ•œ ์ถ”๋ก (low-precision inference) ์œผ๋กœ ์ „ํ™˜ํ•˜๋Š”์ค‘

โ†’ 4bit floating point (FP4)๊ฐ€ ๋Œ€์„ธ์ž„

Blog Image

LLM

latency๋Š” ์ฃผ๋กœ ๊ฐ€์ค‘์น˜(weight) ๋กœ๋”ฉ ์†๋„์— ์˜ํ•ด ๊ฒฐ์ •

"๊ฐ€์ค‘์น˜๋งŒ ์–‘์žํ™”(weight-only quantization)" ํ•ด๋„ ์†๋„๋ฅผ ๊ฐœ์„ 

Diffusion ๋ชจ๋ธ

๋ ˆ์ดํ„ด์‹œ๋Š” ๊ฐ€์ค‘์น˜๋ฅผ ๋ถˆ๋Ÿฌ์˜ค๋Š” ์†๋„๊ฐ€ ์•„๋‹ˆ๋ผ, ์—ฐ์‚ฐ๋Ÿ‰ ์ž์ฒด๊ฐ€ ๋ณ‘๋ชฉ

์™œ๋ƒํ•˜๋ฉด ๊ฐ€์ค‘์น˜๋งŒ 4๋น„ํŠธ๋กœ ์ค„์—ฌ๋„ ํ™œ์„ฑํ™”๊ฐ’์ด 16๋น„ํŠธ์ด๋ฉด, ์—ฐ์‚ฐ ๊ณผ์ •์—์„œ 16๋น„ํŠธ๋กœ ๋‹ค์‹œ ๋ณ€ํ™˜(upcast)๋˜๋ฏ€๋กœ ์—ฐ์‚ฐ๋Ÿ‰์ด ์ค„์–ด๋“ค์ง€ ์•Š์Œ.

๊ฒฐ๊ตญ ์—ฐ์‚ฐ๋Ÿ‰์„ ์ค„์ด๋ ค๋ฉด ๊ฐ€์ค‘์น˜(weight)๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ํ™œ์„ฑํ™”๊ฐ’(activation)๋„ ํ•จ๊ป˜ 4๋น„ํŠธ๋กœ ์–‘์žํ™”ํ•ด์•ผ ํ•จ.

Blog Image

๐Ÿ“ข
  • Input Channel โ†’ ์›๋ž˜ Activation์—์„œ ๋‚˜์˜จ ์ž…๋ ฅ ์ฑ„๋„
  • Channel โ†’ Weight์˜ ๊ฐ ์ฑ„๋„

1. ๊ธฐ์กด 4๋น„ํŠธ ์–‘์žํ™”(4-bit Quantization)์˜ ๋ฌธ์ œ์ 

  • ๊ฐ€์ค‘์น˜(Weight)์™€ ํ™œ์„ฑํ™”๊ฐ’(Activation) ๋ชจ๋‘ 4๋น„ํŠธ๋กœ ์ค„์ด๋ฉด ํ’ˆ์งˆ์ด ํฌ๊ฒŒ ์ €ํ•˜๋  ๊ฐ€๋Šฅ์„ฑ์ด ๋†’์Œ.
  • ํŠนํžˆ ๊ธฐ์กด ๋ฐฉ๋ฒ•(์˜ˆ: Smoothing)์€ ๊ฐ€์ค‘์น˜์™€ ํ™œ์„ฑํ™”๊ฐ’ ์‚ฌ์ด์—์„œ Outlier๋ฅผ ์ด๋™์‹œํ‚ค๋Š” ๋ฐฉ์‹์„ ์‚ฌ์šฉํ–ˆ์ง€๋งŒ,Diffusion ๋ชจ๋ธ์—์„œ๋Š” Outlier๊ฐ€ ์–‘์ชฝ(W, X) ๋ชจ๋‘์—์„œ ์‹ฌ๊ฐํ•˜๊ฒŒ ๋ฐœ์ƒํ•˜๋ฏ€๋กœ ํšจ๊ณผ์ ์ด์ง€ ์•Š์Œ.
    • ๊ธฐ์กด ๋ฐฉ์‹์€ ํ™œ์„ฑํ™”๊ฐ’(X)์—์„œ Outlier๋ฅผ ์ œ๊ฑฐํ•˜๋ ค๊ณ  ํ•˜๋ฉด ๊ฐ€์ค‘์น˜(W)๋กœ ์ด๋™ํ•˜๊ณ , ๋ฐ˜๋Œ€๋กœ ํ•˜๋ฉด X์— Outlier๊ฐ€ ๋‚จ๋Š” ๋ฌธ์ œ ๋ฐœ์ƒ.

2. SVDQuant์˜ ํ•ต์‹ฌ ์•„์ด๋””์–ด

  • Outlier๋ฅผ ๋‹จ์ˆœํžˆ ์ด๋™ํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ, "ํก์ˆ˜"ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•จ.
  • ์ €๋น„์šฉ์˜ "Low-Rank Branch"๋ฅผ ์ถ”๊ฐ€ํ•˜์—ฌ Outlier๋ฅผ ๊ฐ€์ค‘์น˜(W)์—์„œ ํก์ˆ˜ํ•จ.
  • ์ด๋ฅผ ์œ„ํ•ด SVD(Singular Value Decomposition, ํŠน์ด๊ฐ’ ๋ถ„ํ•ด) ๊ธฐ๋ฒ•์„ ํ™œ์šฉํ•˜์—ฌ ๊ฐ€์ค‘์น˜๋ฅผ ๋‘ ๊ฐœ์˜ ์„ฑ๋ถ„์œผ๋กœ ๋ถ„ํ•ดํ•จ.

3. SVDQuant์˜ ๋‹จ๊ณ„๋ณ„ ๋™์ž‘ ๋ฐฉ์‹

1. Outlier ์ด๋™ (Smoothing)

  • ๋จผ์ € Outlier๋ฅผ ํ™œ์„ฑํ™”๊ฐ’(X)์—์„œ ๊ฐ€์ค‘์น˜(W)๋กœ ์ด๋™ํ•จ.
  • ์ด๋ฅผ ํ†ตํ•ด ํ™œ์„ฑํ™”๊ฐ’(X)์ด ๋” ๊ท ์ผํ•ด์ ธ์„œ 4๋น„ํŠธ ์–‘์žํ™”๊ฐ€ ๋” ์‰ฌ์›Œ์ง.

2. SVD(ํŠน์ด๊ฐ’ ๋ถ„ํ•ด)๋ฅผ ์ ์šฉํ•˜์—ฌ ๊ฐ€์ค‘์น˜(W)๋ฅผ ๋‘ ๊ฐœ์˜ ์„ฑ๋ถ„์œผ๋กœ ๋ถ„ํ•ด

  • W โ†’ L1L2(์ €์ˆœ์œ„ ์„ฑ๋ถ„) + ์ž”์—ฌ ์„ฑ๋ถ„(W - L1L2)๋กœ ๋ถ„๋ฆฌ
  • L1L2(์ €์ˆœ์œ„ ์„ฑ๋ถ„)์€ 16๋น„ํŠธ๋กœ ์œ ์ง€ํ•˜๊ณ , W - L1L2(์ž”์—ฌ ์„ฑ๋ถ„)๋งŒ 4๋น„ํŠธ๋กœ ์–‘์žํ™”
  • ์ฆ‰, ์ €์ˆœ์œ„ ์„ฑ๋ถ„(Low-Rank Component)์ด Outlier๋ฅผ ํก์ˆ˜ํ•˜๋ฉด์„œ 4๋น„ํŠธ ์–‘์žํ™”๊ฐ€ ๋” ์‰ฌ์›Œ์ง.

3. ์ €์ˆœ์œ„ ์„ฑ๋ถ„์„ ๋”ฐ๋กœ ๊ณ„์‚ฐํ•˜๋ฉด ๋ฉ”๋ชจ๋ฆฌ ์•ก์„ธ์Šค ์˜ค๋ฒ„ํ—ค๋“œ๊ฐ€ ์ฆ๊ฐ€ํ•˜๋Š” ๋ฌธ์ œ ๋ฐœ์ƒ

  • ์ฆ‰, L1L2๋ฅผ ๋ณ„๋„๋กœ ์ฒ˜๋ฆฌํ•˜๋ฉด ์—ฐ์‚ฐ ์†๋„๊ฐ€ ๋А๋ ค์ง€๋Š” ๋ฌธ์ œ๊ฐ€ ์ƒ๊น€.
  • ๊ธฐ๋ณธ์ ์œผ๋กœ 4๋น„ํŠธ ์—ฐ์‚ฐ์˜ ์†๋„๋ฅผ ๋†’์ด๋ ค๊ณ  ํ–ˆ๋Š”๋ฐ, ์ €์ˆœ์œ„ ์—ฐ์‚ฐ์ด ์ถ”๊ฐ€๋˜๋ฉด ์˜คํžˆ๋ ค ๋А๋ ค์งˆ ์ˆ˜ ์žˆ์Œ.

4. ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ์ „์šฉ ์ถ”๋ก  ์—”์ง„(Nunchaku) ์„ค๊ณ„

  • Nunchaku ์—”์ง„์€ 4๋น„ํŠธ ์–‘์žํ™” ์—ฐ์‚ฐ๊ณผ ์ €์ˆœ์œ„ ์—ฐ์‚ฐ์„ ํ•จ๊ป˜ ์ตœ์ ํ™”ํ•˜์—ฌ ์˜ค๋ฒ„ํ—ค๋“œ๋ฅผ ์ค„์ž„.
  • ์ฆ‰, L1L2(์ €์ˆœ์œ„ ์—ฐ์‚ฐ)์™€ 4๋น„ํŠธ ์—ฐ์‚ฐ์„ ํ•จ๊ป˜ ์ฒ˜๋ฆฌํ•˜๋Š” ์ปค๋„(fusion kernel)๋กœ ๋ณ€ํ™˜ํ•˜์—ฌ ์„ฑ๋Šฅ์„ ์ตœ์ ํ™”.
  • ์ด๋ฅผ ํ†ตํ•ด ์ถ”๊ฐ€์ ์ธ ์—ฐ์‚ฐ๋Ÿ‰์ด ์ƒ๊ธฐ๋”๋ผ๋„ ์‹ค์ œ๋กœ๋Š” 4๋น„ํŠธ ์—ฐ์‚ฐ์˜ ์†๋„๋ฅผ ํ–ฅ์ƒํ•  ์ˆ˜ ์žˆ๋„๋ก ์„ค๊ณ„๋จ.

๊ธฐ์กด ๋ฐฉ์‹(SmoothQuant, AWQ)๊ณผ SVDQuant์˜ ์ฐจ์ด
๋ฐฉ๋ฒ•๋ฐฉ์‹Outlier ์ฒ˜๋ฆฌ ๋ฐฉ์‹์ ์šฉ ๋Œ€์ƒ๋ฌธ์ œ์ 
SmoothQuant (2023)W4A4Input Channel(Activation) โ†’ Channel(Weight)LLM(๋Œ€ํ˜• ์–ธ์–ด ๋ชจ๋ธ)Outlier๊ฐ€ ๊ฐ€์ค‘์น˜์— ๋ˆ„์ ๋จ
AWQ (2024)W4A4๊ฐ€์ค‘์น˜ ์ค‘ ์ค‘์š”ํ•œ ๋ถ€๋ถ„์„ ๋ณด์กดํ•˜์—ฌ ์–‘์žํ™”LLMDiffusion ๋ชจ๋ธ์—์„œ๋Š” ํ•œ๊ณ„ ๊ฐ€๋Šฅ์„ฑ
SVDQuant (2024)W4A4์ €์ˆœ์œ„(Low-Rank) ์„ฑ๋ถ„์œผ๋กœ Outlier ํก์ˆ˜Diffusion ๋ชจ๋ธ ์ตœ์ ํ™”์ถ”๊ฐ€ ์—ฐ์‚ฐ์„ ํ•ด๊ฒฐํ•ด์•ผ ํ•จ

1. SmoothQuant (2023) โ€“ Activation์—์„œ Weight๋กœ ์ด์ƒ์น˜ ์ด๋™

SmoothQuant์˜ ํ•ต์‹ฌ ์•„์ด๋””์–ด๋Š” ํ™œ์„ฑํ™”๊ฐ’(Activation)์—์„œ ๋ฐœ์ƒํ•˜๋Š” ์ด์ƒ์น˜๋ฅผ ๊ฐ€์ค‘์น˜(Weight)๋กœ ์ด๋™์‹œํ‚ด

  • ๊ธฐ์กด ๋ฌธ์ œ
    • Transformer ๊ธฐ๋ฐ˜ ๋ชจ๋ธ์—์„œ Self-Attention ์—ฐ์‚ฐ์ด ๋งŽ์•„์„œ ํ™œ์„ฑํ™”๊ฐ’(Activation)์˜ ๋ฒ”์œ„๊ฐ€ ๋„“์–ด์ง€๊ณ  ์ด์ƒ์น˜๊ฐ€ ๋ฐœ์ƒํ•˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์Œ.
    • ์ด๋ฅผ 8-bit์ด๋‚˜ 4-bit๋กœ ์–‘์žํ™”ํ•˜๋ฉด, ์ž‘์€ ๊ฐ’๋“ค์€ ๋ชจ๋‘ 0์ด ๋˜๊ณ , ์ •๋ณด ์†์‹ค์ด ์‹ฌํ•ด์ง.
  • ํ•ด๊ฒฐ ๋ฐฉ๋ฒ•
    • ํ™œ์„ฑํ™”๊ฐ’(Activation)์˜ ์ฑ„๋„๋ณ„ ์Šค์ผ€์ผ๋ง์„ ์ ์šฉํ•˜์—ฌ, ์ด์ƒ์น˜๋ฅผ ๊ฐ€์ค‘์น˜(Weight) ์ชฝ์œผ๋กœ ์ด๋™์‹œํ‚ด.
    • ์ฆ‰, ์›๋ž˜ Activation ๊ฐ’์ด ํฌ๋ฉด, ํ•ด๋‹น ์ฑ„๋„์„ ์Šค์ผ€์ผ๋งํ•ด์„œ ์ค„์ด๊ณ , ๋Œ€์‹  ๊ทธ ๊ฐ’์„ Weight์—์„œ ๋ณด์ƒํ•ด์ฃผ๋Š” ๋ฐฉ์‹.
    • ์ด๋ ‡๊ฒŒ ํ•˜๋ฉด, Activation ๊ฐ’์ด ์–‘์žํ™”ํ•  ๋•Œ ์†์‹ค ์—†์ด ๋” ๊ท ๋“ฑํ•˜๊ฒŒ ๋ถ„ํฌํ•  ์ˆ˜ ์žˆ์Œ.
  • ํ•œ๊ณ„
    • Weight ์ชฝ์œผ๋กœ ์ด์ƒ์น˜๋ฅผ ๋ชฐ์•„๋„ฃ์œผ๋ฉด, Weight์˜ ๊ฐ’์ด ์ปค์ง€๊ณ , Weight ์–‘์žํ™” ์‹œ ์˜ค๋ฅ˜๊ฐ€ ์ปค์งˆ ๊ฐ€๋Šฅ์„ฑ์ด ์žˆ์Œ.
    • ๋”ฐ๋ผ์„œ Weight๋ฅผ 4-bit๋กœ ์–‘์žํ™”ํ•  ๊ฒฝ์šฐ ์ •๋ณด ์†์‹ค์ด ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ์Œ.

2. AWQ (Activation-aware Weight Quantization, 2024) โ€“ Weight์—์„œ Activation์œผ๋กœ ์ด์ƒ์น˜ ์ด๋™

AWQ๋Š” Weight์˜ ์ด์ƒ์น˜๋ฅผ ์ค„์ด๊ธฐ ์œ„ํ•ด Activation์œผ๋กœ ๋ถ„์‚ฐ์‹œํ‚ค๋Š” ๋ฐฉ์‹

  • ๊ธฐ์กด ๋ฌธ์ œ
    • SmoothQuant ๋ฐฉ์‹์ฒ˜๋Ÿผ ์ด์ƒ์น˜๋ฅผ Weight ์ชฝ์œผ๋กœ ์ด๋™์‹œํ‚ค๋ฉด, Weight์˜ ํฌ๊ธฐ๊ฐ€ ์ปค์ ธ์„œ Weight๋ฅผ 4-bit๋กœ ์–‘์žํ™”ํ•  ๋•Œ ์ •๋ณด ์†์‹ค์ด ๋ฐœ์ƒํ•  ๊ฐ€๋Šฅ์„ฑ์ด ๋†’์•„์ง.
    • ํŠนํžˆ, Weight์— ์ด์ƒ์น˜๊ฐ€ ๋งŽ์œผ๋ฉด, ์Šค์ผ€์ผ๋ง์„ ์ ์šฉํ•ด๋„ ์–‘์žํ™” ์˜ค๋ฅ˜๊ฐ€ ์ปค์ง€๊ณ  ์„ฑ๋Šฅ์ด ๋–จ์–ด์ง€๋Š” ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒ.
  • ํ•ด๊ฒฐ ๋ฐฉ๋ฒ•
    • ๋Œ€์‹  Weight์—์„œ Activation์œผ๋กœ ์ผ๋ถ€ ์ด์ƒ์น˜๋ฅผ ์ด๋™์‹œ์ผœ์„œ, Weight๊ฐ€ ์–‘์žํ™”๋  ๋•Œ ์ •๋ณด ์†์‹ค์„ ์ตœ์†Œํ™”ํ•จ.
    • ์ฆ‰, ์ค‘์š”ํ•œ Weight ๊ฐ’์„ ๋”ฐ๋กœ ๋ณดํ˜ธํ•˜๊ณ , ๋ถˆํ•„์š”ํ•œ ํฐ ๊ฐ’์„ Activation ์ชฝ์œผ๋กœ ์ด๋™์‹œ์ผœ์„œ Weight๋ฅผ ๋” ๊ท ๋“ฑํ•œ ๋ถ„ํฌ๋กœ ๋งŒ๋“ค๋„๋ก ์„ค๊ณ„.
  • ํ•œ๊ณ„
    • Activation์˜ ๋ถ„ํฌ๊ฐ€ ๋‹ค์‹œ ๋„“์–ด์งˆ ๊ฐ€๋Šฅ์„ฑ์ด ์žˆ์Œ โ†’ Activation์„ ๋‹ค์‹œ 4-bit๋กœ ์–‘์žํ™”ํ•  ๊ฒฝ์šฐ ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•  ์ˆ˜๋„ ์žˆ์Œ.

3. SVDQuant (2024) โ€“ Outlier๋ฅผ Low-Rank Component๋กœ ์ด๋™

SVDQuant๋Š” SmoothQuant์™€ AWQ์˜ ๋ฌธ์ œ์ ์„ ๋ชจ๋‘ ํ•ด๊ฒฐํ•˜๋ ค๊ณ , ์ด์ƒ์น˜๋ฅผ ์ด๋™์‹œํ‚ค๋Š” ๊ฒƒ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ Low-Rank Component๋กœ ํก์ˆ˜ํ•˜๋Š” ๋ฐฉ์‹

  • ํ•ต์‹ฌ ์•„์ด๋””์–ด
    • SmoothQuant์ฒ˜๋Ÿผ Activation์˜ ์ด์ƒ์น˜๋ฅผ Weight๋กœ ์ด๋™ํ•˜๋ฉด์„œ๋„,
    • AWQ์ฒ˜๋Ÿผ Weight์—์„œ ๋‹ค์‹œ Activation์œผ๋กœ ์ด๋™ํ•˜๋Š” ๋Œ€์‹ , Low-Rank Component๋กœ ๋ถ„๋ฆฌํ•˜์—ฌ ์ €์žฅ.
    • ์ฆ‰, ์ด์ƒ์น˜๋ฅผ ์–‘์žํ™”ํ•˜์ง€ ์•Š๊ณ , 16-bit Low-Rank Component๋กœ ์œ ์ง€ํ•˜์—ฌ ์ •๋ณด ์†์‹ค์„ ์ตœ์†Œํ™”.
  • ์žฅ์ 
    • SmoothQuant๋‚˜ AWQ์ฒ˜๋Ÿผ ํ•œ์ชฝ์œผ๋กœ ์ด์ƒ์น˜๋ฅผ ๋ชฐ์•„๋„ฃ์ง€ ์•Š๊ณ , Low-Rank Branch๊ฐ€ ์ด์ƒ์น˜๋ฅผ ํก์ˆ˜ํ•ด์„œ ์†์‹ค์„ ๋ง‰์Œ.
    • Weight์™€ Activation ๋ชจ๋‘ ๊ท ๋“ฑํ•œ ๋ถ„ํฌ๋ฅผ ๊ฐ€์ง€๊ฒŒ ๋˜์–ด, ์–‘์žํ™” ์˜ค๋ฅ˜๊ฐ€ ์ค„์–ด๋“ฆ.
    • ์‹ค์ œ ์‹คํ—˜์—์„œ๋„ SmoothQuant, AWQ๋ณด๋‹ค 4-bit ์–‘์žํ™”์—์„œ ์„ฑ๋Šฅ์ด ๋›ฐ์–ด๋‚จ.

๊ฒฐ๋ก 

  • SmoothQuant โ†’ Activation์˜ ์ด์ƒ์น˜๋ฅผ Weight๋กœ ์ด๋™ (Weight์˜ ์ •๋ณด ์†์‹ค ๊ฐ€๋Šฅ์„ฑ ์žˆ์Œ)
  • AWQ โ†’ Weight์˜ ์ด์ƒ์น˜๋ฅผ Activation์œผ๋กœ ์ด๋™ (Activation์˜ ์ •๋ณด ์†์‹ค ๊ฐ€๋Šฅ์„ฑ ์žˆ์Œ)
  • SVDQuant โ†’ Weight์™€ Activation์—์„œ Low-Rank Component๋กœ ์ด๋™ (์ด์ƒ์น˜ ์ž์ฒด๋ฅผ ์ œ๊ฑฐํ•˜์—ฌ ์ •๋ณด ์†์‹ค์„ ์ตœ์†Œํ™”)

์ฆ‰, SmoothQuant๊ณผ AWQ๋Š” ๋‘˜ ์ค‘ ํ•˜๋‚˜๋งŒ Outlier๋ฅผ ๋ฐœ์ƒํ•˜์ง€ ์•Š๋„๋ก ํ•˜๋ ค๊ณ  ํ–ˆ๋˜ ์ ‘๊ทผ๋ฒ•, ๋ฐ˜๋ฉด SVDQuant๋Š” Outlier ์ž์ฒด๋ฅผ Low-Rank๋กœ ๋นผ๋ฒ„๋ฆฌ๋Š” ๋ฐฉ์‹์ด๋ผ ์ •๋ณด ์†์‹ค์ด ๊ฐ€์žฅ ์ ์Œ.

3 QUANTIZATION PRELIMINARY

  • ์–‘์žํ™”(Quantization)์˜ ๊ธฐ๋ณธ ๊ฐœ๋…
    • ๋”ฅ๋Ÿฌ๋‹์—์„œ ์–‘์žํ™”๋Š” ์—ฐ์‚ฐ ์†๋„๋ฅผ ๋†’์ด๊ณ  ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์„ ์ค„์ด๋Š” ๋ฐ ์‚ฌ์šฉ๋˜๋Š” ๋ฐฉ๋ฒ•.
    • ํ…์„œ X๋ฅผ ์–‘์žํ™”ํ•˜๋Š” ๊ณผ์ •:
  • ์—ฌ๊ธฐ์„œ QX๋Š” ์–‘์žํ™”๋œ(low-bit) ๊ฐ’.
  • sX๋Š” ์Šค์ผ€์ผ๋ง ํŒฉํ„ฐ(Scaling Factor).
  • qmax๋Š” ์ตœ๋Œ€ ์–‘์žํ™” ๊ฐ’(๋น„ํŠธ ์ˆ˜์— ๋”ฐ๋ผ ๋‹ฌ๋ผ์ง).
  • 4๋น„ํŠธ ๋ถ€๋™์†Œ์ˆ˜์  ์–‘์žํ™”(4-bit FP)์—์„œ๋Š” qmax=6์ž„.
  • ์–‘์žํ™”๋œ ํ–‰๋ ฌ ์—ฐ์‚ฐ
    • ์„ ํ˜• ๊ณ„์ธต(Linear Layer)์—์„œ ์ž…๋ ฅ X์™€ ๊ฐ€์ค‘์น˜ W๊ฐ€ ์žˆ์„ ๋•Œ, ์—ฐ์‚ฐ์„ ์–‘์žํ™”๋œ ๊ฐ’์œผ๋กœ ๊ทผ์‚ฌ:
    • ์ฆ‰, ์–‘์žํ™”๋œ ํ…์„œ๋ผ๋ฆฌ ์—ฐ์‚ฐํ•œ ํ›„, ์Šค์ผ€์ผ๋ง ํŒฉํ„ฐ sXsW๋ฅผ ๊ณฑํ•˜์—ฌ ๋‹ค์‹œ ์›๋ž˜ ๊ฐ’์— ๊ฐ€๊น๊ฒŒ ๋ณต์›ํ•จ.
  • GPU์—์„œ ๊ฐ™์€ ๋น„ํŠธํญ(bit width)์„ ์‚ฌ์šฉํ•ด์•ผ ํ•˜๋Š” ์ด์œ 
    • ์ตœ์‹  GPU์—์„œ๋Š” ์ž…๋ ฅ(QX)๊ณผ ๊ฐ€์ค‘์น˜(QW)์˜ ๋น„ํŠธ ์ˆ˜๊ฐ€ ๋™์ผํ•ด์•ผ ์—ฐ์‚ฐ ์†๋„๊ฐ€ ํ–ฅ์ƒ๋จ.
    • ๋งŒ์•ฝ QX์™€ QW์˜ ๋น„ํŠธ ์ˆ˜๊ฐ€ ๋‹ค๋ฅด๋ฉด, ๋” ๋†’์€ ๋น„ํŠธ ๊ฐ’์œผ๋กœ ๋ณ€ํ™˜(upcast)๋˜๋ฉด์„œ ์†๋„ ์ด์ ์ด ์‚ฌ๋ผ์ง.
    • ์˜ˆ:
      • ๊ฐ€์ค‘์น˜(W)๋ฅผ 4๋น„ํŠธ๋กœ ์–‘์žํ™”(W4)ํ–ˆ์ง€๋งŒ, ํ™œ์„ฑํ™”๊ฐ’(X)์ด 16๋น„ํŠธ(A16)๋ผ๋ฉด?
        โ†’ ์—ฐ์‚ฐ ์‹œ W4๊ฐ€ A16์œผ๋กœ ์—…์บ์ŠคํŠธ(Upcast)๋˜์–ด ์‹ค์ œ ์†๋„ ํ–ฅ์ƒ์ด ์—†์Œ.
        โ†’ ๋”ฐ๋ผ์„œ, W4A4(๊ฐ€์ค‘์น˜ 4๋น„ํŠธ, ํ™œ์„ฑํ™”๊ฐ’ 4๋น„ํŠธ) ์กฐํ•ฉ์ด ์ตœ์ ํ™”๋œ ๋ฐฉ์‹.
  • W4A4 ์–‘์žํ™”์—์„œ์˜ ๋ฌธ์ œ์ : Outlier(์ด์ƒ์น˜)
    • Diffusion ๋ชจ๋ธ์—์„œ๋Š” ๊ฐ€์ค‘์น˜(W)์™€ ํ™œ์„ฑํ™”๊ฐ’(X) ์–‘์ชฝ์—์„œ Outlier(๊ทน๋‹จ์ ์ธ ๊ฐ’)๊ฐ€ ๋งŽ์ด ๋ฐœ์ƒํ•จ.
    • Outlier๊ฐ€ ๋งŽ์œผ๋ฉด ์–‘์žํ™” ํ›„ ํ’ˆ์งˆ์ด ํฌ๊ฒŒ ์ €ํ•˜๋จ.
    • ๊ธฐ์กด ํ•ด๊ฒฐ ๋ฐฉ๋ฒ•:
      1. Quantization-Aware Training (QAT)
        • ์–‘์žํ™”๋ฅผ ๊ณ ๋ คํ•˜์—ฌ ๋ชจ๋ธ์„ ํ›ˆ๋ จํ•˜๋Š” ๋ฐฉ์‹.
        • ํ•˜์ง€๋งŒ, 100์–ต ๊ฐœ ์ด์ƒ์˜ ๋งค๊ฐœ๋ณ€์ˆ˜(์˜ˆ: FLUX.1 ๋ชจ๋ธ)๋ฅผ ์กฐ์ •ํ•˜๋ ค๋ฉด ๊ณ„์‚ฐ ๋น„์šฉ์ด ๋งค์šฐ ํผ.
      1. Rotation ๊ธฐ๋ฒ• (Ashkboos et al., 2024; Liu et al., 2024c)
        • ๊ฐ€์ค‘์น˜์™€ ํ™œ์„ฑํ™”๊ฐ’์„ ํšŒ์ „(rotation)ํ•˜์—ฌ Outlier๋ฅผ ์ค„์ด๋Š” ๋ฐฉ๋ฒ•.
        • ํ•˜์ง€๋งŒ, Diffusion ๋ชจ๋ธ์˜ "Adaptive Normalization Layer"์—์„œ๋Š” ์ ์šฉ์ด ์–ด๋ ค์›€.
        • ์ด์œ :
          • Adaptive Normalization์€ ์‹คํ–‰ ์‹œ๊ฐ„(runtime) ์ค‘์— ์ƒˆ๋กœ์šด ๊ฐ€์ค‘์น˜๋ฅผ ์ƒ์„ฑ.
          • ๋”ฐ๋ผ์„œ, ์‚ฌ์ „ ๊ณ„์‚ฐ๋œ ํšŒ์ „ ํ–‰๋ ฌ์„ ์ ์šฉํ•  ์ˆ˜ ์—†์Œ.
          • ์‹คํ–‰ ์‹œ๊ฐ„์— ํšŒ์ „์„ ์ ์šฉํ•˜๋ฉด ์—ฐ์‚ฐ๋Ÿ‰์ด ์ฆ๊ฐ€ํ•˜์—ฌ ์†๋„๊ฐ€ ๋А๋ ค์ง.

๐Ÿ“ข

์ด์ƒ์น˜(Outlier)๊ฐ€ ์žˆ์œผ๋ฉด ์–ด๋–ป๊ฒŒ ์„ฑ๋Šฅ์ด ์ €ํ•˜๋ ๊นŒ?

1. ์Šค์ผ€์ผ๋ง ํŒฉํ„ฐ ๋ฌธ์ œ

  • ์–‘์žํ™”๋Š” ๋ฐ์ดํ„ฐ์˜ ์ „์ฒด ๋ฒ”์œ„(min-max)๋ฅผ ๊ณ ๋ คํ•ด์„œ ๊ฐ’์„ ์กฐ์ •ํ•ด์•ผ ํ•˜๋Š”๋ฐ, ์ด์ƒ์น˜๊ฐ€ ์žˆ์œผ๋ฉด ์Šค์ผ€์ผ๋ง ํŒฉํ„ฐ๊ฐ€ ๋น„์ •์ƒ์ ์œผ๋กœ ์ปค์ง.
  • ๋Œ€๋ถ€๋ถ„์˜ ๊ฐ’์€ ์ž‘์€ ๋ฒ”์œ„์— ๋ชฐ๋ ค ์žˆ๋Š”๋ฐ, ํ•œ๋‘ ๊ฐœ์˜ ํฐ ๊ฐ’(์ด์ƒ์น˜) ๋•Œ๋ฌธ์— ์Šค์ผ€์ผ์ด ์ปค์ง€๋ฉด ์ž‘์€ ๊ฐ’๋“ค์ด ๋ชจ๋‘ 0 ๋˜๋Š” ๋™์ผํ•œ ๊ฐ’์œผ๋กœ ๋งคํ•‘๋˜๋Š” ๋ฌธ์ œ๊ฐ€ ์ƒ๊ฒจ.

    ์˜ˆ์ œ:

    • ์›๋ž˜ ๊ฐ€์ค‘์น˜ ๊ฐ’: [-0.1, -0.05, 0.0, 0.05, 0.1, 5.0] (์ด์ƒ์น˜: 5.0)
    • ์ด์ƒ์น˜๊ฐ€ ์—†์„ ๋•Œ: s_X = 0.1, ๋ฒ”์œ„๋ฅผ [-8, 7]๋กœ ๋งคํ•‘ ๊ฐ€๋Šฅ
    • ์ด์ƒ์น˜(5.0)๊ฐ€ ํฌํ•จ๋  ๋•Œ: s_X = 5.0, ์ž‘์€ ๊ฐ’๋“ค์€ ๋ชจ๋‘ 0์ด ๋˜์–ด ์ •๋ณด ์†์‹ค ๋ฐœ์ƒ

2. ์ •๋ณด ์†์‹ค (Precision Loss)

  • ์ด์ƒ์น˜๋ฅผ ๊ณ ๋ คํ•ด ์ „์ฒด ๊ฐ’์„ ์กฐ์ •ํ•˜๋ฉด, ๋‚˜๋จธ์ง€ ๋Œ€๋ถ€๋ถ„์˜ ๊ฐ’์ด ๋งค์šฐ ์ž‘์€ ์ฐจ์ด๋ฅผ ๊ฐ€์ง€๋Š”๋ฐ๋„ ๋™์ผํ•œ ์–‘์žํ™”๋œ ๊ฐ’์œผ๋กœ ํ‘œํ˜„๋  ๊ฐ€๋Šฅ์„ฑ์ด ๋†’์•„.
  • ์ฆ‰, ๋ชจ๋ธ์ด ์ž‘์€ ๋ณ€ํ™”(gradient ๋“ฑ)๋ฅผ ๋ฐ˜์˜ํ•˜์ง€ ๋ชปํ•˜๊ณ  ํ‘œํ˜„๋ ฅ์ด ๊ธ‰๊ฒฉํžˆ ๋–จ์–ด์ง.

3. ํ™œ์„ฑํ™”(Activation) ์ด์ƒ์น˜๋กœ ์ธํ•ด ์—ฐ์‚ฐ๋Ÿ‰ ์ฆ๊ฐ€

  • ์ด์ƒ์น˜๊ฐ€ ์žˆ์œผ๋ฉด ์–‘์žํ™”๋œ ๊ฐ’์„ ๋‹ค์‹œ ๋ถ€๋™์†Œ์ˆ˜์ ์œผ๋กœ ๋ณ€ํ™˜ํ•  ๋•Œ FP32(32-bit)๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์•„, ๊ฒฐ๊ตญ ์—ฐ์‚ฐ ์ตœ์ ํ™”๊ฐ€ ๊นจ์ง.
  • ํŠนํžˆ Transformer ๊ธฐ๋ฐ˜ ๋ชจ๋ธ์—์„œ๋Š” Self-Attention ์—ฐ์‚ฐ์ด ํฌ๊ธฐ ๋•Œ๋ฌธ์— ํ™œ์„ฑํ™”๊ฐ’(Activation)์˜ ์ด์ƒ์น˜๋Š” ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰๊ณผ ์—ฐ์‚ฐ๋Ÿ‰ ์ฆ๊ฐ€๋กœ ์ด์–ด์งˆ ์ˆ˜ ์žˆ์Œ.

4 Method

4.1 PROBLEM FORMULATION

์–‘์žํ™”์˜ ์˜ค๋ฅ˜๋ฅผ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ •์˜๋จ.

์›๋ž˜ ํ–‰๋ ฌ ๊ณฑ์…ˆ XW์™€ ์–‘์žํ™”๋œ ๊ฐ’์œผ๋กœ ์—ฐ์‚ฐํ•œ Q(X)Q(W)์˜ ์ฐจ์ด๋ฅผ ์ธก์ •ํ•˜๋Š” ๊ฐ’

์ข€ ๋” ์„ธ๋ถ„ํ™”

์ด ์‹์€ ์–‘์žํ™” ์˜ค๋ฅ˜๋ฅผ ๊ฒฐ์ •ํ•˜๋Š” ๋„ค ๊ฐ€์ง€ ์š”์†Œ๋ฅผ ๋ณด์—ฌ์คŒ:

  1. ๊ฐ€์ค‘์น˜์˜ ํฌ๊ธฐ: ๏ปฟ
  1. ์ž…๋ ฅ์˜ ํฌ๊ธฐ: ๏ปฟ
  1. ๊ฐ€์ค‘์น˜์˜ ์–‘์žํ™” ์˜ค๋ฅ˜: ๏ปฟ
  1. ์ž…๋ ฅ์˜ ์–‘์žํ™” ์˜ค๋ฅ˜: ๏ปฟ

์ฆ‰, ์ „์ฒด์ ์ธ ์–‘์žํ™” ์˜ค๋ฅ˜๋ฅผ ์ตœ์†Œํ™”ํ•˜๋ ค๋ฉด ์ด ๋„ค ๊ฐ€์ง€ ์š”์†Œ๋ฅผ ์กฐ์ ˆํ•˜๋Š” ๊ฒƒ์ด ํ•ต์‹ฌ์ž„.

4.2 SVDQUANT: ABSORBING OUTLIERS VIA LOW-RANK BRANCH


Migrate outliers from activation to weight

Blog Image

After Smoothing ๋ถ€๋ถ„์ด ๊ธฐ์กด ๊ธฐ๋ฒ•์ธ๋ฐ, ๋‹จ์ ์ด ์žˆ์Œ

  • โœ… Activation(X)์˜ ์ด์ƒ์น˜๋ฅผ ์—†์• ๋Š” ๊ฒƒ์€ ์„ฑ๊ณตํ–ˆ์ง€๋งŒ,
  • โŒ ๋Œ€์‹  Weight(W)์˜ ์ด์ƒ์น˜๊ฐ€ ์ฆ๊ฐ€ํ•˜๋Š” ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒ.
  • ๊ฒฐ๊ณผ์ ์œผ๋กœ, ์ „์ฒด์ ์ธ ์–‘์žํ™” ์˜ค๋ฅ˜๋ฅผ ์ค„์ด๋ ค๋Š” ๋ชฉ์ ์ด ์ œ๋Œ€๋กœ ๋‹ฌ์„ฑ๋˜์ง€ ์•Š์Œ.

Absorb magnified weight outliers with a low-rank branch.

Weight๋ฅผ ๋ฐ”๋กœ 4-bit๋กœ ์–‘์žํ™”ํ•˜์ง€ ์•Š๊ณ , Low-Rank Component๋ฅผ ๋”ฐ๋กœ ๋ถ„๋ฆฌํ•ด์„œ ์ด์ƒ์น˜๋ฅผ ํก์ˆ˜ํ•˜๋Š” ์ „๋žต

  • ๏ปฟ (์ž…๋ ฅ ์ฐจ์›์„ ์ค„์ด๋Š” ํ–‰๋ ฌ)
  • ๏ปฟ (์ถœ๋ ฅ ์ฐจ์›์„ ์ค„์ด๋Š” ํ–‰๋ ฌ)
  • ๏ปฟ (์ž”์—ฌ ํ–‰๋ ฌ, 4-bit๋กœ ์–‘์žํ™”๋  ๋ถ€๋ถ„)

SVD(Singular Value Decomposition, ํŠน์ด๊ฐ’ ๋ถ„ํ•ด)

  • ์›๋ž˜ ํ–‰๋ ฌ ํฌ๊ธฐ๊ฐ€ mร—n์ด๋ฉด, ์ง์ ‘ ๊ณฑํ•˜๋ฉด ์—ฐ์‚ฐ๋Ÿ‰์ด O(mn).
  • ํ•˜์ง€๋งŒ SVD๋กœ Rank r๋งŒ ์œ ์ง€ํ•˜๋ฉด ์—ฐ์‚ฐ๋Ÿ‰์ด O(mr+rn)๋กœ ์ค„์–ด๋“ฆ.
  • ๏ปฟ: ์ž…๋ ฅ ์ฐจ์› ๋ณ€ํ™˜ ํ–‰๋ ฌ ๏ปฟ
  • ๏ปฟ: ๋Œ€๊ฐ ํ–‰๋ ฌ (ํŠน์ด๊ฐ’๋“ค์ด ๋“ค์–ด ์žˆ๋Š” ํ–‰๋ ฌ, ๏ปฟ)
  • ๏ปฟ: ์ถœ๋ ฅ ์ฐจ์› ๋ณ€ํ™˜ ํ–‰๋ ฌ ๏ปฟ

ํฌ์ธํŠธ๋Š” ๋Œ€๊ฐ ์›์†Œ(ํŠน์ด๊ฐ’, Singular Values)

  • ํฐ ํŠน์ด๊ฐ’๋“ค์€ ์ค‘์š”ํ•œ ์ •๋ณด(ํŒจํ„ด)๋ฅผ ๋‚˜ํƒ€๋ƒ„.
  • ์ž‘์€ ํŠน์ด๊ฐ’๋“ค์€ ๋…ธ์ด์ฆˆ(์ด์ƒ์น˜ ํฌํ•จ)๋ฅผ ๋‚˜ํƒ€๋‚ผ ๊ฐ€๋Šฅ์„ฑ์ด ๋†’์Œ.

Low-Rank ๋ถ„ํ•ด

๏ปฟ

๋Œ€๊ฐ ์›์†Œ ์ค‘์—์„œ ์ƒ์œ„ r๊ฐœ์˜ ํŠน์ด๊ฐ’๋งŒ ์œ ์ง€ํ•˜์—ฌ, ๊ฐ€์žฅ ์ค‘์š”ํ•œ ์ •๋ณด๋งŒ ํฌํ•จํ•˜๋Š” L1,L2๋ฅผ ์ƒ์„ฑ.

โ†’ ๋”ฐ๋กœ 16-bit ์—ฐ์‚ฐ

๋‚จ์€ ๋ถ€๋ถ„(์ž‘์€ ํŠน์ด๊ฐ’) โ†’ R์œผ๋กœ ๋ถ„๋ฆฌ

โ†’ 4-bit ์—ฐ์‚ฐ

โ†’ ์ž‘์€ ํŠน์ด๊ฐ’๋งŒ ๋‚จ์•˜์œผ๋ฏ€๋กœ R์„ 4-bit๋กœ ์–‘์žํ™”ํ•˜๋”๋ผ๋„ ์ •๋ณด ์†์‹ค์ด ํฌ๊ฒŒ ์ค„์–ด๋“ฆ

4.3 NUNCHAKU: Fusing Low-Rank and Low-Bit Branch Kernels

Blog Image

Low-Rank Branch์—์„œ ๋ฐœ์ƒํ•˜๋Š” ์„ฑ๋Šฅ ์ €ํ•˜ ๋ฌธ์ œ

  • QKV Projection๊ณผ ๊ฐ™์€ ์—ฐ์‚ฐ์—์„œ๋Š” Low-Rank Branch๊ฐ€ L2 ์บ์‹œ๋ฅผ ์ดˆ๊ณผํ•˜๋ฉด์„œ DRAM์—์„œ ๋ฐ์ดํ„ฐ๋ฅผ ๋ถˆ๋Ÿฌ์™€์•ผ ํ•จ.
  • ์ด๋Š” ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ๋น„์šฉ์ด ์ฆ๊ฐ€ํ•˜์—ฌ ์—ฐ์‚ฐ ์†๋„๊ฐ€ ๋–จ์–ด์ง€๋Š” ์›์ธ.
  • Figure 6(a)์—์„œ ๋ณด๋“ฏ์ด, Low-Rank Branch๋Š” ์ „์ฒด 4-bit ์—ฐ์‚ฐ ์ง€์—ฐ์˜ 50%๋ฅผ ์ฐจ์ง€.

NUNCHAKU: ํ•ด๊ฒฐ ๋ฐฉ๋ฒ•

  • ๋…ผ๋ฌธ์—์„œ๋Š” Low-Rank Branch์™€ Low-Bit Branch์˜ ์—ฐ์‚ฐ์„ ํ•˜๋‚˜๋กœ ํ•ฉ์ณ(fusing) ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ์„ ์ค„์ด๋Š” ๋ฐฉ๋ฒ•์„ ์ œ์•ˆ.
  • Figure 6(b)์—์„œ ๋ณด๋“ฏ์ด, ๋‘ ๊ฐœ์˜ Kernel์„ ํ•ฉ์ณ์„œ ๋ฐ์ดํ„ฐ๋ฅผ ๊ณต์œ ํ•จ:
    1. Down Projection ์—ฐ์‚ฐ์„ Quantization Kernel๊ณผ ํ•ฉ์นจ.
    1. Up Projection ์—ฐ์‚ฐ์„ 4-bit ์—ฐ์‚ฐ Kernel๊ณผ ํ•ฉ์นจ.
  • ์ด๋ฅผ ํ†ตํ•ด Low-Rank Branch๊ฐ€ Low-Bit Branch์™€ ํ™œ์„ฑํ™”๊ฐ’์„ ๊ณต์œ ํ•  ์ˆ˜ ์žˆ์–ด, ์ถ”๊ฐ€์ ์ธ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ์„ ์ œ๊ฑฐ.
  • ๊ฒฐ๊ณผ์ ์œผ๋กœ, Kernel ํ˜ธ์ถœ ํšŸ์ˆ˜๊ฐ€ ์ ˆ๋ฐ˜์œผ๋กœ ์ค„์–ด๋“ค์–ด ์†๋„ ๊ฐœ์„  ํšจ๊ณผ๊ฐ€ ์žˆ์Œ.

5 Experiments

Benchmark models

ModelArchitectureParametersSpecial Features
FLUX.1-devDiT12B50-step guidance-distilled
FLUX.1-schnellDiT12B4-step timestep-distilled
PixArt-ฮฃDiT600M20-step default
SANADiT1.6B32ร— compression autoencoder, Linear Attention
SDXLUNet2.6B30-step

Baselines Quantization

MethodDescriptionUsage in Benchmarking
NF4
(4-bit NormalFloat)
Optimized 4-bit weight-only quantization assuming normal distributionUsed as a weight-only quantization baseline for FLUX.1
ViDiT-QPer-token quantization + smoothing to reduce outliersAchieves lossless 8-bit quantization on PixArt-ฮฃ
MixDQDetects outliers in text embeddings and protects them with 16-bit pre-computationEnables W4A8 quantization with minimal performance drop on SDXL-Turbo
TensorRTIndustry-standard PTQ toolkit for 8-bit quantizationUses smoothing + percentile calibration over specific timesteps
Blog Image
Blog Image
Blog Image
Blog Image
Blog Image

Limitation

์—„์ฒญ๋‚œ ๊ธฐ์ˆ ์ด๊ณ  ๊ธฐ์ˆ ๋ฉด์—์„œ๋Š” ํ•œ๊ณ„๊ฐ€ ์—†๋‹ค๊ณ  ์ƒ๊ฐํ•จ.

ํ•˜์ง€๋งŒ, ๊ตณ์ด ํ•œ๊ณ„์ ์„ ๋ฝ‘์ž๋ฉด, Song Han์ด NVIDIA์—์„œ๋„ ์—ฐ๊ตฌ๋ฅผ ์ง„ํ–‰ํ•˜๊ธฐ ๋•Œ๋ฌธ์— NVIDIA chip๋งŒ์„ ์œ„ํ•ด์„œ ์ฝ”๋“œ๋ฅผ ์งฐ๊ณ , ์ด์— ์ตœ์ ํ™”๋˜์–ด์žˆ๋‹ค.

์‹ฌ์ง€์–ด CUDA 12.2 ์ด์ƒ์—์„œ๋งŒ ์ž‘๋™ ๊ฐ€๋Šฅํ•ด์„œ ๋‚ด ํ•™๊ต ์„œ๋ฒ„๋กœ ๋Œ๋ ค๋ณด๋ ค๊ณ  ํ–ˆ๋Š”๋ฐ, GPU Driver version์ด ๋‚ฎ์•„์„œ ์•ˆ๋Œ์•„๊ฐ€๋”๋ผ.

Blog Image

๋ฌผ๋ก  NVIDIA chip์—์„œ ๊ทนํ•œ์˜ ์ตœ์ ํ™”๋ฅผ ์œ„ํ•ด์„œ ์˜€์ง€๋งŒ, ๋‹ค๋ฅธ GPU ์žฅ๋น„์—์„œ๋Š” ์ด๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜ ์—†๋‹ค.

๊ฐ™์€ ๋ฐฉ์‹์„ ๋‹ค๋ฅธ GPU ์žฅ๋น„์™€ Mobile edge device๋“ค์— ์ ์šฉํ•œ๋‹ค๋ฉด ์ข‹์„ ๊ฒƒ์ด๋‹ค.

๋ฉ”๋ชจ๋ฆฌ๋ž‘ ์ถ”๋ก ์‹œ๊ฐ„๋งŒ ์ค„์ธ๊ฒŒ ์•„๋‹Œ๊ฐ€? ์ •ํ™•๋„๋ฅผ ์™œ ์–ธ๊ธ‰?

Memory, Latency?

์•„๋ž˜ ๊ทธ๋ฆผ๊ณผ ๊ฐ™์ด SVD์–‘์žํ™”๋กœ ์ธํ•ด ๋ฉ”๋ชจ๋ฆฌ์™€ latency ์ด์ ์„ ์–ป์€๊ฒŒ ํฌ์ธํŠธ ์•„๋‹Œ๊ฐ€?

Blog Image

๋น„๊ต๊ฐ€ ๊ธฐ์กด 16bit / W4A16 / W4A4(SVD) ์˜€๊ธฐ ๋•Œ๋ฌธ์— ํฐ ์ฐจ์ด๋ฅผ ๋ณด์—ฌ์ค€ ๊ฒƒ ๊ฐ™๋‹ค.

Inference time๊ณผ Memory ์ค„์ธ๊ฒŒ ํฌ์ธํŠธ์ธ์ค„ ์•Œ์•˜๋Š”๋ฐ ์™„์ „ ์ž˜๋ชป ์ƒ๊ฐํ•œ๊ฒƒ ๊ฐ™๊ธฐ๋„ ํ•ฉ๋‹ˆ๋‹ค.

๋ฌผ๋ก  Outlier๋•Œ๋ฌธ์— 32bit๋กœ ์ฒ˜๋ฆฌํ–ˆ๋˜ ๋ถ€๋ถ„๋“ค์ด ์—†์–ด์ง€๊ณ  16bit๋กœ low rank๋กœ ๋”ฐ๋กœ ๋นผ๋‹ˆ๊นŒ ํ–ฅ์ƒ์€ ๋์„ ๊ฒƒ์ด์ง€๋งŒ, ์ œ ์ƒ๊ฐ์—๋Š” SVD ์—†๋Š” W4A4๋ž‘ ๋น„๊ตํ–ˆ๋‹ค๋ฉด, Memory์™€ ์ถ”๋ก ์‹œ๊ฐ„์ด 3๋ฐฐ ์ด์ƒ ์ฐจ์ด๋‚˜์ง€๋Š” ์•Š์„ ๊ฒƒ์ž…๋‹ˆ๋‹ค.

Blog Image

์ •ํ™•๋„?

๊ธฐ์กด ์–‘์žํ™”์—์„œ ์ •ํ™•๋„๋ฅผ ์œ ์ง€ํ•˜๋Š” ๋ถ€๋ถ„์ด ์•„๋ž˜ ๊ทธ๋ฆผ์ฒ˜๋Ÿผ ๋งŽ์ด ๋–จ์–ด์กŒ์—ˆ๋‹ค.

(๋‘๋ฒˆ์งธ๊ฐ€ ๊ธฐ์กด ์–‘์žํ™” ๊ธฐ๋ฒ•, ์‹ฌ์ง€์–ด W4A4๊ฐ€ ์•„๋‹Œ W4A16์ธ๋ฐ๋„ ๋” ๋–จ์–ด์ง€๋Š” ๋ชจ์Šต์„ ๋ณด์ž„)

Blog Image

๋ณด์กด ํ•  ์ˆ˜ ์žˆ์—ˆ๋˜ ์ด์œ  โ†’ Low-Rank Branch

  • ๊ธฐ์กด 4-bit ์–‘์žํ™” ๋ฐฉ์‹์—์„œ๋Š” Weight ์ „์ฒด๋ฅผ 4-bit๋กœ ๋ณ€ํ™˜ํ•˜๋ฏ€๋กœ ์ •๋ณด ์†์‹ค์ด ํผ.
  • SVDQuant๋Š” Weight๋ฅผ Low-Rank Component (L1L2) ์™€ ์ž”์—ฌ (R)๋กœ ๋ถ„ํ•ดํ•œ๋‹ค.
  • L1L2๋Š” 16-bit precision์œผ๋กœ ์œ ์ง€ โ†’ ์ค‘์š”ํ•œ ์ •๋ณด๋Š” ๊ณ ์ •๋ฐ€๋„๋กœ ๋‚จ๊ฒจ๋‘ .
  • ์ž”์—ฌ R๋งŒ 4-bit๋กœ ์–‘์žํ™”ํ•˜์—ฌ ์ •๋ณด ์†์‹ค์„ ์ตœ์†Œํ™”ํ•จ
  • ํ•œ ๋ฒˆ๋งŒ Low-Rank ๋ถ„ํ•ดํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ, ๋ฐ˜๋ณต์ ์œผ๋กœ R์„ ์ตœ์ ํ™”ํ•˜์—ฌ ์–‘์žํ™” ์˜ค๋ฅ˜๋ฅผ ์ตœ์†Œํ™”

quantization error

์ด ์‹์—์„œ L1L2+R๋กœ ๋ถ„ํ•ดํ•จ์œผ๋กœ์„œ Quantization์„ ์ง„ํ–‰ํ•˜๋ฉด์„œ ๋ฐœ์ƒํ•˜๋Š” Error๋ฅผ ์ตœ๋Œ€ํ•œ ์ค„์ธ๊ฑฐ์ž„.

Quantization ์ž์ฒด๊ฐ€ Outlier๋กœ ์ธํ•ด์„œ ์ •ํ™•๋„๋ฅผ ๋–จ์–ด๋œจ๋ฆด ์ˆ˜ ๋ฐ–์— ์—†๋Š”๋ฐ, ์ด๋ฅผ ์ตœ๋Œ€ํ•œ ๋ณด์กดํ–ˆ๋‹ค๋Š” ์ ์ด ์—„์ฒญ๋‚œ ์—ฐ๊ตฌ์ธ ๊ฒƒ์ž„!!!!

SVD๋ฅผ LLM์— ์จ๋„ ๋˜๋Š”๊ฐ€? ์™œ Diffusion์œผ๋กœ ๋…ผ๋ฌธ์„?

๊ธฐ์กด ์ œ ์ƒ๊ฐ : ์ ์šฉ ํ•  ์ˆ˜๋Š” ์žˆ์„ ๊ฒƒ ๊ฐ™์œผ๋‚˜, LLM์€ ๋ณ‘๋ชฉํ˜„์ƒ์ด ๋ฌด๊ฑฐ์šด ๋ชจ๋ธ์„ ๋ถˆ๋Ÿฌ์˜ค๋Š” ๊ณผ์ •์—์„œ ๋‚˜ํƒ€๋‚˜๋ฏ€๋กœ ๋’ค์— ์—ฐ์‚ฐ์„ ์ค„์—ฌ๋„ Diffusion๋งŒํผ ํฐ ํšจ๊ณผ๋Š” ๋‚˜ํƒ€๋‚  ์ง€ ๋ชจ๋ฅด๊ฒ ์Šต๋‹ˆ๋‹ค. (Diffusion์€ ์—ฐ์‚ฐ์ด ๋ณ‘๋ชฉ์ž„)

๋ฌผ๋ก  ๊ธฐ์กด ์ œ ์ƒ๊ฐ๋„ ์ฐพ์•„๋ณด๋‹ˆ ๋งž๋Š” ๊ฒƒ ๊ฐ™์œผ๋‚˜, ์‹ค์ œ๋กœ ํ™œ์šฉํ•œ๋‹ค๋ฉด, ๋ฐ”๋กœ ์œ„ ์งˆ๋ฌธ์—์„œ ๋‹ค๋ค˜๋˜ ์ •ํ™•๋„ ํ–ฅ์ƒ์—๋„ ๋„์›€์ด ๋˜๊ธฐ ๋•Œ๋ฌธ์—, ์˜คํžˆ๋ ค ์ •ํ™•๋„ ๋ถ€๋ถ„์—์„œ ๋„์›€์ด ๋  ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค.

๊ธฐ์กด ์–‘์žํ™”

  • GPTQ โ†’ Post-training quantization ๋ฐฉ์‹, Weight๋งŒ 4-bit ๋ณ€ํ™˜.
  • AWQ โ†’ Weight ์ค‘์š”๋„๋ฅผ ๋ถ„์„ํ•ด ์„ ํƒ์ ์œผ๋กœ 4-bit ๋ณ€ํ™˜.
  • SmoothQuant โ†’ Activation ์ด์ƒ์น˜๋ฅผ Weight๋กœ ์ด๋™์‹œ์ผœ ์–‘์žํ™” ์˜ค๋ฅ˜๋ฅผ ์ค„์ด๋Š” ๋ฐฉ์‹.

SVDQuant ์‚ฌ์šฉํ•œ๋‹ค๋ฉด?

Weight๋ฅผ Low-Rank(16-bit) + Residual(4-bit)๋กœ ๋‚˜๋ˆ„์–ด ์ค‘์š”ํ•œ ์ •๋ณด๋Š” ์œ ์ง€ํ•˜๋ฉด์„œ ์••์ถ•ํ•˜๋ฏ€๋กœ

Diffusion์ฒ˜๋Ÿผ ์ •ํ™•๋„ ํ–ฅ์ƒ์— ๋„์›€์ด ๋ ๋“ฏ !!!

For an explanation from the author, Song Han

Blog Image

Youtube [Introduction to SVDQuant for 4-bit Diffusion Models]

Demo

https://hanlab.mit.edu/projects/svdquant

Blog Image