Back to Blog List

Test-Time Learning for Large Language Models

โ†Paper Review

ArXivhttps://arxiv.org/abs/2505.20633
AuthorsJinwu Hu, Zhitian Zhang, Guohao Chen, Xutao Wen, Chao Shuai, Wei Luo, Bin Xiao, Yuanqing Li, Mingkui Tan
AffiliationSchool of Software Engineering, South China University of Technology
Pazhou Laboratory
Zhejiang University
South China Agricultural University
Chongqing University of Posts and Telecommunications
Key Laboratory of Big Data and Intelligent Robot, Ministry of Education
๐Ÿ’ก

Key Differentiator

Perplexity Minimization

๊ธฐ์กด TTA (ex. Tent, EATA, COME)๋Š” ์ „๋ถ€ entropy minimization ๊ธฐ๋ฐ˜

โ†’ ์ถœ๋ ฅ ๋ถ„ํฌ์˜ ๋ถˆํ™•์‹ค์„ฑ์„ ๋‚ฎ์ถ”๋Š” ๋ฐฉํ–ฅ

ํ•˜์ง€๋งŒ ์ด ๋…ผ๋ฌธ์€ LLM์˜ autoregressive ๊ตฌ์กฐ๋ฅผ ๊ณ ๋ คํ•ด

โ†’ ์ถœ๋ ฅ entropy๊ฐ€ ์•„๋‹ˆ๋ผ ์ž…๋ ฅ perplexity๋ฅผ ์ตœ์†Œํ™”ํ•˜๋Š” ์™„์ „ํžˆ ๋‹ค๋ฅธ objective๋ฅผ ์ œ์•ˆ

2. Related Work

๊ธฐ์กด ๋ฐฉ๋ฒ•๋“ค์˜ ํŠน์ง•๊ณผ LLM์—์„œ์˜ ํ•œ๊ณ„

Fine-tuning

: ๋ผ๋ฒจ๋œ ๋ฐ์ดํ„ฐ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ๋ชจ๋ธ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์—…๋ฐ์ดํŠธ

โ†’ ๋ผ๋ฒจ๋ง ๋น„์šฉ์ด ํฌ๊ณ , ํ˜„์‹ค์—์„œ ๊ณ„์†ํ•ด์„œ ๋ผ๋ฒจ๋œ ๋ฐ์ดํ„ฐ๋ฅผ ๊ตฌํ•˜๊ธฐ ํž˜๋“ฆ

RAG (Retrieval-Augmented Generation)

: ์™ธ๋ถ€ ์ง€์‹ ๋ฒ ์ด์Šค์—์„œ ๊ด€๋ จ ์ •๋ณด๋ฅผ ์ฐพ์•„์™€ ์‘๋‹ต์— ๋ฐ˜์˜

โ†’ ๊ฒ€์ƒ‰ ํ’ˆ์งˆ์— ์˜์กด + ๊ฒ€์ƒ‰ ๋น„์šฉ ์žˆ์Œ

TTT (Test-Time Training)

: ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ๋‚˜ knowledge base์—์„œ ์œ ์‚ฌํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ์ฐพ์•„์„œ ๋ชจ๋ธ์„ ๋ฏธ์„ธ ์กฐ์ •

โ†’ ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ ์ ‘๊ทผ ํ•„์š” + ๊ฒ€์ƒ‰ ๊ณผ์ •์ด ๋А๋ฆผ

TTA (Test-Time Adaptation)

: ๋ผ๋ฒจ ์—†๋Š” ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ๋ฅผ ์ด์šฉํ•ด ๋ชจ๋ธ์„ ์ ์‘์‹œํ‚ด

โ†’ ๋Œ€๋ถ€๋ถ„ entropy minimization (์ถœ๋ ฅ์˜ ํ™•๋ฅ  ๋ถ„ํฌ๋ฅผ ๋‹จ์ผํ•˜๊ฒŒ ๋งŒ๋“œ๋Š” ๋ฐฉ์‹)์„ ์‚ฌ์šฉ

โ†’ LLM์€ autoregressive ๊ตฌ์กฐ์ธ๋ฐ, ์ด ๊ตฌ์กฐ๋ฅผ ๋ฌด์‹œํ•˜๊ณ  entropy๋งŒ ์ตœ์†Œํ™”ํ•˜๋ฉด ํšจ๊ณผ๊ฐ€ ๋–จ์–ด์ง


๊ธฐ์กด LLM TTA๊ฐ€ ๋ถˆ๊ฐ€๋Šฅํ•œ ์ด์œ 

  • ๊ธฐ์กด TTA๋Š” ์ฃผ๋กœ BatchNorm ํ†ต๊ณ„(mean/var)๋ฅผ ์—…๋ฐ์ดํŠธํ•˜๋ฉด์„œ ์ ์‘
  • ๊ทธ๋Ÿฐ๋ฐ LLM์—๋Š” BatchNorm์ด ์—†๊ณ  ๋Œ€์‹  LayerNorm์„ ์“ฐ๊ณ ,

    LayerNorm์€ test-time์—์„œ ์—…๋ฐ์ดํŠธํ•  ๊ฒŒ ์—†์Œ โ†’ ๊ธฐ์กด ๋ฐฉ์‹ ์ ์šฉ ๋ถˆ๊ฐ€

๊ทธ๋Ÿผ LLM์—์„œ๋Š” ์–ด๋–ค test-time ์‹ ํ˜ธ๊ฐ€ ์žˆ๋Š”๊ฐ€?

โ†’ ์ž…๋ ฅ perplexity๋ฅผ ์ด์šฉํ•ด์„œ

โ†’ backprop ๊ฐ€๋Šฅํ•œ self-supervised objective๋ฅผ ์„ค๊ณ„ํ•จ


Why Entropy Minimization Doesnโ€™t Work Well for LLMs?

Entropy๋ž€?

entropy = uncertainty

[0.5, 0.5] โ†’ high entropy

[0.99, 0.01] โ†’ less entropy

Autoregressiveํ•œ LLM์€?

Predict tokens one by one

Each prediction depends on previous tokens

Errors accumulate over time

๋ฌธ์ œ์ 

  • Ignores token dependencies
  • Optimizes locally, not globally
  • Early mistakes โ†’ later tokens collapse

Blog Image


4.1 Perplexity Minimization for Test-Time Learning

Entropy ๊ธฐ๋ฐ˜์˜ ๋ฌธ์ œ์  ํ•ด๊ฒฐ์ฑ…

perplexity

: A metric that measures how confidently a language model predicts a given sequence

์–ธ์–ด ๋ชจ๋ธ์ด ์ฃผ์–ด์ง„ ์‹œํ€€์Šค๋ฅผ ์–ผ๋งˆ๋‚˜ โ€œ์ž์‹  ์žˆ๊ฒŒโ€ ์˜ˆ์ธกํ–ˆ๋Š”๊ฐ€๋ฅผ ์ธก์ •ํ•˜๋Š” ์ง€ํ‘œ

  • log probability๊ฐ€ ํด์ˆ˜๋ก โ†’ ์˜ˆ์ธก ์ž˜ํ•จ โ†’ perplexity ๋‚ฎ์Œ
  • ํ™•๋ฅ  ์˜ˆ์ธก์ด ๋‚ฎ๊ณ  ๋ถˆํ™•์‹คํ• ์ˆ˜๋ก โ†’ perplexity ๋†’์Œ

TTA์—์„œ์˜ ๋ฌธ์ œ์ ์„ ํ•ด๊ฒฐ

entropy๋Š” [pโ‚, pโ‚‚, ..., p_T] ๊ฐ ํ† ํฐ์— ๋Œ€ํ•ด ๊ฐœ๋ณ„์ ์œผ๋กœ ํ™•๋ฅ  ๋ถ„ํฌ๋ฅผ ๋งŒ๋“ค์–ด์„œ ํ† ํฐ๊ฐ„์˜ ๊ด€๊ณ„๋ณด๋‹ค๋Š”, ๊ฐ ์œ„์น˜์—์„œ ๋‹จ์ผ ์ •๋‹ต์„ ๊ฐ•ํ•˜๊ฒŒ ๋งŒ๋“ค๋ ค๊ณ  ํ•จ. โ†’ ํ† ํฐ ๊ฐ„ dependency ๋ฌด์‹œ

โ†’ ์‹œ๊ทธ๋งˆ๋กœ ์ „์ฒด ์‹œํ€€์Šค์— ๋Œ€ํ•œ joint probability์˜ log loss๋ฅผ ๊ตฌํ•˜๊ธฐ ๋•Œ๋ฌธ์— ํ† ํฐ ๊ฐ„ ์˜์กด์„ฑ์„ ์™„์ „ํžˆ ๋ฐ˜์˜ํ•จ

โ†’ ์ „์ฒด ๋ฌธ์žฅ์„ ์–ผ๋งˆ๋‚˜ ์ž˜ ์˜ˆ์ธกํ–ˆ๋Š”๊ฐ€? ๊ธฐ์ค€์œผ๋กœ loss ์คŒ

โ†’ ๊ธ€๋กœ๋ฒŒํ•œ ๊ด€์ ์—์„œ ๋ชจ๋ธ์„ ์—…๋ฐ์ดํŠธ

๋ฌธ์ œ์ 

ํ…Œ์ŠคํŠธ ์‹œ์—” ground truth๊ฐ€ ์—†์Œ โ†’ output perplexity๋ฅผ ๋ชป ์”€

x = input / y = output

  • LLM ์„ฑ๋Šฅ์„ ์˜ฌ๋ฆฌ๋ ค๋ฉด ๋‹น์—ฐํžˆ P(y | x) ๋ฅผ ์ค„์ด๋Š” ๊ฒŒ ๋งž์Œ
  • ํ•˜์ง€๋งŒ test-time์—๋Š” y๋ฅผ ๋ชจ๋ฅด๊ธฐ ๋•Œ๋ฌธ์— ์œ„์˜ ์‹์„ ์ง์ ‘ ์“ธ ์ˆ˜ ์—†๋‹ค.

โ†’ ๋ฐœ๊ฒฌ

P(y | x)๋ฅผ ์ค„์ด๋Š” ๋Œ€์‹  P(x)๋ฅผ ์ค„์ด๋Š” ๊ฒƒ๋„ ํšจ๊ณผ๊ฐ€ ์žˆ๋‹ค๋Š” ๊ฒƒ

"The trend of LLMโ€™s perplexity to the input P(x; ฮ˜) and perplexity to the output P(y|x; ฮ˜) is the same.โ€

Blog Image

์™ผ์ชฝ ๊ทธ๋ž˜ํ”„๋ฅผ ๋ณด๋ฉด,

input/output perplexity๋ฅผ ์ธก์ •ํ–ˆ์„๋•Œ

๊ฐ•ํ•œ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ๋ณด์ธ๋‹ค.

โ†’ ์ž…๋ ฅ perplexity๋ฅผ ์ค„์ด๋ฉด ์ถœ๋ ฅ๋„ ๊ฐ™์ด ์ข‹์•„์ง

์—์„œ y๋ฅผ ์“ฐ์ง€ ์•Š๋Š” ๋‹ค์Œ์œผ๋กœ ๋ณ€๊ฒฝ


4.2 Sample Efficient Learning Strategy

TTL์—์„œ ๋ชจ๋“  ํ…Œ์ŠคํŠธ ์ƒ˜ํ”Œ์„ ๋‹ค ์‚ฌ์šฉํ•ด์„œ ์—…๋ฐ์ดํŠธํ•˜๋ฉด:

  • ๊ณ„์‚ฐ๋Ÿ‰ ๋‚ญ๋น„
  • ํšจ๊ณผ ์—†๋Š” ์ƒ˜ํ”Œ์— ๋ชจ๋ธ์ด ์˜คํžˆ๋ ค ํ”๋“ค๋ฆด ์ˆ˜ ์žˆ์Œ

Blog Image

โ†’ ๋ฐœ๊ฒฌ

  • high-perplexity ์ƒ˜ํ”Œ๋กœ ํ•™์Šตํ•˜๋ฉด ๋” ๋†’์€ ROUGE ์„ฑ๋Šฅ
  • low-perplexity ์ƒ˜ํ”Œ๋งŒ ์“ฐ๋ฉด ์„ฑ๋Šฅ ์˜คํžˆ๋ ค ๋–จ์–ด์ง

Low-perplexity input์€ ์ด๋ฏธ ์ž˜ ๋งž์ถ”๋Š” ๊ฒƒ์ด๋ผ ์ •๋ณด ๊ฑฐ์˜ ์—†๊ธฐ ๋•Œ๋ฌธ.

ROUGE(Recall-Oriented Understudy for Gisting Evaluation):

์ž์—ฐ์–ด ์ƒ์„ฑ ๊ฒฐ๊ณผ๋ฅผ reference ๋ฌธ์žฅ๊ณผ ๋น„๊ตํ•ด ์–ผ๋งˆ๋‚˜ ์ž˜ ์ผ์น˜ํ•˜๋Š”์ง€ ํ‰๊ฐ€ํ•˜

ROUGE-L : Longest Common Subsequence (LCS) ๊ธฐ๋ฐ˜์œผ๋กœ ๊ณตํ†ต ๋ถ€๋ถ„๋ฌธ์ž์—ด

  • ๏ปฟ โ†’ informativeํ•œ ์ƒ˜ํ”Œ๋งŒ ๊ณจ๋ผ ์“ฐ์ž
  • ๊ฐ ์ƒ˜ํ”Œ๋งˆ๋‹ค perplexity๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์ ์ˆ˜ S(x) ๋ฅผ ๋งค๊ธฐ๊ณ 
  • S(x)๊ฐ€ ๋†’์€ ์ƒ˜ํ”Œ๋งŒ backpropagation์— ์‚ฌ์šฉ

Low-perplexity ์ƒ˜ํ”Œ์€ ์ œ์™ธํ•˜๊ณ , High-perplexity ์ƒ˜ํ”Œ์€ ๋น„์ค‘์„ ํฌ๊ฒŒ ๋ถ€์—ฌํ•ด์„œ ํ•™์Šต์— ๋ฐ˜์˜

์—ฌ๊ธฐ์„œ ์•„๋ผ๋น„์•„์ˆซ์ž 2 ๊ฐ™์ด ์ƒ๊ธด๊ฑด, indicator function (์ง€์‹œ ํ•จ์ˆ˜)๋ฅผ ์˜๋ฏธํ•œ๋‹ค.

์ฆ‰, ์กฐ๊ฑด์„ ๋งŒ์กฑํ•˜๋ฉด 1, ๋งŒ์กฑํ•˜์ง€ ์•Š์œผ๋ฉด 0์ด ๋˜๋Š” ๋ถˆ์—ฐ์† ํ•จ์ˆ˜

ํšจ๊ณผ

  • ๋ถˆํ•„์š”ํ•œ ์ƒ˜ํ”Œ ์—…๋ฐ์ดํŠธ๋ฅผ ์ค„์—ฌ ๊ณ„์‚ฐ๋Ÿ‰ ๊ฐ์†Œ
  • ๋” informativeํ•œ ์ƒ˜ํ”Œ๋กœ๋งŒ ์—…๋ฐ์ดํŠธํ•ด์„œ ์„ฑ๋Šฅ ํ–ฅ์ƒ


4.3 Modulating Parameters for Test-Time Learning

Blog Image

LoRA(Low-Rank Adaptation)๋Š” ์ผ๋ถ€ linear layer์— ์ž‘์€ ๋žญํฌ์˜ ๋ณด์กฐ ํ–‰๋ ฌ A, B๋ฅผ ์ถ”๊ฐ€ํ•ด์„œ ์—…๋ฐ์ดํŠธํ•จ

โ†’ LoRA๋กœ๋งŒ ์—…๋ฐ์ดํŠธํ•œ ๊ฒฝ์šฐ, full-param update๋ณด๋‹ค original ์„ฑ๋Šฅ ์œ ์ง€๋ ฅ์ด ํ›จ์”ฌ ์ข‹์Œ

๋„๋ฉ”์ธ ์ ์‘ ์ค‘์—๋„ task ์„ฑ๋Šฅ ์œ ์ง€ โ†’ forgetting ์–ต์ œ

LoRA๋ฅผ ์‚ฌ์šฉํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์œ„์™€ ๊ฐ™์€ ์‹์ด ๋จ


5.1. Experimental Settings

Benchmark: AdaptEval

  • ๋…ผ๋ฌธ์—์„œ ๋งŒ๋“  ์ƒˆ๋กœ์šด ํ‰๊ฐ€ ๋ฒค์น˜๋งˆํฌ
  • ๋‹ค์–‘ํ•œ ๋„๋ฉ”์ธ๊ณผ ๊ณผ์ œ ์œ ํ˜•์„ ํฌํ•จํ•˜์—ฌ LLM์˜ ์ ์‘ ๋Šฅ๋ ฅ์„ ๋‹ค๊ฐ๋„๋กœ ํ‰๊ฐ€
Bench๋ชฉ์ ํฌํ•จ๋œ ๋ฐ์ดํ„ฐ์…‹
DomainBench๋„๋ฉ”์ธ ์ง€์‹ ์ ์‘Geography, Agriculture, Medicine, Finance
InstructionBench์ง€์‹œ ๋”ฐ๋ฅด๊ธฐAlpaca-GPT4, Dolly, InstructionWild
ReasoningBench๋…ผ๋ฆฌ ์ถ”๋ก GSM8K, MetaMath, Logiqa

ํ‰๊ฐ€ ์ง€ํ‘œ (Metrics)

  • DomainBench, InstructionBench โ†’ ROUGE-Lsum (R-Lsum)
  • ReasoningBench โ†’ Exact Match (EM)
๊ฐ๊ฐ์˜ task ํŠน์„ฑ์— ๋งž๋Š” ๋Œ€ํ‘œ ์ง€ํ‘œ๋ฅผ ์‚ฌ์šฉํ•จ

LLM ๋ชจ๋ธ

  • Llama3.2-3B-Instruct
  • Llama3-8B-Instruct
  • Llama2-13B-Chat
  • Qwen2.5-7B-Instruct

Baselines

  • Tent (Entropy minimization ๊ธฐ๋ฐ˜ TTA)
  • EATA (low-entropy ์ƒ˜ํ”Œ ์„ ํƒ ๊ธฐ๋ฐ˜ TTA)
  • COME (๋ณด์ˆ˜์  entropy ์ตœ์†Œํ™” ๋ฐฉ์‹)

โ†’ ๋ชจ๋‘ unlabeled test data๋งŒ ์‚ฌ์šฉํ•˜๋Š” ์ตœ์‹  TTA ๊ธฐ๋ฒ•

โ†’ ๊ณต์ •ํ•œ ๋น„๊ต๋ฅผ ์œ„ํ•ด ๋ชจ๋‘ offline ์„ค์ •์— ๋งž์ถฐ ์žฌ๊ตฌํ˜„

๊ตฌํ˜„ ์„ธํŒ…

  • Optimizer: Adam
  • Learning rate:
    • DomainBench: 5e-5
    • InstructionBench: 5e-5
    • ReasoningBench: 1e-6
  • Batch size: 1
  • Decoding: Greedy, temperature = 0
  • ฮป = 0.1, Pโ‚€ = eยณ (์ƒ˜ํ”Œ ์„ ํƒ threshold)


5.2 Comparison Experiments

Blog Image

TLM์€ ๋ชจ๋“  task category์—์„œ baseline ๋Œ€๋น„ ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑ

  • DomainBench: ์ „๋ฌธ ์šฉ์–ด, ๋„๋ฉ”์ธ-specific ํ‘œํ˜„๋“ค (์˜ˆ: ์˜๋ฃŒ, ๊ธˆ์œต ์šฉ์–ด)
  • InstructionBench: ์ง€์‹œ๋ฌธ์˜ ํ‘œํ˜„ ๋ฐฉ์‹, ๋งํˆฌ, ์š”์ฒญ ์Šคํƒ€์ผ ์ ์‘

โ†’ ์ด๋Ÿฐ ๊ณผ์ œ๋“ค์€ ๋ชจ๋ธ์ด ์ƒˆ๋กœ์šด ์šฉ์–ด, ๋ฌธ์žฅ ํŒจํ„ด์—๋งŒ ์ ์‘ํ•˜๋ฉด ์„ฑ๋Šฅ์ด ์˜ฌ๋ผ๊ฐ

โ†’ ๊ทธ๋ฆฌ๊ณ  TLM์€ ์ž…๋ ฅ perplexity ์ตœ์†Œํ™” โ†’ ๋ฌธ์žฅ ํ‘œํ˜„์— ๋Œ€ํ•œ ์ดํ•ด ๊ฐ•ํ™”

โ†’ ์ฆ‰, perplexity ๊ธฐ๋ฐ˜ self-supervised ์ ์‘์ด ์ง์ ‘์ ์œผ๋กœ ํšจ๊ณผ์ ์ž„

Blog Image

์ด ํ‘œ๋งŒ ๊ฐ’์ด Exact Match (EM)์ž„.

โ†’ ์ •ํ™•ํžˆ ๋‹ต์„ ์–ผ๋งˆ๋‚˜ ๋งž์ท„๋Š”๊ฐ€

ReasoningBench๋„ ์ข‹์•„์ง€๊ธด ํ–ˆ์ง€๋งŒ, ๋…ผ๋ฆฌ ๊ตฌ์กฐ๊ฐ€ ํ•ต์‹ฌ์ด๋ผ chain-of-thought reasoning์ด ์ค‘์š”ํ•จ

โ†’ test-time์— ์ž…๋ ฅ๋งŒ ๋ณด๊ณ  ๋ชจ๋ธ์„ ๊ฐœ์„ ํ•˜๋Š” ๊ฑด ์ œํ•œ์ ์ธ ํšจ๊ณผ๋งŒ ์žˆ์Œ

โ†’ ํŠนํžˆ reasoning ๋Šฅ๋ ฅ์€ ์ด๋ฏธ pretraining + fine-tuning ๋‹จ๊ณ„์—์„œ ๊นŠ๊ฒŒ ํ•™์Šต๋˜์–ด์•ผ ํ•จ


5.3 Ablation Studies

๋ฒ„์ „์„ค๋ช…
Original LLM์•„๋ฌด๋Ÿฐ TTL ์ ์šฉ ์•ˆ ํ•œ ์›๋ณธ
Ours (w/o SEL)์ƒ˜ํ”Œ ์„ ํƒ ์—†์ด input perplexity๋งŒ ์ตœ์†Œํ™”
Oursfull TLM = SEL + LoRA + perplexity minimization
Blog Image

Input Perplexity Minimization

โ†’ ์„ฑ๋Šฅ ํ–ฅ์ƒ์˜ ์ฃผ๋œ ์›์ธ

โ†’ SEL ์—†์ด๋„ 30~80% ํ–ฅ์ƒ

Sample Efficient Learning (SEL)

โ†’ ์ถ”๊ฐ€ ํ–ฅ์ƒ์€ ์ ์ง€๋งŒ,

โ†’ ๊ณ„์‚ฐ๋Ÿ‰ ์ค„์ด๋ฉด์„œ ์„ฑ๋Šฅ ์œ ์ง€

Threshold P0 (perplexity margin)

Blog Image

โ†’๋‹ค์–‘ํ•œ P0โˆˆ{e2,e3,...,e6} ์‹คํ—˜

โ†’ P0=e3 ์ผ ๋•Œ ๊ฐ€์žฅ ์•ˆ์ •์ ์ด๊ณ  ์ข‹์€ ์„ฑ๋Šฅ

Blog Image


5.4 More Discussions

Blog Image

Online Test-Time Experiments

test-time์— ์ƒ˜ํ”Œ ํ•˜๋‚˜์”ฉ ๋“ค์–ด์˜ค๋Š” ์ƒํ™ฉ์—์„œ๋„ TLM์ด ์ž˜ ์ž‘๋™ํ•˜๋Š”๊ฐ€?

  • ์ฒ˜์Œ์—๋Š” high-perplexity ์ƒ˜ํ”Œ์ด ๋งŽ์•„์„œ ์—…๋ฐ์ดํŠธ ๋งŽ์ด ํ•˜๋‹ค๊ฐ€
  • ์ ์  ๋ชจ๋ธ์ด ์ ์‘ํ•˜๋ฉด์„œ low-perplexity ์ƒ˜ํ”Œ์ด ๋Š˜์–ด๋‚จ โ†’ ์ž๋™์œผ๋กœ ํ•™์Šต ์ค‘๋‹จ๋จ

Experiments on Quantized LLM

TLM์€ quantized ๋ชจ๋ธ์—์„œ๋„ ์„ฑ๋Šฅ ํ–ฅ์ƒ ์œ ์ง€

โ†’ ๋ฉ”๋ชจ๋ฆฌ ์ œํ•œ ํ™˜๊ฒฝ์—์„œ๋„ ์‹ค์šฉ์ 


Limitation

๋”ฐ๋กœ ์–ธ๊ธ‰์€ ์—†์ง€๋งŒ ๊ตณ์ด ๋ฝ‘์ž๋ฉด,

1. No Backprop-Free Variant (์‹ค์ œ inference ํ™˜๊ฒฝ ์ œํ•œ)

TLM์€ backprop์ด ํ•„์š”ํ•จ โ†’ ๋Œ€๋ถ€๋ถ„์˜ ์‹ค์ œ LLM ๋ฐฐํฌ ํ™˜๊ฒฝ(API, closed-weight)์—์„œ๋Š” ์ ์šฉ ๋ถˆ๊ฐ€

Future Work: Backprop-free TTL, e.g. prompt-based or derivative-free adaptation


2. Limited Effect on Reasoning Tasks

GSM8K, MetaMath ๋“ฑ reasoning benchmark์—์„œ ์„ฑ๋Šฅ ํ–ฅ์ƒํญ์ด ์ž‘์Œ

โ†’ Perplexity minimization์ด ํ‘œํ˜„ ์ ์‘์—๋Š” ๊ฐ•ํ•˜์ง€๋งŒ, ๋…ผ๋ฆฌ์  ์ถ”๋ก ์—” ์•ฝํ•จ

Future Work: TTL for logic and chain-of-thought reasoning


3. Domain-Specific Overfitting / Forgetting Risk

LoRA ์‚ฌ์šฉํ•ด๋„, ์žฅ๊ธฐ์ ์œผ๋กœ ํŠน์ • ๋„๋ฉ”์ธ์— ๋ฐ˜๋ณต ์ ์‘ ์‹œ ์›๋ž˜ ๋Šฅ๋ ฅ(logic, general knowledge) ์ €ํ•˜ ๊ฐ€๋Šฅ์„ฑ ์กด์žฌ

Future Work: Continual TTL with forgetting mitigation


4. Hyperparameter Sensitivity (e.g., Pโ‚€ threshold)

์ƒ˜ํ”Œ ์„ ํƒ ๊ธฐ์ค€(Pโ‚€ = eยณ)์ด๋‚˜ ฮป ๊ฐ’์— ๋ฏผ๊ฐ

๋„๋ฉ”์ธ/๋ชจ๋ธ์— ๋”ฐ๋ผ ํŠœ๋‹ ํ•„์š” โ†’ ์‹ค์šฉํ™”์— ๋ฐฉํ•ด๋  ์ˆ˜ ์žˆ์Œ

Future Work: Auto-tuning or adaptive sampling strategies


5. Session-Aware / Multi-Turn TTL ๋ฏธ์ง€์›

ํ˜„์žฌ๋Š” ์ž…๋ ฅ ๋‹จ์œ„๋กœ๋งŒ TTL ์ž‘๋™. ๋Œ€ํ™”ํ˜• ์‹œ์Šคํ…œ์ฒ˜๋Ÿผ context๊ฐ€ ๋ˆ„์ ๋˜๋Š” ํ™˜๊ฒฝ์—์„œ๋Š” ์ ์šฉ๋˜์ง€ ์•Š์Œ

Future Work: Session-level TTL for conversational agents


Q&A

Q. LLM์—์„œ Test-Time ํ•™์Šต ์ตœ์ดˆ์ธ๊ฐ€?

๋‹ค์Œ๊ณผ ๊ฐ™์€ ๋…ผ๋ฌธ๋“ค์€ ์žˆ์—ˆ์Œ.

https://arxiv.org/abs/2410.08020

LLM์„ ํ…Œ์ŠคํŠธ ์‹œ์ ์— promptโ€‘specific fineโ€‘tuning ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ–ˆ๊ณ ,

์‹คํ—˜์„ ํ†ตํ•ด testโ€‘time์—๋„ LLM์„ ์—…๋ฐ์ดํŠธํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ์ ์„ ๋ณด์—ฌ์คŒ

๋‹ค๋งŒ, ์†๋„๊ฐ€ ๋А๋ฆฌ๊ณ  ๊ณ„์‚ฐ ๋น„์šฉ์ด ํฌ๋‹ค๋Š” ๋‹จ์ 

โ†’ ๊ฐ€๋Šฅ์€ ํ–ˆ์ง€๋งŒ ์‹ค์šฉ์—์„œ ๋ฉ€์—ˆ์Œ

Prompt tuning ๋นผ๋ฉด TLM์ด LLM์—์„œ Test-Time ํ•™์Šต์„ ์ตœ์ดˆ๋กœ ์‹ค์šฉํ™”ํ•œ ๋…ผ๋ฌธ

์œ„ Table 5 ๋ฅผ ๋ณด๋ฉด ์ด๋ฏธ Tent๋‚˜ EATA๋„ ํ•œ๊ฒƒ์ฒ˜๋Ÿผ ๋ณด์ด์ง€๋งŒ, ์˜คํžˆ๋ ค ๊ธฐ์กด LLM์„ ํŒŒ๊ดดํ•œ ์„ฑ๋Šฅ์ด ๋‚˜์˜ด.


Q. Output ๋‚ด๊ธฐ ์ „์— ํ•™์Šตํ•˜๋Š” ๊ฒƒ์ธ๊ฐ€?

P(y|x)์„ ๊ธฐ์ค€์œผ๋กœ ํ•™์Šต ๋ชปํ•œ๋‹ค๊ณ  ํ–ˆ๋Š”๋ฐ, ์™œ? ์–ด์ฐจํ”ผ output์€ ๋‚˜์˜ค๋Š”๊ฑฐ ์•„๋‹Œ๊ฐ€?

์™œ y๋ฅผ ์•ˆ์“ฐ๋Š”๊ฐ€?

๋ชจ๋ธ์ด ์ƒ์„ฑํ•œ yฬ‚๋Š” ์ •๋‹ต์ด ์•„๋‹˜ โ†’ ground truth ์—†์Œ

P(ลท | x)๋ฅผ ์ค„์ด๋ฉด ์ž˜๋ชป๋œ ์ถœ๋ ฅ์„ ๋” ํ™•์‹ ํ•˜๊ฒŒ ๋งŒ๋“œ๋Š” ๊ฒฐ๊ณผ๊ฐ€ ๋  ์ˆ˜ ์žˆ์Œ

์˜ˆ์‹œ

  • โ€œ์‚ฌ๊ณผ๋Š” ๋นจ๊ฐ›๋‹คโ€๊ฐ€ ์ •๋‹ต์ธ๋ฐ
  • ๋ชจ๋ธ์ด โ€œ์‚ฌ๊ณผ๋Š” ๋ฐ”๋‚˜๋‚˜๋‹คโ€๋ผ๊ณ  ์ถœ๋ ฅํ–ˆ์„ ๊ฒฝ์šฐ
  • ์ด๊ฑธ ๊ธฐ์ค€์œผ๋กœ loss๋ฅผ ์ค„์ด๋ฉด ์˜คํžˆ๋ ค ํ‹€๋ฆฐ ์ถœ๋ ฅ์— ๋” ํ™•์‹ ์„ ์ฃผ๊ฒŒ ๋จ

์ˆœ์„œ

  1. ์ž…๋ ฅ x ๋“ค์–ด์˜ด
  1. ๋ชจ๋ธ์ด ํ˜„์žฌ ํŒŒ๋ผ๋ฏธํ„ฐ(ฮธ + ฮ”ฮธ)๋กœ ์ถœ๋ ฅ ลท ์ƒ์„ฑ
  1. x์— ๋Œ€ํ•œ perplexity ๊ณ„์‚ฐ
  1. P(x)๊ฐ€ ๊ธฐ์ค€๋ณด๋‹ค ํฌ๋ฉด โ†’ backprop์œผ๋กœ LoRA ํŒŒ๋ผ๋ฏธํ„ฐ ์—…๋ฐ์ดํŠธ
  1. โ†’ ์ด ์—…๋ฐ์ดํŠธ๋Š” ๋‹ค์Œ ์ž…๋ ฅ๋ถ€ํ„ฐ ๋ฐ˜์˜๋จ

์ •๋ฆฌ

Although the true target y is unavailable at test time, we show that minimizing P(x) leads to update directions that are often aligned with those from minimizing P(y|x).

Test-time์— label y๊ฐ€ ์—†์–ด์„œ ๊ทธ ์ƒ˜ํ”Œ์˜ ์„ฑ๋Šฅ์„ ์ง์ ‘ ํ‰๊ฐ€ํ•  ์ˆ˜ ์—†์Œ

โ†’ ๋Œ€์‹ , ๊ทธ ์ƒ˜ํ”Œ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ์ „์ฒด ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์—…๋ฐ์ดํŠธํ•ด์„œ ๋ฏธ๋ž˜์˜ ์˜ˆ์ธก๋ ฅ์„ ๋†’์ž„

โ†’ TLM์€ online self-supervised continual learning์— ๊ฐ€๊นŒ์›€