Back to Blog List

TTRL: Test-Time Reinforcement Learning

โ†Paper Review

ArXivhttps://arxiv.org/abs/2504.16084
Github Codehttps://github.com/PRIME-RL/TTRL
AuthorsYuxin Zuo, Kaiyan Zhang, Li Sheng, Shang Qu, Ganqu Cui, Xuekai Zhu, Haozhan Li, Yuchen Zhang, Xinwei Long, Ermo Hua, Biqing Qi, Youbang Sun, Zhiyuan Ma, Lifan Yuan, Ning Ding, Bowen Zhou
AffiliationTsinghua University, Shanghai AI Lab
๐Ÿ’ก

Key Differentiator

(1) ๋‹ค์–‘ํ•œ ๋‹ต๋ณ€์„ ์ƒ์„ฑํ•˜๊ณ  (Test-Time Scaling)

(2) majority voting์„ ํ†ตํ•ด "์ด ๋‹ต๋ณ€์€ good, ์ด๊ฑด bad"๋ผ๋Š” ํ‰๊ฐ€๋ฅผ ์ž๋™์œผ๋กœ ์ƒ์„ฑ

(3) ์ด๋ฅผ reward๋กœ ๋ณ€ํ™˜ํ•ด RL ์ˆ˜ํ–‰

Test Time์— ์ž์œจ์ , ๋ฐ˜๋ณต์ , label-free ๋ฐฉ์‹์œผ๋กœ ํ•™์Šตํ•˜๊ณ  ๋” ์ข‹์€ ๊ฒฐ๊ณผ๋ฅผ ๋‚ด๋Š” ํšจ๊ณผ

๋‹ต์•ˆ์ง€ ์•ˆ์ฃผ๊ณ  ๋ฌธ์ œ๋งŒ ์คฌ๋Š”๋ฐ, ์•Œ์•„์„œ ๋ฐ˜๋ณต์ ์œผ๋กœ ๋ฌธ์ œํ’€๋ฉด์„œ ๋˜‘๋˜‘ํ•ด์ง„๋‹ค!

๐Ÿคท

Why I chose this paper?

  • Test-Time ๋…ผ๋ฌธ๋“ค์„ ๋งŽ์ด ๋ดค์ง€๋งŒ, Reinforcement Learning์€ ์ฒ˜์Œ ๋“ค์–ด๋ด์„œ ๊ถ๊ธˆํ–ˆ๋‹ค.
  • ์•„์นด์ด๋ธŒ์—๋งŒ ์žˆ๊ณ , ์ตœ๊ทผ์— ์ œ์ถœ๋œ ๋…ผ๋ฌธ์ธ๋ฐ, github star์ด 700๊ฐœ๋‚˜ ๋˜์–ด์žˆ์–ด์„œ ๊ถ๊ธˆํ–ˆ๋‹ค.

๊ธฐ์กด์˜ Test-Time Scaling์ด๋‚˜ Reinforcement Learning์„ ๋จผ์ € ์„ค๋ช…ํ•˜๋Š”๊ฒŒ ์ข‹์„ ๊ฒƒ ๊ฐ™์•„์„œ

5 Related Works๋ฅผ ๋จผ์ € ์ฝ์—ˆ๋‹ค.

5 Related Works

5.1 Test-Time Scaling

= LLM์ด ํ…Œ์ŠคํŠธ(์ถ”๋ก ) ์‹œ์ ์—์„œ ๋” ๋งŽ์€ ๊ณ„์‚ฐ ์ž์›์„ ์‚ฌ์šฉํ•ด ์„ฑ๋Šฅ์„ ๋†’์ด๋Š” ๋ฐฉ๋ฒ•

โ†’ ์ฆ‰, ํ•™์Šต๋œ ๋ชจ๋ธ ๊ตฌ์กฐ๋Š” ๊ทธ๋Œ€๋กœ ๋‘๊ณ , test-time์— inference ๋ฐฉ์‹๋งŒ ํ™•์žฅํ•˜๋Š” ์ „๋žต

โ‘  Parallel Generation

ํ•˜๋‚˜์˜ ์ž…๋ ฅ์— ๋Œ€ํ•ด ์—ฌ๋Ÿฌ ๊ฐœ์˜ output์„ ์ƒ์„ฑํ•˜๊ณ  ๊ทธ ์ค‘ โ€œ์ข‹์€ ๊ฒƒโ€์„ ์„ ํƒ

  • Self-Consistency (Wang et al., 2022)
    • ์—ฌ๋Ÿฌ CoT ๋‹ต๋ณ€์„ ๋งŒ๋“ค๊ณ  ๊ฐ€์žฅ ๋งŽ์€ ๋นˆ๋„๋ฅผ ๊ฐ€์ง„ ๋‹ต๋ณ€์„ ์„ ํƒ (majority voting)
  • Best-of-N (Stiennon et al., 2020; Nakano et al., 2021)
    • reward function์ด๋‚˜ score function์œผ๋กœ best ๋‹ต๋ณ€ ์„ ํƒ
  • Reward-guided Search (Deng & Raffel, 2023; Khanov et al., 2024)
    • sampling๋œ ๊ฒฐ๊ณผ์— external reward function์„ ์ ์šฉํ•ด ์„ ํƒ

โ†’ ์ด์ฒ˜๋Ÿผ parallelํ•˜๊ฒŒ ์—ฌ๋Ÿฌ ๋‹ต์•ˆ์„ ๋งŒ๋“ค๊ณ  ํ•˜๋‚˜๋ฅผ โ€œ์„ ํƒโ€ํ•˜๊ฑฐ๋‚˜ โ€œaggregationโ€ ํ•˜๋Š” ๊ฒŒ ๊ณตํ†ต

โ‘ก Sequential Generation

ํ•˜๋‚˜์˜ ๋‹ต๋ณ€์„ ๊ธธ๊ฒŒ, ์ ์ง„์ ์œผ๋กœ ํ™•์žฅํ•˜๊ฑฐ๋‚˜ ์ˆ˜์ •ํ•˜๋ฉฐ reasoning

  • Chain-of-Thought (CoT) prompting (Wei et al., 2022)
    • ์ค‘๊ฐ„ reasoning step์„ ๋ช…์‹œ์ ์œผ๋กœ ์œ ๋„
  • Reflective reasoning (Madaan et al., 2023)
    • ์Šค์Šค๋กœ ๋‹ต์„ ๊ฒ€ํ† ํ•˜๊ณ  ์ˆ˜์ •

โ†’ reasoning depth๋ฅผ ๋Š˜๋ฆฌ๊ฑฐ๋‚˜ self-correction์„ ์œ ๋„

ํ•œ๊ณ„: ๋Œ€๋ถ€๋ถ„์˜ TTS๋Š” prompt-based์ด๋ฉฐ, ๋ชจ๋ธ ํŒŒ๋ผ๋ฏธํ„ฐ ์ž์ฒด๋Š” ์—…๋ฐ์ดํŠธ๋˜์ง€ ์•Š์Œ

๊ธฐ์กด TTSTTRL
inference time์—๋งŒ ์‚ฌ์šฉinference + parameter update (TTT) ํฌํ•จ
majority voting๋งŒ ์‚ฌ์šฉmajority voting โ†’ reward๋กœ ์ „ํ™˜ํ•˜์—ฌ RL ์ˆ˜ํ–‰
non-parametricparametric update ํฌํ•จ

5.2 RL for Reasoning

Human Preference ๊ธฐ๋ฐ˜

  1. Human ๋˜๋Š” annotator๊ฐ€ ์—ฌ๋Ÿฌ ๋‹ต ์ค‘ ์„ ํ˜ธ๋„๋ฅผ ๋งค๊น€
  1. Preference Model ํ•™์Šต โ†’ reward๋กœ ์‚ฌ์šฉ
  1. PPO ๋“ฑ์œผ๋กœ policy (LLM) ์—…๋ฐ์ดํŠธ
  • ๊ฐ•์ : ์ž์—ฐ์–ธ์–ด์  open-ended instruction์—๋Š” ์ ํ•ฉ
  • ํ•œ๊ณ„: ์‚ฌ๋žŒ์˜ label์ด ํ•„์š”ํ•˜๊ณ , ์ˆ˜์น˜์  ํ‰๊ฐ€ ๋ถˆ๊ฐ€๋Šฅํ•œ domain์—์„œ๋งŒ ๊ฐ€๋Šฅ

Rule-based Reward ๊ธฐ๋ฐ˜

reasoning domain (์˜ˆ: ์ˆ˜ํ•™)์—์„œ๋Š” ์ •๋‹ต์„ ๋ช…ํ™•ํ•˜๊ฒŒ ํŒ๋ณ„ํ•  ์ˆ˜ ์žˆ์Œ

โ†’ ๋งž์•˜์œผ๋ฉด reward = 1, ํ‹€๋ ธ์œผ๋ฉด 0 ๊ฐ™์€ rule-based reward ์‚ฌ์šฉ ๊ฐ€๋Šฅ

GRPO (Group Relative Policy Optimization):

DeepSeek-R1์—์„œ ์‚ฌ์šฉ. ์ˆ˜ํ•™ ๋ฌธ์ œ์— ๋Œ€ํ•ด ๊ธด CoT ์ƒ์„ฑ ์œ ๋„

PPO๋„ ์‚ฌ์šฉ๋˜์ง€๋งŒ, ์ˆ˜์น˜์  reward์˜ ์•ˆ์ •์„ฑ๊ณผ gradient variance๊ฐ€ ๋ฌธ์ œ๋จ

๊ตฌ๋ถ„RLHFGRPO / Rule-based RLTTRL
supervision sourcehuman preferencerule-based labels
(์ •๋‹ต์กด์žฌ)
majority voting
(pseudo-label)
label ํ•„์š” ์—ฌ๋ถ€ํ•„์š”ํ•„์š”๋ถˆํ•„์š” (label ์—†์Œ)
ํ•™์Šต ์‹œ์ offline RLoffline RLTest-time (online RL)
taskopen-domain instructionmath, logic, programmath, logic, program


2. Test-Time Reinforcement Learning (TTRL)

We study the problem of training a pre-trained model during test time using RL without ground-truth labels. We call this setting Test-Time Reinforcement Learning.

2.1 Methodology

Blog Image

๋…น์ƒ‰ ๋ฐฐ๊ฒฝ์ด TTS + ์ดํ›„ ๋‚˜์˜จ ๊ฒฐ๊ณผ๋กœ reward calaulation์„ ํ†ตํ•ด Test-Time์— Training

M: ํ•œ ๋ฌธ์ œ(q)์— ๋Œ€ํ•ด ์ƒ์„ฑํ•˜๋Š” ๋‹ต๋ณ€ ์ˆ˜

N = batch_size(ํ•˜๋‚˜์˜ ํ•™์Šต step์—์„œ ์‚ฌ์šฉํ•˜๋Š” ๋ฌธ์ œ ์ˆ˜)

์ฒซ๋ฒˆ์งธ ๋ฌธ์ œ์— M๊ฐœ์˜ ๋‹ต๋ณ€ ๋‚ด๊ณ , votingํ•˜๊ณ  reward ๊ณ„์‚ฐํ•ด์„œ ๋ชจ์•„๋†“์€๊ฒŒ R(y1, y)

์ƒํƒœ(state)์™€ ํ–‰๋™(action)

  • ์ฃผ์–ด์ง„ ๋ฌธ์ œ(prompt) x๋ฅผ ์ƒํƒœ(state)๋กœ ๋ณด๊ณ ,
  • LLM์€ ๊ทธ์— ๋Œ€ํ•œ ๋‹ต๋ณ€ y๋ฅผ policy ๏ปฟ๋กœ๋ถ€ํ„ฐ ์ƒ์„ฑ (sampling)

โ†’ LLM์˜ ๋‹ต๋ณ€ ํ–‰์œ„ = RL์˜ action

Rollout: ๋‹ต๋ณ€ ์—ฌ๋Ÿฌ ๊ฐœ ์ƒ์„ฑ

Blog Image

Ground-truth label์ด ์—†์ด reward signal์„ ๋ณด๋‚ด์•ผํ•˜๋‹ˆ๊นŒ, ์—ฌ๋Ÿฌ๊ฐœ์˜ candidate(ํ›„๋ณด) output์„ ์ƒ์„ฑ

โ†’ x์— ๋Œ€ํ•ด ๋‹ต๋ณ€ {y1,...,yM} ์„ sampling

๋žœ๋ค์„ฑ ์žˆ๋Š” ์ƒ˜ํ”Œ๋ง์œผ๋กœ M=64๊ฐœ์˜ ๋‹ค์–‘ํ•œ ๋‹ต๋ณ€ ์ƒ์„ฑ(appendix์—์„œ๋Š” 16)

โ†’ ๋‹จ์ผ ๋‹ต์ด ์•„๋‹Œ ๋‹ค์–‘ํ•œ reasoning path๋ฅผ ํ™•๋ณดํ•ด์•ผ voting์ด ์˜๋ฏธ ์žˆ์Œ

  • ์—ฌ๋Ÿฌ๊ฐœ์˜ ๋‹ต๋ณ€์€ ์ž๋™์œผ๋กœ ๋žœ๋ค๋˜๋Š”๊ฒƒ์ธ๊ฐ€? ๋”ฐ๋กœ ์„ค์ •์„ ๋ฐ”๊ฟ”์ฃผ๋Š”๊ฒƒ์ธ๊ฐ€?

    ์›๋ž˜ LLM์€ ๊ฐ™์€ ์ž…๋ ฅ์— ๋Œ€ํ•ด์„œ๋„ ๋žœ๋คํ•˜๊ฒŒ ๋‹ค๋ฅด๊ฒŒ ๋‚˜์˜ค๊ธด ํ•จ.

    ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์ผ๋ถ€๋Ÿฌ ๋ณ€ํ™”์‹œํ‚ค์ง„ ์•Š๊ณ , Randomized decoding ์„ค์ •

    โ†’ temperature, top-p, top-k, sampling ํšŸ์ˆ˜ ๋“ฑ์œผ๋กœ ์กฐ์ ˆ

    Blog Image
    • temperature = 0.6
    • top-p = 0.95

    โ†’ sampling๋œ ๋‹ต๋ณ€์ด ๋‹ค๋ฅด๊ฒŒ ๋‚˜์˜ค๋„๋ก ์œ ๋„

    Temperature: Setting the temperature to 1.0, as opposed to 0.6, increases the modelโ€™s output entropy. This promotes more extensive exploration and allows the model to make better use of its prior knowledge for self-improvement, which is particularly important when addressing challenging benchmarks.

    ์‹ค์ œ๋กœ ์ดํ›„ ์‹คํ—˜์—์„œ Parameter์— ๋”ฐ๋ฅธ ๋น„๊ต๊ฐ€ ์žˆ๊ณ ,

    Dataset ๋‚œ์ด๋„์— ๋”ฐ๋ผ์„œ ์กฐ์ •์ด ํ•„์š”ํ•˜๋‹ค๋Š” ํ•œ๊ณ„๋ฅผ ๋ฐํž˜.

    (Figure 11 : Inappropriate RL Hyperparameters)

    ์–ด๋ ค์šด task์— ๋Œ€ํ•ด์„œ๋Š” Temperature์„ 1.0์œผ๋กœ ํ•ด์•ผ ํšจ๊ณผ๊ฐ€ ์ข‹์Œ.

    โ†’ temperature ๋†’์ด๋ฉด diversity ์ฆ๊ฐ€ โ†’ exploration ์ฆ๊ฐ€ โ†’ high entropy โ†’ ๋‹ค์–‘ํ•œ ๋‹ต๋ณ€

์ •๋‹ต ์ถ”์ถœ + Majority Voting (Label ์ถ”์ •)

Blog Image
  • ๊ฐ ๐‘ฆi์—์„œ ์ตœ์ข… ์ •๋‹ต๋งŒ extract โ†’ ์ˆซ์ž, ์„ ํƒ์ง€
  • majority voting (๋‹ค์ˆ˜๊ฒฐ)๋กœ ๊ฐ€์žฅ ๋งŽ์ด ๋‚˜์˜จ ๋‹ต์„ pseudo-label ๐‘ฆโˆ—๋กœ ์ •ํ•จ

Reward ๊ณ„์‚ฐ

Blog Image

sampling๋œ y๋‹ต์ด majority ๋‹ต์ด๋ž‘ ์ผ์น˜ํ•˜๋ฉด โ†’ reward = 1

์•„๋‹ˆ๋ฉด โ†’ reward = 0

Blog Image

โ†’ ์‹ค์ œ ์ •๋‹ต์„ ๋ชจ๋ฅด์ง€๋งŒ, voting ๊ฒฐ๊ณผ์— ์–ผ๋งˆ๋‚˜ ์ผ์น˜ํ–ˆ๋Š”์ง€๋ฅผ ๊ธฐ์ค€์œผ๋กœ ํ•™์Šต ์‹ ํ˜ธ ์ œ๊ณต

We sample 64 responses per prompt using the current model and randomly select 32 to use for training.

๋žœ๋คํ•˜๊ฒŒ 32๊ฐœ๋ฅผ ํŠธ๋ ˆ์ด๋‹์— ์‚ฌ์šฉ๊ทธ ์ค‘ 32๊ฐœ๋งŒ ๊ณจ๋ผ์„œ reward ๊ณ„์‚ฐ์— ์‚ฌ์šฉ

โ†’ ํˆฌํ‘œ๋Š” 64๊ฐœ๋กœ ํ•˜๋Š”๋ฐ, ๋‚˜์ค‘์— RL์€ ๋žœ๋ค์œผ๋กœ ๋ฐ˜๋งŒ ์‚ฌ์šฉํ•จ. (๋„ˆ๋ฌด ๊ณ„์‚ฐ ๊ณผ๋„ํ•˜๋‹ˆ๊นŒ)

๐Ÿ’ก

์œ„ ๊ณผ์ •์„ batch size (N) ๋งŒํผ ๋ฐ˜๋ณตํ•จ. ๋งค ์งˆ๋ฌธ๋งˆ๋‹ค ์—…๋ฐ์ดํŠธ ํ•˜๋Š”๊ฒŒ ์•„๋‹˜.

Each RL step samples a batch of questions and computes policy gradients using the pseudo-rewards from majority voting.

โ†’ ๊ฐ step๋งˆ๋‹ค ์—ฌ๋Ÿฌ ๊ฐœ์˜ ์งˆ๋ฌธ(batch of questions)์„ ์‚ฌ์šฉ

โ†’ ์ด๊ฒŒ ๊ณง ์šฐ๋ฆฌ๊ฐ€ ๋งํ•˜๋Š” โ€œbatch sizeโ€์— ํ•ด๋‹น

๋ฐ์ดํ„ฐ์…‹๋งˆ๋‹ค ๋‹ค๋ฅด๊ฒŒ ์‚ฌ์šฉํ–ˆ์Œ.

AIME=80, AMC=30, MATH-500=10

AIME๋Š” ์–ด๋ ค์šฐ๋‹ˆ๊นŒ ์—ฌ๋Ÿฌ๋ฒˆํ•˜๊ณ  ์—…๋ฐ์ดํŠธํ•ด์ค˜์•ผ ํ‹€๋ฆฐ ์ •๋ณด๋กœ ์—…๋ฐ์ดํŠธ๊ฐ€ ๋ฐ˜๋ณต๋  ํ™•๋ฅ ์ด ์ค„์–ด๋“ฌ.

Policy ์—…๋ฐ์ดํŠธ (RL)

Blog Image

batch size๋งŒํผ ๋ฐ˜๋ณตํ•˜๋ฉด ์ด์ œ ๋ชจ์•„๋†จ๋˜

๋ชฉํ‘œ: expected reward๋ฅผ ์ตœ๋Œ€ํ™”ํ•˜๋Š” ๊ฒƒ โ†’ ๋‹ค์ˆ˜๊ฒฐ์ด ์˜ณ๋‹ค๊ณ  ๋ฏฟ์ž!

reward๊ฐ€ ๋†’์•˜๋˜ ๋‹ต๋ณ€ ์ชฝ์œผ๋กœ (gradient ascent)

ฮธ (๋ชจ๋ธ ํŒŒ๋ผ๋ฏธํ„ฐ)๋ฅผ ์—…๋ฐ์ดํŠธ

๐Ÿ’ก

LLM์ด prompt์— ๋Œ€ํ•ด ์—ฌ๋Ÿฌ ๋‹ต๋ณ€์„ ์ƒ์„ฑํ•˜๊ณ , ๊ทธ ์ค‘ ๋‹ค์ˆ˜๊ฒฐ๋กœ ์ถ”์ •๋œ label๊ณผ ์–ผ๋งˆ๋‚˜ ์ผ์น˜ํ•˜๋Š”์ง€๋ฅผ reward๋กœ ์‚ผ์•„, LLM์˜ policy๋ฅผ reinforcement learning์œผ๋กœ ์—…๋ฐ์ดํŠธํ•œ๋‹ค.

2.2 Majority Voting Reward Function

Blog Image
python
from collections import Counter def majority_voting_reward_fn(outputs): # 1. ์ •๋‹ต ์ถ”์ถœ answers = [extract_answer(output) for output in outputs] # 2. ๋‹ค์ˆ˜๊ฒฐ๋กœ label ์ถ”์ • counts = Counter(answers) majority_answer, _ = counts.most_common(1)[0] # 3. reward ๊ณ„์‚ฐ (์ผ์น˜ ์—ฌ๋ถ€ ๊ธฐ์ค€) rewards = [1 if ans == majority_answer else 0 for ans in answers] return rewards


3 Experiments

3.1 Experimental Setup

๊ตฌ์„ฑ์š”์†Œ์„ค์ • ๋‚ด์šฉ์ด์œ 
ModelsQwen, LLaMA, Mistral, DeepSeek ๋“ฑ ๋‹ค์–‘ํ•œ scale์˜ LLMpretrained + post-trained ๋ชจ๋ธ ๋ชจ๋‘ ์‚ฌ์šฉ
โ†’ TTRL์ด ์ „ํ˜•์ ์ธ SFT ์ดํ›„์—๋„ ์ž‘๋™ ๊ฐ€๋Šฅํ•œ์ง€ ๊ฒ€์ฆ
TasksAIME 2024, AMC, MATH-500, GPQA์ •๋‹ต์ด ๋ช…ํ™•ํ•˜๊ณ  ์ฑ„์  ๊ฐ€๋Šฅํ•œ task ์œ„์ฃผ ์„ ํƒ
Sampling64๊ฐœ ์ƒ์„ฑ, 32๊ฐœ ํ•™์Šต ์‚ฌ์šฉlabel estimation ์‹ ๋ขฐ๋„ ํ™•๋ณด + ์—ฐ์‚ฐ ํšจ์œจ ๊ณ ๋ ค
Decodingtemp=0.6, top-p=0.95
RL AlgorithmGRPO, AdamW, Cosine schedule Learning rate:5 ร— 10โปโท์‹คํ—˜์ ์œผ๋กœ ์•ˆ์ •์„ฑ๊ณผ sample-efficiency๊ฐ€ ๊ฒ€์ฆ๋œ ๋ฐฉ์‹
Max Length3072 (์ผ๋ฐ˜), 32768 (LRM)CoT์ฒ˜๋Ÿผ ๊ธธ๊ณ  reasoning-heavyํ•œ ๋‹ต๋ณ€๋„ ์ฒ˜๋ฆฌ ๊ฐ€๋Šฅํ•˜๋„๋ก ์„ค๊ณ„
EpisodesAIME=80, AMC=30, MATH-500=10dataset ๋‚œ์ด๋„์™€ ํฌ๊ธฐ์— ๋งž์ถฐ ์ ์ ˆํžˆ ์กฐ์ •
  • Dataset ์„ค๋ช…

    AIME 2024 - American Invitational Mathematics Examination

    • ๊ณ ๋“ฑํ•™๊ต ์ƒ์œ„๊ถŒ ๋Œ€์ƒ ๋ฏธ๊ตญ ์ˆ˜ํ•™ ๊ฒฝ์‹œ๋Œ€ํšŒ (3-digit integer)

    AMC - American Mathematics Competitions

    • AIME๋ณด๋‹ค ์‰ฌ์šด ๋‹จ๊ณ„์˜ ์„ ๋‹คํ˜• ์ˆ˜ํ•™ ๊ฒฝ์‹œ ๋ฌธ์ œ (5์ง€์„ ๋‹ค A~E)

    MATH-500 - Open-source ์ˆ˜ํ•™ ๋ฌธ์ œ์ง‘์—์„œ 500๊ฐœ ์ถ”์ถœ

    • ์ˆ˜์‹, ์ •์ˆ˜ โ†’ ํ”„๋กœ๊ทธ๋žจ์œผ๋กœ ์ง์ ‘ ๊ณ„์‚ฐ (symbolic checker ์‚ฌ์šฉ)

    GPQA - Graduate-level Physics Question Answering

    • ์ด ์ค‘ Diamond ๋‚œ์ด๋„๋งŒ (๊ฐ๊ด€์‹)

3.2 Main Results

Table 1 : Performs well on most tasks

Blog Image

Qwen2.5-Math-1.5B ๊ฐ™์€ 1.5B ๋ชจ๋ธ์ด 73.0๊นŒ์ง€ ๊ฐ€๊ธฐ๋„ ํ•จ.
โ†’ ์†Œํ˜• ๋ชจ๋ธ์€ RL์ด ์–ด๋ ค์› ๋‹ค๋Š” ๊ฑธ ๊นธ

* ์—ฌ๊ธฐ์„œ๋Š” Qwen3-8B๊ฐ€ non-thinking mode์ด๊ณ , thinking mode๋Š” Figure 3

Table 2 : Performs well on most models

Blog Image

LLaMA-Instruct, DeepSeek-R1, Mistral ๋“ฑ ๋‹ค์–‘ํ•œ ๋ชจ๋ธ์—์„œ ํ…Œ์ŠคํŠธ

Figure 3 : TTRL performs well on LRMs

Blog Image

Large Reasoning Models์—๋„ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์ž„

์ด๋ฏธ Reasoning์ชฝ์œผ๋กœ ํƒ€๊ฒŸํ•ด์„œ ํ•™์Šตํ•œ ๋ชจ๋ธ๋„ ํ–ฅ์ƒ๋จ

Figure 4 : TTRL generalizes well beyond the target task

Blog Image

ํŠน์ • ๋ฒค์น˜๋งˆํฌ์—์„œ ํ•™์Šต ํ›„ ๋‹ค๋ฅธ task์—์„œ๋„ ์„ฑ๋Šฅ์ด ๊ฐ™์ด ์˜ฌ๋ผ๊ฐ.

Figure 5 : TTRL is compatible with different RL algorithms

Blog Image

GRPO, PPO, PRIME์„ ๋น„๊ต

GRPO (rule-based), PPO (value-based), PRIME (process-level reward)

โ†’ ๋‹ค๋ฅธ ์•Œ๊ณ ๋ฆฌ์ฆ˜์œผ๋กœ๋„ ํ˜ธํ™˜์ด๋œ๋‹ค.

  • PPO : Proximal Policy Optimization

    ํ˜„์žฌ์˜ policy๋ฅผ ๋„ˆ๋ฌด ๋งŽ์ด ๋ฐ”๊พธ์ง€ ์•Š์œผ๋ฉด์„œ ์กฐ๊ธˆ์”ฉ ์ข‹์•„์ง€๊ฒŒ

    value function V(s) ์„ ์‚ฌ์šฉํ•ด์„œ ํ˜„์žฌ ์ƒํƒœ๊ฐ€ ์–ผ๋งˆ๋‚˜ ์ข‹์€์ง€๋ฅผ ์˜ˆ์ธกํ•˜๊ณ ,

    ๊ทธ ๊ธฐ์ค€์œผ๋กœ ์–ผ๋งˆ๋‚˜ ์ •์ฑ…์„ ๋ฐ”๊ฟ€์ง€๋ฅผ ๊ณ„์‚ฐ

  • PRIME : Process Reinforcement through Implicit Rewards

    ๊ฐ ํ† ํฐ ๋‹จ์œ„๋กœ ๊ณ„์‚ฐ๋œ log-prob ratio๋ฅผ ์‚ฌ์šฉํ•ด reward๋ฅผ ๊ตฌ์„ฑ

    ๊ทผ๋ฐ, reward source๋Š” ์—ฌ์ „ํžˆ majority voting ๊ธฐ๋ฐ˜์ด์—ˆ์„ ๊ฐ€๋Šฅ์„ฑ์ด ํผ

  • GRPO : Group Relative Policy Optimization

    ๊ฐ™์€ ์งˆ๋ฌธ์— ๋Œ€ํ•œ ์—ฌ๋Ÿฌ ์‘๋‹ต์˜ ์ƒ๋Œ€์ ์ธ ์ •๋‹ต๋ฅ ์„ ๋น„๊ตํ•˜์—ฌ ๋ณด์ƒ์„ ์ฃผ๋Š” ๋ฐฉ์‹

    N๊ฐœ์˜ ์ƒ˜ํ”Œ, ๋‹คํ•ญ reward, ๋‹ค์–‘์„ฑ ์ƒ˜ํ”Œ๋ง, online setting ๋“ฑ์— ๋ชจ๋‘ ์ ์šฉ ๊ฐ€๋Šฅ

  • DPO : Direct Preference Optimization (์‚ฌ์šฉ ์•ˆํ•จ)

    ๋‘ ์‘๋‹ต ์ค‘ ์–ด๋А ์ชฝ์ด ๋” ์ข‹์€์ง€์— ๋Œ€ํ•œ ์ธ๊ฐ„์˜ "์„ ํ˜ธ"๋ฅผ ์ง์ ‘ ํ•™์Šตํ•˜๋Š” ๋ฐฉ์‹

    reward๊ฐ€ ๋‹จ์ˆœํ•œ 0/1 ํ˜•ํƒœ์˜ pairwise ๋น„๊ต๋กœ ์ œํ•œ

    preference ๋“ค์–ด๊ฐ€์„œ offline ๊ตฌ์กฐ์ž„.

    โ†’ ์ด ๋…ผ๋ฌธ์—์„œ ์ ์šฉ ๋ชปํ•จ.

Figure 6 : Achieves sustainable self-evolution through โ€œonlineโ€ and โ€œRLโ€

Blog Image

  • pass@1, avg@16, maj@16 ์ด ๋ญ์•ผ?

    pass@1

    โ†’ TTRL ๋ชจ๋ธ์ด ํ˜„์žฌ ์ƒํƒœ๋กœ inference ํ•  ๋•Œ ๋‹จ์ผ ์ƒ˜ํ”Œ์ด ์ •๋‹ต์ผ ํ™•๋ฅ 

    โ†’ ์‹ค์ œ ์‚ฌ์šฉ์ž ์„ฑ๋Šฅ

    avg@16

    • ๊ฐ ๋ฌธ์ œ์— ๋Œ€ํ•ด ์ƒ์„ฑ๋œ 16๊ฐœ์˜ ๋‹ต๋ณ€ ์ค‘ ์ •๋‹ต์ธ ๋น„์œจ์„ ๊ณ„์‚ฐ

    โ†’ ์˜ˆ๋ฅผ ๋“ค์–ด ์ •๋‹ต์ด 10๋ฒˆ ๋‚˜์™”์œผ๋ฉด ๊ทธ ๋ฌธ์ œ์˜ ์ ์ˆ˜๋Š” 10/16

    • ์ด๋ ‡๊ฒŒ ๋ชจ๋“  ๋ฌธ์ œ์— ๋Œ€ํ•ด ํ‰๊ท ์„ ๋‚ธ ๊ฒƒ

    โ†’ ground truth๋ž‘ ๋น„๊ตํ•ด์•ผํ•˜๋‹ˆ๊นŒ ์‚ฌ์‹ค์ƒ ๋‚˜์ค‘์— ์„ฑ๋Šฅ ํ‰๊ฐ€์šฉ์ž„

    maj@16

    • ๋‹ค์ˆ˜๊ฒฐ๋กœ ๋ฝ‘์€ ์ˆ˜๋„๋ผ๋ฒจ์ด ์ •๋‹ต์ด๋ฉด 1์ 
    • ์ด๋ ‡๊ฒŒ ์ „์ฒด ๋ฌธ์ œ์— ๋Œ€ํ•ด ํ‰๊ท ์„ ๋‚ด๋ฉด maj@64 accuracy

TTRL์€ ๋‹จ์ˆœํžˆ ๊ธฐ์กด์˜ pseudo-label์— ์ˆ˜๋ ดํ•˜๋Š” ๊ฒŒ ์•„๋‹ˆ๋ผ, pseudo-label ์ž์ฒด๋„ ๊ณ„์† ๊ณ ๋„ํ™”๋˜๊ณ  ์žˆ์Œ

  • TTRL์˜ ๊ตฌ์กฐ์ƒ, ์ž์‹ ์˜ ์˜ˆ์ธก(y_hat)์œผ๋กœ๋ถ€ํ„ฐ ๋งŒ๋“  ๋ณด์ƒ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ์Šค์Šค๋กœ ํ•™์Šตํ•จ
  • ๊ทธ๋Ÿฌ๋ฉด ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์ด ํ–ฅ์ƒ๋˜๊ณ , ๊ทธ์— ๋”ฐ๋ผ ๋” ์ข‹์€ ์˜ˆ์ธก โ†’ ๋” ๋‚˜์€ pseudo-label โ†’ ๋” ๋‚˜์€ ํ•™์Šต ์‹ ํ˜ธ๋กœ ์ด์–ด์ง€๋Š” ์ž๊ธฐ๊ฐ•ํ™” ๋ฃจํ”„(self-reinforcing loop) ํ˜•์„ฑ

์ด ๋…ผ๋ฌธ์—์„œ๋Š” RL์„ ์“ธ ๋•Œ maj@16 ๊ธฐ์ค€์œผ๋กœ reward๋ฅผ ๊ณ„์‚ฐ(GRPO) ํ•˜๊ณ ,

ํ•™์Šต ์ดํ›„ ์„ฑ๋Šฅ ํ‰๊ฐ€์—์„œ๋Š” avg@16, maj@16 ๋‘˜ ๋‹ค ํ™•์ธํ•จ.


4 Analysis and Discussions

4.1 Q1: How Well Can TTRL Perform?

๊ธฐ์กด self-training ๋ฐฉ์‹์˜ ์ƒํ•œ์„ ๋“ค๊ณผ ๋น„๊ตํ•ด์„œ ์–ด๋””๊นŒ์ง€ ๋„๋‹ฌํ•  ์ˆ˜ ์žˆ๋Š”์ง€ ์‹คํ—˜์ ์œผ๋กœ ๊ฒ€์ฆ

Blog Image

TTRL ์ „/ํ›„์˜ avg@64, maj@64 ๋น„๊ต

โ†’ ๋ชจ๋“  benchmark์—์„œ TTRL ์ ์šฉ ํ›„ avg@64, maj@64 ๋‘˜ ๋‹ค ์„ฑ๋Šฅ์ด ์ฆ๊ฐ€

TTRL์€ ํ•™์Šต ์‹ ํ˜ธ๋กœ maj@n์„ ์‚ฌ์šฉํ–ˆ์ง€๋งŒ, ํ•™์Šต ์ดํ›„ ๊ฒฐ๊ณผ๋Š” ๊ทธ ์ƒํ•œ์„ ์„ ์ดˆ๊ณผ

Blog Image

RL : ground-truth label์ด ์žˆ๋Š” ์ƒํƒœ์—์„œ RL์„ ์ง์ ‘ ๋Œ๋ฆฌ๋Š” ๊ฒฝ์šฐ

์‚ฌ์‹ค RL์€ Label์ด ์—†์–ด์„œ Test-Time์— ๋ถˆ๊ฐ€๋Šฅ โ†’ ๊ทธ ์„ฑ๋Šฅ์„ ๋”ฐ๋ผ๊ฐ

  • ๊ทธ๋Ÿฐ๋ฐ ์–ด๋–ป๊ฒŒ TTRL์ด ์ค‘๊ฐ„ accuracy์—์„œ leakage๋ณด๋‹ค ๋†’์•„์งˆ ์ˆ˜ ์žˆ๋‚˜?

    Leakage RL์€ ๋‹จ์ผ ์ƒ˜ํ”Œ์— ๋Œ€ํ•ด reward๊ฐ€ binaryํ•จ

    • ์ •๋‹ต์ด๋ฉด +1, ์˜ค๋‹ต์ด๋ฉด 0

    โ†’ ์ด๊ฑด very sparse + very high variance reward

    โ†’ ๋”ฐ๋ผ์„œ ์ดˆ๋ฐ˜์—๋Š” policy๊ฐ€ ์ด reward๋ฅผ ์ œ๋Œ€๋กœ ํ™œ์šฉํ•˜๊ธฐ ์–ด๋ ค์›€

    ๋ฐ˜๋Œ€๋กœ TTRL์€ softํ•œ avg ๊ธฐ๋ฐ˜ reward๋ฅผ ์‚ฌ์šฉํ•จ

    • ์˜ˆ๋ฅผ ๋“ค์–ด ์ •๋‹ต์ด ์ „์ฒด 32๊ฐœ ์ค‘ 18๊ฐœ๋ฉด reward๊ฐ€ 0.5625
    • ์ด๊ฑด gradient variance๊ฐ€ ๋‚ฎ๊ณ  ์•ˆ์ •์ ์ธ ํ•™์Šต์ด ๊ฐ€๋Šฅ

    ํ•˜์ง€๋งŒ ์‹œ๊ฐ„์ด ์ง€๋‚˜๋ฉด, ์ •๋‹ต ๊ธฐ๋ฐ˜ reward๊ฐ€ ๋” ์ •ํ™•ํ•˜๋ฏ€๋กœ TTRL๋ณด๋‹ค ๋” ๋†’์€ ํ•œ๊ณ„ ์„ฑ๋Šฅ์— ์ˆ˜๋ ดํ•˜๊ฒŒ ๋จ

4.2 Q2: Why Does TTRL Work?

1. Label Estimation

Blog Image

Label Accuracy์™€ Reward Accuracy๋Š” ๋‹ค๋ฅด๋‹ค!!

  • label accuracy๋Š” ๋‚ฎ์Œ (majority voting์œผ๋กœ ๋งŒ๋“  pseudo-label์ด ํ‹€๋ฆฌ๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์Œ)
  • ๊ทธ๋Ÿฌ๋‚˜ reward accuracy๋Š” ๋†’๊ฒŒ ์œ ์ง€๋จ
๐Ÿ’ก

"Lucky Hit" ํ˜„์ƒ: sampling๋œ ๋‹ต ์ค‘ ์ •๋‹ต์„ ์šฐ์—ฐํžˆ ๋งž์ถœ ์ˆ˜ ์žˆ๊ณ , ์ด๊ฒŒ ๋†’์€ ๋ณด์ƒ์œผ๋กœ ์ด์–ด์ง

๋ผ๋ฒจ์ด ํ‹€๋ฆด ์ˆ˜ ์žˆ์–ด๋„, reward๋Š” ์šฐ์—ฐํžˆ ๋งž๊ธฐ ๋•Œ๋ฌธ์— ๋ณด์ƒ ์‹ ํ˜ธ๋Š” ์ถฉ๋ถ„ํžˆ ์œ ํšจํ•˜๊ณ ,

RL์€ ์›๋ž˜ ๊ทธ๋Ÿฐ noise์— ๊ฐ•ํ•˜๋ฏ€๋กœ, label์ด ๋ถ€์ •ํ™•ํ•ด๋„ ํ•™์Šต์ด ์•ˆ์ •์ ์œผ๋กœ ์ง„ํ–‰๋  ์ˆ˜ ์žˆ๋‹ค.

2. Reward Calculations

๋น„๊ต ๊ธฐ๋ฐ˜์ด๊ธฐ ๋•Œ๋ฌธ์— "์šด ์ข‹๊ฒŒ" ์˜ฌ๋ฐ”๋ฅธ ๋ณด์ƒ์„ ์ค„ ์ˆ˜ ์žˆ๋‹ค

  • yฬ‚๊ฐ€ label๊ณผ ๊ฐ™์œผ๋ฉด โ†’ positive reward
  • yฬ‚๊ฐ€ label๊ณผ ๋‹ค๋ฅด๋ฉด โ†’ negative reward

๊ทผ๋ฐ ์ด label์ด ์‹ค์ œ ์ •๋‹ต์ด ์•„๋‹ ์ˆ˜๋„ ์žˆ์Œ

๊ทธ๋ž˜๋„ yฬ‚๊ฐ€ ํ‹€๋ฆฐ label์ด๋ž‘๋„ ๋‹ค๋ฅด๋ฉด, ๊ทธ๊ฑด "ํ‹€๋ฆฐ ๊ฑฐ๋‹ค"๋ผ๋Š” ๋ถ€์ •์  ์‹ ํ˜ธ๋กœ๋Š” ์—ฌ์ „ํžˆ ๋งž์Œ

โ†’ label์ด ํ‹€๋ ค๋„ reward๋Š” ์šฐ์—ฐํžˆ ๋งž๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์Œ

Blog Image
๋ฌด์—‡์„ ๊ธฐ์ค€์œผ๋กœ reward?์˜ˆ์ธก๊ฐ’reward
True label (3)1 1 2 2 2 4 5 60 0 0 0 0 0 0 0 โ†’ ์ „๋ถ€ ์˜ค๋‹ต
Estimated label (2)1 1 2 2 2 4 5 60 0 1 1 1 0 0 0 โ†’ 3๊ฐœ ์ •๋‹ต ์ฒ˜๋ฆฌ

โ†’ 8๊ฐœ ์ค‘ 5๊ฐœ๋Š” ์˜ฌ๋ฐ”๋ฅธ reward๋ฅผ ์คŒ

rollout ๊ธฐ๋ฐ˜ robustness

ํ•˜๋‚˜์˜ ์งˆ๋ฌธ์— ๋Œ€ํ•ด ์—ฌ๋Ÿฌ ๊ฐœ (M๊ฐœ)์˜ ๋‹ต๋ณ€์„ samplingํ•˜๊ธฐ ๋•Œ๋ฌธ์—:

  • ํ•˜๋‚˜๋ผ๋„ label๊ณผ ์ผ์น˜ํ•˜๋Š” output์ด ์žˆ์œผ๋ฉด โ†’ positive reward
  • ํ•˜๋‚˜๋„ ์—†๋”๋ผ๋„ โ†’ negative reward๋Š” ์ •ํ™•ํžˆ ๊ณ„์‚ฐ๋จ

๋ชจ๋ธ์ด ๋ชปํ• ์ˆ˜๋ก reward accuracy๋Š” ์˜คํžˆ๋ ค ์˜ฌ๋ผ๊ฐ„๋‹ค?

AIME 2024์—์„œ

  • label accuracy: 37%
  • reward accuracy: 92%

๋ชจ๋ธ์ด ๋‹ค์–‘ํ•œ ์˜ค๋‹ต์„ ๋‚ด๊ธฐ ๋•Œ๋ฌธ์— (e.g., ๊ฐ€์žฅ ๋งŽ์ด ๋‚˜์˜จ ๋‹ต์ด 16.6%์— ๋ถˆ๊ณผ)

๊ฐ๊ฐ์˜ output์ด ๋‹ค ๋‹ค๋ฅธ ํ‹€๋ฆฐ ๋‹ต์ด๋ฏ€๋กœ โ†’ label๊ณผ ์ผ์น˜ํ•˜์ง€ ์•Š์Œ

๊ทธ ์ž์ฒด๋กœ negative reward๊ฐ€ ์ œ๋Œ€๋กœ ์ „๋‹ฌ๋จ (๋น„๊ต ๊ฒฐ๊ณผ ๋‹ค๋ฅด๋‹ˆ๊นŒ)

3. Online Learning

์˜จ๋ผ์ธ RL ์ ‘๊ทผ ๋ฐฉ์‹์„ ๊ธฐ๋ฐ˜์œผ๋กœ ์„ค๊ณ„๋˜๋‹ˆ๊นŒ ๋ชจ๋ธ์€ applicationํ•˜๋ฉด์„œ ๊ธฐ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ฌ ์ˆ˜ ์žˆ์œผ๋ฉฐ, ์ด๋Š” ํˆฌํ‘œ๋ฅผ ํ†ตํ•ด ์ƒ์„ฑ ๋œ๋ณด๋‹ค ์ •ํ™•ํ•œ ๋ ˆ์ด๋ธ”๋กœ ์ด์–ด์ง

โ†’ supervision ์‹ ํ˜ธ์˜ ํ’ˆ์งˆ์ด ํ–ฅ์ƒ๋˜์–ด ์ง€์† ๊ฐ€๋Šฅํ•œ ์ž๊ธฐ ์ง„ํ™”๊ฐ€ ๊ฐ€๋Šฅ (Figure 6 ๋‚ด์šฉ)

4.3 Q3: When Might TTRL Fail?

Figure 11 : Inappropriate RL Hyperparameters

Blog Image

TTRL์€ unsupervised + reward estimation์ด noisyํ•œ ๊ตฌ์กฐ์ด๊ธฐ ๋•Œ๋ฌธ์—,

์ผ๋ฐ˜์ ์ธ RL๋ณด๋‹ค ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ์— ํ›จ์”ฌ ๋ฏผ๊ฐํ•จ.

  • ํŠนํžˆ ์‹คํŒจํ•œ ๊ฒฝ์šฐ๋Š” Entropy๊ฐ€ ๋๊นŒ์ง€ ๋‚ฎ์•„์ง€์ง€ ์•Š์Œ (โ†’ exploration ์‹คํŒจ)
  • ์‹คํ—˜์ ์œผ๋กœ ๋‹ค์Œ ๋‘ ๊ฐ€์ง€๊ฐ€ ํ•ต์‹ฌ์ž„:

(1) Temperature

  • T=1.0์œผ๋กœ ๋†’์ด๋ฉด ๋” ๋งŽ์€ entropy (๋” ๋‹ค์–‘ํ•œ ๋‹ต๋ณ€)
  • exploration์ด ๋งŽ์•„์ง€๊ณ  prior knowledge๋ฅผ ๋” ์ž˜ ์“ฐ๊ฒŒ ๋จ
  • challenging benchmark (ex. AIME)์—์„  exploration์ด ๋งค์šฐ ์ค‘์š”

(2) Episodes

  • ๋ฌธ์ œ ์ˆ˜ ์ ๊ณ  ๋‚œ์ด๋„ ๋†’์€ ๋ฐ์ดํ„ฐ์…‹ (ex. AIME 2024)์€ ์—ํ”ผ์†Œ๋“œ ์ˆ˜๊ฐ€ ๋งŽ์•„์•ผ ํ•จ
  • exploration์„ ์ถฉ๋ถ„ํžˆ ํ•˜์ง€ ์•Š์œผ๋ฉด ์ˆ˜๋ ด ๋ถˆ๊ฐ€

โ†’ TTRL์€ ๋ฌธ์ œ์˜ ๋‚œ์ด๋„/๋ถ„ํฌ/๊ทœ๋ชจ์— ๋”ฐ๋ผ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์ •๋ฐ€ ์กฐ์ •ํ•ด์•ผ ํ•จ

Table 3: Lack of Prior Knowledge on Target Task

Blog Image

TTRL์€ test set๋งŒ ๊ฐ€์ง€๊ณ  ํ•™์Šตํ•˜๊ธฐ ๋•Œ๋ฌธ์—,

๋ชจ๋ธ์ด ๊ทธ ๋ถ„์•ผ์— ๋Œ€ํ•œ ์‚ฌ์ „ ์ง€์‹์ด ์—†์œผ๋ฉด ์™„์ „ํžˆ ์‹คํŒจํ•  ์ˆ˜ ์žˆ์Œ

  • curriculum learning (์‰ฌ์šด ๋ฌธ์ œ๋ถ€ํ„ฐ) ๊ฐ™์€ ๋„์ž…์ด ์—†์Œ
  • ์‚ฌ์ „ ํ•™์Šต๋œ knowledge ์—†์ด ์–ด๋ ค์šด ๋ฌธ์ œ์— ๋ฐ”๋กœ ์ ์‘ํ•ด์•ผ ํ•จ

  • ๋‚œ์ด๋„๊ฐ€ ๋†’์•„์งˆ์ˆ˜๋ก ์„ฑ๋Šฅ ํ–ฅ์ƒ ํญ ๊ฐ์†Œ
  • ๋ฌธ์ œ ๊ธธ์ด ๊ฐ์†Œ์œจ๋„ ๋–จ์–ด์ง

โ†’ ์–ด๋ ค์šด ๋ฌธ์ œ์ผ์ˆ˜๋ก backbone์˜ ์‚ฌ์ „ ์ง€์‹ ๋ถ€์กฑ์œผ๋กœ ํ•™์Šต์ด ํž˜๋“ค๋‹ค๋Š” ์ฆ๊ฑฐ


7 Limitations and Future Works

Limitations

  1. TTRL์€ ์ดˆ๊ธฐ ํƒ์ƒ‰ ๋‹จ๊ณ„์— ๋ถˆ๊ณผํ•˜๋ฉฐ,
  1. ๋‹ค์Œ ๋‘ ์š”์†Œ๊ฐ€ ํ•™์Šต ์„ฑ๋Šฅ์— ํฐ ์˜ํ–ฅ์„ ๋ฏธ์นจ์—๋„ ์•„์ง ์ •๋Ÿ‰์  ๋ถ„์„์ด ๋ถ€์กฑํ•จ:
    • ๋ชจ๋ธ์˜ ์‚ฌ์ „ ์ง€์‹ ์ˆ˜์ค€ (prior knowledge)
    • ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ์„ค์ • (temperature, episode ์ˆ˜ ๋“ฑ)

Future Works (๋…ผ๋ฌธ์— ๋‚˜์˜จ ๋‚ด์šฉ)

  1. ์ด๋ก ์  ๋ถ„์„
    • TTRL์ด 4.1์—์„œ ์ •์˜ํ•œ ๋‘ upper bound(maj@n / RL leakage)์— ๋Œ€ํ•ด ์–ผ๋งˆ๋‚˜ ์ˆ˜๋ ด ๊ฐ€๋Šฅํ•œ์ง€ ์ด๋ก ์ ์œผ๋กœ ๋ถ„์„
    • convergence theory์™€ optimality ์กฐ๊ฑด ๊ทœ๋ช…
  1. ์ŠคํŠธ๋ฆฌ๋ฐ ๋ฐ์ดํ„ฐ ๊ธฐ๋ฐ˜ ์˜จ๋ผ์ธ ํ•™์Šต
    • ํ˜„์žฌ TTRL์€ static test set ๊ธฐ์ค€
    • ์ด๋ฅผ ์‹ค์‹œ๊ฐ„ ๋„์ฐฉํ•˜๋Š” ๋ฐ์ดํ„ฐ ์ŠคํŠธ๋ฆผ์— ์ ์‘ํ•˜๋Š” ํ˜•ํƒœ๋กœ ํ™•์žฅํ•˜๋ ค๋Š” ๊ณ„ํš

      โ†’ ์ง„์ •ํ•œ Test-Time Adaptation (TTA)์œผ๋กœ์˜ ํ™•์žฅ

  1. ๋Œ€๊ทœ๋ชจ self-supervised RL
    • TTRL์„ ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ์…‹ + ๋Œ€ํ˜• LLM์— ์ ์šฉ
    • ์ธ๊ฐ„์˜ ๋ผ๋ฒจ๋ง ์—†์ด๋„ ๊ฐ•๋ ฅํ•œ ์ž๊ธฐ์ง€๋„ ๊ฐ•ํ™”ํ•™์Šต ์‹œ์Šคํ…œ์œผ๋กœ ๋ฐœ์ „์‹œํ‚ค๋ ค๋Š” ๋ฐฉํ–ฅ
  1. Agentic Task ๋ฐ ๊ณผํ•™์  ์ถ”๋ก 
    • TTRL์„ ๋‹จ์ˆœ QA ๋˜๋Š” math benchmark๊ฐ€ ์•„๋‹Œ,
      • ์žฅ๊ธฐ์  ๊ณ„ํš์ด ํ•„์š”ํ•œ agentic task
      • ์—ฌ๋Ÿฌ ๋‹จ๊ณ„์˜ ๋…ผ๋ฆฌ๋ฅผ ์š”ํ•˜๋Š” ๊ณผํ•™์  ๋ฌธ์ œ ํ•ด๊ฒฐ์— ํ™•์žฅ
    • open-endedํ•œ domain์œผ๋กœ๋„ TTRL ์ ์šฉ ๊ฐ€๋Šฅ์„ฑ ํƒ€์ง„

Limitations & Future Works (๋‚ด ์ƒ๊ฐ)

  1. Hyperparameter Sensitivity

    RL training is highly sensitive to hyperparameters.

    โ†’Automatic hyperparameter tuning

  1. Too much resource

    The experiments require 8 ร— A100 80GB GPUs

    โ†’ Parameter-efficient by LoRA

  1. Only for simple QA

    Experiments are focused on math & multiple-choice

    โ†’ Extend to complex, multi-step reasoning tasks


Q&A

๋…ผ๋ฌธ Presentation ๋ฐœํ‘œ ์ค‘ ์ œ๋Œ€๋กœ ๋‹ต๋ณ€ ๋ชปํ•œ Q&A

Q1) ์ด๊ฑฐ Test-Time์— RL ์ฒซ ๋…ผ๋ฌธ ๋งž๋Š”๊ฐ€?

์ด๋ก ์ƒ Test-Time + RL ๊ตฌ์กฐ์˜ ์ฒซ ๋…ผ๋ฌธ์€ ์•„๋‹ˆ๋‹ค.

GRPO ์ž์ฒด๋„ test-time์— ์“ธ ์ˆ˜ ์žˆ๊ณ , label ์—†์ด๋„ reward ๋งŒ๋“ค ์ˆ˜ ์žˆ์ง€ ์•Š๋‚˜?

  • GRPO๋Š” RL ์•Œ๊ณ ๋ฆฌ์ฆ˜ (optimizer)
  • TTRL์€ ์ „์ฒด ํ”„๋ ˆ์ž„์›Œํฌ

https://arxiv.org/abs/2402.03300

GRPO๋ฅผ ์†Œ๊ฐœํ•œ DeepSeekMath

โ†’ GRPO๋ฅผ test-time์—์„œ label ์—†์ด ์“ฐ๋Š” ๊ฑด ๊ฐ€๋Šฅ

TTRL์—์„œ ์ •์˜ํ•œ GRPO๋ฅผ test-time์— ์‹ค์ œ๋กœ ์“ฐ๋ ค๋ฉด ํ•„์š”ํ•œ ์š”์†Œ๋“ค

  • test-time input stream ์ฒ˜๋ฆฌ ๋ฐฉ์‹
  • sampling strategy (M๊ฐœ ์ƒ์„ฑ โ†’ voting)
  • pseudo-label โ†’ reward ๋ณ€ํ™˜ ํ•จ์ˆ˜
  • GRPO๋ฅผ ์ž‘๋™์‹œํ‚ฌ ์ˆ˜ ์žˆ๋Š” reward scaling
  • batch-level update โ†’ continual self-evolution

โ†’ ์‹ค์ œ๋กœ ์ž‘๋™ํ•˜๊ฒŒ ๋งŒ๋“œ๋Š” ํ™˜๊ฒฝ + ์ž…๋ ฅ + reward + ๋ฐ˜๋ณต ํ•™์Šต ๋ฃจํ”„๋ฅผ ์ฒ˜์Œ ์„ค๊ณ„

https://arxiv.org/abs/2505.18514

์šฐ๋ฆฌ ์—ฐ๊ตฌ์‹ค ์ง€๋„๊ต์ˆ˜๋‹˜์ด์‹  ๊ณตํƒœ์‹ ๊ต์ˆ˜๋‹˜์ด ์ตœ๊ทผ์— ์“ฐ์‹  ์ด ๋…ผ๋ฌธ๋„ โ€œTest-Time RLโ€์˜ ์ •์˜๊ฐ€ ๋งž์Œ

โ†’ Test-Time RL์˜ ์ตœ์ดˆ๋ผ๋Š” ๋ง์€ ํ‹€๋ฆผ.

์ฐจ์ด์ ์€

BiTTA : ์ •๋‹ต ํด๋ž˜์Šค ์ž์ฒด๋Š” ํ•„์š”ํ•˜์ง€ ์•Š์ง€๋งŒ, ์‹ค์‹œ๊ฐ„์œผ๋กœ ์‚ฌ๋žŒ์˜ Binary Feedback์ด ํ•„์š”ํ•˜๋‹ค.

TTRL : ์ง„์งœ๋กœ label-free, oracle-free

์ •๋‹ต์ด ์žˆ๋Š” oracle์˜ binary feedback์„ ํ•„์š”๋กœ ํ•จ.

TTRL์€ ์–ด๋– ํ•œ ๋ชจ๋ธ, ๋ฐ์ดํ„ฐ์…‹์—๋„ ์ ์šฉ ๊ฐ€๋Šฅํ•˜์ง€๋งŒ, (์ •๋‹ต์ด ๋”ฑ ๋–จ์–ด์ง€๊ธฐ๋งŒ ํ•˜๋ฉด)

reward noise, model prior, dataset ๋‚œ์ด๋„ ๋“ฑ์˜ ์˜ํ–ฅ์œผ๋กœ

hyperparameter (batch size, temperature, episode ์ˆ˜ ๋“ฑ)์— ๋งค์šฐ ๋ฏผ๊ฐ

โ†’ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ํŠœ๋‹์ด ํ•„์š”

Q2) Lucky hit์—ฌ๋„ ๊ฒฐ๊ตญ ํ‹€๋ฆฐ๋‹ต์œผ๋กœ ํ•™์Šตํ•˜๋Š”๊ฑฐ ์•„๋‹Œ๊ฐ€?

์™„์ „ ์ž˜๋ชป ์ดํ•ดํ•˜๊ณ  ์žˆ์—ˆ๋‹ค. ๋‚ด๊ฐ€ ์•Œ๊ณ ์žˆ๋˜ ์ •๋ณด๋Š” BiTTA ์ฒ˜๋Ÿผ Reward๋ฅผ 1๊ณผ -1๋กœ ์ค˜์•ผ์ง€ ๊ฐ€๋Šฅํ•œ ๊ฒƒ์ด๋‹ค.

https://arxiv.org/abs/2505.18514

์˜ˆ์ธก์ด ํ‹€๋ ธ์œผ๋ฉด -1์ด๋ผ๋Š” ๋ถ€์ •์ ์ธ ๋ณด์ƒ์„ ๋ช…์‹œ์ ์œผ๋กœ ์ค˜์„œ, ํ‹€๋ฆฐ ๋ฐฉํ–ฅ์œผ๋กœ์˜ ํ™•๋ฅ ์ด ๋‚ฎ์•„์ง€๋„๋ก gradient

Blog Image

์˜ˆ์‹œ - ๋ชจ๋ธ์ด ์ž˜ ๋ชจ๋ฅผ๋•Œ

[1, 1, 2, 2, 2, 4, 5, 6] โ†’ majority๋Š” 2 (2๊ฐœ)

โ†’ 3/8์ด๋ผ์„œ ์ ๊ธด ํ•˜์ง€๋งŒ, ๋ถ„๋ช…ํžˆ 2๊ฐ€ ์ •๋‹ต์ด๋ผ๋Š” ๊ฒƒ์ด ๊ฐ•ํ™”๋˜๋Š” ๊ฑด ๋งž๋‹ค.

TTRL์—์„œ ๋ชจ๋ธ์ด ์ž˜ ๋ชฐ๋ผ์„œ majority voting์„ ํ†ตํ•ด ๋ฝ‘์€ pseudo label = 2

0 0 1 1 1 0 0 0

์‹ ํ˜ธ๋กœ Reinforcement Learning

๋งŒ์•ฝ ์‹ค์ œ๋กœ true-label (3)์„ ์ค„ ๋•Œ

0 0 0 0 0 0 0 0

์‹ ํ˜ธ๋กœ Reinforcement Learning

๋ชจ๋ธ์ด ์ž˜ ๋ชจ๋ฅด๋Š” ๊ฒƒ์— ๋Œ€ํ•ด์„œ label์ด ์—†์ด ํ–ˆ์ง€๋งŒ,

์‹ค์ œ ์ •๋‹ต label์ด ์žˆ์„๋•Œ์™€ Reward ์‹ ํ˜ธ๊ฐ€ 62.5%๋‚˜ ์ผ์น˜ํ•œ๋‹ค! โ†’ hit ratio

์‹ค์ œ ์ •๋‹ต ๋ผ๋ฒจ์ด ์—†๊ธฐ ๋•Œ๋ฌธ์— ๋‹ค๋ฅธ ๋…ผ๋ฌธ์ฒ˜๋Ÿผ reward์—์„œ penalty ์‹ ํ˜ธ์ธ -1์„ ์ฃผ์ง€ ์•Š๊ณ ,

๋งž์œผ๋ฉด 1, ํ‹€๋ ค๋„ 0 ์œผ๋กœ ์„ค์ •ํ•œ ๊ฒƒ์œผ๋กœ ๋ณด์ธ๋‹ค.

Q3) ์ด RL์—์„œ action์ด ๋ญ”๊ฐ€?

state : ์ฃผ์–ด์ง„ ๋ฌธ์ œ(prompt) x

action : LLM์˜ ๋‹ต๋ณ€ ํ–‰์œ„

โ†’ LLM์€ ๊ทธ์— ๋Œ€ํ•œ ๋‹ต๋ณ€ y๋ฅผ policy ๏ปฟ๋กœ๋ถ€ํ„ฐ ์ƒ์„ฑ (sampling)