Back to Blog List

GTA1: GUI Test-time Scaling Agent

ArXivhttps://arxiv.org/abs/2507.05791
OpenReviewhttps://openreview.net/forum?id=3VIPmz7iAi
Github Codehttps://github.com/Yan98/GTA1
AuthorsYan Yang, Dongxu Li, Yutong Dai, Yuhao Yang, Ziyang Luo, Zirui Zhao, Zhiyuan Hu, Junzhe Huang, Amrita Saha, Zeyuan Chen, Ran Xu, Liyuan Pan, Silvio Savarese, Caiming Xiong, Junnan Li
Affiliation1Salesforce AI Research 2The Australian National University3University of Hong Kong
๐Ÿ’ก

Key Differentiator

GUI ์—์ด์ „ํŠธ์—์„œ ๋ถˆํ•„์š”ํ•˜๊ฒŒ ๋ณต์žกํ•ด์ง„ RL ์„ค๊ณ„๋ฅผ ๊ฑท์–ด๋‚ด๊ณ ,

test-time compute์™€ ์ •๋ ฌ๋œ ๋ณด์ƒ๋งŒ์œผ๋กœ๋„ SOTA๊ฐ€ ๊ฐ€๋Šฅํ•จ์„ ์‹คํ—˜์ ์œผ๋กœ ์ฆ๋ช…

๐Ÿคท

Why I chose this paper?

  • ICLR 2026 Accept ๋…ผ๋ฌธ ๋ฆฌ์ŠคํŠธ์—์„œ GUI ๊ฒ€์ƒ‰ ํ›„ ์ฐพ์•„๋ดค๋‹ค.

Related Work

GUI Grounding ์—ฐ๊ตฌ ํ๋ฆ„

์ง€๋„ํ•™์Šต(SFT, Supervised Fine-Tuning) ๊ธฐ๋ฐ˜ ์ ‘๊ทผ

  • UI ์š”์†Œ์˜ ์ค‘์‹ฌ ์ขŒํ‘œ ์˜ˆ์ธก ๋ฐฉ์‹
  • ๊ณผ์ œ ์ •๋ ฌ(objective alignment) ๋ฌธ์ œ
    • ์‹ค์ œ ๊ณผ์ œ๋Š” ์˜์—ญ ๋‚ด๋ถ€ ์ „์ฒด๊ฐ€ ์ •๋‹ต
    • SFT๋Š” ์ค‘์‹ฌ์—์„œ ๋ฒ—์–ด๋‚˜๋ฉด ํŒจ๋„ํ‹ฐ ๋ถ€์—ฌ
  • ๊ณ ํ•ด์ƒ๋„ยท๋ณต์žก GUI์—์„œ ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ ์ €ํ•˜

๊ฐ•ํ™”ํ•™์Šต(RL, ํŠนํžˆ GRPO) ๊ธฐ๋ฐ˜ ์ ‘๊ทผ

  • ์ผ๋ฐ˜์ ์ธ ์„ค๊ณ„ ํŒจํ„ด
    1. ๋ชจ๋ธ์ด โ€œthinkingโ€(Chain-of-Thought) ์ƒ์„ฑ
    1. ์ดํ›„ ์ขŒํ‘œ ์˜ˆ์ธก
    1. format reward + click reward ๊ฒฐํ•ฉ
  • ํ™•์žฅ ๋ฐฉํ–ฅ
    • ์ผ๋ถ€ ์—ฐ๊ตฌ๋Š” ๋ฐ”์šด๋”ฉ ๋ฐ•์Šค ์˜ˆ์ธก(IoU reward) ๊นŒ์ง€ ์ถ”๊ฐ€
  • ํ•˜์ง€๋งŒ, ๋ช…์‹œ์  โ€œthinkingโ€์ด GUI grounding ์„ฑ๋Šฅ์„ ๊ฐœ์„ ํ•˜์ง€ ์•Š๊ฑฐ๋‚˜ ์˜คํžˆ๋ ค ์ •ํ™•๋„๋ฅผ ์ €ํ•ดํ•˜๋Š” ๊ฒฝ์šฐ ์กด์žฌ
    • GUI grounding์€ ์ถ”๋ก  ๋ฌธ์ œ(reasoning problem)๊ฐ€ ์•„๋‹ˆ๋ผ ์ •ํ™•ํ•œ ์œ„์น˜ ์˜ˆ์ธก(perception-aligned task) ์„ฑ๊ฒฉ์ด๋ผ์„œ

GUI Agent ์•„ํ‚คํ…์ฒ˜ ๊ณ„์—ด

Two-stage GUI Agent

planner: ๊ฐ•๋ ฅํ•œ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ LLM

grounding model: ์ขŒํ‘œ ์˜ˆ์ธก ๋‹ด๋‹น

โ†’ ๋‹ค๋ฅธ ๋ชจ๋“ˆ๋กœ ๋ถ„๋ฆฌ

  • ์žฅ์ 
    • ๋ชจ๋“ˆํ™”๋กœ ์ธํ•œ ํ•ด์„ ์šฉ์ด์„ฑ
    • grounding ์„ฑ๋Šฅ ๊ฐœ์„  ์—ฐ๊ตฌ์— ์ง‘์ค‘ ๊ฐ€๋Šฅ

Native(end-to-end) GUI Agent

: perception, memory, planning, action์„ ํ•˜๋‚˜์˜ end-to-end ์‹œ์Šคํ…œ์œผ๋กœ ํ†ตํ•ฉ

  • long-context ์œ ์ง€, ๊ณผ๊ฑฐ ํ–‰๋™ ์ด๋ ฅ ๊ด€๋ฆฌ๊ฐ€ ์ค‘์š”ํ•จ
    • sliding window ํ˜น์€ ํ…์ŠคํŠธ ์š”์•ฝ ๊ธฐ๋ฐ˜ trajectory ๊ด€๋ฆฌ๋กœ ์™„ํ™”.
  • OSWorld ๊ฐ™์€ ๋™์ ยทํ˜„์‹ค์  ๋ฒค์น˜๋งˆํฌ์—์„œ ๊ฐ•๋ ฅํ•œ ์„ฑ๋Šฅ

์ด ๋…ผ๋ฌธ์˜ ๋ฌธ์ œ ์ œ๊ธฐ

  • two-stage ๋ฐฉ์‹๋„ ๋™์  ํ™˜๊ฒฝ์—์„œ ์ถฉ๋ถ„ํžˆ ๊ฒฝ์Ÿ๋ ฅ ์žˆ์Œ
  • end-to-end๊ฐ€ ์œ ์ผํ•œ ํ•ด๋ฒ•์ด๋ผ๋Š” ๊ฐ€์ •์— ๋Œ€ํ•œ ๋ฐ˜๋ก€ ์ œ์‹œ

Method

  • ๋ฌธ์ œ: GUI ํ™˜๊ฒฝ์€ ๋น„๊ฐ€์—ญ์ 
    • ์ „์ฒด ํ–‰๋™ ์‹œํ€€์Šค๋ฅผ ์‚ฌ์ „์— lookaheadํ•˜๊ธฐ ์–ด๋ ค์›€
    • ๋‹จ์ผ ํ–‰๋™ ์„ ํƒ ์‹คํŒจ๊ฐ€ ๋ˆ„์  ์˜ค๋ฅ˜๋กœ ์ด์–ด์งˆ ๊ฐ€๋Šฅ์„ฑ ์กด์žฌ

GTA1์€ Native๊ฐ€ ์•„๋‹Œ two-stage GUI agent ๊ตฌ์กฐ๋ฅผ ์œ ์ง€ํ•˜๋ฉด์„œ,

๊ฐ ๋‹จ๊ณ„์˜ ์ทจ์•ฝ์ ์„ test-time scaling๊ณผ RL grounding์œผ๋กœ ๋ณด์™„

  1. Planner (ํ–‰๋™ ์ œ์•ˆ ์ƒ์„ฑ๊ธฐ)
    • ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์–ธ์–ด ๋ชจ๋ธ ๊ธฐ๋ฐ˜
    • ํ˜„์žฌ UI ์ƒํƒœ์™€ ์‚ฌ์šฉ์ž ์ง€์‹œ๋ฅผ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›์•„

      ๋‹ค์Œ ํ–‰๋™ ํ›„๋ณด(action proposal)๋ฅผ ์ƒ์„ฑ

  1. Judge Model (ํ–‰๋™ ์„ ํƒ๊ธฐ)
    • planner๊ฐ€ ์ƒ์„ฑํ•œ ๋ณต์ˆ˜์˜ ํ–‰๋™ ํ›„๋ณด ์ค‘ ํ•˜๋‚˜๋ฅผ ์„ ํƒ
    • ํ˜„์žฌ UI ์ƒํƒœ + ์‚ฌ์šฉ์ž ๋ชฉํ‘œ๋ฅผ ๊ธฐ์ค€์œผ๋กœ ํ‰๊ฐ€
  1. Grounding Model (์ขŒํ‘œ ์˜ˆ์ธก๊ธฐ)
    • ์„ ํƒ๋œ ํ–‰๋™์„ ์‹ค์ œ GUI ์ขŒํ‘œ๋กœ ๋ณ€ํ™˜
    • ๊ฐ•ํ™”ํ•™์Šต์œผ๋กœ ํ•™์Šต๋œ ํด๋ฆญ ๊ธฐ๋ฐ˜ ๋ชจ๋ธ

  • GTA1
    • ๊ฐ ๋‹จ๊ณ„์—์„œ ๋‹ค์ˆ˜์˜ ํ–‰๋™ ์ œ์•ˆ์„ ์ƒ˜ํ”Œ๋ง
    • test-time์—์„œ๋งŒ ๊ณ„์‚ฐ๋Ÿ‰์„ ์ฆ๊ฐ€์‹œ์ผœ ๊ณ„ํš ์„ ํƒ์˜ ๊ฐ•๊ฑด์„ฑ์„ ํ™•๋ณด
  • ์ค‘์š”ํ•œ ์ 
    • ํ•™์Šต ๋‹จ๊ณ„(training-time) ๋ณ€๊ฒฝ ์ตœ์†Œํ™”
    • inference ๋‹จ๊ณ„(test-time)์—์„œ๋งŒ ํ™•์žฅ

Blog Image

Test-time Scaling for Planning

๋ฌธ์ œ: ๊ฐ ๋‹จ๊ณ„์—์„œ ๋‹จ ํ•˜๋‚˜์˜ ํ–‰๋™ ์„ ํƒ์€ ์ดˆ๊ธฐ ์˜ค๋ฅ˜๊ฐ€ ์ „์ฒด ์‹คํŒจ๋กœ ์ด์–ด์ง€๋Š” ๊ตฌ์กฐ

๊ฐ ํƒ€์ž„์Šคํ…๋งˆ๋‹ค

  • ํ–‰๋™ ์ œ์•ˆ์„ ํ•˜๋‚˜๊ฐ€ ์•„๋‹ˆ๋ผ ์—ฌ๋Ÿฌ ๊ฐœ ์ƒ์„ฑ
  • test-time์—์„œ๋งŒ ๊ณ„์‚ฐ๋Ÿ‰์„ ๋Š˜๋ ค ์„ ํƒ ์•ˆ์ •์„ฑ ํ™•๋ณด

  • Planner: ๋™์ผ ์ž…๋ ฅ์—์„œ N๊ฐœ์˜ ํ–‰๋™ ์ œ์•ˆ ์ƒ˜ํ”Œ๋ง
  • Judge model: ํ˜„์žฌ UI ์ƒํƒœ์™€ ์‚ฌ์šฉ์ž ๋ชฉํ‘œ ๊ธฐ์ค€, ํ–‰๋™ ํ›„๋ณด ๊ฐ„ ์ƒ๋Œ€์  ์„ ํ˜ธ ๋น„๊ต
  • ์„ ํƒ๋œ ํ–‰๋™๋งŒ ์‹ค์ œ ์‹คํ–‰

  • ์ „์ฒด ์‹œํ€€์Šค ์ตœ์ ํ™”๊ฐ€ ์•„๋‹Œ ํ˜„์žฌ ๋‹จ๊ณ„ ์‹คํŒจ๋ฅผ ํ”ผํ•˜๋Š” ์„ ํƒ ๋ฌธ์ œ
  • lookahead ์—†์ด๋„ local robustness ํ™•๋ณด

  • cascading failure ๊ฐ์†Œ
  • ๋‹ค์–‘ํ•œ planner, ๋ชจ๋ธ ํฌ๊ธฐ์™€ ํ˜ธํ™˜
  • ์‹ค์ œ GUI ํ™˜๊ฒฝ์—์„œ ์•ˆ์ •์  ์ˆ˜ํ–‰

โ†’ GUI planning์—์„œ๋Š” โ€œ๋ฏธ๋ž˜๋ฅผ ์ •ํ™•ํžˆ ์˜ˆ์ธกํ•˜๋Š” ๋Šฅ๋ ฅโ€๋ณด๋‹ค โ€œํ˜„์žฌ ๋‹จ๊ณ„์—์„œ์˜ ์„ ํƒ ์•ˆ์ •์„ฑโ€์ด ๋” ์ค‘์š”ํ•˜๋‹ค!

GTA1์€ two-stage GUI ์—์ด์ „ํŠธ ๊ตฌ์กฐ๋ฅผ ์œ ์ง€ํ•œ ์ฑ„, ๊ฐ ๋‹จ๊ณ„์—์„œ ๋‹ค์ˆ˜์˜ ํ–‰๋™ ์ œ์•ˆ์„ ์ƒ์„ฑํ•˜๊ณ  test-time์—์„œ ํŒ๋ณ„ ๋ชจ๋ธ๋กœ ์„ ํƒํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ, lookahead ์—†์ด๋„ ๊ณ„ํš ์•ˆ์ •์„ฑ์„ ํ™•๋ณดํ•˜๋Š” planning ์ „๋žต์„ ์ œ์•ˆ

Reinforcement Learning for GUI Grounding

Data Cleaning

์‹ค์ œ ํ™”๋ฉด์—์„œ๋Š” ๋ Œ๋”๋ง ์ง€์—ฐ, ํƒ€์ด๋ฐ mismatch โ†’ bbox๊ฐ€ ์‹œ๊ฐ์ ์œผ๋กœ ์—‰๋šฑํ•œ ์œ„์น˜๋ฅผ ๊ฐ€๋ฆฌํ‚ค๋Š” ๊ฒฝ์šฐ๊ฐ€ ์žˆ์Œ.

annotated bbox๊ฐ€ OmniParser๊ฐ€ ๊ฐ์ง€ํ•œ ์‹ค์ œ UI ์š”์†Œ๋“ค ์ค‘ ํ•˜๋‚˜๋ผ๋„ ์ถฉ๋ถ„ํžˆ ๊ฒน์น˜์ง€ ์•Š์œผ๋ฉด ๊ทธ ๋ฐ์ดํ„ฐ ์ƒ˜ํ”Œ์„ ๋ฒ„๋ฆฌ๊ธฐ (์‹คํ—˜์—์„œ ฯ„ = 0.3 ์‚ฌ์šฉํ•จ)

Blog Image
b_ann: annotated bbox
b_i : OmniPARSER๊ฐ€ ๊ฐ์ง€ํ•œ UI
Blog Image
Figure 3: Examples from the Aria-UI dataset (Yang et al., 2024). The blue bounding box shows the annotation b_ann, while red bounding boxes are detected by OmniParser (Lu et al., 2024). The green arrow highlights misaligned annotations, which our cleaning strategy filters out.

Training

  • Chain-of-Thought ์ œ๊ฑฐ
    • ๊ธฐ์กด RL grounding: ์ขŒํ‘œ ์˜ˆ์ธก ์ „์— reasoning / thinking / ์„ค๋ช… ํ…์ŠคํŠธ ์ƒ์„ฑ, format reward๋กœ โ€œ์ƒ๊ฐ์„ ์ž˜ ์ผ๋Š”์ง€โ€๋„ ํ‰๊ฐ€
    • thinking ํ† ํฐ, format reward ์ œ๊ฑฐ, ์ขŒํ‘œ ๊ฒฐ๊ณผ๋กœ๋งŒ ํ•™์Šต โ†’ ์ถ”๋ก ์€ ๋ถˆํ•„์š”ํ•œ ๋…ธ์ด์ฆˆ๋‹ค!

  • ๋ฐ”์šด๋”ฉ ๋ฐ•์Šค, ์ค‘์‹ฌ์  ์˜ˆ์ธก, ๊ฑฐ๋ฆฌ ๊ธฐ๋ฐ˜ ๋ณด์ƒ ์ œ๊ฑฐ
    • ๊ธฐ์กด: ์ค‘์‹ฌ์  ํšŒ๊ท€ ์†์‹ค ์‚ฌ์šฉํ•˜๊ฑฐ๋‚˜ ๋ฐ”์šด๋”ฉ ๋ฐ•์Šค ์˜ˆ์ธก + IoU ๋ณด์ƒ ์ถ”๊ฐ€ โ†’ โ€œ์ •๋‹ต ๊ตฌ์กฐโ€๋ฅผ ๋ชจ๋ธ์— ๊ฐ•์ œ
    • ์ด ๋…ผ๋ฌธ: ๋ชจ๋‘ ์ œ๊ฑฐ โ†’ ๊ณผ์ œ ์ •์˜์— ์—†๋Š” ์ œ์•ฝ ์ œ๊ฑฐ

  • ๋‹จ์ผ ๋ณด์ƒ ์‹ ํ˜ธ๋กœ ๋‹จ์ˆœํ•˜๊ฒŒ ๋ณ€๊ฒฝ
    Blog Image
    • ํด๋ฆญ ์ขŒํ‘œ๊ฐ€ ๋ชฉํ‘œ target UI ์š”์†Œ ๋‚ด๋ถ€๋ฉด ์„ฑ๊ณต, ์™ธ๋ถ€๋ฉด ์‹คํŒจ

  • GRPO(Group Relative Policy Optimization)
    • ๊ธฐ์กด: reasoning ํ’ˆ์งˆ ๋น„๊ต์— ํ™œ์šฉ + ์–ธ์–ด ์ƒ์„ฑ ์ค‘์‹ฌ
    • ์ƒ˜ํ”Œ๋ง๋œ โ€œK๊ฐœ ์ขŒํ‘œ ์ค‘ ํ‰๊ท  ๋Œ€๋น„ ๋” ๋‚˜์€ ํด๋ฆญ์ธ์ง€โ€๋งŒ ํ‰๊ฐ€
    Blog Image
    • ํด๋ฆญ ์„ฑ๊ณต ์—ฌ๋ถ€๊ฐ€ ์ง์ ‘ ์ •์ฑ… ๊ฐœ์„ ์— ๋ฐ˜์˜

GUI grounding = reasoning ๋ฌธ์ œ๊ฐ€ ์•„๋‹ˆ๋ผ perception-aligned control ๋ฌธ์ œ๋‹ค!

Planning + Grounding ์‹œ๋„ˆ์ง€

  • Planning ๋‹จ๊ณ„
    • test-time scaling์œผ๋กœ ์‹คํŒจ ๊ฐ€๋Šฅ์„ฑ ๋‚ฎ์€ ํ–‰๋™ ์„ ํƒ
  • Grounding ๋‹จ๊ณ„
    • RL ๊ธฐ๋ฐ˜ ์ขŒํ‘œ ์˜ˆ์ธก์œผ๋กœ ์„ ํƒ๋œ ํ–‰๋™์˜ ์‹คํ–‰ ์„ฑ๊ณต๋ฅ  ๊ทน๋Œ€ํ™”

  • Planning์€ test-time์—์„œ ๋„“๊ฒŒ ๋ณด๊ณ  ๊ณ ๋ฅด๊ณ ,
  • Grounding์€ ํ•™์Šต ๋‹จ๊ณ„์—์„œ ๋‹จ์ˆœํ•˜๊ฒŒ ์ •๋ ฌํ•œ๋‹ค๋Š” ์ „๋žต,
  • lookahead ์—†๋Š” GUI ํ™˜๊ฒฝ์—์„œ๋„ ์•ˆ์ •์ ์ธ ์—์ด์ „ํŠธ ๋™์ž‘์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜๋Š” ์„ค๊ณ„

Experiment

๊ตฌ๋ถ„๋‚ด์šฉ
์—์ด์ „ํŠธ ๊ตฌ์กฐTwo-stage GUI agent ๊ตฌ์กฐ ์‚ฌ์šฉ
Planner๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ LLM ๊ธฐ๋ฐ˜ ํ–‰๋™ ์ œ์•ˆ ์ƒ์„ฑ ๋ชจ๋ธ
Planning ์ „๋žตTest-time scaling ์ ์šฉ, ๋งค step๋งˆ๋‹ค ๋ณต์ˆ˜ action proposal ์ƒ˜ํ”Œ๋ง
Judge modelplanner๊ฐ€ ์ƒ์„ฑํ•œ action proposal ์ค‘ ์ƒ๋Œ€์  ์„ ํ˜ธ ๊ธฐ์ค€์œผ๋กœ ์„ ํƒ
Grounding modelRL ๊ธฐ๋ฐ˜ ์ขŒํ‘œ ์˜ˆ์ธก ๋ชจ๋ธ, ํด๋ฆญ ์„ฑ๊ณต ์—ฌ๋ถ€๋งŒ ์‚ฌ์šฉ
๋ฐ์ดํ„ฐ์…‹Aria-UI ํฌํ•จ curated open-source GUI ๋ฐ์ดํ„ฐ

GTA1์˜ backbone

  • GTA1-7B: UI-TARS-1.5-7B๋ฅผ base๋กœ ์ดˆ๊ธฐํ™”ํ•œ ๋’ค GRPO๋กœ ํ•™์Šตํ•˜๋Š” ๊ตฌ์„ฑ
  • GTA1-32B: OpenCUA-32B๋ฅผ base๋กœ ์ดˆ๊ธฐํ™”ํ•œ ๋’ค GRPO๋กœ ํ•™์Šตํ•˜๋Š” ๊ตฌ์„ฑ

GUI Grounding Performance

RL ๊ธฐ๋ฐ˜ grounding ์„ค๊ณ„ ๊ฒ€์ฆ (planning ์˜ํ–ฅ ์—†์ด, ์ˆœ์ˆ˜ grounding ์„ฑ๋Šฅ๋งŒ ๋น„๊ต)

  • thinking์„ ์ œ๊ฑฐํ–ˆ์Œ์—๋„ ์˜คํžˆ๋ ค ์ •ํ™•๋„๊ฐ€ ๋” ๋†’์Œ
  • ๊ณ ํ•ด์ƒ๋„ UI์ผ์ˆ˜๋ก ์„ฑ๋Šฅ ๊ฒฉ์ฐจ ํ™•๋Œ€

โ†’ GUI grounding์€ reasoning ํ’ˆ์งˆ์ด ์•„๋‹ˆ๋ผ ์ขŒํ‘œ ๊ฒฐ๊ณผ์™€ ์ง์ ‘ ์ •๋ ฌ๋œ ๋ณด์ƒ ์„ค๊ณ„๊ฐ€ ์„ฑ๋Šฅํšจ๊ณผ๋ฅผ ๋ณด์—ฌ์ค€๋‹ค!

Blog Image

End-to-End GUI Agent Performance

  • ๊ธฐ์กด two-stage ์—์ด์ „ํŠธ ๋Œ€๋น„ ์—„์ฒญ๋‚œ ํ–ฅ์ƒ
  • native๊ฐ€ ๊ฐ•ํ•˜๋‹ค๋Š” ์š”์ฆ˜ ๋ถ„์œ„๊ธฐ๋ฅผ two-stage๋กœ ์ด๊น€
Blog Image

Ablation

  • Click reward: ์˜ˆ์ธก ์ขŒํ‘œ๊ฐ€ ํƒ€๊นƒ ์š”์†Œ bbox ์•ˆ์— ๋“ค์–ด๊ฐ€๋ฉด ์„ฑ๊ณต ๋ณด์ƒ
  • IoU reward: ํƒ€๊นƒ ์š”์†Œ bbox ์ž์ฒด๋ฅผ ๋งž์ถ”๋„๋ก ์œ ๋„ํ•˜๋Š” ๋ณด์ƒ
  • Format reward: ์˜ˆ์ธก ์ „์— โ€œthinkingโ€์„ ๊ฐ•์ œํ•˜๋Š” ๋ณด์ƒ(ํฌ๋งท ์ œ์•ฝ)

โ†’ Click reward๋งŒ ์“ฐ๋Š” ์กฐํ•ฉ์ด ์ข‹๋‹ค!

Thinking ์“ฐ๋Š”๊ฒŒ ScreenSpot-V2์—์„œ ์ข‹๊ธด ํ•จ.
Thinking์ด ์ฒด๊ณ„์  reasoning ์ด๋“์ด๋ผ๊ธฐ๋ณด๋‹ค๋Š” ํ•™์Šต ๋ถˆ์•ˆ์ •์„ฑ์ด ๋Š˜์–ด๋‚œ๋‹ค๋ผ๊ณ  ํ•ด์„.
ํ•˜์ง€๋งŒ, dynamic ํ™˜๊ฒฝ + trajectory/goal ์ œ๊ณต์ด ํ•„์š”ํ•œ AndroidWorld์—์„œ๋Š” task success rate๊ฐ€ 39% โ†’ 44%๋กœ ์ฆ๊ฐ€ํ–ˆ๋‹ค๋Š” ๊ด€์ฐฐ (ํ‘œ๋Š” ์—†๊ณ  ๋ง๋กœ)
Blog Image

test-time scaling์—์„œ action proposal ๊ฐœ์ˆ˜ K๋ฅผ ๋Š˜๋ฆด ๋•Œ ์„ฑ๊ณต๋ฅ ์ด ์–ด๋–ป๊ฒŒ ๋ณ€ํ•˜๋Š”์ง€ ๋ณด์—ฌ์ฃผ๋Š” ๊ทธ๋ฆผ ๊ตฌ์„ฑ

  • K ์ฆ๊ฐ€์— ๋”ฐ๋ผ ์„ฑ๊ณต๋ฅ ์ด ์ƒ์Šนํ•˜๋Š” ๊ตฌ๊ฐ„ ์กด์žฌ
  • test-time compute๋กœ robustness ํ™•๋ณด ์ฃผ์žฅ ๋ณด์—ฌ์ฃผ๊ธฐ
Blog Image

Conclusion

์ง€๋Šฅํ˜• GUI ์—์ด์ „ํŠธ ๊ตฌ์ถ•์„ ์œ„ํ•œ ํ•ต์‹ฌ ๋‚œ์ œ ๋‘ ๊ฐ€์ง€๋กœ ์ •๋ฆฌ

  1. ํฐ ํ–‰๋™ ๊ณต๊ฐ„์—์„œ ํšจ๊ณผ์ ์ธ ๊ณ„ํš(plan) ์„ ํƒ ๋ฌธ์ œ
  1. ๋ณต์žกํ•œ ์ธํ„ฐํŽ˜์ด์Šค์—์„œ ์ •ํ™•ํ•œ ๊ทธ๋ผ์šด๋”ฉ(grounding) ๋ฌธ์ œ

  • ์ „๋žต 1: Planning์šฉ test-time scaling
    • ๋งค ์Šคํ…์—์„œ ๋‹จ์ผ ์ œ์•ˆ์— ๊ณ ์ •ํ•˜์ง€ ์•Š๊ณ  ์—ฌ๋Ÿฌ action proposal์„ ๋™์‹œ ์ƒ˜ํ”Œ๋ง
    • ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ LLM judge๊ฐ€ ๊ทธ์ค‘ ๊ฐ€์žฅ ์ ์ ˆํ•œ ์ œ์•ˆ ์„ ํƒ
  • ์ „๋žต 2: RL ๊ธฐ๋ฐ˜ grounding
    • ํƒ€๊นƒ ์š”์†Œ ํด๋ฆญ ์„ฑ๊ณต์„ ์ง์ ‘ ๋ณด์ƒํ•˜๋Š” ๋‹จ์ˆœ RL ์ตœ์ ํ™”
    • ๊ธฐ์กด ๋ฐฉ์‹์ด ๊ฐ•์ œํ•˜๋˜ ๋ช…์‹œ์  โ€œthinkingโ€์„ ์šฐํšŒํ•˜๋Š” ์„ค๊ณ„

  • ํ‘œ์ค€ GUI grounding ๋ฒค์น˜๋งˆํฌ์—์„œ SOTA ๋‹ฌ์„ฑ
  • planner์™€ ๊ฒฐํ•ฉํ•œ ์‹ค์ œ GUI task execution์—์„œ๋„ ๊ฒฌ๊ณ ํ•œ ๋™์ž‘์„ ํ™•์ธ
  • ์œ„ ๋‘ ์ „๋žต ๊ฒฐํ•ฉ์ด โ€œ๊ณ„ํš ์•ˆ์ •์„ฑ + ๊ทธ๋ผ์šด๋”ฉ ์ •๋ ฌโ€์„ ๋™์‹œ์— ๋Œ์–ด์˜ฌ๋ฆฐ๋‹ค๋Š” ๋ฉ”์‹œ์ง€

Limitation & Future Work

grounding ๋ชจ๋ธ์ด ์‹œ๊ฐ์  ์„ ํƒ์ด ์•„๋‹Œ โ€œ์กฐ์ž‘, ํŽธ์ง‘โ€ ์„ฑ๊ฒฉ์˜ task์—์„œ๋Š” ์—ฌ์ „ํžˆ ์ทจ์•ฝ

๊ฐœ์ธ์ ์ธ ์˜๊ฒฌ

์ด๊ฑด ๊ณ„์‚ฐ๋Ÿ‰์„ ๋Š˜๋ฆฌ๊ณ , ์ •ํ™•๋„๋ฅผ ์˜ฌ๋ฆฐ Tradeoff

์ตœ๊ทผ์— ๋‚˜์˜จ MAI-UI๊ฐ€ ์••๋„์ ์œผ๋กœ ์ด๊น€.

Two-stage GUI agent๋กœ๋„ native GUI agent๋ฅผ ์ด๊ธด๋‹ค๋Š” ์ฆ๋ช… ๋ฐ”๋กœ ๊นจ์ง.

Blog Image

OpenReview ์ •๋ฆฌ

๊ฐ•์ 

  • planning๊ณผ grounding์„ ๊ฐ๊ฐ โ€œ์™œ ๋ณต์žกํ•  ํ•„์š”๊ฐ€ ์—†๋Š”์ง€โ€ ์„ค๋“ํ•œ ์ 
  • thinking ์ œ๊ฑฐ, click-only reward ๋“ฑ ์„ค๊ณ„ ์„ ํƒ์˜ ๋ช…ํ™•์„ฑ

novelty๋Š” ๊ธฐ์ˆ  ์ž์ฒด๋ณด๋‹ค ์„ค๊ณ„ ํŒ๋‹จ๊ณผ ์‹คํ—˜์  ์ฆ๋ช…์— ์žˆ์Œ

๋ฆฌ๋ทฐ์–ด๋“ค์˜ ์ธ์‹

  • test-time scaling์€ ๋ณธ์งˆ์ ์œผ๋กœ computeโ€“performance tradeoff
  • K ์ฆ๊ฐ€์— ๋”ฐ๋ผ ํ† ํฐ ๋น„์šฉยท์ง€์—ฐ(latency)์ด ์ฆ๊ฐ€ โ†’ Dynamic K ํ•„์š”

dynamic UI(AndroidWorld ๋“ฑ)์—์„œ๋Š” thinking์ด ์ผ๊ด€๋˜๊ฒŒ ์ด๋“์„ ์ฃผ๋Š” ๊ฒฝํ–ฅ ์กด์žฌ

  • โ€œthinking์€ ํ•„์š” ์—†๋‹คโ€๊ฐ€ ์•„๋‹ˆ๋ผ
  • ์–ธ์ œ, ์–ด๋–ค ์กฐ๊ฑด์—์„œ ํ•„์š”ํ•œ์ง€๊ฐ€ ์•„์ง ์—ด๋ ค ์žˆ๋Š” ๋ฌธ์ œ
  • grounding ๋ชจ๋ธ์ด ์‹œ๊ฐ์  ์„ ํƒ์ด ์•„๋‹Œ โ€œ์กฐ์ž‘/ํŽธ์ง‘โ€ ์„ฑ๊ฒฉ์˜ task์—์„œ๋Š” ์—ฌ์ „ํžˆ ์ทจ์•ฝ

์ด ๋…ผ๋ฌธ์€ ์ƒˆ๋กœ์šด ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์ œ์‹œํ–ˆ๋‹ค๊ธฐ๋ณด๋‹ค, GUI ์—์ด์ „ํŠธ์—์„œ ๋ถˆํ•„์š”ํ•˜๊ฒŒ ๋ณต์žกํ•ด์ง„ ์„ค๊ณ„๋ฅผ ๊ฑท์–ด๋‚ด๊ณ , test-time compute์™€ ์ •๋ ฌ๋œ ๋ณด์ƒ๋งŒ์œผ๋กœ๋„ SOTA๊ฐ€ ๊ฐ€๋Šฅํ•จ์„ ์‹คํ—˜์ ์œผ๋กœ ์ฆ๋ช…ํ•œ ์—ฐ๊ตฌ๋กœ ํ‰๊ฐ€


Q&A

๋…ผ๋ฌธ Presentation ๋ฐœํ‘œ ์ค‘ ์ œ๋Œ€๋กœ ๋‹ต๋ณ€ ๋ชปํ•œ Q&A

trajectory์— ์ด๋ฏธ์ง€๊นŒ์ง€ ์“ฐ๋Š”๊ฑด๊ฐ€?

GTA1 ๋ชจ๋ธ์—์„œ๋Š” image๋ฅผ ์‚ฌ์šฉํ•˜๋Š”์ง€ ๋…ผ๋ฌธ์—์„œ ์ •ํ™•ํžˆ ๋ช…์‹œํ•˜์ง€๋Š” ์•Š์•˜์Šต๋‹ˆ๋‹ค.


Github Repository ํƒ์ƒ‰

GTA1 ๋ ˆํฌ๋Š” ์ฃผ๋กœ ์–ด๋–ป๊ฒŒ ํ•™์Šตํ• ๊ฒƒ์ธ๊ฐ€์— ๋Œ€ํ•œ ๋ ˆํฌ ๊ธฐ๋ฐ˜์ด๋ผ ํ•ด๋‹น ๋‚ด์šฉ์ด ์—†์Šต๋‹ˆ๋‹ค.

์˜คํžˆ๋ ค, OSWorld ์ชฝ ๋ ˆํฌ์— inference ํŒŒ์ผ์ด ๋“ค์–ด์žˆ์–ด ๋ฐœ๊ฒฌํ–ˆ์Šต๋‹ˆ๋‹ค.

OSWorld/mm_agents/gta1/gta1_agent.py

predictย ๋ฉ”์„œ๋“œ (๋ผ์ธ 1226-1432):

python
self.actions.append([plan_code]) self.observations.append(obs) # obs์—๋Š” screenshot (bytes)๊ฐ€ ํฌํ•จ๋จ self.thoughts.append(thought) self.observation_captions.append(observation_caption)

Planner์—๊ฒŒ ์ „๋‹ฌย (๋ผ์ธ 1244-1298):

python
# Determine which observations to include images for (only most recent ones) obs_start_idx = max(0, len(self.observations) - self.max_image_history_length) # Add all thought and action history for i in range(len(self.thoughts)): # For recent steps, include the actual screenshot if i >= obs_start_idx: messages.append({ "role": "user", "content": [{ "type": "image_url", "image_url": { "url": f"data:image/png;base64,{encode_image(self.observations[i]['screenshot'])}", "detail": "high" }, }] })

  • ์ตœ๊ทผย max_image_history_lengthย (๊ธฐ๋ณธ๊ฐ’ 5) ๊ฐœ์˜ trajectory step์— ๋Œ€ํ•ด์„œ๋งŒย ์‹ค์ œ ์ด๋ฏธ์ง€๋ฅผ ํฌํ•จ
  • ์ด๋ฏธ์ง€๋Š”ย base64๋กœ ์ธ์ฝ”๋”ฉ๋˜์–ดย image_urlย ํ˜•ํƒœ๋กœ ์ „์†ก
  • ์˜ค๋ž˜๋œ history๋Š” ํ…์ŠคํŠธ(observation caption)๋งŒ ํฌํ•จ

โ†’ ๋ชจ๋“  ์ด๋ฏธ์ง€๋Š” ๋„ฃ์ง€ ์•Š๊ณ , ๊ธฐ๋ณธ๊ฐ’ 5๊ฐœ์˜ ์ด๋ฏธ์ง€๋ฅผ ๋„ฃ์Šต๋‹ˆ๋‹ค.

๋‹ค๋งŒ ํ…์ŠคํŠธ๋กœ๋Š” ์˜ค๋ž˜๋œ ํžˆ์Šคํ† ๋ฆฌ๊นŒ์ง€ ๋„ฃ์Šต๋‹ˆ๋‹ค.

์ฐธ๊ณ ๋กœ, ์ดํ›„ ๋‚˜์˜จ MAI UI๋„ ํ˜„์žฌ ์ œ์™ธ ์ตœ๋Œ€ 2๊ฐœ์˜ ์ด๋ฏธ์ง€๊นŒ์ง€ ๋„ฃ๋Š” ๋“ฏ ํ•ฉ๋‹ˆ๋‹ค.

  • ์ตœ๊ทผย history_n - 1ย ๊ฐœ์˜ ์ด๋ฏธ์ง€๋งŒ ์„ ํƒ
python
default_conf = { "history_n": 3, # ๊ธฐ๋ณธ๊ฐ’: ์ตœ๊ทผ 3๊ฐœ๋งŒ ... }

Chain of Thought๋Š” RL์—์„œ๋งŒ ๋นผ๋Š”๊ฑด๊ฐ€?

thinking token ๋“ฑ์„ ์ œ๊ฑฐํ•œ๋‹ค๊ณ  ํ•œ๊ฑด, RL ๋ฉ”์†Œ๋“œ ์„ค๋ช…์—์„œ๊ฐ€ ๋งž์Šต๋‹ˆ๋‹ค.

RL๋กœ ํ•™์Šตํ•˜๋Š” grounding model์˜ ์ถœ๋ ฅ ๊ณต๊ฐ„์—์„œ thinking token ์ œ๊ฑฐ

Blog Image

๊ทธ๋Œ€๋กœ์ธ ๊ฒƒ

  • planner๋‚˜ judge๊ฐ€ ๋‚ด๋ถ€์ ์œผ๋กœ reasoning์„ ์“ฐ๋Š” ๊ฒƒ
  • inference ์‹œ ํ…์ŠคํŠธ reasoning์„ ์“ฐ๋Š” ๊ฒƒ

python
thought_messages = f"Step {i+1} Thought:\n{self.thoughts[i]}" messages.append({ "role": "assistant", "content": [{ "type": "text", "text": thought_messages + "\n" + action_messages }] })

grpo์—์„œ๋Š” RL์—์„œ ์ด๋ ‡๊ฒŒ ๋‹จ์ˆœํžˆ ์ขŒํ‘œ๋งŒ ์ถœ๋ ฅ

python
SYSTEM_PROMPT = ''' You are an expert UI element locator. Given a GUI image and a user's element description, provide the coordinates of the specified element as a single (x,y) point. The image resolution is height {height} and width {width}. For elements with area, return the center point. Output the coordinate pair exactly: (x,y) '''

format reward ํ•จ์ˆ˜ (ablation์šฉ, ์ด๊ฑธ ์‚ฌ์šฉํ•˜๋ฉด ์˜ต์…˜์œผ๋กœ thinking๊นŒ์ง€ ํ•ด์„œ ํ•จ)

  • ๋…ผ๋ฌธ์—์„œ๋„ ์•„์ง ๋ณต์žกํ•œ task์—์„œ planning์„ ์œ„ํ•ด ํ•„์š”ํ•œ ๊ฒฝ์šฐ๋„ ์žˆ๋‹ค๊ณ  ํ•จ.
python
def format_reward(completions, **kwargs): """Reward function that checks if the completion has a specific format.""" pattern = r" <div class="think">.*?</div> \s*<answer>\(\d+,\s*\d+\)</answer>" # ...

๋А๋‚€์ 

์—ฌํƒœ๊นŒ์ง€ ๋‹น์—ฐํ•˜๊ฒŒ ์—ฌ๊ฒจ์กŒ๋˜ RL ๋ฉ”์†Œ๋“œ๋ฅผ ๋ถ€์ˆ˜๋Š” ์ข‹์€ ์•„์ด๋””์–ด๋ผ๊ณ  ์ƒ๊ฐํ–ˆ๋‹ค.

๋˜ํ•œ, ๋ชจ๋ธ๊นŒ์ง€ ๋งŒ๋“ค์–ด์„œ ๋ฐฐํฌํ–ˆ๊ณ , ์ด ๋‹น์‹œ SOTA์˜€๊ธฐ ๋•Œ๋ฌธ์— ICLR ํ•™ํšŒ์— ๋ถ™์€ ๊ฒƒ ๊ฐ™๋‹ค.

๋ฌผ๋ก  ๊ธˆ๋ฐฉ MAI-UI๊ฐ€ ์••๋„์ ์œผ๋กœ ๋‚˜์˜ค๋ฉด์„œ ์•„์‰ฝ๊ฒŒ ๋˜์—ˆ๊ณ , ์ด ๋…ผ๋ฌธ์—์„œ ๋ชจ๋ธ์„ ํ™œ์šฉํ•ด์„œ ๋” ๋ฐœ์ „์€ ํ•˜์ง€ ๋ชปํ•  ๊ฒƒ ๊ฐ™๋‹ค.

์—ฌํƒœ๊นŒ์ง€ ๋‹น์—ฐํ•˜๊ฒŒ ์—ฌ๊ฒจ์˜จ ๋ณต์žกํ•œ ๋ฉ”์†Œ๋“œ๋ฅผ ๊ฐ„๋‹จํ•˜๊ฒŒ ํ•จ์œผ๋กœ์„œ ์˜คํžˆ๋ ค ์„ฑ๋Šฅ์„ ๋†’์ผ ์ˆ˜๋„ ์žˆ๋‹ค๋Š” ์ข‹์€ ์•„์ด๋””์–ด๋ฅผ ์•Œ๊ฒŒ ๋˜์–ด ์ข‹์•˜๋‹ค.

๋‹น์—ฐํ•˜๊ฒŒ ์—ฌ๊ฒจ์ง€๋Š” ๊ฒƒ์„ ๋ฐ”๊ฟ”์•ผ ๋…ธ๋ฒจํ‹ฐ ์žˆ๋Š” ๋…ผ๋ฌธ์ด ๋‚˜์˜ค๋Š” ๊ฒƒ ๊ฐ™๋‹ค.