Back to Blog List

CATP: Contextually Adaptive Token Pruning for Efficient and Enhanced Multimodal In-Context Learning

โ†Paper Review

ArXivhttps://arxiv.org/abs/2508.07871
AuthorYanshu Li, Jianjiang Yang, Zhennan Shen, Ligong Han, Haoyan Xu, Ruixiang Tang
Affiliation1Brown University 2University of Bristol 3MIT-IBM Watson AI Lab 4University of Southern California 5Rutgers University
๐Ÿ’ก

Key Differentiator

  • query-cross attention์ด decoder layer ์ง„ํ–‰ ์ค‘, ํŠนํžˆ shallow layers (6~10) ์ดํ›„์— ์ฆ๊ฐ€ํ•˜๋Š” ํ˜„์ƒ์„ ๊ด€์ฐฐ
  • query ๋งˆ์ง€๋ง‰ ํ† ํฐ์ด image token์— ์ฃผ๋Š” attention์˜ layer ๊ฐ„ ์ฆ๊ฐ€๋Ÿ‰์„ relevance ์‹ ํ˜ธ๋กœ ์‚ฌ์šฉํ•˜์—ฌ, query-guided reasoning ๊ณผ์ •์—์„œ ์ƒˆ๋กญ๊ฒŒ ์ค‘์š”ํ•ด์ง„ ํ† ํฐ์„ ์„ ํƒํ•˜๋Š” ๋ฐฉ์‹
  • query ๊ธฐ๋ฐ˜ attention๊ณผ representation similarity๋ฅผ ๊ฒฐํ•ฉํ•ด ์ด๋ฏธ์ง€ ์ปจํ…์ŠคํŠธ ํ† ํฐ์˜ ์ค‘์š”๋„๋ฅผ ์žฌํ‰๊ฐ€ํ•˜๊ณ , In-Context Learning์— ๋” ์ ํ•ฉํ•œ ์‹œํ€€์Šค๋ฅผ ์žฌ๊ตฌ์„ฑ
๐Ÿคท

Why I chose this paper?

  • Motivation for token-level optimization
    • GOLD๊ฐ€ ๋‹จ์ˆœํ•œ Coarse-to-fine ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๊ธฐ๋ฐ˜์ด๋ผ token-level optimization์„ ์›ํ–ˆ์Œ.
  • Interest in sequential GUI settings
    • GOLD single-task setting ํƒ€๊ฒŸ์ด๋ผ์„œ sequential GUI tasks๋กœ ํ™•์žฅํ•˜๊ธฐ๋ฅผ ์›ํ–ˆ์Œ.
    • I want ideas for efficient computation under multi-image inputs.

Abstract

Large Vision-Language Models (LVLM)์—์„œ ์ด๋ฏธ์ง€ ํ† ํฐ์ด sparseํ•˜๊ธฐ ๋•Œ๋ฌธ์— reasoning์— ๊ธฐ์—ฌํ•˜์ง€ ์•Š๋Š” ํ† ํฐ์ด ๋‹ค์ˆ˜๋ฅผ ์ฐจ์ง€ํ•จ. โ†’ cost ์ƒ๋‹น

๊ทธ๋ž˜์„œ image token pruning์„ ์‚ฌ์šฉํ•จ.

  • ํ•˜์ง€๋งŒ, single-image task์— ์ดˆ์ ์ด ๋งž์ถฐ์ง
  • multimodal in-context learning ์ƒํ™ฉ์„ ๊ณ ๋ คํ•˜์ง€ ์•Š์Œ.
  • ๊ธฐ์กด ํ”„๋ฃจ๋‹ ๊ธฐ๋ฒ•์„ ๊ทธ๋Œ€๋กœ ์ ์šฉํ•˜๋ฉด ์ •ํ™•๋„ drop ๋ฐœ์ƒ

์ด ๋…ผ๋ฌธ CATP์—์„œ๋Š” ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ICL ํ™˜๊ฒฝ์— ํŠนํ™”๋˜์–ด

  • ์ด๋ฏธ์ง€โ€“ํ…์ŠคํŠธโ€“์ด๋ฏธ์ง€ ๊ฐ„์˜ ๋งฅ๋ฝ์  ๊ด€๊ณ„(context)๋ฅผ ๋ฐ˜์˜ํ•˜๋„๋ก ์„ค๊ณ„
  • 2-stage progressive pruning ๊ตฌ์กฐ๋ฅผ ์‚ฌ์šฉ

โ†’ ๊ฒฐ๊ณผ์ ์œผ๋กœ ์ด๋ฏธ์ง€ ํ† ํฐ์˜ 77.8% ์ œ๊ฑฐํ•˜๋ฉด์„œ ํ‰๊ท  ์„ ์€ 0.6% ํ–ฅ์ƒ, ์ถ”๋ก  ์ง€์—ฐ์‹œ๊ฐ„ 10.78% ๊ฐ์†Œ๋ฅผ ๋ณด์ž„

Related Work

Large Vision-language Models (LVLMs)

  • LLM์ด ๋ฐœ์ „ํ•˜๋ฉด์„œ ์ด๋ฏธ์ง€, ํ…์ŠคํŠธ ๋™์‹œ์— ์ฒ˜๋ฆฌํ•˜๋Š” LVLM ํ™•์žฅ๋จ
  • vision encoder, projector, and LLM decoder ๊ตฌ์กฐ
  • ๋น„ํšจ์œจ
    • ํ…์ŠคํŠธ ํ† ํฐ์— ๋น„ํ•ด ์ •๋ณด ๋ฐ€๋„๊ฐ€ ํ˜„์ €ํžˆ ๋‚ฎ์Œ
    • ์‹ฌ๊ฐํ•œ ์ด๋ฏธ์ง€ ํ† ํฐ ์ค‘๋ณต ๋ฐœ์ƒ

In-context Learning (ICL)

  • ํŒŒ๋ผ๋ฏธํ„ฐ ์—…๋ฐ์ดํŠธ ์—†์ด ์˜ˆ์‹œ ๋ช‡ ๊ฐœ(ICDs)๋งŒ์œผ๋กœ ์ฆ‰๊ฐ์ ์ธ ํƒœ์Šคํฌ ์ ์‘
  • Multi-model๋กœ ํ™•์žฅ๋˜์–ด ์ด๋ฏธ์ง€ + ํ…์ŠคํŠธ ํฌํ•จํ•˜๋Š” ICL์ด ์ค‘์š”ํ•ด์ง
    • ๊ทผ๋ฐ, ์ด๋ฏธ์ง€ ํ† ํฐ๋“ค์€ ํ…์ŠคํŠธ 3ํ† ํฐ์— ๋น„ํ•ด sparseํ•จ. (์ค‘์š”ํ•œ ๋ถ€๋ถ„์˜ ๋ฐ€๋„๊ฐ€ ๋‚ฎ์Œ)
    • ๊ฐ ์˜ˆ์‹œ๋งˆ๋‹ค ์ด๋ฏธ์ง€ ์žˆ๊ณ , ์ฟผ๋ฆฌ์—๋„ ์ด๋ฏธ์ง€ ํฌํ•จ๋˜์–ด์„œ ์ž…๋ ฅ ๊ธธ์ด ๊ธธ์–ด์ง
    • ICL ์žฅ์ ์ด ๊ฐ€๋ณ๊ณ  ๋น ๋ฅด๋‹ค๋Š” ๊ฒƒ์ธ๋ฐ, ์ด๋ฏธ์ง€ ํ† ํฐ ์ค‘๋ณต๋•Œ๋ฌธ์— ์˜คํžˆ๋ ค ์žฅ์ ์ด ์•ฝํ™”๋œ๋‹ค!
    • LLaVA-Next ์—ฐ์‚ฐ๋Ÿ‰ ์ฆ๊ฐ€ ์˜ˆ์‹œ

      ์ด๋ฏธ์ง€ 1์žฅ โ†’ 576 tokens

      VizWiz ๋ฐ์ดํ„ฐ์…‹์—์„œ 2-shot ICL

      • single-image inference ๋Œ€๋น„ 3.2ร— ์—ฐ์‚ฐ๋Ÿ‰
      • text-only inference ๋Œ€๋น„ 14.3ร— ์—ฐ์‚ฐ๋Ÿ‰

Image Token Pruning

์ด๋ฏธ์ง€ ํ† ํฐ ์ค‘๋ณต ๋ฌธ์ œ ํ•ด๊ฒฐํ•˜๋ ค๊ณ  training-free image token pruning ์—ฐ๊ตฌ๋“ค์ด ๋‚˜์˜ด

Blog Image

3-shot in-context sequence

์•ž์— 3๊ฐœ์˜ ICD๋ฅผ ํ•˜๋‚˜์˜ ์‹œํ€€์Šค๋กœ ๋„ฃ๊ณ  ๋งˆ์ง€๋ง‰์— Query์ฃผ๋Š”๊ฑฐ

Attention-based Image Token Pruning (b)

  • LLM decoder ๋‚ด๋ถ€์—์„œ ์ด๋ฏธ์ง€ ํ† ํฐ์ด ๋ฐ›๋Š” attention weight๋ฅผ ์ค‘์š”๋„์˜ ์ง€ํ‘œ๋กœ
  • ์ด๋ฏธ์ง€์™€ ํ…์ŠคํŠธ๊ฐ€ ์‹ค์ œ๋กœ ์ƒํ˜ธ์ž‘์šฉํ•˜๋Š” ์ง€์ ์„ ํ™œ์šฉ
  • ๋‹จ์  : Attention shift ๋ฌธ์ œ
    Blog Image

    ์—ฌ๊ธฐ์„œ X1I(์ด๋ฏธ์ง€)์™€ X1T(ํ…์ŠคํŠธ) ํ† ํฐ์ด interleaved(๋ผ์›Œ๋„ฃ๊ธฐ) ํ˜•์‹์ž„

    ์ด๋ฏธ์ง€ ํ† ํฐ์„ ํŽผ์น˜๋ฉด ํ•˜๋‹จ ํ† ํฐ์ด ํ•ด๋‹น ํ…์ŠคํŠธ ํ† ํฐ๊ณผ ๊ฐ€๊นŒ์›Œ์ง.

    Transformer attention์€ ๊ฐ€๊นŒ์šด ์œ„์น˜ ํ† ํฐ์— ํŽธํ–ฅ๋จ (positional bias)

Diversity-based Image Token Pruning (c)

  • Vision encoder + projector ์ดํ›„์—, ํ…์ŠคํŠธ์™€ ์ƒํ˜ธ์ž‘์šฉํ•˜๋Š” Decoder์— ์ž…๋ ฅ๋˜๊ธฐ ์ „ ๋‹จ๊ณ„์—์„œ ์ด๋ฏธ์ง€ ํ† ํฐ๋“ค ๊ฐ„์˜ feature similarity๋ฅผ ๊ธฐ์ค€์œผ๋กœ ์ค‘๋ณต ํ† ํฐ์„ ์ œ๊ฑฐ
  • ๊ฐ ์ด๋ฏธ์ง€๋ฅผ ๋…๋ฆฝ์ ์œผ๋กœ ์ฒ˜๋ฆฌ โ†’ Figure(c)์—์„œ ์ด๋ฏธ์ง€๋งˆ๋‹ค 64๊ฐœ ํ† ํฐ์„ ๊ท ๋“ฑํ•˜๊ฒŒ ๋‚จ๊น€
  • ๋‹จ์ : ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ICL์—์„œ ํ•„์š”ํ•œ cross-image, imageโ€“text, context-level interaction์„ ํฌ์ฐฉX

    โ†’ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ICL์—์„œ๋Š” fine-grained pruning ์‹คํŒจ๊ฐ€ ๋ฐœ์ƒ

CATP (d)

๋ชจ๋“  ICD๋ฅผ ํ•˜๋‚˜์˜ context๋กœ ๋ด„

์ด๋ฏธ์ง€๋งˆ๋‹ค ํ† ํฐ์ˆ˜๊ฐ€ ๋‹ค๋ฆ„ (๊ธฐ์—ฌ๋„ ๊ธฐ๋ฐ˜์ž„)

sequence ๋‚ด์˜ ๋ณต์žกํ•œ cross-modal interactions

Blog Image

Single-image setting์—์„œ๋Š” ๊ดœ์ฐฎ์€๋ฐ, 4-shot์œผ๋กœ ๊ฐ€๋ฉด Random๋ณด๋‹ค๋„ ๋‚ฎ๊ฑฐ๋‚˜ ๋น„์Šทํ•œ ์„ฑ๋Šฅ์„ ๋ณด์ž„

โ†’ ๊ธฐ์กด ํ”„๋ฃจ๋‹ ๊ธฐ๋ฒ•์ด ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ICL์—๋Š” ๋งž์ง€ ์•Š๋Š”๋‹ค.

Related Work ๊ฒฐ๋ก 

: ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ICL์—์„œ๋Š” โ€œ๊ฐœ๋ณ„ ์ด๋ฏธ์ง€ ๋‚ด ์ค‘์š” ํ† ํฐโ€์ด ์•„๋‹ˆ๋ผ โ€œ์‹œํ€€์Šค ์ „์ฒด ๋งฅ๋ฝ์—์„œ ๊ธฐ์—ฌํ•˜๋Š” ํ† ํฐโ€์„ ์‹๋ณ„ํ•˜๋Š” ์ƒˆ๋กœ์šด ๊ธฐ์ค€์ด ํ•„์š”ํ•˜๋‹ค!!

Method

Preliminary and Motivation

Multimodal In-Context Sequence

Blog Image
  • query๋„ ICD(in-context demonstration)์ฒ˜๋Ÿผ image + text ์Œ์ด๋‹ค.

Blog Image
  • ์ด๋ฏธ์ง€๊ฐ€ ํ† ํฐ์œผ๋กœ ๋ณ€ํ™˜๋˜๋Š” ๊ณผ์ •
  • f : Vision encoder, g : projector
  • S : ํ† ํฐ์ˆ˜ (๋ชจ๋ธ๋งˆ๋‹ค, ์ž…๋ ฅ ํ•ด์ƒ๋„๋งˆ๋‹ค ๋‹ฌ๋ผ์ง)

โ†’ ์ด๋ฏธ์ง€ ํ† ํฐ ์ค‘๋ณต์ด ๋ฐœ์ƒ

Blog Image

์ตœ์ข…์ ์œผ๋กœ ์ „์ฒด ํ† ํฐ ์‹œํ€€์Šค๋Š” ์ด๋Ÿฐ์‹์œผ๋กœ image๋ž‘ text๊ฐ€ interleaved (๋ผ์›Œ๋„ฃ๋Š”) ํ˜•ํƒœ๋กœ ๋ฐฐ์น˜๋จ

๋ชจ๋“  image token์€ ๊ฐ™์€ ์ด๋ฏธ์ง€ ํ† ํฐ๋ผ๋ฆฌ๋งŒ ์ƒํ˜ธ์ž‘์šฉํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ,

๋‹ค๋ฅธ ์ด๋ฏธ์ง€, ๋‹ค๋ฅธ ํ…์ŠคํŠธ, query์™€ ๋™์‹œ์— ์ƒํ˜ธ์ž‘์šฉ!!

๊ธฐ์กด Diversity-based pruning์€ ์ด ํƒ€์ด๋ฐ์— ์ผ์–ด๋‚จ.

Decoder ์ด์ „ (image token feature๋“ค๋งŒ ์กด์žฌ, text, query ์ •๋ณด ์—†์Œ โ†’ image ๋‚ด๋ถ€ ์ •๋ณด๋งŒ์œผ๋กœ)

Blog Image

์œ„์˜ ์‹œํ€€์Šค๊ฐ€ ์ด N-layer Transformer decoder์— ์ž…๋ ฅ๋˜์–ด

โ†’ ์ฆ‰ ๋ชจ๋“  image token์€ ๋ชจ๋“  text token๊ณผ ์—ฐ๊ฒฐ๋จ

๊ธฐ์กด Attention-based pruning์€ ์ด ํƒ€์ด๋ฐ์— ์ผ์–ด๋‚จ.

decoder ๋‚ด๋ถ€ ํŠน์ • layer์— ์ ์šฉ

โ†’ attention์€ layer๋งˆ๋‹ค ์˜๋ฏธ๊ฐ€ ๋‹ฌ๋ผ์ง€๊ณ , interleaved ๊ตฌ์กฐ์—์„œ๋Š” attention shift ๋ฐœ์ƒ

โ†’ attention๊ฐ’์ด ์‹ค์ œ ๊ธฐ์—ฌ๋„๊ฐ€ ์•„๋‹ˆ๋‹ค.

Blog Image

Figure3 (a)

diversity-based pruning

๊ฐ ์ด๋ฏธ์ง€ ๋‚ด๋ถ€์—์„œ๋งŒ ํ† ํฐ ์ค‘์š”๋„๋ฅผ ํŒ๋‹จ โ†’ ๋‹ค๋ฅธ ์ด๋ฏธ์ง€, ํ…์ŠคํŠธ, query ์ •๋ณด๋ฅผ ์ „ํ˜€ ๋ณด์ง€ ๋ชปํ•จ

Figure3 (b), (c)

์–ด๋–ค attention์„, ์–ด๋А ๋ ˆ์ด์–ด์—์„œ ์“ฐ๋А๋ƒ์— ๋”ฐ๋ผ ๊ฒฐ๊ณผ๊ฐ€ ์™„์ „ํžˆ ๋‹ฌ๋ผ์ง„๋‹ค

  • FastV (๊ธฐ์กด ๋ฐฉ์‹)
    • image token์ด ๋ชจ๋“  ํ† ํฐ์œผ๋กœ๋ถ€ํ„ฐ ๋ฐ›์€ attention ์ดํ•ฉ
  • Intra-cross
    • image token์ด ์ž๊ธฐ imageโ€“text pair ๋‚ด๋ถ€์˜ ํ…์ŠคํŠธ ํ† ํฐ๋“ค๋กœ๋ถ€ํ„ฐ ๋ฐ›์€ attention
    • โ€œimageโ€“text alignment๊ฐ€ ์ค‘์š”ํ•œ ์ดˆ๊ธฐ ๋ ˆ์ด์–ด์—์„œ๋Š” ์ด๊ฒŒ ๋” ๋งž์ง€ ์•Š์„๊นŒ?โ€
  • Query-cross
    • image token์ด query sample์˜ ํ† ํฐ๋“ค๋กœ๋ถ€ํ„ฐ ๋ฐ›์€ attention
    • โ€œICL์—์„œ๋Š” ๊ฒฐ๊ตญ query๊ฐ€ ์ค‘์š”ํ•˜๋‹ˆ, query๊ฐ€ ์ฃผ๋ชฉํ•˜๋Š” ํ† ํฐ์ด ์ค‘์š”ํ•˜์ง€ ์•Š์„๊นŒ?โ€

โ†’ Static single-layer attention์€ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ICL์—์„œ ํ† ํฐ ์ค‘์š”๋„๋ฅผ ์•ˆ์ •์ ์œผ๋กœ ๋ฐ˜์˜ํ•˜์ง€ ๋ชปํ•จ

  • attention shift๊ฐ€ ๋ˆ„์ ๋˜์–ด ๋†’์€ attention โ‰  ์ค‘์š”ํ•œ token

โ†’ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ICL์—์„œ ๋ณต์žกํ•˜๊ฒŒ ์–ฝํžŒ imageโ€“textโ€“query ์ƒํ˜ธ์ž‘์šฉ์—์„œ

์ „์ฒด ์‹œํ€€์Šค์˜ reasoning์— ์‹ค์งˆ์ ์œผ๋กœ ๊ธฐ์—ฌํ•˜๋Š” image token์„ ์–ด๋–ป๊ฒŒ ์‹๋ณ„ํ•  ์ˆ˜ ์žˆ๋Š”๊ฐ€?

Contextually Adaptive Token Pruning

Overview

  • ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ICL์—์„œ๋Š” ์—ฌ๋Ÿฌ ์ด๋ฏธ์ง€์™€ ํ…์ŠคํŠธ๊ฐ€ ํ•˜๋‚˜์˜ reasoning context๋ฅผ ํ˜•์„ฑ

โ†’ image token์˜ ์ค‘์š”์„ฑ์€ ์ด๋ฏธ์ง€ ๋‚ด๋ถ€, ๋‹จ์ผ attention ๊ฐ’์ด ์•„๋‹Œ ์‹œํ€€์Šค ์ „์ฒด ๋งฅ๋ฝ(context)์ด ์ค‘์š”

  • Stage 1: Context-aware Coarse Pruning
  • Stage 2: Query-guided Fine-grained Pruning
  • Stage 1๋งŒ ์‚ฌ์šฉํ•˜๋ฉด
    • coarse pruning๊นŒ์ง€๋งŒ ๊ฐ€๋Šฅ
    • fine-grained ์‹คํŒจ
  • Stage 2๋งŒ ์‚ฌ์šฉํ•˜๋ฉด
    • decoder ๋ถ€๋‹ด ๊ณผ๋‹ค
    • attention noise ์‹ฌ๊ฐ

โ€œdecoder ์ด์ „์˜ context-aware filtering + decoder ๋‚ด๋ถ€์˜ query-guided refinementโ€

Stage 1: Context-aware Coarse Pruning

  • Vision encoder + projector ์ดํ›„, LLM decoder ์ž…๋ ฅ ์ง์ „ (diversity-based pruning์ฒ˜๋Ÿผ)
  • ๋ชฉ์ : ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ICL ์‹œํ€€์Šค์—์„œ ๋งฅ๋ฝ(context)๊ณผ ๊ฑฐ์˜ ์ƒํ˜ธ์ž‘์šฉํ•˜์ง€ ์•Š๋Š” image token์„ decoder์— ๋“ค์–ด๊ฐ€๊ธฐ ์ „์— ์ œ๊ฑฐ
  • ๊ธฐ์กด diversity-based pruning
    • image token ๊ฐ„ feature similarity๋งŒ ์‚ฌ์šฉ
    • ์ด๋ฏธ์ง€ ๋‚ด๋ถ€ ์ค‘๋ณต ์ œ๊ฑฐ์—๋งŒ ์ง‘์ค‘
    • ๊ณ ๋ คํ•˜์ง€ ์•Š๋Š” ์ •๋ณด : ํ•ด๋‹น ์ด๋ฏธ์ง€์˜ ํ…์ŠคํŠธ, ๋‹ค๋ฅธ ICD ์ด๋ฏธ์ง€, query ์ด๋ฏธ์ง€, ์ „์ฒด ICL ์‹œํ€€์Šค
    • ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ์—์„œ ์ค‘์š”ํ•˜์ง€๋งŒ ๋งฅ๋ฝ ์˜์กด์ ์ธ ํ† ํฐ์„ ๋ฌด์ฐจ๋ณ„์ ์œผ๋กœ ์ œ๊ฑฐํ•˜๋Š” ๋ฌธ์ œ

Diversity ํ•ญ Fdiv(Yi)

๊ธฐ์กด diversity-based pruning ์‚ฌ์šฉ

Blog Image
  • ์„ ํƒ๋œ image token ์ง‘ํ•ฉ Yi ๊ฐ€ ์›๋ž˜ image token ๊ณต๊ฐ„ XiI ๋ฅผ ์–ผ๋งˆ๋‚˜ ์ž˜ ๋Œ€ํ‘œ(coverage)ํ•˜๋Š”์ง€ ์ธก์ •
  • submodular : ์ด๋ฏธ ์„ ํƒ๋œ ํ† ํฐ ์ˆ˜๊ฐ€ ๋งŽ์„์ˆ˜๋ก token์„ ํ•˜๋‚˜ ๋” ์ถ”๊ฐ€ํ•  ๋•Œ์˜ ์ด๋“์ด ๊ฐ์†Œ

Alignment ํ•จ์ˆ˜ Falign(Yi)

imageโ€“text alignment score

Blog Image
  • ๊ฐ image token์ด text summary vห‰i ์™€ ์–ผ๋งˆ๋‚˜ ์˜๋ฏธ์ ์œผ๋กœ ๊ฐ€๊นŒ์šด์ง€ ์ธก์ •
    • vห‰i : ํ•ด๋‹น image์— ๋ถ™์€ ํ…์ŠคํŠธ ํ† ํฐ๋“ค์˜ hidden state๋ฅผ average pooling
    Blog Image
  • modular : ๊ฐ ์›์†Œ์˜ ์ ์ˆ˜๊ฐ€ ์„œ๋กœ์—๊ฒŒ ์ „ํ˜€ ์˜ํ–ฅ ์•ˆ ์ฃผ๋Š” ํ•จ์ˆ˜

์ตœ์ข… ๋ชฉ์ ํ•จ์ˆ˜

: ๋‘˜์˜ ํ•ฉ์„ ์ตœ๋Œ€ํ™”ํ•˜๋Š” ์ด๋ฏธ์ง€ ํ† ํฐ ์ง‘ํ•ฉ

Blog Image
  • Greedy selection
    • submodular์ธ Fdiv + modular์ธ Falign ์ด๋ผ์„œ ๋”ํ•œ ์ตœ์ข…๋„ submodular โ†’ Greedy๊ฐ€๋Šฅ
    • ๋งค ๋‹จ๊ณ„๋งˆ๋‹ค โ€œ์ง€๊ธˆ ์ถ”๊ฐ€ํ–ˆ์„ ๋•Œ Fdiv + Falign์ด ์ œ์ผ ๋งŽ์ด ์˜ค๋ฅด๋Š” ํ† ํฐโ€์„ ํ•˜๋‚˜์”ฉ ์ถ”๊ฐ€ํ•ด๋„ ๊ฒฐ๊ณผ ์ข‹์Œ

Stage 2: Query-guided Fine-grained Pruning

  • LLM decoder ๋‚ด๋ถ€์˜ ๋‘ ๊ฐœ์˜ ์–•์€ decoder layer๋งŒ
    • Layer K โ†’ ICD ์ด๋ฏธ์ง€ ํ† ํฐ pruning (context pruning)
    • Layer K+1 โ†’ Query ์ด๋ฏธ์ง€ ํ† ํฐ pruning (query pruning)

Attention growth

Blog Image
  • Ak = k๋ฒˆ์งธ layer์˜ attention matrix
  • layer Kโˆ’1 โ†’ K ๋กœ ๋„˜์–ด๊ฐ€๋ฉด์„œ query๊ฐ€ ํ•ด๋‹น token์„ ๋” ์‚ฌ์šฉํ•˜๊ฒŒ ๋˜์—ˆ๋Š”์ง€ ์ธก์ •

Context token ์ค‘์š”๋„ ์ ์ˆ˜

Blog Image
  • ฮ”A(c): query๊ฐ€ ์ƒˆ๋กญ๊ฒŒ ์ฃผ๋ชฉํ•˜๊ธฐ ์‹œ์ž‘ํ•œ token์ธ๊ฐ€?
  • sim(hcK,vqK): token ์˜๋ฏธ๊ฐ€ query ์˜๋ฏธ์™€ ๋งž๋Š”๊ฐ€?
  • pruning ratio R ๋‹ฌ์„ฑ๊นŒ์ง€ Scontext ๋‚ฎ์€ ์ˆœ์œผ๋กœ ์ œ๊ฑฐ

Query token ์ค‘์š”๋„ ์ ์ˆ˜

Blog Image
  • Query token ์˜๋ฏธ๊ฐ€ ์œ„์—์„œ ๋‚จ๊ฒจ์ง„ ICD(context) ํ† ํฐ ์˜๋ฏธ์™€ ๋งž๋Š”๊ฐ€?
  • pruning ratio R ๋‹ฌ์„ฑ๊นŒ์ง€ Scontext ๋‚ฎ์€ ์ˆœ์œผ๋กœ ์ œ๊ฑฐ

์™œ Stage2์—์„œ Query๋ฅผ ๋ณด๋Š”๊ฐ€?

Blog Image
  • ๋ ˆ์ด์–ด ์ง„ํ–‰ ์ค‘ query-cross ์ด ๊ธ‰๊ฒฉํžˆ ์ฆ๊ฐ€ํ•˜๋Š”๊ฑธ ๋ณผ ์ˆ˜ ์žˆ์Œ

    โ†’ ์ด๋Š” ๋””์ฝ”๋”์˜ Layer ๋„์ค‘์— query์™€์˜ ์ƒํ˜ธ์ž‘์šฉ์ด ์ปค์ง„๋‹ค๋Š” ๊ฒƒ

Overview

Blog Image

Stage1: ์ฒ˜์Œ์—” ๋Œ€์ถฉ ์ค‘์š”ํ•œ ๊ทธ๋ฆผ ์กฐ๊ฐ๋งŒ ๋‚จ๊ธฐ๊ธฐ

  • Falign: text summary์™€ alignmentํ•ด์„œ ํ…์ŠคํŠธ ํ† ํฐ๊ณผ ์–ผ๋งˆ๋‚˜ ์—ฐ๊ด€์žˆ๋Š”์ง€
  • Fdiv: ๋‹ค๋ฅธ ํ† ํฐ๋“ค๊ณผ ์–ผ๋งˆ๋‚˜ ์ž˜ ๋Œ€ํ‘œ(coverage)ํ•˜๋Š”์ง€
  • ๋‘˜์˜ ํ•ฉ์„ ์ตœ๋Œ€ํ™”ํ•˜๋Š” ํ† ํฐ ์ง‘ํ•ฉ์„ ๋‚จ๊น€

Decoder ์ง„์ž… (Layer 0 โ†’ K-1)

  • ๋ชจ๋ธ์ด query ์ค‘์‹ฌ์œผ๋กœ ์‚ฌ๊ณ ๋ฅผ ์‹œ์ž‘ํ•จ

Stage2: ์งˆ๋ฌธ(Query)์„ ๊ธฐ์ค€์œผ๋กœ ์ •๋ง ์“ธ๋ชจ ์žˆ๋Š” ์กฐ๊ฐ๋งŒ ์ •๋ฐ€ํ•˜๊ฒŒ ๋‚จ๊ธฐ๊ธฐ

  • Attention difference (Layer K-1 โ†’ K)
    • ์ง€๊ธˆ attention ํฐ ์• ๊ฐ€ ์ค‘์š”ํ•˜๊ฒ ์ง€ โ†’ FastV ๋ฐฉ์‹์ธ๋ฐ, ์ด๋ ‡๊ฒŒ ์•ˆํ•จ!!!!
    • ์ด ์งˆ๋ฌธ์„ ์ฒ˜๋ฆฌํ•˜๋ฉด์„œ ๊ฐ‘์ž๊ธฐ ์ค‘์š”ํ•ด์ง„ ํ† ํฐ์ด ๋ญ์ง€?
  • Layer K: ICD(context) ํ† ํฐ ์ •๋ฆฌ
    • ๊ธฐ์กด pruning์€ ๊ฐ ์ด๋ฏธ์ง€๋งˆ๋‹ค ๊ฐ™์€ ๋น„์œจ๋กœ ์ž๋ฅด์ง€๋งŒ, ์ด๊ฑด ICD๊ฐ„์—๋„ ์ž๋ฅด๋Š” ํ† ํฐ ์ˆ˜๊ฐ€ ๋‹ค๋ฆ„
    • ICD ๊ฐ„ ์ฐจ๋ณ„ pruning โ†’๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ICL ์ง„์งœ ์ค‘์š”ํ•œ ํฌ์ธํŠธ
  • Layer K+1: Query ์ด๋ฏธ์ง€ ํ† ํฐ ์ •๋ฆฌ
    • ๋ฐ˜๋Œ€๋กœ ๋‚จ์•„ ์žˆ๋Š” ICD(context) ํ† ํฐ์„ ๊ธฐ์ค€์œผ๋กœ query ์ด๋ฏธ์ง€์—์„œ ๋งฅ๋ฝ์•ˆ๋งž๋Š” ๊ฒƒ๋“ค๋„ pruning
  • ์ดํ›„ Layer
    • ์•ž์—์„œ ํ† ํฐ ๋งŽ์ด ์ค„์ธ ์ƒํƒœ๋กœ ๊ณ„์‚ฐ

Experiments

Setup

  • LLaVA-Next-7B ์ฃผ๋กœ ์‚ฌ์šฉ, LLaVA-1.5. Qwen2.5-VL ์ถ”๊ฐ€ ์‚ฌ์šฉ
  • pruning์€ inference-time ์—๋งŒ ์ ์šฉ
  • ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ question answering ์ค‘์‹ฌ benchmark
    • VQAv2, GQA, VizWiz, TextVQA, OK-VQA, MMBench

Main Results

Blog Image
  • ๊ฐ Baseline ์„ค๋ช…
    • FastV โ†’ decoder attention ํฌ๊ธฐ๋ฅผ ๊ธฐ์ค€์œผ๋กœ image token์„ ์ œ๊ฑฐํ•˜๋Š” attention-based pruning
    • DivPrune โ†’ image token embedding์˜ ๋‹ค์–‘์„ฑ(coverage)์„ ๊ธฐ์ค€์œผ๋กœ ์ œ๊ฑฐํ•˜๋Š” diversity-based pruning
    • FitPrune โ†’ image token๊ณผ ํ…์ŠคํŠธ ๊ฐ„ feature ์œ ์‚ฌ๋„๋ฅผ ๊ธฐ์ค€์œผ๋กœ ์ œ๊ฑฐํ•˜๋Š” feature-alignment ๊ธฐ๋ฐ˜ pruning
    • VTW โ†’ image token ์ค‘์š”๋„๋ฅผ ํ•™์Šต๋œ ๊ฐ€์ค‘์น˜๋กœ ์กฐ์ ˆํ•˜๋Š” token weighting ๊ธฐ๋ฐ˜ soft pruning
    • HiRED โ†’ ๊ณ„์ธต์  relevance ํŒ๋‹จ์œผ๋กœ token์„ ์„ ํƒํ•˜๋Š” hierarchical routing ๊ธฐ๋ฐ˜ pruning
    • SparseVLM โ†’ ๋ชจ๋ธ ๊ตฌ์กฐ ์ž์ฒด์— sparsity๋ฅผ ๋„์ž…ํ•˜๋Š” architecture-level sparse VLM
    • PLPHP โ†’ ํ•™์Šต๋œ ์ •์ฑ…(policy)์œผ๋กœ token์„ ์ œ๊ฑฐํ•˜๋Š” policy-learning ๊ธฐ๋ฐ˜ pruning
      • ์• ์ดˆ์— VLM / LVLM์„ ์ „์ œ๋กœ ์„ค๊ณ„๋œ pruning โ†’ ๋‹ค๋ฅธ ์• ๋“ค๋ณด๋‹ค ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ํ™˜๊ฒฝ์— ์กฐ๊ธˆ ๋” ์นœํ™”์ 
      • ํ•˜์ง€๋งŒ, query๊ฐ€ policy ์ž…๋ ฅ ์ค‘ ํ•˜๋‚˜์ผ ๋ฟ pruning ๊ธฐ์ค€์ ์ด ์•„๋‹˜.
    • CATP โ†’ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ICL์—์„œ query๋ฅผ ๊ธฐ์ค€์œผ๋กœ context ์ „์ฒด๋ฅผ ์ ์‘์ ์œผ๋กœ ์ค„์ด๋Š” context-aware, query-guided pruning

๊ธฐ์กด ๋ฐฉ๋ฒ•๋“ค์€ Randomํ•˜๊ฒŒ pruningํ•˜๋Š” ๊ฒฐ๊ณผ์™€ ๋น„์Šทํ•˜๊ฑฐ๋‚˜ ์˜คํžˆ๋ ค ๋” ๋‚ฎ์€ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์ด๊ธฐ๋„ ํ•จ.

66.7%์™€ 77.8% ๋ชจ๋‘ ๋ฐ”๋‹๋ผ ๋ชจ๋ธ์— ๋น„ํ•ด ์„ฑ๋Šฅ์ด ํ–ฅ์ƒ, 89.9%์—์„œ ์‚ด์ง ๋–จ์–ด์ง.

Efficiency Analysis

Blog Image
GPT์˜ ํ•ด์„

Wen et al. (2025a)์— ๋”ฐ๋ฅด๋ฉด FLOPs์™€ KV Cache๋Š”

token pruning์˜ ์‹ค์ œ ์‹คํ–‰ ๋น„์šฉ์„ ๋ฐ˜์˜ํ•˜์ง€ ๋ชปํ•˜๋ฉฐ,

pruning ์—ฐ์‚ฐยท๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผยท๋ณ‘๋ ฌํ™” ํšจ์œจ๊นŒ์ง€ ํฌํ•จํ•˜๋Š” latency๋งŒ์ด

token pruning ํšจ์œจ์˜ ๊ฐ€์žฅ ์‹ ๋ขฐ ๊ฐ€๋Šฅํ•œ ์ง€ํ‘œ์ด๊ณ ,

์ด ๊ธฐ์ค€์—์„œ CATP๋Š” ๊ธฐ์กด ๋ฐฉ๋ฒ•๋“ค๊ณผ ์งˆ์ ์œผ๋กœ ๋‹ค๋ฅธ ํšจ์œจ์„ฑ์„ ๋ณด์ธ๋‹ค

  • pruning์„ ์ •ํ•ด์ง„ layer(K, K+1) ์—์„œ๋งŒ ์ˆ˜ํ–‰
  • ํ† ํฐ์„ ์‹ค์ œ๋กœ ์™„์ „ํžˆ ์ œ๊ฑฐ

    โ†’ ์ดํ›„ layer๋“ค์€ ๋” ์งง์€ ์‹œํ€€์Šค๋ฅผ ๊ทธ๋Œ€๋กœ denseํ•˜๊ฒŒ ์ฒ˜๋ฆฌ

    โ†’ GPU ์นœํ™”์ ์ธ ์—ฐ์‚ฐ์œผ๋กœ FLOPs

  • ํ† ํฐ์„ earlyํ•˜๊ฒŒ ์ œ๊ฑฐ

    โ†’ ์ดํ›„ layer์—์„œ KV cache ํฌ๊ธฐ ์ž์ฒด๊ฐ€ ์ž‘์•„์ง

Impact of each stage

Blog Image

decoder ์ด์ „์—์„œ ๋ถˆํ•„์š”ํ•œ image token์„ ๋จผ์ € ์ œ๊ฑฐํ•ด์•ผ

Stage 2์˜ attention ๋ฐ relevance ๊ธฐ๋ฐ˜ ํŒ๋‹จ์ด ์™œ๊ณก ์—†์ด ์ž‘๋™ํ•  ์ˆ˜ ์žˆ์Œ

Stage 1๋„ ๊ผญ ํ•„์š”ํ•˜๋‹ค๋Š”๊ฑธ ๋ณด์—ฌ์คŒ

Impact of hyperparameters

Blog Image
  • K: Progressive adaptation ์‹œ์ž‘ layer (=6, ๋ฌด๊ฑฐ์šด ๋ชจ๋ธ์—์„œ๋Š” 10)
    • queryโ€“context ์ƒํ˜ธ์ž‘์šฉ์ด ์–ธ์ œ ๋ณธ๊ฒฉํ™”๋˜๋Š”์ง€
  • Stage 1์˜ ฮปโ‚ (=0.7)
    • ฮปโ‚์ด ๋„ˆ๋ฌด ์ž‘์€ ๊ฒฝ์šฐ pruning ๊ธฐ์ค€์ด ๊ฑฐ์˜ alignment ์œ„์ฃผ๋กœ ์ž‘๋™

      ํŠน์ • ํ…์ŠคํŠธ์™€ ๊ฐ•ํ•˜๊ฒŒ ๋Œ€์‘๋˜๋Š” token๋งŒ ๋‚จ์Œ image ๋‚ด๋ถ€์˜ ๊ณต๊ฐ„์ ยท์‹œ๊ฐ์  ๋‹ค์–‘์„ฑ ์†์‹ค

Conclusion

  • training-free๋กœ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ICL์— ํŠนํ™”๋œ ์ด๋ฏธ์ง€ ํ† ํฐ ํ”„๋ฃจ๋‹
  • 2 Stage ๊ตฌ์กฐ๋กœ ์ž…๋ ฅ๋œ in-context sequence ์ „์ฒด๋ฅผ ๊ธฐ์ค€์œผ๋กœ ICL ๊ณผ์ •์— ์ค‘์š”ํ•œ ์ด๋ฏธ์ง€ ํ† ํฐ๋งŒ ์„ ํƒ
  • performance, efficiency ๋‘˜ ๋‹ค ํ–ฅ์ƒ
  • ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ICL ์—ฐ๊ตฌ ์ž์ฒด๊ฐ€ ๊ณต๋ฐฑ์ด์—ˆ๋Š”๋ฐ ํšจ์œจ์ ์œผ๋กœ ๊ฐœ์„ ํ•จ

โ†’ LVLM ๋ฐœ์ „์— insight ์ œ๊ณตํ–ˆ๋‹ค!

Limitation

  • attention difference๊ฐ€ ์˜๋ฏธ์  ์ค‘์š”๋„(semantic necessity)๋ฅผ ๋ณด์žฅํ•œ๋‹ค๋Š” ์ฆ๋ช…์ด ์—†์Œ
    • ์ค‘์š”ํ•œ ์ •๋ณด๊ฐ€ ๊ทธ๋ƒฅ ์ฒ˜์Œ๋ถ€ํ„ฐ attention์ด ๋†’๋‹ค๋ฉด?
    • attention ๋ณ€ํ™”๊ฐ€ positional bias layer normalization ๊ฐ™์€ ์ด์œ ๋กœ ์ผ์–ด๋‚ฌ๋‹ค๋ฉด?
  • decoder-centric LVLM์„ ์ „์ œ๋กœ ํ•˜๊ณ , encoderโ€“decoder ๋ถ„๋ฆฌํ˜•, early fusion ๊ตฌ์กฐ, cross-attention ์ค‘์‹ฌ ๊ตฌ์กฐ์— ์•ˆ๋งž๋Š”๋‹ค.

Future Work

Importance Persistence / Temporal Contribution Modeling

  • Layer ํ•˜๋‚˜์—์„œ attention difference ํ•˜๋‚˜๋ฅผ ์‹ ํ˜ธ๋กœ ์“ฐ์ง€ ๋ง๊ณ , ๋‹ค์Œ๊ฐ™์€ ํ† ํฐ์— ๊ฐ€์ค‘์น˜ ์ฃผ๊ฑฐ๋‚˜ ๋ณดํ˜ธ
    • ์ฒ˜์Œ๋ถ€ํ„ฐ ๋๊นŒ์ง€ ๊ณ„์† ์ฐธ์กฐ๋˜๋Š” ํ† ํฐ
    • ์—ฌ๋Ÿฌ query step์—์„œ ๋ฐ˜๋ณต์ ์œผ๋กœ ์„ ํƒ๋˜๋Š” ํ† ํฐ
  • multi-step reasoning, long-horizon ICL, agent-style inference ํ™•์žฅ

GUI Grounding ํ™•์žฅ

  • layer ๊ฐ„ difference๊ฐ€ ์•„๋‹ˆ๋ผ action ์ „ํ›„๋กœ attention difference๋กœ ํ™•์žฅํ•ด๋ณด์ž!

  • ์ด๋ฏธ ์žˆ๋Š” GUI Grounding ๋…ผ๋ฌธ์€ ๋‘˜ ๋‹ค Vision ๊ธฐ๋ฐ˜์ด ์•„๋‹ˆ๊ณ , Action(click, type, scroll) ์ดํ›„ HTML/DOM tree์˜ ๋ณ€๊ฒฝ ์‚ฌํ•ญ์„ ์ค‘์‹ฌ์œผ๋กœ ๋‹ค์Œ ํ–‰๋™ ํ›„๋ณด๋ฅผ ๊ตฌ์„ฑ

  • state diff๊ฐ€ ํฐ๊ฒƒ๋งŒ ๋‚จ๊ธฐ๋Š” ๋…ผ๋ฌธ์€ ์ด๋ฏธ ๋งŽ์Œ (๋กœ๋ด‡ / embodied VLA)

  • ์ถ”๊ฐ€๋กœ reasoning diff ํ•˜๋ฉด ์งˆ๋ฌธ(query)์— ์˜ํ•ด ์ƒˆ๋กญ๊ฒŒ ์ค‘์š”ํ•ด์ง„ ๊ฒƒ์„ ์ถ”๊ฐ€์ ์œผ๋กœ ๊ฑธ๋Ÿฌ๋‚ด๋Š” ์—ญํ• 
  • ์ƒ๊ฐํ•ด๋ณธ ์•„์ด๋””์–ด
    • ๊ฐ™์€ ๋””์ฝ”๋” layer K์—์„œ (query์˜ ์˜ํ–ฅ๋ ฅ์ด ์ปค์ง€๋Š” layer K๋ฅผ ์ด ๋…ผ๋ฌธ์ฒ˜๋Ÿผ ์ฐพ์•„์„œ)
    • action ์ „ ์ž…๋ ฅ(state_t)์„ ๋„ฃ์—ˆ์„ ๋•Œ์˜ attention
    • action ํ›„ ์ž…๋ ฅ(state_t+1)์„ ๋„ฃ์—ˆ์„ ๋•Œ์˜ attention ์ฐจ์ด ๊ฐ€ ํฐ ๊ฑธ ๋‚จ๊ธฐ๊ณ ,
    • ๋‚˜๋จธ์ง€ K+1 ๋ฒˆ์งธ Layer๋ถ€ํ„ฐ pruning๋œ ์ฑ„๋กœ ์ง„ํ–‰
  • GUI๊ฐ€ ํŠนํžˆ ๋‹ค๋ฅธ sequential๋ณด๋‹ค state ๋ณ€ํ™”๋Ÿ‰์ด ๋งค์šฐ ์ปค์ง€๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์•„์„œ state diff ์™ธ์—๋„ ์ด๊ฑธ ์ถ”๊ฐ€ํ•ด์•ผ ๋งŽ์€ token pruning์ด ๊ฐ€๋Šฅํ• ๋“ฏ?


Q&A

๋…ผ๋ฌธ Presentation ๋ฐœํ‘œ ์ค‘ ์ œ๋Œ€๋กœ ๋‹ต๋ณ€ ๋ชปํ•œ Q&A

Multimodal ICL์—์„œ token duplication(=redundancy) ๋ฌธ์ œ ์ด์œ 

As shown in Figure 1(a), every ICD and the subsequent query sample include an image, so the image token redundancy that is already a bottleneck in single-image tasks becomes even more acute.

duplication ๋ฌธ์ œ๋Š” ์›๋ž˜ single image์—์„œ๋„ ์ด๋ฏธ ์กด์žฌํ•จ

: image token๋“ค์ด feature space์—์„œ ๊ฑฐ์˜ ๋™์ผํ•œ embedding์„ ํ˜•์„ฑํ•ด์„œ ์ธ์ ‘ํ•œ ๋ถ€๋ถ„ (ํŠนํžˆ ๋ฐฐ๊ฒฝ) ๋“ฑ์—์„œ ๋น„์Šทํ•œ cluster โ†’ near-duplicate image token

ํŠนํžˆ ์—ฌ๋Ÿฌ ์ด๋ฏธ์ง€๊ฐ€ interleaved ๋˜๋Š” ICL ๊ตฌ์กฐ์—์„œ๋Š”

๋™์ผํ•œ encoder์™€ projector๋ฅผ ๊ฑฐ์นœ ๋น„์Šทํ•œ ๊ตฌ์กฐ์˜ ์—ฌ๋Ÿฌ ์ด๋ฏธ์ง€๊ฐ€ ํ•˜๋‚˜์˜ prompt์— ๋ฐ˜๋ณต ์‚ฝ์ž…๋˜๋ฏ€๋กœ, ์ค‘๋ณต ๊ตฌ์กฐ๊ฐ€ ์ด๋ฏธ์ง€ ๊ฐ„์—๋„ ๋ˆ„์ ๋˜์–ด ์‹ฌํ•ด์ง„๋‹ค.

pruning ์ดํ›„ position embedding์€ ์–ด๋–ป๊ฒŒ๋˜๋Š”๊ฐ€?

masking์ด ์•„๋‹Œ ์ง„์งœ ์ œ๊ฑฐํ•˜๋Š” pruning์ด๋ฉด transformer์—์„œ position embedding์„ ์–ด๋–ป๊ฒŒ ์ฒ˜๋ฆฌํ•˜๋Š”๊ฐ€?

๋…ผ๋ฌธ์—์„œ๋Š” pruning ์ดํ›„ position embedding ์ฒ˜๋ฆฌ์— ๋Œ€ํ•ด ๋ช…์‹œ์ ์œผ๋กœ ์„ค๋ช…์ด ์—†์Œ.

https://arxiv.org/abs/2210.09461

https://arxiv.org/abs/2106.02034

๊ธฐ์กด ๋…ผ๋ฌธ๋“ค๋„ position embedding์„ ๋‹ค์‹œ ๋ถ€์—ฌํ•˜๊ฑฐ๋‚˜ ์œ ์ง€ํ•˜๋Š” ํŠน๋ณ„ํ•œ ๋ณด์ •์€ ๋”ฑํžˆ ์•ˆํ•œ๋‹ค.

๋‚จ์€ token์„ ์—ฐ์† ์‹œํ€€์Šค๋กœ ์žฌ๋ฐฐ์—ดํ•˜๊ณ  position embedding์„ ๋‹ค์‹œ ๋ถ€์—ฌ

โ†’ ๊ธฐ์กด pruning ์—ฐ๊ตฌ๋“ค์€ ์ ˆ๋Œ€ ์œ„์น˜ ๋ณด์กด๋ณด๋‹ค๋Š”, pruning ์ดํ›„์—๋„ reasoning์— ํ•„์š”ํ•œ ์ƒ๋Œ€์  ๊ด€๊ณ„๊ฐ€ ์œ ์ง€๋˜๋Š”์ง€๋ฅผ ๋” ์ค‘์š”ํ•˜๊ฒŒ ๊ฐ€์ •

โญ๏ธโญ๏ธโญ๏ธ ์™œ query๋ž‘ ๋น„๊ต๋ฅผ stage 2์—์„œ ํ•ด์•ผํ•˜๋Š”๊ฐ€?

query-cross shows a sharp rise in the shallow layers, roughly layers 7 to 10, indicating that after perception, the LVLM shifts to query-guided reasoning

Blog Image
  • ์ด ๊ทธ๋ž˜ํ”„์—์„œ ๋ด์•ผํ•  Query-cross
    • query sample์˜ ๋ชจ๋“  ํ† ํฐ์ด image token์— ์ฃผ๋Š” attention์„ ๊ธฐ์ค€์œผ๋กœ ํ•œ pruning ์‹ ํ˜ธ
    • โ€œICL์—์„œ๋Š” ๊ฒฐ๊ตญ query๊ฐ€ ์ค‘์š”ํ•˜๋‹ˆ, query๊ฐ€ ์ฃผ๋ชฉํ•˜๋Š” ํ† ํฐ์ด ์ค‘์š”ํ•˜์ง€ ์•Š์„๊นŒ?โ€
  • ๋””์ฝ”๋”์—์„œ query-cross ๊ธฐ๋ฐ˜ pruning์ด shallow layer(7~10)์—์„œ ์ƒ๋Œ€ ์„ฑ๋Šฅ์ด ์ข‹์•„์ง€๋Š” ๊ฒฝํ–ฅ

    โ†’ ๋…ผ๋ฌธ์—์„œ ์ด๋ฅผ perception ์ดํ›„ query-guided reasoning์œผ๋กœ์˜ ์ „ํ™˜ ์‹ ํ˜ธ๋กœ ํ•ด์„

    โ€œ๋””์ฝ”๋”์˜ Layer ๋„์ค‘์— query์™€์˜ ์ƒํ˜ธ์ž‘์šฉ์ด ์ปค์ง„๋‹ค!โ€

๊ฒฐ๋ก 

  • ์ดˆ๊ธฐ decoder layer
    • imageโ€“text perception
    • local alignment ์ค‘์‹ฌ
  • ํŠน์ • shallow layer ์ดํ›„
    • query๊ฐ€ context๋ฅผ ์„ ํƒ์ ์œผ๋กœ ์ฐธ์กฐ
    • query-guided reasoning์œผ๋กœ ์ „ํ™˜

โ†’ query๊ฐ€ context token ์ค‘์š”๋„๋ฅผ ์‹ค์ œ๋กœ โ€˜๋ถ„๋ณ„โ€™ํ•˜๊ธฐ ์‹œ์ž‘ํ•˜๋Š” ์ง€์ ์ด ์กด์žฌ

Decoder ์ด์ „์—์„œ๋Š” query์™€ ๋น„์Šทํ•ด ๋ณด์ด๋Š” visual token๋งŒ ์ฐพ์„ ์ˆ˜ ์žˆ์„ ๋ฟ์ด๋‹ค.

query ๋‹ต๋ณ€์— ์‹ค์งˆ์ ์œผ๋กœ ๊ธฐ์—ฌํ•˜๋Š” token์„ ์ธก์ •ํ•˜๊ธฐ ์–ด๋ ต๋‹ค!

์ด๊ฒŒ ์ด ๋…ผ๋ฌธ์˜ Key Idea์ด๋‹ค.

๊ธฐ์กด pruning๊ณผ ๊ฒฐ์ •์ ์œผ๋กœ ๋‹ค๋ฅธ ์ง€์ ์ด โ€œ๋‹จ์ˆœํžˆ Query๋ž‘ ๋น„๊ตํ–ˆ๋‹ค.โ€ ๊ฐ€ ์•„๋‹ˆ๋ผ

query-guided reasoning์ด ๋ฐœ์ƒํ•˜๋Š” ์‹œ์ ์„ ์ •ํ™•ํžˆ ์งš์–ด์„œ pruning ์œ„์น˜๋ฅผ ๋ถ„๋ฆฌํ–ˆ๋‹ค๋Š” ์ ์ด๋‹ค.

Diversity function์€ ๊ทธ๋ƒฅ projection ์ „์— Stage 0 ์ฒ˜๋Ÿผํ•˜๋ฉด ์–ด๋–ค๊ฐ€?

Projection ์ด์ „์— ํ•˜๋ฉด Projection ์ด์ „๋ถ€ํ„ฐ ํ”„๋ฃจ๋‹์„ ํ•ด์„œ Projection ์—ฐ์‚ฐ์ด ์ค„๊ฒ ์ง€๋งŒ, ์œ„ํ—˜์„ฑ์ด ์žˆ๋‹ค.

๊ธฐ์กด์—๋„ Diversity ๊ด€๋ จ prune์ธ DivPrune ์ž์ฒด๋„ ์ด ๊ตฌ๊ฐ„์—์„œ ํ–ˆ์Œ.

Vision Encoder โ†’ Projector โ†’ DivPrune โ†’ Decoder

DivPrune frames image token pruning after the projector as a Max-Min diversity problem, aiming to choose a subset of tokens that maximizes diversity among the selected tokens.

projector ์ดํ›„์˜ token๋“ค์ด ์ด๋ฏธ decoder๊ฐ€ ์‹ค์ œ๋กœ ์‚ฌ์šฉํ•˜๋Š” embedding space์— ๋†“์ด๋ฏ€๋กœ cosine similarity ๊ธฐ๋ฐ˜ diversity objective๋ฅผ ์ •์˜ํ•˜๋Š” ๊ฒƒ์ด ์ž์—ฐ์Šค๋Ÿฌ์›€!

โ†’ reasoning์— ๊ธฐ์—ฌํ•˜๋Š” token์„ ๋‚จ๊ธฐ๋Š” ๊ฒƒ์ด ๋ชฉํ‘œ์ด๊ธฐ ๋•Œ๋ฌธ์—,

decoder embedding space ๊ธฐ์ค€์œผ๋กœ diversity๋ฅผ ์ธก์ •ํ•˜๋„๋ก ์„ค๊ณ„๋จ

์ˆ˜์‹์€ ๊ฐ™์ง€๋งŒ, projection์œผ๋กœ ๋น„์„ ํ˜• ๋ณ€ํ™˜์ด ๋œ๋‹ค๋ฉด projection ์ „ํ›„๋กœ ๊ณต๊ฐ„์—์„œ ์ตœ๊ทผ์ ‘ ๋Œ€ํ‘œ ํ† ํฐ์ด ๋‹ฌ๋ผ์ ธ์„œ ์„ ํƒ๋˜๋Š” token ์ง‘ํ•ฉ์ด ๋‹ฌ๋ผ์ง

Pruning Layer K = 6 (7B/8B), K = 10 (13B) ์ด์œ ?

๋…ผ๋ฌธ์— ๋ช…์‹œ๋˜์–ด์žˆ์ง€๋Š” ์•Š์ง€๋งŒ, ์•„๋งˆ๋„ decoder layer ์ˆ˜๊ฐ€ ๋‹ฌ๋ผ์„œ ๊ทธ์ •๋„ ๊ฒฝ๊ณ„์—์„œ ์‹คํ—˜ํ•œ๊ฒƒ์œผ๋กœ ๋ณด์ž…๋‹ˆ๋‹ค.

  • LLaVA-Next-7B
    • LLM backbone: Mistral-7B โ†’ 32 layers
  • LLaVA-Next-13B
    • LLM backbone: Vicuna-1.5-13B โ†’ 40 layers