Back to Blog List

ImagePiece: Content-aware Re-tokenization for Efficient Image Recognition

ArXivhttp://arxiv.org/abs/2412.16491
AuthorsSeungdong Yoa, Seungjun Lee, Hyeseung Cho, Bumsoo Kim, Woohyung Lim
AffiliationLG AI Research, Chung-ang University
๐Ÿ’ก

Key Differentiator

Attention ์ ์ˆ˜๊ฐ€ ๋‚ฎ์€ Non-sementicํ•œ Token๋ผ๋ฆฌ๋งŒ Merging

โ†’ ์ตœ๋Œ€ํ•œ Token Merging์—์„œ ์˜๋ฏธ์žˆ๋Š” ๊ฒƒ๋“ค์ด Merge๋˜์ง€ ์•Š๋„๋ก Token Reduction

๐Ÿคท

Why I chose this paper?

  • ์šฐ๋ฆฌ๋‚˜๋ผ ๊ธฐ์—…์ด ๋‚ด๋Š” ๋…ผ๋ฌธ์„ ์ฝ๊ณ ์‹ถ์—ˆ๋‹ค.
  • ๋‚ด๊ฐ€ ์ตœ๊ทผ์— ์ œ์ถœํ–ˆ๋˜ Efficient GUI Grounding ๋…ผ๋ฌธ์˜ ํ›„์†์—ฐ๊ตฌ๋ฅผ Token ๊ธฐ๋ฐ˜์œผ๋กœ ํ™•์žฅํ•˜๊ณ  ์‹ถ์—ˆ๋‹ค.

Abstract

Vision Transformer๋Š” ๋ชจ๋“  ํŒจ์น˜๋ฅผ ํ† ํฐ์œผ๋กœ ์ฒ˜๋ฆฌํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๊ณ„์‚ฐ๋Ÿ‰์ด ํผ

โ†’ ๊ธฐ์กด ์—ฐ๊ตฌ๋“ค์€ ํ† ํฐ์„ ์ œ๊ฑฐ(pruning) ํ•˜๊ฑฐ๋‚˜ ํ•ฉ์น˜๊ธฐ(merging)

โ†’ ๊ฐ ํ† ํฐ์ด ์˜๋ฏธ๋ฅผ ์ถฉ๋ถ„ํžˆ ๋‹ด๊ณ  ์žˆ์ง€ ์•Š์•„์„œ ์˜๋ฏธ ์—†๋Š” ํ† ํฐ์„ ๋‹จ์ˆœํžˆ ์ œ๊ฑฐํ•˜๊ฑฐ๋‚˜ ์„ž์œผ๋ฉด ์˜คํžˆ๋ ค ์ •๋ณด ์†์‹ค์ด ํผ

์ด ๋…ผ๋ฌธ์€ โ€˜ImagePieceโ€™ ๋ผ๋Š” ์ƒˆ๋กœ์šด ์žฌํ† ํฌ๋‚˜์ด์ œ์ด์…˜(re-tokenization) ๋ฐฉ์‹์„ ์ œ์•ˆ

https://wikidocs.net/166826

WordPiece tokenizer์ฒ˜๋Ÿผ ์ด๋ฏธ์ง€ ์•ˆ์˜ ์˜๋ฏธ ์—†๋Š” ์ž‘์€ ํŒจ์น˜๋“ค์„ ํ•ฉ์ณ์„œ ์˜๋ฏธ ์žˆ๋Š” ๋‹จ์œ„๊ฐ€ ๋  ๋•Œ๊นŒ์ง€ ๋ฌถ๋Š” ๋ฐฉ์‹

  • local coherence ๋ชจ๋“ˆ: ์ธ์ ‘ํ•œ ํŒจ์น˜๋“ค์˜ ์œ ์‚ฌ์„ฑ์„ ๋†’์—ฌ, ์„œ๋กœ ์˜๋ฏธ๋ฅผ ํ˜•์„ฑํ•˜๋„๋ก ๋„์›€
  • ์ด๋ ‡๊ฒŒ ๋งŒ๋“ค์–ด์ง„ ์ƒˆ๋กœ์šด โ€œ์˜๋ฏธ ์žˆ๋Š” ํ† ํฐโ€๋งŒ Transformer์— ๋‚จ๊ธฐ๊ณ , ๋๊นŒ์ง€ ์˜๋ฏธ๊ฐ€ ์—†๋Š” ํ† ํฐ์€ ๋ฒ„๋ฆฐ๋‹ค.

๊ฒฐ๊ณผ

  • DeiT-S ๋ชจ๋ธ ๊ธฐ์ค€ ์ถ”๋ก  ์†๋„ 54% ํ–ฅ์ƒ (์•ฝ 1.5๋ฐฐ ๋น ๋ฆ„)
  • ๋™์‹œ์— ImageNet ์ •ํ™•๋„ 0.39% ํ–ฅ์ƒ
  • ๊ทน๋‹จ์ ์ธ ์†๋„ ์กฐ๊ฑด(251% ๊ฐ€์†)์—์„œ๋„ ๊ธฐ์กด ๋ฐฉ์‹๋ณด๋‹ค ์ •ํ™•๋„๊ฐ€ 8% ์ด์ƒ ๋†’์Œ


Preliminary

Vision Transformer(ViT)

  • Transformer๋Š” ์›๋ž˜ NLP์šฉ ๋ชจ๋ธ์ด์ง€๋งŒ,

    ์ด๋ฏธ์ง€์—๋„ ์ ์šฉ๋˜๋ฉด์„œ Vision Transformer(ViT) ๊ฐ€ ๋“ฑ์žฅ (Dosovitskiy et al., 2021)

  • ์ด๋ฏธ์ง€๋ฅผ ์ •์‚ฌ๊ฐํ˜• ํŒจ์น˜(pร—p) ๋กœ ๋‚˜๋ˆ„๊ณ , ๊ฐ ํŒจ์น˜๋ฅผ ํ•˜๋‚˜์˜ ํ† ํฐ(token) ์œผ๋กœ ๋ณ€ํ™˜ํ•ด Transformer์— ์ž…๋ ฅ
  • 224ร—224 ์ด๋ฏธ์ง€, ํŒจ์น˜ ํฌ๊ธฐ 16ร—16 โ†’ ๏ปฟ ๊ฐœ์˜ ํ† ํฐ ์ƒ์„ฑ

    ์—ฌ๊ธฐ์— [CLS] ํ† ํฐ์„ ์ถ”๊ฐ€ํ•ด ์ด 197๊ฐœ์˜ ํ† ํฐ์ด Transformer ์ž…๋ ฅ

  • Token Importance (ํ† ํฐ ์ค‘์š”๋„ ํ‰๊ฐ€)
    • ViT ๋‚ด๋ถ€์—์„œ๋Š” [CLS] ํ† ํฐ์ด ์ „์ฒด ์ด๋ฏธ์ง€์˜ ์ „์—ญ ์ •๋ณด๋ฅผ ์š”์•ฝํ•˜๋Š” ์—ญํ• ์„ ํ•จ.
    • ๊ฐ ํ† ํฐ์˜ ์ค‘์š”๋„๋Š” [CLS] ํ† ํฐ์ด ํ•ด๋‹น ํ† ํฐ์— ์–ผ๋งˆ๋‚˜ ์ฃผ์˜๋ฅผ ๋‘๋Š”๊ฐ€(attention) ๋กœ ์ธก์ •
    • ๏ปฟ : [CLS] ํ† ํฐ์˜ query ๋ฒกํ„ฐ
    • ๏ปฟ: ์ „์ฒด ํ† ํฐ์˜ key, value ํ–‰๋ ฌ
    • ๏ปฟ: ๊ฐ ํ† ํฐ์ด [CLS]์— ์˜ํ•ด ์–ผ๋งˆ๋‚˜ ์ค‘์š”ํ•˜๊ฒŒ ์—ฌ๊ฒจ์ง€๋Š”์ง€ ๋‚˜ํƒ€๋‚ด๋Š” attention score

    โ†’ ์ด attention score ๊ฐ’์ด ๋†’์„์ˆ˜๋ก, ๊ทธ ํ† ํฐ์€ ์ „์ฒด ์ด๋ฏธ์ง€ ์˜๋ฏธ๋ฅผ ๊ตฌ์„ฑํ•˜๋Š” ๋ฐ ๋” ์ค‘์š”ํ•จ์„ ์˜๋ฏธ.

  • ๊ฐ ํ† ํฐ์€ ํŒจ์น˜ ์ž„๋ฒ ๋”ฉ(embedding) + ์œ„์น˜ ์ž„๋ฒ ๋”ฉ(positional embedding) ์„ ํฌํ•จํ•˜์—ฌ

    Self-Attention์œผ๋กœ ์ „์—ญ ์ •๋ณด๋ฅผ ํ•™์Šตํ•จ.

  • ์ฆ‰, NLP์—์„œ์˜ โ€œ๋‹จ์–ด ํ† ํฐโ€ โ†’ ViT์—์„œ๋Š” โ€œ์ด๋ฏธ์ง€ ํŒจ์น˜ ํ† ํฐโ€

๊ทธ๋Ÿฌ๋‚˜ ๋‘ ๋ถ„์•ผ๋Š” ํ† ํฐ์˜ ์˜๋ฏธ(semantic structure) ์ธก๋ฉด์—์„œ ํฐ ์ฐจ์ด๊ฐ€ ์žˆ์Œ

๊ตฌ๋ถ„NLP (WordPiece ๋“ฑ)ViT (Patch Token)
์ž…๋ ฅ ๋‹จ์œ„๋‹จ์–ด ๋˜๋Š” ์˜๋ฏธ ์žˆ๋Š” ์„œ๋ธŒ์›Œ๋“œ16ร—16 ํ”ฝ์…€ ํŒจ์น˜
ํ† ํฐ ์˜๋ฏธ๋Œ€๋ถ€๋ถ„ ์˜๋ฏธ ์žˆ์Œ๋งŽ์€ ํŒจ์น˜๋Š” ๋ฐฐ๊ฒฝ ๋“ฑ, ์˜๋ฏธ ์—†์Œ
๊ฒฐ๊ณผ์  ๋ฌธ์ œ์—†์Œ์ •๋ณด๊ฐ€ ํฌ๋ฐ•ํ•˜๊ณ  ์ค‘๋ณต ๋งŽ์Œ

โ†’ ViT์˜ ํšจ์œจ์„ฑ ๋ฌธ์ œ๋Š” โ€œ์˜๋ฏธ ์—†๋Š” ํ† ํฐ์ด ๋„ˆ๋ฌด ๋งŽ์Œโ€ ์—์„œ ๋น„๋กฏ๋จ.

ViT = O(Nยฒ)


Related Work (Efficient Transformer)

(1) Efficient Attention

Self-Attention์˜ ์—ฐ์‚ฐ๋Ÿ‰ ์ž์ฒด๋ฅผ ์ค„์ด๋Š” ์ ‘๊ทผ.

Attention์„ ๊ทผ์‚ฌํ•˜๊ฑฐ๋‚˜ ๋ณ‘๋ ฌ ์ตœ์ ํ™”๋กœ ์†๋„ ํ–ฅ์ƒ.

  • Linformer (Wang et al., 2020)
  • Performer (Choromanski et al., 2020)
  • FlashAttention (Dao et al., 2022)

โ†’ Attention ๋ ˆ๋ฒจ์˜ ์ตœ์ ํ™”๋กœ ๊ณ„์‚ฐ๋งŒ ์ค„์ด๊ณ , ํ† ํฐ ์ž์ฒด์˜ ์˜๋ฏธ ๋ฌธ์ œ๋Š” ํ•ด๊ฒฐํ•˜์ง€ ๋ชปํ•จ.

(2) Token Pruning (ํ† ํฐ ์ œ๊ฑฐ)

๋น„์ค‘์š” ํ† ํฐ์„ attention score ๊ธฐ์ค€์œผ๋กœ ์ œ๊ฑฐ.

  • DynamicViT (Rao et al., 2021): ํ•™์Šต๋œ projection layer๋กœ ํ† ํฐ์„ ์ ์ง„์ ์œผ๋กœ ๋ฒ„๋ฆผ.
  • EViT (Liang et al., 2022): class token์— ๋Œ€ํ•œ attention์„ ๊ธฐ์ค€์œผ๋กœ ํ•˜์œ„ ํ† ํฐ ์‚ญ์ œ.
  • SPViT (Kong et al., 2022): soft selector๋กœ ์ค‘์š”๋„ ๊ณ„์‚ฐ ํ›„ pruning.

โ†’ ์˜๋ฏธ๊ฐ€ ์™„์ „ํžˆ ๋“œ๋Ÿฌ๋‚˜์ง€ ์•Š์€ ํ† ํฐ(์˜ˆ: ๋ฒ„์Šค์˜ ์ผ๋ถ€ ์กฐ๊ฐ)์„ ๋„ˆ๋ฌด ๋นจ๋ฆฌ ์ œ๊ฑฐํ•จ โ†’ ์ •๋ณด ์†์‹ค ๋ฐœ์ƒ.

(3) Token Merging (ํ† ํฐ ๋ณ‘ํ•ฉ)

๋น„์Šทํ•œ ํŠน์ง•์„ ๊ฐ€์ง„ ํ† ํฐ๋“ค์„ ๊ฒฐํ•ฉ(merge) ํ•˜์—ฌ ์ˆ˜๋ฅผ ์ค„์ž„.

  • ToMe (Bolya et al., 2023): bipartite soft matching์œผ๋กœ ๊ฐ€์žฅ ์œ ์‚ฌํ•œ ํ† ํฐ ์Œ ๋ณ‘ํ•ฉ.
  • Token Pooling (Marin et al., 2021): K-Means ๊ธฐ๋ฐ˜ ๋ณ‘ํ•ฉ.
  • Token Learner (Ryoo et al., 2021): MLP๋กœ ์„ ํƒ์  ํ† ํฐ ์ƒ์„ฑ.

โ†’ ๋น„์Šทํ•˜์ง€๋งŒ ์˜๋ฏธ๊ฐ€ ๋‹ค๋ฅด๊ฑฐ๋‚˜ ์ค‘์š”ํ•œ ํ† ํฐ๋“ค๊นŒ์ง€ ์„ž์ž„ โ†’ ๊ฒฐ๊ณผ์ ์œผ๋กœ semantic dilution (์˜๋ฏธ ํฌ์„) ๋ฐœ์ƒ.

์ด๋“ค์˜ ๊ณตํ†ต์ !

โ€œํ† ํฐ์˜ ์˜๋ฏธ(semanitcs)๋ฅผ ๊ณ ๋ คํ•˜์ง€ ์•Š๋Š”๋‹คโ€


ImagePiece

Blog Image

ViT์˜ ํšจ์œจํ™”๋Š” ๋‹จ์ˆœํžˆ ํ† ํฐ ์ˆ˜๋ฅผ ์ค„์ด๋Š” ๊ฒŒ ์•„๋‹ˆ๋ผ,

โ€œํ† ํฐ์ด ์ถฉ๋ถ„ํžˆ ์˜๋ฏธ๋ฅผ ๊ฐ€์งˆ ๋•Œ๊นŒ์ง€ ์žฌ๊ตฌ์„ฑ(re-tokenization)โ€ ํ•ด์•ผ ํ•œ๋‹ค.

Step I : Token Importance Evaluation

  • ๊ฐ ํ† ํฐ์ด ์ „์ฒด ์ด๋ฏธ์ง€ ์˜๋ฏธ์— ์–ผ๋งˆ๋‚˜ ๊ธฐ์—ฌํ•˜๋Š”์ง€ ํ‰๊ฐ€
  • [CLS] ํ† ํฐ๊ณผ์˜ attention ๊ฐ’ ๏ปฟ ์„ ์ด์šฉํ•ด ์ค‘์š”๋„ ์ˆœ์œ„๋ฅผ ๊ณ„์‚ฐ
  • ์ค‘์š”๋„๊ฐ€ ๋‚ฎ์€ bottom-k ํ† ํฐ์„ ํ›„๋ณด๋กœ ์ง€์ • โ†’ โ€œnon-semantic tokensโ€

Step II : Re-tokenization of Non-semantic Tokens

  • bottom-k ํ† ํฐ๋“ค์„ ๋‘ ๊ทธ๋ฃน(A, B)์œผ๋กœ ๋‚˜๋ˆˆ ๋’ค,

    ๊ฐ€์žฅ ์œ ์‚ฌํ•œ ์Œ๋ผ๋ฆฌ merge

  • ๋ณ‘ํ•ฉ์—๋Š” bipartite soft matching์„ ์‚ฌ์šฉ
  • Bipartite soft matching (Bolya et al., 2023)

    โ€œ๋‘ ๊ทธ๋ฃน ์‚ฌ์ด์˜ ์ตœ์  ์œ ์‚ฌ๋„ ๋งค์นญโ€ ์„ ๋ถ€๋“œ๋Ÿฝ๊ฒŒ(softly) ๊ณ„์‚ฐํ•˜๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜

    1. Bipartite structure
      • ๋‘ ํ† ํฐ ์ง‘ํ•ฉ A ์™€ B ๊ฐ€ ์ฃผ์–ด์กŒ์„ ๋•Œ, A์˜ ๊ฐ ํ† ํฐ์ด B ๋‚ด์˜ ํ•œ ํ† ํฐ๊ณผ ์—ฐ๊ฒฐ๋  ํ™•๋ฅ  ๊ณ„์‚ฐ
    1. Soft assignment
      • ๊ฐ ์—ฐ๊ฒฐ์˜ ๊ฐ•๋„ : ๏ปฟ

      โ†’ ํ•˜๋‚˜์˜ A ํ† ํฐ์ด ์—ฌ๋Ÿฌ B ํ† ํฐ์˜ ์ •๋ณด๋ฅผ ๊ฐ€์ค‘ํ•ฉ ํ˜•ํƒœ๋กœ ๋ณ‘ํ•ฉ ๊ฐ€๋Šฅ

    1. Information preserving merge
      • ์ƒˆ ํ† ํฐ์€ ๏ปฟ ๋กœ ๊ณ„์‚ฐ๋˜์–ด,

        ํ•˜๋“œ ๋งค์นญ๋ณด๋‹ค ๋” ์—ฐ์†์ ์ด๊ณ  ์†์‹ค์ด ์ ์€ ๋ณ‘ํ•ฉ์„ ์ˆ˜ํ–‰

Step III : Re-evaluation and Discarding

  • merge๋œ ํ† ํฐ๋“ค๋งŒ attention ์„ ๋‹ค์‹œ ๊ณ„์‚ฐํžˆ๋ฉด์„œ step1, 2 ๋ฐ˜๋ณต
  • ์ตœ์ข…์ ์œผ๋กœ ์—ฌ์ „ํžˆ ์˜๋ฏธ๊ฐ€ ์—†์œผ๋ฉด โ†’ ์ตœ์ข…์ ์œผ๋กœ ์‚ญ์ œ (prune)

Local Coherence Bias (๋กœ์ปฌ ์ผ๊ด€์„ฑ ๊ฐ•ํ™” ๋ชจ๋“ˆ)

  • ์ด๋ฏธ์ง€์˜ ๊ณต๊ฐ„์  ํŠน์„ฑ์„ ๊ณ ๋ คํ•ด, ์ธ์ ‘ํ•œ ํŒจ์น˜๋“ค์€ ์œ ์‚ฌํ•˜๊ฒŒ ์ธ์‹๋˜๋„๋ก bias๋ฅผ ์ถ”๊ฐ€
  • ๊ตฌ์ฒด์ ์œผ๋กœ 4๊ฐœ์˜ 3ร—3 conv + 1๊ฐœ์˜ 1ร—1 conv ๋ฅผ ์ ์šฉํ•ด ๊ฒน์น˜๋Š” ํŒจ์น˜ feature ๋ฅผ ๋งŒ๋“ฆ
  • ๊ฒฐ๊ณผ์ ์œผ๋กœ ๊ณต๊ฐ„์ ์œผ๋กœ ๊ฐ€๊นŒ์šด ํŒจ์น˜๋“ค์€ ์œ ์‚ฌ๋„๊ฐ€ ๋†’์•„์ง€๊ณ ,

    ๋ณ‘ํ•ฉ ๊ณผ์ •์—์„œ ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ๊ฐ™์€ ์˜๋ฏธ ๋‹จ์œ„๋กœ ๋ฌถ์ž„

Blog Image
WordPiece (NLP)ImagePiece (Vision)
๋ฌธ์žฅ์„ ์˜๋ฏธ ์žˆ๋Š” ๋‹จ์–ด ๋‹จ์œ„๋กœ ๋ถ„ํ•ด.์ด๋ฏธ์ง€๋ฅผ 16ร—16 ํŒจ์น˜๋กœ ๋ถ„ํ• .
โ€œmeaningful tokensโ€ โ†’ ๊ฐ ํ† ํฐ์ด ์ด๋ฏธ ์˜๋ฏธ๋ฅผ ๊ฐ€์ง.โ€œpatch tokensโ€ โ†’ ๋Œ€๋ถ€๋ถ„ ์˜๋ฏธ ์—†์Œ(๋ฐฐ๊ฒฝยทํ•˜๋Š˜ ๋“ฑ).
๊ธด ๋ฌธ์žฅ์„ MaxMatch(์ตœ๋Œ€ ์ผ์น˜) ๋กœ ํ† ํฐํ™”.์˜๋ฏธ ์—†๋Š” ํŒจ์น˜๋“ค์„ ์˜๋ฏธ๊ฐ€ ์ƒ๊ธธ ๋•Œ๊นŒ์ง€ ํ•ฉ์นจ.
WordPiece๋Š” ๋‹จ์–ด๋ฅผ ์ชผ๊ฐœ ์˜๋ฏธ ๋‹จ์œ„๋ฅผ ๋งŒ๋“ค๊ณ ,

ImagePiece๋Š” ๋ฐ˜๋Œ€๋กœ ์˜๋ฏธ ์—†๋Š” ์กฐ๊ฐ๋“ค์„ ํ•ฉ์ณ ์˜๋ฏธ ๋‹จ์œ„๋กœ ๋งŒ๋“ฆ

์˜ˆ๋ฅผ ๋“ค์–ด, ํŒŒ๋ž€์ƒ‰ ํŒจ์น˜ ํ•˜๋‚˜๋งŒ ๋ณด๋ฉด ์•„๋ฌด ์˜๋ฏธ ์—†์ง€๋งŒ

์ฃผ๋ณ€ ํŒจ์น˜๋“ค๊ณผ ํ•ฉ์น˜๋ฉด โ€˜๋ฒ„์Šคโ€™๋ผ๋Š” ์˜๋ฏธ๊ฐ€ ์ƒ๊น€ โ†’ ์ด๊ฒŒ re-tokenization ์˜ ํ•ต์‹ฌ

Compatibility with Other Methods

  • ์žฌํ† ํฐํ™”๊ฐ€ ํ† ํฐ ์ƒ์„ฑ ๋‹จ๊ณ„์—์„œ ์ด๋ค„์ง€๋ฏ€๋กœ, ๊ทธ ๋’ค์˜ pruning ๋˜๋Š” merging ๋ชจ๋“ˆ๊ณผ ์ถฉ๋Œํ•˜์ง€ ์•Š์Œ
  • ๊ธฐ์กด Token Pruning (EViT, DynamicViT) ์ด๋‚˜ Merging (ToMe) ๋ฐฉ์‹๊ณผ ๊ฒฐํ•ฉ ๊ฐ€๋Šฅ
  • ์˜คํžˆ๋ ค re-tokenization ๋•๋ถ„์— ์ดˆ๊ธฐ layer์—์„œ ์˜๋ฏธ ์—†๋Š” ํŒจ์น˜๊ฐ€ ๋นจ๋ฆฌ ์ •๋ฆฌ๋˜์–ด ์ „์ฒด ํšจ์œจ์ด ๋” ์ข‹์•„์ง


Experiment

์‹คํ—˜ ๊ฐœ์š”

  • ๋ฐ์ดํ„ฐ์…‹: ImageNet-1k (1.2M train, 50k test)
  • ๊ธฐ๋ฐ˜ ๋ชจ๋ธ: DeiT-Ti, DeiT-S (๋‘ ๊ฐ€์ง€ Vision Transformer ๋ฒ„์ „)
  • ์ž…๋ ฅ ํฌ๊ธฐ: 224ร—224
  • ํ›ˆ๋ จ: 300 epoch / finetuning, pretraining ์—†์Œ. (DeiT ๋…ผ๋ฌธ๊ณผ ๋™์ผํ•œ ์„ค์ •์œผ๋กœ )
  • NVIDIA RTX 3090

Table 1 - Token Pruning ๋น„๊ต ๊ฒฐ๊ณผ

Blog Image

๊ฐ™์€ keep ratio (0.7)๊ธฐ์ค€์œผ๋กœ ์ธก์ •

  • DynamicViT / EViT ์€ ํ† ํฐ ์ œ๊ฑฐ ๊ฒฐ์ •์„ ํ›„๋ฐ˜ layer์—์„œ ํ•จ โ†’ ์•ž์ชฝ layer๋Š” ์—ฌ์ „ํžˆ ๋งŽ์€ ํ† ํฐ์„ ์ฒ˜๋ฆฌ
  • ImagePiece๋Š” ์ดˆ๊ธฐ์— re-tokenization์„ ์ˆ˜ํ–‰ โ†’ ์•ž layer๋ถ€ํ„ฐ ํ† ํฐ ์ˆ˜๊ฐ€ ํฌ๊ฒŒ ์ค„์–ด๋“ฆ.

Table 2 - Token Merging ๋น„๊ต ๊ฒฐ๊ณผ

Blog Image

ImagePiece๋Š” ์˜๋ฏธ ์—†๋Š” ํ† ํฐ๋งŒ ๋ณ‘ํ•ฉํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์˜๋ฏธ ์žˆ๋Š” ์ •๋ณด(semantic tokens)๋Š” ๊ทธ๋Œ€๋กœ ์œ ์ง€

โ†’ ์ •ํ™•๋„ ์†์‹ค ๊ฑฐ์˜ ์—†์Œ

Figure 3 - Hyper-speed Inference Experiment

์ด ์‹คํ—˜์€ โ€œํ† ํฐ ์ˆ˜๋ฅผ ๊ทน๋‹จ์ ์œผ๋กœ ์ค„์—ฌ๋„ ์„ฑ๋Šฅ์ด ์œ ์ง€๋˜๋Š”๊ฐ€?โ€ ๋ฅผ ๋ณด๋Š” ํ…Œ์ŠคํŠธ

๊ฐ ๋ชจ๋ธ์˜ keep rate (๋‚จ๊ธฐ๋Š” ํ† ํฐ ๋น„์œจ)์„ 70%, 60%, 50%, ... ๋กœ ์ ์  ์ค„์—ฌ๊ฐ€๋ฉด์„œ ์ธก์ •

โ†’ ๊ทน๋‹จ์ ์œผ๋กœ ๋น ๋ฅธ ์ถ”๋ก  ์†๋„์ผ๋•Œ๋„ ์ •ํ™•๋„ ๋งŽ์ด ๋ณด์กด

Blog Image

๊ฐ™์€ย Acc ๊ธฐ์ค€์œผ๋กœ ๋น„๊ต

Blog Image
  • ImagePiece๋Š” ์ „์ฒด ํ† ํฐ์˜ 13%๋งŒ ๋‚จ๊ธฐ๊ณ ๋„ ์ •ํ™•๋„๋ฅผ ์œ ์ง€
  • ๋™์ผํ•œ ์„ฑ๋Šฅ ๊ธฐ์ค€์—์„œ 30% ์ด์ƒ ๋น ๋ฅธ ์ถ”๋ก  ์†๋„๋ฅผ ๋‹ฌ์„ฑ
  • โ€œ์˜๋ฏธ ์—†๋Š” ํ† ํฐ์„ ๋” ์ •ํ™•ํžˆ ์‹๋ณ„ํ•ด ๋ฒ„๋ฆฌ๊ธฐ ๋•Œ๋ฌธโ€

Table 4 - Random Masking Noise Robustness

Blog Image

๋…ธ์ด์ฆˆ๋‚˜ ๊ฐ€๋ ค์ง„ ์˜์—ญ์— ๋Œ€ํ•œ ๊ฒฌ๊ณ ์„ฑ(robustness) ๊ฒ€์ฆ

  • ํ…Œ์ŠคํŠธ ์ด๋ฏธ์ง€์— ๋ฌด์ž‘์œ„ 16ร—16 ๋งˆ์Šคํฌ 7~50๊ฐœ ์ถ”๊ฐ€
  • โ€œ์˜๋ฏธ ๋‹จ์œ„๋กœ ๋ฌถ์ธ ํ† ํฐ์ด ๋” ๊ฒฌ๊ณ ํ•œ global representationโ€

Table 5 - Token Attentiveness ๋ณ€ํ™”

โ€œ์˜๋ฏธ ์—†๋Š” ํ† ํฐ๋„ ๋ณ‘ํ•ฉ ํ›„ ์˜๋ฏธ๊ฐ€ ์ƒ๊ธฐ๋ฉด ๋‹ค์‹œ ์ค‘์š”ํ•ด์ง„๋‹คโ€

Blog Image
  • ์ด์ „ layer์—์„œ inattentive(๋น„์ค‘์š”)๋กœ ํŒ๋‹จ๋˜์—ˆ๋˜ ํ† ํฐ ์ค‘ ๋‹ค์Œ layer์—์„œ attentive(์ค‘์š”)๋กœ ๋ฐ”๋€ ๋น„์œจ
  • re-tokenization ๋•๋ถ„์— ์˜๋ฏธ ์—†๋Š” ํ† ํฐ์ด ์˜๋ฏธ ๋‹จ์œ„๋กœ ๋ณ‘ํ•ฉ๋˜๋ฉด์„œ semantic importance๋ฅผ ํšŒ๋ณต

Table 6 & 7 - Token Similarity

๋ณ‘ํ•ฉ๋œ ํ† ํฐ ์Œ(token pairs) ๋“ค์˜ feature cosine similarity

Blog Image

ToMe: layer๊ฐ€ ๊นŠ์–ด์งˆ์ˆ˜๋ก ์œ ์‚ฌ๋„๊ฐ€ ๋–จ์–ด์ ธ ์ •๋ณด ํฌ์„ ๋ฐœ์ƒ,

ImagePiece: ์ •๋ณด ์ผ๊ด€์„ฑ ๋ณด์กด

์ฒซ๋ฒˆ์งธ layer์—์„œ ๋ณ‘ํ•ฉ๋œ ํ† ํฐ ์ค‘ โ€œ์ค‘์š” ํ† ํฐโ€ ๋น„์œจ

Blog Image

โ†’ ๊ธฐ์กด merging ๋ฐฉ์‹์€ ์ค‘์š”ํ•œ ํ† ํฐ์„ ๋„ˆ๋ฌด ์ž์ฃผ ๋ณ‘ํ•ฉํ•จ.

๋ฐ˜๋ฉด ImagePiece๋Š” bottom-k๋งŒ ๋ณ‘ํ•ฉํ•˜๋ฏ€๋กœ semantic dilution ๋ฐฉ์ง€.

Table 8 - Local Coherence ํšจ๊ณผ

Blog Image

โ†’ ๊ณต๊ฐ„์ ์œผ๋กœ ๊ฐ€๊นŒ์šด ํŒจ์น˜๋ผ๋ฆฌ ์˜๋ฏธ ๋‹จ์œ„๋กœ ๋ฌถ์ธ๋‹ค

Accuracy (%)
ImagePiece (no local bias)79.81
Full ImagePiece (with local coherence)80.22

โ†’ local coherence module ์„ค๊ณ„๊ฐ€ ํšจ๊ณผ ์žˆ์Œ.

Table 9 - Compatibility ์‹คํ—˜

Blog Image

๊ธฐ์กด pruningยทmerging ๊ตฌ์กฐ์— ImagePiece๋ฅผ ๋‹จ์ˆœ ์ถ”๊ฐ€ํ•ด๋„ ์ •ํ™•๋„, ์†๋„ ํ–ฅ์ƒ

โ†’ ๋ชจ๋“ˆํ˜•์œผ๋กœ ์‚ฝ์ž… ๊ฐ€๋Šฅํ•œ ํ™•์žฅ์„ฑ ๋†’์€ ๊ตฌ์กฐ


Limitation & Future Work

๋…ผ๋ฌธ์—๋Š” ์—†์ง€๋งŒ ๋‚ด๊ฐ€ ์ƒ๊ฐํ•ด๋ณธ ์ ๋“ค

Patch ํฌ๊ธฐ์™€ ๊ตฌ์กฐ์— ๋ฏผ๊ฐ

๋ชจ๋ธ๋งˆ๋‹ค ์ตœ์ ์˜ Patch ํฌ๊ธฐ๊ฐ€ ๋‹ค๋ฅผ ์ˆ˜ ์žˆ๋Š”๋ฐ, ํŒจ์น˜ ํฌ๊ธฐ๊ฐ€ ์ž‘๊ฑฐ๋‚˜ ํฌ๋‹ค๋ฉด ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ์ž˜ ์ž‘๋™ํ•˜์ง€ ์•Š์„ ๊ฒƒ ๊ฐ™๋‹ค.

(๋…ผ๋ฌธ์—์„œ๋Š” 16x16 ์‚ฌ์šฉ)

โ†’ ์ตœ๊ทผ์— ๋‚ด๊ฐ€ ์ œ์ถœํ•œ ๋…ผ๋ฌธ์—์„œ๋„ ์ด์™€๊ฐ™์€ Limitation์ด ์žˆ์—ˆ๋‹ค.

future work

  • Feature map ํ•ด์ƒ๋„์— ๋”ฐ๋ผ ๋ณ‘ํ•ฉ granularity๋ฅผ ์ž๋™ ์กฐ์ ˆ
  • Local coherence module์˜ receptive field๋ฅผ patch size์— ๋งž๊ฒŒ ์กฐ์ •

Semantic ๊ธฐ์ค€์ด attention ๊ธฐ๋ฐ˜

์˜๋ฏธ ์ •์˜๊ฐ€ attention score์—๋งŒ ์˜์กด

โ†’ ๋ณต์žกํ•œ ์žฅ๋ฉด(๋‹ค์ค‘ ๊ฐ์ฒด ์ด๋ฏธ์ง€)์—์„œ๋Š” ํ† ํฐ ๋ณ‘ํ•ฉ์ด ๋ถ€์ •ํ™•ํ•  ๊ฐ€๋Šฅ์„ฑ ์กด์žฌ.

future work

  • attention ์™ธ์—๋„ spatial, contrastive, objectness ์ •๋ณด๋ฅผ ๊ฒฐํ•ฉํ•˜์—ฌ ํ† ํฐ ์˜๋ฏธ ํ‰๊ฐ€