๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
์นดํ…Œ๊ณ ๋ฆฌ ์—†์Œ

[๋…ผ๋ฌธ ์Šคํ„ฐ๋””] An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (ViT)

by coderSohyun 2024. 2. 9.

๋…ผ๋ฌธ ์ƒ์„ฑ ๋ฐฐ๊ฒฝ

์ž์—ฐ์–ด์ฒ˜๋ฆฌ์—์„œ๋Š” ์ด์ œ RNN์„ ์‚ฌ์šฉํ•˜์ง€ ์•Š๊ณ  transformer๊ฐ€ NLP์˜ ํ‘œ์ค€์ด๋ผ๊ณ  ํ•  ์ •๋„๋กœ ์ž๋ฆฌ๊ฐ€ ์žกํžŒ ์ค‘์š”ํ•œ ๋ชจ๋ธ์ด๋‹ค. 

 

์ด๋ฅผ ์ปดํ“จํ„ฐ ๋น„์ „์˜ Image Classification์— ์ ์šฉ์„ ํ•ด๋ณด๊ธฐ ์œ„ํ•ด ๋งŽ์€ ๋…ธ๋ ฅ๋“ค์ด ์žˆ์—ˆ์ง€๋งŒ,

์—ฌ์ „ํžˆ CNN ๋ชจ๋ธ์— ์˜์กด์ ์ธ ๋ชจ๋ธ๋“ค์ด ๋งŽ์ด ๋‚˜์™”๊ณ 

์™„๋ฒฝํ•˜๊ฒŒ transformer๋งŒ ์‚ฌ์šฉํ•œ ๋ชจ๋ธ๋“ค์€ ์ด๋ก ์ ์œผ๋กœ๋Š” ํšจ์œจ์ ์ด๊ฒ ์ง€๋งŒ, 

specialized attention pattern๋“ค์„ ์‚ฌ์šฉํ•˜๊ธฐ ๋•Œ๋ฌธ์—

์ตœ์‹  ํ•˜๋“œ์›จ์–ด ๊ฐ€์†๊ธฐ์—์„œ๋Š” ์•„์ง ํšจ๊ณผ์ ์œผ๋กœ ํ™•์žฅ๋˜์ง€ ์•Š์•˜๋‹ค. 

 

๊ทธ๋ž˜์„œ ์ด ๋…ผ๋ฌธ์—์„œ๋Š” CNN ๊ตฌ์กฐ๋ฅผ ๋ฒ„๋ฆฐ, ์˜จ์ „ํžˆ transformer๋งŒ ์‚ฌ์šฉํ•˜์—ฌ 

Image Classificationํ•  ์ˆ˜ ์žˆ๋„๋ก ViT(Vision Transformer) ๋ชจ๋ธ์ด ๋‚˜์˜ด 

 

๋…ผ๋ฌธ ๋ชจ๋ธ ๊ตฌ์กฐ 

 

๋…ผ๋ฌธ ๋‚ด์šฉ ๊ตฌ์„ฑ 

Abstract

์ž์—ฐ์–ด์ฒ˜๋ฆฌ์—์„œ์˜ transformer๋Š” ์‚ฌ์‹ค์ƒ ํ‘œ์ค€์œผ๋กœ ์ž๋ฆฌ ์žก์•˜๋‹ค.

 

ํ•˜์ง€๋งŒ ์ปดํ“จํ„ฐ ๋น„์ „์—์„œ์˜ transformer๋Š” CNN๊ณผ ํ•จ๊ป˜ ์‚ฌ์šฉ๋˜๊ฑฐ๋‚˜, CNN์˜ ๊ตฌ์กฐ๋Š” ๊ฐ–๊ณ  ๊ฐ€๋˜ ํŠน์ • ๋ถ€๋ถ„์„ ๋Œ€์ฒดํ•˜์—ฌ ์‚ฌ์šฉ๋˜๋Š” ์ •๋„๋กœ ์ž๋ฆฌ์žก๊ณ  ์žˆ์—ˆ๋‹ค.

 

์ด ๋…ผ๋ฌธ์—์„œ๋Š” Image Classification์„ ์œ„ํ•ด CNN์— ์˜์กดํ•˜์ง€ ์•Š๊ณ  ์˜จ์ „ํžˆ transformer๋งŒ ์‚ฌ์šฉํ•˜์—ฌ ์ข‹์€ ์„ฑ๊ณผ๋ฅผ ๋‚ผ ์ˆ˜ ์žˆ์Œ์„ ๋ณด์—ฌ์ค€๋‹ค.

 

Introduction

NLP์˜ transformer์ฒ˜๋Ÿผ ์ตœ์†Œํ•œ์˜ modification ์—†์ด ์ด๋ฏธ์ง€์— ์ง์ ‘์ ์œผ๋กœ ์‚ฌ์šฉ๋  ์ˆ˜ ์žˆ๋Š” transformer๋ฅผ ๋งŒ๋“ค๊ธฐ ์œ„ํ•ด์„œ

 

์ฃผ์–ด์ง„ ์ด๋ฏธ์ง€๋ฅผ patch๋“ค๋กœ ์ชผ๊ฐ  ํ›„ transformer์˜ ์ž…๋ ฅ์— ๋“ค์–ด๊ฐˆ linear embedding๋“ค์˜ ์ˆœ์„œ(sequence)๋ฅผ ์ œ๊ณตํ•œ๋‹ค. 

(์—ฌ๊ธฐ์„œ ์ด๋ฏธ์ง€ patch๋“ค์€ NLP์—์„œ์˜ ํ† ํฐ๊ณผ ๊ฐ™์€ ๊ฐœ๋…์ž„)

 

When trained on mid-sized datasets such as ImageNet without strong regularization, these models yield modest accuracies of a few percentage points below ResNets of comparable size. This seemingly discouraging outcome may be expected: Transformers lack some of the inductive biases inherent to CNNs, such as translation equivariance and locality, and therefore do not generalize well when trained on insufficient amounts of data.

 

๊ฐ•๋ ฅํ•œ ์ •๊ทœํ™” ์—†์ด ImageNet๊ณผ ๊ฐ™์€ ์ค‘๊ฐ„ ํฌ๊ธฐ์˜ ๋ฐ์ดํ„ฐ ์„ธํŠธ์— ๋Œ€ํ•ด ํ•™์Šตํ•œ ๊ฒฝ์šฐ, ์ด๋Ÿฌํ•œ ๋ชจ๋ธ์€ ๋น„์Šทํ•œ ํฌ๊ธฐ์˜ ResNet๋ณด๋‹ค ๋ช‡ ํผ์„ผํŠธ ํฌ์ธํŠธ ๋‚ฎ์€ ์ˆ˜์ค€์˜ ์ •ํ™•๋„๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ์‹ค๋ง์Šค๋Ÿฌ์šด ๊ฒฐ๊ณผ๋Š” ์˜ˆ์ƒํ•  ์ˆ˜ ์žˆ๋Š” ๊ฒฐ๊ณผ์ž…๋‹ˆ๋‹ค: ํŠธ๋žœ์Šคํฌ๋จธ๋Š” ๋ฒˆ์—ญ ๋™๋“ฑ์„ฑ ๋ฐ ์ง€์—ญ์„ฑ๊ณผ ๊ฐ™์€ CNN ๊ณ ์œ ์˜ ๊ท€๋‚ฉ์  ํŽธํ–ฅ์ด ๋ถ€์กฑํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋ถˆ์ถฉ๋ถ„ํ•œ ์–‘์˜ ๋ฐ์ดํ„ฐ๋กœ ํ›ˆ๋ จํ•  ๊ฒฝ์šฐ ์ผ๋ฐ˜ํ™”๊ฐ€ ์ž˜ ์ด๋ฃจ์–ด์ง€์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

 

ํ•˜์ง€๋งŒ 14M-300M์˜ ์ด๋ฏธ์ง€ ์ •๋„ ํฌ๊ธฐ์˜ ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ์…‹์˜ ๊ฒฝ์šฐ ๊ท€๋‚ฉ์  ํŽธํ–ฅ์˜ ๋ถ€์กฑ์„ ๋›ฐ์–ด๋„˜์–ด ์ข‹์€ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์ด๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค. 

 

Related Work

 

Method

1. Vision Transformer(ViT)

 

2. Fine-Tuning and Higher Resolution

 

Experiments

1. Setup

2. Comparison to State of the Art

3. Pre-training Data Requirements

4. Scaling Study

5. Inspecting Vision Transformer

6. Self-Supervision

Conclusion