๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
์นดํ…Œ๊ณ ๋ฆฌ ์—†์Œ

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (ViT)

by coderSohyun 2024. 2. 7.

1. ํŒŒ์ดํ† ์น˜ ํŠธ๋žœ์Šคํฌ๋จธ๋ฅผ ํ™œ์šฉํ•œ ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ์™€ ์ปดํ“จํ„ฐ๋น„์ „ ์‹ฌ์ธตํ•™์Šต p601~623

2. Shusen Wang - Vision Transformer for Image Classification (์œ ํŠœ๋ธŒ) 

3. ๊ณ ๋ ค๋Œ€ํ•™๊ต DSBA - [Paper Review] ViT 

 

 

์ž์—ฐ์–ด ์ฒ˜๋ฆฌ์—์„œ์˜ ํŠธ๋žœ์Šคํฌ๋จธ๋Š” ์ปดํ“จํ„ฐ ๋น„์ „์—์„œ๋„ ๋งŽ์€ ์˜ํ–ฅ์„ ์ฃผ์—ˆ๋Š”๋ฐ,

 

์ด์ „ ์ปดํ“จํ„ฐ๋น„์ „ ๊ด€๋ จ ์—ฐ๊ตฌ๋Š” ํ•ฉ์„ฑ๊ณฑ ์‹ ๊ฒฝ๋ง์— ํŠธ๋žœ์Šคํฌ๋จธ ๋ชจ๋ธ์˜ ์…€ํ”„ ์–ดํ…์…˜ ๋ชจ๋“ˆ์„ ์ฐฉ์šฉํ•œ ๋ชจ๋ธ์ด ๋งŽ์•˜์ง€๋งŒ,

ViT(Vision Transformer)๋Š” ํŠธ๋žœ์Šคํฌ๋จธ ๊ตฌ์กฐ ์ž์ฒด๋ฅผ ์ปดํ“จํ„ฐ๋น„์ „ ๋ถ„์•ผ์— ์ ์šฉํ•œ ์ฒซ ๋ฒˆ์งธ ์—ฐ๊ตฌ๋‹ค. 

 

CNN ๋ชจ๋ธ์˜ ํ•ฉ์„ฑ๊ณฑ ๊ณ„์ธต ๋ฐฉ๋ฒ•์€ ์ด๋ฏธ์ง€ ๋ถ„๋ฅ˜๋ฅผ ์œ„ํ•ด์„œ ์ง€์—ญ ํŠน์ง•์„ ์ถ”์ถœํ–ˆ๋‹ค๋ฉด

ViT๋Š” ์…€ํ”„ ์–ดํ…์…˜์„ ์‚ฌ์šฉํ•ด ์ „์ฒด ์ด๋ฏธ์ง€๋ฅผ ํ•œ ๋ฒˆ์— ์ฒ˜๋ฆฌํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ๊ตฌํ˜„ํ•œ๋‹ค.

 

BERT์™€ ViT ๋ชจ๋ธ์€ ๋ชจ๋‘ ํŠธ๋žœ์Šคํฌ๋จธ ๊ตฌ์กฐ๋ฅผ ๊ฐ–๋Š”๋ฐ, ์ž…๋ ฅ ๋ฐ์ดํ„ฐ๋ฅผ ๋งŒ๋“œ๋Š” ๊ณผ์ •์€ ์„œ๋กœ ๋‹ค๋ฅด๋‹ค.

 

ViT ๋ชจ๋ธ์€ ์ด๋ฏธ์ง€๊ฐ€ ๊ฒฉ์ž๋กœ ์ž‘์€ ๋‹จ์œ„์˜ ์ด๋ฏธ์ง€ ํŒจ์น˜๋กœ ๋‚˜๋‰˜์–ด ์ˆœ์ฐจ์ ์œผ๋กœ ์ž…๋ ฅ๋œ๋‹ค.

ViT ๋ชจ๋ธ์— ์‚ฌ์šฉ๋˜๋Š” ์ž…๋ ฅ ์ด๋ฏธ์ง€ ํŒจ์น˜๋Š” ์™ผ์ชฝ์—์„œ ์˜ค๋ฅธ์ชฝ, ์œ„์—์„œ ์•„๋ž˜๋กœ ํ‘œํ˜„๋œ ์‹œํ€€์…œ ๋ฐฐ์—ด์„ ๊ฐ€์ •ํ•œ๋‹ค

 

ํ•ฉ์„ฑ๊ณฑ ๋ชจ๋ธ๊ณผ ViT ๋ชจ๋ธ ๋น„๊ต 

ํ•ฉ์„ฑ๊ณฑ ์‹ ๊ฒฝ๋ง๊ณผ ํŠธ๋žœ์Šคํฌ๋จธ๋Š” ์ด๋ฏธ์ง€ ํŠน์ง•์„ ์ž˜ ํ‘œํ˜„ํ•˜๋Š” ์ž„๋ฒ ๋”ฉ์„ ๋งŒ๋“ค๊ณ ์ž ํ•˜๋Š” ๋ชฉ์ ์€ ๊ฐ™์Œ 

 

ํ•ฉ์„ฑ๊ณฑ ์‹ ๊ฒฝ๋ง์˜ ์ž„๋ฒ ๋”ฉ์€ ์ด๋ฏธ์ง€ ํŒจ์น˜ ์ค‘ ์ผ๋ถ€๋งŒ ์„ ํƒํ•˜์—ฌ ํ•™์Šตํ•˜๋ฉฐ, ์ด๋ฅผ ํ†ตํ•ด ์ด๋ฏธ์ง€ ์ „์ฒด์˜ ํŠน์ง•์„ ์ถ”์ถœํ•จ

 

๋ฐ˜๋ฉด ViT ์ž„๋ฒ ๋”ฉ์€ ์ด๋ฏธ์ง€๋ฅผ ์ž‘์€ ํŒจ์น˜๋“ค๋กœ ๋‚˜๋ˆ  ๊ฐ ํŒจ์น˜ ๊ฐ„์˜ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ํ•™์Šตํ•จ. 

์ด๋ฅผ ์œ„ํ•ด ์…€ํ”„ ์–ดํ…์…˜ ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•ด

๋ชจ๋“  ์ด๋ฏธ์ง€ ํŒจ์น˜๊ฐ€ ์„œ๋กœ์—๊ฒŒ ์ฃผ๋Š” ์˜ํ–ฅ์„ ๊ณ ๋ คํ•ด

์ด๋ฏธ์ง€์˜ ์ „์ฒด ํŠน์ง•์„ ์ถ”์ถœํ•จ.

๊ทธ๋Ÿฌ๋ฏ€๋กœ ViT๋Š” ๋ชจ๋“  ์ด๋ฏธ์ง€ ํŒจ์น˜๊ฐ€ ํ•™์Šต์— ๊ด€์—ฌํ•˜๋ฉฐ ๋†’์€ ์ˆ˜์กด์˜ ์ด๋ฏธ์ง€ ํ‘œํ˜„์„ ์ œ๊ณตํ•จ 

 

์ข์€ ์ˆ˜์šฉ ์˜์—ญ(Receptive Field, RF)์„ ๊ฐ€์ง„ ํ•ฉ์„ฑ๊ณฑ ์‹ ๊ฒฝ๋ง์€ 

์ „์ฒด ์ด๋ฏธ์ง€ ์ •๋ณด๋ฅผ ํ‘œํ˜„ํ•˜๋Š” ๋ฐ ์ˆ˜๋งŽ์€ ๊ณ„์ธต์ด ํ•„์š”ํ•˜์ง€๋งŒ,

ํŠธ๋žœ์Šคํฌ๋จธ ๋ชจ๋ธ์€ ์–ดํ…์…˜ ๊ฑฐ๋ฆฌ(Attention Distance)๋ฅผ ๊ณ„์‚ฐํ•˜์—ฌ 

์˜ค์ง ํ•œ ๊ฐœ์˜ ViT ๋ ˆ์ด์–ด๋กœ ์ „์ฒด ์ด๋ฏธ์ง€ ์ •๋ณด๋ฅผ ์‰ฝ๊ฒŒ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ๋‹ค. 

 

ViT๋Š” ํ”ฝ์…€ ๋‹จ์œ„๋กœ ์ฒ˜๋ฆฌํ•˜๋Š” ํ•ฉ์„ฑ๊ณฑ ๋ชจ๋ธ๊ณผ ๋‹ฌ๋ฆฌ ํŒจ์น˜ ๋‹จ์œ„๋กœ ์ด๋ฏธ์ง€๋ฅผ ์ฒ˜๋ฆฌํ•˜๊ธฐ ๋•Œ๋ฌธ์—

๋” ์ž‘์€ ๋ชจ๋ธ๋กœ๋„ ๋†’์€ ์„ฑ๋Šฅ์„ ์–ป์„ ์ˆ˜ ์žˆ๋‹ค๋Š” ์žฅ์ ์ด ์žˆ๋‹ค. 

 

ViT ๋ชจ๋ธ์€ ์ž…๋ ฅ ์ด๋ฏธ์ง€์˜ ํฌ๊ธฐ๊ฐ€ ๊ณ ์ •๋˜์–ด ์žˆ์–ด 

ํฌ๊ธฐ๊ฐ€ ๋‹ค๋ฅธ ์ด๋ฏธ์ง€๋ฅผ ์ฒ˜๋ฆฌํ•˜๋ ค๋ฉด ์ด๋ฏธ์ง€ ํฌ๊ธฐ๋ฅผ ๋งž์ถ”๋Š” ์ „์ฒ˜๋ฆฌ๊ฐ€ ํ•„์š”ํ•˜๋ฉฐ, 

 

ํ•ฉ์„ฑ๊ณฑ ์‹ ๊ฒฝ๋ง์ด ์ด๋ฏธ์ง€์˜ ๊ณต๊ฐ„์ ์ธ ์œ„์น˜ ์ •๋ณด๋ฅผ ๊ณ ๋ คํ•˜๋Š” ๋ฐ ๋น„ํ•ด

ViT๋Š” ํŒจ์น˜ ๊ฐ„์˜ ์ƒ๋Œ€์ ์ธ ์œ„์น˜ ์ •๋ณด๋งŒ ๊ณ ๋ คํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์ด๋ฏธ์ง€ ๋ณ€ํ™˜์— ์ทจ์•ฝํ•  ์ˆ˜ ์žˆ๋‹ค 

 

ViT์˜ ๊ท€๋‚ฉ์  ํŽธํ–ฅ

ํŽธํ–ฅ์˜ ๊ฐœ๋…์ด ์ƒ์†Œํ•จ.. 

ViT ๋ชจ๋ธ

์ž…๋ ฅ ์ด๋ฏธ์ง€๋ฅผ 

ํŠธ๋žœ์Šคํฌ๋จธ ๊ตฌ์กฐ์— ๋งž๊ฒŒ ์ผ์ •ํ•œ ํฌ๊ธฐ์˜ ํŒจ์น˜๋กœ ๋‚˜๋ˆˆ ๋‹ค์Œ

๊ฐ ํŒจ์น˜๋ฅผ ๋ฒกํ„ฐ ํ˜•ํƒœ๋กœ ๋ณ€ํ™˜ํ•˜๋Š”

ํŒจ์น˜ ์ž„๋ฒ ๋”ฉ(Patch Embedding)๊ณผ 

๊ฐ ํŒจ์น˜์™€์˜ ๊ด€๊ณ„๋ฅผ ํ•™์Šตํ•˜๋Š” ์ธ์ฝ”๋”(Encoder) ๊ณ„์ธต์œผ๋กœ ๊ตฌ์„ฑ๋จ 

 

ํŒจ์น˜ ์ž„๋ฒ ๋”ฉ๊ณผ ์ธ์ฝ”๋” ๊ณ„์ธต์„ ํ†ตํ•ด 

์ด๋ฏธ์ง€์˜ ํŠน์ง•์„ ์ถ”์ถœํ•˜๊ณ  ๋ถ„๋ฅ˜๋‚˜ ํšŒ๊ท€์™€ ๊ฐ™์€ ์ž‘์—…์— ๋งž๋Š” ์ถœ๋ ฅ๊ฐ’์œผ๋กœ ๋ณ€ํ™˜ํ•ด ์‚ฌ์šฉํ•จ 

 

ํŒจ์น˜ ์ž„๋ฒ ๋”ฉ 

ํŒจ์น˜ ์ž„๋ฒ ๋”ฉ(Patch Embedding)์€ ์ž…๋ ฅ ์ด๋ฏธ์ง€๋ฅผ ์ž‘์€ ํŒจ์น˜๋กœ ๋ถ„ํ• ํ•˜๋Š” ๊ณผ์ •์„ ๋งํ•จ

์ž‘์€ ํŒจ์น˜๋กœ ๋ถ„ํ• ํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ์ด๋ฏธ์ง€ ํฌ๊ธฐ๋ฅผ ๋งž์ถ”๋Š” ์ „์ฒ˜๋ฆฌ๊ฐ€ ์ˆ˜ํ–‰๋ผ์•ผ ํ•จ 

ํฌ๊ธฐ์— ๋งž๊ฒŒ ์ •๋ฐฉํ–ฅ ํฌ๊ธฐ๋กœ ์ด๋ฏธ์ง€ ํฌ๊ธฐ๋ฅผ ์กฐ์ ˆํ•จ

 

์ด๋ฏธ์ง€ ํฌ๊ธฐ๋ฅผ ์ผ์ •ํ•œ ํฌ๊ธฐ๋กœ ๋ณ€๊ฒฝํ–ˆ๋‹ค๋ฉด 

์ „์ฒด ์ด๋ฏธ์ง€๋ฅผ ํŒจ์น˜ ํฌ๊ธฐ๋กœ ๋ถ„ํ• ํ•ด ์‹œํ€€์…œ ๋ฐฐ์—ด์„ ๋งŒ๋“ฆ

์ด๋•Œ ํ•ฉ์„ฑ๊ณฑ ์‹ ๊ฒฝ๋ง์˜ ๊ณ„์ธต์„ ํ™œ์šฉํ•จ

 

 

์ธ์ฝ”๋” ๊ณ„์ธต 

 

๋ชจ๋ธ ์‹ค์Šต 

ํ—ˆ๊น… ํŽ˜์ด์Šค ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์™€ FashionMNIST ๋ฐ์ดํ„ฐ์„ธํŠธ๋ฅผ ํ™œ์šฉํ•ด

ViT ๋ชจ๋ธ์„ ๋ฏธ์„ธ ์กฐ์ •ํ•ด๋ณธ๋‹ค

FashionMNIST ๋ฐ์ดํ„ฐ์„ธํŠธ๋Š” ๊ธฐ์กด์˜ MNIST ๋ฐ์ดํ„ฐ์„ธํŠธ๋ณด๋‹ค ๋” ๋ณต์žกํ•œ ์ด๋ฏธ์ง€ ๋ถ„๋ฅ˜ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ๋งŒ๋“ค์–ด์ง 

FashionMNIST ๋ฐ์ดํ„ฐ์„ธํŠธ๋Š” ์˜๋ฅ˜ ์ด๋ฏธ์ง€๋ฅผ ๋‹ด๊ณ  ์žˆ๋Š” ๋ฐ์ดํ„ฐ์„ธํŠธ๋กœ

์ด 10๊ฐœ์˜ ํด๋ž˜์Šค์™€ 60000๊ฐœ์˜ ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ์„ธํŠธ, 10000๊ฐœ์˜ ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ์„ธํŠธ๋กœ ๊ตฌ์„ฑ๋œ๋‹ค 

 

๊ฐ„๋‹จํ•œ ์‹ค์Šต์„ ์œ„ํ•ด ํ•™์Šต ๋ฐ์ดํ„ฐ๋ฅผ 10000๊ฐœ, ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ๋ฅผ 1000๊ฐœ๋กœ ์ƒ˜ํ”Œ๋งํ•œ๋‹ค

 

from itertools import chain

itertools ๋ชจ๋“ˆ์—์„œ chain ํ•จ์ˆ˜๋ฅผ ๊ฐ€์ ธ์˜ต๋‹ˆ๋‹ค. chain ํ•จ์ˆ˜๋Š” ์—ฌ๋Ÿฌ ๊ฐœ์˜ iterable์„ ํ•˜๋‚˜์˜ iterable๋กœ ๊ฒฐํ•ฉํ•ฉ๋‹ˆ๋‹ค.

 

from collections import defaultdict

collections ๋ชจ๋“ˆ์—์„œ defaultdict๋ฅผ ๊ฐ€์ ธ์˜ต๋‹ˆ๋‹ค. defaultdict๋Š” ๊ธฐ๋ณธ๊ฐ’์„ ๊ฐ€์ง„ ๋”•์…”๋„ˆ๋ฆฌ๋ฅผ ์ƒ์„ฑํ•˜๋Š” ํด๋ž˜์Šค์ž…๋‹ˆ๋‹ค.

 

from torch.utils.data import Subset

PyTorch์˜ ๋ฐ์ดํ„ฐ์…‹์„ ์‚ฌ์šฉํ•˜๊ธฐ ์œ„ํ•ด Subset ํด๋ž˜์Šค๋ฅผ ๊ฐ€์ ธ์˜ต๋‹ˆ๋‹ค. ์ด ํด๋ž˜์Šค๋Š” ๋ฐ์ดํ„ฐ์…‹์˜ ์ผ๋ถ€๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ์„œ๋ธŒ์…‹์„ ๋งŒ๋“ค์–ด์ค๋‹ˆ๋‹ค.

 

 

๋…ผ๋ฌธ ๊ด€๋ จ ๊ฐ•์˜ ๋‚ด์šฉ 

 

๊ฐ•์•„์ง€์˜ ์‚ฌ์ง„์„ Neural Network์— ๋„ฃ์œผ๋ฉด ๋ฒกํ„ฐ p๊ฐ’์„ ๋‚ด๋†“์€๋‹ค.

์ด ๋ฒกํ„ฐ p๊ฐ’์€ ์ด๋ฏธ์ง€ ๋ถ„๋ฅ˜(image classification)์˜ ๊ฒฐ๊ณผ๋‹ค.

๋ฒกํ„ฐ p์˜ ๊ฐ๊ฐ์˜ ์š”์†Œ(element)๋Š” class์™€ ์—ฐ๊ด€ ๋˜์–ด ์žˆ๋‹ค.

๊ทธ๋ž˜์„œ ๋งŒ์•ฝ์— 8๊ฐœ์˜ class๊ฐ€ ์žˆ์œผ๋ฉด ๋ฒกํ„ฐ p๋Š” 8์ฐจ์›์ด๋‹ค.

 

๋ฐ‘์— ๊ทธ๋ž˜ํ”„์ฒ˜๋Ÿผ ๊ฐ๊ฐ์˜ ์š”์†Œ๋“ค์€ 0~1์‚ฌ์ด์˜ ๊ฐ’์„ ๊ฐ€์ง€๋ฉฐ

์ดํ•ฉ์€ 1๋กœ ์ˆ˜๋ ดํ•œ๋‹ค.

 

๋ฐ์ดํ„ฐ์…‹์ด ํด์ˆ˜๋ก ResNet๋ณด๋‹ค ๋” ํšจ์œจ์ ์ธ ์„ฑ๋Šฅ์„ ๋ณด์ž„ 

ViT is based on Transformer (for NLP)

 

์ผ๋‹จ ์ฒ˜์Œ์— ViT ๋ชจ๋ธ์— ๋ฐ์ดํ„ฐ๋ฅผ ๋Œ๋ฆฌ๊ธฐ ์œ„ํ•ด์„œ๋Š” ์ด๋ฏธ์ง€๋ฅผ partitionํ•ด์•ผํ•จ (into Patches)

 

์ด๋ฏธ์ง€๋ฅผ partitionํ•  ๋•Œ sliding window๋ฅผ ์ด์šฉํ•˜์—ฌ ๋งค๋ฒˆ ๋ช‡ ํ”ฝ์…€์”ฉ ์›€์ง์ธ๋‹ค 

์—ฌ๊ธฐ์„œ stride๋ž€ how many pixels the sliding window moves each time

 

์œ„์™€ ๊ฐ™์€ ์ƒํƒœ์—์„œ User Specifies ์กฐ๊ฑด์— patch size๊ฐ€ 16x16์ธ๋ฐ

stride๋ฅผ 1x1๋กœ ํ•˜๋ฉด patch๊ฐ€ ๋„ˆ๋ฌด ๋งŽ์•„์ ธ์„œ ๊ณ„์‚ฐ๋Ÿ‰์ด ๋งŽ์•„์งˆ ๊ฒƒ์ž„ (heavy computation)

 

9๊ฐœ์˜ patch๋กœ ๋‚˜๋ˆ ์กŒ๋‹ค๊ณ  ์ƒ๊ฐํ•ด๋ณด์ž. 

๋ชจ๋“  patch๋“ค์€ ๋˜ ๊ฐ๊ฐ์˜ rgb ์ฑ„๋„์ด ์žˆ๋Š” ์ด๋ฏธ์ง€๋ฅผ ํ˜•์„ฑํ•œ๋‹ค 

 

a patch is an order 3 tensor 

 

์ด๋ ‡๊ฒŒ 9๊ฐœ์˜ patch๋“ค๋กœ ์ชผ๊ฐฐ์œผ๋ฉด patch๋“ค์„ vectorizeํ•˜๋ฉด ์ข‹์„ ๊ฒƒ ๊ฐ™๋‹ค 

vectorization์ด๋ž€ tensor๋ฅผ ๋ฒกํ„ฐ๋กœ reshape ํ•˜๋Š” ๊ฒƒ์„ ์˜๋ฏธํ•œ๋‹ค 

 

์ € patches ๋ถ€๋ถ„ ์ดํ•ด ๋ชปํ•จ.. 

 

 

n๊ฐœ์˜ patches๋“ค๋กœ ๋‚˜๋ˆ ์ง€๊ณ  ์ด patches๋“ค์ด n๊ฐœ์˜ ๋ฒกํ„ฐ๋กœ ๋‚˜๋ˆ ์กŒ๋‹ค๊ณ  ์ƒ๊ฐ์„ ํ•ด๋ณด์ž 

์ด๋•Œ ๊ฐ๊ฐ์˜ vector์—์„œ dense layer์„ ์ ์šฉํ•˜๋ฉด 

์ด๋Ÿฌํ•œ output์ด ๋‚˜์˜ค๋Š”๋ฐ 

์ด๋•Œ linear activation function์ด ์ ์šฉ๋œ ๊ฒƒ์ด ์•„๋‹ˆ๋ฏ€๋กœ

dense layer์€ ์„ ํ˜•ํ•จ์ˆ˜์ž„์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค. (linear functions)

 

W๋Š” ํ–‰๋ ฌ์ด๊ณ  b๋Š” ๋ฒกํ„ฐ์ธ๋ฐ ์ด๋Š” training data๋ฅผ ํ†ตํ•ด์„œ ์•Œ ์ˆ˜ ์žˆ๋Š” ๊ฐ’๋“ค์ด๋‹ค

๊ทธ๋ฆฌ๊ณ  ์ด dense layer์€ ๋ชจ๋“  ๋ฒกํ„ฐ์—์„œ ๋™์ผํ•œ parameter W, b๋ฅผ ๊ฐ–๋Š”๋‹ค 

We need to add positional encoding to vector z1 to zn 

the input image is split into n patches 

each patch has a position which is an integer between 1 and n

positional encoding maps that integer into a vector 

the shape of the vector is the same as z

Add the positional encoding vectors to the z vectors 

This way, a z vector captures both the content and the position of a patch 

 

The ViT paper empirically demonstrated the benefit of using positional encoding 

Without position encoding the accuracy decreases by three percent 

 

The paper tried several positional encoding methods

 

Those methods lead to almost the same accuracy 

so it is okay to use any kind of positional encoding 

 

์™œ positional encoding์€ ์ค‘์š”ํ• ๊นŒ? 

 

๋‘ ๋ฒˆ์งธ ์ด๋ฏธ์ง€๋Š” patch๋“ค์„ ์žฌ๋ฐฐ์—ดํ•˜์—ฌ ๊ทธ๋ฆผ์„ ๋‚˜ํƒ€๋ƒˆ๋‹ค. ๊ทธ๋ž˜์„œ ๋‹น์—ฐํžˆ ๊ฐ•์•„์ง€ ๊ทธ๋ฆผ์€ ๋‘ ๊ฐœ๊ฐ€ ์„œ๋กœ ๋‹ค๋ฅด๊ฒŒ ๋๋Š”๋ฐ

z๋ฒกํ„ฐ๋ฅผ swapping ํ•˜๋Š” ๊ฒƒ์€ final output of transformer์— ์˜ํ–ฅ์„ ์ฃผ์ง€ ์•Š์„ ๊ฒƒ์ž„ 

 

if the z vectors do not contain the positional encoding 

๊ทธ๋Ÿฌ๋ฉด transformer ์ž…์žฅ์—์„œ๋Š” ๋‘ ์ด๋ฏธ์ง€๊ฐ€ ๋˜‘๊ฐ™๋‹ค๊ณ  ํŒ๋‹จํ•  ๊ฒƒ์ž„ 

 

๋‹น์—ฐํžˆ ์ด๊ฒƒ์€ ์˜ฌ๋ฐ”๋ฅด์ง€ ์•Š์€ ํŒ๋‹จ์ด๊ธฐ ๋•Œ๋ฌธ์— 

positional information์„ patches์— ๋ถ™ํžˆ๊ณ  z vectors์— positional encoding ์„ ๋ถ™ํžˆ๋Š” ๊ฒƒ์ž„ 

 

์ด๋ ‡๊ฒŒ ๋˜๋ฉด ๋งŒ์•ฝ์— ์ด๋ฏธ์ง€๊ฐ€ ๋’ค์ฃฝ ๋ฐ•์ฃฝ ๋์„ ๋•Œ 

positional encoding ์ด ๋ฐ”๋€”ํ…Œ๋‹ˆ๊น 

๊ทธ๋Ÿฌ๋ฉด transformer์˜ output์ด ๋‹ฌ๋ผ์งˆ ๊ฒƒ์ด๋‹ค 

(๋ณต์Šต : ์—ฌ๊ธฐ์—์„œ z1~zn ๋ฒกํ„ฐ์˜ ๊ฒฐ๊ณผ๋Š” linear transformation๊ณผ positional encoding ์˜ ๊ฒฐ๊ณผ๋‹ค 

they are the representations of the n patches

they capture both the content and the positions of the patches)

 

We use the CLS token for classification 

Then the embedding layer takes the input of the CLS token and outputs vector z0 

์ด๋•Œ z0 has the same shape as the other z vectors 

 

We use this CLS token because the output of transformer in this position will be used for classification

(BERT ๋ชจ๋ธ์— CLS ๊ด€๋ จ๋œ ๋‚ด์šฉ์ด ๋‚˜์˜ค๋‚˜๋ด„!)  

the output of this multi-head self-attention layer are sequence of n plus 1 vectors 

the output of the dense layer are sequence of n plus 1 vectors 

you can add many self attention layers and dense layers one by one if you want 

 

besides from the layers

transformer actually uses skip connection and normalization 

 

there are standard tricks for improving performance

 

๋ฉ€ํ‹ฐํ—ค๋“œ ์…€ํ”„ ์–ดํ…์…˜ ๋ ˆ์ด์–ด์™€ ๋ด์Šค ๋ ˆ์ด์–ด๋Š” ํŠธ๋žœ์Šคํฌ๋จธ์˜ encoder ๋„คํŠธ์›Œํฌ๋ฅผ ๊ตฌ์„ฑํ•ฉ๋‹ˆ๋‹ค. 

์ด์— ๋Œ€ํ•œ output์€ n+1 ๋ฒกํ„ฐ์˜ ์‹œํ€€์Šค๋“ค์ด๋‹ค 

c0~cn์€ transformer์˜ output์ž„.

classification task๋ฅผ ์ˆ˜ํ–‰ํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ์šฐ๋ฆฌ๋Š” c1~cn์˜ ๋ฒกํ„ฐ๋Š” ํ•„์š” ์—†์–ด์„œ ๋ฌด์‹œํ•ด๋„ ์ข‹๋‹ค 

์šฐ๋ฆฌ๊ฐ€ ํ•„์š”ํ•œ ๊ฒƒ์€ vector c0์ด๋‹ค 

vector c0๋Š” feature vector extracted from the image 

the classification is based on c0 

๊ทธ๋ž˜์„œ ๋ฒกํ„ฐ c0๋ฅผ softmax classifier์— ์ง‘์–ด ๋„ฃ์œผ๋ฉด ๋ฒกํ„ฐ p๊ฐ€ ๋‚˜์˜ค๊ฒŒ ๋˜๊ณ 

์ด ๋ฒกํ„ฐ์˜ shape์€ class์˜ ์ˆ˜์™€ ๊ฐ™๋‹ค 

๋งŒ์•ฝ์— data set์ด 8๊ฐœ์˜ class๋ฅผ ๊ฐ€์ง€๋ฉด 

p๋Š” 8์ฐจ์›์ด๋‹ค 

p์˜ ๊ฐ’์„ ๋‚˜ํƒ€๋‚ธ ๊ทธ๋ž˜ํ”„์ž„

during training we compute the cross entropy of vector p and the ground truth 

then compute the gradient of the cross entropy loss with respect to the model parameters 

and perform gradient descent to update the parameters 

 

์ด์ œ ViT์˜ ๊ตฌ์กฐ๋ฅผ ์‚ดํŽด๋ดค์œผ๋‹ˆ 

dataset์„ train ์‹œ์ผœ๋ด์•ผํ•œ๋‹ค 

(Next step is to train the model on image data)

์ผ๋‹จ ๋ชจ๋ธ์„ random intialize(์ดˆ๊ธฐํ™”)ํ•œ ํ›„ 

train the model on data set A 

์ด๋•Œ ์ด ๋ฐ์ดํ„ฐ์„ธํŠธ๋Š” large scale data set์ด์—ฌ์•ผ ํ•œ๋‹ค 

์ด๊ฒƒ์€ pretraining ์ด๋ผ๊ณ  ๋ถ€๋ฅธ๋‹ค 

๊ทธ๋ž˜์„œ ์•ž์„œ pretrainํ•œ ๋ชจ๋ธ์„ ๊ฐ€์ง€๊ณ  

dataset B๋กœ training์„ ๋˜ ์‹œํ‚จ๋‹ค 

๋ณดํ†ต์˜ ๊ฒฝ์šฐ dataset A๋ณด๋‹ค๋Š” ํฌ๊ธฐ๊ฐ€ ์ž‘๊ณ  

์ด๋Ÿฌํ•œ ๊ณผ์ •์„ fine tuning์ด๋ผ๊ณ  ํ•œ๋‹ค 

dataset B๋Š” ImageNet์ด ๋  ์ˆ˜ ์žˆ์Œ 

Dataset B is the target data set 

๊ทธ๋ž˜์„œ test accuracy๋ฅผ ๊ตฌํ•˜๊ธฐ ์œ„ํ•ด์„œ data set b๋ฅผ evaluation matrix๋กœ ์‚ฌ์šฉํ•จ 

 

๋ฐ์ดํ„ฐ์–‘์ด ์—„์ฒญ ํด์ˆ˜๋ก resnet๋ณด๋‹ค ์ข‹์Œ 

transformer requires very large data for pretraining 

 

the bigger the pre-training data set 

the greater the advantage of transformer over resnet 

300M ์ด์ƒ์ด ๋” ์ข‹์„ ๋“ฏํ•จ 

 

๋ ˆ์ฆˆ๋„ท์˜ ๊ฒฝ์šฐ 100m์ด๋˜ 300m์ด๋˜ ์ฐจ์ด๊ฐ€ ์—†์Œ 

the accuracy of resnet does not improve 

as the number of samples grows from 100m to 300m 

 

in sum ViT requires huge data for pre-training 

 

Transformer is an advantage over cnn only when the data set for pretuning is sufficiently large

300m ์ด๋ฏธ์ง€๋„ ๋ถ€์กฑํ•จ (not enough)

 

https://www.youtube.com/watch?v=HZ4j_U3FC94

https://github.com/wangshusen/DeepLearning?tab=readme-ov-file