CoCa: Contrastive Captioner 논문 읽기

Computer Vision 2024. 3. 8. 13:25

https://arxiv.org/abs/2205.01917

CoCa: Contrastive Captioners are Image-Text Foundation Models

Exploring large-scale pretrained foundation models is of significant interest in computer vision because these models can be quickly transferred to many downstream tasks. This paper presents Contrastive Captioner (CoCa), a minimalist design to pretrain an

arxiv.org

Vision Language problem에서 주로 사용하는 모델 아키텍쳐는 다음과 같다.

Single Encoder models
- 하나의 인코더만을 사용하며, image classification dataset에 대해 CE Loss를 통해 사전 학습된다.
- 이미지 인코더는 generic visual representation을 제공하지만, 라벨링된 이미지 어노테이션에 크게 rely하며, 적용할 수 있는 downstream task에 한계가 있다.
Dual Encoder models
- 두 개의 병렬적인 인코더를 Contrastive loss를 사용하여 image-text pair 데이터에 대해 학습시킨다. 예를 들어 CLIP 모델이 대표적이다.
- visual embedding과 text embedding을 동일한 latent space에 encode하며 cross modal alignment를 제공한다.
- 이를 통해 zero-shot image classification과 image-text alignment와 같은 작업이 가능하다.
- 그러나, fused image, text representation를 학습할 수 있는 component가 없기 때문에 VQA와 같은 downstream task에 곧바로 적용할 수 없다는 단점이 있다.
Encoder-Decoder models
- 인코더-디코더 구조를 사용하며, generative pretraining을 통해 generic vision과 multimodal representation을 학습한다.
- 사전학습 시에, 인코더는 이미지를 입력받고, 디코더의 출력값에 대해 Language Modeling을 적용한다.
- Image embeddings와 align된 text-only representation을 생성하지 않으며 따라서 cross modal alignment task에 효율적이지 않다는 문제점이 있다.

이 논문에서는 세 가지 방법에 대해 모두 통합하는 하나의 방법을 제안하며, Encoder-Decoder 구조를 가지며 Contrastive Loss와 Captioning (Generative) Loss에 대해 모두 학습되는 모델을 제안한다.

1. Contrastive Captioners Pretraining

모델을 Contrastive Loss와 Captioning (Generative) Loss에 대해 모두 학습시키기 위해 다음과 같은 Loss Function을 사용한다. 여기서 람다로 표현된 값은 weighting 하이퍼파라미터이다.
Captioning approach는 텍스트의 conditional likelihood를 optimize하는 반면, Contrastive approach는 unconditional text representation을 사용한다.

이러한 딜레마를 해결하기 위해 저자들은 디코더를 두 가지 레이어로 분리하는 방법을 사용하였다.

2. Decoupled Text Decoder and CoCA architecture

아래의 그림과 같이 CoCA는 Unimodal Text Decoder, Multimodal Text Decoder 두 가지의 디코더 레이어로 이루어져있다.

Unimodal layer에서는 cross attention-mechanism을 사용하지 않아 text representation만을 학습하고, casually masked self-attention을 통해 input text를 인코딩한다. Unimodal layer에서는 text representation만을 학습한다는 것이 이 논문의 main idea인 것 같다. 또한, Unimodal layer에 대해 텍스트를 입력으로 줄때, [CLS] 토큰을 텍스트 뒤에 붙여주었고, 출력으로 나온 cls-token을 Contrastive loss를 계산하는데 사용하였다고 한다. 이 내용은 위의 그림에서도 확인가능하다.
Multimodal layer에서는 cross attention과 casually masked self-attention을 통해 visual encoder를 output한다.
마지막으로 두 개의 디코더에서 동일한 수의 레이어를 사용했다고 한다.

3. Attentional Poolers

Contrastive Loss는 각 이미지에 대해 하나의 임베딩을 사용하는 반면, captioner에서의 디코더는 주로 image output sequence를 다룬다.
이와 같은 문제를 해결하기 위해서 CoCa에서는 Task-specific attentional pooling을 통해 모델이 다른 길이를 갖는 두 개의 loss에 대해 embedding을 pooling하는 것을 학습할 수 있도록 했다. 따라서, Generative loss에서는 n=256개의 쿼리를 사용하며, Contrastive loss에서는 n=1개의 쿼리를 사용한다고 한다.
여기서 pooler라는 말은 n개의 학습가능한 query를 갖는 multi-head attention layer를 나타내며 인코더의 output을 key와 value로 사용한다고 한다.
Attentional Poolers의 또다른 장점은, downstream task에 대해 CoCa를 적용할 때, Encoder 부분은 freeze하면 되며, 새로운 Pooler만 학습시키면 된다는 것이다. 이는 모델이 강력한 퍼포먼스를 낼 수 있도록 해준다고 한다.

4. Results

아래의 그림을 보면, Visual recognition, Crossmodal alignment, Image captioning 등의 다양한 분야에서 CoCa가 SOTA 성능을 보이는 것을 알 수 있다.

저작자표시

'Computer Vision' 카테고리의 다른 글

VILA: Learning Image Aesthetics from User Comments with Vision-Language Pretraining 논문 읽기 (1)	2024.03.18
An Image Is Worth 16x16 Words (ViT) (2)	2024.03.13
CLIP: Contrastive Language-Image Pre-training 논문 핵심 요약 (0)	2024.03.04
R-CNN 논문 읽기 (0)	2023.03.14
OHEM 논문 읽기 (0)	2023.03.08

ABOUT ME

JH's Tech Blog JH's Tech Blog

'Computer Vision' 카테고리의 다른 글

티스토리툴바

ABOUT ME

'Computer Vision' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바