Computer Vision
-
An Introduction to Vision-Language Modeling 논문 요약Computer Vision 2024. 8. 28. 10:44
https://arxiv.org/pdf/2405.17247 목차1. Introduction2. The Families of VLMs3. A Guide to VLM Training4. Extending VLMs to Videos 1. IntroductionVision Language Model이란?“In simple terms, a VLM can understand images and text jointly and relate them together” VLM은 단순히 정의하면, 영상과 텍스트를 결합하여 이해하고 연관 지을 수 있는 모델을 말한다. 근래의 VLM 모델은 대부분 Transformer에 기반하고 있으며, 일반적으로 이미지 모델, 텍스트 모델 그리고 두 개의 모달리티를 결합하는 모듈 이렇게..
-
YOLO-World: Real-Time Open-Vocabulary Object Detection 논문 요약Computer Vision 2024. 4. 10. 15:45
https://arxiv.org/abs/2401.17270 YOLO-World: Real-Time Open-Vocabulary Object Detection The You Only Look Once (YOLO) series of detectors have established themselves as efficient and practical tools. However, their reliance on predefined and trained object categories limits their applicability in open scenarios. Addressing this limitation, we arxiv.org Open Vocabulary Object Detection 추론 시 한정되지 ..
-
MLP-Mixer: An all-MLP Architecture for Vision 논문 요약Computer Vision 2024. 3. 29. 16:21
https://arxiv.org/abs/2105.01601 MLP-Mixer: An all-MLP Architecture for Vision Convolutional Neural Networks (CNNs) are the go-to model for computer vision. Recently, attention-based networks, such as the Vision Transformer, have also become popular. In this paper we show that while convolutions and attention are both sufficient for arxiv.org Idea 그동안 computer vision 분야에서 주로 사용된 모델은 CNN 기반이거나, a..
-
Parameter-Efficient Transfer Learning for NLP 논문 요약Computer Vision 2024. 3. 27. 16:11
https://arxiv.org/abs/1902.00751 Parameter-Efficient Transfer Learning for NLP Fine-tuning large pre-trained models is an effective transfer mechanism in NLP. However, in the presence of many downstream tasks, fine-tuning is parameter inefficient: an entire new model is required for every task. As an alternative, we propose transfer arxiv.org Idea 1. transfer learning이란? 한 task에서 학습한 지식을 다른 down..
-
CLIP-IQA: Exploring CLIP for Assessing the Look and Feel of Images 논문 요약Computer Vision 2024. 3. 19. 23:00
https://arxiv.org/abs/2207.12396 Exploring CLIP for Assessing the Look and Feel of Images Measuring the perception of visual content is a long-standing problem in computer vision. Many mathematical models have been developed to evaluate the look or quality of an image. Despite the effectiveness of such tools in quantifying degradations such as arxiv.org Idea 대규모의 image-text pair로 학습되는 CLIP이 인간의 ..
-
VILA: Learning Image Aesthetics from User Comments with Vision-Language Pretraining 논문 읽기Computer Vision 2024. 3. 18. 18:30
https://arxiv.org/abs/2312.07533 VILA: On Pre-training for Visual Language Models Visual language models (VLMs) rapidly progressed with the recent success of large language models. There have been growing efforts on visual instruction tuning to extend the LLM with visual inputs, but lacks an in-depth study of the visual language pre-tra arxiv.org Introduction 기존의 IAA method들은 human labeled ratin..
-
An Image Is Worth 16x16 Words (ViT)Computer Vision 2024. 3. 13. 16:13
논문 링크 https://arxiv.org/abs/2010.11929 An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision, attention is either applied in conjunction with convolutional networks, or used to rep arxiv.org ViT의 기본 구조가 되는 tr..
-
CoCa: Contrastive Captioner 논문 읽기Computer Vision 2024. 3. 8. 13:25
https://arxiv.org/abs/2205.01917 CoCa: Contrastive Captioners are Image-Text Foundation Models Exploring large-scale pretrained foundation models is of significant interest in computer vision because these models can be quickly transferred to many downstream tasks. This paper presents Contrastive Captioner (CoCa), a minimalist design to pretrain an arxiv.org Vision Language problem에서 주로 사용하는 모델 ..