분류 전체보기
-
An Introduction to Vision-Language Modeling 논문 요약Computer Vision 2024. 8. 28. 10:44
https://arxiv.org/pdf/2405.17247 목차1. Introduction2. The Families of VLMs3. A Guide to VLM Training4. Extending VLMs to Videos 1. IntroductionVision Language Model이란?“In simple terms, a VLM can understand images and text jointly and relate them together” VLM은 단순히 정의하면, 영상과 텍스트를 결합하여 이해하고 연관 지을 수 있는 모델을 말한다. 근래의 VLM 모델은 대부분 Transformer에 기반하고 있으며, 일반적으로 이미지 모델, 텍스트 모델 그리고 두 개의 모달리티를 결합하는 모듈 이렇게..
-
[OS] System Call HandlingComputer Science 2024. 4. 10. 16:39
System Call 이란 OS에 의해 제공되는 인터페이스 Program은 System call 인터페이스를 통해 os에 서비스를 요청한다. 예를 들어, 하드 디스크를 접근하는 것과, 새로운 프로세스를 생성하는 등의 작업이 있다. API 란 Application Program Interface의 약자로 application programmer에게 제공되는 함수의 집합이다. 프로그래머는 API를 통해 시스템과 통신하며, OS의 세부 사항은 프로그래머로부터 숨겨진다. Sytem Call Handling 운영 체제(OS)의 System Call Handling은 운영 체제의 핵심적인 역할 중 하나로, 시스템 콜 처리 과정은 다음과 같다. 사용자 프로그램 호출: 사용자 프로그램이 시스템 콜을 호출한다. 이는 일..
-
YOLO-World: Real-Time Open-Vocabulary Object Detection 논문 요약Computer Vision 2024. 4. 10. 15:45
https://arxiv.org/abs/2401.17270 YOLO-World: Real-Time Open-Vocabulary Object Detection The You Only Look Once (YOLO) series of detectors have established themselves as efficient and practical tools. However, their reliance on predefined and trained object categories limits their applicability in open scenarios. Addressing this limitation, we arxiv.org Open Vocabulary Object Detection 추론 시 한정되지 ..
-
MLP-Mixer: An all-MLP Architecture for Vision 논문 요약Computer Vision 2024. 3. 29. 16:21
https://arxiv.org/abs/2105.01601 MLP-Mixer: An all-MLP Architecture for Vision Convolutional Neural Networks (CNNs) are the go-to model for computer vision. Recently, attention-based networks, such as the Vision Transformer, have also become popular. In this paper we show that while convolutions and attention are both sufficient for arxiv.org Idea 그동안 computer vision 분야에서 주로 사용된 모델은 CNN 기반이거나, a..
-
Parameter-Efficient Transfer Learning for NLP 논문 요약Computer Vision 2024. 3. 27. 16:11
https://arxiv.org/abs/1902.00751 Parameter-Efficient Transfer Learning for NLP Fine-tuning large pre-trained models is an effective transfer mechanism in NLP. However, in the presence of many downstream tasks, fine-tuning is parameter inefficient: an entire new model is required for every task. As an alternative, we propose transfer arxiv.org Idea 1. transfer learning이란? 한 task에서 학습한 지식을 다른 down..
-
CLIP-IQA: Exploring CLIP for Assessing the Look and Feel of Images 논문 요약Computer Vision 2024. 3. 19. 23:00
https://arxiv.org/abs/2207.12396 Exploring CLIP for Assessing the Look and Feel of Images Measuring the perception of visual content is a long-standing problem in computer vision. Many mathematical models have been developed to evaluate the look or quality of an image. Despite the effectiveness of such tools in quantifying degradations such as arxiv.org Idea 대규모의 image-text pair로 학습되는 CLIP이 인간의 ..
-
VILA: Learning Image Aesthetics from User Comments with Vision-Language Pretraining 논문 읽기Computer Vision 2024. 3. 18. 18:30
https://arxiv.org/abs/2312.07533 VILA: On Pre-training for Visual Language Models Visual language models (VLMs) rapidly progressed with the recent success of large language models. There have been growing efforts on visual instruction tuning to extend the LLM with visual inputs, but lacks an in-depth study of the visual language pre-tra arxiv.org Introduction 기존의 IAA method들은 human labeled ratin..
-
An Image Is Worth 16x16 Words (ViT)Computer Vision 2024. 3. 13. 16:13
논문 링크 https://arxiv.org/abs/2010.11929 An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision, attention is either applied in conjunction with convolutional networks, or used to rep arxiv.org ViT의 기본 구조가 되는 tr..