Upstage AI Lab 3기 - Computer Vision & Generation

1. 컴퓨터 비전이란?

2. 컴퓨터 비전 활용 사례

3. 컴퓨터 비전 모델 구조 이해하기

4. Backbone 이해하기

5. Object Detection

6. Semantic Segmentation

7. Computer Vision Generation

1) 생성 모델의 발전 과정
2) 생성 모델과 최대 가능도 추정
3) 오토 인코더와 변분 오토 인코더
4) 적대적 생성 신경망
5) 확산 모델

1. 컴퓨터 비전이란?

: 컴퓨터 공학의 한 분야로 컴퓨터로 시각(vision) 데이터를 처리하는 분야

vision : 시각적인 정보들의 집합.

:=> 시각으로 보이는 것을 숫자로 데이터화 하여 저장한 모든 것들을 포함하는 개념.

2. 컴퓨터 비전 활용 사례

pose estimation

Optical Character Recognition

Medical Image Analysis

Generative Models

Neural Radiance Fields

: 2차원 이미지에서 장면의 3차원 표현을 재구성하기 위한 딥러닝 기반 방법

3. 컴퓨터 비전 모델 구조 이해하기

Visual Feature in Computer Vision

: 컴퓨터 비전의 태스크(classification, detection, segmentation, …)를 해결할 때 필요한 이미지의
특성을 담고 있는 정보들

Backbone

: 이미지에서 중요한 Feature를 추출하도록 훈련

: 여러개의 Layer로 이루어져 있고, 다양한 Level의 Feature를 추출함.(Low, Mid, High Level)

Decoder

: 압축된 Feature를 목표하는 태스크의 출력 형태로 만드는 일을 수행

4. Backbone 이해하기

CNN

Convolutional Layer - input image를 특정크기의 Filter를 이용하여 탐색하면서 Convolution 연산을 하여 특징을 추출

activation Function - 비선형성을 가지도록 함.

Pooling Layer - Feature Map에 공간적 집계를 시킴

AlexNet

: 5개의 합성곱 계층과 3개의 완전연결 계층으로 구성된 8계층 CNN 모델

VGG

: 각 블록은 2D Convolution 레이어와 Max Pooling 레이어로 구성( 16개 및 19개 레이어)

VGG16, VGG19

ResNet

EfficientNet

1) 모델 Depth 늘리기

2) 채널 Width 늘리기

3) 해상도 높이기

5. Object Detection

사물 각각의 Bounding Box (Bbox) 위치와 Category를 예측

2-stage Detector

1-stage Detector

1-stage Detector vs 2-stage Detector

6. Semantic Segmentation

사진에 있는 모든 픽셀을 해당하는 class로 분류하는 것

Classification : Input에 대해서 하나의 label을 예측하는 작업.
AlexNet, ResNet, Xception 등
Localization/Detection: 물체의 label을 예측하면서 그 물체가 어디에 있는지 정보를 제공. Bounding Box
YOLO, R-CNN 등
Segmentation : 모든 픽셀의 label을 예측
FCN, SegNet, DeepLab 등

https://www.jeremyjordan.me/semantic-segmentation/#dilated_convolutions

Downsampling: 주 목적은 차원을 줄여서 적은 메모리로 깊은 Convolution 을 할 수 있게 하는 것. 보통 stride 를 2 이상으로 하는 Convolution 을 사용하거나, pooling을 사용함. 이 과정을 진행하면 feature 의 정보를 잃게됨.
마지막에 Fully-Connected Layer를 넣지 않고, Fully Connected Network 를 주로 사용. FCN 모델에서 위와같은 방법을 제시한 후 이후에 나온 대부분의 모델들에서 사용하는 방법.
Upsampling: Downsampling 을 통해서 받은 결과의 차원을 늘려서 인풋과 같은 차원으로 만들어 주는 과정. 주로 Strided Transpose Convolution 을 사용

skip connections은 encoder의 각 layer에서 나온 정보들을 decoder에 적절하게 결합시켜 정확한 boundaries를 생생하도록 만들어 줌

7. Computer Vision Generation

1) 생성 모델의 발전 과정

2) 생성 모델과 최대 가능도 추정

생성 모델의 학습은 최대 가능도를 최적화하며 진행할 수 있음
쿨백-라이블러 발산 (KL Divergence)은 최대 가능도 최적화에 활용 가능한 기준이 됨

3) 오토 인코더와 변분 오토 인코더

오토 인코더

: 입력 데이터를 주요 특징으로 효율적으로 압축(인코딩)한 후 이 압축된 표현에서 원본 입력을 재구성(디코딩)하도록 설계된 일종의 신경망 아키텍처

변분 오토 인코더(Variational Autoencoder, VAE, 2014)

오토 인코더와 동일한 구조(Encoder + Decoder)를 가지는 생성 모델
잠재 변수(z) 가 표준정규분포를 따른다고 가정 (사전 분포 → p(z))

4) 적대적 생성 신경망

: 적대적으로 학습하는 신경망들로 구성되며, 생성 모델로써 활용함

VAE와 GANs의 차이

VAE의 생성 방식: 입력 분포를 근사하는 과정에서 규제 (Regularization)을 주며 데이터를 생성
GANs의 생성 방식: 생성된 데이터와 실제 데이터를 판별하고 속이는 과정을 거치며 생성 모델을 개선

5) 확산 모델

정방향 확산 (Forward Diffusion Process): 데이터 → 노이즈

역방향 확산 (Reverse Diffusion Process): 데이터 ← 노이즈

https://cvpr2022-tutorial-diffusion-models.github.io/

디노이징 확산 확률 모델(DDPM): 현재 더해진 잡음을 추정하는 방식의 목적 함수를 취함

잠재확산 모델

실제 이미지의 고차원 공간이 아닌 잠재 공간에서 노이즈 연산을 반복하도록 설계

가장 왼쪽의 빨간색 음영 부분은 Auto Encoder를 표현.
가운데 초록색 음영 부분은 Latent Diffusion Model을 표현.
오른쪽의 회색 음영은 Condition 입력 부분을 표현
이렇게 입력된 Condition은 Diffusion Model 내부에서 Cross Attention 연산을 사용함

[ PythonCode ] Generate Images From Text - StableDiffusion

Install the necessary libraries

%pip install --quiet --upgrade diffusers transformers accelerate

%pip install torch

# import 실행오류로 인해 torchvision 재설치

# 에러 내용
# The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.
from transformers.utils.hub import move_cache

move_cache()

# ModuleNotFoundError: No module named 'torch._custom_ops'
pip install torch -U

%pip install torchvision

%pip install --quiet --upgrade diffusers

Using Dreamlike Photoreal

import the StableDiffusionPipeline and the PyTorch library

from diffusers import StableDiffusionPipeline
import torch

Huggingface hub Model 사용 - 전체 파이프라인 로드

model_id = "dreamlike-art/dreamlike-photoreal-2.0"
pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16)
pipe = pipe.to("cuda")

만들 이미지의 프롬프트

prompts = ["Cute Rabbit, Ultra HD, realistic, futuristic, sharp, octane render, photoshopped, photorealistic, soft, pastel, Aesthetic, Magical background",
           "Anime style aesthetic landscape, 90's vintage style, digital art, ultra HD, 8k, photoshopped, sharp focus, surrealism, akira style, detailed line art",
           "Beautiful, abstract art of a human mind, 3D, highly detailed, 8K, aesthetic"]
# 귀여운 토끼, 울트라 HD, 현실적, 미래적, 선명함, 옥탄 렌더링, 포토샵, 사실적, 부드러움, 파스텔, 미적, 마법 같은 배경
# 애니메이션 스타일의 미적 풍경, 90년대 빈티지 스타일, 디지털 아트, 울트라 HD, 8k, 포토샵, 선명한 초점, 초현실주의, 아키라 스타일, 상세한 라인 아트
# 인간 마음의 아름답고 추상적인 예술, 3D, 매우 세밀한, 8K, 미적
images = []

이미지 생성

for i, prompt in enumerate(prompts):
    image = pipe(prompt).images[0]
    image.save(f'result_{i}.jpg')
    images.append(image)

images[0]

images[1]

images[2]

Manually working with the different components

import torch
from torch import autocast
import numpy as np

from transformers import CLIPTextModel, CLIPTokenizer

from diffusers import AutoencoderKL
from diffusers import LMSDiscreteScheduler
from diffusers import UNet2DConditionModel
from diffusers.schedulers.scheduling_ddim import DDIMScheduler

from tqdm import tqdm
from PIL import Image

확산 모델 클래스 정의

class ImageDiffusionModel:

    def __init__(self, vae, tokenizer, text_encoder, unet,
                 scheduler_LMS, scheduler_DDIM):
        self.vae = vae
        self.tokenizer = tokenizer
        self.text_encoder = text_encoder
        self.unet = unet
        self.scheduler_LMS = scheduler_LMS
        self.scheduler_DDIM = scheduler_DDIM
        self.device = 'cuda' if torch.cuda.is_available() else 'cpu'


    def get_text_embeds(self, text):
        # tokenize the text
        text_input = self.tokenizer(text,
                                    padding='max_length',
                                    max_length=tokenizer.model_max_length,
                                    truncation=True,
                                    return_tensors='pt')
        # embed the text
        with torch.no_grad():
            text_embeds = self.text_encoder(text_input.input_ids.to(self.device))[0]

        return text_embeds

    def get_prompt_embeds(self, prompt):
        # get conditional prompt embeddings
        cond_embeds = self.get_text_embeds(prompt)
        # get unconditional prompt embeddings
        uncond_embeds = self.get_text_embeds([''] * len(prompt))
        # concatenate the above 2 embeds
        prompt_embeds = torch.cat([uncond_embeds, cond_embeds])
        return prompt_embeds

    def get_img_latents(self,
                        text_embeds,
                        height=512, width=512,
                        num_inference_steps=50,
                        guidance_scale=7.5,
                        img_latents=None):
        # if no image latent is passed, start reverse diffusion with random noise
        if img_latents is None:
            img_latents = torch.randn((text_embeds.shape[0] // 2, self.unet.in_channels,\
                                       height // 8, width // 8)).to(self.device)
        # set the number of inference steps for the scheduler
        self.scheduler_LMS.set_timesteps(num_inference_steps)
        # scale the latent embeds
        img_latents = img_latents * self.scheduler_LMS.sigmas[0]
        # use autocast for automatic mixed precision (AMP) inference
        with autocast('cuda'):
            for i, t in tqdm(enumerate(self.scheduler_LMS.timesteps)):
                # do a single forward pass for both the conditional and unconditional latents
                latent_model_input = torch.cat([img_latents] * 2)
                sigma = self.scheduler_LMS.sigmas[i]
                latent_model_input = latent_model_input / ((sigma ** 2 + 1) ** 0.5)

                # predict noise residuals
                with torch.no_grad():
                    noise_pred = self.unet(latent_model_input, t, encoder_hidden_states=text_embeds)['sample']

                # separate predictions for unconditional and conditional outputs
                noise_pred_uncond, noise_pred_cond = noise_pred.chunk(2)
                # perform guidance
                noise_pred = noise_pred_uncond + guidance_scale * (noise_pred_cond - noise_pred_uncond)

                # remove the noise from the current sample i.e. go from x_t to x_{t-1}
                img_latents = self.scheduler_LMS.step(noise_pred, t, img_latents)['prev_sample']

        return img_latents


    def decode_img_latents(self, img_latents):
        img_latents = img_latents / 0.18215
        with torch.no_grad():
            imgs = self.vae.decode(img_latents)["sample"]
        # load image in the CPU
        imgs = imgs.detach().cpu()
        return imgs



    def transform_imgs(self, imgs):
        # transform images from the range [-1, 1] to [0, 1]
        imgs = (imgs / 2 + 0.5).clamp(0, 1)
        # permute the channels and convert to numpy arrays
        imgs = imgs.permute(0, 2, 3, 1).numpy()
        # scale images to the range [0, 255] and convert to int
        imgs = (imgs * 255).round().astype('uint8')
        # convert to PIL Image objects
        imgs = [Image.fromarray(img) for img in imgs]
        return imgs



    def prompt_to_img(self,
                      prompts,
                      height=512, width=512,
                      num_inference_steps=50,
                      guidance_scale=7.5,
                      img_latents=None):

        # convert prompt to a list
        if isinstance(prompts, str):
            prompts = [prompts]

        # get prompt embeddings
        text_embeds = self.get_prompt_embeds(prompts)

        # get image embeddings
        img_latents = self.get_img_latents(text_embeds,
                                      height, width,
                                      num_inference_steps,
                                      guidance_scale,
                                      img_latents)
        # decode the image embeddings
        imgs = self.decode_img_latents(img_latents)
        # convert decoded image to suitable PIL Image format
        imgs = self.transform_imgs(imgs)

        return imgs



    def encode_img_latents(self, imgs):
        if not isinstance(imgs, list):
            imgs = [imgs]

        imgs = np.stack([np.array(img) for img in imgs], axis=0)
        # scale images to the range [-1, 1]
        imgs = 2 * ((imgs / 255.0) - 0.5)
        imgs = torch.from_numpy(imgs).float().permute(0, 3, 1, 2)

        # encode images
        img_latents_dist = self.vae.encode(imgs.to(self.device))
        # img_latents = img_latents_dist.sample()
        img_latents = img_latents_dist["latent_dist"].mean.clone()
        # scale images
        img_latents *= 0.18215

        return img_latents


    def get_img_latents_similar(self,
                                img_latents,
                                text_embeds,
                                height=512, width=512,
                                num_inference_steps=50,
                                guidance_scale=7.5,
                                start_step=10):

        # set the number of inference steps for the scheduler
        self.scheduler_DDIM.set_timesteps(num_inference_steps)

        if start_step > 0:
            start_timestep = self.scheduler_DDIM.timesteps[start_step]
            start_timesteps = start_timestep.repeat(img_latents.shape[0]).long()

            noise = torch.randn_like(img_latents)
            img_latents = scheduler_DDIM.add_noise(img_latents, noise, start_timesteps)

        # use autocast for automatic mixed precision (AMP) inference
        with autocast('cuda'):
            for i, t in tqdm(enumerate(self.scheduler_DDIM.timesteps[start_step:])):
                # do a single forward pass for both the conditional and unconditional latents
                latent_model_input = torch.cat([img_latents] * 2)

                # predict noise residuals
                with torch.no_grad():
                    noise_pred = self.unet(latent_model_input, t, encoder_hidden_states=text_embeds)['sample']

                # separate predictions for unconditional and conditional outputs
                noise_pred_uncond, noise_pred_cond = noise_pred.chunk(2)
                # perform guidance
                noise_pred = noise_pred_uncond + guidance_scale * (noise_pred_cond - noise_pred_uncond)

                # remove the noise from the current sample i.e. go from x_t to x_{t-1}
                img_latents = self.scheduler_DDIM.step(noise_pred, t, img_latents)['prev_sample']

        return img_latents


    def similar_imgs(self,
                     img,
                     prompt,
                     height=512, width=512,
                     num_inference_steps=50,
                     guidance_scale=7.5,
                     start_step=10):

        # get image latents
        img_latents = self.encode_img_latents(img)

        if isinstance(prompt, str):
            prompt = [prompt]

        text_embeds = self.get_prompt_embeds(prompt)

        img_latents = self.get_img_latents_similar(img_latents=img_latents,
                                                   text_embeds=text_embeds,
                                                height=height, width=width,
                                                num_inference_steps=num_inference_steps,
                                                guidance_scale=guidance_scale,
                                                start_step=start_step)

        imgs = self.decode_img_latents(img_latents)
        imgs = self.transform_imgs(imgs)
        # Clear the CUDA cache
        torch.cuda.empty_cache()

        return imgs

사전 훈련된 컴포넌트 로드

device = 'cuda'

# model_name = "dreamlike-art/dreamlike-photoreal-2.0"
model_name = "CompVis/stable-diffusion-v1-4"
# Load autoencoder
vae = AutoencoderKL.from_pretrained(model_name,
                                    subfolder='vae').to(device)

# Load tokenizer and the text encoder
tokenizer = CLIPTokenizer.from_pretrained(model_name, subfolder="tokenizer")
text_encoder = CLIPTextModel.from_pretrained(model_name, subfolder="text_encoder").to(device)

# Load UNet model
unet = UNet2DConditionModel.from_pretrained(model_name, subfolder='unet').to(device)

# Load scheduler
scheduler_LMS = LMSDiscreteScheduler(beta_start=0.00085,
                                 beta_end=0.012,
                                 beta_schedule='scaled_linear',
                                 num_train_timesteps=1000)

scheduler_DDIM = DDIMScheduler(beta_start=0.00085,
                               beta_end=0.012,
                               beta_schedule='scaled_linear',
                               num_train_timesteps=1000)

model = ImageDiffusionModel(vae, tokenizer, text_encoder, unet, scheduler_LMS, scheduler_DDIM)

prompts = ["A really giant cute pink barbie doll on the top of Burj Khalifa",
           "A green, scary aesthetic dragon breathing fire near a group of heroic firefighters"]
# 부르즈 칼리파 꼭대기에 있는 정말 거대하고 귀여운 핑크색 바비 인형
# 영웅적인 소방관 근처에서 불을 뿜는 녹색의 무서운 (심미적) 용


imgs = model.prompt_to_img(prompts)

imgs[0]

imgs[1]

'Upstage AI Lab' 카테고리의 다른 글

Upstage AI Lab 3기 - Data Centric AI (5)	2024.10.01
Upstage AI Lab 3기 - Natural Language Processing Basic, Advanced, 경진대회, LM to LLM (2)	2024.09.22
Upstage AI Lab 3기 - Computer Vision [경진대회] Image Classification (0)	2024.08.13
Upstage AI Lab 3기 - Machine Learning [경진대회] Regression (0)	2024.07.20
Upstage AI Lab 3기 - AI 과정 중간 회고 (1)	2024.07.14

ZeroToInfinity

Upstage AI Lab 3기 - Computer Vision & Generation

목차

1. 컴퓨터 비전이란?

2. 컴퓨터 비전 활용 사례

3. 컴퓨터 비전 모델 구조 이해하기

4. Backbone 이해하기

5. Object Detection

6. Semantic Segmentation

7. Computer Vision Generation

1) 생성 모델의 발전 과정
2) 생성 모델과 최대 가능도 추정
3) 오토 인코더와 변분 오토 인코더
4) 적대적 생성 신경망
5) 확산 모델

'Upstage AI Lab' 카테고리의 다른 글

티스토리툴바

Upstage AI Lab 3기 - Computer Vision & Generation

목차

1. 컴퓨터 비전이란?

2. 컴퓨터 비전 활용 사례

3. 컴퓨터 비전 모델 구조 이해하기

4. Backbone 이해하기

5. Object Detection

6. Semantic Segmentation

7. Computer Vision Generation

1) 생성 모델의 발전 과정 2) 생성 모델과 최대 가능도 추정 3) 오토 인코더와 변분 오토 인코더 4) 적대적 생성 신경망 5) 확산 모델

'Upstage AI Lab' 카테고리의 다른 글

'Upstage AI Lab' Related Articles

티스토리툴바

1) 생성 모델의 발전 과정
2) 생성 모델과 최대 가능도 추정
3) 오토 인코더와 변분 오토 인코더
4) 적대적 생성 신경망
5) 확산 모델