CLIPVisionTransformer / CLIPTextTransformer

Notice

Recent Posts

Recent Comments

Link

« 2025/07 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Tags more

Archives

Today

Total

관리 메뉴

토니의 연습장

CLIPVisionTransformer / CLIPTextTransformer 본문

비전 AI (VISION)/CLIP

CLIPVisionTransformer / CLIPTextTransformer

bellmake 2024. 8. 23. 21:35

class CLIPVisionTransformer(nn.Module):

pixel_values 받아서 embedding 및 layer norm 이후 encoder를 통과시키고 0번째만 추출하여 사용합니다.

def forward(

hidden_states = self.embeddings(pixel_values)

hidden_states = self.pre_layrnorm(hidden_states)

encoder_outputs = self.encoder(

inputs_embeds=hidden_states,

output_attentions=output_attentions,

output_hidden_states=output_hidden_states,

return_dict=return_dict,

)

print('last hidden', encoder_outputs[0].shape)

last_hidden_state = encoder_outputs[0]

pooled_output = last_hidden_state[:, 0, :]

pooled_output = self.post_layernorm(pooled_output)

print('pooled', pooled_output.shape)

class CLIPTextTransformer(nn.Module):

def forward(

input_shape = input_ids.size()

input_ids = input_ids.view(-1, input_shape[-1])

hidden_states = self.embeddings(input_ids=input_ids, position_ids=position_ids)

bsz, seq_len = input_shape

# CLIP's text model uses causal mask, prepare it here.

# https://github.com/openai/CLIP/blob/cfcffb90e69f37bf2ff1e988237a0fbe41f33c04/clip/model.py#L324

causal_attention_mask = self._build_causal_attention_mask(bsz, seq_len).to(hidden_states.device)

# expand attention_mask

if attention_mask is not None:

# [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len]

attention_mask = _expand_mask(attention_mask, hidden_states.dtype)

encoder_outputs = self.encoder(

inputs_embeds=hidden_states,

attention_mask=attention_mask,

causal_attention_mask=causal_attention_mask,

output_attentions=output_attentions,

output_hidden_states=output_hidden_states,

return_dict=return_dict,

)

print('encoder', encoder_outputs)

last_hidden_state = encoder_outputs[0]

last_hidden_state = self.final_layer_norm(last_hidden_state)

print('last', last_hidden_state.shape)

# text_embeds.shape = [batch_size, sequence_length, transformer.width]

# take features from the eot embedding (eot_token is the highest number in each sequence)

pooled_output = last_hidden_state[torch.arange(last_hidden_state.shape[0]), input_ids.argmax(dim=-1)]

'비전 AI (VISION) > CLIP' 카테고리의 다른 글

CLIPProcessor (1)	2024.08.23
CLIP 이론 (1)	2024.08.23
CLIPModel (3)	2024.08.23

'비전 AI (VISION)/CLIP' Related Articles

토니의 연습장

CLIPVisionTransformer / CLIPTextTransformer 본문

CLIPVisionTransformer / CLIPTextTransformer

'비전 AI (VISION) > CLIP' 카테고리의 다른 글

티스토리툴바