«Собака біжить по пляжу» — це речення з п'яти слів. Фото золотистого ретрівера на піщаному березі — це мільйони пікселів. Відеозапис цієї сцени — це тисячі кадрів. Аудіо гавкання та плескоту хвиль — це мільйони семплів.
Чотири фундаментально різні форми даних. Один і той самий концепт.
Людський мозок робить це автоматично: ви читаєте «собака на пляжі» і миттєво уявляєте картинку, можливо чуєте звуки, відчуваєте текстуру піску. Різні модальності — єдине ментальне представлення.
Cross-modal embedding — це коли AI вчиться робити те саме. Розміщувати representations різних модальностей в одному математичному просторі так, щоб семантично схожі концепти були геометрично близько, незалежно від форми їхнього вираження.
CLIP від OpenAI став проривом для text+image. ImageBind від Meta розширив це на 6 модальностей. А це лише початок революції в мультимодальному AI.
Ідея: Shared Representation Space
Традиційний підхід (ізольовані простори)
Традиційна архітектура:
Text Encoder:
"A dog on the beach" → [0.2, -0.5, 0.8, ...] (768-dim text space)
Image Encoder:
??️ → [0.7, 0.1, -0.3, ...] (2048-dim image space)
Проблема:
- Різні простори, різна розмірність
- Неможливо порівняти напряму
- Потрібен окремий alignment layer
- Немає shared understanding
Cross-modal embedding (єдиний простір)
Cross-Modal Architecture:
Text Encoder + Projection:
"A dog on the beach" → [0.42, -0.18, 0.73, ...] (512-dim SHARED space)
Image Encoder + Projection:
??️ → [0.44, -0.16, 0.71, ...] (512-dim SHARED space)
Результат:
cosine_similarity(text_embed, image_embed) = 0.97 (high!)
Magic:
- Один простір для всіх модальностей
- Можна порівнювати напряму
- Близькі концепти = близькі вектори
CLIP: Contrastive Language-Image Pre-training
Архітектура та принцип навчання
"""
CLIP (Contrastive Language-Image Pre-training)
OpenAI, 2021 — революція в cross-modal learning.
"""
import torch
import torch.nn as nn
import torch.nn.functional as F
from typing import Tuple, Optional
import numpy as np
class CLIP(nn.Module):
"""
Реалізація CLIP architecture.
"""
def __init__(self,
embed_dim: int = 512,
# Vision
image_resolution: int = 224,
vision_layers: int = 12,
vision_width: int = 768,
vision_patch_size: int = 16,
# Text
context_length: int = 77,
vocab_size: int = 49408,
transformer_width: int = 512,
transformer_heads: int = 8,
transformer_layers: int = 12):
super().__init__()
self.embed_dim = embed_dim
self.context_length = context_length
# Vision Encoder (ViT)
self.visual = VisionTransformer(
input_resolution=image_resolution,
patch_size=vision_patch_size,
width=vision_width,
layers=vision_layers,
heads=vision_width // 64, # 64 dim per head
output_dim=embed_dim
)
# Text Encoder (Transformer)
self.transformer = TextTransformer(
width=transformer_width,
layers=transformer_layers,
heads=transformer_heads,
context_length=context_length,
vocab_size=vocab_size,
output_dim=embed_dim
)
# Learnable temperature parameter
self.logit_scale = nn.Parameter(torch.ones([]) * np.log(1 / 0.07))
def encode_image(self, image: torch.Tensor) -> torch.Tensor:
"""
Encode image to embedding space.
image: [B, 3, H, W] → [B, embed_dim]
"""
return self.visual(image)
def encode_text(self, text: torch.Tensor) -> torch.Tensor:
"""
Encode text tokens to embedding space.
text: [B, context_length] → [B, embed_dim]
"""
return self.transformer(text)
def forward(self, image: torch.Tensor, text: torch.Tensor
) -> Tuple[torch.Tensor, torch.Tensor]:
"""
Returns normalized image and text features.
"""
image_features = self.encode_image(image)
text_features = self.encode_text(text)
# L2 normalize
image_features = F.normalize(image_features, dim=-1)
text_features = F.normalize(text_features, dim=-1)
return image_features, text_features
def compute_similarity(self, image_features: torch.Tensor,
text_features: torch.Tensor) -> torch.Tensor:
"""
Compute similarity matrix.
[B_img, D] × [B_txt, D].T = [B_img, B_txt]
"""
logit_scale = self.logit_scale.exp()
logits_per_image = logit_scale * image_features @ text_features.t()
return logits_per_image
class VisionTransformer(nn.Module):
"""
Vision Transformer (ViT) encoder.
"""
def __init__(self, input_resolution: int, patch_size: int,
width: int, layers: int, heads: int, output_dim: int):
super().__init__()
self.input_resolution = input_resolution
self.output_dim = output_dim
# Patch embedding
self.conv1 = nn.Conv2d(
3, width,
kernel_size=patch_size,
stride=patch_size,
bias=False
)
scale = width ** -0.5
self.class_embedding = nn.Parameter(scale * torch.randn(width))
self.positional_embedding = nn.Parameter(
scale * torch.randn((input_resolution // patch_size) ** 2 + 1, width)
)
self.ln_pre = nn.LayerNorm(width)
# Transformer blocks
self.transformer = nn.ModuleList([
TransformerBlock(width, heads)
for _ in range(layers)
])
self.ln_post = nn.LayerNorm(width)
self.proj = nn.Parameter(scale * torch.randn(width, output_dim))
def forward(self, x: torch.Tensor) -> torch.Tensor:
# Patch embedding: [B, 3, H, W] → [B, width, H/P, W/P]
x = self.conv1(x)
# Flatten: [B, width, H/P, W/P] → [B, width, num_patches]
x = x.reshape(x.shape[0], x.shape[1], -1)
# Transpose: [B, num_patches, width]
x = x.permute(0, 2, 1)
# Prepend CLS token
cls = self.class_embedding.view(1, 1, -1).expand(x.shape[0], -1, -1)
x = torch.cat([cls, x], dim=1)
# Add positional embedding
x = x + self.positional_embedding
x = self.ln_pre(x)
# Transformer
for block in self.transformer:
x = block(x)
x = self.ln_post(x[:, 0, :]) # Take CLS token
# Project to shared space
x = x @ self.proj
return x
class TextTransformer(nn.Module):
"""
Text Transformer encoder.
"""
def __init__(self, width: int, layers: int, heads: int,
context_length: int, vocab_size: int, output_dim: int):
super().__init__()
self.context_length = context_length
self.token_embedding = nn.Embedding(vocab_size, width)
self.positional_embedding = nn.Parameter(torch.zeros(context_length, width))
self.transformer = nn.ModuleList([
TransformerBlock(width, heads, causal=True)
for _ in range(layers)
])
self.ln_final = nn.LayerNorm(width)
self.text_projection = nn.Parameter(torch.randn(width, output_dim))
def forward(self, text: torch.Tensor) -> torch.Tensor:
# Token embedding
x = self.token_embedding(text) # [B, L, width]
# Add positional embedding
x = x + self.positional_embedding[:x.shape[1]]
# Transformer
for block in self.transformer:
x = block(x)
x = self.ln_final(x)
# Take features from EOT token (end of text)
# Assuming EOT is the last non-padding token
x = x[torch.arange(x.shape[0]), text.argmax(dim=-1)]
# Project to shared space
x = x @ self.text_projection
return x
class TransformerBlock(nn.Module):
"""
Standard transformer block.
"""
def __init__(self, d_model: int, n_head: int, causal: bool = False):
super().__init__()
self.causal = causal
self.attn = nn.MultiheadAttention(d_model, n_head, batch_first=True)
self.ln_1 = nn.LayerNorm(d_model)
self.mlp = nn.Sequential(
nn.Linear(d_model, d_model * 4),
nn.GELU(),
nn.Linear(d_model * 4, d_model)
)
self.ln_2 = nn.LayerNorm(d_model)
def forward(self, x: torch.Tensor) -> torch.Tensor:
# Causal mask for text
if self.causal:
mask = torch.triu(
torch.ones(x.shape[1], x.shape[1], device=x.device),
diagonal=1
).bool()
else:
mask = None
# Self-attention
h = self.ln_1(x)
h, _ = self.attn(h, h, h, attn_mask=mask)
x = x + h
# MLP
x = x + self.mlp(self.ln_2(x))
return x
Contrastive Loss: серце навчання
"""
InfoNCE Loss — ключ до contrastive learning.
"""
class ContrastiveLoss(nn.Module):
"""
Symmetric contrastive loss для CLIP.
"""
def __init__(self, temperature: float = 0.07):
super().__init__()
self.temperature = temperature
def forward(self, image_features: torch.Tensor,
text_features: torch.Tensor) -> torch.Tensor:
"""
InfoNCE loss: maximize similarity для matching pairs,
minimize для non-matching.
image_features: [B, D] normalized
text_features: [B, D] normalized
"""
batch_size = image_features.shape[0]
# Compute similarity matrix [B, B]
# logits[i, j] = similarity(image_i, text_j)
logits = (image_features @ text_features.T) / self.temperature
# Labels: diagonal (matching pairs)
labels = torch.arange(batch_size, device=logits.device)
# Image-to-text loss
loss_i2t = F.cross_entropy(logits, labels)
# Text-to-image loss (transpose)
loss_t2i = F.cross_entropy(logits.T, labels)
# Symmetric loss
loss = (loss_i2t + loss_t2i) / 2
return loss
class HardNegativeMining:
"""
Techniques для вибору informative негативів.
"""
@staticmethod
def in_batch_hard_negatives(similarity_matrix: torch.Tensor,
k: int = 5) -> torch.Tensor:
"""
Select top-k hardest negatives from batch.
Hard negatives = high similarity but wrong pair.
"""
batch_size = similarity_matrix.shape[0]
# Mask diagonal (positive pairs)
mask = ~torch.eye(batch_size, dtype=torch.bool,
device=similarity_matrix.device)
# Get top-k hard negatives for each sample
hard_neg_indices = []
for i in range(batch_size):
neg_similarities = similarity_matrix[i][mask[i]]
_, top_k = neg_similarities.topk(k)
hard_neg_indices.append(top_k)
return torch.stack(hard_neg_indices)
@staticmethod
def cross_batch_negatives(queue: torch.Tensor,
current_features: torch.Tensor,
queue_size: int = 65536) -> torch.Tensor:
"""
MoCo-style: use queue of past embeddings as negatives.
Much larger effective batch size.
"""
# queue: [queue_size, D]
# current_features: [B, D]
# Similarity with queue
similarity = current_features @ queue.T # [B, queue_size]
return similarity
ImageBind: 6 модальностей в одному просторі
"""
ImageBind (Meta AI, 2023) — розширення на 6 модальностей.
Key insight: Image as the binding modality.
"""
import torch
import torch.nn as nn
from typing import Dict, Optional
class ImageBind(nn.Module):
"""
ImageBind: One Embedding Space to Bind Them All.
Modalities:
- Image (anchor)
- Text
- Audio
- Video
- Depth
- Thermal/IMU
"""
def __init__(self, embed_dim: int = 1024):
super().__init__()
self.embed_dim = embed_dim
# Modality-specific encoders
self.encoders = nn.ModuleDict({
'image': ImageEncoder(embed_dim),
'text': TextEncoder(embed_dim),
'audio': AudioEncoder(embed_dim),
'video': VideoEncoder(embed_dim),
'depth': DepthEncoder(embed_dim),
'thermal': ThermalEncoder(embed_dim),
})
# Modality heads (project to shared space)
self.heads = nn.ModuleDict({
mod: nn.Linear(embed_dim, embed_dim)
for mod in self.encoders.keys()
})
def encode(self, data: Dict[str, torch.Tensor]) -> Dict[str, torch.Tensor]:
"""
Encode any subset of modalities.
"""
embeddings = {}
for modality, tensor in data.items():
if modality in self.encoders:
features = self.encoders[modality](tensor)
features = self.heads[modality](features)
features = F.normalize(features, dim=-1)
embeddings[modality] = features
return embeddings
def forward(self, data: Dict[str, torch.Tensor]) -> Dict[str, torch.Tensor]:
return self.encode(data)
class ImageBindTraining:
"""
Training strategy: Image as binding modality.
Key insight:
- We have Image-Text pairs (CLIP data)
- We have Image-Audio pairs (video data)
- We have Image-Depth pairs (RGB-D data)
We DON'T need Text-Audio pairs!
Image "binds" all modalities together.
"""
def __init__(self, model: ImageBind):
self.model = model
self.loss_fn = ContrastiveLoss()
def training_step(self, batch: Dict[str, torch.Tensor]) -> torch.Tensor:
"""
Train with available pairs.
Image is always one of the modalities.
"""
embeddings = self.model(batch)
total_loss = 0
pairs = [
('image', 'text'),
('image', 'audio'),
('image', 'depth'),
('image', 'thermal'),
]
for mod1, mod2 in pairs:
if mod1 in embeddings and mod2 in embeddings:
loss = self.loss_fn(embeddings[mod1], embeddings[mod2])
total_loss += loss
return total_loss
class AudioEncoder(nn.Module):
"""
Audio encoder: spectrogram → embedding.
"""
def __init__(self, embed_dim: int):
super().__init__()
# Mel-spectrogram settings
self.n_mels = 128
self.n_fft = 1024
self.hop_length = 512
# Audio spectrogram transformer
self.patch_embed = nn.Conv2d(1, embed_dim, kernel_size=(16, 16), stride=(16, 16))
self.pos_embed = nn.Parameter(torch.zeros(1, 256, embed_dim))
self.cls_token = nn.Parameter(torch.zeros(1, 1, embed_dim))
self.blocks = nn.ModuleList([
TransformerBlock(embed_dim, 16)
for _ in range(12)
])
self.norm = nn.LayerNorm(embed_dim)
self.head = nn.Linear(embed_dim, embed_dim)
def forward(self, audio: torch.Tensor) -> torch.Tensor:
"""
audio: [B, samples] → [B, embed_dim]
"""
# Convert to spectrogram
spec = self._audio_to_spec(audio) # [B, 1, n_mels, time]
# Patch embed
x = self.patch_embed(spec) # [B, D, H', W']
x = x.flatten(2).transpose(1, 2) # [B, num_patches, D]
# Add CLS and position
cls = self.cls_token.expand(x.shape[0], -1, -1)
x = torch.cat([cls, x], dim=1)
x = x + self.pos_embed[:, :x.shape[1], :]
# Transformer
for block in self.blocks:
x = block(x)
x = self.norm(x[:, 0]) # CLS token
x = self.head(x)
return x
def _audio_to_spec(self, audio: torch.Tensor) -> torch.Tensor:
"""Convert waveform to mel-spectrogram."""
# Simplified — use torchaudio in practice
import torchaudio.transforms as T
mel_transform = T.MelSpectrogram(
sample_rate=16000,
n_mels=self.n_mels,
n_fft=self.n_fft,
hop_length=self.hop_length
)
return mel_transform(audio).unsqueeze(1)
class VideoEncoder(nn.Module):
"""
Video encoder: frames → embedding.
"""
def __init__(self, embed_dim: int, num_frames: int = 8):
super().__init__()
self.num_frames = num_frames
# Per-frame encoding (reuse image encoder)
self.image_encoder = ImageEncoder(embed_dim)
# Temporal aggregation
self.temporal_transformer = nn.TransformerEncoder(
nn.TransformerEncoderLayer(embed_dim, 8, batch_first=True),
num_layers=4
)
self.temporal_cls = nn.Parameter(torch.zeros(1, 1, embed_dim))
def forward(self, video: torch.Tensor) -> torch.Tensor:
"""
video: [B, T, C, H, W] → [B, embed_dim]
"""
B, T, C, H, W = video.shape
# Encode each frame
frames = video.view(B * T, C, H, W)
frame_features = self.image_encoder(frames) # [B*T, D]
frame_features = frame_features.view(B, T, -1) # [B, T, D]
# Temporal aggregation
cls = self.temporal_cls.expand(B, -1, -1)
x = torch.cat([cls, frame_features], dim=1)
x = self.temporal_transformer(x)
return x[:, 0] # CLS token
Zero-Shot Classification з CLIP
"""
Zero-shot classification: classify без training на target dataset.
"""
import torch
import torch.nn.functional as F
from typing import List
class ZeroShotClassifier:
"""
Use CLIP для zero-shot image classification.
"""
def __init__(self, model: CLIP, tokenizer):
self.model = model
self.tokenizer = tokenizer
self.model.eval()
def classify(self, image: torch.Tensor,
class_names: List[str],
prompt_template: str = "a photo of a {}") -> torch.Tensor:
"""
Classify image into one of class_names.
"""
with torch.no_grad():
# Encode image
image_features = self.model.encode_image(image)
image_features = F.normalize(image_features, dim=-1)
# Create text prompts for each class
prompts = [prompt_template.format(name) for name in class_names]
text_tokens = self.tokenizer(prompts)
# Encode text
text_features = self.model.encode_text(text_tokens)
text_features = F.normalize(text_features, dim=-1)
# Compute similarities
similarity = image_features @ text_features.T
# Temperature scaling
logits = similarity * self.model.logit_scale.exp()
# Probabilities
probs = F.softmax(logits, dim=-1)
return probs
def batch_classify(self, images: torch.Tensor,
class_names: List[str]) -> torch.Tensor:
"""
Classify batch of images.
"""
with torch.no_grad():
# Encode images
image_features = self.model.encode_image(images) # [B, D]
image_features = F.normalize(image_features, dim=-1)
# Encode classes (once)
prompts = [f"a photo of a {name}" for name in class_names]
text_tokens = self.tokenizer(prompts)
text_features = self.model.encode_text(text_tokens) # [C, D]
text_features = F.normalize(text_features, dim=-1)
# Similarity: [B, C]
logits = (image_features @ text_features.T) * self.model.logit_scale.exp()
return logits
def prompt_ensemble(self, image: torch.Tensor,
class_names: List[str],
templates: List[str]) -> torch.Tensor:
"""
Use multiple prompt templates and average.
Improves accuracy significantly.
"""
all_probs = []
for template in templates:
probs = self.classify(image, class_names, template)
all_probs.append(probs)
# Average across templates
return torch.stack(all_probs).mean(dim=0)
# Промпт templates для кращої accuracy
IMAGENET_TEMPLATES = [
"a photo of a {}.",
"a bad photo of a {}.",
"a origami {}.",
"a photo of the large {}.",
"a {} in a video game.",
"art of a {}.",
"a photo of the small {}.",
"a blurry photo of a {}.",
"a rendering of a {}.",
"graffiti of a {}.",
"a tattoo of a {}.",
"a embroidered {}.",
"a dark photo of the {}.",
"a bright photo of a {}.",
]
Cross-Modal Retrieval
"""
Cross-modal retrieval: знайти зображення по тексту і навпаки.
"""
import numpy as np
import faiss
from typing import List, Tuple
class CrossModalRetrieval:
"""
Efficient cross-modal retrieval з FAISS.
"""
def __init__(self, model: CLIP, embed_dim: int = 512):
self.model = model
self.embed_dim = embed_dim
# FAISS index для швидкого пошуку
self.image_index = faiss.IndexFlatIP(embed_dim) # Inner product = cosine for normalized
self.image_metadata = []
# GPU acceleration (optional)
if faiss.get_num_gpus() > 0:
self.image_index = faiss.index_cpu_to_gpu(
faiss.StandardGpuResources(),
0,
self.image_index
)
def index_images(self, images: torch.Tensor, metadata: List[dict]):
"""
Add images to searchable index.
"""
with torch.no_grad():
features = self.model.encode_image(images)
features = F.normalize(features, dim=-1).cpu().numpy()
self.image_index.add(features)
self.image_metadata.extend(metadata)
def search_by_text(self, query: str, k: int = 10) -> List[Tuple[dict, float]]:
"""
Find top-k images matching text query.
"""
with torch.no_grad():
# Encode query
text_tokens = self.tokenizer([query])
query_features = self.model.encode_text(text_tokens)
query_features = F.normalize(query_features, dim=-1).cpu().numpy()
# Search
similarities, indices = self.image_index.search(query_features, k)
results = []
for i, (idx, sim) in enumerate(zip(indices[0], similarities[0])):
results.append((self.image_metadata[idx], float(sim)))
return results
def search_by_image(self, query_image: torch.Tensor,
k: int = 10) -> List[Tuple[dict, float]]:
"""
Find top-k similar images (image-to-image search).
"""
with torch.no_grad():
query_features = self.model.encode_image(query_image)
query_features = F.normalize(query_features, dim=-1).cpu().numpy()
similarities, indices = self.image_index.search(query_features, k)
results = []
for idx, sim in zip(indices[0], similarities[0]):
results.append((self.image_metadata[idx], float(sim)))
return results
def cross_modal_arithmetic(self,
positive_texts: List[str],
negative_texts: List[str],
k: int = 10) -> List[Tuple[dict, float]]:
"""
Embedding arithmetic: "king - man + woman = queen"
Example: "dog" + "snow" - "grass" = snowy dog images
"""
with torch.no_grad():
# Encode positive concepts
pos_tokens = self.tokenizer(positive_texts)
pos_features = self.model.encode_text(pos_tokens)
pos_features = F.normalize(pos_features, dim=-1)
pos_mean = pos_features.mean(dim=0)
# Encode negative concepts
neg_tokens = self.tokenizer(negative_texts)
neg_features = self.model.encode_text(neg_tokens)
neg_features = F.normalize(neg_features, dim=-1)
neg_mean = neg_features.mean(dim=0)
# Arithmetic
query = pos_mean - neg_mean
query = F.normalize(query.unsqueeze(0), dim=-1).cpu().numpy()
similarities, indices = self.image_index.search(query, k)
results = []
for idx, sim in zip(indices[0], similarities[0]):
results.append((self.image_metadata[idx], float(sim)))
return results
Fine-tuning CLIP для специфічних доменів
"""
Fine-tuning CLIP на domain-specific data.
"""
import torch
import torch.nn as nn
from torch.utils.data import DataLoader
class CLIPFineTuner:
"""
Fine-tune CLIP для specific domain (medical, satellite, etc.)
"""
def __init__(self, model: CLIP,
freeze_backbone: bool = True,
learning_rate: float = 1e-5):
self.model = model
self.freeze_backbone = freeze_backbone
if freeze_backbone:
# Freeze all except projection layers
for param in model.visual.parameters():
param.requires_grad = False
for param in model.transformer.parameters():
param.requires_grad = False
# Only train projections
trainable = [
model.visual.proj,
model.transformer.text_projection,
model.logit_scale
]
else:
trainable = model.parameters()
self.optimizer = torch.optim.AdamW(
[p for p in trainable if p.requires_grad],
lr=learning_rate,
weight_decay=0.2
)
self.loss_fn = ContrastiveLoss()
def train_epoch(self, dataloader: DataLoader) -> float:
"""
One epoch of fine-tuning.
"""
self.model.train()
total_loss = 0
for images, texts in dataloader:
self.optimizer.zero_grad()
# Forward
image_features, text_features = self.model(images, texts)
# Loss
loss = self.loss_fn(image_features, text_features)
# Backward
loss.backward()
self.optimizer.step()
total_loss += loss.item()
return total_loss / len(dataloader)
def adapter_tuning(self, adapter_dim: int = 64):
"""
More efficient: add small adapter layers.
Much less trainable parameters.
"""
class Adapter(nn.Module):
def __init__(self, in_dim: int, bottleneck_dim: int):
super().__init__()
self.down = nn.Linear(in_dim, bottleneck_dim)
self.up = nn.Linear(bottleneck_dim, in_dim)
self.act = nn.GELU()
def forward(self, x):
return x + self.up(self.act(self.down(x)))
# Add adapters to transformer blocks
for block in self.model.visual.transformer:
block.adapter = Adapter(block.mlp[0].in_features, adapter_dim)
# Only train adapters
for param in self.model.parameters():
param.requires_grad = False
for block in self.model.visual.transformer:
for param in block.adapter.parameters():
param.requires_grad = True
Проблеми та обмеження
1. Bias у cross-modal embeddings
"""
Bias detection та mitigation в CLIP.
"""
class BiasAnalysis:
"""
Analyze bias в cross-modal embeddings.
"""
@staticmethod
def occupation_gender_bias(model: CLIP, tokenizer) -> dict:
"""
Test: "CEO" ближче до "man" чи "woman"?
"""
occupations = ["CEO", "nurse", "engineer", "teacher", "doctor", "secretary"]
genders = ["man", "woman"]
with torch.no_grad():
# Encode genders
gender_tokens = tokenizer([f"a photo of a {g}" for g in genders])
gender_features = model.encode_text(gender_tokens)
gender_features = F.normalize(gender_features, dim=-1)
results = {}
for occupation in occupations:
occ_tokens = tokenizer([f"a photo of a {occupation}"])
occ_features = model.encode_text(occ_tokens)
occ_features = F.normalize(occ_features, dim=-1)
# Similarity to each gender
sims = (occ_features @ gender_features.T).squeeze()
results[occupation] = {
"man": sims[0].item(),
"woman": sims[1].item(),
"bias": (sims[0] - sims[1]).item()
}
return results
class BiasDebiasing:
"""
Techniques для зменшення bias.
"""
@staticmethod
def null_space_projection(embeddings: torch.Tensor,
bias_direction: torch.Tensor) -> torch.Tensor:
"""
Project out bias direction.
"""
# Normalize bias direction
bias_direction = F.normalize(bias_direction, dim=-1)
# Project each embedding
projections = embeddings @ bias_direction.unsqueeze(-1)
debiased = embeddings - projections * bias_direction
return F.normalize(debiased, dim=-1)
Бенчмарки
| Model | ImageNet Zero-Shot | Flickr30k R@1 | COCO R@1 | Training Data |
|-------|-------------------|---------------|----------|---------------|
| CLIP ViT-B/32 | 63.2% | 88.0% | 58.4% | 400M pairs |
| CLIP ViT-L/14 | 75.5% | 92.8% | 66.7% | 400M pairs |
| OpenCLIP ViT-G | 80.1% | 95.2% | 71.3% | 2B pairs |
| SigLIP | 83.2% | 96.1% | 73.5% | 4B pairs |
| ImageBind | 77.7%* | 89.4% | 65.2% | Multi-modal |
Ідеї для дослідження
Для бакалавра:
- Fine-tune CLIP на domain-specific dataset
- Cross-modal retrieval system
- Embedding visualization (t-SNE, UMAP)
- Zero-shot classification experiments
Для магістра:
- Додавання нової модальності (EEG, sensor data)
- Compositional understanding improvements
- Bias analysis та mitigation
- Multi-lingual CLIP
Для PhD:
- Theoretical analysis of cross-modal alignment
- Scaling laws для modality combinations
- Novel contrastive objectives
- Cross-modal generation з embeddings
Висновок: foundation для multimodal AI
Cross-modal embeddings — це не просто техніка для image search. Це фундаментальний спосіб представлення знань у AI системах.
Коли модель «розуміє», що текст «собака», зображення собаки, звук гавкання — це все вирази одного й того ж концепту, вона наближається до human-like розуміння світу. CLIP показав, що це можливо. ImageBind розширив на більше модальностей. Наступний крок — все більше модальностей, все глибше семантичне розуміння.
Для досліджень у цій галузі команда SKP-Degree на skp-degree.com.ua має досвід роботи з CLIP та multi-modal системами. Консультації доступні в Telegram: @kursovi_diplomy.
Cross-modal embedding, CLIP, ImageBind, contrastive learning, multimodal AI, zero-shot classification, cross-modal retrieval, representation learning — ключові терміни для дипломної чи магістерської роботи з мультимодального штучного інтелекту та representation learning.