Технології AI Написано практикуючими розробниками

Cross-Modal Embedding: коли картинка дорівнює тисячі слів — технічний гайд з реалізацією

Оновлено: 15 хв читання 4 переглядів

«Собака біжить по пляжу» — це речення з п'яти слів. Фото золотистого ретрівера на піщаному березі — це мільйони пікселів. Відеозапис цієї сцени — це тисячі кадрів. Аудіо гавкання та плескоту хвиль — це мільйони семплів.


«Собака біжить по пляжу» — це речення з п'яти слів. Фото золотистого ретрівера на піщаному березі — це мільйони пікселів. Відеозапис цієї сцени — це тисячі кадрів. Аудіо гавкання та плескоту хвиль — це мільйони семплів.

Чотири фундаментально різні форми даних. Один і той самий концепт.

Людський мозок робить це автоматично: ви читаєте «собака на пляжі» і миттєво уявляєте картинку, можливо чуєте звуки, відчуваєте текстуру піску. Різні модальності — єдине ментальне представлення.

Cross-modal embedding — це коли AI вчиться робити те саме. Розміщувати representations різних модальностей в одному математичному просторі так, щоб семантично схожі концепти були геометрично близько, незалежно від форми їхнього вираження.

CLIP від OpenAI став проривом для text+image. ImageBind від Meta розширив це на 6 модальностей. А це лише початок революції в мультимодальному AI.


Ідея: Shared Representation Space

Традиційний підхід (ізольовані простори)

Традиційна архітектура:

Text Encoder:
  "A dog on the beach" → [0.2, -0.5, 0.8, ...] (768-dim text space)

Image Encoder:
  ??️ → [0.7, 0.1, -0.3, ...] (2048-dim image space)

Проблема:
  - Різні простори, різна розмірність
  - Неможливо порівняти напряму
  - Потрібен окремий alignment layer
  - Немає shared understanding

Cross-modal embedding (єдиний простір)

Cross-Modal Architecture:

Text Encoder + Projection:
  "A dog on the beach" → [0.42, -0.18, 0.73, ...] (512-dim SHARED space)

Image Encoder + Projection:
  ??️ → [0.44, -0.16, 0.71, ...] (512-dim SHARED space)

Результат:
  cosine_similarity(text_embed, image_embed) = 0.97 (high!)

Magic:
  - Один простір для всіх модальностей
  - Можна порівнювати напряму
  - Близькі концепти = близькі вектори

CLIP: Contrastive Language-Image Pre-training

Архітектура та принцип навчання

"""
CLIP (Contrastive Language-Image Pre-training)
OpenAI, 2021 — революція в cross-modal learning.
"""

import torch
import torch.nn as nn
import torch.nn.functional as F
from typing import Tuple, Optional
import numpy as np

class CLIP(nn.Module):
    """
    Реалізація CLIP architecture.
    """

    def __init__(self,
                 embed_dim: int = 512,
                 # Vision
                 image_resolution: int = 224,
                 vision_layers: int = 12,
                 vision_width: int = 768,
                 vision_patch_size: int = 16,
                 # Text
                 context_length: int = 77,
                 vocab_size: int = 49408,
                 transformer_width: int = 512,
                 transformer_heads: int = 8,
                 transformer_layers: int = 12):
        super().__init__()

        self.embed_dim = embed_dim
        self.context_length = context_length

        # Vision Encoder (ViT)
        self.visual = VisionTransformer(
            input_resolution=image_resolution,
            patch_size=vision_patch_size,
            width=vision_width,
            layers=vision_layers,
            heads=vision_width // 64,  # 64 dim per head
            output_dim=embed_dim
        )

        # Text Encoder (Transformer)
        self.transformer = TextTransformer(
            width=transformer_width,
            layers=transformer_layers,
            heads=transformer_heads,
            context_length=context_length,
            vocab_size=vocab_size,
            output_dim=embed_dim
        )

        # Learnable temperature parameter
        self.logit_scale = nn.Parameter(torch.ones([]) * np.log(1 / 0.07))

    def encode_image(self, image: torch.Tensor) -> torch.Tensor:
        """
        Encode image to embedding space.
        image: [B, 3, H, W] → [B, embed_dim]
        """
        return self.visual(image)

    def encode_text(self, text: torch.Tensor) -> torch.Tensor:
        """
        Encode text tokens to embedding space.
        text: [B, context_length] → [B, embed_dim]
        """
        return self.transformer(text)

    def forward(self, image: torch.Tensor, text: torch.Tensor
                ) -> Tuple[torch.Tensor, torch.Tensor]:
        """
        Returns normalized image and text features.
        """
        image_features = self.encode_image(image)
        text_features = self.encode_text(text)

        # L2 normalize
        image_features = F.normalize(image_features, dim=-1)
        text_features = F.normalize(text_features, dim=-1)

        return image_features, text_features

    def compute_similarity(self, image_features: torch.Tensor,
                          text_features: torch.Tensor) -> torch.Tensor:
        """
        Compute similarity matrix.
        [B_img, D] × [B_txt, D].T = [B_img, B_txt]
        """
        logit_scale = self.logit_scale.exp()
        logits_per_image = logit_scale * image_features @ text_features.t()
        return logits_per_image


class VisionTransformer(nn.Module):
    """
    Vision Transformer (ViT) encoder.
    """

    def __init__(self, input_resolution: int, patch_size: int,
                 width: int, layers: int, heads: int, output_dim: int):
        super().__init__()

        self.input_resolution = input_resolution
        self.output_dim = output_dim

        # Patch embedding
        self.conv1 = nn.Conv2d(
            3, width,
            kernel_size=patch_size,
            stride=patch_size,
            bias=False
        )

        scale = width ** -0.5
        self.class_embedding = nn.Parameter(scale * torch.randn(width))
        self.positional_embedding = nn.Parameter(
            scale * torch.randn((input_resolution // patch_size) ** 2 + 1, width)
        )

        self.ln_pre = nn.LayerNorm(width)

        # Transformer blocks
        self.transformer = nn.ModuleList([
            TransformerBlock(width, heads)
            for _ in range(layers)
        ])

        self.ln_post = nn.LayerNorm(width)
        self.proj = nn.Parameter(scale * torch.randn(width, output_dim))

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # Patch embedding: [B, 3, H, W] → [B, width, H/P, W/P]
        x = self.conv1(x)

        # Flatten: [B, width, H/P, W/P] → [B, width, num_patches]
        x = x.reshape(x.shape[0], x.shape[1], -1)

        # Transpose: [B, num_patches, width]
        x = x.permute(0, 2, 1)

        # Prepend CLS token
        cls = self.class_embedding.view(1, 1, -1).expand(x.shape[0], -1, -1)
        x = torch.cat([cls, x], dim=1)

        # Add positional embedding
        x = x + self.positional_embedding

        x = self.ln_pre(x)

        # Transformer
        for block in self.transformer:
            x = block(x)

        x = self.ln_post(x[:, 0, :])  # Take CLS token

        # Project to shared space
        x = x @ self.proj

        return x


class TextTransformer(nn.Module):
    """
    Text Transformer encoder.
    """

    def __init__(self, width: int, layers: int, heads: int,
                 context_length: int, vocab_size: int, output_dim: int):
        super().__init__()

        self.context_length = context_length

        self.token_embedding = nn.Embedding(vocab_size, width)
        self.positional_embedding = nn.Parameter(torch.zeros(context_length, width))

        self.transformer = nn.ModuleList([
            TransformerBlock(width, heads, causal=True)
            for _ in range(layers)
        ])

        self.ln_final = nn.LayerNorm(width)
        self.text_projection = nn.Parameter(torch.randn(width, output_dim))

    def forward(self, text: torch.Tensor) -> torch.Tensor:
        # Token embedding
        x = self.token_embedding(text)  # [B, L, width]

        # Add positional embedding
        x = x + self.positional_embedding[:x.shape[1]]

        # Transformer
        for block in self.transformer:
            x = block(x)

        x = self.ln_final(x)

        # Take features from EOT token (end of text)
        # Assuming EOT is the last non-padding token
        x = x[torch.arange(x.shape[0]), text.argmax(dim=-1)]

        # Project to shared space
        x = x @ self.text_projection

        return x


class TransformerBlock(nn.Module):
    """
    Standard transformer block.
    """

    def __init__(self, d_model: int, n_head: int, causal: bool = False):
        super().__init__()

        self.causal = causal
        self.attn = nn.MultiheadAttention(d_model, n_head, batch_first=True)
        self.ln_1 = nn.LayerNorm(d_model)
        self.mlp = nn.Sequential(
            nn.Linear(d_model, d_model * 4),
            nn.GELU(),
            nn.Linear(d_model * 4, d_model)
        )
        self.ln_2 = nn.LayerNorm(d_model)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # Causal mask for text
        if self.causal:
            mask = torch.triu(
                torch.ones(x.shape[1], x.shape[1], device=x.device),
                diagonal=1
            ).bool()
        else:
            mask = None

        # Self-attention
        h = self.ln_1(x)
        h, _ = self.attn(h, h, h, attn_mask=mask)
        x = x + h

        # MLP
        x = x + self.mlp(self.ln_2(x))

        return x

Contrastive Loss: серце навчання

"""
InfoNCE Loss — ключ до contrastive learning.
"""

class ContrastiveLoss(nn.Module):
    """
    Symmetric contrastive loss для CLIP.
    """

    def __init__(self, temperature: float = 0.07):
        super().__init__()
        self.temperature = temperature

    def forward(self, image_features: torch.Tensor,
                text_features: torch.Tensor) -> torch.Tensor:
        """
        InfoNCE loss: maximize similarity для matching pairs,
        minimize для non-matching.

        image_features: [B, D] normalized
        text_features: [B, D] normalized
        """
        batch_size = image_features.shape[0]

        # Compute similarity matrix [B, B]
        # logits[i, j] = similarity(image_i, text_j)
        logits = (image_features @ text_features.T) / self.temperature

        # Labels: diagonal (matching pairs)
        labels = torch.arange(batch_size, device=logits.device)

        # Image-to-text loss
        loss_i2t = F.cross_entropy(logits, labels)

        # Text-to-image loss (transpose)
        loss_t2i = F.cross_entropy(logits.T, labels)

        # Symmetric loss
        loss = (loss_i2t + loss_t2i) / 2

        return loss


class HardNegativeMining:
    """
    Techniques для вибору informative негативів.
    """

    @staticmethod
    def in_batch_hard_negatives(similarity_matrix: torch.Tensor,
                                 k: int = 5) -> torch.Tensor:
        """
        Select top-k hardest negatives from batch.
        Hard negatives = high similarity but wrong pair.
        """
        batch_size = similarity_matrix.shape[0]

        # Mask diagonal (positive pairs)
        mask = ~torch.eye(batch_size, dtype=torch.bool,
                         device=similarity_matrix.device)

        # Get top-k hard negatives for each sample
        hard_neg_indices = []
        for i in range(batch_size):
            neg_similarities = similarity_matrix[i][mask[i]]
            _, top_k = neg_similarities.topk(k)
            hard_neg_indices.append(top_k)

        return torch.stack(hard_neg_indices)

    @staticmethod
    def cross_batch_negatives(queue: torch.Tensor,
                               current_features: torch.Tensor,
                               queue_size: int = 65536) -> torch.Tensor:
        """
        MoCo-style: use queue of past embeddings as negatives.
        Much larger effective batch size.
        """
        # queue: [queue_size, D]
        # current_features: [B, D]

        # Similarity with queue
        similarity = current_features @ queue.T  # [B, queue_size]

        return similarity

ImageBind: 6 модальностей в одному просторі

"""
ImageBind (Meta AI, 2023) — розширення на 6 модальностей.
Key insight: Image as the binding modality.
"""

import torch
import torch.nn as nn
from typing import Dict, Optional

class ImageBind(nn.Module):
    """
    ImageBind: One Embedding Space to Bind Them All.

    Modalities:
    - Image (anchor)
    - Text
    - Audio
    - Video
    - Depth
    - Thermal/IMU
    """

    def __init__(self, embed_dim: int = 1024):
        super().__init__()
        self.embed_dim = embed_dim

        # Modality-specific encoders
        self.encoders = nn.ModuleDict({
            'image': ImageEncoder(embed_dim),
            'text': TextEncoder(embed_dim),
            'audio': AudioEncoder(embed_dim),
            'video': VideoEncoder(embed_dim),
            'depth': DepthEncoder(embed_dim),
            'thermal': ThermalEncoder(embed_dim),
        })

        # Modality heads (project to shared space)
        self.heads = nn.ModuleDict({
            mod: nn.Linear(embed_dim, embed_dim)
            for mod in self.encoders.keys()
        })

    def encode(self, data: Dict[str, torch.Tensor]) -> Dict[str, torch.Tensor]:
        """
        Encode any subset of modalities.
        """
        embeddings = {}
        for modality, tensor in data.items():
            if modality in self.encoders:
                features = self.encoders[modality](tensor)
                features = self.heads[modality](features)
                features = F.normalize(features, dim=-1)
                embeddings[modality] = features
        return embeddings

    def forward(self, data: Dict[str, torch.Tensor]) -> Dict[str, torch.Tensor]:
        return self.encode(data)


class ImageBindTraining:
    """
    Training strategy: Image as binding modality.

    Key insight:
    - We have Image-Text pairs (CLIP data)
    - We have Image-Audio pairs (video data)
    - We have Image-Depth pairs (RGB-D data)

    We DON'T need Text-Audio pairs!
    Image "binds" all modalities together.
    """

    def __init__(self, model: ImageBind):
        self.model = model
        self.loss_fn = ContrastiveLoss()

    def training_step(self, batch: Dict[str, torch.Tensor]) -> torch.Tensor:
        """
        Train with available pairs.
        Image is always one of the modalities.
        """
        embeddings = self.model(batch)

        total_loss = 0
        pairs = [
            ('image', 'text'),
            ('image', 'audio'),
            ('image', 'depth'),
            ('image', 'thermal'),
        ]

        for mod1, mod2 in pairs:
            if mod1 in embeddings and mod2 in embeddings:
                loss = self.loss_fn(embeddings[mod1], embeddings[mod2])
                total_loss += loss

        return total_loss


class AudioEncoder(nn.Module):
    """
    Audio encoder: spectrogram → embedding.
    """

    def __init__(self, embed_dim: int):
        super().__init__()

        # Mel-spectrogram settings
        self.n_mels = 128
        self.n_fft = 1024
        self.hop_length = 512

        # Audio spectrogram transformer
        self.patch_embed = nn.Conv2d(1, embed_dim, kernel_size=(16, 16), stride=(16, 16))
        self.pos_embed = nn.Parameter(torch.zeros(1, 256, embed_dim))
        self.cls_token = nn.Parameter(torch.zeros(1, 1, embed_dim))

        self.blocks = nn.ModuleList([
            TransformerBlock(embed_dim, 16)
            for _ in range(12)
        ])

        self.norm = nn.LayerNorm(embed_dim)
        self.head = nn.Linear(embed_dim, embed_dim)

    def forward(self, audio: torch.Tensor) -> torch.Tensor:
        """
        audio: [B, samples] → [B, embed_dim]
        """
        # Convert to spectrogram
        spec = self._audio_to_spec(audio)  # [B, 1, n_mels, time]

        # Patch embed
        x = self.patch_embed(spec)  # [B, D, H', W']
        x = x.flatten(2).transpose(1, 2)  # [B, num_patches, D]

        # Add CLS and position
        cls = self.cls_token.expand(x.shape[0], -1, -1)
        x = torch.cat([cls, x], dim=1)
        x = x + self.pos_embed[:, :x.shape[1], :]

        # Transformer
        for block in self.blocks:
            x = block(x)

        x = self.norm(x[:, 0])  # CLS token
        x = self.head(x)

        return x

    def _audio_to_spec(self, audio: torch.Tensor) -> torch.Tensor:
        """Convert waveform to mel-spectrogram."""
        # Simplified — use torchaudio in practice
        import torchaudio.transforms as T
        mel_transform = T.MelSpectrogram(
            sample_rate=16000,
            n_mels=self.n_mels,
            n_fft=self.n_fft,
            hop_length=self.hop_length
        )
        return mel_transform(audio).unsqueeze(1)


class VideoEncoder(nn.Module):
    """
    Video encoder: frames → embedding.
    """

    def __init__(self, embed_dim: int, num_frames: int = 8):
        super().__init__()

        self.num_frames = num_frames

        # Per-frame encoding (reuse image encoder)
        self.image_encoder = ImageEncoder(embed_dim)

        # Temporal aggregation
        self.temporal_transformer = nn.TransformerEncoder(
            nn.TransformerEncoderLayer(embed_dim, 8, batch_first=True),
            num_layers=4
        )

        self.temporal_cls = nn.Parameter(torch.zeros(1, 1, embed_dim))

    def forward(self, video: torch.Tensor) -> torch.Tensor:
        """
        video: [B, T, C, H, W] → [B, embed_dim]
        """
        B, T, C, H, W = video.shape

        # Encode each frame
        frames = video.view(B * T, C, H, W)
        frame_features = self.image_encoder(frames)  # [B*T, D]
        frame_features = frame_features.view(B, T, -1)  # [B, T, D]

        # Temporal aggregation
        cls = self.temporal_cls.expand(B, -1, -1)
        x = torch.cat([cls, frame_features], dim=1)
        x = self.temporal_transformer(x)

        return x[:, 0]  # CLS token

Zero-Shot Classification з CLIP

"""
Zero-shot classification: classify без training на target dataset.
"""

import torch
import torch.nn.functional as F
from typing import List

class ZeroShotClassifier:
    """
    Use CLIP для zero-shot image classification.
    """

    def __init__(self, model: CLIP, tokenizer):
        self.model = model
        self.tokenizer = tokenizer
        self.model.eval()

    def classify(self, image: torch.Tensor,
                 class_names: List[str],
                 prompt_template: str = "a photo of a {}") -> torch.Tensor:
        """
        Classify image into one of class_names.
        """
        with torch.no_grad():
            # Encode image
            image_features = self.model.encode_image(image)
            image_features = F.normalize(image_features, dim=-1)

            # Create text prompts for each class
            prompts = [prompt_template.format(name) for name in class_names]
            text_tokens = self.tokenizer(prompts)

            # Encode text
            text_features = self.model.encode_text(text_tokens)
            text_features = F.normalize(text_features, dim=-1)

            # Compute similarities
            similarity = image_features @ text_features.T

            # Temperature scaling
            logits = similarity * self.model.logit_scale.exp()

            # Probabilities
            probs = F.softmax(logits, dim=-1)

        return probs

    def batch_classify(self, images: torch.Tensor,
                       class_names: List[str]) -> torch.Tensor:
        """
        Classify batch of images.
        """
        with torch.no_grad():
            # Encode images
            image_features = self.model.encode_image(images)  # [B, D]
            image_features = F.normalize(image_features, dim=-1)

            # Encode classes (once)
            prompts = [f"a photo of a {name}" for name in class_names]
            text_tokens = self.tokenizer(prompts)
            text_features = self.model.encode_text(text_tokens)  # [C, D]
            text_features = F.normalize(text_features, dim=-1)

            # Similarity: [B, C]
            logits = (image_features @ text_features.T) * self.model.logit_scale.exp()

        return logits

    def prompt_ensemble(self, image: torch.Tensor,
                        class_names: List[str],
                        templates: List[str]) -> torch.Tensor:
        """
        Use multiple prompt templates and average.
        Improves accuracy significantly.
        """
        all_probs = []

        for template in templates:
            probs = self.classify(image, class_names, template)
            all_probs.append(probs)

        # Average across templates
        return torch.stack(all_probs).mean(dim=0)


# Промпт templates для кращої accuracy
IMAGENET_TEMPLATES = [
    "a photo of a {}.",
    "a bad photo of a {}.",
    "a origami {}.",
    "a photo of the large {}.",
    "a {} in a video game.",
    "art of a {}.",
    "a photo of the small {}.",
    "a blurry photo of a {}.",
    "a rendering of a {}.",
    "graffiti of a {}.",
    "a tattoo of a {}.",
    "a embroidered {}.",
    "a dark photo of the {}.",
    "a bright photo of a {}.",
]

Cross-Modal Retrieval

"""
Cross-modal retrieval: знайти зображення по тексту і навпаки.
"""

import numpy as np
import faiss
from typing import List, Tuple

class CrossModalRetrieval:
    """
    Efficient cross-modal retrieval з FAISS.
    """

    def __init__(self, model: CLIP, embed_dim: int = 512):
        self.model = model
        self.embed_dim = embed_dim

        # FAISS index для швидкого пошуку
        self.image_index = faiss.IndexFlatIP(embed_dim)  # Inner product = cosine for normalized
        self.image_metadata = []

        # GPU acceleration (optional)
        if faiss.get_num_gpus() > 0:
            self.image_index = faiss.index_cpu_to_gpu(
                faiss.StandardGpuResources(),
                0,
                self.image_index
            )

    def index_images(self, images: torch.Tensor, metadata: List[dict]):
        """
        Add images to searchable index.
        """
        with torch.no_grad():
            features = self.model.encode_image(images)
            features = F.normalize(features, dim=-1).cpu().numpy()

        self.image_index.add(features)
        self.image_metadata.extend(metadata)

    def search_by_text(self, query: str, k: int = 10) -> List[Tuple[dict, float]]:
        """
        Find top-k images matching text query.
        """
        with torch.no_grad():
            # Encode query
            text_tokens = self.tokenizer([query])
            query_features = self.model.encode_text(text_tokens)
            query_features = F.normalize(query_features, dim=-1).cpu().numpy()

        # Search
        similarities, indices = self.image_index.search(query_features, k)

        results = []
        for i, (idx, sim) in enumerate(zip(indices[0], similarities[0])):
            results.append((self.image_metadata[idx], float(sim)))

        return results

    def search_by_image(self, query_image: torch.Tensor,
                        k: int = 10) -> List[Tuple[dict, float]]:
        """
        Find top-k similar images (image-to-image search).
        """
        with torch.no_grad():
            query_features = self.model.encode_image(query_image)
            query_features = F.normalize(query_features, dim=-1).cpu().numpy()

        similarities, indices = self.image_index.search(query_features, k)

        results = []
        for idx, sim in zip(indices[0], similarities[0]):
            results.append((self.image_metadata[idx], float(sim)))

        return results

    def cross_modal_arithmetic(self,
                               positive_texts: List[str],
                               negative_texts: List[str],
                               k: int = 10) -> List[Tuple[dict, float]]:
        """
        Embedding arithmetic: "king - man + woman = queen"

        Example: "dog" + "snow" - "grass" = snowy dog images
        """
        with torch.no_grad():
            # Encode positive concepts
            pos_tokens = self.tokenizer(positive_texts)
            pos_features = self.model.encode_text(pos_tokens)
            pos_features = F.normalize(pos_features, dim=-1)
            pos_mean = pos_features.mean(dim=0)

            # Encode negative concepts
            neg_tokens = self.tokenizer(negative_texts)
            neg_features = self.model.encode_text(neg_tokens)
            neg_features = F.normalize(neg_features, dim=-1)
            neg_mean = neg_features.mean(dim=0)

            # Arithmetic
            query = pos_mean - neg_mean
            query = F.normalize(query.unsqueeze(0), dim=-1).cpu().numpy()

        similarities, indices = self.image_index.search(query, k)

        results = []
        for idx, sim in zip(indices[0], similarities[0]):
            results.append((self.image_metadata[idx], float(sim)))

        return results

Fine-tuning CLIP для специфічних доменів

"""
Fine-tuning CLIP на domain-specific data.
"""

import torch
import torch.nn as nn
from torch.utils.data import DataLoader

class CLIPFineTuner:
    """
    Fine-tune CLIP для specific domain (medical, satellite, etc.)
    """

    def __init__(self, model: CLIP,
                 freeze_backbone: bool = True,
                 learning_rate: float = 1e-5):
        self.model = model
        self.freeze_backbone = freeze_backbone

        if freeze_backbone:
            # Freeze all except projection layers
            for param in model.visual.parameters():
                param.requires_grad = False
            for param in model.transformer.parameters():
                param.requires_grad = False

            # Only train projections
            trainable = [
                model.visual.proj,
                model.transformer.text_projection,
                model.logit_scale
            ]
        else:
            trainable = model.parameters()

        self.optimizer = torch.optim.AdamW(
            [p for p in trainable if p.requires_grad],
            lr=learning_rate,
            weight_decay=0.2
        )

        self.loss_fn = ContrastiveLoss()

    def train_epoch(self, dataloader: DataLoader) -> float:
        """
        One epoch of fine-tuning.
        """
        self.model.train()
        total_loss = 0

        for images, texts in dataloader:
            self.optimizer.zero_grad()

            # Forward
            image_features, text_features = self.model(images, texts)

            # Loss
            loss = self.loss_fn(image_features, text_features)

            # Backward
            loss.backward()
            self.optimizer.step()

            total_loss += loss.item()

        return total_loss / len(dataloader)

    def adapter_tuning(self, adapter_dim: int = 64):
        """
        More efficient: add small adapter layers.
        Much less trainable parameters.
        """
        class Adapter(nn.Module):
            def __init__(self, in_dim: int, bottleneck_dim: int):
                super().__init__()
                self.down = nn.Linear(in_dim, bottleneck_dim)
                self.up = nn.Linear(bottleneck_dim, in_dim)
                self.act = nn.GELU()

            def forward(self, x):
                return x + self.up(self.act(self.down(x)))

        # Add adapters to transformer blocks
        for block in self.model.visual.transformer:
            block.adapter = Adapter(block.mlp[0].in_features, adapter_dim)

        # Only train adapters
        for param in self.model.parameters():
            param.requires_grad = False

        for block in self.model.visual.transformer:
            for param in block.adapter.parameters():
                param.requires_grad = True

Проблеми та обмеження

1. Bias у cross-modal embeddings

"""
Bias detection та mitigation в CLIP.
"""

class BiasAnalysis:
    """
    Analyze bias в cross-modal embeddings.
    """

    @staticmethod
    def occupation_gender_bias(model: CLIP, tokenizer) -> dict:
        """
        Test: "CEO" ближче до "man" чи "woman"?
        """
        occupations = ["CEO", "nurse", "engineer", "teacher", "doctor", "secretary"]
        genders = ["man", "woman"]

        with torch.no_grad():
            # Encode genders
            gender_tokens = tokenizer([f"a photo of a {g}" for g in genders])
            gender_features = model.encode_text(gender_tokens)
            gender_features = F.normalize(gender_features, dim=-1)

            results = {}
            for occupation in occupations:
                occ_tokens = tokenizer([f"a photo of a {occupation}"])
                occ_features = model.encode_text(occ_tokens)
                occ_features = F.normalize(occ_features, dim=-1)

                # Similarity to each gender
                sims = (occ_features @ gender_features.T).squeeze()
                results[occupation] = {
                    "man": sims[0].item(),
                    "woman": sims[1].item(),
                    "bias": (sims[0] - sims[1]).item()
                }

        return results


class BiasDebiasing:
    """
    Techniques для зменшення bias.
    """

    @staticmethod
    def null_space_projection(embeddings: torch.Tensor,
                              bias_direction: torch.Tensor) -> torch.Tensor:
        """
        Project out bias direction.
        """
        # Normalize bias direction
        bias_direction = F.normalize(bias_direction, dim=-1)

        # Project each embedding
        projections = embeddings @ bias_direction.unsqueeze(-1)
        debiased = embeddings - projections * bias_direction

        return F.normalize(debiased, dim=-1)

Бенчмарки

| Model | ImageNet Zero-Shot | Flickr30k R@1 | COCO R@1 | Training Data |

|-------|-------------------|---------------|----------|---------------|

| CLIP ViT-B/32 | 63.2% | 88.0% | 58.4% | 400M pairs |

| CLIP ViT-L/14 | 75.5% | 92.8% | 66.7% | 400M pairs |

| OpenCLIP ViT-G | 80.1% | 95.2% | 71.3% | 2B pairs |

| SigLIP | 83.2% | 96.1% | 73.5% | 4B pairs |

| ImageBind | 77.7%* | 89.4% | 65.2% | Multi-modal |


Ідеї для дослідження

Для бакалавра:

  • Fine-tune CLIP на domain-specific dataset
  • Cross-modal retrieval system
  • Embedding visualization (t-SNE, UMAP)
  • Zero-shot classification experiments

Для магістра:

  • Додавання нової модальності (EEG, sensor data)
  • Compositional understanding improvements
  • Bias analysis та mitigation
  • Multi-lingual CLIP

Для PhD:

  • Theoretical analysis of cross-modal alignment
  • Scaling laws для modality combinations
  • Novel contrastive objectives
  • Cross-modal generation з embeddings

Висновок: foundation для multimodal AI

Cross-modal embeddings — це не просто техніка для image search. Це фундаментальний спосіб представлення знань у AI системах.

Коли модель «розуміє», що текст «собака», зображення собаки, звук гавкання — це все вирази одного й того ж концепту, вона наближається до human-like розуміння світу. CLIP показав, що це можливо. ImageBind розширив на більше модальностей. Наступний крок — все більше модальностей, все глибше семантичне розуміння.

Для досліджень у цій галузі команда SKP-Degree на skp-degree.com.ua має досвід роботи з CLIP та multi-modal системами. Консультації доступні в Telegram: @kursovi_diplomy.


Cross-modal embedding, CLIP, ImageBind, contrastive learning, multimodal AI, zero-shot classification, cross-modal retrieval, representation learning — ключові терміни для дипломної чи магістерської роботи з мультимодального штучного інтелекту та representation learning.

Про автора

Команда SKP-Degree

Верифікований автор

Розробники та дослідники AI · Python, TensorFlow, PyTorch · Досвід у промисловій розробці

Команда SKP-Degree — професійні розробники з досвідом 7+ років у промисловій розробці. Виконали 1000+ проєктів для студентів з України, Польщі та країн Балтії.

Python Django Java ML/AI React C# / .NET JavaScript

Потрібна допомога з роботою?

Замовте курсову чи дипломну роботу з програмування. Оплата після демонстрації!

Без передоплати Відеодемонстрація Автономна робота 24/7
Написати в Telegram