Мультимодальні LLM: коли AI бачить, чує і розуміє

GPT-4 читає текст. DALL-E генерує зображення. Whisper транскрибує аудіо. Три окремі моделі, три окремі pipeline, три окремі API. Хочете, щоб AI подивився на фото і написав опис? Передайте зображення в одну модель, отримайте опис, передайте в іншу.

GPT-4V дивиться на картинку і пояснює, що на ній — в одному запиті. Gemini аналізує відео і відповідає на питання про конкретний момент. Claude розуміє PDF з графіками, таблицями і формулами.

Мультимодальні LLM — це не просто "модель плюс зір". Це фундаментально інший підхід до AI. Єдина модель, яка мислить в кількох модальностях одночасно. Яка розуміє зв'язки між тим, що бачить і читає. Яка може reasoning across modalities.

Еволюція мультимодальності

Era 1: Separate Models (до 2020)

Image → CNN → class label
Text → Transformer → text output
Audio → specialized model → transcription
No interaction between modalities
Each modality = separate pipeline

Era 2: Vision-Language Models (2020-2022)

CLIP: shared embedding space for images and text
BLIP: image captioning with contrastive learning
DALL-E: text → image generation
One model, two modalities, limited integration

Era 3: Multimodal LLMs (2023+)

GPT-4V, Gemini, Claude 3
Native multimodal understanding
Interleaved image-text reasoning
Video, audio, document understanding
General-purpose multimodal intelligence

Архітектура Multimodal LLM

import torch
import torch.nn as nn
from transformers import CLIPVisionModel, AutoModelForCausalLM

class MultimodalLLM(nn.Module):
    """Архітектура мультимодальної мовної моделі"""

    def __init__(self, vision_encoder: str = "openai/clip-vit-large-patch14",
                 llm_model: str = "meta-llama/Llama-2-7b-hf"):
        super().__init__()

        # Vision Encoder (frozen or fine-tuned)
        self.vision_encoder = CLIPVisionModel.from_pretrained(vision_encoder)
        self.vision_hidden_size = self.vision_encoder.config.hidden_size

        # LLM (frozen, LoRA, or full fine-tune)
        self.llm = AutoModelForCausalLM.from_pretrained(llm_model)
        self.llm_hidden_size = self.llm.config.hidden_size

        # Projection layer: align vision to LLM embedding space
        self.vision_projection = nn.Sequential(
            nn.Linear(self.vision_hidden_size, self.llm_hidden_size),
            nn.GELU(),
            nn.Linear(self.llm_hidden_size, self.llm_hidden_size)
        )

        # Special tokens for image
        self.image_start_token = nn.Parameter(torch.randn(1, 1, self.llm_hidden_size))
        self.image_end_token = nn.Parameter(torch.randn(1, 1, self.llm_hidden_size))

    def encode_image(self, pixel_values: torch.Tensor) -> torch.Tensor:
        """Encode image to LLM-compatible embeddings"""
        # Get vision features
        vision_outputs = self.vision_encoder(pixel_values)
        # Use all patch tokens, not just CLS
        image_features = vision_outputs.last_hidden_state  # [B, num_patches, hidden]

        # Project to LLM space
        image_embeds = self.vision_projection(image_features)

        return image_embeds

    def forward(self, input_ids: torch.Tensor,
                pixel_values: torch.Tensor = None,
                attention_mask: torch.Tensor = None,
                image_positions: torch.Tensor = None):
        """Forward pass with interleaved image-text"""
        batch_size = input_ids.shape[0]

        # Get text embeddings
        text_embeds = self.llm.get_input_embeddings()(input_ids)

        if pixel_values is not None:
            # Encode images
            image_embeds = self.encode_image(pixel_values)

            # Insert image tokens at specified positions
            # This is simplified; real implementation handles variable positions
            combined_embeds = self._interleave_embeddings(
                text_embeds, image_embeds, image_positions
            )
        else:
            combined_embeds = text_embeds

        # Forward through LLM
        outputs = self.llm(
            inputs_embeds=combined_embeds,
            attention_mask=attention_mask,
            output_hidden_states=True
        )

        return outputs

    def _interleave_embeddings(self, text_embeds: torch.Tensor,
                                image_embeds: torch.Tensor,
                                positions: torch.Tensor) -> torch.Tensor:
        """Interleave image embeddings into text sequence"""
        batch_size, text_len, hidden = text_embeds.shape
        num_image_tokens = image_embeds.shape[1]

        # Create output tensor
        total_len = text_len + num_image_tokens + 2  # +2 for start/end tokens
        combined = torch.zeros(batch_size, total_len, hidden, device=text_embeds.device)

        for b in range(batch_size):
            pos = positions[b].item()
            # Text before image
            combined[b, :pos] = text_embeds[b, :pos]
            # Image start token
            combined[b, pos:pos+1] = self.image_start_token
            # Image embeddings
            combined[b, pos+1:pos+1+num_image_tokens] = image_embeds[b]
            # Image end token
            combined[b, pos+1+num_image_tokens:pos+2+num_image_tokens] = self.image_end_token
            # Text after image
            combined[b, pos+2+num_image_tokens:] = text_embeds[b, pos:]

        return combined

Visual Instruction Tuning

LLM вже вміє reasoning. Потрібно навчити його "бачити".

from transformers import Trainer, TrainingArguments
from torch.utils.data import Dataset
import json

class VisualInstructionDataset(Dataset):
    """Dataset для visual instruction tuning"""

    def __init__(self, data_path: str, processor, max_length: int = 2048):
        with open(data_path) as f:
            self.data = json.load(f)
        self.processor = processor
        self.max_length = max_length

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        item = self.data[idx]

        # Load image
        image = Image.open(item['image_path']).convert('RGB')

        # Process conversation
        conversations = item['conversations']
        text = ""
        for turn in conversations:
            if turn['from'] == 'human':
                text += f"<|user|>\n{turn['value']}\n"
            else:
                text += f"<|assistant|>\n{turn['value']}\n"

        # Tokenize
        inputs = self.processor(
            text=text,
            images=image,
            return_tensors='pt',
            max_length=self.max_length,
            truncation=True
        )

        return {
            'input_ids': inputs['input_ids'].squeeze(),
            'attention_mask': inputs['attention_mask'].squeeze(),
            'pixel_values': inputs['pixel_values'].squeeze(),
            'labels': inputs['input_ids'].squeeze()
        }


class VisualInstructionTuner:
    """Training pipeline для visual instruction tuning"""

    def __init__(self, model: MultimodalLLM, processor):
        self.model = model
        self.processor = processor

    def create_instruction_data(self, image_path: str,
                                  instruction: str,
                                  response: str) -> dict:
        """Create training example"""
        return {
            'image_path': image_path,
            'conversations': [
                {'from': 'human', 'value': f"<image>\n{instruction}"},
                {'from': 'assistant', 'value': response}
            ]
        }

    def train(self, train_dataset: Dataset, val_dataset: Dataset,
              output_dir: str, epochs: int = 3):
        """Fine-tune on visual instructions"""

        training_args = TrainingArguments(
            output_dir=output_dir,
            num_train_epochs=epochs,
            per_device_train_batch_size=4,
            per_device_eval_batch_size=4,
            gradient_accumulation_steps=8,
            learning_rate=2e-5,
            warmup_ratio=0.03,
            lr_scheduler_type='cosine',
            logging_steps=10,
            save_strategy='epoch',
            evaluation_strategy='epoch',
            bf16=True,
            dataloader_num_workers=4
        )

        trainer = Trainer(
            model=self.model,
            args=training_args,
            train_dataset=train_dataset,
            eval_dataset=val_dataset,
            data_collator=self._collate_fn
        )

        trainer.train()

    def _collate_fn(self, examples):
        """Custom collate for multimodal data"""
        input_ids = torch.stack([e['input_ids'] for e in examples])
        attention_mask = torch.stack([e['attention_mask'] for e in examples])
        pixel_values = torch.stack([e['pixel_values'] for e in examples])
        labels = torch.stack([e['labels'] for e in examples])

        return {
            'input_ids': input_ids,
            'attention_mask': attention_mask,
            'pixel_values': pixel_values,
            'labels': labels
        }

Практичне використання

GPT-4V API

from openai import OpenAI
import base64

def analyze_image_with_gpt4v(image_path: str, question: str) -> str:
    """Аналіз зображення з GPT-4V"""
    client = OpenAI()

    # Encode image
    with open(image_path, "rb") as f:
        base64_image = base64.b64encode(f.read()).decode()

    # Determine MIME type
    if image_path.endswith('.png'):
        mime_type = 'image/png'
    elif image_path.endswith('.gif'):
        mime_type = 'image/gif'
    else:
        mime_type = 'image/jpeg'

    response = client.chat.completions.create(
        model="gpt-4-vision-preview",
        messages=[
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": question},
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": f"data:{mime_type};base64,{base64_image}",
                            "detail": "high"  # or "low" for faster processing
                        }
                    }
                ]
            }
        ],
        max_tokens=1000
    )

    return response.choices[0].message.content


def analyze_multiple_images(images: list, question: str) -> str:
    """Аналіз кількох зображень одночасно"""
    client = OpenAI()

    content = [{"type": "text", "text": question}]

    for img_path in images:
        with open(img_path, "rb") as f:
            base64_image = base64.b64encode(f.read()).decode()

        content.append({
            "type": "image_url",
            "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"}
        })

    response = client.chat.completions.create(
        model="gpt-4-vision-preview",
        messages=[{"role": "user", "content": content}],
        max_tokens=2000
    )

    return response.choices[0].message.content

Local Inference з LLaVA

from transformers import AutoProcessor, LlavaForConditionalGeneration
from PIL import Image
import torch

class LLaVAInference:
    """Local inference з LLaVA"""

    def __init__(self, model_name: str = "llava-hf/llava-1.5-7b-hf"):
        self.processor = AutoProcessor.from_pretrained(model_name)
        self.model = LlavaForConditionalGeneration.from_pretrained(
            model_name,
            torch_dtype=torch.float16,
            device_map="auto"
        )

    def generate(self, image_path: str, prompt: str,
                 max_tokens: int = 512) -> str:
        """Generate response for image + prompt"""
        image = Image.open(image_path).convert('RGB')

        # Format prompt for LLaVA
        formatted_prompt = f"USER: <image>\n{prompt}\nASSISTANT:"

        inputs = self.processor(
            text=formatted_prompt,
            images=image,
            return_tensors='pt'
        ).to(self.model.device)

        with torch.no_grad():
            outputs = self.model.generate(
                **inputs,
                max_new_tokens=max_tokens,
                do_sample=True,
                temperature=0.7,
                top_p=0.9
            )

        # Decode and extract response
        full_response = self.processor.decode(outputs[0], skip_special_tokens=True)
        response = full_response.split("ASSISTANT:")[-1].strip()

        return response

    def batch_generate(self, image_paths: list, prompts: list) -> list:
        """Batch inference"""
        images = [Image.open(p).convert('RGB') for p in image_paths]
        formatted_prompts = [
            f"USER: <image>\n{p}\nASSISTANT:"
            for p in prompts
        ]

        inputs = self.processor(
            text=formatted_prompts,
            images=images,
            return_tensors='pt',
            padding=True
        ).to(self.model.device)

        with torch.no_grad():
            outputs = self.model.generate(
                **inputs,
                max_new_tokens=512,
                do_sample=True
            )

        responses = []
        for output in outputs:
            text = self.processor.decode(output, skip_special_tokens=True)
            response = text.split("ASSISTANT:")[-1].strip()
            responses.append(response)

        return responses

Document Understanding

from pdf2image import convert_from_path
import pytesseract

class DocumentAnalyzer:
    """Аналіз документів з multimodal LLM"""

    def __init__(self, mm_model):
        self.model = mm_model

    def analyze_pdf(self, pdf_path: str, question: str) -> str:
        """Аналіз PDF документа"""
        # Convert PDF to images
        images = convert_from_path(pdf_path, dpi=150)

        # Analyze each page
        page_analyses = []
        for i, image in enumerate(images):
            page_prompt = f"""
            This is page {i+1} of a document.
            {question}
            """
            analysis = self.model.generate(image, page_prompt)
            page_analyses.append(f"Page {i+1}: {analysis}")

        # Synthesize across pages
        synthesis_prompt = f"""
        Based on these page-by-page analyses, provide a comprehensive answer:
        {chr(10).join(page_analyses)}

        Question: {question}
        """

        # Use text-only for synthesis
        return self.model.generate_text(synthesis_prompt)

    def extract_tables(self, image_path: str) -> list:
        """Екстракція таблиць з зображення"""
        prompt = """
        Extract all tables from this image.
        For each table:
        1. Describe its purpose
        2. List column headers
        3. Provide data in markdown table format
        """

        response = self.model.generate(image_path, prompt)
        return self._parse_tables(response)

    def analyze_chart(self, image_path: str) -> dict:
        """Аналіз графіка/діаграми"""
        prompt = """
        Analyze this chart/graph:
        1. What type of chart is this?
        2. What are the axes/labels?
        3. What is the main trend or finding?
        4. Provide key data points if visible
        5. What conclusions can be drawn?
        """

        response = self.model.generate(image_path, prompt)

        return {
            'raw_analysis': response,
            'chart_type': self._extract_chart_type(response),
            'key_insights': self._extract_insights(response)
        }

Video Understanding

import cv2
import numpy as np

class VideoAnalyzer:
    """Аналіз відео з multimodal LLM"""

    def __init__(self, mm_model, frames_per_second: float = 1.0):
        self.model = mm_model
        self.fps = frames_per_second

    def extract_frames(self, video_path: str) -> list:
        """Екстракція frames з відео"""
        cap = cv2.VideoCapture(video_path)
        video_fps = cap.get(cv2.CAP_PROP_FPS)
        frame_interval = int(video_fps / self.fps)

        frames = []
        frame_idx = 0

        while cap.isOpened():
            ret, frame = cap.read()
            if not ret:
                break

            if frame_idx % frame_interval == 0:
                # Convert BGR to RGB
                frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
                frames.append({
                    'frame': frame_rgb,
                    'timestamp': frame_idx / video_fps
                })

            frame_idx += 1

        cap.release()
        return frames

    def analyze_video(self, video_path: str, question: str) -> dict:
        """Повний аналіз відео"""
        frames = self.extract_frames(video_path)

        # Analyze key frames
        frame_analyses = []
        for frame_data in frames[::5]:  # Every 5th extracted frame
            timestamp = frame_data['timestamp']
            frame = frame_data['frame']

            analysis = self.model.generate(
                frame,
                f"Describe what's happening at {timestamp:.1f}s"
            )
            frame_analyses.append({
                'timestamp': timestamp,
                'description': analysis
            })

        # Synthesize
        synthesis = self._synthesize_video_analysis(frame_analyses, question)

        return {
            'frame_analyses': frame_analyses,
            'summary': synthesis,
            'duration': frames[-1]['timestamp'] if frames else 0
        }

    def find_moment(self, video_path: str, description: str) -> float:
        """Знайти момент у відео за описом"""
        frames = self.extract_frames(video_path)

        best_match = None
        best_score = 0

        for frame_data in frames:
            prompt = f"""
            Does this frame show: "{description}"?
            Answer with a confidence score from 0 to 100.
            Just the number.
            """
            response = self.model.generate(frame_data['frame'], prompt)

            try:
                score = int(response.strip())
                if score > best_score:
                    best_score = score
                    best_match = frame_data['timestamp']
            except:
                continue

        return best_match

Multimodal RAG

from sentence_transformers import SentenceTransformer
import faiss

class MultimodalRAG:
    """RAG з підтримкою зображень"""

    def __init__(self, mm_model, text_encoder: str = "all-MiniLM-L6-v2"):
        self.mm_model = mm_model
        self.text_encoder = SentenceTransformer(text_encoder)
        self.clip_model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
        self.clip_processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

        self.text_index = None
        self.image_index = None
        self.documents = []

    def index_document(self, doc: dict):
        """Index document з text + images"""
        # Extract text
        text_embedding = self.text_encoder.encode(doc['text'])

        # Extract image embeddings if present
        image_embeddings = []
        if 'images' in doc:
            for img in doc['images']:
                inputs = self.clip_processor(images=img, return_tensors="pt")
                with torch.no_grad():
                    img_emb = self.clip_model.get_image_features(**inputs)
                    image_embeddings.append(img_emb.numpy())

        self.documents.append({
            'text': doc['text'],
            'text_embedding': text_embedding,
            'images': doc.get('images', []),
            'image_embeddings': image_embeddings
        })

    def build_index(self):
        """Build FAISS index"""
        # Text index
        text_embeddings = np.array([d['text_embedding'] for d in self.documents])
        self.text_index = faiss.IndexFlatIP(text_embeddings.shape[1])
        self.text_index.add(text_embeddings)

        # Image index
        all_image_embs = []
        self.image_to_doc = []
        for doc_idx, doc in enumerate(self.documents):
            for img_emb in doc['image_embeddings']:
                all_image_embs.append(img_emb.flatten())
                self.image_to_doc.append(doc_idx)

        if all_image_embs:
            image_embeddings = np.array(all_image_embs)
            self.image_index = faiss.IndexFlatIP(image_embeddings.shape[1])
            self.image_index.add(image_embeddings)

    def retrieve(self, query: str, query_image=None, k: int = 5) -> list:
        """Retrieve relevant documents"""
        results = []

        # Text retrieval
        query_emb = self.text_encoder.encode([query])
        _, text_indices = self.text_index.search(query_emb, k)

        for idx in text_indices[0]:
            results.append(self.documents[idx])

        # Image retrieval if query image provided
        if query_image is not None and self.image_index is not None:
            inputs = self.clip_processor(images=query_image, return_tensors="pt")
            with torch.no_grad():
                query_img_emb = self.clip_model.get_image_features(**inputs)

            _, img_indices = self.image_index.search(query_img_emb.numpy(), k)

            for idx in img_indices[0]:
                doc_idx = self.image_to_doc[idx]
                if self.documents[doc_idx] not in results:
                    results.append(self.documents[doc_idx])

        return results[:k]

    def answer(self, query: str, query_image=None) -> str:
        """RAG-based answering"""
        # Retrieve
        relevant_docs = self.retrieve(query, query_image)

        # Build context
        context_parts = []
        images_for_context = []

        for doc in relevant_docs:
            context_parts.append(doc['text'])
            images_for_context.extend(doc['images'][:2])  # Max 2 images per doc

        context = "\n\n".join(context_parts)

        # Generate answer with multimodal context
        prompt = f"""
        Based on the following context and images, answer the question.

        Context:
        {context}

        Question: {query}

        Answer:
        """

        if images_for_context:
            return self.mm_model.generate(images_for_context[0], prompt)
        else:
            return self.mm_model.generate_text(prompt)

Benchmarks

|-------|-------|---------|--------|-----|

| GPT-4V | 77.2 | 78.0 | 88.4 | 65.8 |

| Gemini Pro | 71.2 | 74.6 | 88.1 | - |

| LLaVA-1.5-13B | 80.0 | 61.3 | - | 63.3 |

| Qwen-VL | 79.5 | 63.8 | 65.1 | 59.3 |

| Claude 3 Opus | - | - | 89.3 | - |

Ідеї для дослідження

Для бакалавра:

Fine-tune LLaVA на specific domain (медичні зображення, документи)
Multimodal RAG pipeline
VQA evaluation на українських даних

Для магістра:

Video understanding з temporal reasoning
Document AI pipeline (OCR + understanding)
Multi-image reasoning

Для PhD:

Novel architectures for multimodal understanding
Hallucination reduction in vision-language models
Efficient multimodal models для edge

Інструменти

Models:

LLaVA: github.com/haotian-liu/LLaVA
MiniGPT-4: github.com/Vision-CAIR/MiniGPT-4
Qwen-VL: github.com/QwenLM/Qwen-VL
Open-Flamingo: github.com/mlfoundations/open_flamingo

APIs:

OpenAI GPT-4V
Google Gemini
Anthropic Claude

Мультимодальність — це не про "додавання зору до LLM". Це про emergent reasoning across modalities. Коли модель бачить графік і читає питання, вона не просто робить OCR плюс текстовий reasoning. Вона reasoning про візуальну інформацію в контексті запиту.

Якщо вас цікавить розробка мультимодальних систем — від використання API до fine-tuning власних моделей — звертайтесь до команди SKP-Degree на skp-degree.com.ua або пишіть у Telegram: @kursovi_diplomy. Допоможемо з вибором архітектури, підготовкою даних та імплементацією.

Ключові слова: multimodal LLM, GPT-4V, Gemini, LLaVA, vision-language, VQA, document AI, video understanding, дипломна робота, магістерська, AI research, foundation models

Мультимодальні LLM: коли AI бачить, чує і розуміє

Еволюція мультимодальності

Архітектура Multimodal LLM

Visual Instruction Tuning

Практичне використання

GPT-4V API

Local Inference з LLaVA

Document Understanding

Video Understanding

Multimodal RAG

Benchmarks

Ідеї для дослідження

Інструменти

Про автора

Команда SKP-Degree

Поділитися

Схожі статті

AI Agents та автономне кодування у 2026 році: повний гайд для розробників і студентів

Digital Twins + AI: симуляція реального світу нейромережами

Computer Vision з OpenCV та YOLO

Потрібна допомога з роботою?