GPT-4 читає текст. DALL-E генерує зображення. Whisper транскрибує аудіо. Три окремі моделі, три окремі pipeline, три окремі API. Хочете, щоб AI подивився на фото і написав опис? Передайте зображення в одну модель, отримайте опис, передайте в іншу.
GPT-4V дивиться на картинку і пояснює, що на ній — в одному запиті. Gemini аналізує відео і відповідає на питання про конкретний момент. Claude розуміє PDF з графіками, таблицями і формулами.
Мультимодальні LLM — це не просто "модель плюс зір". Це фундаментально інший підхід до AI. Єдина модель, яка мислить в кількох модальностях одночасно. Яка розуміє зв'язки між тим, що бачить і читає. Яка може reasoning across modalities.
Еволюція мультимодальності
Era 1: Separate Models (до 2020)
Image → CNN → class label
Text → Transformer → text output
Audio → specialized model → transcription
No interaction between modalities
Each modality = separate pipeline
Era 2: Vision-Language Models (2020-2022)
CLIP: shared embedding space for images and text
BLIP: image captioning with contrastive learning
DALL-E: text → image generation
One model, two modalities, limited integration
Era 3: Multimodal LLMs (2023+)
GPT-4V, Gemini, Claude 3
Native multimodal understanding
Interleaved image-text reasoning
Video, audio, document understanding
General-purpose multimodal intelligence
Архітектура Multimodal LLM
import torch
import torch.nn as nn
from transformers import CLIPVisionModel, AutoModelForCausalLM
class MultimodalLLM(nn.Module):
"""Архітектура мультимодальної мовної моделі"""
def __init__(self, vision_encoder: str = "openai/clip-vit-large-patch14",
llm_model: str = "meta-llama/Llama-2-7b-hf"):
super().__init__()
# Vision Encoder (frozen or fine-tuned)
self.vision_encoder = CLIPVisionModel.from_pretrained(vision_encoder)
self.vision_hidden_size = self.vision_encoder.config.hidden_size
# LLM (frozen, LoRA, or full fine-tune)
self.llm = AutoModelForCausalLM.from_pretrained(llm_model)
self.llm_hidden_size = self.llm.config.hidden_size
# Projection layer: align vision to LLM embedding space
self.vision_projection = nn.Sequential(
nn.Linear(self.vision_hidden_size, self.llm_hidden_size),
nn.GELU(),
nn.Linear(self.llm_hidden_size, self.llm_hidden_size)
)
# Special tokens for image
self.image_start_token = nn.Parameter(torch.randn(1, 1, self.llm_hidden_size))
self.image_end_token = nn.Parameter(torch.randn(1, 1, self.llm_hidden_size))
def encode_image(self, pixel_values: torch.Tensor) -> torch.Tensor:
"""Encode image to LLM-compatible embeddings"""
# Get vision features
vision_outputs = self.vision_encoder(pixel_values)
# Use all patch tokens, not just CLS
image_features = vision_outputs.last_hidden_state # [B, num_patches, hidden]
# Project to LLM space
image_embeds = self.vision_projection(image_features)
return image_embeds
def forward(self, input_ids: torch.Tensor,
pixel_values: torch.Tensor = None,
attention_mask: torch.Tensor = None,
image_positions: torch.Tensor = None):
"""Forward pass with interleaved image-text"""
batch_size = input_ids.shape[0]
# Get text embeddings
text_embeds = self.llm.get_input_embeddings()(input_ids)
if pixel_values is not None:
# Encode images
image_embeds = self.encode_image(pixel_values)
# Insert image tokens at specified positions
# This is simplified; real implementation handles variable positions
combined_embeds = self._interleave_embeddings(
text_embeds, image_embeds, image_positions
)
else:
combined_embeds = text_embeds
# Forward through LLM
outputs = self.llm(
inputs_embeds=combined_embeds,
attention_mask=attention_mask,
output_hidden_states=True
)
return outputs
def _interleave_embeddings(self, text_embeds: torch.Tensor,
image_embeds: torch.Tensor,
positions: torch.Tensor) -> torch.Tensor:
"""Interleave image embeddings into text sequence"""
batch_size, text_len, hidden = text_embeds.shape
num_image_tokens = image_embeds.shape[1]
# Create output tensor
total_len = text_len + num_image_tokens + 2 # +2 for start/end tokens
combined = torch.zeros(batch_size, total_len, hidden, device=text_embeds.device)
for b in range(batch_size):
pos = positions[b].item()
# Text before image
combined[b, :pos] = text_embeds[b, :pos]
# Image start token
combined[b, pos:pos+1] = self.image_start_token
# Image embeddings
combined[b, pos+1:pos+1+num_image_tokens] = image_embeds[b]
# Image end token
combined[b, pos+1+num_image_tokens:pos+2+num_image_tokens] = self.image_end_token
# Text after image
combined[b, pos+2+num_image_tokens:] = text_embeds[b, pos:]
return combined
Visual Instruction Tuning
LLM вже вміє reasoning. Потрібно навчити його "бачити".
from transformers import Trainer, TrainingArguments
from torch.utils.data import Dataset
import json
class VisualInstructionDataset(Dataset):
"""Dataset для visual instruction tuning"""
def __init__(self, data_path: str, processor, max_length: int = 2048):
with open(data_path) as f:
self.data = json.load(f)
self.processor = processor
self.max_length = max_length
def __len__(self):
return len(self.data)
def __getitem__(self, idx):
item = self.data[idx]
# Load image
image = Image.open(item['image_path']).convert('RGB')
# Process conversation
conversations = item['conversations']
text = ""
for turn in conversations:
if turn['from'] == 'human':
text += f"<|user|>\n{turn['value']}\n"
else:
text += f"<|assistant|>\n{turn['value']}\n"
# Tokenize
inputs = self.processor(
text=text,
images=image,
return_tensors='pt',
max_length=self.max_length,
truncation=True
)
return {
'input_ids': inputs['input_ids'].squeeze(),
'attention_mask': inputs['attention_mask'].squeeze(),
'pixel_values': inputs['pixel_values'].squeeze(),
'labels': inputs['input_ids'].squeeze()
}
class VisualInstructionTuner:
"""Training pipeline для visual instruction tuning"""
def __init__(self, model: MultimodalLLM, processor):
self.model = model
self.processor = processor
def create_instruction_data(self, image_path: str,
instruction: str,
response: str) -> dict:
"""Create training example"""
return {
'image_path': image_path,
'conversations': [
{'from': 'human', 'value': f"<image>\n{instruction}"},
{'from': 'assistant', 'value': response}
]
}
def train(self, train_dataset: Dataset, val_dataset: Dataset,
output_dir: str, epochs: int = 3):
"""Fine-tune on visual instructions"""
training_args = TrainingArguments(
output_dir=output_dir,
num_train_epochs=epochs,
per_device_train_batch_size=4,
per_device_eval_batch_size=4,
gradient_accumulation_steps=8,
learning_rate=2e-5,
warmup_ratio=0.03,
lr_scheduler_type='cosine',
logging_steps=10,
save_strategy='epoch',
evaluation_strategy='epoch',
bf16=True,
dataloader_num_workers=4
)
trainer = Trainer(
model=self.model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=val_dataset,
data_collator=self._collate_fn
)
trainer.train()
def _collate_fn(self, examples):
"""Custom collate for multimodal data"""
input_ids = torch.stack([e['input_ids'] for e in examples])
attention_mask = torch.stack([e['attention_mask'] for e in examples])
pixel_values = torch.stack([e['pixel_values'] for e in examples])
labels = torch.stack([e['labels'] for e in examples])
return {
'input_ids': input_ids,
'attention_mask': attention_mask,
'pixel_values': pixel_values,
'labels': labels
}
Практичне використання
GPT-4V API
from openai import OpenAI
import base64
def analyze_image_with_gpt4v(image_path: str, question: str) -> str:
"""Аналіз зображення з GPT-4V"""
client = OpenAI()
# Encode image
with open(image_path, "rb") as f:
base64_image = base64.b64encode(f.read()).decode()
# Determine MIME type
if image_path.endswith('.png'):
mime_type = 'image/png'
elif image_path.endswith('.gif'):
mime_type = 'image/gif'
else:
mime_type = 'image/jpeg'
response = client.chat.completions.create(
model="gpt-4-vision-preview",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": question},
{
"type": "image_url",
"image_url": {
"url": f"data:{mime_type};base64,{base64_image}",
"detail": "high" # or "low" for faster processing
}
}
]
}
],
max_tokens=1000
)
return response.choices[0].message.content
def analyze_multiple_images(images: list, question: str) -> str:
"""Аналіз кількох зображень одночасно"""
client = OpenAI()
content = [{"type": "text", "text": question}]
for img_path in images:
with open(img_path, "rb") as f:
base64_image = base64.b64encode(f.read()).decode()
content.append({
"type": "image_url",
"image_url": {"url": f"data:image/jpeg;base64,{base64_image}"}
})
response = client.chat.completions.create(
model="gpt-4-vision-preview",
messages=[{"role": "user", "content": content}],
max_tokens=2000
)
return response.choices[0].message.content
Local Inference з LLaVA
from transformers import AutoProcessor, LlavaForConditionalGeneration
from PIL import Image
import torch
class LLaVAInference:
"""Local inference з LLaVA"""
def __init__(self, model_name: str = "llava-hf/llava-1.5-7b-hf"):
self.processor = AutoProcessor.from_pretrained(model_name)
self.model = LlavaForConditionalGeneration.from_pretrained(
model_name,
torch_dtype=torch.float16,
device_map="auto"
)
def generate(self, image_path: str, prompt: str,
max_tokens: int = 512) -> str:
"""Generate response for image + prompt"""
image = Image.open(image_path).convert('RGB')
# Format prompt for LLaVA
formatted_prompt = f"USER: <image>\n{prompt}\nASSISTANT:"
inputs = self.processor(
text=formatted_prompt,
images=image,
return_tensors='pt'
).to(self.model.device)
with torch.no_grad():
outputs = self.model.generate(
**inputs,
max_new_tokens=max_tokens,
do_sample=True,
temperature=0.7,
top_p=0.9
)
# Decode and extract response
full_response = self.processor.decode(outputs[0], skip_special_tokens=True)
response = full_response.split("ASSISTANT:")[-1].strip()
return response
def batch_generate(self, image_paths: list, prompts: list) -> list:
"""Batch inference"""
images = [Image.open(p).convert('RGB') for p in image_paths]
formatted_prompts = [
f"USER: <image>\n{p}\nASSISTANT:"
for p in prompts
]
inputs = self.processor(
text=formatted_prompts,
images=images,
return_tensors='pt',
padding=True
).to(self.model.device)
with torch.no_grad():
outputs = self.model.generate(
**inputs,
max_new_tokens=512,
do_sample=True
)
responses = []
for output in outputs:
text = self.processor.decode(output, skip_special_tokens=True)
response = text.split("ASSISTANT:")[-1].strip()
responses.append(response)
return responses
Document Understanding
from pdf2image import convert_from_path
import pytesseract
class DocumentAnalyzer:
"""Аналіз документів з multimodal LLM"""
def __init__(self, mm_model):
self.model = mm_model
def analyze_pdf(self, pdf_path: str, question: str) -> str:
"""Аналіз PDF документа"""
# Convert PDF to images
images = convert_from_path(pdf_path, dpi=150)
# Analyze each page
page_analyses = []
for i, image in enumerate(images):
page_prompt = f"""
This is page {i+1} of a document.
{question}
"""
analysis = self.model.generate(image, page_prompt)
page_analyses.append(f"Page {i+1}: {analysis}")
# Synthesize across pages
synthesis_prompt = f"""
Based on these page-by-page analyses, provide a comprehensive answer:
{chr(10).join(page_analyses)}
Question: {question}
"""
# Use text-only for synthesis
return self.model.generate_text(synthesis_prompt)
def extract_tables(self, image_path: str) -> list:
"""Екстракція таблиць з зображення"""
prompt = """
Extract all tables from this image.
For each table:
1. Describe its purpose
2. List column headers
3. Provide data in markdown table format
"""
response = self.model.generate(image_path, prompt)
return self._parse_tables(response)
def analyze_chart(self, image_path: str) -> dict:
"""Аналіз графіка/діаграми"""
prompt = """
Analyze this chart/graph:
1. What type of chart is this?
2. What are the axes/labels?
3. What is the main trend or finding?
4. Provide key data points if visible
5. What conclusions can be drawn?
"""
response = self.model.generate(image_path, prompt)
return {
'raw_analysis': response,
'chart_type': self._extract_chart_type(response),
'key_insights': self._extract_insights(response)
}
Video Understanding
import cv2
import numpy as np
class VideoAnalyzer:
"""Аналіз відео з multimodal LLM"""
def __init__(self, mm_model, frames_per_second: float = 1.0):
self.model = mm_model
self.fps = frames_per_second
def extract_frames(self, video_path: str) -> list:
"""Екстракція frames з відео"""
cap = cv2.VideoCapture(video_path)
video_fps = cap.get(cv2.CAP_PROP_FPS)
frame_interval = int(video_fps / self.fps)
frames = []
frame_idx = 0
while cap.isOpened():
ret, frame = cap.read()
if not ret:
break
if frame_idx % frame_interval == 0:
# Convert BGR to RGB
frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
frames.append({
'frame': frame_rgb,
'timestamp': frame_idx / video_fps
})
frame_idx += 1
cap.release()
return frames
def analyze_video(self, video_path: str, question: str) -> dict:
"""Повний аналіз відео"""
frames = self.extract_frames(video_path)
# Analyze key frames
frame_analyses = []
for frame_data in frames[::5]: # Every 5th extracted frame
timestamp = frame_data['timestamp']
frame = frame_data['frame']
analysis = self.model.generate(
frame,
f"Describe what's happening at {timestamp:.1f}s"
)
frame_analyses.append({
'timestamp': timestamp,
'description': analysis
})
# Synthesize
synthesis = self._synthesize_video_analysis(frame_analyses, question)
return {
'frame_analyses': frame_analyses,
'summary': synthesis,
'duration': frames[-1]['timestamp'] if frames else 0
}
def find_moment(self, video_path: str, description: str) -> float:
"""Знайти момент у відео за описом"""
frames = self.extract_frames(video_path)
best_match = None
best_score = 0
for frame_data in frames:
prompt = f"""
Does this frame show: "{description}"?
Answer with a confidence score from 0 to 100.
Just the number.
"""
response = self.model.generate(frame_data['frame'], prompt)
try:
score = int(response.strip())
if score > best_score:
best_score = score
best_match = frame_data['timestamp']
except:
continue
return best_match
Multimodal RAG
from sentence_transformers import SentenceTransformer
import faiss
class MultimodalRAG:
"""RAG з підтримкою зображень"""
def __init__(self, mm_model, text_encoder: str = "all-MiniLM-L6-v2"):
self.mm_model = mm_model
self.text_encoder = SentenceTransformer(text_encoder)
self.clip_model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
self.clip_processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
self.text_index = None
self.image_index = None
self.documents = []
def index_document(self, doc: dict):
"""Index document з text + images"""
# Extract text
text_embedding = self.text_encoder.encode(doc['text'])
# Extract image embeddings if present
image_embeddings = []
if 'images' in doc:
for img in doc['images']:
inputs = self.clip_processor(images=img, return_tensors="pt")
with torch.no_grad():
img_emb = self.clip_model.get_image_features(**inputs)
image_embeddings.append(img_emb.numpy())
self.documents.append({
'text': doc['text'],
'text_embedding': text_embedding,
'images': doc.get('images', []),
'image_embeddings': image_embeddings
})
def build_index(self):
"""Build FAISS index"""
# Text index
text_embeddings = np.array([d['text_embedding'] for d in self.documents])
self.text_index = faiss.IndexFlatIP(text_embeddings.shape[1])
self.text_index.add(text_embeddings)
# Image index
all_image_embs = []
self.image_to_doc = []
for doc_idx, doc in enumerate(self.documents):
for img_emb in doc['image_embeddings']:
all_image_embs.append(img_emb.flatten())
self.image_to_doc.append(doc_idx)
if all_image_embs:
image_embeddings = np.array(all_image_embs)
self.image_index = faiss.IndexFlatIP(image_embeddings.shape[1])
self.image_index.add(image_embeddings)
def retrieve(self, query: str, query_image=None, k: int = 5) -> list:
"""Retrieve relevant documents"""
results = []
# Text retrieval
query_emb = self.text_encoder.encode([query])
_, text_indices = self.text_index.search(query_emb, k)
for idx in text_indices[0]:
results.append(self.documents[idx])
# Image retrieval if query image provided
if query_image is not None and self.image_index is not None:
inputs = self.clip_processor(images=query_image, return_tensors="pt")
with torch.no_grad():
query_img_emb = self.clip_model.get_image_features(**inputs)
_, img_indices = self.image_index.search(query_img_emb.numpy(), k)
for idx in img_indices[0]:
doc_idx = self.image_to_doc[idx]
if self.documents[doc_idx] not in results:
results.append(self.documents[doc_idx])
return results[:k]
def answer(self, query: str, query_image=None) -> str:
"""RAG-based answering"""
# Retrieve
relevant_docs = self.retrieve(query, query_image)
# Build context
context_parts = []
images_for_context = []
for doc in relevant_docs:
context_parts.append(doc['text'])
images_for_context.extend(doc['images'][:2]) # Max 2 images per doc
context = "\n\n".join(context_parts)
# Generate answer with multimodal context
prompt = f"""
Based on the following context and images, answer the question.
Context:
{context}
Question: {query}
Answer:
"""
if images_for_context:
return self.mm_model.generate(images_for_context[0], prompt)
else:
return self.mm_model.generate_text(prompt)
Benchmarks
| Model | VQAv2 | TextVQA | DocVQA | GQA |
|-------|-------|---------|--------|-----|
| GPT-4V | 77.2 | 78.0 | 88.4 | 65.8 |
| Gemini Pro | 71.2 | 74.6 | 88.1 | - |
| LLaVA-1.5-13B | 80.0 | 61.3 | - | 63.3 |
| Qwen-VL | 79.5 | 63.8 | 65.1 | 59.3 |
| Claude 3 Opus | - | - | 89.3 | - |
Ідеї для дослідження
Для бакалавра:
- Fine-tune LLaVA на specific domain (медичні зображення, документи)
- Multimodal RAG pipeline
- VQA evaluation на українських даних
Для магістра:
- Video understanding з temporal reasoning
- Document AI pipeline (OCR + understanding)
- Multi-image reasoning
Для PhD:
- Novel architectures for multimodal understanding
- Hallucination reduction in vision-language models
- Efficient multimodal models для edge
Інструменти
Models:
- LLaVA: github.com/haotian-liu/LLaVA
- MiniGPT-4: github.com/Vision-CAIR/MiniGPT-4
- Qwen-VL: github.com/QwenLM/Qwen-VL
- Open-Flamingo: github.com/mlfoundations/open_flamingo
APIs:
- OpenAI GPT-4V
- Google Gemini
- Anthropic Claude
Мультимодальність — це не про "додавання зору до LLM". Це про emergent reasoning across modalities. Коли модель бачить графік і читає питання, вона не просто робить OCR плюс текстовий reasoning. Вона reasoning про візуальну інформацію в контексті запиту.
Якщо вас цікавить розробка мультимодальних систем — від використання API до fine-tuning власних моделей — звертайтесь до команди SKP-Degree на skp-degree.com.ua або пишіть у Telegram: @kursovi_diplomy. Допоможемо з вибором архітектури, підготовкою даних та імплементацією.
Ключові слова: multimodal LLM, GPT-4V, Gemini, LLaVA, vision-language, VQA, document AI, video understanding, дипломна робота, магістерська, AI research, foundation models