Mixture-of-Experts: як GPT-4 став таким розумним (і таким дорогим)

Уяви університет, де один професор викладає все: від квантової фізики до середньовічної літератури, від програмування до психології. Абсурд, правда? Неможливо бути експертом у всьому одночасно.

А тепер уяви, що в цьому університеті 100 професорів-експертів. І є розумний диспетчер, який направляє студента до потрібного. Питання про Шекспіра? До літературознавця. Про квантову заплутаність? До фізика. Про PyTorch? До ML-інженера.

Це Mixture-of-Experts (MoE). Архітектурна ідея, яка зробила GPT-4 і Gemini такими потужними. Модель на трильйон параметрів, яка працює майже як 100-мільярдна. Секрет? Не всі експерти активні одночасно.

Проблема dense моделей: закон Мура зупинився

Dense Transformer (класичний GPT-3):

class DenseTransformerProblem:
    """
    Проблема dense моделей: все активне завжди
    """

    def forward_pass_cost(self, model_params: int, sequence_length: int) -> int:
        """
        Кожен token проходить через ВСІ параметри

        GPT-3 175B:
        - 175 мільярдів параметрів
        - ~175 мільярдів операцій на token (наближено)
        - 100 tokens = 17.5 трильйонів операцій

        Для inference потрібно:
        - Зберігати всі параметри в пам'яті
        - Активувати всі для кожного token
        """
        return model_params * sequence_length

    def scaling_problem(self):
        """
        Scaling laws говорять: більша модель = краща модель
        Але:
        - 10x параметрів = 10x compute
        - 10x compute = 10x cost
        - Але НЕ 10x quality

        Diminishing returns: log scaling
        GPT-4 1.7T params → ~2x краще ніж GPT-3 175B
        """
        pass

    def memory_wall(self, model_params_billions: float) -> dict:
        """
        Memory wall problem
        """
        bytes_per_param_fp16 = 2
        bytes_per_param_fp32 = 4

        return {
            "fp16_memory_gb": model_params_billions * bytes_per_param_fp16,
            "fp32_memory_gb": model_params_billions * bytes_per_param_fp32,
            "a100_80gb_needed_fp16": (model_params_billions * 2) / 80,
            "h100_80gb_needed_fp16": (model_params_billions * 2) / 80
        }

# GPT-3 175B потребує:
# 350GB у fp16 = мінімум 5 A100 80GB тільки для ваг
# + activation memory, + KV cache
# Реально: 8+ A100 для inference

MoE: спарсність як спасіння

Ключова ідея: Не всі параметри потрібні для кожного input.

import torch
import torch.nn as nn
import torch.nn.functional as F

class MixtureOfExpertsLayer(nn.Module):
    """
    Mixture of Experts Layer

    Замість одного великого FFN — кілька менших "експертів"
    Router вибирає релевантних експертів для кожного token
    """

    def __init__(self,
                 d_model: int,
                 d_ff: int,
                 num_experts: int = 8,
                 top_k: int = 2,
                 capacity_factor: float = 1.25):
        super().__init__()

        self.num_experts = num_experts
        self.top_k = top_k
        self.capacity_factor = capacity_factor

        # Router (Gating Network)
        self.router = nn.Linear(d_model, num_experts, bias=False)

        # Experts — кожен це окремий FFN
        self.experts = nn.ModuleList([
            nn.Sequential(
                nn.Linear(d_model, d_ff),
                nn.GELU(),
                nn.Linear(d_ff, d_model)
            ) for _ in range(num_experts)
        ])

    def forward(self, x: torch.Tensor) -> tuple[torch.Tensor, dict]:
        """
        x: (batch, seq_len, d_model)
        returns: (output, aux_losses)
        """
        batch_size, seq_len, d_model = x.shape

        # Flatten для routing
        x_flat = x.view(-1, d_model)  # (batch * seq_len, d_model)

        # Compute routing scores
        router_logits = self.router(x_flat)  # (batch * seq_len, num_experts)
        router_probs = F.softmax(router_logits, dim=-1)

        # Top-K selection
        top_k_probs, top_k_indices = router_probs.topk(self.top_k, dim=-1)

        # Normalize top-k weights
        top_k_weights = top_k_probs / top_k_probs.sum(dim=-1, keepdim=True)

        # Compute expert outputs
        output = torch.zeros_like(x_flat)

        for expert_idx in range(self.num_experts):
            # Find tokens routed to this expert
            expert_mask = (top_k_indices == expert_idx).any(dim=-1)

            if expert_mask.any():
                # Get tokens for this expert
                expert_input = x_flat[expert_mask]

                # Compute expert output
                expert_output = self.experts[expert_idx](expert_input)

                # Get weights for this expert
                weight_mask = (top_k_indices[expert_mask] == expert_idx)
                weights = torch.where(
                    weight_mask,
                    top_k_weights[expert_mask],
                    torch.zeros_like(top_k_weights[expert_mask])
                ).sum(dim=-1, keepdim=True)

                # Weighted contribution
                output[expert_mask] += weights * expert_output

        # Reshape back
        output = output.view(batch_size, seq_len, d_model)

        # Auxiliary losses for load balancing
        aux_losses = self._compute_aux_losses(router_probs, top_k_indices)

        return output, aux_losses

    def _compute_aux_losses(self, router_probs, top_k_indices):
        """Load balancing losses"""

        # Fraction of tokens routed to each expert
        num_tokens = router_probs.shape[0]
        tokens_per_expert = torch.zeros(self.num_experts, device=router_probs.device)

        for k in range(self.top_k):
            expert_counts = torch.bincount(
                top_k_indices[:, k],
                minlength=self.num_experts
            ).float()
            tokens_per_expert += expert_counts

        fraction_tokens = tokens_per_expert / (num_tokens * self.top_k)

        # Average routing probability per expert
        avg_probs = router_probs.mean(dim=0)

        # Load balancing loss: мінімізуємо variance
        load_balance_loss = (fraction_tokens * avg_probs).sum() * self.num_experts

        return {
            "load_balance_loss": load_balance_loss,
            "tokens_per_expert": tokens_per_expert
        }

Архітектура Mixtral: відкрита MoE

Mixtral 8x7B (Mistral AI, 2023):

class MixtralConfig:
    """Конфігурація Mixtral 8x7B"""

    # Architecture
    num_experts = 8
    num_experts_per_token = 2  # Top-2 routing
    hidden_size = 4096
    intermediate_size = 14336
    num_hidden_layers = 32
    num_attention_heads = 32
    num_key_value_heads = 8  # GQA

    # Params
    total_params = 46.7e9      # 47B total
    active_params = 12.9e9    # 13B active per token

    # Comparison
    vs_llama2_70b = {
        "performance": "comparable",
        "active_compute": "6x less",
        "memory_for_weights": "0.7x",
        "inference_speed": "3x faster"
    }


class MixtralBlock(nn.Module):
    """Один блок Mixtral"""

    def __init__(self, config):
        super().__init__()

        # Attention (standard)
        self.attention = GroupedQueryAttention(config)

        # MoE FFN (замість звичайного FFN)
        self.moe = MixtureOfExpertsLayer(
            d_model=config.hidden_size,
            d_ff=config.intermediate_size,
            num_experts=config.num_experts,
            top_k=config.num_experts_per_token
        )

        self.attention_norm = RMSNorm(config.hidden_size)
        self.moe_norm = RMSNorm(config.hidden_size)

    def forward(self, x, attention_mask=None):
        # Attention
        residual = x
        x = self.attention_norm(x)
        x = self.attention(x, attention_mask)
        x = residual + x

        # MoE FFN
        residual = x
        x = self.moe_norm(x)
        x, aux_loss = self.moe(x)
        x = residual + x

        return x, aux_loss

GPT-4: найбільша MoE

Інформація з витоків (не підтверджена офіційно):

class GPT4RumoredArchitecture:
    """
    Архітектура GPT-4 за чутками
    Джерело: George Hotz, SemiAnalysis
    """

    RUMORED_SPECS = {
        # Model architecture
        "architecture": "MoE Transformer",
        "num_experts": 16,
        "active_experts_per_token": 2,
        "params_per_expert": "~111B",
        "total_params": "~1.76T",  # 16 × 111B
        "active_params": "~222B",  # 2 × 111B

        # Context
        "context_8k_model": {
            "context_length": 8192,
            "num_layers": 120,
        },
        "context_32k_model": {
            "context_length": 32768,
            "num_layers": 120,  # Same but extended position embeddings
        },

        # Training
        "training_tokens": "~13T",
        "training_duration": "~100 days",
        "training_cluster": "~10,000 A100s",
        "training_cost": "~$100M",

        # Inference
        "inference_cluster": "~3,000+ servers",
        "cost_per_1k_tokens": 0.06  # input
    }

    @staticmethod
    def why_moe_for_gpt4():
        """Чому OpenAI вибрали MoE"""
        return [
            "1.7T params неможливо запустити dense на доступному hardware",
            "MoE дозволяє 'найбільшу' модель з 'прийнятним' inference cost",
            "Кожен експерт може спеціалізуватись (code, math, language, etc)",
            "Scaling law advantage: більше params = краща якість"
        ]

Router дизайн: ключ до успіху MoE

Різні типи routing:

class RouterVariants:
    """Різні дизайни router для MoE"""

    class TokenChoiceRouter(nn.Module):
        """
        Token вибирає Top-K експертів
        Стандартний підхід (Mixtral, GPT-4)
        """

        def __init__(self, d_model: int, num_experts: int, top_k: int = 2):
            super().__init__()
            self.router = nn.Linear(d_model, num_experts)
            self.top_k = top_k

        def forward(self, x: torch.Tensor):
            # x: (batch, seq, d_model)
            logits = self.router(x)  # (batch, seq, num_experts)
            probs = F.softmax(logits, dim=-1)
            weights, indices = probs.topk(self.top_k, dim=-1)
            # Normalize
            weights = weights / weights.sum(dim=-1, keepdim=True)
            return weights, indices


    class ExpertChoiceRouter(nn.Module):
        """
        Експерт вибирає Top-K токенів
        Кращий load balancing, але потребує buffer
        (Switch Transformer variation)
        """

        def __init__(self, d_model: int, num_experts: int, capacity: int):
            super().__init__()
            self.router = nn.Linear(d_model, num_experts)
            self.capacity = capacity  # max tokens per expert

        def forward(self, x: torch.Tensor):
            batch_size, seq_len, d_model = x.shape
            x_flat = x.view(-1, d_model)

            logits = self.router(x_flat)  # (batch*seq, num_experts)

            # Transpose: experts choose tokens
            logits_t = logits.T  # (num_experts, batch*seq)

            # Each expert picks top-capacity tokens
            weights, indices = logits_t.topk(self.capacity, dim=-1)
            weights = F.softmax(weights, dim=-1)

            return weights, indices


    class HashRouter(nn.Module):
        """
        Deterministic routing based on hash
        Немає trainable params в router
        Простіше, але менш адаптивне
        """

        def __init__(self, num_experts: int):
            super().__init__()
            self.num_experts = num_experts

        def forward(self, x: torch.Tensor, token_ids: torch.Tensor):
            # Hash token_id to expert
            expert_indices = token_ids % self.num_experts
            weights = torch.ones_like(expert_indices).float()
            return weights.unsqueeze(-1), expert_indices.unsqueeze(-1)


    class SoftMoERouter(nn.Module):
        """
        Soft MoE: всі експерти отримують weighted input
        Замість discrete routing — continuous
        """

        def __init__(self, d_model: int, num_experts: int, num_slots: int):
            super().__init__()
            self.slot_embeddings = nn.Parameter(
                torch.randn(num_experts, num_slots, d_model)
            )

        def forward(self, x: torch.Tensor):
            # x: (batch, seq, d_model)
            # Compute attention between tokens and slots
            # Each slot is a "soft" assignment point
            # Returns weighted combination for each expert
            pass

Load Balancing: критична проблема

Чому load balancing важливий:

class LoadBalancingProblem:
    """
    Проблема: без balancing один експерт отримує всі токени

    Чому це погано:
    1. Один експерт перевантажений → bottleneck
    2. Інші експерти не навчаються → waste of params
    3. "Expert collapse" — модель деградує
    """

    @staticmethod
    def auxiliary_loss(router_probs: torch.Tensor,
                       expert_assignments: torch.Tensor,
                       num_experts: int,
                       alpha: float = 0.01) -> torch.Tensor:
        """
        Auxiliary loss для балансування

        Мінімізуємо: sum(fraction_i × prob_i) × num_experts

        Ідея: якщо fraction ≈ 1/num_experts для всіх експертів,
        то loss мінімальний
        """
        # Fraction of tokens per expert
        num_tokens = expert_assignments.numel()
        tokens_per_expert = torch.bincount(
            expert_assignments.flatten(),
            minlength=num_experts
        ).float()
        fraction = tokens_per_expert / num_tokens

        # Average probability per expert
        avg_prob = router_probs.mean(dim=0)

        # Load balance loss
        loss = alpha * num_experts * (fraction * avg_prob).sum()

        return loss


class LoadBalancingSolutions:
    """Рішення для load balancing"""

    @staticmethod
    def expert_capacity_limiting():
        """
        Обмежуємо capacity кожного експерта
        capacity = (tokens / num_experts) × capacity_factor

        Overflow tokens → dropped або routed to другий експерт
        """
        pass

    @staticmethod
    def noisy_top_k_gating(logits: torch.Tensor,
                           noise_std: float = 1.0) -> torch.Tensor:
        """
        Додаємо noise до routing logits під час training
        Допомагає exploration різних експертів
        """
        noise = torch.randn_like(logits) * noise_std
        return logits + noise

    @staticmethod
    def expert_parallelism():
        """
        Розподіляємо експертів по різних GPU
        All-to-all communication для routing
        Зменшує memory per GPU
        """
        pass

Efficient MoE Inference

Оптимізації для inference:

class EfficientMoEInference:
    """Оптимізації для MoE inference"""

    @staticmethod
    def expert_parallelism(model, num_gpus: int):
        """
        Розподіляємо експертів по GPU

        8 experts, 4 GPU → 2 experts per GPU
        Зменшує memory per GPU
        """
        experts_per_gpu = model.num_experts // num_gpus
        # Sharding logic...
        pass

    @staticmethod
    def expert_offloading(model, cpu_experts: list):
        """
        Неактивні експерти → CPU memory
        Активні → GPU
        Swap при потребі

        Trade-off: memory vs latency
        """
        pass

    class BatchedExpertExecution:
        """
        Замість окремих forward passes для кожного експерта:
        1. Групуємо токени по призначених експертах
        2. Один batched forward per expert
        3. Скаттеруємо результати назад
        """

        def forward(self, tokens, routing_indices, experts):
            # Sort tokens by expert
            sorted_indices = routing_indices.argsort()
            sorted_tokens = tokens[sorted_indices]

            # Find boundaries
            expert_boundaries = (routing_indices[sorted_indices][:-1] !=
                                routing_indices[sorted_indices][1:]).nonzero()

            # Batched forward per expert
            results = []
            start = 0
            for expert_idx, boundary in enumerate(expert_boundaries):
                end = boundary.item() + 1
                expert_tokens = sorted_tokens[start:end]
                expert_output = experts[expert_idx](expert_tokens)
                results.append(expert_output)
                start = end

            # Unsort
            outputs = torch.cat(results)
            original_order = sorted_indices.argsort()
            return outputs[original_order]

MoE для агентних систем

Чому MoE ідеальна для AI agents:

class AgentMoEArchitecture:
    """
    MoE для multi-capability agents

    Різні експерти для різних типів задач:
    - Code generation expert
    - Web search expert
    - Mathematical reasoning expert
    - Tool use expert
    - Dialogue expert
    """

    def __init__(self):
        self.task_experts = {
            "code": CodeGenerationExpert(),
            "search": WebSearchExpert(),
            "math": MathReasoningExpert(),
            "tools": ToolUseExpert(),
            "dialogue": DialogueExpert(),
            "planning": PlanningExpert(),
            "memory": MemoryRetrievalExpert(),
            "synthesis": InformationSynthesisExpert()
        }

        self.task_router = TaskRouter()

    def process_agent_query(self, query: str, context: dict) -> str:
        """
        Agent query → route to relevant experts → combine results
        """
        # 1. Analyze query type
        task_types = self.task_router.analyze(query)

        # 2. Route to top-k experts
        active_experts = [self.task_experts[t] for t in task_types[:2]]

        # 3. Get expert outputs
        expert_outputs = [expert(query, context) for expert in active_experts]

        # 4. Combine (weighted by router confidence)
        combined = self.combine_expert_outputs(expert_outputs)

        return combined

    def advantages_for_agents(self) -> list:
        return [
            "Спеціалізація: code expert знає API краще за generic model",
            "Efficiency: не активуємо math expert для чату",
            "Modularity: можна оновити один expert без ретренування всього",
            "Scalability: додати нового expert простіше ніж ретренувати",
            "Interpretability: бачимо який expert відповідає"
        ]


class HierarchicalAgentMoE:
    """
    Hierarchical MoE для складних агентів

    Level 1: Task type routing (code vs search vs chat)
    Level 2: Sub-task routing (Python vs JS, Google vs Wikipedia)
    Level 3: Expert execution
    """

    def __init__(self):
        # Level 1: High-level task routing
        self.task_router = nn.Linear(768, 4)  # 4 task types

        # Level 2: Sub-task routers
        self.code_router = nn.Linear(768, 3)    # Python, JS, SQL
        self.search_router = nn.Linear(768, 3)  # Google, Wiki, arxiv

        # Level 3: Actual experts
        self.python_expert = PythonExpert()
        self.js_expert = JSExpert()
        # ...

Benchmark результати

Порівняння MoE vs Dense:

|-------|--------------|---------------|------|-----------|------|

| Llama 2 70B | 70B | 70B | 68.9 | 29.9 | High |

| Mixtral 8x7B | 47B | 13B | 70.6 | 40.2 | Medium |

| GPT-3.5 | ~175B | ~175B | 70.0 | 48.1 | Medium |

| GPT-4 (est) | ~1.7T | ~220B | 86.4 | 67.0 | Very High |

Key insight: Mixtral з 13B active params перевершує Llama 2 70B з 70B active params.

Ідеї для дослідження

Для бакалаврської роботи:

Fine-tune Mixtral на specific domain (legal, medical)
Аналіз: які експерти активуються для яких типів токенів
Порівняння inference cost: MoE vs Dense при однаковій якості

Для магістерської:

MoE для multi-agent coordination (спеціалізовані агенти)
Dynamic expert allocation based on task complexity
Pruning/merging неактивних експертів
Knowledge distillation з MoE в dense model

Для PhD:

Hierarchical MoE (experts of experts)
Continual learning в MoE: нові експерти без forgetting старих
Theoretical analysis: optimal routing strategies
Sparse-to-Dense distillation
Novel routing mechanisms (learned vs fixed)

Чому це критично для майбутнього AI

Scaling laws кажуть: більша модель = краща модель. Chinchilla показала: потрібно балансувати params і data. Але compute все одно росте експоненційно.

MoE — це хак scaling laws: "більша" модель (за params), яка не коштує дорожче в inference. Трильйон параметрів, але активуються сотні мільярдів. Спеціалізація замість універсальності.

Для агентних систем це особливо важливо. Агент має бути versatile (багато знань і навичок) і fast (low latency для real-time interaction). MoE дає обидва: широта через загальну кількість params, швидкість через sparse activation.

Для тих, хто готує наукову роботу з MoE архітектур — від курсової до дисертації — команда SKP-Degree на skp-degree.com.ua готова допомогти з дослідженням та імплементацією. Пишіть у Telegram: @kursovi_diplomy — маємо досвід роботи з Mixtral та розробки власних sparse architectures.

Ключові слова: mixture of experts, MoE, Mixtral, GPT-4, sparse models, conditional computation, routing, load balancing, efficiency, LLM, agents, наукова робота, дипломна, магістерська, курсова.

Mixture-of-Experts: як GPT-4 став таким розумним (і таким дорогим)

Проблема dense моделей: закон Мура зупинився

MoE: спарсність як спасіння

Архітектура Mixtral: відкрита MoE

GPT-4: найбільша MoE

Router дизайн: ключ до успіху MoE

Load Balancing: критична проблема

Efficient MoE Inference

MoE для агентних систем

Benchmark результати

Ідеї для дослідження

Чому це критично для майбутнього AI

Про автора

Команда SKP-Degree

Поділитися

Схожі статті

AI Agents та автономне кодування у 2026 році: повний гайд для розробників і студентів

Digital Twins + AI: симуляція реального світу нейромережами

Computer Vision з OpenCV та YOLO

Потрібна допомога з роботою?