Speculative Decoding: як прискорити LLM вдвічі без втрати якості

LLM генерує текст по одному токену. Кожен токен — повний forward pass через мільярди параметрів. GPT-4 генерує 50 tokens/second. Для відповіді на 500 токенів — 10 секунд чекання. Для code generation з кількома тисячами токенів — хвилина+.

Це не просто незручно. Це дорого. Inference cost пропорційний часу × GPU. Якщо можна згенерувати ту саму відповідь за 5 секунд замість 10 — це 50% економії на compute.

А що якщо генерувати 4-5 токенів за один прохід? Без зміни моделі. Без втрати якості. Точно та сама відповідь, але в 2-3 рази швидше.

Speculative decoding робить саме це. І це free lunch у світі LLM optimization.

Проблема: чому autoregressive generation повільний

Sequential dependency — фундаментальне обмеження:

class AutoregressiveProblem:
    """
    Чому LLM генерація inherently sequential
    """

    def generate_standard(self, model, prompt: str, max_tokens: int) -> str:
        """
        Стандартна autoregressive генерація

        Проблема: кожен token залежить від попереднього
        → не можна паралелізувати
        """
        tokens = tokenize(prompt)

        for _ in range(max_tokens):
            # Повний forward pass для ОДНОГО token
            logits = model.forward(tokens)

            # Sample наступний token
            next_token = sample(logits[:, -1, :])

            # Append
            tokens = torch.cat([tokens, next_token.unsqueeze(0)], dim=1)

            if next_token == EOS:
                break

        return detokenize(tokens)

    def analyze_bottleneck(self):
        """
        Memory-bound vs Compute-bound

        Для single token generation LLM is memory-bound:
        - Треба прочитати всі ваги з пам'яті
        - Compute швидший за читання
        - GPU utilization 30-50%
        """
        # Llama-70B на A100 80GB
        model_size_gb = 140  # FP16
        memory_bandwidth_gb_s = 2000  # A100 HBM
        flops = 312e12  # A100 FP16

        # Time to read weights
        read_time = model_size_gb / memory_bandwidth_gb_s

        # Theoretical compute time (much faster)
        compute_time = model_size_gb * 2 / (flops / 1e9)  # ~2 ops per param

        return {
            "memory_read_time_ms": read_time * 1000,  # ~70ms
            "compute_time_ms": compute_time * 1000,   # ~0.9ms
            "bottleneck": "memory bandwidth",
            "gpu_utilization": compute_time / read_time  # ~1.3%
        }

Ключове спостереження: Генерація 1 token або 10 tokens займає приблизно однаковий час (якщо це verification, не generation). Бо bottleneck — читання ваг, не обчислення.

Speculative Decoding: основна ідея

Концепт:

1. Draft model (маленька, швидка) генерує K tokens "спекулятивно"
2. Target model (велика, точна) верифікує всі K tokens за один forward pass
3. Приймаємо правильні, відкидаємо неправильні
4. Гарантія: output ідентичний target model без speculation

Аналогія: Замість того щоб питати GPT-4 "який наступний токен?" 10 разів послідовно, ми:

Питаємо GPT-2: "вгадай 10 токенів" (дуже швидко)
Питаємо GPT-4: "перевір ці 10 токенів паралельно" (один forward pass)
Приймаємо ті, де GPT-4 погоджується з GPT-2

Алгоритм Speculative Decoding

import torch
from typing import Tuple, List

class SpeculativeDecoder:
    """
    Speculative Decoding Implementation

    Key components:
    1. Draft model — generates K speculative tokens
    2. Target model — verifies all K tokens in parallel
    3. Acceptance criterion — ensures correctness
    """

    def __init__(self,
                 target_model,
                 draft_model,
                 K: int = 5):  # Number of speculative tokens
        self.target = target_model
        self.draft = draft_model
        self.K = K

    def generate(self, prompt: torch.Tensor,
                 max_tokens: int) -> torch.Tensor:
        """
        Main speculative decoding loop
        """
        tokens = prompt.clone()
        generated = 0

        while generated < max_tokens:
            # Step 1: Draft K tokens
            draft_tokens, draft_probs = self.draft_speculate(tokens)

            # Step 2: Verify with target model
            accepted, next_token = self.verify_and_accept(
                tokens, draft_tokens, draft_probs
            )

            # Step 3: Append accepted tokens
            tokens = torch.cat([tokens, accepted], dim=1)
            generated += len(accepted)

            # Step 4: Check for EOS
            if accepted[-1] == self.target.eos_token_id:
                break

        return tokens

    def draft_speculate(self, context: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
        """
        Draft model generates K speculative tokens autoregressively
        """
        draft_tokens = []
        draft_probs = []

        current = context.clone()

        for _ in range(self.K):
            with torch.no_grad():
                logits = self.draft(current)
                probs = torch.softmax(logits[:, -1, :], dim=-1)

                # Greedy selection (or sample)
                token = probs.argmax(dim=-1)

                draft_tokens.append(token)
                draft_probs.append(probs[0, token])

                current = torch.cat([current, token.unsqueeze(0).unsqueeze(0)], dim=1)

        return torch.stack(draft_tokens), torch.stack(draft_probs)

    def verify_and_accept(self,
                          context: torch.Tensor,
                          draft_tokens: torch.Tensor,
                          draft_probs: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
        """
        Target model verifies all draft tokens in parallel

        Key insight: single forward pass verifies K tokens
        """
        # Concatenate context with all draft tokens
        full_sequence = torch.cat([
            context,
            draft_tokens.unsqueeze(0).unsqueeze(0)
        ], dim=1)

        # Single forward pass через target
        with torch.no_grad():
            target_logits = self.target(full_sequence)

        # Get target probabilities for each draft position
        # target_logits[:, -K-1:-1, :] corresponds to positions predicting draft tokens
        target_probs = torch.softmax(target_logits[:, -self.K-1:-1, :], dim=-1)

        # Acceptance criterion
        accepted_tokens = []

        for i in range(self.K):
            draft_token = draft_tokens[i]
            p_draft = draft_probs[i]
            p_target = target_probs[0, i, draft_token]

            # Accept with probability min(1, p_target / p_draft)
            acceptance_prob = min(1.0, p_target / p_draft)

            if torch.rand(1) < acceptance_prob:
                accepted_tokens.append(draft_token)
            else:
                # Reject this and all subsequent tokens
                # Sample from adjusted distribution
                adjusted_probs = torch.clamp(
                    target_probs[0, i] - draft_probs[i] * p_draft,
                    min=0
                )
                adjusted_probs = adjusted_probs / adjusted_probs.sum()
                new_token = torch.multinomial(adjusted_probs, 1)
                accepted_tokens.append(new_token)
                break

        return torch.stack(accepted_tokens), None


class GreedySpeculativeDecoder(SpeculativeDecoder):
    """
    Спрощена версія для greedy decoding (temperature=0)

    Простіший acceptance criterion: accept if draft == target
    """

    def verify_and_accept_greedy(self,
                                  context: torch.Tensor,
                                  draft_tokens: torch.Tensor) -> torch.Tensor:
        """
        Greedy verification — accept while draft matches target
        """
        full_sequence = torch.cat([context, draft_tokens.unsqueeze(0)], dim=1)

        with torch.no_grad():
            target_logits = self.target(full_sequence)

        # Get argmax predictions from target
        target_predictions = target_logits[:, -self.K-1:-1, :].argmax(dim=-1)

        # Find first mismatch
        accepted = []
        for i in range(self.K):
            if draft_tokens[i] == target_predictions[0, i]:
                accepted.append(draft_tokens[i])
            else:
                # Accept target's prediction instead
                accepted.append(target_predictions[0, i])
                break

        return torch.stack(accepted) if accepted else target_predictions[0, 0:1]

Математична гарантія: Lossless Speculation

Теорема: Speculative decoding з proper acceptance criterion генерує точно такий самий output як target model без speculation.

class SpeculativeDecodingTheory:
    """
    Математичне обґрунтування коректності
    """

    @staticmethod
    def acceptance_probability(p_target: float, p_draft: float) -> float:
        """
        Acceptance probability для одного token

        P(accept) = min(1, p_target / p_draft)

        Якщо p_target >= p_draft: завжди accept
        Якщо p_target < p_draft: accept з probability p_target/p_draft
        """
        return min(1.0, p_target / p_draft)

    @staticmethod
    def expected_accepted_tokens(acceptance_rates: List[float]) -> float:
        """
        Expected number of accepted tokens

        E[accepted] = Σ P(token i accepted | all previous accepted)

        Для i.i.d. tokens з acceptance rate α:
        E[accepted] = 1 / (1 - α)  (geometric distribution)
        """
        # Simplified: assuming independent acceptance
        expected = 0
        prob_all_accepted = 1.0

        for rate in acceptance_rates:
            expected += prob_all_accepted * rate
            prob_all_accepted *= rate

        return expected

    @staticmethod
    def speedup_analysis(K: int,
                         acceptance_rate: float,
                         draft_cost_ratio: float) -> float:
        """
        Theoretical speedup від speculative decoding

        Parameters:
        - K: number of speculative tokens
        - acceptance_rate: P(draft token accepted)
        - draft_cost_ratio: cost_draft / cost_target

        Speedup ≈ E[accepted] / (1 + K × draft_cost_ratio)
        """
        # Expected accepted tokens (geometric series)
        expected_accepted = (1 - acceptance_rate**(K+1)) / (1 - acceptance_rate)

        # Total cost
        total_cost = 1 + K * draft_cost_ratio

        return expected_accepted / total_cost

# Приклад:
# K=5, acceptance_rate=0.8, draft_cost=5% of target
# Expected accepted ≈ 4 tokens
# Speedup ≈ 4 / 1.25 ≈ 3.2x

Вибір Draft Model

Різні стратегії:

class DraftModelStrategies:
    """
    Варіанти draft model для speculative decoding
    """

    class SmallerVersionDraft:
        """
        Strategy 1: Менша версія тієї ж моделі

        Example: Llama-70B + Llama-7B

        Pros:
        - Найкращий acceptance rate (same family)
        - Well-aligned distributions

        Cons:
        - Потрібно завантажити дві моделі
        - Memory overhead
        """

        def __init__(self, target_size: str = "70B", draft_size: str = "7B"):
            self.target = load_model(f"llama-{target_size}")
            self.draft = load_model(f"llama-{draft_size}")

        def memory_usage(self) -> dict:
            return {
                "target": "140 GB (FP16)",
                "draft": "14 GB (FP16)",
                "total": "154 GB",
                "overhead": "10%"
            }

    class QuantizedDraft:
        """
        Strategy 2: Квантизована версія target

        Example: Llama-70B FP16 + Llama-70B INT4

        Pros:
        - Potentially same weights (less memory)
        - Very high acceptance rate

        Cons:
        - Still significant memory for draft
        - Quantized may have different distribution
        """

        def __init__(self, model_name: str):
            self.target = load_model(model_name, dtype="fp16")
            self.draft = load_model(model_name, dtype="int4")

    class EarlyExitDraft:
        """
        Strategy 3: Early exit — використовуємо перші N layers

        Example: Layers 1-8 of 80-layer model

        Pros:
        - No additional model
        - Self-speculation
        - Minimal overhead

        Cons:
        - Lower acceptance rate
        - Needs auxiliary head
        """

        def __init__(self, model, early_exit_layer: int = 8):
            self.model = model
            self.exit_layer = early_exit_layer

            # Auxiliary head для early exit
            self.exit_head = nn.Linear(
                model.config.hidden_size,
                model.config.vocab_size
            )

        def draft_forward(self, x):
            # Run only first N layers
            hidden = x
            for i, layer in enumerate(self.model.layers):
                hidden = layer(hidden)
                if i == self.exit_layer:
                    break

            # Use auxiliary head
            logits = self.exit_head(hidden)
            return logits

    class MedusaHeads:
        """
        Strategy 4: Medusa — multiple prediction heads

        Heads predict tokens at positions t+1, t+2, ..., t+K

        Pros:
        - Single forward pass для draft
        - Minimal overhead (~1% params)
        - 2-3x speedup

        Cons:
        - Needs training
        - Heads need to be accurate
        """

        def __init__(self, base_model, num_heads: int = 5):
            self.base = base_model
            self.heads = nn.ModuleList([
                nn.Sequential(
                    nn.Linear(base_model.config.hidden_size, base_model.config.hidden_size),
                    nn.SiLU(),
                    nn.Linear(base_model.config.hidden_size, base_model.config.vocab_size)
                )
                for _ in range(num_heads)
            ])

        def forward(self, hidden_states):
            # Base prediction
            base_logits = self.base.lm_head(hidden_states)

            # Speculative heads
            spec_logits = [head(hidden_states) for head in self.heads]

            return base_logits, spec_logits

Medusa: Detailed Implementation

import torch
import torch.nn as nn
from transformers import LlamaForCausalLM

class MedusaModel(nn.Module):
    """
    Medusa: Simple Framework for Accelerating LLM Generation
    (Cai et al., 2024)

    Додаємо K prediction heads до base LLM
    Кожен head передбачає token на позиції +i
    """

    def __init__(self,
                 base_model: LlamaForCausalLM,
                 num_heads: int = 5,
                 hidden_size: int = None):
        super().__init__()

        self.base_model = base_model
        self.num_heads = num_heads

        if hidden_size is None:
            hidden_size = base_model.config.hidden_size

        vocab_size = base_model.config.vocab_size

        # Medusa heads
        self.medusa_heads = nn.ModuleList([
            nn.Sequential(
                nn.Linear(hidden_size, hidden_size, bias=False),
                nn.SiLU(),
                nn.Linear(hidden_size, vocab_size, bias=False)
            )
            for _ in range(num_heads)
        ])

        # Initialize heads to output similar to base
        self._init_medusa_heads()

    def _init_medusa_heads(self):
        """Initialize Medusa heads with small weights"""
        for head in self.medusa_heads:
            for module in head.modules():
                if isinstance(module, nn.Linear):
                    nn.init.normal_(module.weight, std=0.02)

    def forward(self,
                input_ids: torch.Tensor,
                attention_mask: torch.Tensor = None,
                **kwargs):
        """
        Forward pass з Medusa heads

        Returns:
        - base_logits: standard LM predictions
        - medusa_logits: list of K speculative predictions
        """
        # Get hidden states from base model
        outputs = self.base_model(
            input_ids,
            attention_mask=attention_mask,
            output_hidden_states=True,
            **kwargs
        )

        # Last hidden state
        hidden_states = outputs.hidden_states[-1]

        # Base model predictions
        base_logits = outputs.logits

        # Medusa predictions
        medusa_logits = [head(hidden_states) for head in self.medusa_heads]

        return base_logits, medusa_logits

    def generate_with_medusa(self,
                              input_ids: torch.Tensor,
                              max_new_tokens: int = 100,
                              **kwargs) -> torch.Tensor:
        """
        Speculative generation з Medusa heads
        """
        generated_tokens = []

        while len(generated_tokens) < max_new_tokens:
            # Single forward pass
            base_logits, medusa_logits = self.forward(input_ids)

            # Get predictions
            base_pred = base_logits[:, -1, :].argmax(dim=-1)  # t+1

            medusa_preds = [
                logits[:, -1, :].argmax(dim=-1)
                for logits in medusa_logits
            ]  # t+2, t+3, ...

            # Candidates: base + medusa
            candidates = [base_pred] + medusa_preds

            # Verify candidates (tree attention)
            accepted = self.verify_candidates(input_ids, candidates)

            # Update
            input_ids = torch.cat([input_ids, accepted.unsqueeze(0)], dim=1)
            generated_tokens.extend(accepted.tolist())

            if self.base_model.config.eos_token_id in accepted:
                break

        return input_ids

    def verify_candidates(self,
                          context: torch.Tensor,
                          candidates: List[torch.Tensor]) -> torch.Tensor:
        """
        Verify speculative candidates using tree attention

        Tree structure allows verifying multiple paths
        """
        # Simplified: sequential verification
        # Full implementation uses tree attention for parallelism

        verified = []
        for i, candidate in enumerate(candidates):
            # Check if candidate matches target prediction
            test_seq = torch.cat([context, torch.stack(verified + [candidate]).unsqueeze(0)], dim=1)

            with torch.no_grad():
                logits = self.base_model(test_seq).logits

            predicted = logits[:, -1, :].argmax(dim=-1)

            if predicted == candidate or i == 0:  # Always accept first
                verified.append(candidate)
            else:
                verified.append(predicted)  # Use target's prediction
                break

        return torch.stack(verified)


class MedusaTrainer:
    """
    Training Medusa heads

    Heads trained to predict future tokens from current hidden states
    """

    def __init__(self, medusa_model: MedusaModel, lr: float = 1e-4):
        self.model = medusa_model

        # Only train Medusa heads, freeze base
        for param in self.model.base_model.parameters():
            param.requires_grad = False

        for param in self.model.medusa_heads.parameters():
            param.requires_grad = True

        self.optimizer = torch.optim.AdamW(
            self.model.medusa_heads.parameters(),
            lr=lr
        )

    def train_step(self, input_ids: torch.Tensor) -> dict:
        """
        Training step for Medusa heads

        Target: predict token at position i+k from hidden state at position i
        """
        self.optimizer.zero_grad()

        # Forward pass
        base_logits, medusa_logits = self.model(input_ids)

        # Compute losses for each head
        losses = []
        seq_len = input_ids.shape[1]

        for k, head_logits in enumerate(medusa_logits, start=1):
            if seq_len <= k:
                continue

            # Predictions at position i should match token at position i+k
            predictions = head_logits[:, :-k, :]  # Positions 0 to seq_len-k-1
            targets = input_ids[:, k:]            # Positions k to seq_len-1

            loss = nn.functional.cross_entropy(
                predictions.reshape(-1, predictions.shape[-1]),
                targets.reshape(-1)
            )
            losses.append(loss)

        total_loss = sum(losses) / len(losses)
        total_loss.backward()
        self.optimizer.step()

        return {
            "loss": total_loss.item(),
            "individual_losses": [l.item() for l in losses]
        }

EAGLE: State-of-the-Art

class EAGLEModel(nn.Module):
    """
    EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty
    (Li et al., 2024)

    Ключова ідея: autoregressive draft head замість independent heads

    Draft head — маленька LM, яка генерує sequence спекулятивних tokens
    Використовує feature vectors з target model
    """

    def __init__(self,
                 base_model: LlamaForCausalLM,
                 draft_hidden_size: int = 1024,
                 draft_layers: int = 1):
        super().__init__()

        self.base_model = base_model
        base_hidden = base_model.config.hidden_size
        vocab_size = base_model.config.vocab_size

        # Feature projection
        self.feature_proj = nn.Linear(base_hidden, draft_hidden_size)

        # Draft model (small transformer)
        self.draft_layers = nn.ModuleList([
            nn.TransformerEncoderLayer(
                d_model=draft_hidden_size,
                nhead=8,
                dim_feedforward=draft_hidden_size * 4,
                batch_first=True
            )
            for _ in range(draft_layers)
        ])

        # Output projection
        self.draft_head = nn.Linear(draft_hidden_size, vocab_size)

        # Token embedding for draft
        self.draft_embed = nn.Embedding(vocab_size, draft_hidden_size)

    def get_base_features(self, input_ids: torch.Tensor) -> torch.Tensor:
        """Extract features from base model"""
        with torch.no_grad():
            outputs = self.base_model(
                input_ids,
                output_hidden_states=True
            )
        return outputs.hidden_states[-1]

    def draft_generate(self,
                       features: torch.Tensor,
                       num_tokens: int = 5) -> torch.Tensor:
        """
        Autoregressive generation з draft model

        Використовує projected features як context
        """
        # Project features
        proj_features = self.feature_proj(features[:, -1:, :])

        draft_tokens = []
        hidden = proj_features

        for _ in range(num_tokens):
            # Draft transformer layers
            for layer in self.draft_layers:
                hidden = layer(hidden)

            # Predict next token
            logits = self.draft_head(hidden[:, -1, :])
            next_token = logits.argmax(dim=-1)
            draft_tokens.append(next_token)

            # Embed for next step
            token_emb = self.draft_embed(next_token).unsqueeze(1)
            hidden = torch.cat([hidden, token_emb], dim=1)

        return torch.stack(draft_tokens, dim=1)

    def generate(self,
                 input_ids: torch.Tensor,
                 max_new_tokens: int = 100) -> torch.Tensor:
        """Full EAGLE generation"""

        while input_ids.shape[1] < max_new_tokens:
            # 1. Get features from base model
            features = self.get_base_features(input_ids)

            # 2. Draft tokens
            draft_tokens = self.draft_generate(features, num_tokens=5)

            # 3. Verify and accept
            accepted = self.verify_speculative(input_ids, draft_tokens)

            # 4. Update
            input_ids = torch.cat([input_ids, accepted], dim=1)

            if self.base_model.config.eos_token_id in accepted:
                break

        return input_ids

Benchmark Results

Порівняння методів на різних моделях:

|--------|-------|------|------------|---------|

Спостереження:

Speedup вищий для structured tasks (code, JSON)
Creative writing має нижчий acceptance rate
EAGLE consistently outperforms інші methods

Production Deployment

vLLM Integration:

from vllm import LLM, SamplingParams

class SpeculativeVLLM:
    """Production-ready speculative decoding з vLLM"""

    def __init__(self,
                 model_name: str,
                 draft_model_name: str = None,
                 num_speculative_tokens: int = 5):

        self.llm = LLM(
            model=model_name,
            speculative_model=draft_model_name,
            num_speculative_tokens=num_speculative_tokens,
            tensor_parallel_size=1,  # Adjust for multi-GPU
            gpu_memory_utilization=0.9
        )

        self.sampling_params = SamplingParams(
            temperature=0,  # Greedy for best speculation
            max_tokens=1024
        )

    def generate(self, prompts: List[str]) -> List[str]:
        """Batch generation"""
        outputs = self.llm.generate(prompts, self.sampling_params)
        return [o.outputs[0].text for o in outputs]

    def benchmark(self, prompts: List[str]) -> dict:
        """Benchmark throughput"""
        import time

        start = time.time()
        outputs = self.generate(prompts)
        elapsed = time.time() - start

        total_tokens = sum(len(o.split()) for o in outputs)

        return {
            "total_time_s": elapsed,
            "total_tokens": total_tokens,
            "tokens_per_second": total_tokens / elapsed,
            "prompts_per_second": len(prompts) / elapsed
        }

Ідеї для дослідження

Для бакалаврської роботи:

Benchmark speculative decoding на різних задачах (code, chat, QA)
Порівняти різні draft model sizes (7B, 3B, 1B)
Знайти optimal K для різних use cases

Для магістерської:

Train custom Medusa heads для specific domain
Combine speculative decoding + quantization
Multi-draft speculation (ensemble of draft models)
Domain-specific draft models

Для PhD:

Optimal speculation strategies під sampling
Theoretical bounds on speedup
Hardware-aware speculation (memory hierarchy)
Novel acceptance criteria
Speculation for MoE architectures

Чому це game changer

Inference cost = tokens × time × $/GPU-hour.

Speculative decoding зменшує time в 2-3 рази:

При тому ж hardware
Без втрати якості (mathematically lossless)
Без зміни моделі
Drop-in replacement

Для production з мільйонами запитів — це мільйони доларів економії. Не optimization heuristic, а mathematical guarantee of correctness.

Для тих, хто готує наукову роботу з прискорення LLM inference — від курсової до дисертації — команда SKP-Degree на skp-degree.com.ua допоможе з дослідженням та експериментами. Пишіть у Telegram: @kursovi_diplomy — маємо досвід роботи з vLLM, Medusa, EAGLE та custom speculation strategies.

Ключові слова: speculative decoding, LLM inference, Medusa, EAGLE, draft model, acceleration, vLLM, latency optimization, throughput, наукова робота, дипломна, магістерська, курсова.

Speculative Decoding: як прискорити LLM вдвічі без втрати якості

Проблема: чому autoregressive generation повільний

Speculative Decoding: основна ідея

Алгоритм Speculative Decoding

Математична гарантія: Lossless Speculation

Вибір Draft Model

Medusa: Detailed Implementation

EAGLE: State-of-the-Art

Benchmark Results

Production Deployment

Ідеї для дослідження

Чому це game changer

Про автора

Команда SKP-Degree

Поділитися

Схожі статті

AI Agents та автономне кодування у 2026 році: повний гайд для розробників і студентів

Digital Twins + AI: симуляція реального світу нейромережами

Computer Vision з OpenCV та YOLO

Потрібна допомога з роботою?