LLM генерує текст по одному токену. Кожен токен — повний forward pass через мільярди параметрів. GPT-4 генерує 50 tokens/second. Для відповіді на 500 токенів — 10 секунд чекання. Для code generation з кількома тисячами токенів — хвилина+.
Це не просто незручно. Це дорого. Inference cost пропорційний часу × GPU. Якщо можна згенерувати ту саму відповідь за 5 секунд замість 10 — це 50% економії на compute.
А що якщо генерувати 4-5 токенів за один прохід? Без зміни моделі. Без втрати якості. Точно та сама відповідь, але в 2-3 рази швидше.
Speculative decoding робить саме це. І це free lunch у світі LLM optimization.
Проблема: чому autoregressive generation повільний
Sequential dependency — фундаментальне обмеження:
class AutoregressiveProblem:
"""
Чому LLM генерація inherently sequential
"""
def generate_standard(self, model, prompt: str, max_tokens: int) -> str:
"""
Стандартна autoregressive генерація
Проблема: кожен token залежить від попереднього
→ не можна паралелізувати
"""
tokens = tokenize(prompt)
for _ in range(max_tokens):
# Повний forward pass для ОДНОГО token
logits = model.forward(tokens)
# Sample наступний token
next_token = sample(logits[:, -1, :])
# Append
tokens = torch.cat([tokens, next_token.unsqueeze(0)], dim=1)
if next_token == EOS:
break
return detokenize(tokens)
def analyze_bottleneck(self):
"""
Memory-bound vs Compute-bound
Для single token generation LLM is memory-bound:
- Треба прочитати всі ваги з пам'яті
- Compute швидший за читання
- GPU utilization 30-50%
"""
# Llama-70B на A100 80GB
model_size_gb = 140 # FP16
memory_bandwidth_gb_s = 2000 # A100 HBM
flops = 312e12 # A100 FP16
# Time to read weights
read_time = model_size_gb / memory_bandwidth_gb_s
# Theoretical compute time (much faster)
compute_time = model_size_gb * 2 / (flops / 1e9) # ~2 ops per param
return {
"memory_read_time_ms": read_time * 1000, # ~70ms
"compute_time_ms": compute_time * 1000, # ~0.9ms
"bottleneck": "memory bandwidth",
"gpu_utilization": compute_time / read_time # ~1.3%
}
Ключове спостереження: Генерація 1 token або 10 tokens займає приблизно однаковий час (якщо це verification, не generation). Бо bottleneck — читання ваг, не обчислення.
Speculative Decoding: основна ідея
Концепт:
1. Draft model (маленька, швидка) генерує K tokens "спекулятивно"
2. Target model (велика, точна) верифікує всі K tokens за один forward pass
3. Приймаємо правильні, відкидаємо неправильні
4. Гарантія: output ідентичний target model без speculation
Аналогія: Замість того щоб питати GPT-4 "який наступний токен?" 10 разів послідовно, ми:
- Питаємо GPT-2: "вгадай 10 токенів" (дуже швидко)
- Питаємо GPT-4: "перевір ці 10 токенів паралельно" (один forward pass)
- Приймаємо ті, де GPT-4 погоджується з GPT-2
Алгоритм Speculative Decoding
import torch
from typing import Tuple, List
class SpeculativeDecoder:
"""
Speculative Decoding Implementation
Key components:
1. Draft model — generates K speculative tokens
2. Target model — verifies all K tokens in parallel
3. Acceptance criterion — ensures correctness
"""
def __init__(self,
target_model,
draft_model,
K: int = 5): # Number of speculative tokens
self.target = target_model
self.draft = draft_model
self.K = K
def generate(self, prompt: torch.Tensor,
max_tokens: int) -> torch.Tensor:
"""
Main speculative decoding loop
"""
tokens = prompt.clone()
generated = 0
while generated < max_tokens:
# Step 1: Draft K tokens
draft_tokens, draft_probs = self.draft_speculate(tokens)
# Step 2: Verify with target model
accepted, next_token = self.verify_and_accept(
tokens, draft_tokens, draft_probs
)
# Step 3: Append accepted tokens
tokens = torch.cat([tokens, accepted], dim=1)
generated += len(accepted)
# Step 4: Check for EOS
if accepted[-1] == self.target.eos_token_id:
break
return tokens
def draft_speculate(self, context: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
"""
Draft model generates K speculative tokens autoregressively
"""
draft_tokens = []
draft_probs = []
current = context.clone()
for _ in range(self.K):
with torch.no_grad():
logits = self.draft(current)
probs = torch.softmax(logits[:, -1, :], dim=-1)
# Greedy selection (or sample)
token = probs.argmax(dim=-1)
draft_tokens.append(token)
draft_probs.append(probs[0, token])
current = torch.cat([current, token.unsqueeze(0).unsqueeze(0)], dim=1)
return torch.stack(draft_tokens), torch.stack(draft_probs)
def verify_and_accept(self,
context: torch.Tensor,
draft_tokens: torch.Tensor,
draft_probs: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
"""
Target model verifies all draft tokens in parallel
Key insight: single forward pass verifies K tokens
"""
# Concatenate context with all draft tokens
full_sequence = torch.cat([
context,
draft_tokens.unsqueeze(0).unsqueeze(0)
], dim=1)
# Single forward pass через target
with torch.no_grad():
target_logits = self.target(full_sequence)
# Get target probabilities for each draft position
# target_logits[:, -K-1:-1, :] corresponds to positions predicting draft tokens
target_probs = torch.softmax(target_logits[:, -self.K-1:-1, :], dim=-1)
# Acceptance criterion
accepted_tokens = []
for i in range(self.K):
draft_token = draft_tokens[i]
p_draft = draft_probs[i]
p_target = target_probs[0, i, draft_token]
# Accept with probability min(1, p_target / p_draft)
acceptance_prob = min(1.0, p_target / p_draft)
if torch.rand(1) < acceptance_prob:
accepted_tokens.append(draft_token)
else:
# Reject this and all subsequent tokens
# Sample from adjusted distribution
adjusted_probs = torch.clamp(
target_probs[0, i] - draft_probs[i] * p_draft,
min=0
)
adjusted_probs = adjusted_probs / adjusted_probs.sum()
new_token = torch.multinomial(adjusted_probs, 1)
accepted_tokens.append(new_token)
break
return torch.stack(accepted_tokens), None
class GreedySpeculativeDecoder(SpeculativeDecoder):
"""
Спрощена версія для greedy decoding (temperature=0)
Простіший acceptance criterion: accept if draft == target
"""
def verify_and_accept_greedy(self,
context: torch.Tensor,
draft_tokens: torch.Tensor) -> torch.Tensor:
"""
Greedy verification — accept while draft matches target
"""
full_sequence = torch.cat([context, draft_tokens.unsqueeze(0)], dim=1)
with torch.no_grad():
target_logits = self.target(full_sequence)
# Get argmax predictions from target
target_predictions = target_logits[:, -self.K-1:-1, :].argmax(dim=-1)
# Find first mismatch
accepted = []
for i in range(self.K):
if draft_tokens[i] == target_predictions[0, i]:
accepted.append(draft_tokens[i])
else:
# Accept target's prediction instead
accepted.append(target_predictions[0, i])
break
return torch.stack(accepted) if accepted else target_predictions[0, 0:1]
Математична гарантія: Lossless Speculation
Теорема: Speculative decoding з proper acceptance criterion генерує точно такий самий output як target model без speculation.
class SpeculativeDecodingTheory:
"""
Математичне обґрунтування коректності
"""
@staticmethod
def acceptance_probability(p_target: float, p_draft: float) -> float:
"""
Acceptance probability для одного token
P(accept) = min(1, p_target / p_draft)
Якщо p_target >= p_draft: завжди accept
Якщо p_target < p_draft: accept з probability p_target/p_draft
"""
return min(1.0, p_target / p_draft)
@staticmethod
def expected_accepted_tokens(acceptance_rates: List[float]) -> float:
"""
Expected number of accepted tokens
E[accepted] = Σ P(token i accepted | all previous accepted)
Для i.i.d. tokens з acceptance rate α:
E[accepted] = 1 / (1 - α) (geometric distribution)
"""
# Simplified: assuming independent acceptance
expected = 0
prob_all_accepted = 1.0
for rate in acceptance_rates:
expected += prob_all_accepted * rate
prob_all_accepted *= rate
return expected
@staticmethod
def speedup_analysis(K: int,
acceptance_rate: float,
draft_cost_ratio: float) -> float:
"""
Theoretical speedup від speculative decoding
Parameters:
- K: number of speculative tokens
- acceptance_rate: P(draft token accepted)
- draft_cost_ratio: cost_draft / cost_target
Speedup ≈ E[accepted] / (1 + K × draft_cost_ratio)
"""
# Expected accepted tokens (geometric series)
expected_accepted = (1 - acceptance_rate**(K+1)) / (1 - acceptance_rate)
# Total cost
total_cost = 1 + K * draft_cost_ratio
return expected_accepted / total_cost
# Приклад:
# K=5, acceptance_rate=0.8, draft_cost=5% of target
# Expected accepted ≈ 4 tokens
# Speedup ≈ 4 / 1.25 ≈ 3.2x
Вибір Draft Model
Різні стратегії:
class DraftModelStrategies:
"""
Варіанти draft model для speculative decoding
"""
class SmallerVersionDraft:
"""
Strategy 1: Менша версія тієї ж моделі
Example: Llama-70B + Llama-7B
Pros:
- Найкращий acceptance rate (same family)
- Well-aligned distributions
Cons:
- Потрібно завантажити дві моделі
- Memory overhead
"""
def __init__(self, target_size: str = "70B", draft_size: str = "7B"):
self.target = load_model(f"llama-{target_size}")
self.draft = load_model(f"llama-{draft_size}")
def memory_usage(self) -> dict:
return {
"target": "140 GB (FP16)",
"draft": "14 GB (FP16)",
"total": "154 GB",
"overhead": "10%"
}
class QuantizedDraft:
"""
Strategy 2: Квантизована версія target
Example: Llama-70B FP16 + Llama-70B INT4
Pros:
- Potentially same weights (less memory)
- Very high acceptance rate
Cons:
- Still significant memory for draft
- Quantized may have different distribution
"""
def __init__(self, model_name: str):
self.target = load_model(model_name, dtype="fp16")
self.draft = load_model(model_name, dtype="int4")
class EarlyExitDraft:
"""
Strategy 3: Early exit — використовуємо перші N layers
Example: Layers 1-8 of 80-layer model
Pros:
- No additional model
- Self-speculation
- Minimal overhead
Cons:
- Lower acceptance rate
- Needs auxiliary head
"""
def __init__(self, model, early_exit_layer: int = 8):
self.model = model
self.exit_layer = early_exit_layer
# Auxiliary head для early exit
self.exit_head = nn.Linear(
model.config.hidden_size,
model.config.vocab_size
)
def draft_forward(self, x):
# Run only first N layers
hidden = x
for i, layer in enumerate(self.model.layers):
hidden = layer(hidden)
if i == self.exit_layer:
break
# Use auxiliary head
logits = self.exit_head(hidden)
return logits
class MedusaHeads:
"""
Strategy 4: Medusa — multiple prediction heads
Heads predict tokens at positions t+1, t+2, ..., t+K
Pros:
- Single forward pass для draft
- Minimal overhead (~1% params)
- 2-3x speedup
Cons:
- Needs training
- Heads need to be accurate
"""
def __init__(self, base_model, num_heads: int = 5):
self.base = base_model
self.heads = nn.ModuleList([
nn.Sequential(
nn.Linear(base_model.config.hidden_size, base_model.config.hidden_size),
nn.SiLU(),
nn.Linear(base_model.config.hidden_size, base_model.config.vocab_size)
)
for _ in range(num_heads)
])
def forward(self, hidden_states):
# Base prediction
base_logits = self.base.lm_head(hidden_states)
# Speculative heads
spec_logits = [head(hidden_states) for head in self.heads]
return base_logits, spec_logits
Medusa: Detailed Implementation
import torch
import torch.nn as nn
from transformers import LlamaForCausalLM
class MedusaModel(nn.Module):
"""
Medusa: Simple Framework for Accelerating LLM Generation
(Cai et al., 2024)
Додаємо K prediction heads до base LLM
Кожен head передбачає token на позиції +i
"""
def __init__(self,
base_model: LlamaForCausalLM,
num_heads: int = 5,
hidden_size: int = None):
super().__init__()
self.base_model = base_model
self.num_heads = num_heads
if hidden_size is None:
hidden_size = base_model.config.hidden_size
vocab_size = base_model.config.vocab_size
# Medusa heads
self.medusa_heads = nn.ModuleList([
nn.Sequential(
nn.Linear(hidden_size, hidden_size, bias=False),
nn.SiLU(),
nn.Linear(hidden_size, vocab_size, bias=False)
)
for _ in range(num_heads)
])
# Initialize heads to output similar to base
self._init_medusa_heads()
def _init_medusa_heads(self):
"""Initialize Medusa heads with small weights"""
for head in self.medusa_heads:
for module in head.modules():
if isinstance(module, nn.Linear):
nn.init.normal_(module.weight, std=0.02)
def forward(self,
input_ids: torch.Tensor,
attention_mask: torch.Tensor = None,
**kwargs):
"""
Forward pass з Medusa heads
Returns:
- base_logits: standard LM predictions
- medusa_logits: list of K speculative predictions
"""
# Get hidden states from base model
outputs = self.base_model(
input_ids,
attention_mask=attention_mask,
output_hidden_states=True,
**kwargs
)
# Last hidden state
hidden_states = outputs.hidden_states[-1]
# Base model predictions
base_logits = outputs.logits
# Medusa predictions
medusa_logits = [head(hidden_states) for head in self.medusa_heads]
return base_logits, medusa_logits
def generate_with_medusa(self,
input_ids: torch.Tensor,
max_new_tokens: int = 100,
**kwargs) -> torch.Tensor:
"""
Speculative generation з Medusa heads
"""
generated_tokens = []
while len(generated_tokens) < max_new_tokens:
# Single forward pass
base_logits, medusa_logits = self.forward(input_ids)
# Get predictions
base_pred = base_logits[:, -1, :].argmax(dim=-1) # t+1
medusa_preds = [
logits[:, -1, :].argmax(dim=-1)
for logits in medusa_logits
] # t+2, t+3, ...
# Candidates: base + medusa
candidates = [base_pred] + medusa_preds
# Verify candidates (tree attention)
accepted = self.verify_candidates(input_ids, candidates)
# Update
input_ids = torch.cat([input_ids, accepted.unsqueeze(0)], dim=1)
generated_tokens.extend(accepted.tolist())
if self.base_model.config.eos_token_id in accepted:
break
return input_ids
def verify_candidates(self,
context: torch.Tensor,
candidates: List[torch.Tensor]) -> torch.Tensor:
"""
Verify speculative candidates using tree attention
Tree structure allows verifying multiple paths
"""
# Simplified: sequential verification
# Full implementation uses tree attention for parallelism
verified = []
for i, candidate in enumerate(candidates):
# Check if candidate matches target prediction
test_seq = torch.cat([context, torch.stack(verified + [candidate]).unsqueeze(0)], dim=1)
with torch.no_grad():
logits = self.base_model(test_seq).logits
predicted = logits[:, -1, :].argmax(dim=-1)
if predicted == candidate or i == 0: # Always accept first
verified.append(candidate)
else:
verified.append(predicted) # Use target's prediction
break
return torch.stack(verified)
class MedusaTrainer:
"""
Training Medusa heads
Heads trained to predict future tokens from current hidden states
"""
def __init__(self, medusa_model: MedusaModel, lr: float = 1e-4):
self.model = medusa_model
# Only train Medusa heads, freeze base
for param in self.model.base_model.parameters():
param.requires_grad = False
for param in self.model.medusa_heads.parameters():
param.requires_grad = True
self.optimizer = torch.optim.AdamW(
self.model.medusa_heads.parameters(),
lr=lr
)
def train_step(self, input_ids: torch.Tensor) -> dict:
"""
Training step for Medusa heads
Target: predict token at position i+k from hidden state at position i
"""
self.optimizer.zero_grad()
# Forward pass
base_logits, medusa_logits = self.model(input_ids)
# Compute losses for each head
losses = []
seq_len = input_ids.shape[1]
for k, head_logits in enumerate(medusa_logits, start=1):
if seq_len <= k:
continue
# Predictions at position i should match token at position i+k
predictions = head_logits[:, :-k, :] # Positions 0 to seq_len-k-1
targets = input_ids[:, k:] # Positions k to seq_len-1
loss = nn.functional.cross_entropy(
predictions.reshape(-1, predictions.shape[-1]),
targets.reshape(-1)
)
losses.append(loss)
total_loss = sum(losses) / len(losses)
total_loss.backward()
self.optimizer.step()
return {
"loss": total_loss.item(),
"individual_losses": [l.item() for l in losses]
}
EAGLE: State-of-the-Art
class EAGLEModel(nn.Module):
"""
EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty
(Li et al., 2024)
Ключова ідея: autoregressive draft head замість independent heads
Draft head — маленька LM, яка генерує sequence спекулятивних tokens
Використовує feature vectors з target model
"""
def __init__(self,
base_model: LlamaForCausalLM,
draft_hidden_size: int = 1024,
draft_layers: int = 1):
super().__init__()
self.base_model = base_model
base_hidden = base_model.config.hidden_size
vocab_size = base_model.config.vocab_size
# Feature projection
self.feature_proj = nn.Linear(base_hidden, draft_hidden_size)
# Draft model (small transformer)
self.draft_layers = nn.ModuleList([
nn.TransformerEncoderLayer(
d_model=draft_hidden_size,
nhead=8,
dim_feedforward=draft_hidden_size * 4,
batch_first=True
)
for _ in range(draft_layers)
])
# Output projection
self.draft_head = nn.Linear(draft_hidden_size, vocab_size)
# Token embedding for draft
self.draft_embed = nn.Embedding(vocab_size, draft_hidden_size)
def get_base_features(self, input_ids: torch.Tensor) -> torch.Tensor:
"""Extract features from base model"""
with torch.no_grad():
outputs = self.base_model(
input_ids,
output_hidden_states=True
)
return outputs.hidden_states[-1]
def draft_generate(self,
features: torch.Tensor,
num_tokens: int = 5) -> torch.Tensor:
"""
Autoregressive generation з draft model
Використовує projected features як context
"""
# Project features
proj_features = self.feature_proj(features[:, -1:, :])
draft_tokens = []
hidden = proj_features
for _ in range(num_tokens):
# Draft transformer layers
for layer in self.draft_layers:
hidden = layer(hidden)
# Predict next token
logits = self.draft_head(hidden[:, -1, :])
next_token = logits.argmax(dim=-1)
draft_tokens.append(next_token)
# Embed for next step
token_emb = self.draft_embed(next_token).unsqueeze(1)
hidden = torch.cat([hidden, token_emb], dim=1)
return torch.stack(draft_tokens, dim=1)
def generate(self,
input_ids: torch.Tensor,
max_new_tokens: int = 100) -> torch.Tensor:
"""Full EAGLE generation"""
while input_ids.shape[1] < max_new_tokens:
# 1. Get features from base model
features = self.get_base_features(input_ids)
# 2. Draft tokens
draft_tokens = self.draft_generate(features, num_tokens=5)
# 3. Verify and accept
accepted = self.verify_speculative(input_ids, draft_tokens)
# 4. Update
input_ids = torch.cat([input_ids, accepted], dim=1)
if self.base_model.config.eos_token_id in accepted:
break
return input_ids
Benchmark Results
Порівняння методів на різних моделях:
| Method | Model | Task | Tokens/sec | Speedup |
|--------|-------|------|------------|---------|
| Baseline | Llama-2-70B | Code | 25 | 1.0x |
| Spec (7B draft) | Llama-2-70B | Code | 58 | 2.3x |
| Medusa | Llama-2-70B | Code | 68 | 2.7x |
| EAGLE | Llama-2-70B | Code | 85 | 3.4x |
| Baseline | Llama-2-70B | Chat | 25 | 1.0x |
| Spec (7B draft) | Llama-2-70B | Chat | 45 | 1.8x |
| Medusa | Llama-2-70B | Chat | 52 | 2.1x |
| EAGLE | Llama-2-70B | Chat | 62 | 2.5x |
Спостереження:
- Speedup вищий для structured tasks (code, JSON)
- Creative writing має нижчий acceptance rate
- EAGLE consistently outperforms інші methods
Production Deployment
vLLM Integration:
from vllm import LLM, SamplingParams
class SpeculativeVLLM:
"""Production-ready speculative decoding з vLLM"""
def __init__(self,
model_name: str,
draft_model_name: str = None,
num_speculative_tokens: int = 5):
self.llm = LLM(
model=model_name,
speculative_model=draft_model_name,
num_speculative_tokens=num_speculative_tokens,
tensor_parallel_size=1, # Adjust for multi-GPU
gpu_memory_utilization=0.9
)
self.sampling_params = SamplingParams(
temperature=0, # Greedy for best speculation
max_tokens=1024
)
def generate(self, prompts: List[str]) -> List[str]:
"""Batch generation"""
outputs = self.llm.generate(prompts, self.sampling_params)
return [o.outputs[0].text for o in outputs]
def benchmark(self, prompts: List[str]) -> dict:
"""Benchmark throughput"""
import time
start = time.time()
outputs = self.generate(prompts)
elapsed = time.time() - start
total_tokens = sum(len(o.split()) for o in outputs)
return {
"total_time_s": elapsed,
"total_tokens": total_tokens,
"tokens_per_second": total_tokens / elapsed,
"prompts_per_second": len(prompts) / elapsed
}
Ідеї для дослідження
Для бакалаврської роботи:
- Benchmark speculative decoding на різних задачах (code, chat, QA)
- Порівняти різні draft model sizes (7B, 3B, 1B)
- Знайти optimal K для різних use cases
Для магістерської:
- Train custom Medusa heads для specific domain
- Combine speculative decoding + quantization
- Multi-draft speculation (ensemble of draft models)
- Domain-specific draft models
Для PhD:
- Optimal speculation strategies під sampling
- Theoretical bounds on speedup
- Hardware-aware speculation (memory hierarchy)
- Novel acceptance criteria
- Speculation for MoE architectures
Чому це game changer
Inference cost = tokens × time × $/GPU-hour.
Speculative decoding зменшує time в 2-3 рази:
- При тому ж hardware
- Без втрати якості (mathematically lossless)
- Без зміни моделі
- Drop-in replacement
Для production з мільйонами запитів — це мільйони доларів економії. Не optimization heuristic, а mathematical guarantee of correctness.
Для тих, хто готує наукову роботу з прискорення LLM inference — від курсової до дисертації — команда SKP-Degree на skp-degree.com.ua допоможе з дослідженням та експериментами. Пишіть у Telegram: @kursovi_diplomy — маємо досвід роботи з vLLM, Medusa, EAGLE та custom speculation strategies.
Ключові слова: speculative decoding, LLM inference, Medusa, EAGLE, draft model, acceleration, vLLM, latency optimization, throughput, наукова робота, дипломна, магістерська, курсова.