Квантизація LLM: як стиснути гіганта без втрати розуму

Llama-70B займає 140 GB у FP16. Це дві найкращих відеокарти. Вартість? $3000+. А якщо треба кілька моделей для A/B тестування? Або inference cluster для production? Рахунок швидко переходить у десятки тисяч.

Та ж Llama-70B у 4-bit квантизації: 35 GB. Одна RTX 4090. $1600. Performance drop? 1-2%. Для більшості задач — взагалі непомітно.

Квантизація — це не магія і не хак. Це математика про те, як ефективно зберігати і обробляти числа. І ця математика дозволяє запускати найпотужніші LLM на звичайному consumer hardware. Democratization of AI — не маркетинг, а інженерна реальність.

Фундаментальна проблема: memory wall

Чому пам'ять — bottleneck:

class MemoryAnalysis:
    """Аналіз використання пам'яті LLM"""

    def compute_model_size(self, num_params_billions: float,
                           precision: str) -> dict:
        """
        Обчислює розмір моделі в пам'яті

        precision: "fp32", "fp16", "bf16", "int8", "int4"
        """
        bytes_per_param = {
            "fp32": 4,
            "fp16": 2,
            "bf16": 2,
            "int8": 1,
            "int4": 0.5
        }

        size_gb = num_params_billions * bytes_per_param[precision]

        return {
            "model_size_gb": size_gb,
            "with_overhead_gb": size_gb * 1.2,  # KV cache, activations
            "min_vram_inference": size_gb * 1.3,
            "min_vram_training": size_gb * 4  # gradients, optimizer states
        }

    def analyze_llama_family(self):
        """Аналіз Llama моделей"""
        models = {
            "Llama-3-8B": 8,
            "Llama-3-70B": 70,
            "Llama-3-405B": 405
        }

        results = {}
        for name, params in models.items():
            results[name] = {
                "fp16": self.compute_model_size(params, "fp16"),
                "int8": self.compute_model_size(params, "int8"),
                "int4": self.compute_model_size(params, "int4")
            }

        return results

# Результати:
# Llama-3-70B:
#   FP16: 140 GB → 2× A100 80GB або 6× RTX 4090 24GB
#   INT8:  70 GB → 1× A100 80GB або 3× RTX 4090 24GB
#   INT4:  35 GB → 1× RTX 4090 24GB + KV cache margin

Memory bandwidth як bottleneck:

class InferenceBandwidthAnalysis:
    """
    Для LLM inference основний bottleneck — memory bandwidth,
    не compute

    Причина: кожен token потребує читання ВСІХ ваг
    """

    def tokens_per_second(self,
                          model_size_gb: float,
                          memory_bandwidth_gb_s: float,
                          batch_size: int = 1) -> float:
        """
        Theoretical max tokens/second based on bandwidth

        A100 40GB: 1.5 TB/s bandwidth
        RTX 4090:  1.0 TB/s bandwidth
        RTX 3090:  0.9 TB/s bandwidth
        """
        # Кожен token читає всі ваги
        bytes_per_token = model_size_gb * 1e9

        # Bandwidth / bytes = tokens/second
        # Batch amortizes somewhat
        effective_bandwidth = memory_bandwidth_gb_s * 1e9
        tps = effective_bandwidth / bytes_per_token * batch_size**0.5

        return tps

    def quantization_speedup(self):
        """
        Чому квантизація прискорює inference

        Менше bytes per weight = менше читати з memory
        = вища throughput
        """
        # RTX 4090, Llama-70B
        bandwidth = 1000  # GB/s

        return {
            "fp16_tps": self.tokens_per_second(140, bandwidth),
            "int8_tps": self.tokens_per_second(70, bandwidth),
            "int4_tps": self.tokens_per_second(35, bandwidth),
            "speedup_int4_vs_fp16": 140 / 35  # ~4x theoretical
        }

Математика квантизації

Linear (Uniform) Quantization:

import numpy as np

class LinearQuantization:
    """
    Базова лінійна квантизація

    Маппінг: float range → integer range
    """

    def __init__(self, bits: int = 8):
        self.bits = bits
        self.qmin = 0
        self.qmax = 2**bits - 1

    def quantize(self, x: np.ndarray) -> tuple[np.ndarray, float, float]:
        """
        Квантизує float array в integer

        Returns: (quantized, scale, zero_point)
        """
        # Знаходимо range
        x_min, x_max = x.min(), x.max()

        # Обчислюємо scale і zero point
        scale = (x_max - x_min) / (self.qmax - self.qmin)
        zero_point = self.qmin - x_min / scale

        # Квантизуємо
        x_quant = np.clip(
            np.round(x / scale + zero_point),
            self.qmin,
            self.qmax
        ).astype(np.uint8 if self.bits == 8 else np.uint16)

        return x_quant, scale, zero_point

    def dequantize(self, x_quant: np.ndarray,
                   scale: float, zero_point: float) -> np.ndarray:
        """Відновлює float з quantized"""
        return (x_quant.astype(np.float32) - zero_point) * scale


class SymmetricQuantization:
    """
    Симетрична квантизація навколо нуля

    Простіша математика, часто краще для ваг
    """

    def __init__(self, bits: int = 8):
        self.bits = bits
        self.qmax = 2**(bits - 1) - 1  # 127 for int8

    def quantize(self, x: np.ndarray) -> tuple[np.ndarray, float]:
        """
        Symmetric quantization: scale тільки, zero_point = 0
        """
        # Знаходимо max absolute value
        abs_max = np.abs(x).max()

        # Scale
        scale = abs_max / self.qmax

        # Quantize
        x_quant = np.clip(
            np.round(x / scale),
            -self.qmax - 1,
            self.qmax
        ).astype(np.int8)

        return x_quant, scale

    def dequantize(self, x_quant: np.ndarray, scale: float) -> np.ndarray:
        return x_quant.astype(np.float32) * scale

Per-channel vs Per-tensor Quantization:

class PerChannelQuantization:
    """
    Per-channel quantization: окремий scale для кожного output channel

    Значно краща якість для ваг, особливо в Linear layers
    """

    def quantize_weight(self, weight: np.ndarray, bits: int = 8) -> dict:
        """
        weight: (out_features, in_features)
        Квантизуємо окремо для кожного row (output channel)
        """
        out_features, in_features = weight.shape
        qmax = 2**(bits - 1) - 1

        scales = np.zeros(out_features)
        weight_quant = np.zeros_like(weight, dtype=np.int8)

        for i in range(out_features):
            row = weight[i]
            abs_max = np.abs(row).max()
            scales[i] = abs_max / qmax if abs_max > 0 else 1.0
            weight_quant[i] = np.round(row / scales[i]).clip(-qmax-1, qmax)

        return {
            "weight_quant": weight_quant,
            "scales": scales,
            "bits": bits
        }

    def dequantize_weight(self, quant_data: dict) -> np.ndarray:
        """Dequantize з per-channel scales"""
        weight = quant_data["weight_quant"].astype(np.float32)
        scales = quant_data["scales"]

        # Broadcast: scales (out,) × weight (out, in)
        return weight * scales[:, np.newaxis]

GPTQ: Post-Training Quantization для LLM

GPTQ Algorithm (Frantar et al., 2022):

import torch

class GPTQ:
    """
    GPTQ: Accurate Post-Training Quantization for GPT

    Ключова ідея: оптимальна per-layer quantization
    з урахуванням reconstruction error
    """

    def __init__(self, layer: torch.nn.Linear, bits: int = 4):
        self.layer = layer
        self.bits = bits
        self.columns = layer.weight.shape[1]

        # Hessian accumulator
        self.H = torch.zeros(self.columns, self.columns)
        self.nsamples = 0

    def add_batch(self, inp: torch.Tensor):
        """
        Накопичуємо статистику Hessian з calibration data

        H = X^T X (second-order information)
        """
        if len(inp.shape) == 2:
            inp = inp.unsqueeze(0)
        batch_size = inp.shape[0]

        inp = inp.reshape(-1, inp.shape[-1])
        self.H += inp.T @ inp
        self.nsamples += batch_size

    def quantize(self) -> dict:
        """
        GPTQ quantization з optimal block-wise updates

        Мінімізуємо: ||W - Q||_H де H — Hessian
        """
        W = self.layer.weight.data.clone()
        H = self.H / self.nsamples

        # Add dampening for numerical stability
        H += torch.eye(self.columns) * 1e-4 * torch.diag(H).mean()

        # Cholesky decomposition
        H_inv = torch.linalg.cholesky(H)
        H_inv = torch.cholesky_inverse(H_inv)

        # Quantization with optimal rounding
        Q = torch.zeros_like(W)
        Err = torch.zeros_like(W)

        for i in range(self.columns):
            w = W[:, i]
            h_ii = H_inv[i, i]

            # Quantize column
            q = self.quantize_column(w)
            Q[:, i] = q

            # Compute error
            err = (w - q) / h_ii

            # Propagate error to remaining columns
            W[:, i:] -= err.unsqueeze(1) @ H_inv[i, i:].unsqueeze(0)

        return {
            "weight_quant": self.pack_weights(Q),
            "scales": self.compute_scales(W),
            "zeros": self.compute_zeros(W)
        }

    def quantize_column(self, w: torch.Tensor) -> torch.Tensor:
        """Quantize single column"""
        qmax = 2**self.bits - 1
        scale = w.abs().max() / qmax
        return torch.round(w / scale).clamp(0, qmax) * scale

AWQ: Activation-aware Weight Quantization

AWQ (Lin et al., 2023):

class AWQ:
    """
    AWQ: Activation-aware Weight Quantization

    Ключова ідея: не всі ваги однаково важливі
    Salient weights (важливі для activations) зберігаємо точніше
    """

    def __init__(self, model, calibration_data):
        self.model = model
        self.calibration_data = calibration_data

    def compute_activation_scales(self, layer: torch.nn.Linear,
                                   inputs: torch.Tensor) -> torch.Tensor:
        """
        Обчислюємо importance score для кожного input channel
        на основі magnitude activations
        """
        # Compute average activation magnitude per channel
        act_scales = inputs.abs().mean(dim=0)
        return act_scales

    def search_optimal_scale(self, weight: torch.Tensor,
                              act_scales: torch.Tensor,
                              grid_size: int = 20) -> torch.Tensor:
        """
        Шукаємо optimal scaling factor s для кожного channel

        Ідея: W' = W / s, X' = X * s
        Результат той самий, але quantization error менший для salient channels
        """
        best_scales = torch.ones(weight.shape[1])
        best_error = float('inf')

        # Grid search для optimal scaling
        for ratio in torch.linspace(0.1, 0.9, grid_size):
            # Scale inversely proportional to activation importance
            scales = act_scales.pow(ratio)
            scales = scales / scales.mean()  # normalize

            # Apply scaling to weight
            scaled_weight = weight / scales.unsqueeze(0)

            # Quantize scaled weight
            quant_weight = self.quantize(scaled_weight)

            # Compute reconstruction error
            recon = quant_weight * scales.unsqueeze(0)
            error = (weight - recon).pow(2).mean()

            if error < best_error:
                best_error = error
                best_scales = scales

        return best_scales

    def quantize_layer(self, layer: torch.nn.Linear,
                       calibration_inputs: torch.Tensor) -> dict:
        """Full AWQ quantization for one layer"""

        weight = layer.weight.data

        # 1. Compute activation scales
        act_scales = self.compute_activation_scales(layer, calibration_inputs)

        # 2. Search optimal weight scaling
        weight_scales = self.search_optimal_scale(weight, act_scales)

        # 3. Apply scaling and quantize
        scaled_weight = weight / weight_scales.unsqueeze(0)
        quant_weight = self.quantize(scaled_weight)

        return {
            "weight_quant": quant_weight,
            "weight_scales": weight_scales,
            "act_scales": act_scales
        }

GGUF/GGML: CPU-оптимізована квантизація

GGUF формат (llama.cpp):

class GGUFQuantization:
    """
    GGUF: формат для llama.cpp

    Особливості:
    - Mixed precision (різні bits для різних layers)
    - CPU-optimized (SIMD instructions)
    - Streaming inference (low memory)
    """

    QUANT_TYPES = {
        "Q2_K": {"bits": 2, "block_size": 256, "description": "Extreme compression"},
        "Q3_K_S": {"bits": 3, "block_size": 256, "description": "Small 3-bit"},
        "Q3_K_M": {"bits": 3, "block_size": 256, "description": "Medium 3-bit"},
        "Q4_0": {"bits": 4, "block_size": 32, "description": "Basic 4-bit"},
        "Q4_K_S": {"bits": 4, "block_size": 256, "description": "Small 4-bit K-quant"},
        "Q4_K_M": {"bits": 4, "block_size": 256, "description": "Medium 4-bit K-quant"},
        "Q5_0": {"bits": 5, "block_size": 32, "description": "Basic 5-bit"},
        "Q5_K_S": {"bits": 5, "block_size": 256, "description": "Small 5-bit K-quant"},
        "Q5_K_M": {"bits": 5, "block_size": 256, "description": "Medium 5-bit K-quant"},
        "Q6_K": {"bits": 6, "block_size": 256, "description": "6-bit K-quant"},
        "Q8_0": {"bits": 8, "block_size": 32, "description": "8-bit basic"}
    }

    @staticmethod
    def recommend_quant_type(model_size_b: float,
                             available_ram_gb: float,
                             quality_priority: bool = True) -> str:
        """
        Рекомендація типу квантизації

        quality_priority: True = краща якість, False = менше пам'яті
        """
        # Estimate sizes
        base_size = model_size_b * 2  # FP16 baseline GB

        recommendations = []

        for qtype, info in GGUFQuantization.QUANT_TYPES.items():
            estimated_size = base_size * info["bits"] / 16
            if estimated_size < available_ram_gb * 0.8:  # 80% margin
                recommendations.append((qtype, estimated_size, info["bits"]))

        if not recommendations:
            return "Model too large for available RAM"

        if quality_priority:
            # Вибираємо з найбільшою кількістю bits
            return max(recommendations, key=lambda x: x[2])[0]
        else:
            # Вибираємо з найменшим розміром
            return min(recommendations, key=lambda x: x[1])[0]

QLoRA: Fine-tuning квантизованих моделей

QLoRA Pipeline:

import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

class QLoRATrainer:
    """
    QLoRA: Quantized Low-Rank Adaptation

    Ключові інновації:
    1. 4-bit NormalFloat (NF4) quantization
    2. Double quantization (quantize quantization constants)
    3. Paged optimizers (handle memory spikes)
    """

    def __init__(self, model_name: str):
        self.model_name = model_name

        # 4-bit quantization config
        self.bnb_config = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_quant_type="nf4",  # NormalFloat4
            bnb_4bit_compute_dtype=torch.bfloat16,
            bnb_4bit_use_double_quant=True  # Quantize scales too
        )

    def load_quantized_model(self):
        """Завантаження моделі в 4-bit"""
        model = AutoModelForCausalLM.from_pretrained(
            self.model_name,
            quantization_config=self.bnb_config,
            device_map="auto",
            trust_remote_code=True
        )

        # Prepare for training
        model.gradient_checkpointing_enable()
        model = prepare_model_for_kbit_training(model)

        return model

    def add_lora_adapters(self, model, target_modules=None):
        """Додаємо LoRA adapters"""

        if target_modules is None:
            target_modules = [
                "q_proj", "k_proj", "v_proj", "o_proj",
                "gate_proj", "up_proj", "down_proj"
            ]

        lora_config = LoraConfig(
            r=64,  # LoRA rank
            lora_alpha=16,  # Scaling factor
            target_modules=target_modules,
            lora_dropout=0.1,
            bias="none",
            task_type="CAUSAL_LM"
        )

        model = get_peft_model(model, lora_config)

        # Print trainable params
        trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
        total_params = sum(p.numel() for p in model.parameters())
        print(f"Trainable: {trainable_params:,} / {total_params:,} "
              f"({100 * trainable_params / total_params:.2f}%)")

        return model

    def training_config(self):
        """Training configuration для QLoRA"""
        return {
            "per_device_train_batch_size": 4,
            "gradient_accumulation_steps": 4,
            "warmup_steps": 100,
            "max_steps": 1000,
            "learning_rate": 2e-4,
            "fp16": True,  # Mixed precision training
            "logging_steps": 10,
            "optim": "paged_adamw_32bit",  # Memory-efficient optimizer
            "save_strategy": "steps",
            "save_steps": 100
        }


class NF4Quantization:
    """
    NormalFloat4 (NF4) — спеціальний 4-bit формат для LLM ваг

    Ідея: ваги LLM зазвичай мають нормальний розподіл
    NF4 оптимізований для цього розподілу
    """

    # NF4 quantization bins (оптимізовані для N(0,1))
    NF4_BINS = [
        -1.0, -0.6961928009986877, -0.5250730514526367,
        -0.39491748809814453, -0.28444138169288635,
        -0.18477343022823334, -0.09105003625154495, 0.0,
        0.07958029955625534, 0.16093020141124725,
        0.24611230194568634, 0.33791524171829224,
        0.44070982933044434, 0.5626170039176941,
        0.7229568362236023, 1.0
    ]

    @staticmethod
    def quantize_nf4(x: torch.Tensor) -> tuple[torch.Tensor, float]:
        """Quantize to NF4"""
        # Normalize
        absmax = x.abs().max()
        x_norm = x / absmax

        # Find nearest bin
        bins = torch.tensor(NF4Quantization.NF4_BINS)
        indices = (x_norm.unsqueeze(-1) - bins).abs().argmin(dim=-1)

        return indices.byte(), absmax.item()

Benchmark: Quantization Quality

Порівняння методів на Llama-3-70B:

|--------|------|-----------|------------|------|-----------|

| FP16 | 16 | 140 | 3.12 | 79.5% | 72.0% |

| INT8 GPTQ | 8 | 70 | 3.14 | 79.3% | 71.5% |

| INT4 GPTQ | 4 | 35 | 3.21 | 78.8% | 70.2% |

| INT4 AWQ | 4 | 35 | 3.18 | 79.1% | 70.8% |

| Q4KM | 4 | 38 | 3.19 | 78.9% | 70.5% |

| Q3KM | 3 | 30 | 3.35 | 77.5% | 67.3% |

| Q2_K | 2 | 20 | 4.12 | 72.1% | 58.4% |

Висновки:

4-bit: <1% втрати якості
3-bit: ~2% втрати
2-bit: significant degradation

Практичний deployment

Вибір hardware + quantization:

class DeploymentGuide:
    """Практичний гайд по deployment"""

    CONFIGURATIONS = {
        "rtx_4090_24gb": {
            "models": {
                "Llama-3-8B": "Q8_0 або FP16",
                "Llama-3-70B": "Q4_K_M",
                "Mixtral-8x7B": "Q4_K_M",
                "Llama-3-405B": "Не поміститься"
            },
            "recommendations": [
                "Q4_K_M — найкращий баланс",
                "Q5_K_M якщо quality critical",
                "Q3_K_M якщо потрібен context buffer"
            ]
        },
        "rtx_3090_24gb": {
            "models": {
                "Llama-3-8B": "Q8_0",
                "Llama-3-70B": "Q4_K_S",
                "Mixtral-8x7B": "Q4_K_S"
            }
        },
        "mac_m2_max_96gb": {
            "models": {
                "Llama-3-8B": "FP16",
                "Llama-3-70B": "Q8_0 або Q4_K_M",
                "Llama-3-405B": "Q4_K_M (tight)"
            },
            "notes": "CPU inference slower but unified memory helps"
        },
        "cloud_a100_80gb": {
            "models": {
                "Llama-3-70B": "FP16 або INT8",
                "Llama-3-405B": "INT4 multi-GPU"
            }
        }
    }

    @staticmethod
    def recommend(model: str, hardware: str, use_case: str) -> dict:
        """Рекомендація квантизації"""

        # Use case priorities
        if use_case == "code_generation":
            quality_threshold = 0.99  # Потрібна висока точність
        elif use_case == "chat":
            quality_threshold = 0.95  # Можна трохи пожертвувати
        elif use_case == "summarization":
            quality_threshold = 0.93  # Більш толерантний
        else:
            quality_threshold = 0.95

        # ... recommendation logic
        return {
            "quant_type": "Q4_K_M",
            "expected_quality": 0.98,
            "memory_usage": "~38GB",
            "tokens_per_second": "~30-40"
        }

Ідеї для дослідження

Для бакалаврської роботи:

Benchmark GPTQ vs AWQ vs GGUF на конкретних задачах (код, математика, чат)
Знайти optimal quantization level для різних model sizes
Вплив квантизації на reasoning tasks

Для магістерської:

Quantization-aware training для domain-specific models
Mixed-precision strategies (різні bits для різних layers)
Adaptive quantization based on layer importance
Квантизація + Speculative decoding combo

Для PhD:

Theoretical bounds: мінімальний bits без втрати capability
Novel quantization schemes beyond uniform
Hardware-software co-design для quantized inference
Quantization для emerging architectures (MoE, SSM)

Чому це критично важливо

Democratization of AI — не buzzword. До квантизації великі LLM були доступні тільки компаніям з бюджетами на datacenter hardware. GPT-3 175B потребував би ~350 GB пам'яті. Це ~5 A100 тільки для ваг.

Квантизація змінила гру:

Llama-70B на одній споживчій GPU — реальність
Fine-tuning 70B на одній GPU — QLoRA
Local privacy-preserving AI — можливо
Зменшення inference costs — в 3-4 рази

Для тих, хто готує наукову роботу з квантизації LLM — від курсової до дисертації — команда SKP-Degree на skp-degree.com.ua готова допомогти з експериментами та аналізом. Пишіть у Telegram: @kursovi_diplomy — маємо досвід роботи з GPTQ, AWQ, QLoRA та deployment на різному hardware.

Ключові слова: квантизація, GPTQ, AWQ, QLoRA, bitsandbytes, GGUF, llama.cpp, INT4, INT8, LLM compression, edge deployment, наукова робота, дипломна, магістерська, курсова.

Квантизація LLM: як стиснути гіганта без втрати розуму

Фундаментальна проблема: memory wall

Математика квантизації

GPTQ: Post-Training Quantization для LLM

AWQ: Activation-aware Weight Quantization

GGUF/GGML: CPU-оптимізована квантизація

QLoRA: Fine-tuning квантизованих моделей

Benchmark: Quantization Quality

Практичний deployment

Ідеї для дослідження

Чому це критично важливо

Про автора

Команда SKP-Degree

Поділитися

Схожі статті

AI Agents та автономне кодування у 2026 році: повний гайд для розробників і студентів

Digital Twins + AI: симуляція реального світу нейромережами

Computer Vision з OpenCV та YOLO

Потрібна допомога з роботою?