ResNet. Transformer. EfficientNet. Всі ці архітектури — результат років досліджень, інтуїції експертів, безлічі експериментів і щасливих випадковостей. Yann LeCun експериментував з convolutions у 1989. Vaswani et al. придумали attention у 2017. Людський мозок проектував архітектури для штучного інтелекту.
Але що якщо це можна автоматизувати? Що якщо AI може сам знайти оптимальну архітектуру — не вибирати з готових шаблонів, а проектувати з нуля? Під конкретну задачу. Під конкретний hardware. З урахуванням обмежень на latency, memory, energy.
Neural Architecture Search (NAS) робить саме це. Google використовує NAS для мобільних моделей. Meta — для ranking систем. NVIDIA — для edge inference. І архітектури, знайдені автоматично, часто перевершують найкращі human-designed.
Рівні AutoML
AutoML — це не одна технологія, а спектр рівнів автоматизації:
Level 0: Manual ML
- Людина вибирає модель
- Людина налаштовує hyperparameters
- Людина робить feature engineering
- Людина інтерпретує результати
Level 1: Hyperparameter Optimization (HPO)
- Автоматичний пошук learning rate, batch size, etc.
- Grid search, random search, Bayesian optimization
- Model selection залишається за людиною
Level 2: Algorithm Selection
- Який алгоритм для цієї задачі?
- XGBoost vs Random Forest vs Neural Network
- Meta-learning для вибору
Level 3: Neural Architecture Search
- Яка архітектура нейромережі?
- Скільки layers, який тип, які connections
- Найскладніший рівень
Level 4: Full AutoML
- Data preprocessing
- Feature engineering
- Model selection
- Architecture design
- Hyperparameter tuning
- All automated end-to-end
Neural Architecture Search: три компоненти
NAS складається з трьох ключових компонентів:
1. Search Space
from enum import Enum
from dataclasses import dataclass
from typing import List, Optional
class OperationType(Enum):
CONV_3x3 = "conv3x3"
CONV_5x5 = "conv5x5"
CONV_1x1 = "conv1x1"
SEP_CONV_3x3 = "sep_conv3x3"
DIL_CONV_3x3 = "dil_conv3x3"
MAX_POOL_3x3 = "max_pool3x3"
AVG_POOL_3x3 = "avg_pool3x3"
SKIP = "skip"
NONE = "none"
@dataclass
class CellSearchSpace:
"""Простір пошуку для одної cell"""
n_nodes: int = 4 # Кількість intermediate nodes
operations: List[OperationType] = None
def __post_init__(self):
if self.operations is None:
self.operations = list(OperationType)
def sample_random_cell(self) -> dict:
"""Генерація випадкової cell"""
import random
cell = {
'nodes': [],
'edges': []
}
for i in range(self.n_nodes):
# Кожен node з'єднується з двома попередніми
inputs = random.sample(range(i + 2), 2) # +2 for input nodes
ops = [random.choice(self.operations) for _ in range(2)]
cell['nodes'].append({
'inputs': inputs,
'operations': ops
})
return cell
def get_search_space_size(self) -> int:
"""Розмір простору пошуку"""
n_ops = len(self.operations)
# Для кожного node: вибір 2 inputs з попередніх + 2 operations
total = 1
for i in range(self.n_nodes):
n_possible_inputs = i + 2
# C(n, 2) * ops^2
from math import comb
total *= comb(n_possible_inputs, 2) * (n_ops ** 2)
return total
2. Search Strategy
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
from collections import namedtuple
class RLNASController(nn.Module):
"""Reinforcement Learning контролер для NAS"""
def __init__(self, n_ops: int, n_nodes: int,
hidden_size: int = 64):
super().__init__()
self.n_ops = n_ops
self.n_nodes = n_nodes
# LSTM для генерації архітектури
self.lstm = nn.LSTMCell(hidden_size, hidden_size)
# Embeddings для operations
self.op_embedding = nn.Embedding(n_ops, hidden_size)
# Output heads
self.op_classifier = nn.Linear(hidden_size, n_ops)
self.input_classifier = nn.Linear(hidden_size, n_nodes + 2)
# Learnable initial state
self.h0 = nn.Parameter(torch.zeros(1, hidden_size))
self.c0 = nn.Parameter(torch.zeros(1, hidden_size))
def forward(self, batch_size: int = 1):
"""Генерація batch архітектур"""
h = self.h0.expand(batch_size, -1)
c = self.c0.expand(batch_size, -1)
architectures = []
log_probs = []
entropies = []
for node_idx in range(self.n_nodes):
# Для кожного node: 2 inputs + 2 operations
node_decisions = []
for j in range(2):
# Input selection
n_possible = node_idx + 2
h, c = self.lstm(h, (h, c))
input_logits = self.input_classifier(h)[:, :n_possible]
input_probs = F.softmax(input_logits, dim=-1)
input_dist = torch.distributions.Categorical(input_probs)
input_idx = input_dist.sample()
log_probs.append(input_dist.log_prob(input_idx))
entropies.append(input_dist.entropy())
node_decisions.append(('input', input_idx))
# Operation selection
h, c = self.lstm(h, (h, c))
op_logits = self.op_classifier(h)
op_probs = F.softmax(op_logits, dim=-1)
op_dist = torch.distributions.Categorical(op_probs)
op_idx = op_dist.sample()
log_probs.append(op_dist.log_prob(op_idx))
entropies.append(op_dist.entropy())
node_decisions.append(('op', op_idx))
architectures.append(node_decisions)
log_prob = torch.stack(log_probs).sum(dim=0)
entropy = torch.stack(entropies).mean(dim=0)
return architectures, log_prob, entropy
def update_with_reward(self, log_probs: torch.Tensor,
rewards: torch.Tensor,
baseline: float,
entropy_weight: float = 0.01):
"""REINFORCE update"""
advantage = rewards - baseline
# Policy gradient loss
policy_loss = -(log_probs * advantage).mean()
return policy_loss
3. Performance Estimation
class PerformanceEstimator:
"""Оцінка якості архітектури"""
def __init__(self, train_data, val_data, epochs_per_arch: int = 10):
self.train_data = train_data
self.val_data = val_data
self.epochs = epochs_per_arch
def evaluate_full_training(self, architecture: dict) -> float:
"""Повне навчання (дорого, точно)"""
model = self.build_model(architecture)
for epoch in range(self.epochs):
train_epoch(model, self.train_data)
return evaluate(model, self.val_data)
def evaluate_weight_sharing(self, architecture: dict,
supernet: nn.Module) -> float:
"""Weight sharing (швидко, менш точно)"""
# Активувати тільки обрані paths у supernet
subnet = supernet.get_subnet(architecture)
return evaluate(subnet, self.val_data)
def evaluate_zero_cost(self, architecture: dict) -> float:
"""Zero-cost proxy (миттєво, найменш точно)"""
model = self.build_model(architecture)
model.train()
# Proxy metrics без навчання
scores = []
# 1. Gradient norm at initialization
x, y = next(iter(self.train_data))
output = model(x)
loss = F.cross_entropy(output, y)
loss.backward()
grad_norm = 0
for p in model.parameters():
if p.grad is not None:
grad_norm += p.grad.norm().item()
scores.append(grad_norm)
# 2. Number of linear regions (expressivity)
scores.append(self.count_linear_regions(model))
# 3. Trainability score
scores.append(self.compute_trainability(model))
return np.mean(scores)
def count_linear_regions(self, model: nn.Module) -> float:
"""Proxy для expressivity через linear regions"""
# Simplified: count ReLU activations
n_activations = 0
for module in model.modules():
if isinstance(module, nn.ReLU):
n_activations += 1
return n_activations
DARTS: Differentiable Architecture Search
DARTS революціонізував NAS, зробивши пошук диференційовним:
import torch
import torch.nn as nn
import torch.nn.functional as F
class MixedOperation(nn.Module):
"""Weighted sum всіх операцій"""
def __init__(self, channels: int, stride: int):
super().__init__()
self.ops = nn.ModuleList([
nn.Conv2d(channels, channels, 3, stride, 1, bias=False),
nn.Conv2d(channels, channels, 5, stride, 2, bias=False),
SepConv(channels, channels, 3, stride, 1),
DilConv(channels, channels, 3, stride, 2, 2),
nn.MaxPool2d(3, stride, 1),
nn.AvgPool2d(3, stride, 1),
nn.Identity() if stride == 1 else nn.Conv2d(channels, channels, 1, stride, 0, bias=False),
Zero(stride)
])
def forward(self, x: torch.Tensor,
weights: torch.Tensor) -> torch.Tensor:
"""Weighted combination of operations"""
return sum(w * op(x) for w, op in zip(weights, self.ops))
class DARTSCell(nn.Module):
"""DARTS cell з learnable architecture"""
def __init__(self, channels: int, n_nodes: int = 4):
super().__init__()
self.n_nodes = n_nodes
# Operations for each edge
self.edges = nn.ModuleDict()
for i in range(n_nodes):
for j in range(i + 2): # Connect to all previous nodes + 2 inputs
self.edges[f'{j}_{i+2}'] = MixedOperation(channels, stride=1)
# Architecture parameters (α)
self._arch_params = nn.ParameterDict()
for i in range(n_nodes):
n_inputs = i + 2
# 8 operations for each edge
self._arch_params[f'node_{i}'] = nn.Parameter(
torch.randn(n_inputs, 8) * 0.001
)
def forward(self, s0: torch.Tensor, s1: torch.Tensor) -> torch.Tensor:
"""Forward з поточними architecture weights"""
states = [s0, s1]
for i in range(self.n_nodes):
# Softmax over operations for each edge
weights = F.softmax(self._arch_params[f'node_{i}'], dim=-1)
# Sum of all inputs
node_output = 0
for j, state in enumerate(states):
edge_key = f'{j}_{i+2}'
node_output += self.edges[edge_key](state, weights[j])
states.append(node_output)
# Concatenate intermediate nodes
return torch.cat(states[2:], dim=1)
def get_discrete_architecture(self) -> dict:
"""Extract discrete architecture from continuous params"""
arch = {}
for i in range(self.n_nodes):
weights = self._arch_params[f'node_{i}']
# For each node: select top-2 edges and best operation
n_inputs = i + 2
# Get best operation for each edge
best_ops = weights.argmax(dim=-1)
# Get top-2 edges
edge_weights = weights.max(dim=-1).values
top2 = edge_weights.topk(2).indices
arch[f'node_{i}'] = {
'inputs': top2.tolist(),
'operations': [best_ops[idx].item() for idx in top2]
}
return arch
class DARTSTrainer:
"""Bilevel optimization для DARTS"""
def __init__(self, model: nn.Module, train_loader, val_loader):
self.model = model
self.train_loader = train_loader
self.val_loader = val_loader
# Separate optimizers
self.weight_optim = torch.optim.SGD(
[p for n, p in model.named_parameters() if 'arch_params' not in n],
lr=0.025, momentum=0.9, weight_decay=3e-4
)
self.arch_optim = torch.optim.Adam(
[p for n, p in model.named_parameters() if 'arch_params' in n],
lr=3e-4, betas=(0.5, 0.999), weight_decay=1e-3
)
def train_epoch(self):
"""One epoch of bilevel optimization"""
train_iter = iter(self.train_loader)
val_iter = iter(self.val_loader)
for step in range(len(self.train_loader)):
# Get batches
train_x, train_y = next(train_iter)
val_x, val_y = next(val_iter)
# Step 1: Update architecture params on validation loss
self.arch_optim.zero_grad()
val_pred = self.model(val_x)
val_loss = F.cross_entropy(val_pred, val_y)
val_loss.backward()
self.arch_optim.step()
# Step 2: Update weights on training loss
self.weight_optim.zero_grad()
train_pred = self.model(train_x)
train_loss = F.cross_entropy(train_pred, train_y)
train_loss.backward()
self.weight_optim.step()
return val_loss.item()
Hardware-Aware NAS
Accuracy — не єдина метрика. На практиці важливі latency, memory footprint, energy:
from dataclasses import dataclass
from typing import Dict, Callable
import time
@dataclass
class HardwareConstraints:
max_latency_ms: float = 10.0
max_memory_mb: float = 100.0
max_params_m: float = 5.0
max_flops_g: float = 1.0
class HardwareAwareNAS:
"""NAS з урахуванням hardware constraints"""
def __init__(self, search_space, hardware: str = 'gpu'):
self.search_space = search_space
self.hardware = hardware
self.latency_predictor = self._build_latency_predictor()
def _build_latency_predictor(self) -> Callable:
"""Train/load latency prediction model"""
# Lookup table або neural predictor
if self.hardware == 'gpu':
return GPULatencyPredictor()
elif self.hardware == 'mobile':
return MobileLatencyPredictor()
else:
return CPULatencyPredictor()
def evaluate_architecture(self, arch: dict,
constraints: HardwareConstraints) -> Dict:
"""Multi-objective evaluation"""
# Build model
model = self.search_space.build_model(arch)
# Measure/predict metrics
accuracy = self.train_and_evaluate(model)
latency = self.latency_predictor.predict(arch)
params = sum(p.numel() for p in model.parameters()) / 1e6
flops = self.compute_flops(model)
# Check constraints
feasible = (
latency <= constraints.max_latency_ms and
params <= constraints.max_params_m and
flops <= constraints.max_flops_g
)
return {
'accuracy': accuracy,
'latency_ms': latency,
'params_m': params,
'flops_g': flops,
'feasible': feasible
}
def multi_objective_search(self, n_iterations: int,
constraints: HardwareConstraints):
"""Pareto-optimal architecture search"""
pareto_front = []
for i in range(n_iterations):
# Sample architecture
arch = self.search_space.sample_random_cell()
# Evaluate
metrics = self.evaluate_architecture(arch, constraints)
if not metrics['feasible']:
continue
# Update Pareto front
is_dominated = False
to_remove = []
for j, (existing_arch, existing_metrics) in enumerate(pareto_front):
if self._dominates(existing_metrics, metrics):
is_dominated = True
break
if self._dominates(metrics, existing_metrics):
to_remove.append(j)
if not is_dominated:
pareto_front = [
x for i, x in enumerate(pareto_front)
if i not in to_remove
]
pareto_front.append((arch, metrics))
return pareto_front
def _dominates(self, a: Dict, b: Dict) -> bool:
"""Check if a dominates b (better in all objectives)"""
better_in_all = (
a['accuracy'] >= b['accuracy'] and
a['latency_ms'] <= b['latency_ms']
)
strictly_better = (
a['accuracy'] > b['accuracy'] or
a['latency_ms'] < b['latency_ms']
)
return better_in_all and strictly_better
Once-for-All: Train Once, Specialize Anywhere
class OnceForAllNetwork(nn.Module):
"""Supernet з субмережами для різних constraints"""
def __init__(self, depths: list = [2, 3, 4],
widths: list = [0.5, 0.75, 1.0],
kernel_sizes: list = [3, 5, 7]):
super().__init__()
self.depths = depths
self.widths = widths
self.kernel_sizes = kernel_sizes
# Build supernet (maximum configuration)
self.stages = nn.ModuleList()
for stage_idx in range(4):
stage = nn.ModuleList()
for block_idx in range(max(depths)):
block = ElasticBlock(
channels=64 * (2 ** stage_idx),
kernel_sizes=kernel_sizes
)
stage.append(block)
self.stages.append(stage)
def forward(self, x: torch.Tensor,
config: dict = None) -> torch.Tensor:
"""Forward з specific configuration"""
if config is None:
config = self.get_max_config()
for stage_idx, stage in enumerate(self.stages):
depth = config['depths'][stage_idx]
width = config['widths'][stage_idx]
kernel = config['kernels'][stage_idx]
for block_idx in range(depth):
x = stage[block_idx](x, width=width, kernel_size=kernel)
return x
def get_subnet(self, config: dict) -> nn.Module:
"""Extract standalone subnet"""
subnet = SubNetwork()
for stage_idx, stage in enumerate(self.stages):
depth = config['depths'][stage_idx]
width = config['widths'][stage_idx]
kernel = config['kernels'][stage_idx]
for block_idx in range(depth):
block = stage[block_idx].get_active_block(width, kernel)
subnet.add_block(block)
return subnet
def sample_random_subnet(self) -> dict:
"""Sample random valid configuration"""
import random
return {
'depths': [random.choice(self.depths) for _ in range(4)],
'widths': [random.choice(self.widths) for _ in range(4)],
'kernels': [random.choice(self.kernel_sizes) for _ in range(4)]
}
class OFATrainer:
"""Progressive shrinking training"""
def __init__(self, supernet: OnceForAllNetwork):
self.supernet = supernet
def train(self, train_loader, epochs: int = 180):
"""Train з progressive shrinking"""
# Phase 1: Train largest network
for epoch in range(epochs // 3):
self.train_epoch(train_loader, sample_fn=lambda: self.supernet.get_max_config())
# Phase 2: Elastic kernel size
for epoch in range(epochs // 3):
self.train_epoch(train_loader, sample_fn=self.sample_kernel_subnet)
# Phase 3: Elastic depth + width
for epoch in range(epochs // 3):
self.train_epoch(train_loader, sample_fn=self.supernet.sample_random_subnet)
def deploy_for_target(self, target_latency: float,
hardware: str) -> nn.Module:
"""Find best subnet for target"""
predictor = LatencyPredictor(hardware)
best_config = None
best_accuracy = 0
for _ in range(1000):
config = self.supernet.sample_random_subnet()
latency = predictor.predict(config)
if latency <= target_latency:
subnet = self.supernet.get_subnet(config)
accuracy = self.evaluate(subnet)
if accuracy > best_accuracy:
best_accuracy = accuracy
best_config = config
return self.supernet.get_subnet(best_config)
Hyperparameter Optimization з Optuna
import optuna
from optuna.trial import Trial
def create_model_objective(train_loader, val_loader):
"""Optuna objective для hyperparameter search"""
def objective(trial: Trial) -> float:
# Hyperparameters to tune
lr = trial.suggest_float('lr', 1e-5, 1e-1, log=True)
batch_size = trial.suggest_categorical('batch_size', [16, 32, 64, 128])
optimizer_name = trial.suggest_categorical('optimizer', ['Adam', 'SGD', 'AdamW'])
n_layers = trial.suggest_int('n_layers', 2, 8)
hidden_dim = trial.suggest_int('hidden_dim', 64, 512, step=64)
dropout = trial.suggest_float('dropout', 0.1, 0.5)
activation = trial.suggest_categorical('activation', ['relu', 'gelu', 'silu'])
# Build model
model = build_model(
n_layers=n_layers,
hidden_dim=hidden_dim,
dropout=dropout,
activation=activation
)
# Create optimizer
if optimizer_name == 'Adam':
optimizer = torch.optim.Adam(model.parameters(), lr=lr)
elif optimizer_name == 'SGD':
optimizer = torch.optim.SGD(model.parameters(), lr=lr, momentum=0.9)
else:
optimizer = torch.optim.AdamW(model.parameters(), lr=lr)
# Train
for epoch in range(10):
train_loss = train_epoch(model, train_loader, optimizer)
val_accuracy = evaluate(model, val_loader)
# Pruning: early stop unpromising trials
trial.report(val_accuracy, epoch)
if trial.should_prune():
raise optuna.TrialPruned()
return val_accuracy
return objective
def run_hpo():
"""Run hyperparameter optimization"""
# Create study
study = optuna.create_study(
direction='maximize',
sampler=optuna.samplers.TPESampler(),
pruner=optuna.pruners.MedianPruner()
)
# Optimize
objective = create_model_objective(train_loader, val_loader)
study.optimize(objective, n_trials=100, timeout=3600)
# Results
print(f"Best trial: {study.best_trial.value}")
print(f"Best params: {study.best_params}")
# Visualization
optuna.visualization.plot_optimization_history(study)
optuna.visualization.plot_param_importances(study)
return study.best_params
Benchmarks та порівняння
| Method | ImageNet Top-1 | Search Cost (GPU days) | Model Size |
|--------|----------------|------------------------|------------|
| ResNet-50 (manual) | 76.0% | N/A | 25.6M |
| NASNet-A (RL) | 82.7% | 1800 | 88.9M |
| DARTS (gradient) | 73.3% | 1 | 3.3M |
| EfficientNet-B0 (NAS) | 77.1% | ~3000 | 5.3M |
| Once-for-All | 76.0-80.0% | 40 (one-time) | 2-8M |
Ідеї для дослідження
Для бакалавра:
- Порівняння HPO методів на стандартному benchmark
- AutoKeras vs manual tuning на конкретній задачі
- Візуалізація процесу пошуку
Для магістра:
- Hardware-aware NAS для конкретного edge device
- Transfer NAS: search на малому, deploy на великому
- Multi-task architecture search
Для PhD:
- Novel search spaces для нових типів архітектур
- Zero-shot NAS predictors
- Theoretical foundations of architecture search
Інструменти
NAS фреймворки:
- NNI (Microsoft): github.com/microsoft/nni
- AutoPyTorch: github.com/automl/Auto-PyTorch
- NASLib: github.com/automl/NASLib
HPO:
- Optuna: optuna.org
- Ray Tune: docs.ray.io/tune
- Hyperopt: hyperopt.github.io
Full AutoML:
- AutoKeras: autokeras.com
- Auto-sklearn: automl.github.io/auto-sklearn
- FLAML (Microsoft): github.com/microsoft/FLAML
Архітектури, знайдені NAS, часто перевершують human-designed. EfficientNet, MobileNetV3, NASNet — всі продукти автоматичного пошуку. Це не просто оптимізація — це AI, який проектує AI.
Якщо вас цікавить AutoML чи NAS — від простого hyperparameter tuning до повноцінного architecture search — звертайтесь до команди SKP-Degree на skp-degree.com.ua або пишіть у Telegram: @kursovi_diplomy. Допоможемо з вибором підходу, налаштуванням search space та інтерпретацією результатів.
Ключові слова: AutoML, NAS, Neural Architecture Search, DARTS, hyperparameter optimization, Optuna, EfficientNet, Once-for-All, hardware-aware, дипломна робота, магістерська, AI research, optimization