Один агент вчиться грати в шахи. Складно, але вирішено. Два агенти вчаться грати один проти одного — цікавіше. Десять агентів координують drone swarm у реальному часі — справжній виклик. Тисячі агентів моделюють економіку — хаос.
Multi-Agent Reinforcement Learning (MARL) — це коли кілька агентів вчаться одночасно в спільному середовищі, впливаючи один на одного. Це не просто "RL помножене на N". Це якісно інша проблема з унікальними викликами: нестаціонарність, credit assignment, emergent behavior.
OpenAI Five грає в Dota на рівні професіоналів. DeepMind AlphaStar перемагає у StarCraft. Autonomous traffic координує сотні транспортних засобів. MARL — frontier reinforcement learning.
Чому MARL — якісно інша проблема
Single-Agent RL:
Environment = stationary (не змінюється)
Agent learns → environment stays same
Optimal policy exists and is stable
Markov property holds
Multi-Agent RL:
Environment = includes other learning agents
Other agents change their policies
Environment is NON-STATIONARY
Optimal policy depends on what others do
Game theory meets machine learning
Проблема нестаціонарності:
import numpy as np
class NonStationarityDemonstration:
"""Демонстрація проблеми нестаціонарності"""
def __init__(self, n_agents: int = 2):
self.n_agents = n_agents
self.policies = [RandomPolicy() for _ in range(n_agents)]
def simulate_learning(self, episodes: int = 1000):
"""Симуляція взаємного навчання"""
reward_history = {i: [] for i in range(self.n_agents)}
for episode in range(episodes):
# Agent 0 оптимізує проти поточної policy Agent 1
optimal_response_0 = self.compute_best_response(0, self.policies[1])
self.policies[0] = optimal_response_0
# Але Agent 1 теж змінюється!
optimal_response_1 = self.compute_best_response(1, self.policies[0])
self.policies[1] = optimal_response_1
# Environment для Agent 0 змінилося, бо Agent 1 змінився
# Те, що було optimal для Agent 0, вже не optimal
# Циклічна залежність
for i in range(self.n_agents):
r = self.evaluate_policy(i)
reward_history[i].append(r)
return reward_history
def compute_best_response(self, agent_id: int, opponent_policy) -> 'Policy':
"""Обчислення найкращої відповіді на policy опонента"""
# Simplified: in real MARL this is expensive
best_policy = None
best_value = float('-inf')
for candidate in self.generate_policy_candidates():
value = self.evaluate_against(agent_id, candidate, opponent_policy)
if value > best_value:
best_value = value
best_policy = candidate
return best_policy
Типи Multi-Agent Settings
1. Cooperative (Кооперативні)
class CooperativeEnvironment:
"""Середовище для кооперативних задач"""
def __init__(self, n_agents: int):
self.n_agents = n_agents
self.shared_goal = None
def step(self, actions: list) -> tuple:
"""Всі агенти отримують спільну нагороду"""
# Actions від усіх агентів
joint_action = self.combine_actions(actions)
# Один reward для всієї команди
next_state = self.transition(self.state, joint_action)
team_reward = self.compute_team_reward(next_state)
# Кожен агент отримує team reward
rewards = [team_reward] * self.n_agents
return next_state, rewards, self.is_terminal(next_state)
def compute_team_reward(self, state) -> float:
"""Нагорода за досягнення спільної мети"""
# Приклад: robot team moving furniture
distance_to_goal = np.linalg.norm(
state['furniture_position'] - state['goal_position']
)
return -distance_to_goal # Closer = better
2. Competitive (Zero-sum)
class ZeroSumGame:
"""Гра з нульовою сумою (шахи, го)"""
def step(self, actions: list) -> tuple:
"""Виграш одного = програш іншого"""
assert len(actions) == 2
next_state = self.transition(self.state, actions)
# Zero-sum: rewards sum to 0
if self.player_0_wins(next_state):
rewards = [1.0, -1.0]
elif self.player_1_wins(next_state):
rewards = [-1.0, 1.0]
else:
rewards = [0.0, 0.0]
return next_state, rewards, self.is_terminal(next_state)
3. Mixed (General-sum)
class TrafficEnvironment:
"""Трафік: і кооперація, і конкуренція"""
def step(self, actions: list) -> tuple:
"""Кожен хоче швидше, але collision = погано для всіх"""
next_state = self.move_vehicles(actions)
rewards = []
for i, vehicle in enumerate(self.vehicles):
# Individual reward: progress toward destination
progress = self.compute_progress(vehicle)
# Shared penalty: collision
collision_penalty = -100 if self.has_collision(vehicle) else 0
# Individual + shared components
rewards.append(progress + collision_penalty)
return next_state, rewards, self.all_arrived()
Архітектури MARL
1. Independent Learners
class IndependentQLearning:
"""Кожен агент вчиться незалежно"""
def __init__(self, n_agents: int, obs_dim: int, action_dim: int):
self.agents = [
DQNAgent(obs_dim, action_dim)
for _ in range(n_agents)
]
def act(self, observations: list) -> list:
"""Кожен агент обирає дію на основі своєї observation"""
actions = []
for i, obs in enumerate(observations):
action = self.agents[i].select_action(obs)
actions.append(action)
return actions
def update(self, transitions: list):
"""Незалежне оновлення кожного агента"""
for i, (obs, action, reward, next_obs, done) in enumerate(transitions):
# Кожен агент бачить інших як частину environment
# Не знає про їхні policies
self.agents[i].update(obs, action, reward, next_obs, done)
# Проблема: non-stationarity
# Agent i вчиться, але environment (including other agents) змінюється
# → Q-values можуть не сходитися
2. Centralized Training, Decentralized Execution (CTDE)
import torch
import torch.nn as nn
class CTDEArchitecture(nn.Module):
"""Centralized critic, decentralized actors"""
def __init__(self, n_agents: int, obs_dim: int,
action_dim: int, state_dim: int):
super().__init__()
self.n_agents = n_agents
# Decentralized actors: local observation → action
self.actors = nn.ModuleList([
nn.Sequential(
nn.Linear(obs_dim, 256),
nn.ReLU(),
nn.Linear(256, 128),
nn.ReLU(),
nn.Linear(128, action_dim)
) for _ in range(n_agents)
])
# Centralized critic: global state + all actions → Q-value
self.critic = nn.Sequential(
nn.Linear(state_dim + n_agents * action_dim, 512),
nn.ReLU(),
nn.Linear(512, 256),
nn.ReLU(),
nn.Linear(256, 1)
)
def get_actions(self, observations: list) -> list:
"""Decentralized: each agent uses only local obs"""
actions = []
for i, obs in enumerate(observations):
logits = self.actors[i](obs)
action = torch.distributions.Categorical(logits=logits).sample()
actions.append(action)
return actions
def get_q_value(self, global_state: torch.Tensor,
actions: list) -> torch.Tensor:
"""Centralized: critic sees everything"""
actions_concat = torch.cat(actions, dim=-1)
critic_input = torch.cat([global_state, actions_concat], dim=-1)
return self.critic(critic_input)
3. Value Decomposition (QMIX)
class QMIX(nn.Module):
"""Value decomposition for cooperative MARL"""
def __init__(self, n_agents: int, obs_dim: int,
action_dim: int, state_dim: int,
mixing_embed_dim: int = 32):
super().__init__()
self.n_agents = n_agents
# Individual Q-networks
self.agent_qs = nn.ModuleList([
nn.Sequential(
nn.Linear(obs_dim, 64),
nn.ReLU(),
nn.Linear(64, action_dim)
) for _ in range(n_agents)
])
# Hypernetwork для mixing weights
self.hyper_w1 = nn.Sequential(
nn.Linear(state_dim, mixing_embed_dim),
nn.ReLU(),
nn.Linear(mixing_embed_dim, n_agents * mixing_embed_dim)
)
self.hyper_w2 = nn.Sequential(
nn.Linear(state_dim, mixing_embed_dim),
nn.ReLU(),
nn.Linear(mixing_embed_dim, mixing_embed_dim)
)
# Biases
self.hyper_b1 = nn.Linear(state_dim, mixing_embed_dim)
self.hyper_b2 = nn.Sequential(
nn.Linear(state_dim, mixing_embed_dim),
nn.ReLU(),
nn.Linear(mixing_embed_dim, 1)
)
self.mixing_embed_dim = mixing_embed_dim
def forward(self, observations: list, actions: list,
state: torch.Tensor) -> torch.Tensor:
"""Compute Q_tot from individual Qs"""
batch_size = state.shape[0]
# Get individual Q-values
agent_qs = []
for i in range(self.n_agents):
q_values = self.agent_qs[i](observations[i])
# Select Q for taken action
q_value = q_values.gather(1, actions[i].unsqueeze(1))
agent_qs.append(q_value)
agent_qs = torch.cat(agent_qs, dim=1) # [batch, n_agents]
# Mixing network with monotonicity constraint
# W1: [batch, n_agents, embed_dim]
w1 = torch.abs(self.hyper_w1(state)).view(
batch_size, self.n_agents, self.mixing_embed_dim
)
b1 = self.hyper_b1(state).view(batch_size, 1, self.mixing_embed_dim)
# First layer
hidden = F.elu(torch.bmm(agent_qs.unsqueeze(1), w1) + b1)
# W2: [batch, embed_dim, 1]
w2 = torch.abs(self.hyper_w2(state)).view(
batch_size, self.mixing_embed_dim, 1
)
b2 = self.hyper_b2(state).view(batch_size, 1, 1)
# Second layer → Q_tot
q_tot = torch.bmm(hidden, w2) + b2
return q_tot.squeeze(-1).squeeze(-1)
def get_individual_actions(self, observations: list) -> list:
"""Decentralized action selection"""
actions = []
for i, obs in enumerate(observations):
q_values = self.agent_qs[i](obs)
action = q_values.argmax(dim=-1)
actions.append(action)
return actions
MAPPO: Multi-Agent PPO
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.distributions import Categorical
class MAPPOAgent(nn.Module):
"""PPO-based multi-agent learner"""
def __init__(self, obs_dim: int, action_dim: int,
shared_obs_dim: int, hidden_dim: int = 256):
super().__init__()
# Actor: local observation → action (decentralized)
self.actor = nn.Sequential(
nn.Linear(obs_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, action_dim)
)
# Critic: global state → value (centralized)
self.critic = nn.Sequential(
nn.Linear(shared_obs_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, 1)
)
def get_action(self, obs: torch.Tensor) -> tuple:
"""Sample action from policy"""
logits = self.actor(obs)
dist = Categorical(logits=logits)
action = dist.sample()
log_prob = dist.log_prob(action)
return action, log_prob
def get_value(self, shared_obs: torch.Tensor) -> torch.Tensor:
"""Get state value from centralized critic"""
return self.critic(shared_obs)
def evaluate_actions(self, obs: torch.Tensor,
actions: torch.Tensor) -> tuple:
"""Evaluate actions for PPO update"""
logits = self.actor(obs)
dist = Categorical(logits=logits)
log_probs = dist.log_prob(actions)
entropy = dist.entropy()
return log_probs, entropy
class MAPPOTrainer:
"""Training loop for MAPPO"""
def __init__(self, agents: list, gamma: float = 0.99,
clip_eps: float = 0.2, epochs: int = 10):
self.agents = agents
self.gamma = gamma
self.clip_eps = clip_eps
self.epochs = epochs
self.optimizers = [
torch.optim.Adam(agent.parameters(), lr=3e-4)
for agent in agents
]
def compute_gae(self, rewards: list, values: list,
dones: list, last_value: float) -> tuple:
"""Generalized Advantage Estimation"""
gae = 0
advantages = []
returns = []
values = values + [last_value]
for t in reversed(range(len(rewards))):
delta = rewards[t] + self.gamma * values[t+1] * (1 - dones[t]) - values[t]
gae = delta + self.gamma * 0.95 * (1 - dones[t]) * gae
advantages.insert(0, gae)
returns.insert(0, gae + values[t])
return advantages, returns
def update(self, rollout_buffer: dict):
"""PPO update for all agents"""
for agent_idx, agent in enumerate(self.agents):
obs = rollout_buffer['observations'][agent_idx]
shared_obs = rollout_buffer['shared_observations']
actions = rollout_buffer['actions'][agent_idx]
old_log_probs = rollout_buffer['log_probs'][agent_idx]
advantages = rollout_buffer['advantages'][agent_idx]
returns = rollout_buffer['returns'][agent_idx]
# Normalize advantages
advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)
for _ in range(self.epochs):
# Evaluate current policy
new_log_probs, entropy = agent.evaluate_actions(obs, actions)
values = agent.get_value(shared_obs)
# Policy loss (PPO clip)
ratio = torch.exp(new_log_probs - old_log_probs)
surr1 = ratio * advantages
surr2 = torch.clamp(ratio, 1 - self.clip_eps, 1 + self.clip_eps) * advantages
policy_loss = -torch.min(surr1, surr2).mean()
# Value loss
value_loss = F.mse_loss(values.squeeze(), returns)
# Entropy bonus
entropy_loss = -entropy.mean()
# Total loss
loss = policy_loss + 0.5 * value_loss + 0.01 * entropy_loss
# Update
self.optimizers[agent_idx].zero_grad()
loss.backward()
torch.nn.utils.clip_grad_norm_(agent.parameters(), 0.5)
self.optimizers[agent_idx].step()
Self-Play для конкурентних ігор
import copy
import random
class SelfPlayTrainer:
"""Навчання через гру проти себе"""
def __init__(self, agent: nn.Module, pool_size: int = 10):
self.current_agent = agent
self.opponent_pool = []
self.pool_size = pool_size
self.elo_ratings = {}
def train_step(self, env):
"""Один крок навчання"""
# Select opponent
if len(self.opponent_pool) == 0 or random.random() < 0.2:
opponent = copy.deepcopy(self.current_agent)
opponent_id = 'self'
else:
# Prioritize by ELO (play against stronger opponents)
opponent_idx = self.sample_by_elo()
opponent = self.opponent_pool[opponent_idx]
opponent_id = opponent_idx
# Play episode
episode_data = self.play_episode(env, self.current_agent, opponent)
# Update current agent
self.current_agent.update(episode_data)
# Update ELO ratings
if opponent_id != 'self':
self.update_elo(episode_data['winner'], opponent_id)
# Add to pool periodically
if self.should_snapshot():
self.add_to_pool(copy.deepcopy(self.current_agent))
def play_episode(self, env, agent1, agent2) -> dict:
"""Play one episode between two agents"""
obs = env.reset()
done = False
trajectory = {'states': [], 'actions': [], 'rewards': []}
while not done:
# Alternate turns or simultaneous
if env.current_player == 0:
action = agent1.select_action(obs)
else:
action = agent2.select_action(obs)
next_obs, reward, done, info = env.step(action)
trajectory['states'].append(obs)
trajectory['actions'].append(action)
trajectory['rewards'].append(reward)
obs = next_obs
trajectory['winner'] = info.get('winner')
return trajectory
def sample_by_elo(self) -> int:
"""Sample opponent prioritizing by ELO rating"""
if not self.elo_ratings:
return random.randint(0, len(self.opponent_pool) - 1)
# Higher rated opponents sampled more often
ratings = [self.elo_ratings.get(i, 1500) for i in range(len(self.opponent_pool))]
probs = F.softmax(torch.tensor(ratings) / 100, dim=0).numpy()
return np.random.choice(len(self.opponent_pool), p=probs)
def update_elo(self, winner: int, opponent_idx: int):
"""Update ELO ratings based on game result"""
K = 32 # ELO K-factor
current_elo = self.elo_ratings.get('current', 1500)
opponent_elo = self.elo_ratings.get(opponent_idx, 1500)
expected_current = 1 / (1 + 10 ** ((opponent_elo - current_elo) / 400))
expected_opponent = 1 - expected_current
if winner == 0: # Current agent won
score_current, score_opponent = 1, 0
elif winner == 1: # Opponent won
score_current, score_opponent = 0, 1
else: # Draw
score_current, score_opponent = 0.5, 0.5
self.elo_ratings['current'] = current_elo + K * (score_current - expected_current)
self.elo_ratings[opponent_idx] = opponent_elo + K * (score_opponent - expected_opponent)
Communication Learning
class CommunicatingAgent(nn.Module):
"""Agent що вчиться комунікувати"""
def __init__(self, obs_dim: int, action_dim: int,
message_dim: int = 16, n_agents: int = 3):
super().__init__()
# Message encoder
self.message_encoder = nn.Sequential(
nn.Linear(obs_dim, 64),
nn.ReLU(),
nn.Linear(64, message_dim)
)
# Message aggregator
self.message_aggregator = nn.GRU(
message_dim, 64, batch_first=True
)
# Policy network (obs + received messages → action)
self.policy = nn.Sequential(
nn.Linear(obs_dim + 64, 128),
nn.ReLU(),
nn.Linear(128, action_dim)
)
def encode_message(self, obs: torch.Tensor) -> torch.Tensor:
"""Generate message to broadcast"""
return self.message_encoder(obs)
def process_messages(self, received: torch.Tensor) -> torch.Tensor:
"""Aggregate messages from other agents"""
# received: [batch, n_agents-1, message_dim]
_, hidden = self.message_aggregator(received)
return hidden.squeeze(0) # [batch, 64]
def forward(self, obs: torch.Tensor,
received_messages: torch.Tensor) -> torch.Tensor:
"""Select action given obs and communications"""
message_context = self.process_messages(received_messages)
combined = torch.cat([obs, message_context], dim=-1)
return self.policy(combined)
class CommMARLEnvironment:
"""Environment з communication channel"""
def __init__(self, base_env, n_agents: int, message_dim: int):
self.env = base_env
self.n_agents = n_agents
self.message_dim = message_dim
def step(self, actions: list, messages: list) -> tuple:
"""Step з action + communication"""
# Execute actions
next_obs, rewards, done, info = self.env.step(actions)
# Broadcast messages (all-to-all)
received = []
for i in range(self.n_agents):
# Agent i receives messages from all others
agent_received = [messages[j] for j in range(self.n_agents) if j != i]
received.append(torch.stack(agent_received))
return next_obs, rewards, done, info, received
Benchmarks та environments
| Environment | Type | Agents | Complexity |
|-------------|------|--------|------------|
| Simple Spread (MPE) | Cooperative | 3 | Low |
| Simple Tag (MPE) | Competitive | 4 | Low |
| SMAC (StarCraft) | Cooperative | 2-27 | Medium-High |
| Google Football | Mixed | 11v11 | High |
| Hanabi | Cooperative | 2-5 | Medium |
| Overcooked | Cooperative | 2 | Medium |
Ідеї для дослідження
Для бакалавра:
- QMIX на простому cooperative environment
- Порівняння independent vs centralized learning
- Візуалізація emergent behaviors
Для магістра:
- MAPPO для drone coordination
- Communication emergence у кооперативних задачах
- Transfer between environments
Для PhD:
- Scalable MARL для тисяч агентів
- Theoretical analysis of convergence
- Human-AI teaming
Інструменти
Environments:
- PettingZoo: pettingzoo.farama.org
- SMAC: github.com/oxwhirl/smac
- Google Football: github.com/google-research/football
- Multi-Agent Particle: github.com/openai/multiagent-particle-envs
Frameworks:
- EPyMARL: github.com/uoe-agents/epymarl
- MARLlib: github.com/Replicable-MARL/MARLlib
- PyMARL: github.com/oxwhirl/pymarl
Real world — це multi-agent world. Traffic, markets, robotics, social systems. Хто вирішить scalable MARL — вирішить coordination at scale. Це одна з найскладніших відкритих проблем у reinforcement learning.
Якщо вас цікавить MARL — від базових алгоритмів до дослідження emergent behavior — звертайтесь до команди SKP-Degree на skp-degree.com.ua або пишіть у Telegram: @kursovi_diplomy. Допоможемо з вибором задачі, налаштуванням середовища та імплементацією.
Ключові слова: MARL, multi-agent, reinforcement learning, QMIX, MAPPO, self-play, game AI, coordination, emergent behavior, дипломна робота, магістерська, PhD, AI research