ChatGPT живе в тексті. DALL-E — в пікселях. Whisper — в аудіо. Вони не знають, що таке взяти чашку. Як боляче вдаритися об стіл. Що таке вага, тертя, баланс. Їхнє знання про світ — з книг, не з досвіду.
Embodied AI — це AI з тілом. Робот, який вчиться через фізичну взаємодію зі світом. Не з dataset, а з власного досвіду. Не supervised learning на текстах, а reinforcement learning на реальних спробах.
Tesla Bot, Figure 01, 1X NEO, Sanctuary AI — мільярди доларів інвестицій у humanoid роботів. Бо embodiment — не просто "руки для GPT". Це принципово інший тип інтелекту. Інтелект, який виник через тіло.
Чому Embodiment важливий
Disembodied AI (ChatGPT):
Вчиться з тексту про світ
"Стілець — це меблі для сидіння"
Не знає, як сісти на стілець
Не знає ваги, текстури, балансу
Знання = статистика слів
Embodied AI:
Вчиться через interaction
Знає, що стілець можна перевернути
Розуміє friction, gravity, affordances
Grounded understanding
Знання = досвід взаємодії
Affordances: Те, що об'єкт "дозволяє" робити. Чашка — тримати, пити. Двері — відчиняти. Сходи — підніматися. Embodied AI відкриває affordances через дію.
class AffordanceDiscovery:
"""Відкриття affordances через взаємодію"""
def __init__(self, robot, world):
self.robot = robot
self.world = world
self.discovered_affordances = {}
def explore_object(self, object_id: str) -> dict:
"""Дослідження об'єкта через взаємодію"""
obj = self.world.get_object(object_id)
affordances = []
# Try different actions
for action in self.robot.available_actions:
initial_state = self.world.get_state()
# Execute action
success = self.robot.execute(action, target=obj)
# Observe changes
new_state = self.world.get_state()
effect = self.compute_effect(initial_state, new_state)
if success and effect is not None:
affordances.append({
'action': action,
'effect': effect,
'success_rate': self.estimate_success_rate(action, obj)
})
self.discovered_affordances[object_id] = affordances
return affordances
def compute_effect(self, before: dict, after: dict) -> dict:
"""Обчислення ефекту дії"""
effects = {}
for key in before:
if before[key] != after[key]:
effects[key] = {
'before': before[key],
'after': after[key]
}
return effects if effects else None
Архітектура Embodied Agent
┌─────────────────────────────────────────────────────────────┐
│ PERCEPTION │
│ Camera │ LiDAR │ Depth │ Touch │ Proprioception │ Audio │
└───────────────────────────┬─────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ STATE ESTIMATION │
│ Object Detection │ Pose Estimation │ SLAM │ Scene Graph │
└───────────────────────────┬─────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ WORLD MODEL │
│ Physics Prediction │ Object Properties │ Semantic Memory │
└───────────────────────────┬─────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ POLICY / PLANNING │
│ Task Planning │ Motion Planning │ Skill Selection │
└───────────────────────────┬─────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ CONTROL │
│ Inverse Kinematics │ Force Control │ Impedance Control │
└───────────────────────────┬─────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ ACTION │
│ Motor Commands │ Locomotion │ Manipulation │ Navigation │
└─────────────────────────────────────────────────────────────┘
import torch
import torch.nn as nn
from typing import Dict, Tuple
class EmbodiedAgent(nn.Module):
"""Full embodied agent architecture"""
def __init__(self, config: dict):
super().__init__()
# Perception modules
self.vision_encoder = VisionEncoder(
image_size=config['image_size'],
hidden_dim=config['hidden_dim']
)
self.proprioception_encoder = ProprioceptionEncoder(
joint_dim=config['n_joints'],
hidden_dim=config['hidden_dim']
)
self.tactile_encoder = TactileEncoder(
sensor_dim=config['tactile_dim'],
hidden_dim=config['hidden_dim']
)
# State fusion
self.state_fusion = nn.TransformerEncoder(
nn.TransformerEncoderLayer(
d_model=config['hidden_dim'],
nhead=8,
batch_first=True
),
num_layers=4
)
# World model
self.world_model = WorldModel(
state_dim=config['hidden_dim'],
action_dim=config['action_dim']
)
# Policy
self.policy = PolicyNetwork(
state_dim=config['hidden_dim'],
action_dim=config['action_dim']
)
def forward(self, observation: Dict[str, torch.Tensor]) -> Tuple[torch.Tensor, dict]:
"""Forward pass: observation → action"""
# Encode each modality
vision_features = self.vision_encoder(observation['image'])
proprio_features = self.proprioception_encoder(observation['joint_positions'])
tactile_features = self.tactile_encoder(observation['tactile'])
# Fuse modalities
all_features = torch.stack([vision_features, proprio_features, tactile_features], dim=1)
fused_state = self.state_fusion(all_features)
state = fused_state.mean(dim=1) # Pool over modalities
# Get action from policy
action, action_info = self.policy(state)
# Predict next state (for planning)
predicted_next_state = self.world_model(state, action)
return action, {
'state': state,
'predicted_next': predicted_next_state,
**action_info
}
class WorldModel(nn.Module):
"""Predicts future states given current state and action"""
def __init__(self, state_dim: int, action_dim: int):
super().__init__()
self.dynamics = nn.Sequential(
nn.Linear(state_dim + action_dim, 512),
nn.ReLU(),
nn.Linear(512, 512),
nn.ReLU(),
nn.Linear(512, state_dim)
)
self.reward_predictor = nn.Sequential(
nn.Linear(state_dim + action_dim, 256),
nn.ReLU(),
nn.Linear(256, 1)
)
def forward(self, state: torch.Tensor, action: torch.Tensor) -> torch.Tensor:
"""Predict next state"""
sa = torch.cat([state, action], dim=-1)
delta = self.dynamics(sa)
next_state = state + delta # Residual prediction
return next_state
def predict_reward(self, state: torch.Tensor, action: torch.Tensor) -> torch.Tensor:
"""Predict reward for state-action pair"""
sa = torch.cat([state, action], dim=-1)
return self.reward_predictor(sa)
Sim-to-Real Transfer
Real robot training — дорогий, повільний, небезпечний. Рішення: train in simulation, deploy in real.
import isaacgym
import torch
import numpy as np
class SimToRealPipeline:
"""Pipeline для sim-to-real transfer"""
def __init__(self, task_name: str, num_envs: int = 4096):
self.num_envs = num_envs
# Create Isaac Gym environment
self.env = isaacgym.make(
task_name,
num_envs=num_envs,
device='cuda'
)
# Domain randomization config
self.randomization_config = {
'friction': {'range': [0.5, 2.0], 'operation': 'scale'},
'mass': {'range': [0.8, 1.2], 'operation': 'scale'},
'motor_strength': {'range': [0.9, 1.1], 'operation': 'scale'},
'observation_noise': {'range': [0.0, 0.01], 'operation': 'add'},
'action_delay': {'range': [0, 2], 'operation': 'delay_frames'},
'camera_position': {'range': [-0.05, 0.05], 'operation': 'add'},
'lighting': {'range': [0.5, 1.5], 'operation': 'scale'}
}
def randomize_domain(self):
"""Apply domain randomization"""
for param, config in self.randomization_config.items():
value = np.random.uniform(config['range'][0], config['range'][1])
if config['operation'] == 'scale':
self.env.scale_parameter(param, value)
elif config['operation'] == 'add':
self.env.add_to_parameter(param, value)
elif config['operation'] == 'delay_frames':
self.env.set_action_delay(int(value))
def train_with_dr(self, policy, epochs: int = 1000):
"""Train with domain randomization"""
optimizer = torch.optim.Adam(policy.parameters(), lr=3e-4)
for epoch in range(epochs):
# Randomize at the start of each epoch
self.randomize_domain()
# Collect rollouts
observations = self.env.reset()
for step in range(self.env.max_episode_length):
# Get action
with torch.no_grad():
actions = policy(observations)
# Step environment
next_obs, rewards, dones, infos = self.env.step(actions)
# Store transition
self.store_transition(observations, actions, rewards, next_obs, dones)
observations = next_obs
# Update policy
loss = self.update_policy(policy, optimizer)
if epoch % 100 == 0:
print(f"Epoch {epoch}, Loss: {loss:.4f}")
return policy
def evaluate_on_real(self, policy, real_robot, n_episodes: int = 10):
"""Evaluate trained policy on real robot"""
successes = []
for ep in range(n_episodes):
obs = real_robot.reset()
done = False
while not done:
# Get action from policy
with torch.no_grad():
action = policy(torch.tensor(obs).float().unsqueeze(0))
action = action.squeeze().numpy()
# Execute on real robot
obs, reward, done, info = real_robot.step(action)
successes.append(info.get('success', False))
success_rate = np.mean(successes)
print(f"Real robot success rate: {success_rate:.2%}")
return success_rate
Imitation Learning
RL потребує reward design. Для складних tasks — альтернатива: вчитися з демонстрацій.
from torch.utils.data import Dataset, DataLoader
class DemonstrationDataset(Dataset):
"""Dataset демонстрацій"""
def __init__(self, demo_paths: list):
self.demonstrations = []
for path in demo_paths:
demo = self.load_demonstration(path)
self.demonstrations.extend(demo)
def load_demonstration(self, path: str) -> list:
"""Load and preprocess demonstration"""
import h5py
with h5py.File(path, 'r') as f:
observations = f['observations'][:]
actions = f['actions'][:]
transitions = []
for i in range(len(actions)):
transitions.append({
'observation': observations[i],
'action': actions[i]
})
return transitions
def __len__(self):
return len(self.demonstrations)
def __getitem__(self, idx):
item = self.demonstrations[idx]
return {
'observation': torch.tensor(item['observation'], dtype=torch.float32),
'action': torch.tensor(item['action'], dtype=torch.float32)
}
class BehaviorCloning:
"""Standard behavior cloning"""
def __init__(self, policy: nn.Module, lr: float = 1e-4):
self.policy = policy
self.optimizer = torch.optim.Adam(policy.parameters(), lr=lr)
def train(self, dataset: Dataset, epochs: int = 100, batch_size: int = 64):
"""Train policy via behavior cloning"""
dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)
for epoch in range(epochs):
total_loss = 0
for batch in dataloader:
obs = batch['observation']
target_action = batch['action']
# Forward
predicted_action = self.policy(obs)
# MSE loss
loss = F.mse_loss(predicted_action, target_action)
# Update
self.optimizer.zero_grad()
loss.backward()
self.optimizer.step()
total_loss += loss.item()
avg_loss = total_loss / len(dataloader)
if epoch % 10 == 0:
print(f"Epoch {epoch}, Loss: {avg_loss:.4f}")
class DiffusionPolicy(nn.Module):
"""Diffusion-based policy (SOTA for imitation learning)"""
def __init__(self, obs_dim: int, action_dim: int,
horizon: int = 16, n_diffusion_steps: int = 100):
super().__init__()
self.horizon = horizon
self.action_dim = action_dim
self.n_steps = n_diffusion_steps
# Noise schedule
self.betas = self._cosine_beta_schedule(n_diffusion_steps)
self.alphas = 1 - self.betas
self.alpha_bars = torch.cumprod(self.alphas, dim=0)
# Denoising network (U-Net style)
self.denoiser = UNetDenoiser(
input_dim=action_dim * horizon,
cond_dim=obs_dim,
hidden_dim=256
)
def _cosine_beta_schedule(self, timesteps: int) -> torch.Tensor:
"""Cosine noise schedule"""
steps = timesteps + 1
x = torch.linspace(0, timesteps, steps)
alphas_cumprod = torch.cos(((x / timesteps) + 0.008) / 1.008 * np.pi * 0.5) ** 2
alphas_cumprod = alphas_cumprod / alphas_cumprod[0]
betas = 1 - (alphas_cumprod[1:] / alphas_cumprod[:-1])
return torch.clip(betas, 0, 0.999)
def forward(self, obs: torch.Tensor) -> torch.Tensor:
"""Generate action sequence via diffusion"""
batch_size = obs.shape[0]
device = obs.device
# Start from noise
x = torch.randn(batch_size, self.horizon * self.action_dim, device=device)
# Reverse diffusion
for t in reversed(range(self.n_steps)):
t_tensor = torch.full((batch_size,), t, device=device)
# Predict noise
predicted_noise = self.denoiser(x, t_tensor, obs)
# Denoise step
alpha = self.alphas[t]
alpha_bar = self.alpha_bars[t]
if t > 0:
noise = torch.randn_like(x)
else:
noise = 0
x = (1 / torch.sqrt(alpha)) * (
x - (self.betas[t] / torch.sqrt(1 - alpha_bar)) * predicted_noise
) + torch.sqrt(self.betas[t]) * noise
# Reshape to action sequence
actions = x.view(batch_size, self.horizon, self.action_dim)
return actions[:, 0, :] # Return first action
def compute_loss(self, obs: torch.Tensor, actions: torch.Tensor) -> torch.Tensor:
"""Training loss for diffusion"""
batch_size = obs.shape[0]
device = obs.device
# Flatten action sequence
action_flat = actions.view(batch_size, -1)
# Sample random timesteps
t = torch.randint(0, self.n_steps, (batch_size,), device=device)
# Sample noise
noise = torch.randn_like(action_flat)
# Noisy actions
alpha_bar = self.alpha_bars[t].view(-1, 1)
noisy_actions = torch.sqrt(alpha_bar) * action_flat + torch.sqrt(1 - alpha_bar) * noise
# Predict noise
predicted_noise = self.denoiser(noisy_actions, t, obs)
# MSE loss
return F.mse_loss(predicted_noise, noise)
Language-Conditioned Policies
"Pick up the apple" → robot picks up the apple
from transformers import CLIPModel, CLIPProcessor
class LanguageConditionedPolicy(nn.Module):
"""Policy conditioned on natural language instructions"""
def __init__(self, vision_dim: int, action_dim: int):
super().__init__()
# CLIP for language understanding
self.clip = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
self.clip_processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
# Freeze CLIP
for param in self.clip.parameters():
param.requires_grad = False
self.language_dim = self.clip.text_projection.out_features
# Vision encoder (separate from CLIP)
self.vision_encoder = nn.Sequential(
nn.Conv2d(3, 32, 8, 4),
nn.ReLU(),
nn.Conv2d(32, 64, 4, 2),
nn.ReLU(),
nn.Conv2d(64, 64, 3, 1),
nn.ReLU(),
nn.Flatten(),
nn.Linear(3136, 512)
)
# Policy network
self.policy = nn.Sequential(
nn.Linear(512 + self.language_dim, 512),
nn.ReLU(),
nn.Linear(512, 256),
nn.ReLU(),
nn.Linear(256, action_dim)
)
def encode_instruction(self, instruction: str) -> torch.Tensor:
"""Encode natural language instruction"""
inputs = self.clip_processor(
text=[instruction],
return_tensors="pt",
padding=True
)
with torch.no_grad():
text_features = self.clip.get_text_features(**inputs)
return text_features
def forward(self, image: torch.Tensor, instruction: str) -> torch.Tensor:
"""Get action given image and instruction"""
# Encode vision
vision_features = self.vision_encoder(image)
# Encode language
language_features = self.encode_instruction(instruction)
# Combine
combined = torch.cat([vision_features, language_features], dim=-1)
# Get action
action = self.policy(combined)
return action
class SayCan:
"""SayCan: LLM proposes, VLM scores, robot executes"""
def __init__(self, llm, vlm, skills: dict):
self.llm = llm # Language model for proposing
self.vlm = vlm # Vision-language model for grounding
self.skills = skills # Available robot skills
def execute_task(self, instruction: str, observation: np.ndarray) -> list:
"""Execute multi-step task"""
executed_skills = []
current_obs = observation
while True:
# LLM proposes next skill
proposed_skills = self.llm_propose(instruction, executed_skills)
# VLM scores feasibility
skill_scores = []
for skill in proposed_skills:
feasibility = self.vlm_score(skill, current_obs)
skill_scores.append((skill, feasibility))
# Select best
best_skill = max(skill_scores, key=lambda x: x[1])[0]
if best_skill == 'done':
break
# Execute
success, new_obs = self.execute_skill(best_skill, current_obs)
if success:
executed_skills.append(best_skill)
current_obs = new_obs
else:
# Retry or replann
pass
return executed_skills
def llm_propose(self, instruction: str, history: list) -> list:
"""LLM proposes candidate next skills"""
prompt = f"""
Task: {instruction}
Already done: {history}
Available skills: {list(self.skills.keys())}
What should be the next step? List top 3 candidates.
"""
response = self.llm.generate(prompt)
return self.parse_skills(response)
def vlm_score(self, skill: str, observation: np.ndarray) -> float:
"""VLM scores if skill is feasible in current scene"""
prompt = f"Can the robot perform '{skill}' in this scene? Score 0-1."
score = self.vlm.score(observation, prompt)
return score
Benchmarks
| Benchmark | Type | Tasks | Complexity |
|-----------|------|-------|------------|
| Meta-World | Manipulation | 50 | Medium |
| RLBench | Manipulation | 100+ | High |
| CALVIN | Language-conditioned | 34 | High |
| Habitat | Navigation | Various | Medium |
| BEHAVIOR | Mobile manipulation | 100 | Very High |
| Maniskill2 | Manipulation | 20 | High |
Ідеї для дослідження
Для бакалавра:
- Sim environment setup (Isaac Gym, MuJoCo)
- Basic policy learning на простих tasks
- Behavior cloning демонстрація
Для магістра:
- Sim-to-real transfer з domain randomization
- Language-conditioned policies
- Tactile-guided manipulation
Для PhD:
- Foundation models for robotics
- World models for long-horizon planning
- Safe learning in real world
- Human-robot collaboration
Інструменти
Simulators:
- Isaac Gym/Sim: developer.nvidia.com/isaac-gym
- MuJoCo: mujoco.org
- PyBullet: pybullet.org
- Robosuite: robosuite.ai
Hardware:
- Franka Emika Panda
- Universal Robots UR5/UR10
- xArm
- Unitree robots
- Tesla Bot, Figure 01
LLM reasoning + Vision understanding + Physical embodiment = загальний інтелект? Дискусії про AGI осторонь — практичні застосування величезні. Home robots — ринок на $100B+. Warehouse automation. Healthcare assistance. Space exploration.
Компанії, що інвестують мільярди (Tesla, Google, OpenAI, Figure), вірять: embodiment необхідний для справжнього інтелекту. Знання — не тільки з книг. Знання — з досвіду взаємодії зі світом.
Якщо вас цікавить embodied AI — від simulation до real robot deployment — звертайтесь до команди SKP-Degree на skp-degree.com.ua або пишіть у Telegram: @kursovi_diplomy. Допоможемо з вибором платформи, налаштуванням simulation та імплементацією алгоритмів.
Ключові слова: embodied AI, robotics, sim-to-real, imitation learning, diffusion policy, RT-2, manipulation, humanoid, дипломна робота, магістерська, PhD, AI research