Learning Object Manipulation from Scratch via Contrastive Interaction

Overview

Motivation figure: smooth reachability in locomotion vs. piecewise mode changes in manipulation

TL;DR

Observation
CRL representation is smooth in locomotion but piecewise nonlinear in manipulation.
Analysis
With a Factorized MDP (FMDP) and Gaussian interpolation:
  • Locomotion can be approximated by a linear transformation.
  • Manipulation can only be approximated by an affine transformation.
  • In manipulation, energy approximation error is propagated after the first interaction.
Method
Interaction-Weighted Resampling (IWR) adds sample coverage near interactions and controls error propagation.
Result
IWR improves manipulation tasks (+19.8%) on Box2D, Meta-World, and Air Hockey (sim-to-real).

Locomotion vs. Manipulation

Goal-Conditioned Reinforcement Learning (GCRL)

Sampled Future Occupancy
$$\rho^\pi(g \mid s,a) = (1-\gamma)\sum_{k=1}^{\infty}\gamma^{k-1}p^\pi(s_{t+k}=g \mid s_t=s, a_t=a)$$

Contrastive Reinforcement Learning (CRL)

Energy Function
$$E^*(s,a,g) = \log \rho^\pi(g \mid s,a) - \log \bar\rho_B(g) = \phi(s,a)^\top \psi(g)$$
InfoNCE Contrastive
$$\max_{\phi,\psi}\ \mathbb{E}\left[\log \frac{\exp(\phi(s,a)^\top \psi(g^+))}{\sum_{j=1}^{N}\exp(\phi(s,a)^\top \psi(g_j^-))}\right]$$

t-SNE visualization of $\phi(s,a)$

Locomotion
Smooth representation landscape

Manipulation
Nonlinear representation landscape
Limited control accuracy

Analysis: Piecewise Nonlinearity

Lemma 1. When interaction happens, the next representation point $\psi_{t+1}$ can only be locally approximated by an affine transformation $A_1\psi_t + b_t$, which leads to piecewise nonlinearity.

Proposition 1. In manipulation, actions only take effect if there is an interaction. Otherwise, object movement is passive, which leads to error propagation in the energy function.

$$\sup |\widehat{E}_k - E_k| \;\propto\; \|A_0^k\|\,\|e\| \;+\; \tfrac{1}{2}\|A_0^k\|^2\|e\|^2$$
Theory illustration of piecewise affine representation dynamics around interactions

Method

Overview of Interaction-Weighted Resampling (IWR): contrastive RL with uniform future sampling vs. interaction-aware resampling
IWR training pipeline: interaction-weighted resampling, CRL critic update, and actor update with replay buffer

Experiments

Overall Results

Task PPO SAC SAC+HER SAC+HINT CRL CRTR IWR (Ours)
Air Hockey (Simulation) 0.6170.1450.3980.4220.6950.727 0.742 (+2.1%)
Air Hockey (real-transfer) 0.1600.2150.1290.1250.4770.465 0.500 (+4.8%)
Air Hockey (Real Robot) 0/200/200/200/205/202/20 12/20 (+140.0%)
Box2D (center) 0.0860.0580.0880.0880.2780.274 0.288 (+3.6%)
Box2D (goal) 0.0890.0460.0860.0640.4500.558 0.709 (+27.1%)
Box2D (hard) 0.0600.0420.0640.0760.3170.365 0.565 (+54.8%)
Box2D (hard velocity) 0.1480.1490.1520.1390.3870.377 0.436 (+12.7%)
Box2D (maze) 0.0330.0120.0310.0350.2170.206 0.223 (+2.8%)
Meta-World (peg insert) 0.0000.0000.0000.0000.4300.367 0.438 (+1.9%)
Meta-World (pick place) 0.0000.0000.0040.0000.2660.305 0.570 (+86.9%)
Meta-World (push) 0.0000.0000.0040.0000.6990.750 0.730
Meta-World (sweep into) 0.0000.0040.0200.0040.8050.910 0.926 (+1.8%)
Average IWR improvement +19.8%

Box2D (Hard)

Box2D
(Hard)
CRTR (34 ticks)
IWR (90 ticks)

Meta-World (Pick-Place)

Meta-World
(Pick-Place)
CRTR (6/20)
IWR (13/20)

Air-Hockey: Sim2Real Transfer

Air Hockey real-transfer training curves: IWR achieves higher success rate than CRL, CRTR, and sparse-reward baselines
IWR (Ours)
12/20
SGCRL
5/20
CRTR
2/20
PPO
0/20
SAC
0/20
SAC+HER
0/20
SAC+HINT
0/20