PackSARM

Stage-Aware Reward Modeling for Robot Manipulation

overview

Applied Stage-Aware Reward Models (SARM) and Reward-Aligned Behaviour Cloning (RA-BC) to the ALOHA bimanual cube transfer task using ACT and DiffusionPolicy. With just 50 expert demos and no environment reward, RA-BC gave DiffusionPolicy a 3× boost (8% → 24%) while ACT reached 68% success on its own.

the task

AlohaTransferCube-v0 — a simulated bimanual robot (2 × 6-DOF arms, 14-DOF total) picks up a cube with its right arm and hands it to the left arm. Trained on 50 human expert episodes with RGB cameras + joint states. Success = cube held in left gripper at episode end.

the idea

Standard behaviour cloning treats every frame equally. But not all frames are equally useful — some show real progress (grasping, lifting) while others are idle or transitioning.

SARM learns to score each frame by task progress (0 → 1), trained with GPT-4V labeling 5 subtask stages: approach, grasp, lift, transfer, release. RA-BC then uses these scores to weight training samples — frames where the robot makes progress get full weight, stagnant frames get downweighted.

results

Config	Steps	Success
Vanilla DiffusionPolicy	10K	8%
RA-BC DiffusionPolicy	10K	24%
Vanilla ACT	80K	68%
RA-BC ACT	80K	62%

DiffusionPolicy struggles with only 50 demos at 8% baseline, but RA-BC’s frame weighting tripled performance to 24%. ACT reaches 68% on its own — its 100-step action chunks already capture long coherent segments, leaving less room for RA-BC to help.

gallery

SARM reward predictions and RA-BC training pipeline

key takeaways

RA-BC works even on uniform data — 3× lift for DiffusionPolicy shows the progress signal is real
Use κ=0.01 — aggressive filtering starves the model when data is limited
ACT is stronger for bimanual tasks — action chunking and stable training make it reliable
Biggest gains come with messier data — the method is built for quality filtering on mixed-quality demos