✨ TL;DR
This paper proposes Occupancy Reward Shaping (ORS), a method that extracts temporal and geometric information from learned world models to create reward functions that improve credit assignment in offline goal-conditioned reinforcement learning. The approach uses optimal transport on occupancy measures and demonstrates 2.2x performance improvements across diverse tasks including real-world Tokamak control.
Credit assignment in reinforcement learning is fundamentally challenging due to the temporal lag between actions and their long-term consequences. This problem is particularly acute in offline goal-conditioned RL with sparse rewards, where agents must learn from fixed datasets without environment interaction. While generative world models capture the distribution of future states an agent may visit, indicating they encode temporal information, it remains unclear how to effectively extract and leverage this temporal information for improved credit assignment.
The paper formalizes how temporal information in world models encodes the underlying geometry of the environment. Using optimal transport theory, the authors extract geometric structure from a learned occupancy measure (the distribution of states visited under a policy) and convert this into a reward function that captures goal-reaching information. This reward shaping approach is theoretically grounded to not alter the optimal policy while providing practical credit assignment benefits in sparse reward settings.
What the paper shows.
ORS improves performance by 2.2x on average across 13 diverse long-horizon locomotion and manipulation tasks in offline goal-conditioned RL settings. The method demonstrates effectiveness beyond simulation, with successful application to 3 real-world Tokamak control tasks for nuclear fusion control, validating the approach's practical utility.
The paper does not explicitly discuss computational overhead of optimal transport calculations or scalability to very high-dimensional state spaces. The evaluation focuses on locomotion, manipulation, and Tokamak control tasks, leaving generalization to other domains unclear. The dependence on the quality of the learned world model is not thoroughly analyzed, nor is the sensitivity to hyperparameters in the occupancy measure estimation.
✨ Generated by Claude · Apr 25, 2026 · Read the PDF for authoritative content.