Occupancy Reward Shaping: Improving Credit Assignment for Offline Goal-Conditioned Reinforcement Learning

Aravind Venugopal; Jiayu Chen; Xudong Wu; Chongyi Zheng; Benjamin Eysenbach; Jeff Schneider

✨ TL;DR

This paper proposes Occupancy Reward Shaping (ORS), a method that extracts temporal and geometric information from learned world models to create reward functions that improve credit assignment in offline goal-conditioned reinforcement learning. The approach uses optimal transport on occupancy measures and demonstrates 2.2x performance improvements across diverse tasks including real-world Tokamak control.

01 · Problem

Credit assignment in reinforcement learning is fundamentally challenging due to the temporal lag between actions and their long-term consequences. This problem is particularly acute in offline goal-conditioned RL with sparse rewards, where agents must learn from fixed datasets without environment interaction. While generative world models capture the distribution of future states an agent may visit, indicating they encode temporal information, it remains unclear how to effectively extract and leverage this temporal information for improved credit assignment.

02 · Approach

The paper formalizes how temporal information in world models encodes the underlying geometry of the environment. Using optimal transport theory, the authors extract geometric structure from a learned occupancy measure (the distribution of states visited under a policy) and convert this into a reward function that captures goal-reaching information. This reward shaping approach is theoretically grounded to not alter the optimal policy while providing practical credit assignment benefits in sparse reward settings.

03 · Key insights

What the paper shows.

01Generative world models implicitly encode geometric information about the environment through their learned occupancy measures

02Optimal transport provides a principled framework to extract this geometric structure and convert it into informative reward signals

03Reward shaping based on occupancy geometry can provably preserve optimal policies while improving learning efficiency

04The method is broadly applicable across diverse domains from locomotion and manipulation to real-world control tasks like nuclear fusion

04 · Results

ORS improves performance by 2.2x on average across 13 diverse long-horizon locomotion and manipulation tasks in offline goal-conditioned RL settings. The method demonstrates effectiveness beyond simulation, with successful application to 3 real-world Tokamak control tasks for nuclear fusion control, validating the approach's practical utility.

05 · Limitations

The paper does not explicitly discuss computational overhead of optimal transport calculations or scalability to very high-dimensional state spaces. The evaluation focuses on locomotion, manipulation, and Tokamak control tasks, leaving generalization to other domains unclear. The dependence on the quality of the learned world model is not thoroughly analyzed, nor is the sensitivity to hyperparameters in the occupancy measure estimation.

✨ Generated by Claude · Apr 25, 2026 · Read the PDF for authoritative content.

What the paper shows.

↘ Related papers