📚 Model-based offline RL methods are a good fit for offline RL as they allow training a model on available data and using it to obtain a good policy or plan directly.
❓ In model-based RL, the trained model is used to answer "what if" questions about different states and actions.
⚙️ Dyna-style methods are adapted to the offline setting to simulate rollouts starting from the collected states and actions.
⛔ One challenge in offline RL is the policy learning to exploit the model by tricking it into going into high-reward out-of-distribution states.
🔧 Modifying model-based methods to penalize the policy when it tricks the model into crazy states can incentivize the policy to stay closer to the data.
🔑 Mobile model-based offline policy optimization modifies the reward function to impose a penalty for exploiting the model.
💡 The uncertainty penalty quantifies how wrong the model is and punishes the policy enough to discourage exploitation.
⚙️ Using model uncertainty techniques, such as training an ensemble of models, helps measure the degree of disagreement among models.
Ensemble disagreement is a common choice for obtaining error metrics in offline reinforcement learning.
Two assumptions are required for accurate estimation of the model error and value function.
The learned policy in offline reinforcement learning can be guaranteed to perform at least as well as the best policy optimized against a reward-minus-error objective.
The best policy is one that avoids states where the model may be incorrect.
The learned policy is at least as good as the behavior policy, considering the model's error.
If the model accurately represents the optimal policy, the learned policy can be close to optimal.
🔍 Using data from the model, the critic's loss function in offline reinforcement learning is designed to balance the q values of the model and the data set.
🎲 Dyna-style algorithms such as CQL and MORAL aim to improve offline reinforcement learning by making the model-based states and actions look worse than the data-based ones.
📊 The trajectory transformer method in offline reinforcement learning trains a model over entire trajectories to estimate the distribution of state-action sequences and optimizes planning based on high-probability actions.
🔑 Using a large and expressive model class, like a transformer, is convenient for offline reinforcement learning.
🔄 To model multi-modal distributions, the trajectory is discretized per dimension of every state and action.
⏲️ By modeling state and action probabilities, accurate predictions can be made for longer horizons.
Using trajectory transformer to make predictions for humanoid future steps.
Utilizing beam search to maximize reward in planning.
Generating high probability trajectories to avoid out-of-distribution states and actions.