📚 Understanding linear fitted value functions is valuable for analyzing and developing deep offline reinforcement learning methods.
🔎 Classical offline value function estimation extends existing ideas for approximate dynamic programming and q learning using simple function approximators.
🔬 Current research focuses on deriving approximate solutions with neural nets as function approximators, with the primary challenge being distributional shift.
🔑 Offline model-based reinforcement learning can be performed in the feature space.
💡 Linear function approximation is used to estimate the reward and transitions based on the features.
🧩 The least square solution is used to solve for the weight vector, which approximates the true reward.
Offline Reinforcement Learning in a sample-based setting.
Transition models describe how features in the present become features in the future.
Policy-specific transition matrices are used in policy evaluation and policy improvement.
📝 The value function in reinforcement learning can be represented as a matrix multiplied by a vector of weights.
🔢 The vector-valued version of the Bellman equation can be written as a linear equation, where the value function is the solution.
💡 The value function can be recovered as a solution to a system of linear equations, even in feature space.
🔑 Least Squares Temporal Difference (LSTD) is a formula in classic reinforcement learning that relates a transition matrix and reward vector to the weights on the value function.
🔍 Replacing the need for complete knowledge of transition matrix and reward vector, LSTD can be solved using samples from an offline dataset.
⚙️ The empirical MDP, induced by the empirical samples, allows for a sample-wise estimate using the same equation.
🔑 Instead of estimating the reward and transition explicitly, we directly estimate the value function using samples and improve the policy.
🔄 We repeat the process of estimating the value function by recovering the greedy policy under the estimated value function.
🎯 For offline reinforcement learning, we estimate the Q function instead of the value function using state-action features.
📚 Featurizing actions in reinforcement learning.
💡 Phi prime depends on the policy pi.
⚙️ The distributional shift problem in offline RL.