π Understanding linear fitted value functions is valuable for analyzing and developing deep offline reinforcement learning methods.
π Classical offline value function estimation extends existing ideas for approximate dynamic programming and q learning using simple function approximators.
π¬ Current research focuses on deriving approximate solutions with neural nets as function approximators, with the primary challenge being distributional shift.
π Offline model-based reinforcement learning can be performed in the feature space.
π‘ Linear function approximation is used to estimate the reward and transitions based on the features.
𧩠The least square solution is used to solve for the weight vector, which approximates the true reward.
Offline Reinforcement Learning in a sample-based setting.
Transition models describe how features in the present become features in the future.
Policy-specific transition matrices are used in policy evaluation and policy improvement.
π The value function in reinforcement learning can be represented as a matrix multiplied by a vector of weights.
π’ The vector-valued version of the Bellman equation can be written as a linear equation, where the value function is the solution.
π‘ The value function can be recovered as a solution to a system of linear equations, even in feature space.
π Least Squares Temporal Difference (LSTD) is a formula in classic reinforcement learning that relates a transition matrix and reward vector to the weights on the value function.
π Replacing the need for complete knowledge of transition matrix and reward vector, LSTD can be solved using samples from an offline dataset.
βοΈ The empirical MDP, induced by the empirical samples, allows for a sample-wise estimate using the same equation.
π Instead of estimating the reward and transition explicitly, we directly estimate the value function using samples and improve the policy.
π We repeat the process of estimating the value function by recovering the greedy policy under the estimated value function.
π― For offline reinforcement learning, we estimate the Q function instead of the value function using state-action features.
π Featurizing actions in reinforcement learning.
π‘ Phi prime depends on the policy pi.
βοΈ The distributional shift problem in offline RL.