📚 Classic offline reinforcement learning methods predate DPRl techniques and provide a historical perspective on batch RL or offline RL.
🔎 Important sampling based methods are discussed in the lecture for offline reinforcement learning.
📊 The problem with important sampling is that the weights become degenerate as the number of probabilities increases.
🔑 The variance of the estimator in offline reinforcement learning is exponentially large, requiring exponentially many samples for an accurate estimate.
💡 In practical offline reinforcement learning methods, important sampling can be used by dropping the probability terms for earlier time steps if the policies used for data generation and policy improvement are similar.
🤔 The importance weight can be separated into two parts: one accounting for the difference in probability of reaching a given state and the other accounting for the difference in rewards. Disregarding the first part can be a reasonable approximation if the policies are close enough.
📝 Importance weight is used to calculate the sum of rewards in offline reinforcement learning.
💡 Actions in the future don't affect rewards in the past.
⭐ To avoid exponentially exploding importance weights, value function estimation is necessary.
📝 The video discusses the calculation of the product of rows and gammas in offline reinforcement learning.
🔁 A recursive equation, v bar t+1 - t = rho t * rt + gamma * v bar t, is introduced as an important sampling estimator of v pi theta s zero.
🔄 Doubly robust estimation is easier to derive in the case of a bandit problem and provides the intuition for the multi-step case.
📚 Offline reinforcement learning involves estimating the value of a bandit by multiplying rewards with importance weights.
💡 Using function approximation, a neural network can be trained to estimate the value function and improve the accuracy of the guess.
🔄 The doubly robust estimator combines importance weighted rewards, estimated function approximation, and expected values to reduce variance.
📚 The video covers the doubly robust off-policy value evaluation, which is a method for estimating values in reinforcement learning.
🔍 Marginalized importance sampling allows for importance sampling with state probabilities, which can be used for off-policy evaluation.
💡 Determining the state or state action importance weights is the main challenge in this method.
📚 Offline Reinforcement Learning involves optimizing state-action marginals using samples from the data set.
🔄 The probability of seeing a state-action pair under the policy is determined by the probability of starting in that state-action pair and the probability of transitioning into it from another state.
⚙️ Solving for the state-action marginals involves a fixed point problem and can be expressed as an expected value under a certain distribution.