📚 Classic offline reinforcement learning methods predate DPRl techniques and provide a historical perspective on batch RL or offline RL.
🔎 Important sampling based methods are discussed in the lecture for offline reinforcement learning.
📊 The problem with important sampling is that the weights become degenerate as the number of probabilities increases.
🔑 The variance of the estimator in offline reinforcement learning is exponentially large, requiring exponentially many samples for an accurate estimate.
💡 In practical offline reinforcement learning methods, important sampling can be used by dropping the probability terms for earlier time steps if the policies used for data generation and policy improvement are similar.
🤔 The importance weight can be separated into two parts: one accounting for the difference in probability of reaching a given state and the other accounting for the difference in rewards. Disregarding the first part can be a reasonable approximation if the policies are close enough.
📝 Importance weight is used to calculate the sum of rewards in offline reinforcement learning.
💡 Actions in the future don't affect rewards in the past.
⭐ To avoid exponentially exploding importance weights, value function estimation is necessary.
📝 The video discusses the calculation of the product of rows and gammas in offline reinforcement learning.
🔁 A recursive equation, v bar t+1 - t = rho t * rt + gamma * v bar t, is introduced as an important sampling estimator of v pi theta s zero.
🔄 Doubly robust estimation is easier to derive in the case of a bandit problem and provides the intuition for the multi-step case.
📚 Offline reinforcement learning involves estimating the value of a bandit by multiplying rewards with importance weights.
💡 Using function approximation, a neural network can be trained to estimate the value function and improve the accuracy of the guess.
🔄 The doubly robust estimator combines importance weighted rewards, estimated function approximation, and expected values to reduce variance.
📚 The video covers the doubly robust off-policy value evaluation, which is a method for estimating values in reinforcement learning.
🔍 Marginalized importance sampling allows for importance sampling with state probabilities, which can be used for off-policy evaluation.
💡 Determining the state or state action importance weights is the main challenge in this method.
📚 Offline Reinforcement Learning involves optimizing state-action marginals using samples from the data set.
🔄 The probability of seeing a state-action pair under the policy is determined by the probability of starting in that state-action pair and the probability of transitioning into it from another state.
⚙️ Solving for the state-action marginals involves a fixed point problem and can be expressed as an expected value under a certain distribution.
Are there universal expressions of emotion? - Sophie Zadeh
Diffusae for After Effects Preview
How companies can create competitive advantage by addressing social issues through their business
The Miracle Worker Sub. Indonesia (Helen Keller Full Movie)
《初級》《從零開始外匯保證金》商品編(二)/外匯保證金CFD與股票期貨商品的差異/買賣價差手續費/外匯套息交易是什麼 #外匯保證金 #MT4 #槓桿交易商 #MT5
《初級》《從零開始外匯保證金》商品篇(六)/CFD商品與複委託差異/強制平倉/保證金比例 #外匯保證金 #MT4 #MT5 #槓桿交易商 #程式交易 #SQX