Week 1: An Introduction to Sequential Decision-Making

可以看這個影片快速了解問題定義: https://www.youtube.com/watch?v=9pZv3-6EUq8

這 5 個拉霸機都有各自得到 reward 的 distributions, 但我們不知道

知道的話每次選擇第 5 台就好

注意到如果 reward distribution 會隨時間改變, 則稱 nonstationary problem, 這是實際問題常見的情況.

定義一些名詞:

$A_t$: Action at time $t$
$R_t$: Reward at time $t$
$q_\star(a)$: Value, is defined as expected reward given that $a$ is selected. $\mathbb{E}_t[R_t|A_t=a]$
$Q_t(a)$: Estimated Value of a action $a$ at time $t$.

Sample-averages and $\varepsilon$-greedy action selection 的 pseudo-codes