Post on 11-Apr-2017
Multi Armed Bandit Algorithms
By,Shrinivas Vasala
2
Overview
- K Slot Machine- Multi Armed Bandit Problems- A/B Testing- MAB Algorithms- Summary
3
K Slot Machines
- Choose a machine and receive a reward- T turns (chances)- What will be your goal ?
- Maximize the cumulative rewards- How you choose the machines (arms) ?
4
Multi Armed Bandit Problem (MAB)
- Goal : Two Fold- Try different arms (Exploration)- Play the seemingly most rewarding arm (Exploitation)
- Explore – Exploit Trade Off- Multi Armed Bandit Algorithms
- Reward distribution ( Unknown)- Mean Reward : <µ1, . . . , µK>- Standard Deviation Reward: <σ1, . . . , σk>
- Regret :- Maximize Cumulative Rewards = Minimize Regret
(Minimize)
5
A/B Testing
- Advertisement selection for a request from a pool of advertisements- Rewards : CTR/AR or CPM
- Recommendation of news articles to users - Product pricing and promotional offers- MAB is used to measure the performance of A/B
Testing experiments
6
MAB Algorithms
- Epsilon-greedy- Softmax- Pursuit- Upper Confidence Bound (UCB1)- UCB1-Tuned
Epsilon-greedy Algorithm- Choose epsilon ( Ɛ) : exploration factor- Play the best arm with probability (1 – Ɛ): Exploitation - Play the random arm with probability Ɛ: Exploration
Note : - Typical value of Ɛ = 0.10 (10%)
8
Softmax Algorithm
9
Pursuit Algorithm
ExplorationExploitation
10
Upper Confidence Bound 1 (UCB1)
- At each iteration, choose the arm corresponding to maximum above score.
Exploitation Exploration
11
UCB1- Tuned
Exploitation Exploration
Variance of the reward
12
Advanced Bandits
- Adversarial Bandits- Contextual Bandits- Infinite Armed Bandits- Thomson Sampling Bandits
13
Summary- Each algorithm has an upper bound on regret
- It’s a function of average rewards distribution- Each algorithm has a tuning parameter- Parameter tuning is a function of reward function - Choose right MAB algorithm based on
simulations/historical data
- All these algorithms have life time auto learning mechanism
14
Thank You