Odds & Ends

15
Odds & Ends

description

Odds & Ends. Administrivia. Reminder: Q3 Nov 10 CS outreach: UNM SOE holding open house for HS seniors Want CS dept participation We want to show off the coolest things in CS Come demo your P1 and P2 code! Contact me or Lynne Jacobson. The bird of time. Last time: Eligibility traces - PowerPoint PPT Presentation

Transcript of Odds & Ends

Page 1: Odds & Ends

Odds & Ends

Page 2: Odds & Ends

Administrivia•Reminder: Q3 Nov 10

•CS outreach:

•UNM SOE holding open house for HS seniors

•Want CS dept participation

•We want to show off the coolest things in CS

•Come demo your P1 and P2 code!

•Contact me or Lynne Jacobson

Page 3: Odds & Ends

The bird of time...•Last time:

•Eligibility traces

•The SARSA(λ) algorithm

•Design exercise

•This time:

•Tip o’ the day

•Notes on exploration

•Design exercise, cont’d.

Page 4: Odds & Ends

Tip o’ the day•Micro-experiments

•Often, often, often when hacking:

•“How the heck does that function work?”

•“The docs don’t say what happens when you hand null to the constructor...”

•“Uhhh... Will this work if I do it this way?”

•“WTF does that mean?”

•Could spend a bunch of time in the docs

•Or...

•Could just go and try it

Page 5: Odds & Ends

Tip o’ the day•Answer: micro-experiments

•Write a very small (<50 line) test program to make sure you understand what the thing does

•Think: homework assignment from CS152

•Quick to write

•Answers question better than docs can

•Builds your intuition about what the machine is doing

•Using the debugger to watch is also good

Page 6: Odds & Ends

Action selection in RL

Page 7: Odds & Ends

Q learning in code...public class MyAgent implements Agent {

public void updateModel(SARSTuple s) {State2d start=s.getInitState();State2d end=s.getNextState();Action act=s.getAction();double r=s.getReward();Action nextAct=_policy.argmaxAct(end);double Qnow=_policy.get(start,act);double Qnext=_policy.get(end,nextAct);double Qrevised=Qnow+getAlpha()*(r+getGamma()*Qnext-Qnow);

_policy.set(start,act,Qrevised);}

}

Page 8: Odds & Ends

The SARSA(λ) codepublic class SARSAlAgent implements Agent {

public void updateModel(SARSTuple s) {State2d start=s.getInitState();State2d end=s.getNextState();Action act=s.getAction();double r=s.getReward();Action nextAct=pickAction(end);double Qnow=_policy.get(start,act);double Qnext=_policy.get(end,nextAct);double delta=r+_gamma*Qnext-Qnow;setElig(start,act,getElig(start,act)+1.0);for (SAPair p : getEligiblePairs()) {

currQ=_policy.get(p.getS(),p.getA());_policy.set(p.getS(),p.getA(),

currQ+getElig(p.getS(),p.getA())*_alpha*delta);setElig(p.getS(),p.getA(),

getElig(p.getS(),p.getA())*_gamma*_lambda);}

}}

Page 9: Odds & Ends

Q & SARSA(λ): Key diffs•Use of eligibility traces

•Q updates single step of history

•SARSA(λ) keeps record of visited state/action pairs: e(s,a)

•Updates Q(s,a) value in proportion to e(s,a)

•Decays e(s,a) by λ each step

Page 10: Odds & Ends

Q & SARSA(λ): Key diffs•How “next state” action is picked

•Q: nextAct=_policy.argmaxAct(end)

•Picks “best” next state

•SARSA: nextAct=RLAgent.pickAction(end)

•Picks next state that agent would pick

•Huh? What’s the difference?

Page 11: Odds & Ends

Exploration vs. exploitation•Sometimes, agent wants to do something other

than “best currently known action”

•Why?

•If agent never tries anything new, it may never discover that there’s a better answer out there...

•Called the “exploration vs. exploitation” tradeoff

•Is it better to “explore” to find new stuff, or to “exploit” what you already know?

Page 12: Odds & Ends

ε-Greedy exploration•Answer:

•“Most of the time” do the best known thing

•act=argmaxa(Q(s,a))

•“Rarely” try something random

•act=pickAtRandom(allActionSet)

•ε-greedy exploration policies:

•“rarely”==prob ε

•“most of the time”==prob 1-ε

Page 13: Odds & Ends

ε-Greedy in codepublic class eGreedyAgent implements RLAgent {// implements the e-greedy exploration policypublic Action pickAction(State2d s) {final double rVal=_rand.nextDouble();if (rVal<_epsilon) {return randPick(_ASet);

}return _policy.argmaxAct(s);

}

private final Set<Action> _ASet;private final double _epsilon;

}

Page 14: Odds & Ends

Design Exercise:Experimental Rig

Page 15: Odds & Ends

Design exercise•For M4/Rollout, need to be able to:•Train agent for many trials/steps per trial•Generate learning curves for agent’s learning•Run some trials w/ learning turned on•Freeze learning•Run some trials w/ learning turned off•Average steps-to-goal over those trials•Save average as one point in curve

•Design: objects/methods to support this learning framework

•Support: diff learning algs, diff environments, diff params, variable # of trials/steps, etc.