source:orange/orange/doc/modules/orngReinforcement.htm@6538:a5f65d7f0b2c

Revision 6538:a5f65d7f0b2c, 4.1 KB checked in by Mitar <Mitar@…>, 4 years ago (diff)

Made XPM version of the icon 32x32.

Line
1<html>
6
7<BODY>
8<index name="outlier detection">
9<h1>orngReinforcement: reinforcement learning</h1>
10
12learning is machine learning method where an agent (learner and actor) tries
13to act with a goal of maximising reward.</p>
14<hr>
15
16<H2>RLSarsa</H2>
17
18<index name="classes/RLSarsa">
19
20<p>Class RLSarsa is an implementation of linear, gradient-descendant
21Sarsa(lambda) with tile coding. The implementation closely follows the boxed
22algorithm in Figure 8.8 on page 212 in Sutton, 1998. It is a descendant of
23mountain car example from the book.</p>
24
25<p>This implementation tries to maximise reward, therefore user should devise a
26way to reward desired behaviour. Actions are integers starting whith zero.</p>
27
28<p>User also has to properly encode state. State is defined as a list of
29real numbers - state variables. Technique used for
30discretization/generalization is tile coding.</p>
31
32<p>We can visualise each 2D
33tiling as mesh of cells. For each tiling the state is transformed to an index of a cell in a mesh: we have
34achieved discretization. Because location of our meshes (tilings) is not the
35same, the state is transformed to (possibly) different cell indexes in each
36tiling. This is the basis for generalization of states. Greater the number of tilings, greater the precision and power of
37generalization.
38
39<p>When calculating cell indexes in the mesh the width of a cell is always
401. The number of cells in any direction of the mesh isn't explicitly specified anywhere in the state.</p>
41
42<p> Let us try to explain transformation of a given property to state variable on a
43simple example. Say we want to divide a state variable describing speed of a
44trolley to 10 subintervals. Let's say we are only interested in speeds between -0.5 and 0.5. Before
45conversion we have to limit our property to that interval. Then we can
46divide limited property value with the width of a subinterval. In our case the width is
47(0.5-(-0.5))/10 = 0.1. Thus, we describe speed with a state variable between -5 and 5. </p>
48
49<p>More subintervals for a variable greatly enhances precision. Keep in mind
50that doubling the number of subintervals also
51doubles memory requirements and increases learning time because it makes
52generalization harder. Doubling the number of subintervals for
53four variables means 16-times greater memory consumption.</p>
54
55<P class=section>Methods</P>
56<DL class=attributes>
57<DT>__init__(numActions, numTilings, memorySize=1000, maxtraces=100,
58mintracevalue=0.01, rand=random.Random(0))</DT>
59<DD>Constructor. Before using <CODE>decide</CODE>, run <CODE>init</CODE>.</DD>
60<DT>init(state)</DT>
61<DD>Initializes new episode. Returns first action.</DD>
62<DT>decide(reward, state)</DT>
63<DD>Returns state dependant action. Also, modifies "knowledge" with respect
64to given reward.</DD>
65</DL>
66
67<p class=section>Attributes</P>
68<DL class=attributes>
69<DT>epsilon</DT>
70<DD>Probability of taking random action (exploring). Default 0.05.</DD>
71<DT>alpha</DT>
72<DD>Step-size parameter for learning, default 0.5</DD>
73<DT>lambda1</DT>
74<DD>Trace weakening factor, default 0.9.</DD>
75<DT>gamma</DT>
76<DD>"Devaluation" factor of expected future rewards, because immediate rewards are worth
77more. Default 0.97.</DD>
78</DL>
79
80<H2>Examples</H2>
81
82<p>In the following example an agent tries to learn that taking
83action 0 in state (0,0) and action 1 in state (1,1) maximises
84reward.</p>
86<XMP class=code>import orngReinforcement
87
88r = orngReinforcement.RLSarsa(2,1)
89
90ans = r.init([0,0]) #initialize episode
91
92#if state is (0,0), act 0
93for i in range(10):
94  if ans == 0: reward = 1
95  else: reward = 0
96  ans = r.decide(reward,[0,0])
97
98#if state is (1,1), act 1
99for i in range(10):
100  if ans == 1: reward = 1
101  else: reward = 0
102  ans = r.decide(reward,[1,1])
103
104r.epsilon = 0.0 #no random action
105r.alpha = 0.0 #stop learning
106
107print "in (0,0) do", r.decide(0,[0,0]) #should output 0
108print "in (1,1) do", r.decide(0,[1,1]) #should output 1
109</XMP>
110
111</BODY>
112</html>
Note: See TracBrowser for help on using the repository browser.