## Machine-learning on-the-fly: online learning for Orange

2 posts
• Page

**1**of**1**### Machine-learning on-the-fly: online learning for Orange

Hi all,

Based on the discussion here, (http://orange.biolab.si/forum/viewtopic.php?f=14&t=1506) i have prepared a sample GSoC proposal with some details on how i plan to go about implementing incremental learners in Orange. I invite you all to evaluate and comment on this proposal.

**********************************************************************************************************************

Abstract:

Implement learning algorithms that are designed for: 1) Continuously processing a stream of data, and 2) Adaptive sampling of the search space, to balance exploration of the search space versus exploitation of learnt model.

Applications: Where would a user use these?

Continuous data processing is useful for tasks such as say, real-time sentiment analysis from twitter / news feeds, learning the parameters of a robot control system, or link-state based adaptive routing protocols. Adaptive sampling of the search space is a closely related task that often surfaces in situations where real-time learning is involved. Essentially, the situation is like this: the learning agent has to make a choice between various available "actions". The result of each action is fed back to the agent in the form of a reward. Some actions are better than others, and the agent learns them over time. Trying out an action incurs some cost, and may result in a reward, (depending on the action). The more actions the agent tries, the better it's knowledge, which can later be exploited.

An example scenario where this occurs is in a link-state based adaptive routing algorithm. The routes are maintained based on knowledge of which links work well. However over time, new nodes may join the network, creating new links with shorter routes, or existing links may die. Discovering new links entails an exploration of the search space, which needs to be balanced against the need to recommend routes that are known to work well.

Another example arises in resource allocation: suppose you have N projects to fund and K dollars available for funding. For each quantum of money invested, a project will generate some amount of returns, based on an underlying stochastic process. For a given budget and past history of returns, in what order do you carry out the funding in order to maximize the expected returns?

Algorithms designed for above situations:

There are 2 algorithms that i propose to implement, that are specifically designed for the above situations:

a) Computation of Gittins Index, which is an optimal solution to the Multi-Armed bandit problem: http://en.wikipedia.org/wiki/Multi-armed_bandit

b) Value iteration/policy iteration: http://en.wikipedia.org/wiki/Markov_dec ... _iteration

Quoting from Wikipedia, "In statistics, particularly in the design of sequential experiments, a multi-armed bandit takes its name from a traditional slot machine (one-armed bandit). Multiple levers are considered in the motivating applications in statistics. When pulled, each lever provides a reward drawn from a distribution associated with that specific lever. The objective of the gambler is to maximize the sum of rewards earned through a sequence of lever pulls."

The Gittins index is an index that is computed for each arm of the multi-armed bandit, based on the past history of pulls and returns. Under certain assumptions, pulling simply pulling the arm with the highest index value is a provably optimal (in some statistical sense) strategy [1].

Value iteration / policy iteration on the other hand, are dynamic programming solutions to solve a markov decision process. They are both iterative algorithms which, given a transition model, can output a value function / policy that can be used for determining which action to take for a given state.

Value iteration involves the computation of a "value function", which outputs a value for a given state. The optimal policy consists of choosing the action that will maximise the sum of immediate reward + value of the state you land in after choosing that action.

Policy iteration is closely related to value iteration: here, the value function computations are implicit, and the output of policy iteration is simply the recommended action for every given state.

Existing MDP-solver implementation:

Kevin Murphy has a Matlab toolbox for MDP solvers. The page also provides a good overview of MDP literature: http://www.cs.ubc.ca/~murphyk/Software/MDP/mdp.html

API design and integration with Orange:

These algorithms will need the introduction of a few additional API's. Firstly, we will store the data stream as list (with custom-designed elements). As the data flows in, we simply append to the list. The learner maintains a pointer to the last read list location.

APIs for Gittins index-based learner:

gi=orange.Gittins(data) --->train the Gittins index based learner on existing data. We can repeatedly call this function, and the learner identifies the newest data, and trains on it.

gi.recommend() ---> recommends which arm to pull, based on current index values

gi.index(i) ----> outputs the index of arm i

API's for value / policy iteration:

We specify the state transition probabilities in a matrix P, and the rewards corresponding to each state in a list R.

value_fun = orange.vi(P, R) ---> outputs a value function object for given P and R by doing value iteration.

value_fun.value(s) ---> outputs the value of state s.

policy = orange.pi(P,R) ---> outputs a policy for given P and R by doing policy iteration.

policy.recommend(s) ---> recommends action for a given state s.

For integration with orange canvas, the data function will need to be modified for the case of streaming data. Again, we can hold the last-read-location in memory, and simply read the newer data each time we hit refresh.

Timeline:

April 23: GSoC results are announced on April 23. Familiarize myself with Orange source code and start coding as soon as possible.

May 21: Official end of "community bonding period". Commit skeleton code for Gittins index based learner, and begin implementing logic.

June 21:Commit skeleton code for value iteration algorithm, begin implementing logic.

July 9: Submit working version of Gittins-based learner for mid-term evaluations by google.

July 15: Complete work on value iteration algorithm. Commit skeleton code for policy iteration.

August 7: Complete work on policy iteration.

August 8 - 20: Buffer period to scrub code, write documentation / tutorials, etc.

Future development:

Both value iteration and policy iteration require a complete transition model. It is possible to learn these transition models by monte - carlo methods. Computing the value function can be prohibitively expensive for a large state space. It is possible to approximate the value function by using some existing learners in Orange, (like SVMs, for example). Q - learning, and TD - learning are some algorithms that do these kinds of things. In general, we can build an "approximate value function learner", depending on how much we know about the transition model. After the GSoC period is over, i hope to continue development along these lines.

Why am i qualified for this project?

I am a graduate student at the National University of Singapore. My research consists of developing machine learning models for processing acoustic signals that are used for wireless communication in an underwater channel. Essentially, i am using data-driven methods to make an underwater modem adapt itself to harsh channel conditions. I also have an academic publication (with more expected to follow) resulting from my research [2].

I am now implementing my models and algorithms on a real underwater modem, which entails working with a large existing code base. I am familiar with version management.

Besides machine learning, I also have experience with signal processing, digital communications, and networks. Over the years, i have programmed in Python, Java, C, and MATLAB/Octave, etc. You can see some more background about me in my homepage: http://shankar.bigbig.com/

References:

[1]Gittins, J. and Jones, D. 1974. A dynamic allocation index for the sequential allocation of

experiments.

[2]S. Shankar, M. Chitre, and M. Jayasuriya, “Data driven algorithms to tune physical layer parameters of an underwater communication link,” in IEEE OCEANS'10 Sydney, (Australia), May 2010.

*****************************************************************************************************************************

regards,

shankar

Based on the discussion here, (http://orange.biolab.si/forum/viewtopic.php?f=14&t=1506) i have prepared a sample GSoC proposal with some details on how i plan to go about implementing incremental learners in Orange. I invite you all to evaluate and comment on this proposal.

**********************************************************************************************************************

Abstract:

Implement learning algorithms that are designed for: 1) Continuously processing a stream of data, and 2) Adaptive sampling of the search space, to balance exploration of the search space versus exploitation of learnt model.

Applications: Where would a user use these?

Continuous data processing is useful for tasks such as say, real-time sentiment analysis from twitter / news feeds, learning the parameters of a robot control system, or link-state based adaptive routing protocols. Adaptive sampling of the search space is a closely related task that often surfaces in situations where real-time learning is involved. Essentially, the situation is like this: the learning agent has to make a choice between various available "actions". The result of each action is fed back to the agent in the form of a reward. Some actions are better than others, and the agent learns them over time. Trying out an action incurs some cost, and may result in a reward, (depending on the action). The more actions the agent tries, the better it's knowledge, which can later be exploited.

An example scenario where this occurs is in a link-state based adaptive routing algorithm. The routes are maintained based on knowledge of which links work well. However over time, new nodes may join the network, creating new links with shorter routes, or existing links may die. Discovering new links entails an exploration of the search space, which needs to be balanced against the need to recommend routes that are known to work well.

Another example arises in resource allocation: suppose you have N projects to fund and K dollars available for funding. For each quantum of money invested, a project will generate some amount of returns, based on an underlying stochastic process. For a given budget and past history of returns, in what order do you carry out the funding in order to maximize the expected returns?

Algorithms designed for above situations:

There are 2 algorithms that i propose to implement, that are specifically designed for the above situations:

a) Computation of Gittins Index, which is an optimal solution to the Multi-Armed bandit problem: http://en.wikipedia.org/wiki/Multi-armed_bandit

b) Value iteration/policy iteration: http://en.wikipedia.org/wiki/Markov_dec ... _iteration

Quoting from Wikipedia, "In statistics, particularly in the design of sequential experiments, a multi-armed bandit takes its name from a traditional slot machine (one-armed bandit). Multiple levers are considered in the motivating applications in statistics. When pulled, each lever provides a reward drawn from a distribution associated with that specific lever. The objective of the gambler is to maximize the sum of rewards earned through a sequence of lever pulls."

The Gittins index is an index that is computed for each arm of the multi-armed bandit, based on the past history of pulls and returns. Under certain assumptions, pulling simply pulling the arm with the highest index value is a provably optimal (in some statistical sense) strategy [1].

Value iteration / policy iteration on the other hand, are dynamic programming solutions to solve a markov decision process. They are both iterative algorithms which, given a transition model, can output a value function / policy that can be used for determining which action to take for a given state.

Value iteration involves the computation of a "value function", which outputs a value for a given state. The optimal policy consists of choosing the action that will maximise the sum of immediate reward + value of the state you land in after choosing that action.

Policy iteration is closely related to value iteration: here, the value function computations are implicit, and the output of policy iteration is simply the recommended action for every given state.

Existing MDP-solver implementation:

Kevin Murphy has a Matlab toolbox for MDP solvers. The page also provides a good overview of MDP literature: http://www.cs.ubc.ca/~murphyk/Software/MDP/mdp.html

API design and integration with Orange:

These algorithms will need the introduction of a few additional API's. Firstly, we will store the data stream as list (with custom-designed elements). As the data flows in, we simply append to the list. The learner maintains a pointer to the last read list location.

APIs for Gittins index-based learner:

gi=orange.Gittins(data) --->train the Gittins index based learner on existing data. We can repeatedly call this function, and the learner identifies the newest data, and trains on it.

gi.recommend() ---> recommends which arm to pull, based on current index values

gi.index(i) ----> outputs the index of arm i

API's for value / policy iteration:

We specify the state transition probabilities in a matrix P, and the rewards corresponding to each state in a list R.

value_fun = orange.vi(P, R) ---> outputs a value function object for given P and R by doing value iteration.

value_fun.value(s) ---> outputs the value of state s.

policy = orange.pi(P,R) ---> outputs a policy for given P and R by doing policy iteration.

policy.recommend(s) ---> recommends action for a given state s.

For integration with orange canvas, the data function will need to be modified for the case of streaming data. Again, we can hold the last-read-location in memory, and simply read the newer data each time we hit refresh.

Timeline:

April 23: GSoC results are announced on April 23. Familiarize myself with Orange source code and start coding as soon as possible.

May 21: Official end of "community bonding period". Commit skeleton code for Gittins index based learner, and begin implementing logic.

June 21:Commit skeleton code for value iteration algorithm, begin implementing logic.

July 9: Submit working version of Gittins-based learner for mid-term evaluations by google.

July 15: Complete work on value iteration algorithm. Commit skeleton code for policy iteration.

August 7: Complete work on policy iteration.

August 8 - 20: Buffer period to scrub code, write documentation / tutorials, etc.

Future development:

Both value iteration and policy iteration require a complete transition model. It is possible to learn these transition models by monte - carlo methods. Computing the value function can be prohibitively expensive for a large state space. It is possible to approximate the value function by using some existing learners in Orange, (like SVMs, for example). Q - learning, and TD - learning are some algorithms that do these kinds of things. In general, we can build an "approximate value function learner", depending on how much we know about the transition model. After the GSoC period is over, i hope to continue development along these lines.

Why am i qualified for this project?

I am a graduate student at the National University of Singapore. My research consists of developing machine learning models for processing acoustic signals that are used for wireless communication in an underwater channel. Essentially, i am using data-driven methods to make an underwater modem adapt itself to harsh channel conditions. I also have an academic publication (with more expected to follow) resulting from my research [2].

I am now implementing my models and algorithms on a real underwater modem, which entails working with a large existing code base. I am familiar with version management.

Besides machine learning, I also have experience with signal processing, digital communications, and networks. Over the years, i have programmed in Python, Java, C, and MATLAB/Octave, etc. You can see some more background about me in my homepage: http://shankar.bigbig.com/

References:

[1]Gittins, J. and Jones, D. 1974. A dynamic allocation index for the sequential allocation of

experiments.

[2]S. Shankar, M. Chitre, and M. Jayasuriya, “Data driven algorithms to tune physical layer parameters of an underwater communication link,” in IEEE OCEANS'10 Sydney, (Australia), May 2010.

*****************************************************************************************************************************

regards,

shankar

### Re: Machine-learning on-the-fly: online learning for Orange

Just a gentle reminder: do let me know you thoughts about this: does it sound good and convincing, or do you think its infeasible? Your feedback will help me decide whether to pitch the idea else where, or perhaps drop it altogether and think of something else.

Thanks you

regards,

shankar.

Thanks you

regards,

shankar.

2 posts
• Page

**1**of**1**