Skip to main content
REINFORCEMENT LEARNING - Part 1


Robotics and AI- A Perfect Blend


Robotics and Artificial Intelligence has made a huge advancement in their respective fields. But the blend of Robotics and Artificial Intelligence have created some amazing stuff too. It doesn't seem far in future when robots can learn themselves what humans can, using Artificial Intelligence. This article is the first part of the series of three articles that would give us a peek into the future of robot learning and an introduction to Reinforcement Learning. By the end of part three an overview of creating a crawling robot that learns itself how to move in a particular direction will be given.

 


What is Reinforcement Learning?

The world seems to be full of Reinforcement Learning (RL) examples if we observe. A dog seems to fetch a ball faster when you reward it with affection or a bone. Children are happy to do their chores if rewarded with sweets or toys. Also, when behaving irrationally they might be punished. Similar concept works in the field of machine learning which is known as Reinforcement Learning. Reinforcement Learning involves an agent interacting with the environment, getting positive or negative rewards(also known as punishments) for its actions and learning to maximise its rewards.


Reinforcement Learning Concept

Every agent in the environment has number of attributes which contributes towards solving the reinforcement learning problems. Some of them are as follows:

States:

A state of an agent determines the position of the agent at a particular time. States in current time are known as current state, and in future time are known as future states. Current states is denoted by "s" and next state is denoted by "s' ".

Actions:

When the agent is in a particular state and tends to move to other states it can do so by taking various actions. Depending on actions taken it lands into respective states. Action is denoted by "a".

Transition Probabilities:

When moving from state "s" to " s' " , by taking an action "a", there is a probability of actually arriving to the intended state. As sometimes in the real world problems there is a possibility that even after taking the particular action to move forward, the agent might not end moving forward.

Rewards:

It is a scalar value that you get for being in a state. Reward can be represented in three ways - reward for being in a state is R(s), reward for being in a state and taking an action is R(s,a), and reward for being in a state taking an action and moving to a particular next state is R(s,a,s'). Negative rewards are also known as punishments.

Discount Function:

As the agent moves in different states it calculates the rewards that it might get in the future when it moves to the next state. These future rewards are discounted compared to the present rewards. As the states are further in future the discount function is larger. The total amount of rewards depending on each state and actions are calculated by the term pay off. It is denoted by gamma (γ).

Pay Off   = R(S0) + γR(S1) + γ2 R(S2) + ........................
R(S0)     = immediate reward
γ         = discount function
R(S)      = future rewards


Policy:

The decision of what action to be taken when at a given state depends on the policy of the agent. It is denoted by л(Pi).

л ---> s ---> a

Markov Decision Process

In Reinforcement learning most of the time the agent(or the robot) needs to make decisions like which motor to start i.e. what action to take. Hence, we need a process by which we can make decisions. Here comes the Markov decision process. The Markovian property originates from Russia. It states that you don't have to consider any other state but only the most recent one. Hence, it says that only the present matters. In case of RL Markovian property says, the transition probabilities only depend on current state "s". Also, another aspect of markovian property is the transition probabilities do not change over time. The last thing that is included in the Markov Decision Process(MDP) are the rewards. Hence, states, transition probabilities and rewards defines what MDP's are.


Bellman's Equation:

It is the value function for given policy. It is a fundamental recursive equation to find out the true value for being in a state. Also, it is a key equation for solving MDP's and Reinforcement Learning problems.



Vπ(S) = R(S0) + γ Σs' P(s') .Vπ(s')


Where,

R(S0)     =   Present state reward
P(s)        =   Probability in each state
VÏ€(s')     =   Value function for future states

Optimal Value Function
                                                 V*(s)   =    maxÏ€VÏ€(S)
                        V*(s)   =    maxa γ Σs' P(s') .VÏ€(s')


Optimal Policy:

The goal of Reinforcement learning problems is to maximise expected value of total rewards. Hence the optimal policy л* is the policy which maximises our long term expected reward. It contains an argmax function which states that it is a direction and not a scalar value. Long term rewards are also known as 'utility'.

        
   Ð»*(s) = argmaxa' Σs' Ps,a(s') .V*(s')

Where,

л*          =    Optimal Policy
Ps,a(s')   =    Transition probability from state s to s'.
V(s')      =    Optimal value of next state


Algorithm to calculate л*

Value Iteration:


In value iteration process we find out what is the total reward we could get by moving to different states. The algorithm looks like follows:



        Initialise  V(s) = 0 (for all states)

     {
               For every S update

               V(s) = R (S) + maxa γ Σs' Ps,a(s') .V(s')
      }



The max will give V(s) ---> V*(s)
 V     : =     B(v)
 B       =     Bellman backup operator


Basically, lets say the agent has four ways to move from its present state and that can be up, down, right or left. Using value iteration method the agent first initialise or assumes that values for all states is 0. Then it calculates each state value depending on its current reward and discounted value of sum of future rewards. Repeating this process would give it a maximum value among the values of different states. This value is called optimal value( V*).


Policy Iteration:


Policy iteration consists of policy search and policy improvement based on the values obtained by value iteration. When the value iterations converges to definite values for each state after a number of iterations. Policy search initially starts with a randomly policy and finds the best policy from each state depending on these values. Finally, policy improvement evaluates the sequence of these policies and updates the policies of each state depending on the policies of future states.



Algorithm for Policy Iteration:

      Repeat

     {
            Initialise 'л' randomly

            Let V       :=    VÏ€ (Solve Bellman's Equation)

            Let л(s)    :=    argmaxa' Σs' Ps,a(s') .V(s')
            
           This will make V ---> V*
                                      л ---> л*
     }

To find probabilities:
Ps,a(s') = number of times action a in state is got in s'/ no. of samples

     
     Repeat

     {
            Take action using л to get experience in MDP, update estimates of Ps,a.

             Solve Bellman's equation using value iteration to get V*

             Update л(s) = argmax a' Σs' Ps,a(s') .V(s')

      }



Reinforcement Learning Types:

There are two types of Reinforcement Learning:
1.Model Based Reinforcement Learning
2.Reinforcement Learning Based Planner

Model Based Reinforcement Learning

In the model based reinforcement learning we are given particular transitions, hence the agent by observation knows which state it is in, what action to take and which state it would land in. But till this step it doesn't know what the whole model or the environment looks like. After using these observations and transitions the agent figures out the structure of the environment and provides it to a modeller. The modeller uses these structures to create a model. Using this model the Planner takes decisions using MDP and creates a policy.

TRANSITIONS    --->    MODELLER    --->    MODEL    --->    PLANNER    --->   POLICY



Reinforcement Learning Based Planner

In the reinforcement learning based planner the model or the environment structure is already given to the agent. Using a simulator it tries on different transitions. Based on these transitions the learner learns the best policy.

MODEL   --->   SIMULATOR   --->   TRANSITIONS   --->   LEARNER   --->   POLICY



Approaches to RL
  1. Given a State ---> Policy ---> Action
  2. Given a state ---> Utility ---> value function
  3. We have, state & action ----> Transition, Rewards---> future state & future reward

All of these 3 approaches are interconnected from third to first. When we have the transitions and rewards we can solve the Bellman equation to find the value function. When we would have the value function we can then find the policy.

Summary

Reinforcement learning is a part of machine learning. It is family of algorithms including value iteration, policy iteration, q learning etc. In value iteration the agent proceeds calculating the values of the future states using Bellman equation and finds the optimal value. In policy iteration, the agent calculates values for number of policies and obtains a policy with maximum value which is known as optimal policy. There are two types of Reinforcement Learning which give an optimal policy. One of which takes transitions as an input and another takes model as an input. The next part will explain about q learning.

References:



By Sanjuksha Nirgude,
Research Project Fellow,
Cere Labs Pvt. Ltd

Comments

Popular posts from this blog

GPU - The brain of Artificial Intelligence

Machine Learning algorithms require tens and thousands of CPU based servers to train a model, which turns out to be an expensive activity. Machine Learning researchers and engineers are often faced with the problem of running their algorithms fast. Although initially invented for processing graphics in computer games, GPUs today are used in machine learning to perform feature detection from vast amount of unlabeled data. Compared to CPUs, GPUs take far less time to train models that perform classification and prediction. Characteristics of GPUs that make them ideal for machine learning Handle large datasets Needs far less data centre infrastructure Can be specialized for specific machine learning needs Perform vector computations faster than any known processor Designed to perform data parallel computation NVIDIA CUDA GPUs today are used to build deep learning image processing tools for  Adobe Creative Cloud. According to NVIDIA blog future Adobe applicati

Understanding Projection Pursuit Regression

The following article gives an overview of the paper "Projection Pursuit Regression” published by Friedman J. H and Stuetzle W. You will need basic background of Machine Learning and Regression before understanding this article. The algorithms and images are taken from the paper. ( http://www.stat.washington.edu/courses/stat527/s13/readings/FriedmanStuetzle_JASA_1981.pdf )  What is Regression? Regression is a machine learning technology used to predict a response variable given multiple predictor variables or features. The main distinction is that the response to be predicted is any real value and not just any class or cluster name. Hence though similar to Classification in terms of making a prediction, it is largely different given what it’s predicting.  A simple to understand real world problem of regression would be predicting the sale price of a particular house based on it’s square footage, given that we have data of similar houses sold in that area in the past. T

Anomaly Detection based on Prediction - A Step Closer to General Artificial Intelligence

Anomaly detection refers to the problem of finding patterns that do not conform to expected behavior [1]. In the last article "Understanding Neocortex to Create Intelligence" , we explored how applications based on the workings of neocortex create intelligence. Pattern recognition along with prediction makes human brains the ultimate intelligent machines. Prediction help humans to detect anomalies in the environment. Before every action is taken, neocortex predicts the outcome. If there is a deviation from the expected outcome, neocortex detects anomalies, and will take necessary steps to handle them. A system which claims to be intelligent, should have anomaly detection in place. Recent findings using research on neocortex have made it possible to create applications that does anomaly detection. Numenta’s NuPIC using Hierarchical Temporal Memory (HTM) framework is able to do inference and prediction, and hence anomaly detection. HTM accurately predicts anomalies in real