Skip to main content
REINFORCEMENT LEARNING - Part 1


Robotics and AI- A Perfect Blend


Robotics and Artificial Intelligence has made a huge advancement in their respective fields. But the blend of Robotics and Artificial Intelligence have created some amazing stuff too. It doesn't seem far in future when robots can learn themselves what humans can, using Artificial Intelligence. This article is the first part of the series of three articles that would give us a peek into the future of robot learning and an introduction to Reinforcement Learning. By the end of part three an overview of creating a crawling robot that learns itself how to move in a particular direction will be given.

 


What is Reinforcement Learning?

The world seems to be full of Reinforcement Learning (RL) examples if we observe. A dog seems to fetch a ball faster when you reward it with affection or a bone. Children are happy to do their chores if rewarded with sweets or toys. Also, when behaving irrationally they might be punished. Similar concept works in the field of machine learning which is known as Reinforcement Learning. Reinforcement Learning involves an agent interacting with the environment, getting positive or negative rewards(also known as punishments) for its actions and learning to maximise its rewards.


Reinforcement Learning Concept

Every agent in the environment has number of attributes which contributes towards solving the reinforcement learning problems. Some of them are as follows:

States:

A state of an agent determines the position of the agent at a particular time. States in current time are known as current state, and in future time are known as future states. Current states is denoted by "s" and next state is denoted by "s' ".

Actions:

When the agent is in a particular state and tends to move to other states it can do so by taking various actions. Depending on actions taken it lands into respective states. Action is denoted by "a".

Transition Probabilities:

When moving from state "s" to " s' " , by taking an action "a", there is a probability of actually arriving to the intended state. As sometimes in the real world problems there is a possibility that even after taking the particular action to move forward, the agent might not end moving forward.

Rewards:

It is a scalar value that you get for being in a state. Reward can be represented in three ways - reward for being in a state is R(s), reward for being in a state and taking an action is R(s,a), and reward for being in a state taking an action and moving to a particular next state is R(s,a,s'). Negative rewards are also known as punishments.

Discount Function:

As the agent moves in different states it calculates the rewards that it might get in the future when it moves to the next state. These future rewards are discounted compared to the present rewards. As the states are further in future the discount function is larger. The total amount of rewards depending on each state and actions are calculated by the term pay off. It is denoted by gamma (γ).

Pay Off   = R(S0) + γR(S1) + γ2 R(S2) + ........................
R(S0)     = immediate reward
γ         = discount function
R(S)      = future rewards


Policy:

The decision of what action to be taken when at a given state depends on the policy of the agent. It is denoted by л(Pi).

л ---> s ---> a

Markov Decision Process

In Reinforcement learning most of the time the agent(or the robot) needs to make decisions like which motor to start i.e. what action to take. Hence, we need a process by which we can make decisions. Here comes the Markov decision process. The Markovian property originates from Russia. It states that you don't have to consider any other state but only the most recent one. Hence, it says that only the present matters. In case of RL Markovian property says, the transition probabilities only depend on current state "s". Also, another aspect of markovian property is the transition probabilities do not change over time. The last thing that is included in the Markov Decision Process(MDP) are the rewards. Hence, states, transition probabilities and rewards defines what MDP's are.


Bellman's Equation:

It is the value function for given policy. It is a fundamental recursive equation to find out the true value for being in a state. Also, it is a key equation for solving MDP's and Reinforcement Learning problems.



Vπ(S) = R(S0) + γ Σs' P(s') .Vπ(s')


Where,

R(S0)     =   Present state reward
P(s)        =   Probability in each state
Vπ(s')     =   Value function for future states

Optimal Value Function
                                                 V*(s)   =    maxπVπ(S)
                        V*(s)   =    maxa γ Σs' P(s') .Vπ(s')


Optimal Policy:

The goal of Reinforcement learning problems is to maximise expected value of total rewards. Hence the optimal policy л* is the policy which maximises our long term expected reward. It contains an argmax function which states that it is a direction and not a scalar value. Long term rewards are also known as 'utility'.

        
   л*(s) = argmaxa' Σs' Ps,a(s') .V*(s')

Where,

л*          =    Optimal Policy
Ps,a(s')   =    Transition probability from state s to s'.
V(s')      =    Optimal value of next state


Algorithm to calculate л*

Value Iteration:


In value iteration process we find out what is the total reward we could get by moving to different states. The algorithm looks like follows:



        Initialise  V(s) = 0 (for all states)

     {
               For every S update

               V(s) = R (S) + maxa γ Σs' Ps,a(s') .V(s')
      }



The max will give V(s) ---> V*(s)
 V     : =     B(v)
 B       =     Bellman backup operator


Basically, lets say the agent has four ways to move from its present state and that can be up, down, right or left. Using value iteration method the agent first initialise or assumes that values for all states is 0. Then it calculates each state value depending on its current reward and discounted value of sum of future rewards. Repeating this process would give it a maximum value among the values of different states. This value is called optimal value( V*).


Policy Iteration:


Policy iteration consists of policy search and policy improvement based on the values obtained by value iteration. When the value iterations converges to definite values for each state after a number of iterations. Policy search initially starts with a randomly policy and finds the best policy from each state depending on these values. Finally, policy improvement evaluates the sequence of these policies and updates the policies of each state depending on the policies of future states.



Algorithm for Policy Iteration:

      Repeat

     {
            Initialise 'л' randomly

            Let V       :=    Vπ (Solve Bellman's Equation)

            Let л(s)    :=    argmaxa' Σs' Ps,a(s') .V(s')
            
           This will make V ---> V*
                                      л ---> л*
     }

To find probabilities:
Ps,a(s') = number of times action a in state is got in s'/ no. of samples

     
     Repeat

     {
            Take action using л to get experience in MDP, update estimates of Ps,a.

             Solve Bellman's equation using value iteration to get V*

             Update л(s) = argmax a' Σs' Ps,a(s') .V(s')

      }



Reinforcement Learning Types:

There are two types of Reinforcement Learning:
1.Model Based Reinforcement Learning
2.Reinforcement Learning Based Planner

Model Based Reinforcement Learning

In the model based reinforcement learning we are given particular transitions, hence the agent by observation knows which state it is in, what action to take and which state it would land in. But till this step it doesn't know what the whole model or the environment looks like. After using these observations and transitions the agent figures out the structure of the environment and provides it to a modeller. The modeller uses these structures to create a model. Using this model the Planner takes decisions using MDP and creates a policy.

TRANSITIONS    --->    MODELLER    --->    MODEL    --->    PLANNER    --->   POLICY



Reinforcement Learning Based Planner

In the reinforcement learning based planner the model or the environment structure is already given to the agent. Using a simulator it tries on different transitions. Based on these transitions the learner learns the best policy.

MODEL   --->   SIMULATOR   --->   TRANSITIONS   --->   LEARNER   --->   POLICY



Approaches to RL
  1. Given a State ---> Policy ---> Action
  2. Given a state ---> Utility ---> value function
  3. We have, state & action ----> Transition, Rewards---> future state & future reward

All of these 3 approaches are interconnected from third to first. When we have the transitions and rewards we can solve the Bellman equation to find the value function. When we would have the value function we can then find the policy.

Summary

Reinforcement learning is a part of machine learning. It is family of algorithms including value iteration, policy iteration, q learning etc. In value iteration the agent proceeds calculating the values of the future states using Bellman equation and finds the optimal value. In policy iteration, the agent calculates values for number of policies and obtains a policy with maximum value which is known as optimal policy. There are two types of Reinforcement Learning which give an optimal policy. One of which takes transitions as an input and another takes model as an input. The next part will explain about q learning.

References:



By Sanjuksha Nirgude,
Research Project Fellow,
Cere Labs Pvt. Ltd

Comments

Popular posts from this blog

Understanding Generative Adversarial Networks - Part II

In "Understanding Generative Adversarial Networks - Part I" you gained a conceptual understanding of how GAN works. In this post let us get a mathematical understanding of GANs.
The loss functions can be designed most easily using the idea of zero-sum games. 
The sum of the costs of all players is 0. This is the Minimax algorithm for GANs
Let’s break it down.
Some terminology: V(D, G) : The value function for a minimax game E(X) : Expectation of a random variable X, also equal to its average value D(x) : The discriminator output for an input x from real data, represents probability G(z): The generator's output when its given z from the noise distribution D(G(z)): Combining the above, this represents the output of the discriminator when 
given a generated image G(z) as input
Now, as explained above, the discriminator is the maximizer and hence it tries to 
maximize

Understanding Generative Adverserial Networks - Part 1

This is a two part series on understanding Generative Adversarial Networks (GANs). This part deals with the conceptual understanding of GANs. In the second part we will try to understand the mathematics behind GANs.

Generative networks have been in use for quite a while now. And so have discriminative networks. But only in 2014 did someone get the brilliant idea of using them together. These are the generative adversarial networks. This kind of deep learning model was invented by Ian Goodfellow. When we work with data already labelled, it’s called supervised learning. It’s much easier compared to unsupervised learning, which has no predefined labels, making the task more vague. 

"Generative Adversarial Networks is the most interesting idea in the last ten years in Machine Learning." - Yann LeCun

In this post, we’ll discuss what GANs are and how they work, at a higher , more abstract level. Since 2014, many variations of the traditional GAN have come out, but the underlying conc…