REINFORCEMENT LEARNING - Part 1

Robotics and AI- A Perfect Blend

Robotics and Artificial Intelligence has made a huge advancement in their respective fields. But the blend of Robotics and Artificial Intelligence have created some amazing stuff too. It doesn't seem far in future when robots can learn themselves what humans can, using Artificial Intelligence. This article is the first part of the series of three articles that would give us a peek into the future of robot learning and an introduction to Reinforcement Learning. By the end of part three an overview of creating a crawling robot that learns itself how to move in a particular direction will be given.

What is Reinforcement Learning?

The world seems to be full of Reinforcement Learning (RL) examples if we observe. A dog seems to fetch a ball faster when you reward it with affection or a bone. Children are happy to do their chores if rewarded with sweets or toys. Also, when behaving irrationally they might be punished. Similar concept works in the field of machine learning which is known as Reinforcement Learning. Reinforcement Learning involves an agent interacting with the environment, getting positive or negative rewards(also known as punishments) for its actions and learning to maximise its rewards.

Reinforcement Learning Concept

Every agent in the environment has number of attributes which contributes towards solving the reinforcement learning problems. Some of them are as follows:

States:

A state of an agent determines the position of the agent at a particular time. States in current time are known as current state, and in future time are known as future states. Current states is denoted by "s" and next state is denoted by "s' ".

Actions:

When the agent is in a particular state and tends to move to other states it can do so by taking various actions. Depending on actions taken it lands into respective states. Action is denoted by "a".

Transition Probabilities:

When moving from state "s" to " s' " , by taking an action "a", there is a probability of actually arriving to the intended state. As sometimes in the real world problems there is a possibility that even after taking the particular action to move forward, the agent might not end moving forward.

Rewards:

It is a scalar value that you get for being in a state. Reward can be represented in three ways - reward for being in a state is R(s), reward for being in a state and taking an action is R(s,a), and reward for being in a state taking an action and moving to a particular next state is R(s,a,s'). Negative rewards are also known as punishments.

Discount Function:

As the agent moves in different states it calculates the rewards that it might get in the future when it moves to the next state. These future rewards are discounted compared to the present rewards. As the states are further in future the discount function is larger. The total amount of rewards depending on each state and actions are calculated by the term pay off. It is denoted by gamma (γ).

Pay Off   = R(S0) + γR(S1) + γ2 R(S2) + ........................
R(S0)     = immediate reward
γ         = discount function
R(S)      = future rewards

Policy:

The decision of what action to be taken when at a given state depends on the policy of the agent. It is denoted by л(Pi).

л ---> s ---> a

Markov Decision Process

In Reinforcement learning most of the time the agent(or the robot) needs to make decisions like which motor to start i.e. what action to take. Hence, we need a process by which we can make decisions. Here comes the Markov decision process. The Markovian property originates from Russia. It states that you don't have to consider any other state but only the most recent one. Hence, it says that only the present matters. In case of RL Markovian property says, the transition probabilities only depend on current state "s". Also, another aspect of markovian property is the transition probabilities do not change over time. The last thing that is included in the Markov Decision Process(MDP) are the rewards. Hence, states, transition probabilities and rewards defines what MDP's are.

Bellman's Equation:

It is the value function for given policy. It is a fundamental recursive equation to find out the true value for being in a state. Also, it is a key equation for solving MDP's and Reinforcement Learning problems.

Vπ(S) = R(S0) + γ Σs' P(s') .Vπ(s')

Where,

R(S0)     =   Present state reward
P(s)        =   Probability in each state
Vπ(s')     =   Value function for future states

Optimal Value Function
V*(s)   =    maxπVπ(S)
V*(s)   =    maxa γ Σs' P(s') .Vπ(s')

Optimal Policy:

The goal of Reinforcement learning problems is to maximise expected value of total rewards. Hence the optimal policy л* is the policy which maximises our long term expected reward. It contains an argmax function which states that it is a direction and not a scalar value. Long term rewards are also known as 'utility'.

л*(s) = argmaxa' Σs' Ps,a(s') .V*(s')

Where,

л*          =    Optimal Policy
Ps,a(s')   =    Transition probability from state s to s'.
V(s')      =    Optimal value of next state

Algorithm to calculate л*

Value Iteration:

In value iteration process we find out what is the total reward we could get by moving to different states. The algorithm looks like follows:

Initialise  V(s) = 0 (for all states)

{
For every S update

V(s) = R (S) + maxa γ Σs' Ps,a(s') .V(s')
}

The max will give V(s) ---> V*(s)
V     : =     B(v)
B       =     Bellman backup operator

Basically, lets say the agent has four ways to move from its present state and that can be up, down, right or left. Using value iteration method the agent first initialise or assumes that values for all states is 0. Then it calculates each state value depending on its current reward and discounted value of sum of future rewards. Repeating this process would give it a maximum value among the values of different states. This value is called optimal value( V*).

Policy Iteration:

Policy iteration consists of policy search and policy improvement based on the values obtained by value iteration. When the value iterations converges to definite values for each state after a number of iterations. Policy search initially starts with a randomly policy and finds the best policy from each state depending on these values. Finally, policy improvement evaluates the sequence of these policies and updates the policies of each state depending on the policies of future states.

Algorithm for Policy Iteration:

Repeat

{
Initialise 'л' randomly

Let V       :=    Vπ (Solve Bellman's Equation)

Let л(s)    :=    argmaxa' Σs' Ps,a(s') .V(s')

This will make V ---> V*
л ---> л*
}

To find probabilities:
Ps,a(s') = number of times action a in state is got in s'/ no. of samples

Repeat

{
Take action using л to get experience in MDP, update estimates of Ps,a.

Solve Bellman's equation using value iteration to get V*

Update л(s) = argmax a' Σs' Ps,a(s') .V(s')

}

Reinforcement Learning Types:

There are two types of Reinforcement Learning:
1.Model Based Reinforcement Learning
2.Reinforcement Learning Based Planner

Model Based Reinforcement Learning

In the model based reinforcement learning we are given particular transitions, hence the agent by observation knows which state it is in, what action to take and which state it would land in. But till this step it doesn't know what the whole model or the environment looks like. After using these observations and transitions the agent figures out the structure of the environment and provides it to a modeller. The modeller uses these structures to create a model. Using this model the Planner takes decisions using MDP and creates a policy.

TRANSITIONS    --->    MODELLER    --->    MODEL    --->    PLANNER    --->   POLICY

Reinforcement Learning Based Planner

In the reinforcement learning based planner the model or the environment structure is already given to the agent. Using a simulator it tries on different transitions. Based on these transitions the learner learns the best policy.

MODEL   --->   SIMULATOR   --->   TRANSITIONS   --->   LEARNER   --->   POLICY

Approaches to RL
1. Given a State ---> Policy ---> Action
2. Given a state ---> Utility ---> value function
3. We have, state & action ----> Transition, Rewards---> future state & future reward

All of these 3 approaches are interconnected from third to first. When we have the transitions and rewards we can solve the Bellman equation to find the value function. When we would have the value function we can then find the policy.

Summary

Reinforcement learning is a part of machine learning. It is family of algorithms including value iteration, policy iteration, q learning etc. In value iteration the agent proceeds calculating the values of the future states using Bellman equation and finds the optimal value. In policy iteration, the agent calculates values for number of policies and obtains a policy with maximum value which is known as optimal policy. There are two types of Reinforcement Learning which give an optimal policy. One of which takes transitions as an input and another takes model as an input. The next part will explain about q learning.

References:

By Sanjuksha Nirgude,
Research Project Fellow,
Cere Labs Pvt. Ltd

### Implement XOR in Tensorflow

XOR is considered as the 'Hello World' of Neural Networks. It seems like the best problem to try your first TensorFlow program.

Tensorflow makes it easy to build a neural network with few tweaks. All you have to do is make a graph and you have a neural network that learns the XOR function.

Why XOR? Well, XOR is the reason why backpropogation was invented in the first place. A single layer perceptron although quite successful in learning the AND and OR functions, can't learn XOR (Table 1) as it is just a linear classifier, and XOR is a linearly inseparable pattern (Figure 1). Thus the single layer perceptron goes into a panic mode while learning XOR – it can't just do that.

Deep Propogation algorithm comes for the rescue. It learns an XOR by adding two lines L1 and L2 (Figure 2). This post assumes you know how the backpropogation algorithm works.

Following are the steps to implement the neural network in Figure 3 for XOR in Tensorflow:
1. Import necessary libraries
impo…

### From Cats to Convolutional Neural Networks

Widely used in image recognition, Convolutional Neural Networks (CNNs) consist of multiple layers of neuron collection which look at small window of the input image, called receptive fields.
The history of Convolutional Neural Networks begins with a famous experiment “Receptive Fields of Single Neurons in the Cat’s Striate Cortex” conducted by Hubel and Wiesel. The experiment confirmed the long belief of neurobiologists and psychologists that the neurons in the brain act as feature detectors.
The first neural network model that drew inspiration from the hierarchy model of the visual nervous system proposed by Hubel and Wiesel was Neocognitron invented by Kunihiko Fukushima, and had the ability of performing unsupervised learning. Kunihiko Fukushima’s approach was commendable as it was the first neural network model having the capability of pattern recognition similar to human brain. The model gave a lot of insight and helped future understanding of the brain.