Unravelling the mysteries of Artificial Intelligence

REINFORCEMENT LEARNING - Part 1

Robotics and AI- A Perfect Blend

Robotics and Artificial Intelligence has made a huge advancement in their respective fields. But the blend of Robotics and Artificial Intelligence have created some amazing stuff too. It doesn't seem far in future when robots can learn themselves what humans can, using Artificial Intelligence. This article is the first part of the series of three articles that would give us a peek into the future of robot learning and an introduction to Reinforcement Learning. By the end of part three an overview of creating a crawling robot that learns itself how to move in a particular direction will be given.

What is Reinforcement Learning?

The world seems to be full of Reinforcement Learning (RL) examples if we observe. A dog seems to fetch a ball faster when you reward it with affection or a bone. Children are happy to do their chores if rewarded with sweets or toys. Also, when behaving irrationally they might be punished. Similar concept works in the field of machine learning which is known as Reinforcement Learning. Reinforcement Learning involves an agent interacting with the environment, getting positive or negative rewards(also known as punishments) for its actions and learning to maximise its rewards.

Reinforcement Learning Concept

Every agent in the environment has number of attributes which contributes towards solving the reinforcement learning problems. Some of them are as follows:

States:

A state of an agent determines the position of the agent at a particular time. States in current time are known as current state, and in future time are known as future states. Current states is denoted by "s" and next state is denoted by "s' ".

Actions:

When the agent is in a particular state and tends to move to other states it can do so by taking various actions. Depending on actions taken it lands into respective states. Action is denoted by "a".

Transition Probabilities:

When moving from state "s" to " s' " , by taking an action "a", there is a probability of actually arriving to the intended state. As sometimes in the real world problems there is a possibility that even after taking the particular action to move forward, the agent might not end moving forward.

Rewards:

It is a scalar value that you get for being in a state. Reward can be represented in three ways - reward for being in a state is R(s), reward for being in a state and taking an action is R(s,a), and reward for being in a state taking an action and moving to a particular next state is R(s,a,s'). Negative rewards are also known as punishments.

Discount Function:

As the agent moves in different states it calculates the rewards that it might get in the future when it moves to the next state. These future rewards are discounted compared to the present rewards. As the states are further in future the discount function is larger. The total amount of rewards depending on each state and actions are calculated by the term pay off. It is denoted by gamma (γ).

Pay Off = R(S₀) + γR(S₁) + γ²R(S₂) + ........................

R(S₀) = immediate reward

γ = discount function

R(S) = future rewards

Policy:

The decision of what action to be taken when at a given state depends on the policy of the agent. It is denoted by л(Pi).

л ---> s ---> a

Markov Decision Process

In Reinforcement learning most of the time the agent(or the robot) needs to make decisions like which motor to start i.e. what action to take. Hence, we need a process by which we can make decisions. Here comes the Markov decision process. The Markovian property originates from Russia. It states that you don't have to consider any other state but only the most recent one. Hence, it says that only the present matters. In case of RL Markovian property says, the transition probabilities only depend on current state "s". Also, another aspect of markovian property is the transition probabilities do not change over time. The last thing that is included in the Markov Decision Process(MDP) are the rewards. Hence, states, transition probabilities and rewards defines what MDP's are.

Bellman's Equation:

It is the value function for given policy. It is a fundamental recursive equation to find out the true value for being in a state. Also, it is a key equation for solving MDP's and Reinforcement Learning problems.

V^π(S) = R(S₀) + γ Σ_s'P(s') .V^π(s')

Where,

R(S₀) = Present state reward

P(s) = Probability in each state

V^π(s') = Value function for future states

Optimal Value Function

V*(s) = max_πV^π(S)

V*(s) = max_aγ Σ_s'P(s') .V^π(s')

Optimal Policy:

The goal of Reinforcement learning problems is to maximise expected value of total rewards. Hence the optimal policy л* is the policy which maximises our long term expected reward. It contains an argmax function which states that it is a direction and not a scalar value. Long term rewards are also known as 'utility'.

л*(s) = argmax_a'Σ_s'P_s,a(s') .V*(s')

Where,

л* = Optimal Policy

P_s,a(s') = Transition probability from state s to s'.

V(s') = Optimal value of next state

Algorithm to calculate л*

Value Iteration:

In value iteration process we find out what is the total reward we could get by moving to different states. The algorithm looks like follows:

Initialise V(s) = 0 (for all states)

{

For every S update

V(s) = R (S) + max_aγ Σ_s'P_s,a(s') .V(s')

}

The max will give V(s) ---> V*(s)

V : = B(v)

B = Bellman backup operator

Basically, lets say the agent has four ways to move from its present state and that can be up, down, right or left. Using value iteration method the agent first initialise or assumes that values for all states is 0. Then it calculates each state value depending on its current reward and discounted value of sum of future rewards. Repeating this process would give it a maximum value among the values of different states. This value is called optimal value( V*).

Policy Iteration:

Policy iteration consists of policy search and policy improvement based on the values obtained by value iteration. When the value iterations converges to definite values for each state after a number of iterations. Policy search initially starts with a randomly policy and finds the best policy from each state depending on these values. Finally, policy improvement evaluates the sequence of these policies and updates the policies of each state depending on the policies of future states.

Algorithm for Policy Iteration:

Repeat

{

Initialise 'л' randomly

Let V := V^π(Solve Bellman's Equation)

Let л(s) := argmax_a'Σ_s'P_s,a(s') .V(s')

This will make V ---> V*

л ---> л*

}

To find probabilities:

P_s,a(s') = number of times action a in state is got in s'/ no. of samples

Repeat

{

Take action using л to get experience in MDP, update estimates of P_s,a.

Solve Bellman's equation using value iteration to get V*

Update л(s) = argmax _a'Σ_s'P_s,a(s') .V(s')

}

Reinforcement Learning Types:

There are two types of Reinforcement Learning:

1.Model Based Reinforcement Learning

2.Reinforcement Learning Based Planner

Model Based Reinforcement Learning

In the model based reinforcement learning we are given particular transitions, hence the agent by observation knows which state it is in, what action to take and which state it would land in. But till this step it doesn't know what the whole model or the environment looks like. After using these observations and transitions the agent figures out the structure of the environment and provides it to a modeller. The modeller uses these structures to create a model. Using this model the Planner takes decisions using MDP and creates a policy.

TRANSITIONS ---> MODELLER ---> MODEL ---> PLANNER ---> POLICY

Reinforcement Learning Based Planner

In the reinforcement learning based planner the model or the environment structure is already given to the agent. Using a simulator it tries on different transitions. Based on these transitions the learner learns the best policy.

MODEL ---> SIMULATOR ---> TRANSITIONS ---> LEARNER ---> POLICY

Approaches to RL

Given a State ---> Policy ---> Action
Given a state ---> Utility ---> value function
We have, state & action ----> Transition, Rewards---> future state & future reward

All of these 3 approaches are interconnected from third to first. When we have the transitions and rewards we can solve the Bellman equation to find the value function. When we would have the value function we can then find the policy.

Summary

Reinforcement learning is a part of machine learning. It is family of algorithms including value iteration, policy iteration, q learning etc. In value iteration the agent proceeds calculating the values of the future states using Bellman equation and finds the optimal value. In policy iteration, the agent calculates values for number of policies and obtains a policy with maximum value which is known as optimal policy. There are two types of Reinforcement Learning which give an optimal policy. One of which takes transitions as an input and another takes model as an input. The next part will explain about q learning.

References:

http://www.ias.tudarmstadt.de/uploads/Publications/Kober_IJRR_2013.pdf
To Learn Reinforcement Learning follow this link:

https://classroom.udacity.com/courses/ud600/lessons/4100878601/concepts/41234786450923

By Sanjuksha Nirgude,
Research Project Fellow,
Cere Labs Pvt. Ltd

Anomaly Detection based on Prediction - A Step Closer to General Artificial Intelligence

Anomaly detection refers to the problem of finding patterns that do not conform to expected behavior [1]. In the last article "Understanding Neocortex to Create Intelligence" , we explored how applications based on the workings of neocortex create intelligence. Pattern recognition along with prediction makes human brains the ultimate intelligent machines. Prediction help humans to detect anomalies in the environment. Before every action is taken, neocortex predicts the outcome. If there is a deviation from the expected outcome, neocortex detects anomalies, and will take necessary steps to handle them. A system which claims to be intelligent, should have anomaly detection in place. Recent findings using research on neocortex have made it possible to create applications that does anomaly detection. Numenta’s NuPIC using Hierarchical Temporal Memory (HTM) framework is able to do inference and prediction, and hence anomaly detection. HTM accurately predicts anomalies in real

Unravelling the mysteries of Artificial Intelligence

Search This Blog

Labels

Comments

Post a Comment

Popular posts from this blog

GPU - The brain of Artificial Intelligence

Understanding Projection Pursuit Regression

Anomaly Detection based on Prediction - A Step Closer to General Artificial Intelligence