REINFORCEMENT
LEARNING - Part 1
Robotics
and AI- A Perfect Blend
Robotics
and Artificial Intelligence has made a huge advancement in their
respective fields. But the blend of Robotics and Artificial
Intelligence have created some amazing stuff too. It doesn't seem far
in future when robots can learn themselves what humans can, using
Artificial Intelligence. This article is the first part of the series of
three articles that would give us a peek into the future of robot learning and an introduction to Reinforcement Learning. By the end of part three an overview of creating a crawling robot that learns itself how to move in a particular direction will be given.
What
is Reinforcement Learning?
The
world seems to be full of Reinforcement Learning (RL) examples if we
observe. A dog seems to fetch a ball faster when you reward it with
affection or a bone. Children are happy to do their chores if
rewarded with sweets or toys. Also, when behaving irrationally they
might be punished. Similar concept works in the field of machine
learning which is known as Reinforcement Learning. Reinforcement
Learning involves an agent interacting with the environment, getting
positive or negative rewards(also known as punishments) for its
actions and learning to maximise its rewards.
Reinforcement
Learning Concept
Every agent in the environment has number of attributes which contributes towards solving the reinforcement learning problems. Some of them are as follows:
States:
A
state of an agent determines the position of the agent at a
particular time. States in current time are known as current state,
and in future time are known as future states. Current
states is denoted by "s" and next state is denoted by "s'
".
Actions:
When
the agent is in a particular state and tends to move to other states
it can do so by taking various actions. Depending on actions taken it
lands into respective states. Action is denoted by "a".
Transition Probabilities:
When
moving from state "s" to " s' " , by taking an
action "a", there is a probability of actually arriving to
the intended state. As sometimes in the real world problems there is
a possibility that even after taking the particular action to move
forward, the agent might not end moving forward.
Rewards:
It
is a scalar value that you get for being in a state. Reward can be
represented in three ways - reward for being in a state is R(s),
reward for being in a state and taking an action is R(s,a), and
reward for being in a state taking an action and moving to a
particular next state is R(s,a,s'). Negative rewards are also known
as punishments.
Discount
Function:
As
the agent moves in different states it calculates the rewards that it
might get in the future when it moves to the next state. These future
rewards are discounted compared to the present rewards. As the states
are further in future the discount function is larger. The total
amount of rewards depending on each state and actions are calculated
by the term pay off. It is denoted by gamma (γ).
Pay
Off = R(S0) + γR(S1) + γ2 R(S2)
+ ........................
R(S0) = immediate reward
γ = discount function
R(S) = future rewards
Policy:
The
decision of what action to be taken when at a given state depends on
the policy of the agent. It is denoted by л(Pi).
л ---> s ---> a
Markov
Decision Process
In
Reinforcement learning most of the time the agent(or the robot) needs
to make decisions like which motor to start i.e. what action to take.
Hence, we need a process by which we can make decisions. Here comes
the Markov decision process. The Markovian property originates from
Russia. It states that you don't have to consider any other state but
only the most recent one. Hence, it says that only the present
matters. In case of RL Markovian property says, the transition
probabilities only depend on current state "s". Also,
another aspect of markovian property is the transition probabilities
do not change over time. The last thing that is included in the
Markov Decision Process(MDP) are the rewards. Hence, states,
transition probabilities and rewards defines what MDP's are.
Bellman's
Equation:
It is the value function for given policy. It is a fundamental recursive equation to find out the true value for being in a state. Also, it is a key equation for solving MDP's and Reinforcement Learning problems.
It is the value function for given policy. It is a fundamental recursive equation to find out the true value for being in a state. Also, it is a key equation for solving MDP's and Reinforcement Learning problems.
VÏ€(S)
= R(S0) + γ Σs' P(s') .Vπ(s')
Where,
R(S0) = Present state reward
P(s) = Probability in each state
VÏ€(s') = Value function for future states
Optimal
Value Function
V*(s) = maxπVπ(S)
V*(s) = maxa γ Σs' P(s') .Vπ(s')
Optimal
Policy:
The
goal of Reinforcement learning problems is to maximise expected value
of total rewards. Hence the optimal policy л* is the policy which
maximises our long term expected reward. It contains an argmax
function which states that it is a direction and not a scalar value.
Long term rewards are also known as 'utility'.
л*(s) = argmaxa' Σs' Ps,a(s') .V*(s')
Where,
л* = Optimal Policy
Ps,a(s') = Transition probability from state s to s'.
V(s') = Optimal value of next state
Algorithm
to calculate л*
Value
Iteration:
In
value iteration process we find out what is the total reward we could
get by moving to different states. The algorithm looks like follows:
Initialise V(s) = 0 (for all states)
{
For
every S update
V(s)
= R (S) + maxa γ Σs' Ps,a(s')
.V(s')
}
The
max will give V(s) ---> V*(s)
V :
= B(v)
B = Bellman backup operator
Basically,
lets say the agent has four ways to move from its present state and
that can be up, down, right or left. Using value iteration method the
agent first initialise or assumes that values for all states is 0.
Then it calculates each state value depending on its current reward
and discounted value of sum of future rewards. Repeating this process
would give it a maximum value among the values of different states.
This value is called optimal value( V*).
Policy Iteration:
Policy
iteration consists of policy search and policy improvement based on
the values obtained by value iteration. When the value iterations
converges to definite values for each state after a number of
iterations. Policy search initially starts with a randomly policy and
finds the best policy from each state depending on these values.
Finally, policy improvement evaluates the sequence of these policies
and updates the policies of each state depending on the policies of
future states.
Algorithm
for Policy Iteration:
Repeat
{
Initialise 'л' randomly
Let
V := VÏ€ (Solve Bellman's Equation)
Let
л(s) := argmaxa' Σs' Ps,a(s')
.V(s')
This
will make V ---> V*
л
---> л*
}
To find probabilities:
Ps,a(s')
= number of times action a in state is got in s'/ no. of samples
Repeat
{
Take
action using л to get experience in MDP, update estimates of Ps,a.Solve Bellman's equation using value iteration to get V*
Update л(s) = argmax a' Σs' Ps,a(s') .V(s')
}
Reinforcement
Learning Types:
There
are two types of Reinforcement Learning:
1.Model
Based Reinforcement Learning
2.Reinforcement
Learning Based Planner
Model
Based Reinforcement Learning
In
the model based reinforcement learning we are given particular
transitions, hence the agent by observation knows which state it is
in, what action to take and which state it would land in. But till
this step it doesn't know what the whole model or the environment
looks like. After using these observations and transitions the agent
figures out the structure of the environment and provides it to a
modeller. The modeller uses these structures to create a model. Using
this model the Planner takes decisions using MDP and creates a
policy.
TRANSITIONS ---> MODELLER ---> MODEL ---> PLANNER ---> POLICY
Reinforcement
Learning Based Planner
In
the reinforcement learning based planner the model or the environment
structure is already given to the agent. Using a simulator it tries
on different transitions. Based on these transitions the learner
learns the best policy.
MODEL ---> SIMULATOR ---> TRANSITIONS ---> LEARNER ---> POLICY
Approaches
to RL
- Given a State ---> Policy ---> Action
- Given a state ---> Utility ---> value function
- We have, state & action ----> Transition, Rewards---> future state & future reward
All
of these 3 approaches are interconnected from third to first. When we
have the transitions and rewards we can solve the Bellman equation to
find the value function. When we would have the value function we can
then find the policy.
Summary
Reinforcement learning is a part of machine learning. It is family of algorithms including value iteration, policy iteration, q learning etc. In value iteration the agent proceeds calculating the values of the future states using Bellman equation and finds the optimal value. In policy iteration, the agent calculates values for number of policies and obtains a policy with maximum value which is known as optimal policy. There are two types of Reinforcement Learning which give an optimal policy. One of which takes transitions as an input and another takes model as an input. The next part will explain about q learning.
References:
- To Learn Reinforcement Learning follow this link:
Research Project Fellow,
Cere Labs Pvt. Ltd
Comments
Post a Comment