For an introduction to Reinforcement Learning, its basic terminologies, concepts and types read Reinforcement Learning - Part 1 by following this link: http://blog.cerelabs.com/2017/04/reinforcement-learning-part-1.html

**Q-Learning**

Q
learning is an algorithm in reinforcement learning. It originates
from the model based reinforcement learning. It can be referred to as
a different kind of value function. The values are called Q values
and are denoted by Q(s,a). It signifies the Q value when in a state
's' and taking an action 'a'.

Mathematically,

**Q(s,a) = R(s) + Î³ Î£**

_{s' }P(s,a,s') max_{a' }Q(s',a')
It
can be defined as the value for arriving in a state which is obtained
by learning via action 'a' and proceeding optimally thereafter.

Also,

**V(s) = max**

_{a }Q(s,a)

**Ð»(s) = argmax**

_{a }Q(s,a)
V(s)
is a value, i.e. it returns a number , a scalar value in particular,
whereas Ð»(s) returns an action. Hence, it is written as argmax.

V(s) =
The reward and behaving optimally thereafter.

Q(s) =
The reward and then taking a specific action and behaving optimally
thereafter.

We
can convert Q into V if we always pick the best action.

Therefore,

**V(s) = max**

_{a }Q(s,a)
Ð»(s)
is the policy we want to follow that maximizes your value going
towards the wall, except the difference is, it is returning an action
not an value, so it should be argmax.

**Ð»(s) = argmax**

_{a }Q(s,a)**To find Q in Q-learning**

To
solve the Q(s,a) equation we don't have the P(s,a,s') values. But in
the learning scenario we don't have the models we have the
transitions.

Estimating
Q from transitions,

**Q(s,a) = R(s) + Î³ Î£**

_{s' }P(s,a,s') max_{a' }Q(s',a')
A
transition means: We observe that we are in a particular state s of
the MDP, then action 'a' was chosen somehow, then transition happens
and we land in a state and we find out what state we are in.

Now,
imagine we got an estimate of the Q function Q^(s,a), and we are
going to update it using learning rate alpha.

**Q^(s,a) <-- Î± --- { R(s) + Î³ max**

_{a'}Q^(s',a')} is utility of a state
Where,

Î± = learning rate

R(s)
= Immediate reward

Î³
max

_{a'}Q^(s',a') = discounted estimated value of the next state (utility of next state)
With
each iteration Q^(s,a) approaches towards the sum of the immediate
reward and discounted estimated value of the next state. This is also
called as utility of state.

eg.

V
<== Î± == X means moving alpha percent of value X to V with each
iteration.

**V = (1- Î± V) + Î±X**

As
we know s , a, s' ,r,

**Q^(s,a) <--Î±**

_{t}--- R(s) + Î³ max_{a'}Q^(s',a')
Where,

Î±

_{t }= learning rate increasing over time.**Q^(s,a) = E { R(s) + Î³ max**

_{a'}Q^(s',a')}**= R(s) + Î³ E**

_{s'}[ max_{a'}Q^(s',a') ]**= R(s) + Î³ Î£**

_{s' }P(s,a,s') Q^(s',a')
The
value of Q^(s,a) actually approaches to the value of the 'utility of
state' term. But, the term value is changing over time, hence it is a
moving target. There is a theorem by which this equation can solve
the Markov Decision Process, it is called the Q- learning convergence
theorem.

**Q-Learning Convergence**

In
this theorem, Q^(s,a) starts at a random value. Then with each
iteration, the value of Q^(s,a) converges to Q(s,a). The speed of
convergence depends on learning rate alpha.

**Q^(s,a) <--Î±**

_{t}--- R(s) + Î³ max_{a'}Q^(s',a')**Q^(s,a) ---converges to -------> Q(s,a)**

This
is the actual solution to Bellman's equation. But, it is only true if
we visit (s,a) infinitely often.

__Choosing Actions__

Q-learning
is a family of algorithms. In order to solve the above problem we
need to find three things:

- How to initialize Q^?
- How to decay Î± ?
- How to choose actions?

There
are two options of choosing actions:

- Always Choose a
_{0} - Choose 'a' randomly, then take all a's and find best Q(s,a)

In
the second option of choosing randomly we might learn the best action
to take to get best Q(s,a) but we don't use it. So, the next step is
to use it.

Now
we initialize,

**Q^(s , a**

_{0 }) = awesome; for all states**and**

**Q^(s , a ) = terrible ; for all actions except a**

_{0}
So,
when it goes to a

_{0 }, it gets awesome results. But, whenever it goes to any other 'a' then it gets really really bad results. Hence, it continues to use a_{0}action forever i.e. infinitely. This is known as**greedy action selection strategy.**
So,
we have to be careful where we start and then take random restarts.
But this process could be very slow. So, instead of taking random
restarts we can just take random action once in a while. Once in a
state, find the estimated value for each action and take that with
probability ( 1 - Îµ).

Where,

Îµ
= randomness factor

**Ð»^(s) = argmax**

_{a }Q^(s,a) ; for P =( 1 - Îµ )**and**

**Random Action for P = Îµ**

Assuming
'Îµ ' is small.

Hence,
we visit (s,a) an infinite number of times as long as (Îµ > 0 ).

**Îµ- Greedy Exploration:**

If
action selection is GLIE ( Greedy in limit with infinite exploration
). This means we start off 'Îµ' randomly then get less and less
random and more and more greedy. We get two facts from this

**Q^ -----> Q**<====

**Learn**

**Ð»^ -----> Ð»**<====

**Use**

This
is an example of Exploration- Exploitation dilemma.

**Summary**

This part, therefore gives a brief idea of Q-learning algorithm, q-values and their mathematical representation. Until now, the reinforcement learning robot has observed its present state and found out the q-values for all possible states in future. Using Q-value convergence theorem we find the appropriate q-values and using Îµ- greedy exploration take proper action. This part therefore covers up how the robot works conceptually. The next and the final part will show how the robot interacts with the real world.

**References:**

- To Learn Reinforcement Learning follow this link: https://classroom.udacity.com/courses/ud600/lessons/4100878601/concepts/41234786450923

- To read the first part follow this link:

http://blog.cerelabs.com/2017/04/reinforcement-learning-part-1.html

By Sanjuksha Nirgude,

Research Project Fellow,

Cere Labs Pvt. Ltd

Cere Labs Pvt. Ltd

## Comments

## Post a Comment