Skip to main content

Reinforcement Learning Part - 3

For an introduction to Reinforcement Learning, its basic terminologies, concepts and types read Reinforcement Learning - Part 1 by following this link: For an introduction to the concept of Q-learning algorithm, q-values and mathematical representation of the same read Reinforcement Learning Part- 2 by following this link: 

Robotics and AI- A Perfect Blend

Robotics and artificial intelligence has made a huge advancement in their respective fields. But the blend of Robotics and Artificial Intelligence have created some amazing stuff too. It doesn't seem far in future when robots can learn themselves what humans can using Artificial Intelligence. A peek into the future of robots learning and an introduction to applications of reinforcement learning is what this article is all about.


A variety of different problems can be solved using Reinforcement Learning, as agents can learn without expert supervision. In this project we created a crawler robot.The robot by using two arms to crawl and pushes its body towards a wall. After reaching the wall it stops and declares that it has reached the wall. In the process of reaching the wall, the robot by itself learns the actions that its arm have to take in order to move towards the wall. This is where reinforcement learning comes into picture. Reinforcement learning enables the robot to move its arms in different positions or states and decide the perfect sequence of actions that maximize the rewards.

In case of the crawler robot the states are the position of its arm and hand. Current states is denoted by "s" and next state is denoted by "s' ". Actions available are moving the arm up or down and moving the hand left or right. Action is denoted by "a". The rewards are given as follows: if the robot moves near the wall i.e. the distance from the wall decreases by 4 cm or more it receives a reward of +1. When the robot moves away from the wall by 4 cm or more it receives a reward of -2. Also, if the robot stays in a particular state without moving it receives a reward of -2. In case the robot moves near the wall i.e. the distance from the wall decreases by 4 cm or more it receives a reward of +1. When the robot moves away from the wall by 4 cm or more it receives a reward of -2. Also, if the robot stays in a particular state without moving it receives a reward of -2.

Reinforcement learning for the Robot:

Here we are going to use model based reinforcement learning to train our crawler robot to move towards the wall. Using the model based reinforcement learning type we already know the transitions, hence by observation the robot knows what state it is in. This can consist on the current distance from the wall, current position of its arm and the current position of its hand. Next thing the robot knows is what actions it can take. So, if the arm of the robot is in the up position it can move it in the down position, similarly for the hand. If it is in the left position we can move it to the right position. Another thing it knows is its distance from the wall, if its less than 10 cm then robot has to stop its motors.

Based on the above mentioned data we known the reward it is getting for staying in the same position that is -2. So after having the transition the robot starts modeling the environment into a proper model for itself. It now knows that only moving towards the wall is going to get it positive rewards.Hence, it now knows its possible future states and rewards.
After observation of current and future states and rewards, using value iteration it can find out out the value function for each state. Using these values and policy iteration it can find the best policy to get maximum rewards.

To sum up, after observing its current positions of the hand, arm and distance from wall. It calculates the values of each state it could take and consequences it would face. And finally reaches to a conclusion of a best policy. In the experiment performed on the crawler robot, when it learned to move towards the wall, it figured out when its arm was down and hand moved initially to the left and then to the right it was getting rewards. Even though this wasn't the perfect way to move towards the wall. The perfect sequence of actions to be taken was move arm up, move hand left, move arm down, move hand right. But as the actions it learned to performed gave it results it started performing the same actions repeatedly until it reached the wall.

Now, let us see how the robot learned this procedure step by step.

Hardware of the crawler robot:

It consists of a 125x105 mm chassis, with two 65 mm diameter dummy wheels and a single castor wheel. Length of the hand is 80 mm including the gripper. The gripper here is a Lego rubber wheel used to help the robot push itself towards the wall while the gripper was in contact with the ground. Length of the arm is 100 mm. The arm is connected to the chassis by a 3.17 kg-cm futaba s3003 servo motor. The arm was connected to the hand by another futaba s3003 motor. The robot also has an ultrasonic sensor on its side facing the wall. The motors and the ultrasonic sensor is connected to a Raspberry Pi 3 board. Connections of Raspberry Pi to the ultrasonic sensor and the servos are given in the figures below.

In the following figures the servo is connected to raspberry pi 3 and a 5v power source. You can also power the servo using the raspberry pi 5V and GND pins. The motor has three pins with common colours red, brown and yellow. Connect the yellow to a gpio pin and red to 5V and brown to GND on Raspberry Pi.

Q Learning in the Crawler Robot:
The Crawler Robot has 4 states, namely arm up-hand left, arm up - hand right, arm down - hand left, arm down - hand right. It has 4 actions namely, arm up and down, hand left and right.

Observing Transitions:
Initially all pins are set up and motors are started and positioned at arm up- hand left , which is the zero state. Distance from the wall is calculated. Then, learning is started.

Learning Mode:
In this mode the robot initially models the environment around it and and starts planning. It first observes its current position of hand and arm. Robot then creates a 4 x 4 (4 states and 4 actions ) q table with random float values in the range of -1 and 1. Then it starts an episode. There are total 10 episodes . In each episode it follows the following steps:

Step 1: Getting the current position of hand and arm and deciding with state the robot is in at present.

Step 2: Deciding action to be taken depending on the maximum value in the Qtable.

Step 3: Calculating (predicting) reward for that action.

Step 4: Considering last action taken and comparing it with the decided action, if its the same taking another action.

Step 5: After taking action calculating the actual reward received  for that action. The reward depends upon the difference between current distance of robot from wall and previous distance from wall.

Step 6: Again observing the current state and considering it as next state or  state prime(s') in reference to the previous state.

Step 7: Choose if next action to be taken should be a random action or a definite action. It depends upon the randomness 'ε'. It is compared to a random value between 0 and 1. If ε <= random value choose random action, else choose an action with maximum Q value from the Q table.

Step 8: Decay the randomness with random action decay rate.

Step 9: Find Q(s,a) = (1 - α ) x Q(s,a) + α x (reward + γ x Q(s',a') ). Update Q value of state 's' and action 'a' in the Q table to the new found value.

Step 10: Update state to state prime and action to action prime.

Step 11: Save the new found Q table in a .csv files.

Step 12: Taken the chosen action.

Step 13: Check if distance from wall is less than 10 cm. If not repeat all steps. If yes, stop the motors and declare "Reached the wall".

Problems and Observations During Q learning of Crawler Robot:

  • Servo motor control:
Every servo motor has a different pulse at which it becomes zero. Initially frequency was 50 Hz and my duty cycle was 1,0.5 etc. But the motor  never stopped at zero. When program was stopped using KeyboardInterrupt it use to stop at position near zero but not exactly at zero.When i increased the frequency to 500Hz the motor stopped exactly at zero for dutycycle of (0). Hence, the conclusion is the frequency affects the final position of the motor.  But with frequency 500Hz the motor did not rotate to its full potential, hence using 50Hz frequency is optimal.
For FUTABA S3003
Motor Angle
Milli Secs
Duty cycle
0 degree
1 ms
1.5 ms
180 degree
2 ms

  • Same action problem
In the initial phase step 4 was not included. Step 4 checks whether the last action is taken is the same as current action. When step 4 was not included the robot used to perform the same action again and again. This would waste a lot of time in the learning phase. Hence, the robot took long time before learning completely.

  • Delay problem
        Initially we had delays in between action taken which wasn't necessary and it only slowed down the learning process.

  • Ultrasonic Sensor Efficiency
The ultrasonic sensor used in this robot is HC-SR04. It gives an error of approximate 0.01 m.So, the difference between the old distance and current distance was kept more than 0.04 m. This sensor has some limitations. Once started the sensor needs settling time of 1 sec, else it behaves abruptly. Also, the sensor has limited range. If the wall is more than 1 m far from the sensor it behaves abruptly and gives random readings.
  • Learning Behaviour
The difference between the current distance and old distance is 4 cm.This means unless the robot has moved 4 cm due to the action it wouldn't get positive or negative reward for action taken. Due to this, while learning the robot would move approximately 2 cm away , then around 8 cm towards the wall. This affected the learning efficiency.
  • Standing Reward
Initially the robot wasn't provided with a standing reward. If it was not moving then it would get no reward. But when negative reward was provided for the not moving the learning process improved and time was reduced.



  1. I always check these types of advisory post and I found your article. This is a great source to increase knowledge about manufacturing robotics integration. Thanks for sharing an article like this.


Post a Comment

Popular posts from this blog

GPU - The brain of Artificial Intelligence

Machine Learning algorithms require tens and thousands of CPU based servers to train a model, which turns out to be an expensive activity. Machine Learning researchers and engineers are often faced with the problem of running their algorithms fast. Although initially invented for processing graphics in computer games, GPUs today are used in machine learning to perform feature detection from vast amount of unlabeled data. Compared to CPUs, GPUs take far less time to train models that perform classification and prediction. Characteristics of GPUs that make them ideal for machine learning Handle large datasets Needs far less data centre infrastructure Can be specialized for specific machine learning needs Perform vector computations faster than any known processor Designed to perform data parallel computation NVIDIA CUDA GPUs today are used to build deep learning image processing tools for  Adobe Creative Cloud. According to NVIDIA blog future Adobe applicati

Building Commonsense in AI

It is often debated that what makes humans the ultimate intelligent species is the innate quality of doing commonsense reasoning. Humans use common sense knowledge about the world around to take appropriate decisions, and this turns out to be the necessary ingredient for their survival. AI researches have long thought about building commonsense knowledge in AI. They argue that if AI possess necessary commonsense knowledge then it will be a truly intelligent machine. We will discuss two major commonsense projects that exploit this idea: Cyc tries to build a comprehensive ontology and knowledge base of everyday commonsense knowledge. This knowledge can be used by AI applications to do human-like reasoning. Started in 1984, Cyc has come a long way. Today, OpenCyc 4.0 includes the entire Cyc ontology, containing 239,000 concepts and 2,093,000 facts and can be browsed on the OpenCyc website - . OpenCyc is available for download from Source

Understanding Projection Pursuit Regression

The following article gives an overview of the paper "Projection Pursuit Regression” published by Friedman J. H and Stuetzle W. You will need basic background of Machine Learning and Regression before understanding this article. The algorithms and images are taken from the paper. ( )  What is Regression? Regression is a machine learning technology used to predict a response variable given multiple predictor variables or features. The main distinction is that the response to be predicted is any real value and not just any class or cluster name. Hence though similar to Classification in terms of making a prediction, it is largely different given what it’s predicting.  A simple to understand real world problem of regression would be predicting the sale price of a particular house based on it’s square footage, given that we have data of similar houses sold in that area in the past. T