Skip to main content

### Reinforcement Learning Part - 3

For an introduction to Reinforcement Learning, its basic terminologies, concepts and types read Reinforcement Learning - Part 1 by following this link: http://blog.cerelabs.com/2017/04/reinforcement-learning-part-1.html. For an introduction to the concept of Q-learning algorithm, q-values and mathematical representation of the same read Reinforcement Learning Part- 2 by following this link:

CRAWLER ROBOT

Robotics and AI- A Perfect Blend

Robotics and artificial intelligence has made a huge advancement in their respective fields. But the blend of Robotics and Artificial Intelligence have created some amazing stuff too. It doesn't seem far in future when robots can learn themselves what humans can using Artificial Intelligence. A peek into the future of robots learning and an introduction to applications of reinforcement learning is what this article is all about.

Introduction

A variety of different problems can be solved using Reinforcement Learning, as agents can learn without expert supervision. In this project we created a crawler robot.The robot by using two arms to crawl and pushes its body towards a wall. After reaching the wall it stops and declares that it has reached the wall. In the process of reaching the wall, the robot by itself learns the actions that its arm have to take in order to move towards the wall. This is where reinforcement learning comes into picture. Reinforcement learning enables the robot to move its arms in different positions or states and decide the perfect sequence of actions that maximize the rewards.

Terminology:
In case of the crawler robot the states are the position of its arm and hand. Current states is denoted by "s" and next state is denoted by "s' ". Actions available are moving the arm up or down and moving the hand left or right. Action is denoted by "a". The rewards are given as follows: if the robot moves near the wall i.e. the distance from the wall decreases by 4 cm or more it receives a reward of +1. When the robot moves away from the wall by 4 cm or more it receives a reward of -2. Also, if the robot stays in a particular state without moving it receives a reward of -2. In case the robot moves near the wall i.e. the distance from the wall decreases by 4 cm or more it receives a reward of +1. When the robot moves away from the wall by 4 cm or more it receives a reward of -2. Also, if the robot stays in a particular state without moving it receives a reward of -2.

## Reinforcement learning for the Robot:

Here we are going to use model based reinforcement learning to train our crawler robot to move towards the wall. Using the model based reinforcement learning type we already know the transitions, hence by observation the robot knows what state it is in. This can consist on the current distance from the wall, current position of its arm and the current position of its hand. Next thing the robot knows is what actions it can take. So, if the arm of the robot is in the up position it can move it in the down position, similarly for the hand. If it is in the left position we can move it to the right position. Another thing it knows is its distance from the wall, if its less than 10 cm then robot has to stop its motors.

Based on the above mentioned data we known the reward it is getting for staying in the same position that is -2. So after having the transition the robot starts modeling the environment into a proper model for itself. It now knows that only moving towards the wall is going to get it positive rewards.Hence, it now knows its possible future states and rewards.
After observation of current and future states and rewards, using value iteration it can find out out the value function for each state. Using these values and policy iteration it can find the best policy to get maximum rewards.

To sum up, after observing its current positions of the hand, arm and distance from wall. It calculates the values of each state it could take and consequences it would face. And finally reaches to a conclusion of a best policy. In the experiment performed on the crawler robot, when it learned to move towards the wall, it figured out when its arm was down and hand moved initially to the left and then to the right it was getting rewards. Even though this wasn't the perfect way to move towards the wall. The perfect sequence of actions to be taken was move arm up, move hand left, move arm down, move hand right. But as the actions it learned to performed gave it results it started performing the same actions repeatedly until it reached the wall.

Now, let us see how the robot learned this procedure step by step.

## Hardware of the crawler robot:

It consists of a 125x105 mm chassis, with two 65 mm diameter dummy wheels and a single castor wheel. Length of the hand is 80 mm including the gripper. The gripper here is a Lego rubber wheel used to help the robot push itself towards the wall while the gripper was in contact with the ground. Length of the arm is 100 mm. The arm is connected to the chassis by a 3.17 kg-cm futaba s3003 servo motor. The arm was connected to the hand by another futaba s3003 motor. The robot also has an ultrasonic sensor on its side facing the wall. The motors and the ultrasonic sensor is connected to a Raspberry Pi 3 board. Connections of Raspberry Pi to the ultrasonic sensor and the servos are given in the figures below.

In the following figures the servo is connected to raspberry pi 3 and a 5v power source. You can also power the servo using the raspberry pi 5V and GND pins. The motor has three pins with common colours red, brown and yellow. Connect the yellow to a gpio pin and red to 5V and brown to GND on Raspberry Pi.

Q Learning in the Crawler Robot:
The Crawler Robot has 4 states, namely arm up-hand left, arm up - hand right, arm down - hand left, arm down - hand right. It has 4 actions namely, arm up and down, hand left and right.

Observing Transitions:
Initially all pins are set up and motors are started and positioned at arm up- hand left , which is the zero state. Distance from the wall is calculated. Then, learning is started.

Learning Mode:
In this mode the robot initially models the environment around it and and starts planning. It first observes its current position of hand and arm. Robot then creates a 4 x 4 (4 states and 4 actions ) q table with random float values in the range of -1 and 1. Then it starts an episode. There are total 10 episodes . In each episode it follows the following steps:

Step 1: Getting the current position of hand and arm and deciding with state the robot is in at present.

Step 2: Deciding action to be taken depending on the maximum value in the Qtable.

Step 3: Calculating (predicting) reward for that action.

Step 4: Considering last action taken and comparing it with the decided action, if its the same taking another action.

Step 5: After taking action calculating the actual reward received  for that action. The reward depends upon the difference between current distance of robot from wall and previous distance from wall.

Step 6: Again observing the current state and considering it as next state or  state prime(s') in reference to the previous state.

Step 7: Choose if next action to be taken should be a random action or a definite action. It depends upon the randomness 'Îµ'. It is compared to a random value between 0 and 1. If Îµ <= random value choose random action, else choose an action with maximum Q value from the Q table.

Step 8: Decay the randomness with random action decay rate.

Step 9: Find Q(s,a) = (1 - Î± ) x Q(s,a) + Î± x (reward + Î³ x Q(s',a') ). Update Q value of state 's' and action 'a' in the Q table to the new found value.

Step 10: Update state to state prime and action to action prime.

Step 11: Save the new found Q table in a .csv files.

Step 12: Taken the chosen action.

Step 13: Check if distance from wall is less than 10 cm. If not repeat all steps. If yes, stop the motors and declare "Reached the wall".

## Problems and Observations During Q learning of Crawler Robot:

• Servo motor control:
Every servo motor has a different pulse at which it becomes zero. Initially frequency was 50 Hz and my duty cycle was 1,0.5 etc. But the motor  never stopped at zero. When program was stopped using KeyboardInterrupt it use to stop at position near zero but not exactly at zero.When i increased the frequency to 500Hz the motor stopped exactly at zero for dutycycle of (0). Hence, the conclusion is the frequency affects the final position of the motor.  But with frequency 500Hz the motor did not rotate to its full potential, hence using 50Hz frequency is optimal.
For FUTABA S3003
 Motor Angle Milli Secs Duty cycle 0 degree 1 ms 5 Nuetral 1.5 ms 7.5 180 degree 2 ms 10

• Same action problem
In the initial phase step 4 was not included. Step 4 checks whether the last action is taken is the same as current action. When step 4 was not included the robot used to perform the same action again and again. This would waste a lot of time in the learning phase. Hence, the robot took long time before learning completely.

• Delay problem
Initially we had delays in between action taken which wasn't necessary and it only slowed down the learning process.

• Ultrasonic Sensor Efficiency
The ultrasonic sensor used in this robot is HC-SR04. It gives an error of approximate 0.01 m.So, the difference between the old distance and current distance was kept more than 0.04 m. This sensor has some limitations. Once started the sensor needs settling time of 1 sec, else it behaves abruptly. Also, the sensor has limited range. If the wall is more than 1 m far from the sensor it behaves abruptly and gives random readings.
• Learning Behaviour
The difference between the current distance and old distance is 4 cm.This means unless the robot has moved 4 cm due to the action it wouldn't get positive or negative reward for action taken. Due to this, while learning the robot would move approximately 2 cm away , then around 8 cm towards the wall. This affected the learning efficiency.
• Standing Reward
Initially the robot wasn't provided with a standing reward. If it was not moving then it would get no reward. But when negative reward was provided for the not moving the learning process improved and time was reduced.

References:

### GPU - The brain of Artificial Intelligence

Machine Learning algorithms require tens and thousands of CPU based servers to train a model, which turns out to be an expensive activity. Machine Learning researchers and engineers are often faced with the problem of running their algorithms fast. Although initially invented for processing graphics in computer games, GPUs today are used in machine learning to perform feature detection from vast amount of unlabeled data. Compared to CPUs, GPUs take far less time to train models that perform classification and prediction. Characteristics of GPUs that make them ideal for machine learning Handle large datasets Needs far less data centre infrastructure Can be specialized for specific machine learning needs Perform vector computations faster than any known processor Designed to perform data parallel computation NVIDIA CUDA GPUs today are used to build deep learning image processing tools for  Adobe Creative Cloud. According to NVIDIA blog future Adobe applicati

### Anomaly Detection based on Prediction - A Step Closer to General Artificial Intelligence

Anomaly detection refers to the problem of finding patterns that do not conform to expected behavior [1]. In the last article "Understanding Neocortex to Create Intelligence" , we explored how applications based on the workings of neocortex create intelligence. Pattern recognition along with prediction makes human brains the ultimate intelligent machines. Prediction help humans to detect anomalies in the environment. Before every action is taken, neocortex predicts the outcome. If there is a deviation from the expected outcome, neocortex detects anomalies, and will take necessary steps to handle them. A system which claims to be intelligent, should have anomaly detection in place. Recent findings using research on neocortex have made it possible to create applications that does anomaly detection. Numenta’s NuPIC using Hierarchical Temporal Memory (HTM) framework is able to do inference and prediction, and hence anomaly detection. HTM accurately predicts anomalies in real

### Understanding Generative Adverserial Networks - Part 1

This is a two part series on understanding Generative Adversarial Networks (GANs). This part deals with the conceptual understanding of GANs. In the second part we will try to understand the mathematics behind GANs. Generative networks have been in use for quite a while now. And so have discriminative networks. But only in 2014 did someone get the brilliant idea of using them together. These are the generative adversarial networks. This kind of deep learning model was invented by Ian Goodfellow . When we work with data already labelled, it’s called supervised learning. It’s much easier compared to unsupervised learning, which has no predefined labels, making the task more vague.  "Generative Adversarial Networks is the most interesting idea in the last ten years in Machine Learning." - Yann LeCun In this post, we’ll discuss what GANs are and how they work, at a higher , more abstract level. Since 2014, many variations of the traditional GAN have co