Skip to main content

Helping the Blind See


The Sense of Vision is taken for granted by us in our day to day life, but only a visually impaired person can understand the true value and necessity of Vision. But soon AI based computer vision systems can help the blind and visually impaired to navigate.

Tech giants like Google, Baidu, Facebook, Microsoft are working on a range of products that apply Deep Learning for the Visually Impaired. One of them being Image Captioning technology wherein the system describes the content of an image.  To accelerate further research and to boost the possible applications of this technology, Google made the latest version of their Image Captioning System available as an open source model in Tensorflow. It’s called “Show And Tell: A Neural Image Caption Generator”. The project can be found at https://github.com/tensorflow/models/tree/master/im2txt and the full paper can be found at https://arxiv.org/abs/1609.06647

The Show and Tell model is an example of an encoder-decoder neural network. It works by first "encoding" an image into a fixed-length vector representation, and then "decoding" the representation into a natural language description.

The image encoder is a deep convolutional neural network. This type of network is widely used for image tasks and is currently state-of-the-art for object recognition and detection. The Inception v3 image recognition model pretrained on the ILSVRC-2012-CLS image classification dataset is used as the encoder.
The decoder is a long short-term memory (LSTM) network. This type of network is commonly used for sequence modeling tasks such as language modeling and machine translation. In the Show and Tell model, the LSTM network is trained as a language model conditioned on the image encoding.
Words in the captions are represented with an embedding model. Each word in the vocabulary is associated with a fixed-length vector representation that is learned during training.
Caption Generated : a street light with a building in the background.

Caption Generated : a group of motorcycles parked in front of a building.

Caption Generated : a group of people walking down a street.

Caption Generated : a group of motorcycles parked next to each other.

Caption Generated : a city street filled with lots of traffic.

Caption Generated : a bus driving down a street next to tall building.

Caption Generated : a group of cars parked on the side of a street.

We at Cere Labs, an Artificial Intelligence startup based in Mumbai, have come with an application wherein we have used this technique and extended its application on Videos to continuously describe the content of Videos. Firstly, we have trained the Show And Tell Model on the MSCOCO image captioning data set to come with our custom model. Then we used OpenCV to obtain video frames from a particular video and these frames were then fed to the inference algorithm of Show And Tell which would caption these individual frames. To speed up the inference performance the frame rate for processing frames in Inference algorithm was tuned to obtain a smooth and synced video playback and caption generation. The results were awesome with some errors in the generated captions but they can be improved further through more data and training. This application was further extended to generate captions on feed received from camera so that the description is real time and can someday help the visually impaired and blind. The possibilities are enormous with applications even in Robotics.

We further plan to experiment and come up with more innovative applications of this promising technology.


By Amol Bhivarkar,
Researcher / Senior Software Developer,
Cere Labs


Comments

Popular posts from this blog

Understanding Generative Adversarial Networks - Part II

In "Understanding Generative Adversarial Networks - Part I" you gained a conceptual understanding of how GAN works. In this post let us get a mathematical understanding of GANs.
The loss functions can be designed most easily using the idea of zero-sum games. 
The sum of the costs of all players is 0. This is the Minimax algorithm for GANs
Let’s break it down.
Some terminology: V(D, G) : The value function for a minimax game E(X) : Expectation of a random variable X, also equal to its average value D(x) : The discriminator output for an input x from real data, represents probability G(z): The generator's output when its given z from the noise distribution D(G(z)): Combining the above, this represents the output of the discriminator when 
given a generated image G(z) as input
Now, as explained above, the discriminator is the maximizer and hence it tries to 
maximize

Understanding Generative Adverserial Networks - Part 1

This is a two part series on understanding Generative Adversarial Networks (GANs). This part deals with the conceptual understanding of GANs. In the second part we will try to understand the mathematics behind GANs.

Generative networks have been in use for quite a while now. And so have discriminative networks. But only in 2014 did someone get the brilliant idea of using them together. These are the generative adversarial networks. This kind of deep learning model was invented by Ian Goodfellow. When we work with data already labelled, it’s called supervised learning. It’s much easier compared to unsupervised learning, which has no predefined labels, making the task more vague. 

"Generative Adversarial Networks is the most interesting idea in the last ten years in Machine Learning." - Yann LeCun

In this post, we’ll discuss what GANs are and how they work, at a higher , more abstract level. Since 2014, many variations of the traditional GAN have come out, but the underlying conc…