Skip to main content

In the World of Document Similarity

How does a human infer whether two documents are similar? This question has dazzled cognitive scientists, and is one area under which a lot of research is taking place. As of  now there is no product that is able to match or surpass human capability in finding the similarity in documents. But things are improving in this domain, and companies such as IBM and Microsoft are investing a lot in this area.

We at Cere Labs, an Artificial Intelligence startup based in Mumbai, also are working in this area, and have applied LDA and Word2Vec techniques, both giving us promising results:

Latent Dirichlet Allocation (LDA): LDA is a technique used mainly for topic modeling. You can leverage on this topic modeling to find the similarity between documents. It is assumed that more the topics two documents overlap, more are the chances that those documents carry semantic similarity.

You can study LDA in the following paper:

You can implement LDA using Gensim:

Word2Vec:

Word2Vec bring words into vector space, where words with similar semantic meaning are embedded near each other. So when plotted in a higher dimensional vector space, similar words tend to come together. The best part with Word2Vec is that it also captures semantic similarity.

You can read the original Word2Vec paper here:

You can also check the implementation in tensorflow at:

Both LDA and Word2Vec techniques can be combined to achieve interesting results. Keep following this space as we will report our findings in future blog posts.

When we look at the results achieved by such techniques, it feels that the AI is thinking. 

For a detailed understanding of Word Embeddings please refer to the following article - An Introduction to Word Embeddings


Comments

Popular posts from this blog

GPU - The brain of Artificial Intelligence

Machine Learning algorithms require tens and thousands of CPU based servers to train a model, which turns out to be an expensive activity. Machine Learning researchers and engineers are often faced with the problem of running their algorithms fast. Although initially invented for processing graphics in computer games, GPUs today are used in machine learning to perform feature detection from vast amount of unlabeled data. Compared to CPUs, GPUs take far less time to train models that perform classification and prediction. Characteristics of GPUs that make them ideal for machine learning Handle large datasets Needs far less data centre infrastructure Can be specialized for specific machine learning needs Perform vector computations faster than any known processor Designed to perform data parallel computation NVIDIA CUDA GPUs today are used to build deep learning image processing tools for  Adobe Creative Cloud. According to NVIDIA blog future Adobe applicati

Building Commonsense in AI

It is often debated that what makes humans the ultimate intelligent species is the innate quality of doing commonsense reasoning. Humans use common sense knowledge about the world around to take appropriate decisions, and this turns out to be the necessary ingredient for their survival. AI researches have long thought about building commonsense knowledge in AI. They argue that if AI possess necessary commonsense knowledge then it will be a truly intelligent machine. We will discuss two major commonsense projects that exploit this idea: Cyc tries to build a comprehensive ontology and knowledge base of everyday commonsense knowledge. This knowledge can be used by AI applications to do human-like reasoning. Started in 1984, Cyc has come a long way. Today, OpenCyc 4.0 includes the entire Cyc ontology, containing 239,000 concepts and 2,093,000 facts and can be browsed on the OpenCyc website - http://www.cyc.com/platform/opencyc/ . OpenCyc is available for download from Source

Understanding Projection Pursuit Regression

The following article gives an overview of the paper "Projection Pursuit Regression” published by Friedman J. H and Stuetzle W. You will need basic background of Machine Learning and Regression before understanding this article. The algorithms and images are taken from the paper. ( http://www.stat.washington.edu/courses/stat527/s13/readings/FriedmanStuetzle_JASA_1981.pdf )  What is Regression? Regression is a machine learning technology used to predict a response variable given multiple predictor variables or features. The main distinction is that the response to be predicted is any real value and not just any class or cluster name. Hence though similar to Classification in terms of making a prediction, it is largely different given what it’s predicting.  A simple to understand real world problem of regression would be predicting the sale price of a particular house based on it’s square footage, given that we have data of similar houses sold in that area in the past. T