Skip to main content

Information Extraction - The key to Question Answering Systems

The day AI reads a document and answers each and every question asked and do reasoning on it, will be the day when we will call it true intelligence. Welcome to the world of Information Extraction, where algorithms try to extract information from unstructured documents into structured information, which the AI can further access to answer questions. Apparently easy for humans perform such an important task, looks hard for AI to do.

The difficulty lies in recognizing named entities, identifying context, relationship extraction, understanding tables and diagrams, and many more. The research in Information Extraction has progressed exponentially since this problem was identified, and today we have lot of open source tools at our disposal.

Any toolkit for Information Extraction is expected to contain the following modules
  • Tokenizer - Converts a sequence of characters into a sequence of tokens
  • Gazetteers - Entity dictionaries used as a lookup table
  • Sentence splitter - Understands where a sentence begins and ends
  • Part of Speech (POS) tagger - Identify and tags part of speech in a text
  • Named Entity Recognition - Identify named entities in a text
  • Coreference Resolution - Identify multiple expressions that refer to the same entity

The following list shows some of the popular open source Information Extraction toolkits that contain most of the above modules

  • General Architecture for Text Engineering (GATE):
    GATE is a suite of tools developed in Java for Natural Language Processing tasks that includes ANNIE (A Nearly-New Information Extraction System) which contains all the basic modules for information extraction. 
      Check out GATE at

  • Unstructured Information Management Architecture (UIMA):
    Unstructured Information Management applications are software systems analyze large volumes of unstructured information to discover relevant knowledge. IBM’s specialized Artificial Intelligence Watson that won the Jeopardy challenge uses UIMA. 
    Lean more about UIMA at

  • OpenNLP:
The Apache OpenNLP library is a toolkit for the processing of natural language text and supports all modules required for Information Extraction. OpenNLP also offers maximum entropy and perceptron based machine learning. 

  • Natural Language Toolkit (NLTK):
   NLTK offers NLP libraries in python to perform Information Extraction. It provides a strong integration with WordNet (lexical database for English language). 
   Explore NLTK at

With such vast array of open source Information Extraction toolkits you can create your custom Information Extraction software in few days. Almost 80% of business information is unstructured. Solutions that make that information structured by capitalizing on the above mentioned toolkits have huge value. The future is going to be on question-answer style querying the unstructured documents and not keyword based search.


Popular posts from this blog

Understanding Generative Adversarial Networks - Part II

In "Understanding Generative Adversarial Networks - Part I" you gained a conceptual understanding of how GAN works. In this post let us get a mathematical understanding of GANs.
The loss functions can be designed most easily using the idea of zero-sum games. 
The sum of the costs of all players is 0. This is the Minimax algorithm for GANs
Let’s break it down.
Some terminology: V(D, G) : The value function for a minimax game E(X) : Expectation of a random variable X, also equal to its average value D(x) : The discriminator output for an input x from real data, represents probability G(z): The generator's output when its given z from the noise distribution D(G(z)): Combining the above, this represents the output of the discriminator when 
given a generated image G(z) as input
Now, as explained above, the discriminator is the maximizer and hence it tries to 

Understanding Generative Adverserial Networks - Part 1

This is a two part series on understanding Generative Adversarial Networks (GANs). This part deals with the conceptual understanding of GANs. In the second part we will try to understand the mathematics behind GANs.

Generative networks have been in use for quite a while now. And so have discriminative networks. But only in 2014 did someone get the brilliant idea of using them together. These are the generative adversarial networks. This kind of deep learning model was invented by Ian Goodfellow. When we work with data already labelled, it’s called supervised learning. It’s much easier compared to unsupervised learning, which has no predefined labels, making the task more vague. 

"Generative Adversarial Networks is the most interesting idea in the last ten years in Machine Learning." - Yann LeCun

In this post, we’ll discuss what GANs are and how they work, at a higher , more abstract level. Since 2014, many variations of the traditional GAN have come out, but the underlying conc…