Skip to main content

Information Extraction - The key to Question Answering Systems

The day AI reads a document and answers each and every question asked and do reasoning on it, will be the day when we will call it true intelligence. Welcome to the world of Information Extraction, where algorithms try to extract information from unstructured documents into structured information, which the AI can further access to answer questions. Apparently easy for humans perform such an important task, looks hard for AI to do.

The difficulty lies in recognizing named entities, identifying context, relationship extraction, understanding tables and diagrams, and many more. The research in Information Extraction has progressed exponentially since this problem was identified, and today we have lot of open source tools at our disposal.

Any toolkit for Information Extraction is expected to contain the following modules
  • Tokenizer - Converts a sequence of characters into a sequence of tokens
  • Gazetteers - Entity dictionaries used as a lookup table
  • Sentence splitter - Understands where a sentence begins and ends
  • Part of Speech (POS) tagger - Identify and tags part of speech in a text
  • Named Entity Recognition - Identify named entities in a text
  • Coreference Resolution - Identify multiple expressions that refer to the same entity

The following list shows some of the popular open source Information Extraction toolkits that contain most of the above modules

  • General Architecture for Text Engineering (GATE):
    GATE is a suite of tools developed in Java for Natural Language Processing tasks that includes ANNIE (A Nearly-New Information Extraction System) which contains all the basic modules for information extraction. 
      Check out GATE at https://gate.ac.uk/

  • Unstructured Information Management Architecture (UIMA):
    Unstructured Information Management applications are software systems analyze large volumes of unstructured information to discover relevant knowledge. IBM’s specialized Artificial Intelligence Watson that won the Jeopardy challenge uses UIMA. 
    Lean more about UIMA at https://uima.apache.org/

  • OpenNLP:
The Apache OpenNLP library is a toolkit for the processing of natural language text and supports all modules required for Information Extraction. OpenNLP also offers maximum entropy and perceptron based machine learning. 

  • Natural Language Toolkit (NLTK):
   NLTK offers NLP libraries in python to perform Information Extraction. It provides a strong integration with WordNet (lexical database for English language). 
   Explore NLTK at http://www.nltk.org/

With such vast array of open source Information Extraction toolkits you can create your custom Information Extraction software in few days. Almost 80% of business information is unstructured. Solutions that make that information structured by capitalizing on the above mentioned toolkits have huge value. The future is going to be on question-answer style querying the unstructured documents and not keyword based search.

Comments

Popular posts from this blog

Implement XOR in Tensorflow

XOR is considered as the 'Hello World' of Neural Networks. It seems like the best problem to try your first TensorFlow program.

Tensorflow makes it easy to build a neural network with few tweaks. All you have to do is make a graph and you have a neural network that learns the XOR function.

Why XOR? Well, XOR is the reason why backpropogation was invented in the first place. A single layer perceptron although quite successful in learning the AND and OR functions, can't learn XOR (Table 1) as it is just a linear classifier, and XOR is a linearly inseparable pattern (Figure 1). Thus the single layer perceptron goes into a panic mode while learning XOR – it can't just do that. 

Deep Propogation algorithm comes for the rescue. It learns an XOR by adding two lines L1 and L2 (Figure 2). This post assumes you know how the backpropogation algorithm works.



Following are the steps to implement the neural network in Figure 3 for XOR in Tensorflow:
1. Import necessary libraries
impo…

Anomaly Detection based on Prediction - A Step Closer to General Artificial Intelligence

Anomaly detection refers to the problem of finding patterns that do not conform to expected behavior [1]. In the last article "Understanding Neocortex to Create Intelligence", we explored how applications based on the workings of neocortex create intelligence. Pattern recognition along with prediction makes human brains the ultimate intelligent machines. Prediction help humans to detect anomalies in the environment. Before every action is taken, neocortex predicts the outcome. If there is a deviation from the expected outcome, neocortex detects anomalies, and will take necessary steps to handle them. A system which claims to be intelligent, should have anomaly detection in place.
Recent findings using research on neocortex have made it possible to create applications that does anomaly detection. Numenta’s NuPIC using Hierarchical Temporal Memory (HTM) framework is able to do inference and prediction, and hence anomaly detection. HTM accurately predicts anomalies in real worl…

Understanding Projection Pursuit Regression

The following article gives an overview of the paper "Projection Pursuit Regression” published by Friedman J. H and Stuetzle W. You will need basic background of Machine Learning and Regression before understanding this article. The algorithms and images are taken from the paper. (http://www.stat.washington.edu/courses/stat527/s13/readings/FriedmanStuetzle_JASA_1981.pdf
What is Regression? Regression is a machine learning technology used to predict a response variable given multiple predictor variables or features. The main distinction is that the response to be predicted is any real value and not just any class or cluster name. Hence though similar to Classification in terms of making a prediction, it is largely different given what it’s predicting. 
A simple to understand real world problem of regression would be predicting the sale price of a particular house based on it’s square footage, given that we have data of similar houses sold in that area in the past. The regression so…