Skip to main content

Information Extraction - The key to Question Answering Systems

The day AI reads a document and answers each and every question asked and do reasoning on it, will be the day when we will call it true intelligence. Welcome to the world of Information Extraction, where algorithms try to extract information from unstructured documents into structured information, which the AI can further access to answer questions. Apparently easy for humans perform such an important task, looks hard for AI to do.

The difficulty lies in recognizing named entities, identifying context, relationship extraction, understanding tables and diagrams, and many more. The research in Information Extraction has progressed exponentially since this problem was identified, and today we have lot of open source tools at our disposal.

Any toolkit for Information Extraction is expected to contain the following modules
  • Tokenizer - Converts a sequence of characters into a sequence of tokens
  • Gazetteers - Entity dictionaries used as a lookup table
  • Sentence splitter - Understands where a sentence begins and ends
  • Part of Speech (POS) tagger - Identify and tags part of speech in a text
  • Named Entity Recognition - Identify named entities in a text
  • Coreference Resolution - Identify multiple expressions that refer to the same entity

The following list shows some of the popular open source Information Extraction toolkits that contain most of the above modules

  • General Architecture for Text Engineering (GATE):
    GATE is a suite of tools developed in Java for Natural Language Processing tasks that includes ANNIE (A Nearly-New Information Extraction System) which contains all the basic modules for information extraction. 
      Check out GATE at https://gate.ac.uk/

  • Unstructured Information Management Architecture (UIMA):
    Unstructured Information Management applications are software systems analyze large volumes of unstructured information to discover relevant knowledge. IBM’s specialized Artificial Intelligence Watson that won the Jeopardy challenge uses UIMA. 
    Lean more about UIMA at https://uima.apache.org/

  • OpenNLP:
The Apache OpenNLP library is a toolkit for the processing of natural language text and supports all modules required for Information Extraction. OpenNLP also offers maximum entropy and perceptron based machine learning. 

  • Natural Language Toolkit (NLTK):
   NLTK offers NLP libraries in python to perform Information Extraction. It provides a strong integration with WordNet (lexical database for English language). 
   Explore NLTK at http://www.nltk.org/

With such vast array of open source Information Extraction toolkits you can create your custom Information Extraction software in few days. Almost 80% of business information is unstructured. Solutions that make that information structured by capitalizing on the above mentioned toolkits have huge value. The future is going to be on question-answer style querying the unstructured documents and not keyword based search.

Comments

Popular posts from this blog

GPU - The brain of Artificial Intelligence

Machine Learning algorithms require tens and thousands of CPU based servers to train a model, which turns out to be an expensive activity. Machine Learning researchers and engineers are often faced with the problem of running their algorithms fast. Although initially invented for processing graphics in computer games, GPUs today are used in machine learning to perform feature detection from vast amount of unlabeled data. Compared to CPUs, GPUs take far less time to train models that perform classification and prediction. Characteristics of GPUs that make them ideal for machine learning Handle large datasets Needs far less data centre infrastructure Can be specialized for specific machine learning needs Perform vector computations faster than any known processor Designed to perform data parallel computation NVIDIA CUDA GPUs today are used to build deep learning image processing tools for  Adobe Creative Cloud. According to NVIDIA blog future Adobe applicati

Building Commonsense in AI

It is often debated that what makes humans the ultimate intelligent species is the innate quality of doing commonsense reasoning. Humans use common sense knowledge about the world around to take appropriate decisions, and this turns out to be the necessary ingredient for their survival. AI researches have long thought about building commonsense knowledge in AI. They argue that if AI possess necessary commonsense knowledge then it will be a truly intelligent machine. We will discuss two major commonsense projects that exploit this idea: Cyc tries to build a comprehensive ontology and knowledge base of everyday commonsense knowledge. This knowledge can be used by AI applications to do human-like reasoning. Started in 1984, Cyc has come a long way. Today, OpenCyc 4.0 includes the entire Cyc ontology, containing 239,000 concepts and 2,093,000 facts and can be browsed on the OpenCyc website - http://www.cyc.com/platform/opencyc/ . OpenCyc is available for download from Source

Understanding Projection Pursuit Regression

The following article gives an overview of the paper "Projection Pursuit Regression” published by Friedman J. H and Stuetzle W. You will need basic background of Machine Learning and Regression before understanding this article. The algorithms and images are taken from the paper. ( http://www.stat.washington.edu/courses/stat527/s13/readings/FriedmanStuetzle_JASA_1981.pdf )  What is Regression? Regression is a machine learning technology used to predict a response variable given multiple predictor variables or features. The main distinction is that the response to be predicted is any real value and not just any class or cluster name. Hence though similar to Classification in terms of making a prediction, it is largely different given what it’s predicting.  A simple to understand real world problem of regression would be predicting the sale price of a particular house based on it’s square footage, given that we have data of similar houses sold in that area in the past. T