Madhawa's Note: November 2014

Most of the time, people need to categorize the documents according to their context. It is very useful when people work with very large number of documents.

Therefore It is very easy to make NLP categorizer for above purpose. There are several algorithms used categorize these documents. most of them are semi supervised algorithms, like maximum entropy, naive bayes and maximum entropy markov models. Today I'm going describe how to categorize documents using apache openNLP toolkit. Apache openNLP supports maximum entropy algorithm.

First we have create a training dataset. training data should include category and content. normally there should be more than 5000 training data set for get a fine model.

Other good morning /
Other good evening /
Other have you any update on negombo road till wattala /
Other perhaps the madness was always there but only the schools bring it out? /
Other sorry didn't notice geotag /
Feed high traffic in wattala /
Feed low traffic in negombo road /
Feed moving traffic in wattala /
Feed nawala bridge area clear /
Feed no traffic at all at ja-ela /

Then we need to train a nlp categorizer model according to the dataset. Therefore you can easily go through OpenNLP documentation and train you model.
This following code can be used to train categorizing model and testing. here I have used training parameters as 2 cutoff mark and 300 iterations.

import opennlp.tools.doccat.DoccatModel;
import opennlp.tools.doccat.DocumentCategorizerME;
import opennlp.tools.doccat.DocumentSample;
import opennlp.tools.doccat.DocumentSampleStream;
import opennlp.tools.util.ObjectStream;
import opennlp.tools.util.PlainTextByLineStream;
import org.apache.log4j.Logger;

import java.io.*;

public class FeedClassifierTrainer {

 private static final Logger log = Logger.getLogger(FeedClassifierTrainer.class);

 private static DoccatModel model = null;

 public static void main(String[] args) {
  log.debug("Model training started");
  //to train the model
 new FeedClassifierTrainer().train();
  //testing purpose
  String content = "due to train strike heavy traffic in maradhana ";
  try {
   //test the model
   new FeedClassifierTrainer().test(content);
  } catch (IOException e) {
   e.printStackTrace();
  }
 }

 /**
  * Training the models
  */
 public void train() {
  // model name you define your own name for the model
  String onlpModelPath = "en-doccat.bin";
  // training data set
  String trainingDataFilePath = "data.txt";

  InputStream dataInputStream = null;
  try {
   // Read training data file
   dataInputStream = new FileInputStream(trainingDataFilePath);
   // Read each training instance
   ObjectStream lineStream = new PlainTextByLineStream(dataInputStream, "UTF-8");
   // making sample Stream to train
   ObjectStream sampleStream = new DocumentSampleStream(lineStream);
   // Calculate the training model "en" means english, sampleStream is the training data, 2 cutoff, 300 iterations
   model = DocumentCategorizerME.train("en", sampleStream, 2, 300);
  } catch (IOException e) {
   log.error(e.getMessage());
  } finally {
   if (dataInputStream != null) {
    try {
     dataInputStream.close();
    } catch (IOException e) {
     log.error(e.getMessage());
    }
   }
  }


 // Now we are writing the calculated model to a file in order to use the
 // trained classifier in production

try {
   if (model != null) {
  //saving the file 
    model.serialize(new FileOutputStream(onlpModelPath));
   }
  } catch (IOException e) {
   log.error(e.getMessage());
  }
 }

 /*
  * Now we call the saved model and test it
  * Give it a new text document and the expected category
  */
 public void test(String text) throws IOException {
  String classificationModelFilePath = "en-doccat.bin";
  DocumentCategorizerME classificationME =
    new DocumentCategorizerME(
      new DoccatModel(
        new FileInputStream(
          classificationModelFilePath)));
  String documentContent = text;
  double[] classDistribution = classificationME.categorize(documentContent);
  // get the predicted model
  String predictedCategory = classificationME.getBestCategory(classDistribution);
  System.out.println("Model prediction : " + predictedCategory);

 }
}

Madhawa's Note

Thursday, November 20, 2014

NLP Categorizer