Madhawa's Note: 2014

Thursday, November 20, 2014

NLP Categorizer

Most of the time, people need to categorize the documents according to their context. It is very useful when people work with very large number of documents.

Therefore It is very easy to make NLP categorizer for above purpose. There are several algorithms used categorize these documents. most of them are semi supervised algorithms, like maximum entropy, naive bayes and maximum entropy markov models. Today I'm going describe how to categorize documents using apache openNLP toolkit. Apache openNLP supports maximum entropy algorithm.

First we have create a training dataset. training data should include category and content. normally there should be more than 5000 training data set for get a fine model.

Other good morning /
Other good evening /
Other have you any update on negombo road till wattala /
Other perhaps the madness was always there but only the schools bring it out? /
Other sorry didn't notice geotag /
Feed high traffic in wattala /
Feed low traffic in negombo road /
Feed moving traffic in wattala /
Feed nawala bridge area clear /
Feed no traffic at all at ja-ela /

Then we need to train a nlp categorizer model according to the dataset. Therefore you can easily go through OpenNLP documentation and train you model.
This following code can be used to train categorizing model and testing. here I have used training parameters as 2 cutoff mark and 300 iterations.

import opennlp.tools.doccat.DoccatModel;
import opennlp.tools.doccat.DocumentCategorizerME;
import opennlp.tools.doccat.DocumentSample;
import opennlp.tools.doccat.DocumentSampleStream;
import opennlp.tools.util.ObjectStream;
import opennlp.tools.util.PlainTextByLineStream;
import org.apache.log4j.Logger;

import java.io.*;

public class FeedClassifierTrainer {

 private static final Logger log = Logger.getLogger(FeedClassifierTrainer.class);

 private static DoccatModel model = null;

 public static void main(String[] args) {
  log.debug("Model training started");
  //to train the model
 new FeedClassifierTrainer().train();
  //testing purpose
  String content = "due to train strike heavy traffic in maradhana ";
  try {
   //test the model
   new FeedClassifierTrainer().test(content);
  } catch (IOException e) {
   e.printStackTrace();
  }
 }

 /**
  * Training the models
  */
 public void train() {
  // model name you define your own name for the model
  String onlpModelPath = "en-doccat.bin";
  // training data set
  String trainingDataFilePath = "data.txt";

  InputStream dataInputStream = null;
  try {
   // Read training data file
   dataInputStream = new FileInputStream(trainingDataFilePath);
   // Read each training instance
   ObjectStream lineStream = new PlainTextByLineStream(dataInputStream, "UTF-8");
   // making sample Stream to train
   ObjectStream sampleStream = new DocumentSampleStream(lineStream);
   // Calculate the training model "en" means english, sampleStream is the training data, 2 cutoff, 300 iterations
   model = DocumentCategorizerME.train("en", sampleStream, 2, 300);
  } catch (IOException e) {
   log.error(e.getMessage());
  } finally {
   if (dataInputStream != null) {
    try {
     dataInputStream.close();
    } catch (IOException e) {
     log.error(e.getMessage());
    }
   }
  }


 // Now we are writing the calculated model to a file in order to use the
 // trained classifier in production

try {
   if (model != null) {
  //saving the file 
    model.serialize(new FileOutputStream(onlpModelPath));
   }
  } catch (IOException e) {
   log.error(e.getMessage());
  }
 }

 /*
  * Now we call the saved model and test it
  * Give it a new text document and the expected category
  */
 public void test(String text) throws IOException {
  String classificationModelFilePath = "en-doccat.bin";
  DocumentCategorizerME classificationME =
    new DocumentCategorizerME(
      new DoccatModel(
        new FileInputStream(
          classificationModelFilePath)));
  String documentContent = text;
  double[] classDistribution = classificationME.categorize(documentContent);
  // get the predicted model
  String predictedCategory = classificationME.getBestCategory(classDistribution);
  System.out.println("Model prediction : " + predictedCategory);

 }
}

Saturday, October 25, 2014

Solving Java.lang.NoClassDefFoundError in Maven

Most developers find this error when they run executables in the terminal. Commonly these types of errors don't occur in Integrated Development Environments (IDE). NoClassDefFoundError in Java comes when Java Virtual Machine is not able to find a particular class at runtime which was available during compile time.

Converting HTK Binary MFCC values in to ASCII MFCC

HTK is a Open Source Toolkit for Hidden Markov models developed by Cambridge university . mostly people use this tool kit for speech recognition and speech synthesis purposes.

Testing Open NLP Name Entity Recognizer model using java

First you have to train a NER model to test or you can directly download existing models from this link
Please refer my previous blog post on training a model.

Training Name Entity Recognizer model using Open NLP

Name Entity Recognizer means it extract some Name Entities like ( Person , Location , Traffic Level ) tags from sentences automatically.
First you have to download Open NLP

Image Processing -Matlab Tutorial 2

Today I'm going to present you some spatial filtering methods. This methods are known as Mask processing methods as well. These filters are used for different purposes.like

Image Enhancements
Edge Detection

Image Processing -Matlab Tutorial 1

Digital Image Processing

Digital image processing is the use of computer algorithms to perform image processing on digital images. As a subcategory or field of digital signal processing.

Difference between Image Processing and Computer Vision

SVM

Support Vector Machines

A support vector machine (SVM) is a statistical supervised learning technique from the field of machine learning applicable to both classification and regression. The original SVM algorithm was invented by Vladimir N. Vapnik and the current standard incarnation (soft margin) was proposed by Corinna Cortes and Vapnik in 1993 and published 1995. [1] SVMs use the spirit of the Structural Risk Minimization principle.

A support vector machine constructs a hyperplane or set of hyperplanes in a high or infinite dimensional space, which can be used for classification, regression, or other tasks. Good separation is achieved by the hyperplane that has the largest distance to the nearest training data point of any class (so-called functional margin), since in general the larger the margin the lower the generalization error of the classifier. SVMs are used to classify linear and nonlinear separation models.

Whereas the original problem may be stated in a finite dimensional space, it often happens that the sets to discriminate are not linearly separable in that space. For this reason, it was proposed that the original finite-dimensional space be mapped into a much higher-dimensional space, presumably making the separation easier in that space. To keep the computational load reasonable, the mappings used by SVM schemes are designed to ensure that dot products may be computed easily in terms of the variables in the original space, by defining them in terms of a kernel function K(x,y) selected to suit the problem.

Radial Basis Function

Today I'm going to give a small introduction for a simple Machine Learning Algorithm called Radial Basis Function. Actually I was learnt this method in my machine learning class. You can get a good understand by referring my assignment.

Neural Networks offer a powerful framework for representing nonlinear mappings from several inputs to one or more outputs. An important application of neural networks is regression. Instead of mapping the inputs into a discrete class label, the neural network maps the input variables into continuous values.

A major class of neural networks is the radial basis function (RBF) neural network. We will look at the architecture of RBF neural networks, followed by its applications in both regression and classification.

Clustering - K Means

Clustering analysis has been a topic of emerging research issue in data mining due its variety of applications. It is broadly use in wide variety of applications, including statistics, image processing, computational biology, mobile communication, medicine and economics. Clustering is a process which partitions a given data set into homogeneous groups based on given features such that similar objects are kept in a group whereas dissimilar objects are in different groups.

It is the most important unsupervised learning problem. It deals with finding structure in a collection of unlabeled data.

Support Vector Machine - Mat Lab Code

Full SVM implementation

Today I'm going to present you how to implement a simple Support Vector Machine for 'XOR' Gate in basic. Actually there is a built-in command for SVM in Matlab. but you can't find a clear understanding about SVM through it.

Hello World

Amateur blogger started blogging !!!!

Madhawa's Note

Thursday, November 20, 2014

NLP Categorizer

Saturday, October 25, 2014

Solving Java.lang.NoClassDefFoundError in Maven

Thursday, October 16, 2014

Converting HTK Binary MFCC values in to ASCII MFCC

Wednesday, October 15, 2014

Testing Open NLP Name Entity Recognizer model using java

Tuesday, October 14, 2014

Training Name Entity Recognizer model using Open NLP

Tuesday, September 30, 2014

Image Processing -Matlab Tutorial 2

Saturday, September 27, 2014

Image Processing -Matlab Tutorial 1

Digital Image Processing

Difference between Image Processing and Computer Vision

Wednesday, September 17, 2014

SVM

Monday, September 15, 2014

Radial Basis Function

Clustering - K Means

Saturday, August 2, 2014

Support Vector Machine - Mat Lab Code

Full SVM implementation

Saturday, June 28, 2014

Hello World