Most of the time, people need to categorize the documents according to their context. It is very useful when people work with very large number of documents.
Therefore It is very easy to make NLP categorizer for above purpose. There are several algorithms used categorize these documents. most of them are semi supervised algorithms, like maximum entropy, naive bayes and maximum entropy markov models. Today I'm going describe how to categorize documents using apache openNLP toolkit. Apache openNLP supports maximum entropy algorithm.
First we have create a training dataset. training data should include category and content. normally there should be more than 5000 training data set for get a fine model.
Other good morning / Other good evening / Other have you any update on negombo road till wattala / Other perhaps the madness was always there but only the schools bring it out? / Other sorry didn't notice geotag / Feed high traffic in wattala / Feed low traffic in negombo road / Feed moving traffic in wattala / Feed nawala bridge area clear / Feed no traffic at all at ja-ela /
Then we need to train a nlp categorizer model according to the dataset. Therefore you can easily go through OpenNLP documentation and train you model.
This following code can be used to train categorizing model and testing. here I have used training parameters as 2 cutoff mark and 300 iterations.
import opennlp.tools.doccat.DoccatModel; import opennlp.tools.doccat.DocumentCategorizerME; import opennlp.tools.doccat.DocumentSample; import opennlp.tools.doccat.DocumentSampleStream; import opennlp.tools.util.ObjectStream; import opennlp.tools.util.PlainTextByLineStream; import org.apache.log4j.Logger; import java.io.*; public class FeedClassifierTrainer { private static final Logger log = Logger.getLogger(FeedClassifierTrainer.class); private static DoccatModel model = null; public static void main(String[] args) { log.debug("Model training started"); //to train the model new FeedClassifierTrainer().train(); //testing purpose String content = "due to train strike heavy traffic in maradhana "; try { //test the model new FeedClassifierTrainer().test(content); } catch (IOException e) { e.printStackTrace(); } } /** * Training the models */ public void train() { // model name you define your own name for the model String onlpModelPath = "en-doccat.bin"; // training data set String trainingDataFilePath = "data.txt"; InputStream dataInputStream = null; try { // Read training data file dataInputStream = new FileInputStream(trainingDataFilePath); // Read each training instance ObjectStreamlineStream = new PlainTextByLineStream(dataInputStream, "UTF-8"); // making sample Stream to train ObjectStream sampleStream = new DocumentSampleStream(lineStream); // Calculate the training model "en" means english, sampleStream is the training data, 2 cutoff, 300 iterations model = DocumentCategorizerME.train("en", sampleStream, 2, 300); } catch (IOException e) { log.error(e.getMessage()); } finally { if (dataInputStream != null) { try { dataInputStream.close(); } catch (IOException e) { log.error(e.getMessage()); } } } // Now we are writing the calculated model to a file in order to use the // trained classifier in production try { if (model != null) { //saving the file model.serialize(new FileOutputStream(onlpModelPath)); } } catch (IOException e) { log.error(e.getMessage()); } } /* * Now we call the saved model and test it * Give it a new text document and the expected category */ public void test(String text) throws IOException { String classificationModelFilePath = "en-doccat.bin"; DocumentCategorizerME classificationME = new DocumentCategorizerME( new DoccatModel( new FileInputStream( classificationModelFilePath))); String documentContent = text; double[] classDistribution = classificationME.categorize(documentContent); // get the predicted model String predictedCategory = classificationME.getBestCategory(classDistribution); System.out.println("Model prediction : " + predictedCategory); } }
model = DocumentCategorizerME.train("en", sampleStream, 2, 300); is always returning null value, do you know why? I couldn't manage to get the error code, in some places it says that I should have more data to train the dataset, but my dataset is around 2000 rows, can you help me?
ReplyDeleteDid you preprocess the data ?
ReplyDeleteHow can i have "en-doccat.bin"?
ReplyDelete