The NLP categorizer – N-gram

Lets examine the NLP categorizer – N-gram. Its builtin and if follows the N-gram algorithm. No programming necessary, just use it. As mentioned on other pages, this can be used as a client of the Content Server or started within the Java subsystem of the Content Server. In both cases, the documents must be transfered from the Content Server and the categories predicted must be transferred to the Content Server.

Stay tuned for the Command Line Execs (w/o programming) in a later article.

Contains

The NLP categorizer -N-gram Example

Document Classification using NGram Features in OpenNLP

Training

Test

Other NLP AI Contentserver Articles

Starting Page

Example Linguistic Features of NLP

The buildin NLP categorizer

Detecting the language

Content Server NLP application examples

Content Server Autocategorizer

The NLP categorizer – N-gram example

An n-gram is a collection of n successive items in a text document that may include words, numbers, symbols, and punctuation. N-gram models are useful in many text analytics applications where sequences of words are relevant, such as in sentiment analysis, text classification, and text generation. N-gram modeling is one of the many techniques used to convert text from an unstructured format to a structured format. An alternative to n-gram is word embedding techniques, such as word2vec.

see the SPACY article (for Python) if you want to use Word Vectors.

Document Classification using NGram Features in OpenNLP

This topic is kind of continuation to document classification using Maxent model or document classification using Naive Bayes model, where a detailed explanation has been given on how to train a model for document classification or categorization with default features incorporated in DoccatFactory.

Following is the snippet of Java code, where we define and initialize N-gram feature generators that could be used for Document Categorizer.

FeatureGenerator[] featureGenerators = { new NGramFeatureGenerator(1,1),
                    new NGramFeatureGenerator(2,3) };
DoccatFactory factory = new DoccatFactory(featureGenerators);

featureGenearators is an array where a list of feature generators(which implement FeatureGenerator interface) could be provided. You may build your own class of feature generator extending FeatureGenerator and use the same for document categorizer, by just adding it in the list. See javadoc of  NGramFeatureGenerator

import java.io.BufferedOutputStream;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
 
import opennlp.tools.doccat.BagOfWordsFeatureGenerator;
import opennlp.tools.doccat.DoccatFactory;
import opennlp.tools.doccat.DoccatModel;
import opennlp.tools.doccat.DocumentCategorizer;
import opennlp.tools.doccat.DocumentCategorizerME;
import opennlp.tools.doccat.DocumentSample;
import opennlp.tools.doccat.DocumentSampleStream;
import opennlp.tools.doccat.FeatureGenerator;
import opennlp.tools.doccat.NGramFeatureGenerator;
import opennlp.tools.util.InputStreamFactory;
import opennlp.tools.util.MarkableFileInputStreamFactory;
import opennlp.tools.util.ObjectStream;
import opennlp.tools.util.PlainTextByLineStream;
import opennlp.tools.util.TrainingParameters;
 

public class DocClassificationNGramFeaturesDemo {
 
    public static void main(String[] args) {
 
        try {
         System.out.println("Doc classification on ngram  (66 romantic and thriller movies)\n");
            // read the training data
            InputStreamFactory dataIn = new MarkableFileInputStreamFactory(new File("train"+File.separator+"en-movie-category.train"));
            ObjectStream<String> lineStream = new PlainTextByLineStream(dataIn, "UTF-8");
            ObjectStream<DocumentSample> sampleStream = new DocumentSampleStream(lineStream);
 
            // define the training parameters
            TrainingParameters params = new TrainingParameters();
            params.put(TrainingParameters.ITERATIONS_PARAM, 10+"");
            params.put(TrainingParameters.CUTOFF_PARAM, 0+"");
             
            // feature generators - N-gram feature generators
            FeatureGenerator[] featureGenerators = { new NGramFeatureGenerator(1,1),
                    new NGramFeatureGenerator(2,3) };
            DoccatFactory factory = new DoccatFactory(featureGenerators);
            System.out.println("Train the model with the movies database\n");
            // create a model from training data
            DoccatModel model = DocumentCategorizerME.train("en", sampleStream, params, factory);
            System.out.println("\nModel is successfully trained.");
 
            // save the model to local
            BufferedOutputStream modelOut = new BufferedOutputStream(new FileOutputStream("custom_models"+File.separator+"en-movie-classifier-ngram.bin"));
            model.serialize(modelOut);
            System.out.println("\nTrained Model is saved locally at : "+"custom_models"+File.separator+"en-movie-classifier-ngram.bin");
 
            // test the model file by subjecting it to prediction
            DocumentCategorizer doccat = new DocumentCategorizerME(model);
            String movie = "Afterwards Stuart and Charlie notice Kate in the photos Stuart took at Leopolds ball and realise that her destiny must be to go back and be with Leopold That night while Kate is accepting her promotion at a company banquet he and Charlie race to meet her and show her the pictures Kate initially rejects their overtures and goes on to give her acceptance speech but it is there that she sees Stuarts picture and realises that she truly wants to be with Leopold";
            System.out.println("\nThe movie test description(new to the system)\n"+movie);
            System.out.println("Is it a Thriller or Romantic?\n");
            String[] docWords = movie.replaceAll("[^A-Za-z]", " ").split(" ");
            double[] aProbs = doccat.categorize(docWords);
 
            // print the probabilities of the categories
            System.out.println("\n---------------------------------\nCategory : Probability\n---------------------------------");
            for(int i=0;i<doccat.getNumberOfCategories();i++){
                System.out.println(doccat.getCategory(i)+" : "+aProbs[i]);
            }
            System.out.println("---------------------------------");
 
            System.out.println("\n"+doccat.getBestCategory(aProbs)+" : is the predicted category for the given sentence.");
        }
        catch (IOException e) {
            System.out.println("An exception in reading the training file. Please check.");
            e.printStackTrace();
        }
    }
}

If you run this example, the output is

Doc classification on ngram  (66 romantic and thriller movies)
Train the model with the movies database
Model is successfully trained.
Trained Model is saved locally at : custom_models\en-movie-classifier-ngram.bin
The movie test description(new to the system)
Afterwards Stuart and Charlie notice Kate in the photos Stuart took at Leopolds ball and realise that her destiny must be to go back and be with Leopold That night while Kate is accepting her promotion at a company banquet he and Charlie race to meet her and show her the pictures Kate initially rejects their overtures and goes on to give her acceptance speech but it is there that she sees Stuarts picture and realises that she truly wants to be with Leopold
Is it a Thriller or Romantic?
---------------------------------
Category : Probability
---------------------------------
Thriller : 0.4912161738321056
Romantic : 0.5087838261678945
---------------------------------
Romantic : is the predicted category for the given sentence.

First,the model is trained with the movie database. Then, we give the description of a movie unknown to the system and let predict the category.

The highest propability is 0.5087838261678945, that means, its romantic.

The NLP categorizer – N-gram Training

Before that, the model must be trained. A plain text file is used. Here, 2 of the 66 entries are shown. One of the thriller category and one of the romantic category. The category is at the beginning and printed in bold.

Thriller An American Muslim man and former Delta Force operator Yusuf Sheen makes a videotape When FBI Special Agent Helen Brody Moss and her team see news bulletins looking for Yusuf they launch an investigation which is curtailed when they are summoned to a high school which has been converted into a black site under military command They are shown Yusuf s complete tape where he threatens to detonate three nuclear bombs in separate U S cities if his demands are not met A special interrogator H Samuel L Jackson is brought in to force Yusuf to reveal the locations of the nuclear bombs H quickly shows his capability and cruelty by chopping off one of Yusuf s fingers with a small hatchet Horrified Special Agent Brody attempts to put a stop to the measures Her superiors make it clear that the potentially disastrous consequences necessitate these extreme measures As the plot unfolds H escalates his methods with Brody as the good cop Brody realizes that Yusuf anticipated that he would be tortured Yusuf then makes his demands: he would like the President of the United States to announce a cessation of support for puppet governments and dictatorships in Muslim countries and a withdrawal of American troops from all Muslim countries The group immediately dismisses the possibility of his demands being met citing the United States declared policy of not negotiating with terrorists When Brody accuses Yusuf of faking the bomb threat in order to make a point about the moral character of the United States government he breaks down and agrees that it was all a ruse He gives her an address to prove it They find a room that matches the scene in the video tape and find evidence on the roof A soldier removes a picture from an electrical switch which triggers a tremendous C explosion at a nearby shopping mall visible from the roof The explosion kills people Angry at the senseless deaths Brody returns to Yusuf and cuts his chest with a scalpel Yusuf is unafraid and demands she cut him He justifies the deaths in the shopping mall stating that the Americans kill that many people every day Yusuf says he allowed himself to be caught so he could face his oppressors H questions whether Yusuf will reveal the bombs location unless Yusuf s wife is found When she is detained H brings her in front of her husband and threatens to mutilate her in front of him Brody and the others begin to take her away from the room in disgust Out of desperation H slashes her throat and she bleeds to death in front of Yusuf Still without cooperation H tells the soldiers to bring in Yusuf s two children a young boy and a girl Outside of Yusuf s hearing he assures everyone that he will not harm the children Yusuf s children are brought in and H makes it clear that he will torture them if the locations of the bombs are not divulged Yusuf breaks and gives three addresses in New York Los Angeles and Dallas but H does not stop forcing the others to intervene Citing the amount of missing nuclear material Yusuf potentially had at his disposal some – lbs were reported missing with about ½ lbs needed per device H insists that Yusuf has not admitted anything about a hence unreferenced fourth bomb H points out that everything Yusuf has done so far has been planned meticulously He knew the torture would most likely break him and he would have been certain to plant a fourth bomb just in case The purpose of the preceding torture was not to break Yusuf but rather to make it clear what would happen to his children if he did not cooperate The official in charge of the operation demands that H bring Yusuf s children back in for further interrogation H demands that Brody bring the children back in because her decency will give him the moral approval that he needs to do the unthinkable When Brody refuses to retrieve the children for H he unstraps Yusuf sarcastically setting him free The official draws his pistol and aims it at H to coerce him into further interrogation Yusuf grabs the official s gun He asks Brody to take care of his children and kills himself Brody walks out of the building with Yusuf s children 
Romantic Krish Malhotra Arjun Kapoor a fresh Engineer from IIT Delhi now a student pursuing his MBA at the IIM Ahmedabad Gujarat comes from a troubled rich family of Punjabi heritage He meets his classmate Ananya Swaminathan Alia Bhatt a BBA holder in his college who comes from a conservative Tamil Brahmin family Krish and Ananya initially quarrel but soon become friends and start studying together Soon they begin dating and stay together for their next months on the IIM campus Krish confides in Ananya that his real passion is writing which he wants to pursue a career in They both have become so close to each other and also have developed sexual relationship many times during the stay in IIM Krish gets selected in the campus drives for Yes Bank He immediately rushes to the next room and proposes to Ananya in the middle of her interview She accepts and then gets selected for Sunsilk When they complete their graduation Krish and Ananya decide to get married They introduce their parents to each other at the convocation ceremony To their dismay Krish s loud mother Kavita Amrita Singh does not get along with Ananya s reserved Tamilian parents Radha Revathy and Swaminathan Shiv Kumar Subramaniam After graduation Ananya begins her marketing job in her hometown Chennai and Krish goes back to his own hometown Delhi with the choice of work place in his own hands Krish s brash family urges him to stay in Delhi and try to discourage him from his interest in writing They also criticize his relationship with Ananya and tell him to get into an arranged marriage with a Punjabi girl It is also evident that there is tension between Krish and his rich alcoholic father Vikram Ronit Roy Krish leaves his dysfunctional family and starts his banking job in Chennai During this time he tries very hard to win over Ananya s family He tutors her younger brother for IIT entrance exams gets her mother an opportunity to sing at an event for his workplace and helps her father create his first PowerPoint presentation After all his effort Ananya s family agrees to the marriage with Krish Krish and Ananya then travel to Delhi to win over Krish s family Initially Kavita and her family are hostile towards Ananya but come to like her after she saves Krish s cousin s wedding from being cancelled due to a dispute over dowry she is also accepted Krish and Ananya decide to take a vacation with their families before the wedding The vacation does not go as planned when Kavita makes continuous snide remarks about Tamilian culture Furthermore Ananya and her parents overhear Krish falsely assuring his mother that she can treat Ananya however she wants after they are married Having had enough of the insults Ananya calls off the wedding and both return to their respective hometown Krish and Ananya find it hard to live without each other Sometime later he gets a call from Ananya who reveals that Krish s father had come down to Chennai to speak to her parents apologizing for his wife s shallow behavior This allows for Krish and Ananya to finally get happily married they become parents to twin boys and Krish resigns from banking and publishes his book States based on his and Ananya s life 

The model is saved after Training. It can be re-trained.

The NLP categorizer – N-gram Test

After the training, a test movie ( unknown to the system) is used to get a prediction:

Afterwards Stuart and Charlie notice Kate in the photos Stuart took at Leopolds ball and realise that her destiny must be to go back and be with Leopold That night while Kate is accepting her promotion at a company banquet he and Charlie race to meet her and show her the pictures Kate initially rejects their overtures and goes on to give her acceptance speech but it is there that she sees Stuarts picture and realises that she truly wants to be with Leopold

When the system has to make a prediction of the category, it predicts: The highest propability is 0.5087838261678945, that means, its romantic.