a name finder example | The Joy of Content Server

This is a name finder example for NERs (Named Entities Recognition) which is a nice way to find names and other things in the text based on a trained model. A basic training might be required.

Other NLP AI Contentserver Articles

Starting Page

Example Linguistic Features of NLP

Detecting the language

Content Server NLP application examples

Content Server Autocategorizer

Content

Name Finder

Training

Name finder

Lets first start with the example:

import java.io.File;
import java.io.FileInputStream; 
import java.io.InputStream;  

import opennlp.tools.namefind.NameFinderME; 
import opennlp.tools.namefind.TokenNameFinderModel; 
import opennlp.tools.tokenize.TokenizerME; 
import opennlp.tools.tokenize.TokenizerModel; 
import opennlp.tools.util.Span;  

public class NameFinderSentences {  
   public static void main(String args[]) throws Exception{        
      
      //Loading the tokenizer model 
      InputStream inputStreamTokenizer = new 
         FileInputStream("models"+File.separator+"en-token.bin");
      TokenizerModel tokenModel = new TokenizerModel(inputStreamTokenizer); 
       
      //Instantiating the TokenizerME class 
      TokenizerME tokenizer = new TokenizerME(tokenModel); 
       
      //Tokenizing the sentence in to a string array 
      String sentence = "Mike is senior sales manager and John is his deputy"; 
      String tokens[] = tokenizer.tokenize(sentence); 
      System.out.println("Namefinder example"+"\n");
      System.out.println("Example sentence"+"\n"+sentence);
      System.out.println("found"+"\n");
      //Loading the NER-person model 
      InputStream inputStreamNameFinder = new 
         FileInputStream("models"+File.separator+"en-ner-person.bin");       
      TokenNameFinderModel model = new TokenNameFinderModel(inputStreamNameFinder);
      
      //Instantiating the NameFinderME class 
      NameFinderME nameFinder = new NameFinderME(model);       
      
      //Finding the names in the sentence 
      Span nameSpans[] = nameFinder.find(tokens);        
      
      //Printing the names and their spans in a sentence 
      System.out.println("Position        Name"+"\n");
      for(Span s: nameSpans)        
         System.out.println(s.toString()+"  "+tokens[s.getStart()]);      
   }    
}

the example sentence is “Mike is senior sales manager and John is his deputy“.

Lets start this example

Namefinder example

Example sentence
Mike is senior sales manager and John is his deputy
found

Position Name

[0..1) person Mike
[6..7) person John

The Token “(0…1) person Mike” and “(6…7) person John” are recognized. In this example, the model ios “en-ner-person.bin”, which means, the Language is english and the model is trained on person names

Training

A training example is

import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.IOException;
import java.nio.charset.StandardCharsets;
import java.util.Collections;
 
import opennlp.tools.namefind.BioCodec;
import opennlp.tools.namefind.NameFinderME;
import opennlp.tools.namefind.NameSample;
import opennlp.tools.namefind.NameSampleDataStream;
import opennlp.tools.namefind.TokenNameFinder;
import opennlp.tools.namefind.TokenNameFinderFactory;
import opennlp.tools.namefind.TokenNameFinderModel;
import opennlp.tools.util.InputStreamFactory;
import opennlp.tools.util.MarkableFileInputStreamFactory;
import opennlp.tools.util.ObjectStream;
import opennlp.tools.util.PlainTextByLineStream;
import opennlp.tools.util.Span;
import opennlp.tools.util.TrainingParameters;
 

public class NERTrainingExample {
 
    public static void main(String[] args) {
    	 System.out.println("NER person training example\n");
        // reading training data
        InputStreamFactory in = null;
        try {
            in = new MarkableFileInputStreamFactory(new File("train"+File.separator+"AnnotatedSentences.txt"));
        } catch (FileNotFoundException e2) {
            e2.printStackTrace();
        }
         
        ObjectStream sampleStream = null;
        try {
            sampleStream = new NameSampleDataStream(
                new PlainTextByLineStream(in, StandardCharsets.UTF_8));
        } catch (IOException e1) {
            e1.printStackTrace();
        }
 
        // setting the parameters for training
        TrainingParameters params = new TrainingParameters();
        params.put(TrainingParameters.ITERATIONS_PARAM, 70);
        params.put(TrainingParameters.CUTOFF_PARAM, 1);
        System.out.println("Training parameters\n"+"TrainingParameters.ITERATIONS_PARAM 70");
        System.out.println("TrainingParameters.CUTOFF_PARAM 1");
        // training the model using TokenNameFinderModel class 
        TokenNameFinderModel nameFinderModel = null;
        try {
            nameFinderModel = NameFinderME.train("en", null, sampleStream,
                params, TokenNameFinderFactory.create(null, null, Collections.emptyMap(), new BioCodec()));
        } catch (IOException e) {
            e.printStackTrace();
        }
         
        // saving the model to "ner-custom-model.bin" file
        try {
            File output = new File("custom_models"+File.separator+"ner-custom-model.bin");
            FileOutputStream outputStream = new FileOutputStream(output);
            nameFinderModel.serialize(outputStream);
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        }
         
        // testing the model and printing the types it found in the input sentence
        TokenNameFinder nameFinder = new NameFinderME(nameFinderModel);
        String sentence = "Stefan Miller is a highly qualified specialist at Clouds working at a Content Server partner";
        String[] testSentence =sentence.split(" "); 
       
        System.out.println("test sentence\n"+sentence);
        System.out.println("Finding types in the test sentence..");
        Span[] names = nameFinder.find(testSentence);
        System.out.println(names.length + " names found");
        for(Span name:names){
            String personName="";
            for(int i=name.getStart();i<name.getEnd();i++){
                personName+=testSentence[i]+" ";
            }
            System.out.println(name.getType()+" : "+personName+"\t [probability="+name.getProb()+"]");
        }
    }
 
}

When the example is run, the output is

NER person training example

Training parameters
TrainingParameters.ITERATIONS_PARAM 70
TrainingParameters.CUTOFF_PARAM 1
test sentence
Stefan Miller is a highly qualified specialist at Clouds working at a Content Server partner
Finding types in the test sentence..
1 names found
default : Stefan Miller 	 [probability=0.586696461666961]

At the end, there is a test sentence to see, if the training was ok.

The training data are 130 annotated sentences. The name the system is supposed to learn, is plain text and

My name is <START> Michael Hinterhofer <END>.

The NER is defined by <START> and <END>.

See annotation examples on Github https://github.com/mccraigmccraig/opennlp/blob/master/src/test/resources/opennlp/tools/namefind/AnnotatedSentencesWithTypes.txt

As always, a lot helps a lot. The more sentences you have, the better. For production, you should contain at least 15000 sentences