Here are the POSTagger and the Lemmatizer
POSTagger (Sentence analysis) and the
Lemmatizer (get the base forms of words)
There are a lot of other linguistic features, like sentence recognition, which are not included in this article.
Navigation
Artificial Intelligence and Content Server
ContentServer Example Project with NLP
NLP application: AUTOCATEGORIZER – KI based categories
A POSTagger (sentence analysis)
An input sentence is broken down into its components.
An example program is (in Java)
public class PosTaggerExample {
public static void main(String args[]) throws Exception{
//Loading Parts of speech-maxent model
InputStream inputStream = new
FileInputStream(“models”+File.separator+”en-pos-maxent.bin”);
POSModel model = new POSModel(inputStream);
//Instantiating POSTaggerME class
POSTaggerME tagger = new POSTaggerME(model);
String sentence = “Hi welcome to our POS example”;
//Tokenizing the sentence using WhitespaceTokenizer class
WhitespaceTokenizer whitespaceTokenizer= WhitespaceTokenizer.INSTANCE;
String[] tokens = whitespaceTokenizer.tokenize(sentence);
//Generating tags
String[] tags = tagger.tag(tokens);
//Instantiating the POSSample class
POSSample sample = new POSSample(tokens, tags);
System.out.println(sample.toString());
}
a POSTagger
If you run this Java program, this output results (input sentence
Hi. Welcome to our POS example
)
Hi_NNP welcome_JJ to_TO our_PRP$ POS_NNS example_NN
The POS tags are appended with an underscore character.
For example, Hi_NNP is “the Hi is noun and singular)
The system recognizes for example
NNP | Proper Noun, Singular |
VBZ | Verb, 3rd as Singular und singular present |
CD | Cardinal Number |
NNS | Noun, Plural |
JJ | Adjective |
A lemmatizer
(returns words to the base version.)
An example program is in Java
public static void main(String[] args){
try{
// test sentence
String[] tokens = new String[]{“Most”, “large”, “cities”, “in”, “the”, “US”, “had”,
“morning”, “and”, “afternoon”, “newspapers”, “.”};
// Parts-Of-Speech Tagging
// reading parts-of-speech model to a stream
InputStream posModelIn = new FileInputStream(“models”+File.separator+”en-pos-maxent.bin”);
// loading the parts-of-speech model from stream
POSModel posModel = new POSModel(posModelIn);
// initializing the parts-of-speech tagger with model
POSTaggerME posTagger = new POSTaggerME(posModel);
// Tagger tagging the tokens
String tags[] = posTagger.tag(tokens);
// loading the dictionary to input stream
InputStream dictLemmatizer = new FileInputStream(“train”+File.separator+”en-lemmatizer.dict.txt”);
// loading the lemmatizer with dictionary
DictionaryLemmatizer lemmatizer = new DictionaryLemmatizer(dictLemmatizer);
// finding the lemmas
String[] lemmas = lemmatizer.lemmatize(tokens, tags);
// printing the results
System.out.println(“\nPrinting lemmas for the given sentence…”);
System.out.println(“WORD -POSTAG : LEMMA”);
for(int i=0;i< tokens.length;i++){
System.out.println(tokens[i]+” -“+tags[i]+” : “+lemmas[i]);
}
} catch (FileNotFoundException e){
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}
Lemmatizer Example
This means that the example sentence “Most large cities in the US had morning and afternoon newspapers” can be broken down into its components:
Printing lemmas for the given sentence…
WORD -POSTAG : LEMMA
Most -JJS : much
large -JJ : large
cities -NNS : city
in -IN : in
the -DT : the
US -NNP : O
had -VBD : have
morning -NN : O
and -CC : and
afternoon -NN : O
newspapers -NNS : newspaper
Result of the Lemmatizer
The POSTAG column has the same meaning as in the POSTagger part. The result of is in the LEMMA colums.
For example, the lemma of most is much. the basic form of the word, like for example “had” in the text. The lemma is “have”.
Quite a nice thing to great seach values from the text.