Linguistic Features of NLP-I

Here are the POSTagger and the Lemmatizer

POSTagger (Sentence analysis) and the

Lemmatizer (get the base forms of words)

There are a lot of other linguistic features, like sentence recognition, which are not included in this article.

Navigation

Starting Page

Artificial Intelligence and Content Server

Linguistic Features of NLP-I

Detecting the Language

ContentServer Example Project with NLP

NLP application: AUTOCATEGORIZER – KI based categories

A POSTagger (sentence analysis)


An input sentence is broken down into its components.
An example program is (in Java)

public class PosTaggerExample {

   public static void main(String args[]) throws Exception{

      //Loading Parts of speech-maxent model      

      InputStream inputStream = new

         FileInputStream(“models”+File.separator+”en-pos-maxent.bin”);

      POSModel model = new POSModel(inputStream);

      //Instantiating POSTaggerME class

      POSTaggerME tagger = new POSTaggerME(model);

      String sentence = “Hi welcome to our POS example”;

      //Tokenizing the sentence using WhitespaceTokenizer class 

      WhitespaceTokenizer whitespaceTokenizer= WhitespaceTokenizer.INSTANCE;

      String[] tokens = whitespaceTokenizer.tokenize(sentence);

      //Generating tags

      String[] tags = tagger.tag(tokens);

      //Instantiating the POSSample class

      POSSample sample = new POSSample(tokens, tags);

      System.out.println(sample.toString());

   }

a POSTagger

If you run this Java program, this output results (input sentence
Hi. Welcome to our POS example
)

Hi_NNP welcome_JJ to_TO our_PRP$ POS_NNS example_NN

The POS tags are appended with an underscore character.
For example, Hi_NNP is “the Hi is noun and singular)

The system recognizes for example

NNPProper Noun, Singular
VBZVerb, 3rd  as Singular und  singular present
CDCardinal Number
NNSNoun, Plural
JJAdjective


A lemmatizer

(returns words to the base version.)

An example program is in Java

public static void main(String[] args){

        try{

            // test sentence

            String[] tokens = new String[]{“Most”, “large”, “cities”, “in”, “the”, “US”, “had”,

                    “morning”, “and”, “afternoon”, “newspapers”, “.”};

            // Parts-Of-Speech Tagging

            // reading parts-of-speech model to a stream

            InputStream posModelIn = new FileInputStream(“models”+File.separator+”en-pos-maxent.bin”);

            // loading the parts-of-speech model from stream

            POSModel posModel = new POSModel(posModelIn);

            // initializing the parts-of-speech tagger with model

            POSTaggerME posTagger = new POSTaggerME(posModel);

            // Tagger tagging the tokens

            String tags[] = posTagger.tag(tokens);

            // loading the dictionary to input stream

            InputStream dictLemmatizer = new FileInputStream(“train”+File.separator+”en-lemmatizer.dict.txt”);

            // loading the lemmatizer with dictionary

            DictionaryLemmatizer lemmatizer = new DictionaryLemmatizer(dictLemmatizer);

            // finding the lemmas

            String[] lemmas = lemmatizer.lemmatize(tokens, tags);

            // printing the results

            System.out.println(“\nPrinting lemmas for the given sentence…”);

            System.out.println(“WORD -POSTAG : LEMMA”);

            for(int i=0;i< tokens.length;i++){

                System.out.println(tokens[i]+” -“+tags[i]+” : “+lemmas[i]);

            }

        } catch (FileNotFoundException e){

            e.printStackTrace();

        } catch (IOException e) {

            e.printStackTrace();

        }

    }

Lemmatizer Example

This means that the example sentence “Most large cities in the US had morning and afternoon newspapers” can be broken down into its components:

Printing lemmas for the given sentence…

WORD -POSTAG : LEMMA

Most -JJS : much

large -JJ : large

cities -NNS : city

in -IN : in

the -DT : the

US -NNP : O

had -VBD : have

morning -NN : O

and -CC : and

afternoon -NN : O

newspapers -NNS : newspaper

Result of the Lemmatizer

The POSTAG column has the same meaning as in the POSTagger part. The result of is in the LEMMA colums.

For example, the lemma of most is much. the basic form of the word, like for example “had” in the text. The lemma is “have”.

Quite a nice thing to great seach values from the text.