NLP and Content Server : Artificial intelligence and natural language processing (NLP) for OpenText content servers

An introduction to the possibilities

This is start of the series of AI with natural language processing (NLP).

Contains

Other NLP AI Contentserver Articles

Starting Page

Example Linguistic Features of NLP- Part1

Example Linguistic Features of NLP- Part2

The buildin NLP categorizer

Detecting the language

a name finder example

a name and location finder example

Content Server NLP application examples

Content Server Autocategorizer

Hardware

These applications do not require exorbitantly expensive NVIDIA processors (for example, the H100 will cost around $45,000-$100.000 in spring 2024) or running on a cloud.
All language services can run either on a OpenText content server (in the Java subsystem) or on a client (communication via REST, fast PC with lots of memory required).

If you have an low-end NVIDIA CUDA capable graphics card (like the A1000 for 1000€ for example), its a good idea to use this card ob your NLP processor, but this is not a must. The NVRAM of 16 GB makes NLP operations very fast if you are using Apache openNLP or SPACY as NLP framework.

Language processing (Natural Language Processing)

Natural language processing (NLP) is a sub-area of artificial intelligence. It is intended to enable computers to understand, interpret and manipulate human language. We are using Apache openNLP as the free software to work with the Content Server.

Here are a few examples:

NLP and Content Server : Spam detection

When you think of spam detection, you might not think of an NLP solution, but the best technologies to use NLP’s text classification capabilities to scan emails for language elements that often indicate spam or phishing. These indicators include, among others: the overuse of financial terms, typically poor grammar, threats, undue urgency, misspelled company names, and more. Spam detection is one of the few NLP problems that experts consider “largely solved.”

NLP and Content Server : Machine translation

Google Translate or CASSIA are examples of widely used NLP technology. Truly useful machine translation involves more than replacing words in one language with words in another. A good translation must accurately capture the meaning and tone of the source language and translate it into a text with the same meaning and desired effect in the target language. Machine translation programs are making significant progress in terms of accuracy. A good way to test a machine translation program is to translate a text into another target language and then back to the source language.

Example of NLP TRANSALATION English – Russian

An often jokingly quoted example for the English-Russian language pair:

When “The spirit is willing but the flesh is weak”

was translated from English into Russian and back again, the following wording resulted:

“The vodka is good but the meat is rot”.

Today the result is “The spirit desires, but the flesh is weak”.

While this is still not perfect, it suggests a much improved translation from English to Russian.

NLP has become an essential business tool for unlocking hidden data from social media channels. Sentiment analysis can analyze the language used in social media posts, replies, reviews, etc. to determine attitudes and emotions in response to products, promotions, and events – information that companies can use for product designs, advertising campaigns, and more.

Example: RestaurantS FROM YELP

Here is binary classification to categorize the bad and good food reviews.

The resulting positive highlights are also listed here

NLP and Content Server : Text summary

This uses natural language processing techniques to process large amounts of digital text and create summaries and synopses for indexes, research databases, or busy readers who don’t have time to read the entire text. The best text summarization applications use semantic inference and natural language generation (NLG) to add useful context and inferences to summaries. Example (sorry for the thy’s, the text is not from yesterday):

Original:

Our Father who art in heaven, hallowed be thy name. Thy kingdom comes. Thy will be done, on earth as it is in heaven. Give us our daily bread this day; and forgive us our trespasses, as we forgive those who trespass against us; and lead us not into temptation, but deliver us from evil

Summary, automatically generated:

Our Father who art in heaven, hallowed be thy name. Thy will be done, on earth as it is in heaven. Thy kingdom come.

NLP and Content Server : Automatic editing and cleaning of documents

The result is a document that looks like this, for example, with all sensitive information automatically removed.

These automatic editing systems can be used for censorship and also to clean up sensitive information (data protection).

NLP and Content Server : Named Entities

For example, a “named entity” looks like this (original Italian text)

For example, “Named Entities” are

Sensitivity labels
Insider risk management
Data Lifecycle Management
Records management

This example sentence is used as a representative of these named entities, which can be trained and identified in any language-independent manner

When Sebastian Thrun started working on self-driving cars at Google in 2007, few people outside of the company took him seriously

results

Person = Name, ORG = Organization, DATE = Date

This allows the content of a document to be automatically examined for named entities. These terms can then be used as categories in the content server.
Trained named entities are very suitable for auto-classification purposes.

NLP and Content Server : Word vectors

A similarity between two texts is determined using Word vectors. This can also be used to determine whether texts are equal or unequal. This is done by the comparism of the word vectors.
Word vectors generally look like this as a simplified example:

In this table, each dimension has a clearly defined meaning. For example, if the first dimension represents the meaning of the word “animal,” then each numerical value represents how closely the line relates to “animal.”

Similar words are drawn in vector space. What is interesting is how close “cat” and “dog” are to the term “pet”, how close “elephant”, “lion”, “tiger” are and how descriptive words (“wild”, “zoo”, “domesticated”) are. appear in groups.
Word vectors are available in every language.

Funny: You can add two word vektors. By adding “German + airlines” you get Lufthansa

Stay tuned for the additional parts on NLP and the usage in the OpenText Content Server

Possibilities are endless