Blog | The Joy of Content Server

April 9, 2024April 9, 2024

Using TIKA in the command line

You might have noticed, that you need a tool to open a document and extract the content as pure text for further natural language processing (p.ex. the buildin Categorizer, linguistic examination, language detection or predicting something from your documents or finding names and locations from your documents ) downstream.

The answer is Apache Tika, a free tool from Apache.

But you as you can programm TIKA in JAVA (inside the Content Server or as a client), you can also use TIKA in the command line without any programming.

BTW: See the starting page of the openNLP series of articles here

The Tika application jar (tika-app-*.jar) can be used as a command line utility for extracting text content and metadata from all sorts of files. This runnable jar contains all the dependencies it needs, so you don’t need to worry about classpath settings to run it. So, no need for coding.

This is the usage-help documentation

usage: java -jar tika-app.jar [option...] [file|port...]

Options:
    -?  or --help          Print this usage message
    -v  or --verbose       Print debug level messages
    -V  or --version       Print the Apache Tika version number

    -g  or --gui           Start the Apache Tika GUI
    -s  or --server        Start the Apache Tika server
    -f  or --fork          Use Fork Mode for out-of-process extraction

    --config=<tika-config.xml>
        TikaConfig file. Must be specified before -g, -s, -f or the dump-x-config !
    --dump-minimal-config  Print minimal TikaConfig
    --dump-current-config  Print current TikaConfig
    --dump-static-config   Print static config
    --dump-static-full-config  Print static explicit config

    -x  or --xml           Output XHTML content (default)
    -h  or --html          Output HTML content
    -t  or --text          Output plain text content
    -T  or --text-main     Output plain text content (main content only)
    -m  or --metadata      Output only metadata
    -j  or --json          Output metadata in JSON
    -y  or --xmp           Output metadata in XMP
    -J  or --jsonRecursive Output metadata and content from all
                           embedded files (choose content type
                           with -x, -h, -t or -m; default is -x)
    -l  or --language      Output only language
    -d  or --detect        Detect document type
           --digest=X      Include digest X (md2, md5, sha1,
                               sha256, sha384, sha512
    -eX or --encoding=X    Use output encoding X
    -pX or --password=X    Use document password X
    -z  or --extract       Extract all attachements into current directory
    --extract-dir=<dir>    Specify target directory for -z
    -r  or --pretty-print  For JSON, XML and XHTML outputs, adds newlines and
                           whitespace, for better readability

    --list-parsers
         List the available document parsers
    --list-parser-details
         List the available document parsers and their supported mime types
    --list-parser-details-apt
         List the available document parsers and their supported mime types in apt format.
    --list-detectors
         List the available document detectors
    --list-met-models
         List the available metadata models, and their supported keys
    --list-supported-types
         List all known media types and related information


    --compare-file-magic=<dir>
         Compares Tika's known media types to the File(1) tool's magic directory

Description:
    Apache Tika will parse the file(s) specified on the
    command line and output the extracted text content
    or metadata to standard output.

    Instead of a file name you can also specify the URL
    of a document to be parsed.

    If no file name or URL is specified (or the special
    name "-" is used), then the standard input stream
    is parsed. If no arguments were given and no input
    data is available, the GUI is started instead.

- GUI mode

    Use the "--gui" (or "-g") option to start the
    Apache Tika GUI. You can drag and drop files from
    a normal file explorer to the GUI window to extract
    text content and metadata from the files.

- Batch mode

    Simplest method.
    Specify two directories as args with no other args:
         java -jar tika-app.jar <inputDirectory> <outputDirectory>


Batch Options:
    -i  or --inputDir          Input directory
    -o  or --outputDir         Output directory
    -numConsumers              Number of processing threads
    -bc                        Batch config file
    -maxRestarts               Maximum number of times the
                               watchdog process will restart the child process.
    -timeoutThresholdMillis    Number of milliseconds allowed to a parse
                               before the process is killed and restarted
    -fileList                  List of files to process, with
                               paths relative to the input directory
    -includeFilePat            Regular expression to determine which
                               files to process, e.g. "(?i)\.pdf"
    -excludeFilePat            Regular expression to determine which
                               files to avoid processing, e.g. "(?i)\.pdf"
    -maxFileSizeBytes          Skip files longer than this value

    Control the type of output with -x, -h, -t and/or -J.

    To modify child process jvm args, prepend "J" as in:
    -JXmx4g or -JDlog4j.configuration=file:log4j.xml.

You can also use the jar as a component in a Unix pipeline or as an external tool in many other scripting languages.

# Check if an Internet resource contains a specific keyword
curl http://.../document.doc \
  | java -jar tika-app.jar --text \
  | grep -q keyword

A nice thing about TIKA is the existence of other language ports like Python or Julia

So, to open documents from the Content Server, you need to use a JAVA REST client (Login, Select the document, transfer the document).

Then use TIKA to extract the text of the document.

Then use openNLP to do all of the AI natural language processing (NLP).

Thats it.

It’s really easy to use Apache TIKA and Apache openNLP in the Content Server environment

April 8, 2024April 8, 2024

The build in algorithms for the openNLP categorizer

The build in algorithms for the openNLP categorizer, which can be used to create and train the categorizer right out the box are

maxent – maximum entropy

n-gram

naive Bayes (nb)

Starting page of the articles on Apache openNLP for the Content Server

maxent – maximum entropy

The principle of maximum entropy states that the probability distribution which best represents the current state of knowledge about a system is the one with largest entropy, in the context of precisely stated prior data (such as a proposition that expresses testable information).

In ordinary language, the principle of maximum entropy can be said to express a claim of epistemic modesty, or of maximum ignorance (In the US its well known as GOP-principle). The selected distribution is the one that makes the least claim to being informed beyond the stated prior data, that is to say the one that admits the most ignorance beyond the stated prior data.

See here an example of the maxent categorizer

n-gram

An n-gram is a collection of n successive items in a text document that may include words, numbers, symbols, and punctuation. N-gram models are useful in many text analytics applications where sequences of words are relevant, such as in sentiment analysis, text classification, and text generation. N-gram modeling is one of the many techniques used to convert text from an unstructured format to a structured format. An alternative to n-gram is word embedding techniques, such as word2vec. (See the article of SPACY for word vectors)

See here an example of the n-gram categorizer

Naive Bayes

Naive Bayes is a simple technique for constructing classifiers: models that assign class labels to problem instances, represented as vectors of feature values, where the class labels are drawn from some finite set. There is not a single algorithm for training such classifiers, but a family of algorithms based on a common principle: all naive Bayes classifiers assume that the value of a particular feature is independent of the value of any other feature, given the class variable. For example, a fruit may be considered to be an apple if it is red, round, and about 10 cm in diameter. A naive Bayes classifier considers each of these features to contribute independently to the probability that this fruit is an apple, regardless of any possible correlations between the color, roundness, and diameter features.

See here an example of the naive Bayes categorizer

April 3, 2024April 3, 2024

A TIKA based language detector

Part of the free Apache Tika package (used to extract text from variious file formats to do NLP tasks) is also an Language detector. This A TIKA based language detector can be used independent from the openNLP language detector

Content

Using Apache Tika as a command line utility

Apache Tika Example: Extracting MS Office

A TIKA based language detector – supported langages

Example of a TIKA based language detector

Goto the start opennlp series of acticles: Starting Page

A TIKA based language detector – supported langages

These 18 languages are supported by TIKA

da—Danish	de—German	et—Estonian	el—Greek
en—English	es—Spanish	fi—Finnish	fr—French
hu—Hungarian	is—Icelandic	it—Italian	nl—Dutch
no—Norwegian	pl—Polish	pt—Portuguese	ru—Russian
sv—Swedish	th—Thai

Example of a TIKA based language detector

This is the java source code

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;

import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.sax.BodyContentHandler;
import org.apache.tika.language.*;

import org.xml.sax.SAXException;

public class LanguageDetection {

   public static void main(final String[] args) throws IOException, SAXException, TikaException {

      //Instantiating a file object
      File file = new File("myExample.txt");

      //Parser method parameters
      Parser parser = new AutoDetectParser();
      BodyContentHandler handler = new BodyContentHandler();
      Metadata metadata = new Metadata();
      FileInputStream content = new FileInputStream(file);

      //Parsing the given document
      parser.parse(content, handler, metadata, new ParseContext());

      LanguageIdentifier object = new LanguageIdentifier(handler.toString());
      System.out.println("Language name :" + object.getLanguage());

If you run this program with this as myExample.txt

Þetta er íslenskur frumkóði (Islandic for "This is Icelandic source code")

you’ll get this as output

Language name :is

So you can also use Language Detector from Apache TIKA

April 2, 2024April 2, 2024

openNLP vs Spacy for Contentserver Part 2

openNLP vs Spacy for Contentserver is the second part of a comparism between this two AI NLP packages in the Content Server environment.

openNLP vs Spacy for Contentserver Part 1

Goto the start opennlp series of acticles: Starting Page

Spacy

openNLP vs Spacy for Contentserver Part 2

Feature	openNLP	Spacy
Named Entities (NER) detection (ISO Language Codes)	fr, de, en, it, nl,da, es, pt,se Other Languages require training	ca,zh, hr, da, nl, en, fi, fr, de, el, it, ja, ko, lt, mk, nb, pl, pt, ro, ru, sl, es, sv, uk, af, sq, am, grc, ar, eu, bn, bg, cs, et, fo, gu, he, hi, hu, is, id, ga, hn, ki, la, lv, lij, dsb, lg, ms, ml, mr, ne, nn, fa, sa, sr , tn, si, sk, tl, ta, tt, te, hh, ti, tr, hsb, ur, vi, yo Other Languages require training
Word Vectors	experimental	included in the larger models
Visualizers	none	Part of Speech Named Entities Span Visualizer in Jupyter Notebooks Web Based
Connect to the Content Server	1. Inside the Content Server in the JVM 2.From a JAVA Client using REST	1. With a JAVA Rest client. This client invokes trhe Spacy processor for each entry to process 2.Using jspybridge (javascript/python bridge) and connect the js part to the Content Server via REST Remark: Using REST directly from Python won’t work due to the architecture of Content Server REST
File Type Opener	Apache TICA	Apache TICA
Application Architecture	Separate Client/can run in the Content Server	Separate Client
LLM (Large Language Model) Interface	none as LLM, standard NLP tasks such as Named Entity Recognition and Text Classification are to be implemented locally based n openNLP	Hugging Face, OpenAI API, including GPT-4 and various GPT-3 models (Usage examples for standard NLP tasks such as Named Entity Recognition and Text Classification)
Programming Language	JAVA	Python

March 28, 2024April 2, 2024

openNLP vs Spacy for Contentserver

openNLP vs Spacy for Contentserver is the first part of a comparism between this two AI NLP packages in the Content Server environment.

openNLP vs Spacy for Contentserver Part 2

Goto the start opennlp series of acticles: Starting Page

Spacy

openNLP vs Spacy for Contentserver Part 1

Feature	openNLP	Spacy
Programming-Language	Java, at least JDK 17	Python at least 3.1, Python Environment must be installed, like
Connect to the Content Server	JVM inside of the Content Server or REST API	REST API, restricted use with Content Server
Open Documents	Apache TIKA as frontend processor	Apache TIKA as frontend processor, REST or batch connection
Supported Languages (ISO Language Codes)	fr, de, en, it, nl,da, es, pt,se as pretrained models	ca,zh, hr, da, nl, en, fi, fr, de, el, it, ja, ko, lt, mk, nb, pl, pt, ro, ru, sl, es, sv, uk, af, sq, am, grc, ar, eu, bn, bg, cs, et, fo, gu, he, hi, hu, is, id, ga, hn, ki, la, lv, lij, dsb, lg, ms, ml, mr, ne, nn, fa, sa, sr , tn, si, sk, tl, ta, tt, te, hh, ti, tr, hsb, ur, vi, yo as pretrained models
Trainable Languages	yes	yes
Word Vectors	experimental	in large models supported

March 27, 2024March 27, 2024

Installing Spacy

We had an overview of Spacy, which you’ll find here. Today, we shall discuss the installation of Spacy.

So without further redo, lets dive in the installation, shall we?

Content

Installation

Step 1: Select the configuation

Step 2: Download

Visual Studio Code extension

The main installation page is https://spacy.io/usage

Installation

Install a Python Environmen as prerequisite. Use a newer Python.

Remark: I use Anaconda as free environment.

Step 1: Select the configuation

Select operating system, platform and the pre-trained models.

For example, if you chose to download a windows version for x86 processors and you have a modern graphics card (like the NVIDIA A4000), then you can select GPU with CUDA 11.2-11x option. Here in this picture,the languages English, French, German and Romanian are selected

Step 2: Download

Your configuration will result in a couple of command line entries. You should copy/paste this in your environment.

Remark: I use Anaconda as free environment.

After having executed the pasted lines, they download the components elected.

Visual Studio Code extension

there is also a Visual Studio code extension for Spacy, which can be used as IDE-

It can be found at https://marketplace.visualstudio.com/items?itemName=Explosion.spacy-extension

March 26, 2024March 28, 2024

Spacy

Spacy is an Python based popular Open Source AI – NLP (natural language processing) package for 75+ languages including supporting Word Vectors. Spacy supports also new graphics processors.

Content

How to find it?

Package naming conventions

Named Entity Recognition

Word vectors and semantic similarity

Goto the start opennlp series of acticles: Starting Page

Spacy vs openNLP Part 1 Pros and Cons

How to find it?

The website is spaCy · Industrial-strength Natural Language Processing in Python.

It has also several downloadable pretrained models, in these languages

Package naming conventions

In general, spaCy uses for all pipeline packages to follow the naming convention of [lang]_[name]. For spaCy’s pipelines, there is also chose to divide of the name into three components:

Type: Capabilities (e.g. core for general-purpose pipeline with tagging, parsing, lemmatization and named entity recognition, or dep for only tagging, parsing and lemmatization).
Genre: Type of text the pipeline is trained on, e.g. web or news.
Size: Package size indicator, sm, md, lg or trf.
sm and trf pipelines have no static word vectors. For pipelines with default vectors, md has a reduced word vector table with 20k unique vectors for ~500k words and lg has a large word vector table with ~500k entries. For pipelines with floret vectors, md vector tables have 50k entries and lg vector tables have 200k entries.

Example `en_core_web_md`

For example, en_core_web_md is a medium English model trained on written text , that includes vocabulary, syntax and entities.

The larger models have the word vertors buildin.

SIZE	MD 40 MB
COMPONENTS	`tok2vec`, `tagger`, `parser`, `senter`, `attribute_ruler`, `lemmatizer`, `ner`
PIPELINE	`tok2vec`, `tagger`, `parser`, `attribute_ruler`, `lemmatizer`, `ner`
VECTORS	514k keys, 20k unique vectors (300 dimensions)
DOWNLOAD LINK	en_core_web_md-3.7.1-py3-none-any.whl
SOURCES	OntoNotes 5 (Ralph Weischedel, Martha Palmer, Mitchell Marcus, Eduard Hovy, Sameer Pradhan, Lance Ramshaw, Nianwen Xue, Ann Taylor, Jeff Kaufman, Michelle Franchini, Mohammed El-Bachouti, Robert Belvin, Ann Houston) ClearNLP Constituent-to-Dependency Conversion (Emory University) WordNet 3.0 (Princeton University) Explosion Vectors (OSCAR 2109 + Wikipedia + OpenSubtitles + WMT News Crawl) (Explosion)

short summary of the linguistic Capabilities

POS Tagging

After tokenization, spaCy can parse and tag a given Doc. This is where the trained pipeline and its statistical models come in, which enable spaCy to make predictions of which tag or label most likely applies in this context. A trained component includes binary data that is produced by showing a system enough examples for it to make predictions that generalize across the language – for example, a word following “the” in English is most likely a noun.

import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("IBM is looking at buying U.K. startup for $1 billion")

for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
            token.shape_, token.is_alpha, token.is_stop)

If you run that Python code, you'll get

IBM IBM PROPN NNP nsubj XXX True False
is be AUX VBZ aux xx True True
looking look VERB VBG ROOT xxxx True False
at at ADP IN prep xx True True
buying buy VERB VBG pcomp xxxx True False
U.K. U.K. PROPN NNP dobj X.X. False False
startup startup NOUN NN dep xxxx True False
for for ADP IN prep xxx True True
$ $ SYM $ quantmod $ False False
1 1 NUM CD compound d False False
billion billion NUM CD pobj xxxx True False

this means

TEXT	LEMMA	POS	TAG	DEP	SHAPE	ALPHA	STOP
IBM	ibm	`PROPN`	`NNP`	`nsubj`	`Xxxxx`	`True`	`False`
is	be	`AUX`	`VBZ`	`aux`	`xx`	`True`	`True`
looking	look	`VERB`	`VBG`	`ROOT`	`xxxx`	`True`	`False`
at	at	`ADP`	`IN`	`prep`	`xx`	`True`	`True`
buying	buy	`VERB`	`VBG`	`pcomp`	`xxxx`	`True`	`False`
U.K.	u.k.	`PROPN`	`NNP`	`compound`	`X.X.`	`False`	`False`
startup	startup	`NOUN`	`NN`	`dobj`	`xxxx`	`True`	`False`
for	for	`ADP`	`IN`	`prep`	`xxx`	`True`	`True`
$	$	`SYM`	`$`	`quantmod`	`$`	`False`	`False`
1	1	`NUM`	`CD`	`compound`	`d`	`False`	`False`
billion	billion	`NUM`	`CD`	`pobj`	`xxxx`	`True`	`False`

If you use one of the spacy visualizers, you’ll get this image

Nice, isn’t it?

Goto TOP

Morphology

Inflectional morphology is the process by which a root form of a word is modified by adding prefixes or suffixes that specify its grammatical function but do not change its part-of-speech. We say that a lemma (root form) is inflected (modified/combined) with one or more morphological features to create a surface form. Here are some examples:

CONTEXT	SURFACE	LEMMA	POS	MORPHOLOGICAL FEATURES
I was reading the paper	reading	read	`VERB`	`VerbForm=Ger`
I don’t watch the news, I read the paper	read	read	`VERB`	`VerbForm=Fin`, `Mood=Ind`, `Tense=Pres`
I read the paper yesterday	read	read	`VERB`	`VerbForm=Fin`, `Mood=Ind`, `Tense=Past`

import spacy

nlp = spacy.load(“en_core_web_sm”)
print(“Pipeline:”, nlp.pipe_names)
doc = nlp(“I am going to dinner”)
token = doc[0] # ‘I’
print(token.morph) # ‘Case=Nom|Number=Sing|Person=1|PronType=Prs’
print(token.morph.get(“PronType”)) # [‘Prs’]

if you run this code, you’ll get

Case=Nom|Number=Sing|Person=1|PronType=Prs

['Prs']

Goto TOP

Lemmatization

as always, a Lemmatizer takes the word into its basic form

import spacy

# English pipelines include a rule-based lemmatizer
nlp = spacy.load("en_core_web_sm")
lemmatizer = nlp.get_pipe("lemmatizer")
print(lemmatizer.mode) # 'rule'

doc = nlp("I was taking the paper.")
print([token.lemma_ for token in doc])

if you run this code, you’ll get

rule
['I', 'be', 'take', 'the', 'paper', '.']

Goto TOP

Dependency Parsing

spaCy features a syntactic dependency parser, and has an API for navigating the tree. The parser also powers the sentence boundary detection, and lets you iterate over base noun phrases, or “chunks”.

import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Credit and mortgage account holders must submit their requests")
span = doc[doc[4].left_edge.i : doc[4].right_edge.i+1]
with doc.retokenize() as retokenizer:
    retokenizer.merge(span)
for token in doc:
    print(token.text, token.pos_, token.dep_, token.head.text)

if you’ll run this code, you’ll get

C:\Users\merz\AppData\Local\anaconda3\python.exe K:\cspython\src\REST\test.py 
Credit nmod 0 2 ['account', 'holders', 'submit']
and cc 0 0 ['Credit', 'account', 'holders', 'submit']
mortgage conj 0 0 ['Credit', 'account', 'holders', 'submit']
account compound 1 0 ['holders', 'submit']
holders nsubj 1 0 ['submit']
['credit', 'and', 'mortgage', 'account', 'holder', 'should', 'submit', 'their', 'request']


TEXT	DEP	N_LEFTS	N_RIGHTS	ANCESTORS
Credit	nmod	0	2	holders, submit
and	cc	0	0	holders, submit
mortgage	compound	0	0	account, Credit, holders, submit
account	conj	1	0	Credit, holders, submit
holders	nsubj	1	0	submit

Finally, the .left_edge and .right_edge attributes can be especially useful, because they give you the first and last token of the subtree. This is the easiest way to create a Span object for a syntactic phrase. Note that .right_edge gives a token within the subtree – so if you use it as the end-point of a range, don’t forget to +1!

Goto TOP

Named Entity Recognition

spaCy has an fast statistical entity recognition system, that assigns labels to contiguous spans of tokens. The default trained pipelines can identify a variety of named and numeric entities, including companies, locations, organizations and products. You can add arbitrary classes to the entity recognition system, and update the model with new examples.

import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("San Francisco considers banning delivery robots")

# document level
ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents]
print(ents)

# token level
ent_san = [doc[0].text, doc[0].ent_iob_, doc[0].ent_type_]
ent_francisco = [doc[1].text, doc[1].ent_iob_, doc[1].ent_type_]
print(ent_san)  # ['San', 'B', 'GPE']
print(ent_francisco)  # ['Francisco', 'I', 'GPE']

if you’ll run that code, you’ll get

C:\Users\merz\AppData\Local\anaconda3\python.exe K:\cspython\src\REST\test.py 
[('San Francisco', 0, 13, 'GPE')]
['San', 'B', 'GPE']
['Francisco', 'I', 'GPE']

If you use the builtin visualizer of spacy to visualizte the NER, you’ll get this:

Goto TOP

Word vectors and semantic similarity

import spacy

nlp = spacy.load("de_core_news_lg")
tokens = nlp("dog cat banane afskfsd")

for token in tokens:
    print(token.text, token.has_vector, token.vector_norm, token.is_oov)

if run run that code, you’ll get this

C:\Users\merz\AppData\Local\anaconda3\python.exe K:\cspython\src\REST\test.py 
hund True 45.556004 False
katze True 40.768963 False
banane True 22.76727 False
afskfsd False 0.0 True

The words “hund” (in German its a “dog”) , “katze” (in German its a “cat”) and “banane” (in German its a “banana”) are all pretty common in German, so they’re part of the pipeline’s vocabulary, and come with a vector. The word “afskfsd” on the other hand is a lot less common and out-of-vocabulary – so its vector representation consists of 300 dimensions of 0, which means it’s practically nonexistent

spaCy can compare two objects, and make a prediction of how similar they are. Predicting similarity is useful for building recommendation systems or flagging duplicates. For example, you can suggest a user content that’s similar to what they’re currently looking at, or label a support ticket as a duplicate if it’s very similar to an already existing one.

Goto TOP

March 25, 2024March 25, 2024

Apache Tika

Apache Tika – Open and extract text from virtually all formats. Its called a “content analysis toolkit”

Its free like openNLP and its used to extract text from virtually all file formats possible.

Content

Using Apache Tika as a command line utility

Apache Tika Example: Extracting MS Office

Goto the start opennlp series of acticles: Starting Page

The complete configuration of our AI Content Server extension is like this:

You can get TIKA at https://tika.apache.org/ . How to install is explained on this URL

A nice overview of TIKA is here https://www.tutorialspoint.com/tika/tika_overview.htm

A very nice book on TIKA at Manning https://www.manning.com/books/tika-in-action

Using Apache Tika as a command line utility

The basic usage (w/o programming) is a command line utility. Its used like

usage: java -jar tika-app.jar [option...] [file|port...]

Options:
    -?  or --help          Print this usage message
    -v  or --verbose       Print debug level messages
    -V  or --version       Print the Apache Tika version number

    -g  or --gui           Start the Apache Tika GUI
    -s  or --server        Start the Apache Tika server
    -f  or --fork          Use Fork Mode for out-of-process extraction

    --config=<tika-config.xml>
        TikaConfig file. Must be specified before -g, -s, -f or the dump-x-config !
    --dump-minimal-config  Print minimal TikaConfig
    --dump-current-config  Print current TikaConfig
    --dump-static-config   Print static config
    --dump-static-full-config  Print static explicit config

    -x  or --xml           Output XHTML content (default)
    -h  or --html          Output HTML content
    -t  or --text          Output plain text content
    -T  or --text-main     Output plain text content (main content only)
    -m  or --metadata      Output only metadata
    -j  or --json          Output metadata in JSON
    -y  or --xmp           Output metadata in XMP
    -J  or --jsonRecursive Output metadata and content from all
                           embedded files (choose content type
                           with -x, -h, -t or -m; default is -x)
    -l  or --language      Output only language
    -d  or --detect        Detect document type
           --digest=X      Include digest X (md2, md5, sha1,
                               sha256, sha384, sha512
    -eX or --encoding=X    Use output encoding X
    -pX or --password=X    Use document password X
    -z  or --extract       Extract all attachements into current directory
    --extract-dir=<dir>    Specify target directory for -z
    -r  or --pretty-print  For JSON, XML and XHTML outputs, adds newlines and
                           whitespace, for better readability

    --list-parsers
         List the available document parsers
    --list-parser-details
         List the available document parsers and their supported mime types
    --list-parser-details-apt
         List the available document parsers and their supported mime types in apt format.
    --list-detectors
         List the available document detectors
    --list-met-models
         List the available metadata models, and their supported keys
    --list-supported-types
         List all known media types and related information


    --compare-file-magic=<dir>
         Compares Tika's known media types to the File(1) tool's magic directory

Description:
    Apache Tika will parse the file(s) specified on the
    command line and output the extracted text content
    or metadata to standard output.

    Instead of a file name you can also specify the URL
    of a document to be parsed.

    If no file name or URL is specified (or the special
    name "-" is used), then the standard input stream
    is parsed. If no arguments were given and no input
    data is available, the GUI is started instead.

- GUI mode

    Use the "--gui" (or "-g") option to start the
    Apache Tika GUI. You can drag and drop files from
    a normal file explorer to the GUI window to extract
    text content and metadata from the files.

- Batch mode

    Simplest method.
    Specify two directories as args with no other args:
         java -jar tika-app.jar <inputDirectory> <outputDirectory>


Batch Options:
    -i  or --inputDir          Input directory
    -o  or --outputDir         Output directory
    -numConsumers              Number of processing threads
    -bc                        Batch config file
    -maxRestarts               Maximum number of times the
                               watchdog process will restart the child process.
    -timeoutThresholdMillis    Number of milliseconds allowed to a parse
                               before the process is killed and restarted
    -fileList                  List of files to process, with
                               paths relative to the input directory
    -includeFilePat            Regular expression to determine which
                               files to process, e.g. "(?i)\.pdf"
    -excludeFilePat            Regular expression to determine which
                               files to avoid processing, e.g. "(?i)\.pdf"
    -maxFileSizeBytes          Skip files longer than this value

    Control the type of output with -x, -h, -t and/or -J.

    To modify child process jvm args, prepend "J" as in:
    -JXmx4g or -JDlog4j.configuration=file:log4j.xml.

This stand alone jar is the easiest way to use TIKA.

Apache Tika Example: Extracting MS Office

This is an example of using TIKA with Excel (metadata)

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;

import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.microsoft.ooxml.OOXMLParser;
import org.apache.tika.sax.BodyContentHandler;

import org.xml.sax.SAXException;

public class MSxcelParse {

   public static void main(final String[] args) throws IOException, TikaException {
      
      //detecting the file type
      BodyContentHandler handler = new BodyContentHandler();
      Metadata metadata = new Metadata();
      FileInputStream inputstream = new FileInputStream(new File("example_msExcel.xlsx"));
      ParseContext pcontext = new ParseContext();
      
      //OOXml parser
      OOXMLParser  msofficeparser = new OOXMLParser (); 
      msofficeparser.parse(inputstream, handler, metadata,pcontext);
      System.out.println("Contents of the document:" + handler.toString());
      System.out.println("Metadata of the document:");
      String[] metadataNames = metadata.names();
      
      for(String name : metadataNames) {
         System.out.println(name + ": " + metadata.get(name));
      }
   }
}

When you run this program, you’ll see this output

Contents of the document:

Sheet1
Name	Age	Designation		Salary
Ramu	50	Manager			50,000
Raheem	40	Assistant manager	40,000
Robert	30	Superviser		30,000
sita	25	Clerk			25,000
sameer	25	Section in-charge	20,000

Metadata of the document:

meta:creation-date:    2006-09-16T00:00:00Z
dcterms:modified:    2014-09-28T15:18:41Z
meta:save-date:    2014-09-28T15:18:41Z
Application-Name:    Microsoft Excel
extended-properties:Company:    
dcterms:created:    2006-09-16T00:00:00Z
Last-Modified:    2014-09-28T15:18:41Z
Application-Version:    15.0300
date:    2014-09-28T15:18:41Z
publisher:    
modified:    2014-09-28T15:18:41Z
Creation-Date:    2006-09-16T00:00:00Z
extended-properties:AppVersion:    15.0300
protected:    false
dc:publisher:    
extended-properties:Application:    Microsoft Excel
Content-Type:    application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
Last-Save-Date:    2014-09-28T15:18:41Z

Please refer to the book or to the pages from Apache for more.

November 30, 2021November 30, 2021

The new Install Modules Page

In the versions prior to 21.4 the module version numbers tented to be confusing. To fix this, there is a changed “Install Modules” page.

This page has been redesigned to avoid confusion.

The standard modules appear in a separate secsion without version numbers. And the somewhat long names (like “Content Server Comments” are simplified to “Comments”.

The optional modules are listed in the “Installed Modules” section. They now feature “Build” instead of “Version” to clarify the meaning.

Have much fun with the new “Install Modules” page.

November 10, 2021November 10, 2021

Rethinking smartUI Part 4-B

Some weeks ago I published a new video on Rethinking smartUI on Youtube. Now we have Rethinking smartUI Part 4-B discussing the main part of gathering and displaying thr documents data.

If you havn’t it seen yet, here is the short video. In this posts, we’ll go though the technical aspects of this new kind of approach to smartUI. This demo is a Document Pad (or short DocPad), displaying all document properties in a SCIFI GUI arrangement.

A warning at the beginning: To use this code with IE11 is a perfect way to test all js error messages. Use this code only with the newest browsers. Its tested with Chrome (V98), Chrome Canary (V98), FF Developer (95.0b4), Edge (95.0) and Firefox (93.0)

The other parts of this post were

Part 4-A The Javascript Part 1

Part 3 The Infrastructure and the CSS

Part 2 The HTML

Part 1 Overview

In the part 4A, we had discussed all the js responsible for the perimeter of the whdgetz. Not lets discuss the main part which is responsible to gather and displad this data:

A CSS Grid

As you can see, there are 6 panels arranged in a CSS grid.

For infos on the css, please review this post, the part 3 of this series.

So let’s start with the panel at top left.

The documents metadata

This is more or less the data which is related directly to the document. The documents node number was the output from the node picker . The nodepicker was closed by the done() callback.

Here we are in the this function of the nodepicker. We extract the node from the callbacks arguments an get the id with the topmost arrow. We extract the name of the node and put this name inside the id #document.

The loadDescriptions function does the work.

loadDescriptions

The prelude is simply to select the first face “.face.one”

If this is not undefined (remember, smartUI always makes 2 runs, so its always a goot idea to test if its defined) the create and modify dates are extracted and translated in a standard js data. For non-US readers it will be always a difference between p.ex 04-05-20 and 4.Mai 2020 (US and german dates for the Star Wars day May the fourth), thats why we translate the dates.

Also we need to get the users of the creation and the modification. But these are numbers, so we want to translate them to names.

Next, extract the server from the connection and construct the members REST command to get these names.

First view: The fetch command

fetch is new in js 6. In this older, antique times you would have used some ajax variants like xmlhttprequest or some similar methods which we will use in other calls for comparism.

Technically, we have to issue two REST calls to /member/ to get the names of the createuser and the modifyuser. We use the fetch command.

Remark: the famous async/await would be much more handy for that, but we wanted to limit the language scope to js6 for these posts.

Once we get the responses, we’ll put that names simply as innerHTML on the panel.

Technically, you can use all other avaliable methods to put text on the panel, from template-strings to create a and fill a text nodein the DOM. You can even invite handlebars to do this for you.

loadDocumentThumbnail

In the top middle panel we added the document thumbnail, which is created automatically during indexing on the server.

We must enter the nodeid in the REST command /thumbnails/medium/content to get the medium resolution version of the thumbnail.

To show the diffence to the fetch comand, the old archaic XMLHttpRequest was used.

The receiving image is put into a div with the id “Thumbnail”.

In the case the user selects another document the old thumbnail would remain. So we remove the old image element.

Almost done, we need to put our otcsticket in the request header and to send the request to the server.

loadNodeData

In this function, we use exactly one REST call to get all data at once. This is done by the function /forms/update?id=xx whick will deliver all data for that nodeid at once. Expecially the categories take a while, so a css-fog was used to cloak the image of the approaching grid until the data was received (revisit the video). Then the css fog is cleared and trhe categories are displayed.

The call is also done with the old XMLHttpRequest to show the diffences to the modern fetch command.

Local functions were used instead of those in “this” to keep the scope clean.

The categories and the attributes

Categories were returned in an object with their category name a title in the entry. To get the attributes we have to do a little bit more.

We split the result into several arrays to extract the values. If we have “date” in the type field, we have to use our date translation also on that to display the dates correct.

Security Clearances

All security related data is on the fouth face, the one on the lower left.

Here, all security levels and markings were displayed inside a span.

Records Management Data

here we extract and fill the data on the lower middle panel.

The Versions Data

Here, the REST commend has a problem. Versions are not included in the answer of the REST command, at least in the Content Server versions 21.3 and 21.4. So let’s ionform the user on this fact and display a local language string of this fact.

Tip: Maybe there will be a patch to fix this in the near future.

So we had all parts discussed.

We offer a one day remote training to understand the javascript code. If you are already a sophisticated Javascript Developer, you can get the free Sources also from https://github.com/ReinerMerz/reinerdemo (a public repository on Github).

Warning: This is only the sourcetree of the project, so you have to insert this in your own project file.

The data returned from the formsupdate?id=nn REST command

The whole data structure is send back in response to a forms/update?id=nnn REST call. Some of these entries take quite a while, so try using some css to cache this.

Have fun on discovering the endless possibilities of Dashboards and other Contentserver smartUI extensions using javascript6 and css3.

The sky is the limit.

maxent – maximum entropy

n-gram

A TIKA based language detector – supported langages

Example of a TIKA based language detector

openNLP vs Spacy for Contentserver Part 2

openNLP vs Spacy for Contentserver Part 1

Installation

Step 1: Select the configuation

Step 2: Download

Visual Studio Code extension

How to find it?

Package naming conventions

Example en_core_web_md

short summary of the linguistic Capabilities

POS Tagging

Morphology

Lemmatization

Dependency Parsing

Named Entity Recognition

Word vectors and semantic similarity

Using Apache Tika as a command line utility

Apache Tika Example: Extracting MS Office

A CSS Grid

The documents metadata

loadDescriptions

First view: The fetch command

loadDocumentThumbnail

loadNodeData

The categories and the attributes

Security Clearances

Records Management Data

The Versions Data

The data returned from the formsupdate?id=nn REST command

Example `en_core_web_md`