Using TIKA in the command line

Apache Tika

You might have noticed, that you need a tool to open a document and extract the content as pure text for further natural language processing (p.ex. the buildin Categorizer, linguistic examination, language detection or predicting something from your documents or finding names and locations from your documents ) downstream.

The answer is Apache Tika, a free tool from Apache.

But you as you can programm TIKA in JAVA (inside the Content Server or as a client), you can also use TIKA in the command line without any programming.

BTW: See the starting page of the openNLP series of articles here

The Tika application jar (tika-app-*.jar) can be used as a command line utility for extracting text content and metadata from all sorts of files. This runnable jar contains all the dependencies it needs, so you don’t need to worry about classpath settings to run it. So, no need for coding.

This is the usage-help documentation

usage: java -jar tika-app.jar [option...] [file|port...]

Options:
    -?  or --help          Print this usage message
    -v  or --verbose       Print debug level messages
    -V  or --version       Print the Apache Tika version number

    -g  or --gui           Start the Apache Tika GUI
    -s  or --server        Start the Apache Tika server
    -f  or --fork          Use Fork Mode for out-of-process extraction

    --config=<tika-config.xml>
        TikaConfig file. Must be specified before -g, -s, -f or the dump-x-config !
    --dump-minimal-config  Print minimal TikaConfig
    --dump-current-config  Print current TikaConfig
    --dump-static-config   Print static config
    --dump-static-full-config  Print static explicit config

    -x  or --xml           Output XHTML content (default)
    -h  or --html          Output HTML content
    -t  or --text          Output plain text content
    -T  or --text-main     Output plain text content (main content only)
    -m  or --metadata      Output only metadata
    -j  or --json          Output metadata in JSON
    -y  or --xmp           Output metadata in XMP
    -J  or --jsonRecursive Output metadata and content from all
                           embedded files (choose content type
                           with -x, -h, -t or -m; default is -x)
    -l  or --language      Output only language
    -d  or --detect        Detect document type
           --digest=X      Include digest X (md2, md5, sha1,
                               sha256, sha384, sha512
    -eX or --encoding=X    Use output encoding X
    -pX or --password=X    Use document password X
    -z  or --extract       Extract all attachements into current directory
    --extract-dir=<dir>    Specify target directory for -z
    -r  or --pretty-print  For JSON, XML and XHTML outputs, adds newlines and
                           whitespace, for better readability

    --list-parsers
         List the available document parsers
    --list-parser-details
         List the available document parsers and their supported mime types
    --list-parser-details-apt
         List the available document parsers and their supported mime types in apt format.
    --list-detectors
         List the available document detectors
    --list-met-models
         List the available metadata models, and their supported keys
    --list-supported-types
         List all known media types and related information


    --compare-file-magic=<dir>
         Compares Tika's known media types to the File(1) tool's magic directory

Description:
    Apache Tika will parse the file(s) specified on the
    command line and output the extracted text content
    or metadata to standard output.

    Instead of a file name you can also specify the URL
    of a document to be parsed.

    If no file name or URL is specified (or the special
    name "-" is used), then the standard input stream
    is parsed. If no arguments were given and no input
    data is available, the GUI is started instead.

- GUI mode

    Use the "--gui" (or "-g") option to start the
    Apache Tika GUI. You can drag and drop files from
    a normal file explorer to the GUI window to extract
    text content and metadata from the files.

- Batch mode

    Simplest method.
    Specify two directories as args with no other args:
         java -jar tika-app.jar <inputDirectory> <outputDirectory>


Batch Options:
    -i  or --inputDir          Input directory
    -o  or --outputDir         Output directory
    -numConsumers              Number of processing threads
    -bc                        Batch config file
    -maxRestarts               Maximum number of times the
                               watchdog process will restart the child process.
    -timeoutThresholdMillis    Number of milliseconds allowed to a parse
                               before the process is killed and restarted
    -fileList                  List of files to process, with
                               paths relative to the input directory
    -includeFilePat            Regular expression to determine which
                               files to process, e.g. "(?i)\.pdf"
    -excludeFilePat            Regular expression to determine which
                               files to avoid processing, e.g. "(?i)\.pdf"
    -maxFileSizeBytes          Skip files longer than this value

    Control the type of output with -x, -h, -t and/or -J.

    To modify child process jvm args, prepend "J" as in:
    -JXmx4g or -JDlog4j.configuration=file:log4j.xml.

You can also use the jar as a component in a Unix pipeline or as an external tool in many other scripting languages.

# Check if an Internet resource contains a specific keyword
curl http://.../document.doc \
  | java -jar tika-app.jar --text \
  | grep -q keyword

A nice thing about TIKA is the existence of other language ports like Python or Julia

So, to open documents from the Content Server, you need to use a JAVA REST client (Login, Select the document, transfer the document).

Then use TIKA to extract the text of the document.

Then use openNLP to do all of the AI natural language processing (NLP).

Thats it.

It’s really easy to use Apache TIKA and Apache openNLP in the Content Server environment

openNLP vs Spacy for Contentserver Part 2

openNLP vs Spacy for Contentserver is the second part of a comparism between this two AI NLP packages in the Content Server environment.

openNLP vs Spacy for Contentserver Part 1

Goto the start opennlp series of acticles: Starting Page

Spacy

openNLP vs Spacy for Contentserver Part 2

FeatureopenNLPSpacy
Named Entities (NER) detection (ISO Language Codes)fr, de, en, it, nl,da, es, pt,se Other Languages require trainingca,zh, hr, da, nl, en, fi, fr, de, el, it, ja, ko, lt, mk, nb, pl, pt, ro, ru, sl, es, sv, uk, af, sq, am, grc, ar, eu, bn, bg, cs, et, fo, gu, he, hi, hu, is, id, ga, hn, ki, la, lv, lij, dsb, lg, ms, ml, mr, ne, nn, fa, sa, sr , tn, si, sk, tl, ta, tt, te, hh, ti, tr, hsb, ur, vi, yo
Other Languages require training
Word Vectorsexperimentalincluded in the larger models
Visualizersnone Part of Speech
Named Entities
Span
Visualizer in Jupyter Notebooks
Web Based
Connect to the Content Server1. Inside the Content Server in the JVM
2.From a JAVA Client using REST
1. With a JAVA Rest client. This client invokes trhe Spacy processor for each entry to process
2.Using jspybridge (javascript/python bridge) and connect the js part to the Content Server via REST
Remark: Using REST directly from Python won’t work due to the architecture of Content Server REST
File Type OpenerApache TICAApache TICA
Application ArchitectureSeparate Client/can run in the Content ServerSeparate Client
LLM (Large Language Model) Interfacenone as LLM, standard NLP tasks such as Named Entity Recognition and Text Classification are to be implemented locally based n openNLP
Hugging Face, OpenAI API, including GPT-4 and various GPT-3 models (Usage examples for standard NLP tasks such as Named Entity Recognition and Text Classification)
Programming LanguageJAVAPython

Calling OScript from Java Code (2)

In December 2016, we discussed how to call Java code from OScript . Now, we’ll discuss the other way round, how to call OScript from Java. This can be quite useful, if you want to use your business logic implemented in java against the content server.

The Java code must be put in the ojlib directory (see the previous post on this topic).

As always, if you want to deploy your java code within a module, do this

  • Build your code into jar files.
  • Add the jar files in OTHOME/ojlib/ or OTHOME/module/yourmodule_yourversions/ojlib/ directory (and their subdirectories) to be recognized by Content Serverk JVM’s application classloader.

The base thing is, you call OScript built-in functions through the OScriptObject.runScript method from a java coding.

For example, the java code

OScriptObject.runScript("echo","System.ThreadID()");

will display the unique integer for the current thread ID at the console (or in the logs, if you do not use CSIDE)

You can call all OScript functions and scripts. This example will list all nodes in the enterprise workspace

     // List nodes in the Enterprise workspace.
            result = (Map<String,Object>) OScriptObject.runScript( "$LLIAPI.NodeUtil.ListNodes",
				prgCtx,
				"(ParentID=:A1)",
				args );

prgCtx is the standard rogram Context, args is the ArrayList containing the arguments and A1 points to the first entry in the args array to be used as ParentID.

The next example can be used to get the current user from java coding, extract its userID and then derives the user name from ths user id. A standard logger is used to log the output, replace this with the logger of your preference.

    public static String getUserName( OScriptObject prgCtx )
    throws Exception
    {
        String retval = " user not found";
        
        try
        {        
            // get the user session object
            OScriptObject uSession = (OScriptObject) prgCtx.invokeScript( "USession" );
            // get the userID from the User Session
            Integer userID = (Integer) uSession.getFeature( "fUserId" );
            // display it
            logger.log( Level.INFO, "UserID is " + userID );
            
            retval = "The current login User: " + userID;
            
            // Get the Users name from the User ID
            Map<String,Object> status = (Map<String,Object>)
			OScriptObject.runScript( "$LLIAPI.UsersPkg.NameGetByID", uSession, 1000 );
			
            logger.log( Level.INFO, String.valueOf( status ) );
            logger.log( Level.INFO, (String) status.get( "Name" ) );
        }
        catch( Exception e )
        {
            logger.log( Level.SEVERE, "Caught Exception", e );
            throw e;
        }
        
        return retval;
    }

 

In the next posting on this topic, we’ll discuss the Mappings from JAVA to OScript and vice versa.

 

Calling Java Code from OScript (1)

Sometimes it would be nice to use existing Java coding from a module instead of recoding this in OScript.

There is a facility in the content server which does exactly this bridging from OScript to Java, the so called JavaObject class. You’ll find the exact documentation in the “OScript API/Build-In Package Index”

In this first post of the series we’ll discuss the  basic calls from OScript.

First, you need some Java Code, compiled and in the form of a jar. Put this jar either in OTHOME/ojlib or (much better) in a ojlib directory in your module structure. After installing the module, the jar(s) are copied automatically to the OTHOME/ojlib. Then, the jvm classloader will find your jar(s).

From OScript it is possible to access static classes and instances.

Static
Dynamic InvokeStaticMethod(String classname, String methodName, List parameters)

Dynamic GetStaticField( String classname, String fieldName)

Dynamic SetStaticField( String className, String  fieldName, Dynamic value)

The return values are either Error or Dynamic if the call is successful.

Dynamic
JavaObject New (String className, List parameters)
Instance
Dynamic GetField(String fieldname)

Dynamic SetField(String fieldname, Dynamic value)

Dynamic InvokeMethod(String methodName, List parameter)

The return values are either Error or Dynamic if the call is successful.

An example
function void javaTest()

   JavaObject myObject

   myObject = JavaObject.new("my.own.package.class")
 
   Dynamic res = myObject.InvokeMethod("myMethod",{"aa","bb})

   if (isError(res))

     echo ("Init failed")

     return

   end

   Dynamic res1 = myObject.GetField("myResult")

   if (IsError(res)) 

     echo("Calculation failed")

     return

   end

   echo("The result is "+res1)

end

In the next post we’ll discuss how to get the JNI exceptions and the error stack from the jvm.

 

Authentication (3/3) Authenticate with Webservices and Java against Content Server

To authenticate with a JAVA client against a Content Server, you should first create all client proxys. This example can be used against a Servlet Container, like Tomcat. Here, it is assumed, that it runs on port 8080.

The creation of the proxys must be done manually by typing

wsimport -keep http://yourserver:8080/cws/services/Authentication?wsdl
 [add all services you want to use]

jar cvfM webservices.jar com/opentext/*

Add the webservices.jar file in the build path of your Java application.

Then, use something like this to authenticate

String username = "[yourUser]";
  String passsword = "[yourpassword]";
  Authentication_Service authsrv = new Authentication_Service();
  Authentication authclient = authsrv.getbasicHttpBindingAuthentication();
  // the Token
  String authToken=null;
  // authenticate
  try
  {
      authToken = authclient.authenticateUser(username,password);
  } catch (SOAPFaultException e)
  {
      System.out.println("Failed! "+e.getFault().e.getFaultCode()+" : "+e.getMessage());
  }

This will, if succesful, store the authenticaten token in the string authToken.

Use this token in this way:

DocumentManagement_Service docmanSrv = new DocumentManagement_Service();

DocumentManagement docman = docmanSrv.getbasicHttpDocumentManagement();

OTAuthentication otAuth = new OTAuthentication();

otAuth.setAuthenticationToken(authToken);

Now, we have to set the token manually in the SOAP header

try

{

    final String API_NAMESPACE = "urn:api.ecm.opentext.com";
     SOAPHeader header = MessageFactory.newInstance().createMessage().getSOAPPart().getEnvelope().getHeader();
     SOAPHeaderElement authElement = header.addHeaderElement (new QName(API_NAMESPACE,"OPAuthentication"));
     SOAPElement authTokenElement = authElement.addChildElement (new QName(API_NAMESPACE, "AuthenticationToken"));
     authTokenElement.addTextNode (otAuth.getAuthenticationToken());
     ((WSBindingProvider) docMan).setOutboundHeaders (Headers.create(otAuthElement));
  } catch (SOAPExeption e)

{

[Print Error and return]

}

// now, we are ready to perform some functionality

Node node = docman.getNode(nodeId);
  1. Login and get the auth token
  2. set the auth token in the SOAP header
  3. tell your services about the auth token in the SOAP header
  4. perform operations on the server

Do not forget to refresh the token before it gets invalid. We discuss the refesh in a later posting.