A TIKA based language detector | The Joy of Content Server

Part of the free Apache Tika package (used to extract text from variious file formats to do NLP tasks) is also an Language detector. This A TIKA based language detector can be used independent from the openNLP language detector

Content

Using Apache Tika as a command line utility

Apache Tika Example: Extracting MS Office

A TIKA based language detector – supported langages

Example of a TIKA based language detector

Goto the start opennlp series of acticles: Starting Page

A TIKA based language detector – supported langages

These 18 languages are supported by TIKA

da—Danish	de—German	et—Estonian	el—Greek
en—English	es—Spanish	fi—Finnish	fr—French
hu—Hungarian	is—Icelandic	it—Italian	nl—Dutch
no—Norwegian	pl—Polish	pt—Portuguese	ru—Russian
sv—Swedish	th—Thai

Example of a TIKA based language detector

This is the java source code

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;

import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.sax.BodyContentHandler;
import org.apache.tika.language.*;

import org.xml.sax.SAXException;

public class LanguageDetection {

   public static void main(final String[] args) throws IOException, SAXException, TikaException {

      //Instantiating a file object
      File file = new File("myExample.txt");

      //Parser method parameters
      Parser parser = new AutoDetectParser();
      BodyContentHandler handler = new BodyContentHandler();
      Metadata metadata = new Metadata();
      FileInputStream content = new FileInputStream(file);

      //Parsing the given document
      parser.parse(content, handler, metadata, new ParseContext());

      LanguageIdentifier object = new LanguageIdentifier(handler.toString());
      System.out.println("Language name :" + object.getLanguage());

If you run this program with this as myExample.txt

Þetta er íslenskur frumkóði (Islandic for "This is Icelandic source code")

you’ll get this as output

Language name :is

So you can also use Language Detector from Apache TIKA