A TIKA based language detector

Part of the free Apache Tika package (used to extract text from variious file formats to do NLP tasks) is also an Language detector. This A TIKA based language detector can be used independent from the openNLP language detector

Content

Using Apache Tika as a command line utility

Apache Tika Example: Extracting MS Office

A TIKA based language detector – supported langages

Example of a TIKA based language detector

Goto the start opennlp series of acticles: Starting Page

A TIKA based language detector – supported langages

These 18 languages are supported by TIKA

da—Danishde—Germanet—Estonianel—Greek
en—Englishes—Spanishfi—Finnishfr—French
hu—Hungarianis—Icelandicit—Italiannl—Dutch
no—Norwegianpl—Polishpt—Portugueseru—Russian
sv—Swedishth—Thai

Example of a TIKA based language detector

This is the java source code

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;

import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.sax.BodyContentHandler;
import org.apache.tika.language.*;

import org.xml.sax.SAXException;

public class LanguageDetection {

   public static void main(final String[] args) throws IOException, SAXException, TikaException {

      //Instantiating a file object
      File file = new File("myExample.txt");

      //Parser method parameters
      Parser parser = new AutoDetectParser();
      BodyContentHandler handler = new BodyContentHandler();
      Metadata metadata = new Metadata();
      FileInputStream content = new FileInputStream(file);

      //Parsing the given document
      parser.parse(content, handler, metadata, new ParseContext());

      LanguageIdentifier object = new LanguageIdentifier(handler.toString());
      System.out.println("Language name :" + object.getLanguage());

If you run this program with this as myExample.txt

Þetta er íslenskur frumkóði (Islandic for "This is Icelandic source code")

you’ll get this as output

Language name :is

So you can also use Language Detector from Apache TIKA