Part of the free Apache Tika package (used to extract text from variious file formats to do NLP tasks) is also an Language detector. This A TIKA based language detector can be used independent from the openNLP language detector
Content
Using Apache Tika as a command line utility
Apache Tika Example: Extracting MS Office
A TIKA based language detector – supported langages
Example of a TIKA based language detector
Goto the start opennlp series of acticles: Starting Page
A TIKA based language detector – supported langages
These 18 languages are supported by TIKA
da—Danish | de—German | et—Estonian | el—Greek |
en—English | es—Spanish | fi—Finnish | fr—French |
hu—Hungarian | is—Icelandic | it—Italian | nl—Dutch |
no—Norwegian | pl—Polish | pt—Portuguese | ru—Russian |
sv—Swedish | th—Thai |
Example of a TIKA based language detector
This is the java source code
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.sax.BodyContentHandler;
import org.apache.tika.language.*;
import org.xml.sax.SAXException;
public class LanguageDetection {
public static void main(final String[] args) throws IOException, SAXException, TikaException {
//Instantiating a file object
File file = new File("myExample.txt");
//Parser method parameters
Parser parser = new AutoDetectParser();
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
FileInputStream content = new FileInputStream(file);
//Parsing the given document
parser.parse(content, handler, metadata, new ParseContext());
LanguageIdentifier object = new LanguageIdentifier(handler.toString());
System.out.println("Language name :" + object.getLanguage());
If you run this program with this as myExample.txt
Þetta er íslenskur frumkóði (Islandic for "This is Icelandic source code")
you’ll get this as output
Language name :is
So you can also use Language Detector from Apache TIKA