E-LIS, Eprints in Library and Information Science Homepage E-LIS, Eprints in Library and Information Science
   home   |   about   |   search   |   browse   |   register   |   registered users area   |   help   |   FAQ   |   JITA   

Evaluation of a Language
Identification System for
Mono- and Multi-lingual
Text Documents

Artemenko, Olga and Mandl, Thomas and Shramko, Margaryta and Womser-Hacker, Christa (2006) Evaluation of a Language
Identification System for
Mono- and Multi-lingual
Text Documents
. Delivered at 2006 ACM SAC Symposium on Applied Computing (SAC). Document Engineering Track (DE), Dijon, France. Presentation.

Full text available as:
PDF - Requires Adobe Acrobat Reader or other PDF viewer.

View statistics for this eprint

Abstract

Language identification an important task
for web information retrieval. This paper
presents the implementation of a tool for
language identification in mono- and multilingual
documents. The tool implements
four algorithms for language identification.
Furthermore, we present a n-gram
approach for the identification of
languages in multi-lingual documents. An
evaluation for monolingual texts of varied
length is presented. Results for eight
languages including Ukrainian and
Russian are shown. It could be shown that
n-gram-based approaches outperform
word-based algorithms for short texts. For
longer texts, the performance is
comparable. The evaluation for multilingual
documents is based on both short
synthetic documents and real world web
documents. Our tool is able to recognize
the languages present as well as the
location of the language change with
reasonable accuracy.

Keywords:language identification
Subjects:L. Information technology and library technology. > LL. Automated language processing.
L. Information technology and library technology. > LM. Automatic text retrieval.
L. Information technology and library technology. > LS. Search engines.
ID Code:7081
Deposited By:Mandl, Thomas
Deposited On:28 August 2006
All fields:Show all fields

Archive Staff Only: edit this record