Artemenko, Olga and Mandl, Thomas and Shramko, Margaryta and Womser-Hacker, Christa (2006) Evaluation of a Language Identification System for Mono- and Multi-lingual Text Documents. . In 2006 ACM SAC Symposium on Applied Computing (SAC). Document Engineering Track (DE), Dijon, France, April, 23-27, 2006. [Presentation] (Unpublished)
Full text available as:
| PDF - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader 389Kb Language: English |
Abstract(s)
Language identification an important task for web information retrieval. This paper presents the implementation of a tool for language identification in mono- and multilingual documents. The tool implements four algorithms for language identification. Furthermore, we present a n-gram approach for the identification of languages in multi-lingual documents. An evaluation for monolingual texts of varied length is presented. Results for eight languages including Ukrainian and Russian are shown. It could be shown that n-gram-based approaches outperform word-based algorithms for short texts. For longer texts, the performance is comparable. The evaluation for multilingual documents is based on both short synthetic documents and real world web documents. Our tool is able to recognize the languages present as well as the location of the language change with reasonable accuracy.
| Item Type: | Presentation |
|---|---|
| Keywords: | language identification |
| Subjects: | L. Information technology and library technology. > LL. Automated language processing. L. Information technology and library technology. > LM. Automatic text retrieval. L. Information technology and library technology. > LS. Search engines. |
| Full Metadata: | Show all fields |
| ID Code: | 7081 |
| Deposited By: | Mandl, Thomas |
| Deposited On: | 28 Aug 2006 |
| Last Modified: | 19 Nov 2008 10:23 |
| Statistics: | View statistics for this eprint |
Archive Staff Only: edit this record

