Truncation of Content Terms for Turkish

Sever, Hayri and Tonta, Yaşar Truncation of Content Terms for Turkish., 2006 (Unpublished) [Report]

[img]
Preview
PDF
tonta-sever-cicling2006.pdf

Download (272kB) | Preview

English abstract

Stemming, truncating, suffix stripping and decompounding algorithms used in information retrieval (IR) to reduce the content terms to their respective conflated forms are well-known algorithms for their causes for improving the retrieval performance as well as providing space and processing efficiency. In this paper we investigate the statistical characteristics of the truncated terms for Turkish on a text corpus consisting of more than 50 million words and attempt to measure the vocabulary growth rates for both the whole and truncated words. Findings indicate that the truncated words in Turkish exhibit a Zipfian behavior and that the whole words can successfully be truncated to the average word length (6.2 characters) without compromising performance effectiveness. The vocabulary growth rate for truncated words is about one third of that for the whole words. The result of our study is two fold. First it surely opens the room for truncation of content terms for Turkish for which there is no publicly available stemming code equipped with morphological analysis capability. Second, use of a truncation algorithm for indexing Turkish text may yield comparable effectiveness values with that of a stemming algorithm and hence, the need for stemming may become absolote, given that morphological analyzers for Turkish is highly complex in nature.

Item type: Report
Keywords: Stemming algorithms, truncation, Turkish language, information retrieval, indexing
Subjects: L. Information technology and library technology > LM. Automatic text retrieval.
I. Information treatment for information services > ID. Knowledge representation.
Depositing user: prof. yasar tonta
Date deposited: 05 May 2007
Last modified: 02 Oct 2014 12:07
URI: http://hdl.handle.net/10760/9494

References

Abu Bakar, Z. & Rahman, N.A. (2003). Evaluating the effectiveness of thesaurus and stemming methods in retrieving Malay translated Al-Quran documents. Lecture Notes in Computer Science (LNCS), Springer Verlag, Vol. 2911, pp. 653-662.

Ahmad, F., Yusoff, M. & Sembok, T.M.T. (1996), Experiments with a stemming algorithm for Malay words, Journal of the American Society for Information Science, 47, 909-918.

Bitirim, Y., Tonta, Y. & Sever, H. (2002). Information retrieval effectiveness of Turkish search engines. In Tatyana Yakhno, ed. Advances in Information Systems: Second International Conference, ADVIS 2002, İzmir, Turkey, October 23-25, 2002, Proceedings. (pp. 93-103). Berlin: Springer-Verlag.

Braschler, M. & Ripplinger, B. (2004). How effective is stemming and decompounding for German text retrieval? Information Retrieval, 7(3-4): 291-316.

Dalkılıç, G. & Çebi, Y. (2004). Zipf's Law and Mandelbrot's constants for Turkish language using Turkish corpus (TurCo). In Tatyana Yakhno, ed. Advances in Information Systems: Fourth International Conference, ADVIS 2004, İzmir, Turkey. pp. 273-282.

Dalkılıç, G. & Çebi, Y. (2002). A 300 MB Turkish Corpus and Word Analysis. In Tatyana Yakhno, ed. Advances in Information Systems: Second International Conference, ADVIS 2002, İzmir, Turkey, October 23-25, 2002, Proceedings. (pp. 205-212). Berlin: Springer-Verlag.

Dinçer, B.T. & Karaoğlan. B. (2003). Stemming in agglutinative languages: a probabilistic stemmer for Turkish. In: A. Yazıcı & C. Şener (eds.) 18th International Symposium on Computer and Information Sciences (ISCIS'03), Antalya, Turkey, November 3-5, 2003. pp. 244-251.

Duran, G. & Sever, H. (1996). Turkce Govdeleme Algoritmalarinin Analizi. In Proceedings of Annual Conference of Turkish Informatic Association, Istanbul, Turkey, September 1996, pp. 235-243.

Ekmekçioğlu, F.Ç., Lynch, M. F. & Willett, P. (1995). Development and evaluation of conflation techniques for the implementation of a document retrieval system for Turkish text databases. The New Review of Document and Text Management, 1, 131-146.

Ekmekçioğlu, F.Ç., Lynch, M.F. & Willett, P. (1996). Stemming and N-gram matching for term conflation in Turkish texts. Information Research, 1(1) Retrieved July 6, 2003, from http://informationr.net/ir/2-2/paper13.html.

Ekmekçioğlu, F.Ç., Lynch, M.F., Robertson, A.M., Sembok, T.M.T. & Willett, P. (1996).

Comparison of n-gram matching and stemming for term conflation in English, Malay, and Turkish texts. Text Technology, 6, 1-14.

Figuerola, C.G., Gomez, R., Rodriguez, A.F.Z. & Berrocal, J.L.A. (2002). Spanish monolingual track: The impact of stemming on retrieval. Lecture Notes in Computer Science (LNCS), Springer Verlag, Vol. 2406, pp. 253-261.

Frakes, W.B. (1992), Stemming algorithms, in Frakes, W.B. and Baeza-Yates, R. (Eds),

Information Retrieval: Data Structures & Algorithms, .(pp. 161-218). Englewood Cliffs, NJ: Prentice- Hall.

Ha, L.Q., Sicilia-Garcia, E.I., Ming, J. & Smith, F.J. (2002). Extension of Zipf's Law to words and phrases. Retrieved July 6, 2005, from http://acl.ldc.upenn.edu/C/C02/C02-1117.pdf.

Harman, D. (1991). How effective is suffixing?, Journal of the American Society for Information Science, 42, 7-15.

Kalamboukis, T.Z. (1995). Suffix stripping with modern Grek. Program, 29, 313-321.

Küçük, M.E., Olgun, B. & Sever, H. (2000). Application of metadata concepts to discovery of Internet resources, Lecture Notes in Computer Science (LNCS), Springer Verlag, Vol. 1909, pp. 304-13, 2000.

Li, W. (2005). Zipf’s Law. Retrieved July 6, 2003, from http://www.nslijgenetics.org/wli/zipf/.

Pembe, F.C. & Say, A.C.C. (2004). A linguistically motivated information retrieval system for Turkish. Lecture Notes in Computer Science, Springer Verlag, Vol. 3280, pp. 741-750.

Popovic, M. & Willett, P. (1992). The effectiveness of stemming for natural language access to Slovene textual data. Journal of the American Society for Information Science, 43, 384–90.

Porter, M.F. (1980). An algorithm for suffix stripping. Program, 14, 130-137.

Salton G.(1988). Automatic Text Processing, Addison Wesley, New York, NY

Sever, H. & Bitirim, Y. (2003). The analysis and evaluation of stemming algorithms for Turkish. 10th International Symposium on String Processing and Information Retrieval (SPIRE’03), Manaus, Brazil, October 8-10, 2003, Lecture Notes in Computer Science (LNCS), Springer Verlag, Vol. 2857, pp 238-51.

Solak, A., & Can, F., (1994). Effects of stemming on Turkish text retrieval. In Proceedings of the Ninth International. Symposium on Computer and Information Sciences (ISCIS). (Antalya, Turkey, November 1994), pp. 49-56.

Solak, A. & Oflazer, K. (1993). Design and implementation of a spelling checker for Turkish. Linguistic and Literary Computing, 8, 113-130.

Tashakori, M. Meybodi, M. & Oroumchian, F. (2002). Bon: The Persian stemmer. Lecture Notes in Computer Science (LNCS), Springer Verlag, Vol. 2510, pp. 487-494.


Downloads

Downloads per month over past year

Actions (login required)

View Item View Item