G. Figuerola, Carlos, Gomez Díaz, Raquel and López de San Roman, Eva Stemming and n-grams in Spanish: An evaluation of their impact on information retrieval. Journal of Information Science, 2000, vol. 26, n. 6, pp. 461-467. [Journal article (Paginated)]
Preview |
PDF
figuerola2000stemming.pdf Download (51kB) | Preview |
English abstract
At some stage, most of the models and techniques implemented in IR use frequency counts of the terms appearing in documents and in queries. However, many words, since they are derived from the same stem, have very close semantic contents. This makes a grouping of such variants under a single term advisable. Otherwise, dispersal occurs in the calculation of frequency of these terms, and it also becomes difficult to compare queries and documents. On the other hand, there are notable differences between different languages in the way of forming derivatives and inflected forms, so that the application of specific techniques can produce unequal results according to the language of the documents and queries. A description is given of the tests carried out for documents in Spanish, which involved some stemming techniques widely used in English, as well as the application of n-grams, and the results are compared.
Item type: | Journal article (Paginated) |
---|---|
Keywords: | Stemming; N-grams; information retrieval, Recuperación de la Información |
Subjects: | L. Information technology and library technology > LM. Automatic text retrieval. |
Depositing user: | R. Gómez-Díaz |
Date deposited: | 17 Nov 2009 |
Last modified: | 02 Oct 2014 11:56 |
URI: | http://hdl.handle.net/10760/3815 |
References
REFERENCES: [1] Salton, G. : Automatic Information Organization and Retrieval McGraw-Hill, New York, 1968) [2] Porter, M. F. (1980): An algorithm for suffix stripping, Program London), 14 (3) (1980), 130-137 [3] Paice, C. D.: Method for evaluation of stemming algorithms based on error counting, Journal of the American Society for Information Science, 47(8)(1996), 632-649 [4] Lovin, J.B.: Development of a Stemming Algorithm, Mechanical Translations and Computational Linguistics, 11(1-2)(1968), 22-31 [5] Dawson, J.: Suffix Removal and Word Conflation, ALLC Bulletin, 1974, 33-46 [6] Paice, C. : Another Stemmer, ACM SIGIR Forum, 24(3) (1990), 56-61 [7] Schinke, R., Robertson, A., Willett, P., Greengrass, M. : A stemming algorithm for Latin text databases, Journal of Documentation, 52,(2)(1996) 172-187 [8] Ahamad, F., Yussof, M. and Sembok, M. T. : Experiments with a Stemming Algorithm for Malay Words, Journal of the American Society for Information Science, 47, (12) (1996) 909-918 [9] Savoy, J. : Stemming of French words based on grammatical categories, Journal of the American Society for Information Science, 44(1) (1993), 1-9 [10] Savoy, J.: A Stemming Procedure and Stopword List for General French Corpora, Journal of the American Society for Information Science, 50(10)(1999), 944-952 [11] Abu-Salem, H.; Al –Omari, M. and Evens, M.W. : Stemming Methodologies Over Individual Query Words for an Arabian Information Retrieval System, Journal of American Society for Information Science, 50(6) (1999) 524-529 [12] Robertson, A.M. and Willet, P. : Applications of n-grams in textual information systems, Journal of Documentation, 54(1) (1998), 48-69 [13] Pollock, J.J. and Zamora, A. : System design for detection and correction of spelling errors in scientifc and scholarly text, Journal of American Society for Information Science, 35 (1984)104-109 [14] Adamson, G.W. and Boreham, J. : The use of an association measure based on character structure to identify semantically related pairs of words and document titles, Information Storage and Retrieval, 10 (1974) 253-260. [15] Lennon, M. , Peirce, D.S., Tarry, B.D. and Willett, P. : An evaluation of some conflation algorithms for information retrieval, Journal of Information Science, 3 (1981)177-183 [16] Cavnar, W.B.: Using An N-Gram Based Document Representation With A Vector Processing Retrieval Model, TREC-3, Special NIST Pub. N. 500-226, Gaittersburg, Maryland, 1994 http://trec.nist.gov/pubs/trec3/papers/cavnar_ngram_94.ps [17] Damashek, M. : Gauging similarity with n-grams: language independent categorisation of text, Science, 267 (1995) 843-848 [18] Huffman, S. : Acquaintance: Language-Independent Document Categorization by N-Grams, TREC-4, Special NIST Pub. N. 500-236, Gaittersburg, Maryland, 1995, http://trec.nist.gov/pubs/trec4/papers/nsa.ps [19] Harman, D. : The TREC Conferences, Proceedings HIM'95(Hypertext-Information Retrieval-Multimedia), Konstanz (1995), 9-23 [20] Harman, D.K. (ed.): Overview of the Third Text Retrieval Conference (TREC-3), Special NIST Pub. N. 500-226, Gaittersburg, Maryland, 1994 http://trec.nist.gov/pubs/trec3/t3_proceedings.html [21] Harman, D.K. (ed.): The Fourth Text Retrieval Conference TREC-4), Special NIST Pub. N. 500-236, Gaittersburg, Maryland, 1995 http://trec.nist.gov/pubs/trec4/t4_proceedings.html [22] Buckley, C., Salton, G., Allan, J. and Singhal, A. : Automatic Query Expansion Using SMART: TREC3, TREC-3, Special NIST Pub. N. 500-226, Gaittersburg, Maryland, 1994 http://trec.nist.gov/pubs/trec3/papers/cornell.new.ps [23] Gómez Díaz, R. : La Recuperación de Información en español: evaluación del efecto de sus peculiaridades lingüísticas (Unpublished paper at Universidad de Salamanca, Salamanca, 1998) [24] Harman, D. : Ranking Algorithms. In Frakes, W.B. and Baeza-Yates, R. (ed.), Information Retrieval. Data Structures and Algorithms, Prentice Hall, Upper Saddle River, NJ, 1992, 363-392 [25] Salton, G. and McGill, M.J.: Introduction to modern Information Retrieval (McGraw-Hill, New York, 1983) [26] Real Academia Española: Diccionario de la lengua española, Madrid, 1996 [27] Moliner, M.: Diccionario de uso del español, Madrid, 1991 [28] Pérez Lagos, M.F.: Formación de palabras, la composición culta en los diccionarios, Salamanca, 1996 [29] Rodríguez Muñoz, J. V. y Gil Leiva, I. : Análisis de los descriptores de diferentes áreas del conocimiento indizadas en bases de datos del CSIC. Aplicación de la indización automática , Revista Española de Documentación Científica, 20(2) (1997), 150-160Downloads
Downloads per month over past year
Actions (login required)
View Item |