La recuperación de información en español y la normalización de términos

G.-Figuerola, Carlos, Zazo, Ángel F., Rodríguez-Vázquez-de-Aldana, Emilio and Alonso-Berrocal, José-Luis La recuperación de información en español y la normalización de términos. Inteligencia Artificial. Revista Iberoamericana de Inteligencia Artificial, 2004, vol. 8, n. 22, pp. 135-145. [Journal article (Paginated)]

[thumbnail of figuerola2004recuperacion.pdf]
Preview
PDF
figuerola2004recuperacion.pdf

Download (214kB) | Preview

English abstract

Most of the Information Retrieval Systems uses counts of frequencies of the words that occur in documents. Such counts entail the need of normalizing these terms. A simple normalization of characters (upper/ lowercase, accents and other diacritical ones) seems insucient, since many words, by morphologic inection or derivation, could be grouped under an only form, when having very near semantic mean. Several algorithms of normalization are analyzed and tested experimentally to evaluate their efectiveness.

Spanish abstract

La mayor parte de los Sistemas de Recuperación de Información utilizan, de una forma u otra, recuentos de frecuencias de las palabras que aparecen en los documentos.Tales recuentos conllevan la necesidad de normalización de caracteres (mayúsculas/minísculas, acentos y otros diacríticos) parece insuciente, ya que muchas palabras, por flexión morfológica o derivación, podrían ser agrupadas bajo una úunica forma, al tener contenidos semánticos muy cercanos. Se analizan diversos algoritmos de normalización y se muestran los experimentos llevados a cabo para evaluar su eficacia.

Item type: Journal article (Paginated)
Keywords: Information Retrieval, stemming, n-grams, in flectional stemming, derivational stemming, recuperación de la información, Español. Lenguaje natural, Normalización
Subjects: L. Information technology and library technology > LM. Automatic text retrieval.
I. Information treatment for information services > II. Filtering.
Depositing user: R. Gómez-Díaz
Date deposited: 07 Dec 2009
Last modified: 02 Oct 2014 12:16
URI: http://hdl.handle.net/10760/13961

References

H. Abu-Salem, M. Al-Omari, and M. W. Evens. Stemming methodologies over individual queries words for an arabian information retrieval system. JASIS, 50(6):524{529, 1999.

F. Ahmad, M. Yussof, and M. T. Sembok. Experiments with a stemming algorithm for malay words. JASIS, 47(12):909{918, 1996.

J. Allen. Natural Language Understanding.Benjamin/Cummings, 1995.

C. Bell and K. P. Jones. Toward everyday languaje information retrieval system via minicomputer. JASIS, 30:334{338, 1979.

J. Carmona, S. Cervell, L. Márquez, M. Martín, L. Padr_o, R. Placer, H.

Rodríguez, M. Taul_e, and J. Turmo. An environment for morphosyntactic processing of unrestricted spanish text. In LREC 98: Proceedings of the First International Conference on Language Resources and Evaluation, number 1, Granada, España, 1998.

W.B. Cavnar. N-gram based text filtering for trec-2. In D.K. Harman, editor, The Second Text REtrieval Conference (TREC-2), number 2, pages 171{180, Gaithersburg, Maryland, noviembre 1993. National Institute of Standards and Technology (NIST), Advanced Research Projects Agency (ARPA).

W.B. Cavnar. Using an n-gram based document representation with a vector processing retrieval model. In D.K. Harman, editor, Overview of the Thrid Text Retrieval Conference (TREC-3), number 3, pages 269{278, Gaithersburg, Maryland, noviembre 1994. National Institute of Standards and Technology (NIST), Advanced Research Projects Agency (ARPA).

William B. Cavnar and John M. Trenkle. Ngram based text categorization. In D.K. Harman, editor, Proceedings of the Third Annual Symposium on Document Analysis and Information Retrieval, number 3, pages 191- 176, University of Nevada, Las Vegas, 1994. NIST.

E. Charniak. Statistical Language Learning. The MIT Press, Cambridge (Massachussetts), 1993.

M. Damashek. Gauging similarity with n-grams: Language-independent categorization of text. Science, (267):843{848, 1995.

J. Dawson. Su_x removal and word conation. ALLC bulletin, 2(3):33{46, 1974.

Carlos G. Figuerola. La investigación sobre recuperación de información en español. In C. Gonzalo García and V. García Yedra, editors, Documentación, Terminología y Traducción, pages 73- 82. Síntesis, Madrid, 2000.

C.G. Figuerola, R. Gómez, and E. López de San Román. Stemming and n-grams in spanish: an evaluation of their impact on information retrieval. Journal of Information Science, 26(6):461{467, 2000.

D. Harman. How effective is suffixing? JASIS, 42(1):7-15, 1991.

D. Harman. Overview of the fourth text retrieval conference (trec-4). In D.K. Harman, editor, The Fourth Text REtrieval Conference (TREC-4), number 4, pages 1{24, Gaithersburg, Maryland, noviembre 1995. National Institute of Standards and Technology (NIST), Defense Advanced Research Projects Agency (DARPA).

S. Human. Acquaintance: Language independent document categorization by ngrams. In D.K. Vorhees, E.M.; Harman, editor, The Fourth Text REtrieval Conference (TREC-4), number 4, pages 359{372, Gaithersburg, Maryland, noviembre 1995. National Institute of Standards and Technology

(NIST), Defense Advanced Research Projects Agency (DARPA).

D.A. Hull and G. Grefenstette. Queryng across languages: A dictionary-based approach to multilingual information retrieval. In SIGIR 96, volume 47, pages 49{57, 1996.

D. Jurafsky and J. H. Martin. Speech and Language Processing: An Introduction to Natural Language Processing. Prentice-Hall, NJ, 2000.

T. Z. Kalamboukis. Su_x stripping with moderm greek. Program, 29(3):313{321, 1995.

W. Kraaij and R. Pohlmann. Viewing stemming as recall enhancement. In SIGIR 96, pages 40- 48, 1996.

W. Kraaij and Ren_ee Pohlmann. Porter's stemming algorithm for dutch. In L. G.

M. Noordman and W. A. M. de Vroomen, editors, Informatiewetenschap, pages 167-180, Tilburg, 1994. STINFON.

R. Krovetz. Viewing morphology as an inference process. In SIGIR 93, pages 191-203, 1993.

J. B. Lovins. Development of a stemming algorithm. Mechanical Translation and Computational Linguistics, 11:22{31, 1968.

B. Merialdo. Tagging english text with a probabilistic model. Computational Linguistic, 20(2):155{171, 1994.

A. Molina and L. Moreno. Técnicas de análisis parcial en procesamiento del lenguaje natural. Technical Report DSIC-II/30/98, UPV, Departamento de Sistemas Informáticos y Computación, 1998.

L. Moreno Boronat, M. Palomar Sanz, A. Molina Marco, and A. Fernández Rodríguez. Introducción al Procesamiento del Lenguaje Natural. Universidad de Alicante, Murcia, 1999.

C. D. Paice. Another stemmer. SIGIR Forum, 24(3):56{61, 1990.

F. Pla i Santamaría. Etiquetado Léxico y Análisis Sintáctico Superficial basado en Modelos estadísticos. PhD thesis, Universidad de Valencia, Valencia, 2000.

M. Popovic and P. Willet. The effectiveness of stemming for natural-language access to slovene textual data. JASIS, 43:384-390, 1992.

M. F. Porter. An algorithm for suffixing stripping. Program, 14(3):130{137, julio 1980.

A. Robertson and P. Willet. Applications of n-grams in textual information systems. Journal of Documentation, 54(1):28-47, 1999.

S. Rodríguez and J. Carretero. A formal approach to spanish morphology: the coes tools. In XII Congreso de la SEPLN, pages 118-126, Sevilla, 1996.

H. Rodríguez Hontoria. Filología e Informática: nuevas tendencias en los estudios filoflógicos, chapter Técnicas estadísticas en el tratamiento del lenguaje natural, pages 111-140. UAB, Barcelona, 1999.

H. Rodríguez Hontoria. Técnicas basadas en el tratamiento informático de la lengua. Quark, (19), Julio-Diciembre 2000.

G. Salton and M. McGill. Introduction to Modern Information Retrieval. McGraw-Hill, New York, 1983.

O. Santana, J. Pérez, F. Carreras, J. Duque, Z. Hernández, and G. Rodríguez. Flanom: Flexionador y lematizador automático de formas nominales. Lingüística Española Actual, XXI(2):253{297, 1999.

O. Santana, J. Pereza, Z. Hernández, F. Carreras, and G. Rodríguez. Flaver: Flexionador y lematizador automático de formas verbales. Lingüística Española Actual, XIX(2):229{282, 1997.

J. Savoy. Effectiveness of information retrieval systems used in a hypertext environment. Hypermedia, 5:23{46, 1993.

J. Savoy. A stemming procedure and stopword list for general french corpora. JASIS, 50(10):944-952, 1999.

R. Schinke, A. Robertson, P. Willet, and M. Greengrass. A stemming algorithm for latin text databases. Journal of Documentation, 52(2):172{187, 1996.

A. Voutilainen. A syntax-based part-of speech analyser. In Procs. of the Conference European of the ACL-95, Dublin, 1995.

R. et. al. Weischedel. Coping with ambiguity and unknow words through probabilistic models. Computational Linguistics, 19(2):359 {382, 1993.

E. M. Zamora, J. J. Pollock, and A. Zamora.The use of trigram analysis for spelling error detection. Information Processing and Management, (17):305{316, 1981.


Downloads

Downloads per month over past year

Actions (login required)

View Item View Item