Aplicación de transductores de estado-finito a los procesos de unificación de términos (Application of transducers of finite state to unification processes of term variants)

Galvez, Carmen Aplicación de transductores de estado-finito a los procesos de unificación de términos (Application of transducers of finite state to unification processes of term variants). Ciência da Informação, 2006, vol. 35, n. 3, pp. 67-74. [Journal article (Paginated)]

[thumbnail of Galvez-Ciencia-da-informacao.pdf]
Preview
PDF
Galvez-Ciencia-da-informacao.pdf

Download (199kB) | Preview

English abstract

Application of transducers of state-finite to unification processes of term variants. An approach based on techniques of state-finite has applied to the processes of unification of terms in Spanish. The algorithms of conflation are computational procedures utilized in some Information Retrieval (RI) systems for the unification of term variants, semantically equivalent, to a normalized form. The programs that carry out habitually this process are called: stemmers and lematizadores. The objective of this work is to evaluate the deficiencies and errors of the lemmatizers in the conflation of terms. The method utilized for the construction of the lemmatizer has been based on the implementation of a linguistic tool that allows to build electronic dictionaries represented internally in Finite-State Transducers (FST). The lexical resources developed have been applied to a corpus of verification to evaluate the performance of these lexical parsers. The metric of evaluation utilized has been an adaptation of coverage and precision measures. The results show that the main limitation of unification processes of term variants through technology of state-finite is the infra-analysis.

Spanish abstract

Se presenta una aplicación basada en técnicas de estado-finito a los procesos de unificación de términos en español. Los algoritmos de unificación, o conflación, de términos son procedimientos computacionales utilizados en algunos sistemas de Recuperación de Información (RI) para la reducción de variantes de términos, semánticamente equivalentes, a una forma normalizada. Los programas que realizan habitualmente este proceso se denominan: stemmers y lematizadores. El objetivo de este trabajo es evaluar el grado de deficiencias y errores de los lematizadores en el proceso de agrupación de los términos a su correspondiente radical. El método utilizado para la construcción del lematizador se ha basado en la implementación de una herramienta lingüística que permite construir diccionarios electrónicos representados internamente en Transductores de Estado-Finito. Los recursos léxicos desarrollados se han aplicado a un corpus de verificación para evaluar el funcionamiento de este tipo de analizadores léxicos. La métrica de evaluación utilizada ha sido una adaptación de las medidas de cobertura y precisión. Los resultados muestran que la principal limitación del proceso de unificación de variantes de término por medio de tecnología de estado-finito es el infra-análisis.

Item type: Journal article (Paginated)
Keywords: Finite-State Transducers; Normalization; Dictionary; Term conflation; Lemmatization; Unificación de términos; Lematización; Transductores de estado finito.
Subjects: A. Theoretical and general aspects of libraries and information. > AA. Library and information science as a field.
Depositing user: Carmen Galvez
Date deposited: 06 Aug 2007
Last modified: 02 Oct 2014 12:08
URI: http://hdl.handle.net/10760/10015

References

ADAMSON, G. W. and BOREHAM, J. The Use of an association measure based on character structure to identify semantically related pairs of words and document titles. Information Storage and Retrieval, v. 10, n. 1, p. 253-260, 1974.

ALLAN, J.; CALLAN, J. P.; CROFT, B. W.; BALLESTEROS, L.; BROGLIO, J.; XU, J.; SHU, H. Allan, J. INQUERY at TREC-5. In: PROCEEDINGS OF THE FIFTH TEXT RETRIEVAL CONFERENCE, TREC-5, 1995. Gaithersburg, Maryland: National Institute of Standards and Technology, p. 119-132, 1996.

BUCKLEY, C.; SALTON, G.; ALLAN, J.; SINGHAL, A. Automatic query expansion using SMART: TREC 3. In: PROCEEDINGS OF THE THIRD TEXT RETRIEVAL CONFERENCE, TREC-3, 1994. Gaithersburg, Maryland: National Institute of Standards and Technology, 1994, p. 69-80.

CARMONA, J.; CERVELL, S.; MARQUEZ, L.: MARTI, M. A.; PADRO, L.; PLACER, R.; RODRIGUEZ, H.; TAULE, M.; TURMO, J. An environment for morphosyntactic processing of Spanish unrestricted text. FIRST INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, LREC'98. Granada, 1998.

DAWSON, J. L. 1974.Suffix removal for word conflation. Bulletin of the Association for Literary & Linguistic Computing, v. 2, n. 3, p. 33-46, 1974.

FRAKES, W. B. Stemming algorithms. In: FRAKES, W. B.; BAEZA-YATES, R. (Ed.), Information retrieval: data structures and algorithms. Prentice-Hall, Englewood Cliffs, NJ, 1992.

GALVEZ, C.; MOYA-ANEGON, F.; SOLANA, V. H. Term conflation methods in information retrieval: non-linguistic and linguistic approaches. Journal of Documentation, v. 61, n. 4, p. 520-547, 2005.

HARMAN, D. K. How effective is suffixing? Journal of the American Society for Information Science, v. 47, n. 1, p. 70-84, 1991.

HOPCROFT, J. E.; ULLMAN, J. D. Introduction to Automata Theory, Languages, and Computation. Addison-Wesley: Reading, MA, 1979.

HULL, D. A. Stemming algorithms: a case study for detailed evaluation. Journal of the American Society for Information Science, v. 47, n.1, p. 70-84, 1996.

HULL, D. A.; GREFENSTETTE, G.; SCHULZE, B.; GAUSSIER, E.; SCHUTZE, H.; PEDERSEN, J. O. Xerox TREC-5 site report: routing filtering, NLP and Spanish tracks. In: PROCEEDINGS OF THE FIFTH TEXT RETRIEVAL CONFERENCE, TREC-5, 1995. Gaithersburg, Maryland: National Institute of Standards and Technology,1996

JACQUEMIN, C. ; TZOUKERMANN, E.NLP for term variant extraction: synergy between morphology, lexicon, and syntax. In: STRZALKOWSKI, T. (Ed). Natural Language Information Retrieval. Dordrecht: Kluwer Academic Publishers, 1999.

JOHNSON, C. D. Formal Aspects of Phonological Description. La Haya: Mouton, 1972.

KARTTUNEN, L. Constructing lexical transducers. In: PROCEEDINGS OF THE FIFTEENTH INTERNATIONAL CONFERENCE ON COMPUTATIONAL LINGUISTICS. Kyoto: Coling 94, 1994.

KARTTUNEN, L., KAPLAN, R. M.; ZAENEN, A. Two-level morphology with composition. PROCEEDINGS OF THE 15TH INTERNATIONAL CONFERENCE ON COMPUTATIONAL LINGUISTICS (COLING'92). Nantes, France, 1992.

KOSKENNIEMI, K. Two-level morphology: a general computational model for word-form recognition and production. University of Helsinki: Department of General Linguistics, 1983.

LENNON, M.; PIERCE, D. S.; TARRY, B. D.; WILLETT, P. An evaluation of some conflation algorithms for information retrieval. Journal of Information Science, v. 3, n. 4, p. 177-183, 1981.

LOVINS, J. B. Development of a stemming algorithm. Mechanical Translation and Computational Linguistics, v. 11, p. 22-31, 1968.

MATTHEWS, P. H. Morphology. An Introduction to the theory of word-structure. Cambridge University Press, 1974.

MOHRI, M. On some applications of finite-state automata theory to natural language processing. Journal of Natural Language Engineering, v. 2, n. 1, p. 61-80, 1996.

PAICE, C. D. Another Stemmer. ACM SIGIR Forum, v. 24, n. 3, p. 56-61, 1990.

PAICE, C. D. A method for evaluation of stemming algorithms based on error counting. Journal of the American Society for Information Science, v. 47, n. 8, p. 632-649, 1996.

PEREIRA, F. Sentence modeling and parsing. In: COLE, R. A.; MARIANI, J.; USZKOREIT, H.; ZAENEN, A.; ZUE, V. Survey of the State of the Art in Human Language Technology. Cambridge, MA: Cambridge University Press, p. 130-140, 1997.

POPOVIC, M.; WILLET, P. The effectiveness of stemming for natural-language access to Slovene textual data. Journal of the American Society for Information Science, v. 43, n. 5, p. 384-90, 1992.

PORTER, M. F. An algorithm for suffix stripping. Program, v. 14, p. 130-137, 1980.

ROBERTSON, A. M.; P. Willett. 1998. "Applications of n-grams in textual information systems." Journal of Documentation, v. 54, n. 1, p. 48-69, 1998

ROCHE, E.; SCHABES, Y. 1995. Deterministic part-of-speech tagging with finite state transducers. Computational Linguistics, v. 21, n. 2, p. 227-253, 1995.

RODRIGUEZ, S.; CARRETERO, J. A formal approach to Spanish morphology: the COES tools. In: XII CONGRESO DE LA SOCIEDAD ESPAÑOLA PARA EL PROCESAMIENTO DEL LENGUAJE NATURAL (SEPLN). Sevilla, SEPLN, 1996, p. 118-126.

SAVOY, J. Stemming of French words based on grammatical categories. Journal of the American Society for Information Science, v. 44, n. 1, p. 1-9, 1993.

SILBERZTEIN, M. Text indexation with INTEX. Computers and the Humanities, v. 33, n. 3, p. 265-80, 1999.

SONWBALL WEB SITE. Disponível em: <http://snowball.tartarus.org>. Acceso em: 18 jun. 2006.

SPARCK JONES, K.; TAIT, J. I. Automatic search term variant generation. Journal of Documentation, v. 40, n. 1, p. 50-66, 1984.

VILARES, J.; ALONSO, M. A.; RIBADAS, F. J.; VILARES, M. COLE experiments at CLEF 2002 Spanish monolingual track. In: ADVANCES IN CROSS-LANGUAGE INFORMATION RETRIVAL. Berlin: Springer-Verlag, 2003, p. 265-271.

VOUTILAINEN, A. Morphological disambiguation. In: KARLSSON, F.; VOUTILAINEN, A.; HEIKKILA, J. (Ed.). Constraint grammar: a language-independent system for parsing unrestricted Text. New York: Mouton de Gruyter, p. 165-284, 1995.

XU, J.; CROFT, B. Corpus-based stemming using co-occurrence of word variants. ACM Transactions on Information Systems, v. 16, n. 1, p. 61-81, 1998.


Downloads

Downloads per month over past year

Actions (login required)

View Item View Item