An evaluation of conflation accuracy using finite-state transducers

Galvez, Carmen and De-Moya-Anegón, Félix An evaluation of conflation accuracy using finite-state transducers. Journal of Documentation, 2006, vol. 62, n. 3. [Journal article (Unpaginated)]

[img]
Preview
PDF
Galvez-An_Evaluation.pdf

Download (204kB) | Preview

English abstract

Purpose – To evaluate the accuracy of conflation methods based on Finite-State Transducers (FSTs). Design/methodology/approach – Incorrectly lemmatized and stemmed forms may lead to the retrieval of inappropriate documents. Experimental studies to date have focused on retrieval performance, but very few on conflation performance. The process of normalization we used involved a linguistic toolbox that allowed us to construct, through graphic interfaces, electronic dictionaries represented internally by FSTs. The lexical resources developed were applied to a Spanish test corpus for merging term variants in canonical lemmatized forms. Conflation performance was evaluated in terms of an adaptation of recall and precision measures, based on accuracy and coverage, not actual retrieval. The results were compared with those obtained using a Spanish version of the Porter algorithm. Findings – We come to the conclusion that the main strength of lemmatisation using finite-state technology is its accuracy, whereas its main limitation is the underanalysis of variant forms. Originality/value –The report outlines the potential of transducers in their application to normalization processes.

Item type: Journal article (Unpaginated)
Keywords: Natural Languge Processing; Finite-State Transducers; Information Retrieval; Conflation; Linguistics, Semantics, Programming and algorithm theory, Accuracy
Subjects: L. Information technology and library technology > LM. Automatic text retrieval.
Depositing user: Carmen Galvez
Date deposited: 08 Aug 2007
Last modified: 02 Oct 2014 12:09
URI: http://hdl.handle.net/10760/10184

References

Adamson, G.W. and Boreham, J. (1974), "The use of an association measure based on character structure to identify semantically related pairs of words and document titles", Information Storage and Retrieval, Vol. 10 No. 1, pp. 253-60.

Alcoba, S. (1991), "Morfología del verbo español", in Martin Vide, C. (Ed.), Lenguajes Naturales y Lenguajes Formales, Publicaciones de la Universidad, Barcelona.

Allan, J., Ballesteros, L., Callan, J.P., Croft, W.B.and Lu, Z. (1995), "Recent experiments with INQUERY", in Harman, D.K. (Ed.), Proceedings of the Fourth Text REtrieval Conference (TREC-4), National Institute of Standards and Technology Special Publication 500-236, Gaithersburg, Maryland, pp. 49-63.

Allan, J., Callan, J.P., Croft, B.W., Ballesteros, L., Broglio, J., Xu, J. and Shu, H. (1996), "INQUERY at TREC-5", Proceedings of the Fifth Text REtrieval Conference (TREC-5), National Institute of Standards and Technology Special Publication 500-238, Gaithersburg, Maryland, pp. 119-32.

Ambadiang, T. (1990), "Contribución al estudio del verbo español: un análisis morfosemántico", Anuario de Lingüística Hispánica, Vol. 6, pp. 29-63.

Ambadiang, T. (1994), La Morfología Flexiva, Taurus, Madrid.

Angell, R.C., Freund, G.E. and Willett, P. (1983), "Automatic spelling correction using a trigram similarity measure", Information Processing & Management, Vol. 19 No. 4, pp. 255-61.

Antworth, E.L. (1995), "User's Guide to PC-KIMMO Version 2" [Web Page]. Available at http://www.sil.org/pckimmo/v2/doc/guide.html.

Broglio, J., Callan, J.P., Croft, W.B. and Nachbar, D.W. (1994), "Document retrieval and routing using the INQUERY system", in Harman, D.K. (Ed.), Proceedings of the 3rd Text REtrieval Conference (TREC-3), National Institute of Standards and Technology Special Publication 500-225, Gaithersburg, Maryland, pp. 29-38.

Buckley, C., Singhal, A. and Mitra, M. (1996), "Using query zoning and correlation within SMART: TREC 5", in Harman, D.K. (Ed.), Proceedings of the Fourth Text REtrieval Conference (TREC-5), National Institute of Standards and Technology Special Publication 500-238, Gaithersburg, Maryland, pp. 105-18.

Buckley, C., Singhal, A., Mitra, M. and (Salton, G.) (1995), "New retrieval approaches using SMART: TREC 4", in Harman, D.K. (Ed.), Proceedings of the Fourth Text REtrieval Conference (TREC-4), National Institute of Standards and Technology Special Publication 500-236, Gaithersburg, Maryland, pp. 25-48.

Buckley, C., Salton, G., Allan, J. and Singhal, A. (1994), "Automatic query expansion using SMART: TREC 3", in Harman, D.K. (Ed.), Proceedings of the ThirdText REtrieval Conference (TREC-3), National Institute of Standards and Technology Special Publication 500-225, Gaithersburg, Maryland, pp. 69-80.

Carmona, J., Cervell. S., Márquez, L., Martí, M.A., Padró, L., Placer, R., Rodríguez, H., Taulé, M. and Turmo, J. (1998), "An environment for morphosyntactic processing of Spanish unrestricted text", First International Conference on Language Resources and Evaluation, LREC'98, Granada, pp. 915-22.

Cavnar, W.B. (1994), "Using an n-gram based document representation with a vector processing retrieval model", in Harman, D.K. (Ed.), Proceedings of the Third Text REtrieval Conference (TREC-3), National Institute of Standards and Technology, Gaithersburg, Maryland, pp. 269-78.

Chomsky, N. (1957), Syntactic Structures, Mouton, The Hague.

Chomsky, N. and Halle, M. (1968), The Sound Pattern of English, Harper and Row, New York.

Coseriu, E. (1981), Lecciones de Lingüística General, Gredos, Madrid.

Damashek, M. (1995), "Gauging similarity with n-grams: language independent categorization of text", Science, Vol. 267, pp. 843-48.

Dawson, J.L. (1974), "Suffix removal for word conflation", Bulletin of the Association for Literary and Linguistic Computing, Vol. 2 No. 3, pp. 33-46.

Figuerola, C.G., Gómez, R., Zazo Rodríguez, A.F. and Alonso Berrocal, J.L. (2002), "Stemming in Spanish: a first approach to its impact on information retrieval", in Peters C., Braschler, M., Gonzalo, J. and Kluck, M. (Eds), Evaluation of Cross-Language Information Retrieval Systems. Second Workshop of the Cross-Language Evaluation Forum, CLEF 2001, Springer-Verlag, Berlin, Heidelberg, New York. (Lecture Notes in Computer Science, Vol. 2406)

Frakes, W.B. (1992), "Stemming algorithms", in Frakes, W.B. and Baeza-Yates, R. (Eds), Information Retrieval: Data Structures and Algorithms, Prentice-Hall, Englewood Cliffs, New Jersey.

Frakes, W.B. and Fox, C.J. (2003), "Strength and similarity of affix removal stemming algorithms", ACM SIGIR Forum, Vol. 37 No. 1, pp. 26-30.

Galvez, C., Moya-Anegón, F. and Solana, V.H. (2005), "Term conflation methods in information retrieval: non-linguistic and linguistic approaches", Journal of Documentation, Vol. 61 No. 4

Gey, F.C., Chen, J.A., He, M. and Jason, M. (1995), "Logistic regression at TREC 4: probalistic retrieval from full text document collections", in Harman, D.K. (Ed.), Proceedings of the Fourth Text REtrieval Conference (TREC-4), National Institute of Standards and Technology Special Publication 500-236, Gaithersburg, Maryland, pp. 65-72.

Graña, J., Barcala, F.M. and Alonso, A. (2001), "Compilation methods of minimal acyclic automata for large dictionaries", in Watson, B.W. and Wood, D. (Eds), Proceedings of the 6th Conference on Implementations and Applications of Automata (CIAA 2001), Pretoria, South Africa, pp. 116-29.

Harris, J.W. (1987), "The accentual patterns of verb paradigms in Spanish", Natural Language and Linguistic Theory, Vol. 5, pp. 61-90.

Hearst, M., Pedersen, J.O., Pirolli, P., Schütze, H., Grefenstette, G. and Hull, D.A. (1995), "Xerox site report: four TREC-4 tracks", in Harman, D.K. (Ed.), Proceedings of the Fourth Text REtrieval Conference (TREC-4), National Institute of Standards and Technology Special Publication 500-236, Gaithersburg, Maryland, pp. 97-119.

Hull, D.A. (1996), "Stemming algorithms: a case study for detailed evaluation", Journal of the American Society for Information Science, Vol. 47 No.1, pp. 70-84.

Hull, D.A., Grefenstette, G., Schulze, B.M., Gaussier, E., Schütze, H. and Pedersen, J.O. (1996), "Xerox TREC-5 site report: routing filtering, NLP and Spanish tracks", in Voorhees, E.M. and Harman, D.K. (Eds), The Fifth TExt Retrieval Conference (TREC-5), National Institute of Standards and Technology Special Publication 500-238, Gaithersburg, Maryland, pp. 167-80.

Jacquemin, C. and Tzoukermann, E. (1999), "NLP for term variant extraction: synergy between morphology, lexicon, and syntax", in Strzalkowski, T. (Ed.), Natural Language Information Retrieval, Kluwer, Dordrecht.

Johnson, C.D. (1972), Formal Aspects of Phonological Description, Mouton, The Hague.

Kaplan, R.M. and Kay, M. (1981), "Phonological rules and finite-state transducers", Linguistic Society of America Meeting Handbook, Fifty-Sixth Annual Meeting . New York.

Karttunen, L. (1994), "Constructing lexical transducers", Proceedings of the Fifteenth International Conference on Computational Linguistics (COLING-94), Kyoto, pp. 406-11.

Karttunen, L. (1983), "KIMMO: a general morphological processor", Texas Linguistics Forum, Vol. 22, pp. 217-28.

Karttunen, L., Kaplan, R.M. and Zaenen, A. (1992), "Two-level morphology with composition", Proceedings of the 15th International Conference on Computational Linguistics (COLING-92), Nantes, France, pp. 141-48.

Kelledy, F. and Smeaton, A.F. (1996), "TREC-5 experiments at Dublin City University: query space reduction, Spanish stemming & character shape encoding", in Voorhees, E.M. and Harman, D.K. (Eds), The Fifth Text REtrieval Conference (TREC-5), National Institute of Standards and Technology Special Publication 500-238 Gaithersburg, Maryland, pp. 57-64.

Kosinov, S. (2001), "Evaluation of n-grams conflation approach in text-based information retrieval", Proceedings of International Workshop on Information Retrieval, Oulu, Finland.

Koskenniemi, K. (1983), "Two-level morphology: a general computational model for word-form recognition and production", Department of General Linguistics, University of Helsinki.

Kraaij, W. and Pohlmann, R. (1995), "Evaluation of a Dutch stemming algorithm", in Rowley, R. (Ed.), The New Review of Document and Text Management, Vol. 1, Taylor Graham, London.

Kraaij, W. and Pohlmann, R. (1994), "Porter's stemming algorithm for Dutch", in Noordman, L.G.M. and de Vroomen W.A.M. (Eds), Informatiewetenschap 1994: Wetenschappelijke bijdragen aan de derde STINFON Conferentie, Tilburg, pp. 167-80.

Krovetz, R. (1993), "Viewing morphology as an inference process", in Korfhage, R., Rasmussen, E.M. and Willett, P. (Eds), Proceedings of the 16th Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval, Association for Computing Machinery, New York, pp. 191-202.

Lennon, M., Pierce, D.S., Tarry, B.D. and Willett, P. (1981), "An evaluation of some conflation algorithms for information retrieval", Journal of Information Science, Vol. 3 No. 4, pp. 177-83.

Lovins, J.B. (1968), "Development of a stemming algorithm", Mechanical Translation and Computational Linguistics, Vol. 11, pp. 22-31.

Matthews, P.H. (1965), "The inflection component of a word-and-paradigm grammar", Journal of Linguistics, Vol. 1, pp. 139-71.

Matthews, P.H. (1974), Morphology. An Introduction to the Theory of Word-Structure, Cambridge University Press, Cambridge.

Mighetto, D. (1992), "Notas sobre la noción de aspecto en un marco de clasificación de verbos (Vb) y sustantivos verbales (Sv)", Voz y Letra, Vol. 3 No.1, pp. 69-100.

Nakov, P. (2003), "Building an inflectional stemmer for Bulgarian", Proceedings of 4th International Conference on Computer Systems and Technologies (ICCST'03), ACM Press, New York, pp. 419-24.

Paice, C.D. (1990), "Another stemmer", ACM SIGIR Forum, Vol. 24 No. 3, pp. 56-61.

Paice, C.D. (1996), "A method for evaluation of stemming algorithms based on error counting", Journal of the American Society for Information Science, Vol. 47 No. 8, pp. 632-49.

Pirkola, A. (2001), "Morphological typology of languages for IR", Journal of Documentation, Vol. 57 No. 3, pp. 330-48.

Popovic, M. and Willett, P. (1992), "The effectiveness of stemming for natural-language access to Slovene textual data", Journal of the American Society for Information Science, Vol. 43 No. 5, pp. 384-90.

Porter, M.F. (1980), "An Algorithm for Suffix Stripping", Program, Vol. 14, pp. 130-37.

Robertson, A.M. and Willett, P. (1998), "Applications of n-grams in textual information systems", Journal of Documentation, Vol. 54 No. 1, pp. 48-69.

Rodríguez, S. and Carretero, J. (1996), "A formal approach to Spanish morphology: the COES tools", XII Congreso de la Sociedad Espańola para el Procesamiento del Lenguaje Natural (SEPLN), SEPLN, Sevilla, pp. 118-26.

Savoy, J. (1993), "Stemming of French words based on grammatical categories", Journal of the American Society for Information Science, Vol. 44 No. 1, pp. 1-9.

Silberztein, M. (1993), Dictionnaires Électroniques et Analyse Automatique de Textes: le Système INTEX, Masson, Paris.

Silberztein, M. (2000), "INTEX: an FST toolbox", Theorical Computer Science, Vol. 231 No. 1, pp. 33-46.

Sparck Jones, K. and Tait, J.I. (1984), "Automatic search term variant generation", Journal of Documentation, Vol. 40 No. 1, pp. 50-66.

Tzoukermann, E., Klavans, J.L. and Jacquemin, C. (1997), "Effective use of natural language processing techniques for automatic conflation of multi-word terms: The role of derivational morphology, part of speech tagging, and shallow parsing", Proceedings 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'97), Philadelphia, Pennsylvania, pp. 148-55.

Van Rijsbergen, C. J. (1979), Information Retrieval. Butterworths, London.

Vilares, J., Alonso, M.A., Ribadas, F.J. and Vilares, M. (2003), "COLE experiments at CLEF 2002 Spanish monolingual track", in Peters, C., Braschler, M., Gonzalo, J. and Kluck, M. (Eds), Advances in Cross-Language Information Retrieval, Springer-Verlag, Berlin, Heidelberg, New York, pp. 265-78. (Lecture Notes in Computer Science, Vol. 2785).

Voutilainen, A. (1995), "Morphological disambiguation", in Karlsson, F. Voutilainen, A., Heikkilä, J. and Anttila, A. (Eds), Constraint Grammar: A Language-Independent System for Parsing Unrestricted Text, Mouton de Gruyter, Berlin and New York, pp. 165-284.

Xu, J. and Croft, B. (1998), "Corpus-based stemming using co-occurrence of word variants", ACM Transactions on Information Systems, Vol. 16 No. 1, pp. 61-81.


Downloads

Downloads per month over past year

Actions (login required)

View Item View Item