Term conflation methods in information retrieval: non-linguistic and linguistic approaches

Galvez, Carmen, Felix de Moya-Anegon, Felix and Herrero-Solana, Victor Term conflation methods in information retrieval: non-linguistic and linguistic approaches. Journal of Documentation, 2005, vol. 61, n. 4, pp. 520-547. [Journal article (Paginated)]

[thumbnail of Galvez-JD.pdf]
Preview
PDF
Galvez-JD.pdf

Download (396kB) | Preview

English abstract

Purpose – To propose a categorization of the different conflation procedures at the two basic approaches, non-linguistic and linguistic techniques, and to justify the application of normalization methods within the framework of linguistic techniques. Design/methodology/approach – Presents a range of term conflation methods, that can be used in information retrieval. The uniterm and multiterm variants can be considered equivalent units for the purposes of automatic indexing. Stemming algorithms, segmentation rules, association measures and clustering techniques are well evaluated non-linguistic methods, and experiments with these techniques show a wide variety of results. Alternatively, the lemmatisation and the use of syntactic pattern-matching, through equivalence relations represented in finite-state transducers (FST), are emerging methods for the recognition and standardization of terms. Findings – The survey attempts to point out the positive and negative effects of the linguistic approach and its potential as a term conflation method. Originality/value – Outlines the importance of FSTs for the normalization of term variants.

Item type: Journal article (Paginated)
Keywords: Finite-State Transducers
Subjects: I. Information treatment for information services > IC. Index languages, processes and schemes.
Depositing user: Carmen Galvez
Date deposited: 06 Aug 2007
Last modified: 02 Oct 2014 12:06
URI: http://hdl.handle.net/10760/8818

References

Abney, S. (1991), “Parsing by chunks”, in Berwick, R., Abney, S. and Tenny, C. (Eds), Principle-Based Parsing, Kluwer Academic Publishers, Dordrecht.

Abu-Salem, H., Al-Omari, M. and Evens, M. W. (1999), “Stemming methodologies over individual queries words for an Arabian information retrieval system”, Journal of the American Society for Information Science, Vol. 50 No. 6, pp. 524-9.

Adamson, G. W. and Boreham, J. (1974), “The use of an association measure based on character structure to identify semantically related pairs of words and document titles”, Information Storage and Retrieval, Vol. 10 No. 1, pp. 253-60.

Ahmad, F., Yussof, M. and Sembok, M.T. (1996), “Experiments with a stemming algorithm for malay words”, Journal of the American Society for Information Science, Vol. 47, No. 1, pp. 909-18.

Angell, R.C., Freund, G.E. and Willett, P. (1983), “Automatic spelling correction using a trigram similarity measure”, Information Processing and Management, Vol. 19 No. 4, pp. 255-61.

Arampatzis, A.T., Tsoris, T., Koster, C.H.A. and van der Weide, P. (1998), “Phrase-based information retrieval”, Information Processing and Management, Vol. 14 No. 6, pp. 693-707.

Arampatzis, A.T., van der Weide, P., van Bommel, P. and Koster, C.H.A. (2000), “Linguistically motivated information retrieval”, in Kent, A. (Ed.), Encyclopedia of Library and Information Science, Marcel Dekker, New York, NY Basel.

Brent, M., Lundberg, A. and Murthy, S.K. (1995), “Discovering morphemic suffixes: a case study in minimum description length induction”, Proceedings of the Fifth International Workshop on Artificial Intelligence and Statistics, Vanderbilt University, Ft. Lauderdale, FL.

Brill, E. (1992), “A simple rule based part-of-speech tagger”, Third Conference on Applied Natural Language Proceedings, Trento, pp. 152-5.

Brill, E. (1993), “A corpus-based approach to language learning”, PhD thesis, Department of Computer and Information Science, University of Pennsylvania, University Park, PA.

Buckley, C., Alland, J. and Salton, G. (1995), “Automatic routing and retrieval using SMART: TREC-2”, Information Processing and Management, Vol. 31 No. 3, pp. 315-26.

Cavnar, W.B. (1994), “Using an n-gram based document representation with a vector processing retrieval model”, Proceedings of the Third Text REtrieval Conference (TREC-3), Special Publication 500-226, National Institute of Standards and Technology (NIST), Gaithersburg, MA.

Chomsky, N. (1957), Syntactic Structures, Mouton, The Hague.

Church, K. (1988), “A stochastic parts program and noun phrase parser for unrestricted text”, paper presented at Second Conference on Applied Natural Language Processing, Austin, TX.

Church, K. W. and Hanks, P. (1990), “Word association norms, mutual information and lexicography”, Computational Linguistics, Vol. 16, pp. 22-9.

Croft, W.B., Turtle, H.R. and Lewis, D.D. (1991), “The use of phrases and structured queries in information retrieval”, Proceedings, SIGIR 1991, pp. 32-45.

Cutting, D., Kupiec, J., Pedersen, J. and Sibun, P. (1992), “A practical part-of-speech tagger”, paper presented at Third Conference on Applied Natural Language Processing, Trento, pp. 133-40.

Damashek, M. (1995), “Gauging similarity with n-grams: language independent categorization of text”, Science, Vol. 267, pp. 843-8.

Dawson, J.L. (1974), “Suffix removal for word conflation”, Bulletin of the Association for Literary and Linguistic Computing, Vol. 2 No. 3, pp. 33-46.

Egghe, L. and Rousseau, R. (1990), Introduction to Informetrics: Quantitative Methods in Library, Documentation and Information Science, Elsevier, Amsterdam.

Evans, D.A. and Zhai, C. (1996), “Noun-phrase analysis in unrestricted text for information retrieval”, Proceedings of the 34th Annual Meeting of Association for Computational Linguistics, University of California, Santa Cruz, CA, pp. 17-24.

Evans, D.A., Milic-Frayling, N. and Lefferts, R.G. (1996), “CLARIT TREC-4 experiments”, in Harman, D.K (Ed.), The Fourth Text REtrieval Conference (TREC-4), Special Publication 500-236, National Institute of Standards and Technology(NIST), Gaithersburg, MD.

Fagan, J.L. (1989), “The effectiveness of a nonsyntactic approach to automatic phrase indexing for document retrieval”, Journal of the American Society for Information Science, Vol. 40, No. 2, pp. 115-32.

Feng, F. and Croft, W.B. (2001), “Probabilistic techniques for phrase extraction”, Information Processing and Management, Vol. 37 No. 2, pp. 199-220.

Frakes, W.B. (1992), “Stemming algorithms”, in Frakes, W.B. and Baeza-Yates, R. (Eds), Information Retrieval: Data Structures and Algorithms, Prentice-Hall, Englewood Cliffs, NJ.

Frakes, W.B. and Baeza-Yates, R. (1992), Information Retrieval: Data Structures and Algorithms, Prentice-Hall, Englewood Cliffs, NJ.

Francis, W. and Kucera, H. (1979), “Brown corpus manual”, Technique Report, Department of Linguistics, Brown University, Providence, RI.

Goldberg, D.E. (1989), Genetic Algorithms in Search, Optimization and Machine Learning, Addison-Wesley, Reading, MA.

Goldsmith, J. (2001), “Unsupervised learning of the morphology of a natural language”, Computational Linguistics, Vol. 27 No. 2, pp. 153-98.

Hafer, M.A. and Weiss, S.F. (1974), “Word segmentation by letter successor varieties”, Information Processing and Management, Vol. 10 Nos 11/12, pp. 371-86.

Hamers, L., Hemerick, Y., Herweyers, G., Janssen, M., Keters, H., Rousseau, R. and Vanhoutte, A. (1989), “Similarity measures in scientometric research: the Jaccard index versus Salton’s cosine formula”, Information Processing and Management, Vol. 25 No. 3, pp. 315-8.

Harman, D.K. (1991), “How effective is suffixing?”, Journal of the American Society for Information Science, Vol. 47 No. 1, pp. 70-84.

Harman, D.K. (1997), The sixth Text REtrieval Conference (TREC-6), Special Publication 500-240, National Institute of Standards and Technology (NIST), Gaithersburg, MD.

Harper, D.J. and van Rijsbergen, C.J. (1978), “An evaluation of feedback in document retrieval using co-occurence data”, Journal of Documentation, Vol. 34 No. 3, pp. 189-216.

Harris, Z.S. (1951), Methods in Structural Linguistics, University of Chicago Press, Chicago, IL.

Harris, Z.S. (1955), “From phoneme to morpheme”, Language, Vol. 31 No. 2, pp. 190-222.

Hopcroft, J.E. and Ullman, J.D. (1979), Introduction to Automata Theory, Languages, and Computation, Addison-Wesley, Reading, MA.

Hull, D.A. (1996), “Stemming algorithms – a case study for detailed evaluation”, Journal of the American Society for Information Science, Vol. 47 No. 1, pp. 70-84.

Hull, D.A., Grefenstette, G., Schulze, B.M., Gaussier, E., Schutze, H. and Pedersen, J.O. (1996), “Xerox TREC-5 site report: routing filtering, NLP and Spanish tracks”, in Harman, D.K. and Voorhees, E.M. (Eds), The Fifth Text REtrieval Conference (TREC-5), Special Publication 500-238, National Institute of Standards and Technology (NIST), Gaithersburg, MD.

Jacquemin, C. (2001), Spotting and Discovering Terms Through Natural Language Processing, MIT Press, Cambridge, MA.

Jacquemin, C. and Tzoukermann, E. (1999), “NLP for term variant extraction: synergy between morphology, lexicon, and syntax”, in Strzalkowski, T. (Ed.), Natural Language Information Retrieval, Kluwer, Dordrecht.

Kalamboukis, T.Z. (1995), “Suffix stripping with moderm Greek”, Program, Vol. 29 No. 3, pp. 313-21.

Kaplan, R.M. and Kay, M. (1994), “Regular models of phonological rule systems”, Computational Linguistics, Vol. 20 No. 3, pp. 331-78.

Karp, D., Schabes, Y., Zaidel, M. and Egedi, D. (1992), “A freely available wide coverage morphological analyser for English”, Proceedings of the 15th International Conference on Computational Linguistics (COLING-92), Nantes, pp. 950-4.

Karttunen, L. (1983), “KIMMO: a general morphological processor”, Texas Linguistics Forum, Vol. 22, pp. 217-28.

Karttunen, L., Kaplan, R.M. and Zaenen, A. (1992), “Two-level morphology with composition”, Proceedings of the 15th International Conference on Computational Linguistics (COLING-92), Nantes, pp. 141-8.

Kazakov, D. (1997), “Unsupervised learning of naı¨ve morphology with genetic algorithms”, in Daelemans, W., Bosch, A. and Weijters, A. (Eds), Workshop Notes of the ECML/MLnet Workshop on Empirical Learning of Natural Language Processing Tasks, Prague, pp. 105-12.

Kazakov, D. and Manandhar, S. (2001), “Unsupervised learning of word segmentation rules with genetic algorithms and inductive logic programming”, Machine Learning, Vol. 43 Nos 1/2, pp. 121-62.

Kemeny, J.G. and Snell, J.L. (1976), Finite Markov Chains, Springer-Velarg, New York, NY.

Klenee, S.C. (1956), “Representation of events in nerve nets and finite automata”, Automata Studies, Princeton University Press, Princeton, NJ.

Knuth, D. (1973), The Art of Computer Programming: Sorting and Searching, 3, Addison-Wesley, Reading, MA.

Kosinov, S. (2001), “Evaluation of n-grams conflation approach in text-based information retrieval”, Proceedings of International Workshop on Information Retrieval, Oulu.

Koskenniemi, K. (1983), Two-level Morphology: A General Computational Model for Word-form Recognition and Production, Department of General Linguistics, University of Helsinki.

Koskenniemi, K. (1996), “Finite-state morphology and information retrieval”, Proceedings of ECAI-96 Workshop on Extended Finite State Models of Language, Budapest, pp. 42-5.

Kraaij, W. and Pohlmann, R. (1994), “Porter’s stemming algorithm for Dutch”, in Noordman, L.G.M. and de Vroomen, W.A.M. (Eds), Informatiewetenschap 1994: Wetenschappelijke bijdragen aan de derde STINFON Conferentie, Tilburg, pp. 167-80.

Kraaij, W. and Pohlmann, R. (1995), “Evaluation of a Dutch stemming algorithm”, in Rowley, R. (Ed.), The New Review of Document and Text Management, Vol. 1, Taylor Graham, London.

Krovetz, R. (1993), “Viewing morphology as an inference process”, in Korfhage, R. (Ed.), Proceedings of the 16th ACM/SIGIR Conference, Association for Computing Machinery, New York, NY, pp. 191-202.

Kupiec, J. (1992), “Robust part-of-speech tagging using a Hidden Markov Model”, Computer Speech and Language, Vol. 6, pp. 225-42.

Kupiec, J. (1993), “Murax: a robust linguistic approach for question answer using an on-line encyclopedia”, in Korfhage, R., Rasmussen, E. and Willett, P. (Eds), Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Pittsburg, PA, pp. 160-9.

Lennon, M., Pierce, D.S., Tarry, B.D. and Willett, P. (1981), “An evaluation of some conflation algorithms for information retrieval”, Journal of Information Science, Vol. 3 No. 4, pp. 177-83.

Lovins, J.B. (1968), “Development of a stemming algorithm”, Mechanical Translation and Computational Linguistics, Vol. 11, pp. 22-31.

Mitchell, T.M. (1997), Machine Learning, McGraw-Hill, New York, NY.

Mohri, M. and Sproat, R. (1996), “An efficient compiler for weighted rewrite rules”, Proceedings of the 34th Annual Meeting of the Association for Computational Linguistics, ACL-96, Santa Cruz, California, pp. 231-8.

Paice, C.D. (1990), “Another stemmer”, ACM SIGIR Forum, Vol. 24 No. 3, pp. 56-61.

Paice, C.D. (1996), “A method for evaluation of stemming algorithms based on error counting”, Journal of the American Society for Information Science, Vol. 47 No. 8, pp. 632-49.

Pirkola, A. (2001), “Morphological typology of languages for IR”, Journal of Documentation, Vol. 57 No. 3, pp. 330-48.

Popovic, M. and Willett, P. (1992), “The effectiveness of stemming for natural-language access to slovene textual data”, Journal of the American Society for Information Science, Vol. 43, No. 5, pp. 384-90.

Porter, M.F. (1980), “An algorithm for suffix stripping”, Program, Vol. 14, pp. 130-7.

Robertson, A.M. and Willett, P. (1998), “Applications of n-grams in textual information systems”, Journal of Documentation, Vol. 54 No. 1, pp. 48-69.

Roche, E. (1999), “Finite state transducers: parsing free and frozen sentences”, in Kornai, A. (Ed.), Extended Finite State Models of Language, Cambridge University Press, Cambridge.

Roche, E. and Schabes, Y. (1997), Finite State Language Processing, MIT Press, Cambridge, MA.

Salton, G. (1980), “The SMART system 1961-1976: experiments in dynamic document

processing”, Encyclopedia of Library and Information Science, Vol. 28, pp. 1-36.

Salton, G. (1989), Automatic Text Processing the Transformation, Analysis and Retrieval of Information by Computer, Addison-Wesley, Reading, MA.

Salton, G. and McGill, M.J. (1983), Introduction to Modern Information Retrieval, McGraw-Hill, New York, NY.

Savary, A. and Jacquemin, C. (2003), “Reducing information variation in text”, Lecture Notes in Computer Science, Vol. 2705, pp. 145-81.

Savoy, J. (1993), “Stemming of French words based on grammatical categories”, Journal of the American Society for Information Science, Vol. 44 No. 1, pp. 1-9.

Savoy, J. (1999), “A stemming procedure and stopword list for general French corpora”, Journal of the American Society for Information Science, Vol. 50, No. 10, pp. 944-52.

Schinke, R., Greengrass, M., Robertson, A.M. and Wilett, P. (1996), “A stemming algorithm for Latin text database”, Journal of Documentation, Vol. 52 No. 2, pp. 172-8.

Schwarz, C. (1990), “Automatic syntactic analysis of free text”, Journal of the American Society for Information Science, Vol. 41 No. 6, pp. 408-17.

Smeaton, A.F. and van Rijsbergen, C.J. (1983), “The retrieval effects of query expansion on a feedback document retrieval system”, The Computer Journal, Vol. 26 No. 3, pp. 239-46.

Shannon, C.E. and Weaver, W. (1949), The Mathematical Theory of Communication, University of Illinois Press, Urbana, IL.

Sheridan, P. and Smeaton, A.F. (1992), “The application of morpho-syntactic language processing to effective phrase matching”, Information Processing and Management, Vol. 28 No. 3, pp. 349-69.

Silberztein, M. (1993), Dictionnaires Electroniques et Analyse Automatique de Textes: le Systeme INTEX, Masson, Paris.

Silberztein, M. (2000), “INTEX: an FST toolbox”, Theorical Computer Science, Vol. 231 No. 1, pp. 33-46.

Smadja, F. (1993), “Retrieving collocations from text: XTRACT”, Computational Linguistics, Vol. 19 No. 1.

Sparck Jones, K. and Tait, J.I. (1984), “Automatic search term variant generation”, Journal of Documentation, Vol. 40 No. 1, pp. 50-66.

Strzalkowski, T. (1996), “Natural language information retrieval”, Information Processing and Management, Vol. 31 No. 3, pp. 397-417.

Strzalkowski, T., Lin, L., Wang, J. and Perez-Carballo, J. (1999), “Evaluating natural language processing techniques in information retrieval: a TREC perspective”, in Strzalkowski, T. (Ed.), Natural Language Information Retrieval, Kluwer Academic Publishers, Dordrecht, pp. 113-45.

Tolle, K.M. and Chen, H. (2000), “Comparing noun phrasing techniques for use with medical digital library tools”, Journal of the American Society for Information Science, Vol. 51 No. 4, pp. 352-70.

Turing, A. (1936), “On computable numbers, with an application to the Entscheidungsproblem”, Proceedings of the London Mathematical Society, Vol. 42 No. 2, pp. 230-65.

Van Rijsbergen, C.J. (1977), “A theoretical basis for the use of co-occurrence data in information retrieval”, Journal of Documentation, Vol. 32 No. 2, pp. 106-19.

Voutilainen, A. (1997), “A short introduction to NPtool”, available at: www.lingsoft.fi/doc/nptool/intro/

Xu, J. and Croft, B. (1998), “Corpus-based stemming using co-occurrence of word variants”, ACM Transactions on Information Systems, Vol. 16 No. 1, pp. 61-81.


Downloads

Downloads per month over past year

Actions (login required)

View Item View Item