Approximate Personal Name-Matching Through Finite-State Graphs

Galvez, Carmen and De-Moya-Anegón, Félix Approximate Personal Name-Matching Through Finite-State Graphs. Journal of the American Society for Information Science and Technology, 2007, vol. 53, n. 13. [Journal article (Unpaginated)]

[img]
Preview
PDF
Galvez-JASIST.pdf

Download (442kB) | Preview

English abstract

This article shows how finite-state methods can be employed in a new and different task: the conflation of personal name variants in standard forms. In bibliographic databases and citation index systems, variant forms create problems of inaccuracy that affect information retrieval, the quality of information from databases, and the citation statistics used for the evaluation of scientists' work. A number of approximate string matching techniques have been developed to validate variant forms, based on similarity and equivalence relations. We classify the personal name variants as nonvalid and valid forms. In establishing an equivalence relation between valid variants and the standard form of its equivalence class, we defend the application of finite-state transducers. The process of variant identification requires the elaboration of: (a) binary matrices and (b) finite-state graphs. This procedure was tested on samples of author names from bibliographic records, selected from the Library and Information Science Abstracts (LISA) and Science Citation Index Expanded (SCI-E) databases. The evaluation involved calculating the measures of precision and recall, based on completeness and accuracy. The results demonstrate the usefulness of this approach, although it should be complemented with methods based on similarity relations for the recognition of spelling variants and misspellings.

Item type: Journal article (Unpaginated)
Keywords: Finite-State Transducers ; Natural Language Processing (NLP) ; Personal Names ; Normalization
Subjects: L. Information technology and library technology > LL. Automated language processing.
Depositing user: Carmen Galvez
Date deposited: 12 Oct 2007
Last modified: 02 Oct 2014 12:09
URI: http://hdl.handle.net/10760/10529

References

Accomazzi, A., Eichhorn, G., Kurtz, M.J., Grant, C.S., & Murray, S.S. (2000). The NASA astrophysics data system: Architecture. Astronomy and Astrophysics Supplement Series , 143(1), 41-59. Links

Angell, R.C., Freund, G.E., & Willett, P. (1983). Automatic spelling correction using a trigram similarity measure. Information Processing & Management , 19(4), 255-261. Links

Auld, L. (1982). Authority control: An eighty-year review. Library Resources and Technical Services , 26, 319-330. Links

Bagga, A., & Baldwin, B. (1998). Entity-based cross-document co-referencing using the vector space model. Proceedings of the 17th International Conference on Computational Linguistics (pp. 79-85). Montreal: ACL.

Baluja, S., Mittal, V., & Sukthankar, R. (2000). Applying machine learning for high performance name-entity extraction. Computational Intelligence , 16(4), 586-595. Links

Belkin, N.J., & Croft, W.B. (1987). Retrieval techniques. Annual Review of Information Science and Technology , 22, 109-145. Links

Bikel, D.M., Miller, S., Schwartz, R., & Weischedel, R. (1997). Nymble: A high-performance learning name-finder. Proceedings of the 5th Conference on Applied Natural Language Processing (pp. 194-201). San Francisco: Kaufmann.

Blair, C.R. (1960). A program for correcting spelling errors. Information and Control , 3, 60-67. Links

Borgman, C.L., & Siegfried, S.L. (1992). Getty's synoname and its cousins: A survey of applications of personal name-matching algorithms. Journal of the American Society for Information Science , 43(7), 459-476. Links

Bouchard, G., & Pouyez, C. (1980). Name variations and computerized record linkage. Historical Methods , 13, 119-125. Links

Bourne, C.P. (1977). Frequency and impact of spelling errors in bibliographic data bases. Information Processing & Management , 13(1), 1-12. Links

Chinchor, N. (1997). Named entity task definition (Version 3.5). Proceedings of the 7th Message Understanding Conference. Fairfax, VA: Morgan Kaufmann.

Chomsky, N. (1957). Syntactic structures. The Hague, The Netherlands: Mouton.

Church, K. (1988). A stochastic parts program and noun phrase parser for unrestricted text (pp. 136-143). Second Conference on Applied Natural Language Processing. Austin, TX: ACL.

Church, K., & Hanks, P. (1990). Word association norms, mutual information and lexicography. Computational Linguistics , 16, 22-29. Links

Croft, W.B., & Xu, J. (1995). Corpus-specific stemming using word form co-occurrence. Proceedings for the 4th annual Symposium on Document Analysis and Information Retrieval (pp. 147-159). Las Vegas, Nevada.

Cronin, B., & Snyder, H.W. (1997). Comparative citation ranking of authors in monographic and journal literature: A study of sociology. Journal of Documentation , 53(3), 263-273. Links

Coates-Stephens, S. (1993). The analysis and acquisition of proper names for the understanding of free text. Computers and the Humanities , 26(5-6), 441-456. Links

Cucerzan, S., & Brill, E. (2004). Spelling correction as an iterative process that exploits the collective knowledge of web users. In D. Lin, & D. Wu (Eds.), Proceedings of EMNLP 2004 (pp. 293-300). Barcelona: Association for Computational Linguistics.

Damerau, F.J. (1964). A technique for computer detection and correction of spelling errors. Communications of the ACM , 7(4), 171-176. Links

Damerau, F.J., & Mays, E. (1989). An examination of undetected typing errors. Information Processing & Management , 25(6), 659-664. Links

Davidson, L. (1962). Retrieval of misspelled names in an airlines passenger record system. Communications of the ACM , 5(3), 169-171. Links

Frakes, W.B., & Fox, C.J. (2003). Strength and similarity of affix removal stemming algorithms. ACM SIGIR Forum , 37(1), 26-30. Links

French, J.C., Powell, A.L., & Schulman, E. (2000). Using clustering strategies for creating authority files. Journal of the American Society for Information Science and Technology , 51(8), 774-786. Links

Gadd, T.N. (1988). Fisching for werds: Phonetic retrieval of written text in information systems. Program: Automated Library and Information Science , 22(3), 222-237. Links

Gadd, T.N. (1990). PHONIX: The algorithm. Program: Automated Library and Information Science , 24(4), 363-366. Links

Gaizauskas, R., Wakao, T., Humphreys, K., Cunningham, H., & Wilks, Y. (1995). University of Sheffield: Description of the LaSIE system as used for MUC-6. Proceedings of the 6th Message Understanding Conference (pp. 207-220). Columbia, MD: NIST, Kaufmann.

Garfield, E. (1979). Citation indexing - Its theory and application in science, technology, and humanities. New York: Wiley.

Garfield, E. (1983a). Idiosyncrasies and errors, or the terrible things journals do to us. Current Contents , 2, 5-11. Links

Garfield, E. (1983b). Quality control at ISI. Current Contents , 19, 5-12. Links

Giles, C.L., Bollacker, K., & Lawrence, S. (1998). CiteSeer: An automatic citation indexing system. In I. Witten, R. Akscyn, & F.M. Shipman III (Eds.), Digital libraries 98 - The 3rd ACM Conference on Digital Libraries (pp. 89-98). New York: ACM Press.

Gooi, C.H., & Allan, J. (2004). Cross-document coreference on a large scale corpus. In S. Dumais, D. Marcu, & S. Roukos (Eds.), Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics (pp. 9-16). Boston, MA: ACL.

Gross, M. (1975). Méthodes en syntaxe. Paris: Hermann.

Gross, M. (1997). The construction of local grammars. In E. Roche & Y. Schabes (Eds.), Finite-state language processing (pp. 329-352). Cambridge, MA: MIT Press.

Hall, P.A.V., & Dowling, G.R. (1980). Approximate string matching. Computing Surveys , 12(4), 381-402. Links

Han, H., Giles, L., Zha, H., Li, C., & Tsioutsiouliklis, K. (2004). Two supervised learning approaches for name disambiguation in author citations. Proceedings of the 4th ACM/IEEE-CS Joint Conference on Digital Libraries (pp. 296-305). Tucson, AZ: ACM.

Han, H., Zha, H., & Giles, C.L. (2005). Name disambiguation in author citations using a K-way spectral clustering method. Proceedings of the 5th ACM/IEEE Joint Conference on Digital Libraries. Denver, CO: ACM.

Harris, Z.S. (1951). Methods in structural linguistics. Chicago: University of Chicago Press.

Hayes, P. (1994). NameFinder: Software that finds names in text. Proceedings of 4th RIAO Conference of Computer Assisted Information Searching on the Internet (pp. 762-774), New York: Rockefeller University.

Hermansen, J. (1985). Automatic name searching in large data bases of international names. Unpublished doctoral dissertation, Washington, DC: Georgetown University, Department of Linguistics.

Hopcroft, J.E., & Ullman, J.D. (1979). Introduction to automata theory, languages, and computation. Reading, MA: Addison-Wesley.

Hull, J.J. (1992). A hidden Markov model for language syntax in text recognition. Proceedings of the 11th IAPR International Conference on Pattern Recognition (pp. 124-127). The Hague, The Netherlands: IEEE Computer Society Press.

Jaro, M.A. (1989). Advances in record-linkage methodology as applied to matching the 1985 Census of Tampa, Florida. Journal of the American Statistical Association , 89(406), 414-420. Links

Jaro, M.A. (1995). Probabilistic linkage of large public health data files. Statistics in Medicine , 14(5-7), 491-498. Links

Knuth, D. (1973). The art of computer programming. Reading, MA: Addison-Wesley.

Kukich, K. (1992). Techniques for automatically correcting words in texts. ACM Computing Surveys , 24(4), 377-439. Links

Landau, G.M., & Vishkin, U. (1986). Efficient string matching with k mismatches. Theoretical Computer Science , 43, 239-249. Links

Levenshtein, V.I. (1965). Binary codes capable of correcting spurious insertions and deletions of ones. Problems of Information Transmission , 1(1), 8-17. Links

Lin, X., White, H.D., & Buzydlowski, J. (2003). Real-time author co-citation mapping for online searching. Information Processing and Management , 39(5), 689-706. Links

Mann, G., & Yarowsky, D. (2003). Unsupervised personal name disambiguation. In W. Daelemans & M. Osborne (Eds.), Proceedings of CoNLL-2003 (pp. 33-40). Edmonton, Canada: ACL.

Mays, E., Damerau, F.J., & Mercer, R.L. (1991). Context based spelling correction. Information Processing & Management , 27(5), 517-522. Links

McCain, K.W. (1990). Mapping authors in intellectual space: A technical overview. Journal of the American Society for Information Science , 41(6), 433-443. Links

Melamed, I.D. (1999). Bitext maps and alignment via pattern recognition. Computational Linguistics , 25(1), 107-130. Links

Moed, H.F., & Vriens, M. (1989). Possible inaccuracies occurring in citation analysis. Journal of Information Science , 15(2), 95-107. Links

MUC-4. (1992). Proceedings of the 4th Message Understanding Conference. McLean, VA: Kaufmann.

MUC-6. (1995). Proceedings of the 6th Message Understanding Conference. Columbia, MD: Kaufmann.

MUC-7. (1997). Proceedings of the 7th Message Understanding Conference. Fairfax, VA: Kaufmann.

Navarro, G., Baeza-Yates, R., & Arcoverde, J.M.A. (2003). Matchsimile: A flexible approximate matching tool for searching proper names. Journal of the American Society for Information Science and Technology , 54(1), 3-15. Links

Paik, W., Liddy, E.D., Yu, E., & McKenna, M. (1993). Categorizing and standardizing proper nouns for efficient information retrieval. In B. Boguraev & J. Pustejovsky (Eds.), Corpus processing for lexical acquisition (pp. 44-54). Cambridge, MA: MIT Press.

Patman, F., & Thompson, P. (2003). Names: A new frontier in text mining. NSF/NIJ Symposium on Intelligence and Security Informatics, Lecture Notes in Computer Science (pp. 1-3). Berlin: Springer-Verlag.

Pedersen, T., Purandare, A., & Kulkarni, A. (2005). Name discrimination by clustering similar contexts. Proceedings of the 6th International Conference and Computational Linguistics (pp. 226-237). Mexico City, Mexico: ACL.

Pereira, F. (1997). Sentence modeling and parsing. In R.A. Cole, J. Mariani, H. Uszkoreit, A. Zaenen, & V. Zue (Eds.), Survey of the state of the art in human language technology (pp. 130-140). Cambridge, MA: Cambridge University Press.

Petersen, J.L. (1986). A note on undetected typing errors. Communications of the ACM , 29(7), 633-637. Links

Pfeiffer, U., Poersch, T., & Fuhr, N. (1996). Retrieval effectiveness of proper name search methods. Information Processing & Management , 32(6), 667-679. Links

Philips, L. (1990). Handing on the metaphone. Computer Language , 7(12), 39-43. Links

Pollock, J.J. (1982). Spelling error detection and correction by computer: Some notes and a bibliography. Journal of Documentation , 38(4), 282-291. Links

Pollock, J.J., & Zamora, A. (1983). Collection and characterization spelling errors in scientific and scholarly text. Journal of the American Society for Information Science , 34(1), 51-58. Links

Pollock, J.J., & Zamora, A. (1984). Automatic spelling correction in scientific and scholarly text. Communications of the ACM , 27(4), 358-368. Links

Ravin, Y., & Wacholder, N. (1996). Extracting names from natural-language text. Research Report RC 20338, IBM Corporation.

Rice, R.E., Borgman, C.L., Bednarski, D., & Hart, P.J. (1989). Journal-to-journal citation data: Issues of validity and reliability. Scientometrics , 15(3-4), 257-282. Links

Riseman, E.M., & Ehrich, R.W. (1971). Contextual word recognition using binary digrams. IEEE Transactions on Computers, C- 20, 397-403. Links

Roche, E. (1993). Analyse syntaxique transformationelle du français par transducteurs et lexique-grammaire. Unpublished doctoral dissertation, Université Paris.

Roche, E. (1999). Finite state transducers: Parsing free and frozen sentences. In A. Kornai (Ed.), Extended finite state models of language (pp. 108-120). Cambridge, UK: Cambridge University Press.

Roche, E., & Schabes, Y. (1995). Deterministic part-of-speech tagging with finite state transducers. Computational Linguistics , 21(2), 227-253. Links

Rogers, H.J., & Willett, P. (1991). Searching for historical word forms in text databases using spelling-correction methods: Reverse error and phonetic coding methods. Journal of Documentation , 47(4), 333-353. Links

Ruiz-Perez, R., Delgado López-Cózar, E., & Jiménez-Contreras, E. (2002). Spanish personal name variations in national and international biomedical databases: Implications for information retrieval and bibliometric studies. Journal of the Medical Library Association , 90(4), 411-430. Links

Russell, R.C. (1918). United States Patent No. 1261167. Washington, DC: U.S. Patent Office.

Salton, G. (1989). Automatic text processing: The transformation, analysis and retrieval of information by computer. Reading, MA: Addison-Wesley.

Schulz, K.U., & Mihov, S. (2002). Fast string correction with Levenshtein-automata. International Journal of Document Analysis and Recognition , 5(1), 67-85. Links

Senellart, J. (1998). Locating noun phrases with finite state transducers. Proceedings of the 17th International Conference on Computational Linguistics (pp. 1212-1219). Montreal: COLING.

Sher, I.H., Garfield, E., & Elias, A.W. (1966). Control and elimination of errors in ISI services. Journal of Chemical Documentation , 6(3), 132-135. Links

Siegfried, S., & Bernstein, J. (1991). Synoname: The Getty's new approach to pattern matching for personal names. Computers and the Humanities , 25(4), 211-226. Links

Silberztein, M. (1993). Dictionnaires électroniques et analyse automatique de textes: Le systëme INTEX. Paris: Masson.

Silberztein, M. (2000). INTEX: An FST Toolbox. Theoretical Computer Science , 231(1), 33-46. Links

Spink, A., Jansen, B.J., & Pedersen, J. (2004). Searching for people on Web search engines. Journal of Documentation , 60(3), 266-278. Links

Strunk, K. (1991). Control of personal names. Cataloging & Classification Quarterly , 14(2), 63-79. Links

Taft, R.L. (1970). Name search techniques (Special Report No. 1). Albany, NJ: Bureau of Systems Development, New York State Identification and Intelligence Systems.

Tagliacozzo, R., Kochen, M., & Rosenberg, L. (1970). Orthographic error patterns of author names in catalog searches. Journal of Library Automation , 3, 93-101. Links

Takahashi, H., Itoh, N., Amano, T., & Yamashita, A. (1990). A spelling correction method and its application to an OCR system. Pattern Recognition , 23(3ndash;4), 363-377. Links

Tao, H., & Cole, C. (1991). Wade-Giles or Hanyu-Pinyin: Practical issues in the transliteration of Chinese titles and proper names. Cataloging & Classification Quarterly , 12(2), 105-124. Links

Taylor, A.G. (1984). Authority files in online catalogs: An investigation of their value. Cataloging & Classification Quarterly , 4(3), 1-17. Links

Taylor, A.G. (1989). Research and theoretical considerations in authority control. In B.B. Tillett (Ed.), Authority control in the online environment: Considerations and practices (pp. 29-56). New York: Haworth.

Thompson, P., & Dozier, C.C. (1999). Name recognition and retrieval performance. In T. Strzalkowski (Ed.), Natural language information retrieval (pp. 25-74). Dordrecht, The Netherlands: Kluwer.

Tillett, B.B. (1989). Considerations for authority control in the online environment. In B.B. Tillett (Ed.), Authority control in the online environment: Considerations and practices (pp. 1-11). New York: Haworth.

Torvik, V.I., Weeber, M., Swanson, D.R., & Smalheiser, N.R. (2005). A probabilistic similarity metric for Medline records: A model for author name disambiguation. Journal of the American Society for Information Science and Technology , 56(2), 140-158. Links

Ukkonen, E. (1992). Approximate string matching with Q-grams and maximal matches. Theoretical Computer Science , 92(1), 191-212. Links

Ullmann, J.R. (1977). A binary N-gram technique for automatic correction of substitution, deletion, insertion and reversal errors. The Computer Journal , 20(2), 141-147. Links

van Rijsbergen, C.J. (1979). Information retrieval. London: Butterworths.

Wacholder, N., Ravin, Y., & Choi, M. (1997). Disambiguation of proper names in text. Proceedings of the 5th Conference on Applied Natural Language Processing (pp. 202-208). Washington, DC: ACL.

Wacholder, N., Ravin, Y., & Byrd, R.J. (1994). Retrieving information from full text using linguistic knowledge. Proceedings of the 15th National Online Meeting (pp. 441-447). Yorktown Heights, NJ: IBM T.J. Watson Research Center.

Weintraub, T. (1991). Personal name variations: Implications for authority control in computerized catalogs. Library Resources and Technical Services , 35, 217-228. Links

Winkler, W.E. (1999). The state of record linkage and current research problems. Research Report No. RR99/04. Washington, DC: U.S. Bureau of the Census, Statistical Research Division.

Woods, W.A. (1970). Transition network grammars for natural language analysis. Communications of the ACM , 13(10), 391-606. Links

Xu, J., & Croft, B. (1998). Corpus-based stemming using co-occurrence of word variants. ACM Transactions on Information Systems , 16(1), 61-81. Links

Zamora, E.M., Pollock, J., & Zamora, A. (1981). The use of trigrams analysis for spelling error detection. Information Processing and Management , 17(6), 305-316. Links

Zobel, J., & Dart, P. (1995). Finding approximate matches in large lexicons. Software Practice and Experience , 25(3), 331-345. Links


Downloads

Downloads per month over past year

Actions (login required)

View Item View Item