Standardizing formats of corporate source data

Galvez, Carmen and Moya-Anegón, Félix Standardizing formats of corporate source data. Scientometrics, 2007, vol. 70, n. 1, pp. 3-26. [Journal article (Paginated)]

[thumbnail of Galvez-Scientometrics-2.pdf]
Preview
PDF
Galvez-Scientometrics-2.pdf

Download (466kB) | Preview

English abstract

This paper describe an approach for improving the data quality of corporate sources when databases are used for bibliometric purposes. Research management relies on bibliographic databases and citation index systems as analytical tools, yet the raw resources for bibliometric studies are plagued by a lack of consistency in fied formatting for institution data. The present contribution puts forth a Natural Language Processing (NLP)-oriented method for the identification of the structures guiding corporate data and their mapping into a standardized format. The proposed unification process is based on the definition of address-patterns and the ensuing application of Enhanced Finite-State Transducers (E-FST). Our procedure was tested on address formats downloaded from the INSPEC, MEDLINE and CAB Abstracts. The results demonstrate the helpfulness of the method as long as close control of errors is exercised as far as the formats to be unified. The computational efficacy of the model is noteworthy, due to the fact that it is firmly guided by the definition of data in the application domain.

Item type: Journal article (Paginated)
Keywords: Finite-State Transducers; Term normalization;
Subjects: L. Information technology and library technology > LL. Automated language processing.
B. Information use and sociology of information > BB. Bibliometric methods
Depositing user: Carmen Galvez
Date deposited: 06 Aug 2007
Last modified: 02 Oct 2014 12:08
URI: http://hdl.handle.net/10760/10020

References

ABNEY, S. (2002), Bootstrapping. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL). Philadelphia.

ABNEY, S. (1996), Partial parsing via finite-state cascades. In: Proceedings of the ESSLLI'96 Robust Parsing Workshop. Prague, pp. 8-15.

ANDERSON, J., COLLINS, P. M. D., IRVINE, J., ISARD, P. A., MARTIN, B. R., NARIN, F., STEVENS, K. (1988), On-line approaches to measuring national scientific output: A cautionary tale, Science and Public Policy, 15 : 153–161.

BOURKE, P., BUTLER, L. (1996), Standards issues in a national bibliometric database: The Australian case, Scientometrics, 35 : 199–207.

BOURKE, P., BUTLER, L. (1998), Institutions and the map of science: Matching university departments and fields of research, Research Policy, 26 : 711–718.

BRAUN, T., BROCKEN, M., GLÄNZEL, W., RINIA, E., SCHUBERT, A. (1995), "Hyphenation" of databases in building scientometric indicators: Physics briefs, SCI based indicators of 13 European countries, 1980–1989, Scientometrics 33 : 131–148.

CARPENTER, M. P., GIBB, F., HARRIS, J., IRVINE, J., NARIN, F. (1988), Bibliometric profiles for British academic institutions: An experiment to develop research output indicators, Scientometrics, 14 : 213–234.

CATARCI, T. (2004), Special issue on data quality in cooperative information systems (Editorial), Information Systems, 29 : 529–530.

CHOMSKY, N. (1965), Aspects of the Theory of Syntax, Massachusetts Institute of Technology, Cambridge, Massachusetts.

CHOMSKY, N. (1957), Syntactic Structures, Mouton, The Hague.

CUNNINGHAM, H. (2005), Information Extraction, Automatic, Enclyclopedia of Language and Linguistics, 2nd ed. Elsevier, Oxford.

CUNNINGHAM, S. J. (1998), Applications for bibliometric research in the emerging digital libraries, Scientometrics, 43 : 161–175.

DE BRUIN, R. E., MOED, H. F. (1993), Delimitation of scientific subfields using cognitive words from corporate addresses in scientific publications, Scientometrics, 26 : 65–80.

DE BRUIN, R. E., MOED, H. F. (1990), The unification of addresses in scientific publications. In: L. Egghe, R. Rousseau (Eds), Informetrics 1989/90. Elsevier Science Publishers, Amsterdam, pp. 65–78.

FRENCH, J. C., POWELL, A. L., SCHULMAN, E. (2000), Using clustering strategies for creating authority files, Journal of the American Society for Information Science and Technology, 51 : 774–786.

GALVEZ, C., MOYA-ANEGON, F. (under revision), The unification of institutional addresses applying parametrized finite-state graphs (P-FSG), Scientometrics.

GARFIELD, E. (1979), Citation Indexing: Its Theory and Application in Science, Technology, and Humanities, John Wiley, New York.

GARFIELD, E. (1983a), Idiosyncrasies and errors, or the terrible things journals do to us, Current Contents, 2 : 5–11.

GARFIELD, E. (1983b), Quality control at ISI, Current Contents, 19 : 5–12.

GRISHMAN, R. (1997), Information extraction: Techniques and challenges. In: M. T. Pazienza (Ed.), Information Extraction. Springer-Verlag, Rome, pp. 10–27

HARRIS, Z. S. (1951), Methods in Structural Linguistics. University of Chicago Press, Chicago.

HAWKINS, D. T. (1977), Unconventional uses of on-line information retrieval systems: On-line bibliometric studies, Journal of the American Society for Information Science, 28 : 13–18.

HAWKINS, D. T. (1981), Machine-readable output from online searches, Journal of the American Society for Information Science, 32 : 253–256.

HERBERTZ, H., MÜLLER-HILL, B. (1995), Quality and efficiency of basic research in molecular biology: A bibliometric analysis of thirteen excellent research institutes, Research Policy, 24 : 959–979.

HOBBS, J. R. (1993), The Generic Information Extraction System. In: Proceedings of the Fifth Message Understanding Conference (MUC-5). Morgan Kaufman, San Mateo, CA, pp. 87–91.

HOBBS, J. R., APPELT, D. E., TYSON, M., MABRY, B., ISRAEL, D. (1992), SRI international: Description of the FASTUS system used for MUC-4. In: Proceedings of the Fourth Message Understanding Conference (MUC-4). Morgan Kaufmann, pp. 268–275.

HOOD, W. W., WILSON, C. S. (2003), Informetric studies using databases: Opportunities and challenges. Scientometrics, 58 : 587–608.

INGWERSEN, P., CHRISTENSEN, F. H. (1997), Data set isolation for bibliometric online analyses of research publications: Fundamental methodological issues, Journal of the American Society for Information Science, 48 : 205–217.

JACOBS, P. S., RAU, L. F. (1990), SCISOR: Extracting information from on-line news, Communications of the ACM 33 : 88–97.

LEYDESDORFF, L. (1988), Problems with the 'measurement' of national scientific performance, Science and Public Policy, 15 : 149–152.

MÄHLCK, P., PERSSON, O. (2000), Socio-bibliometric mapping of intra-departmental networks, Scientometrics, 49 : 81–91.

McGRATH, W. (1996), The unit of analysis (object of study) in biblometrics and scientometrics, Scientometrics, 32 : 257–264.

MELIN, G., PERSSON, O. (1996), Studying research collaboration using co-authorships, Scientometrics, 36 : 363–377.

MOED, H. F. (1988), The Use of on-line databases for bibliometric analysis. In: L. Egghe, R. Rousseau (Eds) Informetrics 87/88. Elsevier Science Publishers, Amsterdam, pp. 133–146.

MOED, H. F. (2000), Bibliometric indicators reflect publication and management strategies, Scientometrics, 47 : 323–346.

MOED, H. F., VRIENSV, M. (1989), Possible inaccuracies occurring in citation analysis, Journal of Information Science, 15 : 95–117.

MOED, H. F., DE BRUIN, R. E., VAN LEEUWEN, TH. N. (1995), New bibliometric tools for the assessment of national research performance: Database description, overview of indicators and first applications, Scientometrics 33: 381–422.

MOED, H. F., VAN RAAN, A. F. J. (1988), Indicators of research performance: Applications in university research policy. In: A. F. J. VAN RAAN (Ed.), Handbook of Quantitative Studies of Science and Technology. Elsevier Science Publishers, Amsterdam, pp. 177–192.

MOYA-ANEGÓN, F., VARGAS-QUESADA, B., HERRERO-SOLANA, V., CHINCHILLA-RODRÍGUEZ, Z., CORERA-ÁLVAREZ, E., MUNOZ-FERNANDEZ, F. J. (2004), A new technique for building maps of large scientific domains based on the cocitation of classes and categories, Scientometrics, 61 : 129–145.

NERI, F., SAITTA, L. (1997), Machine Learning for Information Extraction. In: M. L. Pazienza (Ed.), Information Extraction. Springer-Verlag, Rome, pp. 10–27.

NOYONS, E. C. M., MOED, H. F., LUWEL, M. (1999), Combining mapping and citation analysis for evaluative bibliometric purposes: A bibliometric study, Journal of the American Society for Information Science, 50 : 115–131.

PITERNICK, A. B. (1982), Standardization of journal titles in databases (letter to the editor), Journal of the American Society for Information Science, 33 : 105.

RINIA, E. J., DE LANGE, C., MOED, H. F. (1993), Measuring national output in physics: Delimitation problems, Scientometrics, 28 : 89–110.

ROCHE, E. (1996), Finite-state transducers: Parsing free and frozen sentences. In: A. Kornai (Ed.), Proceedings of the ECAI 96 Workshop extended finite state models of language. ECAI, pp. 52–57.

SHER, I. H., GARFIELD, E., ELIAS, A. W. (1966), Control and elimination of errors in ISI services, Journal of Chemical Documentation, 6 : 132–135.

SHRUM, W., MULLINS, N. (1988), Network analysis in the study of science and technology. In: A. F. J. VAN RAAN (Ed.), Handbook of Quantitative Studies of Science and Technology. Elsevier Science Publishers, Amsterdam, pp. 107–133.

SILBERZTEIN, M. (1999), Text indexation with INTEX, Computers and the Humanities, 33 : 265–280.

SILBERZTEIN, M. (2000), INTEX: An FST toolbox, Theoretical Computer Science, 231 : 33–46.

STEFANIAK, B. (1987), Use of bibliographic data bases for scientometric studies, Scientometrics, 12 : 149–161.

VAN DEN BERGHE, H., DE BRUIN, R. E., HOUBEN, J. A., KINT, A., LUWEL, M., SPRUYT, E., MOED, H. F. (1998), Bibliometric indicators of university research performance in Flanders, Journal of the American Society for Information Science, 49 : 59–67.

VAN RAAN, A. F. J. (2005), Fatal attraction: conceptual and methodological problems in the ranking of universities by bibliometric methods, Scientometrics, 62 : 133–143.

VAN ZAANEN, M. (1999). Bootstrapping structure using similarity. In: P. Monachesi (ed.), Computational Linguistics in the Netherlands 1999-Selected Papers From the Tenth CLIN Meeting. Universteit Utrecht, Utrecht, The Netherlands, pp. 235–245.

WATRIN, P. (2003), Information extraction and lexicon-grammar. In: Proceedings of the Fourth Dutch-Belgian Information Retrieval Workshop, DIR, Amsterdam, pp. 16–21.

WILLIAMS, M. E., LANNOM, L. (1981), Lack of standardization of the journal title data element in databases, Journal of the American Society for Information Science, 32 : 229–233.


Downloads

Downloads per month over past year

Actions (login required)

View Item View Item