Automatic indexing of scientific articles on Library and Information Science with SISA, KEA and MAUI

Gil-Leiva, Isidoro, Díaz-Ortuño, Pedro Daniel and Corrêa, Renato Fernandes Automatic indexing of scientific articles on Library and Information Science with SISA, KEA and MAUI. Revista Espanola de Documentacion Cientifica, 2022, vol. 45, n. 4, pp. 1-18. [Journal article (Paginated)]

[thumbnail of SISA_KEA_MAUI_Gil_Leiva_Fernandes_Diaz_2022.pdf]
Preview
Text
SISA_KEA_MAUI_Gil_Leiva_Fernandes_Diaz_2022.pdf

Download (619kB) | Preview

English abstract

This article evaluates the SISA (Automatic Indexing System), KEA (Keyphrase Extraction Algorithm) and MAUI (Multi-Purpose Automatic Topic Indexing) automatic indexing systems to find out how they perform in relation to human indexing. SISA algorithm is based on rules about the position of terms in the different structural components of the document, while the algorithms for KEA and MAUI are based on machine learning and the statistical features of terms. For evaluation purposes, a document collection of 230 scientific articles from the Revista Española de Documentación Científica published by the Consejo Superior de Investigaciones Científicas (CSIC) was used, of which 30 were used for training tasks and were not part of the evaluation test set. The articles were written in Spanish and indexed by human indexers using a controlled vocabulary in the InDICES database, also belonging to the CSIC. The human indexing of these documents constitutes the baseline or golden indexing, against which to evaluate the output of the automatic indexing systems by comparing terms sets using the evaluation metrics of precision, recall, F-measure and consistency. The results show that the SISA system performs best, followed by KEA and MAUI.

Spanish abstract

Este artículo evalúa los sistemas de indización automática SISA (Automatic Indexing System), KEA (Keyphrase Extraction Algorithm) y MAUI (Multi-Purpose Automatic Topic Indexing) para averiguar cómo funcionan en relación con la indización realzada por especialistas. El algoritmo de SISA se basa en reglas sobre la posición de los términos en los diferentes componentes estructurales del documento, mientras que los algoritmos de KEA y MAUI se basan en el aprendizaje automático y las frecuencia estadística de los términos. Para la evaluación se utilizó una colección documental de 230 artículos científicos de la Revista Española de Documentación Científica, publicada por el Consejo Superior de Investigaciones Científicas (CSIC), de los cuales 30 se utilizaron para tareas formativas y no formaban parte del conjunto de pruebas de evaluación. Los artículos fueron escritos en español e indizados por indizadores humanos utilizando un vocabulario controlado en la base de datos InDICES, también perteneciente al CSIC. La indización humana de estos documentos constituye la referencia contra la cual se evalúa el resultado de los sistemas de indización automáticos, comparando conjuntos de términos usando métricas de evaluación de precisión, recuperación, medida F y consistencia. Los resultados muestran que el sistema SISA funciona mejor, seguido de KEA y MAUI.

Item type: Journal article (Paginated)
Additional information: cited By 0
Keywords: automatic indexing; automatic indexing systems; SISA; KEA; MAUI; indexing assessment; indización automática; sistemas de indización automática; evaluación de indización
Subjects: I. Information treatment for information services > IB. Content analysis (A and I, class.)
I. Information treatment for information services > IC. Index languages, processes and schemes.
Depositing user: Isidoro Gil Leiva
Date deposited: 22 Mar 2023 15:38
Last modified: 22 Mar 2023 15:38
URI: http://hdl.handle.net/10760/44190

References

Aronson, A.R., Bodenreider, O., Chang, H., Florence, H., Humphrey, S.M., Mork, J. G., Stuart, J.N., Rindflesch, T. C., & Wilbur, W. J. (2000). The NLM Indexing Initiative. In J. Marc Overhage (ed.), Proceedings of the AMIA Annual Symposium, 17-21.

Akhtar, N., Javed, H., & Ahmad, T. (2017). Searching related Scientific Articles Using Formal Concept Analysis. In International Conference on Energy, Communication, Data Analytics and Soft Computing (ICECDS), 2158-2163. DOI: https://doi.org/10.1109/ICECDS.2017.8389834

Amat, N. (1989). Documentación y nuevas tecnologías de la información. Pirámide.

Al-Zoghby, A. (2018). A New Semantic Distance Measure for the VSM-Based Information Retrieval Systems. In Intelligent Natural Language Processing: Trends and Application, 740: 229-250. https://doi.org/10.1007/978-3-319-67056-0_12

Aquino, G., & Lanzarini, L. (2015). Keyword Identification in Spanish Documents using Neural Networks. Journal of Computer Science and Technology, 15, 55-60.

Bandim, M. A. S., & Corrêa, R. F. (2019). Indexação automática por atribuição de artigos científicos em português da área de Ciência da Informação. Transinformação, 31, 1-12. https://doi.org/10.1590/2318-0889201931e180004

Chebil, Wiem, Soualmia, L., Dahamna, B., & Srmoni, S. (2012). Indexation automatique de documents ensanté: évaluation et analyse de sources d’erreurs. IRBM. 33, 316-329. DOI: https://doi.org/10.1016/j.irbm.2012.10.002

Cleverdon, C.W. (1962). Aslib Cranfield Research Project: report on the testing and analysis of an investigation into the comparative efficiency of indexing systems Cranfield.

Duwairi, R., & Hedaya, M. (2016). Automatic keyphrase extraction for Arabic news documents based on KEA system. Journal of Intelligent and Fuzzy Systems, 30(4), 2101-2110.

El-Haj, M., Balkan, L., Barbalet, S., Bell, L., & Shepherdson, J. (2013). An Experiment in Automatic Indexing Using the HASSET Thesaurus. In 5th Computer Science and Electronic Engineering Conference (CEEC), 13-18. DOI: https://doi.org/10.1109/CEEC.2013.6659437

Evans, D. A. (1990). Concept Management in Text via Natural-Language Processing: the CLARIT Approach. In Working Notes of the 1990 AAAI Symposium on “Text-Based Intelligent Systems’9, Stanford University, March, 27-29, 93-95.

Evans, D.A., Hersh W.R., Monarch, I., Lefferts, R. G., & Handerson, S. K. (1991a). Automatic Indexing of abstracts via Natural-Language Processing Using a Simple Thesaurus. Medical Decision Making, 11(4), 108-115.

Evans, D.A., Handerson, S. K., Lefferts, R. G., & Monarch, I. (1991b). A Summary of the CLARIT Project. November 1991, Report No. CMU-LCL-91-2. DOI: https://doi.org/10.1184/R1/6490799.v1

Farrow, J. (1994). Indexing as a cognitive process. In Kent, A., Lancour, H. and Daily, J.E. (eds). Encyclopedia of Library and Information Science, 53, 155-171.

Frank, E., Paynter, G. W., Witten, I. H., Gutwin, C., & Nevill-Manning, C. G. (1999). Domain-specific Keyphrase Extraction. In Proceedings of the 16th International Joint Conference on Artificial Intelligence, Stockholm, Sweden, 668–673. San Francisco, CA: Morgan Kaufmann Publishers.

García Gutiérrez, A. (1984). Lingüística documental. Barcelona: Mitre.

Gil-Leiva, I. (2008). Manual de indización. Teoría y práctica. Trea.

Gil-Leiva, I. (2017a). SISA: Automatic Indexing System for Scientific Articles. Experiments with Location Heuristics Rules versus TF-IDF Rules. Knowledge Organization, 44(3), 139-162.

Gil-Leiva, I. (2017b). La indización de artículos científicos con el sistema de indización automática SISA comparada con la indización en las Bases de datos Agricola, WoS y SCOPUS. In Third Spanish-Portuguese ISKO Conference, Portugal, Thirteenth ISKO Conference, Spain, University of Coimbra, 23 and 24 November, 510-524.

Gopan, E., Rajesh, S. Gr, V., Akhil, R. R., & Thushara, M. (2020). Comparative Study on Different Approaches in

Keyword Extraction. In 2020 Fourth International Conference on Computing Methodologies and Communication (ICCMC), 70-74. DOI: https://doi.org/10.1109/iccmc48092.2020.iccmc-00013

Gupta, Y., Saini, A., & Saxena, A. (2015). A new fuzzy logic based ranking function for efficient Information Retrieval system. Expert Systems with Applications, 42(3), 42, 1223-1234.

Hersh W. R., & Greenes R. (1990). SAPHIRE: An information Retrieval Environment Featuring Conceptmatching, Automatic Indexing, and Probabilistic Retrieval. Computers and Biomedical Research, 123, 410-425.

Hersh W. R., Hickam D. H., Haynes, R. B., & McKibbon, K. A. (1991). Evaluation of SAPHIRE: an Automated Approach to Indexing and Retrieving Medical Literature. In ProceedingsSymposium on Computer Applications in Medical Care, 808-812.

Hooper, R.S. (1965). Indexer consistency tests: origin, measurement, results, and utilization. IBM Corporation, (TR95-56).

Humphrey, S. M., & Miller, N. E. (1987). Knowledge-Based Indexing of the Medical Literature: The Indexing Aid Project. Journal of the American Society for Information Science, 38(3), 84-196.

Humphrey, S. M. (1999). Automatic Indexing of Documents from Journal Descriptors: A Preliminary Investigation. Journal of the American Society for Information Science, 50(8), 661-674.

Humphrey, S. M., Rogers, W. J., Kilicoglu, H., Demner-Fushman, D., & Rindflesch, T. C. (2006). Word Sense Disambiguation by Selecting the Best Semantic Type Based on Journal Descriptor Indexing: Preliminary Experiment. Journal of the American Society for Information Science and Technology, 57(1), 96-113.

Irfan, R., Khan, S., Qamar, A. M., & Bloodsworth, P. C. (2014). Refining Kea++ Automatic Keyphrase Assignment. Journal of Information Science, 40(4), 446-459. DOI: https://doi.org/10.1177/0165551514529054

Irving, H. B. (1997). Computer-assisted Indexing Training and Electronic Text Conversion at NAL. Knowledge

Organization, 24(1), 4-7. ISO 5963:1985 : Documentation -- Methods for Examining Documents, Determining their Subjects, and Selecting

Indexing Terms. Geneva: ISO. Karetnyk, D., Karlsson, F., & Smart, G. (1991). Knolewledge-based Indexing of Morpho-Syntactically Analysed Language. Expert Systems for Information Management, 4(1), 1-29.

Khan et al. (2011). A Refined Methodology for Automatic Keyphrase Assignment to Digital Documents. Journal of Digital Information Management, 9(2), 55-63.

Kim, S. N., Medelyan, O., Kan, M., & Baldwin, T. (2013) Automatic Keyphrase Extraction from Scientific Articles. Language Resources and Evaluation, 47, 723–742. DOI: https://doi.org/10.1007/s10579-012-9210-3

Klingbiel, P. H. (1973). A Technique for Machine-Aided Indexing. Information Storage and Retrieval, 9(9), 477-494. DOI: https://doi.org/10.1016/0020-0271(73)90034-X

Krapivin, M., Marchese, M., Yadrantsau, A, & Liang, Y. (2008). Unsupervised Key-Phrases Extraction from Scientific Papers using Domain and Linguistic Knowledge. In International Conference on Digital Information Management, 105-112.

Lima, V. M. A., & Boccato, V. R. C. (2009). O desempenho terminológico dos descritores em Ciência da Informação do Vocabulário Controlado do SIBi/USP nos processos de indexação manual, automática e semi-automática. Perspectivas em Ciência da Informação, 1, 131-151.

Lin, N., Kudinov, V.A., Zaw, H.M., & Naing, S. (2020). Query Expansion for Myanmar Information Retrieval Used by WordNet. In2020 IEEE Conference of Russian Young Researchers in Electrical and Electronic Engineering (EIConRus), 395-399.

Medelyan, O. (2005). Automatic keyphrase indexing with a domain-specific thesaurus. Master’s thesis, Albert-Ludwigs University.

Medelyan, O. (2009). Human-competitive automatic topic indexing. PhD Thesis. University of Waikato, New Zealand. Available at: https://cds.cern.ch/record/1198029/files/Thesis-2009-Medelyan.pdf [Consulted: 05/05/2021].

Mork, J. G., Aronson, A., & Demner-Fushman, D. (2017). 12 Years on – Is the NLM Medical Text Indexer Still Useful and Relevant?. Journal of Biomedical Semantics, 8. DOI: https://doi.org/10.1186/s13326-017-0113-5

Mynarz, J., & Škuta, C. (2010). Integration of an Automatic Indexing System within the Document Flow of a Grey Literature Repository. In Twelfth International Conference on Grey Literature, Prague, December. Available at: http://www.nusl.cz/ntk/nusl-42005 [Date consulted: 24/03/2021].

Névéol, A., Mary, V., Gaudinat, A., Boyer, C., Rogozan, A., & Darmoni, S. J. (2005). A Benchmark Evaluation of the French MeSH Indexers. Lecture Notes in Computer Science, 251–255. DOI: https://doi.org/10.1007/11527770_37

Rae, A., Pritchard, D., Mork, J. G., & Emner-Fushman, D. (2021). Automatic MeSH Indexing: Revisiting the Subheading Attachment Problem. In Annual Symposium proceedings. AMIA Symposium, 2020, 1031-1040.

Rolling, L. N. 1981. Indexing Consistency, Quality snd Efficiency. Information Processing and Management, 17, 69-76.

Salisbury, L., & Smith, J. J. (2014). Building the AgNIC Resource Database Using Semi-Automatic Indexing of Material. Journal of Agricultural & Food Information, 15 (3), 159-176. DOI: https://doi.org/10.1080/10496505.2014.919805

Salton, G. (1989). The SMART system 1961-1976: Experiments in Dynamic Document Processing. Encyclopedia of Library and Information Science, 28, 1-28.

Salton, G. (1991). The Smart Document Retrieval Project. In Proceeding SIGIR ‘91 Proceedings of the 14th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 356-58.

Scholastica survey: The State Of Journal Production And Access 2020. Available at: https://lp.scholasticahq.com/journal-production-access-survey/[Date consulted: 8/10/2021].

Seiler, M., Hübner, P., & Paech, B. (2019). Comparing Traceability through Information Retrieval, Commits, Interaction Logs, and Tags. In 2019 IEEE/ACM 10th International Symposium on Software and Systems Traceability (SST), 21-28.

Shams, R., & Mercer, R. E. (2012a). Investigating Keyphrase Indexing with Text Denoising. In Proceedings of the 12th ACM/IEEE-CS Joint Conference on Digital Libraries - JCDL ’12. DOI: https://doi.org/10.1145/2232817.2232866

Shams, R., & Mercer, R.E. (2012b). Improving Supervised Keyphrase Indexer Classification of Keyphrases with text Denoising. Lecture Notes in Computer Science, 77-86.

Silva, S. R. de B., & Corrêa, R. F. (2020). Sistemas de Indexação automática por atribuição: uma análise comparativa. Encontros Bibli: Revista Eletrônica De Biblioteconomia E Ciência Da Informação, 25, 1-25. DOI:https://doi.org/10.5007/1518-2924.2020.e70740

Silva, S. R. de B., & Corrêa, R. F., Gil-Leiva, I. (2020). Avaliação direta e conjunta de Sistemas de Indexação Automática por Atribuição. Informação & Sociedade-Estudos, 30, 1-27. http://dx.doi.org/10.22478/ufpb.1809-4783.2020v30n4.57259

Silvester, J. P., Genuardi, M. T., & Klingbiel, P. H. (1994). Machine-Aided Indexing at NASA. Information Processing & Management 30 (5), 631-645.

Sinkkilä, R., Suominen, O., & Hyvönen, E. (2011). Automatic Semantic Subject Indexing of Web Documents in Highly Inflected Languages. Proceedings The Semantic Web: Research and Applications : 8th Extended Semantic Web Conference, ESWC 2011, Heraklion, Crete,

Greece, May 29-June 2, 215–229. DOI: https://doi.org/10.1007/978-3-642-21034-1_15

Souza-Rocha, R., & Gil-Leiva, I. (2016). Automatic Indexing of Scientific Texts: A Methodological Comparison. In Chaves Guimarães, J. A., Oliveira Milani, S., Dodebei, V., Knowledge Organization for a Sustainable World: Challenges and Perspectives for Cultural, Scientific, and Technological Sharing in a Connected Society: Proceedings of the Fourteenth International ISKO Conference 27-29 September 2016, 243-250. Rio de Janeiro, Brazil. Würzburg: Ergon Verlag.

Suominen, O. (2019). Annif: DIY Automated Subject Indexing using Multiple Algorithms. LIBER Quarterly, 29 (1), 1-25. DOI: http://doi.org/10.18352/lq.10285

Wang, D.X., Gao, X., & Andreae, P. (2015). DIKEA: Exploiting Wikipedia for keyphrase extraction. Web Intelligence, 13 (3), 153-165.

Witten, I. H., Paynter, G. W., Frank, E., Gutwin, C., & Nevill-Manning, C. G. (1999). KEA: Practical Automatic Keyphrase Extraction. In Proceedings of the fourth ACM conference on Digital libraries, 254-255, 243-250 https://doi.org/10.1145/313238.313437


Downloads

Downloads per month over past year

Actions (login required)

View Item View Item