A semiautomatic indexing system based on embedded information in HTML documents

Vallez, Mari and Pedraza-Jiménez, Rafael and Codina, Lluís and Blanco, Saúl and Rovira, Cristòfol A semiautomatic indexing system based on embedded information in HTML documents. Library Hi Tech, 2015, vol. 33, n. 2, pp. 195-210. [Journal article (Paginated)]

[img] Text
Preprint-Semi-automaticIndexingSystemBasedEmbeddedInformationHTMLDocuments.pdf - Draft version
Available under License Creative Commons Attribution.

Download (838kB)

English abstract

Purpose. This paper describes and evaluates the tool DigiDoc MetaEdit which allows the semi-automatic indexing of HTML documents. The tool works by identifying and suggesting keywords from a thesaurus according to the embedded information in HTML documents. This enables the parameterization of keyword assignment based on how frequently the terms appear in the document, the relevance of their position, and the combination of both. Design/methodology/approach. In order to evaluate the efficiency of the indexing tool, the descriptors/keywords suggested by the indexing tool are compared to the keywords which have been indexed manually by human experts. To make this comparison a corpus of HTML documents are randomly selected from a journal devoted to Library and Information Science. Findings. The results of the evaluation show that there: (1) is close to a 50% match or overlap between the two indexing systems, however if you take into consideration the related terms and the narrow terms the matches can reach 73%; and (2) the first terms identified by the tool are the most relevant. Originality/value. The tool presented identifies the most important keywords in an HTML document based on the embedded information in HTML documents. Nowadays, representing the contents of documents with keywords is an essential practice in areas such as information retrieval and e-commerce.

Item type: Journal article (Paginated)
Keywords: Semi-automatic indexing; Keywords assignment; Metadata editor; Controlled language; Semantic web technologies, Information retrieval.
Subjects: I. Information treatment for information services > IC. Index languages, processes and schemes.
I. Information treatment for information services > IG. Information presentation: hypertext, hypermedia.
Depositing user: PhD Mari Letrado
Date deposited: 16 Jul 2015 20:35
Last modified: 16 Jul 2015 20:35
URI: http://hdl.handle.net/10760/25382

References

Abulaish, M. and Anwar, T. (2012), “A supervised learning approach for automatic keyphrase extraction”, International Journal of Innovative Computing, Information and Control, Vol. 8 No. 11, pp. 7579–7601.

Anderson, J.D. and Pérez-Carballo, J. (2001a), “The nature of indexing: how humans and machines analyze messages and texts for retrieval. Part II: Machine indexing, and the allocation of human versus machine effort”, Information Processing & Management, Vol. 37 No. 2, pp. 255–277.

Anderson, J.D. and Pérez-Carballo, J. (2001b), “The nature of indexing: how humans and machines analyze messages and texts for retrieval. Part I: Research, and the nature of human indexing”, Information Processing & Management, Vol. 37 No. 2, pp. 231–254.

Borko, H. (1977), “Toward a theory of indexing”, Information Processing & Management, Vol. 13 No. 6, pp. 355–365.

Bukhari, A.C., Klein, A. and Baker, C.J.O. (2013), “Towards interoperable bioNLP semantic web services using the SADI framework”, in Baker, C.J.O., Butler, G. and Jurisica, I. (Eds.), Data integration in the life sciences, Lecture Notes in Computer Science: Vol. 7970, Springer Berlin Heidelberg, pp. 69–80.

Beliga, S. (2014), Keyword extraction: a review of methods and approaches, University of Rijeka, Department of Informatics, Rijeka.

Cleverdon, C.W. (1972), “On the inverse relationship of recall and precision”, Journal of Documentation, Vol. 28 No. 3, pp. 195–201.

Coffman, J. and Weaver, A.C. (2014), “An empirical performance evaluation of relational keyword search techniques”, IEEE Transactions on Knowledge and Data Engineering, Vol. 26 No. 1, pp. 30–42.

El-Haj, M., Balkan, L., Barbalet, S., Bell, L. and Shepherdson, J. (2013), “An experiment in automatic indexing using the HASSET thesaurus”, in Proceedings of the 5th Computer Science and Electronic Engineering Conference, IEEE Xplore, Colchester, United Kingdom, pp. 13–18.

Ercan, G. and Cicekli, I. (2007), “Using lexical chains for keyword extraction”, Information Processing & Management, Vol. 43 No. 6, pp. 1705–1714.

Evans, D.A., Hersh, W.R., Monarch, I.A., Lefferts, R.G. and Handerson, S.K. (1991), “Automatic indexing of abstracts via natural-language processing using a simple thesaurus”, Medical Decision Making, Vol. 11 No. 4 Suppl, pp. 108–115.

Frank, E., Paynter, G.W., Witten, I.H., Gutwin, C. and Nevill-Manning, C.G. (1999), “Domain-specific keyphrase extraction”, in Proceedings of the 16th International Joint Conference on Artificial Intelligence, Morgan Kaufmann Publishers, San Francisco, CA, USA, pp. 668–673.

Ganapathi Raju, N.V., Sukavasi, B., Rama Krishna Chava, S. and Rani Vadisala, V. (2011), “An application of statistical indexing for searching and ranking of documents - A case study on Telugu script”, International Journal of Computer Applications, Vol. 28 No. 3, pp. 22–27.

Gazendam, L., Wartena, C. and Brussee, R. (2010), “Thesaurus based term ranking for keyword extraction”, in Workshop on Database and Expert Systems Applications, 21st DEXA Conference, IEEE Xplore, Bilbao, Spain, pp. 49 –53.

Giarlo, M.J. (2005), A comparative analysis of keyword extraction techniques, Rutgers, The State University of New Jersey, available at: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.119.647 (accessed 16 August 2014).

Glier, M.W., McAdams, D.A. and Linsey, J.S. (2013), “An experimental investigation of analogy formation using the Engineering-to-Biology thesaurus”, in Proceedings of the 25th International Conference on Design Theory and Methodology, American Society of Mechanical Engineers, Portland, United States, Vol. 5, doi:10.1115/DETC2013-13160.

Golbeck, J., Grove, M., Parsia, B., Kalyanpur, A. and Hendler, J. (2002), “New tools for the semantic web”, in Gómez-Pérez, A. and Benjamins, V.R. (Eds.), Knowledge Engineering and Knowledge Management: Ontologies and the Semantic Web, Lecture Notes in Computer Science : Vol. 2473, Springer Berlin Heidelberg, pp. 392–400.

Hjørland, B. (2011), “The importance of theories of knowledge: Indexing and information retrieval as an example”, Journal of the American Society for Information Science and Technology, Vol. 62 No. 1, pp. 72–77.

Hu, H. and Du, X. (2013), “TAG: A Tag-as-You-Go online annotation tool for web browsing and navigation”, in Wang, M. (Ed.), Knowledge Science, Engineering and Management, Lecture Notes in Computer Science : Vol. 8041, Springer Berlin Heidelberg, pp. 298–309.

Hulth, A. (2003), “Improved automatic keyword extraction given more linguistic knowledge”, in Proceedings of the 2003 conference on Empirical methods in natural language processing, Association for Computational Linguistics, Stroudsburg, PA, USA, pp. 216–223.

Hulth, A. (2004), Automatic keyword extraction: combining machine learning and natural language processing, Stockholm University, Edsbruk, Sweden, available at: http://people.dsv.su.se/~hulth/thesis_hulth.pdf (accessed 16 August 2014).

Hu, X. and Wu, B. (2006), “Automatic keyword extraction using linguistic features”, in Data Mining Workshops, 6th IEEE International Conference on Data Mining, IEEE Computer Society, Hong Kong, China, pp. 19–23.

Kamps, J. (2004), “Improving retrieval effectiveness by reranking documents based on controlled vocabulary”, in McDonald, S. and Tait, J. (Eds.), Advances in Information Retrieval: Proceedings of the 26th European Conference on IR Research, Springer, Sunderland, UK, Vol. 2997, pp. 283–295.

Kaur, J. and Gupta, V. (2010), “Effective approaches for extraction of keywords”, International Journal of Computer Science, Vol. 7 No. 6, pp. 144–148.

Lancaster, F.W. (2003), Indexing and abstracting in theory and practice, Facet Publishing, London, England, 3rd ed.

Mai, J.E. (1997), “The concept of subject: on problems in indexing”, in Proceedings of the 6th International Study Conference on Classification Research, International Federation for Information Documentation, The Hague, Netherlands, pp. 60–66.

Mai, J.E. (2001), “Semiotics and indexing: An analysis of the subject indexing process”, Journal of Documentation, Vol. 57 No. 5, p. 591.

Matsuo, Y. and Ishizuka, M. (2004), “Keyword extraction from a single document using word co-occurrence statistical information”, International Journal on Artificial Intelligence Tools, Vol. 13 No. 1, pp. 157–170.

Medelyan, O. and Witten, I.H. (2005), “Thesaurus-based index term extraction for agricultural documents”, in Proceedings of the 6th Agricultural Ontology Service (AOS), Food and Agriculture Organization of the United Nations, Vila Real, Portugal, pp. 1122–1129.

Medelyan, O. and Witten, I.H. (2006a), “Measuring inter-indexer consistency using a thesaurus”, in Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries, Chapel Hill, NC, USA, pp. 274 –275.

Medelyan, O. and Witten, I.H. (2006b), “Thesaurus based automatic keyphrase indexing”, in Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries, ACM, New York, NY, USA, pp. 296–297.

Moens, M.-F. (2002), “Automatic indexing: The assignment of controlled language index terms”, Automatic indexing and abstracting of document texts, The Information Retrieval Series: Vol. 6, Springer US, pp. 103–132.

Monchon, G. and Sorli, A. (2002), Tesauro de biblioteconomía y documentación, CSIC, Madrid.

Névéol, A., Shooshan, S.E., Humphrey, S.M., Mork, J.G. and Aronson, A.R. (2009), “A recent advance in the automatic indexing of the biomedical literature”, Journal of biomedical informatics, Vol. 42 No. 5, pp. 814–823.

Olson, H.A. and Wolfram, D. (2008), “Syntagmatic relationships and indexing consistency on a larger scale”, Journal of Documentation, Vol. 64 No. 4, pp. 602–615.

Pedraza-Jiménez, R., Codina, L. and Rovira, C. (2008), “Semantic web adoption: online tools for web evaluation and metadata extraction”, in Ruan, D. and Montero, J. (Eds.), Computational Intelligence in Decision and Control: Proceedings of the 8th International FLINS Conference, World Scientific Publishing Company, Madrid, Spain, pp. 121–126.

Van Rijsbergen, C.J. (1977), “A theoretical basis for the use of co-occurrence data in information retrieval”, Journal of Documentation, Vol. 33 No. 2, pp. 106–119.

Sharp, J. and Sen, B.A. (2013), “The viability of automatic indexing of biomedical literature”, International Journal of Health Information Management Research, Vol. 1 No. 1, pp. 55–66.

Sinkkilä, R., Suominen, O. and Hyvönen, E. (2011), “Automatic semantic subject indexing of web documents in highly inflected languages”, in Antoniou, G., Grobelnik, M., Simperl, E., Parsia, B., Plexousakis, D., Leenheer, P.D. and Pan, J. (Eds.), The Semantic Web: Research and Applications, Lecture Notes in Computer Science: Vol. 6643, Springer Berlin Heidelberg, pp. 215–229.

Spärck Jones, K. (1974), “Automatic indexing”, Journal of Documentation, Vol. 30 No. 4, pp. 393–432.

Tejeda-Lorente, Á., Porcel, C., Peis, E., Sanz, R. and Herrera-Viedma, E. (2014), “A quality based recommender system to disseminate information in a university digital library”, Information Sciences, Vol. 261, pp. 52–69.

Vállez, M. (2011), “Keyword research: métodos y herramientas para identificar palabras clave”, BiD: textos universitaris de biblioteconomia i documentació, Vol. 27.

Vállez, M., Rovira, C., Codina, L. and Pedraza-Jiménez, R. (2010), “Procedures for extracting keywords from web pages, based on search engine optimization”, Hipertext.net, Vol. 8.

Vasuki, V. and Cohen, T. (2010), “Reflective random indexing for semi-automatic indexing of the biomedical literature”, Journal of Biomedical Informatics, Vol. 43 No. 5, pp. 694–700.

Verberne, S., D’hondt, E., van den Bosch, A. and Marx, M. (2014), “Automatic thematic classification of election manifestos”, Information Processing & Management, Vol. 50 No. 4, pp. 554–567.

White, H., Willis, C. and Greenberg, J. (2013), “HIVEing: The effect of a semantic Web technology on inter-indexer consistency”, Journal of Documentation, Vol. 70 No. 3, pp. 1–1.

Willis, C. and Losee, R.M. (2013), “A random walk on an ontology: Using thesaurus structure for automatic subject indexing”, Journal of the American Society for Information Science and Technology, Vol. 64 No. 7, pp. 1330–1344.

Yang, S., Zhang, B., Li, S., Yu, C. and Hao, Q. (2014), “Keyword extraction using multiple novel features”, Journal of Computational Information Systems, Vol. 10 No. 7, pp. 2795–2802.

Zhang, C. (2008), “Automatic keyword extraction from documents using conditional random fields”, Journal of Computational Information Systems, pp. 1169–1180.

Zunde, P. and Dexter, M.E. (1969), “Indexing consistency and quality”, American Documentation, Vol. 20 No. 3, pp. 259–267.

"Improved search - Semantic Web Case Studies and Use Cases" (n.d.), available at http://www.w3.org/2001/sw/sweo/public/UseCases/ (accessed 26 November 2014).


Downloads

Downloads per month over past year

Actions (login required)

View Item View Item