Ontology-based text summarization. The case of Texminer

Hípola, Pedro and A. Senso, José and Leiva-Mederos, Amed and Domínguez-Velasco, Sandor Ontology-based text summarization. The case of Texminer. Library Hi Tech, 2014, vol. 32, n. 2, pp. 229-248. [Journal article (Paginated)]

[img] Text
Ontology_based_text_summarization_The_case_of_Texminer.pdf - Published version

Download (444kB)

English abstract

Purpose – The purpose of this paper is to look into the latest advances in ontology-based text summarization systems, with emphasis on the methodologies of a socio-cognitive approach, the structural discourse models and the ontology-based text summarization systems. Design/methodology/approach – The paper analyzes the main literature in this field and presents the structure and features of Texminer, a software that facilitates summarization of texts on Port and Coastal Engineering. Texminer entails a combination of several techniques, including: socio-cognitive user models, Natural Language Processing, disambiguation and ontologies. After processing a corpus, the system was evaluated using as a reference various clustering evaluation experiments conducted by Arco (2008) and Hennig et al. (2008). The results were checked with a support vector machine, Rouge metrics, the F-measure and calculation of precision and recall. Findings – The experiment illustrates the superiority of abstracts obtained through the assistance of ontology-based techniques. Originality/value – The authors were able to corroborate that the summaries obtained using Texminer are more efficient than those derived through other systems whose summarization models do not use ontologies to summarize texts. Thanks to ontologies, main sentences can be selected with a broad rhetorical structure, especially for a specific knowledge domain.

Item type: Journal article (Paginated)
Keywords: Information retrieval, Software evaluation, Ontologies, Indexing, Programming, Automatic summarization systems, Texminer
Subjects: L. Information technology and library technology > LL. Automated language processing.
Depositing user: Pedro Hipola
Date deposited: 10 Aug 2015 05:30
Last modified: 10 Aug 2015 05:30
URI: http://hdl.handle.net/10760/25540

References

Alonso, L. and Fuentes, M. (2003), “Integrating cohesion and coherence for text summarization”, Proceedings of the EACL’03 Student Session, Budapest, pp. 1-8.

Alpcan, T. , Bauckhage, C. and Agarwal, S. (2007), “An efficient ontology-based expert peering system”, Proceedings of the IAPR Workshop on Graph-Based Representations, pp. 273-282.

Anaya, H. , Pons, A. and Berlanga, R. (2006), “Una panorámica de la construcción de extractos de un texto”, Revista Cubana de Ciencias Informáticas, Vol. 1 No. 1, pp. 55-65. [Infotrieve]

Andreasen, T. and Bulskov, H. (2009), “Conceptual querying through ontologies”, Fuzzy Sets and Systems, Vol. 160 No. 5, pp. 2159-2172. [CrossRef], [ISI] [Infotrieve]

Arco, L. (2008), “Agrupamiento basado en intermediación diferencial”, PhD thesis, Universidad Central “Marta Abreu” de las Villas, Santa Clara.

Aretoulaki, M. (1997), “COSY-MATS: ‘an intelligent and scalable summarisation shell’”, Proceedings of the ACL Workshop on Intelligent Scalable Text Summarization, Madrid, pp. 74-81.

Austin, J.L. (1962), How To Do Things With Words, Clarendon Press, Oxford.

Barzilay, R. and Elhadad, M. (1997), “Using lexical chains for text summarization”, Proceedings of the ACL Workshop on Intelligent Scalable Text Summarization, Madrid, pp. 10-17.

Berry, M. (2004), Survey of Text Mining: Clustering, Classification, and Retrieval, Springer, New York, NY.

Chen, P. and Verma, R. (2006), “A query-based medical information summarization system using ontology knowledge”, Proceedings of the 19th IEEE Symposium on Computer-Based Medical Systems (CBMS’06), pp. 37-42.

D'Cunha, I. (2006), “Hacia un modelo lingüístico de resumen automático de artículos médicos en español”, PhD thesis, Universitat Pompeu Fabra, Barcelona.

Domínguez-Velasco, S. (2010), Protex Beta Software, Departamento de automática, Universidad Central de las Villas, Santa Clara.

Domínguez-Velasco, S. (2013), Metric Beta Software, Universidad Central “Marta Abreu” de las Villas, CDICT, Santa Clara.

Endres-Niggemeyer, B. , Maier, E. and Sigel, A. (1995), “How to implement a naturalistic model of abstracting: four core working steps of an expert abstractor”, Information Processing & Management, Vol. 31 No. 5, pp. 631-674. [CrossRef], [ISI] [Infotrieve]

Frakes, W.B. and Baeza-Yates, R. (Eds) (1992), Information Retrieval: Data Structures & Algorithms, Prentice Hall, New York, NY.

Gaizauskas, R. , Herring, P. , Oakes, M. , Beaulieu, M. , Willett, P. , Fowkes, H. and Jonsson, A. (2001), “Intelligent access to text: integrating information extraction technology into text browsers”, Proceedings of the Human Language Technology Conference, San Diego, CA, pp. 189-193.

Gil-García, R. and Pons-Porrata, A. (2008), “Hierarchical star clustering algorithm for dynamic document collections”, CIARP 2008, pp. 187-194.

Goldstein, J. , Kantrowitz, M. , Mittal, V. and Carbonell, J. (1999), “Summarizing text documents: sentence selection and evaluation metrics”, Proceedings of the 22nd Annual International ACM SIGI R Conference on Research and Development in Information Retrieval (SIGIR ‘99), ACM, New York, NY, pp. 121-128.

Goldstein, J. , Mittal, V. , Carbonell, J. and Callan, J. (2000), “Creating and evaluating multi-document sentence extract summaries”, Proceedings of the Ninth International Conference on Information and Knowledge Management (CIKM ‘00), ACM, New York, NY, pp. 165-172.

Grimes, U.E. (1975), The Thread of Discourse, Mouton, The Hague.

Halliday, M.A.K. and Hasan, R. (1976), Cohesion in English, Longman, Essex.

Havens, T.C. , Keller, J.M. , Popescu, M. and Bezdek, J.C. (2008), “Ontological self-organizing maps for cluster visualization and functional summarization of gene products using gene ontology similarity measures”, IEEE International Conference on Fuzzy Systems (FUZZ 2008), Hong Kong, June 1-6.

Hennig, L. , Umbrath, W. and Wetzker, R. (2008), “An ontology-based approach to text summarization”, 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, pp. 291-294.

Hjørland, B. (2002), “Epistemology and the socio-cognitive perspective in information science”, Journal of American Society of Information Science, Vol. 53 No. 4, pp. 257-270. [CrossRef], [ISI] [Infotrieve]

Hu, P. , He, T. , Ji, D. and Wang, M. (2004), “A study of Chinese text summarization using adaptive clustering of paragraphs”, Proceedings of the Fourth International Conference on Computer and Information Technology (CIT’04), IEEE, ACM, Wuhan, September 14-16.

Huang, H.H. and Kuo, Y.H. (2007), “Towards auto-construction of domain ontology: an auto-constructed domain conceptual lexicon and its application to extractive summarization”, Proceedings of the Sixth International Conference on Machine Learning and Cybernetics, IEEE, Hong Kong, pp. 2947-2952.

Hung, J. (2008), “RETRACTION: a new WSD approach using word ontology and concept distribution”, Journal of Information Science, Vol. 34 No. 22, pp. 231-253. [Infotrieve]

Lanquillon, C. (2002), “Enhancing text classification to improve information filtering”, PhD thesis, Otto-von-Guericke-Universität Magdeburg, Magdeburg.

Larsen, B. and Aone, C. (1999), “Fast and effective text mining using linear-time document clustering”, Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM Press, New York, NY, pp. 16-22.

Lee, C.S. , Chen, Y.J. and Jian, Z.W. (2003), “Ontology-based fuzzy event extraction agent for Chinese e-news summarization”, Expert Systems with Applications, Vol. 25 No. 3, pp. 431-447. [CrossRef], [ISI] [Infotrieve]

Leiva, A. , Senso, J.A. , Domínguez, S. and Hípola, P. (2009), “An automat for the semantic processing of structured information”, ISDA 9th International Conference of Design of Software and Application, IEEE, Pisa, November 30-December 3.

Leiva-Mederos, A. (2012), “Texminer: un modelo para la extracción y desambiguación de textos científicos en el dominio de Ingeniería de Puertos y Costas”, PhD thesis, Universidad de Granada, Granada.

Leiva-Mederos, A. , Domínguez-Velasco, S. and Senso, J.A. (2012), “PuertoTex: un software de minería textual para la creación de resúmenes automáticos en el dominio de ingeniería de puertos y costas basado en ontologías”, TransInformação, Vol. 24 No. 2, pp. 103-115. [CrossRef] [Infotrieve]

Lesk, M. (1986), “Automatic sense disambiguation using machine readable dictionaries: How to tell a pine cone from an ice cream cone”, Proceedings of SIGDOC.

Leskovec, J. , Milic-Frayling, N. and Grobelnik, M. (2005), “Impact of linguistic analysis on the semantic graph coverage and learning of document extracts”, in Veloso, M. and Kambhampati, S. (Eds), Proceedings of the 20th national Conference on Artificial Intelligence, ACM Press, Pittsburgh, Vol. 3, pp. 1069-1074.

Lin, C. (2004a), “Looking for a few good metrics: automatic summarization evaluation – how many samples are enough?”, Proceedings of the NTCIR Workshop 4, Tokyo.

Lin, C. (2004b), “Rouge: a package for automatic evaluation of summaries”, Proceedings of the Workshop on Text Summarization Branches Out (WAS’04), Barcelona, pp. 25-26.

Lin, C. and Hovy, E. (1997), “Identifying topics by position”, Proceedings of the ACL Applied Natural Language Processing Conference, Washington DC, pp. 283-290.

Lin, F.R. and Liang, C.H. (2008), “Storyline-based summarization for news topic retrospection”, Decision Support Systems, Vol. 45 No. 3, pp. 473-490. [CrossRef], [ISI] [Infotrieve]

Luhn, H. (1958), “The automatic creation of literature abstracts”, Journal of Research and Development, Vol. 2 No. 2, pp. 159-165. [CrossRef] [Infotrieve]

Mani, I. and Bloedorn, E. (1999), “Summarizing similarities and differences among related documents”, Information Retrieval, Vol. 1 Nos 1-2, pp. 35-67. [CrossRef]

Mann, W.C. and Thompson, S.A. (1988), “Rhetorical structure theory: toward a functional theory of text organization”, Text, Vol. 8 No. 3, pp. 243-281. [Infotrieve]

Marcu, D. (1998), “The rhetorical parsing, summarization, and generation of natural language texts”, PhD thesis, University of Toronto, Toronto.

Marcu, D. (2000), The Theory and Practice of Discourse Parsing Summarization, Massachusetts Institute of Technology, Cambridge, MA.

Mateo, P. , González, J.C. , Villena, J. and Martínez, J.L. (2003), “Un sistema para resumen automático de textos en castellano”, Procesamiento del Lenguaje Natural, Vol. 31, pp. 29-36.

Montalvo, S. , Navarro, A. , Martínez, R. , Casillas, A. and Fresno, V. (2006), “Evaluación de la selección, traducción y pesado de los rasgos para la mejora del clustering multilingüe”, Campus Multidisciplinar en Percepción e Inteligencia (CMPI 2006) – 50 Años de Inteligencia Artificial, Vol. 2, pp. 769-778.

Nomoto, T. and Matsumoto, Y. (2001), “A new approach to unsupervised text summarization”, Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ‘01), ACM, New York, NY, pp. 26-34.

Ono, K. , Sumita, K. and Miike, S. (1994), “Abstract generation based on rhetorical structure extraction”, Proceedings of the International Conference on Computational Linguistics, Kyoto, pp. 344-348.

Pinto, M. (2001), El resumen documental: principios y métodos, Fundación Germán Sánchez Ruipérez, Madrid.

Popescu, M. , Keller, J.M. , Mitchell, J.A. and Bezdek, J.C. (2004), Functional Summarization of Gene Product Clusters Using Gene Ontology Similarity Measures, ISSNIP, Proceedings of the 2004 Intelligent Sensors, Sensor Networks and Information Processing Conference, IEEE, New York.

Rosell, M. , Kann, V. and Litton, J. (2004), “Comparing comparisons: document clustering evaluation using two manual classifications”, Proceedings of ICON 2004, 3rd International Conference on Natural Language Processing, Hyderabad, December 19-22.

Salton, G. and Buckley, C. (1988), “Term weighting approaches”, Automatic text Information Processing and Management, Vol. 24 No. 5, pp. 513-523. [CrossRef], [ISI] [Infotrieve]

Searle, J. (1969), Speech Acts. An Essay in the Philosophy of Language, Cambridge University Press, Cambridge.

Steinbach, M. , Karypis, G. and Kumar, V.A. (2000), “Comparison of document clustering techniques”, KDD Workshop on Text Mining, Vol. 400 No. 1, pp. 525-526. [Infotrieve]

Teufel, S. and Moens, M. (2002), “Summarizing scientific articles: experiments with relevance and rhetorical status”, Computational Linguistics, Vol. 28 No. 4, pp. 409-445. [CrossRef], [ISI] [Infotrieve]

Wu, K. , Li, L. , Li, J. and Li, T. (2013), “Ontology-enriched multi-document summarization in disaster management using submodular function”, Information Sciences, Vol. 224, pp. 118-129. [CrossRef], [ISI]

Yoo, I. , Hu, X. and Song, I. (2006), “Integration of semantic-based bipartite graph representation and mutual refinement strategy for biomedical literature clustering”, Proceedings of ACM SIGKDD, ACM, pp. 791-796.

Yuan, S.T. and Sun, J. (2004), “Ontology-based structured cosine similarity in speech document summarization”, Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence (WI’04), IEEE, ACM, Beijing, September 20-24.

Zhang, Z. , Huang, Z. and Zhang, X. (2010), “Knowledge summarization for scalable semantic data processing”, Journal of Computational Information Systems, Vol. 6 No. 12, pp. 3893-3902. [Infotrieve]


Downloads

Downloads per month over past year

Actions (login required)

View Item View Item