Extracción y normalización de entidades genómicas en textos biomédicos: una propuesta basada en transductores gráficos

Galvez, Carmen and De-Moya-Anegón, Félix Extracción y normalización de entidades genómicas en textos biomédicos: una propuesta basada en transductores gráficos., 2006 . In 1st Iberian Conference on Information Sciences and Technologies - CISTI 2006, Esposende, Portugal: Escola Superior de Tecnologia (EST), Instituto Politécnico do Cávado e do Ave (IPCA), 21-23 June 2006. [Conference paper]

[thumbnail of Galvez-Congreso-CISTI-2006.pdf]
Preview
PDF
Galvez-Congreso-CISTI-2006.pdf

Download (285kB) | Preview

English abstract

The lack of systems endorsed to call the genes is a problem for the identification of information in the biomedical literature and does very difficult an essential process in the field of the molecular biology: finding and to discover biological relations among genes, in those documents that treat the same genomic entity but that they use different symbols. We propose a procedure taken from the of natural language processing (NLP) based on the application of transducers of finite-state that allows the recognition of the diverse names of a gene and relates them to an unified form. The process of standardization requires as input a list of synonyms, and as an output an unique identifier for that gene. The genomic database FlyBase has contributed us the necessary resources to expose our proposal.

Spanish abstract

La falta de sistemas homologados para denominar a los genes es un problema para la identificación de información en la literatura biomédica y hace muy difícil un proceso esencial en el campo de la biología molecular: encontrar y descubrir relaciones biológicas entre genes, en aquellos documentos que tratan la misma entidad genómica pero que usan símbolos distintos. Nosotros proponemos un procedimiento adoptado del procesamiento de lenguaje natural (PLN) basado en la aplicación de transductores de estado-finito que permite el reconocimiento de los diversos nombres de un gen y los relaciona con una forma unificada. El proceso de normalización requiere como input una lista de sinónimos, y como output un identificador único para ese gen. La base de datos genómica FlyBase nos ha aportado los recursos necesarios para exponer nuestra propuesta.

Item type: Conference paper
Keywords: Finite-state transducers; Normalization of gene term; Information extraction; Codificación fonética; Equiparación de nombres personales; Algoritmos de equiparación de nombre.
Subjects: L. Information technology and library technology > LL. Automated language processing.
Depositing user: Carmen Galvez
Date deposited: 06 Aug 2007
Last modified: 02 Oct 2014 12:08
URI: http://hdl.handle.net/10760/10016

References

Crim, J., McDonald, R. & Pereira, F. (2005). Automatically Annotating Documents With Normalized Gene Lists. BMC Bioinformatics, 6(1), 13-19.

Cunningham, H. (2005), Information Extraction, Automatic. Enclyclopedia of Language and Linguistics. Oxford: Elsevier.

Friedman, C., Kra, P., Yu, H., Krauthammer, M. & Rzhetsky, A. (2001). GENIES: a Natural-Language Processing System for the Extraction of Molecular Pathways from Journal Articles. Bioinformatics, 17(1), 74-82.

Hatzivassiloglou, V., Duboue, P. A. & Rzhetsky, A. (2001). Disambiguating Proteins, Genes, and RNA in Text: a Machine Learning Approach. Bioinformatics, 17, 97-106.

Hirschman, L., Park, C., Tsujii, J., Wong, L. & Wu, C. H. (2002). Accomplishments and Challenges in Literature Data Mining for Biology. Bioninformatics, 18(12), 1553-1561.

Hopcroft, J. E. & Ullman, J. D. (1979). Introduction to Automata Theory, Languages, and Computation. Reading, MA: Addison-Wesley.

Liu, H.,Johnson, S. B. & Friedman, C. (2002). Automatic Resolution of Ambiguous Terms Based on Machine Learning and Conceptual Relations in the UMLS. Journal of the American Medical Informatics Association Online, 9, 621-636.

Liu, H., Lussier, Y. A. & Friedman, C. (2001). Disambiguating Ambiguous Biomedical Terms in Biomedical Narrative Text: an Unsupervised Method. Journal of Biomedical Informatics, 34, 249-261.

Morgan, A. A., Hirschman, L., Colosimo, M., Yeh, A. S. & Colombe, J. B. (2004). Gene Name Identification and Normalization Using a Model Organism Database. Journal of Biomedical Informatics, 37, 396-410.

Ng, S., Wong, M. (1999). Toward Routine Automatic Pathway Discovery from On-line Scientific Text Abstracts. In Proceedings of Genome Informatics, 104-112.

Pearson, H. (2001). La catarata de nuevos genes pone en evidencia la anarquía de sus nombres. El País (España). Disponible en: <http://www.elpais.es/suplementos/futuro/20010711/24genes.html>.

Proux, D., Rechenmann, F. & Julliard, L. (1998). Detecting Gene Symbols and Names in Biological Texts: a First Step Toward Pertinent Information Extraction. In Proceedings of Genome Informatics, 78-80.

Roche, E. & Schabes, Y. (1995). Deterministic Part-Of-Speech Tagging With Finite State Transducers. Computational Linguistics, 21(2), 227-253.

Schijvenaars, B. J., Mons, B., Weeber, M., Shuemie, M. J., Van Mulligen, E. M.,Wain, H. M. & Kors, J. A. (2005). Thesaurus-Based Disambiguation of Gene Symbols. BMC Bioinformatics, 6(1), 149.

Seki, K., Mostafa, J. (2005). A Hybrid Approach to Protein Name Identification in Biomedical Texts. Information Processing & Management, 41(4), 723-743.

Silberztein, M. (2000). INTEX: an FST toolbox. Theoretical Computer Science, 231,33–46.

Smaglik, P. (1998). Creativity, Confusion for Genes. The Scientist, 12(7), 1. Disponible en: <http://www.the-scientist.com/article/display/17971/>.

Thomas, J., Milward, D., Ouzounis, Pulman, S. & Carroll, M. (2000). Automatic Extraction of Protein Interactions from Scientific Abstracts. In Proceedings of the Pacific Symposium on Biocomputing, 538-549.

Tuason, O., Chen, L., Liu, H., Blake, J. & Friedman, C. (2004). Biological Nomenclatures: a Source of Lexical Knowledge and Ambiguity. In Proceedings of the Pacific Symposium on Biocomputing, 238-249.


Downloads

Downloads per month over past year

Actions (login required)

View Item View Item