The aim of this proposal is to implement a novel way of processing and accessing the vast detailed knowledge contained within collections of scientific publications on the regulation of transcription initiation in bacterial models. In principle, this model for processing and reading information and new knowledge is applicable to other biological domains, potentially benefiting any area of biomedical knowledge. It is certainly criticl to generate new strategies to cope with the ever-increasing amount of knowledge generated in genomics and in biomedical research at large. Improving the efficiency of the traditional high-quality manual curation of scientific publications will enable us also to expand the type of biological knowledge, beyond mechanisms and their elements in the genome, to start including their connections with larger regulated processes and eventually physiological properties of the cell.
We will first implement the necessary technology to improve our curation by means of a computational system that has text mining capabilities for preprocessing the papers before a human expert curator identifies which sentences contain the information that is to be added to the database. Premarked options selected by the curators will accelerate their decisions. The accumulative precise mapping between sentences and curated knowledge will provide training sets for text mining technologies to improve their automatic extraction. The curator practices will become more efficient, enabling us to curate selected high-impact published reviews to place mechanisms into a rich context of their physiological processes and general biology. Another relevant component of our proposal is the improved modeling of regulated processes by means of new concepts in biology that capture larger collections of coregulated genes and their concatenated reactions. Starting from all interactions of a local regulator, coregulated regulators and their domain of action will be incorporated to construct the biobricks of complex decisions, as they are encoded in the genome. These are conceptual containers that capture the organization of knowledge to describe the genetic programming of cellular capabilities. These proposals will be formalized and proposed within an international consortium focused in enriching standard models or ontologies of gene regulation for use by the scientific community. Finally, a portal to navigate across all the sentences of a given corpus of a large number (more than 5,000) of related papers will be implemented. The different avenues of navigation will essentially use two technologies, one dealing with automatically generating simpler sentences from original sentences as input, and the other one with the classification of papers based on their theme or ontology. Their combination will enable a novel navigation reading system. If we achieve our aims, this project will give a proof-of-principle prototype with clearly innovative higher levels of large amounts of integrated knowledge. Future directions may adapt these concepts and methods to the biology of higher organisms, including humans.
Grant ID: 5R01GM110597-03
Funding Agency: NIH
Title: "High-Throughput Literature Curation of Genetic Regulation in Bacterial Models"
Funding: $406,247 for the first year
Duration: 4 years (1. Jan 2015 - 31. Dec 2018)
PI: Dr. Julio Collado-Vides
Collaborators: Dr. Michael Savageau, UCDavis; Dr. Stephen Busby, Univ. of Birmingham;
Dr. Fabio Rinaldi, Univ. of Zurich.
One of the goals of this NIH-funded project is to integrate advanced text mining techniques in the curation process of a life science database (RegulonDB). The project will make use of ODIN (OntoGene Document Inspector), a user-friendly interface designed by the OntoGene group for curation tasks, which integrates with the OntoGene text mining pipeline.
Selected Publications
- Fabio Rinaldi, Simon Clematide, Yael Garten, Michelle Whirl-Carrillo, Li Gong, Joan M. Hebert, Katrin Sangkuhl, Caroline F. Thorn, Teri E. Klein, Russ B. Altman; Using ODIN for a PharmGKB revalidation experiment. Database (Oxford) 2012; 2012 bas021. doi: 10.1093/database/bas021
- Socorro Gama-Castro, Fabio Rinaldi, Alejandra López-Fuentes, Yalbi Itzel Balderas-Martínez, Simon Clematide, Tilia Renate Ellendorff, Alberto Santos-Zavaleta, Hernani Marques-Madeira, Julio Collado-Vides; Assisted curation of regulatory interactions and growth conditions of OxyR in E. coli K-12 . Database (Oxford) 2014; 2014 bau049. doi: 10.1093/database/bau049
- Fabio Rinaldi, Oscar Lithgow, Socorro Gama-Castro, Hilda Solano, Alejandra Lopez, Luis José Muñiz Rascado, Cecilia Ishida-Gutiérrez, Carlos-Francisco Méndez-Cruz, Julio Collado-Vides; Strategies towards digital and semi-automated curation in RegulonDB. Database (Oxford) 2017; 2017 (1): bax012. doi: 10.1093/database/bax012
- Balderas-Martínez YI, Rinaldi F, Contreras G, Solano-Lira H, Sánchez-Pérez M, Collado-Vides J, Selman M, Pardo A; Improving biocuration of microRNAs in diseases: a case study in idiopathic pulmonary fibrosis. Database (Oxford) 2017; 2017 (1): bax030. doi: 10.1093/database/bax030
Screenshot of ODIN interface, customized for RegulonDB curation.