Informatics research to enable clinically relevant, personalized genomic medicine
- Lucila Ohno-Machado, Editor-in-Chief
This is a particularly exciting issue of JAMIA. Not only do we display exceptional work spanning informatics research that integrates data from different biological levels (from molecules to tissues to individuals), but we also show how this research is greatly enhanced by clever integration of knowledge from publicly shared resources (from nucleotide sequences to gene and protein networks to data from the biomedical literature). The articles in this issue cover a broad range of approaches developed in different institutions spread over five countries and 12 US states, and are prime examples of the importance of a quantitative approach to health sciences that requires computational analysis of massive amounts of data that are now being generated at an accelerated pace.
Upon recognizing the importance of providing biomedical and behavioral researchers with algorithms, tools, and computational facilities that accelerate scientific discoveries, the NIH sponsored the creation of several National Centers for Biomedical Computing (NCBCs) 8 years ago. An editorial by Berg (see page 151) discusses the impact of these centers, which are described in eight brief communications (see pages 166–206). These NCBCs embody the very nature of biomedical informatics: a field dedicated to the improvement of human health through the development of new algorithms and software tools for data capture, analysis, and knowledge dissemination, resulting from the combined efforts of researchers from fields seemingly as diverse as computer science, engineering, physics, statistics, library sciences, and biomedical and behavioral sciences, to name a few. While each of the centers focuses on different informatics aspects, they all share the goal of enabling collaborative ‘big science’ through the dissemination of innovative algorithms, tools, and services to the scientific community. An article by Cantor (see page 153) expresses the same goals from the perspective of the pharmaceutical industry: the creation of public–private precompetitive consortia to share data, the adoption of standards so data can be reused, the utilization of electronic health records (EHRs) for data capture and clinical trial recruitment, and the development of incentives for those who develop and share data and tools that facilitate scientific research and advancement. Lussier (see page 156) provides an academic perspective outlining the same issues, and describes huge opportunities and challenges for biomedical informatics research related to personalizing cancer treatment using microRNA signatures. Among the challenges is the ‘full integration with clinical phenotyping data, an exercise that exceeds the storage and computing requirements of standard computing approaches and may require computation in grid or cloud space.’ Schweitzer (see page 161) provides a didactic explanation of cloud computing and some requirements for handling sensitive information such as data from EHRs, and discusses issues related to privacy, security, liability, and affordability.
In addition to the NCBCs, other NIH initiatives such the Electronic Medical Records and Genomics (eMERGE) network have helped promote the integration of the clinical informatics community and a vibrant new breed of translational bioinformatics researchers. In this issue, Kho (see page 212), Wei (see page 219), and Peissig (see page 225) describe algorithms that were used for phenotype identification in EHRs to help select patients for genome-wide association studies (GWAS). Outside the informatics community, few colleagues fully appreciate the difficulty in extracting accurate phenotypes from data collected for patient care. These articles explain how it is often necessary to combine several techniques, including natural language processing (NLP) approaches, to obtain accurate phenotypes from EHRs. Although the eMERGE network has been primarily focused on five clinical conditions, we expect that lessons learned from these phenotype extraction studies will inform related research in many other clinical domains, given appropriate contextual information. For example, an article by Stevenson (see page 235) describes a context-informed NLP approach that enhances current systems for biomedical word sense disambiguation based on the Unified Medical Language System (UMLS).
A large portion of this issue of JAMIA is focused on gene variants and their association with disease, in preparation for a future in which personalized genomic medicine is realized at the point of care. Crockett (see page 207) describes how a new gene-specific machine learning approach that starts with information on sequences leading to particular amino acid substitutions can ultimately help predict disease associated with certain variants. Also related to gene-disease prediction, Peterson (see page 275) proposes a significance score to predict pathogenicity for variants impacting specific protein domain positions and hotspots related to cancer. Morgan (see page 284) describes how a meta-analysis-based, data-driven approach outperforms a knowledge-driven approach to identify genes with variants associated with immune dysfunction. Liu et al (see page 241) describe an innovative approach to identify genes associated with disease that utilizes pairwise analyses of gene and protein expressions instead of utilizing individual measurements. The authors hypothesize that alterations that are represented by pairs with dynamic interactions serve as better markers for disease than conventional differential expression of individual genes or proteins since the pairs can capture disruptions in complex molecular networks. They tested the hypotheses in gastric cancer and obtained very promising results.
The articles in this issue serve as notable examples of the critical contribution of shared resources to the acceleration of biomedical research. With the increasing availability of personal genomes, several studies of gene variants rely on reference catalogs such as those offered by HapMap and the 1000 Genomes Project, so it is critically important that researchers understand that different assumptions were used to construct these references. Buchanan (see page 289) shows that not all variants cataloged in HapMap are present in the 1000 Genomes Project resource, and explains the importance of understanding these differences when validating single nucleotide polymorphisms (SNPs), researching candidate genes, or performing statistical imputation of genotype data. Sarkar (see page 249) proposes an information retrieval vector space-based approach using nucleotide sequence similarity (using BLAST, GenBank, and OMIM) and literature from PubMed (Medline) to infer relationships between diseases and potential common causative genes, offering an opportunity to explore drug repositioning for diseases found to be related (ie, to facilitate the discovery of new disease targets for existing medications). Also motivated by the potential benefits of drug repositioning, Li (see page 295) describes an approach to study clinical traits related to common genes using protein interaction networks derived from the analysis of SNPs and Gene Ontology annotations. Regan (see page 306) extends a similar network approach, showing how unique personal variants can be genetically connected to knowledge derived from disease-associated SNPs of Mendelian or complex inheritance.
We anticipate that in the near future the identification of personal variants and their relation to disease will be used, together with laboratory and clinical findings, to build more accurate personalized predictive models that assess susceptibility to disease or that predict outcomes given specific treatments much better than is possible today. In addition to genomic and proteomic data described in other articles, Cooper (see page 317) uses data from The Cancer Genome Atlas (TCGA) combined with image analysis techniques, to provide an excellent example of how morphologic variations can enhance subclassification of disease, clustering tumor tissues into groups of prognostic significance and providing meaningful categories to analyze differential gene expression. Related to phenotypic predictors, Lasserre (see page 255) describes several statistical and machine learning models to predict outcomes of renal transplantation that utilize recipient and donor features, human leucocyte antigen mismatches, as well as allograft handling variables. Although encouraging results are obtained with phenotypic variables alone, we expect that the inclusion of additional gene variant information will lead to even better predictive performance in the future. Given the critical role of predictive models in organ transplant allocation, accurate estimation of outcomes is very important. Predictive models are increasingly being used in many types of clinical decisions. The inclusion of genetic information, while offering potential for improvement, also presents additional challenges in making reliable individualized, personalized predictions. Jiang (see page 263) addresses the problem of calibrating predictive models so that they can be used for personalized assessments.
We live in exceptional times: biomedical informatics is advancing and extending its reach to enable remarkable scientific advancement through integration of ‘big data’ from all biological levels, using innovative computational tools and methods for data sharing and analyzes. Editing this issue of JAMIA was a true pleasure and honor. Documenting how informatics has ‘repositioned’ itself into the core of this new era of computationally intensive team science is a most rewarding experience. I thank our authors, reviewers, editorial team members, AMIA, and most importantly, our readers, for affording me this unique opportunity.