Lucila Ohno-Machado

Duke University, Durham, North Carolina, United States

Are you Lucila Ohno-Machado?

Claim your profile

Publications (246)419.55 Total impact

  • [Show abstract] [Hide abstract]
    ABSTRACT: To propose a new approach to privacy preserving data selection, which helps the data users access human genomic datasets efficiently without undermining patients' privacy.
    Journal of the American Medical Informatics Association : JAMIA. 10/2014;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: MicroRNAs (miRNAs) are a class of small (22 nucleotides) non-coding RNAs that post-transcriptionally regulate gene expression by interacting with target mRNAs. A majority of miRNAs is located within intronic or exonic regions of protein-coding genes (host genes), and increasing evidence suggests a functional relationship between these miRNAs and their host genes. Here, we introduce miRIAD, a web-service to facilitate the analysis of gen-omic and structural features of intragenic miRNAs and their host genes for five species (human, rhesus monkey, mouse, chicken and opossum). miRIAD contains the genomic classification of all miRNAs (inter-and intragenic), as well as classification of all protein-coding genes into host or non-host genes (depending on whether they contain an intra-genic miRNA or not). We collected and processed public data from several sources to pro-vide a clear visualization of relevant knowledge related to intragenic miRNAs, such as host gene function, genomic context, names of and references to intragenic miRNAs, miRNA binding sites, clusters of intragenic miRNAs, miRNA and host gene expression across dif-ferent tissues and expression correlation for intragenic miRNAs and their host genes. Protein–protein interaction data are also presented for functional network analysis of host genes. In summary, miRIAD was designed to help the research community to explore, in a user-friendly environment, intragenic miRNAs, their host genes and functional annotations with minimal effort, facilitating hypothesis generation and in-silico validations. V C The Author(s) 2014. Published by Oxford University Press.
    Database The Journal of Biological Databases and Curation 10/2014; · 4.20 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: MAGI is a web service for fast MicroRNA-Seq data analysis in a GPU infrastructure. Using just a browser, users have access to results as web-reports in just a few hours - over 600% end-to-end performance improvement over state-of-the-art. MAGI's salient features are: (i) transfer of large input files in native FASTQ format through drag-and-drop operations, (ii) rapid prediction of microRNA target genes leveraging parallel computing with GPU devices, (iii) all-in-one analytics with novel feature extraction, statistical test for differential expression, and diagnostic plot generation for quality control, and (iv) interactive visualization and exploration of results in web-reports that are readily available for publication. Availability and implementation: MAGI relies on the Node.js JavaScript framework, along with NVIDIA CUDA C, PHP, Perl, R. It is freely available at
    Bioinformatics (Oxford, England). 06/2014;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Non-coding sequences such as microRNAs have important roles in disease processes. Computational microRNA target identification (CMTI) is becoming increasingly important since traditional experimental methods for target identification pose many difficulties. These methods are time-consuming, costly, and often need guidance from computational methods to narrow down candidate genes anyway. However, most CMTI methods are computationally demanding, since they need to handle not only several million query microRNA and reference RNA pairs, but also several million nucleotide comparisons within each given pair. Thus, the need to perform microRNA identification at such large scale has increased the demand for parallel computing.
    BMC Medical Genomics 05/2014; 7(Suppl 1):S9. · 3.91 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Privacy protecting is an important issue in medical informatics and differential privacy is a state-of-the-art framework for data privacy research. Differential privacy offers provable privacy against attackers who have auxiliary information, and can be applied to data mining models (for example, logistic regression). However, differentially private methods sometimes introduce too much noise and make outputs less useful. Given available public data in medical research (e.g. from patients who sign open-consent agreements), we can design algorithms that use both public and private data sets to decrease the amount of noise that is introduced.
    BMC Medical Genomics 05/2014; 7(Suppl 1):S14. · 3.91 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: This article describes the patient-centered Scalable National Network for Effectiveness Research (pSCANNER), which is part of the recently formed PCORnet, a national network composed of learning healthcare systems and patient-powered research networks funded by the Patient Centered Outcomes Research Institute (PCORI). It is designed to be a stakeholder-governed federated network that uses a distributed architecture to integrate data from three existing networks covering over 21 million patients in all 50 states: (1) VA Informatics and Computing Infrastructure (VINCI), with data from Veteran Health Administration's 151 inpatient and 909 ambulatory care and community-based outpatient clinics; (2) the University of California Research exchange (UC-ReX) network, with data from UC Davis, Irvine, Los Angeles, San Francisco, and San Diego; and (3) SCANNER, a consortium of UCSD, Tennessee VA, and three federally qualified health systems in the Los Angeles area supplemented with claims and health information exchange data, led by the University of Southern California. Initial use cases will focus on three conditions: (1) congestive heart failure; (2) Kawasaki disease; (3) obesity. Stakeholders, such as patients, clinicians, and health service researchers, will be engaged to prioritize research questions to be answered through the network. We will use a privacy-preserving distributed computation model with synchronous and asynchronous modes. The distributed system will be based on a common data model that allows the construction and evaluation of distributed multivariate models for a variety of statistical analyses.
    Journal of the American Medical Informatics Association 04/2014; · 3.57 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Many healthcare facilities enforce security on their electronic health records (EHRs) through a corrective mechanism: some staff nominally have almost unrestricted access to the records, but there is a strict ex post facto audit process for inappropriate accesses, i.e., accesses that violate the facility’s security and privacy policies. This process is inefficient, as each suspicious access has to be reviewed by a security expert, and is purely retrospective, as it occurs after damage may have been incurred. This motivates automated approaches based on machine learning using historical data. Previous attempts at such a system have successfully applied supervised learning models to this end, such as SVMs and logistic regression. While providing benefits over manual auditing, these approaches ignore the identity of the users and patients involved in a record access. Therefore, they cannot exploit the fact that a patient whose record was previously involved in a violation has an increased risk of being involved in a future violation. Motivated by this, in this paper, we propose a collaborative filtering inspired approach to predicting inappropriate accesses. Our solution integrates both explicit and latent features for staff and patients, the latter acting as a personalized “fingerprint” based on historical access patterns. The proposed method, when applied to real EHR access data from two tertiary hospitals and a file-access dataset from Amazon, shows not only significantly improved performance compared to existing methods, but also provides insights as to what indicates an inappropriate access.
    Machine Learning 04/2014; · 1.47 Impact Factor
  • Source
    Son Doan, Mike Conway, Tu Minh Phuong, Lucila Ohno-Machado
    [Show abstract] [Hide abstract]
    ABSTRACT: In modern electronic medical records (EMR) much of the clinically important data - signs and symptoms, symptom severity, disease status, etc. - are not provided in structured data fields, but rather are encoded in clinician generated narrative text. Natural language processing (NLP) provides a means of "unlocking" this important data source for applications in clinical decision support, quality assurance, and public health. This chapter provides an overview of representative NLP systems in biomedicine based on a unified architectural view. A general architecture in an NLP system consists of two main components: background knowledge that includes biomedical knowledge resources and a framework that integrates NLP tools to process text. Systems differ in both components, which we will review briefly. Additionally, challenges facing current research efforts in biomedical NLP include the paucity of large, publicly available annotated corpora, although initiatives that facilitate data sharing, system evaluation, and collaborative work between researchers in clinical NLP are starting to emerge.
    Methods in molecular biology (Clifton, N.J.). 01/2014; 1168.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: MicroRNAs (miRNAs) are a class of short noncoding RNAs that regulate gene expression through base pairing with messenger RNAs. Due to the interest in studying miRNA dysregulation in disease and limits of validated miRNA references, identification of novel miRNAs is a critical task. The performance of different models to predict novel miRNAs varies with the features chosen as predictors. However, no study has systematically compared published feature sets. We constructed a comprehensive feature set using the minimum free energy of the secondary structure of precursor miRNAs, a set of nucleotide-structure triplets, and additional extracted sequence and structure characteristics. We then compared the predictive value of our comprehensive feature set to those from three previously published studies, using logistic regression and random forest classifiers. We found that classifiers containing as few as seven highly predictive features are able to predict novel precursor miRNAs as well as classifiers that use larger feature sets. In a real data set, our method correctly identified the holdout miRNAs relevant to renal cancer.
    Cancer informatics 01/2014; 13(Suppl 1):95-102.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Short-read sequencing is becoming the standard of practice for the study of structural variants associated with disease. However, with the growth of sequence data largely surpassing reasonable storage capability, the biomedical community is challenged with the management, transfer, archiving, and storage of sequence data. We developed Hierarchical mUlti-reference Genome cOmpression (HUGO), a novel compression algorithm for aligned reads in the sorted Sequence Alignment/Map (SAM) format. We first aligned short reads against a reference genome and stored exactly mapped reads for compression. For the inexact mapped or unmapped reads, we realigned them against different reference genomes using an adaptive scheme by gradually shortening the read length. Regarding the base quality value, we offer lossy and lossless compression mechanisms. The lossy compression mechanism for the base quality values uses k-means clustering, where a user can adjust the balance between decompression quality and compression rate. The lossless compression can be produced by setting k (the number of clusters) to the number of different quality values. The proposed method produced a compression ratio in the range 0.5-0.65, which corresponds to 35-50% storage savings based on experimental datasets. The proposed approach achieved 15% more storage savings over CRAM and comparable compression ratio with Samcomp (CRAM and Samcomp are two of the state-of-the-art genome compression algorithms). The software is freely available at a General Public License (GPL) license. Our method requires having different reference genomes and prolongs the execution time for additional alignments. The proposed multi-reference-based compression algorithm for aligned reads outperforms existing single-reference based algorithms.
    Journal of the American Medical Informatics Association 12/2013; · 3.57 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: There is currently limited information on best practices for the development of governance requirements for distributed research networks (DRNs), an emerging model that promotes clinical data reuse and improves timeliness of comparative effectiveness research. Much of the existing information is based on a single type of stakeholder such as researchers or administrators. This paper reports on a triangulated approach to developing DRN data governance requirements based on a combination of policy analysis with experts, interviews with institutional leaders, and patient focus groups. This approach is illustrated with an example from the Scalable National Network for Effectiveness Research, which resulted in 91 requirements. These requirements were analyzed against the Fair Information Practice Principles (FIPPs) and Health Insurance Portability and Accountability Act (HIPAA) protected versus non-protected health information. The requirements addressed all FIPPs, showing how a DRN's technical infrastructure is able to fulfill HIPAA regulations, protect privacy, and provide a trustworthy platform for research.
    Journal of the American Medical Informatics Association 12/2013; · 3.57 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In a growing interdisciplinary field like biomedical informatics, information dissemination and citation trends are changing rapidly due to many factors. To understand these factors better, we analyzed the evolution of the number of articles per major biomedical informatics topic, download/online view frequencies, and citation patterns (using Web of Science) for articles published from 2009 to 2012 in JAMIA. The number of articles published in JAMIA increased significantly from 2009 to 2012, and there were some topic differences in the last 4 years. Medical Record Systems, Algorithms, and Methods are topic categories that are growing fast in several publications. We observed a significant correlation between download frequencies and the number of citations per month since publication for a given article. Earlier free availability of articles to non-subscribers was associated with a higher number of downloads and showed a trend towards a higher number of citations. This trend will need to be verified as more data accumulate in coming years.
    Journal of the American Medical Informatics Association 11/2013; · 3.57 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: WebGLORE is a free webservice that enables privacy-preserving construction of a global logistic regression model from distributed datasets that are sensitive. It only transfers aggregated local statistics (from participants) through Hypertext Transfer Protocol Secure (HTTPS) to a trusted server, where the global model is synthesized. WebGLORE seamlessly integrates AJAX, JAVA Applet/Servlet and PHP technologies to provide an easy-to-use webservice for biomedical researchers to break down policy barriers during information exchange. can be used under the terms of GNU general public license as published by the Free Software Foundation.
    Bioinformatics 09/2013; · 5.47 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The database of genotypes and phenotypes (dbGaP) developed by the National Center for Biotechnology Information (NCBI) is a resource that contains information on various genome-wide association studies (GWAS) and is currently available via NCBI's dbGaP Entrez interface. The database is an important resource, providing GWAS data that can be used for new exploratory research or cross-study validation by authorized users. However, finding studies relevant to a particular phenotype of interest is challenging, as phenotype information is presented in a non-standardized way. To address this issue, we developed PhenDisco (phenotype discoverer), a new information retrieval system for dbGaP. PhenDisco consists of two main components: (1) text processing tools that standardize phenotype variables and study metadata, and (2) information retrieval tools that support queries from users and return ranked results. In a preliminary comparison involving 18 search scenarios, PhenDisco showed promising performance for both unranked and ranked search comparisons with dbGaP's search engine Entrez. The system can be accessed at
    Journal of the American Medical Informatics Association 08/2013; · 3.57 Impact Factor
  • Claudiu Farcas, Natasha Balac, Lucila Ohno-Machado
    [Show abstract] [Hide abstract]
    ABSTRACT: Biomedical research traverses a new era of advancements through the adoption of massive computing and big-data solutions to major scientific problems. However, the road ahead is far from "a walk in a park" -- many obstacles exist in the standardization, adoption, and evolution of methods, practices, algorithms, tools, and ultimately knowledge, that would mature along this road. In this article, we discuss such challenges that we encountered in this field and possible solutions from the iDASH program that closely engages this community.
    Proceedings of the Conference on Extreme Science and Engineering Discovery Environment: Gateway to Discovery; 07/2013
  • Atul J Butte, Lucila Ohno-Machado
    Journal of the American Medical Informatics Association 07/2013; 20(4):595-6. · 3.57 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Advances in DNA information extraction techniques have led to huge sequenced genomes from organisms spanning the tree of life. This increasing amount of genomic information requires tools for comparison of the nucleotide sequences. In this paper, we propose a novel nucleotide sequence alignment method based on sparse coding and belief propagation to compare the similarity of the nucleotide sequences. We used the neighbors of each nucleotide as features, and then we employed sparse coding to find a set of candidate nucleotides. To select optimum matches, belief propagation was subsequently applied to these candidate nucleotides. Experimental results show that the proposed approach is able to robustly align nucleotide sequences and is competitive to SOAPaligner [1] and BWA [2].
    Conference proceedings: ... Annual International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE Engineering in Medicine and Biology Society. Conference 07/2013; 2013:588-591.
  • Katherine K Kim, Deven McGraw, Laura Mamo, Lucila Ohno-Machado
    [Show abstract] [Hide abstract]
    ABSTRACT: Comparative effectiveness research (CER) conducted in distributed research networks (DRNs) is subject to different state laws and regulations as well as institution-specific policies intended to protect privacy and security of health information. The goal of the Scalable National Network for Effectiveness Research (SCANNER) project is to develop and demonstrate a scalable, flexible technical infrastructure for DRNs that enables near real-time CER consistent with privacy and security laws and best practices. This investigation began with an analysis of privacy and security laws and state health information exchange (HIE) guidelines applicable to SCANNER participants from California, Illinois, Massachusetts, and the Federal Veteran's Administration. A 7-member expert panel of policy and technical experts reviewed the analysis and gave input into the framework during 5 meetings held in 2011-2012. The state/federal guidelines were applied to 3 CER use cases: safety of new oral hematologic medications; medication therapy management for patients with diabetes and hypertension; and informational interventions for providers in the treatment of acute respiratory infections. The policy framework provides flexibility, beginning with a use-case approach rather than a one-size-fits-all approach. The policies may vary depending on the type of patient data shared (aggregate counts, deidentified, limited, and fully identified datasets) and the flow of data. The types of agreements necessary for a DRN may include a network-level and data use agreements. The need for flexibility in the development and implementation of policies must be balanced with responsibilities of data stewardship.
    Medical care 06/2013; · 3.24 Impact Factor
  • Xiaoqian Jiang, Anand D Sarwate, Lucila Ohno-Machado
    [Show abstract] [Hide abstract]
    ABSTRACT: OBJECTIVE:: Effective data sharing is critical for comparative effectiveness research (CER), but there are significant concerns about inappropriate disclosure of patient data. These concerns have spurred the development of new technologies for privacy-preserving data sharing and data mining. Our goal is to review existing and emerging techniques that may be appropriate for data sharing related to CER. MATERIALS AND METHODS:: We adapted a systematic review methodology to comprehensively search the research literature. We searched 7 databases and applied 3 stages of filtering based on titles, abstracts, and full text to identify those works most relevant to CER. RESULTS:: On the basis of agreement and using the arbitrage of a third party expert, we selected 97 articles for meta-analysis. Our findings are organized along major types of data sharing in CER applications (ie, institution-to-institution, institution hosted, and public release). We made recommendations based on specific scenarios. LIMITATION:: We limited the scope of our study to methods that demonstrated practical impact, eliminating many theoretical studies of privacy that have been surveyed elsewhere. We further limited our study to data sharing for data tables, rather than complex genomic, set valued, time series, text, image, or network data. CONCLUSION:: State-of-the-art privacy-preserving technologies can guide the development of practical tools that will scale up the CER studies of the future. However, many challenges remain in this fast moving field in terms of practical evaluations and applications to a wider range of data types.
    Medical care 06/2013; · 3.24 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: We developed an EXpectation Propagation LOgistic REgRession (EXPLORER) model for distributed privacy-preserving online learning. The proposed framework provides a high level guarantee for protecting sensitive information, since the information exchanged between the server and the client is the encrypted posterior distribution of coefficients. Through experimental results, EXPLORER shows the same performance (e.g., discrimination, calibration, feature selection etc.) as the traditional frequentist Logistic Regression model, but provides more flexibility in model updating. That is, EXPLORER can be updated one point at a time rather than having to retrain the entire data set when new observations are recorded. The proposed EXPLORER supports asynchronized communication, which relieves the participants from coordinating with one another, and prevents service breakdown from the absence of participants or interrupted communications.
    Journal of Biomedical Informatics 04/2013; · 2.13 Impact Factor

Publication Stats

4k Citations
419.55 Total Impact Points


  • 2014
    • Duke University
      Durham, North Carolina, United States
  • 2010–2014
    • University of California, San Diego
      • Department of Medicine
      San Diego, California, United States
    • Ludwig-Maximilian-University of Munich
      • Department of Anesthesiology
      München, Bavaria, Germany
    • Private Universität für Gesundheitswissenschaften, Medizinische Informatik und Technik
      • Institute for Electrical and Biomedical Engineering
      Hall in Tirol, Tyrol, Austria
  • 2013
    • Rutgers, The State University of New Jersey
      New Brunswick, New Jersey, United States
    • Shanghai Jiao Tong University
      Shanghai, Shanghai Shi, China
    • San Francisco State University
      San Francisco, California, United States
    • Yale University
      New Haven, Connecticut, United States
    • Toyota Technological Institute at Chicago
      Chicago, Illinois, United States
  • 1993–2013
    • Stanford Medicine
      • Department of Pediatrics
      Stanford, CA, United States
  • 2012
    • Concordia University Montreal
      • Department of Computer Science and Software Engineering
      Montréal, Quebec, Canada
    • University of Pittsburgh
      Pittsburgh, Pennsylvania, United States
  • 1996–2012
    • Harvard Medical School
      • • Department of Radiology
      • • Department of Genetics
      Boston, Massachusetts, United States
  • 2011
    • University of Vermont
      • Center for Clinical and Translational Science
      Burlington, Vermont, United States
    • Johns Hopkins University
      • Division of Health Sciences Informatics
      Baltimore, MD, United States
  • 2002–2010
    • Fachhochschule Oberösterreich
      Wels, Upper Austria, Austria
    • Seoul National University Hospital
      Sŏul, Seoul, South Korea
    • Federal University of Rio de Janeiro
      Rio de Janeiro, Rio de Janeiro, Brazil
  • 1996–2010
    • Brigham and Women's Hospital
      • • Center for Brain Mind Medicine
      • • Department of Medicine
      • • Decision Systems Group
      • • Division of Cardiovascular Medicine
      Boston, MA, United States
  • 2009
    • University of Washington Seattle
      Seattle, Washington, United States
  • 2001–2008
    • Harvard University
      Cambridge, Massachusetts, United States
  • 2007
    • University of Oslo
      Kristiania (historical), Oslo County, Norway
  • 2000–2007
    • Massachusetts Institute of Technology
      • • Computer Science and Artificial Intelligence Laboratory
      • • Division of Health Sciences and Technology
      Cambridge, MA, United States
    • Consorcio Hospital General Universitario de Valencia
      • Departamento de Cardiología
      Valencia, Valencia, Spain
  • 2004–2006
    • Universidade Federal de São Paulo
      San Paulo, São Paulo, Brazil
    • University of Massachusetts Boston
      • Department of Computer Science
      Boston, MA, United States
  • 2003–2006
    • Boston Children's Hospital
      • Department of Radiology
      Boston, MA, United States
  • 2004–2005
    • Partners HealthCare
      • Department of Radiology
      Boston, MA, United States
  • 2003–2005
    • Teikyo University Hospital
      Edo, Tōkyō, Japan
  • 2002–2003
    • University of Hertfordshire
      • School of Computer Science
      Hatfield, ENG, United Kingdom
  • 1997–2002
    • Stanford University
      Palo Alto, California, United States
    • University of Southern California
      • Department of Chemical Engineering and Materials Science
      Los Angeles, CA, United States
  • 1999
    • Norwegian University of Science and Technology
      • Department of Computer and Information Science
      Trondheim, Sor-Trondelag Fylke, Norway