Journal of Biomedical Informatics

Published by Elsevier
Online ISSN: 1532-0480
Publications
Article
In order to support empirical medical research concerning reuse and improvement of the expressiveness of study data and hence promote syntactic as well as semantic interoperability, services are required for the maintenance of data element collections. As part of the project for the implementation of a German metadata repository for empirical research we assessed the ability of ISO/IEC 11179 "Information technology - Metadata Registries (MDR)" part 3 edition 3 Final Committee Draft "Registry Metamodel and basic attributes" to represent healthcare standards. First step of the evaluation was a reformulation of ISO's metamodel with the terms and structures of the different healthcare standards. In a second step, we imported instances of the healthcare standards into a prototypical database implementation representing ISO's metamodel. Whereas the flat structure of disease registries as well as some controlled vocabularies could be easily mapped to the ISO's metamodel, complex structures as used in reference models of electronic health records or classifications could be not exhaustively represented. A logical reconstruction of an application will be needed in order to represent them adequately. Moreover, the correct linkage between elements from ISO/IEC 11179 edition 3 and concepts of classifications remains unclear. We also observed some restrictions of ISO/IEC 11179 edition 3 concerning the representation of items of the Operational Data Model from the Clinical Data Interchange Standards Consortium, which might be outside the scope of a MDR. Thus, despite the obvious strength of the ISO/IEC 11179 edition 3 for metadata registries, some issues should be considered in its further development.
 
Article
The threat of bioterrorism has stimulated interest in enhancing public health surveillance to detect disease outbreaks more rapidly than is currently possible. To advance research on improving the timeliness of outbreak detection, the Defense Advanced Research Project Agency sponsored the Bio-event Advanced Leading Indicator Recognition Technology (BioALIRT) project beginning in 2001. The purpose of this paper is to provide a synthesis of research on outbreak detection algorithms conducted by academic and industrial partners in the BioALIRT project. We first suggest a practical classification for outbreak detection algorithms that considers the types of information encountered in surveillance analysis. We then present a synthesis of our research according to this classification. The research conducted for this project has examined how to use spatial and other covariate information from disparate sources to improve the timeliness of outbreak detection. Our results suggest that use of spatial and other covariate information can improve outbreak detection performance. We also identified, however, methodological challenges that limited our ability to determine the benefit of using outbreak detection algorithms that operate on large volumes of data. Future research must address challenges such as forecasting expected values in high-dimensional data and generating spatial and multivariate test data sets.
 
Article
The communication between health information systems of hospitals and primary care organizations is currently an important challenge to improve the quality of clinical practice and patient safety. However, clinical information is usually distributed among several independent systems that may be syntactically or semantically incompatible. This fact prevents healthcare professionals from accessing clinical information of patients in an understandable and normalized way. In this work, we address the semantic interoperability of two EHR standards: OpenEHR and ISO EN 13606. Both standards follow the dual model approach which distinguishes information and knowledge, this being represented through archetypes. The solution presented here is capable of transforming OpenEHR archetypes into ISO EN 13606 and vice versa by combining Semantic Web and Model-driven Engineering technologies. The resulting software implementation has been tested using publicly available collections of archetypes for both standards.
 
Article
Operating room teams consist of team members with diverse training backgrounds. In addition to differences in training, each team member has unique and complex decision making paths. As such, team members may function in the same environment largely ...
 
Article
This paper proposes an encoding system for 1D biomedical signals that allows embedding metadata and provides security and privacy. The design is based on the analysis of requirements for secure and efficient storage, transmission and access to medical tests in e-health environment. This approach uses the 1D SPIHT algorithm to compress 1D biomedical signals with clinical quality, metadata embedding in the compressed domain to avoid extra distortion, digital signature to implement security and attribute-level encryption to support Role-Based Access Control. The implementation has been extensively tested using standard electrocardiogram and electroencephalogram databases (MIT-BIH Arrhythmia, MIT-BIH Compression and SCCN-EEG), demonstrating high embedding capacity (e.g. 3 KB in resting ECGs, 200 KB in stress tests, 30 MB in ambulatory ECGs), short delays (2-3.3 s in real-time transmission) and compression of the signal (by ≃3 in real-time transmission, by ≃5 in offline operation) despite of the embedding of security elements and metadata to enable e-health services.
 
Article
We describe the potential of current Web 2.0 technologies to achieve data mashup in the health care and life sciences (HCLS) domains, and compare that potential to the nascent trend of performing semantic mashup. After providing an overview of Web 2.0, we demonstrate two scenarios of data mashup, facilitated by the following Web 2.0 tools and sites: Yahoo! Pipes, Dapper, Google Maps and GeoCommons. In the first scenario, we exploited Dapper and Yahoo! Pipes to implement a challenging data integration task in the context of DNA microarray research. In the second scenario, we exploited Yahoo! Pipes, Google Maps, and GeoCommons to create a geographic information system (GIS) interface that allows visualization and integration of diverse categories of public health data, including cancer incidence and pollution prevalence data. Based on these two scenarios, we discuss the strengths and weaknesses of these Web 2.0 mashup technologies. We then describe Semantic Web, the mainstream Web 3.0 technology that enables more powerful data integration over the Web. We discuss the areas of intersection of Web 2.0 and Semantic Web, and describe the potential benefits that can be brought to HCLS research by combining these two sets of technologies.
 
Article
Clinical decision support is a powerful tool for improving healthcare quality and patient safety. However, developing a comprehensive package of decision support interventions is costly and difficult. If used well, Web 2.0 methods may make it easier and less costly to develop decision support. Web 2.0 is characterized by online communities, open sharing, interactivity and collaboration. Although most previous attempts at sharing clinical decision support content have worked outside of the Web 2.0 framework, several initiatives are beginning to use Web 2.0 to share and collaborate on decision support content. We present case studies of three efforts: the Clinfowiki, a world-accessible wiki for developing decision support content; Partners Healthcare eRooms, web-based tools for developing decision support within a single organization; and Epic Systems Corporation's Community Library, a repository for sharing decision support content for customers of a single clinical system vendor. We evaluate the potential of Web 2.0 technologies to enable collaborative development and sharing of clinical decision support systems through the lens of three case studies; analyzing technical, legal and organizational issues for developers, consumers and organizers of clinical decision support content in Web 2.0. We believe the case for Web 2.0 as a tool for collaborating on clinical decision support content appears strong, particularly for collaborative content development within an organization.
 
Article
Autism spectrum disorders (ASD) represent a group of developmental disabilities with a strong genetic basis. The laboratory mouse is increasingly used as a model organism for ASD, and MGI, the Mouse Genome Informatics resource, is the primary model organism ...
 
Article
Autism spectrum disorders (ASD) represent a group of developmental disabilities with a strong genetic basis. The laboratory mouse is increasingly used as a model organism for ASD, and MGI, the Mouse Genome Informatics resource, is the primary model organism ...
 
Article
The American College of Medical Informatics (ACMI) sponsors periodic debates during the American Medical Informatics Fall Symposium to highlight important informatics issues of broad interest. In 2012, a panel debated the following topic: "Resolved: Health Information Exchange Organizations Should Shift Their Principal Focus to Consumer-Mediated Exchange in Order to Facilitate the Rapid Development of Effective, Scalable, and Sustainable Health Information Infrastructure." Those supporting the proposition emphasized the need for consumer-controlled community repositories of electronic health records (health record banks) to address privacy, stakeholder cooperation, scalability, and sustainability. Those opposing the proposition emphasized that the current healthcare environment is so complex that development of consumer control will take time and that even then, consumers may not be able to mediate their information effectively. While privately, each discussant recognizes that there are many sides to this complex issue, each followed the debater's tradition of taking an extreme position in order emphasize some of the polarizing aspects in the short time allotted them. In preparing this summary, we sought to convey the substance and spirit of the debate in printed form. Transcripts of the actual debate were edited for clarity, and appropriate supporting citations were added for the further edification of the reader.
 
Article
The Stanford Biomedical Informatics training program began with a focus on clinical informatics, and has now evolved into a general program of biomedical informatics training, including clinical informatics, bioinformatics and imaging informatics. The program offers PhD, MS, distance MS, certificate programs, and is now affiliated with an undergraduate major in biomedical computation. Current dynamics include (1) increased activity in informatics within other training programs in biology and the information sciences (2) increased desire among informatics students to gain laboratory experience, (3) increased demand for computational collaboration among biomedical researchers, and (4) interaction with the newly formed Department of Bioengineering at Stanford University. The core focus on research training-the development and application of novel informatics methods for biomedical research-keeps the program centered in the midst of this period of growth and diversification.
 
Article
Multi-dimensional Bayesian network classifiers (MBCs) are probabilistic graphical models recently proposed to deal with multi-dimensional classification problems, where each instance in the data set has to be assigned to more than one class variable. In this paper, we propose a Markov blanket-based approach for learning MBCs from data. Basically, it consists of determining the Markov blanket around each class variable using the HITON algorithm, then specifying the directionality over the MBC subgraphs. Our approach is applied to the prediction problem of the European Quality of Life-5 Dimensions (EQ-5D) from the 39-item Parkinson's Disease Questionnaire (PDQ-39) in order to estimate the health-related quality of life of Parkinson's patients. Fivefold cross-validation experiments were carried out on randomly generated synthetic data sets, Yeast data set, as well as on a real-world Parkinson's disease data set containing 488 patients. The experimental study, including comparison with additional Bayesian network-based approaches, back propagation for multi-label learning, multi-label k-nearest neighbor, multinomial logistic regression, ordinary least squares, and censored least absolute deviations, shows encouraging results in terms of predictive accuracy as well as the identification of dependence relationships among class and feature variables.
 
Article
In this paper, we describe a first step towards a collaborative extension of the well-known 3D-Slicer; this platform is nowadays used as a standalone tool for both surgical planning and medical intervention. We show how this tool can be easily modified to make it collaborative so that it may constitute an integrated environment for expertise exchange as well as a useful tool for academic purposes.
 
Article
In this paper, we describe and evaluate a new distributed architecture for clinical decision support called SANDS (Service-oriented Architecture for NHIN Decision Support), which leverages current health information exchange efforts and is based on the principles of a service-oriented architecture. The architecture allows disparate clinical information systems and clinical decision support systems to be seamlessly integrated over a network according to a set of interfaces and protocols described in this paper. The architecture described is fully defined and developed, and six use cases have been developed and tested using a prototype electronic health record which links to one of the existing prototype National Health Information Networks (NHIN): drug interaction checking, syndromic surveillance, diagnostic decision support, inappropriate prescribing in older adults, information at the point of care and a simple personal health record. Some of these use cases utilize existing decision support systems, which are either commercially or freely available at present, and developed outside of the SANDS project, while other use cases are based on decision support systems developed specifically for the project. Open source code for many of these components is available, and an open source reference parser is also available for comparison and testing of other clinical information systems and clinical decision support systems that wish to implement the SANDS architecture. The SANDS architecture for decision support has several significant advantages over other architectures for clinical decision support. The most salient of these are:
 
Article
The goal of this research is to provide a framework to enable authoring and verification of clinical guidelines. The framework is part of a larger research project aimed at improving the representation, quality and application of clinical guidelines in daily clinical practice. The verification process of a guideline is based on (1) model checking techniques to verify guidelines against semantic errors and inconsistencies in their definition, (2) combined with Model Driven Development (MDD) techniques, which enable us to automatically process manually created guideline specifications and temporal-logic statements to be checked and verified regarding these specifications, making the verification process faster and cost-effective. Particularly, we use UML statecharts to represent the dynamics of guidelines and, based on this manually defined guideline specifications, we use a MDD-based tool chain to automatically process them to generate the input model of a model checker. The model checker takes the resulted model together with the specific guideline requirements, and verifies whether the guideline fulfils such properties. The overall framework has been implemented as an Eclipse plug-in named GBDSSGenerator which, particularly, starting from the UML statechart representing a guideline, allows the verification of the guideline against specific requirements. Additionally, we have established a pattern-based approach for defining commonly occurring types of requirements in guidelines. We have successfully validated our overall approach by verifying properties in different clinical guidelines resulting in the detection of some inconsistencies in their definition. The proposed framework allows (1) the authoring and (2) the verification of clinical guidelines against specific requirements defined based on a set of property specification patterns, enabling non-experts to easily write formal specifications and thus easing the verification process.
 
Article
The ambiguity of biomedical abbreviations is one of the challenges in biomedical text mining systems. In particular, the handling of term variants and abbreviations without nearby definitions is a critical issue. In this study, we adopt the concepts of topic of document and word link to disambiguate biomedical abbreviations. We newly suggest the link topic model inspired by the latent Dirichlet allocation model, in which each document is perceived as a random mixture of topics, where each topic is characterized by a distribution over words. Thus, the most probable expansions with respect to abbreviations of a given abstract are determined by word-topic, document-topic, and word-link distributions estimated from a document collection through the link topic model. The model allows two distinct modes of word generation to incorporate semantic dependencies among words, particularly long form words of abbreviations and their sentential co-occurring words; a word can be generated either dependently on the long form of the abbreviation or independently. The semantic dependency between two words is defined as a link and a new random parameter for the link is assigned to each word as well as a topic parameter. Because the link status indicates whether the word constitutes a link with a given specific long form, it has the effect of determining whether a word forms a unigram or a skipping/consecutive bigram with respect to the long form. Furthermore, we place a constraint on the model so that a word has the same topic as a specific long form if it is generated in reference to the long form. Consequently, documents are generated from the two hidden parameters, i.e. topic and link, and the most probable expansion of a specific abbreviation is estimated from the parameters. Our model relaxes the bag-of-words assumption of the standard topic model in which the word order is neglected, and it captures a richer structure of text than does the standard topic model by considering unigrams and semantically associated bigrams simultaneously. The addition of semantic links improves the disambiguation accuracy without removing irrelevant contextual words and reduces the parameter space of massive skipping or consecutive bigrams. The link topic model achieves 98.42% disambiguation accuracy on 73,505 MEDLINE abstracts with respect to 21 three letter abbreviations and their 139 distinct long forms. Copyright © 2014. Published by Elsevier Inc.
 
Article
Abbreviations are widely used in clinical documents and they are often ambiguous. Building a list of possible senses (also called sense inventory) for each ambiguous abbreviation is the first step to automatically identify correct meanings of abbreviations in given contexts. Clustering based methods have been used to detect senses of abbreviations from a clinical corpus [1]. However, rare senses remain challenging and existing algorithms are not good enough to detect them. In this study, we developed a new two-phase clustering algorithm called Tight Clustering for Rare Senses (TCRS) and applied it to sense generation of abbreviations in clinical text. Using manually annotated sense inventories from a set of 13 ambiguous clinical abbreviations, we evaluated and compared TCRS with the existing Expectation Maximization (EM) clustering algorithm for sense generation, at two different levels of annotation cost (10 vs. 20 instances for each abbreviation). Our results showed that the TCRS-based method could detect 85% senses on average; while the EM-based method found only 75% senses, when similar annotation effort (about 20 instances) was used. Further analysis demonstrated that the improvement by the TCRS method was mainly from additionally detected rare senses, thus indicating its usefulness for building more complete sense inventories of clinical abbreviations.
 
Article
Biomedical abbreviations and acronyms are widely used in biomedical literature. Since many of them represent important content in biomedical literature, information retrieval and extraction benefits from identifying the meanings of those terms. On the other hand, many abbreviations and acronyms are ambiguous, it would be important to map them to their full forms, which ultimately represent the meanings of the abbreviations. In this study, we present a semi-supervised method that applies MEDLINE as a knowledge source for disambiguating abbreviations and acronyms in full-text biomedical journal articles. We first automatically generated from the MEDLINE abstracts a dictionary of abbreviation-full pairs based on a rule-based system that maps abbreviations to full forms when full forms are defined in the abstracts. We then trained on the MEDLINE abstracts and predicted the full forms of abbreviations in full-text journal articles by applying supervised machine-learning algorithms in a semi-supervised fashion. We report up to 92% prediction precision and up to 91% coverage.
 
Article
Single nucleotide polymorphism (SNP) serve as frequent genetic markers along the chromosome. They can, however, have important consequences for individual susceptibility to disease and reactions to medical treatment. Also, genetics of the human phenotype variation could be understood by knowing the functions of these SNPs. Currently, a vast literature exists reporting possible associations between SNPs and diseases. It is still a major challenge to identify the functional SNPs in a disease related gene. In this work, we have analyzed the genetic variation that can alter the expression and the function in chronic myeloid leukemia (CML) by ABL1 gene through computational methods. Out of the total 827 SNPs, 18 were found to be non-synonymous (nsSNPs). Among the 30 SNPs in the untranslated region, 3 SNPs were found in 5' and 27 SNPs were found in 3' untranslated regions (UTR). It was found that 16.7% nsSNPs were found to be damaging by both SIFT and PolyPhen server. UTR resource tool suggested that 6 out of 27 SNPs in the 3' UTR region were functionally significant. The two major mutations that occurred in the native protein (1OPL) coded by ABL1 gene were at positions 159 (L-->P) and 178 (G-->S). Val (6), Ala (7) and Trp (344) were found to be stabilizing residues in the native protein (1OPL) coded by ABL1 gene. Even though all the three residues were found in the mutant protein 178 (G-->S), only two of them Val (6) and Ala (7) were acting as stabilizing residue in another mutant 159 (L-->P). We propose from the overall results obtained in this work that, both the mutations 159 (L-->P) and 178 (G-->S) should be considered important in the chronic myeloid leukemia caused by ABL1 gene. Our results on this computational study will find good application with the cancer biologist working on experimental protocols.
 
Article
In clinical cancer research, high throughput genomic technologies are increasingly used to identify copy number aberrations. However, the admixture of tumor and stromal cells and the inherent karyotypic heterogeneity of most of the solid tumor samples make this task highly challenging. Here, we propose a robust two-step strategy to detect copy number aberrations in such a context. A spatial mixture model is first used to fit the preprocessed data. Then, a calling algorithm is applied to classify the genomic segments in three biologically meaningful states (copy loss, copy gain and modal copy). The results of a simulation study show the good properties of the proposed procedure with complex patterns of genomic aberrations. The interest of the proposed procedure in clinical cancer research is then illustrated by the analysis of real lung adenocarcinoma samples.
 
Article
The National Cancer Institute has developed the NCI Thesaurus, a biomedical vocabulary for cancer research, covering terminology across a wide range of cancer research domains. A major design goal of the NCI Thesaurus is to facilitate translational research. We describe: the features of Ontylog, a description logic used to build NCI Thesaurus; our methodology for enhancing the terminology through collaboration between ontologists and domain experts, and for addressing certain real world challenges arising in modeling the Thesaurus; and finally, we describe the conversion of NCI Thesaurus from Ontylog into Web Ontology Language Lite. Ontylog has proven well suited for constructing big biomedical vocabularies. We have capitalized on the Ontylog constructs Kind and Role in the collaboration process described in this paper to facilitate communication between ontologists and domain experts. The artifacts and processes developed by NCI for collaboration may be useful in other biomedical terminology development efforts.
 
Article
Inter-case similarity metrics can potentially help find similar cases from a case base for evidence-based practice. While several methods to measure similarity between cases have been proposed, developing an effective means for measuring patient case similarity remains a challenging problem. We were interested in examining how abstracting could potentially assist computing case similarity. In this study, abstracted patient-specific features from medical records were used to improve an existing information-theoretic measurement. The developed metric, using a combination of abstracted disease, finding, procedure and medication features, achieved a correlation between 0.6012 and 0.6940 to experts.
 
Article
When authors of empirical science articles write abstracts, they employ a wide variety of distinct linguistic operations which interact to condense and rephrase a subset of sentences from the source text. An on-going comparison of biological and biomedical journal articles with their author-written abstracts is providing a basis for a more linguistically detailed model of abstract derivation using syntactic representations of selected source sentences. The description makes use of rich dictionary information to formulate paraphrasing rules of differing degrees of generality, including some which are sublanguage-specific, and others which appear valid in several languages when formulated using "lexical functions" to express important semantic relationships between lexical items. Some paraphrase operations may use both lexical functions and rhetorical relations between sentences to reformulate larger chunks of text in a concise abstract sentence. The descriptive framework is computable and utilizes existing linguistic resources.
 
Article
The Unified Medical Language System (UMLS) joins together a group of established medical terminologies in a unified knowledge representation framework. Two major resources of the UMLS are its Metathesaurus, containing a large number of concepts, and the Semantic Network (SN), containing semantic types and forming an abstraction of the Metathesaurus. However, the SN itself is large and complex and may still be difficult to view and comprehend. Our structural partitioning technique partitions the SN into structurally uniform sets of semantic types based on the distribution of the relationships within the SN. An enhancement of the structural partition results in cohesive, singly rooted sets of semantic types. Each such set is named after its root which represents the common nature of the group. These sets of semantic types are represented by higher-level components called metasemantic types. A network, called a metaschema, which consists of the meta-semantic types connected by hierarchical and semantic relationships is obtained and provides an abstract view supporting orientation to the SN. The metaschema is utilized to audit the UMLS classifications. We present a set of graphical views of the SN based on the metaschema to help in user orientation to the SN. A study compares the cohesive metaschema to metaschemas derived semantically by UMLS experts.
 
Article
An algorithmically-derived abstraction network, called the partial-area taxonomy, for a SNOMED hierarchy has led to the identification of concepts considered complex. The designation "complex" is arrived at automatically on the basis of structural analyses of overlap among the constituent concept groups of the partial-area taxonomy. Such complex concepts, called overlapping concepts, constitute a tangled portion of a hierarchy and can be obstacles to users trying to gain an understanding of the hierarchy's content. A new methodology for partitioning the entire collection of overlapping concepts into singly-rooted groups, that are more manageable to work with and comprehend, is presented. Different kinds of overlapping concepts with varying degrees of complexity are identified. This leads to an abstract model of the overlapping concepts called the disjoint partial-area taxonomy, which serves as a vehicle for enhanced, high-level display. The methodology is demonstrated with an application to SNOMED's Specimen hierarchy. Overall, the resulting disjoint partial-area taxonomy offers a refined view of the hierarchy's structural organization and conceptual content that can aid users, such as maintenance personnel, working with SNOMED. The utility of the disjoint partial-area taxonomy as the basis for a SNOMED auditing regimen is presented in a companion paper.
 
Article
Auditors of a large terminology, such as SNOMED CT, face a daunting challenge. To aid them in their efforts, it is essential to devise techniques that can automatically identify concepts warranting special attention. "Complex" concepts, which by their very nature are more difficult to model, fall neatly into this category. A special kind of grouping, called a partial-area, is utilized in the characterization of complex concepts. In particular, the complex concepts that are the focus of this work are those appearing in intersections of multiple partial-areas and are thus referred to as overlapping concepts. In a companion paper, an automatic methodology for identifying and partitioning the entire collection of overlapping concepts into disjoint, singly-rooted groups, that are more manageable to work with and comprehend, has been presented. The partitioning methodology formed the foundation for the development of an abstraction network for the overlapping concepts called a disjoint partial-area taxonomy. This new disjoint partial-area taxonomy offers a collection of semantically uniform partial-areas and is exploited herein as the basis for a novel auditing methodology. The review of the overlapping concepts is done in a top-down order within semantically uniform groups. These groups are themselves reviewed in a top-down order, which proceeds from the less complex to the more complex overlapping concepts. The results of applying the methodology to SNOMED's Specimen hierarchy are presented. Hypotheses regarding error ratios for overlapping concepts and between different kinds of overlapping concepts are formulated. Two phases of auditing the Specimen hierarchy for two releases of SNOMED are reported on. With the use of the double bootstrap and Fisher's exact test (two-tailed), the auditing of concepts and especially roots of overlapping partial-areas is shown to yield a statistically significant higher proportion of errors.
 
Article
The aims of this work were: to define an abstract notation for interactive decision trees; to formally analyse exploration errors in such trees through automated translation to Lotos (language of temporal ordering specification); to generate tree implementations through automated translation for an existing tree viewer, and to demonstrate the approach on healthcare examples created by the CGT (clinical guidance tree) project. An abstract and machine-readable notation was developed for describing clinical guidance trees: Ad/it (abstract decision/interactive trees). A methodology has been designed for creating trees using Ad/it. In particular, tree structure is separated from tree content. Tree structure and flow are designed and evaluated before committing to detailed content of the tree. Software tools have been created to translate Ad/it tree descriptions into Lotos and into CGT Viewer format. These representations support formal analysis and interactive exploration of decision trees. Through automated conversion of existing CGT trees, realistic healthcare applications have been used to validate the approach. All key objectives of the work have been achieved. An abstract notation has been created for decision trees, and is supported by automated translation and analysis. Although healthcare applications have been the main focus to date, the approach is generic and of value in almost any domain where decision trees are useful.
 
Article
Automatic summarization has been proposed to help manage the results of biomedical information retrieval systems. Semantic MEDLINE, for example, summarizes semantic predications representing assertions in MEDLINE citations. Results are presented as a graph which maintains links to the original citations. Graphs summarizing more than 500 citations are hard to read and navigate, however. We exploit graph theory for focusing these large graphs. The method is based on degree centrality, which measures connectedness in a graph. Four categories of clinical concepts related to treatment of disease were identified and presented as a summary of input text. A baseline was created using term frequency of occurrence. The system was evaluated on summaries for treatment of five diseases compared to a reference standard produced manually by two physicians. The results showed that recall for system results was 72%, precision was 73%, and F-score was 0.72. The system F-score was considerably higher than that for the baseline (0.47).
 
Article
Massive increases in electronically available text have spurred a variety of natural language processing methods to automatically identify relationships from text; however, existing annotated collections comprise only bioinformatics (gene-protein) or clinical informatics (treatment-disease) relationships. This paper introduces the Claim Framework that reflects how authors across biomedical spectrum communicate findings in empirical studies. The Framework captures different levels of evidence by differentiating between explicit and implicit claims, and by capturing under-specified claims such as correlations, comparisons, and observations. The results from 29 full-text articles show that authors report fewer than 7.84% of scientific claims in an abstract, thus revealing the urgent need for text mining systems to consider the full-text of an article rather than just the abstract. The results also show that authors typically report explicit claims (77.12%) rather than an observations (9.23%), correlations (5.39%), comparisons (5.11%) or implicit claims (2.7%). Informed by the initial manual annotations, we introduce an automated approach that uses syntax and semantics to identify explicit claims automatically and measure the degree to which each feature contributes to the overall precision and recall. Results show that a combination of semantics and syntax is required to achieve the best system performance.
 
Article
Gene/protein interactions provide critical information for a thorough understanding of cellular processes. Recently, considerable interest and effort has been focused on the construction and analysis of genome-wide gene networks. The large body of biomedical literature is an important source of gene/protein interaction information. Recent advances in text mining tools have made it possible to automatically extract such documented interactions from free-text literature. In this paper, we propose a comprehensive framework for constructing and analyzing large-scale gene functional networks based on the gene/protein interactions extracted from biomedical literature repositories using text mining tools. Our proposed framework consists of analyses of the network topology, network topology-gene function relationship, and temporal network evolution to distill valuable information embedded in the gene functional interactions in the literature. We demonstrate the application of the proposed framework using a testbed of P53-related PubMed abstracts, which shows that the literature-based P53 networks exhibit small-world and scale-free properties. We also found that high degree genes in the literature-based networks have a high probability of appearing in the manually curated database and genes in the same pathway tend to form local clusters in our literature-based networks. Temporal analysis showed that genes interacting with many other genes tend to be involved in a large number of newly discovered interactions.
 
Article
Natural language processing (NLP) techniques are used to extract information automatically from computer-readable literature. In biology, the identification of terms corresponding to biological substances (e.g., genes and proteins) is a necessary step that precedes the application of other NLP systems that extract biological information (e.g., protein-protein interactions, gene regulation events, and biochemical pathways). We have developed GPmarkup (for "gene/protein-full name mark up"), a software system that automatically identifies gene/protein terms (i.e., symbols or full names) in MEDLINE abstracts. As a part of marking up process, we also generated automatically a knowledge source of paired gene/protein symbols and full names (e.g., LARD for lymphocyte associated receptor of death) from MEDLINE. We found that many of the pairs in our knowledge source do not appear in the current GenBank database. Therefore our methods may also be used for automatic lexicon generation. GPmarkup has 73% recall and 93% precision in identifying and marking up gene/protein terms in MEDLINE abstracts. A random sample of gene/protein symbols and full names and a sample set of marked up abstracts can be viewed at http://www.cpmc.columbia.edu/homepages/yuh9001/GPmarkup/. Contact. hy52@columbia.edu. Voice: 212-939-7028; fax: 212-666-0140.
 
Article
Concurrent with progress in biomedical sciences, an overwhelming of textual knowledge is accumulating in the biomedical literature. PubMed is the most comprehensive database collecting and managing biomedical literature. To help researchers easily understand collections of PubMed abstracts, numerous clustering methods have been proposed to group similar abstracts based on their shared features. However, most of these methods do not explore the semantic relationships among groupings of documents, which could help better illuminate the groupings of PubMed abstracts. To address this issue, we proposed an ontological clustering method called GOClonto for conceptualizing PubMed abstracts. GOClonto uses latent semantic analysis (LSA) and gene ontology (GO) to identify key gene-related concepts and their relationships as well as allocate PubMed abstracts based on these key gene-related concepts. Based on two PubMed abstract collections, the experimental results show that GOClonto is able to identify key gene-related concepts and outperforms the STC (suffix tree clustering) algorithm, the Lingo algorithm, the Fuzzy Ants algorithm, and the clustering based TRS (tolerance rough set) algorithm. Moreover, the two ontologies generated by GOClonto show significant informative conceptual structures.
 
Article
Objectives: The role of social media in biomedical knowledge mining, including clinical, medical and healthcare informatics, prescription drug abuse epidemiology and drug pharmacology, has become increasingly significant in recent years. Social media offers opportunities for people to share opinions and experiences freely in online communities, which may contribute information beyond the knowledge of domain professionals. This paper describes the development of a novel semantic web platform called PREDOSE (PREscription Drug abuse Online Surveillance and Epidemiology), which is designed to facilitate the epidemiologic study of prescription (and related) drug abuse practices using social media. PREDOSE uses web forum posts and domain knowledge, modeled in a manually created Drug Abuse Ontology (DAO--pronounced dow), to facilitate the extraction of semantic information from User Generated Content (UGC), through combination of lexical, pattern-based and semantics-based techniques. In a previous study, PREDOSE was used to obtain the datasets from which new knowledge in drug abuse research was derived. Here, we report on various platform enhancements, including an updated DAO, new components for relationship and triple extraction, and tools for content analysis, trend detection and emerging patterns exploration, which enhance the capabilities of the PREDOSE platform. Given these enhancements, PREDOSE is now more equipped to impact drug abuse research by alleviating traditional labor-intensive content analysis tasks. Methods: Using custom web crawlers that scrape UGC from publicly available web forums, PREDOSE first automates the collection of web-based social media content for subsequent semantic annotation. The annotation scheme is modeled in the DAO, and includes domain specific knowledge such as prescription (and related) drugs, methods of preparation, side effects, and routes of administration. The DAO is also used to help recognize three types of data, namely: (1) entities, (2) relationships and (3) triples. PREDOSE then uses a combination of lexical and semantic-based techniques to extract entities and relationships from the scraped content, and a top-down approach for triple extraction that uses patterns expressed in the DAO. In addition, PREDOSE uses publicly available lexicons to identify initial sentiment expressions in text, and then a probabilistic optimization algorithm (from related research) to extract the final sentiment expressions. Together, these techniques enable the capture of fine-grained semantic information, which facilitate search, trend analysis and overall content analysis using social media on prescription drug abuse. Moreover, extracted data are also made available to domain experts for the creation of training and test sets for use in evaluation and refinements in information extraction techniques. Results: A recent evaluation of the information extraction techniques applied in the PREDOSE platform indicates 85% precision and 72% recall in entity identification, on a manually created gold standard dataset. In another study, PREDOSE achieved 36% precision in relationship identification and 33% precision in triple extraction, through manual evaluation by domain experts. Given the complexity of the relationship and triple extraction tasks and the abstruse nature of social media texts, we interpret these as favorable initial results. Extracted semantic information is currently in use in an online discovery support system, by prescription drug abuse researchers at the Center for Interventions, Treatment and Addictions Research (CITAR) at Wright State University. Conclusion: A comprehensive platform for entity, relationship, triple and sentiment extraction from such abstruse texts has never been developed for drug abuse research. PREDOSE has already demonstrated the importance of mining social media by providing data from which new findings in drug abuse research were uncovered. Given the recent platform enhancements, including the refined DAO, components for relationship and triple extraction, and tools for content, trend and emerging pattern analysis, it is expected that PREDOSE will play a significant role in advancing drug abuse epidemiology in future.
 
Article
In the healthcare domain, human collaboration processes (HCPs), which consist of interactions between healthcare workers from different (para)medical disciplines and departments, are of growing importance as healthcare delivery becomes increasingly integrated. Existing workflow-based process modelling tools for healthcare process management, which are the most commonly applied, are not suited for healthcare HCPs mainly due to their focus on the definition of task sequences instead of the graphical description of human interactions. This paper uses a case study of a healthcare HCP at a Dutch academic hospital to evaluate a novel interaction-centric process modelling method. The HCP under study is the care pathway performed by the head and neck oncology team. The evaluation results show that the method brings innovative, effective, and useful features. First, it collects and formalizes the tacit domain knowledge of the interviewed healthcare workers in individual interaction diagrams. Second, the method automatically integrates these local diagrams into a single global interaction diagram that reflects the consolidated domain knowledge. Third, the case study illustrates how the method utilizes a graphical modelling language for effective tree-based description of interactions, their composition and routing relations, and their roles. A process analysis of the global interaction diagram is shown to identify HCP improvement opportunities. The proposed interaction-centric method has wider applicability since interactions are the core of most multidisciplinary patient-care processes. A discussion argues that, although (multidisciplinary) collaboration is in many cases not optimal in the healthcare domain, it is increasingly considered a necessity to improve integration, continuity, and quality of care. The proposed method is helpful to describe, analyze, and improve the functioning of healthcare collaboration.
 
Article
Distance based clustering algorithms can group genes that show similar expression values under multiple experimental conditions. They are unable to identify a group of genes that have similar pattern of variation in their expression values. Previously we developed an algorithm called divisive correlation clustering algorithm (DCCA) to tackle this situation, which is based on the concept of correlation clustering. But this algorithm may also fail for certain cases. In order to overcome these situations, we propose a new clustering algorithm, called average correlation clustering algorithm (ACCA), which is able to produce better clustering solution than that produced by some others. ACCA is able to find groups of genes having more common transcription factors and similar pattern of variation in their expression values. Moreover, ACCA is more efficient than DCCA with respect to the time of execution. Like DCCA, we use the concept of correlation clustering concept introduced by Bansal et al. ACCA uses the correlation matrix in such a way that all genes in a cluster have the highest average correlation values with the genes in that cluster. We have applied ACCA and some well-known conventional methods including DCCA to two artificial and nine gene expression datasets, and compared the performance of the algorithms. The clustering results of ACCA are found to be more significantly relevant to the biological annotations than those of the other methods. Analysis of the results show the superiority of ACCA over some others in determining a group of genes having more common transcription factors and with similar pattern of variation in their expression profiles.
 
Article
Several countries are in the process of implementing an Electronic Health Record (EHR), but limited physicians' acceptance of this technology presents a serious threat to its successful implementation. The aim of this study was to identify the main determinants of physician acceptance of EHR in a sample of general practitioners and specialists of the Province of Quebec (Canada). We sent an electronic questionnaire to physician members of the Quebec Medical Association. We tested four theoretical models (Technology acceptance model (TAM), Extended TAM, Psychosocial Model, and Integrated Model) using path analysis and multiple linear regression analysis in order to identify the main determinants of physicians' intention to use the EHR. We evaluated the modifying effect of sociodemographic characteristics using multi-group analysis of structural weights invariance. A total of 157 questionnaires were returned. The four models performed well and explained between 44% and 55% of the variance in physicians' intention to use the EHR. The Integrated model performed the best and showed that perceived ease of use, professional norm, social norm, and demonstrability of the results are the strongest predictors of physicians' intention to use the EHR. Age, gender, previous experience and specialty modified the association between those determinants and intention. The proposed integrated theoretical model is useful in identifying which factors could motivate physicians from different backgrounds to use the EHR. Physicians who perceive the EHR to be easy to use, coherent with their professional norms, supported by their peers and patients, and able to demonstrate tangible results are more likely to accept this technology. Age, gender, specialty and experience should also be taken into account when developing EHR implementation strategies targeting physicians.
 
Representation of the TAM model and the proposed extended TAM model. Bold lines represent the classical TAM Model. Normal lines represent paths tested in previous studies. Dot lines, represent constructs and paths tested for first time in Technology Acceptance Models. ''Physician Specialty'' represents profession-specific differences that moderate the relationships between TAM variables. 
Structural model (standardized results). Circles represent latent factors; boxes represent indicators. Casual effects are given by arrows connecting circles. Bold numbers over circles represent variance explained. Disturbance and measurement error effects are omitted for clarity. 
Article
Recent empirical research has utilized the Technology Acceptance Model (TAM) to advance the understanding of doctors' and nurses' technology acceptance in the workplace. However, the majority of the reported studies are either qualitative in nature or use small convenience samples of medical staff. Additionally, in very few studies moderators are either used or assessed despite their importance in TAM based research. The present study focuses on the application of TAM in order to explain the intention to use clinical information systems, in a random sample of 604 medical staff (534 physicians) working in 14 hospitals in Greece. We introduce physicians' specialty as a moderator in TAM and test medical staff's information and communication technology (ICT) knowledge and ICT feature demands, as external variables. The results show that TAM predicts a substantial proportion of the intention to use clinical information systems. Findings make a contribution to the literature by replicating, explaining and advancing the TAM, whereas theory is benefited by the addition of external variables and medical specialty as a moderator. Recommendations for further research are discussed.
 
Article
Increasing interest in end users' reactions to health information technology (IT) has elevated the importance of theories that predict and explain health IT acceptance and use. This paper reviews the application of one such theory, the Technology Acceptance Model (TAM), to health care. We reviewed 16 data sets analyzed in over 20 studies of clinicians using health IT for patient care. Studies differed greatly in samples and settings, health ITs studied, research models, relationships tested, and construct operationalization. Certain TAM relationships were consistently found to be significant, whereas others were inconsistent. Several key relationships were infrequently assessed. Findings show that TAM predicts a substantial portion of the use or acceptance of health IT, but that the theory may benefit from several additions and modifications. Aside from improved study quality, standardization, and theoretically motivated additions to the model, an important future direction for TAM is to adapt the model specifically to the health care context, using beliefs elicitation methods.
 
Article
Although information access control models have been developed and applied to various applications, few of the previous works have addressed the issue of managing information access in the combined context of team collaboration and workflow. To facilitate this requirement, we have enhanced the Role-Based Access Control (RBAC) model through formulating universal constraints, defining bridging entities and contributing attributes, extending access permissions to include workflow contexts, synthesizing a role-based access delegation model to target on specific objects, and developing domain ontologies as instantiations of the general model to particular applications. We have successfully applied this model to the New York State HIV Clinical Education Initiative (CEI) project to address the specific needs of information management in collaborative processes. An initial evaluation has shown this model achieved a high level of agreement with an existing system when applied to 4576 cases (kappa=0.801). Comparing to a reference standard, the sensitivity and specificity of the enhanced RBAC model were at the level of 97-100%. These results indicate that the enhanced RBAC model can be effectively used for information access management in context of team collaboration and workflow to coordinate clinical education programs. Future research is required to incrementally develop additional types of universal constraints, to further investigate how the workflow context and access delegation can be enriched to support the various needs on information access management in collaborative processes, and to examine the generalizability of the enhanced RBAC model for other applications in clinical education, biomedical research, and patient care.
 
Article
Access control is a central problem in privacy management. A common practice in controlling access to sensitive data, such as electronic health records (EHRs), is Role-Based Access Control (RBAC). RBAC is limited as it does not account for the circumstances under which access to sensitive data is requested. Following a qualitative study that elicited access scenarios, we used Object-Process Methodology to structure the scenarios and conceive a Situation-Based Access Control (SitBAC) model. SitBAC is a conceptual model, which defines scenarios where patient's data access is permitted or denied. The main concept underlying this model is the Situation Schema, which is a pattern consisting of the entities Data-Requestor, Patient, EHR, Access Task, Legal-Authorization, and Response, along with their properties and relations. The various data access scenarios are expressed via Situation Instances. While we focus on the medical domain, the model is generic and can be adapted to other domains.
 
Article
The integration of medical data coming from multiple sources is important in clinical research. Amongst others, it enables the discovery of appropriate subjects in patient-oriented research and the identification of innovative results in epidemiological studies. At the same time, the integration of medical data faces significant ethical and legal challenges that impose access constraints. Some of these issues can be addressed by making available aggregated instead of raw record-level data. In many cases however, there is still a need for controlling access even to the resulting aggregated data, e.g., due to data provider's policies. In this paper we present the Linked Medical Data Access Control (LiMDAC) framework that capitalizes on Linked Data technologies to enable controlling access to medical data across distributed sources with diverse access constraints. The LiMDAC framework consists of three Linked Data models, namely the LiMDAC metadata model, the LiMDAC user profile model, and the LiMDAC access policy model. It also includes an architecture that exploits these models. Based on the framework, a proof-of-concept platform is developed and its performance and functionality are evaluated by employing two usage scenarios.
 
Article
The increasing volume and diversity of information in biomedical research is demanding new approaches for data integration in this domain. Semantic Web technologies and applications can leverage the potential of biomedical information integration and discovery, facing the problem of semantic heterogeneity of biomedical information sources. In such an environment, agent technology can assist users in discovering and invoking the services available on the Internet. In this paper we present SEMMAS, an ontology-based, domain-independent framework for seamlessly integrating Intelligent Agents and Semantic Web Services. Our approach is backed with a proof-of-concept implementation where the breakthrough and efficiency of integrating disparate biomedical information sources have been tested.
 
SAIL architecture. This diagram shows the SAIL databank system and the controls in place for data acquisition and utilisation, with an indication of the roles carried out by each party. Beginning at the base of the diagram, SAIL has formal agreements with data providers to provide their data to the databank in accordance with Information Governance. The commonly-recognised identifiers are anonymised at NWIS, who provide a trusted third party service to SAIL. Further processes of masking and encryption are carried out at SAIL, and the SAIL databank is constructed. From the top of the diagram, requests to use the data are reviewed by SAIL and an independent Information Governance Review Panel (IGRP) to assess compliance with Information Governance before access can be allowed. Once this is agreed, a data view is created by SAIL staff, and access to this view can be made available via the SAIL Gateway. For this to happen, further data transformations are carried out to control the risk of disclosure, and the data user signs an access agreement for responsible data utilisation, in accordance the specifications of the IGRP to comply with Information Governance. 
SAIL Info Central screenshot. This screenshot displays the home page for a data user on the SAIL Info Central (External to the SAIL Gateway) site. The top section displays menus to more information about the datasets and support options. The top left hand section displays information about the user, which is editable by the user. The bottom left hand section displays the services available to the user and indicates service status. The centre section displays all the projects that the user is authorised to access and the hyperlink directs them to more information about the project. The bottom right hand section display a timeline of news feed from projects and dataset updates. 
The SAIL Gateway data user journey. This flowchart illustrates the SAIL data user journey from initial contact with SAIL to dissemination of outputs. Work conducted within the SAIL Gateway is highlighted. 
Article
With the current expansion of data linkage research, the challenge is to find the balance between preserving the privacy of person-level data whilst making these data accessible for use to their full potential. We describe a privacy-protecting safe haven and secure remote access system, referred to as the Secure Anonymised Information Linkage (SAIL) Gateway. The Gateway provides data users with a familiar Windows interface and their usual toolsets to access approved anonymously-linked datasets for research and evaluation. We outline the principles and operating model of the Gateway, the features provided to users within the secure environment, and how we are approaching the challenges of making data safely accessible to increasing numbers of research users. The Gateway represents a powerful analytical environment and has been designed to be scalable and adaptable to meet the needs of the rapidly growing data linkage community.
 
Article
Document search is generally based on individual terms in the document. However, for collections within limited domains it is possible to provide more powerful access tools. This paper describes a system designed for collections of reports of infectious disease outbreaks. The system, Proteus-BIO, automatically creates a table of outbreaks, with each table entry linked to the document describing that outbreak; this makes it possible to use database operations such as selection and sorting to find relevant documents. Proteus-BIO consists of a Web crawler which gathers relevant documents; an information extraction engine which converts the individual outbreak events to a tabular database; and a database browser which provides access to the events and, through them, to the documents. The information extraction engine uses sets of patterns and word classes to extract the information about each event. Preparing these patterns and word classes has been a time-consuming manual operation in the past, but automated discovery tools now make this task significantly easier. A small study comparing the effectiveness of the tabular index with conventional Web search tools demonstrated that users can find substantially more documents in a given time period with Proteus-BIO.
 
Article
Online personal health records (PHRs) enable patients to access, manage, and share certain of their own health information electronically. This capability creates the need for precise access-controls mechanisms that restrict the sharing of data to that intended by the patient. The authors describe the design and implementation of an access-control mechanism for PHR repositories that is modeled on the eXtensible Access Control Markup Language (XACML) standard, but intended to reduce the cognitive and computational complexity of XACML. The authors implemented the mechanism entirely in a relational database system using ANSI-standard SQL statements. Based on a set of access-control rules encoded as relational table rows, the mechanism determines via a single SQL query whether a user who accesses patient data from a specific application is authorized to perform a requested operation on a specified data object. Testing of this query on a moderately large database has demonstrated execution times consistently below 100ms. The authors include the details of the implementation, including algorithms, examples, and a test database as Supplementary materials.
 
Article
Modern healthcare organizations (HCOs) are composed of complex dynamic teams to ensure clinical operations are executed in a quick and competent manner. At the same time, the fluid nature of such environments hinders administrators' efforts to define access control policies that appropriately balance patient privacy and healthcare functions. Manual efforts to define these policies are labor-intensive and error-prone, often resulting in systems that endow certain care providers with overly broad access to patients' medical records while restricting other providers from legitimate and timely use. In this work, we propose an alternative method to generate these policies by automatically mining usage patterns from electronic health record (EHR) systems. EHR systems are increasingly being integrated into clinical environments and our approach is designed to be generalizable across HCOs, thus assisting in the design and evaluation of local access control policies. Our technique, which is grounded in data mining and social network analysis theory, extracts a statistical model of the organization from the access logs of its EHRs. In doing so, our approach enables the review of predefined policies, as well as the discovery of unknown behaviors. We evaluate our approach with 5 months of access logs from the Vanderbilt University Medical Center and confirm the existence of stable social structures and intuitive business operations. Additionally, we demonstrate that there is significant turnover in the interactions between users in the HCO and that policies learned at the department-level afford greater stability over time.
 
Article
In many healthcare organizations, comparative effectiveness research and quality improvement (QI) investigations are hampered by a lack of access to data created as a byproduct of patient care. Data collection often hinges upon either manual chart review or ad hoc requests to technical experts who support legacy clinical systems. In order to facilitate this needed capacity for data exploration at our institution (Duke University Health System), we have designed and deployed a robust Web application for cohort identification and data extraction--the Duke Enterprise Data Unified Content Explorer (DEDUCE). DEDUCE is envisioned as a simple, web-based environment that allows investigators access to administrative, financial, and clinical information generated during patient care. By using business intelligence tools to create a view into Duke Medicine's enterprise data warehouse, DEDUCE provides a Guided Query functionality using a wizard-like interface that lets users filter through millions of clinical records, explore aggregate reports, and, export extracts. Researchers and QI specialists can obtain detailed patient- and observation-level extracts without needing to understand structured query language or the underlying database model. Developers designing such tools must devote sufficient training and develop application safeguards to ensure that patient-centered clinical researchers understand when observation-level extracts should be used. This may mitigate the risk of data being misunderstood and consequently used in an improper fashion.
 
Article
The development of large semantic networks, such as the UMLS, which are intended to support a variety of applications, requires a flexible and efficient query interface for the extraction of information. Using one of the source vocabularies of UMLS as a test bed, we have developed such a prototype query interface. We first identify common classes of queries needed by applications that access these semantic networks. Next, we survey StruQL, an existing query language that we adopted, which supports all of these classes of queries. We then describe the OQAFMA Querying Agent for the Foundational Model of Anatomy (OQAFMA), which provides an efficient implementation of a subset of StruQL by pre-computing a variety of indices. We describe how OQAFMA leverages database optimization by converting StruQL queries to SQL. We evaluate the flexibility and efficiency of our implementation using English queries written by anatomists. This evaluation verifies that OQAFMA provides flexible, efficient access to one such large semantic network, the Foundational Model of Anatomy, and suggests that OQAFMA could be an efficient query interface to other large biomedical knowledge bases, such as the Unified Medical Language System.
 
Article
The large and rapidly growing number of information sources relevant to health care, and the increasing amounts of new evidence produced by researchers, are improving the access of professionals and students to valuable information. However, seeking and filtering useful, valid information can be still very difficult. An online information system that conducts searches based on individual patient data can have a beneficial influence on the particular patient's outcome and educate the healthcare worker. In this paper, we describe the underlying model for a system that aims to facilitate the search for evidence based on clinicians' needs. This paper reviews studies of information needs of clinicians, describes principles of information retrieval, and examines the role that standardized terminologies can play in the integration between a clinical system and literature resources, as well as in the information retrieval process. The paper also describes a model for a digital library system that supports the integration of clinical systems with online information sources, making use of information available in the electronic medical record to enhance searches and information retrieval. The model builds on several different, previously developed techniques to identify information themes that are relevant to specific clinical data. Using a framework of evidence-based practice, the system generates well-structured questions with the intent of enhancing information retrieval. We believe that by helping clinicians to pose well-structured clinical queries and including in them relevant information from individual patients' medical records, we can enhance information retrieval and thus can improve patient-care.
 
Schematic representation of cis-splicing leading to chimeric gene fusions. Intergenic splicing combines exons from the upstream and downstream genes to form a chimeric gene fusion. Occasionally, the intergenic region is preserved in the mRNA transcript, too.  
Mutual interaction network for the putative intergenic splicing transcripts included in the pioneering discoveries by Akiva et al. [1] and Parra et al. [2]. If the observation reveals a novel pathological mechanism, these putative protein complexes are most likely to be affected. In particular, the housekeeping or homeotic HOXtransfactors centered around BMI-1 warrant further attention in developmental biology. Only the direct interactions are displayed. Blue arrows indicate activation and red arrows inhibition. As defined by the GeneGo Inc MetaCore/MetaDrug systems medicine platform, the symbols are as follows: enzyme; kinase; protease; protein; generic binding protein; transfactor; GTPase; G-protein adaptor; receptor; receptor ligand; transporter; GPCR; channel.  
Article
Over half of the DNA of mammalian genomes is transcribed, and one of the emerging enigmas in the field of RNA research is intergenic splicing or transcription induced chimerism. We argue that fused low-copy-number transcripts constitute neglected pathological mechanism akin to copy number variation, due to loss of stoichiometric subunit ratios in protein complexes. An obstacle for transcriptomics meta-analysis of published microarrays is the traditional nomenclature of merged transcript neighbors under same accession codes. Tandem transcripts cover 4-20% of genomes but are only loosely overlapping in population. They were most enriched in systems medicine annotations concerning neurology, thalassemia and genital disorders in the GeneGo Inc. MetaCore-MetaDrug(TM) knowledgebase, evaluated with external randomizations here. Clinical transcriptomics is good news since new disease etiologies offer new remedies. We identified homeotic HOX-transfactors centered around BMI-1, the Grb2 adaptor network, the kallikrein system, and thalassemia RNA surveillance as vulnerable hotspot chimeras. As a cure, RNA interference would require verification of chimerism from symptomatic tissue contra healthy control tissue from the same patient.
 
Top-cited authors
Paul A Harris
  • Vanderbilt University
Jose G. Conde
  • University of Puerto Rico, Medical Sciences Campus
Richard J Holden
  • Vanderbilt University
Harsh Dweep
  • Wistar Institute
Priyanka Pandey
  • National Institute of Biomedical Genomics