Journal of Biomedical Informatics (J Biomed Informat)

Publisher: Elsevier

Journal description

The Journal of Biomedical Informatics (formerly Computers and Biomedical Research) has been redesigned to reflect a commitment to high-quality original research papers and reviews in the area of biomedical informatics. Although published articles are motivated by applications in the biomedical sciences (for example, clinical medicine, health care, population health, imaging, and bioinformatics), the journal emphasizes reports of new methodologies and techniques that have general applicability and that form the basis for the evolving science of biomedical informatics. Articles on medical devices, and formal evaluations of completed systems, including clinical trials of information technologies, would generally be more suitable for publication in other venues. System descriptions are welcome if they illustrate and substantiate the underlying methodology that is the principal focus of the report.

Current impact factor: 2.48

Impact Factor Rankings

2015 Impact Factor Available summer 2015
2013 / 2014 Impact Factor 2.482
2012 Impact Factor 2.131
2011 Impact Factor 1.792
2010 Impact Factor 1.719
2009 Impact Factor 2.432
2008 Impact Factor 1.924
2007 Impact Factor 2
2006 Impact Factor 2.346
2005 Impact Factor 2.388
2004 Impact Factor 1.013
2003 Impact Factor 0.855
2002 Impact Factor 0.862

Impact factor over time

Impact factor
Year

Additional details

5-year impact 2.43
Cited half-life 4.40
Immediacy index 0.55
Eigenfactor 0.01
Article influence 0.84
Website Journal of Biomedical Informatics website
Other titles Journal of biomedical informatics (Online)
ISSN 1532-0480
OCLC 45147742
Material type Document, Periodical, Internet resource
Document type Internet Resource, Computer File, Journal / Magazine / Newspaper

Publisher details

Elsevier

  • Pre-print
    • Author can archive a pre-print version
  • Post-print
    • Author can archive a post-print version
  • Conditions
    • Authors pre-print on any website, including arXiv and RePEC
    • Author's post-print on author's personal website immediately
    • Author's post-print on open access repository after an embargo period of between 12 months and 48 months
    • Permitted deposit due to Funding Body, Institutional and Governmental policy or mandate, may be required to comply with embargo periods of 12 months to 48 months
    • Author's post-print may be used to update arXiv and RepEC
    • Publisher's version/PDF cannot be used
    • Must link to publisher version with DOI
    • Author's post-print must be released with a Creative Commons Attribution Non-Commercial No Derivatives License
    • Publisher last reviewed on 03/06/2015
  • Classification
    ​ green

Publications in this journal

  • [Show abstract] [Hide abstract]
    ABSTRACT: Multiple Instance Learning algorithms have been increasingly utilized in Computer Aided Detection and Diagnosis field. In this study, we propose a novel multiple instance learning method for the identification of tumor invasion depth of gastric cancer with dual-energy CT imaging. In the proposed scheme, two level features, bag-level features and instance-level features are extracted for subsequent processing and classification work. For instance-level features, there is some ambiguity in assigning labels to selected patches. An improved Citation-KNN method is presented to solve this problem. Compared with benchmarking state-of-the-art multiple instance learning algorithms using the same clinical dataset, the proposed algorithm can achieve improved results. The experimental evaluation is performed using leave-one-out cross validation with the total accuracy of 0.7692. The proposed multiple instance learning algorithm serves as an alternative method for computer aided diagnosis and identification of tumor invasion depth of gastric cancer with dual-energy CT imaging techniques. Copyright © 2015 Elsevier Inc. All rights reserved.
    Journal of Biomedical Informatics 08/2015; DOI:10.1016/j.jbi.2015.08.017
  • [Show abstract] [Hide abstract]
    ABSTRACT: The 2014 i2b2/UTHealth natural language processing shared task featured a track focused on the de-identification of longitudinal medical records. For this track, we de-identified a set of 1,304 longitudinal medical records describing 296 patients. This corpus was de-identified under a broad interpretation of the HIPAA guidelines using double-annotation followed by arbitration, rounds of sanity checking, and proof reading. The average token-based F1 measure for the annotators compared to the gold standard was 0.927. The resulting annotations were used both to de-identify the data and to set the gold standard for the de-identification track of the 2014 i2b2/UTHealth shared task. All annotated private health information were replaced with realistic surrogates automatically and then read over and corrected manually. The resulting corpus is the first of its kind made available for de-identification research. This corpus was first used for the 2014 i2b2/UTHealth shared task, during which the systems achieved a mean F-measure of 0.872 and a maximum F-measure of 0.964 using entity-based micro-averaged evaluations. Copyright © 2015. Published by Elsevier Inc.
    Journal of Biomedical Informatics 08/2015; DOI:10.1016/j.jbi.2015.07.020
  • [Show abstract] [Hide abstract]
    ABSTRACT: De-identification is a shared task of the 2014 i2b2/UTHealth challenge. The purpose of this task is to remove protected health information (PHI) from medical records. In this paper, we propose a novel de-identifier, WI-deId, based on conditional random fields (CRFs). A preprocessing module, which tokenizes the medical records using regular expressions and an off-the-shelf tokenizer, is introduced, and three groups of features are extracted to train the de-identifier model. The experiment shows that our system is effective in the de-identification of medical records, achieving a micro-F1 of 0.9232 at the i2b2 strict entity evaluation level. Copyright © 2015 Elsevier Inc. All rights reserved.
    Journal of Biomedical Informatics 08/2015; DOI:10.1016/j.jbi.2015.08.012
  • [Show abstract] [Hide abstract]
    ABSTRACT: We present the design, and analyze the performance of a multi-stage natural language processing system employing named entity recognition, Bayesian statistics, and rule logic to identify and characterize heart disease risk factor events in diabetic patients over time. The system was originally developed for the 2014 i2b2 Challenges in Natural Language in Clinical Data. The system's strengths included a high level of accuracy for identifying named entities associated with heart disease risk factor events. The system's primary weakness was due to inaccuracies when characterizing the attributes of some events. For example, determining the relative time of an event with respect to the record date, whether an event is attributable to the patient's history or the patient's family history, and differentiating between current and prior smoking status. We believe these inaccuracies were due in large part to the lack of an effective approach for integrating context into our event detection model. To address these inaccuracies, we explore the addition of a distributional semantic model for characterizing contextual evidence of heart disease risk factor events. Using this semantic model, we raise our initial 2014 i2b2 Challenges in Natural Language of Clinical data F1 score of 0.838 to 0.890 and increased precision by 10.3% without use of any lexicons that might bias our results. Copyright © 2015. Published by Elsevier Inc.
    Journal of Biomedical Informatics 08/2015; DOI:10.1016/j.jbi.2015.08.009
  • [Show abstract] [Hide abstract]
    ABSTRACT: SNOMED CT is the international lingua franca of terminologies for human health. Based in Description Logics (DL), the terminology enables data queries that incorporate inferences between data elements, as well as, those relationships that are explicitly stated. However, the ontologic and polyhierarchical nature of the SNOMED CT concept model make it difficult to implement in its entirety within electronic health record systems that largely employ object oriented or relational database architectures. The result is a reduction of data richness, limitations of query capability and increased systems overhead. The hypothesis of this research was that a graph database (graph DB) architecture using SNOMED CT as the basis for the data model and subsequently modeling patient data upon the semantic core of SNOMED CT could exploit the full value of the terminology to enrich and support advanced data querying capability of patient data sets. The hypothesis was tested by instantiating a graph DB with the fully classified SNOMED CT concept model. The graph DB instance was tested for integrity by calculating the transitive closure table for the SNOMED CT hierarchy and comparing the results with transitive closure tables created using current, validated methods. The graph DB was then populated with 461,171 anonymized patient record fragments and over 2.1 million associated SNOMED CT clinical findings. Queries, including concept negation and disjunction, were then run against the graph database and an enterprise Oracle relational database (RDBMS) of the same patient data sets. The graph DB was then populated with laboratory data encoded using LOINC, as well as, medication data encoded with RxNorm and complex queries performed using LOINC, RxNorm and SNOMED CT to identify uniquely described patient populations. A graph database instance was successfully created for two international releases of SNOMED CT and two US SNOMED CT editions. Transitive closure tables and descriptive statistics generated using the graph database were identical to those using validated methods. Patient queries produced identical patient count results to the Oracle RDBMS with comparable times. Database queries involving defining attributes of SNOMED CT concepts were possible with the graph DB. The same queries could not be directly performed with the Oracle RDBMS representation of the patient data and required the creation and use of external terminology services. Further, queries of undefined depth were successful in identifying unknown relationships between patient cohorts. The results of this study supported the hypothesis that a patient database built upon and around the semantic model of SNOMED CT was possible. The model supported queries that leveraged all aspects of the SNOMED CT logical model to produce clinically relevant query results. Logical disjunction and negation queries were possible using the data model, as well as, queries that extended beyond the structural IS_A hierarchy of SNOMED CT to include queries that employed defining attribute-values of SNOMED CT concepts as search parameters. As medical terminologies, such as SNOMED CT, continue to expand, they will become more complex and model consistency will be more difficult to assure. Simultaneously, consumers of data will increasingly demand improvements to query functionality to accommodate additional granularity of clinical concepts without sacrificing speed. This new line of research provides an alternative approach to instantiating and querying patient data represented using advanced computable clinical terminologies. Copyright © 2015. Published by Elsevier Inc.
    Journal of Biomedical Informatics 08/2015; DOI:10.1016/j.jbi.2015.08.016
  • [Show abstract] [Hide abstract]
    ABSTRACT: For the purpose of post-marketing drug safety surveillance, which has traditionally relied on the voluntary reporting of individual cases of adverse drug events (ADEs), other sources of information are now being explored, including electronic health records (EHRs), which give us access to enormous amounts of longitudinal observations of the treatment of patients and their drug use. Adverse drug events, which can be encoded in EHRs with certain diagnosis codes, are, however, heavily underreported. It is therefore important to develop capabilities to process, by means of computational methods, the more unstructured EHR data in the form of clinical notes, where clinicians may describe and reason around suspected ADEs. In this study, we report on the creation of an annotated corpus of Swedish health records for the purpose of learning to identify information pertaining to ADEs present in clinical notes. To this end, three key tasks are tackled: recognizing relevant named entities (disorders, symptoms, drugs), labeling attributes of the recognized entities (negation, speculation, temporality), and relationships between them (indication, adverse drug event). For each of the three tasks, leveraging models of distributional semantics - i.e., unsupervised methods that exploit co-occurrence information to model, typically in vector space, the meaning of words - and, in particular, combinations of such models, is shown to improve the predictive performance. The ability to make use of such unsupervised methods is critical when faced with large amounts of sparse and high-dimensional data, especially in domains where annotated resources are scarce. Copyright © 2015. Published by Elsevier Inc.
    Journal of Biomedical Informatics 08/2015; DOI:10.1016/j.jbi.2015.08.013
  • [Show abstract] [Hide abstract]
    ABSTRACT: In the United States, about 600,000 people die of heart disease every year. The annual cost of care services, medications, and lost productivity reportedly exceeds 108.9 billion dollars. Effective disease risk assessment is critical to prevention, care, and treatment planning. Recent advancements in text analytics have opened up new possibilities of using the rich information in electronic medical records (EMRs) to identify relevant risk factors. The 2014 i2b2/UTHealth Challenge brought together researchers and practitioners of clinical natural language processing (NLP) to tackle the identification of heart disease risk factors reported in EMRs. We participated in this track and developed an NLP system by leveraging existing tools and resources, both public and proprietary. Our system was a hybrid of several machine-learning and rule-based components. The system achieved an overall F1 score of 0.9185, with a recall of 0.9409 and a precision of 0.8972. Copyright © 2015. Published by Elsevier Inc.
    Journal of Biomedical Informatics 08/2015; DOI:10.1016/j.jbi.2015.08.011
  • [Show abstract] [Hide abstract]
    ABSTRACT: Electronic Health Records (EHRs) present the opportunity to observe serial measurements on patients. While potentially informative, analyzing these data can be challenging. In this work we present a means to classify individuals based on a series of measurements collected by an EHR. Using patients undergoing hemodialysis, we categorized people based on their intradialytic blood pressure. Our primary criteria were that the classifications were time dependent and independent of other subjects. We fit a curve of intradialytic blood pressure using regression splines and then calculated first and second derivatives to come up with four mutually exclusive classifications at different time points. We show that these classifications relate to near term risk of cardiac events and are moderately stable over a succeeding two-week period. This work has general application for analyzing dense EHR data. Copyright © 2015. Published by Elsevier Inc.
    Journal of Biomedical Informatics 08/2015; DOI:10.1016/j.jbi.2015.08.010
  • [Show abstract] [Hide abstract]
    ABSTRACT: While there are many state-of-the-art approaches to introducing telemedical services in the area of medical imaging, it is hard to point to studies which would address all relevant aspects in a complete and comprehensive manner. In this paper we describe our approach to design and implementation of a universal platform for imaging medicine which is based on our longstanding experience in this area. We claim it is holistic, because, contrary to most of the available studies it addresses all aspects related to creation and utilization of a medical teleconsultation workspace. We present an extensive analysis of requirements, including possible usage scenarios, user needs, organizational and security issues and infrastructure components. We enumerate and analyze multiple usage scenarios related to medical imaging data in treatment, research and educational applications - with typical teleconsultations treated as just one of many possible options. Certain phases common to all these scenarios have been identified, with the resulting classification distinguishing several modes of operation (local vs. remote, collaborative vs. non-interactive etc.). On this basis we propose a system architecture which addresses all of the identified requirements, applying two key concepts: Service Oriented Architecture (SOA) and Virtual Organizations (VO). The SOA paradigm allows us to decompose the functionality of the system into several distinct building blocks, ensuring flexibility and reliability. The VO paradigm defines the cooperation model for all participating healthcare institutions. Our approach is validated by an ICT platform called TeleDICOM II which implements the proposed architecture. All of its main elements are described in detail and cross-checked against the listed requirements. A case study presents the role and usage of the platform in a specific scenario. Finally, our platform is compared with similar systems described in to-date studies and available on the market. Copyright © 2015. Published by Elsevier Inc.
    Journal of Biomedical Informatics 08/2015; DOI:10.1016/j.jbi.2015.08.007
  • [Show abstract] [Hide abstract]
    ABSTRACT: Due to an enormous number of scientific publications that cannot be handled manually, there is a rising interest in text-mining techniques for automated information extraction, especially in the biomedical field. Such techniques provide effective means of information search, knowledge discovery, and hypothesis generation. Most previous studies have primarily focused on the design and performance improvement of either named entity recognition or relation extraction. In this paper, we present PKDE4J, a comprehensive text-mining system that integrates dictionary-based entity extraction and rule-based relation extraction in a highly flexible and extensible framework. Starting with the Stanford CoreNLP, we developed the system to cope with multiple types of entities and relations. The system also has fairly good performance in terms of accuracy as well as the ability to configure text-processing components. We demonstrate its competitive performance by evaluating it on many corpora and found that it surpasses existing systems with average F-measures of 85% for entity extraction and 81% for relation extraction. Copyright © 2015. Published by Elsevier Inc.
    Journal of Biomedical Informatics 08/2015; DOI:10.1016/j.jbi.2015.08.008
  • [Show abstract] [Hide abstract]
    ABSTRACT: Exceptional growth in the availability of large-scale clinical imaging datasets has led to the development of computational infrastructures that offer scientists access to image repositories and associated clinical variables data. The EU FP7 neuGRID and its follow on neuGRID4You (N4U) projects provide a leading e-Infrastructure where neuroscientists can find core services and resources for brain image analysis. The core component of this e-Infrastructure is the N4U Virtual Laboratory, which offers easy access for neuroscientists to a wide range of datasets and algorithms, pipelines, computational resources, services, and associated support services. The foundation of this virtual laboratory is a massive data store plus a set of information services collectively called the 'Data Atlas'. This data atlas stores datasets, clinical study data, data dictionaries, algorithm/pipeline definitions, and provides interfaces for parameterised querying so that neuroscientists can perform analyses on required datasets. This paper presents the overall design and development of the Data Atlas, its associated dataset indexing and retrieval services that originated from the development of the N4U Virtual Laboratory in the EU FP7 N4U project in the light of detailed user requirements. Copyright © 2015 Elsevier Inc. All rights reserved.
    Journal of Biomedical Informatics 08/2015; DOI:10.1016/j.jbi.2015.08.004
  • [Show abstract] [Hide abstract]
    ABSTRACT: Research Objectives Nationally sponsored cancer care quality improvement efforts have been deployed in community health centers to increase breast, cervical, and colorectal cancer screening rates among vulnerable populations. Despite some immediate and short-term gains screening rates remain below national benchmark objectives. Overall improvement has been both difficult to sustain over time in some organizational settings and/or diffuse to others as repeatable best practices. One reason is that facility-level changes typically occur in dynamic organizational environments that are complex, adaptive, and unpredictable. This study seeks to better understand the factors that help shape community health center facility-level cancer screening performance over time. This study applies a computational modeling approach that combines principles of health services research, health informatics, network theory, and systems science.Methods In order to investigate the role of knowledge acquisition, retention, and sharing within the setting of the community health center and the effect of this role on the relationship between clinical decision support capabilities and improvement in cancer screening rate improvement, we employed ConstructTM to create simulated community health centers using previous collected point-in-time survey data. ConstructTM is a multi-agent model of network evolution. Social, knowledge, and belief networks co-evolve. Groups and organizations are treated as complex systems, thus capturing the variability in human and organizational factors. In ConstructTM, individuals and groups interact communicate, learn, and make decisions in a continuous cycle.Data from the survey was used to create high-performing simulated community health centers and low-performing ones based on extent of both computer decision support use and cancer-screening rates.Results Our virtual experiment revealed that patterns of overall network symmetry, agent cohesion, and connectedness varied by community health center performance level. Visual assessment of both the agent-to-agent knowledge sharing network and agent-to-resource knowledge use network diagrams demonstrated that community health centers labeled as high performers typically showed higher levels of collaboration and cohesiveness among agent classes, faster knowledge absorption rates, and fewer unconnected agents to key knowledge resources. Conclusions and Research Implications Using the point-in-time survey data outlining community health center cancer screening practices our computational model successfully distinguished between high and low performers. Our study showed that high performance environments displayed distinctive network characteristics in patterns of interaction among agents, as well as in the access and utilization of key knowledge resources. Our study demonstrated how non-network specific data obtained from a point-in-time survey can be employed to forecast community health center performance over time and thereby enhance sustainability of long-term strategic improvement efforts. Our results revealed a strategic profile for community health center cancer screening improvement over a projected 10-year simulated period. The use of computational modeling and simulation allows for additional inferential knowledge to drawn from existing data examining organizational performance in increasingly complex environments. Copyright © 2015. Published by Elsevier Inc.
    Journal of Biomedical Informatics 08/2015; DOI:10.1016/j.jbi.2015.08.005
  • [Show abstract] [Hide abstract]
    ABSTRACT: Our objective was to identify and examine studies of collaboration in relation to the use of health information technologies (HIT) in the biomedical informatics field. We conducted a systematic literature review of articles through PubMed searches as well as reviewing a variety of individual journals and proceedings. Our search period was from 1990-2015. We identified 98 articles that met our inclusion criteria. We excluded articles that were not published in English, did not deal with technology, and did not focus primarily on individuals collaborating. We categorized the studies by technology type, user groups, study location, methodology, processes related to collaboration, and desired outcomes. We identified three major processes: workflow, communication, and information exchange and two outcomes: maintaining awareness and establishing common ground. Researchers most frequently studied collaboration within hospitals using qualitative methods. Based on our findings, we present the "collaboration space model", which is a model to help researchers study collaboration and technology in healthcare. We also discuss issues related to collaboration and future research directions. While collaboration is being increasingly recognized in the biomedical informatics community as essential to healthcare delivery, collaboration is often implicitly discussed or intertwined with other similar concepts. In order to evaluate how HIT affects collaboration and how we can build HIT to effectively support collaboration, we need more studies that explicitly focus on collaborative issues. Copyright © 2015. Published by Elsevier Inc.
    Journal of Biomedical Informatics 08/2015; DOI:10.1016/j.jbi.2015.08.006
  • [Show abstract] [Hide abstract]
    ABSTRACT: The Systematized Nomenclature of Medicine Clinical Terms (SNOMED CT) is an extensive reference terminology with an attendant amount of complexity. It has been updated continuously and revisions have been released semi-annually to meet users' needs and to reflect the results of quality assurance (QA) activities. Two measures based on structural features are proposed to track the effects of both natural terminology growth and QA activities based on aspects of the complexity of SNOMED CT. These two measures, called the structural density measure and accumulated structural measure, are derived based on two abstraction networks, the area taxonomy and the partial-area taxonomy. The measures derive from attribute relationship distributions and various concept groupings that are associated with the abstraction networks. They are used to track the trends in the complexity of structures as SNOMED CT changes over time. The measures were calculated for consecutive releases of five SNOMED CT hierarchies, including the Specimen hierarchy. The structural density measure shows that natural growth tends to move a hierarchy's structure toward a more complex state, whereas the accumulated structural measure shows that QA processes tend to move a hierarchy's structure toward a less complex state. It is also observed that both the structural density and accumulated structural measures are useful tools to track the evolution of an entire SNOMED CT hierarchy and reveal internal concept migration within it. Copyright © 2015. Published by Elsevier Inc.
    Journal of Biomedical Informatics 08/2015; DOI:10.1016/j.jbi.2015.08.001
  • [Show abstract] [Hide abstract]
    ABSTRACT: Automated phenotype identification plays a critical role in cohort selection and bioinformatics data mining. Natural Language Processing (NLP)-informed classification techniques can robustly identify phenotypes in unstructured medical notes. In this paper, we systematically assess the effect of naive, lexically normalized, and semantic feature spaces on classifier performance for obesity, atherosclerotic cardiovascular disease (CAD), hyperlipidemia, hypertension, and diabetes. We train support vector machines (SVMs) using individual feature spaces as well as combinations of these feature spaces on two small training corpora (730 and 790 documents) and a combined (1520 documents) training corpus. We assess the importance of feature spaces and training data size on SVM model performance. We show that inclusion of semantically-informed features does not statistically improve performance for these models. The addition of training data has weak effects of mixed statistical significance across disease classes suggesting larger corpora are not necessary to achieve relatively high performance with these models. Copyright © 2015. Published by Elsevier Inc.
    Journal of Biomedical Informatics 08/2015; DOI:10.1016/j.jbi.2015.07.016
  • [Show abstract] [Hide abstract]
    ABSTRACT: In a number of biological studies, the raw gene expression data are not usually published due to different causes, such as data privacy and patent rights. Instead, significant gene lists with fold change values are usually provided in most studies. However, due to variations in data sources and profiling conditions, only a small number of common significant genes could be found among similar studies. Moreover, traditional gene set based analyses that consider these genes have not taken into account the fold change values, which may be important to distinguish between the different levels of significance of the genes. Human embryonic stem cell derived cardiomyocytes (hESC-CM) is a good representative of this category. hESC-CMs, with its role as a potentially unlimited source of human heart cells for regenerative medicine, have attracted the attentions of biological and medical researchers. Because of the difficulty of acquiring data and the resulting expenses, there are only a few related hESC-CM studies and few hESC-CM gene expression data are provided. In view of these challenges, we propose a new Gene Set Enrichment Ensemble (GSEE) approach to perform gene set based analysis on individual studies based on significant up-regulated gene lists with fold change data only. Our approach provides both explicit and implicit ways to utilize the fold change data, in order to make full use of scarce data. We validate our approach with hESC-CM data and fetal heart data, respectively. Experimental results on significant gene lists from different studies illustrate the effectiveness of our proposed approach. Copyright © 2015. Published by Elsevier Inc.
    Journal of Biomedical Informatics 08/2015; DOI:10.1016/j.jbi.2015.07.019
  • [Show abstract] [Hide abstract]
    ABSTRACT: Although full-text articles are provided by the publishers in electronic formats, it remains a challenge to find related work beyond the title and abstract context. Identifying related articles based on their abstract is indeed a good starting point; this process is straightforward and does not consume as many resources as full-text based similarity would require. However, further analyses may require in-depth understanding of the full content. Two articles with highly related abstracts can be substantially different regarding the full content. How similarity differs when considering title-and-abstract versus full-text and which semantic similarity metric provides better results when dealing with full-text articles are the main issues addressed in this manuscript. We have benchmarked three similarity metrics -BM25, PMRA, and Cosine, in order to determine which one performs best when using concept-based annotations on full-text documents. We also evaluated variations in similarity values based on title-and-abstract against those relying on full-text. Our test dataset comprises the Genomics track article collection from the 2005 Text Retrieval Conference. Initially, we used an entity recognition software to semantically annotate titles and abstracts as well as full-text with concepts defined in the Unified Medical Language System (UMLS®). For each article, we created a document profile, i.e., a set of identified concepts, term frequency, and inverse document frequency; we then applied various similarity metrics to those document profiles. We considered correlation, precision, recall, and F1 in order to determine which similarity metric performs best with concept-based annotations. For those full-text articles available in PubMed Central Open Access (PMC-OA), we also performed dispersion analyses in order to understand how similarity varies when considering full-text articles. We have found that the PubMed Related Articles similarity metric is the most suitable for full-text articles annotated with UMLS concepts. For similarity values above 0.8, all metrics exhibited an F1 around 0.2 and a recall around 0.1; BM25 showed the highest precision close to 1; in all cases the concept-based metrics performed better than the word-stem-based one. Our experiments show that similarity values vary when considering only title-and-abstract versus full-text similarity. Therefore, analyses based on full-text become useful when a given research requires going beyond title and abstract, particularly regarding connectivity across articles. Visualization available at ljgarcia.github.io/semsim.benchmark/, data available at http://dx.doi.org/10.5281/zenodo.13323. Copyright © 2015. Published by Elsevier Inc.
    Journal of Biomedical Informatics 08/2015; DOI:10.1016/j.jbi.2015.07.015