Figure 2 - uploaded by Frank Feyerabend
Content may be subject to copyright.
Distribution of participant geolocation. There are 1570 registered participants from 84 unique countries. The majority of contributors are from Europe, North America, China and Australia.
Source publication
Document recommendation systems for locating relevant literature have mostly relied on methods developed a decade ago. This is largely due to the lack of a large offline gold-standard benchmark of relevant documents that cover a variety of research fields such that newly developed literature search techniques can be compared, improved and translate...
Contexts in source publication
Context 1
... were received from a total of 1643 unique scientists around the world, of whom 1570 are registered participants (affiliated consortium members) and 73 contributed anonymously. Figure 2 shows that the contributors to this project originated from diverse geographic locations, including 84 unique countries with the largest clusters located in Europe (586), North America (356), China (265) and Oceania (161). Altogether, the participants annotated 3017 seed articles or 181,020 (3017×60) labelled document pairs. ...
Context 2
... were received from a total of 1643 unique scientists around the world, of whom 1570 are registered participants (affiliated consortium members) and 73 contributed anonymously. Figure 2 shows that the contributors to this project originated from diverse geographic locations, including 84 unique countries with the largest clusters located in Europe (586), North America (356), China (265) and Oceania (161). Altogether, the participants annotated 3017 seed articles or 181,020 (3017×60) labelled document pairs. ...
Context 3
... were received from a total of 1643 unique scientists around the world, of whom 1570 are registered participants (affiliated consortium members) and 73 contributed anonymously. Figure 2 shows that the contributors to this project originated from diverse geographic locations, including 84 unique countries with the largest clusters located in Europe (586), North America (356), China (265) and Oceania (161). Altogether, the participants annotated 3017 seed articles or 181,020 (3017×60) labelled document pairs. ...
Similar publications
Document recommendation systems for locating relevant literature have mostly relied on methods developed a decade ago. This is largely due to the lack of a large offline gold-standard benchmark of relevant documents that cover a variety of research fields such that newly developed literature search techniques can be compared, improved and translate...
Citations
... Usually, collections are crafted by using some initial pooling strategy of candidate documents for each topic and then showing them to judges who rate whether the document is relevant. RELISH was constructed by using BM25, tf-idf and the PubMed recommender for initial pooling [10]. That explains why the PubMed recommender shows good performance on RELISH. ...
... While some paper recommendation methods just divide relevant and irrelevant documents given by some test collection, as done in [10,21,42,54], our system works on a comprehensive document collection. When evaluating the test collections, we observed two central issues: First, our graph-based approach retrieved many documents that have not been judged in the collection data. ...
Digital libraries provide different access paths, allowing users to explore their collections. For instance, paper recommendation suggests literature similar to some selected paper. Their implementation is often cost-intensive, especially if neural methods are applied. Additionally, it is hard for users to understand or guess why a recommendation should be relevant for them. That is why we tackled the problem from a different perspective. We propose XGPRec, a graph-based and thus explainable method which we integrate into our existing graph-based biomedical discovery system. Moreover, we show that XGPRec (1) can, in terms of computational costs, manage a real digital library collection with 37M documents from the biomedical domain, (2) performs well on established test collections and concept-centric information needs, and (3) generates explanations that proved to be beneficial in a preliminary user study. We share our code so that user libraries can build upon XGPRec.
... This approach relies on machine learning based methods (Shojaei and Saneifar, 2021), which can be effective from a machine learning perspective, since more data can mean improved document similarity metrics on average over large datasets (Kusner et al., 2015). Focusing on individuals annotations of documents has been explored in the context of domain specific knowledge such as biomedical research papers (Brown and Zhou, 2019), or for specific context like document summarizing (Zhang et al., 2003). However to date little attention has been given to the notion of individualized metrics of similarity that account for biases and constraints specifically, which are highly relevant for educational contexts (Chew and Cerbin, 2021). ...
Cosine similarity between two documents can be computed using token embeddings formed by Large Language Models (LLMs) such as GPT-4, and used to categorize those documents across a range of uses. However, these similarities are ultimately dependent on the corpora used to train these LLMs, and may not reflect subjective similarity of individuals or how their biases and constraints impact similarity metrics. This lack of cognitively-aware personalization of similarity metrics can be particularly problematic in educational and recommendation settings where there is a limited number of individual judgements of category or preference, and biases can be particularly relevant. To address this, we rely on an integration of an Instance-Based Learning (IBL) cognitive model with LLM embeddings to develop the Instance-Based Individualized Similarity (IBIS) metric. This similarity metric is beneficial in that it takes into account individual biases and constraints in a manner that is grounded in the cognitive mechanisms of decision making. To evaluate the IBIS metric, we also introduce a dataset of human categorizations of emails as being either dangerous (phishing) or safe (ham). This dataset is used to demonstrate the benefits of leveraging a cognitive model to measure the subjective similarity of human participants in an educational setting.
... The risk and prevalence of dementia are higher among racial and ethnic minoritized groups, including African American/Black and Hispanic/Latino (3)(4)(5). Nevertheless, a systematic review of randomized controlled trials (RCTs) reported that racial and ethnic minoritized communities remain underrepresented in AD/ADRD clinical trials (6). In a series of 6 AD/ ADRD cooperative trials, only 5% Hispanic and 6% African American participants were enrolled (7). ...
Background
Despite higher dementia prevalence in racial and ethnic minoritized communities, they are underrepresented in Alzheimer’s disease clinical trials. Community-based recruitment strategies are believed to yield positive outcomes in various fields, such as cancer and cardiovascular clinical trials, but their outcomes in Alzheimer’s disease and Related Dementias (AD/ADRD) require further study. In this systematic rapid review, we synthesized the available evidence on community-engaged recruitment strategies in enhancing participation in AD/ADRD clinical trials and observational study participation.
Methods
We searched and identified studies describing a community-based recruitment approach for racial and ethnic minoritized communities across seven databases (Pubmed, OVID MEDLINE, Cochrane Central Register of Controlled Trials, CINAHL, PsychINFO, Web of Science, and EMBASE).
Results
Out of 1915 screened studies, 49 met the inclusion criteria. Most studies employed multiple community-based recruitment approaches, including educational presentations, collaborations with community-based faith organizations, community advisory boards, and engagement with local clinics or health professionals. 52% of studies targeted more than one racial and ethnic minoritized population, primarily African Americans and then Hispanic/Latino. Gaps in knowledge about AD/ADRD, its increased risk among minoritized populations, distrust, and stigma were noted as barriers to research participation. Approximately 50% of the studies specified whether they evaluated their recruitment approaches, and in studies where approaches were evaluated, there was substantial heterogeneity in methods utilized.
Conclusion
The quality of available evidence on the use of community-based recruitment approaches to include racial and ethnic minoritized populations in AD/ADRD research, particularly in clinical trials, is limited. Systematic assessment of recruitment strategies is urgently needed to increase the evidence base around community-engaged recruitment approaches.
Electronic Supplementary Material
Supplementary material is available in the online version of this article at 10.14283/jpad.2024.149.
... This approach relies on machine learning based methods (Shojaei and Saneifar, 2021), which can be effective from a machine learning perspective, since more data can mean improved document similarity metrics on average over large datasets (Kusner et al., 2015). Focusing on individuals annotations of documents has been explored in the context of domain specific knowledge such as biomedical research papers (Brown and Zhou, 2019), or for specific context like document summarizing (Zhang et al., 2003). ...
... Pranata, R. (Pelita Harapan University, Tangerang, Indonesia) is the author with the most literature review publications in Indonesia (106 publications and 3,861 citations). The article "Large expert-curated database for benchmarking document similarity detection in biomedical literature search" in the database of 2019 is the first literature review publication he has published(Brown & Zhou, 2019). While "Diabetes mellitus is associated with increased mortality and severity of disease in COVID-19 pneumonia -A systematic review, metaanalysis, and meta-regression: Diabetes and COVID-19" Diabetes and Metabolic Syndrome: Clinical Research and Reviews, Vol 14, Issue 4, being the publication with the most citations (575 citations) from Pranata, R(Huang et al., 2020). ...
Text data mining ('big data methods') is one of the most widely used approaches during the COVID-19 pandemic. In particular, text data mining on Scopus databases or Web of Science (WoS). Text data mining is widely used to collect literature for later bibliometric analysis, and in the end, it becomes a literature review article. Therefore, in this article, we reveal the trend of publication of literature reviews in Scopus journals from Indonesia, Japan, South Korea, Vietnam, Singapore, and Malaysia. This article describes two essential parts, namely 1) a comparison of international publication trends and subject area of literature review publications, and 2) a comparison of Top 5 for Authors, Affiliation, Source Title, and Collaboration Country.
... PubMed similar articles allow a search based on seed articles, i.e., it provides a pre-compiled list of articles that are similar to the given seed article 1 (Lin and Wilbur, 2007). It is a valuable resource with many applications, e.g., for building clusters of articles (Boyack et al., 2020), entity networks (Lee et al., 2016), or similarity-based datasets (Brown et al., 2019;Butzke et al., 2020). ...
... We are only aware of three previous similar evaluations of the PMRA algorithm: (i) the RELISH database, in which the authors compared PMRA to BM25 and TF-IDF for a large collection of more than 180k articles and more than 3k seed articles (Brown et al., 2019); (ii) the SMAFIRA-c dataset, in which an evaluation was carried out for three seed articles (Butzke et al., 2020); and (iii) an evaluation for seven seed articles with the focus on the abstracts' sections (Neves et al., 2019). A couple of previous projects also carried out an evaluation on some of the datasets that we used (Medić and Šnajder, 2022;Mysore et al., 2022). ...
... It is a large database in which more than 180k PubMed abstracts were validated in terms of similarity to a seed article (Brown et al., 2019). We utilized the dataset used in the devel-opment of the Aspire tool (Mysore et al., 2022), which is available for download 3 . ...
... More recently, but also noteworthily, Brown et al. 7 described the much larger effort of the RELISH (RElevant LIterature SearcH) consortium to curate over 180,000 articles with respect to relevance (similarity) to a seed article and made the results available as a resource for testing and improving biomedical literature recommender systems. This addresses a more general problem of finding the most relevant literature, which may also be helpful for finding relevant data for comparison with a given dataset. ...
A major obstacle for reusing and integrating existing data is finding the data that is most relevant in a given context. The primary metadata resource is the scientific literature describing the experiments that produced the data. To stimulate the development of natural language processing methods for extracting this information from articles, we have manually annotated 100 recent open access publications in Analytical Chemistry as semantic graphs. We focused on articles mentioning mass spectrometry in their experimental sections, as we are particularly interested in the topic, which is also within the domain of several ontologies and controlled vocabularies. The resulting gold standard dataset is publicly available and directly applicable to validating automated methods for retrieving this metadata from the literature. In the process, we also made a number of observations on the structure and description of experiments and open access publication in this journal.
... The first study presented an evaluation framework (CITREC) [21], which evaluated 35 similarity measures on a PubMed dataset based on a MeSHbased bibliometric indicator. However, the drawback of CITREC is that the MeSH-based indicator is not always reliable for judging article similarity because, for example, a recent gold-standard dataset [22] shows that some articles highly considered similar do not have any overlapping MeSH terms. Moreover, our statistics on the PubMed literature database show that 14% of articles do not have the MeSH metadata, and the number of MeSH terms assigned to biomedical articles varies over a large range. ...
... Our research differs from the existing studies in two aspects. Firstly, we evaluate all the methods with the same experimental settings on the same datasets: the Relevant Literature Search Consortium (RELISH) dataset [22] and the Text REtrieval Conference (TREC) 2005 Genomics dataset [24], which we will introduce in Section 4. Secondly, we also evaluate different text representation techniques in addition to these existing methods and systems. The text representation techniques we evaluate include word-level representation models (e.g., fastText [25], BioWordVec [26]), sentence-level representation models (e.g., InferSent [27], Sent2Vec [28]), document-level representation models (e.g., LDA [29], Doc2Vec [30]), and the BERT-based models (e.g., AllenAI's SPECTER [31], BioBERT [32]). ...
... Regarding the quality, the dataset has been rigorously evaluated. For example, the authors show that there is no systematical bias observed among annotators with different levels of background, and the scores judged by different annotators are quite stable [22]. ...
Background
Biomedical sciences, with their focus on human health and disease, have attracted unprecedented attention in the 21st century. The proliferation of biomedical sciences has also led to a large number of scientific articles being produced, which makes it difficult for biomedical researchers to find relevant articles and hinders the dissemination of valuable discoveries. To bridge this gap, the research community has initiated the article recommendation task, with the aim of recommending articles to biomedical researchers automatically based on their research interests. Over the past two decades, many recommendation methods have been developed. However, an algorithm-level comparison and rigorous evaluation of the most important methods on a shared dataset is still lacking.
Method
In this study, we first investigate 15 methods for automated article recommendation in the biomedical domain. We then conduct an empirical evaluation of the 15 methods, including six term-based methods, two word embedding methods, three sentence embedding methods, two document embedding methods, and two BERT-based methods. These methods are evaluated in two scenarios: article-oriented recommenders and user-oriented recommenders, with two publicly available datasets: TREC 2005 Genomics and RELISH, respectively.
Results
Our experimental results show that the text representation models BERT and BioSenVec outperform many existing recommendation methods (e.g., BM25, PMRA, XPRC) and web-based recommendation systems (e.g., MScanner, MedlineRanker, BioReader) on both datasets regarding most of the evaluation metrics, and fine-tuning can improve the performance of the BERT-based methods.
Conclusions
Our comparison study is useful for researchers and practitioners in selecting the best modeling strategies for building article recommendation systems in the biomedical domain. The code and datasets are publicly available.
... Digital Healthcare and Clinical Health Records ML can learn from almost any data type, even unstructured medical text, such as patient records, medical notes, prescriptions, audio interview transcripts, or pathology and radiology reports. Future day-to-day applications will embrace ML methods to organize a growing volume of scientific literature, facilitating access and extraction of meaningful knowledge content from it [24]. In the clinic, ML can harness the potential of electronic health records to accurately predict medical events [25]. ...
Purpose of Review
We critically evaluate the future potential of machine learning (ML), deep learning (DL), and artificial intelligence (AI) in precision medicine. The goal of this work is to show progress in ML in digital health, to exemplify future needs and trends, and to identify any essential prerequisites of AI and ML for precision health.
Recent Findings
High-throughput technologies are delivering growing volumes of biomedical data, such as large-scale genome-wide sequencing assays; libraries of medical images; or drug perturbation screens of healthy, developing, and diseased tissue. Multi-omics data in biomedicine is deep and complex, offering an opportunity for data-driven insights and automated disease classification. Learning from these data will open our understanding and definition of healthy baselines and disease signatures. State-of-the-art applications of deep neural networks include digital image recognition, single-cell clustering, and virtual drug screens, demonstrating breadths and power of ML in biomedicine.
Summary
Significantly, AI and systems biology have embraced big data challenges and may enable novel biotechnology-derived therapies to facilitate the implementation of precision medicine approaches.