The t-SNE visualization of the text vectors of the training data. (A) TF-IDF text vector visualization; (B) S2V text vector visualization.

The t-SNE visualization of the text vectors of the training data. (A) TF-IDF text vector visualization; (B) S2V text vector visualization.

Source publication
Article
Full-text available
Drug-induced liver injury describes the adverse effects of drugs that damage the liver. Life-threatening results were also reported in severe cases. Therefore, liver toxicity is an important assessment for new drug candidates. These reports are documented in research papers that contain preliminary in vitro and in vivo experiments. Conventional...

Contexts in source publication

Context 1
... non-linear dimensionality reduction method: tdistributed stochastic neighbour embedding (t-SNE), which has been shown effective in visualizing high-dimensional data [20], [28]. The results show that the positive samples and the negative samples cluster separately, indicating the potential feasibility of classifying the DILI-positive samples (Fig. 1). It should be noted that the t-SNE visualization is completely unsupervised and the clusters shown are labelled with the ground truth labels from the ...
Context 2
... hyperparameter tuning in the five-fold cross validation on the training data, the best strength of L2 penalty were 10 for the BOW model, 0.1 for the TF-IDF, W2V1 and W2V2 models, and 1 for the S2V model. Word stemming was used for BOW and TF-IDF models but not in the W2V1, W2V2 and S2V models. The performance on the validation data is shown in Fig. 1. The results show that besides the ensemble learning model, TF-IDF outperformed the other models with the highest AUROC (0.990), accuracy (0.957), AUPRC (0.990), and F1-score (0.958) (Fig. 2). The RF models did not outperform the LR models and were therefore not shown and used in our ensemble ...

Citations

... In biomedical image analysis, for instance, training an effective supervised learning model often requires experienced radiologists annotating thousands of radiological images (e.g., X-rays, CT scans) to ensure accuracy [2,6]. Similarly, to train an effective model for biomedical natural language processing, experts need to read through thousands of free-text notes to generate text classifications [7]. This intensive manual annotation is prone to human error, introducing labeling noise as a common issue. ...
... To investigate a broad range of medical data modalities and different scales of feature dimensionality, in this study, we tested the ICP-based training data cleaning method on three classification tasks: 1) a natural language processing task: filter drug-induced liver injury (DILI) literature based on word2vec (W2V) and sent2vec (S2V) embeddings [7]; 2) an imaging and electronic health record task: predict whether a COVID-19 patient in the general ward will be admitted to the intensive care unit (ICU) [2]; 3) an RNA-sequencing (RNA-seq) task: classify breast cancer subtypes based on The Cancer Genome Atlas Program (TCGA) RNA-seq dataset [30]. The details of these datasets are introduced in Table 1 and the following paragraphs. ...
... The DILI dataset was released by the Annual International Conference on Critical Assessment of Massive Data Analysis (CAMDA 2021), with the DILI-positive samples (7,177) and DILI-negative samples (7,026) curated by FDA experts. The data involve both title and abstract of publications and the task is to predict whether a publication contains DILI information. ...
Article
Full-text available
Accurately labeling large datasets is important for biomedical machine learning yet challenging while modern data augmentation methods may generate noise in the training data, which may deteriorate machine learning model performance. Existing approaches addressing noisy training data typically rely on strict modeling assumptions, classification models and well-curated dataset. To address these, we propose a novel reliability-based training-data-cleaning method employing inductive conformal prediction (ICP). This method uses a small set of well-curated training data and leverages ICP-calculated reliability metrics to selectively correct mislabeled data and outliers within vast quantities of noisy training data. The efficacy is validated across three classification tasks with distinct modalities: filtering drug-induced-liver-injury (DILI) literature with free-text title and abstract, predicting ICU admission of COVID-19 patients through CT radiomics and electronic health records, and subtyping breast cancer using RNA-sequencing data. Varying levels of noise to the training labels were introduced via label permutation. Our training-data-cleaning method significantly enhanced the downstream classification performance (paired t-tests, p ≤ 0 . 05 among 30 random train/test partitions): significant accuracy enhancement in 86 out of 96 DILI experiments (up to 11.4% increase from 0.812 to 0.905), significant AUROC and AUPRC enhancements in all 48 COVID-19 experiments (up to 23.8% increase from 0.597 to 0.739 for AUROC, and 69.8% increase from 0.183 to 0.311 for AUPRC), and significant accuracy and macro-average F1-score improvements in 47 out of 48 RNA-sequencing experiments (up to 74.6% increase from 0.351 to 0.613 for accuracy, and 89.0% increase from 0.267 to 0.505 for F1-score). The improvement can be both statistically and clinically significant for information retrieval, disease diagnosis and prognosis. The method offers the potential to substantially boost classification performance in biomedical machine learning tasks without necessitating an excessive volume of well-curated training data or strong data distribution and modeling assumptions in existing semi-supervised learning methods.
... Some studies have pointed out that a few keywords can express the general content of a text, so it is possible to understand the key content of the text through keywords. The Term Frequency Inverse Document Frequency (TF-IDF) method in the semantic feature determination framework is applied to partition the weights of keywords [25]. The calculation expression is shown in equation (2). ...
Article
Full-text available
To solve the problem of lack of science and poor reusability of experience in traditional engineering scheme decision-making, which leads to the increase of time and cost of pre-decision-making, the study first uses case-based reasoning and ontology to construct a solution library for engineering solution decision-making system, and standardizes the cases using methods such as eigenfrequency. In addition, a retrieval mechanism based on residual similarity is designed to achieve effective retrieval of similar cases. The experiment outcomes denoted that the resource utilization rate of the traditional scheme was 75% before implementation, but decreased to 72% after implementation, a decrease of 3%. The resource utilization rate of the decision-making system scheme was 75% before implementation, and increased to 80% after implementation, an increase of 5%. The results indicated that the decision system scheme designed by the research performed better in terms of resource utilization, could more efficiently utilize resources, and reduce waste. The average decision accuracy of integrating CBR and BIM systems was 92%, significantly higher than the 84% of traditional decision systems. The CBR technology improved the scientificity and reliability of decision-making through continuous updating and optimization of the case library.
... Queries such as "persistent depressive disorder" or "SSRIs" were matched against the indexed terms, returning relevant documents. Following this, the document ranking process occurred, where documents were ranked based on their relevance to the search terms using criteria like term frequency-inverse document frequency (TF-IDF) [22], term proximity, and term density. Once the documents were ranked, the system moved to candidate extraction, where depression-related entities, including symptoms, treatments, and diagnoses, were identified. ...
Preprint
Full-text available
Depression is a multifaceted mental health disorder that necessitates accurate identification of symptoms, treatments, and comorbidities for effective diagnosis and treatment planning. This paper introduces a hybrid approach to Biomedical Entity Linking (BEL) for depression detection by combining full-text search and advanced natural language processing (NLP) techniques using vector embedding models. We leveraged models like BioBERT, BioWordVec, BlueBERT, FastText, MetaMap, and Llama to improve the linking of depression-related entities in un- structured clinical texts to structured knowledge bases such as the Diagnostic and Statistical Manual of Mental Disorders, Fifth Edition (DSM-5) and Unified Medical Language System (UMLS). Additionally, we propose a novel Depression Entity Relevance Ranker (DERR), which com- bines Token Set Ratio, Jaro-Winkler Similarity, and cosine similarity of embeddings to ensure accurate ranking of entities by contextual relevance. This hybrid approach ad- dresses ambiguities and variations in depression-related terminology, significantly enhancing the accuracy of entity linking. The system achieved an overall accuracy of 84%, with a Mean Reciprocal Rank (MRR) of 0.92 and Hits@5 of 95%, demonstrating its practical value for clinical decision support systems and mental health research.
... In natural language processing it is common to employ pre-processing before the actual text processing, such as classification. Examples of these techniques include stopword removal [53]- [55], lemmatization [56], punctuation removal [53], [55], [57] and lowercasing [53]- [55], [57]. These methods help overcome minor variations between identical words or sentences, reducing noise in the data. ...
... In natural language processing it is common to employ pre-processing before the actual text processing, such as classification. Examples of these techniques include stopword removal [53]- [55], lemmatization [56], punctuation removal [53], [55], [57] and lowercasing [53]- [55], [57]. These methods help overcome minor variations between identical words or sentences, reducing noise in the data. ...
... In natural language processing it is common to employ pre-processing before the actual text processing, such as classification. Examples of these techniques include stopword removal [53]- [55], lemmatization [56], punctuation removal [53], [55], [57] and lowercasing [53]- [55], [57]. These methods help overcome minor variations between identical words or sentences, reducing noise in the data. ...
Preprint
Full-text available
Companies support their customers using live chats and chatbots to gain their loyalty. AFAS is a Dutch company aiming to leverage the opportunity large language models (LLMs) offer to answer customer queries with minimal to no input from its customer support team. Adding to its complexity, it is unclear what makes a response correct, and that too in Dutch. Further, with minimal data available for training, the challenge is to identify whether an answer generated by a large language model is correct and do it on the fly. This study is the first to define the correctness of a response based on how the support team at AFAS makes decisions. It leverages literature on natural language generation and automated answer grading systems to automate the decision-making of the customer support team. We investigated questions requiring a binary response (e.g., Would it be possible to adjust tax rates manually?) or instructions (e.g., How would I adjust tax rate manually?) to test how close our automated approach reaches support rating. Our approach can identify wrong messages in 55\% of the cases. This work shows the viability of automatically assessing when our chatbot tell lies.
... To reduce overfitting, Lasso regularization is adopted during logistic regression, which encourages sparse coefficient values, making feature selection more explicit [16]. The logistic regression model with L1 regularization is represented as follows with the defined objective function: ...
Article
Full-text available
Drug-induced liver injury (DILI) poses a significant challenge for the pharmaceutical industry and regulatory bodies. Despite extensive toxicological research aimed at mitigating DILI risk, the effectiveness of these techniques in predicting DILI in humans remains limited. Consequently, researchers have explored novel approaches and procedures to enhance the accuracy of DILI risk prediction for drug candidates under development. In this study, we leveraged a large human dataset to develop machine learning models for assessing DILI risk. The performance of these prediction models was rigorously evaluated using a 10-fold cross-validation approach and an external test set. Notably, the random forest (RF) and multilayer perceptron (MLP) models emerged as the most effective in predicting DILI. During cross-validation, RF achieved an average prediction accuracy of 0.631, while MLP achieved the highest Matthews Correlation Coefficient (MCC) of 0.245. To validate the models externally, we applied them to a set of drug candidates that had failed in clinical development due to hepatotoxicity. Both RF and MLP accurately predicted the toxic drug candidates in this external validation. Our findings suggest that in silico machine learning approaches hold promise for identifying DILI liabilities associated with drug candidates during development.
... In our current DILI study, binary classification has been considered. Logistic regression is exemplified for DILI classification [18,19]. Once again, if be the input features of dimension , and be the binary class with class labels 0 and 1. ...
Preprint
Full-text available
Drug-induced liver injury (DILI) remains a significant challenge for the pharmaceutical industry and regulatory organizations. Despite a plethora of toxicological research aimed at estimating the risk of DILI, the efficacy of these techniques in predicting DILI in humans has remained limited. This has prompted the exploration of new approaches and procedures to improve the prediction accuracy of DILI risk for drug candidates in development. This study aimed to address this gap by leveraging a large human dataset to develop machine learning models for assessing DILI risk. The performance of the developed prediction models was extensively evaluated using a 10-fold cross-validation approach and two external test sets. Our study revealed that the Random Forest (RF) and MultiLayer Perceptron (MLP) models emerged as among the most effective in predicting DILI. RF outperformed other machine learning strategies, reaching an average prediction accuracy of 63.10% during the cross-validation, while the MLP achieved the highest Matthews Correlation Coefficient (MCC) of 0.245. These two models were further validated externally by a set of drug candidates that failed in clinical development due to DILI. Both models accurately predicted 90.9% of the toxic drug candidates in the external validation. Our study suggests that in silico machine learning approaches have the potential to significantly enhance the identification of DILI liabilities associated with drug candidates in development.
... This diversity in FEHMs and assessment approaches complicates the process of cross-comparison among different MLHMs. Unlike fields such as computer vision and natural language processing, where public datasets, shared hold-out test sets, and universally accepted metrics facilitate benchmarking, our field lacks these standardized tools [50]. However, within the constraints [24]. ...
Article
Machine learning head models (MLHMs) are developed to estimate brain deformation from sensor-based kinematics for early detection of traumatic brain injury (TBI). However, the overfitting to simulated impacts and the decreasing accuracy caused by distributional shift of different head impact datasets hinders the broad clinical applications of current MLHMs. We propose a new MLHM configuration that integrates unsupervised domain adaptation with a deep neural network to predict whole-brain maximum principal strain (MPS) and MPS rate (MPSR). With 12,780 simulated head impacts, we performed unsupervised domain adaptation on target head impacts from 302 college football (CF) impacts and 457 mixed martial arts (MMA) impacts using domain regularized component analysis (DRCA) and cycle-GAN-based methods. The new model improved the MPS/MPSR estimation accuracy, with the DRCA method outperforming other domain adaptation methods in prediction accuracy: MPS mean absolute error (MAE): 0.017 (CF) and 0.020 (MMA); MPSR MAE: 4.09 s -1 (CF) and 6.61 s -1 (MMA). On another two hold-out test sets with 195 college football impacts and 260 boxing impacts, the DRCA model outperformed the baseline model without domain adaptation in MPS and MPSR estimation MAE. The DRCA domain adaptation approach reduces the error of MPS/MPSR estimation to be well below previously reported TBI thresholds, enabling accurate brain deformation estimation to detect TBI in future clinical applications.
... In the domain of DILI studies, NLP models have proven to be valuable tools for extracting insights from textual sources. Zhan et al. (2022) developed NLP techniques specifically for biomedical texts, allowing the automated processing of 28,000 titles and abstracts retrieved from the PubMed database. By comparing five different text embedding techniques, they found that the model using term frequency-inverse document frequency and logistic regression performed best, with an accuracy of 0.957 on the validation set. ...
Article
Full-text available
Drug-induced liver injury (DILI) is a severe adverse reaction caused by drugs and may result in acute liver failure and even death. Many efforts have centered on mitigating risks associated with potential DILI in humans. Among these, quantitative structure-activity relationship (QSAR) was proven to be a valuable tool for early-stage hepatotoxicity screening. Its advantages include no requirement for physical substances and rapid delivery of results. Deep learning (DL) made rapid advancements recently and has been used for developing QSAR models. This review discusses the use of DL in predicting DILI, focusing on the development of QSAR models employing extensive chemical structure datasets alongside their corresponding DILI outcomes. We undertake a comprehensive evaluation of various DL methods, comparing with those of traditional machine learning (ML) approaches, and explore the strengths and limitations of DL techniques regarding their interpretability, scalability, and generalization. Overall, our review underscores the potential of DL methodologies to enhance DILI prediction and provides insights into future avenues for developing predictive models to mitigate DILI risk in humans. KEYWORDS drug-induced liver injury (DILI), machine learning, deep learning, drug safety, predictive model
... The results showed that TF-IDF dominates other approaches with respect to the AUROC and AURPC values, suggesting that simple and interpretable approaches could and do, in fact, outperform more sophisticated solutions. In a similar research (Zhan et al., 2022) dedicated specifically to DILI, authors leverage the same text processing techniques like BOW, W2V, and TF-IDF, while instead of D2V, Sentence2Vec is used. In addition, the paper includes the use of Random Forests for particular classification in addition to logistic regression. ...
Article
Full-text available
Introduction: Scientific articles serve as vital sources of biomedical information, but with the yearly growth in publication volume, processing such vast amounts of information has become increasingly challenging. This difficulty is particularly pronounced when it requires the expertise of highly qualified professionals. Our research focused on the domain-specific articles classification to determine whether they contain information about drug-induced liver injury (DILI). DILI is a clinically significant condition and one of the reasons for drug registration failures. The rapid and accurate identification of drugs that may cause such conditions can prevent side effects in millions of patients. Methods: Developing a text classification method can help regulators, such as the FDA, much faster at a massive scale identify facts of potential DILI of concrete drugs. In our study, we compared several text classification methodologies, including transformers, LSTMs, information theory, and statistics-based methods. We devised a simple and interpretable text classification method that is as fast as Naïve Bayes while delivering superior performance for topic-oriented text categorisation. Moreover, we revisited techniques and methodologies to handle the imbalance of the data. Results: Transformers achieve the best results in cases if the distribution of classes and semantics of test data matches the training set. But in cases of imbalanced data, simple statistical-information theory-based models can surpass complex transformers, bringing more interpretable results that are so important for the biomedical domain. As our results show, neural networks can achieve better results if they are pre-trained on domain-specific data, and the loss function was designed to reflect the class distribution. Discussion: Overall, transformers are powerful architecture, however, in certain cases, such as topic classification, its usage can be redundant and simple statistical approaches can achieve compatible results while being much faster and explainable. However, we see potential in combining results from both worlds. Development of new neural network architectures, loss functions and training procedures that bring stability to unbalanced data is a promising topic of development.
... The Critical Assessment of Massive Data Analysis (CAMDA) 2022 in collaboration with the Intelligent Systems for Molecular Biology (ISMB) hosted the Literature AI for Drug Induced Liver Injury (DILI) challenge (Zhan et al., 2022b). A curated dataset, consisting of 277,016 DILI annotated papers, was downloaded from the CAMDA website. ...
Article
Full-text available
Drug-induced liver injury (DILI) is an adverse hepatic drug reaction that can potentially lead to life-threatening liver failure. Previously published work in the scientific literature on DILI has provided valuable insights for the understanding of hepatotoxicity as well as drug development. However, the manual search of scientific literature in PubMed is laborious and time-consuming. Natural language processing (NLP) techniques along with artificial intelligence/machine learning approaches may allow for automatic processing in identifying DILI-related literature, but useful methods are yet to be demonstrated. To address this issue, we have developed an integrated NLP/machine learning classification model to identify DILI-related literature using only paper titles and abstracts. For prediction modeling, we used 14,203 publications provided by the Critical Assessment of Massive Data Analysis (CAMDA) challenge, employing word vectorization techniques in NLP in conjunction with machine learning methods. Classification modeling was performed using 2/3 of the data for training and the remainder for test in internal validation. The best performance was achieved using a linear support vector machine (SVM) model on the combined vectors derived from term frequency-inverse document frequency (TF-IDF) and Word2Vec, resulting in an accuracy of 95.0% and an F1-score of 95.0%. The final SVM model constructed from all 14,203 publications was tested on independent datasets, resulting in accuracies of 92.5%, 96.3%, and 98.3%, and F1-scores of 93.5%, 86.1%, and 75.6% for three test sets (T1-T3). Furthermore, the SVM model was tested on four external validation sets (V1-V4), resulting in accuracies of 92.0%, 96.2%, 98.3%, and 93.1%, and F1-scores of 92.4%, 82.9%, 75.0%, and 93.3%.