ArticlePublisher preview available

An interpretable diagnostic approach for lung cancer: Combining maximal clique and improved BERT

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract and Figures

The lung cancer incidence and mortality in China have always been high. Moreover, due to the limited level of professional technology, misdiagnosis and missed diagnosis of lung cancer often occur. To improve the accuracy of diagnosis, this paper proposes an interpretable diagnostic method for lung cancer based on Chinese electronic medical records (EMRs). First, to overcome the difficulty in word segmentation of clinical texts in Chinese EMRs, a dictionary construction method is proposed based on the idea of maximal clique, and 730 medical professional terms related to lung diseases are identified. Then, the ProbSparse self‐attention mechanism and self‐attention distilling operation in Informer are used to improve the Bidirectional Encoder Representations from Transformer (BERT) to realize the representation of long clinical texts with lower time complexity and memory consumption. Finally, the convolutional neural network with an attention mechanism is employed to process the representation results to realize the interpretable prediction of lung cancer. This method is applied to the lung cancer diagnosis of inpatients in a tertiary hospital in Hunan Province, obtaining excellent results of about 0.9 for area under the receiver operating characteristic curve (AUROC) and area under the precision‐recall curve (AUPRC). In addition, the results of the comparative analysis with existing dictionaries, word embedding methods and diagnostic methods further confirm the superiority of the proposed method. Specifically, the proposed method improves the precision by at least 6%, the recall by at least 2.6%, the F1 score by at least 5.2%, AUROC by at least 7.3% and AUPRC by at least 7.7% compared with all these state‐of‐the‐art methods.
This content is subject to copyright. Terms and conditions apply.
ORIGINAL ARTICLE
An interpretable diagnostic approach for lung cancer:
Combining maximal clique and improved BERT
Zi-yu Chen
1
| Fei Xiao
1
| Xiao-kang Wang
1,2
| Wen-hui Hou
1
|
Rui-lu Huang
1
| Jian-qiang Wang
1
1
School of Business, Central South University,
Changsha, People's Republic of China
2
College of Management, Shenzhen
University, Shenzhen, People's Republic of
China
Correspondence
Jian-qiang Wang and Rui-lu Huang, School of
Business, Central South University, Changsha
410083, People's Republic of China.
Email: jqwang@csu.edu.cn;rlhuang@csu.
edu.cn
Funding information
Natural Science Foundation of Hunan Province
Abstract
The lung cancer incidence and mortality in China have always been high. Moreover,
due to the limited level of professional technology, misdiagnosis and missed diagnosis
of lung cancer often occur. To improve the accuracy of diagnosis, this paper proposes
an interpretable diagnostic method for lung cancer based on Chinese electronic med-
ical records (EMRs). First, to overcome the difficulty in word segmentation of clinical
texts in Chinese EMRs, a dictionary construction method is proposed based on the
idea of maximal clique, and 730 medical professional terms related to lung diseases
are identified. Then, the ProbSparse self-attention mechanism and self-attention dis-
tilling operation in Informer are used to improve the Bidirectional Encoder Represen-
tations from Transformer (BERT) to realize the representation of long clinical texts
with lower time complexity and memory consumption. Finally, the convolutional neu-
ral network with an attention mechanism is employed to process the representation
results to realize the interpretable prediction of lung cancer. This method is applied
to the lung cancer diagnosis of inpatients in a tertiary hospital in Hunan Province,
obtaining excellent results of about 0.9 for area under the receiver operating charac-
teristic curve (AUROC) and area under the precision-recall curve (AUPRC). In addi-
tion, the results of the comparative analysis with existing dictionaries, word
embedding methods and diagnostic methods further confirm the superiority of the
proposed method. Specifically, the proposed method improves the precision by at
least 6%, the recall by at least 2.6%, the F1 score by at least 5.2%, AUROC by at least
7.3% and AUPRC by at least 7.7% compared with all these state-of-the-art methods.
KEYWORDS
Chinese electronic medical records, informer, interpretable diagnostic method, lung cancer,
maximal clique, term discovery
1|INTRODUCTION
Lung cancer, as one of the top 10 cancers in the world, affects 11.7% of cancer patients (Cao & Chen, 2021). Moreover, the mortality rate of lung
cancer is extremely high, with 1.8 million deaths due to lung cancer worldwide in 2020 (Cao & Chen, 2021). In China, the number of lung cancer
cases was 820 thousand, and 710 thousand people died from lung cancer in 2020 (Liu, Li, et al., 2021), which far exceed the global average. The
economic losses caused by lung cancer are huge. In 2017, the total economic burden caused by lung cancer in China was estimated to be
$25.069 billion (Liu, Shi, et al., 2021).
Received: 26 November 2022 Revised: 26 February 2023 Accepted: 4 April 2023
DOI: 10.1111/exsy.13310
Expert Systems. 2023;40:e13310. wileyonlinelibrary.com/journal/exsy © 2023 John Wiley & Sons Ltd. 1of22
https://doi.org/10.1111/exsy.13310
... The continuous increase in the number of motor vehicles has brought many problems to society, such as traffic congestion, a waste of resources, economic losses, excessive commuting times, and frequent traffic accidents. In addition, the pollution caused by the large number of cars may threaten human health [1]. Since traffic flow can reflect the number of vehicles that pass a point in a certain period of time [2], accurate traffic flow forecasting is of great significance to management departments and individuals, which can optimize the design and operation of transportation systems to proposed a method to predict the spatio-temporal characteristics of short-term traffic flow by combing the k-nearest neighbor algorithm and bidirectional long-short-term memory network model. ...
Article
Full-text available
Since traffic congestion during peak hours has become the norm in daily life, research on short-term traffic flow forecasting has attracted widespread attention that can alleviate urban traffic congestion. However, the existing research ignores the uncertainty of short-term traffic flow forecasting, which will affect the accuracy and robustness of traffic flow forecasting models. Therefore, this paper proposes a short-term traffic flow forecasting algorithm combining the cloud model and the fuzzy inference system in an uncertain environment, which uses the idea of the cloud model to process the traffic flow data and describe its randomness and fuzziness at the same time. First, the fuzzy c-means algorithm is selected to carry out cluster analysis on the original traffic flow data, and the number and parameter values of the initial membership function of the system are obtained. Based on the cloud reasoning algorithm and the cloud rule generator, an improved fuzzy reasoning system is proposed for short-term traffic flow predictions. The reasoning system cannot only capture the uncertainty of traffic flow data, but it also can describe temporal dependencies well. Finally, experimental results indicate that the proposed model has a better prediction accuracy and better stability, which reduces 0.6106 in RMSE, reduces 0.281 in MAE, and reduces 0.0022 in MRE compared with the suboptimal comparative methods.
Article
Full-text available
The diagnosis of lung cancer in ambulatory settings is often challenging due to non-specific clinical presentation, but there are currently no clinical quality measures (CQMs) in the United States used to identify areas for practice improvement in diagnosis. We describe the pre-diagnostic time intervals among a retrospective cohort of 711 patients identified with primary lung cancer from 2012–2019 from ambulatory care clinics in Seattle, Washington USA. Electronic health record data were extracted for two years prior to diagnosis, and Natural Language Processing (NLP) applied to identify symptoms/signs from free text clinical fields. Time points were defined for initial symptomatic presentation, chest imaging, specialist consultation, diagnostic confirmation, and treatment initiation. Median and interquartile ranges (IQR) were calculated for intervals spanning these time points. The mean age of the cohort was 67.3 years, 54.1% had Stage III or IV disease and the majority were diagnosed after clinical presentation (94.5%) rather than screening (5.5%). Median intervals from first recorded symptoms/signs to diagnosis was 570 days (IQR 273–691), from chest CT or chest X-ray imaging to diagnosis 43 days (IQR 11–240), specialist consultation to diagnosis 72 days (IQR 13–456), and from diagnosis to treatment initiation 7 days (IQR 0–36). Symptoms/signs associated with lung cancer can be identified over a year prior to diagnosis using NLP, highlighting the need for CQMs to improve timeliness of diagnosis.
Article
Full-text available
Importance: Overdose is one of the leading causes of death in the US; however, surveillance data lag considerably from medical examiner determination of the death to reporting in national surveillance reports. Objective: To automate the classification of deaths related to substances in medical examiner data using natural language processing (NLP) and machine learning (ML). Design, setting, and participants: Diagnostic study comparing different natural language processing and machine learning algorithms to identify substances related to overdose in 10 health jurisdictions in the US from January 1, 2020, to December 31, 2020. Unstructured text from 35 433 medical examiner and coroners' death records was examined. Exposures: Text from each case was manually classified to a substance that was related to the death. Three feature representation methods were used and compared: text frequency-inverse document frequency (TF-IDF), global vectors for word representations (GloVe), and concept unique identifier (CUI) embeddings. Several ML algorithms were trained and best models were selected based on F-scores. The best models were tested on a hold-out test set and results were reported with 95% CIs. Main outcomes and measures: Text data from death certificates were classified as any opioid, fentanyl, alcohol, cocaine, methamphetamine, heroin, prescription opioid, and an aggregate of other substances. Diagnostic metrics and 95% CIs were calculated for each combination of feature extraction method and machine learning classifier. Results: Of 35 433 death records analyzed (decedent median age, 58 years [IQR, 41-72 years]; 24 449 [69%] were male), the most common substances related to deaths included any opioid (5739 [16%]), fentanyl (4758 [13%]), alcohol (2866 [8%]), cocaine (2247 [6%]), methamphetamine (1876 [5%]), heroin (1613 [5%]), prescription opioids (1197 [3%]), and any benzodiazepine (1076 [3%]). The CUI embeddings had similar or better diagnostic metrics compared with word embeddings and TF-IDF for all substances except alcohol. ML classifiers had perfect or near perfect performance in classifying deaths related to any opioids, heroin, fentanyl, prescription opioids, methamphetamine, cocaine, and alcohol. Classification of benzodiazepines was suboptimal using all 3 feature extraction methods. Conclusions and relevance: In this diagnostic study, NLP/ML algorithms demonstrated excellent diagnostic performance at classifying substances related to overdoses. These algorithms should be integrated into workflows to decrease the lag time in reporting overdose surveillance data.
Article
Full-text available
In the medical field, text classification based on natural language process (NLP) has shown good results and has great practical application prospects such as clinical medical value, but most existing research focuses on English electronic medical record data, and there is less research on the natural language processing task for Chinese electronic medical records. Most of the current Chinese electronic medical records are non-institutionalized texts, which generally have low utilization rates and inconsistent terminology, often mingling patients’ symptoms, medications, diagnoses, and other essential information. In this paper, we propose a Capsule network model for electronic medical record classification, which combines LSTM and GRU models and relies on a unique routing structure to extract complex Chinese medical text features. The experimental results show that this model outperforms several other baseline models and achieves excellent results with an F1 value of 73.51% on the Chinese electronic medical record dataset, at least 4.1% better than other baseline models.
Article
Full-text available
Background Electronic medical records (EMR) contain detailed information about patient health. Developing an effective representation model is of great significance for the downstream applications of EMR. However, processing data directly is difficult because EMR data has such characteristics as incompleteness, unstructure and redundancy. Therefore, preprocess of the original data is the key step of EMR data mining. The classic distributed word representations ignore the geometric feature of the word vectors for the representation of EMR data, which often underestimate the similarities between similar words and overestimate the similarities between distant words. This results in word similarity obtained from embedding models being inconsistent with human judgment and much valuable medical information being lost. Results In this study, we propose a biomedical word embedding framework based on manifold subspace. Our proposed model first obtains the word vector representations of the EMR data, and then re-embeds the word vector in the manifold subspace. We develop an efficient optimization algorithm with neighborhood preserving embedding based on manifold optimization. To verify the algorithm presented in this study, we perform experiments on intrinsic evaluation and external classification tasks, and the experimental results demonstrate its advantages over other baseline methods. Conclusions Manifold learning subspace embedding can enhance the representation of distributed word representations in electronic medical record texts. Reduce the difficulty for researchers to process unstructured electronic medical record text data, which has certain biomedical research value.
Article
Many real-world applications require the prediction of long sequence time-series, such as electricity consumption planning. Long sequence time-series forecasting (LSTF) demands a high prediction capacity of the model, which is the ability to capture precise long-range dependency coupling between output and input efficiently. Recent studies have shown the potential of Transformer to increase the prediction capacity. However, there are several severe issues with Transformer that prevent it from being directly applicable to LSTF, including quadratic time complexity, high memory usage, and inherent limitation of the encoder-decoder architecture. To address these issues, we design an efficient transformer-based model for LSTF, named Informer, with three distinctive characteristics: (i) a ProbSparse self-attention mechanism, which achieves O(L log L) in time complexity and memory usage, and has comparable performance on sequences' dependency alignment. (ii) the self-attention distilling highlights dominating attention by halving cascading layer input, and efficiently handles extreme long input sequences. (iii) the generative style decoder, while conceptually simple, predicts the long time-series sequences at one forward operation rather than a step-by-step way, which drastically improves the inference speed of long-sequence predictions. Extensive experiments on four large-scale datasets demonstrate that Informer significantly outperforms existing methods and provides a new solution to the LSTF problem.
Article
The existing multivariate time series prediction schemes are inefficient in extracting intermediate features. This paper proposes an artificial neural network called Feature Path Efficient Multivariate Time Series Prediction (FPEMTSP) to predict the next element of the main time series in the presence of several secondary time series. We propose to generate all the possible combinations of the secondary time series and extract multivariate features by doing the Cartesian product of the main and the secondary time series features. Our calculations prove that the FPEMTSP’s complexity and network size are acceptable. We have considered a few internal parameters in FPEMTSP that can be configured to improve the prediction accuracy and adjust the network size. We trained and evaluated FPEMTSP using two public datasets. Our evaluation revealed the optimal values for the internal parameters and showed that FPEMTSP surpasses the existing schemes in terms of prediction accuracy and the number of correctly predicted steps.
Article
A promoter is a sequence of DNA that initializes the process of transcription and regulates whenever and wherever genes are expressed in the organism. Because of its importance in molecular biology, identifying DNA promoters are challenging to provide useful information related to its functions and related diseases. Several computational models have been developed to early predict promoters from high-throughput sequencing over the past decade. Although some useful predictors have been proposed, there remains short-falls in those models and there is an urgent need to enhance the predictive performance to meet the practice requirements. In this study, we proposed a novel architecture that incorporated transformer natural language processing (NLP) and explainable machine learning to address this problem. More specifically, a pre-trained Bidirectional Encoder Representations from Transformers (BERT) model was employed to encode DNA sequences, and SHapley Additive exPlanations (SHAP) analysis served as a feature selection step to look at the top-rank BERT encodings. At the last stage, different machine learning classifiers were implemented to learn the top features and produce the prediction outcomes. This study not only predicted the DNA promoters but also their activities (strong or weak promoters). Overall, several experiments showed an accuracy of 85.5% and 76.9% for these two levels, respectively. Our performance showed a superiority to previously published predictors on the same dataset in most measurement metrics. We named our predictor as BERT-Promoter and it is freely available at https://github.com/khanhlee/bert-promoter.
Article
The stochastic configuration network (SCN), a type of randomized learning algorithm, can solve the infeasible problem in random vector functional link (RVFL) by establishing a supervisory mechanism. The advantages of fast learning, convergence and not easily falling into local optima make SCN popular. However, the prediction effect of SCN is affected by the parameter settings and the non‐stationarity of input data. In this paper, a hybrid model based on variational mode decomposition (VMD), improved whale optimization algorithm (IWOA), and SCN is proposed. The SCN will predict relatively stable data after decomposition by VMD, and parameters of SCN are optimized by IWOA. The IWOA diversifies the initial population by employing logistic chaotic map based on bit reversal and improves the search ability by using Lévy flight. The exploration and exploitation of IWOA are superior to those of other optimization algorithms in multiple benchmark functions and CEC2020. Moreover, the proposed model is applied to the prediction of the nonstationary wind speeds in four seasons. We evaluate the performance of the proposed model using four evaluation indicators. The results show that the R2 of the proposed model under four seasons are more than 0.999, and the root mean square error, mean absolute error, symmetric mean absolute percentage error are less than 0.3, 0.17 and 13%, respectively, which are almost 1/10, 1/10 and 1/4 those of SCN, respectively.
Article
Cardiovascular (CVD) is the leading cause of death worldwide and a significant public health concern. Therefore, its mortality prediction is crucial to many existing treatment guidelines. Medical claims data can be used to accurately foresee the health outcomes of patients contracting to a variety of diseases. Many machine learning algorithms, especially deep learning artificial neural networks, can predict mortality rate among patients with CVD using clinical data. Calibration of probabilistic prediction is essential for precise medical interventions as it indicates how well a model’s output matches the probability of the event. However, deep learning neural networks are poorly calibrated. Through experiments, we observe that feature representation is an important factor influencing calibration. This paper proposes a novel feature-based deep learning neural network framework to predict the mortality rate among patients with CVD. Our focus is to present a comprehensive study to achieve advanced performance calibration of mortality prediction on CVD in leveraging deep learning architecture and feature representations. Our study demonstrates that the proposed feature-based neural network framework integrated with Principal Component Analysis or Autoencoders significantly reduces training time and boosts calibration, making model updates in clinical context more flexible and decision-making in medical prevention more reliable.