A preview of this full-text is provided by Wiley.
Content available from Expert Systems
This content is subject to copyright. Terms and conditions apply.
ORIGINAL ARTICLE
An interpretable diagnostic approach for lung cancer:
Combining maximal clique and improved BERT
Zi-yu Chen
1
| Fei Xiao
1
| Xiao-kang Wang
1,2
| Wen-hui Hou
1
|
Rui-lu Huang
1
| Jian-qiang Wang
1
1
School of Business, Central South University,
Changsha, People's Republic of China
2
College of Management, Shenzhen
University, Shenzhen, People's Republic of
China
Correspondence
Jian-qiang Wang and Rui-lu Huang, School of
Business, Central South University, Changsha
410083, People's Republic of China.
Email: jqwang@csu.edu.cn;rlhuang@csu.
edu.cn
Funding information
Natural Science Foundation of Hunan Province
Abstract
The lung cancer incidence and mortality in China have always been high. Moreover,
due to the limited level of professional technology, misdiagnosis and missed diagnosis
of lung cancer often occur. To improve the accuracy of diagnosis, this paper proposes
an interpretable diagnostic method for lung cancer based on Chinese electronic med-
ical records (EMRs). First, to overcome the difficulty in word segmentation of clinical
texts in Chinese EMRs, a dictionary construction method is proposed based on the
idea of maximal clique, and 730 medical professional terms related to lung diseases
are identified. Then, the ProbSparse self-attention mechanism and self-attention dis-
tilling operation in Informer are used to improve the Bidirectional Encoder Represen-
tations from Transformer (BERT) to realize the representation of long clinical texts
with lower time complexity and memory consumption. Finally, the convolutional neu-
ral network with an attention mechanism is employed to process the representation
results to realize the interpretable prediction of lung cancer. This method is applied
to the lung cancer diagnosis of inpatients in a tertiary hospital in Hunan Province,
obtaining excellent results of about 0.9 for area under the receiver operating charac-
teristic curve (AUROC) and area under the precision-recall curve (AUPRC). In addi-
tion, the results of the comparative analysis with existing dictionaries, word
embedding methods and diagnostic methods further confirm the superiority of the
proposed method. Specifically, the proposed method improves the precision by at
least 6%, the recall by at least 2.6%, the F1 score by at least 5.2%, AUROC by at least
7.3% and AUPRC by at least 7.7% compared with all these state-of-the-art methods.
KEYWORDS
Chinese electronic medical records, informer, interpretable diagnostic method, lung cancer,
maximal clique, term discovery
1|INTRODUCTION
Lung cancer, as one of the top 10 cancers in the world, affects 11.7% of cancer patients (Cao & Chen, 2021). Moreover, the mortality rate of lung
cancer is extremely high, with 1.8 million deaths due to lung cancer worldwide in 2020 (Cao & Chen, 2021). In China, the number of lung cancer
cases was 820 thousand, and 710 thousand people died from lung cancer in 2020 (Liu, Li, et al., 2021), which far exceed the global average. The
economic losses caused by lung cancer are huge. In 2017, the total economic burden caused by lung cancer in China was estimated to be
$25.069 billion (Liu, Shi, et al., 2021).
Received: 26 November 2022 Revised: 26 February 2023 Accepted: 4 April 2023
DOI: 10.1111/exsy.13310
Expert Systems. 2023;40:e13310. wileyonlinelibrary.com/journal/exsy © 2023 John Wiley & Sons Ltd. 1of22
https://doi.org/10.1111/exsy.13310