ArticlePDF Available

Machine learning-derived peripheral blood transcriptomic biomarkers for early lung cancer diagnosis: Unveiling tumor-immune interaction mechanisms

Authors:

Abstract

Lung cancer continues to be the leading cause of cancer‐related mortality worldwide. Early detection and a comprehensive understanding of tumor‐immune interactions are crucial for improving patient outcomes. This study aimed to develop a novel biomarker panel utilizing peripheral blood transcriptomics and machine learning algorithms for early lung cancer diagnosis, while simultaneously providing insights into tumor‐immune crosstalk mechanisms. Leveraging a training cohort (GSE135304), we employed multiple machine learning algorithms to formulate a Lung Cancer Diagnostic Score (LCDS) based on peripheral blood transcriptomic features. The LCDS model's performance was evaluated using the area under the receiver operating characteristic (ROC) curve (AUC) in multiple validation cohorts (GSE42834, GSE157086, and an in‐house dataset). Peripheral blood samples were obtained from 20 lung cancer patients and 10 healthy control subjects, representing an in‐house cohort recruited at the Sixth People's Hospital of Chengdu. We employed advanced bioinformatics techniques to explore tumor‐immune interactions through comprehensive immune infiltration and pathway enrichment analyses. Initial screening identified 844 differentially expressed genes, which were subsequently refined to 87 genes using the Boruta feature selection algorithm. The random forest (RF) algorithm demonstrated the highest accuracy in constructing the LCDS model, yielding a mean AUC of 0.938. Lower LCDS values were significantly associated with elevated immune scores and increased CD4+ and CD8+ T‐cell infiltration, indicative of enhanced antitumor‐immune responses. Higher LCDS scores correlated with activation of hypoxia, peroxisome proliferator‐activated receptor (PPAR), and Toll‐like receptor (TLR) signaling pathways, as well as reduced DNA damage repair pathway scores. Our study presents a novel, machine learning‐derived peripheral blood transcriptomic biomarker panel with potential applications in early lung cancer diagnosis. The LCDS model not only demonstrates high accuracy in distinguishing lung cancer patients from healthy individuals but also offers valuable insights into tumor‐immune interactions and underlying cancer biology. This approach may facilitate early lung cancer detection and contribute to a deeper understanding of the molecular and cellular mechanisms underlying tumor‐immune crosstalk. Furthermore, our findings on the relationship between LCDS and immune infiltration patterns may have implications for future research on therapeutic strategies targeting the immune system in lung cancer.
RESEARCH ARTICLE
Machine learning-derived peripheral blood transcriptomic
biomarkers for early lung cancer diagnosis: Unveiling
tumor-immune interaction mechanisms
Xiaohua Li
1
| Xuebing Li
2
| Jiangyue Qin
3
| Lei Lei
4
| Hua Guo
1
|
Xi Zheng
5
| Xuefeng Zeng
1
1
Department of Respiratory and Critical Care Medicine, Sixth People's Hospital of Chengdu, Chengdu, Sichuan, China
2
Department of Respiratory and Critical Care Medicine, People's Hospital of Yaan, Yaan, Sichuan, China
3
Department of General Practice, West China Hospital, Sichuan University, Chengdu, Sichuan, China
4
Department of Oncology, Sixth People's Hospital of Chengdu, Chengdu, Sichuan, China
5
Department of Thoracic Surgery, West China Hospital, Sichuan University, Chengdu, Sichuan, China
Correspondence
Xiaohua Li, Department of Respiratory
and Critical Care Medicine, Sixth People's
Hospital of Chengdu, No. 16 Jianshe
South St, Chenghua District, Chengdu,
Sichuan, China.
Email: lxh_hjcd@163.com
Xuefeng Zeng, Department of Respiratory
and Critical Care Medicine, Sixth People's
Hospital of Chengdu., No. 16 Jianshe
South St, Chenghua District, Chengdu,
Sichuan, China.
Email: zxfeng1616@163.com
Funding information
National Natural Science Foundation of
China, Grant/Award Number: 81830001;
Natural Science Foundation of Sichuan
Province, Grant/Award Number:
2023NSFSC1890; Medical Science and
Technology Project of Sichuan Provincial
Health Commission, Grant/Award
Number: 21PJ153
Abstract
Lung cancer continues to be the leading cause of cancer-related mortality
worldwide. Early detection and a comprehensive understanding of tumor-
immune interactions are crucial for improving patient outcomes. This study
aimed to develop a novel biomarker panel utilizing peripheral blood transcrip-
tomics and machine learning algorithms for early lung cancer diagnosis, while
simultaneously providing insights into tumor-immune crosstalk mechanisms.
Leveraging a training cohort (GSE135304), we employed multiple machine
learning algorithms to formulate a Lung Cancer Diagnostic Score (LCDS)
based on peripheral blood transcriptomic features. The LCDS model's perfor-
mance was evaluated using the area under the receiver operating characteristic
(ROC) curve (AUC) in multiple validation cohorts (GSE42834, GSE157086,
and an in-house dataset). Peripheral blood samples were obtained from 20 lung
cancer patients and 10 healthy control subjects, representing an in-house
cohort recruited at the Sixth People's Hospital of Chengdu. We employed
advanced bioinformatics techniques to explore tumor-immune interactions
through comprehensive immune infiltration and pathway enrichment ana-
lyses. Initial screening identified 844 differentially expressed genes, which were
subsequently refined to 87 genes using the Boruta feature selection algorithm.
The random forest (RF) algorithm demonstrated the highest accuracy in con-
structing the LCDS model, yielding a mean AUC of 0.938. Lower LCDS values
were significantly associated with elevated immune scores and increased
Xiaohua Li, Xuebing Li, and Jiangyue Qin contributed equally to this work.
Received: 7 August 2024 Accepted: 30 September 2024
DOI: 10.1002/biof.2129
This is an open access article under the terms of the Creative Commons Attribution-NonCommercial License, which permits use, distribution and reproduction in any
medium, provided the original work is properly cited and is not used for commercial purposes.
© 2024 The Author(s). BioFactors published by Wiley Periodicals LLC on behalf of International Union of Biochemistry and Molecular Biology.
BioFactors. 2025;51:e2129. wileyonlinelibrary.com/journal/biof 1of14
https://doi.org/10.1002/biof.2129
CD4+and CD8+T-cell infiltration, indicative of enhanced antitumor-immune
responses. Higher LCDS scores correlated with activation of hypoxia, peroxi-
some proliferator-activated receptor (PPAR), and Toll-like receptor (TLR) sig-
naling pathways, as well as reduced DNA damage repair pathway scores. Our
study presents a novel, machine learning-derived peripheral blood transcrip-
tomic biomarker panel with potential applications in early lung cancer diagno-
sis. The LCDS model not only demonstrates high accuracy in distinguishing
lung cancer patients from healthy individuals but also offers valuable insights
into tumor-immune interactions and underlying cancer biology. This approach
may facilitate early lung cancer detection and contribute to a deeper under-
standing of the molecular and cellular mechanisms underlying tumor-immune
crosstalk. Furthermore, our findings on the relationship between LCDS and
immune infiltration patterns may have implications for future research on
therapeutic strategies targeting the immune system in lung cancer.
KEYWORDS
diagnosis, lung cancer, lung cancer diagnostic score, machine learning, tumor-immune
interaction
1|INTRODUCTION
Lung cancer is characterized by the highest mortality rate
among malignant tumors,
1
which underscores the impor-
tance of screening, early diagnosis, and timely treatment
as effective measures to reduce mortality.
2
Although
great progress has been made in treatment of lung can-
cer, the prognosis of patients with advanced lung cancer
remains poor.
3,4
Studies have shown that the median
overall survival (mOS) of stage I/II lung cancer is approx-
imately 57 months, while the mOS of stage III/IV lung
cancer patients is only 7 months.
3,4
Consequently, early
diagnosis emerges as a pivotal strategy to enhance sur-
vival rates and improve prognosis among lung cancer
patients.
57
Currently, the gold standard for lung cancer diagnosis
is histopathological examination, but this approach is
invasive, has poor patient compliance and cannot facili-
tate early diagnosis of lung cancer.
8
Advances in cancer
screening technology, especially the improved resolution
of low-dose computed tomography (LDCT), have led to
identification of numerous lung nodules each year.
9,10
However, the increased false-positive rate and risk of
overdiagnosis have resulted in many LDCT-screened
patients with lung nodules undergoing unnecessary sur-
gical procedures, thereby increasing physical and psycho-
logical burdens.
11
Given these limitations, LDCT is not
recommended for widespread early lung cancer screening
but should be reserved for high-risk individuals.
12
Hence,
there is an urgent need for more precise and noninvasive
prediction models applied to early-stage lung cancer,
advancing precision medicine in lung cancer management.
As an emerging predictive tool, noninvasive and con-
venient prediction models continue to play a valuable
role in early warning and auxiliary diagnosis, real-time
monitoring of therapeutic efficacy, guidance of medica-
tion and exploration of drug resistance mechanisms, and
assessment of prognosis in cases of lung cancer, among
other clinical applications.
9,13
With rapid development of
high-throughput sequencing technology, an increasing
number of novel diagnostic predictors have been intro-
duced into model variables (including genomics, micro-
biomics, immunology and imaging), improving the
accuracy and sensitivity of early disease diagnosis predic-
tion models.
1416
Recently, several studies have demon-
strated the role of machine learning in development of
early lung cancer prediction models.
1719
Duan et al.
20
focused on four biomarkers, such as the promoter meth-
ylation levels of the p16, RASSF1A, and FHIT genes and
the relative telomere length, applying Fisher discriminant
analysis and a BP neural network for adjuvant lung can-
cer diagnosis. Yu et al.
21
used CT imaging data combined
with machine learning methods to diagnose lung cancer
and determine pathological stages.
Peripheral blood tumor biomarkers serve as important
clinical tools for cancer screening, offering the advantages
of noninvasiveness and the ability to monitor dynamic
changes.
22,23
However, existing serum tumor markers are
mostly protein markers, including carcinoembryonic anti-
gen (CEA), neuron-specific enolase (NSE), and cytokeratin
2of14 LI ET AL.
19 fragment (CYFRA21-1),
24,25
and have limited sensitivity
and are primarily suited for adjunctive diagnostics. For
example,
2628
CEA is a nonorgan-specific tumor-associated
antigen with an AUC of approximately 0.670.70 in lung
cancer diagnosis.
26,27
NSE is elevated only in 10%20% of
small-cell lung cancer (SCLC) patients and a small propor-
tion of patients with benign lung diseases, and its diagnos-
tic sensitivity for SCLC is approximately 50%80%.
28
Recent advances have been made in combining machine
learning with other peripheral blood biomarkers to predict
the occurrence or absence of early-stage cancer.
29,30
How-
ever, machine learning prediction models incorporating
transcriptomic markers from peripheral blood remain
unexplored in early lung cancer diagnosis.
In this study, we developed Lung Cancer Diagnostic
Score (LCDS), a predictive model based on machine learn-
ing algorithms, to enable precise diagnosis of early-stage
lung cancer predicated on transcriptomic features within
peripheral blood. Additionally, we combined multiomics
data to explore potential molecular mechanisms underly-
ing lung cancer development in conjunction with LCDS.
The findings not only facilitate early lung cancer predic-
tion but also promote precision treatment for patients with
lung cancer.
2|METHODS
2.1 |Data sources
We downloaded expression and clinical data of periph-
eral blood samples from healthy individuals and lung
cancer patients in GSE135304
31
(LC: N=303, normal:
N=284), GSE42834
32
(LC: N=16, normal: N=126),
and GSE157086 (LC: N=2, normal: N=3) from the
Gene Expression Omnibus (GEO) database. The probe
information of the expression data in GSE135304 and
GSE42834 was obtained from GPL10558. Additionally,
we collected peripheral blood samples from 10 healthy
individuals and 20 lung cancer patients from the Sixth
People's Hospital of Chengdu (in-house cohort). This
study was approved by the Ethics Committee of Sixth
People's Hospital of Chengdu (NO: 2022-Research
projects-001), and written informed consent was obtained
from all patients. The clinical baseline characteristics of
the patients in the GSE135304, GSE42834 and in-house
cohorts are detailed in Tables S1S3.
2.2 |Collection of specimens
We recruited 10 healthy individuals and 20 lung cancer
patients and collected fresh peripheral blood from each
enrolled individual before clinical treatment. The study
was reviewed and approved by the Ethics Committee of
the Sixth People's Hospital of Chengdu and carried out in
accordance with the World Medical Association Declara-
tion of Helsinki Ethical Principles for Medical Research.
All subjects provided written informed consent.
2.3 |Generation and normalization of
RNA-sequencing data
We extracted total RNA from peripheral blood. Eligible
libraries were prepared from qualified samples with a
NEBNext
®
UltraRNA Library Prep Kit (New England
Biolabs, Ipswich, MA, UK) and sequenced using the Illu-
mina HiSeq 4000 platform. Paired-end reads (150 bp)
were mapped to the human reference genome (GRCh38).
Transcript abundances were summarized at the gene
level with tximport and normalized based on transcripts
per million (TPM).
2.4 |LCDS model construction
GSE135304 was used as the training cohort, and GSE42834,
GSE157086 and in-house data were used as the external
validation cohorts for this study. In the training set
GSE135304, we jointly used a univariable logistic regression
model (false discovery rate [FDR] <0.0 5) and receiver
operating characteristic (ROC) algorithm (AUC value
>0.6) to screen 844 lung cancer-related genes. To further
identify the most important contributing features as
model predictors and to avoid the overfitting phenome-
non, Boruta feature selection was applied to select the
844genesbydimensionalityreduction.
33
Finally, based
on the 87 lung cancer-related genes after Boruta feature
selection (Table S4), 97 machine learning combinatorial
algorithms were used for LCDS model construction
(Table S5).
2.5 |Relationship between LCDS and
lung cancer development
Expression data of all samples in the GSE42834 and in-
house cohorts were evaluated based on the ssgsea
method using the GSVAR package from the Molecu-
lar Signatures Database (MsigDB) for GO-BP, GO-CC,
GO-MF, KEGG, Reactome and Hallmark signaling
pathway scores.
34,35
Additionally, the xCell algorithm
was used to assess the immune cell infiltration score
for each peripheral blood sample.
36
Expression data
and survival data for patients with lung cancer were
LI ET AL.3of14
used to validate the prognostic value of the LCDS
model (Table S6).
3741
2.6 |Statistical analysis
We used univariable logistic regression to screen for
genes associated with development of lung cancer, with a
screening criterion of FDR < 0.05.
42
ROC curves were
used to assess the sensitivity and specificity of genes asso-
ciated with the development of lung cancer (AUC > 0.6),
mainly using the pROCR package.
43
Additionally, the
Boruta algorithm and random forest (RF) algorithm were
used to further screen the characterized genes.
33,44
Spear-
man's correlation analysis was employed to calculate the
degree of correlation between two continuous variables,
and the correlation coefficients were expressed in terms
of Rvalues. The KaplanMeier (KM) method and the sur-
vival R package survminerwere used to plot survival
curves, and further comparisons between groups were
performed by the log-rank test.
45
X-tile was applied to
group the highest performing critical values in the KM
analysis.
46
Decision curve analysis (DCA) was utilized to
evaluate the decision-making effect of the predictive
models.
47
We used the ggolot2R package to visualize
the figures in this study.
48
All statistical analyses and
graph visualization were based on R software (Version:
4.1). Two-sided pvalues that were <0.05 were considered
statistically significant.
3|RESULTS
3.1 |Construction of the LCDS model
Based on the training set GSE135304, we first used uni-
variable logistic regression to screen for genes associated
with development of lung cancer (cutoff: FDR < 0.05).
Next, ROC curve analysis was used to assess the sensitiv-
ity and accuracy of these lung cancer-associated genes for
prediction of lung cancer patients and healthy individ-
uals, and lung cancer-associated genes, with AUC values
>0.6 were identified. Details of the analysis process in
this study are shown in Figure 1. A total of 844 genes
associated with lung cancer were identified. Next, we
used Boruta feature selection to further screen the
844 lung cancer-related genes, yielding 87 lung cancer-
related genes, which were finally incorporated into the
lung cancer prediction model construction to obtain
LCDS (Figure 2A; Table S4). The correlation between
expression levels of these 87 lung cancer-related genes in
the training set GSE135304 is shown in Figure 2B.
Expression levels of the 87 lung cancer-related genes in
the peripheral blood of lung cancer patients and healthy
individuals differed to some extent (Figure 2C).
3.2 |Evaluation of the LCDS model
To validate the predictive ability of LCDS for lung cancer
patients, we calculated AUC values of 97 machine learn-
ing combinatorial algorithms based on 87 lung cancer-
related genes in 3 external validation cohorts (GSE42834,
GSE157086 and in-house) (Figure 3A,TableS6). The
mean AUC values of the 97 machine learning combinato-
rial algorithms in the 3 external validation cohorts indicate
the accuracy of each machine learning combinatorial algo-
rithm. Therefore, the accuracy of the lung cancer predic-
tion models based on expression of the 87 lung cancer-
related genes using the RF algorithm was maximized, with
a mean AUC value of 0.938 (Figure 3A). The AUC values
of the RF model were 1 and 0.985 in the training set
GSE135304 and in-house validation set, respectively
(Figure 3B,C). Next, we evaluated the performance of
LCDS in early-stage lung cancer patients. In both the
GSE135304 (training dataset) and in-house cohort (valida-
tion dataset), the LCDS demonstrated high sensitivity and
specificity for detecting stage I/II lung cancer patients ver-
sus normal controls (AUC =1.0, Sensitivity =1.0,
Specificity =1.0, Figure 3D,E). The results of DCA dem-
onstrated the net benefit of the LCDS model for predicting
lung cancer patients (Figure 3F: GSE157086; Figure 3G:
in-house). Both in the training set GSE135304 and in the
in-house validation set, LCDS showed a more statistically
significant difference in predicting the occurrence or
absence of lung cancer than a single gene among the
87 lung cancer-related genes (Figure 3H: GSE157086;
Figure 3I: in-house).
3.3 |Relationship between LCDS and
lung carcinogenesis
To further explore the possible relationship between the
LCDS and lung carcinogenesis, we analyzed the relation-
ship between the LCDS and immune cells and signaling
pathways. We found that lower LCDSs were associated
with higher immune scores and the abundance of acti-
vated immune cell infiltration, including CD4+Tcm
cells, CD8+Tcm cells, CD4+T cells, CD8+T cells,
CD4+memory T cells, and CD4+naïve T cells
(Figure 4A: in-house cohort; Figure 4B: GSE42834,
p< 0.05, R< 0). Higher LCDSs were significantly associ-
ated with macrophage infiltration (Figure 4A,B,P <0.05,
R> 0). The above results suggest that individuals in the
low LCDS group have a more suitable antitumor immune
4of14 LI ET AL.
microenvironment than individuals in the high LCDS
group. Moreover, individuals in the high LCDS group
showed stronger activity of pathways related to driving
cancer development and progression, including the hyp-
oxia pathway, peroxisome proliferator-activated recep-
tor (PPAR) pathway, and Toll-like receptor (TLR)
pathway (Figure 5A: in-house; Figure 5B: GSE42834,
p< 0.05, R> 0). In addition, individuals in the high
LCDS group exhibited a weaker capacity for cellular
response to DNA damage repair (Figure 5A: in-house;
Figure 5B: GSE42834, p< 0.05, R<0).
We further extended application of LCDS to assess
clinical prognosis in lung cancer patients. By evaluating
the effect of LCDS on OS time in 142 lung cancer patients
treated with ICIs (Ravi et al.
37
), we found that lung can-
cer patients in the low LCDS group had significantly
longer OS and progression-free survival (PFS) times than
those in the high LCDS group (Figure 5C, OS: hazard
ratio [HR] =0.56, 95% confidence interval [CI]: 0.33
0.96, log-rank p=0.033; PFS: HR =0.62, 95% CI: 0.40
0.97, log-rank p=0.034). Additionally, we found that
lung cancer patients (GSE41271, GSE47115, GSE73403,
GSE101929) in the low LCDS group had significantly
improved prognosis compared with those in the high
LCDS group (Figure 5C, log-rank p< 0.05, HR <1).
4|DISCUSSION
Lung cancer is a multifaceted disease characterized by
intricate regulatory mechanisms involving malignant
cells, immune cells, stromal cells, and aberrant signaling
In house Cohort
Data Collection
RNA-seq
Tumor = 20, Healthy Individuals = 10
Public Cohort
For Model Development
Training
- GSE135304 (T = 303, H = 284)
Validation
- GSE42834 (T = 16, H = 126)
- GSE157086 (T = 2, H = 3)
For Survival Analysis
- Ravi (n = 142)
- GSE41271 (n = 274)
- GSE47115 (n = 44)
- GSE73403 (n = 69)
- GSE101929 (n = 66)
Model Development
Logistic regression
and AUC >0.6
Boruta
Feature Selection
87 genes for training Integrated machine learning
GBM
Lasso
Enet
Ridge
LDA
Further Evaluation
Low Risk Score
High Risk Score
Time
Survival
0
20
40
60
80
100
Survival Analysis
Immune Analysis Pathway Enrichment Analysis
lung cancer
peripheral blood
Random Forest
SVM
plsRglm
xgboost
glmboost
Stepglm
Select the best model with
the highest average AUC in
the validation set
Lung Cancer
Diagnostic Score (LCDS)
FIGURE 1 Flowchart of construction of the lung cancer diagnostic score (LCDS). The flowchart shows the overall study design and
methods used to develop and validate LCDS, including obtaining lung cancer and normal lung tissue gene expression data, feature selection
using the Boruta algorithm on 87 lung cancer-related genes, selecting the optimal random forest model, and validating the model on
independent datasets.
LI ET AL.5of14
pathways within a complex ecosystem.
49,50
Therefore, it is
challenging to accurately describe the molecular environ-
ment of the disease by using a single marker. Machine
learning offers a potent means to process massive amounts
of high-dimensional data, enabling precise diagnosis and
prediction of diseases based on data feature correlation.
51
In this study, we demonstrated the feasibility of a machine
learning predictive model (LCDS) based on gene expres-
sion data from peripheral blood for early lung cancer diag-
nosis. The LCDS model exhibited excellent performance,
with a mean AUC value of 0.938 across three validation
sets (GSE42834, GSE157086, and in-house cohort). The
observed relationship between LCDS and lung cancer
mechanisms is apparent, as evidenced by a more favorable
antitumor immune microenvironment (IME) and reduced
signaling pathway activity contributing to tumorigenesis
in the lower LCDS group (Figure 6). Furthermore, we
explored the relationship between LCDS and clinical prog-
nosis, revealing significantly improved outcomes among
lung cancer patients in the lower LCDS group.
(A)
(B) (C)
FIGURE 2 Screening and selection of 87 lung cancer-related genes. (A) Variable importance ranking by the Boruta algorithm showing
the 87 selected genes. (B) Correlation heat map of the 87 gene expression values in the GSE135304 dataset. (C) Heat map comparing
differential expression of the 87 genes in GSE135304. Red indicates upregulated genes, and blue indicates downregulated genes.
6of14 LI ET AL.
****
****
****
****
****
****
****
****
****
****
****
****
****
****
****
****
****
****
****
****
****
****
****
****
****
****
****
****
****
****
****
****
****
****
****
****
****
****
****
****
****
****
****
****
****
****
****
****
ADAM9
IFNGR1
P2RY10
UBTF
MID1IP1
OSBP
TBCCD1
CIAPIN1
SETD2
LPCAT1
SCAF4
PARL
SPOCK2
DYNC1H1
ID3
SMARCC1
STK4
HERC1
MCEMP1
PTBP1
MYDGF
DGAT2
DNAJB1
MYCBP2
ZFAND2A
ACAA2
GOT1
RHOC
TSPYL1
RHOB
IFFO2
AARS1
TRRAP
CCNK
USP5
TERF2
TP53BP1
RNF20
SPTAN1
CDK13
PFAS
SIN3A
EP400
PRPF8
SCAF8
CBLL1
CD63
POLR2A
LCDS
0.5 0.7 0.9 1.0
Training
0.6
0.985
0.96
0.95
0.96
0.965
0.96
0.97
0.97
0.975
0.965
0.98
0.96
0.965
0.95
0.97
0.92
0.95
0.93
0.925
0.955
0.93
0.905
0.905
0.915
0.905
0.91
0.91
0.91
0.91
0.91
0.91
0.91
0.91
0.91
0.91
0.91
0.94
0.9
0.895
0.875
0.89
0.915
0.9
0.905
0.9
0.9
0.905
0.9
0.905
0.9
0.9
0.895
0.875
0.9
0.9
0.865
0.885
0.865
0.88
0.88
0.88
0.88
0.88
0.88
0.88
0.88
0.88
0.88
0.88
0.88
0.865
0.88
0.86
0.86
0.86
0.86
0.86
0.86
0.86
0.86
0.86
0.86
0.855
0.85
0.905
0.9
0.885
0.885
0.88
0.815
0.795
0.85
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0.833
0.833
0.833
0.833
0.833
0.833
0.833
0.667
0.667
0.667
0.667
0.667
0.333
0.167
0.167
0
0.829
0.823
0.829
0.818
0.806
0.803
0.788
0.78
0.773
0.783
0.761
0.763
0.736
0.739
0.715
0.737
0.699
0.719
0.712
0.682
0.689
0.695
0.695
0.68
0.689
0.682
0.682
0.681
0.681
0.681
0.681
0.681
0.681
0.681
0.68
0.679
0.637
0.671
0.673
0.69
0.656
0.628
0.636
0.626
0.631
0.631
0.625
0.63
0.624
0.628
0.627
0.629
0.649
0.623
0.621
0.644
0.618
0.634
0.614
0.573
0.564
0.564
0.563
0.563
0.562
0.562
0.562
0.562
0.561
0.561
0.573
0.552
0.562
0.562
0.562
0.561
0.56
0.558
0.558
0.556
0.556
0.555
0.548
0.684
0.578
0.572
0.581
0.581
0.559
0.584
0.59
0.556
RF
NaiveBayes
plsRglm
svm
lasso + NaiveBayes
lasso + plsRglm
lasso + RF
Stepglm[backward] + plsRglm
xgboost
Stepglm[both] + plsRglm
Stepglm[backward] + NaiveBayes
Stepglm[both] + NaiveBayes
glmBoost + plsRglm
GBM
glmBoost + RF
glmBoost + svm
glmBoost
lasso + xgboost
lasso + gbm
lasso + svm
glmBoost + gbm
glmBoost + Stepglm[both]
glmBoost + Stepglm[backward]
glmBoost + Ridge
Stepglm[backw ard] + RF
glmBoost + Stepglm[forward]
glmBoost + Lasso
glmBoost + NaiveBayes
glmBoost + LDA
glmBoost + xgboost
Stepglm[both] + RF
Ridge
lasso + glmBoost
Stepglm[backward] + xgboost
Stepglm[backward] + gbm
Lasso
Stepglm[backward] + svm
Stepglm[both] + gbm
Stepglm[backward] + glmBoost
Stepglm[both] + svm
Stepglm[both] + glmBoost
Stepglm[backward] + Ridge
Stepglm[backw
Stepglm[backw
Stepglm[backw
Stepglm[backw
Stepglm[backw
Stepglm[backw
Stepglm[backw
Stepglm[backward] + Lasso
Stepglm[backw
Stepglm[backw
Stepglm[both] + Ridge
Stepglm[backw ard]
Stepglm[both] + Lasso
Stepglm[both]
Stepglm[both] + xgboost
lasso + Stepglm[forward]
lasso + LDA
lasso + Stepglm[both]
lasso + Stepglm[backward]
Stepglm[backward] + LDA
LDA
Stepglm[forward]
Stepglm[both] + LDA
Cohort
0.938
0.928
0.926
0.926
0.924
0.921
0.919
0.917
0.916
0.916
0.914
0.908
0.9
0.896
0.895
0.886
0.883
0.883
0.879
0.879
0.873
0.867
0.867
0.865
0.865
0.864
0.864
0.864
0.864
0.864
0.864
0.864
0.864
0.864
0.863
0.863
0.859
0.857
0.856
0.855
0.849
0.848
0.845
0.844
0.844
0.844
0.843
0.843
0.843
0.843
0.842
0.841
0.841
0.841
0.84
0.836
0.834
0.833
0.831
0.818
0.815
0.815
0.814
0.814
0.814
0.814
0.814
0.814
0.814
0.814
0.813
0.811
0.807
0.807
0.807
0.807
0.751
0.75
0.75
0.75
0.75
0.749
0.745
0.733
0.717
0.713
0.711
0.711
0.591
0.522
0.517
0.469
00.51
Mean AUC
0.4 0.6 0.8 1
GSE157086 GSE42834
In house
(A) (B)
0.00.20.40.60.81.0
0.0 0.2 0.4 0.6 0.8 1.0
A
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
A
(C)
0.0
0.2
0.4
0.6
0.00 0.25 0.50 0.75 1.00
Threshold Probility
Net Benefit
0.0
0.2
0.4
0.6
0.00 0.25 0.50 0.75 1.00
Threshold Probility
Net Benefit
(D) (E)
All None LCDS
(H) (I)
**
*
**
*
**
*
*
*
*
**
*
*
*
**
*
*
*
*
*
*
**
**
*
*
**
**
*
*
*
**
*
*
*
SERPINB10
SETD2
DGAT2
CD63
CCNK
TP53BP1
AARS1
GOT1
EPB41L3
FUNDC1
HES6
RBM4B
SLC1A3
CDK13
FASN
HSPH1
CIAPIN1
OSBP
PTBP1
SIN3A
CD160
RHOC
CBLL1
SMARCC1
TERF2
UBTF
USP5
DYNC1H1
MYCBP2
TM2D2
SCAF8
TBCCD1
BRF1
SCAF4
PRPF8
PORCN
HERC1
IFFO2
SDE2
IL1R2
SPOCK2
PFAS
TSPYL1
SPTAN1
P2RY10
TRRAP
MCEMP1
EP400
LCDS
0.5 0.7 0.9 1.0
In House
0.6
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
A
0.00.20.40.60.81.0
0.0 0.2 0.4 0.6 0.8 1.0
A
(F) (G)
Training In-house
Training (Stage I/II vs Healthy individuals) In-house (Stage I/II vs Healthy individuals)
Training In-house
FIGURE 3 Legend on next page.
LI ET AL.7of14
(A)
(B)
FIGURE 4 Correlations between lung cancer diagnostic score (LCDS) and immune cell infiltration scores. (A) The correlation between
LCDS and scores of immune cells was estimated by the xCell method using GSE135304. (B) The correlation between LCDS and scores of
immune cells estimated by the xCell method using the in-house cohort.
FIGURE 3 A lung cancer diagnostic score (LCDS) was developed and validated via the machine learning-based integrative procedure.
(A) Area under the receiver operating characteristic (ROC) curve (AUC) of 97 machine learning models validated on the GSE157086,
GSE42834, and in-house cohort (all validation datasets). (B) ROC curve of LCDS model on training cohort GSE135304 (Lung cancer
vs. Healthy individuals). (C) ROC curve of LCDS model on independent in-house validation cohort (Lung cancer vs. Healthy individuals).
(D) ROC curve of LCDS model on training cohort GSE135304 (Stage I/II Lung cancer vs. Healthy individuals). (E) ROC curve of LCDS
model on independent in-house validation cohort (Stage I/II Lung cancer vs. Healthy individuals). (F) Decision curve analysis (DCA) plots
showing the clinical utility of LCDS on GSE135304. (G) DCA plots showing the clinical utility of LCDS on the in-house cohort. (H) Top
87 nonzero feature coefficients from the final random forest (RF) model on the GSE135304. (I) Top 87 nonzero feature coefficients from the
final RF model on the in-house cohort.
8of14 LI ET AL.
(A)
(B)
(C)
pppp
ppp
FIGURE 5 Correlations between Lung Cancer Diagnostic Score (LCDS) and pathway activity scores. Spearman correlation coefficients
between LCDS and pathway activity scores by the single sample gene enrichment analysis (ssGSEA) method using (A) GSE135304 and
(B) in-house cohorts. (C) KaplanMeier plots of high vs. low LCDS groups showing association with overall survival (OS) and progression-
free survival (PFS) in lung cancer patients, as reported by Ravi et al.,
37
GSE41271, GSE47115, GSE73403, and GSE101929 (log-rank p< 0.05,
hazard ratio [HR] <1).
LI ET AL.9of14
An enhanced antitumor immune microenvironment
likely underlies the reduced risk of lung cancer in the
low LCDS group. Immune cells exert selective pressure
on tumor development through interactions with tumor
cells. Tumor immunogenicity undergoes modification by
the immune system in three successive stages: elimina-
tion, homeostasis, and escape.
5254
The immune system,
especially cellular immunity featuring CD8+T cells with
recognized antitumor effects and CD4+T cells involved
in cancerous tissue clearance,
55
serves as a pivotal antitu-
mor mechanism. Studies have shown that the gradual
transition from immune activation to immunosuppres-
sion plays a crucial role in lung cancer progression, mani-
fested by decreased T-cell clones (CD4+T cells and
CD8+T cells) and increased regulatory cells (Tregs).
56
Additionally, proinflammatory monocyte-derived macro-
phages become significantly infiltrated in patients with
early-stage lung cancer.
57
Sinjab et al. found that CD8+
T cells and the inflammatory signature were significantly
reduced in early-stage lung cancer tissues and adjacent
normal tissues.
58
In this study, we found that individuals
in the higher LCDS group had significantly fewer CD4+
T cells and CD8+T cells and significantly more macro-
phages than those in the lower LCDS group.
Elevated pathway activity related to cancer progres-
sion may characterize individuals in the high LCDS
group at the molecular level. Hypoxia, as a potent selec-
tive stress, plays a moderate role in tumor cell invasion,
metastasis and angiogenesis.
59,60
Hypoxia impacts intra-
cellular mitochondrial gene expression
61,62
and induces
metabolic adaptations in cancer cells.
63
It also has impor-
tant effects on drug delivery, DNA repair, regulation of
drug resistance-related genes, the cell cycle and cell
death-related pathways, ultimately promoting malignant
tumorigenesis.
6466
PPAR-γis highly expressed in non-
small cell lung cancer (NSCLC) and correlates significantly
CD4+T cell
CD8+T cell
Macrophage
hypoxia pathway
PPAR pathway
TLRs pathway
DDR pathway
Lung Cancer Peripheral Blood
LCDS
High versus Low
FIGURE 6 Proposed molecular mechanisms linking Lung Cancer Diagnostic Score (LCDS) to lung cancer development. This schematic
summarizes the potential mechanisms supported by the observed correlations between LCDS, immune infiltration, and pathway activities.
LCDS correlates positively with immunosuppressive tumor-associated macrophage (TAM) infiltration, hypoxia, peroxisome proliferator-
activated receptor (PPAR), and Toll-like receptor (TLR) pathway activity. LCDS correlates negatively with CD8+T-cell and CD4+T-cell
infiltration.
10 of 14 LI ET AL.
with tumor histological type, pathological differentiation and
clinical stage.
6769
In tumor cells, TLRs contribute to cancer
development by promoting inflammation, cell proliferation,
cell survival and immunosuppression in various ways.
70,71
In
addition, aberrantly activatedTLRscanupregulatenuclear
factor κB(NF-κB) activity, inhibit JNK-mediated proapopto-
tic signaling, and ultimately create a tumor-friendly microen-
vironment.
71,72
Choi et al. found that high TLR expression
was associated with poor prognosis in cancer patients.
73
Abnormalities in the DNA damage repair system may also
contribute to development of tumors,
74
such as colorectal
cancer (CRC), endometrial cancer, ovarian cancer, and gas-
tric cancer.
75
In this study, we found that individuals with
higher LCDS had significantly increased activation scores in
hypoxia, PPAR, and TLR pathways, along with a signifi-
cantly lower DNA damage repair response.
However, there are some shortcomings in our study.
First, due to the lack of datasets with peripheral blood
samples from both lung cancer patients and healthy indi-
viduals, we incorporated three publicly available datasets
(GSE135304, GSE42834, and GSE157086) in our analysis.
Second, our LCDS model primarily served as a screening
tool for lung cancer patients in general and did not differ-
entiate between histological subtypes of lung cancer
(such as LUAD and LUSC). Finally, we did not explore
the mechanism underlying the capacity of LCDS for
early-stage lung cancer diagnosis.
5|CONCLUSIONS
We used machine learning to construct a diagnostic
model (LCDS) based on transcriptomic features of
peripheral blood that can be used to predict the occur-
rence of lung cancer. We also analyzed the association
between LCDS and the molecular mechanisms of lung
cancer development. LCDS has potential biomarker value
in clinical prognosis of lung cancer patients.
AUTHOR CONTRIBUTIONS
Xiaohua Li: Formal analysis; Investigation; Software;
Validation; Visualization; Writingoriginal draft; Fund-
ing acquisition; Project administration; Supervision;
Writingreview & editing. Xuebing Li: Data curation;
Methodology; Writingoriginal draft. Jiangyue Qin:
Data curation; Methodology; Conceptualization; Formal
analysis; Methodology. Lei Lei: Methodology. Hua Guo:
Software. Xuefeng Zeng: Funding acquisition; Method-
ology; Resources; Validation; Visualization; Writing
review & editing. Xi Zheng: Writingreview & editing.
ACKNOWLEDGMENTS
We thank Dr. Jun Chen for helpful discussion.
FUNDING INFORMATION
This work was supported by Medical Science and Technol-
ogy Project of Sichuan Provincial Health Commission
(21PJ153); the National Natural Science Foundation of
China (81830001); and Natural Science Foundation of
Sichuan Province (2023NSFSC1890).
CONFLICT OF INTEREST STATEMENT
The authors declare no conflicts of interest.
DATA AVAILABILITY STATEMENT
The data used to support the findings of this study are
available in the Supporting Information.
CONSENT FOR PUBLICATION
All authors have read and approved the submitted manu-
script. No other consent for publication was required.
ORCID
Xiaohua Li https://orcid.org/0009-0003-0929-4666
REFERENCES
1. Siegel RL, Miller KD, Fuchs HE, Jemal A. Cancer statistics,
2022. CA Cancer J Clin. 2022;72(1):733. https://doi.org/10.
3322/caac.21708
2. Sung H, Ferlay J, Siegel RL, Laversanne M, Soerjomataram I,
Jemal A, et al. Global cancer statistics 2020: GLOBOCAN esti-
mates of incidence and mortality worldwide for 36 cancers in
185 countries. CA Cancer J Clin. 2021;71(3):20949. https://
doi.org/10.3322/caac.21660
3. Pan J, Fang S, Tian H, Zhou C, Zhao X, Tian H, et al. lncRNA
JPX/miR-33a-5p/Twist1 axis regulates tumorigenesis and
metastasis of lung cancer by activating Wnt/β-catenin signal-
ing. Mol Cancer. 2020;19(1):9. https://doi.org/10.1186/s12943-
020-1133-9
4. Flores R, Patel P, Alpert N, Pyenson B, Taioli E. Association of
stage shift and population mortality among patients with non-
small cell lung cancer. JAMA Netw Open. 2021;4(12):e2137508.
https://doi.org/10.1001/jamanetworkopen.2021.37508
5. The National Lung Screening Trial Research Team. Reduced
lung-cancer mortality with low-dose computed tomographic
screening. N Engl J Med. 2011;365(5):395409. https://doi.org/
10.1056/NEJMoa1102873
6. De Koning HJ, Van Der Aalst CM, De Jong PA, Scholten ET,
Nackaerts K, Heuvelmans MA, et al. Reduced lung-cancer
mortality with volume CT screening in a randomized trial. N
Engl J Med. 2020;382(6):50313. https://doi.org/10.1056/
NEJMoa1911793
7. Kay FU, Kandathil A, Batra K, Saboo SS, Abbara S, Rajiah P.
Revisions to the tumor, node, metastasis staging of lung cancer
(8th edition): rationale, radiologic findings and clinical implica-
tions. World J Radiol. 2017;9(6):26979. https://doi.org/10.
4329/wjr.v9.i6.269
8. Nooreldeen R, Bach H. Current and future development in
lung cancer diagnosis. Int J Mol Sci. 2021;22(16):8661. https://
doi.org/10.3390/ijms22168661
LI ET AL.11 of 14
9. Oudkerk M, Liu S, Heuvelmans MA, Walter JE, Field JK. Lung
cancer LDCT screening and mortality reductionevidence,
pitfalls and future perspectives. Nat Rev Clin Oncol. 2021;
18(3):13551. https://doi.org/10.1038/s41571-020-00432-6
10. Chan M, Huang W, Wang J, Liu RS, Hsiao M. Next-generation
cancer-specific hybrid Theranostic nanomaterials: MAGE-A3
NIR persistent luminescence nanoparticles conjugated to Afati-
nib for in situ suppression of lung adenocarcinoma growth and
metastasis. Adv Sci. 2020;7(9):1903741. https://doi.org/10.1002/
advs.201903741
11. Fehlmann T, Kahraman M, Ludwig N, Backes C, Galata V,
Keller V, et al. Evaluating the use of circulating MicroRNA pro-
files for lung cancer detection in symptomatic patients. JAMA
Oncol. 2020;6(5):71423. https://doi.org/10.1001/jamaoncol.
2020.0001
12. Dickson JL, Horst C, Nair A, Tisi S, Prendecki R, Janes SM.
Hesitancy around low-dose CT screening for lung cancer. Ann
Oncol. 2022;33(1):3441. https://doi.org/10.1016/j.annonc.
2021.09.008
13. MacMahon H, Li F, Jiang Y, Armato SG III. Accuracy of the
Vancouver lung cancer risk prediction model compared with
that of radiologists. Chest. 2019;156(1):1129. https://doi.org/
10.1016/j.chest.2019.04.002
14. Qiu YL, Zheng H, Devos A, Selby H, Gevaert O. A meta-
learning approach for genomic survival analysis. Nat Commun.
2020;11(1):6350. https://doi.org/10.1038/s41467-020-20167-3
15. Xing W, Sun H, Yan C, Zhao C, Wang D, Li M, et al. A predic-
tion model based on DNA methylation biomarkers and radio-
logical characteristics for identifying malignant from benign
pulmonary nodules. BMC Cancer. 2021;21(1):263. https://doi.
org/10.1186/s12885-021-08002-4
16. Hu F, Huang H, Jiang Y, Feng M, Wang H, Tang M, et al. Dis-
criminating invasive adenocarcinoma among lung pure
ground-glass nodules: a multi-parameter prediction model.
J Thorac Dis. 2021;13(9):538394. https://doi.org/10.21037/jtd-
21-786
17. Hosny A, Parmar C, Coroller TP, Grossmann P, Zeleznik R,
Kumar A, et al. Deep learning for lung cancer prognostication:
A retrospective multi-cohort radiomics study. PLOS Med. 2018;
15(11):e1002711. https://doi.org/10.1371/journal.pmed.1002711
18. Chen K, Sun J, Zhao H, Jiang R, Zheng J, Li Z, et al.
Non-invasive lung cancer diagnosis and prognosis based on
multi-analyte liquid biopsy. Mol Cancer. 2021;20(1):23. https://
doi.org/10.1186/s12943-021-01323-9
19. Takahashi S, Asada K, Takasawa K, Shimoyama R, Sakai A,
Bolatkan A, et al. Predicting deep learning based multi-omics
parallel integration survival subtypes in lung cancer using
reverse phase protein Array data. Biomolecules. 2020;10(10):
1460. https://doi.org/10.3390/biom10101460
20. Duan X, Yang Y, Tan S, Wang S, Feng X, Cui L, et al. Applica-
tion of artificial neural network model combined with four bio-
markers in auxiliary diagnosis of lung cancer. Med Biol Eng
Comput. 2017;55(8):123948. https://doi.org/10.1007/s11517-
016-1585-7
21. Yu L, Tao G, Zhu L, Wang G, Li Z, Ye J, et al. Prediction of
pathologic stage in non-small cell lung cancer using machine
learning algorithm based on CT image feature analysis. BMC
Cancer. 2019;19(1):464. https://doi.org/10.1186/s12885-019-
5646-9
22. Sethi S, Ali S, Philip P, Sarkar F. Clinical advances in molecu-
lar biomarkers for cancer diagnosis and therapy. Int J Mol Sci.
2013;14(7):1477184. https://doi.org/10.3390/ijms140714771
23. Liu C, Xiang X, Han S, Lim HY, Li L, Zhang X, et al. Blood-
based liquid biopsy: insights into early detection and clinical
management of lung cancer. Cancer Lett. 2022;524:91102.
https://doi.org/10.1016/j.canlet.2021.10.013
24. Li X, Asmitananda T, Gao L, Gai D, Song Z, Zhang Y, et al.
Biomarkers in the lung cancer diagnosis: a clinical perspective.
Neoplasma. 2012;59(5):5007. https://doi.org/10.4149/neo_
2012_064
25. Wang B, He YJ, Tian YX, Yang RN, Zhu YR, Qiu H. Clinical
utility of Haptoglobin in combination with CEA, NSE and
CYFRA21-1 for diagnosis of lung cancer. Asian Pac J Cancer
Prev. 2014;15(22):96114. https://doi.org/10.7314/APJCP.2014.
15.22.9611
26. Wu LX, Li XF, Chen HF, Zhu YC, Wang WX, Xu CW, et al.
Combined detection of CEA and CA125 for the diagnosis for
lung cancer: a meta-analysis. Cell Mol Biol (Noisy-le-grand).
2018;64(15):6770.
27. Zhou J, Diao X, Wang S, Yao Y. Diagnosis value of combined
detection of serum SF, CEA and CRP in non-small cell lung
cancer. Cancer Manag Res. 2020;12:88139. https://doi.org/10.
2147/CMAR.S268565
28. Liu L, Teng J, Zhang L, Cong P, Yao Y, Sun G, et al. The com-
bination of the tumor markers suggests the histological diagno-
sis of lung cancer. Biomed Res Int. 2017;2017:19. https://doi.
org/10.1155/2017/2013989
29. Cosma G, McArdle SE, Foulds GA, Hood SP, Reeder S,
Johnson C, et al. Prostate cancer: early detection and assessing
clinical risk using deep machine learning of high dimensional
peripheral blood flow Cytometric phenotyping data. Front Immu-
nol. 2021;12:786828. https://doi.org/10.3389/fimmu.2021.786828
30. Hood SP, Cosma G, Foulds GA, Johnson C, Reeder S,
McArdle SE, et al. Identifying prostate cancer and its clinical
risk in asymptomatic men using machine learning of high
dimensional peripheral blood flow cytometric natural killer cell
subset phenotyping data. Elife. 2020;9:e50936. https://doi.org/
10.7554/eLife.50936
31. Kossenkov AV, Qureshi R, Dawany NB, Wickramasinghe J,
Liu Q, Majumdar RS, et al. A gene expression classifier from
whole blood distinguishes benign from malignant lung nodules
detected by low-dose CT. Cancer Res. 2019;79(1):26373.
https://doi.org/10.1158/0008-5472.CAN-18-2032
32. Bloom CI, Graham CM, Berry MPR, Rozakeas F, Redford PS,
Wang Y, et al. Transcriptional blood signatures distinguish pul-
monary tuberculosis, pulmonary sarcoidosis, pneumonias and
lung cancers. PLoS ONE. 2013;8:e70630. https://doi.org/10.
1371/journal.pone.0070630
33. Kursa MB, Rudnicki WR. Feature selection with the Boruta
package. J Stat Softw. 2010;36(11). https://doi.org/10.18637/jss.
v036.i11
34. Hänzelmann S, Castelo R, Guinney J. GSVA: gene set variation
analysis for microarray and RNA-Seq data. BMC Bioinformat-
ics. 2013;14(1):7. https://doi.org/10.1186/1471-2105-14-7
35. Liberzon A, Subramanian A, Pinchback R, Thorvaldsd
ottir H,
Tamayo P, Mesirov JP. Molecular Signatures Database
(MSigDB) 3.0. Bioinformatics. 2011;27(12):173940. https://doi.
org/10.1093/bioinformatics/btr260
12 of 14 LI ET AL.
36. Aran D, Hu Z, Butte AJ. xCell: digitally portraying the tissue
cellular heterogeneity landscape. Genome Biol. 2017;18(1):220.
https://doi.org/10.1186/s13059-017-1349-1
37. Ravi A, Hellmann MD, Arniella MB, Holton M, Freeman SS,
Naranbhai V, et al. Genomic and transcriptomic analysis of
checkpoint blockade response in advanced non-small cell lung
cancer. Nat Genet. 2023;55(5):80719. https://doi.org/10.1038/
s41588-023-01355-5
38. Riquelme E, Suraokar M, Behrens C, Lin HY, Girard L,
Nilsson MB, et al. VEGF/VEGFR-2 upregulates EZH2 expres-
sion in lung adenocarcinoma cells and EZH2 depletion
enhances the response to platinum-based and VEGFR-2
targeted therapy. Clin Cancer Res. 2014;20(14):384961.
https://doi.org/10.1158/1078-0432.CCR-13-1916
39. Yu G, Herazo-Maya JD, Nukui T, Romkes M, Parwani A, Juan-
Guardela BM, et al. Matrix metalloproteinase-19 promotes met-
astatic behavior in vitro and is associated with increased mor-
tality in non-small cell lung cancer. Am J Respir Crit Care
Med. 2014;190(7):78090. https://doi.org/10.1164/rccm.201310-
1903OC
40. Feng L, Wang J, Cao B, Zhang Y, Wu B, Di X, et al. Gene
expression profiling in human lung development: an abundant
resource for lung adenocarcinoma prognosis. PLoS ONE. 2014;
9(8):e105639. https://doi.org/10.1371/journal.pone.0105639
41. Mitchell KA, Zingone A, Toulabi L, Boeckelman J, Ryan BM.
Comparative transcriptome profiling reveals coding and non-
coding RNA differences in NSCLC from African Americans
and European Americans. Clin Cancer Res. 2017;23(23):7412
25. https://doi.org/10.1158/1078-0432.CCR-17-0527
42. Zabor EC, Reddy CA, Tendulkar RD, Patil S. Logistic regres-
sion in clinical studies. Int J Radiat Oncol. 2022;112(2):2717.
https://doi.org/10.1016/j.ijrobp.2021.08.007
43. Robin X, Turck N, Hainard A, Tiberti N, Lisacek F, Sanchez J-
C, et al. pROC: display and analyze ROC curves. Expasy Org.
2015 Available from: https://cran.r-project.org/web/packages/
pROC/index.html
44. Naghibi SA, Ahmadi K, Daneshi A. Application of support vec-
tor machine, random forest, and genetic algorithm optimized
random Forest models in groundwater potential mapping.
Water Resour Manag. 2017;31(9):276175. https://doi.org/10.
1007/s11269-017-1660-3
45. Lin A, Qi C, Wei T, Li M, Cheng Q, Liu Z, et al. CAMOIP: a
web server for comprehensive analysis on multi-omics of
immunotherapy in pan-cancer. Brief Bioinform. 2022;23(3):
bbac129. https://doi.org/10.1093/bib/bbac129
46. Camp RL, Dolled-Filhart M, Rimm DL. X-Tile. Clin Cancer
Res. 2004;10(21):72529. https://doi.org/10.1158/1078-0432.
CCR-04-0713
47. Vickers AJ, Elkin EB. Decision curve analysis: a novel method
for evaluating prediction models. Med Decis Making. 2006;
26(6):56574. https://doi.org/10.1177/0272989X06295361
48. Wickham H, Chang W, Henry L, Takahashi K, Wilke C,
Woo K, et al. ggplot2: create elegant data visualisations using
the grammar of graphics. 2016 Available from: https://cran.r-
project.org/web/packages/ggplot2/index.html
49. Frankell AM, Dietzen M, Al Bakir M, Lim EL, Karasaki T,
Ward S, et al. The evolution of lung cancer and impact of sub-
clonal selection in TRACERx. Nature. 2023;616(7957):52533.
https://doi.org/10.1038/s41586-023-05783-5
50. Chen WW, Liu W, Li Y, Wang J, Ren Y, Wang G, et al. Deci-
phering the immune-tumor interplay during early-stage lung
cancer development via single-cell technology. Front Oncol.
2022;11:716042. https://doi.org/10.3389/fonc.2021.716042
51. Huang S, Yang J, Shen N, Xu Q, Zhao Q. Artificial intelligence
in lung cancer diagnosis and prognosis: current application
and future perspective. Semin Cancer Biol. 2023;89:307.
https://doi.org/10.1016/j.semcancer.2023.01.006
52. McGranahan N, Swanton C. Cancer evolution constrained by
the immune microenvironment. Cell. 2017;170(5):8257.
https://doi.org/10.1016/j.cell.2017.08.012
53. Schreiber RD, Old LJ, Smyth MJ. Cancer Immunoediting: inte-
grating Immunity's roles in cancer suppression and promotion.
Science. 2011;331(6024):156570. https://doi.org/10.1126/science.
1203486
54. Mittal D, Gubin MM, Schreiber RD, Smyth MJ. New insights
into cancer immunoediting and its three component phases
elimination, equilibrium and escape. Curr Opin Immunol.
2014;27:1625. https://doi.org/10.1016/j.coi.2014.01.004
55. Riemann D, Cwikowski M, Turzer S, Giese T, Grallert M,
Schütte W, et al. Blood immune cell biomarkers in lung cancer.
Clin Exp Immunol. 2019;195(2):17989. https://doi.org/10.
1111/cei.13219
56. Saab S, Zalzale H, Rahal Z, Khalifeh Y, Sinjab A, Kadara H.
Insights into lung cancer immune-based biology, prevention,
and treatment. Front Immunol. 2020;11:159. https://doi.org/10.
3389/fimmu.2020.00159
57. Bischoff P, Trinks A, Obermayer B, Pett JP, Wiederspahn J,
Uhlitz F, et al. Single-cell RNA sequencing reveals distinct tumor
microenvironmental patterns in lung adenocarcinoma. Oncogene.
2021;40(50):674858. https://doi.org/10.1038/s41388-021-02054-3
58. Sinjab A, Han G, Treekitkarnmongkol W, Hara K,
Brennan PM, Dang M, et al. Resolving the spatial and cellular
architecture of lung adenocarcinoma by multiregion single-cell
sequencing. Cancer Discov. 2021;11(10):250623. https://doi.
org/10.1158/2159-8290.CD-20-1285
59. Zhang C, Tang B, Hu J, Fang X, Bian H, Han J, et al. Neutro-
phils correlate with hypoxia microenvironment and promote
progression of non-small-cell lung cancer. Bioengineered. 2021;
12(1):887284. https://doi.org/10.1080/21655979.2021.1987820
60. DeBerardinis RJ. Tumor microenvironment, metabolism, and
immunotherapy. N Engl J Med. 2020;382(9):86971. https://
doi.org/10.1056/NEJMcibr1914890
61. Tello D, Balsa E, Acosta-Iborra B, Fuertes-Yebra E, Elorza A,
Ord
oñez
´
A, et al. Induction of the mitochondrial NDUFA4L2
protein by HIF-1αdecreases oxygen consumption by inhibiting
complex I activity. Cell Metab. 2011;14(6):76879. https://doi.
org/10.1016/j.cmet.2011.10.008
62. Shiratsuki S, Hara T, Munakata Y, Shirasuna K, Kuwayama T,
Iwata H. Low oxygen level increases proliferation and meta-
bolic changes in bovine granulosa cells. Mol Cell Endocrinol.
2016;437:7585. https://doi.org/10.1016/j.mce.2016.08.010
63. Tirpe AA, Gulei D, Ciortea SM, Crivii C, Berindan-Neagoe I.
Hypoxia: overview on hypoxia-mediated mechanisms with a
focus on the role of HIF genes. Int J Mol Sci. 2019;20(24):6140.
https://doi.org/10.3390/ijms20246140
64. Masoud GN, Li W. HIF-1αpathway: role, regulation and inter-
vention for cancer therapy. Acta Pharm Sin B. 2015;5(5):378
89. https://doi.org/10.1016/j.apsb.2015.05.007
LI ET AL.13 of 14
65. Scanlon SE, Glazer PM. Multifaceted control of DNA repair
pathways by the hypoxic tumor microenvironment. DNA
Repair. 2015;32:1809. https://doi.org/10.1016/j.dnarep.2015.
04030
66. Gao X, Wang G, Zhao W, Han J, Diao CY, Wang XH, et al.
Blocking OLFM4/HIF-1αaxis alleviates hypoxia-induced inva-
sion, epithelialmesenchymal transition, and chemotherapy
resistance in non-small-cell lung cancer. J Cell Physiol. 2019;
234(9):1503543. https://doi.org/10.1002/jcp.28144
67. Reka AK, Goswami MT, Krishnapuram R, Standiford TJ,
Keshamouni VG. Molecular cross-regulation between PPAR-γ
and other signaling pathways: implications for lung cancer
therapy. Lung Cancer. 2011;72(2):1549. https://doi.org/10.
1016/j.lungcan.2011.01.019
68. Giaginis C, Politi E, Alexandrou P, Sfiniadakis J, Kouraklis G,
Theocharis S. Expression of peroxisome proliferator activated
receptor-gamma (PPAR-γ) in human non-small cell lung carci-
noma: correlation with Clinicopathological parameters, prolif-
eration and apoptosis related molecules and Patients' survival.
Pathol Oncol Res. 2012;18(4):87583. https://doi.org/10.1007/
s12253-012-9517-9
69. Xu R, Luo X, Ye X, Li H, Liu H, du Q, et al. SIRT1/PGC-
1α/PPAR-γcorrelate with hypoxia-induced chemoresistance in
non-small cell lung cancer. Front Oncol. 2021;11:682762.
https://doi.org/10.3389/fonc.2021.682762
70. Pradere JP, Dapito DH, Schwabe RF. The yin and yang of toll-
like receptors in cancer. Oncogene. 2014;33(27):348595.
https://doi.org/10.1038/onc.2013.302
71. Martín-Medina A, Cer
on-Pisa N, Martinez-Font E, Shafiek H,
Obrador-Hevia A, Sauleda J, et al. TLR/WNT: a novel relation-
ship in immunomodulation of lung cancer. Int J Mol Sci. 2022;
23(12):6539. https://doi.org/10.3390/ijms-23126539
72. Dutta J, Fan Y, Gupta N, Fan G. Gélinas C Current insights
into the regulation of programmed cell death by NF-κB.
Oncogene. 2006;25(51):680016. https://doi.org/10.1038/sj.
onc.1209938
73. Choi CH, Kang TH, Song JS, Kim YS, Chung EJ, Ylaya K, et al.
Hewitt SM Elevated expression of pancreatic adenocarcinoma
upregulated factor (PAUF) is associated with poor prognosis
and chemoresistance in epithelial ovarian cancer. Sci Rep.
2018;8(1):12161. https://doi.org/10.1038/s41598-018-30582-8
74. Majidinia M, Yousefi B. DNA repair and damage pathways in
breast cancer development and therapy. DNA Repair. 2017;54:
229. https://doi.org/10.1016/j.dnarep.2017.03.009
75. Kristeleit RS, Miller RE, Kohn EC. Gynecologic cancers:
emerging novel strategies for targeting DNA repair deficiency.
Am Soc Clin Oncol Educ Book. 2016;36:e25968. https://doi.
org/10.1200/EDBK_159086
SUPPORTING INFORMATION
Additional supporting information can be found online
in the Supporting Information section at the end of this
article.
How to cite this article: Li X, Li X, Qin J, Lei L,
Guo H, Zheng X, et al. Machine learning-derived
peripheral blood transcriptomic biomarkers for
early lung cancer diagnosis: Unveiling tumor-
immune interaction mechanisms. BioFactors. 2025;
51(1):e2129. https://doi.org/10.1002/biof.2129
14 of 14 LI ET AL.
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Lung cancer is the leading cause of cancer-associated mortality worldwide1. Here we analysed 1,644 tumour regions sampled at surgery or during follow-up from the first 421 patients with non-small cell lung cancer prospectively enrolled into the TRACERx study. This project aims to decipher lung cancer evolution and address the primary study endpoint: determining the relationship between intratumour heterogeneity and clinical outcome. In lung adenocarcinoma, mutations in 22 out of 40 common cancer genes were under significant subclonal selection, including classical tumour initiators such as TP53 and KRAS. We defined evolutionary dependencies between drivers, mutational processes and whole genome doubling (WGD) events. Despite patients having a history of smoking, 8% of lung adenocarcinomas lacked evidence of tobacco-induced mutagenesis. These tumours also had similar detection rates for EGFR mutations and for RET, ROS1, ALK and MET oncogenic isoforms compared with tumours in never-smokers, which suggests that they have a similar aetiology and pathogenesis. Large subclonal expansions were associated with positive subclonal selection. Patients with tumours harbouring recent subclonal expansions, on the terminus of a phylogenetic branch, had significantly shorter disease-free survival. Subclonal WGD was detected in 19% of tumours, and 10% of tumours harboured multiple subclonal WGDs in parallel. Subclonal, but not truncal, WGD was associated with shorter disease-free survival. Copy number heterogeneity was associated with extrathoracic relapse within 1 year after surgery. These data demonstrate the importance of clonal expansion, WGD and copy number instability in determining the timing and patterns of relapse in non-small cell lung cancer and provide a comprehensive clinical cancer evolutionary data resource.
Article
Full-text available
Anti-PD-1/PD-L1 agents have transformed the treatment landscape of advanced non-small cell lung cancer (NSCLC). To expand our understanding of the molecular features underlying response to checkpoint inhibitors in NSCLC, we describe here the first joint analysis of the Stand Up To Cancer-Mark Foundation cohort, a resource of whole exome and/or RNA sequencing from 393 patients with NSCLC treated with anti-PD-(L)1 therapy, along with matched clinical response annotation. We identify a number of associations between molecular features and outcome, including (1) favorable (for example, ATM altered) and unfavorable (for example, TERT amplified) genomic subgroups, (2) a prominent association between expression of inducible components of the immunoproteasome and response and (3) a dedifferentiated tumor-intrinsic subtype with enhanced response to checkpoint blockade. Taken together, results from this cohort demonstrate the complexity of biological determinants underlying immunotherapy outcomes and reinforce the discovery potential of integrative analysis within large, well-curated, cancer-specific cohorts.
Article
Full-text available
Each year, the American Cancer Society estimates the numbers of new cancer cases and deaths in the United States and compiles the most recent data on population‐based cancer occurrence and outcomes. Incidence data (through 2018) were collected by the Surveillance, Epidemiology, and End Results program; the National Program of Cancer Registries; and the North American Association of Central Cancer Registries. Mortality data (through 2019) were collected by the National Center for Health Statistics. In 2022, 1,918,030 new cancer cases and 609,360 cancer deaths are projected to occur in the United States, including approximately 350 deaths per day from lung cancer, the leading cause of cancer death. Incidence during 2014 through 2018 continued a slow increase for female breast cancer (by 0.5% annually) and remained stable for prostate cancer, despite a 4% to 6% annual increase for advanced disease since 2011. Consequently, the proportion of prostate cancer diagnosed at a distant stage increased from 3.9% to 8.2% over the past decade. In contrast, lung cancer incidence continued to decline steeply for advanced disease while rates for localized‐stage increased suddenly by 4.5% annually, contributing to gains both in the proportion of localized‐stage diagnoses (from 17% in 2004 to 28% in 2018) and 3‐year relative survival (from 21% to 31%). Mortality patterns reflect incidence trends, with declines accelerating for lung cancer, slowing for breast cancer, and stabilizing for prostate cancer. In summary, progress has stagnated for breast and prostate cancers but strengthened for lung cancer, coinciding with changes in medical practice related to cancer screening and/or treatment. More targeted cancer control interventions and investment in improved early detection and treatment would facilitate reductions in cancer mortality.
Article
Full-text available
Lung cancer is the leading cause of cancer-related death worldwide. Cancer immunotherapy has shown great success in treating advanced-stage lung cancer but has yet been used to treat early-stage lung cancer, mostly due to lack of understanding of the tumor immune microenvironment in early-stage lung cancer. The immune system could both constrain and promote tumorigenesis in a process termed immune editing that can be divided into three phases, namely, elimination, equilibrium, and escape. Current understanding of the immune response toward tumor is mainly on the “escape” phase when the tumor is clinically detectable. The detailed mechanism by which tumor progenitor lesions was modulated by the immune system during early stage of lung cancer development remains elusive. The advent of single-cell sequencing technology enables tumor immunologists to address those fundamental questions. In this perspective, we will summarize our current understanding and big gaps about the immune response during early lung tumorigenesis. We will then present the state of the art of single-cell technology and then envision how single-cell technology could be used to address those questions. Advances in the understanding of the immune response and its dynamics during malignant transformation of pre-malignant lesion will shed light on how malignant cells interact with the immune system and evolve under immune selection. Such knowledge could then contribute to the development of precision and early intervention strategies toward lung malignancy.
Article
Full-text available
Importance Early detection by computed tomography and a more attention-oriented approach to incidentally identified pulmonary nodules in the last decade has led to population stage shift for non–small cell lung cancer (NSCLC). This stage shift could substantially confound the evaluation of newer therapeutics and mortality outcomes. Objective To investigate the association of stage shift with population mortality among patients with NSCLC. Design, Setting, and Participants This retrospective cohort study was performed from October 2020 to June 2021 and used data from the Surveillance, Epidemiology, and End Results (SEER) registries to assess all patients from 2006 to 2016 with NSCLC. Main Outcomes and Measures Incidence-based mortality was evaluated by year-of-death. To assess shifts in diagnostic characteristics, clinical stage and histology distributions were examined by year using χ² tests. Trends were assessed using the average annual percentage change (AAPC), calculated with JoinPoint software. Kaplan-Meier survival analysis assessed overall survival according to stage and compared those missing any stage with those with a reported stage. Results The final sample contained 312 382 patients; 166 657 (53.4%) were male, 38 201 (12.2%) were Black, and 249 062 (79.7%) were White; the median (IQR) age was 68 (60-76) years; 163 086 (52.2%) had adenocarcinoma histology. Incidence-based mortality within 5 years of diagnosis decreased from 2006 to 2016 (AAPC, −3.7; 95% CI, −4.1 to −3.4). When assessing stage shift, there was significant association between year-of-diagnosis and clinical stage, with stage I/II diagnosis increasing from 26.5% to 31.2% (AAPC, 1.5; 95% CI, 0.5 to 2.5); and stage III/IV diagnosis decreasing significantly from 70.8% to 66.1% (AAPC, −0.6; 95% CI, −1.0 to −0.2). Missing staging information was not associated with year-of-diagnosis (AAPC, −1.6; 95% CI, −7.4 to 4.5). Year-of-diagnosis was significantly associated with tumor histology (χ² = 8990.0; P < .001). There was a significant increase in adenocarcinomas: 42.9% in 2006 to 59.0% in 2016 (AAPC, 3.4; 95% CI, 2.9 to 3.9). Median (IQR) survival for stage I/II was 57 months (18 months to not reached); stage III/IV was 7 (2-19) months; and missing stage was 10 (2-28) months. When compared with those with known stage, those without stage information had significantly worse survival than those with stage I/II, with survival between those with stage III and stage IV (log-rank χ² = 87 125.0; P < .001). Conclusions and Relevance This cohort study found an association between decreased mortality and a corresponding diagnostic shift from later to earlier stage. These findings suggest that studies investigating the effect of treatment on lung cancer must take into account stage shift and the confounding association with survival and mortality outcome.
Article
Full-text available
Detecting the presence of prostate cancer (PCa) and distinguishing low- or intermediate-risk disease from high-risk disease early, and without the need for potentially unnecessary invasive biopsies remains a significant clinical challenge. The aim of this study is to determine whether the T and B cell phenotypic features which we have previously identified as being able to distinguish between benign prostate disease and PCa in asymptomatic men having Prostate-Specific Antigen (PSA) levels < 20 ng/ml can also be used to detect the presence and clinical risk of PCa in a larger cohort of patients whose PSA levels ranged between 3 and 2617 ng/ml. The peripheral blood of 130 asymptomatic men having elevated Prostate-Specific Antigen (PSA) levels was immune profiled using multiparametric whole blood flow cytometry. Of these men, 42 were subsequently diagnosed as having benign prostate disease and 88 as having PCa on biopsy-based evidence. We built a bidirectional Long Short-Term Memory Deep Neural Network (biLSTM) model for detecting the presence of PCa in men which combined the previously-identified phenotypic features (CD8⁺CD45RA⁻CD27⁻CD28⁻ (CD8⁺ Effector Memory cells), CD4⁺CD45RA⁻CD27⁻CD28⁻ (CD4⁺ Effector Memory cells), CD4⁺CD45RA⁺CD27⁻CD28⁻ (CD4⁺ Terminally Differentiated Effector Memory Cells re-expressing CD45RA), CD3⁻CD19⁺ (B cells), CD3⁺CD56⁺CD8⁺CD4⁺ (NKT cells) with Age. The performance of the PCa presence ‘detection’ model was: Acc: 86.79 ( ± 0.10), Sensitivity: 82.78% (± 0.15); Specificity: 95.83% (± 0.11) on the test set (test set that was not used during training and validation); AUC: 89.31% (± 0.07), ORP-FPR: 7.50% (± 0.20), ORP-TPR: 84.44% (± 0.14). A second biLSTM ‘risk’ model combined the immunophenotypic features with PSA to predict whether a patient with PCa has high-risk disease (defined by the D’Amico Risk Classification) achieved the following: Acc: 94.90% (± 6.29), Sensitivity: 92% (± 21.39); Specificity: 96.11 (± 0.00); AUC: 94.06% (± 10.69), ORP-FPR: 3.89% (± 0.00), ORP-TPR: 92% (± 21.39). The ORP-FPR for predicting the presence of PCa when combining FC+PSA was lower than that of PSA alone. This study demonstrates that AI approaches based on peripheral blood phenotyping profiles can distinguish between benign prostate disease and PCa and predict clinical risk in asymptomatic men having elevated PSA levels.
Article
Full-text available
Recent developments in immuno-oncology demonstrate that not only cancer cells, but also the tumor microenvironment can guide precision medicine. A comprehensive and in-depth characterization of the tumor microenvironment is challenging since its cell populations are diverse and can be important even if scarce. To identify clinically relevant microenvironmental and cancer features, we applied single-cell RNA sequencing to ten human lung adenocarcinomas and ten normal control tissues. Our analyses revealed heterogeneous carcinoma cell transcriptomes reflecting histological grade and oncogenic pathway activities, and two distinct microenvironmental patterns. The immune-activated CP²E microenvironment was composed of cancer-associated myofibroblasts, proinflammatory monocyte-derived macrophages, plasmacytoid dendritic cells and exhausted CD8+ T cells, and was prognostically unfavorable. In contrast, the inert N³MC microenvironment was characterized by normal-like myofibroblasts, non-inflammatory monocyte-derived macrophages, NK cells, myeloid dendritic cells and conventional T cells, and was associated with a favorable prognosis. Microenvironmental marker genes and signatures identified in single-cell profiles had progonostic value in bulk tumor profiles. In summary, single-cell RNA profiling of lung adenocarcinoma provides additional prognostic information based on the microenvironment, and may help to predict therapy response and to reveal possible target cell populations for future therapeutic approaches.
Article
Lung cancer is one of the malignant tumors with the highest incidence and mortality in the world. The overall five-year survival rate of lung cancer is relatively lower than many leading cancers. Early diagnosis and prognosis of lung cancer are essential to improve the patient's survival rate. With artificial intelligence (AI) approaches widely applied in lung cancer, early diagnosis and prediction have achieved excellent performance in recent years. This review summarizes various types of AI algorithm applications in lung cancer, including natural language processing (NLP), machine learning and deep learning, and reinforcement learning. In addition, we provides evidence regarding the application of AI in lung cancer diagnostic and clinical prognosis. This review aims to elucidate the value of AI in lung cancer diagnosis and prognosis as the novel screening decision-making for the precise treatment of lung cancer patients.
Article
Immune checkpoint inhibitors (ICIs) have completely changed the approach pertaining to tumor diagnostics and treatment. Similarly, immunotherapy has also provided much needed data about mutation, expression and prognosis, affording an unprecedented opportunity for discovering candidate drug targets and screening for immunotherapy-relevant biomarkers. Although existing web tools enable biologists to analyze the expression, mutation and prognostic data of tumors, they are currently unable to facilitate data mining and mechanism analyses specifically related to immunotherapy. Thus, we effectively developed our own web-based tool, called Comprehensive Analysis on Multi-Omics of Immunotherapy in Pan-cancer (CAMOIP), in which we are able to successfully screen various prognostic markers and analyze the mechanisms involved in biomarker expression and function, as well as immunotherapy. The analyses include information relevant to survival analysis, expression analysis, mutational landscape analysis, immune infiltration analysis, immunogenicity analysis and pathway enrichment analysis. This comprehensive analysis of biomarkers for immunotherapy can be carried out by a click of CAMOIP, and the software should greatly encourage the further development of immunotherapy. CAMOIP provides invaluable evidence that bridges the information between the data of cancer genomics based on immunotherapy, providing comprehensive information to users and assisting in making the value of current ICI-treated data available to all users. CAMOIP is available at https://www.camoip.net.
Article
Currently, early detection of lung cancer relies on the characterisation of images generated from computed tomography (CT). However, lung tissue biopsy, a highly invasive surgical procedure, is required to confirm CT-derived diagnostic results with very high false-positive rates. Hence, a non-invasive or minimally invasive approach is essential to complement the existing low-dose CT (LDCT) for early detection, improving responses to a certain treatment, predicting cancer recurrence, and evaluating prognosis. In the past decade, liquid biopsies (e.g., blood) have been demonstrated to be highly effective for lung cancer biomarker discovery. In this review, the roles of emerging liquid biopsy-derived biomarkers such as circulating nucleic acids, circulating tumour cells (CTCs), long non-coding RNA (lncRNA), and microRNA (miRNA), as well as exosomes, have been highlighted. The advantages and limitations of these blood-based minimally invasive biomarkers have been discussed. Furthermore, the current progress of the identified biomarkers for clinical management of lung cancer has been summarised. Finally, a potential strategy for the early detection of lung cancer, using a combination of LDCT scans and well-validated biomarkers, has been discussed.