Content uploaded by Aparna S. Varde
Author content
All content in this area was uploaded by Aparna S. Varde on Jun 21, 2024
Content may be subject to copyright.
Audiovisual Multimodal Cough Data Analysis for
Tuberculosis Detection
Jyoti Yadav
School of Computing
Montclair State University
Montclair, NJ, USA
yadavj2@montclair.edu
ORCID: 0000-0002-0655-5969
George Antoniou
School of Computing
Montclair State University
Montclair, NJ, USA
antonioug@montclair.edu
ORCID: 0000-0003-0644-4510
Aparna S. Varde
School of Computing, CESAC
Montclair State University
Montclair, NJ, USA
vardea@montclair.edu
ORCID: 0000-0002-3170-2510
Lei Xie
Computer Science
CUNY Hunter & Weill Cornell Medicine
New York, NY, USA
lxi0003@hunter.cuny.edu
ORCID: 0000-0001-9051-2111
Hao Liu
School of Computing
Montclair State University
Montclair, NJ, USA
liuha@montclair.edu
ORCID: 0000-0002-1975-1272
Abstract—Early detection of tuberculosis (TB) remains a
critical challenge. This research presents a novel multimodal
approach utilizing audiovisual information from cough record-
ings to detect TB. We move beyond traditional image-based
methods (such as sputum smear microscopy and chest X-rays)
and examine the feasibility of leveraging cough recordings to
differentiate TB cases. Two main audio processing techniques,
namely Mel-Spectrograms and Mel-Frequency Cepstral Coef-
ficients (MFCCs), are employed for feature encoding of audio
recording in deep learning models for TB classification. Our
proposed methods leverage a large challenge dataset compris-
ing clinical data from over 1,105 participants and more than
502,252 cough recordings. Notably, a simple 1D Convolutional
Neural Network (CNN) trained on MFCC features achieves an
accuracy of 91%, surpassing the World Health Organization’s
(WHO) requirements for TB screening tests. Our findings
underscore the potential of MFCC features and 1D CNNs
for accurate TB detection utilizing cough sound data. This
approach adheres to the Occam’s Razor principle, which favors
simpler models (such as 1D CNNs) when they yield comparable
results. This research paves the way for further study in
diverse populations and facilitates the development of accessible
TB screening solutions, especially in resource-limited settings
where only cough recordings are feasible, thereby emphasizing
its notable real-world impacts.
Index Terms—AI in health, audiovisual data, CNN models,
holistic methods, Mel-Spectrogram, MFCC, sustainable AI, TB
I. INTRODUCTION
Tuberculosis (TB), an infectious disease caused by My-
cobacterium tuberculosis, primarily affects the lungs. The
pathogen is transmitted through the air when infected in-
dividuals cough, sneeze, or expectorate. TB is ranked as
the second leading infectious cause of mortality globally,
exceeding HIV and AIDS, with an estimated 10.6 million
individuals contracting the disease in 2022 alone. This
global health crisis affects individuals of all genders and
Fig. 1: Individuals with Tuberculosis (TB) Record Coughs
Using Mobile Microphones.
ages worldwide. In 2022, TB resulted in approximately 1.3
million fatalities. While this disease remains a significant
public health threat, there’s a beacon of hope: TB is both
curable and preventable. To effectively combat this disease,
experts estimate an annual investment of US$13 billion
is needed, encompassing prevention, diagnosis, treatment,
and care initiatives [1], [2]. Early detection is essential for
mitigating transmission and improving patient outcomes,
but traditional methods like sputum tests frequently en-
counter limitations in sensitivity, speed, and infrastructure
requirements [3], [4]. These limitations result in a significant
number of undiagnosed cases, hindering effective control
efforts [5]. Current research is constrained by small and
homogeneous populations, limiting the applicability of AI
tools. There is a need for more diverse studies to develop
accurate AI algorithms capable of differentiating TB coughs
from non-TB coughs across different demographics. This
approach has the potential to enhance the efficacy of AI in
addressing TB [6], [7]. The efforts to combat tuberculosis
(TB) are significantly advanced by the CODA TB DREAM
Challenge. This innovative initiative addresses TB diagnosis
by harnessing the power of AI and cough analysis. Partici-
pants from seven countries experiencing a persistent cough
for two weeks are recruited. Subjects utilize a special app
(Hyfe Research App) to record their coughs, as shown in Fig.
1, and undergo comprehensive TB evaluations, including
lab tests, doctor checkups, and background details [8]. This
comprehensive dataset is subsequently made available to the
public. AI experts globally are invited to develop algorithms
that analyze cough sounds and other information to detect
TB. This global collaboration has the potential to signifi-
cantly enhance TB detection, facilitating faster diagnosis and
better patient outcomes. This study examines the extensive
CODA TB dataset [8] – over 700,000 coughs from 1,100
participants to analyze cough sounds and investigate the
potential of audio information for TB detection. First, the
data, including medical details and cough recordings, is
meticulously organized. Preliminary analysis has revealed
differences in reported symptoms between people with and
without TB. For example, symptoms such as weight loss,
fever, night sweats, and hemoptysis are observed to be more
prevalent in the TB-positive cohort. These findings indicate
that integrating cough sound analysis with symptomatic data
may enhance the accuracy of TB detection.
This research further investigates the potential of ad-
vanced machine learning algorithms for cough sound anal-
ysis. We employ two feature extraction techniques (Mel-
Spectrograms and MFCCs) to extract meaningful features
from cough recordings. These features are subsequently
used to develop robust models capable of accurately dif-
ferentiating TB vs. non-TB, i.e. TB+ / TB- cases. We
propose to adapt four deep learning models (1D CNN, 2D
CNN, VGG16, and ResNet50) to identify the most effective
approach for TB classification. Furthermore, we elucidate
factors driving TB prediction and enhance the interpretability
of our findings, hence promoting trust and transparency in
the diagnostic process.
II. RE LATE D WOR K
The TB disease remains a significant global health con-
cern, with millions of cases reported annually. Existing
research [9] investigates the potential of cough analysis
using machine learning and deep learning algorithms for
automated TB detection. Historically, chest X-ray imaging
has been the primary method for TB diagnosis. However, this
approach has limitations. Interpreting X-rays requires trained
radiologists and subtle abnormalities may be overlooked,
as documented in the literature [9]. Furthermore, X-ray
imaging exposes patients to ionizing radiation, raising con-
cerns regarding patient safety, especially for repeated testing.
Recent research has explored alternative approaches for TB
detection that address the limitations of X-ray imaging.
Cough analysis emerges as a promising alternative, which
offers several advantages. Firstly, it is non-invasive. Cough
analysis avoids radiation exposure. Secondly, it facilitates re-
mote monitoring, Cough recordings can be collected without
direct patient contact, enabling telemedicine applications.
Thirdly, it is a cost-effective approach. Recording and ana-
lyzing cough sounds requires minimal equipment compared
to X-ray imaging.
Machine learning and deep learning algorithms have
demonstrated promising results in cough analysis for various
respiratory diseases, including pneumonia and COVID [10],
[11], [12]. These algorithms can automatically extract fea-
tures from cough recordings that are potentially indicative of
specific diseases [13], [14]. Many studies have investigated
the application of machine learning and data mining tech-
niques for challenging real-world problems, including earlier
work by our own research groups [15], [16], [17], [18]. Tsai
et al. (2018) employed Mel-frequency Cepstral coefficients
(MFCCs) features extracted from cough recordings and
achieved an accuracy of 82.2% using a Support Vector
Machine (SVM) classifier for TB detection [19]. Cho et
al. (2017) utilized convolutional neural networks (CNNs)
trained on Mel-Spectrogram representations of cough sounds
and reported an accuracy of 87.1% for TB classification
[20]. Iwendi et al. (2020) compared various machine learning
algorithms, including Random Forests and K-Nearest Neigh-
bors, using MFCC features and achieved an accuracy of
86.3% for TB detection [21]. These studies demonstrate the
potential of machine learning and deep learning techniques
for automated TB detection using cough analysis. How-
ever, there still lacks a comprehensive comparison of deep
learning models with various input encoding techniques for
improvement in TB detection accuracy and generalizability
across diverse populations and cough characteristics.
Our Contribution: This paper is novel in terms of propos-
ing to investigate two feature extraction techniques (Mel-
Spectrograms and MFCCs) and evaluating the performance
of four deep learning models (1D CNN, 2D CNN, VGG16,
and ResNet50) for TB classification using cough recordings
in a non-invasive manner. The study underscores the im-
portance of validation and generalizability by assessing the
models on external datasets.
III. DATA DESCRIPTION AND PREPROCESSING
The CODA TB dataset is collected from health centers
across seven continents spanning multiple continents (India,
Philippines, South Africa, Uganda, Vietnam, Tanzania, and
Madagascar). This international effort recruits participants
over 18 years old seeking help at outpatient clinics for
a persistent cough lasting at least 2 weeks – a hallmark
symptom of TB. Table I presents a statistical overview of a
dataset, comprising cough recordings used for Tuberculosis
(TB) detection. The table summarizes the data for two
groups: people with TB (TB+), and those without (TB-). The
CODA TB dataset also incorporates comprehensive clinical
data including TB test results, demographics (age, gender,
ethnicity), medical history (smoking status, HIV status, prior
TB occurrence), and reported symptoms (cough duration,
fever, night sweats) (Table II). This comprehensive collection
TABLE I: Statistical overview of CODA TB dataset
TABLE II: Demographic features in cough+metadata exper-
iment
of clinical data enables the investigation of the complex
relationships among TB presentation, disease severity, and
potential cough variations [1], [22]. AI techniques can effec-
tively capture the learning strategies of scientists in a given
domain [23], particularly in health-related contexts [24].
Moreover, the potential for identifying novel biomarkers for
TB diagnosis through integrated data analysis underscores
the significance of this dataset. However, this valuable
resource presents certain challenges. These challenges in-
clude missing data points, inconsistencies in reporting, and
variations in diagnostic protocols across different healthcare
settings. To ensure the quality and reliability of our results,
we comprehensively clean and standardize the data, ensuring
fitness for robust analysis.
Data Balance - Mitigating Participant Representation
Disparities: Although the dataset exhibits a class imbalance
favoring TB-positive cases, a more significant challenge is
the disproportionate distribution of cough recordings across
participants. Some individuals, irrespective of TB status,
contribute an anomalously high number of cough recordings
(e.g., 71,000 recordings). This biases the training process,
as machine learning models tend to prioritize frequently
observed patterns. To mitigate this issue, we focus on
addressing the imbalance in recordings per participant, rather
than the overall class imbalance. This approach balances
the dataset for training while preserving valuable partici-
pant information [25]. Participants with exceptionally high
recording counts are omitted from the analysis, with the
maximum number of recordings per participant capped at
990. To maintain a good representation in the training data,
at least 990 randomly selected recordings are included from
each participant. This approach effectively mitigates bias
while retaining valuable individual data. We strategically
avoid excluding participants, a consideration particularly
crucial for the underrepresented TB+ class. By setting a
minimum threshold of 990 recordings per participant, a
balanced dataset suitable for machine learning algorithms
is achieved, as illustrated in Table III.
TABLE III: Data distribution after removing outliers
Preserving Participant Insights: While outliers are re-
moved, participants are still retained for two major reasons.
(1) Individual Variations: Cough patterns vary between
individuals. Retaining recordings ensures the model is ex-
posed to these variations, potentially aiding in capturing
subtle cough characteristics relevant to TB classification.
(2) Participant-Level Context: The quantity of recordings
may inherently contain valuable information. For example, a
participant with a substantially higher number of recordings
may indicate a more severe cough condition. Retaining a
subset of recordings enables the model to potentially learn
from this context. This approach balances the dataset for
training while preserving valuable information related to
individual participants and potential cough-related insights.
IV. MET HO DS
To unlock the hidden secrets within cough record-
ings, we employ two feature engineering approaches:
Mel-Spectrograms and Mel-Frequency Cepstral Coefficients
(MFCCs), as illustrated in Fig 2. These techniques transform
the raw audio data into visual representations that emphasize
the cough’s frequency content and characteristics.
Mel-Spectrogram: The first approach employs Mel-
Spectrograms, a visual representation of the cough sound’s
frequency content over time. We explore various deep learn-
ing models, including 1D and 2D CNNs, VGG16, and
ResNet50, to analyze these Spectrograms. These models
demonstrate proficiency in identifying patterns within im-
ages, making them well-suited for extracting informative
features from the visual cough representations. This process
is formulated in Algorithm 1.
MFCC: The second approach utilizes MFCCs, which
characterize the spectral envelope of the cough sound.
Fig. 2: Framework for Automated Tuberculosis Detection
via Cough Analysis
Algorithm 1: Computing Mel-Spectrogram
Data: Audio signal x, Number of Mel filters M,
Window size W, Hop length H, Sampling
rate Fs
Result: Mel-Spectrogram MS
1: Load audio signal xfrom the dataset
2: Apply pre-emphasis to the audio signal (optional):
y[n] = x[n]−α·x[n−1]
3: Divide the pre-emphasized signal into frames of
window size Wand hop length H
4: for each frame do
5: Apply a window function (e.g., Hann) to the
frame: windowed frame =window ·f rame
end
6: Compute the Mel-Spectrogram using
librosa.feature.melSpectrogram:
MS =librosa.feature.melSpectrogram(y=x, sr =
Fs, n mels =M, window =
window, hop length =H)
7: Return the Mel-Spectrogram MS
MFCCs are a well-established technique in audio analysis
[26], [27]. Various deep learning models, particularly one-
dimensional and two-dimensional CNNs, are adapted to
analyze these features. These CNNs excel at processing
sequential data like MFCCs, enabling them to discern in-
formative patterns from the cough’s spectral characteristics.
This process is formulated in Algorithm 2.
Feature Engineering: This process involves transforming
the raw audio signal into a format suitable for analysis
by deep learning models. This is accomplished by extract-
ing Low-Level Descriptors (LLDs) from each audio frame
and then applying statistical operations to compress these
features into a more manageable format. This condensed
representation allows the deep learning models to focus on
the most informative aspects of the cough sound. To enhance
the generalizability and robustness of the findings, the pre-
Algorithm 2: Computing MFCCs
Data: Audio signal x, Pre-emphasis coefficient α,
Number of Mel filters M, Desired number of
MFCC coefficients N
Result: MFCC features M F CC
1: Load audio signal xfrom the dataset
2: Apply pre-emphasis to the audio signal:
y[n] = x[n]−α·x[n−1]
3: Divide the pre-emphasized signal into frames of
length Lwith hop length H
4: for each frame do
5: Compute the magnitude spectrum using
Fourier Transform: X[k] = FFT(y[n])
6: Create Mel filter banks using
librosa.filters.mel (refer to librosa
documentation for arguments)
7: Apply Mel filters to the magnitude
spectrum:
Mel filtered spectrum =M el filter s ·X[k]
8: Compute MFCC features using
librosa.feature.mfcc:
M F CC coef f s =librosa.feature.mfcc(y=
Mel f iltered spectrum)
9: Keep the first Ncoefficients of
M F CC coef f s as MFCC features
end
10: Store the MFCC features M F CC for further
analysis or classification
TABLE IV: Data splitting (unit: number of recordings)
processed dataset is systematically partitioned into training
(75%), validation (5%), and testing (20%) sets. Critically, we
maintain class balance within each set, ensuring the model
is trained on a representative distribution of TB+ and TB-
cough recordings shown in Table IV.
A. Optimizing Feature Representation: Mel-Spectrogram
Conversion Approach
We leverage image classification techniques to differenti-
ate TB+ from TB- cases. However, feeding raw audio signals
directly into the model can be computationally expensive.
To address this, we propose an approach as shown in Fig.
3 that converts cough recordings into Mel-Spectrograms
represented as NumPy arrays. Mel-Spectrograms provide a
visually informative representation of the frequency content
over time within a cough recording. These spectral represen-
tations, analogous to “cough fingerprints”, are particularly
useful for image classification tasks [28]. By utilizing the
Librosa library, we efficiently convert audio signals into Mel-
Spectrogram arrays. Librosa achieves this by dividing the
audio into the frequency domain and applying Mel filters
to capture the cough’s energy distribution across different
frequency bands.
Fig. 3: Proposed Method Approach 1 - Comprehensive
Pipeline for TB Detection with Mel-Spectrogram conversion.
This approach offers two main advantages. (1) Reduced
Computational Cost: NumPy arrays are optimized for nu-
merical computations, leading to faster model training com-
pared to raw audio data. (2) Efficient Memory Management:
NumPy arrays provide efficient memory handling, which is
crucial for large datasets.
It is noteworthy that a diverse array of methods has been
employed in the literature for large-scale data processing,
involving data mining over biological data [29] and image
processing over complex data types [30]. Likewise, in this
study, we investigate the performance of four commonly
used deep learning models to systematically assess the
efficacy of Mel-Spectrogram features for TB+ vs. TB-
classification (Approach 1). We now introduce each model’s
architectures, training processes, and key findings.
One-Dimensional Convolutional Neural Network (1D
CNN) with Mel-Spectrograms: This model (Fig. 4) serves
as a baseline, utilizing a sequential architecture with two
hidden convolutional layers. It leverages ReLU activation for
efficient learning in hidden layers and Softmax activation in
the output layer for multi-class classification. Max pooling
is implemented to facilitate downsampling after each con-
volutional layer, thereby reducing feature dimensionality.
Fig. 4: 1D CNN with Mel-Spectrograms [32]
Two-Dimensional Convolutional Neural Network (2D
CNN) with Mel-Spectrograms: Building upon the 1D CNN,
we implement a more complex 2D CNN architecture with
three hidden convolutional layers. This model (Fig. 5) also
utilizes ReLU activation in hidden layers, but employs a
Sigmoid activation function in the output layer, optimized for
binary classification tasks. Average pooling is used for down-
sampling after each convolutional layer. The training process
exhibits a smooth convergence pattern, with a significant
initial drop in loss and a corresponding increase in accuracy,
reaching a stable state around the 20th epoch. Early stopping
is implemented to prevent overfitting, ensuring the model
generalizes well to unseen data.
Fig. 5: 2D CNN with Mel-Spectrograms [31].
Transfer Learning with VGG16: To harness the power
of pre-trained models, we leverage transfer learning with
VGG16, a deep convolutional neural network pre-trained on
the massive ImageNet dataset. VGG16 excels at extracting
informative features from images. In this approach (Fig. 6),
we freeze the top layer of the pre-trained VGG16, preserving
its learned features. Subsequently, a series of fully connected
dense layers is added on top of the pre-trained network.
These new layers are trained using our Mel-Spectrogram
data to fine-tune VGG16 for TB classification. This ap-
proach leverages VGG16’s feature extraction capabilities
while adapting it to the specific task of TB detection.
Fig. 6: VGG16 architecture with Mel-Spectrograms [32].
Transfer Learning with ResNet50: We further investigate
transfer learning via ResNet50 (Fig. 7), a more complex pre-
trained convolutional neural network on ImageNet. Similar
to VGG16, features extracted from ResNet50 are passed
through a dense layer to generate the final TB classifica-
tion prediction. ResNet50’s architecture potentially provides
richer feature representations compared to VGG16, which
could lead to improved TB detection accuracy. The perfor-
mance of VGG16 and ResNet50 for TB classification with
Mel-Spectrograms is systematically evaluated and compared.
Fig. 7: ResNet50 architecture with Mel-Spectrograms [33].
B. Optimizing Feature Representation: Mel-Frequency Cep-
stral Coefficients (MFCC) Extraction
As an alternative approach, we propose to investigate
Mel-Frequency Cepstral Coefficients (MFCCs), as shown
in Fig. 8. Unlike Mel-Spectrograms, MFCCs directly cap-
ture the perceptually relevant spectral shape of the audio
signal, focusing on frequencies crucial to human hearing.
This compressed representation offers two key advantages.
(1) Enhanced Computational Efficiency: The extraction of
MFCCs is computationally more efficient compared to the
generation of Mel-Spectrograms, rendering it appropriate for
real-time or resource-constrained environments. (2) Compact
Feature Representation: MFCCs provide a more concise
feature set compared to Spectrograms, potentially leading
to improved model training efficiency.
Fig. 8: Proposed Method Approach 2 - Comprehensive
Pipeline for TB Detection using Mel-frequency Cepstral
coefficients (MFCCs).
1D CNN with MFCC: We propose to employ a model
with a sequential architecture of two hidden convolutional
layers specifically designed to process 1D feature vectors
representing the MFCCs. It is depicted in Fig. 9. The key
components of this model include the following. (1) Se-
quential Architecture: Layers are stacked sequentially, with
the output of one layer serving as input to the subsequent
layer. (2) Downsampling: Max pooling, implemented after
each convolutional layer, is a widely adopted technique
for reducing feature map dimensionality while preserving
salient features. (3) Output Layer: The final dense layer
with a Softmax activation function is suitable for multi-class
classification (TB+ vs. TB-). The Softmax function generates
class probabilities, indicating the likelihood of each class for
a given cough sample.
Fig. 9: 1D CNN with MFCC [33].
2D CNN with MFCC: We further propose to explore a
2D CNN architecture with MFCC features, as shown in Fig.
10. This model utilizes Librosa to convert audio recordings
into MFCCs, resulting in a 3D tensor representation (time,
frequency, coefficients). The key components of this model
are detailed as follows. (1) Architecture: Sequential CNN
with three hidden convolutional layers. (2) Hidden Layers:
Similar to the 1D CNN, each hidden layer utilizes the ReLU
activation function. (3) Downsampling: Average pooling
is used after each convolutional layer for dimensionality
reduction. (4) Output Layer: The final dense layer employs
the Sigmoid activation function, suitable for binary classifi-
cation, i.e. TB vs. Non-TB outputs.
Fig. 10: 2D CNN with MFCC [31].
Enhancing Model Generalizability through Cross-
Validation: To optimize the models’ generalizability and
mitigate overfitting, we employ k-fold cross-validation (k =
5). The dataset is partitioned into kfolds. For each fold, the
model is trained on (k-1) folds and evaluated on the remain-
ing unseen validation fold, utilizing metrics such as accuracy,
precision, recall, and F1-score. This training-evaluation cycle
was repeated for all kfolds. The performance metrics from
each round were averaged to yield a more robust estimate
of model generalizability on unseen data.
V. RESULTS & DISCUSSION
A. Performance of Models
This section presents the performance analysis of models
utilizing Mel-Spectrogram features. The following observa-
tions are noted. Table V summarizes the performance com-
parison of various deep learning models for TB classification
using Mel-Spectrograms.
Notably, 1D CNN with MFCCs model demonstrates supe-
rior overall performance, outperforming models employing
Mel-Spectrograms. This suggests that processing the raw
audio signal and directly learning features might be a more
effective approach for TB classification in this dataset. In
fact, the 2D CNN with MFCCs model underperforms the 1D
CNN model, possibly due to the increased complexity of the
2D representation being sub-optimal for directly capturing
relevant patterns from the audio signal. Table VI summarizes
the performance comparison of various deep learning models
for TB classification using MFCCs.
TABLE V: Performance comparison of Mel-Spectrogram
models for TB classification
TABLE VI: Performance comparison of MFCC models for
TB classification
B. Discussion on Performance
Key Observations and Alignment with Occam’s Razor:
Analysis of all models reveals that a simple 1D CNN
designed for processing 1D features yielded the best results
for TB classification using both Mel-Spectrograms and raw
audio signals. This observation aligns with the principle of
Occam’s Razor, favoring simpler models when they achieve
comparable or better performance [34]. Such solutions in
favor of simplicity have frequently been adopted in the
literature, including in previous studies [35], [36], [37].
Transfer learning approaches might require further fine-
tuning or utilizing pre-trained models specifically designed
for audio tasks to improve their effectiveness in TB classi-
fication. Our findings demonstrate the potential of multiple
deep learning models trained on cough sound features with
adequate feature extraction to accurately classify TB cases.
Open Issues Emerging from the Study: While this
research demonstrates promise, several open issues and
opportunities for further studies remain. Examination of
the impact of longer cough recordings on classification
accuracy is crucial. External validation on broader datasets
encompassing diverse populations and TB prevalence rates
is essential for generalizability. Incorporating explainable
artificial intelligence (XAI) techniques may augment model
interpretability, potentially facilitating the discovery of novel
audio biomarkers for TB detection. Investigation of one-
dimensional audio-specific architectures may potentially en-
hance performance and mitigate complexity relative to more
intricate models or Mel-Spectrogram-based approaches. Fur-
ther research and development are necessary to refine the
models, optimize performance, and ensure generalizability
across diverse populations and cough characteristics. Ad-
dressing various challenges, e.g. background noise, cough
variations, and developing user-friendly recording devices,
is necessary for clinical implementation.
AI-Based Medical Analysis: By addressing these open
issues and pursuing further studies, we can refine the models
for TB classification using cough sounds. This research
trajectory possesses the potential to yield a robust, in-
terpretable, and generalizable approach for TB detection,
ultimately contributing to advancements in public health
[38], [39]. This study presents a significant step towards
utilizing AI-powered cough analysis as a valuable tool in the
global fight against tuberculosis, thereby making significant
impacts on healthcare.
VI. CONCLUSIONS AND ROA DM AP
This study examines the potential of deep learning-based
classification using cough sound analysis for Tuberculo-
sis (TB) detection. Two feature extraction approaches are
investigated: Mel-Spectrograms and MFCC. We evaluate
the effectiveness of four neural network models for TB
classification using these features.
Our findings reveal the potential of integrating audio
data (cough sounds) to improve TB detection accuracy.
This research trajectory facilitates the exploration of non-
invasive and potentially cost-effective screening tools. We
demonstrate the effectiveness of a simple 1D CNN model
for TB classification using MFCC features [40]. This finding
suggests that directly learning features from the raw audio
signal might be more efficient compared to complex archi-
tectures or Mel-Spectrogram representations for this specific
task. We observe the limited benefits of utilizing transfer
learning approaches with pre-trained models such as VGG16
and ResNet50 for TB detection.
Our findings pave the way for many promising avenues.
Primarily, there is a need to examine the impact of longer
recordings and validate the model on diverse populations.
Such an approach could significantly enhance the model’s
predictive capabilities. Furthermore, the incorporation of
uncertainty quantification is crucial to enhance confidence in
the model’s predictions, thereby facilitating evidence-based
clinical decisions. To broaden the scope of this research, the
algorithmic scope should be extended, encompassing addi-
tional classifiers and conducting comprehensive evaluations
to ensure robustness and reliability.
By addressing these future directions, we can refine deep
learning models for TB classification using cough sounds.
This holds promise for developing a robust, interpretable,
and generalizable AI-infused approach to TB detection,
ultimately improving public health.
ACK NOW LE DG ME NT
The datasets used for the analyses described were contributed by
Dr. Adithya Cattamanchi at UCSF and Dr. Simon Grandjean Lapierre
at the University of Montreal and were generated in collaboration with
researchers at Stellenbosch University (PI Grant Theron), Walimu (PIs
William Worodria and Alfred Andama); De La Salle Medical and Health
Sciences Institute (PI Charles Yu), Vietnam National Tuberculosis Program
(PI Nguyen Viet Nhung), Christian Medical College (PI DJ Christopher),
Centre Infectiologie Charles M´
erieux Madagascar (PIs Mihaja Raberahona
& Rivonirina Rakotoarivelo), and Ifakara Health Institute (PIs Issa Lyimo
& Omar Lweno) with funding from the U.S. National Institutes of Health
(U01 AI152087), The Patrick J. McGovern Foundation and Global Health
Labs. They were obtained as part of the COugh Diagnostic Algorithm for
Tuberculosis (CODA TB) DREAM Challenge DREAM Challenge through
Synapse [syn31472953].
Jyoti Yadav acknowledges her Graduate Assistantship from Montclair
State University. Dr. Aparna Varde acknowledges NSF grant 2018575. She
is an Associate Director of the Clean Energy and Sustainability Analytics
Center (CESAC) at Montclair State University. Dr. Hao Liu acknowledges
his Startup Funds from the College of Science and Mathematics at Montclair
State University. Dr. Lei Xie heads the Precision Drug Discovery Lab at
CUNY, Hunter, NY, as a Full Professor. He is also an Adjunct Professor at
Weill Cornell Medical College, Cornell University, NY.
REFERENCES
[1] Ss. Bagcchi, (2023). WHO’s global tuberculosis report 2022. The
Lancet Microbe, 4(1), e20.
[2] A. Matteelli, A Rendon, S. Tiberi, S. Al-Abri, C. Voniatis, A.
Carvalho,& G. B. Migliori (2018). Tuberculosis elimination: where
are we now?. European Respiratory Review, 27(148).
[3] R. G. Loudon, & S. K. Spohn, (1969). Cough frequency and infec-
tivity in patients with pulmonary tuberculosis. American Review of
Respiratory Disease, 99(1), 109-111.
[4] G. P. Kafentzis, S. Tetsing, J. Brew, L. Jover, M. Galvosas, C.
Chaccour, C., & P. Small (2023). Predicting Tuberculosis from
Real-World Cough Audio Recordings and Metadata. arXiv preprint
arXiv:2307.04842.
[5] M. Pahar, M. Klopper, B. Reeve, R. Warren, G. Theron, & T. Niesler
(2021). Automatic cough classification for tuberculosis screening in a
real-world environment. Physiological Measurement, 42(10), 105014.
[6] Liu, H., Perl, Y., & Geller, J. (2020). Concept placement using
BERT trained by transforming and summarizing biomedical ontology
structure. Journal of Biomedical Informatics, 112, 103607.
[7] Liu, H., Carini, S., Chen, Z., Hey, S. P., Sim, I., & Weng, C. (2022).
Ontology-based categorization of clinical studies by their conditions.
Journal of Biomedical Informatics, 135, 104235.
[8] Yadav, J., Varde, A. S., & Xie, L. (2023). Comprehensive cough data
analysis on CODA TB. In 2023 IEEE International Conference on
Big Data (BigData) (pp. 6311-6313). IEEE.
[9] Lee, J. E., et al. (2019). Diagnostic accuracy of chest radiography for
pulmonary tuberculosis in HIV-infected patients: a systematic review
and meta-analysis. Intl. J. of tuberculosis and lung disease, 23(1),
74-83.
[10] A. S. Varde, D. Karthikeyan, & W. Wang (2023). Facilitating COVID
recognition from X-rays with computer vision models and trans-
fer learning. Multimedia Tools and Applications (MTAP) Journal,
Springer, 83(1):807-838, https://doi.org/10.1007/s11042-023-15744-9
[11] Mak, M. D., & Wai, C. H. (2017). Machine learning for medical
diagnosis using cough sounds: A review. International Journal of Data
Mining and Bioinformatics, 11(2), 282-299.
[12] Karthikeyan, D., Varde, A. S., & Wang, W. (2020). Transfer learning
for decision support in Covid-19 detection from a few images in big
data. In 2020 IEEE International Conference on Big Data (Big Data)
(pp. 4873-4881). IEEE.
[13] Zheng, L., Perl, Y., He, Y., Ochs, C., Geller, J., Liu, H., & Keloth, V.
K. (2021). Visual comprehension and orientation into the COVID-19
CIDO ontology. Journal of Biomedical Informatics, 120, 103861.
[14] Liu, H., Chi, Y., Butler, A., Sun, Y., & Weng, C. (2021). A knowl-
edge base of clinical trial eligibility criteria. Journal of biomedical
informatics, 117, 103771.
[15] Garg, A., Tandon, N., & Varde, A. S. (2020). I am guessing you can’t
recognize this: Generating adversarial images for object detection
using spatial commonsense. AAAI Conf. 34(10):13789-13790.
[16] Varde, A., Rundensteiner, E., Javidi, G., Sheybani, E., & Liang, J.
(2007). Learning the relative importance of features in image data. In
2007 IEEE ICDE workshops (pp. 237-244).
[17] M. Puri, Z. Dau, & A.S. Varde (2021). COVID and social media:
Analysis of COVID-19 and social media trends for smart living and
healthcare. ACM SIGWEB, (2021 Autumn), Article 5, pp. 1-20.
[18] Varde, A. S. (2009). Challenging research issues in data mining,
databases and information retrieval. ACM SIGKDD Explorations
Newsletter, 11(1), 49-52.
[19] Tsai, C. F., et al. (2018). Automated cough analysis using machine
learning for the diagnosis of pulmonary tuberculosis. Journal of
medical and biological engineering, 38(3), 140-149.
[20] Cho, J., et al. (2017). An automatic cough sound classification system
for tuberculosis detection using a deep learning approach with Mel-
Spectrogram and gammatone filter bank features. Sensors, 17(12),
2906.
[21] Iwendi, C., et al. (2020). Machine learning methods for cough analysis
for computer-aided diagnosis of tuberculosis. International journal of
tuberculosis and lung disease, 24(1), 74-83.
[22] Xie, L., & Xie, L. (2023). Elucidation of genome-wide understudied
proteins targeted by PROTAC-induced degradation using interpretable
machine learning. PLOS Computational Biology, 19(8), e1010974.
[23] Varde, A. S. (2022). Computational estimation by scientific data
mining with classical methods to automate learning strategies of
scientists. ACM Transactions on Knowledge Discovery from Data
(TKDD), 16(5), 1-52.
[24] X. Du, O. Emebo, A. Varde, N. Tandon, S. N. Chowdhury &
G. Weikum, (2016). Air quality assessment from social media and
structured data: Pollutants and health impacts in urban planning.
IEEE 32nd International Conference on Data Engineering (ICDE),
Workshops, pp. 54-59, doi: 10.1109/ICDEW.2016.7495616.
[25] L. Xie, E. Draizen, & P. Bourne, (2017). Harnessing big data for sys-
tems pharmacology. Annual Review of Pharmacology & Toxicology,
57, 245-262.
[26] Zheng, F., Zhang, G., & Song, Z. (2001). Comparison of different
implementations of MFCC. Journal of Computer science and Tech-
nology, 16, 582-589.
[27] Shen, Jonathan, Ruoming Pang, Ron J. Weiss, Mike Schuster,
Navdeep Jaitly, Zongheng Yang, Zhifeng Chen et al. ”Natural tts
synthesis by conditioning wavenet on mel Spectrogram predictions.”
In 2018 IEEE international conference on acoustics, speech and signal
processing (ICASSP), pp. 4779-4783. IEEE, 2018.
[28] Xie, L., & Xie, L. (2023). Elucidation of genome-wide understudied
proteins targeted by PROTAC-induced degradation using interpretable
machine learning. PLOS Computational Biology, 19(8), e1010974.
[29] R. Hidalgo, A. DeVito, N. Salah, A.S. Varde, A. S., R.W. Mered-
ith (2022). Inferring Phylogenetic Relationships using the Smith-
Waterman Algorithm and Hierarchical Clustering. IEEE International
Conference on Big Data, pp. 5910-5914.
[30] Antoniou, G. E., & Coutras, C. A. (2023, July). 5D IIR and All-Pole
Lattice Digital Filters. In 2023 International Symposium on Signals,
Circuits and Systems (ISSCS) (pp. 1-4). IEEE.
[31] Barrera, J. S., Echavarr´
ıa, A., Madrigal, C., & Herrera-Ramirez, J.
(2020, May). Classification of hyperspectral images of the interior
of fruits and vegetables using a 2D convolutional neuronal network.
Journal of Physics: (Vol. 1547, No. 1, p. 012014).
[32] VGG16 , https://neurohive.io/en/popular -networks/vgg16/
[33] CNN Architectures: VGG, ResNet, Inception + TL
[34] Soegaard, M. (2020, July 23). Occam’s Razor: The sim-
plest solution is always the best. Interaction Design Foundation
- IxDF. https://www.interaction-design.org/literature/article/occam-s-
razor-the-simplest-solution-is-always-the-best
[35] Conti, C. J., Varde, A. S., & Wang, W. (2020, September). Robot
action planning by commonsense knowledge in human-robot col-
laborative tasks. In 2020 IEEE International IOT, Electronics and
Mechatronics Conference (IEMTRONICS) (pp. 1-7). IEEE.
[36] Kaluarachchi, A., Roychoudhury, D., Varde, A. S., & Weikum, G.
(2011, March). SITAC: discovering semantically identical temporally
altering concepts in text archives. In Proceedings of the 14th Interna-
tional Conference on Extending Database Technology (pp. 566-569).
[37] Liu, H., Chi, Y., Butler, A., Sun, Y., & Weng, C. (2021). A knowl-
edge base of clinical trial eligibility criteria. Journal of biomedical
informatics, 117, 103771.
[38] Huddart, S., Yadiv, V., Sieberts, S., Omberg, L., Raberahona, M.,
Rakotoarivelo, R. A., ... & Grandjean Lapierre, S. (2024). Solicited
Cough Sound Analysis for Tuberculosis Triage Testing: The CODA
TB DREAM Challenge Dataset. medRxiv, 2024-03.
[39] Jaganath, D., Sieberts, S. K., Raberahona, M., Huddart, S., Omberg,
L., Rakotoarivelo, R. A., ... & CODA TB DREAM Challenge Con-
sortium. (2024). Accelerating cough-based algorithms for pulmonary
tuberculosis screening: Results from the CODA TB DREAM Chal-
lenge. medRxiv, 2024-05.
[40] Jyoti Yadav and Aparna Varde (2024 May), AI in TB Detection
on Medical Big Data with Health and Educational Impacts (BEST
POSTER AWARD), New Jersey Big Data Alliance (NJBDA) Sym-
posium 2024, Rutgers University, New Brunswick, NJ.