Conference PaperPDF Available

Audiovisual Multimodal Cough Data Analysis for Tuberculosis Detection

Authors:

Abstract and Figures

Early detection of tuberculosis (TB) remains a critical challenge. This research presents a novel multimodal approach utilizing audiovisual information from cough recordings to detect TB. We move beyond traditional image-based methods (such as sputum smear microscopy and chest X-rays) and examine the feasibility of leveraging cough recordings to differentiate TB cases. Two main audio processing techniques, namely Mel-Spectrograms and Mel-Frequency Cepstral Coefficients (MFCCs), are employed for feature encoding of audio recording in deep learning models for TB classification. Our proposed methods leverage a large challenge dataset comprising clinical data from over 1,105 participants and more than 502,252 cough recordings. Notably, a simple 1D Convolutional Neural Network (CNN) trained on MFCC features achieves an accuracy of 91%, surpassing the World Health Organization's (WHO) requirements for TB screening tests. Our findings underscore the potential of MFCC features and 1D CNNs for accurate TB detection utilizing cough sound data. This approach adheres to the Occam's Razor principle, which favors simpler models (such as 1D CNNs) when they yield comparable results. This research paves the way for further study in diverse populations and facilitates the development of accessible TB screening solutions, especially in resource-limited settings where only cough recordings are feasible, thereby emphasizing its notable real-world impacts.
Content may be subject to copyright.
Audiovisual Multimodal Cough Data Analysis for
Tuberculosis Detection
Jyoti Yadav
School of Computing
Montclair State University
Montclair, NJ, USA
yadavj2@montclair.edu
ORCID: 0000-0002-0655-5969
George Antoniou
School of Computing
Montclair State University
Montclair, NJ, USA
antonioug@montclair.edu
ORCID: 0000-0003-0644-4510
Aparna S. Varde
School of Computing, CESAC
Montclair State University
Montclair, NJ, USA
vardea@montclair.edu
ORCID: 0000-0002-3170-2510
Lei Xie
Computer Science
CUNY Hunter & Weill Cornell Medicine
New York, NY, USA
lxi0003@hunter.cuny.edu
ORCID: 0000-0001-9051-2111
Hao Liu
School of Computing
Montclair State University
Montclair, NJ, USA
liuha@montclair.edu
ORCID: 0000-0002-1975-1272
Abstract—Early detection of tuberculosis (TB) remains a
critical challenge. This research presents a novel multimodal
approach utilizing audiovisual information from cough record-
ings to detect TB. We move beyond traditional image-based
methods (such as sputum smear microscopy and chest X-rays)
and examine the feasibility of leveraging cough recordings to
differentiate TB cases. Two main audio processing techniques,
namely Mel-Spectrograms and Mel-Frequency Cepstral Coef-
ficients (MFCCs), are employed for feature encoding of audio
recording in deep learning models for TB classification. Our
proposed methods leverage a large challenge dataset compris-
ing clinical data from over 1,105 participants and more than
502,252 cough recordings. Notably, a simple 1D Convolutional
Neural Network (CNN) trained on MFCC features achieves an
accuracy of 91%, surpassing the World Health Organization’s
(WHO) requirements for TB screening tests. Our findings
underscore the potential of MFCC features and 1D CNNs
for accurate TB detection utilizing cough sound data. This
approach adheres to the Occam’s Razor principle, which favors
simpler models (such as 1D CNNs) when they yield comparable
results. This research paves the way for further study in
diverse populations and facilitates the development of accessible
TB screening solutions, especially in resource-limited settings
where only cough recordings are feasible, thereby emphasizing
its notable real-world impacts.
Index Terms—AI in health, audiovisual data, CNN models,
holistic methods, Mel-Spectrogram, MFCC, sustainable AI, TB
I. INTRODUCTION
Tuberculosis (TB), an infectious disease caused by My-
cobacterium tuberculosis, primarily affects the lungs. The
pathogen is transmitted through the air when infected in-
dividuals cough, sneeze, or expectorate. TB is ranked as
the second leading infectious cause of mortality globally,
exceeding HIV and AIDS, with an estimated 10.6 million
individuals contracting the disease in 2022 alone. This
global health crisis affects individuals of all genders and
Fig. 1: Individuals with Tuberculosis (TB) Record Coughs
Using Mobile Microphones.
ages worldwide. In 2022, TB resulted in approximately 1.3
million fatalities. While this disease remains a significant
public health threat, there’s a beacon of hope: TB is both
curable and preventable. To effectively combat this disease,
experts estimate an annual investment of US$13 billion
is needed, encompassing prevention, diagnosis, treatment,
and care initiatives [1], [2]. Early detection is essential for
mitigating transmission and improving patient outcomes,
but traditional methods like sputum tests frequently en-
counter limitations in sensitivity, speed, and infrastructure
requirements [3], [4]. These limitations result in a significant
number of undiagnosed cases, hindering effective control
efforts [5]. Current research is constrained by small and
homogeneous populations, limiting the applicability of AI
tools. There is a need for more diverse studies to develop
accurate AI algorithms capable of differentiating TB coughs
from non-TB coughs across different demographics. This
approach has the potential to enhance the efficacy of AI in
addressing TB [6], [7]. The efforts to combat tuberculosis
(TB) are significantly advanced by the CODA TB DREAM
Challenge. This innovative initiative addresses TB diagnosis
by harnessing the power of AI and cough analysis. Partici-
pants from seven countries experiencing a persistent cough
for two weeks are recruited. Subjects utilize a special app
(Hyfe Research App) to record their coughs, as shown in Fig.
1, and undergo comprehensive TB evaluations, including
lab tests, doctor checkups, and background details [8]. This
comprehensive dataset is subsequently made available to the
public. AI experts globally are invited to develop algorithms
that analyze cough sounds and other information to detect
TB. This global collaboration has the potential to signifi-
cantly enhance TB detection, facilitating faster diagnosis and
better patient outcomes. This study examines the extensive
CODA TB dataset [8] over 700,000 coughs from 1,100
participants to analyze cough sounds and investigate the
potential of audio information for TB detection. First, the
data, including medical details and cough recordings, is
meticulously organized. Preliminary analysis has revealed
differences in reported symptoms between people with and
without TB. For example, symptoms such as weight loss,
fever, night sweats, and hemoptysis are observed to be more
prevalent in the TB-positive cohort. These findings indicate
that integrating cough sound analysis with symptomatic data
may enhance the accuracy of TB detection.
This research further investigates the potential of ad-
vanced machine learning algorithms for cough sound anal-
ysis. We employ two feature extraction techniques (Mel-
Spectrograms and MFCCs) to extract meaningful features
from cough recordings. These features are subsequently
used to develop robust models capable of accurately dif-
ferentiating TB vs. non-TB, i.e. TB+ / TB- cases. We
propose to adapt four deep learning models (1D CNN, 2D
CNN, VGG16, and ResNet50) to identify the most effective
approach for TB classification. Furthermore, we elucidate
factors driving TB prediction and enhance the interpretability
of our findings, hence promoting trust and transparency in
the diagnostic process.
II. RE LATE D WOR K
The TB disease remains a significant global health con-
cern, with millions of cases reported annually. Existing
research [9] investigates the potential of cough analysis
using machine learning and deep learning algorithms for
automated TB detection. Historically, chest X-ray imaging
has been the primary method for TB diagnosis. However, this
approach has limitations. Interpreting X-rays requires trained
radiologists and subtle abnormalities may be overlooked,
as documented in the literature [9]. Furthermore, X-ray
imaging exposes patients to ionizing radiation, raising con-
cerns regarding patient safety, especially for repeated testing.
Recent research has explored alternative approaches for TB
detection that address the limitations of X-ray imaging.
Cough analysis emerges as a promising alternative, which
offers several advantages. Firstly, it is non-invasive. Cough
analysis avoids radiation exposure. Secondly, it facilitates re-
mote monitoring, Cough recordings can be collected without
direct patient contact, enabling telemedicine applications.
Thirdly, it is a cost-effective approach. Recording and ana-
lyzing cough sounds requires minimal equipment compared
to X-ray imaging.
Machine learning and deep learning algorithms have
demonstrated promising results in cough analysis for various
respiratory diseases, including pneumonia and COVID [10],
[11], [12]. These algorithms can automatically extract fea-
tures from cough recordings that are potentially indicative of
specific diseases [13], [14]. Many studies have investigated
the application of machine learning and data mining tech-
niques for challenging real-world problems, including earlier
work by our own research groups [15], [16], [17], [18]. Tsai
et al. (2018) employed Mel-frequency Cepstral coefficients
(MFCCs) features extracted from cough recordings and
achieved an accuracy of 82.2% using a Support Vector
Machine (SVM) classifier for TB detection [19]. Cho et
al. (2017) utilized convolutional neural networks (CNNs)
trained on Mel-Spectrogram representations of cough sounds
and reported an accuracy of 87.1% for TB classification
[20]. Iwendi et al. (2020) compared various machine learning
algorithms, including Random Forests and K-Nearest Neigh-
bors, using MFCC features and achieved an accuracy of
86.3% for TB detection [21]. These studies demonstrate the
potential of machine learning and deep learning techniques
for automated TB detection using cough analysis. How-
ever, there still lacks a comprehensive comparison of deep
learning models with various input encoding techniques for
improvement in TB detection accuracy and generalizability
across diverse populations and cough characteristics.
Our Contribution: This paper is novel in terms of propos-
ing to investigate two feature extraction techniques (Mel-
Spectrograms and MFCCs) and evaluating the performance
of four deep learning models (1D CNN, 2D CNN, VGG16,
and ResNet50) for TB classification using cough recordings
in a non-invasive manner. The study underscores the im-
portance of validation and generalizability by assessing the
models on external datasets.
III. DATA DESCRIPTION AND PREPROCESSING
The CODA TB dataset is collected from health centers
across seven continents spanning multiple continents (India,
Philippines, South Africa, Uganda, Vietnam, Tanzania, and
Madagascar). This international effort recruits participants
over 18 years old seeking help at outpatient clinics for
a persistent cough lasting at least 2 weeks a hallmark
symptom of TB. Table I presents a statistical overview of a
dataset, comprising cough recordings used for Tuberculosis
(TB) detection. The table summarizes the data for two
groups: people with TB (TB+), and those without (TB-). The
CODA TB dataset also incorporates comprehensive clinical
data including TB test results, demographics (age, gender,
ethnicity), medical history (smoking status, HIV status, prior
TB occurrence), and reported symptoms (cough duration,
fever, night sweats) (Table II). This comprehensive collection
TABLE I: Statistical overview of CODA TB dataset
TABLE II: Demographic features in cough+metadata exper-
iment
of clinical data enables the investigation of the complex
relationships among TB presentation, disease severity, and
potential cough variations [1], [22]. AI techniques can effec-
tively capture the learning strategies of scientists in a given
domain [23], particularly in health-related contexts [24].
Moreover, the potential for identifying novel biomarkers for
TB diagnosis through integrated data analysis underscores
the significance of this dataset. However, this valuable
resource presents certain challenges. These challenges in-
clude missing data points, inconsistencies in reporting, and
variations in diagnostic protocols across different healthcare
settings. To ensure the quality and reliability of our results,
we comprehensively clean and standardize the data, ensuring
fitness for robust analysis.
Data Balance - Mitigating Participant Representation
Disparities: Although the dataset exhibits a class imbalance
favoring TB-positive cases, a more significant challenge is
the disproportionate distribution of cough recordings across
participants. Some individuals, irrespective of TB status,
contribute an anomalously high number of cough recordings
(e.g., 71,000 recordings). This biases the training process,
as machine learning models tend to prioritize frequently
observed patterns. To mitigate this issue, we focus on
addressing the imbalance in recordings per participant, rather
than the overall class imbalance. This approach balances
the dataset for training while preserving valuable partici-
pant information [25]. Participants with exceptionally high
recording counts are omitted from the analysis, with the
maximum number of recordings per participant capped at
990. To maintain a good representation in the training data,
at least 990 randomly selected recordings are included from
each participant. This approach effectively mitigates bias
while retaining valuable individual data. We strategically
avoid excluding participants, a consideration particularly
crucial for the underrepresented TB+ class. By setting a
minimum threshold of 990 recordings per participant, a
balanced dataset suitable for machine learning algorithms
is achieved, as illustrated in Table III.
TABLE III: Data distribution after removing outliers
Preserving Participant Insights: While outliers are re-
moved, participants are still retained for two major reasons.
(1) Individual Variations: Cough patterns vary between
individuals. Retaining recordings ensures the model is ex-
posed to these variations, potentially aiding in capturing
subtle cough characteristics relevant to TB classification.
(2) Participant-Level Context: The quantity of recordings
may inherently contain valuable information. For example, a
participant with a substantially higher number of recordings
may indicate a more severe cough condition. Retaining a
subset of recordings enables the model to potentially learn
from this context. This approach balances the dataset for
training while preserving valuable information related to
individual participants and potential cough-related insights.
IV. MET HO DS
To unlock the hidden secrets within cough record-
ings, we employ two feature engineering approaches:
Mel-Spectrograms and Mel-Frequency Cepstral Coefficients
(MFCCs), as illustrated in Fig 2. These techniques transform
the raw audio data into visual representations that emphasize
the cough’s frequency content and characteristics.
Mel-Spectrogram: The first approach employs Mel-
Spectrograms, a visual representation of the cough sound’s
frequency content over time. We explore various deep learn-
ing models, including 1D and 2D CNNs, VGG16, and
ResNet50, to analyze these Spectrograms. These models
demonstrate proficiency in identifying patterns within im-
ages, making them well-suited for extracting informative
features from the visual cough representations. This process
is formulated in Algorithm 1.
MFCC: The second approach utilizes MFCCs, which
characterize the spectral envelope of the cough sound.
Fig. 2: Framework for Automated Tuberculosis Detection
via Cough Analysis
Algorithm 1: Computing Mel-Spectrogram
Data: Audio signal x, Number of Mel filters M,
Window size W, Hop length H, Sampling
rate Fs
Result: Mel-Spectrogram MS
1: Load audio signal xfrom the dataset
2: Apply pre-emphasis to the audio signal (optional):
y[n] = x[n]α·x[n1]
3: Divide the pre-emphasized signal into frames of
window size Wand hop length H
4: for each frame do
5: Apply a window function (e.g., Hann) to the
frame: windowed frame =window ·f rame
end
6: Compute the Mel-Spectrogram using
librosa.feature.melSpectrogram:
MS =librosa.feature.melSpectrogram(y=x, sr =
Fs, n mels =M, window =
window, hop length =H)
7: Return the Mel-Spectrogram MS
MFCCs are a well-established technique in audio analysis
[26], [27]. Various deep learning models, particularly one-
dimensional and two-dimensional CNNs, are adapted to
analyze these features. These CNNs excel at processing
sequential data like MFCCs, enabling them to discern in-
formative patterns from the cough’s spectral characteristics.
This process is formulated in Algorithm 2.
Feature Engineering: This process involves transforming
the raw audio signal into a format suitable for analysis
by deep learning models. This is accomplished by extract-
ing Low-Level Descriptors (LLDs) from each audio frame
and then applying statistical operations to compress these
features into a more manageable format. This condensed
representation allows the deep learning models to focus on
the most informative aspects of the cough sound. To enhance
the generalizability and robustness of the findings, the pre-
Algorithm 2: Computing MFCCs
Data: Audio signal x, Pre-emphasis coefficient α,
Number of Mel filters M, Desired number of
MFCC coefficients N
Result: MFCC features M F CC
1: Load audio signal xfrom the dataset
2: Apply pre-emphasis to the audio signal:
y[n] = x[n]α·x[n1]
3: Divide the pre-emphasized signal into frames of
length Lwith hop length H
4: for each frame do
5: Compute the magnitude spectrum using
Fourier Transform: X[k] = FFT(y[n])
6: Create Mel filter banks using
librosa.filters.mel (refer to librosa
documentation for arguments)
7: Apply Mel filters to the magnitude
spectrum:
Mel filtered spectrum =M el filter s ·X[k]
8: Compute MFCC features using
librosa.feature.mfcc:
M F CC coef f s =librosa.feature.mfcc(y=
Mel f iltered spectrum)
9: Keep the first Ncoefficients of
M F CC coef f s as MFCC features
end
10: Store the MFCC features M F CC for further
analysis or classification
TABLE IV: Data splitting (unit: number of recordings)
processed dataset is systematically partitioned into training
(75%), validation (5%), and testing (20%) sets. Critically, we
maintain class balance within each set, ensuring the model
is trained on a representative distribution of TB+ and TB-
cough recordings shown in Table IV.
A. Optimizing Feature Representation: Mel-Spectrogram
Conversion Approach
We leverage image classification techniques to differenti-
ate TB+ from TB- cases. However, feeding raw audio signals
directly into the model can be computationally expensive.
To address this, we propose an approach as shown in Fig.
3 that converts cough recordings into Mel-Spectrograms
represented as NumPy arrays. Mel-Spectrograms provide a
visually informative representation of the frequency content
over time within a cough recording. These spectral represen-
tations, analogous to “cough fingerprints”, are particularly
useful for image classification tasks [28]. By utilizing the
Librosa library, we efficiently convert audio signals into Mel-
Spectrogram arrays. Librosa achieves this by dividing the
audio into the frequency domain and applying Mel filters
to capture the cough’s energy distribution across different
frequency bands.
Fig. 3: Proposed Method Approach 1 - Comprehensive
Pipeline for TB Detection with Mel-Spectrogram conversion.
This approach offers two main advantages. (1) Reduced
Computational Cost: NumPy arrays are optimized for nu-
merical computations, leading to faster model training com-
pared to raw audio data. (2) Efficient Memory Management:
NumPy arrays provide efficient memory handling, which is
crucial for large datasets.
It is noteworthy that a diverse array of methods has been
employed in the literature for large-scale data processing,
involving data mining over biological data [29] and image
processing over complex data types [30]. Likewise, in this
study, we investigate the performance of four commonly
used deep learning models to systematically assess the
efficacy of Mel-Spectrogram features for TB+ vs. TB-
classification (Approach 1). We now introduce each model’s
architectures, training processes, and key findings.
One-Dimensional Convolutional Neural Network (1D
CNN) with Mel-Spectrograms: This model (Fig. 4) serves
as a baseline, utilizing a sequential architecture with two
hidden convolutional layers. It leverages ReLU activation for
efficient learning in hidden layers and Softmax activation in
the output layer for multi-class classification. Max pooling
is implemented to facilitate downsampling after each con-
volutional layer, thereby reducing feature dimensionality.
Fig. 4: 1D CNN with Mel-Spectrograms [32]
Two-Dimensional Convolutional Neural Network (2D
CNN) with Mel-Spectrograms: Building upon the 1D CNN,
we implement a more complex 2D CNN architecture with
three hidden convolutional layers. This model (Fig. 5) also
utilizes ReLU activation in hidden layers, but employs a
Sigmoid activation function in the output layer, optimized for
binary classification tasks. Average pooling is used for down-
sampling after each convolutional layer. The training process
exhibits a smooth convergence pattern, with a significant
initial drop in loss and a corresponding increase in accuracy,
reaching a stable state around the 20th epoch. Early stopping
is implemented to prevent overfitting, ensuring the model
generalizes well to unseen data.
Fig. 5: 2D CNN with Mel-Spectrograms [31].
Transfer Learning with VGG16: To harness the power
of pre-trained models, we leverage transfer learning with
VGG16, a deep convolutional neural network pre-trained on
the massive ImageNet dataset. VGG16 excels at extracting
informative features from images. In this approach (Fig. 6),
we freeze the top layer of the pre-trained VGG16, preserving
its learned features. Subsequently, a series of fully connected
dense layers is added on top of the pre-trained network.
These new layers are trained using our Mel-Spectrogram
data to fine-tune VGG16 for TB classification. This ap-
proach leverages VGG16’s feature extraction capabilities
while adapting it to the specific task of TB detection.
Fig. 6: VGG16 architecture with Mel-Spectrograms [32].
Transfer Learning with ResNet50: We further investigate
transfer learning via ResNet50 (Fig. 7), a more complex pre-
trained convolutional neural network on ImageNet. Similar
to VGG16, features extracted from ResNet50 are passed
through a dense layer to generate the final TB classifica-
tion prediction. ResNet50’s architecture potentially provides
richer feature representations compared to VGG16, which
could lead to improved TB detection accuracy. The perfor-
mance of VGG16 and ResNet50 for TB classification with
Mel-Spectrograms is systematically evaluated and compared.
Fig. 7: ResNet50 architecture with Mel-Spectrograms [33].
B. Optimizing Feature Representation: Mel-Frequency Cep-
stral Coefficients (MFCC) Extraction
As an alternative approach, we propose to investigate
Mel-Frequency Cepstral Coefficients (MFCCs), as shown
in Fig. 8. Unlike Mel-Spectrograms, MFCCs directly cap-
ture the perceptually relevant spectral shape of the audio
signal, focusing on frequencies crucial to human hearing.
This compressed representation offers two key advantages.
(1) Enhanced Computational Efficiency: The extraction of
MFCCs is computationally more efficient compared to the
generation of Mel-Spectrograms, rendering it appropriate for
real-time or resource-constrained environments. (2) Compact
Feature Representation: MFCCs provide a more concise
feature set compared to Spectrograms, potentially leading
to improved model training efficiency.
Fig. 8: Proposed Method Approach 2 - Comprehensive
Pipeline for TB Detection using Mel-frequency Cepstral
coefficients (MFCCs).
1D CNN with MFCC: We propose to employ a model
with a sequential architecture of two hidden convolutional
layers specifically designed to process 1D feature vectors
representing the MFCCs. It is depicted in Fig. 9. The key
components of this model include the following. (1) Se-
quential Architecture: Layers are stacked sequentially, with
the output of one layer serving as input to the subsequent
layer. (2) Downsampling: Max pooling, implemented after
each convolutional layer, is a widely adopted technique
for reducing feature map dimensionality while preserving
salient features. (3) Output Layer: The final dense layer
with a Softmax activation function is suitable for multi-class
classification (TB+ vs. TB-). The Softmax function generates
class probabilities, indicating the likelihood of each class for
a given cough sample.
Fig. 9: 1D CNN with MFCC [33].
2D CNN with MFCC: We further propose to explore a
2D CNN architecture with MFCC features, as shown in Fig.
10. This model utilizes Librosa to convert audio recordings
into MFCCs, resulting in a 3D tensor representation (time,
frequency, coefficients). The key components of this model
are detailed as follows. (1) Architecture: Sequential CNN
with three hidden convolutional layers. (2) Hidden Layers:
Similar to the 1D CNN, each hidden layer utilizes the ReLU
activation function. (3) Downsampling: Average pooling
is used after each convolutional layer for dimensionality
reduction. (4) Output Layer: The final dense layer employs
the Sigmoid activation function, suitable for binary classifi-
cation, i.e. TB vs. Non-TB outputs.
Fig. 10: 2D CNN with MFCC [31].
Enhancing Model Generalizability through Cross-
Validation: To optimize the models’ generalizability and
mitigate overfitting, we employ k-fold cross-validation (k =
5). The dataset is partitioned into kfolds. For each fold, the
model is trained on (k-1) folds and evaluated on the remain-
ing unseen validation fold, utilizing metrics such as accuracy,
precision, recall, and F1-score. This training-evaluation cycle
was repeated for all kfolds. The performance metrics from
each round were averaged to yield a more robust estimate
of model generalizability on unseen data.
V. RESULTS & DISCUSSION
A. Performance of Models
This section presents the performance analysis of models
utilizing Mel-Spectrogram features. The following observa-
tions are noted. Table V summarizes the performance com-
parison of various deep learning models for TB classification
using Mel-Spectrograms.
Notably, 1D CNN with MFCCs model demonstrates supe-
rior overall performance, outperforming models employing
Mel-Spectrograms. This suggests that processing the raw
audio signal and directly learning features might be a more
effective approach for TB classification in this dataset. In
fact, the 2D CNN with MFCCs model underperforms the 1D
CNN model, possibly due to the increased complexity of the
2D representation being sub-optimal for directly capturing
relevant patterns from the audio signal. Table VI summarizes
the performance comparison of various deep learning models
for TB classification using MFCCs.
TABLE V: Performance comparison of Mel-Spectrogram
models for TB classification
TABLE VI: Performance comparison of MFCC models for
TB classification
B. Discussion on Performance
Key Observations and Alignment with Occam’s Razor:
Analysis of all models reveals that a simple 1D CNN
designed for processing 1D features yielded the best results
for TB classification using both Mel-Spectrograms and raw
audio signals. This observation aligns with the principle of
Occam’s Razor, favoring simpler models when they achieve
comparable or better performance [34]. Such solutions in
favor of simplicity have frequently been adopted in the
literature, including in previous studies [35], [36], [37].
Transfer learning approaches might require further fine-
tuning or utilizing pre-trained models specifically designed
for audio tasks to improve their effectiveness in TB classi-
fication. Our findings demonstrate the potential of multiple
deep learning models trained on cough sound features with
adequate feature extraction to accurately classify TB cases.
Open Issues Emerging from the Study: While this
research demonstrates promise, several open issues and
opportunities for further studies remain. Examination of
the impact of longer cough recordings on classification
accuracy is crucial. External validation on broader datasets
encompassing diverse populations and TB prevalence rates
is essential for generalizability. Incorporating explainable
artificial intelligence (XAI) techniques may augment model
interpretability, potentially facilitating the discovery of novel
audio biomarkers for TB detection. Investigation of one-
dimensional audio-specific architectures may potentially en-
hance performance and mitigate complexity relative to more
intricate models or Mel-Spectrogram-based approaches. Fur-
ther research and development are necessary to refine the
models, optimize performance, and ensure generalizability
across diverse populations and cough characteristics. Ad-
dressing various challenges, e.g. background noise, cough
variations, and developing user-friendly recording devices,
is necessary for clinical implementation.
AI-Based Medical Analysis: By addressing these open
issues and pursuing further studies, we can refine the models
for TB classification using cough sounds. This research
trajectory possesses the potential to yield a robust, in-
terpretable, and generalizable approach for TB detection,
ultimately contributing to advancements in public health
[38], [39]. This study presents a significant step towards
utilizing AI-powered cough analysis as a valuable tool in the
global fight against tuberculosis, thereby making significant
impacts on healthcare.
VI. CONCLUSIONS AND ROA DM AP
This study examines the potential of deep learning-based
classification using cough sound analysis for Tuberculo-
sis (TB) detection. Two feature extraction approaches are
investigated: Mel-Spectrograms and MFCC. We evaluate
the effectiveness of four neural network models for TB
classification using these features.
Our findings reveal the potential of integrating audio
data (cough sounds) to improve TB detection accuracy.
This research trajectory facilitates the exploration of non-
invasive and potentially cost-effective screening tools. We
demonstrate the effectiveness of a simple 1D CNN model
for TB classification using MFCC features [40]. This finding
suggests that directly learning features from the raw audio
signal might be more efficient compared to complex archi-
tectures or Mel-Spectrogram representations for this specific
task. We observe the limited benefits of utilizing transfer
learning approaches with pre-trained models such as VGG16
and ResNet50 for TB detection.
Our findings pave the way for many promising avenues.
Primarily, there is a need to examine the impact of longer
recordings and validate the model on diverse populations.
Such an approach could significantly enhance the model’s
predictive capabilities. Furthermore, the incorporation of
uncertainty quantification is crucial to enhance confidence in
the model’s predictions, thereby facilitating evidence-based
clinical decisions. To broaden the scope of this research, the
algorithmic scope should be extended, encompassing addi-
tional classifiers and conducting comprehensive evaluations
to ensure robustness and reliability.
By addressing these future directions, we can refine deep
learning models for TB classification using cough sounds.
This holds promise for developing a robust, interpretable,
and generalizable AI-infused approach to TB detection,
ultimately improving public health.
ACK NOW LE DG ME NT
The datasets used for the analyses described were contributed by
Dr. Adithya Cattamanchi at UCSF and Dr. Simon Grandjean Lapierre
at the University of Montreal and were generated in collaboration with
researchers at Stellenbosch University (PI Grant Theron), Walimu (PIs
William Worodria and Alfred Andama); De La Salle Medical and Health
Sciences Institute (PI Charles Yu), Vietnam National Tuberculosis Program
(PI Nguyen Viet Nhung), Christian Medical College (PI DJ Christopher),
Centre Infectiologie Charles M´
erieux Madagascar (PIs Mihaja Raberahona
& Rivonirina Rakotoarivelo), and Ifakara Health Institute (PIs Issa Lyimo
& Omar Lweno) with funding from the U.S. National Institutes of Health
(U01 AI152087), The Patrick J. McGovern Foundation and Global Health
Labs. They were obtained as part of the COugh Diagnostic Algorithm for
Tuberculosis (CODA TB) DREAM Challenge DREAM Challenge through
Synapse [syn31472953].
Jyoti Yadav acknowledges her Graduate Assistantship from Montclair
State University. Dr. Aparna Varde acknowledges NSF grant 2018575. She
is an Associate Director of the Clean Energy and Sustainability Analytics
Center (CESAC) at Montclair State University. Dr. Hao Liu acknowledges
his Startup Funds from the College of Science and Mathematics at Montclair
State University. Dr. Lei Xie heads the Precision Drug Discovery Lab at
CUNY, Hunter, NY, as a Full Professor. He is also an Adjunct Professor at
Weill Cornell Medical College, Cornell University, NY.
REFERENCES
[1] Ss. Bagcchi, (2023). WHO’s global tuberculosis report 2022. The
Lancet Microbe, 4(1), e20.
[2] A. Matteelli, A Rendon, S. Tiberi, S. Al-Abri, C. Voniatis, A.
Carvalho,& G. B. Migliori (2018). Tuberculosis elimination: where
are we now?. European Respiratory Review, 27(148).
[3] R. G. Loudon, & S. K. Spohn, (1969). Cough frequency and infec-
tivity in patients with pulmonary tuberculosis. American Review of
Respiratory Disease, 99(1), 109-111.
[4] G. P. Kafentzis, S. Tetsing, J. Brew, L. Jover, M. Galvosas, C.
Chaccour, C., & P. Small (2023). Predicting Tuberculosis from
Real-World Cough Audio Recordings and Metadata. arXiv preprint
arXiv:2307.04842.
[5] M. Pahar, M. Klopper, B. Reeve, R. Warren, G. Theron, & T. Niesler
(2021). Automatic cough classification for tuberculosis screening in a
real-world environment. Physiological Measurement, 42(10), 105014.
[6] Liu, H., Perl, Y., & Geller, J. (2020). Concept placement using
BERT trained by transforming and summarizing biomedical ontology
structure. Journal of Biomedical Informatics, 112, 103607.
[7] Liu, H., Carini, S., Chen, Z., Hey, S. P., Sim, I., & Weng, C. (2022).
Ontology-based categorization of clinical studies by their conditions.
Journal of Biomedical Informatics, 135, 104235.
[8] Yadav, J., Varde, A. S., & Xie, L. (2023). Comprehensive cough data
analysis on CODA TB. In 2023 IEEE International Conference on
Big Data (BigData) (pp. 6311-6313). IEEE.
[9] Lee, J. E., et al. (2019). Diagnostic accuracy of chest radiography for
pulmonary tuberculosis in HIV-infected patients: a systematic review
and meta-analysis. Intl. J. of tuberculosis and lung disease, 23(1),
74-83.
[10] A. S. Varde, D. Karthikeyan, & W. Wang (2023). Facilitating COVID
recognition from X-rays with computer vision models and trans-
fer learning. Multimedia Tools and Applications (MTAP) Journal,
Springer, 83(1):807-838, https://doi.org/10.1007/s11042-023-15744-9
[11] Mak, M. D., & Wai, C. H. (2017). Machine learning for medical
diagnosis using cough sounds: A review. International Journal of Data
Mining and Bioinformatics, 11(2), 282-299.
[12] Karthikeyan, D., Varde, A. S., & Wang, W. (2020). Transfer learning
for decision support in Covid-19 detection from a few images in big
data. In 2020 IEEE International Conference on Big Data (Big Data)
(pp. 4873-4881). IEEE.
[13] Zheng, L., Perl, Y., He, Y., Ochs, C., Geller, J., Liu, H., & Keloth, V.
K. (2021). Visual comprehension and orientation into the COVID-19
CIDO ontology. Journal of Biomedical Informatics, 120, 103861.
[14] Liu, H., Chi, Y., Butler, A., Sun, Y., & Weng, C. (2021). A knowl-
edge base of clinical trial eligibility criteria. Journal of biomedical
informatics, 117, 103771.
[15] Garg, A., Tandon, N., & Varde, A. S. (2020). I am guessing you can’t
recognize this: Generating adversarial images for object detection
using spatial commonsense. AAAI Conf. 34(10):13789-13790.
[16] Varde, A., Rundensteiner, E., Javidi, G., Sheybani, E., & Liang, J.
(2007). Learning the relative importance of features in image data. In
2007 IEEE ICDE workshops (pp. 237-244).
[17] M. Puri, Z. Dau, & A.S. Varde (2021). COVID and social media:
Analysis of COVID-19 and social media trends for smart living and
healthcare. ACM SIGWEB, (2021 Autumn), Article 5, pp. 1-20.
[18] Varde, A. S. (2009). Challenging research issues in data mining,
databases and information retrieval. ACM SIGKDD Explorations
Newsletter, 11(1), 49-52.
[19] Tsai, C. F., et al. (2018). Automated cough analysis using machine
learning for the diagnosis of pulmonary tuberculosis. Journal of
medical and biological engineering, 38(3), 140-149.
[20] Cho, J., et al. (2017). An automatic cough sound classification system
for tuberculosis detection using a deep learning approach with Mel-
Spectrogram and gammatone filter bank features. Sensors, 17(12),
2906.
[21] Iwendi, C., et al. (2020). Machine learning methods for cough analysis
for computer-aided diagnosis of tuberculosis. International journal of
tuberculosis and lung disease, 24(1), 74-83.
[22] Xie, L., & Xie, L. (2023). Elucidation of genome-wide understudied
proteins targeted by PROTAC-induced degradation using interpretable
machine learning. PLOS Computational Biology, 19(8), e1010974.
[23] Varde, A. S. (2022). Computational estimation by scientific data
mining with classical methods to automate learning strategies of
scientists. ACM Transactions on Knowledge Discovery from Data
(TKDD), 16(5), 1-52.
[24] X. Du, O. Emebo, A. Varde, N. Tandon, S. N. Chowdhury &
G. Weikum, (2016). Air quality assessment from social media and
structured data: Pollutants and health impacts in urban planning.
IEEE 32nd International Conference on Data Engineering (ICDE),
Workshops, pp. 54-59, doi: 10.1109/ICDEW.2016.7495616.
[25] L. Xie, E. Draizen, & P. Bourne, (2017). Harnessing big data for sys-
tems pharmacology. Annual Review of Pharmacology & Toxicology,
57, 245-262.
[26] Zheng, F., Zhang, G., & Song, Z. (2001). Comparison of different
implementations of MFCC. Journal of Computer science and Tech-
nology, 16, 582-589.
[27] Shen, Jonathan, Ruoming Pang, Ron J. Weiss, Mike Schuster,
Navdeep Jaitly, Zongheng Yang, Zhifeng Chen et al. ”Natural tts
synthesis by conditioning wavenet on mel Spectrogram predictions.
In 2018 IEEE international conference on acoustics, speech and signal
processing (ICASSP), pp. 4779-4783. IEEE, 2018.
[28] Xie, L., & Xie, L. (2023). Elucidation of genome-wide understudied
proteins targeted by PROTAC-induced degradation using interpretable
machine learning. PLOS Computational Biology, 19(8), e1010974.
[29] R. Hidalgo, A. DeVito, N. Salah, A.S. Varde, A. S., R.W. Mered-
ith (2022). Inferring Phylogenetic Relationships using the Smith-
Waterman Algorithm and Hierarchical Clustering. IEEE International
Conference on Big Data, pp. 5910-5914.
[30] Antoniou, G. E., & Coutras, C. A. (2023, July). 5D IIR and All-Pole
Lattice Digital Filters. In 2023 International Symposium on Signals,
Circuits and Systems (ISSCS) (pp. 1-4). IEEE.
[31] Barrera, J. S., Echavarr´
ıa, A., Madrigal, C., & Herrera-Ramirez, J.
(2020, May). Classification of hyperspectral images of the interior
of fruits and vegetables using a 2D convolutional neuronal network.
Journal of Physics: (Vol. 1547, No. 1, p. 012014).
[32] VGG16 , https://neurohive.io/en/popular -networks/vgg16/
[33] CNN Architectures: VGG, ResNet, Inception + TL
[34] Soegaard, M. (2020, July 23). Occam’s Razor: The sim-
plest solution is always the best. Interaction Design Foundation
- IxDF. https://www.interaction-design.org/literature/article/occam-s-
razor-the-simplest-solution-is-always-the-best
[35] Conti, C. J., Varde, A. S., & Wang, W. (2020, September). Robot
action planning by commonsense knowledge in human-robot col-
laborative tasks. In 2020 IEEE International IOT, Electronics and
Mechatronics Conference (IEMTRONICS) (pp. 1-7). IEEE.
[36] Kaluarachchi, A., Roychoudhury, D., Varde, A. S., & Weikum, G.
(2011, March). SITAC: discovering semantically identical temporally
altering concepts in text archives. In Proceedings of the 14th Interna-
tional Conference on Extending Database Technology (pp. 566-569).
[37] Liu, H., Chi, Y., Butler, A., Sun, Y., & Weng, C. (2021). A knowl-
edge base of clinical trial eligibility criteria. Journal of biomedical
informatics, 117, 103771.
[38] Huddart, S., Yadiv, V., Sieberts, S., Omberg, L., Raberahona, M.,
Rakotoarivelo, R. A., ... & Grandjean Lapierre, S. (2024). Solicited
Cough Sound Analysis for Tuberculosis Triage Testing: The CODA
TB DREAM Challenge Dataset. medRxiv, 2024-03.
[39] Jaganath, D., Sieberts, S. K., Raberahona, M., Huddart, S., Omberg,
L., Rakotoarivelo, R. A., ... & CODA TB DREAM Challenge Con-
sortium. (2024). Accelerating cough-based algorithms for pulmonary
tuberculosis screening: Results from the CODA TB DREAM Chal-
lenge. medRxiv, 2024-05.
[40] Jyoti Yadav and Aparna Varde (2024 May), AI in TB Detection
on Medical Big Data with Health and Educational Impacts (BEST
POSTER AWARD), New Jersey Big Data Alliance (NJBDA) Sym-
posium 2024, Rutgers University, New Brunswick, NJ.
... Furthermore, there are complex medical and surgical robots that can conduct patient check-ups and perform routine surgeries, hence assisting doctors, nurses and other healthcare professionals. Medical expert systems [1] can potentially simulate the roles of physicians, therapists and other healthcare consultants, These and other advances in health informatics [81], [92], [31], [64], [91] and domestic robotics [11], [32], [66] positively impact smart living and smart health. ...
Chapter
Full-text available
This second chapter of the book discusses the areas of robots, drones and automated vehicles encompassing cars, trains and others. The evolution, significance, and decision-making capabilities of all these advances are addressed with respect to the critical role of AI within their functioning. There are modern day challenges with respect to bringing such systems closer to the thresholds of human cognition which pose non-trivial problems of much interest to the AI community. Such issues are mentioned in this chapter.
Article
Full-text available
Background The two-way partial AUC has been recently proposed as a way to directly quantify partial area under the ROC curve with simultaneous restrictions on the sensitivity and specificity ranges of diagnostic tests or classifiers. The metric, as originally implemented in the tpAUC R package, is estimated using a nonparametric estimator based on a trimmed Mann-Whitney U-statistic, which becomes computationally expensive in large sample sizes. (Its computational complexity is of order O(nxny)O(nxny)O(n_x n_y), where nxnxn_x and nynyn_y represent the number of positive and negative cases, respectively). This is problematic since the statistical methodology for comparing estimates generated from alternative diagnostic tests/classifiers relies on bootstrapping resampling and requires repeated computations of the estimator on a large number of bootstrap samples. Methods By leveraging the graphical and probabilistic representations of the AUC, partial AUCs, and two-way partial AUC, we derive a novel estimator for the two-way partial AUC, which can be directly computed from the output of any software able to compute AUC and partial AUCs. We implemented our estimator using the computationally efficient pROC R package, which leverages a nonparametric approach using the trapezoidal rule for the computation of AUC and partial AUC scores. (Its computational complexity is of order O(nlogn)O(nlogn)O(n \log n), where n=nx+nyn=nx+nyn = n_x + n_y.). We compare the empirical bias and computation time of the proposed estimator against the original estimator provided in the tpAUC package in a series of simulation studies and on two real datasets. Results Our estimator tended to be less biased than the original estimator based on the trimmed Mann-Whitney U-statistic across all experiments (and showed considerably less bias in the experiments based on small sample sizes). But, most importantly, because the computational complexity of the proposed estimator is of order O(nlogn)O(nlogn)O(n \log n), rather than O(nxny)O(nxny)O(n_x n_y), it is much faster to compute when sample sizes are large. Conclusions The proposed estimator provides an improvement for the computation of two-way partial AUC, and allows the comparison of diagnostic tests/machine learning classifiers in large datasets where repeated computations of the original estimator on bootstrap samples become too expensive to compute.
Conference Paper
Full-text available
This work leverages CODA TB, a groundbreaking dataset for a novel comprehensive method of early TB detection from medical big data. Departing from the erstwhile, we find mere cough duration less effective in TB prediction. We discover key demographic and clinical factors (e.g. heart rate, presenting symptoms) to be crucial in distinguishing TB cases, motivating comprehensive cough data analysis with enhanced screening.
Article
Full-text available
Proteolysis-targeting chimeras (PROTACs) are hetero-bifunctional molecules that induce the degradation of target proteins by recruiting an E3 ligase. PROTACs have the potential to inactivate disease-related genes that are considered undruggable by small molecules, making them a promising therapy for the treatment of incurable diseases. However, only a few hundred proteins have been experimentally tested for their amenability to PROTACs, and it remains unclear which other proteins in the entire human genome can be targeted by PROTACs. In this study, we have developed PrePROTAC, an interpretable machine learning model based on a transformer-based protein sequence descriptor and random forest classification. PrePROTAC predicts genome-wide targets that can be degraded by CRBN, one of the E3 ligases. In the benchmark studies, PrePROTAC achieved a ROC-AUC of 0.81, an average precision of 0.84, and over 40% sensitivity at a false positive rate of 0.05. When evaluated by an external test set which comprised proteins from different structural folds than those in the training set, the performance of PrePROTAC did not drop significantly, indicating its generalizability. Furthermore, we developed an embedding SHapley Additive exPlanations (eSHAP) method, which extends conventional SHAP analysis for original features to an embedding space through in silico mutagenesis. This method allowed us to identify key residues in the protein structure that play critical roles in PROTAC activity. The identified key residues were consistent with existing knowledge. Using PrePROTAC, we identified over 600 novel understudied proteins that are potentially degradable by CRBN and proposed PROTAC compounds for three novel drug targets associated with Alzheimer’s disease.
Article
Full-text available
Multimedia data plays an important role in medicine and healthcare since EHR (Electronic Health Records) entail complex images and videos for analyzing patient data. In this article, we hypothesize that transfer learning with computer vision can be adequately harnessed on such data, more specifically chest X-rays, to learn from a few images for assisting accurate, efficient recognition of COVID. While researchers have analyzed medical data (including COVID data) using computer vision models, the main contributions of our study entail the following. Firstly, we conduct transfer learning using a few images from publicly available big data on chest X-rays, suitably adapting computer vision models with data augmentation. Secondly, we aim to find the best fit models to solve this problem, adjusting the number of samples for training and validation to obtain the minimum number of samples with maximum accuracy. Thirdly, our results indicate that combining chest radiography with transfer learning has the potential to improve the accuracy and timeliness of radiological interpretations of COVID in a cost-effective manner. Finally, we outline applications of this work during COVID and its recovery phases with future issues for research and development. This research exemplifies the use of multimedia technology and machine learning in healthcare.
Article
Full-text available
Systematically discovering protein-ligand interactions across the entire human and pathogen genomes is critical in chemical genomics, protein function prediction, drug discovery, and many other areas. However, more than 90% of gene families remain “dark”—i.e., their small-molecule ligands are undiscovered due to experimental limitations or human/historical biases. Existing computational approaches typically fail when the dark protein differs from those with known ligands. To address this challenge, we have developed a deep learning framework, called PortalCG, which consists of four novel components: (i) a 3-dimensional ligand binding site enhanced sequence pre-training strategy to encode the evolutionary links between ligand-binding sites across gene families; (ii) an end-to-end pretraining-fine-tuning strategy to reduce the impact of inaccuracy of predicted structures on function predictions by recognizing the sequence-structure-function paradigm; (iii) a new out-of-cluster meta-learning algorithm that extracts and accumulates information learned from predicting ligands of distinct gene families (meta-data) and applies the meta-data to a dark gene family; and (iv) a stress model selection step, using different gene families in the test data from those in the training and development data sets to facilitate model deployment in a real-world scenario. In extensive and rigorous benchmark experiments, PortalCG considerably outperformed state-of-the-art techniques of machine learning and protein-ligand docking when applied to dark gene families, and demonstrated its generalization power for target identifications and compound screenings under out-of-distribution (OOD) scenarios. Furthermore, in an external validation for the multi-target compound screening, the performance of PortalCG surpassed the rational design from medicinal chemists. Our results also suggest that a differentiable sequence-structure-function deep learning framework, where protein structural information serves as an intermediate layer, could be superior to conventional methodology where predicted protein structures were used for the compound screening. We applied PortalCG to two case studies to exemplify its potential in drug discovery: designing selective dual-antagonists of dopamine receptors for the treatment of opioid use disorder (OUD), and illuminating the understudied human genome for target diseases that do not yet have effective and safe therapeutics. Our results suggested that PortalCG is a viable solution to the OOD problem in exploring understudied regions of protein functional space.
Conference Paper
Full-text available
All biological species undergo change over time due to the evolutionary process. These changes can occur rapidly and unpredictably. Due to their high potential to spread quickly, it is critical to be able to monitor changes and detect viral variants. Phylogenetic trees serve as good methods to study evolutionary relationships. Complex big data in biomedicine is plentiful in regards to viral data. In this paper, we analyze phylogenetic trees with reference to viruses and conduct dynamic programming using the Smith-Waterman algorithm, followed by hierarchical clustering. This methodology constitutes an intelligent approach for data mining, paving the way for examining variations in SARS-Cov-2, which in turn can help to discover knowledge potentially useful in biomedicine.
Article
Full-text available
Experimental results are often plotted as 2-dimensional graphical plots (aka graphs) in scientific domains depicting dependent versus independent variables to aid visual analysis of processes. Repeatedly performing laboratory experiments consumes significant time and resources, motivating the need for computational estimation. The goals are to estimate the graph obtained in an experiment given its input conditions, and to estimate the conditions that would lead to a desired graph. Existing estimation approaches often do not meet accuracy and efficiency needs of targeted applications. We develop a computational estimation approach called AutoDomainMine that integrates clustering and classification over complex scientific data in a framework so as to automate classical learning methods of scientists. Knowledge discovered thereby from a database of existing experiments serves as the basis for estimation. Challenges include preserving domain semantics in clustering, finding matching strategies in classification, striking a good balance between elaboration and conciseness while displaying estimation results based on needs of targeted users, and deriving objective measures to capture subjective user interests. These and other challenges are addressed in this work. The AutoDomainMine approach is used to build a computational estimation system, rigorously evaluated with real data in Materials Science. Our evaluation confirms that AutoDomainMine provides desired accuracy and efficiency in computational estimation. It is extendable to other science and engineering domains as proved by adaptation of its sub-processes within fields such as Bioinformatics and Nanotechnology.
Article
Objective The free-text Condition data field in the ClinicalTrials.gov is not amenable to computational processes for retrieving, aggregating and visualizing clinical studies by condition categories. This paper contributes a method for automated ontology-based categorization of clinical studies by their conditions. Materials and Methods Our method first maps text entries in ClinicalTrials.gov’s Condition field to standard condition concepts in the OMOP Common Data Model by using SNOMED CT as a reference ontology and using Usagi for concept normalization, followed by hierarchical traversal of the SNOMED ontology for concept expansion, ontology-driven condition categorization, and visualization. We compared the accuracy of this method to that of the MeSH-based method. Results We reviewed the 4,506 studies on Vivli.org categorized by our method. Condition terms of 4,501 (99.89%) studies were successfully mapped to SNOMED CT concepts, and with a minimum concept mapping score threshold, 4,428 (98.27%) studies were categorized into 31 predefined categories. When validating with manual categorization results on a random sample of 300 studies, our method achieved an estimated categorization accuracy of 95.7%, while the MeSH-based method had an accuracy of 85.0%. Conclusion We showed that categorizing clinical studies using their Condition terms with referencing to SNOMED CT achieved a better accuracy and coverage than using MeSH terms. The proposed ontology-driven condition categorization was useful to create accurate clinical study categorization that enables clinical researchers to aggregate evidence from a large number of clinical studies.