Rajeev K. Azad’s research while affiliated with University of North Texas and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (187)


Figure 2. Dendrogram of betacoronavirus spike protein RBD sequences, arterivirus GP2 sequences and torovirus sequences constructed using MWHP PCDTW with Euclidean distance and UPGMA hierarchical clustering. ACE2-binding betacoronavirus sequences are highlighted in red, and non-ACE2-binding betacoronavirus sequences are highlighted in blue. Three Arterivirus GP2 sequences that are unique within this study, in that they cluster very closely with ACE2-binding betacoronavirus sequence, are labeled with blue text. This combined dendrogram underscores the similarity of the three arterivirus GP2 sequences to those of the ACE2-binding betacoronavirus sequences.
Figure 3. (A) Pairwise global alignment identity matrix showing identities for all pairwise comparisons of 51 betacoronavirus spike protein RBD sequences and three arterivirus GP2 sequences. (B) Pairwise distance matrix showing MWHP PCDTW distances which have been scaled to 100 for all pairwise comparisons of 51 betacoronavirus spike protein RBD sequences and three arterivirus GP2 (ArtVGP2) sequences. The Pearson Correlation Coefficient for the values in A and B is 0.44 with a pValue of 4.98 × 10 −140 .
An Analysis of Combined Molecular Weight and Hydrophobicity Similarity between the Amino Acid Sequences of Spike Protein Receptor Binding Domains of Betacoronaviruses and Functionally Similar Sequences from Other Virus Families
  • Article
  • Full-text available

October 2024

·

6 Reads

Microorganisms

·

Lavanya Vumma

·

Rajeev K. Azad

Recently, we proposed a new method, based on protein profiles derived from physicochemical dynamic time warping (PCDTW), to functionally/structurally classify coronavirus spike protein receptor binding domains (RBD). Our method, as used herein, uses waveforms derived from two physicochemical properties of amino acids (molecular weight and hydrophobicity (MWHP)) and is designed to reach into the twilight zone of homology, and therefore, has the potential to reveal structural/functional relationships and potentially homologous relationships over greater evolutionary time spans than standard primary sequence alignment-based techniques. One potential application of our method is inferring deep evolutionary relationships such as those between the RBD of the spike protein of betacoronaviruses and functionally similar proteins found in other families of viruses, a task that is extremely difficult, if not impossible, using standard multiple alignment-based techniques. Here, we applied PCDTW to compare members of four divergent families of viruses to betacoronaviruses in terms of MWHP physicochemical similarity of their RBDs. We hypothesized that some members of the families Arteriviridae, Astroviridae, Reoviridae (both from the genera rotavirus and orthoreovirus considered separately), and Toroviridae would show greater physicochemical similarity to betacoronaviruses in protein regions similar to the RBD of the betacoronavirus spike protein than they do to other members of their respective taxonomic groups. This was confirmed to varying degrees in each of our analyses. Three arteriviruses (the glycoprotein-2 sequences) clustered more closely with ACE2-binding betacoronaviruses than to other arteriviruses, and a clade of 33 toroviruses was found embedded within a clade of non-ACE2-binding betacoronaviruses, indicating potentially shared structure/function of RBDs between betacoronaviruses and members of other virus clades.

Download

Physicochemical Evaluation of Remote Homology in the Twilight Zone

September 2024

·

20 Reads

·

1 Citation

Proteins Structure Function and Bioinformatics

A fundamental problem in the field of protein evolutionary biology is determining the degree and nature of evolutionary relatedness among homologous proteins that have diverged to a point where they share less than 30% amino acid identity yet retain similar structures and/or functions. Such proteins are said to lie within the “Twilight Zone” of amino acid identity. Many researchers have leveraged experimentally determined structures in the quest to classify proteins in the Twilight Zone. Such endeavors can be highly time consuming and prohibitively expensive for large‐scale analyses. Motivated by this problem, here we use molecular weight–hydrophobicity physicochemical dynamic time warping (MWHP DTW) to quantify similarity of simulated and real‐world homologous protein domains. MWHP DTW is a physicochemical method requiring only the amino acid sequence to quantify similarity of related proteins and is particularly useful in determining similarity within the Twilight Zone due to its resilience to primary sequence substitution saturation. This is a step forward in determination of the relatedness among Twilight Zone proteins and most notably allows for the discrimination of random similarity and true homology in the 0%–20% identity range. This method was previously presented expeditiously just after the outbreak of COVID‐19 because it was able to functionally cluster ACE2‐binding betacoronavirus receptor binding domains (RBDs), a task that has been elusive using standard techniques. Here we show that one reason that MWHP DTW is an effective technique for comparisons within the Twilight Zone is because it can uncover hidden homology by exploiting physicochemical conservation, a problem that protein sequence alignment algorithms are inherently incapable of addressing within the Twilight Zone. Further, we present an extended definition of the Twilight Zone that incorporates the dynamic relationship between structural, physicochemical, and sequence‐based metrics.


An RNA-Seq Data Analysis Pipeline

July 2024

·

29 Reads

·

4 Citations

Methods in molecular biology (Clifton, N.J.)

In this chapter, we present an established pipeline for analyzing RNA-Seq data, which involves a step-by-step flow starting from raw data obtained from a sequencer and culminating in the identification of differentially expressed genes with their functional characterization. The pipeline is divided into three sections, each addressing crucial stages of the analysis process. The first section covers the initial steps of the pipeline, including downloading of the data of interest and performing quality control assessment. Assessment ensures that the data used for analysis is reliable and suitable for downstream analyses. In the second section, gene-level quantification is performed, which entails quantification of expression levels of genes in the samples. The third and final section is focused on differential expression analysis, which involves comparing gene expression levels between two or more conditions. This step helps identify genes that show significant differences in expression levels under different experimental conditions. To facilitate accessibility and reproducibility, we have provided an online repository containing all scripts and files. Additionally, custom scripts are available, enabling users to modify the pipeline’s output for various downstream analyses. By following this pipeline, researchers can effectively analyze RNA-Seq data and gain valuable insights into gene expression patterns and, furthermore, the understanding of biological processes.


RNA Sequencing Experimental Analysis Workflow Using Caenorhabditis elegans

July 2024

·

10 Reads

Methods in molecular biology (Clifton, N.J.)

Jose Robledo

·

Saifun Ripa Nahar

·

Manuel A. Ruiz

·

[...]

·

Pamela A. Padilla

RNA sequencing is an approach to transcriptomic profiling that enables the detection of differentially expressed genes in response to genetic mutation or experimental treatment, among other uses. Here we describe a method for the use of a customizable, user-friendly bioinformatic pipeline to identify differentially expressed genes in RNA sequencing data obtained from C. elegans, with attention to the improvement in reproducibility and accuracy of results.


RNA-Seq Analysis of Mammalian Prion Disease

July 2024

·

15 Reads

Methods in molecular biology (Clifton, N.J.)

A protein, which can attain a prion state, differs from standard proteins in terms of structural conformation and aggregation propensity. High-throughput sequencing technology provides an opportunity to gain insight into the prion disease condition when coupled with single-cell RNA-Seq analysis to reveal transcriptional changes during prion-based pathogenicity. In this chapter, we present a protocol for RNA-Seq analysis of mammalian prion disease using a single-cell RNA sequencing dataset procured from the NCBI GEO database. This protocol is a tool that can assist researchers in characterizing mammalian prion disease in a reproducible and reusable manner. Further, the resulting output has the potential to provide transcript biomarkers for mammalian prion diseases, which can be employed for diagnostic and prognostic purposes.


Figure 6. A boxplot showing the abundance of genera in the COVID-19 and control samples based on POSMM's predictions.
Deciphering Microbial Shifts in the Gut and Lung Microbiomes of COVID-19 Patients

May 2024

·

31 Reads

Microorganisms

COVID-19, caused by SARS-CoV-2, results in respiratory and cardiopulmonary infections. There is an urgent need to understand not just the pathogenic mechanisms of this disease but also its impact on the physiology of different organs and microbiomes. Multiple studies have reported the effects of COVID-19 on the gastrointestinal microbiota, such as promoting dysbiosis (imbalances in the microbiome) following the disease’s progression. Deconstructing the dynamic changes in microbiome composition that are specifically correlated with COVID-19 patients remains a challenge. Motivated by this problem, we implemented a biomarker discovery pipeline to identify candidate microbes specific to COVID-19. This involved a meta-analysis of large-scale COVID-19 metagenomic data to decipher the impact of COVID-19 on the human gut and respiratory microbiomes. Metagenomic studies of the gut and respiratory microbiomes of COVID-19 patients and of microbiomes from other respiratory diseases with symptoms similar to or overlapping with COVID-19 revealed 1169 and 131 differentially abundant microbes in the human gut and respiratory microbiomes, respectively, that uniquely associate with COVID-19. Furthermore, by utilizing machine learning models (LASSO and XGBoost), we demonstrated the power of microbial features in separating COVID-19 samples from metagenomic samples representing other respiratory diseases and controls (healthy individuals), achieving an overall accuracy of over 80%. Overall, our study provides insights into the microbiome shifts occurring in COVID-19 patients, shining a new light on the compositional changes.



DiseaseNet architecture and transfer learning scheme. Left: Fully trained CancerNet. The transfer learning process starts with the fully trained CancerNet. Center: Transformation to DiseaseNet from CancerNet. The weights are frozen for the encoder and decoder. The classification layers (top) are replaced with layers with randomly initialized weights. The output (softmax) layer has four nodes representing the four diseases being classified. In the first round of training, only the classifier is trained. Right: Finetuning of DiseaseNet. Weights of the entire model are unfrozen and allowed to train in the last round until convergence
Comparison of models for different training schemes. For each disease class, the values of F-measure for the binary models, 4-class model without transfer learning, and fully trained DiseaseNet (with transfer learning), are shown by the blue, orange and grey bars respectively. Overall, DiseaseNet outperformed models representing different architectures and/or training options
Distribution of Samples in DiseaseNet Latent Space. A Schizophrenia (SCZ), Asthma, Arthritis, and Normal (Norm) samples are color-coded with blue, yellow, green, and red respectively in the DiseaseNet’s latent space generated using t-SNE. They are well separated in the space with only a few places of overlap by class. B Samples are color-coded by their tissue source (sources shown in the legend). The latent space shows a clear separation by sources; this indicates that tissue sources play an important role in the classification of the diseases by this model
DiseaseNet: a transfer learning approach to noncommunicable disease classification

March 2024

·

48 Reads

BMC Bioinformatics

As noncommunicable diseases (NCDs) pose a significant global health burden, identifying effective diagnostic and predictive markers for these diseases is of paramount importance. Epigenetic modifications, such as DNA methylation, have emerged as potential indicators for NCDs. These have previously been exploited in other contexts within the framework of neural network models that capture complex relationships within the data. Applications of neural networks have led to significant breakthroughs in various biological or biomedical fields but these have not yet been effectively applied to NCD modeling. This is, in part, due to limited datasets that are not amenable to building of robust neural network models. In this work, we leveraged a neural network trained on one class of NCDs, cancer, as the basis for a transfer learning approach to non-cancer NCD modeling. Our results demonstrate promising performance of the model in predicting three NCDs, namely, arthritis, asthma, and schizophrenia, for the respective blood samples, with an overall accuracy (f-measure) of 94.5%. Furthermore, a concept based explanation method called Testing with Concept Activation Vectors (TCAV) was used to investigate the importance of the sample sources and understand how future training datasets for multiple NCD models may be improved. Our findings highlight the effectiveness of transfer learning in developing accurate diagnostic and predictive models for NCDs.


Figure 1. The general workflow for benchmarking alignment software tools, which includes four fundamental steps: genome collection and indexing, RNA-Seq simulation, aligners setup, and accuracy testing and analyses.
Figure 2. Comparative analysis of read base-level accuracy (F1-score), produced by RNA-Seq aligners at different parametric settings.
Benchmarking RNA-Seq Aligners at Base-Level and Junction Base-Level Resolution Using the Arabidopsis thaliana Genome

February 2024

·

13 Reads

Plants

The utmost goal of selecting an RNA-Seq alignment software is to perform accurate alignments with a robust algorithm, which is capable of detecting the various intricacies underlying read-mapping procedures and beyond. Most alignment software tools are typically pre-tuned with human or prokaryotic data, and therefore may not be suitable for applications to other organisms, such as plants. The rapidly growing plant RNA-Seq databases call for the assessment of the alignment tools on curated plant data, which will aid the calibration of these tools for applications to plant transcriptomic data. We therefore focused here on benchmarking RNA-Seq read alignment tools, using simulated data derived from the model organism Arabidopsis thaliana. We assessed the performance of five popular RNA-Seq alignment tools that are currently available, based on their usage (citation count). By introducing annotated single nucleotide polymorphisms (SNPs) from The Arabidopsis Information Resource (TAIR), we recorded alignment accuracy at both base-level and junction base-level resolutions for each alignment tool. In addition to assessing the performance of the alignment tools at their default settings, accuracies were also recorded by varying the values of numerous parameters, including the confidence threshold and the level of SNP introduction. The performances of the aligners were found consistent under various testing conditions at the base-level accuracy; however, the junction base-level assessment produced varying results depending upon the applied algorithm. At the read base-level assessment, the overall performance of the aligner STAR was superior to other aligners, with the overall accuracy reaching over 90% under different test conditions. On the other hand, at the junction base-level assessment, SubRead emerged as the most promising aligner, with an overall accuracy over 80% under most test conditions.


Figure 1. Schematic workflow diagram for applying machine learning to predict genes underlying differentiation of multipartite and unipartite traits in bacteria. The dataset comprising complete multipartite and unipartite genomes downloaded from NCBI was processed for two ML experiments, namely (1) all-gene-level and (2) differentially present gene-level experiments. In the first, all genes present in both groups, multipartite and unipartite genomes, were used as features, whereas genes unique to each group (thus eliminating the common genes) were used in the second. For each of these, a matrix was created with each row representing a sample (bacterial genome) and each column representing a gene, with presence of gene in a genome marked by '1' and absence by '0' in the binary matrix. The last column of the matrix is for the sample label, where multipartite is coded by '1' and unipartite is coded by '0'. Each matrix was then used to derive three different sets, namely, 'All Set', 'Intersection Set', and 'Random Set', which were used for the assessment of machine learning (ML) algorithms-(i) All Set: entire gene dataset, (ii) Intersection Set: genes deemed important for discrimination by ML that appeared in all 6 rounds of the ML 6-fold cross-validation, and (iii) Random Set: randomly sampled genes (as many as in the Intersection Set) from All Set. The performance of the ML algorithms was assessed and compared by using various accuracy metrics, including F1 score, classification accuracy (for 10-fold cross-validation), area under the ROC curve, and area under the precision and recall curve.
Figure 3. Assessment of the performance of the machine learning algorithms in classifying multipartite and unipartite genomes based on differentially present gene-level analysis under 6-fold cross-validation setting. The performance metrics used were (a) training precision, (b) training recall, (c) training F1 score, (d) test precision, (e) test recall, (f) test F1 score, (g) 10f CV (ten-fold cross-validation), (h) AU ROC (area under ROC curve), and (i) AU PR (area under precision-recall curve). 'All Set' denotes all genes for training (as in the cross-validation partitioning), 'Intersection Set' refers to set of genes that consistently ranked high across all 6 rounds of cross-validation, and 'Random Set' refers to randomly sampled genes.
List of top 15 genes deemed important or discriminative by machine learning algorithms in classifying multipartite and unipartite genomes based on differentially-present-gene-level analysis.
Using Machine Learning to Predict Genes Underlying Differentiation of Multipartite and Unipartite Traits in Bacteria

November 2023

·

29 Reads

Microorganisms

Since the discovery of the second chromosome in the Rhodobacter sphaeroides 2.4.1 by Suwanto and Kaplan in 1989 and the revelation of gene sequences, multipartite genomes have been reported in over three hundred bacterial species under nine different phyla. This phenomenon shattered the dogma of a unipartite genome (a single circular chromosome) in bacteria. Recently, Artificial Intelligence (AI), machine learning (ML), and Deep Learning (DL) have emerged as powerful tools in the investigation of big data in a plethora of disciplines to decipher complex patterns in these data, including the large-scale analysis and interpretation of genomic data. An important inquiry in bacteriology pertains to the genetic factors that underlie the structural evolution of multipartite and unipartite bacterial species. Towards this goal, here we have attempted to leverage machine learning as a means to identify the genetic factors that underlie the differentiation of, in general, bacteria with multipartite genomes and bacteria with unipartite genomes. In this study, deploying ML algorithms yielded two gene lists of interest: one that contains 46 discriminatory genes obtained following an assessment on all gene sets, and another that contains 35 discriminatory genes obtained based on an investigation of genes that are differentially present (or absent) in the genomes of the multipartite bacteria and their respective close relatives. Our study revealed a small pool of genes that discriminate bacteria with multipartite genomes and their close relatives with single-chromosome genomes. Machine learning thus aided in uncovering the genetic factors that underlie the differentiation of bacterial multipartite and unipartite traits.


Citations (55)


... The method presented by Dixson and Azad [15], based on shared MWHP physicochemical profiles as evaluated using dynamic time warping (hereafter referred to as MWHP PCDTW), can reach beyond saturation (i.e., into the twilight zone of sequence identity) by exposing conservation of physicochemical properties rather than amino acid identity. Further, the method has also been shown to correlate well with structural comparisons of both simulated and experimentally determined structures, thus demonstrating its usefulness in the inference of structural/functional similarity in lieu of homology/de novo modeling or solving of experimental structures [18]. This approach can thus uncover distant homology based strictly on conserved MWHP physicochemical profiles while preserving the contextual basis encoded in the primary sequence; however, in some instances, the effects of convergent evolution cannot be ruled out. ...

Reference:

An Analysis of Combined Molecular Weight and Hydrophobicity Similarity between the Amino Acid Sequences of Spike Protein Receptor Binding Domains of Betacoronaviruses and Functionally Similar Sequences from Other Virus Families
Physicochemical Evaluation of Remote Homology in the Twilight Zone
  • Citing Article
  • September 2024

Proteins Structure Function and Bioinformatics

... Colony-stimulating factors are also associated with inflammation, and there is evidence that these factors may be part of a mutually dependent proinflammatory cytokine network that includes IL-1 and tumor necrosis factor (TNF) TNF: Tumor necrosis factor (TNF) is perhaps the best known and most intensely studied of the proinflammatory cytokines, and it plays a prominent role in the cytokine storm literature. The name "tumor necrosis factor" was first used in 1975 for a cytotoxic serum factor capable of inducing tumor regression in mice (23), which soon thereafter was reported to play a role in the pathogenesis of malaria and sepsis (14,30,31 We believe that a rethinking of the COVID-19 treatments in this context can have a significant advantage. As regards to IL-6 signalling Flynn et al have noted specifically: ...

An RNA-Seq Data Analysis Pipeline
  • Citing Chapter
  • July 2024

Methods in molecular biology (Clifton, N.J.)

... The problem of antimicrobial resistance has been addressed from various perspectives, including the development of new antibiotics [9], the implementation of biochemical strategies to mitigate resistance, and the enhancement of the efficacy of existing antibiotics [10]. In recent years, machine learning (ML) has emerged as a crucial tool in addressing this public health crisis [11,12]. For instance, graph neural networks have been used to identify new chemical compounds with antibiotic potential against methicillin-resistant Staphylococcus aureus (MRSA) [13]. ...

Silicon versus Superbug: Assessing Machine Learning’s Role in the Fight against Antimicrobial Resistance

Antibiotics

... Here, we propose an approach where Kraken2 is used to identify and remove contaminating sequences from ancient DNA datasets of single organisms to accelerate the mapping process and improve mapping accuracy. We opted for a k-mer-based metagenomic classifier over alignment-based methods due to its faster processing speed [17], and we chose Kraken2 because it presented the best balance between speed, database size, and classification accuracy compared to other metagenomic classifiers [18,19], especially in ancient DNA contexts [20]. Using both simulated (human and dog) and empirical shotgun aDNA datasets, we show that this workf low presents a simple and efficient method that enables the removal of contaminating sequences from aDNA datasets with limited loss of endogenous DNA sequences while simultaneously reducing the overall computational resources needed during the mapping process as well as mitigating any potential errors introduced by spuriously mapping contaminant reads. ...

Benchmarking Metagenomic Classifiers on Simulated Ancient and Modern Metagenomic Data

Microorganisms

... Plants have developed a complex array of strategies to respond to cold stress. When subjected to cold stress, the integrity of cell membranes is compromised, which results in an increase in the relative electrolyte leakage (REL) in the inter-and extracellular fluids (Ji et al., 2024). The measurement of REL serves as a crucial indicator to evaluate the extent of the damage to the plant cell membranes induced by cold stress. ...

The OsTIL1 lipocalin protects cell membranes from reactive oxygen species damage and maintains the 18:3‐containing glycerolipid biosynthesis under cold stress in rice
  • Citing Article
  • September 2023

The Plant Journal

... The relative abundance was calculated as the percentage of each taxon to the overall microbial community in the sample. Next, the reads in each sample that remained unclassified by Kraken 2 were classified using the alignment-free metagenomic profiling tool, POSMM [17]. This profiler uses a probabilistic scoring-based approach to classify the reads and provides a confidence score suitable for thresholding. ...

POSMM: an efficient alignment-free metagenomic profiler that complements alignment-based profiling

Environmental Microbiome

... The advancement of sequencing technology has facilitated the extensive study of genomics in various organisms. Comparative genomics has enabled people to draw evolutionary relationships and compare between organisms (Sengupta & Azad, 2023). Besides, various bioinformatic tools provide a platform allowing researchers to predict and identify virulence factors in a genome. ...

Leveraging comparative genomics to uncover alien genes in bacterial genomes

Microbial Genomics

... Furthermore, in order to better understand the signifcance of these lncRNAs in COVID-19 progression, we designed the ceRNA regulatory network of these lncRNAs using online databases. Also, bioinformatics construction of the lncRNA/miRNA/mRNA network suggests that the RNF24, F2RL3, ACVR2B, and hsa-miR-23b-3p, hsa-miR-629-5p, hsa-miR-30d-3p, hsa-miR-1307-3p, hsa-miR-342-5p, and hsa-miR-221-5p had the most interaction among the mRNAs and miRNA, respectively [43][44][45][46][47][48][49][50]. Furthermore, a few of these genes have signifcant involvement in the development and prognosis of COVID-19, and their interactions with lncRNAs highlight their signifcance. ...

Molecular signatures in the progression of COVID-19 severity

... While exploring the specific evolutionary role of each chromosome was outside the scope of this study, most of the accessory genes that emerged as specific to the LatAm-VpST3 population, or associated with environmental variables, were located on the first chromosome, with the exception of a FAD-dependent oxidoreductase gene. This chromosome is the largest of the two, and impacts of mutation and recombination have previously been found to be higher in this chromosome for V. parahaemolyticus ST36 4 , however recent genomic insights suggest the second, smaller chromosome may be involved in adaptation to different ecological niches 29,30 . While this was not identified here, specific analysis for each of the two chromosomes (with specific parameters set for each chromosome) would provide further insights into its role in facilitating the expansion of VpST3 in distinct marine environments. ...

Analysis of multipartite bacterial genomes using alignment free and alignment-based pipelines

Archives of Microbiology

... At present, the anammox bacteria shows a fairly independent phylogeny with scarce HGT influence (so far unreported) from other bacteria, unlike most other bacterial phyla (Cuecas et al., 2017;Sheinman et al., 2021;Sengupta and Azad, 2022). This singularity within the complexity of bacterial phylogeny at present time represents an additional piece of evidence in support of a monophyletic origin of the anammox-performing bacteria previously shown on the basis of 16S rRNA (Kuenen, 2008;Lodha et al., 2021;Liao et al., 2022) and in this study through the HDH-encoding gene sequences and phylogenomics (Figure 1). ...

Reconstructing horizontal gene flow network to understand prokaryotic evolution