Evolutionary Bioinformatics

Published by SAGE Publications
Online ISSN: 1176-9343
Learn more about this page
Recent publications
A diagram for illustrating the sampling sites of the human DT microbiome and analysis design.
Fitting Sloan et al 18,19 near neutral model to the DT microbiome; the top graph for the BM-HP pair (Buccal mucosa vs Hard Palate), and bottom graph for BM-Stool (Buccal mucosa vs Stool) pair: (i) The 3 curves constitute the 95% confidence interval predicted with Sloan model; therefore, the pink circles representing for the species with their occurrence frequency > that of neutral species (ie, the positively selected species), the green circles for the neutral species, and the cyan for the species with their occurrence frequency < neutral species (the negatively selected species). (ii) The 2 graphs show contrastingly different distribution pattern of 3 categories of species; in the case of gut (stool) microbiome (the bottom), selection effects (such as diet effects from fully digested food) may be responsible for significantly large proportion (62%, pink circles) of positively selected species, while the proportion of neutral species (46%, the top graph, green circles) was much higher for the oral sites (Table 1). It should be cautioned that the fitting to Sloan model with stool microbiome data failed if judged by R 2 , and the failure is also obvious in this figure, where positively selected OTUs scattered all over the places above the simulated neutral curve. We argue that even though the model fitting for stool microbiome failed, the proportions of the 3 categories of OTUs classified by Sloan model are still of important reference value, that is, nearly twice more positively selected OTUs, which is also consistent with the finding from Ning et al 20 NSR.
The average NSR (normalized stochasticity ratio) for each of the 10 DT sites, averaged from pairing with other 9 DT sites.
It is postulated that the human digestive tract (DT) from mouth to intestine is differentiated into diverse niches. For example, Segata et al. discovered that the microbiomes of diverse habitats along the DT could be distinguished as 4 types (niches) including (i) stool; (ii) sub-gingival plaques (SubP) and supra-gingival plaques (SupP); (iii) tongue dorsum (TD), throat (TH), palatine tonsils (PT), and saliva (Sal); and (iv) hard palate (HP) and buccal mucosa (BM), and keratinized gingiva (KG). These niches are different not only in composition, but also in metabolic potentials. In a previous study, we applied Harris et al’s multi-site neutral and Tang and Zhou’s niche-neutral hybrid models to characterize the DT niches discovered by Segata et al. Here, we complement the previous study by applying Sloan’s near-neural model and Ning et al’s stochasticity analysis framework to quantify the niche-neutral continuum of the DT microbiome distribution to shed light on the possible ecological/evolutionary mechanism that shapes the continuum. Overall but excluding the stool site, the proportion of neutral OTUs (46%) is slightly higher than that of the positive selection (38%), but significantly higher than negative selection (15%). The gut (stool) exhibited 3 to 12 times lower neutrality than other DT sites. The analysis also cross-verified our previous hypothesis that the KG ( keratinized gingiva) is of distinct assembly dynamics in the DT microbiome, should be treated as a fifth niche. Our findings offer new insight on the long-standing debate concerning whether a minimum of 2-mm of KG width is necessary for marginal periodontal health.
 
Genomic size and gene counts of Vibrio vulnificus. Red and blue indicated clinical and environmental isolates, respectively. Circle and triangle represented absence and presence of the plasmid, respectively.
The maximum-likelihood phylogenetic tree based on single-copy orthologous protein sequences. Red and blue indicate clinical and environmental isolates, respectively. Escherichia coli ATCC 11775 T was used as an outgroup.
Ka/Ks ratios of each orthologous ARG in the V. vulnificus.
Detailed information of antibiotic-resistant genes annotated in the V. vulnificus genomes.
Vibrio vulnificus is an emergent marine pathogen and is the cause of a deadly septicemia. However, the evolution mechanism of antibiotic-resistant genes (ARGs) is still unclear. Twenty-two high-quality complete genomes of V. vulnificus were obtained and grouped into 16 clinical isolates and 6 environmental isolates. Genomic annotations found 23 ARG orthologous genes, among which 14 ARGs were shared by V. vulnificus and other Vibrio members. Furthermore, those ARGs were located in their chromosomes, rather than in the plasmids. Phylogenomic reconstruction based on single-copy orthologous protein sequences and ARG protein sequences revealed that clinical and environmental V. vulnificus isolates were in a scattered distribution. The calculation of non-synonymous and synonymous substitutions indicated that most of ARGs evolved under purifying selection with the Ka/ Ks ratios lower than one, while h-ns, rsmA, and soxR in several clinical isolates evolved under the positive selection with Ka/ Ks ratios >1. Our result indicated that V. vulnificus antibiotic-resistant armory was not only confined to clinical isolates, but to environmental ones as well and clinical isolates inclined to accumulate beneficial non-synonymous substitutions that could be retained to improve competitiveness.
 
Background: Zika virus, which is widely spread and infects humans through the bites of Aedes albopictus and Aedes aegypti female mosquitoes, represents a serious global health issue. Objective: The objective of the present study is to computationally characterize Zika virus polyproteins (UniProt Name: PRO_0000443018 [residues 1-3423], PRO_0000445659 [residues 1-3423] and PRO_0000435828 [residues 1-3419]) and their envelope proteins using their physico-chemical properties. Methods: To achieve this, the Polarity Index Method (PIM) profile and the Protein Intrinsic Disorder Predisposition (PIDP) profile of 3 main groups of proteins were evaluated: structural proteins extracted from specific Databases, Zika virus polyproteins, and their envelope proteins (E) extracted from UniProt Database. Once the PIM profile of the Zika virus envelope proteins (E) was obtained and since the Zika virus polyproteins were also identified with this profile, the proteins defined as "reviewed proteins" extracted from the UniProt Database were searched for the similar PIM profile. Finally, the difference between the PIM profiles of the Zika virus polyproteins and their envelope proteins (E) was tested using 2 non-parametric statistical tests. Results: It was found and tested that the PIM profile is an efficient discriminant that allows obtaining a "computational fingerprint" of each Zika virus polyprotein from its envelope protein (E). Conclusion: PIM profile represents a computational tool, which can be used to effectively discover Zika virus polyproteins from Databases, from their envelope proteins (E) sequences.
 
Osteosarcoma clustering subtype analysis. (A) Heat map of clustering of prognostic methylation sites. (B) Consensus cumulative distribution Function (CDF) Plot. (C) Relative change in area under CDF curve for k = 2-6. (D) TSNE plot shows the distribution of the 3 clustered subtypes. (E) Each DNA methylation subtype's survival curves.
Overall copy number variation landscape in patients with osteosarcoma. (A) Copy number amplification events in osteosarcoma gene expression levels. (B) Copy number deletion events in osteosarcoma gene expression levels. (C) The gistic scores of overall copy number on the chromosomes. (D) Percentage of copy number variants in the corresponding patient cohort (Top 20).
Analysis of immunotherapy for osteosarcoma subtypes. (A) Response rate of immunotherapy among 3 subtypes of osteosarcoma. (B) Differences in TIDE scores among 3 subtypes of osteosarcoma. (C) Differences in dysfunction scores among 3 subtypes of osteosarcoma. (D) Differences in exclusion scores among 3 subtypes of osteosarcoma. ns, no significant; *P < .05. **P < .01. ***P < .001.
Independent prognostic analysis of model risk scores and immune infiltration analysis. (A) Independent prognostic analysis of risk scores. (B) Infiltration abundance of 10 immune cell types in groups with high and low risk scores. (C) Comparison of immune scores in high and low risk score groups. (D) Comparison of stromal scores in high and low risk score groups. (E) Comparison of tumor purity in high and low risk score groups. ns, no significant; *P < .05. **P < .01. ***P < .001.
Background: Osteosarcoma (OS) is the most common malignant bone tumor in clinical practice, and currently, the ability to predict prognosis in the diagnosis of OS is limited. There is an urgent need to find new diagnostic methods and treatment strategies for OS. Material and methods: We downloaded the multi-omics data for OS from the TARGET database. Prognosis-associated methylation sites were used to identify clustered subtypes of OS, and OS was classified into 3 subtypes (C1, C2, C3). Survival analysis showed significant differences between the C3 subtype and the other subtypes. Subsequently, differentially expressed genes (DEGs) across subtypes were screened and subjected to pathway enrichment analysis. Results: A total of 249 DEGs were screened from C3 subtype to other subtypes. Metabolic pathway enrichment analysis showed that DEGs were significantly enriched to the hypoxic pathway. Based on univariate and multivariate COX regression analysis, 12 genes from the hypoxia pathway were further screened and used to construct hypoxia-related prognostic model (HRPM). External validation of the HRPM was performed on the GSE21257 dataset. Finally, differences in survival and immune infiltration between high and low risk score groups were compared. Conclusion: In summary, we proposed a hypoxia-associated risk model based on a 12-gene expression signature, which is potentially valuable for prognostic diagnosis of patients with OS.
 
Comparison of trajectories identified by Slingshot showing data dependent performance of each workflow. Combinations of icons for columns/ rows represent distinct workflows. Entropy (upper triangle) is used to aggregate over multiple trajectories identified by Slingshot and data subsets corresponding to cell level and gene level filtering thresholds. Variation (lower triangle) over different data subsets is given to show the confidence for aggregating Entropy measure (See Supplementary Material for details). Results suggest data dependence where the use of imputation in β cells dataset significantly reduces the overlap of PTEs in contrast imputation step overall preserves the identified PTEs in α cells: (a) pancreatic differentiation α cells and (b) pancreatic differentiation β cells.
Rank correlation of geodesic distances on DDRTree trajectories median aggregated over subsets showing both data specific performance and overall increase based on number of cells. (a-c) TKI treatment dataset shows improved overlap of cell orderings. Although the TKI dataset is relatively more heterogeneous, increased number of cells allow DDRTree to capture robust cell-cell similarities. (d-h) Remaining datasets show variable results with pancreatic maturation β performing comparable to TKI dataset and Neurodegeneration dataset performing the poorest.
Sample dimension reductions for Alectinib treated NSClC lines showing nonlinearity in temporal dynamics of gene expression. Since dimension reduction utilizes transcriptional similarity of individual cells, low dimensional representations might not necessarily correlate linearly with sampling time. In datasets where sampling time is not linear and/or the underlying transcriptional dynamics are highly heterogeneous supervised approaches might be more suitable where the change in transcriptional activity is ordered by the temporal process by default: (a) DDRTree and (b) PAGA-UMAP.
Background Statistical methods developed to address various questions in single-cell datasets show increased variability to different parameter regimes. In order to delineate further the robustness of commonly utilized methods for single-cell RNA-Seq, we aimed to comprehensively review scRNA-Seq analysis workflows in the setting of dimension reduction, clustering, and trajectory inference. Methods We utilized datasets with temporal single-cell transcriptomics profiles from public repositories. Combining multiple methods at each level of the workflow, we have performed over 6 k analysis and evaluated the results of clustering and pseudotime estimation using adjusted rand index and rank correlation metrics. We have further integrated neural network methods to assess whether models with increased complexity can show increased bias/variance trade-off. Results Combinatorial workflows showed that utilizing non-linear dimension reduction techniques such as t-SNE and UMAP are sensitive to initial preprocessing steps hence clustering results on dimension reduced space of single-cell datasets should be utilized carefully. Similarly, pseudotime estimation methods that depend on previous non-linear dimension reduction steps can result in highly variable trajectories. In contrast, methods that avoid non-linearity such as WOT can result in repeatable inferences of temporal gene expression dynamics. Furthermore, imputation methods do not improve clustering or trajectory inference results substantially in terms of repeatability. In contrast, the selection of the normalization method shows an increased effect on downstream analysis where ScTransform reduces variability overall.
 
Neuroblastoma (NB) is the most common solid malignancy in children. MYCN gene amplification is the most relevant genetic alteration in patients with NB and is associated with poor prognosis. Autophagy plays specific roles in the occurrence, development, and progression of NB. Here, we aimed to identify and assess the prognostic effects of autophagy-related genes (ARGs) in patients with NB and MYCN gene amplification. Differentially expressed ARGs were identified in patients with NB with and without MYCN gene amplification, and the ARG expression patterns and related clinical data from the Therapeutically Applicable Research to Generate Effective Treatments database were used as the training cohort. Least absolute shrinkage and selection operator analyses were used to identify prognostic ARGs associated with event-free survival (EFS), and a prognostic risk score model was developed. Model performance was assessed using the Kaplan–Meier method and receiver operating characteristic (ROC) curves. The prognostic ARG mode l was verified using the validation cohort dataset, GSE49710. Finally, a nomogram was constructed by combining the ARGbased risk score with clinicopathological factors. Three ARGs (GABARAPL1, NBR1, and PINK1) were selected to build a prognostic risk score model. The EFS in the low-risk group was significantly better than that in the high-risk group in both the training and validation cohorts. A nomogram incorporating the prognostic risk score, age, and International Neuroblastoma Staging System stage showed a favorable predictive ability for EFS rates according to the area under the ROC curve at 3 years (AUC = 0.787) and 5 years (AUC = 0.787). The nomogram demonstrated good discrimination and calibration. Our risk score model for the 3 ARGs can be used as an independent prognostic factor in patients with NB and MYCN gene amplification. The model can accurately predict the 3- and 5-year survival rates.
 
Polar fishes have evolved antifreeze proteins (AFPs) that allow them to survive in subzero temperatures. We performed deep transcriptomic sequencing on a postlarval/juvenile variegated snailfish, Liparis gibbus (Actinopterygii: Scorpaeniformes: Cottoidei: Liparidae), living in an iceberg habitat (−2°C) in Eastern Greenland and report detection of highly expressed transcripts that code for putative AFPs from 2 gene families, Type I and LS-12-like proteins (putative Type IV AFPs). The transcripts encoding both proteins have expression levels among the top <1% of expressed genes in the fish. The Type I AFP sequence is different from a reported Type I AFP from the same species, possibly expressed from a different genetic locus. While prior findings from related adult sculpins suggest that LS-12-like/Type IV AFPs may not have a role in antifreeze protection, our finding of very high relative gene expression of the LS-12-like gene suggests that highly active transcription of the gene is important to the fish in the iceberg habitat and raises the possibility that weak or combinatorial antifreeze activity could be beneficial. These findings highlight the physiological importance of antifreeze proteins to the survival of fishes living in polar habitats.
 
(a) Illustration of metagenomic sequencing and analysis. Reads are first classified into human and non-human sequences. The non-human sequences are mapped to known microbial genomes, revealing presence of EBV; (b) These wound cells are positive for EBER expression by ISH (original magnification ×100)); (c) Atypical lymphocytes infiltration around vessels with destruction of vascular wall (original magnification ×200).
(a) Distribution of 193 EBV reads on the EBV genome; (b) Illustration of 6 reads mapped to the proteins of type one-half EBNA-3A. The alignments of 2 reads are enlarged on the top; (c) Distances measured by the average nucleotide identity of Type 1 EBV strains from different geographical regions.
The numbers and percentages of human and microbial reads in the metagenomic sequencing.
Accurate diagnosis of chronic, non-healing wounds is challenging and time-consuming because it can be caused by a variety of etiologies. This brief report presents an unusual case of a chronic wound lasting for 10 months investigated by deep metagenomic sequencing. Epstein-Barr virus (EBV) was identified in the wound and subsequently validated by in situ hybridization. Histopathologic examination eventually revealed that the non-healing wound was due to an EBV-associated NK/T cell lymphoma. By identifying mutations across the viral genome, the virus was classified as Type I EBV and clustered with others of geographic proximity. Our results suggest that metagenomic shotgun sequencing can not only rapidly and accurately identify the presence of underlying pathogens but also provide strain-level resolution for the surveillance of viral epidemiology.
 
Schematic flowchart to depict the methodologies used in this study.
Heatmap depicting the predictive impact of substitution mutations for each residue of ORF10 amino-acid sequence shown at the top label of the figure. As seen from the scoring bar below the heatmap, dark red square (score >50) indicates a high score for the strong effect of a substitution mutation, white-colored square indicates weak signals (−50 <score <50), meaning that there may be an effect, and blue-colored square indicates a low score (score < 50), meaning that this substitution mutation is neutral or has no effect. Black-colored squares indicate the corresponding wild-type residues.
Number of promiscuous immunogenic HLA-I binding epitopes across SARS-CoV-2 proteins studied clustered with HLA-II binding epitopes. Labeling of protein names in the respective bars starts from the first column of names continuing to the next column.
Introduction In an effort to combat SARS-CoV-2 through multi-subunit vaccine design, during studies using whole genome and immunome, ORF10, located at the 3′ end of the genome, displayed unique features. It showed no homology to any known protein in other organisms, including SARS-CoV. It was observed that its nucleotide sequence is 100% identical in the SARS-CoV-2 genomes sourced worldwide, even in the recent-most VoCs and VoIs of B.1.1.529 (Omicron), B.1.617 (Delta), B.1.1.7 (Alpha), B.1.351 (Beta), and P.1 (Gamma) lineages, implicating its constant nature throughout the evolution of deadly variants. Aim The structure and function of SARS-CoV-2 ORF10 and the role it may play in the viral evolution is yet to be understood clearly. The aim of this study is to predict its structure, function, and understand evolutionary dynamics on the basis of mutations and likely heightened immune responses in the immunopathogenesis of this deadly virus. Methods Sequence analysis, ab-initio structure modeling and an understanding of the impact of likely substitutions in key regions of protein was carried out. Analyses of viral T cell epitopes and primary anchor residue mutations was done to understand the role it may play in the evolution as a molecule with likely enhanced immune response and consequent immunopathogenesis. Results Few amino acid substitution mutations are observed, most probably due to the ribosomal frameshifting, and these mutations may not be detrimental to its functioning. As ORF10 is observed to be an expressed protein, ab-initio structure modeling shows that it comprises mainly an α-helical region and maybe an ER-targeted membrane mini-protein. Analyzing the whole proteome, it is observed that ORF10 presents amongst the highest number of likely promiscuous and immunogenic CTL epitopes, specifically 11 out of 30 promiscuous ones and 9 out of these 11, immunogenic CTL epitopes. Reactive T cells to these epitopes have been uncovered in independent studies. Majority of these epitopes are located on the α-helix region of its structure, and the substitution mutations of primary anchor residues in these epitopes do not affect immunogenicity. Its conserved nucleotide sequence throughout the evolution and diversification of virus into several variants is a puzzle yet to be solved. Conclusions On the basis of its sequence, structure, and epitope mapping, it is concluded that it may function like those mini-proteins used to boost immune responses in medical applications. Due to complete nucleotide sequence conservation even a few years after SARS-CoV-2 genome was first sequenced, it poses a unique puzzle to be solved, in view of the evolutionary dynamics of variants emerging in the populations worldwide.
 
The idea of computational processes, which take place in nature, for example, DNA computation, is discussed in the literature. DNA computation that is going on in the immunoglobulin locus of vertebrates shows how the computations in the biological possibility space could operate during evolution. We suggest that the origin of evolutionarily novel genes and genome evolution constitute the original intrinsic computation of the information about new structures in the space of unrealized biological possibilities. Due to DNA computation, the information about future structures is generated and stored in DNA as genetic information. In evolving ontogenies, search algorithms are necessary, which can search for information about evolutionary innovations and morphological novelties. We believe that such algorithms include stochastic gene expression, gene competition, and compatibility search at different levels of structural organization. We formulate the increase in complexity principle in terms of biological computation and hypothesize the possibility of in silico computing of future functions of evolutionarily novel genes.
 
Background Intrauterine growth retardation (IUGR) affects approximately 10% to 15% of all pregnancies worldwide. IUGR is not only associated with stillbirth and newborn death, but also the delay of cognition in childhood and the promotion of metabolic and vascular disorders in adulthood. Figuring out the mechanism of IUGR is rather meaningful and valuable. Methods Datasets related to IUGR were searched in the Gene Expression Omnibus website. Principal component analysis (PCA) was used for normalization. Differential expressed genes (DEGs) were screened out using the ggpot2 tool. DEGs were used to conduct Gene Ontology (GO) terms, Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways enrichment analyses, and protein-protein interaction (PPI) analysis. IUGR related genes were searched in the OMIM website to look for the intersection with the DEGs. The DEGs were analyzed for tissue-specific expression by the online resource BioGPS. The results were displayed through volcano map, Venn map, box plot, heat map, and GSEA enrichment plots drawn by R language packages. Results Eleven DEGs were screened out of 2 datasets. One hundred ninety-five genes related to IUGR in OMIM were retrieved. EGR2 was the only intersection gene that was found in both groups. Genes associated with placental tissue expression include COL17A1, HSD11B1, and LGALS14. Molecular functions of the DEGs are related to the oxidoreductase activity. The following 4 signaling pathways, reactome signaling by interleukins, reactome collagen degradation, Naba secreted factors, and PID NFAT tfpathway, were enriched by GSEA. Two critical modules comprising 5 up-regulated genes (LEP, PRL, TAC3, MMP14, and ADAMTS4) and 4 down-regulated genes (TIMP4, FOS, CCK, and KISS1) were identified by PPI analysis. Finally, we identified 6 genes (PRL, LGALS14, EGR2, TAC3, LEP, and KISS1) that are potentially relevant to the pathophysiology of IUGR. Conclusion The candidate down-regulated genes LGALS14 and KISS1, as well as the up-regulated genes PRL, EGR2, TAC3, and LEP, were found to be closely related to IUGR by bioinformatics analysis. These hub genes are related to hypoxia and oxidoreductase activities in placental development. We provide useful and novel information to explore the potential mechanism of IUGR and make efforts to the prevention of IUGR.
 
(A) Heatmap of differential mRNA expression, (B) heatmap of differential circRNA expression, and (C) heatmap of differential lncRNA expression.
ceRNA network of differentially expressed genes.
KEGG enrichment analysis of differential mRNA.
p53 signaling pathway mechanism diagram.
Objective To construct a competitive endogenous RNA (ceRNA) regulatory network derived from exosomes of human breast cancer (BC) by using the exoRbase database, to explore the possible pathogenesis of BC, and to develop new targets for future diagnosis and treatment. Methods The exosomal gene sequencing data of BC patients and normal controls were downloaded from the exoRbase database, and the expression profiles of exosomal mRNA, long non-coding RNA (lncRNA), and circular RNA (circRNA) were analyzed by using R language. Use Targetscan and miRanda database to jointly predict and differentially express miRNA (microRNA), miRNA combined with mRNA. The miRcode database was used to predict the miRNA combined with differentially expressed lncRNA, and the starBase database was used to predict the miRNA combined with circRNA in the difference table. The related mRNA, circRNA, lncRNA, and their corresponding miRNA prediction data were imported into Cytoscape software to visualize the ceRNA network. Enrichment analysis and visualization of KEGG were carried out using KOBAS. Hub gene was determined by Cytohubba plug-in. Results Forty-two differentially expressed mRNA, 43 differentially expressed circRNA, and 26 differentially expressed lncRNA were screened out. The ceRNA network was constructed by using Cytoscape software, including 19 mRNA nodes, 2 lncRNA nodes, 8 circRNA nodes, and 41 miRNA nodes. KEGG enrichment analysis showed that differentially expressed mRNA in the regulatory network mainly enriched the p53 signaling pathway. Find the key Hub gene PTEN. Conclusion The ceRNA regulatory network in blood exosomes of BC patients has been successfully constructed in this study, which provides an exact target for the diagnosis and treatment of BC.
 
The number of aligned nodes that have: (A) AFS > 0.50 and (B) AFS > 0.75 are presented. BioAlign aligns a much larger number of nodes in both cases (AFS > 0.50 and AFS > 0.75). The margin between the results of BioAlign and existing aligners is notably high.
The results in terms of AFS for different percentages of aligned nodes (10, 20 . . . 100) are presented. BioAlign outperforms all existing techniques in all cases (blue color). PROPER and BEAMS lines (green and pink colors) show incompleteness in 4 out of 5 cases. Twadn and MONACO show incompleteness in the Mouse-yeast case. The remaining algorithms are failed to produce high AFS.
Data statistics that include the number of nodes, number of edges, and percentage of proteins with the 3D resolved structure are presented.
The average execution time of the aligners on 5 datasets.
Motivation The advancement of high-throughput PPI profiling techniques results in generating a large amount of PPI data. The alignment of the PPI networks uncovers the relationship between the species that can help understand the biological systems. The comparative study reveals the conserved biological interactions of the proteins across the species. It can also help study the biological pathways and signal networks of the cells. Although several network alignment algorithms are developed to study and compare the PPI data, the development of the aligner that aligns the PPI networks with high biological similarity and coverage is still challenging. Results This paper presents a novel global network alignment algorithm, BioAlign, that incorporates a significant amount of biological information. Existing studies use global sequence and/or 3D-structure similarity to align the PPI networks. In contrast, BioAlign uses the local sequence similarity, predicted secondary structure motifs, and remote homology in addition to global sequence and 3D-structure similarity. The extra sources of biological information help BioAlign to align the proteins with high biological similarity. BioAlign produces significantly better results in terms of AFS and Coverage (6-32 and 7-34 with respect to MF and BP, respectively) than the existing algorithms. BioAlign aligns a much larger number of proteins that have high biological similarities as compared to the existing aligners. BioAlign helps in studying the functionally similar protein pairs across the species.
 
The phylogenetic tree of 60 selected vertebrate species.
(A) The numbers of genes and (B) GO terms from housekeeping (HK) genes, tissue-specific (TS) genes, and the top 10% upstream A-repeat enriched (top-A) genes in humans.
Background Coding and non-coding short tandem repeats (STRs) facilitate a great diversity of phenotypic traits. The imbalance of mononucleotide A-repeats around transcription start sites (TSSs) was found in 3 mammals: H. sapiens, M. musculus, and R. norvegicus. Principal Findings We found that the imbalance pattern originated in some vertebrates. A similar pattern was observed in mammals and birds, but not in amphibians and reptiles. We proposed that the enriched A-repeats upstream of TSSs is a novel hallmark of endotherms or warm-blooded animals. Gene ontology analysis indicates that the primary function of upstream A-repeats involves metabolism, cellular transportation, and sensory perception (smell and chemical stimulus) through housekeeping genes. Conclusions Upstream A-repeats may play a regulatory role in the metabolic process of endothermic animals.
 
Exon-intron structure for 66 B12D genes from 14 Viridiplantae species visualized by Gene Structure Display Server. The CDS and genomic sequences for each gene were retrieved from the Phytozome database.
(A) Phylogenetic tree of 66 B12D proteins from 14 Viridiplantae species constructed by minimum evolution method with interior branch tests of 1000 replicates using MEGA X software. The tree was built based on B12D domain sequences and (B) conserved motifs in the 66 B12D proteins predicted by MEME server.
Sequence logo for the conserved motifs in the 66 B12D proteins from 14 Viridiplantae species predicted by MEME server.
Digital expression analysis of 6 B12D genes of G. max conducted by Genevestigator v3 software. Gene accession numbers are illustrated in the diagrams: (A) expression patterns in various plant tissues and (B) expression patterns in the developmental stages; cotyledon, trifoliate, flowering, pod fill, seeding, maturation.
Digital expression analysis of 9 B12D genes in H. vulgure conducted by Genevestigator v3 software. Gene accession numbers are illustrated in the diagrams: (A) expression patterns in various plant tissues and (B) expression patterns in in the developmental stages; germination, tillering, booting, heading, flowering, spikelet, and ripening.
B12D family proteins are transmembrane proteins that contain the B12D domain involved in membrane trafficking. Plants comprise several members of the B12D family, but these members’ numbers and specific functions are not determined. This study aims to identify and characterize the members of B12D protein family in plants. Phytozome database was retrieved for B12D proteins from 14 species. The total 66 B12D proteins were analyzed in silico for gene structure, motifs, gene expression, duplication events, and phylogenetics. In general, B12D proteins are between 86 and 98 aa in length, have 2 or 3 exons, and comprise a single transmembrane helix. Motif prediction and multiple sequence alignment show strong conservation among B12D proteins of 11 flowering plants species. Despite that, the phylogenetic tree revealed a distinct cluster of 16 B12D proteins that have high conservation across flowering plants. Motif prediction revealed 41 aa motif conserved in 58 of the analyzed B12D proteins similar to the bZIP motif, confirming that in the predicted biological process and molecular function, B12D proteins are DNA-binding proteins. Cis-regulatory elements screening in putative B12D promoters found various responsive elements for light, abscisic acid, methyl jasmonate, cytokinin, drought, and heat. Despite that, there is specific elements for cold stress, cell cycle, circadian, auxin, salicylic acid, and gibberellic acid in the promoter of a few B12D genes indicating for functional diversification for B12D family members. The digital expression shows that B12D genes of Glycine max have similar expression patterns consistent with their clustering in the phylogenetic tree. However, the expression of B12D genes of Hordeum vulgure appears inconsistent with their clustering in the tree. Despite the strong conservation of the B12D proteins of Viridiplantae, gene association analysis, promoter analysis, and digital expression indicate different roles for the members of the B12D family during plant developmental stages.
 
CRISPR-Cas systems are an adaptive immunity that protects prokaryotes against foreign genetic elements. Genetic templates acquired during past infection events enable DNA-interacting enzymes to recognize foreign DNA for destruction. Due to the programmability and specificity of these genetic templates, CRISPR-Cas systems are potential alternative antibiotics that can be engineered to self-target antimicrobial resistance genes on the chromosome or plasmid. However, several fundamental questions remain to repurpose these tools against drug-resistant bacteria. For endogenous CRISPR-Cas self-targeting, antimicrobial resistance genes and functional CRISPR-Cas systems have to co-occur in the target cell. Furthermore, these tools have to outplay DNA repair pathways that respond to the nuclease activities of Cas proteins, even for exogenous CRISPR-Cas delivery. Here, we conduct a comprehensive survey of CRISPR-Cas genomes. First, we address the co-occurrence of CRISPR-Cas systems and antimicrobial resistance genes in the CRISPR-Cas genomes. We show that the average number of these genes varies greatly by the CRISPR-Cas type, and some CRISPR-Cas types (IE and IIIA) have over 20 genes per genome. Next, we investigate the DNA repair pathways of these CRISPR-Cas genomes, revealing that the diversity and frequency of these pathways differ by the CRISPR-Cas type. The interplay between CRISPR-Cas systems and DNA repair pathways is essential for the acquisition of new spacers in CRISPR arrays. We conduct simulation studies to demonstrate that the efficiency of these DNA repair pathways may be inferred from the time-series patterns in the RNA structure of CRISPR repeats. This bioinformatic survey of CRISPR-Cas genomes elucidates the necessity to consider multifaceted interactions between different genes and systems, to design effective CRISPR-based antimicrobials that can specifically target drug-resistant bacteria in natural microbial communities.
 
Fitting the alpha-DAR models with 100 times of random permutations of the samples in each group (ie, healthy or diseased treatment).
Gout is a prevalent chronic inflammatory disease that affects the life of tens of millions of people worldwide, and it typically presents as gout arthritis, gout stone, or even kidney damage. Research has revealed its connection with the gut microbiome, although exact mechanism is still unclear. Studies have shown the decline of microbiome diversity in gout patients and change of microbiome compositions between the gout patients and healthy controls. Nevertheless, how diversity changes across host individuals at a cohort ( population) level has not been investigated to the best of our knowledge. Here we apply the diversity-area relationship (DAR), which is an extension to the classic SAR (species-area relationship) and establishes the power-function model between microbiome diversity and the number of individuals within cohort, to comparatively investigate diversity scaling (changes) of gut microbiome in gout patients and healthy controls. The DAR modeling with a study involving 83 subjects (41 gout patients) revealed that the potential number of microbial species in gout patients is only 70% of that in the healthy control (2790 vs 3900) although the difference may not be statistically significant. The other DAR parameters including diversity scaling and similarity parameters did not show statistically significant differences. We postulate that the high resilience of gut microbiome may explain the lack of significant gout-disease effects on gut microbial diversity at the population level. The lack of statistically significant difference between the gout patients and healthy controls at host population (cohort) level is different from the previous findings at individual level in the existing literature.
 
Wenxiang diagrams illustrate protein helices as spirals on a plane and thus have the advantage over helical wheels of being planar graphs. Wenxiang 3.0 extends the original version by adding 3 major features: (1) individual amino acid residues can be colored according to their evolutionary conservation in comparative multiple sequence alignments using CONSURF encoding; (2) α, π, and 3/10 helices can be illustrated by overlaying arcs representative of the pitches of these helices; and, (3) the physico-chemical properties of amino acids residues in the protein sequence can be re-presented by colored geometric shapes.
 
The volcano plot and Heat map of differential phosphorylated proteins. (a) The volcano plot of differential phosphorylated proteins; (b) The Heat map of differential phosphorylated proteins.
The western blotting of PARVA Phosphorylation in NPC cells. (a) The expression of PARVA Phosphorylation in 5-8F-SDF2L1-OV cells and 5-8F-SDF2L1-NC cells; (b) The expression of PARVA Phosphorylation in nasopharyngeal carcinoma cells (HK1,HONE1 and 5-8F)and normal nasopharyngeal epithelial cells (NP69).
The 30 top GO terms of differential phosphorylated proteins. (a) The 10 top biological process of differential phosphorylated proteins; (b) The 10 top cellular components of differential phosphorylated proteins; (c) The 10 top molecular function of differential phosphorylated proteins.
Differential expression of phosphorylated proteins in tumor associated KEGG pathway.
SDF2L1 is a new type of endoplasmic reticulum stress inducible protein, which is related to poor prognosis of various cancer, we initially studied the low expression level of SDF2L1 in NPC, but the molecular mechanism of SDF2L1 in NPC needs further elucidation. To identify phosphorylated proteins regulated by SDF2L1 in nasopharyngeal carcinoma (NPC), Label-free Quantitative (LFQ) Proteomics and 2D-LC-MS/MS analysis were performed on high metastatic NPC 5-8F cells with overexpression of SDF2L1 and empty segment. Western blotting was applied to validate the differentially expressed phosphorylated proteins (DEPPs). As a result, 331 DEPPs were identified by proteomics, and PARVA phosphorylation (ser8) was validated. The present results suggested that PARVA phosphorylation may be a new promising biomarker for predicting NPC and play a key role in the occurrence and development of NPC.
 
Images during grain development of the yellow (Ye; sample 1673) and deep purple (DP; sample 1674) parental seeds: (a) frontal and posterior sides at 10, 20, 30, and 40 days after pollination (DAP) in Ye and DP seeds and (b) cross-sections during grain development of Ye and DP parental seeds at 10 and 20 DAP.
Flavonoid-related genes during yellow (Ye) and deep purple (DP) seed development (10, 20, 30, and 40 days after pollination [DAP]). genes are divided into early and late biosynthetic genes: (a) CHS: Chalcone synthase, (b) CHI: Chalcone isomerase, (c) F3H: Flavanone 3-hydroxylase, (d) F3′5′H: Flavonoid 3′5′ hydroxylase, (e) DFR: Dihydroflavonol 4-reductase, (f) ANS: Anthocyanidin synthase, (g) ANR: Anthocyanidin reductase, and (h) UFGT: UDP-glucose flavonoid-3-O-glucosyltransferase. Data are means ± SD of 3 biological replicates. Asterisks indicate significant differences when comparing Ye DAP in each developmental stage (*P ⩽ .05).
Schematic representation and expression analysis of flavonoid biosynthetic genes in yellow (Ye) and deep purple (DP) seeds during seed color deposition. The DEg names are listed next to the heat map. Log2 fold changes (DP-RIL/Ye-RIL) of DEgs at 10, 20, and 30 Days after pollination (DAP) are illustrated with blue or red boxes, which indicate down or upregulation, respectively, compared to the genes in Ye-RIL. CHS, chalcone synthase; CHI, chalcone isomerase, F3H, flavonol 3-hydroxylase; F3′H, flavonol 3′-hydroxylase; FLS, flavonol synthase; OMT1, O-methyltransferase 1; UGT, UDPglycosyltransferase; DFR, dihydroflavonol reductase; LDOX, leucoanthocyanidin dioxygenase; ANR, anthocyanidin reductase; OMT, O-methyltransferase; GT, glycosyltransferase. Arabidopsis_unipro annotations were used. Numbers represent the fold changes. Colors are based using row minimum and maximum values. The mean values were obtained from 3 biological replicates.
Phylogenetic tree and sequence alignment: (a) Phylogenetic tree of the putative MYB transcription factor cloned with other transcription factors related to color in other species. The accession numbers of these proteins (or translated products) are shown in Supplemental Table S3; (b) multiple sequence alignment of candidate anthocyanin-specific R2R3-MYB proteins with known anthocyanin regulators. The R3 repeats of the MYB domain related to the SANT motif are marked according to previously characterized anthocyanin-specific R3-MYBs.
Interaction network analysis and gene expression of related interacting genes. (a) Interaction network analysis of bhLh and related genes of Arabidopsis thaliana. Line thickness is related to the combined score (TaTCL2 > 0.7). The homologous gene in wheat is in red. (b) qRT-PCR of genes showing direct interactions in interaction network analysis. Yellow (Ye) and deep purple (DP) samples. Days after pollination (DAP). Ye in each stage is used as an internal control. Data are means ± SD of 3 biological replicates. Asterisks indicate significant differences when comparing Ye DAP in each developmental stage (*P ⩽ .05. **P ⩽ .01. ***P ⩽ .001).
Plants accumulate key metabolites as a response of biotic/abiotic stress conditions. In seed coats, anthocyanins, carotenoids, and chlorophylls can be found. They have been associated as important antioxidants that affect germination. In wheat, anthocyanins can impart the seed coat color which have been recognized as health-promoting nutrients. Transcription factors act as master regulators of cellular processes. Transcription complexes such as MYB-bHLH-WD40 (MBW) regulate the expression of multiple target genes in various plant species. In this study, the spatiotemporal accumulation of seed coat pigments in different developmental stages (10, 20, 30, and 40 days after pollination) was analyzed using cryo-cuts. Moreover, the accumulation of phenolic, anthocyanin, and chlorophyll contents was quantified, and the expression of flavonoid biosynthetic genes was evaluated. Finally, transcriptome analysis was performed to analyze putative MYB genes related to seed coat color, followed by further characterization of putative genes. TaTCL2, an MYB gene, was cloned and sequenced. It was determined that TaTCL2 contains a SANT domain, which is often present in proteins participating in the response to anthocyanin accumulation. Moreover, TaTCL2 transcript levels were shown to be influenced by anthocyanin accumulation during grain development. Interaction network analysis showed interactions with GL2 (HD-ZIP IV), EGL3 (bHLH), and TTG1 (WD40). The findings of this study elucidate the mechanisms underlying color formation in Triticum aestivum L. seed coats.
 
Frequencies of PGx variants of drug metabolizing transporters (DMT).
Background Pharmacogenomics (PGx), forming the basis of precision medicine, has revolutionized traditional medical practice. Currently, drug responses such as drug efficacy, drug dosage, and drug adverse reactions can be anticipated based on the genetic makeup of the patients. The pharmacogenomic data of Pakistani populations are limited. This study investigates the frequencies of pharmacogenetic variants and their clinical relevance among ethnic groups in Pakistan. Methods The Pharmacogenomics Knowledge Base (PharmGKB) database was used to extract pharmacogenetic variants that are involved in medical conditions with high (1A + 1B) to moderate (2A + 2B) clinical evidence. Subsequently, the allele frequencies of these variants were searched among multiethnic groups of Pakistan (Balochi, Brahui, Burusho, Hazara, Kalash, Pashtun, Punjabi, and Sindhi) using the 1000 Genomes Project (1KGP) and ALlele FREquency Database (ALFRED). Furthermore, the published Pharmacogenomics literature on the Pakistani population was reviewed in PubMed and Google Scholar. Results Our search retrieved (n = 29) pharmacogenetic genes and their (n = 44) variants with high to moderate evidence of clinical association. These pharmacogenetic variants correspond to drug-metabolizing enzymes (n = 22), drug-metabolizing transporters (n = 8), and PGx gene regulators, etc. (n = 14). We found 5 pharmacogenetic variants present at >50% among 8 ethnic groups of Pakistan. These pharmacogenetic variants include CYP2B6 (rs2279345, C; 70%-86%), CYP3A5 (rs776746, C; 64%-88%), FLT3 (rs1933437, T; 54%-74%), CETP (rs1532624, A; 50%-70%), and DPP6 (rs6977820, C; 61%-86%) genes that are involved in drug response for acquired immune deficiency syndrome, transplantation, cancer, heart disease, and mental health therapy, respectively. Conclusions This study highlights the frequency of important clinical pharmacogenetic variants (1A, 1B, 2A, and 2B) among multi-ethnic Pakistani populations. The high prevalence (>50%) of single nucleotide pharmacogenetic variants may contribute to the drug response/diseases outcome. These PGx data could be used as pharmacogenetic markers in the selection of appropriate therapeutic regimens for specific ethnic groups of Pakistan.
 
Protein-protein interaction network provided by STRING Viruses Consortium and made into a visual graph by Cytoscape between the viral proteins found (in red) and the host's (in blue). According to STRING description, most of these viral proteins carry functions related to cellular proliferation, viral replication, suppression of host's immune response and apoptosis blockade. (A) Interactions among Enterobacteria phage lambda proteins and its host proteins (Escherichia coli). (B) Interactions among Varicella-zoster virus proteins and its host proteins (Homo sapiens). (C) Interactions among Equine Herpesvirus proteins and its host proteins (Equus caballus). (D) Interactions among Escherichia phage T7 proteins and its host proteins (Escherichia coli). (E) Interactions among Human Herpesvirus type 8 proteins and its host proteins (Homo sapiens). (F) Interactions among Vaccinia Virus proteins and its host proteins (Bos taurus).
Bovine papillomavirus (BPV) is associated with bovine papillomatosis, a disease that forms benign warts in epithelial tissues, as well as malignant lesions. Previous studies have detected a co-infection between BPV and other viruses, making it likely that these co-infections could influence disease progression. Therefore, this study aimed to identify and annotate viral genes in cutaneous papillomatous lesions of cattle. Sequences were obtained from the GEO database, and an RNA-seq computational pipeline was used to analyze 3 libraries from bovine papillomatous lesions. In total, 25 viral families were identified, including Poxviridae, Retroviridae, and Herpesviridae. All libraries shared similarities in the viruses and genes found. The viral genes shared similarities with BPV genes, especially for functions as virion entry pathway, malignant progression by apoptosis suppression and immune system control. Therefore, this study presents relevant data extending the current knowledge regarding the viral microbiome in BPV lesions and how other viruses could affect this disease.
 
Laryngeal squamous cell carcinoma (LSCC) is one of the most common types of head and neck squamous cell carcinomas (HNSCC) and is the second most prevalent malignancy occurring in the head and neck or respiratory tract, with a high incidence and mortality rate. Survival is limited for patients with LSCC. To identify more biomarkers associated with the prognosis of patients with LSCC, using bioinformatics analysis, this study used The Cancer Genome Atlas (TCGA) LSCC dataset and gene expression profiles of GSE59102 from the Gene Expression Omnibus (GEO). Eighty-one differentially co-expressed genes were identified by weighted gene co-expression network analysis (WGCNA). Next, 10 hub genes (PPL, KRT78, CRNN, PTK7, SCEL, AGRN, SPINK5, AIF1L, EMP1, and PPP1R3C) were screened from a protein-protein interaction (PPI) network. Based on survival analysis, SPINK5 was significantly correlated with survival time in LSCC patients. After verification in the TCGA and HPA databases, SPINK5 was selected as a prognostic biomarker. Finally, the GSEA results showed that downregulation of SPINK5 gene expression may promote tumorigenesis and the development of cancers by the “BASAL CELL CARCINOMA” pathway, and it has been implicated in disrupting DNA damage and repair pathways. Collectively, SPINK5 may serve as a potential prognostic biomarker in LSCC.
 
Differential expressed genes and their enriched pathways. (A) Volcano plot, blue and red dots indicate significantly down-regulated and up-regulated DEGs, respectively, red horizontal line indicates P-value <.05, 2 vertical lines indicate |log2(FC)|>0.263. (B) Heat map of differentially expressed genes, the red and green horizontal bars at the top indicate high hip BMD and low hip BMD
TOP10 KEGG pathways enriched for the interacting proteins in protein-protein interaction network.
Background Osteoporosis is a bone disease that increases the patient’s risk of fracture. We aimed to identify robust marker genes related to osteoporosis based on different bioinformatic methods and multiple datasets. Methods Three datasets from Gene Expression Omnibus (GEO) were utilized for analysis separately. Significantly differentially expressed genes (DEGs) from comparing high hip and low hip low bone mineral density (BMD) groups in the first dataset were identified for Gene Ontology (GO), Gene set enrichment analysis (GSEA) and Kyoto encyclopedia of genes and genomes (KEGG) to investigate the discrepantly enriched biological processes between high hip and low hip group. Last absolute shrinkage and selection operator (LASSO), SVM model and protein-protein interaction (PPI) regulatory network were performed and generated robust marker genes for downstream TF-target and miRNA-target prediction. Results Several DEGs between high hip BMD group and low hip BMD group were obtained. And the metabolism-related pathways such as metabolic pathways, carbon metabolism, glyoxylate and dicarboxylate metabolism shown enrichment in these DEGs. Integration with LASSO regression analysis, 8 differential expression genes ( SH3BP1, NARF, ANKRD34B, RNF40, ZNF473, AKT1, SHMT1, and VASH1) in GSE62402 were identified as the optimal differential genes combination. Moreover, the SVM validation analysis in GSE56814 and GSE56815 datasets showed that the characteristic gene combinations presented high diagnostic effects, and the model AUC areas for GSE56814 was 0.899 and for GSE56815 was 0.921. Furthermore, the subcellular localization analysis of the 8 genes revealed that 4 proteins were located in the cytoplasm, 3 proteins were located in the nucleus, and 1 protein was located in the mitochondria. Additionally, the related TFs and miRNAs by performing TF-target and miRNA-target prediction for 5 genes ( AKT1, SHMT1, ZNF473, RNF40 and VASH1) were investigated from PPI network. Conclusion The optimal differential genes combination ( SH3BP1, NARF, ANKRD34B, RNF40, ZNF473, AKT1, SHMT1, and VASH1) presented high diagnostic effect for osteoporosis risk.
 
Freshwater ecosystems contain a large diversity of microeukaryotes that play important roles in maintaining their structure. Microeukaryote communities vary in composition and abundance on the basis of temporal and environmental variables and may serve as useful bioindicators of environmental changes. In the present study, 18S rRNA metabarcoding was employed to investigate the seasonal diversity of microeukaryote communities during four seasons in the Han River, Korea. In total, 882 unique operational taxonomic units (OTUs) were detected, including various diatoms, metazoans (e.g., arthropods and rotifers), chlorophytes, and fungi. Although alpha diversity revealed insignificant differences based on seasons, beta diversity exhibited a prominent variation in the community composition as per seasons. The analysis revealed that the diversity of microeukaryotes was primarily driven by seasonal changes in the prevailing conditions of environmental water temperature and dissolved oxygen. Moreover, potential indicator OTUs belonging to diatoms and chlorophytes were associated with seasonal and environmental factors. This analysis was a preliminary study that established a continuous monitoring system using metabarcoding. This approach could be an effective tool to manage the Han River along with other freshwater ecosystems in Korea.
 
Genome features of Oceanobacillus jordanicus strain GSFE11.
Protein features of O. jordanicus strain GSFE11.
The bacterium Oceanobacillus jordanicus strain GSFE11 is a halotolerant endophyte isolated from sterilized roots of Durum wheat ( Triticum turgidum ssp. Durum) growing in hot and arid environments of Ghor Safi area in the Jordan Valley. The draft genome sequence and annotation of this plant growth-promoting endophytic bacterium are reported in this study. The draft genome sequence of Oceanobacillus jordanicus strain GSFE11 has 3 839 208 bp with a G + C content of 39.09%. A total of 3893 protein-coding genes and 68 RNA coding genes were predicted. Several putative genes that are involved in secretion and delivery systems, transport, adhesion, motility, membrane proteins, plant cell wall modification, and detoxification were identified, some are characteristics of endophytes lifestyle including genes that are involved in metabolism of carbohydrate, genes for xylose, fructose and chitin utilization, quinone cofactors biosynthesis, genes associated with nitrogen, sulfur, phosphate and iron acquisition, in addition to genes involved in the biosynthesis of plant hormone auxin. This study highlights the importance of using genome analysis and phylogenomic analysis to resolve the differences between closely related species, such analysis showed Oceanobacillus jordanicus strain GSFE11 to be a new species closely related to Oceanobacillus picturae (genome size 3.67 Mb), Oceanobacillus jordanicus has higher a number of predicted genes compared with Oceanobacillus picturae (3961 genes vs 3823 genes).
 
(A) Distribution of sequences with indels in SARS-CoV-2 spike glycoprotein (n = 1 311 545). Over 50% of sequences (inset pie chart) have at least 1 indel, with sequences containing 3 deleted residues being very frequent. The pattern is similar (open bars) even if only unique sequences (n = 49 118) are considered. (B) Proteins/sequences with indels clearly seem to have a selective advantage as their proportion has risen sharply over time and currently (May 2021) represents 89.3% of all sequences. (C) Month-wise proportions of variants of concern/interest coming from sequences with indels compared to that of without indels.
(A) Map of indels (insertion in green and deletion in red) in SARS-CoV-2 spike glycoprotein. Incidence of indels along the sequence. The first panel shows the frequency (scale at right indicates the number of unique sequence variants) and the second panel shows the occurrence of indels. As many as 420 indel positions (142 insertion and 358 deletion positions) are present. Three-residue deletion of 69, 70 and 144 is the most common combination, but there are as many as 447 unique combinations of indels. (B) Multiple sequence alignment (using Clustal Omega, https://www.ebi.ac.uk/ Tools/msa/clustalo/) shows 17 combinations of deletions present in Delta (B.1.617.2) and Kappa (B.1.617.1) variants (representative sequences based on the earliest date of sampling).
(A) Alteration of N-glycosylation sites due to indels. The panel shows potential N-glycosylation sites (NXS sequons in orange and NXT in blue) along the sequence and the effect of indels (gain of site in green, loss of site in red, and altered site in black). Altered residues (compared to reference sequence) are shown in lower case letters with insertions underscored, and deletions struck through. Multiple instances of the same type of change are numbered after hyphen. The gains of sites were more scattered, while the losses of sites were mostly at the N-terminal part of the sequence. (B) Cartoon and (C) space-fill structures of SARS-CoV-2 spike glycoprotein showing the positions of N-glycosylation sites. Some key sites are indicated by arrows and numbers. Of the 6 gains of sites, 3 sites (at 290, 437 and 871) are completely buried in the 3D structure.
SARS-CoV-2, responsible for the current COVID-19 pandemic that claimed over 5.0 million lives, belongs to a class of enveloped viruses that undergo quick evolutionary adjustments under selection pressure. Numerous variants have emerged in SARS-CoV-2, posing a serious challenge to the global vaccination effort and COVID-19 management. The evolutionary dynamics of this virus are only beginning to be explored. In this work, we have analysed 1.79 million spike glycoprotein sequences of SARS-CoV-2 and found that the virus is fine-tuning the spike with numerous amino acid insertions and deletions (indels). Indels seem to have a selective advantage as the proportions of sequences with indels steadily increased over time, currently at over 89%, with similar trends across countries/variants. There were as many as 420 unique indel positions and 447 unique combinations of indels. Despite their high frequency, indels resulted in only minimal alteration of N-glycosylation sites, including both gain and loss. As indels and point mutations are positively correlated and sequences with indels have significantly more point mutations, they have implications in the evolutionary dynamics of the SARS-CoV-2 spike glycoprotein.
 
Term-centric performance (ROCAUC) per GO term of a LR classifier trained using SeqVec protein-level embeddings on the SwissProt dataset with at most 30% pairwise sequence identity in relation to (a) depth of the GO term and (b) the number of annotated proteins in the training set for the GO term. The Spearman correlations and corresponding P-values are shown in the respective plots. The box-whiskers plots show the interquartile range (IQR) with a box and the median as a bar across the box. Whiskers denote the range equal to 1.5 times the IQR.
Protein-centric performance (F1) per protein of the LR classifier trained using baseline SeqVec protein-level embeddings on the SwissProt dataset in relation to (a) protein sequence length and (b) the number of protein annotations. The LR was trained to predict GO terms. The Spearman correlations and corresponding P-values between protein-centric performance and protein length or number of protein annotations are shown. The box-whiskers plots show the interquartile range (IQR) with a box and the median as a bar across the box. Whiskers denote the range equal to 1.5 times the IQR. (c) Term-centric performance (ROCAUC) of the LR classifier trained using baseline SeqVec protein-level embeddings on the SwissProt dataset. The LR was trained to predict protein length encoded by one-hot encoding in the same intervals as in (a). Errorbars denote 95% confidence estimated using 100 bootstraps.
(a) Phylogenetic tree showing evolutionary relation and divergence time between the training species Mouse and the other test species. Tree produced via the PhyloT tool for phylogenetic tree visualization and divergence times retrieved using the TimeTree tool. 37,38 (b) Average term-centric ROCAUC over all the GO terms and (c) average protein-centric F1 over all the proteins per species for the MLP classifier (brown). The MLP was trained to predict GO terms. Performance is compared to baseline Frequency PSI-BLAST (orange) and to DeepGoPlus (pink). In (c) the coverage C is shown inside the bars. (d) Average protein-centric performance (F1) over all the proteins per species of the MLP in relation to the average protein sequence identity to the Mouse training set. Sequence identity was retrieved using the PSI-BLAST top hit of every protein to the Mouse training set. Errorbars denote 95% confidence intervals estimated using 100 bootstraps.
Median term-centric performance (ROCAUC) per GO category per species of the MLP classifier trained using SeqVec protein-level embeddings on the Mouse training dataset. A missing number indicates that a certain GO category was not present among the evaluated proteins.
The average protein-centric F1 performance over all the proteins per species for the MLP classifier (brown) for (a) biological process GO terms and (b) cellular component GO terms. Performance is compared to baseline Frequency PSI-BLAST (orange). The coverage C is shown inside the bars. Errorbars denote 95% confidence intervals estimated using 100 bootstraps.
Computationally annotating proteins with a molecular function is a difficult problem that is made even harder due to the limited amount of available labeled protein training data. Unsupervised protein embeddings partly circumvent this limitation by learning a universal protein representation from many unlabeled sequences. Such embeddings incorporate contextual information of amino acids, thereby modeling the underlying principles of protein sequences insensitive to the context of species. We used an existing pre-trained protein embedding method and subjected its molecular function prediction performance to detailed characterization, first to advance the understanding of protein language models, and second to determine areas of improvement. Then, we applied the model in a transfer learning task by training a function predictor based on the embeddings of annotated protein sequences of one training species and making predictions on the proteins of several test species with varying evolutionary distance. We show that this approach successfully generalizes knowledge about protein function from one eukaryotic species to various other species, outperforming both an alignment-based and a supervised-learning-based baseline. This implies that such a method could be effective for molecular function prediction in inadequately annotated species from understudied taxonomic kingdoms.
 
Background Sepsis is a dysregulated host response to pathogens. Delay in sepsis diagnosis has become a primary cause of patient death. This study determines some factors to prevent septic shock in its early stage, contributing to the early treatment of sepsis. Methods The sequencing data (RNA- and miRNA-sequencing) of patients with septic shock were obtained from the NCBI GEO database. After re-annotation, we obtained lncRNAs, miRNA, and mRNA information. Then, we evaluated the immune characteristics of the sample based on the ssGSEA algorithm. We used the WGCNA algorithm to obtain genes significantly related to immunity and screen for important related factors by constructing a ceRNA regulatory network. Result After re-annotation, we obtained 1708 lncRNAs, 129 miRNAs, and 17 326 mRNAs. Also, through the ssGSEA algorithm, we obtained 5 important immune cells. Finally, we constructed a ceRNA regulation network associated with SS pathways. Conclusion We identified 5 immune cells with significant changes in the early stage of septic shock. We also constructed a ceRNA network, which will help us explore the pathogenesis of septic shock.
 
The Aurora kinases form a family of 3 genes encoding serine/threonine kinases and are involved in the regulation of cell division during the mitosis. This study was designed to investigate the prognostic role of Aurora kinases in hepatocellular carcinoma (HCC). In this study, we analyzed the expression, overall survival (OS) data, promoter methylation level, and relationship with immunoinhibitors of Aurora kinases in patients with HCC from GEPIA2, UALCAN, OncoLnc, and TISIDB databases. Protein-protein interaction (PPI) network, gene ontology, Kyoto Encyclopedia of Genes and Genomes (KEGG), and Reactome pathway analysis were performed using the STRING database and Cytoscape software. We found that the mRNA expression, stages of HCC, and OS of AURKA and AURKB in HCC tissues were significantly different from control tissues, but there were significant inconsistencies in promoter methylation level and relationship with immunoinhibitors for AURKA and AURKB. None of the above items were significantly different for AURKC. Furthermore, a hub module including AURKA, AURKB, and AURKC was identified within the PPI network constructed with the Molecular Complex Detection (MCODE) plug-in in Cytoscape software. Our results show that AURKB could be a potential biomarker for HCC prognosis.
 
Alpha diversity indices boxplot between benign (n = 19) and malignant (n = 83) groups. comparison based on the (a) Chao, (b) ACE, and (c) Shannon. P-values from Wilcoxon test are shown.
Comparation of the gut microbiota taxonomic profiles from patients with benign (n = 19) and malignant (n = 83) tumors. (a) Barplots of the gut microbiota taxonomic profiles of the benign and malignant cases at phylum level. (b) Heatmap shows the OTU presence and absence of the top 50 fecal samples at genus level. (c) The analysis of principal component analysis (PCA) based on OTU between groups. (d) Principal co-ordinates analysis (PCoA) based on OTU between groups (P-value from Anosim analysis are shown).
Comparison of gut microbiota using LDA and KEgg. (a) Differential taxa between benign and malignant gut microbiota based on a Linear Discriminant Analysis (LDA). Taxa with LDA score >2 at the family and genus level are defined as statistical differences. (b) Differential taxa between the gut microbiota of the 2 groups based on a permutation test. (c and d) Histogram of Kyoto Encyclopedia of genes and genomes (KEgg) metabolic pathway of gut microbiota.
LEfSe identified the most differential gut microbiota in 83 breast cancer patients. Analysis performed based on clinicopathologic groups in (a) PR+ (n = 47) and PR+ (n = 36), (b) ER+(n = 51) and ER-(n = 32), (c) Her2+ (n = 37) and Her2-(n = 45), (d) pre-menopause(n = 53), post-menopause (n = 30), (e) Ki67
The microbiome plays diverse roles in many diseases and can potentially contribute to cancer development. Breast cancer is the most commonly diagnosed cancer in women worldwide. Thus, we investigated whether the gut microbiota differs between patients with breast carcinoma and those with benign tumors. The DNA of the fecal microbiota community was detected by Illumina sequencing and the taxonomy of 16S rRNA genes. The α-diversity and β-diversity analyses were used to determine richness and evenness of the gut microbiota. Gene function prediction of the microbiota in patients with benign and malignant carcinoma was performed using PICRUSt. There was no significant difference in the α-diversity between patients with benign and malignant tumors ( P = 3.15e ⁻¹ for the Chao index and P = 3.1e ⁻¹ for the ACE index). The microbiota composition was different between the 2 groups, although no statistical difference was observed in β-diversity. Of the 31 different genera compared between the 2 groups, level of only Citrobacter was significantly higher in the malignant tumor group than that in benign tumor group. The metabolic pathways of the gut microbiome in the malignant tumor group were significantly different from those in benign tumor group. Furthermore, the study establishes the distinct richness of the gut microbiome in patients with breast cancer with different clinicopathological factors, including ER, PR, Ki-67 level, Her2 status, and tumor grade. These findings suggest that the gut microbiome may be useful for the diagnosis and treatment of malignant breast carcinoma.
 
Workflow of bioinformatics analysis.
We aimed to discover prognostic factors of muscle-invasive bladder cancer (MIBC) and investigate their relationship with immune therapies. Online data of MIBC were obtained from The Cancer Genome Atlas (TCGA) and Gene Expression Omnibus database (GEO) database. Weighted gene co-expression network analysis (WGCNA) and univariate Cox analysis were applied to classify genes into different groups. Venn diagram was used to find the intersection of genes, and prognostic efficacy was proved by Kaplan-Meier analysis. Heatmap was utilized for differential analysis. Riskscore (RS) was calculated according to multivariate Cox analysis and evaluated by receiver operating characteristic curve (ROC). MIBC samples from TCGA and GEO were analyzed by WGCNA and univariate Cox analysis and intersected at 4 genes, CLK4, DEDD2, ENO1, and SYTL1. Higher SYTL1 and DEDD2 expressions were significantly correlated with high tumor grades. Riskscore based on genes showed great prognostic efficiency in predicting overall survival (OS), disease-specific survival (DSS), and progression-free interval (PFI) in TCGA dataset ( P < .001). The area under the ROC curve (AUC) of RS reached 0.671 in predicting 1-year survival and 0.653 in 3-year survival. KEGG pathways enrichment filtered 5 enriched pathways. xCell analysis showed increased T cell CD4+ Th2 cell, macrophage, macrophage M1, and macrophage M2 infiltration in high RS samples ( P < .001). In immune checkpoints analysis, PD-L1 expression was significantly higher in patients with high RS. We have, therefore, constructed RS as a convincing prognostic index for MIBC patients and found potential targeted pathways.
 
The etiology of osteosarcoma (OS) is complex and not fully understood till now. This study aimed to identify the miRNAs, circRNAs, and genes (mRNAs) that are differentially expressed in OS cell lines to investigate the mechanism of circRNA-associated competing endogenous RNAs (ceRNAs) in OS. Microarray datasets reporting mRNA (GSE70414), miRNA (GSE70367), and circRNA changes (GSE96964) in human OS cell lines were downloaded, differentially expressed (DE) RNAs were identified, and DEmRNAs were used for the annotation of Gene Ontology (GO) biological processes (BP), and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways. The mechanisms of DEcircRNA-mediated ceRNAs were identified in a step-by-step process. A total of 326 DEmRNAs, 45 DEmiRNAs, and 110 DEcircRNAs were identified from 3 datasets. The DEmRNAs were associated with GO BP terms, including cholesterol biosynthetic process, angiogenesis, extracellular matrix organization and KEGG pathways, including p53 signaling pathway and biosynthesis of antibiotics. The final ceRNA network consisted of 8 DEcircRNAs, including 5 pappalysin (PAPPA) 1-derived DEcircRNAs (hsa_circ_0005456, hsa_circ_0088209, hsa_circ_0002052, hsa_circ_0088214 and has_circ_0008792, all downregulated), 3 DEmiRNAs (hsa-miR-760, hsa-miR-4665-5p and hsa-miR-4539, all upregulated), and downregulated genes (including MMP13 and HMOX1). The ceRNA regulation network of OS was built, which played important roles in the pathogenesis of OS and might be of great importance in therapy.
 
Pearson correlation coefficient (PCC) of RSCU between SARS-CoV-2 and human tissues. Each dot represents a human individual. Thirty human tissues are displayed separately and ranked by the median value. (A) The RSCU calculated by weighting each gene by expression level. Boxplots are shown for each tissue. (B) The RSCU calculated by weighting each gene by expression level. Histogram is shown for all observations with median = −0.32. The vertical lines also show the 95% quantile [−0.36, −0.19]. (C) The RSCU calculated by tissue-specific HEG. Boxplots are shown for each tissue. (D) The RSCU calculated by tissue-specific HEG. Histogram is shown for all observations with median = −0.06. The vertical lines also show the 95% quantile [−0.26, 0.15].
Comparison between SARS-CoV-2 and RaTG13. Their correlation with human tissues is compared. (A) We define delta PCC = PCC SARSCoV-2 − PCC RaTG13 , where PCC SARS-CoV-2 is the PCC of RSCU between SARS-CoV-2 and human samples, and PCC RaTG13 is the PCC of RSCU between RaTG13 and human samples. RSCU is calculated with tissue-specific HEG. (B) RSCU between SARS-CoV-2 and human sample Lung_GTEX-QDT8-0926-SM-32PL2. The RSCU of all genes is used. (C) RSCU between RaTG13 and human sample Lung_GTEX-QDT8-0926-SM-32PL2. The RSCU of all genes is used. (D) RSCU between SARS-CoV-2 and human sample Lung_GTEX-QDT8-0926-SM-32PL2. The RSCU of tissue-specific HEG is used. (E) RSCU between RaTG13 and human sample Lung_GTEX-QDT8-0926-SM-32PL2. The RSCU of tissue-specific HEG is used.
SARS-CoV-2 needs to efficiently make use of the resources from hosts in order to survive and propagate. Among the multiple layers of regulatory network, mRNA translation is the rate-limiting step in gene expression. Synonymous codon usage usually conforms with tRNA concentration to allow fast decoding during translation. It is acknowledged that SARS-CoV-2 has adapted to the codon usage of human lungs so that the virus could rapidly proliferate in the lung environment. While this notion seems to nicely explain the adaptation of SARS-CoV-2 to lungs, it is unable to tell why other viruses do not have this advantage. In this study, we retrieve the GTEx RNA-seq data for 30 tissues (belonging to over 17 000 individuals). We calculate the RSCU (relative synonymous codon usage) weighted by gene expression in each human sample, and investigate the correlation of RSCU between the human tissues and SARS-CoV-2 or RaTG13 (the closest coronavirus to SARS-CoV-2). Lung has the highest correlation of RSCU to SARS-CoV-2 among all tissues, suggesting that the lung environment is generally suitable for SARS-CoV-2. Interestingly, for most tissues, SARS-CoV-2 has higher correlations with the human samples compared with the RaTG13-human correlation. This difference is most significant for lungs. In conclusion, the codon usage of SARS-CoV-2 has adapted to human lungs to allow fast decoding and translation. This adaptation probably took place after SARS-CoV-2 split from RaTG13 because RaTG13 is less perfectly correlated with human. This finding depicts the trajectory of adaptive evolution from ancestral sequence to SARS-CoV-2, and also well explains why SARS-CoV-2 rather than other viruses could perfectly adapt to human lung environment.
 
Gene expression in 3 datasets. Red represents high expression, blue represents low expression, each column represents a sample, and each row represents a gene. Unclustered heat map of gene expression in GSE27034 (A) unclustered heat map of gene expression in GSE90074 (B) unclustered heat map of gene expression in GSE12288 (C).
Clustered heat map of gene expression in GSE90074 (A) scatter diagram and regression line of XIST and EIF1AY, regression equation: y = −1.32619x − 0.61110, residual standard error: 1.075 on 141 degree of freedom, n:143, multiple R-squared: 0.9411, adjusted R-squared: 0.9407, F-statistic: 2253 on 1 and 141 DF, P-value: <2.2e −16 (B) scatter diagram and regression line of XIST and RPS4Y1, regression equation: y = −1.19600x − 2.91454, residual standard error: 1.009 on 141° of freedom, n:143, multiple R-squared: 0.9482, adjusted R-squared: 0.9478, F-statistic: 2579 on 1 and 141 degree of freedom, P-value: <2.2e −16 (C).
Clustered heat map of gene expression in GSE12288 (A) scatter diagram and regression line of XIST and RPS4Y1, regression equation: y = −0.67326x + 12.23007, residual standard error: 0.9911 on 220 degree of freedom, N:222, multiple R-squared: 0.7398, adjusted R-squared: 0.7386, F-statistic: 625.5 on 1 and 220 DF, P-value: <2.2e −16 (B) scatter diagram and regression line of XIST and EIF1AY, regression equation: y = −1.01055x + 12.24370, residual standard error: 1.239 on 220 degree of freedom, N:222, multiple R-squared: 0.5933, adjusted R-squared: 0.5914, F-statistic: 320.9 on 1 and 220 DF, P-value: <2.2e −16 (C).
Clustered heat map of gene expression in GSE27037 (A) scatter diagram and regression line of XIST and EIF1AY, regression equation: y = −0.61712x + 0.14227, residual standard error: 0.5456 on 35 degree of freedom n:37, multiple R-squared: 0.9219, adjusted R-squared: 0.9197, Fstatistic: 413.2 on 1 and 35 DF, P-value: <2.2e −16 (B) scatter diagram and regression line of XIST and RPS4Y1, regression equation: y = −0.51754x − 0.02932, residual standard error: 0.4404 on 35 degree of freedom, n:37, multiple R-squared: 0.9491, adjusted R-squared: 0.9477, F-statistic: 652.8 on 1 and 35 DF, P-value: <2.2e −16 (C).
Gene expression in GSE9820. Unclustered heat map of gene expression in GSE9820 (A) clustered heat map of gene expression in GSE9820 (B).
Atherosclerosis is a multifaceted disease characterized by the formation and accumulation of plaques that attach to arteries and cause cardiovascular disease and vascular embolism. A range of diagnostic techniques, including selective coronary angiography, stress tests, computerized tomography, and nuclear scans, assess cardiovascular disease risk and treatment targets. However, there is currently no simple blood biochemical index or biological target for the diagnosis of atherosclerosis. Therefore, it is of interest to find a biochemical blood marker for atherosclerosis. Three datasets from the Gene Expression Omnibus (GEO) database were analyzed to obtain differentially expressed genes (DEG) and the results were integrated using the Robustrankaggreg algorithm. The genes considered more critical by the Robustrankaggreg algorithm were put into their own data set and the data set system with cell classification information for verification. Twenty-one possible genes were screened out. Interestingly, we found a good correlation between RPS4Y1, EIF1AY, and XIST. In addition, we know the general expression of these genes in different cell types and whole blood cells. In this study, we identified BTNL8 and BLNK as having good clinical significance. These results will contribute to the analysis of the underlying genes involved in the progression of atherosclerosis and provide insights for the discovery of new diagnostic and evaluation methods.
 
The CCAAT/enhancer binding protein (C/EBP) transcription factors (TFs) regulate many important biological processes, such as energy metabolism, inflammation, cell proliferation etc. A genome-wide gene identification revealed the presence of a total of 99 C/EBP genes in pig and 19 eukaryote genomes. Phylogenetic analysis showed that all C/EBP TFs were classified into 6 subgroups named C/EBPα, C/EBPβ, C/EBPδ, C/EBPε, C/EBPγ, and C/EBPζ. Gene expression analysis showed that the C/EBPα, C/EBPβ, C/EBPδ, C/EBPγ, and C/EBPζ genes were expressed ubiquitously with inconsistent expression patterns in various pig tissues. Moreover, a pig C/EBP regulatory network was constructed, including C/EBP genes, TFs and miRNAs. A total of 27 feed-forward loop (FFL) motifs were detected in the pig C/EBP regulatory network. Based on the RNA-seq data, gene expression patterns related to FFL sub-network were analyzed in 27 adult pig tissues. Certain FFL motifs may be tissue specific. Functional enrichment analysis indicated that C/EBP and its target genes are involved in many important biological pathways. These results provide valuable information that clarifies the evolutionary relationships of the C/EBP family and contributes to the understanding of the biological function of C/EBP genes.
 
Phylogeny sequence data of B. cereus T4S.
Subsystem category distribution of key Pcg of B. cereus strain T4S annotated in the RAST SEED viewer annotation online server. The green/ blue bar represents the subsystem coverage in percentage. Blue bar correlates with the percentage (%) of proteins present.
Plant growth-promoting features of B. cereus T4S.
comparison of genomic features of B. cereus strain T4S with other B. cereus.
Information on the siderophore similar known gene cluster.
In recent times, diverse agriculturally important endophytic bacteria colonizing plant endosphere have been identified. Harnessing the potential of Bacillus species from sunflower could reveal their biotechnological and agricultural importance. Here, we present genomic insights into B. cereus T4S isolated from sunflower sourced from Lichtenburg, South Africa. Genome analysis revealed a sequence read count of 7 255 762, a genome size of 5 945 881 bp, and G + C content of 34.8%. The genome contains various protein-coding genes involved in various metabolic pathways. The detection of genes involved in the metabolism of organic substrates and chemotaxis could enhance plant-microbe interactions in the synthesis of biological products with biotechnological and agricultural importance.
 
Mean value and standard error of the Z G4Hscore of 13 sliding windows in 5′ UTR and 3′ UTR sequences in 5 eukaryotic species. A window in which rG4 structures were located at the center was used, moving upstream and downstream with a sliding window, the step size of which was 30.
Z G Hscore 4
The RNA G-quadruplex (rG4) is a kind of non-canonical high-order secondary structure with important biological functions and is enriched in untranslated regions (UTRs) of protein-coding genes. However, how rG4 structures evolve is largely unknown. Here, we systematically investigated the evolution of RNA sequences around UTR rG4 structures in 5 eukaryotic organisms. We found universal selection on UTR sequences, which facilitated rG4 formation in all the organisms that we analyzed. While G-rich sequences were preferred in the rG4 structural region, C-rich sequences were selectively not preferred. The selective pressure acting on rG4 structures in the UTRs of genes with higher G content was significantly smaller. Furthermore, we found that rG4 structures experienced smaller evolutionary selection near the translation initiation region in the 5' UTR, near the polyadenylation signals in the 3' UTR, and in regions flanking the miRNA targets in the 3' UTR. These results suggest universal selection for rG4 formation in the UTRs of eukaryotic genomes and the selection may be related to the biological functions of rG4s.
 
Comparison of gene relocation between TAS2R and other genes in humans. gene relocation is inversely related to the conservation of collinearity by comparison with multiple outgroup species: (a) comparison of the number of collinear genes and (b) comparison of the number of outgroup species with collinear genes.
Comparison of Ka/Ks between TAS2R and other genes in humans.
Functional groups of taste genes.
In humans, taste genes are responsible for perceiving at least 5 different taste qualities. Human taste genes’ evolutionary mechanisms need to be explored. We compiled a list of 69 human taste-related genes and divided them into 7 functional groups. We carried out comparative genomic and evolutionary analyses for these taste genes based on 8 vertebrate species. We found that relative to other groups of human taste genes, human TAS2R genes have a higher proportion of tandem duplicates, suggesting that tandem duplications have contributed significantly to the expansion of the human TAS2R gene family. Human TAS2R genes tend to have fewer collinear genes in outgroup species and evolve faster, suggesting that human TAS2R genes have experienced more gene relocations. Moreover, human TAS2R genes tend to be under more relaxed purifying selection than other genes. Our study sheds new insights into diverse and contrasting evolutionary patterns among human taste genes.
 
Many dinoflagellate species make toxins in a myriad of different molecular configurations but the underlying chemistry in all cases is presumably via modular synthases, primarily polyketide synthases. In many organisms modular synthases occur as discrete synthetic genes or domains within a gene that act in coordination thus forming a module that produces a particular fragment of a natural product. The modules usually occur in tandem as gene clusters with a syntenic arrangement that is often predictive of the resultant structure. Dinoflagellate genomes however are notoriously complex with individual genes present in many tandem repeats and very few synthetic modules occurring as gene clusters, unlike what has been seen in bacteria and fungi. However, modular synthesis in all organisms requires a free thiol group that acts as a carrier for sequential synthesis called a thiolation domain. We scanned 47 dinoflagellate transcriptomes for 23 modular synthase domain models and compared their abundance among 10 orders of dinoflagellates as well as their co-occurrence with thiolation domains. The total count of domain types was quite large with over thirty-thousand identified, 29 000 of which were in the core dinoflagellates. Although there were no specific trends in domain abundance associated with types of toxins, there were readily observable lineage specific differences. The Gymnodiniales, makers of long polyketide toxins such as brevetoxin and karlotoxin had a high relative abundance of thiolation domains as well as multiple thiolation domains within a single transcript. Orders such as the Gonyaulacales, makers of small polyketides such as spirolides, had fewer thiolation domains but a relative increase in the number of acyl transferases. Unique to the core dinoflagellates, however, were thiolation domains occurring alongside tetratricopeptide repeats that facilitate protein-protein interactions, especially hexa and hepta-repeats, that may explain the scaffolding required for synthetic complexes capable of making large toxins. Clustering analysis for each type of domain was also used to discern possible origins of duplication for the multitude of single domain transcripts. Single domain transcripts frequently clustered with synonymous domains from multi-domain transcripts such as the BurA and ZmaK like genes as well as the multi-ketosynthase genes, sometimes with a large degree of apparent gene duplication, while fatty acid synthesis genes formed distinct clusters. Surprisingly the acyl-transferases and ketoreductases involved in fatty acid synthesis (FabD and FabG, respectively) were found in very large clusters indicating an unprecedented degree of gene duplication for these genes. These results demonstrate a complex evolutionary history of core dinoflagellate modular synthases with domain specific duplications throughout the lineage as well as clues to how large protein complexes can be assembled to synthesize the largest natural products known.
 
Soil contamination by hydrocarbons due to oil spills has become a global concern and it has more implications in oil producing regions. Biostimulation is considered as one of the promising remediation techniques that can be adopted to enhance the rate of degradation of crude oil. The soil microbial consortia play a critical role in governing the biodegradation of total petroleum hydrocarbons (TPHs), in particular polycyclic aromatic hydrocarbons (PAHs). In this study, the degradation pattern of TPHs and PAHs of Kuwait soil biopiles was measured at three-month intervals. Then, the microbial consortium associated with oil degradation at each interval was revealed through 16S rRNA based next generation sequencing. Rapid degradation of TPHs and most of the PAHs was noticed at the first 3 months of biostimulation with a degradation rate of pyrene significantly higher compared to other PAHs counterparts. The taxonomic profiling of individual stages of remediation revealed that, biostimulation of the investigated soil favored the growth of Proteobacteria, Alphaprotobacteria, Chloroflexi, Chlorobi, and Acidobacteria groups. These findings provide a key step towards the restoration of oil-contaminated lands in the arid environment.
 
Gefitinib resistance is a serious threat in the treatment of patients with non-small cell lung cancer (NSCLC). Elucidating the underlying mechanisms and developing effective therapies to overcome gefitinib resistance is urgently needed. The differentially expressed genes (DEGs) were screened from the gene expression profile GSE122005 between gefitinib-sensitive and resistant samples. GO and KEGG analyses were performed with DAVID. The protein-protein interaction (PPI) network was established to visualize DEGs and screen hub genes. The functional roles of CCL20 in lung adenocarcinoma (LUAD) were examined using gene set enrichment analysis (GSEA). Functional analysis revealed that the DEGs were mainly concentrated in inflammatory, cell chemotaxis, and PI3K signal regulation. Ten hub genes were identified based on the PPI network. The survival analysis of the hub genes showed that CCL20 had a significant effect on the prognosis of LUAD patients. GSEA analysis showed that CCL20 high expression group was mainly enriched in cytokine-related signaling pathways. In conclusion, our analysis suggests that changes in inflammation and cytokine-related signaling pathways are closely related to gefitinib resistance in patients with lung cancer. The CCL20 gene may promote the formation of gefitinib resistance, which may serve as a new biomarker for predicting gefitinib resistance in patients with lung cancer.
 
Antimicrobial activity of essential oils.
Antibiotic resistance is a major global health issue that has seen alarming rates of increase in all parts of the world over the past two decades. The surge in antibiotic resistance has resulted in longer hospital stays, higher medical costs, and elevated mortality rates. Constant attempts have been made to discover newer and more effective antimicrobials to reduce the severity of antibiotic resistance. Plant secondary metabolites, such as essential oils, have been the major focus due to their complexity and bioactive nature. However, the underlying mechanism of their antimicrobial effect remains largely unknown. Understanding the antimicrobial mode of action of essential oils is crucial in developing potential strategies for the use of essential oils in a clinical setting. Recent advances in genomics and proteomics have enhanced our understanding of the antimicrobial mode of action of essential oils. We might well be at the dawn of completing a mystery on how essential oils carry out their antimicrobial activities. Therefore, an overview of essential oils with regard to their antimicrobial activities and mode of action is discussed in this review. Recent approaches used in identifying the antimicrobial mode of action of essential oils, specifically from the perspective of genomics and proteomics, are also synthesized. Based on the information gathered from this review, we offer recommendations for future strategies and prospects for the study of essential oils and their function as antimicrobials.
 
SARS-CoV-2 sequence occupancy and entropy: (A) fractional occupancy (left axis, dashed lines) and positional entropy (right axis, solid lines) of the NCBI set calculated by WebLogo3 as displayed as a function of SARS-CoV-2 genome coordinates. This analysis focused only on well characterized betacoronavirus structured elements. The relative positional relationships of each region are marked. Hash marks donate areas of the entire genome that were not considered in this study, (B) the same representation as in (A), but calculated using the GISAID EpiCoV database, and (C) venn diagram of emerging variations in the GISAID set, NCBI set, or both.
Emerging variations in ORF1A stems and the frameshifting pseudoknot: (A) the predicted secondary structure of SL5 is shown. Emerging variants are denoted by an arrow, and the identity of the variation is given next to the arrow. The position of the ORF1a start codon is labeled, (B) the predicted secondary structure of SL6-SL10 is shown. Variations are labeled as in panel A, (C) the secondary structure of the frameshifting pseudoknot is shown. The position and identity of emerging variants are denoted by an arrow and a letter, (D) the molecular model of the frameshifting pseudoknot calculated by RNAcomposer is shown. Stems 1 to 3 are labeled in colors corresponding to those shown on the secondary structure in panel C, and (E) comparison of the base triple observed in the reference model (top) and in the U13536 variation model (bottom). Hydrogen bonds are denoted by dashed lines. The U13536 variant is colored in red.
Emerging variations in the bi-stable molecular switch in the 3ʹUTR: The secondary structure of the 3ʹUTR bi-stable molecular switch in both predicted conformers is shown. The position and identity of emerging variants is denoted by an arrow and a letter.
Molecular dynamics simulations of S2M variations: overlay of the structures from a single representative 180 ns trajectory of the SARS-CoV S2M loop (A and B), G29734C (C and D) and G29742U (E and F) variants. Front and back orientations show the following residues as sticks: G/U29742 and C29754 in (A), (C) and (D); G/C29734 and A29756 in (B), (D) and (F). Structures are colored as a function of time (blue = 0 ns, red = 180 ns). Hydrogen bond frequency between the base pairs of the S2M loop is shown in (G) for SARS-CoV (blue), G29734C (red) and G29742U (green) variants. Hydrogen bond frequency for the interacting nucleotides in the quartet (highlighted in bold font) and base pairs around G/U29742 is shown for SARS-CoV (blue), G29734C (red) and G29742U (green) variants in (H). The hydrogen bond frequency is calculated over the 4 180 ns trajectories in (G) and (H). Histograms of the largest dimension of the S2M loop measured for the 4 180 ns trajectories of SARS-CoV (I), G29734C (J) and G29742U (K) variants. Histograms of the base distance measured between G/C29734 (G/C11) and A29756 (A33) for the 4 180 ns trajectories of SARS-CoV (L), G29734C (M) and G29742U (N) variants. In all panels the nucleotides are numbered as in the x-ray structure of SARS CoV S2M RNA (PDB code 1xJR), the corresponding number in the reference genome can be obtained by adding 29 723.
The S2M stem loop and both emerging variations are marginally stable at infection temperature: (A) normalized thermal UV denaturation curves with the S2M region and variations thereof are plotted as a function of temperature. The data are represented and analyzed as in Figure 2, with lines describing the upper and lower baselines of S2M WT RNA shown to facilitate visualization of the unfolding transition, and (B) fitted parameters presented in the table below are the average and standard deviation of 3 experiments. Values presented as n.d. could not be determined from fits to the 2 state denaturation model described in the methods, presumably because the S2M UUUC and UUUA variants are not folded throughout the temperature range used in the experiment.
The severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) pandemic has motivated a widespread effort to understand its epidemiology and pathogenic mechanisms. Modern high-throughput sequencing technology has led to the deposition of vast numbers of SARS-CoV-2 genome sequences in curated repositories, which have been useful in mapping the spread of the virus around the globe. They also provide a unique opportunity to observe virus evolution in real time. Here, we evaluate two sets of SARS-CoV-2 genomic sequences to identify emerging variants within structured cis-regulatory elements of the SARS-CoV-2 genome. Overall, 20 variants are present at a minor allele frequency of at least 0.5%. Several enhance the stability of Stem Loop 1 in the 5ʹ untranslated region (UTR), including a group of co-occurring variants that extend its length. One appears to modulate the stability of the frameshifting pseudoknot between ORF1a and ORF1b, and another perturbs a bi-ss molecular switch in the 3ʹUTR. Finally, 5 variants destabilize structured elements within the 3ʹUTR hypervariable region, including the S2M (stem loop 2 m) selfish genetic element, raising questions as to the functional relevance of these structures in viral replication. Two of the most abundant variants appear to be caused by RNA editing, suggesting host-viral defense contributes to SARS-CoV-2 genome heterogeneity. Our analysis has implications for the development of therapeutics that target viral cis-regulatory RNA structures or sequences.
 
Lung adenocarcinoma (LUAD) is a tumor with high incidence. This study aimed to identify the central genes of LUAD. LUAD were analyzed by weighted gene co-expression network (WGCNA), and differentially expressed genes (DEGs) were identified. Samples were obtained from The Cancer Genome Atlas (TCGA) and Genotype Tissue Expression (GTEx) databases and included 515 LUAD samples and 347 normal samples. The WGCNA algorithm generated a total of 10 modules. The top 2 modules (MEturquoise and MEblue) with the highest correlation to LUAD were selected. Ten Hub genes (IL6, CDH1, PECAM1, SPP1, THBS1, HGF, SNCA, CDH5, CAV1, and DLC1) were screened in the intersecting genes of DEGs and WGCNA (MEturquoise and MEblue). Only SPP1 was correlated with LUAD poor survival, indicating that SPP1 may be a key Hub gene for LUAD. The competing endogenous RNA (ceRNA) network was constructed to analyze the regulatory relationship of Hub genes, and SPP1 may be directly regulated by 4 microRNAs (miRNAs) and indirectly regulated by 49 long noncoding RNAs (lncRNAs).
 
Figure1. The ARF protein family regulation of auxin-inducible genes transcription by forming dimers with auxin response elements (AREs) in the promoters of Auxin-inducible genes. In the absence of Auxin, the AUX/IAA transcriptional repressor recruits TOPLESS family (TPL) co-repressors by interacting with ARFs, which in turn recruit chromatin-modifying enzymes that inhibit downstream Auxin-inducible genes transcription. The steps of the Auxin response pathway are indicated by numerical arrows. (1) In the presence of auxin, the Aux/IAA, and TIR1/AFB family F-box proteins bind together. (2) The F-box proteins are part of the SCF-type E3 ubiquitin protein ligase complex that transfers activated ubiquitin (Ub) from the E1/E2 enzyme system. (3) Polyubiquitylation of Aux/IAA leads to its degradation. (4) The dimer formed by ARF and AREs is released to activate Auxin-inducible genes transcription.
Neighboring trees (NJ) and representative conserved motif patterns of Aux/IAA proteins of A.rubrum, A.yangbiense, citrus, and Arabidopsis: (A) a phylogenetic tree was constructed for 95 full-length Aux/IAA proteins from 5 plant species, including A.rubrum (Ar), A.yangbiense (Ay), citrus (Cit), Arabidopsis (At), (B) distribution of Aux/IAA proteins of 10 motifs in 4 species, and (C) 5 motifs representing 4 domains I, II, III, and IV were mapped on all Aux/IAA proteins by different colors.
Gene Ontology (GO) analysis of Aux/IAA genes in A.rubrum (A) and A.yangbiense (B). CC: MF: molecular function (blue); cellular component (green); BP: biological process (red).
Expression profiles of ArAux/IAA in new leaves (YL), mature leaves (ML), and phloem (P). Clear water treatment (CK) is shown in blue, IAA treatment (IAA) in yellow, genes that were not significantly different did not show.
The phytohormone auxin are important in all aspects of plant growth and development. The Auxin/Indole-3-Acetic Acid ( Aux/IAA) gene responds to auxin induction as auxin early response gene family. Despite the physiological importance of the Aux/IAA gene, a systematic analysis of the Aux/IAA gene in Acer rubrum has not been reported. This paper describes the characterization of Acer rubrum Aux/IAA genes at the transcriptomic level and Acer yangbiense Aux/IAA genes at the genomic level, with 17 Acer rubrum AUX/IAA genes (ArAux/IAA) and 23 Acer yangbiense Aux/IAA (AyAux/IAA) genes identified. Phylogenetic analysis shows that AyAux/IAA and ArAux/IAA family genes can be subdivided into 4 groups and show strong evolutionary conservatism. Quantitative real-time polymerase chain reaction (qRT-PCR) was used to test the expression profile of ArAux/IAA genes in different tissues under indole-3-acetic acid (IAA) treatment. Most ArAux/IAA genes are responsive to exogenous auxin and have tissue-specific expression. Overall, these results will provide molecular-level insights into auxin metabolism, transport, and signaling in Acer species.
 
Distribution of multiplicity of infection in Ghana samples of P. falciparum analyzed (n = 617): (A) is a histogram showing the number of samples with within sample F statistic (F ws ) on the vertical axis within the range specified on the horizontal axis and (B) is a scatter plot that shows the distribution of the samples with F ws . The red line is at F ws = 0.95, the cutoff point for MOI.
P. falciparum population structure analysis for Ghana samples.
Summary of SNPs characteristics: (A) Frequency distribution of the non-reference allele for each of the biallelic SNPs in the sample of P. falciparum clinical isolates from Ghana (N = 274) and (B) distribution of numbers of protein-coding genes (N = 2256) with each given number of SNPs in the Ghana population sample of P. falciparum clinical isolates.
Distribution of SNPs effect across all genomic positions analyzed.
Summary of number of SNPs and effect at loci of interest.
Sub-Saharan Africa is courting the risk of artemisinin resistance (ARTr) emerging in Plasmodium falciparum malaria parasites. Current molecular surveillance efforts for ARTr have been built on the utility of P. falciparum kelch13 (pfk13) validated molecular markers. However, whether these molecular markers will serve the purpose of early detection of artemisinin-resistant parasites in Ghana is hinged on a pfk13 dependent evolution. Here, we tested the hypothesis that the background pfk13 genome may be present before the pfk13 ARTr-conferring variant(s) is selected and that signatures of balancing selection on these genomic loci may serve as an early warning signal of ARTr. We analyzed 12 198 single nucleotide polymorphisms (SNPs) in Ghanaian clinical isolates in the Pf3K MalariaGEN dataset that passed a stringent filtering regimen. We identified signatures of balancing selection in 2 genes (phosphatidylinositol 4-kinase and chloroquine resistance transporter) previously reported as background loci for ARTr. These genes showed statistically significant and high positive values for Tajima's D, Fu and Li's F, and Fu and Li's D. This indicates that the biodiversity required to establish a pfk13 background genome may have been primed in clinical isolates of P. falciparum from Ghana as of 2010. Despite the absence of ARTr in Ghana to date, our finding supports the current use of pfk13 for molecular surveillance of ARTr in Ghana and highlights the potential utility of monitoring malaria parasite populations for balancing selection in ARTr precursor background genes as early warning molecular signatures for the emergence of ARTr.
 
Binding position of all modeled PIs (here represented by PI.05 cluster 4) (purple) with respect to regions on HA structure (represented by 1RD8): receptor-binding site (green), HA1 fusion region (yellow), esterase (tan), and HA2 fusion region (cyan). The conserved/similar residues are visualized (A) with respect to HA regions and (B) separately and are marked red if they are located in the esterase region and orange in the receptorbinding site.
The binding site of PI.01 (purple) overlaps with the invariant residues (yellow) and synthetic lethal residues (orange) on the B loop that is proposed to drastically alter conformation during fusion. Notice the variation in the orientation of the side chains of invariant and synthetic lethal residues across (A) H1, (B) H2, and (C) H3 HA and the change in the orientation of PI.01 to adapt to different subtypes.
The docking result of clusters of PIs with the lowest docking score for each HA subtype. The PIs were found to form contact with conserved/similar residues on the structures of all 3 HA subtypes. The conserved/similar residues of HA structure in contact with PIs and the docking score are listed in the fourth and fifth columns.
A high level of mutation enables the influenza A virus to resist antibiotics previously effective against the influenza A virus. A portion of the structure of hemagglutinin HA is assumed to be well-conserved to maintain its role in cellular fusion, and the structure tends to be more conserved than sequence. We designed peptide inhibitors to target the conserved residues on the HA surface, which were identified based on structural alignment. Most of the conserved and strongly similar residues are located in the receptor-binding and esterase regions on the HA1 domain In a later step, fragments of anti-HA antibodies were gathered and screened for the binding ability to the found conserved residues. As a result, Methionine amino acid got the best docking score within the −2.8 Å radius of Van der Waals when it is interacting with Tyrosine, Arginine, and Glutamic acid. Then, the binding affinity and spectrum of the fragments were enhanced by grafting hotspot amino acid into the fragments to form peptide inhibitors. Our peptide inhibitor was able to form in silico contact with a structurally conserved region across H1, H2, and H3 HA, with the binding site at the boundary between HA1 and HA2 domains, spreading across different monomers, suggesting a new target for designing broad-spectrum antibody and vaccine. This research presents an affordable method to design broad-spectrum peptide inhibitors using fragments of an antibody as a scaffold.
 
Hepatocellular carcinoma (HCC) is one of the common cancers with a high incidence and mortality. The human replication factor C (RFC) family contains 5 subunits that play an important role in DNA replication and DNA damage repair. RFCs are abnormally expressed in a variety of cancers; some of them are differentially expressed in HCC tissues and related to tumor growth. However, the expression, prognostic value, and effect targets of the whole RFC family in HCC are still unclear. To address these issues, we performed a multidimensional analysis of RFCs in HCC patients by Oncomine, UALCAN, GEPIA, Human protein atlas, Kaplan-Meier plotter, cBioPortal, GeneMANIA, String, and LinkedOmics. mRNA expression of RFCs was significantly increased in HCC tissues. There was a significant correlation between the expression of RFC2/3/4/5 and tumor stage of HCC patients. Besides, high mRNA expression of RFC2/4 was associated with worse overall survival (OS). Moreover, genetic alterations of RFCs were associated with worse OS in HCC patients. We found that genes co-expressed with RFC2/4 were mainly involved in biological processes, such as chromosome segregation, mitotic cell cycle phase transition, and telomere organization and they activated the cell cycle and spliceosome pathways. The gene set is mainly enriched in cancer-related kinases AURKA, ATR, CDK1, PLK1, and CHEK1. E2F family members were the key transcription factors for RFCs. Our results suggest that differentially expressed RFC2 and RFC4 are potential prognostic biomarkers in HCC and may act on E2F transcription factors and some kinase targets to dysregulate the cell cycle pathway. These efforts may provide new research directions for prognostic biomarkers and therapeutic targets in HCC.
 
Beta diversity analysis results: (a) unweighted pair-group method with arithmetic mean (UPGMA) tree of unweighted unifrac distances. (b) Analysis of similarity (ANOSIM) between groups, (c) Bray_curtis distance-based non-metric multidimensional scaling (NMDS) analysis, and (d) Principal coordinate analysis of an unweighted unifrac distance matrix (PCoA). WS represents the wild group, LS represents that captive group.
Numbers of different microbiological taxonomic units in this study.
Alpha diversity comparison between both groups based on multiple indexes.
Relative abundance of predicted function for specific KEGG modules (level 1).
Wild-caught animals must cope with drastic lifestyle and dietary changes after being induced to captivity. How the gut microbiome structure of these animals will change in response receives increasing attention. The plateau zokor ( Eospalax baileyi ), a typic subterranean rodent endemic to the Qinghai-Tibet plateau, spends almost the whole life underground and is well adapted to the environmental pressures of both plateau and underground. However, how the gut microbiome of the plateau zokor will change in response to captivity has not been reported to date. This study compared the microbial community structure and functions of 22 plateau zokors before (the WS group) and after being kept in captivity for 15 days (the LS group, fed on carrots) using the 16S rRNA gene via high-throughput sequencing technology. The results showed that the LS group retained 973 of the 977 operational taxonomic units (OTUs) in the WS group, and no new OTUs were found in the LS group. The dominant bacterial phyla were Bacteroides and Firmicutes in both groups. In alpha diversity analysis, the Shannon, Sobs, and ACE indexes of the LS group were significantly lower than those of the WS group. A remarkable difference ( P
 
Top-cited authors
Laurent Excoffier
  • Universität Bern
Stefan Schneider
  • University of Geneva
Guillaume Laval
  • Institut Pasteur International Network
David Roessli
  • University of Geneva
Rolf Henrik Nilsson
  • University of Gothenburg