Questions related to Bioinformatics Analysis
I am conducting research on the bacterial composition of fecal samples from both healthy and diseased individuals using 16S sequencing. I am seeking expert guidance on the appropriate bioinformatic analysis methods for my dataset.
My goal is to analyze the bacterial communities in fecal samples from a diseased cohort and a control group of healthy individuals, using 16S rRNA gene sequencing.
I have employed a Nanopore sequencer to acquire full-length 16S sequences.
For the alignment process, I have used the kraken2 tool.
The standard database provided by kraken2 has been utilized for the alignment.
I have generated 12 sets of output files, ranging from kraken2-report01 to kraken2-report12 and kraken-output01.txt to kraken-output12.txt.
I am contemplating two approaches for downstream analysis:
- Converting the output data into biom format using kraken-biom and then analyzing it on the QIIME2 platform.
- Converting the output data into either OTU or ASV format for analysis using MicrobiomeAnalyst.
- Is there a specific method for converting the kraken2 output into biom format? If so, could you provide the steps for this conversion?
- If the conversion-based approach is not advisable, what are the recommended methods for diversity analysis and identification of variable species post-kraken2 analysis?
Could someone explain to me why the p-value in the right column of the forest plot is different than the p-value in the test for effect in the subgroup?
I thought that these two p.values should be the same.
I read some papers mentioning that they used the HMP reference genome for protein homology search and I've also read about the HUMAnN database elsewhere. I'm wondering what's the difference.
Hello everyone; I am new to R programming. I want to calculate the firmicutes to Bacteroides ratio from my OTU table. I couldn't find the command and don't know how to do it. Please guide me on this.
I put an example of my OTU table.
I know many websites have simple tools like transcription and translation available, but are there any analysis tools that researchers need that either do not exist or are not publicly available? It could be anything from algorithms to visuals. Thanks!
If I have a sequence (genome.fasta). And I want to check the gene located in 400nt -500nt.
What bash script (I have WSL in my windows) I should use or are there any conda packages ?
Thank you in advanced
Is there any server or tools (bioconda, java, etc.) to exclusively annotate membrane protein only (similar to dbCAN for polysaccharides) from a bacterial genome?
Thank you in advanced!
Hi - I'm currently working with two RNA-Seq studies; one has RNA extracted from whole blood, the other PBMCs. Eventually we want to combine these data and perform some cell-specific deconvolution to look at DEGs.
Are there any recommended methods for batch correcting these data from different sources?
From the link https://gtexportal.org/home/datasets, under V7, I'm trying to do R/Python analyses on the Gene TPM and Transcript TPM files. But in these files (and to open them I had to use Universal Viewer since the files are too large to view with an app like NotePad), I'm seeing a bunch of ID's for samples (i.e. GTEX-1117F-0226-SM-5GZZ7), followed by transcript ID's like ENSG00000223972.4, and then a bunch of numbers like 0.02865 (and they take up like 99% of the large files). Can someone help me decipher what the numbers mean, please? And are the numbers supposed to be assigned to a specific sample ID? (The amount of letters far exceed the amount of samples, btw). I tried opening these files as tables in R but I do not think R is categorizing the contents of the file correctly.
I was using fragbuilder module in python to generate peptides of sizes 4, 6, and 10. However, the issue with fragbuilder module is that some of the bond angles are deviating from the standard values. For instance, C_alpha--C--N bond angle standard value is 121 degrees but fragbuilder assigns 111 degrees. This angle deviation causes a deviation in the distance between the nearest neighbor C_alpha---C_alpha and its value is 3.721 angstrom and the typical standard value is 3.8 A. Also another bond angle is a deviation from the standard value by 6 degrees which is the C_alpha---C---N whose value is 111.4 degrees and typical standard values are 117 degrees. My doubt is how much deviation is allowed for MD simulations of peptides (or proteins) while fixing the bond lengths and bonds angles ?
I want to purchase Macbook mainly for the bioinformatics analysis propose i.e., Transcriptomics, smalRNA, Methylation, lncRNA and other. Would anyone please suggest to me the best affordable one?
I have two sequences from the predicted mRNA sequence (only exons, without intron) and gDNA sequence (with intron). Then, I align the sequences to confirm the position of exon in the DNA sequence. after that, I pick the primers from the exon region and check the specificity on Primer Blast. However, I also design primers only from predicted mRNA without considering the exon region on DNA sequence. Which is more appropriate to use in amplifying full-length genes in the DNA template?
When we are in the step of aligning virulent factors against human proteom to exclude those proteins with > 35% homology what is the output that we have to use for the next step of predicting transmembrane helices and molecular weight for chosen proteins?
please bear with me, because I am a complete beginner with regard to any form of bioinformatics and I am trying to understand the best approach to my experiment.
I am currently trying to isolate cells and sequence them for further bioinformatic analysis, more precisely RNA-Sequencing.
We have, however, had issues with purity and while some samples we looked at reached a purity of >90% after isolation (we usually validate it by use of flow cytometry), some samples of different animal genotypes did not.
This leads me to my first question:
How important is cell purity for Bulk RNA-Seq?
Which purity should be reached for and adequate, realiable analysis?
If anyone has any recommendations for papers to look into regarding that subject, I would be most grateful, because I have no idea where to start and what to consider.
Further along in the story we surmised that maybe Single Cell RNA Sequencing might be the better option in cases of lower purity.
But again, the same question arose: how relevant is cell purity for the following analysis and is there a cut-off value not to be crossed?
How advantegeous would using both methods be?
Sure, Bulk gives a better general overview and Single Cell is more precise, but do they complement each other or is it essentially redundant information gained by doing both experiments?
And are there any disadvantages to using only SC or do both methods completement each other when low purity levels are in the question?
Thank you a lot in advance!!
I have data from the our experimental model - where we analyze the immune response following BCG vaccination, and then the responses and clinical outcome following Mtb infection of our vaccinated models. Because we cannot experimentally follow the very same entity after evaulating the post-vaccination response also for the post vaccination plus post infection studies - we have such data from different batches. Is it possible to do correlation here between post vaccination responses of 5 replicates in one batch (in different vaccine candidates) versus 4-5 replicates in vaccination & infection from another batch? I ask this because we are not following up the same replicates for post vaccination and post infection measurements (as it is not experimentally feasible). If correlation is not the best method, are there other ways to analyze the patterns - such as strength of association between T cell response in BCG vaccinated models versus increased survival of BCG vaccinated models (both measurements are from different batches)? We have several groups like that, with a variety of parameters measured per group in different sets of experiments.
Thanks for your responses and help.
I created this R package to allow easy VCF files visual analysis, investigate mutation rates per chromosome, gene, and much more: https://github.com/cccnrc/plot-VCF
The package is divided into 3 main sections, based on analysis target:
- variant Manhattan-style plots: visualize all/specific variants in your VCF file. You can plot subgroups based on position, sample, gene and/or exon
- chromosome summary plots: visualize plot of variants distribution across (selectable) chromosomes in your VCF file
- gene summary plots: visualize plot of variants distribution across (selectable) genes in your VCF file
Take a look at how many different things you can achieve in just one line of code!
It is extremely easy to install and use, well documented on the GitHub page: https://github.com/cccnrc/plot-VCF
I'd love to have your opinion, bugs you might find etc.
I am performing analysis of 16S rRNA amplicon sequencing data. I have tested effectivity of two classifiers on the mock community and blast classifier shows the best result. However, I found out blast is using a local sequencing alignment. So I do not know if it is appropriate to use this classifier to assign a "mystery" sequence to a bacterial taxon. Is it possible that this approach will result to false positive results? Is it better to use Vsearch classifier which showed worse results but is using a global sequencing alignment?
And a bonus question. Should I use rarefied representative sequences to perform a taxonomy classification or not? I use rarefied data for alpha diversity testing (and for beta diversity testing I do not).
Thank you all for answers!
I tried using Phaster.ca and PhiSpy for phage detection in the bacterial genome
They showed a completely different result for regions and the virus identified.
Do you have the same experiences and could you share your suggestions, please?
Thank in advanced!
I am having an issue with 16S PICRUST data. There is always a warning message post PICRUST run that more than half of the sequences have been removed from further analysis. The reason might be that the ASV fasta files contains mix DNA sequences i.e. both positive and negative strands. PICRUST can only deal with positive sequences hence the output is based on approximately 50% of the sequences of FASTA file. I am really looking for some suggestion (computer programming) on identifying negative sequences from FASTA files based on NCBI BLASTn portal and reverse complementing it. Because this work would be difficult to be performed manually considering 6000 sequences of FASTA files. I have limited knowledge in coding. Any help would be greatly appreciated.
I am running this PICRUST pipeline as mentioned here https://github.com/picrust/picrust2/wiki/Full-pipeline-script. The ASV file has been generated by using raw FASTq files on QIIME2.
Hello, I have some raw files which extension is .d (acquired from Brucker instrument). Which platforms would you recommend to perform the bioinformatic analysis (possibly fee downloadable)? I have experience using MaxQuant but it does not recognize the .d files. Any recommendation? Thank you in advance
Good day! The question is really complex since CRISPR do not have any exact sequence - so the question is the probability of generation of 2 repeat units, each of 23-55 bp and having a short palindromic sequence within and maximum mismatch of 20%, interspersed with a spacer sequence that in 0.6-2.5 of repeat size and that doesn't match to left and right flank of the whole sequence, in a random sequence.
I designed two sets of primer to target a gene of interest in gene expression studies (cDNA). The first primer pair (T1) had an amplicon base pair of 200bp when I carried out Insilico PCR and on targeting my gene of interest in PCR it was successful (showed band on gel and sequencing successful). Coming to the second set of primer (T2) I carried out Insilico pcr and the expected amplicons size was 1200bp which was of interest to me (I'm carrying out bioinformatics analysis of the protein seq.), but on targeting the gene in the cDNA on PCR it wasn't successful. I have troubleshoot varying different parameters but no success. Could high number of sequence length be a hindrance in Pcr, and how can i overcome this problem?
I have whole genome of a bacteria. Do you know which program to detect virus genome within the bacteria?
Is using annotation (example : prokka) and then looking manually for viral genes/proteins? Or by checking the assembly (example : prokka) and blast the shorter contigs will is enough?
Thank you in advanced
-RNA seq and bioinformatics were carried out by professionals.
- Gene in question shows ~700 fold differential regulation by qPCR in multiple independent cohort of experiments - not in RNA seq.
I have some lists of gene IDs from multi species, I want to have their compiled FASTA format files for each species. it looks tedious to copy each accession and collect FASTA seqs.
Batch Entrez is giving me error, may be because the identifier is related to other database.
I'm in the initial stages of planning a miRNA seq experiment using human cultured cells and decided on TRIzol extraction, Truseq small RNA prep kit, using an illumina HiSeq2500. The illumina webinar suggests 10-20 Million reads for discovery, the QandA support page suggests 2-5M, and I wrote the tech support to ask, who suggested I do up to 100M reads for rare transcripts. Exiqon guide to miRNA discovery manual says there is not really any benefit on going over 5M reads. I was hoping to save money by pooling more samples in a lane, so I was hoping someone with experience might be able to suggest a suitable number of reads.
Hi, I was hoping someone could recommend papers that discuss the impact of using averaged data in random forest analyses or in making regression models with large data sets for ecology.
For example, if I had 4,000 samples each from 40 sites and did a random forest analysis (looking at predictors of SOC, for example) using environmental metadata, how would that compare with doing a random forest of the averaged sample values from the 40 sites (so 40 rows of averaged data vs. 4,000 raw data points)?
I ask this because a lot of the 4,000 samples have missing sample-specific environmental data in the first place, but there are other samples within the same site that do have that data available.
I'm just a little confused on 1.) the appropriateness of interpolating average values based on missingness (best practices/warnings), 2.) the drawbacks of using smaller, averaged sample sizes to deal with missingness vs. using incomplete data sets vs. using significantly smaller sample sizes from only "complete" data, and 3.) the geospatial rules for linking environmental data with samples? (if 50% of plots in a site have soil texture data, and 50% of plots don't, yet they're all within the same site/area, what would be the best route for analysis?) (it could depend on variable, but I have ~50 soil chemical/physical variables?)
Thank you for any advice or paper or tutorial recommendations.
I am computing Van der Waal interactions in python for a peptide of size 10 residues for various conformations. The total conformations (or the number of PDB files is 300,000). Is it possible to compute only the 1-4 atom distances to compute Van der Waals interactions as the bonded and 1-3 atom distances are irrelevant when it comes to Van der Waal interactions using some python module?
After finishing the simulation of the cyclic peptide, I tried to find the most populated structure using the cluster peak density algorithm. from the literature, the representative structure was chosen as the structure with maximal ρsum (The summation of local densities of all residues in one structure, ρ𝑠𝑢𝑚 = ∑ ρ𝑖𝑛_𝑟𝑒𝑠𝑖=1) so how can I extract the structure which has the highest density for the all residue?
ref: Clustering by Fast Search and Find of Density Peaks. Science 2014, 344, 1492–1496
Hello! I'm new to bioinformatics and cancer databases. I was exploring cbioportal and analyzing coexpression of different genes through scatter plots. I noticed that the axis are labeled as " RSEM (Batch normalized from Illumina HiSeq_RNASeqV2)" (I attached an example so you can see). I know that RSEM is a transcript quantification software but what does "Batch normalized" mean? does it give upper quartile normalization? FPKM? or what?.
thanks in advance!
Apple's M1 mac is there in the market since 2020, but its application and compatability with bioinformatics analysis tools is scarcely discussed. For example, is it possible to index a human genome on M1 mac air (16gb), if yes, how much time it takes? Is it possible to Map reads to the reference genome? if yes, how much time it takes? Any headsup about the Conda experience?
Please share your thoughts and experiences... it can be of a great help...
GO and KEGG functional analysis for a gene set was using the DAVID database (https://david.ncifcrf.gov/). However, the adjusted p-values (Bonferroni and Benjamini) of the enriched GO terms and KEGG pathways were more than 0.5. Meanwhile, a PPI network was constructed using the STRING database (https://string-db.org). The network was constructed with a confidence score of 0.4 was set as the cutoff criterion with no more than ten as the maximum number of interactions in the first shell. This step added a few more genes to the gene list, and genes with no interactions were removed. When the updated gene list was used for GO and KEGG functional analysis, the enriched GO terms and KEGG pathways were now significant (p-value < 0.05). Is the attempted workflow valid?
Dear all, I am trying to use CD-hit to remove the duplicates from the file that is the output from trinity (RNA seq assembly).
I used the following parameters:
cd-hit-est -i in.fasta -o out_cdhit90.fasta -c 0.90 -n 9 -d 0 -M 0 -T 0
But the output file still contains lots of small or fragmented sequence plus the best one. How can I remove those small or fragmented duplicates by changing the parameters?
Hi, I want to predict post-transitional modification for phosphorylation. I found lots of websites like Phosida, PhosphoSite Plus. I am just curious about is there any python code for this phosphorylation prediction. If you have, could you share the GitHub link?
I did a q PCR analysis to one micro RNA and it was upregulated in tumour tissues compared to normal ones. Then I applied a bioinformatic analysis to detect the target genes and the genes showed the most important targets for the microRNA were oncogenes (based on other studies).
I didn't do any further study on the target genes and I need to keep the bioinformatic analysis only. How can I discuss the results? Is there is any way I can discuss these results knowing that it will be only an in-silico study?
I want to analyze deferentially gene expression of mice before and after treatment.
I have 6 mice and "paired-end" sequences, so how I could merge all my "before treatment" data to compare them with all "after treatment" data via DESeq2 ?!
Should I map/count/DESeq2 them separately?
Is there any way to combine (normalize) all replicates at first and then perform analysis like what we do generally in statistics?!
Thank you in advanced.
I have two vcf files corresponding to the results of healthy tissue and tumor tissue. I want to compare these vcf files and remove their similarities. More specific I want to remove the information of the healthy tissue from the tumor one. Have you any suggestions on which tool I should use or any way that I can do my analysis?
Thanks in advance.
EDIT: Please see below for the edited version of this question first (02.04.22)
I am searching for a reliable normalization method. I have two chip-seq datas to be compared with t-test but the rpkm values are biased. So I need to fix this before the t-test. For instance, when a value is high, it doesn't mean it is high in reality. There can be another factor to see this value is high. In reality, I should see a value closer to mean. Likewise, if a value is low and the factor is strong, we can say that's the reason why we see the low value. We should have seen value much closer to the mean. In brief, what I want is to eliminate the effect of this factor.
In line with this purpose, I have another data showing how strong this factor is for each value in the chip-seq datas (with again RPKM values). Should I simply divide my rpkm values by the corresponding RPKM to get unbiased data? Or is it better to divide rpkm values by the ratio of RPKM/ Mean(RPKMs) ?
Do you have any other suggestions? How should I eliminate the factor?
What exactly is the role of HSP-90 in extracellular environment of the cell? I am wondering whether hsp90 is involved in the translocation of the client protein from outside to inside of the cell. If somebody is having some references please share with me. I am very curious about this molecule.
I have two different ChIP-seq data for different proteins, I have aligned them to some fragments in the DNA. Some of these fragments get zero read count for one of them or for both. To be able to say these fragments has protein X much more than the protein Y, I use student's t-test.
I wonder if It would be better to remove the zero values from both of the data showing rpkm values for each fragment. Moreover, they pose problem when I want to use log during data visualization part.
What would you suggest?
I have not much experience in bioinformatics and I need to find what are the common genes in several gene expression datasets, in other words, I need to find genes that match in all (or some) of my datasets. I am looking for some kind of tool that give me Venn diagrams with the coincident genes. Any suggestion (free software plese) will be very appreciated.
"The development and validation of a medium density SNP genotyping assay in Shrimp" is a research proposal I'm currently working on. Given the restricted budget allotted (9,600 USD) to the project, I'd like to know ahead of time how much it might probably cost me.
I have a 1489 spike protein sequence file. I want to extract codon sequences, of 6 amino acids from this with their respective header. I don't know any sort of programming, so can anyone help me with this?
A big thank you in advance.......
Hello. I am trying to run a haplotype analysis in PopArt. It's going well until I realized I can not load a previous work in PopArt. I can only export the graphical output as .svg, .png, or .pdf but not as a "network" file which I can reload or edit if I want to in the future. I noticed that it can be saved as a .nex file and the new file actually had additional lines (the portion of the code started with: "Begin NETWORK"). I think this is supposed to be read by PopArt but it fails to do so. I encounter parsing errors when I try to run the new file. I am not sure if there is a way around this as I am new to the software. Any help would be appreciated. Stay safe, anon!
I am new in this field. I am doing metagenome analysis with shotgun reads. All reads are single ended. DNA was obtained from airways of human. I just want to find taxon abundances in the samples. Then I will predict the diversities and core microbes.
My mapping results are terrible. How can I handle bad mappings?? OR should I change the tools that I used the analysis?? Which tools are more accurate or sensitive for microbiome analysis?? I need any suggestions, please!
I followed this pipeline:
- Assembly was done using Megahit
- Short contigs (<200 bps) were removed using prinseq
- Read mapping against contigs was performed using BWA
- Similarity searches for GenBank, KEGG, , eggNOG were done using Diamond
- Binning was done using MaxBin2
You can find my mapping results in the attachment.
In the R programming language, I'm going to install the MetaDE package. Nonetheless, I get a warning that package 'MetaDE' is not available for this version of R, A version of this package for your version of R might be available elsewhere. How can I overcome this issue while I'm using R version 4.1.0?
Can anybody help in analyzing a density profile graph generated by a simulation run on GROMACS? I have attached the file for your reference.
Need an elaborate explanation as I am new to this. Kindly also suggest me any research articles related to this topic.
Thank you so much in advance!!
Including these steps: 1) raw data format transformation for five companies 2) update positions for all SNPs to hg37 version 3) Quality control within companies 4) Pre-phasing (SHAPEIT2) and imputation (IMPUTE2) for all SNPs of each company 5) Perform GWAS using two logistic models for 27 phenotypes 6) Statistic and downstream bioinformatic analysis. 7) Estimation of genetic parameters (rg and hg). 8) PRS analysis. However. the size of my dataset only consist more than 1000 people. With no background knowledge, how long would this take as a bioinformatics master student?
I am writing to ask for more information about bioinformatics ideas for a humanized antibody. A humanized antibody has been causing me to wonder what kind of bioinformatician analysis I can do. Although I can use Docking and Molecular Dynamics to evaluate this antibody, I am looking for other ways to analyze it in structural bioinformatics. Please suggest how I can conduct a bioinformatics analysis of this antibody. Any relevant article to refer to would be greatly appreciated.
I have taken references from various sources to write a code but I am not getting the proper dataset read in the R studio.
Applications of bioinformatics in medicine is a key factor in technological advancement in the field of modern medical technologies.
In which areas of medical technology are the technological achievements of bioinformatics used?
What are the applications of bioinformatics in medicine?
I invite you to the discussion
Thank you very much
I am looking to obtain global RNA-Seq data for either E. coli or P. putida. I assume RNA-seq data is publicly available for many microbes, but I am unsure where I can access this information. Does anyone have insight as to what website or database I can find this data?
Hi, I am a beginner in the field of cancer genomics. I am reading gene expression profiling papers in which researchers classify the cancer samples into two groups based on expression of group of genes. for example "High group" "Low group" and do survival analysis, then they associate these groups with other molecular and clinical parameters for example serum B2M levels, serum creatinine levels for 17p del, trisomy of 3. Some researchers classify the cancer samples into 10 groups. Now if I am proposing a cancer classification schemes and presenting a survival model based on 2 groups or 10 groups, How should I assess the predictive power of my proposed classification model and simultaneously how do i compare predictive power of mine with other survival models? Thanks you in advance.
I would like to now if you have any information related to this issues, more precisely companies who could provide services for
1. genome sequencing and assembly ;
2. whole methylation sequencing for 20 samples including bioinformatics analysis
What is the script to do the quantile normalization to do a microarray dataset (GSE70970), by using limma? do i need to create model matrix first before proceeding to normalize it? i'm very new to R
I am wondering if these low levels of total RNA the samples are enough for RNA-seq. Does anyone already did it or has any suggestions to get a reliable data for bioinformatic analysis?
I have whole-genome sequences of a fosmid DNA. I will do the bioinformatics analysis, and my main aim is to identify the sequences of my insert.
Could you recommend a cloud-based/desktop-based (preferably Windows OS) tool for whole-genome sequences analysis of fosmid DNA?
I have some files in bed and bedgraph format to analyze with IGV. My team and I tried to upload them on IGV following the IGV site's tutorias but it hasn't worked. The bedgraph files are large (5157) and we converted them to the bynary .tdf format using the IGVTools "Count" command but it hasn't worked. Only with some files we can see a single flat line on IGV screen without any information. With FilexT we can see that the files in bed and bedgraph are not damaged.
We think that the problem is the step when we select the option "Load from File" on IGV. How can we do? What can we do?
We use the IGV_2.10.3
DNA barcoding is used to obtain taxonomic information about unidentified organisms. Apart from that what other types of Bioinformatics analysis might be performed with the DNA barcode data? What are the Bioinformatics Resources for DNA barcoding data analysis?
I have been asked to check the gene expression patterns of the cells for a RNA seq data after performing principal component analysis plot using MATLAB. I have a CSV file that has the principal component values stored, but I am not sure how to perform differential expression analysis using the PC values. Any MATLAB function available? Kindly help me. Thanks in advance.
I work with spruce which means that we don't have high numbers of clonal replicates. In a RNA-seq experiment we had one clone with six individuals and five clones with two individual. For the one clone there are three control and three treated individuals. For the other clones there is only one replicate of each. I am trying to find a way to analysis this data. Is it possible to use the clone that has replicates as the reference and compare the other clones to it? Is there a test that can be used to see if the transcript counts of the clone with no replicates falls with in the 95% CI of the clone with replicates? I know there are some publication about single subject transcriptomics in medicine where they are trying to develop methods for personalized medicine when only one individual is sequenced.
I am using DAVID (https://david.ncifcrf.gov/home.jsp) to cluster some genes I found upregulated in my RNAseq data. I am just using the official gene symbol without any quantitative data. However, the KEGG pathway results are giving me p-values which are extremely high. It does not make any sense to me. How the p-value can be calculated without any number? Can the p-value be significant?
I am trying to run the pamlX for CODEML but somehow not able to get start the run option. After loading all three files that are .ctl, .phy, and .tree, the pamlX program stands still and the RUN option do not works. Please assist me how I can start the RUN option compelte the analysis.
My interest is to identify lineages with accelerated evolution and test diverse branch models on CODEML, considering one to several ω ratios. If at all this analysis is possible in any other program kindly please suggest that too.
I have provided the test files in the attachment.