Science topic
Bioinformatics Analysis - Science topic
Explore the latest questions and answers in Bioinformatics Analysis, and find Bioinformatics Analysis experts.
Questions related to Bioinformatics Analysis
I am trying to analyze the MT-ATP6 region from the 1000 Genomes Phase 3 mitochondrial chromosome VCF but I see that the VCF does not include every single position. Why is that? I assume it is due to not having any/enough variants to account for the position based on the low-coverage sequencing technology used. However, are there any other reasons for this? How can I account for these missing positional variants? I am pretty new to VCFs so any insight would be greatly appreciated, thanks!
I have a few sample vcfs which are not in a very good quality. They are 23andme files from OpenSNP in the following format:
rsID, chromosome no, position, genotype
I have tried remapping them using Galaxy. However, I guess the error is due to the format. The vcfs contain only SNPs.
ANY IDEAS PLEASE? How can i make it work?
These vcfs are mapped on the GRCh36/hg18 and need to be remapped on hg38.
I have a specific list of SNPs (according to the hg38) in a csv format which I need to filter from each of these vcfs after remapping.
Please suggest any alternate workflows if there are any to help me make this work.
I encountered an unusual observation while constructing a nomogram using the rms package with the Cox proportional hazards model. Specifically, when Karnofsky Performance Status (KPS) is used as a alone predictor, the nomogram points for KPS decrease from high to low. However, when KPS is combined with other variables in a multivariable model, the points for KPS increase from low to high. Additionally, I've noticed that the total points vary from low to high for all variables, while the 1-year survival probability shifts from high to low.
Could anyone help clarify why this directional shift in points occurs? Are there known factors, such as interactions, scaling differences, or confounding effects, that might explain this pattern?
Good day! The question is really complex since CRISPR do not have any exact sequence - so the question is the probability of generation of 2 repeat units, each of 23-55 bp and having a short palindromic sequence within and maximum mismatch of 20%, interspersed with a spacer sequence that in 0.6-2.5 of repeat size and that doesn't match to left and right flank of the whole sequence, in a random sequence.
Hi everyone,
Its been over 3 months that I am trying to develop a script for variant calling and RNA seq analysis for my project. I have attended quite a few workshops but it feels like a scam. I have nobody who can guide me and I really want to learn the analysis. Can anybody tell me if there are currently any short term courses for the same?
Hello, I have some raw files which extension is .d (acquired from Brucker instrument). Which platforms would you recommend to perform the bioinformatic analysis (possibly fee downloadable)? I have experience using MaxQuant but it does not recognize the .d files. Any recommendation? Thank you in advance
Hello,
In the literature, there are some MS/MS results that include hypothetical proteins, which can be shorter than 40 amino acids. I can also find these when I search for an organism in the protein section of NCBI. My question is, would it be absurd if I synthetically synthesize these peptides called hypothetical proteins and test them as drug candidates in certain disease models? Or are studies like the one I mentioned feasible and being conducted? If so, what procedure should I follow? For example, when I find a hypothetical protein, should I first perform a blast and then synthesize and use it if it meets certain conditions?
Is there any chance you could share some references with me that have been done in this manner?
I hope I have been able to convey what I want to ask.
Thank you for your answers.
Example link: https://www.ncbi.nlm.nih.gov/protein?term=txid562%5Borganism%3Aexp%5D+AND+((%2210%22%5BSLEN%5D+%3A+%2220%22%5BSLEN%5D)&cmd=DetailsSearch
Hey everyone,
my question is maybe strange at first glance, but simple: is the rapid 16S kit's only real advantage the significantly larger 16S data amount generation? Shouldn't I be perfectly able to collect necessary strain-level diversity 16S data on the data analysis level from a total nanopore metagenome, without the PCR bias, given enough sample input? If the above thinking is correct, would you consider triple-digit ng input (below 1ug) sufficient, at least for key players of a mixed microbial community?
Just trying to understand if I really need the 16S barcoding kit since I have the native one (which I will use for total metagenome anyway)
Cheers
A
I am trying to predict the stability of a protein with different SNP. I tried using DUET, Predict SNP and Dynamut. The problem with DUET is that I cannot do double mutation however, it gives fast result. But Predict SNP and Dynamut takes long time to generate the result in my case.
Please suggest me other tools that can be used for the stability prediction that are accurate also convenient.
I am new to Desmond simulations and I want to know how can I find the estimated time left for a simulation to be completed? my 2nd query is how to perform B-Factor analysis after performing simulation on Desmond? Any help in this regard will be highly appreciated.
Thanks
Genome editing (also called gene editing) is a group of technologies that give scientists the ability to change an organism's DNA. These technologies allow genetic material to be added, removed, or altered at particular locations in the genome. Several approaches to genome editing have been developed. A recent one is known as CRISPR-Cas9, which is short for clustered regularly interspaced short palindromic repeats and CRISPR-associated protein 9. The CRISPR-Cas9 system has generated a lot of excitement in the scientific community because it is faster, cheaper, more accurate, and more efficient than other existing genome editing methods.
I am using QUAST in Kbase to assess the quality of my genome assemblies of bacterial isolates.
The report from QUAST provided parameters such as N50 and Mismatches. I have found their meaning in https://quast.sourceforge.net/docs/manual.html#sec3.1. And I have learned that an ideal genome is contiguous, complete, and correct.
Most studies suggest the lower the mismatches or other values are, the better the quality will be.
However, are there any absolute values/thresholds that could be used to test whether this assembly is good quality?
(Some studies showed that the threshold depends on the size of the genome and the goal of the study. Then is there any way to calculate this threshold?)
Thank you very much!
Hello, I've recently started exploring molecular docking applications, and I'm still in the early stages.I'd like to ask which proteins should be considered when examining the antimicrobial effects of certain molecules.
Is there a list of these proteins(that I should use as a docking protein), or are there general rules for proteins that should definitely be examined?
Also, can I perform docking not with a molecule but directly with an organism? If so, what should I look for to predict antimicrobial effects?
Could you please guide me on this?
Thank you.
Introduction:
I am conducting research on the bacterial composition of fecal samples from both healthy and diseased individuals using 16S sequencing. I am seeking expert guidance on the appropriate bioinformatic analysis methods for my dataset.
Objective:
My goal is to analyze the bacterial communities in fecal samples from a diseased cohort and a control group of healthy individuals, using 16S rRNA gene sequencing.
Sequencing Method:
I have employed a Nanopore sequencer to acquire full-length 16S sequences.
Alignment Method:
For the alignment process, I have used the kraken2 tool.
Database:
The standard database provided by kraken2 has been utilized for the alignment.
Output Files:
I have generated 12 sets of output files, ranging from kraken2-report01 to kraken2-report12 and kraken-output01.txt to kraken-output12.txt.
Downstream Analysis:
I am contemplating two approaches for downstream analysis:
- Converting the output data into biom format using kraken-biom and then analyzing it on the QIIME2 platform.
- Converting the output data into either OTU or ASV format for analysis using MicrobiomeAnalyst.
Questions:
- Is there a specific method for converting the kraken2 output into biom format? If so, could you provide the steps for this conversion?
- If the conversion-based approach is not advisable, what are the recommended methods for diversity analysis and identification of variable species post-kraken2 analysis?
Could someone explain to me why the p-value in the right column of the forest plot is different than the p-value in the test for effect in the subgroup?
I thought that these two p.values should be the same.

I read some papers mentioning that they used the HMP reference genome for protein homology search and I've also read about the HUMAnN database elsewhere. I'm wondering what's the difference.
Hello everyone; I am new to R programming. I want to calculate the firmicutes to Bacteroides ratio from my OTU table. I couldn't find the command and don't know how to do it. Please guide me on this.
I put an example of my OTU table.
I know many websites have simple tools like transcription and translation available, but are there any analysis tools that researchers need that either do not exist or are not publicly available? It could be anything from algorithms to visuals. Thanks!
Hi - I'm currently working with two RNA-Seq studies; one has RNA extracted from whole blood, the other PBMCs. Eventually we want to combine these data and perform some cell-specific deconvolution to look at DEGs.
Are there any recommended methods for batch correcting these data from different sources?
Mari
From the link https://gtexportal.org/home/datasets, under V7, I'm trying to do R/Python analyses on the Gene TPM and Transcript TPM files. But in these files (and to open them I had to use Universal Viewer since the files are too large to view with an app like NotePad), I'm seeing a bunch of ID's for samples (i.e. GTEX-1117F-0226-SM-5GZZ7), followed by transcript ID's like ENSG00000223972.4, and then a bunch of numbers like 0.02865 (and they take up like 99% of the large files). Can someone help me decipher what the numbers mean, please? And are the numbers supposed to be assigned to a specific sample ID? (The amount of letters far exceed the amount of samples, btw). I tried opening these files as tables in R but I do not think R is categorizing the contents of the file correctly.
I was using fragbuilder module in python to generate peptides of sizes 4, 6, and 10. However, the issue with fragbuilder module is that some of the bond angles are deviating from the standard values. For instance, C_alpha--C--N bond angle standard value is 121 degrees but fragbuilder assigns 111 degrees. This angle deviation causes a deviation in the distance between the nearest neighbor C_alpha---C_alpha and its value is 3.721 angstrom and the typical standard value is 3.8 A. Also another bond angle is a deviation from the standard value by 6 degrees which is the C_alpha---C---N whose value is 111.4 degrees and typical standard values are 117 degrees. My doubt is how much deviation is allowed for MD simulations of peptides (or proteins) while fixing the bond lengths and bonds angles ?
I want to purchase Macbook mainly for the bioinformatics analysis propose i.e., Transcriptomics, smalRNA, Methylation, lncRNA and other. Would anyone please suggest to me the best affordable one?
I have two sequences from the predicted mRNA sequence (only exons, without intron) and gDNA sequence (with intron). Then, I align the sequences to confirm the position of exon in the DNA sequence. after that, I pick the primers from the exon region and check the specificity on Primer Blast. However, I also design primers only from predicted mRNA without considering the exon region on DNA sequence. Which is more appropriate to use in amplifying full-length genes in the DNA template?
When we are in the step of aligning virulent factors against human proteom to exclude those proteins with > 35% homology what is the output that we have to use for the next step of predicting transmembrane helices and molecular weight for chosen proteins?
Hi everyone,
please bear with me, because I am a complete beginner with regard to any form of bioinformatics and I am trying to understand the best approach to my experiment.
I am currently trying to isolate cells and sequence them for further bioinformatic analysis, more precisely RNA-Sequencing.
We have, however, had issues with purity and while some samples we looked at reached a purity of >90% after isolation (we usually validate it by use of flow cytometry), some samples of different animal genotypes did not.
This leads me to my first question:
How important is cell purity for Bulk RNA-Seq?
Which purity should be reached for and adequate, realiable analysis?
If anyone has any recommendations for papers to look into regarding that subject, I would be most grateful, because I have no idea where to start and what to consider.
Further along in the story we surmised that maybe Single Cell RNA Sequencing might be the better option in cases of lower purity.
But again, the same question arose: how relevant is cell purity for the following analysis and is there a cut-off value not to be crossed?
Finally:
How advantegeous would using both methods be?
Sure, Bulk gives a better general overview and Single Cell is more precise, but do they complement each other or is it essentially redundant information gained by doing both experiments?
And are there any disadvantages to using only SC or do both methods completement each other when low purity levels are in the question?
Thank you a lot in advance!!
I have data from the our experimental model - where we analyze the immune response following BCG vaccination, and then the responses and clinical outcome following Mtb infection of our vaccinated models. Because we cannot experimentally follow the very same entity after evaulating the post-vaccination response also for the post vaccination plus post infection studies - we have such data from different batches. Is it possible to do correlation here between post vaccination responses of 5 replicates in one batch (in different vaccine candidates) versus 4-5 replicates in vaccination & infection from another batch? I ask this because we are not following up the same replicates for post vaccination and post infection measurements (as it is not experimentally feasible). If correlation is not the best method, are there other ways to analyze the patterns - such as strength of association between T cell response in BCG vaccinated models versus increased survival of BCG vaccinated models (both measurements are from different batches)? We have several groups like that, with a variety of parameters measured per group in different sets of experiments.
Thanks for your responses and help.
I created this R package to allow easy VCF files visual analysis, investigate mutation rates per chromosome, gene, and much more: https://github.com/cccnrc/plot-VCF
The package is divided into 3 main sections, based on analysis target:
- variant Manhattan-style plots: visualize all/specific variants in your VCF file. You can plot subgroups based on position, sample, gene and/or exon
- chromosome summary plots: visualize plot of variants distribution across (selectable) chromosomes in your VCF file
- gene summary plots: visualize plot of variants distribution across (selectable) genes in your VCF file
Take a look at how many different things you can achieve in just one line of code!
It is extremely easy to install and use, well documented on the GitHub page: https://github.com/cccnrc/plot-VCF
I'd love to have your opinion, bugs you might find etc.

Dear all,
I am performing analysis of 16S rRNA amplicon sequencing data. I have tested effectivity of two classifiers on the mock community and blast classifier shows the best result. However, I found out blast is using a local sequencing alignment. So I do not know if it is appropriate to use this classifier to assign a "mystery" sequence to a bacterial taxon. Is it possible that this approach will result to false positive results? Is it better to use Vsearch classifier which showed worse results but is using a global sequencing alignment?
And a bonus question. Should I use rarefied representative sequences to perform a taxonomy classification or not? I use rarefied data for alpha diversity testing (and for beta diversity testing I do not).
Thank you all for answers!
Martin
Hello all,
I am having an issue with 16S PICRUST data. There is always a warning message post PICRUST run that more than half of the sequences have been removed from further analysis. The reason might be that the ASV fasta files contains mix DNA sequences i.e. both positive and negative strands. PICRUST can only deal with positive sequences hence the output is based on approximately 50% of the sequences of FASTA file. I am really looking for some suggestion (computer programming) on identifying negative sequences from FASTA files based on NCBI BLASTn portal and reverse complementing it. Because this work would be difficult to be performed manually considering 6000 sequences of FASTA files. I have limited knowledge in coding. Any help would be greatly appreciated.
I am running this PICRUST pipeline as mentioned here https://github.com/picrust/picrust2/wiki/Full-pipeline-script. The ASV file has been generated by using raw FASTq files on QIIME2.
Hi,
Could anyone suggest free software or websites used to generate unique DNA barcode sequences [5-10nt] to label XXX genes for library screening?
Thank you in advance
I designed two sets of primer to target a gene of interest in gene expression studies (cDNA). The first primer pair (T1) had an amplicon base pair of 200bp when I carried out Insilico PCR and on targeting my gene of interest in PCR it was successful (showed band on gel and sequencing successful). Coming to the second set of primer (T2) I carried out Insilico pcr and the expected amplicons size was 1200bp which was of interest to me (I'm carrying out bioinformatics analysis of the protein seq.), but on targeting the gene in the cDNA on PCR it wasn't successful. I have troubleshoot varying different parameters but no success. Could high number of sequence length be a hindrance in Pcr, and how can i overcome this problem?
hello
Please introduce me the companies that provide biotechnology services such as designing different types of primers, NGS, RNASeq, etc.
-RNA seq and bioinformatics were carried out by professionals.
- Gene in question shows ~700 fold differential regulation by qPCR in multiple independent cohort of experiments - not in RNA seq.
Please advise....
I have some lists of gene IDs from multi species, I want to have their compiled FASTA format files for each species. it looks tedious to copy each accession and collect FASTA seqs.
Batch Entrez is giving me error, may be because the identifier is related to other database.
I'm in the initial stages of planning a miRNA seq experiment using human cultured cells and decided on TRIzol extraction, Truseq small RNA prep kit, using an illumina HiSeq2500. The illumina webinar suggests 10-20 Million reads for discovery, the QandA support page suggests 2-5M, and I wrote the tech support to ask, who suggested I do up to 100M reads for rare transcripts. Exiqon guide to miRNA discovery manual says there is not really any benefit on going over 5M reads. I was hoping to save money by pooling more samples in a lane, so I was hoping someone with experience might be able to suggest a suitable number of reads.
Hi, I was hoping someone could recommend papers that discuss the impact of using averaged data in random forest analyses or in making regression models with large data sets for ecology.
For example, if I had 4,000 samples each from 40 sites and did a random forest analysis (looking at predictors of SOC, for example) using environmental metadata, how would that compare with doing a random forest of the averaged sample values from the 40 sites (so 40 rows of averaged data vs. 4,000 raw data points)?
I ask this because a lot of the 4,000 samples have missing sample-specific environmental data in the first place, but there are other samples within the same site that do have that data available.
I'm just a little confused on 1.) the appropriateness of interpolating average values based on missingness (best practices/warnings), 2.) the drawbacks of using smaller, averaged sample sizes to deal with missingness vs. using incomplete data sets vs. using significantly smaller sample sizes from only "complete" data, and 3.) the geospatial rules for linking environmental data with samples? (if 50% of plots in a site have soil texture data, and 50% of plots don't, yet they're all within the same site/area, what would be the best route for analysis?) (it could depend on variable, but I have ~50 soil chemical/physical variables?)
Thank you for any advice or paper or tutorial recommendations.
I am computing Van der Waal interactions in python for a peptide of size 10 residues for various conformations. The total conformations (or the number of PDB files is 300,000). Is it possible to compute only the 1-4 atom distances to compute Van der Waals interactions as the bonded and 1-3 atom distances are irrelevant when it comes to Van der Waal interactions using some python module?
Hii, Is there a way I can extract the alternative spliced protein isoform structures from PDB? Also can we mapped the structure to uniprot sequence So we can know which structure belong to which isoform sequence?
After finishing the simulation of the cyclic peptide, I tried to find the most populated structure using the cluster peak density algorithm. from the literature, the representative structure was chosen as the structure with maximal ρsum (The summation of local densities of all residues in one structure, ρ𝑠𝑢𝑚 = ∑ ρ𝑖𝑛_𝑟𝑒𝑠𝑖=1) so how can I extract the structure which has the highest density for the all residue?
ref: Clustering by Fast Search and Find of Density Peaks. Science 2014, 344, 1492–1496
Hello! I'm new to bioinformatics and cancer databases. I was exploring cbioportal and analyzing coexpression of different genes through scatter plots. I noticed that the axis are labeled as " RSEM (Batch normalized from Illumina HiSeq_RNASeqV2)" (I attached an example so you can see). I know that RSEM is a transcript quantification software but what does "Batch normalized" mean? does it give upper quartile normalization? FPKM? or what?.
thanks in advance!

Suggestions of online databases/tools I can use to verify candidate genes
Apple's M1 mac is there in the market since 2020, but its application and compatability with bioinformatics analysis tools is scarcely discussed. For example, is it possible to index a human genome on M1 mac air (16gb), if yes, how much time it takes? Is it possible to Map reads to the reference genome? if yes, how much time it takes? Any headsup about the Conda experience?
Please share your thoughts and experiences... it can be of a great help...
Hi,
GO and KEGG functional analysis for a gene set was using the DAVID database (https://david.ncifcrf.gov/). However, the adjusted p-values (Bonferroni and Benjamini) of the enriched GO terms and KEGG pathways were more than 0.5. Meanwhile, a PPI network was constructed using the STRING database (https://string-db.org). The network was constructed with a confidence score of 0.4 was set as the cutoff criterion with no more than ten as the maximum number of interactions in the first shell. This step added a few more genes to the gene list, and genes with no interactions were removed. When the updated gene list was used for GO and KEGG functional analysis, the enriched GO terms and KEGG pathways were now significant (p-value < 0.05). Is the attempted workflow valid?
Dear all, I am trying to use CD-hit to remove the duplicates from the file that is the output from trinity (RNA seq assembly).
I used the following parameters:
cd-hit-est -i in.fasta -o out_cdhit90.fasta -c 0.90 -n 9 -d 0 -M 0 -T 0
But the output file still contains lots of small or fragmented sequence plus the best one. How can I remove those small or fragmented duplicates by changing the parameters?
thanks
ZQ
Hi, I want to predict post-transitional modification for phosphorylation. I found lots of websites like Phosida, PhosphoSite Plus. I am just curious about is there any python code for this phosphorylation prediction. If you have, could you share the GitHub link?
Dear All,
I did a q PCR analysis to one micro RNA and it was upregulated in tumour tissues compared to normal ones. Then I applied a bioinformatic analysis to detect the target genes and the genes showed the most important targets for the microRNA were oncogenes (based on other studies).
I didn't do any further study on the target genes and I need to keep the bioinformatic analysis only. How can I discuss the results? Is there is any way I can discuss these results knowing that it will be only an in-silico study?
Many thanks
Hi there,
I want to analyze deferentially gene expression of mice before and after treatment.
I have 6 mice and "paired-end" sequences, so how I could merge all my "before treatment" data to compare them with all "after treatment" data via DESeq2 ?!
Should I map/count/DESeq2 them separately?
Is there any way to combine (normalize) all replicates at first and then perform analysis like what we do generally in statistics?!
Thank you in advanced.
I have two vcf files corresponding to the results of healthy tissue and tumor tissue. I want to compare these vcf files and remove their similarities. More specific I want to remove the information of the healthy tissue from the tumor one. Have you any suggestions on which tool I should use or any way that I can do my analysis?
Thanks in advance.
EDIT: Please see below for the edited version of this question first (02.04.22)
Hi,
I am searching for a reliable normalization method. I have two chip-seq datas to be compared with t-test but the rpkm values are biased. So I need to fix this before the t-test. For instance, when a value is high, it doesn't mean it is high in reality. There can be another factor to see this value is high. In reality, I should see a value closer to mean. Likewise, if a value is low and the factor is strong, we can say that's the reason why we see the low value. We should have seen value much closer to the mean. In brief, what I want is to eliminate the effect of this factor.
In line with this purpose, I have another data showing how strong this factor is for each value in the chip-seq datas (with again RPKM values). Should I simply divide my rpkm values by the corresponding RPKM to get unbiased data? Or is it better to divide rpkm values by the ratio of RPKM/ Mean(RPKMs) ?
Do you have any other suggestions? How should I eliminate the factor?
What exactly is the role of HSP-90 in extracellular environment of the cell? I am wondering whether hsp90 is involved in the translocation of the client protein from outside to inside of the cell. If somebody is having some references please share with me. I am very curious about this molecule.
I have two different ChIP-seq data for different proteins, I have aligned them to some fragments in the DNA. Some of these fragments get zero read count for one of them or for both. To be able to say these fragments has protein X much more than the protein Y, I use student's t-test.
I wonder if It would be better to remove the zero values from both of the data showing rpkm values for each fragment. Moreover, they pose problem when I want to use log during data visualization part.
What would you suggest?
I have not much experience in bioinformatics and I need to find what are the common genes in several gene expression datasets, in other words, I need to find genes that match in all (or some) of my datasets. I am looking for some kind of tool that give me Venn diagrams with the coincident genes. Any suggestion (free software plese) will be very appreciated.
"The development and validation of a medium density SNP genotyping assay in Shrimp" is a research proposal I'm currently working on. Given the restricted budget allotted (9,600 USD) to the project, I'd like to know ahead of time how much it might probably cost me.
Hello. I am trying to run a haplotype analysis in PopArt. It's going well until I realized I can not load a previous work in PopArt. I can only export the graphical output as .svg, .png, or .pdf but not as a "network" file which I can reload or edit if I want to in the future. I noticed that it can be saved as a .nex file and the new file actually had additional lines (the portion of the code started with: "Begin NETWORK"). I think this is supposed to be read by PopArt but it fails to do so. I encounter parsing errors when I try to run the new file. I am not sure if there is a way around this as I am new to the software. Any help would be appreciated. Stay safe, anon!
Hello,
I am new in this field. I am doing metagenome analysis with shotgun reads. All reads are single ended. DNA was obtained from airways of human. I just want to find taxon abundances in the samples. Then I will predict the diversities and core microbes.
My mapping results are terrible. How can I handle bad mappings?? OR should I change the tools that I used the analysis?? Which tools are more accurate or sensitive for microbiome analysis?? I need any suggestions, please!
I followed this pipeline:
- Assembly was done using Megahit
- Short contigs (<200 bps) were removed using prinseq
- Read mapping against contigs was performed using BWA
- Similarity searches for GenBank, KEGG, , eggNOG were done using Diamond
- Binning was done using MaxBin2
You can find my mapping results in the attachment.

In the R programming language, I'm going to install the MetaDE package. Nonetheless, I get a warning that package 'MetaDE' is not available for this version of R, A version of this package for your version of R might be available elsewhere. How can I overcome this issue while I'm using R version 4.1.0?
I used WebMGA to cluster my NGS data (COG). I have problem on analyzing the data provided in output.zip since the format file is unknown, in this case do I need some specific software to open each of those files?
Hi everyone,
Can anybody help in analyzing a density profile graph generated by a simulation run on GROMACS? I have attached the file for your reference.
Need an elaborate explanation as I am new to this. Kindly also suggest me any research articles related to this topic.
Thank you so much in advance!!
Good day
Regards
Renu

Including these steps: 1) raw data format transformation for five companies 2) update positions for all SNPs to hg37 version 3) Quality control within companies 4) Pre-phasing (SHAPEIT2) and imputation (IMPUTE2) for all SNPs of each company 5) Perform GWAS using two logistic models for 27 phenotypes 6) Statistic and downstream bioinformatic analysis. 7) Estimation of genetic parameters (rg and hg). 8) PRS analysis. However. the size of my dataset only consist more than 1000 people. With no background knowledge, how long would this take as a bioinformatics master student?
I am writing to ask for more information about bioinformatics ideas for a humanized antibody. A humanized antibody has been causing me to wonder what kind of bioinformatician analysis I can do. Although I can use Docking and Molecular Dynamics to evaluate this antibody, I am looking for other ways to analyze it in structural bioinformatics. Please suggest how I can conduct a bioinformatics analysis of this antibody. Any relevant article to refer to would be greatly appreciated.
I have taken references from various sources to write a code but I am not getting the proper dataset read in the R studio.
Applications of bioinformatics in medicine is a key factor in technological advancement in the field of modern medical technologies.
In which areas of medical technology are the technological achievements of bioinformatics used?
What are the applications of bioinformatics in medicine?
Please reply
I invite you to the discussion
Thank you very much
Best wishes

Hello,
I am looking to obtain global RNA-Seq data for either E. coli or P. putida. I assume RNA-seq data is publicly available for many microbes, but I am unsure where I can access this information. Does anyone have insight as to what website or database I can find this data?
Many thanks,
Shawn
Hi, I am a beginner in the field of cancer genomics. I am reading gene expression profiling papers in which researchers classify the cancer samples into two groups based on expression of group of genes. for example "High group" "Low group" and do survival analysis, then they associate these groups with other molecular and clinical parameters for example serum B2M levels, serum creatinine levels for 17p del, trisomy of 3. Some researchers classify the cancer samples into 10 groups. Now if I am proposing a cancer classification schemes and presenting a survival model based on 2 groups or 10 groups, How should I assess the predictive power of my proposed classification model and simultaneously how do i compare predictive power of mine with other survival models? Thanks you in advance.
Hi all,
I would like to now if you have any information related to this issues, more precisely companies who could provide services for
1. genome sequencing and assembly ;
2. whole methylation sequencing for 20 samples including bioinformatics analysis
Thanks
What is the script to do the quantile normalization to do a microarray dataset (GSE70970), by using limma? do i need to create model matrix first before proceeding to normalize it? i'm very new to R
I am wondering if these low levels of total RNA the samples are enough for RNA-seq. Does anyone already did it or has any suggestions to get a reliable data for bioinformatic analysis?
I've tried to dock an enzyme (523 residues) with its amino acid substrate, but no docking server can recognize a single amino acid as a ligand. What can I do for docking those molecules?
I have whole-genome sequences of a fosmid DNA. I will do the bioinformatics analysis, and my main aim is to identify the sequences of my insert.
Could you recommend a cloud-based/desktop-based (preferably Windows OS) tool for whole-genome sequences analysis of fosmid DNA?
I have some files in bed and bedgraph format to analyze with IGV. My team and I tried to upload them on IGV following the IGV site's tutorias but it hasn't worked. The bedgraph files are large (5157) and we converted them to the bynary .tdf format using the IGVTools "Count" command but it hasn't worked. Only with some files we can see a single flat line on IGV screen without any information. With FilexT we can see that the files in bed and bedgraph are not damaged.
We think that the problem is the step when we select the option "Load from File" on IGV. How can we do? What can we do?
We use the IGV_2.10.3
DNA barcoding is used to obtain taxonomic information about unidentified organisms. Apart from that what other types of Bioinformatics analysis might be performed with the DNA barcode data? What are the Bioinformatics Resources for DNA barcoding data analysis?
I have been asked to check the gene expression patterns of the cells for a RNA seq data after performing principal component analysis plot using MATLAB. I have a CSV file that has the principal component values stored, but I am not sure how to perform differential expression analysis using the PC values. Any MATLAB function available? Kindly help me. Thanks in advance.
I work with spruce which means that we don't have high numbers of clonal replicates. In a RNA-seq experiment we had one clone with six individuals and five clones with two individual. For the one clone there are three control and three treated individuals. For the other clones there is only one replicate of each. I am trying to find a way to analysis this data. Is it possible to use the clone that has replicates as the reference and compare the other clones to it? Is there a test that can be used to see if the transcript counts of the clone with no replicates falls with in the 95% CI of the clone with replicates? I know there are some publication about single subject transcriptomics in medicine where they are trying to develop methods for personalized medicine when only one individual is sequenced.
I am using DAVID (https://david.ncifcrf.gov/home.jsp) to cluster some genes I found upregulated in my RNAseq data. I am just using the official gene symbol without any quantitative data. However, the KEGG pathway results are giving me p-values which are extremely high. It does not make any sense to me. How the p-value can be calculated without any number? Can the p-value be significant?
I am trying to run the pamlX for CODEML but somehow not able to get start the run option. After loading all three files that are .ctl, .phy, and .tree, the pamlX program stands still and the RUN option do not works. Please assist me how I can start the RUN option compelte the analysis.
My interest is to identify lineages with accelerated evolution and test diverse branch models on CODEML, considering one to several ω ratios. If at all this analysis is possible in any other program kindly please suggest that too.
I have provided the test files in the attachment.
I have a data (shown in attached pic ) where I have RNA seq data of various samples for the same the gene twice.
Now suppose for sample-1 if I want to measure the gene ( which is haplotypic in nature ) how do I consider its RNA seq for the sample no 1. Do I take average or do I consider median or should I consider both these versions of genes as separate genes ? I guess biologist would make better explanations.

The project's budget is 12,000$ does not include buying any equipments except (for example) a genotyping analysis kit, I did a project for analyzing genetic diversity and selection signatures in four endangered cattle breeds using Illumina BovineHD kit but it was not satisfying, any suggestions? it is very important and crucial for my career.
Thanks in advance,
Hi,
i want to know if i can detect a mutation on a DNA sequence ( Sanger sequencing ) by using BioPython.
I want to know if there is a program to write to detect the position and the type of mutation in the generated sequence compared to a wild type sequence.
Best regards.
Hello,
I really need to know what is the best way to do bioinformatic analysis of posttranslational modifications of human proteins?
Also, which tools and software do you recommend for this purpose? What about networKIN tool (http://networkin.info/)?
I would highly appreciate if you could help me in this regard.
Many thanks.
Best wishes,
Farah
I am trying to do bioinformatics analysis for LGG and GBM cohort from TCGA. I encountered a difficulty because I am not sure if I counted correctly whether the LGG cohort is heterogeneous or homogeneous in terms of IDH1 status (WT or Mutant). Can somebody help me do that or did it already? :)
Thank You in advance for Your help :)
Hello friends, today I am raising a concern- What are real palindromic DNA sequence ? off course you will say- Restriction enzymes sites, but through a video available at the link http://bit.ly/palindromicDNA, I am raising an issue that, in true sense mirror repeats are palindromic in nature as defined by standard English dictionaries. There are many unique properties of mirror repeats DNA which i will share later. Hopefully biological scientific community will accept mirror repeats as True English Palindrome. So please check out http://bit.ly/palindromicDNA
I will use this equation T(A,B)=(A*B )/(|A|2+|B|2 −A*B) to calculate Drug-likeness. Here, A is defined as a molecular descriptor for the compound and B is defined as the average molecular properties of all compounds in the Drug Bank database.
I have used PaDEL to calculate all the descriptors. Which descriptor I have to use?
Rather than using sequence alignment data, I wanted to have phylogenetic tree from distance matrix and bootstrap as part of statistical analysis. Anyone to tell me how to execute this analysis?
Hello all,
I have a question regarding gene prediction for long metagenomic reads (MinION nanopore).
I was trying to understand the process of gene prediction. In my attempt, I classified my metagenomic sequences using a reference database by following methods :
1. I did Prodigal to predict ORFs using -p meta option and then ran a diamond aligner using e= 0.002
Result: 689 queries aligned
2. I directly used diamond for alignment of metagenomic reads to the reference database using the same e score value (I did not give identity parameter)
Result: 7292 queries aligned
3. I converted my DNA fasta file to protein sequence using GOTRANSEQ, and did the same analysis with same parameter.
Result: 169 queries aligned.
There is a huge difference between 2 and 3 method? Confused...!!
Which approach is better for predicting protein gene sequence for long reads ?
Is e-value a sufficient parameter for diamond blastp analysis? Do I need to give any identity % in case of first approach?
In addition, I would also like to confirm, whether I can directly use the translated file from PRODIGAL analysis (-a output) for DIAMOND?
Please help
I am looking fora command that will modify 3 chains available in the original pdb into a single chain and then renumber all of the residues. I have tried using alter command but when I export the pdb I get only one chain (of the initial trimer) and not the merged chain
I am trying to measure telomere length in an ant species using the TRF method, which is a Southern blot technique. Currently, I am struggling with analyzing the images and would appreciate any tips and suggestions on how to statistically analyze my data.
- Which software do you recommend to analyze the length?
- Any tips on how to image my membrane to get accurate results and is there anything to avoid while imaging?
I have attached an image for reference!
Thank you in advance and much appreciated
PSSM(Position-specific scoring matrix) is one of the key features to be used for B cell conformation epitope prediction but I am confused about how to use it as a feature.
I am researching immune checkpoint genes in oral cancer. I wish to know how to use bioinformatics analysis to predict the synergistic drug combination of a certain immune checkpoint gene inhibitor and a common chemotherapeutic drug. If possible, this combination will provide the research direction for my future experimental design.
Could you please show me the paper, general methods, and database that I can do for achieving this goal? Thank you very much.
I am looking for a recent diagnosis for chikungunya virus through computational biology techniques.
I've read several papers that used a panel of genes to narrow the range of candidates in WES analysis. Some studies may select hundreds of so-called "phenotype-related" candidated genes, but some may use more. My question is how to design a panel that might be benifial to the following analysis?
Besides that obtaining knowledge from reviews, which database could I find out all possible genes that may have effects on the targeted phenotype? (OMIM? MouseMine?)
I was trying to find a plasmid origin of replication, ori-finder did not find it, and also I tried to blast against their database and try to align to other similar bacteria, but with no success.
Can someone recommend a fairly simple program for someone who is not a bioinformatician?
I was also looking into GC skew analysis, if someone can recommend a program for that as well, i would appriciate it.
Thanks.
One of the steps during the preprocessing of the data from metatranscriptomic analysis is to remove any host reads (host contamination) by comparing to host database. But what if there is no host reference or the closest reference is the draft genome of the same family but different genus?