Science topic

Basic Bioinformatics - Science topic

Explore the latest questions and answers in Basic Bioinformatics, and find Basic Bioinformatics experts.
Questions related to Basic Bioinformatics
  • asked a question related to Basic Bioinformatics
Question
1 answer
I have downloaded NIST peptide spectral library from NIST database which is in .msp format. I want to convert it to .mzxml file in order to read the file in MATLAB. 
Is there any free tool for the conversion? I tried to use proteowizard but it doesn't recognize .msp file
Relevant answer
Answer
Hi Sangeetha,
I also have the same issue with MSP files,
Could you solve this? I appreciate your suggestion if you can,
Bests,
  • asked a question related to Basic Bioinformatics
Question
2 answers
Hey everyone,
my question is maybe strange at first glance, but simple: is the rapid 16S kit's only real advantage the significantly larger 16S data amount generation? Shouldn't I be perfectly able to collect necessary strain-level diversity 16S data on the data analysis level from a total nanopore metagenome, without the PCR bias, given enough sample input? If the above thinking is correct, would you consider triple-digit ng input (below 1ug) sufficient, at least for key players of a mixed microbial community?
Just trying to understand if I really need the 16S barcoding kit since I have the native one (which I will use for total metagenome anyway)
Cheers
A
Relevant answer
Answer
Abhijeet Singh both kits offer the same multiplexing capacity, if I understand the question you're asking - both 16S kit and the native kit that we have are "24 barcoding", native / 16S.
I am rather curious about the necessity of 16S in terms of sequencing success - I can see low complexity microbial samples getting sequenced just as succcessfully with a native kit as with 16S, but without the PCR amplification bias, which in fact affects relative quantification negatively, rather than being prerequisite for it as you seem to state (becasue amplification efficiency drops steeply after 60%+ GC content of the amplicon). PCR amplification probably makes a positive difference when trying to detect low-abundance species, but I am not interested in those in this project.
  • asked a question related to Basic Bioinformatics
Question
4 answers
I need to extract the x,y coordinates of a PCA plot (generated in R) to plot into excel (my boss prefers excel)
The code to generate the PCA:
pca <- prcomp(data, scale=T, center=T)
autoplot(pca, label=T)
If we take a look at pca$x, the first two PC scores are as follows for an example point is:
29. 3.969599e+01 6.311406e+01
So for sample 29, the PC scores are 39.69599 and 63.11406.
However if you look at the output plot in R, the coordinates are not 39.69599 and 63.11406 but ~0.09 ~0.2.
Obviously some simple algebra can estimate how the PC scores are converted into the plotted coordinates but I can't do this for ~80 samples.
Can someone please shed some light on how R gets these coordinates and maybe a location to a mystery coordinate file or a simple command to generate a plotted data matrix?
NOTE: pca$x does not give me what I want
Update:
Redoing prcomp() without scale and center gives me this for PC1 and PC2 for the first 5 samples
1 -8.9825883 0.0113775
2 -16.3018548 9.1766104
3 -21.0626458 3.0629666
4 5.5305875 4.0334291
5 0.2349433 12.4872609
However the plot ranges from -0.15 to 0.4 for PC1 and -0.35 to 0.15 for PC2
(Plot attached)
Relevant answer
Answer
Conscious my comment is hardly timely, but I believe the issue might be that some visualisation devices in R happen to "scale" the relevant PCA scores in the background. With this I mean the issue might be autoplot(), or other visualisation facilities - not the PCA you performed, per se.
Take for example the function biplot(), which is readily available in base R to visualise objects generated by the function prcomp(). If you look at how the function is coded (stats:::biplot.prcomp) you'll see that it divides the first two PCA scores by their standard deviation i.e.:
scores <- pca$x
lam <- pca$sdev[,1:2]
pca_plot_coord <- t(t(scores[,1:2])/lam)
(Notice that pca$sdev is the same as taking the square root of the relevant eigenvalues of the covariance matrix of your centred data, as they happen to equal the variance along the corresponding scores, if the computational structure of PCA is followed correctly. Fun fact, this equivalence won't necessarily hold if you use some canned R routines for PCA that rely on the singular value decomposition of the data matrix instead of manipulating its covariance matrix).
So in a nutshell, as previous comments have already pointed out, what you're actually interested in are the pca "scores" (pca$x); however some visualisation facilities in R, such as biplot(), might do some scaling in the background.
If you wanted to make your life simpler you could just use PCA() in the package FactoMineR and then plot() the resulting object: as far as I can tell, the plotted result is not manipulated further and what you get in the plot is based on the PCA scores, as you'd expect.
  • asked a question related to Basic Bioinformatics
Question
2 answers
I'm using autodock vina in Python to dock multiple proteins and ligands, but I'm having trouble setting the docking parameters for each protein. How can I do this in Python? (I have attached my py code which I have done in this I have assumed this parameters same for all proteins)
Relevant answer
Answer
By the above code, irrespective of protein size the grid box size will be considered as 20x20x20. End of the vina execution, most of the complex shows binding affinity "0" or much less, as the active site will be out of the grid box range. Better increase the grid box size (SIZE_X,Y,Z) up to 60 or 120 each, depending on the maximum proteins (chains in PDB code) size of each complex, and try to run VINA again. Then you may get binding energy values of maximum protein-ligand complexes (sometimes for all).
However, this will not mimic the experimental structure correctly, since you handling bulk protein-ligand (separately) complexes docking with the common configuration file same time.
  • asked a question related to Basic Bioinformatics
Question
1 answer
I have an unusual question: I am working on a Erasmus internship project with Drosophila mutants at 2 different timepoints and with WT, KO and KI condition. A company analyzed the data using DESeq2 and I have only got loads of PDFs and the results_apeglm.xlsx file.
This contains: Transcripts per million for each gene, replicate and timepoint with the comparison for looking at DEGs - so I have a padj and log2FC value. A snippet is attached as an example.
I now want to construct a graph and clustering where genes that are going in changing directions between WT and KO over time become visible out of the hundreds of candidate DEGs. With this I want to narrow down the long list to make it verifiable with qPCR and serve as a marker for transformation from presymptomatic to symptomatic.
I am setting up my analysis in R and want to use the degPatterns() function from DEGReport, as it gives a nice visual output and clusters the genes for me.
How can I now transform my Excel sheets, to a matrix format that I can use with degPatterns()? The example with the Summarized Experiment given in the vignette is not really helpful to me, sadly.
Thank you all for reading, pondering and helping with my question! I would be very happy if there´s a way to solve my data wrangling issue.
All the best,
Paul
Relevant answer
Answer
Hi, what exactly do you need? A matrix from excel sheet. Then, simply read excel file using
as.matrix(readxl::read_excel(your_file_location))
function. You need to remove few columns and then matched the columns to meta data.
and IF degPatterns() function is not working properly, then you may need to clean and re-transform your data.
  • asked a question related to Basic Bioinformatics
Question
3 answers
I have tried to separate a direct coculture of MSCs (mesenchymal stromal cells) and macrophages to do bulk RNA seq on macrophages, as I want to find out how MSCs change the genetic expression on macrophages. I have tried different methods to separate the coculture as much possible, but I can only manage to retrieve a cell population with 95% macrophages, and 5% MSCs still present.
Therefore, I want to know if anyone has experience with analyzing data when the population is not completely pure with one cell type and how do I handle such data?
Is it wise to proceed with bulk RNA seq when 5% of my cells are still MSCs, well aware that the expressed genes observed could come from the 5% MSCs?
Relevant answer
Answer
Dear Kian,
have you tried improve your purity by FACS? It´s fairly easy to choose markers to distinguish MSC & macrophages and sort highly pure populations.
  • asked a question related to Basic Bioinformatics
Question
3 answers
Can someone please list the analysis that needs to be performed to understand and compare the mechanism of a protein in its native and inhibitor-bound state? (NOTE: Not the interaction between Protein and Ligand)
Relevant answer
Answer
  • asked a question related to Basic Bioinformatics
Question
3 answers
Hello,
I am looking for a tool (soft, script), a tutorial or some help to generate 10000 random non-redundant peptide sequences with a fixed length of 9 Amino Acids, and for which some AA are fixed and not random.
An example to illustrate:
When starting with : _ _ A _ M _ S S _
I want to generate a pool of 10000 non-redundant peptides with random AAs at positions 1,2,4,6 and 9, while keeping the position 3 (A), 5 (M), 7 (S) and 8 (S) the same.
I am sure with a simple script in python it can be done quite easily but I have almost 0 knowledge to write such script. If someone can help me, or is aware of an online tool that does that (so far the closest thing I found is https://peptidenexus.com/article/sequence-scrambler but it cannot generate thousand of peptides nor fix positions) it would be much appreciated.
Thanks
Relevant answer
Answer
  • asked a question related to Basic Bioinformatics
Question
3 answers
Certain softwares and sites allow to calculate a DNA hairpin Tm depending on the size of the loop and the stem sequence. For example, Gene Runner. Yet the calculation method or citation is not provided. Is there a formula that could help?
Relevant answer
Answer
DOI: 10.1039/b804675c
This paper explains very well how unfolding and melting of DNA hairpin works. kindly have a look.
  • asked a question related to Basic Bioinformatics
Question
5 answers
Does anyone have experience with the table2asn tool of NCBI? It is used for submitting annotations of genomes using GFF files.
I am having a problem with it.
1) The windows tool I get running, but it gives me an error. The command used is the following: table2asn -M n -J -c w -t template.sbt -a r10k -l paired-ends -i FinalContigs.fsa -f 2700988623.gff -o output_file.sqn -Z output_file.dr -locus-tag-prefix xxxxx
template.sbt is a file generated by NCBI
FinalContigs.fsa is a file containing all the different contigs of the draft genome. The first contig is named Contig001, seconf contig is Contig002, and so on.
2700988623.gffis the file containing the annotation. Column 1 gives the contigs where the CDS and RNA are found in, so Contig001, Contig002, etc.
After running the command (as told by NCBI), I get the following error:
Cannot resolve lcl|Contig002: unknown
Line: 0
So if anyone could assist me with getting the tool running in Windows, that would be very much appreciated as it is the final step of submitting my genome.
Thanks in advance!
Relevant answer
Answer
Muhammad Arslan This is like 4 years late but maybe it gets found by someone in the future and can serve as a record. I tried the tool on all three OSs and the same problem persisted. However, it is not that chmod +x does not work, it is that after it is run it is really hard for the terminal to see the file and run it. In all three cases, this is overcome by adding the path to the file in front of its name as you run it. So, for example, instead of running
table2asn -M n -J -c w -euk -t...
You would add in the path and run
/Path/to/thefile/table2asn -M n -J -c w -euk -t....
Easiest way to know wha the exact path is is to just drag and drop the file into the terminal window and the terminal will tell you
Furthermore, to be more explicit, the exact steps to get this tool to run are:
1) Download it, unzip it (on Linux, or Ubuntu on PC or equivalent) you run "unzip NameOfFile" and you may have to install unzip first via "sudo apt-get install unzip"
2) Run chmod +x on the unzipped file. I did confirm that at this stage, you can rename the file to whatever you want, such as instead of "linux64.table2asn" you can just name it table2asn (I think the full name will be table2asn.table2asn but you can call it up with just table2asn)
3) Ru it with the path, or if you know what you are doing, you can add the file to the path and then it will run with just "table2asn" instead of needing to include the path. However, and this was so confusing to me, even if your terminal is in the same directory that the table2asn file is in it will still not see it and call it up, so I am not 100% sure if the "adding to path" route would fix the issue; I didn't try it yet
Anyways, for anyone searching the web for how to get this to work, there's the steps. A more comprehensive README if you will. As for all the options and tools, I haven't actually run this successfully yet (need to specify locus-tags apparently) so I cannot comment about the options yet
  • asked a question related to Basic Bioinformatics
Question
4 answers
I have two different ChIP-seq data for different proteins, I have aligned them to some fragments in the DNA. Some of these fragments get zero read count for one of them or for both. To be able to say these fragments has protein X much more than the protein Y, I use student's t-test.
I wonder if It would be better to remove the zero values from both of the data showing rpkm values for each fragment. Moreover, they pose problem when I want to use log during data visualization part.
What would you suggest?
Relevant answer
Answer
Thank u so much for both your answer and suggestion David Eugene Booth
  • asked a question related to Basic Bioinformatics
Question
5 answers
I am a final year M.Phil student and I am reading up many articles to find a topic for my research. My major is microbiology (molecular) and I am interested in integrating basic bioinformatics. My supervisor is helpful in this matter but I would like to come up with 2 or 3 topics of my own as well to propose but having a severe blockage in this matter. A recommendation would be helpful. Thank you in advance.
Relevant answer
Answer
Hi Ashmal,
Our team has retrieved WGS data from patients in Papua, Indonesia, and found indications of mutations that led to drug resistance. Please check this paper for more information:
We also deployed bioinformatics method to do docking on the mutated protein as well.
  • asked a question related to Basic Bioinformatics
Question
3 answers
I have some files in bed and bedgraph format to analyze with IGV. My team and I tried to upload them on IGV following the IGV site's tutorias but it hasn't worked. The bedgraph files are large (5157) and we converted them to the bynary .tdf format using the IGVTools "Count" command but it hasn't worked. Only with some files we can see a single flat line on IGV screen without any information. With FilexT we can see that the files in bed and bedgraph are not damaged.
We think that the problem is the step when we select the option "Load from File" on IGV. How can we do? What can we do?
We use the IGV_2.10.3
Relevant answer
Answer
Look the link, maybe useful.
Regards,
Shafagat
  • asked a question related to Basic Bioinformatics
Question
4 answers
Suppose I have a peptide sequence of 400 amino acid and there is a particular domain (suppose a DNA binding domain). I am looking for any software that will take the peptide sequence as input and give an option to specifically download the sequence of the domain in the peptide sequence.
Relevant answer
Answer
Md Fahmid Hossain Bhuiyan If you can't find one, I can help you build one if you need :D
  • asked a question related to Basic Bioinformatics
Question
24 answers
Hello friends, today I am raising a concern- What are real palindromic DNA sequence ? off course you will say- Restriction enzymes sites, but through a video available at the link http://bit.ly/palindromicDNA, I am raising an issue that, in true sense mirror repeats are palindromic in nature as defined by standard English dictionaries. There are many unique properties of mirror repeats DNA which i will share later. Hopefully biological scientific community will accept mirror repeats as True English Palindrome. So please check out http://bit.ly/palindromicDNA
Relevant answer
Answer
If you define palindromes as being the same whether you read it forwards or backwards, mirror repeats are not palindromes, because DNA recognising proteins do recognise the DNA double strand. The sequence of the reverse strand is implied by base complementarity
5-'GGATCC-3' implies the sequence 5-'GGATCC-3' on the reverse strand, and therefore is palindromic when looking at the double strand, while
5-'GTGGACCAGGTG-3' would imply 5'-CACCTGGTCCAC-3' on the reverse strand, and therefore is not.
  • asked a question related to Basic Bioinformatics
Question
4 answers
Hello all,
I have a question regarding gene prediction for long metagenomic reads (MinION nanopore).
I was trying to understand the process of gene prediction. In my attempt, I classified my metagenomic sequences using a reference database by following methods :
1. I did Prodigal to predict ORFs using -p meta option and then ran a diamond aligner using e= 0.002
Result: 689 queries aligned
2. I directly used diamond for alignment of metagenomic reads to the reference database using the same e score value (I did not give identity parameter)
Result: 7292 queries aligned
3. I converted my DNA fasta file to protein sequence using GOTRANSEQ, and did the same analysis with same parameter.
Result: 169 queries aligned.
There is a huge difference between 2 and 3 method? Confused...!!
Which approach is better for predicting protein gene sequence for long reads ?
Is e-value a sufficient parameter for diamond blastp analysis? Do I need to give any identity % in case of first approach?
In addition, I would also like to confirm, whether I can directly use the translated file from PRODIGAL analysis (-a output) for DIAMOND?
Please help
Relevant answer
Answer
  1. In my opinion, it would be wiser to assemble the reads before doing any kind of analysis
  2. No wonder different methods give different results. The tools you are using are principally different, thus the results. You can go through the manual of each to see what they are basically designed for. Prodigal on one hand is designed for gene prediction thus it searches protein coding genes (full-length). Diamond on the other hand is a faster version of blast which tries to locally align two sequences irrespective to their functional characteristics or length.
After this, I don't think specific details are important to be further discussed.
  • asked a question related to Basic Bioinformatics
Question
1 answer
I am looking fora command that will modify 3 chains available in the original pdb into a single chain and then renumber all of the residues. I have tried using alter command but when I export the pdb I get only one chain (of the initial trimer) and not the merged chain
Relevant answer
Answer
You can first renumber the chains using the alter command (https://pymolwiki.org/index.php?title=Alter&redirect=no) in such a way that each residue has a unique residue number,
e.g.
alter (chain B),resi=str(int(resi)+100)
alter (chainC),resi=str(int(resi)+200)
to give chains B and C an offset of 100 and 200, respectively.
then again use the alter command to change the chain label
e.g.
alter (all), chain='A'
  • asked a question related to Basic Bioinformatics
Question
6 answers
I am getting negative branches with the Neighbor-joining method, which  I set   to zero. However,   I've read that I should transfer negative distances somewhere else and I do not know how. Does any have an script/method to transfer negative distances to the corresponding branches?. Thanks in advance for any help.
Relevant answer
Answer
In addition to
Michael B Black
's comments, here is the fix negative length function in case anyone uses the ape package:
fix_negative_edge_length <- function(nj.tree) {
edge_infos <- cbind(nj.tree$edge, nj.tree$edge.length) %>% as.data.table
colnames(edge_infos) <- c('from', 'to', 'length')
nega_froms <- edge_infos[length < 0, sort(unique(from))]
nega_froms
for (nega_from in nega_froms) {
minus_length <- edge_infos[from == nega_from, ][order(length)][1, length]
edge_infos[from == nega_from, length := length - minus_length]
edge_infos[to == nega_from, length := length + minus_length]
}
nj.tree$edge.length <- edge_infos$length
nj.tree
}
  • asked a question related to Basic Bioinformatics
Question
4 answers
The GAD_Disease segment in DAVID 6.8 acquires the data from the Genetic Association Disease Database (https://geneticassociationdb.nih.gov/). GAD database was shut down in 2014. It has been 6 years since then, yet DAVID has not abandoned GAD database. My concern is, what is the relevance of the data now in 2020. Is it still scientifically relevant and usable in research, or should be ignored?
I am looking forward to your informed opinion.
Thanks and regards,
Biswajit
Relevant answer
Answer
Tommaso Mazza
Thank you so much!
  • asked a question related to Basic Bioinformatics
Question
1 answer
Dear readers,
Relevant answer
Answer
There are already 100s of tutorials available. Follow and learn. No one here can teach you step by step.
There are many forums and google groups for asking questions regarding specific things. However, everyone suggests to search and try to solve problem by own instead of just ask directly without trying.
  • asked a question related to Basic Bioinformatics
Question
1 answer
Hello everyone,
Could someone please provide a working example (or point me to a good resource) of how 'lambda local' (along with lambda BG, 1k , 5k and 10k) is calculated for peak calling in MACS?
λlocal=max (λBG, [λregion, λ1k], λ5k, λ10k)
Does 'max' here mean the upper limit, based on whichever of the variables λBG, λregion, λ1k, λ5k, λ10k is the highest and hence the p-value being calculated taking this 'max' value for λ? Is λlocal a product/average/sum of all the other λs?
I have been unsuccessful in trying to understand it by referring to the script and the tutorial given in the link below:
Also, is the 'mfold' (10-30 fold enrichment) parameter estimated w.r.t lambda BG?
Thanks!
Relevant answer
Answer
simple, assume you want to randomly pick a number between 1 to 100. If you do it many times (say 1000), how many times have you peaked a number between say 1-5?
(5/100)*1000 = 50
Where 5 is the interval length [1-5], 100 is the total numbers and 1000 is number of trials.
By analogy with MACS, interval length is estimated fragment length d, 100 is the estimated effective genome size and 1000 is the total number of reads form input sample, and 50 is λBG.
  • asked a question related to Basic Bioinformatics
Question
9 answers
In the fasta output of Prokka listing the name of genes, some genes does not have any name ("gene: NA"). My question is  whether these genes are hypothetical or they do not have any name?
If the former one is the case,  how Prokka determine them?
Relevant answer
Answer
If you mean gene feature, then you can use --addgenes option for Prokka. Sebawe Syaj
  • asked a question related to Basic Bioinformatics
Question
1 answer
I'm a molecular biologist, and i have a few projects coming up in transcriptomes and small RNA analysis. Can i get by without knowing any programming using user-friendly software such an Geneious Prime or another program you can suggest or is it absolutely a must?
Relevant answer
Answer
Hi,
To be an efficient bioinformatician, you need to learn at least any programming language. You need not[ be high-end developer, but at least know how to do your bits. Also, you can be comfortable and can use various user-friendly GUI, but it would need more time and space, whereas in coding you can customize, according to your needs.
  • asked a question related to Basic Bioinformatics
Question
12 answers
Hi to all,
I'm approaching to the haddock web-tool for the first time. I got the username and password for the easy interface.
I'd like to know wheather i'm on the right way.
Once I've uploaded the pdb files to be docked, I have to specify both the active and the passive residues.
In order to determine the active residues I have performed an NMR titration of the unlabelled protein with the labelled ligand and vice versa. Then I've calculated the chemical shift perturbation.
Now I have to determine which among them are the active residues in the protein-ligand interaction.
So, shall I have to submit the pdb to a SASA (solvent accessible surface area) calculation program and chose the chemical shift perturbation residues that match with those solvent accessible by the SASA program?
is it correct?
do you advise any software/webtool? (i know NACCESS, but there is a very tedious procedure that i have to follow in order to get codes for decrypt the rar files)
thank you.
what have i do for the passive residues, is reliable the option on haddock that allows to determine them automatically?
Bye
Relevant answer
Answer
HADDOCK is a very good Protein-Protein docking server, and the new upgraded version, HADDOCK 2.4 is much more advanced comparatively. Besides the results from the server come refined and energy-minimized by default. For knowing the active and passive interacting residues, you dont even have to other software, since a server named CPORT, from the same developers can help you with the list of active and passive residues in the pdb files you have uploaded, and hence these residues can be used for docking analysis in HADDOCK 2.4.
Here are the links for the both these servers:
Link for HADDOCK 2.4 server:
Link for CPORT server:
Hope it helps.
  • asked a question related to Basic Bioinformatics
Question
4 answers
I actually have two queries. Biochemical studies suggested presence of an enzyme in an organism but the gene encoding the enzyme is not known. I would like to find out gene candidates based on homology search/sequence similarity, using sequences of similar enzymes present in other organisms. My questions are:
1. What points should I consider to select an already known gene/protein for use in homology search
2. To find out the orthologue of the known gene/prtoein by bioinformatics, which database/software should I use?
Please suggest papers/websites/softwares for beginners.
Thanks!
Relevant answer
Answer
1. It is recommended that you use genes / proteins in which the functionality has been verified using a database of nearby species for which there is information.
2. You can use the NCBI database to obtain the complete sequence of the genes or use the InterPro database to obtain the domains(Recommendable). Then align your sequences with one of the two databases using BLAST.
  • asked a question related to Basic Bioinformatics
Question
3 answers
Hi All, we have a bioinformatics challenge and we would love any help this community can offer. We have data from a target-enrichment experiment that was supposed to capture certain microsatellite motifs. The three enriched libraries were sequenced in a rapid run on Illumina Hiseq 2500 (paired end mode) and our data is in the standard illumina fastq output. Our three libraries come from three different sources. The first library is developed from fresh fish tissue; the second one is mammal tissue; and the third one is the same mammal species but from fecal samples. For the fecal samples, we need to somehow filter out sequences belonging to the mammal only (i.e. not prey or microbiome). We have a reference genome for the mammal, but not for the fish. The data has been demultiplexed already (so for the fish we have 40 individual fish each with its own .fq file containing all the read data). Now, we are facing the challenge of how to deal with this data. Although we are familiar with most basic bioinformatic tools and analyses we do not have advanced programming skills. We need to find a way not only to find and identify the length of our microsats within the reads but also (for the fecal library) somehow be able to identify unique flanking sequences that would correspond to our mammal, in such a way that the reads of other species in the fecal libraries can be excluded. Would anyone have a suggestion on what approach(es) we could use? We have already (unsuccesfully) attempted to tackle this with SSR_pipeline. Thank you in advance for any help you can offer - it is very much appreciated! Daniel & Vania
Relevant answer
Answer
Vania Carolina Fonseca da Silva You may want to consider a pangenomic approach whereby you use K-mer count sets with calculable probabilities of a faecal K-mer set originating from a fish species or group of species.
Because you have sampled enrichment you only have a probability of sampling any given SSRs in each species, and it is difficult to determine that probability for each fish species if the genome is unknown.
My suggestion is to take each set of fish reads and do a K-mer analysis, maximising the K-mer size until you obtain a distribution such that – guestimate – 10% of K-mers have coverage of between 10 and 100 copies
The guestimate of 10 is because you are expecting some coverage at the sampled SSRs and less than 10 could be indicative of sequencing errors. My guestimate of 100 is because you are looking to discard the SSR repeat polymerics and are looking for K-mers likely to be unique both intra and inter fish species. Hopefully for each fish species you will end up with many thousands of identified K-mers accepted within the copy number constraints and of maximal size up around 50% of your read lengths. You can’t use K-mer sizes up around read lengths simply because most K-mers will be discoverable through partial read overlaps, not full length overlaps.
Repeat iteratively over all fish species, adjusting the K-mer size until some minimum number of K-mers per species is reached, and build a matrix containing columns: fish species, K-mer sequence, count of that K-mer present in faecal sample reads – initialise count to 0.
With your faecal sample reads use a sliding window of the K-mer size you built the final accepted matrix with, and slide this window along each read.
For each K-mer from the sliding window check if that K-mer is present in the mammal genome – if so then discard and try next window.
If K-mer not in mammal genome then search matrix for a matching K-mer sequence and for every fish which matches then increment that fishes faecal K-mer match count.
When all K-mers in the faecal sample reads have been processed you can process the counts in the matrix and assign probabilities of the mammals individual fish species (or groups of fish) consumption.
I suggest that you discard counts of less than some threshold – say 10 - as these could represent sequencing errors.
The foregoing approach will need much optimisation to become practical, but could be applied even when the prey consumers genome has yet to be assembled providing you have reads from the consumer species. K-mers from the consumer reads are added to the matrix and so if count more than some threshold then discard that K-mer from further analysis!
I am interested in your research and may be able to provide some bioinformatics expertise ( https://github.com/kit4b ), do you have further details of your experiment you can share?
  • asked a question related to Basic Bioinformatics
Question
15 answers
Metagenome sequencing data analysis, Bioinformatics
Relevant answer
Answer
1. 16S rRNA gene amplicon sequncing is primarliy aimed for the taxonomic profiling of the metagenomics community.
2. Functional prediction cannot be directly predicted by 16S profiling. However, there are software which can indirectly give indications on the functional profile. But, this kid of functional prediction is just a vague overview which cannot be used to conclude anything with confidence.
  • asked a question related to Basic Bioinformatics
Question
2 answers
Hi!
I've been using the refine.bio website to download normalized transcriptome data; each downloaded dataset consists in a compressed directory with an expression matrix in .tsv format, its metadata in .tsv format too and an aggregated metadata file in .json format.
I'm trying to associate the expression matrix with its metadata using R programming language, but I don't know how to do it, and I don't find the way in the site's documentation. I only know that I need reed these files with these commands:
> library(rjson)
>
> expression_df <- read.delim('SRP068114/SRP068114.tsv', header = TRUE,
> row.names = 1, stringsAsFactors = FALSE)
> metadata_list <- fromJSON(file = 'aggregated_metadata.json')
but I have no idea how to merge them for generating a full-informative matrix.
Can someone help me, please?
Thank you so much.
Relevant answer
Answer
First you would need to flatten your json file :
library(jsonlite)
metadata <- fromJSON("File.json", flatten = TRUE)
View(metadatadata)
#After that you will read your expression table :
expression <-read.csv(" SRP068114.csv",header = TRUE)
Mergedataset <-merge(expression, metadata[, c("ColumnName")])
head(Mergedataset)
  • asked a question related to Basic Bioinformatics
Question
4 answers
Anyone with advice on how to mine data from TCGA Database.
I am interested in RNA seq data, i have a gene of interest i to analyze but i cannot seem to locate where to indicate the gene i am interested in. Please some help?
Relevant answer
Answer
Hi
You can go to FIREHOSE website on which TCGA released data. And the FIREHOSE website is more clear and maneuverable.
With regards
  • asked a question related to Basic Bioinformatics
Question
3 answers
I am looking for databases that contain microRNA-drug interactions. Any suggestions or recommendations?
Relevant answer
Answer
Dear Ali Akbar Jamali ,
Look the link, maybe useful.
Regards,
Shafagat
  • asked a question related to Basic Bioinformatics
Question
9 answers
Where can we get best web-course beginning from basic Python programming and using Kernels till analyzing data with Machine learning algorithms.
I think most of the Biologist don't need much to know on how to write programs or design Kernels but should be able to write essential codes to analyze data through already available programs.
I found some online courses, one is: pythonforbiologists.com that appears instantly when searched with the concerned keywords, but was not able to get detailed reviews.
Another is Python for Genomic Data Science by Coursera, the course content looks good, but the reviews say that it lacks later materials (Weeks 3 and 4) in uses/applications that makes virtually impossible to finish the course.
Suggestions appreciated for recommending efficient web-course on Python for Biologist...
Relevant answer
Answer
  • asked a question related to Basic Bioinformatics
Question
6 answers
I just need to explain my data about a filarial nematode and need proper tools for the analysis, plotting and tabulating my data. There are so many software and tools on the internet but so many problems when working with those tools. Any suggestions for tools to analysis, plotting and tabulating genomic data of eukaryotic genomes?
Relevant answer
Abhijeet Singh I'm using Windows and also Linux computers.
  • asked a question related to Basic Bioinformatics
Question
5 answers
I'm trying to extract prophages from bacterial genomes, the principal issue i been having is that i have multiple hits distributed along bacterial genomes and i don't know how to extract them together in order to get the complete prophage genome. I know that there is an outfmt of blast that can give me subjetct-sart and subjetct-end but i want to know if there is an alternative that can give me only the initial and final positions.
In the example only two strains are shown, those contain several hits for one bacteriophage, i want to retrieve the initial and final position of those hits to extract the prophage from databases
Thanks.
Relevant answer
Answer
You don't need to blast, there are several software available for this task
  • Sousa et al, 2018;PhageWeb – Web Interface for Rapid Identification and Characterization of Prophages in Bacterial Genomes;Frontiers in Genetics;DOI=10.3389/fgene.2018.00644
  • Song et al, 2019; Prophage Hunter: an integrative hunting tool for active prophages, Nucleic Acids Research, Volume 47, Issue W1, 02 July 2019, Pages W74–W80, https://doi.org/10.1093/nar/gkz380
  • Starikova et al, 2019; Phigaro: high throughput prophage sequence annotation; bioRxiv preprint ; doi: http://dx.doi.org/10.1101/598243.
  • Akhter et al, 2012; PhiSpy: a novel algorithm for finding prophages in bacterial genomes that combines similarity- and composition-based strategies; Nucleic Acids Res. 2012 Sep; 40(16): e126.
  • Reis-Cunha et al, 2017; ProphET, Prophage Estimation Tool: a standalone prophage sequence prediction tool with self-updating reference database. bioRxiv preprint; doi: http://dx.doi.org/10.1101/176750.
Also, check
  • asked a question related to Basic Bioinformatics
Question
2 answers
Dear friends,
I need the exact explanation of these terms:
-cut-off value
-lineage
-syntax
-proxy
-k-mer
Also, please tell me differences between aligned and unaligned sequences in databases. Which one should be used in molecular analysis?
Best regards,
Relevant answer
Answer
Better to ask your supervisor / colleague. Those terms are all easily definable with google or a dictionary, however the EXACT context will be entirely dependent upon the systems and sciences you are describing / working with.
  • asked a question related to Basic Bioinformatics
Question
15 answers
Hi everyone,
I am currently trying to align several thousand (5-10) protein sequences to a related protein sequence (reference) to determine the percent identity of the large subset of sequences to the reference.
Does anyone know of a program that I can use to align large sets of protein sequences to a reference sequence. I know that Geneious is capable of performing reference alignments, but it is almost prohibitively expensive and I am capable of using command line tools. 
If anyone knows of any open-source tools that could help me perform these tasks I would greatly appreciate the suggestions. 
Thank you in advance!
Relevant answer
Answer
This is old but not yet correctly answered. I am surprised how one can answer without even reading the question.
He asked about REFERENCE ALIGNMENT. Multiple sequence alignment is exactly NOT referenced.
AVID is capable of doing reference alignment.
Bray, N., Dubchak, I. and Pachter, L. (2003) AVID: A Global Alignment Program. Genome Res., 13:97
Please note that VISTA suite implements AVID as an option as alignment tool.
Best.
  • asked a question related to Basic Bioinformatics
Question
9 answers
basically for bioinformatics analysis
perl script
Relevant answer
Answer
If you are looking into analysis of RNA-seq data. I recommend this one: https://www.amazon.com/RNA-seq-Data-Analysis-Mathematical-Computational/dp/1466595000
  • asked a question related to Basic Bioinformatics
Question
6 answers
Hi everyone. Let's suppose that we have an index which summarizes an event of a real-life situation. Let's suppose then that we want to compare the same index of two patients and one index is way bigger than the other one (crude ratio around 200). If we do the ratio between the inverse log10 of the index (1/Log10(index)) of the two patients we obtain a 10-fold ratio, which might be easy to "handle" and to understand in a publication. Would it be better to run statistical analysis with the crude ratio, the ratio of inversed log10, or both? Which other transformation would you suggest for presenting and working with such data? Thank you in advance for your time and suggestions.
Relevant answer
Answer
It's hard to say anything concrete without knowing the subject matter, the scientific background and the aim of the statistical analysis.
To my experience, many variables from biological systems have an approximate log-normal distribution. Notably, ratios of log-normal variables again have a log-normal distribution. Thus, if´t often makes sense to analyze changes or differences in geometric means or "log fold-changes". This is done easily with the standard models (mean values, t-test, ANOVA, linear regression etc.) that are all based on the assumption of normal distributed data when the logarithms of the raw data are used in the analysis.
What usually make no sense is to use ratios of logarithms. This would mean that you analyze some rational root of the raw values, and it's hard to explain what this should represent. Differences of logarithms, in contrast, are log fold-changes - something that is well understandable.
  • asked a question related to Basic Bioinformatics
Question
16 answers
A newby question:
I have a mammalian protein for which I would like to know "its history". Is there any tool/platform I could use to check the presence of this protein's orthologs in different taxons/kingdoms (e.g., plants and animals)? Most platforms don't say "There is no such protein in sequenced genomes belonging to this kingdom" and offer instead some blast alternative. What would be the best way to conduct such search?
Any suggestion is welcome! Thank you!
Relevant answer
Answer
This is also very good ; it also shows species in which the protein homologs is absent
And about my question on the specific protein, the analysis has been already done previously..
  • asked a question related to Basic Bioinformatics
Question
3 answers
Hello everyone
I'm working on haplotypes case control association study, for data analysis , I employed two online analysis softwares ,
The first is (SNPstats) which gave me extremely large odd ratio values as showing in the attached snapshot .
the second is (http://analysis.bio-x.cn/myAnalysis.php) which gave me empty values (---) as showing in the attached snapshot . .
please , how can I explane this odd ratio values ,
thank for any help .
best regards
Relevant answer
Answer
It look likes that the second method is more reliable. When the haplotype frequency equals to zero in cases and/or controls, empty value would be appeared.
  • asked a question related to Basic Bioinformatics
Question
9 answers
I don't know anything about bioinformatics but I'm going to do RNAseq and ChIPseq analysis in R later this year and I would like to know if a Macbook Pro (2018) with a i5 processor, 256 gb SSD and 8 gb RAM would be enough for basic bioinformatics. My model organism has a 12 Mpb genome and I am hesitant if I need the 16 gb RAM and 512 gb SSDA which is way more expensive.
Relevant answer
Answer
8 gb ram is too low if you are going for RNA seq analysis. Even if you have a small genome or transcriptome of your sample, but the files which are generated even during quality check and alignment, require a lot of computational power and space for analysis. So a minimum of 32 gb ram will be more suitable, so as to get the analyses done in a feasible amount of time
  • asked a question related to Basic Bioinformatics
Question
5 answers
The fly protein is 671 aa while the human protein is 620. What are the softwares that can help me predict the domain conservation between the orthologs? Multiple alignment or pairwise alignment?
Relevant answer
Answer
Good question.. Please share me the best answer might you trust...
Regards…
  • asked a question related to Basic Bioinformatics
Question
3 answers
I want to build a fusion/chimera protein structure which is including 3 different protein domains belong to three individual proteins and then evaluate quality of fusion protein 3D structure.Protein domains are linked to each other with a 15 aa linker. I want to use UCSF Chimera, modeller or any other suitable software for this purpose, How can I do that?
Relevant answer
Answer
you can use CHIMERA to make a peptide bond between to polypeptide
First select the molecules involved in peptide bond
Favorites > Commond Line
#0:165.A@C#1:200.A@N
(i select C of 165th residue in first protein and N of 200th residue of second protein)
Tools > Structure Editing > Build Structure
after that you need to use a force field (e.g. patch LINK in VMD) to introduce this peptide bond, otherwise after generating psf from pdb the bond would be disappeared
  • asked a question related to Basic Bioinformatics
Question
3 answers
I am an undergraduate neuroscience and bioinformatics research assistant and my personal project has been to explore the gut microbiota of an EAE mouse model. There are three treatment groups: untreated control, Complete Freund's Adjuvant only, and MOG+CFA. Samples were taken from 6 control, 6 CFA, and 5 CFA+MOG mice over 5 time points (1 before and 4 after starting the EAE experiment).
I have since been focusing on analyzing a network created from the abundance data. Counts were normalized with DESeq2 and split into each treatment group. These count matrices were used to calculate a Spearman's rank correlation coefficient matrix for each group. This technique was applied to permutation testing with 100 randomized matrices generated. From the original matrix, all coefficients that fell below a 5% significance level in their corresponding distribution were set to 0 and correlations above were set to 1. In addition, any coefficients below 0.5 in magnitude were set to 0. This binary matrix was used to create unweighted network for each group.
I have since been focused on using over-representation testing on various features in the networks. Of key interest is how the networks are divided into communities/modules. I am using the fast greedy algorithm from igraph due to time/computing constraints but that could change based on suggestions.
Currently, I have been testing for whether a certain taxa-taxa interaction is over-represented in one module versus the rest of the network. I am using fisher's exact test where the 2x2 matrix can be split into within the network vs outside the network and the taxa-taxa interaction vs every other interaction. The counts correspond to each edge in the network that fulfills the criteria of each cell.
The data I get back is a matrix of a taxon-taxon interaction per each row and a module for each column. The values are p-values from the hypothesis testing. There are 3 such matrixes, one for each network/treatment group. I also have a matrix of the counts for each interaction/edge in each module.
My question is how can I better use the results of this data to derive biological insight? I have looked into dividing up bacteria into functional classes and potentially machine learning applications, but there are no standout programs that I know of that could readily take this data. The goals of this project are to better understand the structural changes in the gut microbiota during EAE and possibly to discover specific features like keystone taxa or co-occurrence groups that are gained or lost in the MOG+CFA group. Of particular interest are any OTUs related to the Lactobacillus/Bacilli lineage.
Relevant answer
Answer
Hi, in my experience, the Louvain approach performs slightly better than 'fast greedy' for identifying clusters in large networks. Re the taxon-taxon interactions you might check how the 'edge betweenness centrality' can help (also implemented in igraph).
Cheers, Steffen
  • asked a question related to Basic Bioinformatics
Question
3 answers
Relevant answer
Answer
The time-limiting step in creating a meaningful synthetic gene is not the gene synthesis, it is the process of learning to understand and design the sequences including the appropriate control elements needed for proper regulated expression. Of course, faster and cheaper methods of synthesizing precise DNA sequences will facilitate the process - however, at 98 to 99% accuracy, the mutation rate is still far to large, these genes will all have to be clonally amplified, sequenced and any sequence errors corrected to produce a gene what is several 100 to 1000 nucleotides in length. To be able to use polymerases for template-free DNA synthesis is a technological advance, which may eventually make this particular step faster and cheaper. but it is not a game-changer.
  • asked a question related to Basic Bioinformatics
Question
3 answers
Hi, this is quite a basic question I feel, but I'm not exactly sure of what the best practice would be. I am currently using IGV for an interactive view, but I would rather have this automated on the command line with any bespoke elements put in place by myself. What tools (command line, not interactive) would I need for the following:
I have indexed BAM data of specific reads along with their corresponding vcf file. I would like to go back all the way to my reference genome, so I can take each read with an identified SNP and pull out the sequence from the reference genome plus an extra N bp e.g., 50 bps, on either side from where the read aligns on to the reference. In short, I want to know what exists to the left and to the right of my aligned read. Could this be done using the #CHROM and POS data in the vcf?
Relevant answer
Answer
Once you are acquainted with VCF files columns and headers, you can manipulate similar file in MS Excel as well.
If you are non-programmer then I would suggest to open your VCF files on MS Excel or any other similar programme. Identify SNPs for which you need flanking region, modify Start and End co-ordinate for them by adding and subtracting 50 and save this data as separate VCF file. Feed this VCF file to getfasta module of bedtools (http://bedtools.readthedocs.io/en/latest/content/tools/getfasta.html) and you would flanking nucleotide sequence.
  • asked a question related to Basic Bioinformatics
Question
5 answers
I need to identify some targets of a number of miRNAs but i need suggestions on the best target prediction tool or a combination majorly for humans. Thanks
Relevant answer
Answer
You may like to use mirDIP (http://ophid.utoronto.ca/mirDIP/), which is the tool developed in our lab. In its recent version mirDIP integrates miRNA-target computational predictions obtained across 30 original resources. It stores nearly 152 million predictions, while covering 2,586 human mature miRNAs and 34,010 human genes. If you have any question regarding its use, please feel free to contact me.
  • asked a question related to Basic Bioinformatics
Question
5 answers
Hello,
I am interested in obtaining set of all human proteins with the DNA binding ability in FASTA format in one step. Downloading one by one is very tedious for me :-D
Do you know any easy (user-friendly) way to do it?
Thank you in advance
Relevant answer
Answer
You could use [BioMart](http://www.ensembl.org/biomart/martview/cc95f207ac5db842cfb81975536d9f3a). [Help video here.](https://www.youtube.com/watch?v=QvGT2G0-hYA) Filter by a [GO term associated with DNA binding](http://amigo.geneontology.org/amigo/search/ontology?q=dna%20binding), then get the sequences as attributes.
  • asked a question related to Basic Bioinformatics
Question
13 answers
I'm currently trying to get the 3D structure of a set of peptides (ranging from 12 to 20 aminoacids). Subsequently we want to make docking analysis against an enzyme.
Which software do you use for that? How do you refine the structure?
Thanks!
Relevant answer
Answer
Such short peptides usually do not assume a single defined structure in solution. The structure of the peptide in the complex is not necessarily the dominant conformation in the ensemble of structures in solution, but induced by the interaction with the binding partner. As a consequence, rigid docking of peptides usually is not possible.
  • asked a question related to Basic Bioinformatics
Question
6 answers
I am trying to submit sequences of COX-1 gene(mitochondrial gene) to gene bank. My sequences is almost 100% identical to a published one. It return a message saying that they contain stop codons and that this hinders their publishing. Now, I checked similar seq. from gene bank and they also contains a stop codons ? if this is hindering, how did those one were published ?
Could any one help to explain why is this and how can I deal with this problem ?
Thanks
Relevant answer
Answer
In addition to Dr Kandasamy's reply, I want to add that the observed stop codon might be due to incorrect selection of the reading frame. You are required to provide the correct reading frame in the gene annotation. The another possibility is your sequence might have stop codon in the right frame, which the GenBank staff wants you to confirm.You need to recheck your sequence electropherograms and confirm. Please go through the GenBank email carefully and do the needful.
best wishes,
SD
  • asked a question related to Basic Bioinformatics
Question
4 answers
Hi,
I understand from the TCGA data, I can easily plot a survival curve related to a gene (for example gene ABCD) for a certain type of cancer.
However, I would like to go a little deeper, as in to see if gene ABCD affects the survival outcome of a group of colorectal cancer patients that were treated with a certain drug, say oxaliplatin.
I want to see if overexpression of ABCD led to reduced survival in oxaliplatin treated colorectal cancer patients, or if the converse if true.
Are there any data sets like that out there?
Thank you all.
Relevant answer
Answer
Hello Stephen,
You could try dbGAP (https://www.ncbi.nlm.nih.gov/gap). dbGAP is a repository of genomic and clinical data available for researchers. Some datasets require a data request which is a one-paragraph summary of your proposed research. Depending on the dataset and available variables you could try running survival analyses. Hopefully this helps!
Best,
Andric
  • asked a question related to Basic Bioinformatics
Question
3 answers
Hi everyone! I would like to perfom a a energy minimization analysis to study the effect of phosphorylation in the protein stability in terms of Gibbs free energy. Aditionally I would like to compare this stability with mutants versions of my protein lacking this phosphorylation site
I have seen this test in a recent published paper using the Discovery Studio Software but I am not capable of reproducing it. I attach you the graph of this test as well as the reference of the article.
Relevant answer
Answer
Hi Alejandro, you are right, there are proteins that remain phosphorylated.
I had also forgotten about casein which is highly phosphorylated.
  • asked a question related to Basic Bioinformatics
Question
21 answers
What are some good introductory courses available online.
Paid is fine as it would be good to have a certificate issued at the end of it.
Relevant answer
Answer
Hi,
I would like to recommend courseria App on mobile just install it and type "Bioinformatic " you will find many courses from notable universities around the world.
  • asked a question related to Basic Bioinformatics
Question
16 answers
- I mapped my whole genome against the reference strain, and found 100 bp gaps between all the contigs? Is it something to be worried about? How can I overcome this shortcoming in my sequence?
Relevant answer
Answer
Hello,
my impression is that these gaps of 100 bp (100 "N") were introduced artificially and do not say anything about the actual size of the gap. Some people insert 50 "N", other 100 "N". If it is always the same number it comes for sure from the assembly. The next consequence would be the the sequence before these "N" and the one after it correspond too individual contains and do not necessarily be next to each other in the genome. Saying this, PCR won't help, except if you want the go for all combinations. Moreover, since you do not know the size and complexity of the gaps, you may even not be able to PCR amplify the gap (e.g. of the gap is 10000 bp in reality).
Well, if I would be you, I would submit the genome sequence as it is but I would first remove all these stretches of "N" and generate separate contains.
BTW, could you please tell us how many of the N100 pieces do you have?
Best wishes,
Ralf
  • asked a question related to Basic Bioinformatics
Question
6 answers
Until now, I was relying on KOBAS ( KEGG Orthology Based Annotation System), which was quite user-friendly and provided p-values for distinct pathways. However, it seems to be no longer available. Does anyone know a comparable online tool for analysis of both microarray and RNA-seq data?
Relevant answer
Answer
Dear Tatyana Pismenyuk,
You can use various tools to make functional pathway enrichment as:
Best regards,
Leite
  • asked a question related to Basic Bioinformatics
Question
10 answers
Colleagues, I need help with Venn diagrams and transcriptomics. I have three list of IDs (example: c58516_g4_i4), only IDs, not the sequences. I need to make a Venn diagram, to know which IDs are shared among the three lists, and which only between two of them and which are only present in its original list. I could do it manually, but it's a huge amount of IDs. Can you suggest me some sowtware for windows or script for linux ?. Thanks!
Relevant answer
Answer
You can try the following tools called "venny". Cheer~
  • asked a question related to Basic Bioinformatics
Question
3 answers
This might be a basic question but I am wondering how does one approach this task? My aim is to compare specific genes from the published genomes of the closest sequenced species/genera to my organism.
I was surprised at first and asked if I will make a similar database as BLAST but I was told that by "database" they expect me to just download FASTA files from Genbank. Is that simply it?
Thanks!
Relevant answer
Answer
I recommend you the KEGG database (http://www.genome.jp/kegg/). There you find the complete genomes available to date. In this list you find them phylogenetically ordered: http://www.genome.jp/kegg/catalog/org_list.html
If you do not have an active subscription it can be a little complicated to download the genomes, but nothing impossible with a script (although paying the subscription is highly recommended).
  • asked a question related to Basic Bioinformatics
Question
8 answers
Hi,
I am currently working on making a variants database.
I generate seperate VCF files for each samples them and merge them using VCFtools (using mergeVCF command) to make a multisample VCF file. Then I can calculate the Minor Allele Frequency using VCFtools for each variant in the multisample VCF file.
My question is, what is your recommendation about the best way to proceed. Do you recommend any other pipeline? Please also correct me if I am wrong in choosing the pipelines mentioned above.
Thank you.
Relevant answer
Answer
The issue is non-trivial:
but you can try to start with whatever database that works for you. Still, some type of queries may be not possible to run du to the performance.
  • asked a question related to Basic Bioinformatics
Question
2 answers
I have a short nucleotide sequence which has 6 ambiguity characters within it. Is there a way of taking this sequence with the ambiguity characters and producing a list of all the possible combinations of the sequence?
Relevant answer
Answer
If using Python is an option for you, follow the advice here: https://stackoverflow.com/questions/27551921/how-to-extend-ambiguous-dna-sequence
Alternatively, write your own script. For example in R, one can use the amb() function from the seqinr package or IUPAC_CODE_MAP from the Biostrings package.
There is a possibility to do this partly or fully online using webtools:
1) Generate a regular expression from your IUPAC string here: https://www.ecseq.com/support/ngs/convert-iupac-nucleotide-sequence-to-regex-search
2) Submit this regular expression to a program that unfolds it, such as regldg (there is an online version https://regldg.com/tryit.php you must set max length to be more than the length of your string, e.g. 999)
  • asked a question related to Basic Bioinformatics
Question
11 answers
In a cancer research i have too many data , now when train the data with all features (redundant also) in a model ,and later use test data in this model,and accuracy is pretty good.I knew, relevant feature is important .Can any one can explain this ,why feature selection needed when with all features we got pretty good result ?
Relevant answer
Answer
Hi Shawon,
Having all features (especially the repeated ones) would not help that much when building the predictive model, as there is no special knowledge that can be extracted out the data with such features. High level of accuracy doesn't mean that the model is awesome and can be generalized effectively on future data. Performing feature selection, before invoking certain classifier on the selected features subset, is a useful approach to go with especially with high dimensional data. (feature selection and classification has to be done together under the evaluation of, e.g., k-fold cross-validation.).
HTH.
Samer, PhD.
  • asked a question related to Basic Bioinformatics
Question
1 answer
I have done MiSeq sequences on ice samples and I would like to run the analysis for them. Nevertheless the two pipelines I'm familiar with (Mothur and Parallel-META) are not the best for ITS1 analysis. I have look in google to find tutorial (Mothur is absolutely fantastic for this) but I can't find anything similar for Skata.
Help anyone?
Thanks,
Mario
Relevant answer
Answer
Hi Mario,
I recommand reading this SOP from LangilleLab:
This should get you started to analyse your ITS data.
Cheers,
  • asked a question related to Basic Bioinformatics
Question
8 answers
I have a DNA sequence in notepad (txt) format. I want to create a chromaogram for it. Which software should I use? I have tried Chroma, Bioedit and mega7 but no luck. Please guide me
Relevant answer
Answer
You will be able to save .txt file as fasta (or several other extensions) but I am not familiar with a software that will create ABI files from the text file. Also I don't see any reason of doing that as for analysis you need four letter bases, not the colored peaks. ABI files are binary files that are generated by ABI sequencer (software) and are generally referred to as chromatograms, electropherograms or DNA trace files - and can be viewed graphically by a ABI viewer. They are pieces of raw data and are shown in colored peaks with the associated bases (A,C,G,T or ambiguity "N") and are the source of DNA sequence. The bases are editable but the colored lines (peaks of chromatograms) are not.
  • asked a question related to Basic Bioinformatics
Question
6 answers
What softwares are available to show how SNP can change shape or confirmation of DNA or RNAs?
Kindly suggest me some easy to use softwares that help to simulater the effect of genetic variants in coding or non coding region.
Relevant answer
Answer
Hi,
I would suggest you to search for some mutations prediction effects tool online (such as mutation tasting) or some official database (exome variant server).
I hope this could be helpful.
Cheers,
Giulia
  • asked a question related to Basic Bioinformatics
Question
5 answers
Anyone with experience with metabolomics data analysis could help me with advices on tests i sould perform for my data?
Relevant answer
Answer
It could also be worthwhile to look at supervised learning: random forest/ support vector machines to discern the features that best discriminate categories. Another option would be network analysis.
  • asked a question related to Basic Bioinformatics
Question
1 answer
I am trying to identify the point of insertion in the genome that may have caused the size difference in fruits between wild-type and mutant of a plant species (also mentioned here : https://www.researchgate.net/post/Identifying_insertion_of_transposable_element_in_a_large_plant_genome_any_recommendations_on_my_situation).
Generally, procedures I used to confine the mutated region include short-read whole-genome sequencing -> k-mer frequency analysis -> contig assembly -> mapping of reads, aligning contigs with databases etc.. In short, my intermediate goal is to extract only reads containing the mutated region, assemble it, assess by read-mapping, before going to some wet-lab (e.g. PCR).
Actually I am so frustrated with the long procedures I have done to the sequencing data without having any control to ensure I did not overdo something (e.g. selection criteria set too stringent). Besides, I am overwhelmed by the algorithms available (e.g. assembler) that output results showing significant difference which, again, hard to compare without any control or gold standards. It is so painful to screen something based on theoretical calculation (while there is always exception) or sometimes even intuition (e.g. the room I should leave in order not to miss any exceptions). However, in the other way round, if I just screen everything so conservatively, tons of candidate sequence remained.
Regarding the situation, I have the following questions :
1) What are the principles of controlling bioinformatics procedures?
2) What are the principles of balancing between the efficiency of screening and the risk of losing a real candidate?
3) Any other suggestions to effectively screen for the true mutation (including wet-lab)?
I am considering if I should accept the situation as it is with the quality and quantity of data on hand, anyway, I want to see how far I could go by optimizing the analysis method after getting advice on the questions.
Relevant answer
Answer
Doubt anyone can give you any more insight than what you already thought of, thus, I refer you to the most often used equation in biology
n = x/y
where n = number of samples you can do, x = money you have, y = price per sample
  • asked a question related to Basic Bioinformatics
Question
17 answers
Why internal stop codons appear during process of submission of COX-I or cyto b partial gene sequences to NCBI? And how can we get rid of them? This happened even if the nucleotide sequences are same as of other closely related species already submitted to the gene bank. How can we submit sequences if such error conditions persist?
Relevant answer
Answer
If you have partial sequences for protein coding genes, it is likely that they do not start at the first codon position. If that is the case, and you count the triplets from the very first nucleotide, it will lead to a frameshift in the codons, and stop codons are likely to occur. One way of obtaining the right frame is to compare your partial sequence to a fully sequenced gene in a closely related species and see if you can determine the position of the first nucleotide in your partial sequence. Once that is determined, you can then obtain the right coding frame in your sequence and you should then be able to avoid getting any premature (internal) stop codons.
  • asked a question related to Basic Bioinformatics
Question
8 answers
important factors considered while performing homology modeling 
suggest best  tools 
Relevant answer
Answer
As stated above: if you have template with high similarity, it might work. The higher the similarity, the better the possibility of creating/designing an accurate model. If the similarity is low you can try I-TASSER, but you will have a VERY hard time convincing anyone that your model is reliable (particularly manuscript reviewers).
In both cases there's still need for validation, which in reality can be only done by experimental methods.
  • asked a question related to Basic Bioinformatics
Question
3 answers
Hello all,
I am a hardcore experimental biologist. In particular, I am well-versed in molecular and cell Biology techniques. However, the area of Bioinformatics and Computational Biology have revolutionised the field of Biology. 
Can anyone suggest basic hands-on-trainings/courses of international standard where we can get a good exposure on the following for beginners?
1. Bioinformatics
2. Systems Biology
3. Synthetic Biology
Relevant answer
Answer
Hi,
Please check the following links. There are wonderful courses that are free, ranging from absolute beginners to advanced courses.
I would suggest getting started with learning to program in Python, this is fast becoming the go-to scripting language for bioinformaticians and systems biologists worldwide.
Good luck,
Ratnadeep
  • asked a question related to Basic Bioinformatics
Question
3 answers
Hi all
I have blast results in tabular format comprising near about 0.5 Million entries having titles "qseqid; sseqid; qlen; slen; qstart; qend; sstart; send; evalue; length; pident; qcovs; bitscore; staxid; sscinames; scomnames; sskingdoms; stitle"
I want to analyse this data based on family, order, class, genus and species level in excel. How to import all taxonomic tree data for each taxon ids
Is there any perl, python script or tool available to do this analysis or fetch data from ncbi taxdb or other db?
suppose my  data having more than 1 Million read which tool should i use for simple calculations like abundance and all (Excel only supports 1048576 raw in single sheet)
Relevant answer
Answer
Hi ,
You can download MEGAN tool. it can directly take your blast results and plot various figures.
Cheers 
Shab
  • asked a question related to Basic Bioinformatics
Question
2 answers
Dear All,
I have written a few tcl vmd scripts , those we can also use as VMD commands in tk console. I hope these will be helpful for VMD users.
Please have a look at these and let me know your valuable suggestions.
Thanks in advance.
K .anji babu
Relevant answer
Answer
Hi  Mohammed Iftekhar ,
Thanks for your invitation. Can you please give details of biocuratiossl.com?
Thank in advance.
Anji babu
  • asked a question related to Basic Bioinformatics
Question
4 answers
I want to analyze several genes for a set of bacteria simultaneously. Is it possible to do an analyze for that set of genes in a real time and drawing their phylogenetic tree, I mean that putting all sequences of each species in a raw? Which software has this ability? 
Relevant answer
  • asked a question related to Basic Bioinformatics
Question
4 answers
Hello
I am working on a new project to observe the transcription activator's function.
I was wondering how I could test the DNA binding site via R-studio or any software you would suggest.I would like the align yeast DNA from negatively numbered codons to Tata box then I would like to see where the transcription activator might bind, if there no binding site, then I would go head and work on a  possible mutation on the protein so, that it binds the yeast DNA to start the transcription.
I need suggestions about what type of software I  could use to do DNA alignment and see how the binding could occur.
I worked on Rstudio a little bit but, I am not sure if it is the correct road to follow.
Also I don't know how to align the DNA sequence by using negative numbers to postivie numbers on Rstudio.I would like to hear suggestions to design my code and project.
I am looking forward to hearing from you soon.
Thanks for your time.
Relevant answer
Answer
Why don't you perform an unbiased analysis by predicting which transcription factors are likely to bind to your sequence of interest without aligning the sequence to known TATA boxes? Here are a few suggestions:
1. Perform a de novo transcription factor binding site (TFBS) analysis on your sequence of interest using "Meme". http://meme-suite.org/. There are various tools available from the aforementioned URL, which will help you identify which transcription factors are likely to bind to your sequence.
2. You could then perform another TFBS analysis using the "Matrix Scan" tool from RSAT: http://www.rsat.eu/. For this analysis, you will need to provide position frequency matrices (PFMs) of TFs of interest, which you can obtain from the JASPAR database: http://jaspar.genereg.net/
3. You could also use a software from one of my papers to analyse non-coding regions in terms of evolutionary conservation and TFBSs. You just need to plug in your genomic sequence of interest and then click a button to get results. http://dreive.cryst.bbk.ac.uk/
Hope this helps!
Mohsin
  • asked a question related to Basic Bioinformatics
Question
1 answer
I have a vcf files with a bunch of SNPs, I want to transform this file in to a distance based matrix (where HomoRef=0, Heterozygous=1, HomoAlternate =2, for example). How can I do this?
Relevant answer
Answer
This can eaysily be done with python.
read in the vcf file line by line, exchange the values, and store them ion a dataframe. If needed, the dataframe can be converted to a numpy array, which is more matrix like.
Find attached some code (non optimized).
Best,
David