Science topic

Computational Genomics - Science topic

Explore the latest questions and answers in Computational Genomics, and find Computational Genomics experts.
Questions related to Computational Genomics
  • asked a question related to Computational Genomics
Question
3 answers
I have whole genome of a bacteria. Do you know which program to detect virus genome within the bacteria?
Is using annotation (example : prokka) and then looking manually for viral genes/proteins? Or by checking the assembly (example : prokka) and blast the shorter contigs will is enough?
Thank you in advanced
Relevant answer
Answer
Use PHASTER, an online tool for the detection of phage sequences in your genome. it will give you specific files with the regions then you can blast those specific phage sequences in NCBI and can do further analysis.
  • asked a question related to Computational Genomics
Question
5 answers
Does anyone have experience with the table2asn tool of NCBI? It is used for submitting annotations of genomes using GFF files.
I am having a problem with it.
1) The windows tool I get running, but it gives me an error. The command used is the following: table2asn -M n -J -c w -t template.sbt -a r10k -l paired-ends -i FinalContigs.fsa -f 2700988623.gff -o output_file.sqn -Z output_file.dr -locus-tag-prefix xxxxx
template.sbt is a file generated by NCBI
FinalContigs.fsa is a file containing all the different contigs of the draft genome. The first contig is named Contig001, seconf contig is Contig002, and so on.
2700988623.gffis the file containing the annotation. Column 1 gives the contigs where the CDS and RNA are found in, so Contig001, Contig002, etc.
After running the command (as told by NCBI), I get the following error:
Cannot resolve lcl|Contig002: unknown
Line: 0
So if anyone could assist me with getting the tool running in Windows, that would be very much appreciated as it is the final step of submitting my genome.
Thanks in advance!
Relevant answer
Answer
Muhammad Arslan This is like 4 years late but maybe it gets found by someone in the future and can serve as a record. I tried the tool on all three OSs and the same problem persisted. However, it is not that chmod +x does not work, it is that after it is run it is really hard for the terminal to see the file and run it. In all three cases, this is overcome by adding the path to the file in front of its name as you run it. So, for example, instead of running
table2asn -M n -J -c w -euk -t...
You would add in the path and run
/Path/to/thefile/table2asn -M n -J -c w -euk -t....
Easiest way to know wha the exact path is is to just drag and drop the file into the terminal window and the terminal will tell you
Furthermore, to be more explicit, the exact steps to get this tool to run are:
1) Download it, unzip it (on Linux, or Ubuntu on PC or equivalent) you run "unzip NameOfFile" and you may have to install unzip first via "sudo apt-get install unzip"
2) Run chmod +x on the unzipped file. I did confirm that at this stage, you can rename the file to whatever you want, such as instead of "linux64.table2asn" you can just name it table2asn (I think the full name will be table2asn.table2asn but you can call it up with just table2asn)
3) Ru it with the path, or if you know what you are doing, you can add the file to the path and then it will run with just "table2asn" instead of needing to include the path. However, and this was so confusing to me, even if your terminal is in the same directory that the table2asn file is in it will still not see it and call it up, so I am not 100% sure if the "adding to path" route would fix the issue; I didn't try it yet
Anyways, for anyone searching the web for how to get this to work, there's the steps. A more comprehensive README if you will. As for all the options and tools, I haven't actually run this successfully yet (need to specify locus-tags apparently) so I cannot comment about the options yet
  • asked a question related to Computational Genomics
Question
7 answers
I would like to draw an haplotype network in Geneious directly, without using any other software. Is there a way to do that? Any plug-in that could help?
Thank you!
Relevant answer
Answer
As far as I know, you can find uncorrected p-distances in Geneious once you have your alignment.
  • asked a question related to Computational Genomics
Question
2 answers
I have to analyze PacBio transcriptome data with four bin sizes: 1-2 Kb, 2-3 Kb, 3-6 Kb and 5-10 Kb. I can understand that the HSP length cut-off and HSP-Hit coverage settings must be customized for each of the bins for running Blast but what should be the criteria for the same. Any good literature or suggestions on this topic will be of great help.  
Relevant answer
Answer
1.5 or 2
  • asked a question related to Computational Genomics
Question
7 answers
I used WebMGA to cluster my NGS data (COG). I have problem on analyzing the data provided in output.zip since the format file is unknown, in this case do I need some specific software to open each of those files? 
Relevant answer
Hi!
I wonder if someone knows how reliable WebMGA is. I would like to know your opinion
  • asked a question related to Computational Genomics
Question
3 answers
Genomic data privacy is an essential thing while sharing the genomic data to the public. How can the privacy of genomic data be protected? Which anonymization models are useless for preserving the privacy of genomic data? Which model is suggested for preserving the privacy?
Relevant answer
Answer
I suppose a case could be argued both for and against anonimisation, not only in the value to the individual, but also in its actual efficiency to protect individual privacy. At the end of the day, our DNA information itself is a FAR more precise identifyer than any other data that is classed as "identifyer". Hence, the need for genomic privacy is more acute than ever...
  • asked a question related to Computational Genomics
Question
14 answers
Hello,
I have this sample that was sequenced as two different libraries.
I do not have access to the BAM files, only PLINK files.
Each PLINK file has that specific sample as its single individual.
Each PLINK file has a different number of SNPs, therefore some might be common to both files and others not.
Can I merge the two PLINK files in a way that I get all SNPs, choosing at random from which file to keep the duplicated ones?
E.g.
SampleA1 - 400000 SNPs
SampleA2 - 150000 SNPs
As some SNPs will only exist in one file and not on the other, I wish to merge the files and keep all of them, discarding for example the duplicated on on the second sample.
Thanks!
EDIT: As mentioned above, I really needed to keep single alleles in all variants, hence the need to discard one of two, which would in cases lead to some biallelic variants. This is because my data was/is pseudo-haploid. If this is irrelevant for you, you can use PLINK's (b)merge instead.
Relevant answer
Answer
May you share your script with me, Daniel M Fernandes?
Thank you very much!
  • asked a question related to Computational Genomics
Question
6 answers
Hi, 
I am trying to create a genome browser software (online, free to use) and I have come across the following difficulty: BRCA1 for example has 30 transcripts, each containing different exons. The question now is which of these transcripts is the relevant one for Breast cancer? It would be wrong to detect a mutation and communicate it as disease causing in one exon that does not actually show up in the final relevant protein. Any idea how I can figure out which one is the right transcript to focus on?
I need a general rule , not the one for BRCA1, as I need to apply this rule to all 25 000 human genes. Any help would be welcome.
My best approach so far: Find amino acid sequence of relevant protein and then reverse calculate which transcript it came from. Is there a better way?
Relevant answer
Answer
I know this is an old post, but for new people with the same question you may find the below information from Ensembl useful:
If you just want the most biologically relevant transcript
If a human gene has a MANE Select transcript, this is the one you should use. This is an agreement between NCBI and Ensembl as to the clinically most relevant transcript, and the transcript structure matches perfectly between the two databases.
If a gene does not have a MANE Select transcript, APPRIS, provides a similar evaluation of the most biologically relevant transcript, which they call the Principal Isoform. They most likely candidate will be the P1 but they can score down to P5.
If those are not available, you may wish to consider the quality of annotation. A CCDS identifier indicates that the coding region of the transcript is matched between NCBI and Ensembl, while a gold transcript indicates matching annotation by the two different methods of annotation in Ensembl. The TSL gives a score for the amount of data that supports the existence of a transcript.
  • asked a question related to Computational Genomics
Question
4 answers
Hi everyone..i just want to ask..how to analyse raw dru sequence of Staphylococcus species for dru typing identification of Staphylococcus species?..I already read several journals and they suggest to used TRST sequence analysis software from Bionumerics software and dru repeat analysis using manually can be a challenge. So..really mean that, typing using dru sequence require bionumerics software and can't used manually?...
Thank you
Relevant answer
Answer
Does anyone have any idea what Bionumerics software cost? I saw on one website they mentioned something like $7,000 at least?!
  • asked a question related to Computational Genomics
Question
4 answers
Hello,
I am working with mouse RNA-seq data aligned to GENCODE's M8 and am trying to create a custom trackhub to visualize my data. I had done this before on RNA-seq of human cell lines without any problems, but in the process of trying to use the bedGraphToBigWig program and obtaining the mm10 chromosome sizes, I keep getting the error:
GL456210.1 is not found in chromosome sizes file
My bedgraphs are from sorted bam files, and I used the following command:
bedGraphToBigWig filename.bedGraph mm10.chrom.sizes filename.bw
Browsing the UCSC forums, I also found that interrogating the unscaffolded reads through:
mysql -hgenome-mysql.cse.ucsc.edu -ugenomep -ppassword -A \
> -e "select chrom,frag from gold;" mm10
led to my not seeing GL456210.1 correspond to anything.
Any thoughts on how to solve this issue? Your suggestions on how to get around this would be much appreciated! Thanks!
Relevant answer
Answer
It is chrom.size file format not matching bedgraph file.
for Inhouse analysed data.
samtools faidx genome_reference_hg38.fa #human genome reference used to map reads cut -f1,2 genome_reference_hg38.fa.fai > hg38Chrom.sizes
If bed files downloaded from Publicly available databases
1. Check Chrom.size file format (i.e UCSC, Ensembl or Gencode),
2.) then "cat" your bed or bedgraph file to find confirm structure of bed file rownames i.e written as chr1 or 1 chr_GL456210.1 or just GL456210.1
3.) Adapt the chrom.sizes file according to bed file e.g UCSC chrom.size file contains - chr number_GL456210v1_random in place of GL456210.1
4.)Adapt it similar to your bed files e.g remove prefix chr and chromosome number
  • asked a question related to Computational Genomics
Question
6 answers
The search for shortest unique substrings (shustrings) of the genome is an important problem, for in some sense it is the mix of these shustrings that defines the phenotype of an organism. A related question concerns the set of shustrings that *is not* found within any genome, for the presence of these may correspond to specific detrimental consequence in organisms. So, my question addresses both the issue of any prior review (there has been) and any discussion of justification for the observance of non-representation of these shustrings within organisms. The shorter the length of such sequences, the more important would be any corresponding justification, I should think.
Relevant answer
Answer
Dr. Konopka:
Ultimately, nullomers are short sequences of nucleotides that are not represented within a DNA molecule. For a long sequence drawn on a small alphabet, such as DNA, there are many, many short sub-sequences that are not represented; I will forward to you in a subsequent message an analysis of the shustrings and nullomers of the RAND million random digits.
I am especially interested in particularly short nullomers; why are they not conserved in the genetic record. Please consider this graph, derived from shustring analysis of the Vibreo cholera genome.
wrb
  • asked a question related to Computational Genomics
Question
4 answers
To correct the phenomenon of mistagging in Illumina, what is the best threshold to filter the low abundance OTUs of each sample?
Relevant answer
Answer
Anny Camelo I did not actually get what you wanted to ask. If you wanted to do that for your data for already assigned OTUs, unfortunately you cant. You can remove the low abundant OTU but I cant say whether it would be because of biology or because of index mis-assignment. It can however be fixed at an earlier level during the library preparation by using dual indexes.
  • asked a question related to Computational Genomics
Question
5 answers
I have 100s of bacterial genome sequences that I have to use to construct an ortholog table by bidirectional all-vs-all BLAST+ search among the protein products. I have already performed a blastall search, but do not know how to pull-out the best hit sequences and arrange them as orthologs (which will also include paralogs). Please help.
Relevant answer
Answer
You can also refer OrthoDB 9.1 (https://www.orthodb.org/) for unique ortholog genes.
  • asked a question related to Computational Genomics
Question
5 answers
Hi,
I work on reptile genome and I have a problem with variant calling using samtools. I'm interested in vcf file with SNPs and indels in my bam file across genomes of multiple species. 
I run the command from manual and it gives  "[mpileup] 1 samples in 1 input files
<mpileup> Set max per-file depth to 8000" 
Does anyone know the solution for this?
Many Thanks.
Relevant answer
Answer
Hello, I actaully had this same issue in the samtools manual it states:
"At a position, read maximally INT reads per input file. Note that samtools has a minimum value of 8000/n where nis the number of input files given to mpileup. This means the default is highly likely to be increased. Once above the cross-sample minimum of 8000 the -d parameter will have an effect. [250]"
However I am still unsure and am wondering if I should omit -d altogether for mpileup. Where you able to get this sorted?
  • asked a question related to Computational Genomics
Question
3 answers
I have 9 VCF files for individuals i need some publications or tools to know more about:-
- Disease risk calculations.
- Disease identification.
- Drug response.
- Antibiotic resistance.
- Dealing with novel SNPs
- Fitness and Nutrition genes
Also if sample contain gene related to specific disease Is this means that this person suffer from this disease or this person suspected to be patient in the future. (How to know if gene is active or not?)
Waiting your replies with publications, tools and recommendations
Thanks
elsayed
Relevant answer
Answer
Hi,
From your question, I understand that you might not be very comfortable with linux and command line based tools. I would suggest you to visit wAnnovar (http://wannovar.wglab.org/). Make sure you provide same genome build and gene definition as used for NGS analysis. Another good option is Variant Effect Predictor (https://asia.ensembl.org/info/docs/tools/vep/index.html). Both are very comprehensive variant annotation tools.
I hope this suggestion helps you with your analysis.
  • asked a question related to Computational Genomics
Question
5 answers
While literature survey I found DarkHorse which can predict horizontal gene transfer (HGT) Candidate Resource from bacterial and archaeal genomes. I want to know if my candidate proteins has been transferred via HGT either from prokaryotes or eukaryotes. Therefore, I am looking for similar online tools to predict horizontal gene transfer in plants.
Relevant answer
Answer
You can use Alienness:
Publication available here:
It works for HGT or any origin to any destination.
If you need help, with this tool, do not hesitate to ask.
  • asked a question related to Computational Genomics
Question
4 answers
Hello, could someone suggest me a tool to get information (gene localization, DNA methylation, CG composition and other ) from geo data? Thanks
Relevant answer
Answer
Hi annalisa
bed files are just tabulated files, you can also open them with excel if you want....
if I can help, just tell me more.
fred
  • asked a question related to Computational Genomics
Question
2 answers
I ran the popsizeabc tool and I got problem on "abc" function of abc R package. They said: error on "nnet.default", x and y must be match. I do compare my input data and an example data, and I think they were similar format.
Hope anyone could help me, please.
Thanks in advance.
Relevant answer
Answer
Thanks Scott... I'll try
  • asked a question related to Computational Genomics
Question
8 answers
Hi everyone,
I've sequenced (Illumina; Nextera; 2x250) different strains of Salmonella (size of the genome around 5Mb) and I want to compare the genomes to check if there are related or not: some came from animals the others from patients.
I've cleaned my sequences with trimmomatic before doing the assembly (I used megahit, although now I know it's not the best one) . After that I tried difference analysis:
-REALPHY (contigs and reference genome as input files) to obtain the phylogenetic tree.
-bwa to do the mapping between my contigs and the reference, but I didn't know how to get a single sequence. I checked it with QUAST.
-With mauve I got my contigs ordered according to my reference but not a consensus sequence for each sample and I could not interpret the results very well. 
I am quite new in Bioinformatics, so I'd appreciate some advice.
1/If I use two different assemblers (Velvet and Spades), how can I merge both?
2/Is is better in this case to do assembly de novo or just mapping with a reference (I have a high coverage)?
3/One I have the contigs, which software should I used to obtain scaffols? And to obtain a unique sequence I can use in further analysis?
4/Which software would you recommend me for phylogeny? And for clustering? Do you think this would be the best way to answer my problem? 
I'd really appreciate any suggestion.
Thank you very much!
Andrea
Relevant answer
Answer
Thank you very much for all the advices Mat. They have been really useful!!!
Best regards,
Andrea
  • asked a question related to Computational Genomics
Question
2 answers
Hello,
I have sequenced an organism by using NGS technology. After assembly, I found it only has 6-22% homology with other related GenBank enteries. I performed MSA with all of related-organism genomes, but it is time-consuming and its phylogenetic tree is hard to interprate. I want to know what is the exact process to do. The organism is a phage, and its family has identified by electron-microscopy.
Regards
Relevant answer
Answer
Just to complement Brian's answer a little, I would say that, even for amino acid sequences, 6-20% of identity is an extremely low value for homologues, so please specify to what gene or protein sequence you were referring to with that range. If you were referring to whole genome alignment then the sequences were definitively not well aligned. I think you might want to look for a phylogenetic marker gene for your virus family instead. Also, please remember that when one deals with sequences one can only speaks of identities or similarities. Homology (or lack of) can then be inferred from there.. 
  • asked a question related to Computational Genomics
Question
3 answers
Hi, could you recommended me any prediction software for DNA-binding sites of my protein?
I have protein with predict structure using Psipred (pdb file) and fasta file with genome. I would like to predict binding site of my protein in this genome.
Thank you.
Relevant answer
Answer
don't know if this fits your search, but it takes in consideration both spatial and sequence features
  • asked a question related to Computational Genomics
Question
10 answers
I have a large set of data for which I already got the DEGs and also have them clustered. I need to get the functions enriched for each of the clustered genes. I was able to use the DAVID tool for just few of the clusters but DAVID is complaining that genes in other clusters are more than 3000 and could not get any result. Which other tool(s) can help. Note that I am dealing with human data.
Relevant answer
Answer
Hi, you can use a big set of tools for Functional Enrihment analysis.
EnrichR : web tool with a lot data bases (GO, KEGG.. etc)
GeneCodis: web tool
Gorila
You can also perform an enrichment analysis using programs as GSEA (in java, with graphical interface). Easy to use and flexible, this program not apply the convectional statistic methods to perform enrichment ( as fisher, hypergeometric test..) and you can get interesting results.
You can also use R programming language, with packages as TopGO, GOstats.. among others..
I hope help you.
best regards
  • asked a question related to Computational Genomics
Question
5 answers
I'm doing a project, my group needs to know what does the DNA sequence represents. The problem is we are new to bioinformatics and we did not know what database we should use for BLAST. We tried Non-redundant protein and SwissProt, both give different answers. Non-redundant shows protein with 0 E-value and 100% identity, while SwissProt shows protein with 4e-29 E-value and 23% identity. Which one should I use??
Relevant answer
Answer
Why not? as long as it shows a high identity percentage solely with such a record.
  • asked a question related to Computational Genomics
Question
5 answers
I am running BLAST search of a target gene in Staph. aureus to find interspecies gene similarity. The BLAST results (top 100) show the target gene in different strains of Staph. aureus. I downloaded the FASTA with all search results (>1000 sequence matches) but manually editing the FASTA is a cumbersome job. The phylogeny tree with this FASTA is a mess. Is there any software available which can cluster the sequences from a single organism into just one branch (no further branching for each strain as they have >99% sequence similarity) of the phylogeny tree so I can compare the similarity of target gene between Staph. aureus and other organisms.
Relevant answer
Answer
True Joan..  did not know CD-HIT can be used to cluster nucleotide sequences too, which is why I did not recommended that tool (I have used that tool to cluster protein sequences only). However, I've checked the associated website and it seems that project isn't active anymore. Also, some problems in the CD-HIT algorithm have been reported (actually in the very same USEARCH article). But the problem with USEARCH is that only the 32bit version is free and that limits the number of sequences you can cluster. For Raman's question there is no problem since he is going to cluster at most hundreds of sequences, but when you have millions, then 32bits version of USEARCH cannot be used. There is, however, a project called VSEARCH (https://github.com/torognes/vsearch) which aims to be a free and unrestricted version of USEARCH.
Just saying this in case you might want to check it, Joan.. 
Best regards.
  • asked a question related to Computational Genomics
Question
3 answers
Hi Friends,
 while I run my RNA-Seq data (Tophat/HISAT2 aligned output & samtools name sorted file) in HTSeq2 for counting DEG, which display warning message like “Mate records missing for 5069 records; Paired end read :HWI-D00486:7:1116:11181:7055” and “Mate pairing was ambiguous for 61656 records;” after SAM alignment record pairs processed. Kindly any one provide a solution to solve this warnings.
Relevant answer
Answer
Easiest solution is to not use paired-end reads. This is not necessary for pure counts-based RNAseq (as opposed to differential splicing etc.)
  • asked a question related to Computational Genomics
Question
3 answers
I have sequenced partial mitochondrial genome of Philautus sp. and have generated a consensus representing the reconstructed genome from the reads. There are many single nucleotide variants present when compared to a reference genome. How can I visualize these SNVs without a reference genome as genome browsers like IGV require reference genomes? 
Relevant answer
Answer
Hi Debjoyti, 
If you want a quick and easy access to a genome browser, specially for smaller genomes, I suggest your try Artemis (http://www.sanger.ac.uk/science/tools/artemis).
You can start right from a FASTA file for your genome and a BAM file for your mapping. It has good visualisation and it's light weight. 
Anna 
  • asked a question related to Computational Genomics
Question
6 answers
Can anyone recommend programs to perform pan-genomic analyses on bacterial genomes (approximately 10 or more)? Either web-based or command line-based suggestions are welcome although from my experience command line tools provide more options and flexibility. I've discovered a few programs so far (get_homologues, PanWeb, BPGA) but I'd like to get some opinions from those who have performed pan-genome work before and have specific recommendations. It looks like a lot of new tools are coming out recently and it's either hard to keep track of them or even find them in the first place without reading literature on what others have done, which of course I'm also doing.  
I'm mostly interested in programs that can define the core, accessory and unique genes but of course programs that also integrate functions for graphing/summarizing results would be helpful. 
Thank you!
Relevant answer
Answer
Hi Jordan,
Hope that's helpful
  • asked a question related to Computational Genomics
Question
3 answers
Planning to send samples off, have ~50 individuals, but will pay for the whole plate, so it has been suggested we fill the rest of the wells with duplicate individuals. However, we won't be able to duplicate all individuals (only have 45 wells to fill). Moreover, we don't necessarily have enough DNA to fill the rest of the plate. Trying to decide if our dataset would be compromised by having only a subset of duplicated runs.
Relevant answer
Answer
Dear Mollie,
RAD-seq novice here ;) I'll try to help :
As far as I know it is always good to get replicates, no matter if you cannot necessarily duplicate everything. I would suggest you to select those as you think the most "sensitive" individuals (according to your knowledge about your sampling design, quality of DNA extractions, theoretical expectations, etc) and also a few "good quality" individuals (e.g. with an amount of DNA that is still above your platform technical recommandations).
Comparing "good" duplicates might allow you to estimate (even grossly) the amount of allele/locus dropout you might in your outcomes. Others duplicates could just serve as supplementary information to corroborate your main data.
If you did not read this yet, I recommand you this great paper from Alicia Mastrettra Yanes et al. They made a very good use of RAD-seq duplicates.
Cheers !
Chrys
  • asked a question related to Computational Genomics
Question
6 answers
Hi all, 
The question is pretty straightforward, I want to find out closer relative of Juncus effusus closer relative whose genome sequence is fully annotated (Protein evidence in FASTA format)?
Thanks
Relevant answer
Answer
Finding a closest relative is one thing. If you want to use it as a reference for your own annotations, you should also take into account the quality of the annotations.
Many less-well studied organisms only have indirect annotations based on model species. If those are your closest relatives, you could as well go back to a slightly more distantly related species with more trustworthy annotations.
  • asked a question related to Computational Genomics
Question
2 answers
Very nice project. Best of luck, but why just transcriptomic data. Can use the ddRAD protocols two, to get SNPs from the junk DNA too.
Relevant answer
Answer
You may get an idea from this review:
Also check 1kp project. Currently there are many papers coming out using the data from this project. Some of the results can't be obtained using RADseq. 
I think one advantage of RNA-seq data is that it can let the researchers to investigate the evolution of genes in a family.  Here is one example:
Using RADseq data, you couldn't get these information. 
  • asked a question related to Computational Genomics
Question
2 answers
Hi all,
I want to do rarefaction analysis for the plant "Juncus effusus" transcriptome assembly. At the moment I have Fasta file with final assembly. I wonder how can I use this dataset? Is there any tool available for this purpose? If there is no direct tool available, how can I get each gene abundance? an example file with a few reads is attached!
Relevant answer
Answer
First you need to annotate your sequences to find out from which genes/functions each sequence belong. Then you can rarefy the gene count table directly, it is much more efficient, fast and as roubust to do it like this. Then you need to run proper statistical analysis to mine your data properly. Good luck :-)
  • asked a question related to Computational Genomics
Question
4 answers
I have sequence the genome of a Klebsiella pneumoniae and I am trying to find other Klebsiella pneumoniae genomes belonging to the same ST to use them for comparison.
I have used BLAST to find similar sequences but best matches resulted in different sequence types.
Relevant answer
Answer
Hi,
Thanks for your answer, it gave me a good idea.
What I did was to download all K. pneumoniae genomes and then use a program called Kleborate (https://github.com/katholt/Kleborate) and automatically type all the genomes. Now I have a database with all the uploaded genomes and their MLST.
  • asked a question related to Computational Genomics
Question
6 answers
I want to know that is there is any suitable tool for extracting start codon and stop codon from the fasta file of multiple protein-coding genes (PCGs). I have downloaded multifasta PCGs (nucleotides) from NCBI; now I want to get the start codon and stop codons of all fasta sequence at a go. Thanks in advance.
Relevant answer
Answer
As the information you provided you have the fasta sequence file then you can just translate it by any online tool like EXPASY, you will get the translated sequence and then you can get the any longest amino acid sequence with out any stop codon then count the methionine amino acid number and then you can get the location of ATG(start codon) and stop codon.
hope you get this..
  • asked a question related to Computational Genomics
Question
4 answers
I am looking for an open bioinformatic program or script with the option of quick analyzing the presence or absence of nonsynonymous SNPs in the NGS results of amplicons of known proteins in a particular human individuals. The program of course may include mapping to the reference genomes but in principle it should detect known SNPs by analyzing the nucleotide sequence and finding a substitution in a codon causing a change in a known SNP position (triggering mutation to another amino acid). Mainly known substitutions are important, however detection of novel would be optional with mapping. I possibly would appreciate pointing to a few programs to fix them with a script. And it's not about dbSNP on NCBI or any other.
Relevant answer
Answer
Thank you for answer. I actually know the programs you have presented, however they do not meet my expectations completely. I have just wrote a script in Python which does what I want.
I intend to develop it and it is possible that I will pipe it with SAM or BAM tools if needed.
It works fast enough. It processes 1MB nucleotide sequence per 1 s on my old dual core 2.1 GHz laptop.
  • asked a question related to Computational Genomics
Question
5 answers
Dear all,
I have a 100,000 transcripts sequences in .fasta file after RNASeq analysis.
So, I need to blast only one sequence that might be similar with some nucleotide in that .fasta file, how I can do it? any online tools or program that I can do? (Blastn of NCBI cannot support my file, because it so big)   
regards,
Relevant answer
Answer
You can make your 100,000 transcript sequences of your sequences in fasta format as your local database and do nblast search in the local database using linux command and loading blast module.. 
  • asked a question related to Computational Genomics
Question
4 answers
I have cadR gene sequence of Pseudomonas putida. There is a cadA gene downstream to cadR. ORF of cadR is 3 'to 5' frame 1and it is 5' to 3' for cadA. So how do I find the promoter and operator regions of cadR gene??
Relevant answer
Answer
thank you. I'll try them.
  • asked a question related to Computational Genomics
Question
3 answers
I have a bunch of gene mutations which I know the exact location for. For example, chr1:979748. In all, there are over 1000 of these, that I need to know the gene and whether the location is within an intron, exon or UTR region.
Is there any way to do this as batch search?
We currently do it manually through IGV but this is not practical. I've tried USCS and although it does batch searching if the gene is within an intron or UTR, it doesn't report it. I also can't get it to report what I originally put into the file, the input file doesn't match the output making it impossible to reconstruct.
Relevant answer
Answer
The easiest way would be to parse GFF/GTF file of your genome to find your snps.
  • asked a question related to Computational Genomics
Question
3 answers
Hi,
I have some BAC sequences, each assembled as a unique scaffold, and corresponding to a region of interest.
I also have scaffolds from an inaccurate genome reference that have been retrieved after BLAST of the BAC sequences.
I'd like to assemble all these sequences to get a better view of the region.
I do not have any external info such as distance (only sequences).
Is there any software that can be useful for this task ?
Relevant answer
Answer
Do you think merging the scaffolds (BACs and genomic scaffolds) would be helpful? There are various ways of doing that.
  • asked a question related to Computational Genomics
Question
3 answers
Hi there,
I have recently assembled a NGS HiSeq data with ABySS 2.02. With the default setting (K-mers=64) I had a 183nt stretch of N (or Gap). I have changed k-mer to 40 and the gap reduced to 165nts. Furthermore, k-mers=20 has also performed and I have observed NO Gap containing N sequences. I want to ask you what exactly can I conclude from this result. More importantly, I want to know if my last result, which has no N sequence, is reliable to further analysis.
Regards
Alireza
Relevant answer
Answer
The short answer is "no".  The proper selection of k-mer size influences the correct assembly of contigs.  Which afterwards, can be verified.
Here is a recent paper that includes references to programs designed to optimize the correct k-mer size for assembly and proposes an alternative empirical method to determine the proper k-mer size to use..
Optimizing k-mer size using a variant grid search to enhance de novo genome assembly. Soyeon Cha and David McK Bird. Bioinformation. 2016; 12(2): 36–40.
Published online 2016 Apr 10. doi: 10.6026/97320630012036
  • asked a question related to Computational Genomics
Question
12 answers
I have been trying to assemble some data from a Illumina miseq system analysing a bacterial whole genome sequence for the first time. I firstly used SPAdes genome assembler to assemble the sequence and then used Mauve multiple genome alignment to order the contigs using a very closely related strain as the reference.
Then I tried to submit the sequence to the genebank. They informed me that a foreign contamination screening on these sequences has shown that the sequence contains adaptors which must be removed. Besides, a preliminary annotation of the genome finds 31 fragmented rRNAs, indicating that the assembly is incorrect. 
Then I have been working on trimming the original fasta.gc file using cutadapt/skewer. However, after the trimming the FastQC results are still not good. 
I'm hoping experts/researchers in this field please give me some guidances on what I should do, or maybe provide some reference/tutorials to help me better learn this.
Relevant answer
Answer
Hi Qinhong
Some points:
- At the first step, it is necessary to recheck the read quality and average coverage, to ensure the use of high-quality reads.
- Also, it is difficult to choose the best assembler for bacterial genomes, so you can try some others instead of SPAdes such as REAPR or FRCBam, But also I prefer the SPAdes. This link can help you: www.cbcb.umd.edu/software/imetamos.
- To remove the adaptors, I suggest to re-trimming.
- Moreover, you can have a look at enclosed papers.
I hope it will be useful
  • asked a question related to Computational Genomics
Question
3 answers
Actually I am working with genome based metabolic network reconstruction. But I didn't get any success yet in using a software which can be handy to work with and if someone have used it before in their work please do mention while responding.
  • asked a question related to Computational Genomics
Question
1 answer
I am using MEME suit for the motif search in my gene sequences now i have the results but unable to download the search results.
Relevant answer
Answer
i think, through visual inspection you can write out the signature sequence of the motif.
the signature sequence can be reported without the need to provide supplementary information of the search result.
  • asked a question related to Computational Genomics
Question
13 answers
I want to perform most accurate variant calling by implementing a perfect pipeline. I want to predict both somatic and germline mutations
Please tell me your suggestions.  
Relevant answer
Answer
There is no "perfect" pipeline for variant calling. Most, if not all, variant callers have systemic errors due to unpredictable behavior of BWA aligners when repeated sequences are encountered. For this reason, it is advisable to mask repeats PRIOR to alignment. K-mer based approaches do not handle variant dense regions very well.
  • asked a question related to Computational Genomics
Question
3 answers
I am trying to do in silico digestion of cotton reference genome using SimRAD package in R, but I kept getting error message:
Error in strsplit(DNAseq, split = recognition_code, fixed = FALSE, perl = FALSE) :
non-character argument
My codes are as follows:
simseq <- Biostrings::readDNAStringSet("C:/Users/...", format="fasta")
library(SimRAD)
#PstI
cs_5p1 <- "CTGCA"
cs_3p1 <- "G"
simseq.dig <- insilico.digest(simseq, cs_5p1, cs_3p1, verbose=TRUE)
Anyone has ideas to fix this problem?
Thank you!
Relevant answer
Answer
Did you check this question linked below?
You also have the option with the Biostrings package (link below too)
Also, try to install EMBOSS locally and check on the REBASE options
  • asked a question related to Computational Genomics
Question
3 answers
Hi everybody
An easy issue that made me confused these days is  why ESTs are used for in silico identification of miRNAs or why other types, for instance transcriptome seq data may not be suitable for?
Answers would be highly appreciated.
Relevant answer
Answer
Use Infernal with the latest Rfam library. It is generic for any ncRNA
  • asked a question related to Computational Genomics
Question
3 answers
The genomes are not annotated, so tblastx or tblastn has to be used to generated the alignment for matching orthologs genes. The only softwareI know is ACT Artemis, which unfortunately does not allow the comparison of more than 4 'clusters'.
Thanks
Relevant answer
Answer
Hey there,
This might help. You can use CoGe to create synteny maps to aid in identifying collinearity between genomes. You can create an account and upload your genomes privately. Since you are working with bacteria the this should be pretty painless.
Adam
  • asked a question related to Computational Genomics
Question
2 answers
I was looking at the gene coordinates of LDLR for the hg19 assenbly and NG RefSeq in order to convert them to each other, while I do understand the length of the gene may be different in the two assemblies, I fail to understand why should there be a difference in the length of the same gene in the two assemblies.
I do appreciate if someone can clarify it to me.
Relevant answer
Answer
Thanks Volkan for the informative explanation!
Actually I am going to convert coordinates from RefSeqGene NG to Hg19 genome assembly and vice versa through developing an App, but, there is a problem with variable gene lengths in hg19 and NG. Do you know any tool able to handle it?!
  • asked a question related to Computational Genomics
Question
4 answers
I need to run codeml for ortholog clusters from different related genus. Do I need to generate a species tree for it that I can use for all the clusters or I need to generate gene tree for all the cluster individually. 
Relevant answer
Answer
You can build phylogenetic trees in whichever way you prefer. For what I remember branch lengths are not used but recalculated by codeml. It is mandatory to align the proteins and then perform the alignment of the coding sequences following that as with revtrans (http://www.cbs.dtu.dk/services/RevTrans/), or alternatives such as MACSE http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0022594. Only in this way you are sure that the aligning algorithm does not introduce frameshifts that would create artifacts in the ensuing selection analysis. Also, as codeml basically will count and model mutations, it is important that there are not too many differences in the sequences; if you not, many of the sites might have had multiple mutations that you can't observe, and the selection level calculation will be affected. You can somehow control for this taking into account the branch lengths of the input tree, usually in substitutions per site, you should have much less than one for every sequence in the multialignment. Clearly the tree has to be built on nucleotide sequences. If there are highly variable regions, you can remove problematic regions from the alignment, but you need to be very careful in doing this because you have to reason in coding sequence terms such that your edit unit will be of three nucleotides, following the frame.
It is also important to contrast the likelihood of the model with selection to the one of the model with no selection. If they do not differ much then, even if you observed a few sites under positive selection, you can't reject the hypothesis of no selection.
  • asked a question related to Computational Genomics
Question
3 answers
There are millions of proteins given in PDB, the sequence for which we can download in FASTA format. There are also hundreds of SNP's given in NCBI dbSNP. My question is whether the proteins in PDB incorporate the SNP's into their structure? If not, is there a way to visualize protein structure using any tool after a SNP on the protein? I know that tools like SIFT exist but they only say whether or not a SNP is harmful or not. They don't comment on the structure of the protein in anyway.
Relevant answer
Answer
Go for I mutant (http://gpcr2.biocomp.unibo.it/cgi/predictors/I-Mutant3.0/I-Mutant3.0.cgi) you will get the free energy change value and whether the mutation/SNP is stable or not. In order to visualize the structure why don't you submit the SNP containing sequence in online modelling softwares (http://raptorx.uchicago.edu/ ) or try Swiss model.
  • asked a question related to Computational Genomics
Question
2 answers
Hi everyone,
I want to retrieve the sequences of all protein coding genes that are present on a particular chromosome of a species. Suppose, chromosome 23 of human. Is there any gene browser available for this? Or is there any computer program coding written for the same?
If anyone has done this before or if has an idea about how to do it, then kindly guide me.
Thank you in advance!
Relevant answer
Answer
you may want to use ftp://ftp.ncbi.nlm.nih.gov/ 
file with the extension ".faa" are protein coding region, have fun!
  • asked a question related to Computational Genomics
Question
3 answers
Hi, everyone!
I have been using Cuffdiff to determine differential gene expression in Serratia marcescens. I notice that multiple genes shared the same transcript, giving the same FPKM values instead of each individual gene in the final testing. Therefore, I compared the genome and the annotation file (gff3 format) which is downloaded from NCBI by Artemis. I found that all the genes that overlapped seem to be merged as one gene. Is this a known issue with Cuffdiff and is there a way to solve it in order to get the FPKM for individual gene even though they are overlapped?
Thanks for your help! :)
Relevant answer
Answer
It happens because the map coverage between these two genes are continuous (it can happen with 2 or more genes) and cuffdiff can't differentiate them. If you have high transcriptome coverage and genes very close to each other, you'll see this lot! Other reason can be the wrong gene prediction during structural annotation. Some gene predictors split real genes in two, or more sometimes. Especially if you used a small number of predictors.
To avoid this intergenic mapping, use the option -T on tophat command line.
Tophat manual:
"The following options in this section are only used when the transcriptome search was activated with -G/--GTF and/or --transcriptome-index.
-T/--transcriptome-only Only align the reads to the transcriptome and report only those mappings as genomic mappings."
  • asked a question related to Computational Genomics
Question
5 answers
The UCSC genome browser provides visualization of ENCODE HiC data. The data points are annotated as "Genome Compartments". I was wondering if anyone experienced with HiC data sets could shed some light on what that annotation actually means? Just off the bat, I assume it means a region where interaction occurs. However, some regions are as long as > 10 Mb which makes me question whether I have correctly understood what the annotation means?  
Relevant answer
Answer
From post processing done on the data, it looks that way.  But some of the regions are 40 kb, which doesn't make much sense to me.
  • asked a question related to Computational Genomics
Question
7 answers
Dear all who are have expertise in bioinformatic, i was beginner and i would like to ask, now i do genome analysis, i got results but unfortunately there were some miss annotation by blastp caused by incorrect previous database which two genes were predicted as single gene. So here i need your help to guide me how to do curating this miss annotation, it was impossible if i should do blastp one by one. Thank for your kind, best regards: SA
Relevant answer
Answer
Thank you so much Mr. Pucker, i will try it.
  • asked a question related to Computational Genomics
Question
2 answers
I have a kinase substrate consensus sequence and I would like to blast it through a mitochondrial proteome to see if any targets pop up but i'm not sure how to do this?  I know there are known mitochondrial proteome databases like MitoCarta2.0 and tools like MitoMiner 4.0 but I'm not sure how to use this to identify a sequence versus a whole protein??? Completely out of my depth here, I would really appreciate any advice or least guidance for the direction I should go in.
Relevant answer
Answer
Hi Anna,
Also, you can find the homology of kinase substrate consensus sequence using SwissModel database for predicting real homolog target.
Orthodb (http://www.orthodb.org/v9.1/?query=Mitochondrial%20sequence) past your consensus to find similar.
Hope it is useful.
Regards,
Rajesh
  • asked a question related to Computational Genomics
Question
2 answers
I have a list of UNIPROT IDs of entire genome and want to perform COG analysis .Please help
Relevant answer
Answer
Thank You Lexa Sir. To be more specific,I want to ask you that for COG analysis on windows platform ,it is required to install COG software available on ftp site or it is a database from where I need to mine out my genes ?
  • asked a question related to Computational Genomics
Question
4 answers
I am looking for tools to get basic statistics of contigs files from Illumina sequencing.
Relevant answer
Answer
Hi,
1) If you are interested in the quality of the reads, you could apply fastQC.
2) There are many different tools (https://omictools.com/assembly-evaluation-category) to evaluate the quality of an assembly e.g. QUAST (http://bioinf.spbau.ru/quast). QUAST will provide some basic statistics about your assembly and run a complete evaluation of the assembly. If you are just interested in the very basic statistics of your contigs, I could send you a simple python script to calculate them.
  • asked a question related to Computational Genomics
Question
4 answers
I would like to know how to open the .soft file downloaded from NCBI GEO.
  • asked a question related to Computational Genomics
Question
3 answers
I am facing difficulty in running blast sequencing in that software as it takes too much time and it gets terminated automatically after sometime. So anyone knows about it how to resolve it or any other alternative software for genome scale reconstruction?
Relevant answer
Answer
I know that.. the problem was made before then, when someone selected a software package that was not properly evaluated for meeting your needs and resources (otherwise you wouldnt be asking this question). Anyway.. for alternatives like those I was pointing out I would be more than happy to help.
Best of luck
  • asked a question related to Computational Genomics
Question
3 answers
GTF/GFF is required to run cufflink and cuffdiff on Galaxy server. But only few species have this file either in specific genome database or public genome db. So, is it possible to find DEG for species whose GTF/GFF file is not reported yet?
Relevant answer
Answer
You can use cufflink for denovo that can infer transcripts de novo from the mapping data alone. There are some other algorithms that quantify expression from transcriptome mappings include RSEM (RNA-Seq by Expectation Maximization) , eXpress , Sailfish and kallisto etc. TransDecoder can be another option http://transdecoder.github.io/
  • asked a question related to Computational Genomics
Question
5 answers
anyone can tell me how to look for minisatellites in genome of an organism.
Relevant answer
Answer
go for MISA software 
  • asked a question related to Computational Genomics
Question
3 answers
Hi,
I am trying to map my illumina PE reads on two Rhizoctonia solani genome,AG1-IA and AG1-IB. I have downloaded the fasta files for AG1-IA genome from RSIADB (http://genedenovoweb.ticp.net:81/rsia/index.php?m=download&f=index) and AG1-IB scaffolds from JGI portal (http://genome.jgi.doe.gov/pages/dynamicOrganismDownload.jsf?organism=Rhiso1). I am using tophat2 and built the index files using the bowtie2-build command as given in the manual. AG1-IA genome is working fine and I am getting more than 75% aligned reads as expected. But when I am using AG1-IB genome, the reads are not mapping (showing an alignment of <1%) which shouldn't be the case as both are members of the same species. What could be the problem?
Relevant answer
Answer
Thank you both for your responses.
@Michaela I will try other aligners.
@Moinul Both genomes are assembled in scaffolds. AG1-IA is working fine. I'll try your suggestion if I don't succeed with other aligners.
  • asked a question related to Computational Genomics
Question
5 answers
I'm trying to blast different nitrogen cycle related gene against various genomes got many hits with different E-value (starting from 0 to 9). What should be the threshold E-value. In literature, researchers used different thresholds for E-value, e.g. 1e-5, 1e-10, 1e-100 etc. I understand bit score, identity and other stats are also important. However, NCBI Blast can use only E-value as a constraint to research queries through the subject. 
What should be a decent E-value?
Relevant answer
Answer
E-value tells you the expected number of hits with a given score or higher.
At low E-values (between 0 - 1e-3), you typically use the numbers only to rank hits, e.g. to select the best one among many. Because otherwise, they are all significant (in the sense that they are not random noise). Hits with E-value > 1 are usually ignored, unless you know more specifically what you're looking for and that it may have very low similarity. In that case you must filter the hits that are satisfactory according to some other method (other than sequence similarity) or just take your chances and accept many false positives. Hits with intermediate values (between 1e-3 and 1) are borderline significant, how to treat those may be different from case to case.
  • asked a question related to Computational Genomics
Question
6 answers
DE analysis was carried out using cufflinks, TAIR10 database, Arabidopsis
Relevant answer
Answer
I personally prefer using EdgeR to perform DE gene analysis. With regards to performing GO enrichment analysis on the DE gene list you obtained from the first step, you can either give goseq (http://bioconductor.org/packages/release/bioc/html/goseq.html) a try, or use publicly available GO enrichment analysis tools such as PantherDB (http://www.pantherdb.org). 
  • asked a question related to Computational Genomics
Question
3 answers
I have whole genome of chickpea data base from other sites. I have other one is with diseased tissues of chickpea database. Is it possible by mathematically .pathogen +tissue minus healthy tissue = pathogen genome size. Can it possible for me to submit in ncbi. Or is it possible to get good amount of annotations/ real pathogen size. For example one person working in wilt of chickpea and this trascriptome data base minus icrisat data base of chickpea equalt to wilt genome. Should I used such ideas in drylabs and submit database in ncbi.
Relevant answer
Answer
However, I am not able to understand your question properly, but main thing which i would like to bring to your notice is, you cannot submit the data in NCBI if already exist as a part of another project. Though you are doing modification in existing data but the fact is, it already exist in NCBI database and people who are actually interested in studying wilt transcriptome and chickpea transcriptome can bifurcate by mapping the reads to wilt genome and chickpea genome respectively.
But in case if you have your original transcriptome data generated for wilt infected chickpea by your own then you can map total reads to existing wilt genome and can use mapped reads for recreating wild genome by denovo assembly and now you can submit it to NCBI as wilt data.
Hope it helps
  • asked a question related to Computational Genomics
Question
8 answers
We are having demo projects on integrative systematics on bees and wasps in the lab. To gather enough data, we would like to try following technologies of next generation sequencing - 
RADSeq, Restriction-site associated DNA sequencing;
AHE, Anchored Hybrid Enrichment;
GBS, Genotyping-by-Sequencing;
UCE, Ultraconserved Elements.
Any comments or suggestions? 
Relevant answer
Answer
Dear all,
We are doing RADseq on Trichogramma unique individuals, these "beasts" are ca 0.3 mm long !!!. To do so we use Qiagen individual extraction followed by WGA and it works very well. We tested the reliability of WGA and we did not detect any problems. To do so we let a male (M1) and a female (F1) tricho mating. We obtained multiple female descents (F2s) that we isolated (no mating) and let them produced males (M3s). We pooled all these males (n=ca 2000, in a kind of natural genome amplification) and compare the results with WGA of  M1 and F1. However, for some reasons we had poor results using WGA for UCE capture, in average we lost one third of the loci captured on WGA individuals.
Best
JY
  • asked a question related to Computational Genomics
Question
1 answer
am from digital signal processing domain.
Relevant answer
Answer
Me and my colleagues (Karel Sedlar, Helena Skutkova) from Department of Biomedical Engineering at Brno University of Technology are working in the field of genomic signal processing. You can check our contributions if you're interested. 
  • asked a question related to Computational Genomics
Question
10 answers
I am looking for genes that have altered expression levels. When I use the GEO database I find multiple results for the same gene within one dataset (for example MCM4 in dataset "GSE9750"). In this example MCM4-results appear 4 times: 2x MCM4 is significantly elevated and 2x MCM4 shows no significant elevation. I'm wondering why one dataset contains multiple results for the same gene that also appear to contradict each other.
I appreciate your help!
Relevant answer
Answer
As expected, this is microarray data. Usually one probe is used to detect one gene (transcript) in an array but it's also common that some genes have multiple probes designed to hybridize to different regions of the transcript, perhaps for the purpose of variant/isoform studies. If you are interested in the overall expression of the gene, I would usually take an average of all probes for that gene. Otherwise you could look up the information of the probes (their locations, etc.) from the array company. In your case, http://www.affymetrix.com/estore/catalog/131537/AFFY/Human+Genome+U133A+2.0+Array#1_1.
  • asked a question related to Computational Genomics
Question
4 answers
We have mapped short-read data to a reference genome using CLC Genomics Workbench. The mapping looks good but we are unsure how to extract coverage per sites or coordinates for individual mapped reads. We would like to use this data in constructing violin plots of coverage...any ideas on how to extract and construct these figures?
Relevant answer
Answer
In which format did you export them ?
You'll need the Sorted Bam Format
  • asked a question related to Computational Genomics
Question
3 answers
In my shotgun sequencing data I have a low coverage for some contigs. How can I explain that?
Relevant answer
Answer
"Standard library preparation methods include a PCR enrichment step prior to cluster generation. Biases inherent in PCR amplification result in uneven read coverage and increase the numbers of duplicate fragments present in the library."
  • asked a question related to Computational Genomics
Question
6 answers
BioEdit is an outdated biological sequence alignment editor: http://www.mbio.ncsu.edu/bioedit/bioedit.html
Do you know alternatives for it ? Thanks in advance
Relevant answer
Answer
Colleagues have recommended AliView, I haven`t had the time to road test it but it might be worth a look.
You can find information about it, and links to download it, here:
  • asked a question related to Computational Genomics
Question
5 answers
::Edit:: This question has been answered, thank you!
I have been running *BEAST to estimate species divergence times, with one nuclear locus and one mitochondrial locus, using multiple individuals for each species. Posteriors look ok as far as I can tell, but a colleague has suggested that species trees for divergence times should be created using a single individual for each species. However, I cannot find any information stating whether a single individual/species or multiple individuals/species is preferable. Can someone shed some light on this issue for me?
Relevant answer
Answer
Finally found the answer.
From Heled and Drummond 2009 (the paper where starBEAST is described)
"Multiple samples per species are necessary for a complete estimation. Even two samples per species are sufficient, given enough loci. A single sample means no coalescent events for that extant species and so no information to estimate population size. This may in turn have a detrimental effect on inferring speciation times and perhaps even species topology."
  • asked a question related to Computational Genomics
Question
4 answers
bismark [options] <genome_folder> {-1 <mates1> -2 <mates2> | <singles>}
in bismark when i write this i get an error msg that
could not read file {-1
I dont know what is the problem with i. Although i have done with   genome_preparation but dont know what to do with this step
Relevant answer
Answer
  • asked a question related to Computational Genomics
Question
4 answers
When an isolate enters your lab it is almost instinctive to want to classify it at the "species" level based on phylogeny or similarity. Also, bacteria evolve rapidly, exchange sometime huge pieces of DNA, and recombine, so phylogeny may not be what your are looking for anyway, as it may be irrelevant to what an isolate is now.  Whole concurrent comparative genome methods such as CCT viewer has shown how woefully inadequate previous classification methods are and why:
1) rRNA sequencing is a good method but has 2 critical failures. One is that there simply isn't enough variation for it to be useful. Second, single mutation events have a huge impact on on the outcome when really they don't mean much.
2) Cytochome Oxidase: In my experience these types of genes are much more likely to have just the right amount of variation to be meaningful, diagnostic, and statistically valid.  However, considering gene mobility this method can still be wildly inaccurate.
3) MLST: It is basically the luck of the draw whether the miniscule number of genes a researcher picks reflects what an isolate is.  Pick another 10 and get another answer. Sill it is perhaps one of the best guesses with a good gene set.
4) Whole genome alignments that generate numbers. The researcher plugs in genomes into an alignment program and it spits out numbers. What do they mean? Who knows?  These numbers can be terribly flawed, perhaps worse than all the others....
To understand why lets look at the classification of a number of isolates of the "species" Flavobacterium psychrophilum.  Using a multi-isolate concurrent genome alignment tool like CCT viewer (Stothard Research Group) the relationships between the genomes can be seen holistically. In the case of FP, some isolates have an almost 100kbp insertion. Numeric methods will always place these isolates into a group, despite that these isolates may not really be that close in core genome. CCT viewer also visualizes the patterns of similarity and differences in the group as a whole.  Is variation concentrated or distributed? What has moved around, gained/lost? Whats the CG content in these regions. What are the genes in these regions? How do those genes relate to what a isolate actually is?
With genomics and concurrent whole genome alignment methodology there is some chance of figuring this out on a case by case basis. The interpretation really can't be done by a computer yet. Once concurrent methods are done and the patterns of variation quantified, other simpler numeric method may work well.
Mis-information is worse than no information.
Relevant answer
Answer
OK, let me suggest a rather more optimistic assessment. :)
(1) 16S rRNA gene sequencing is fast and ridiculously inexpensive. I would say it has limitations rather than "failures," since its success rate is nearly 100%. The limitation is what you say: the sequence is so highly conserved that it sometimes lacks enough phylogenetic information to pin the isolate down to the species level. Genus, certainly, but not always species. This isn't a matter of accuracy, but precision. For many applications, it is precise enough. Single Nucleotide Polymorphisms (SNPs) certainly exist, but they are a "feature," not a "bug." One SNP out of 1400+ bp of data won't skew the results. But a set of common SNPs does indeed indicate a close phylogenetic relationship.
(2) Single gene comparisons are risky for the reason you mentioned--horizontal gene transfer between bacterial species. However, in practice what we observe is that highly conserved housekeeping genes are less frequently exchanged by HGT than are mobile elements and other components of the pangenome. Careful comparisons with large datasets reveal that there is a high level of correlation between housekeeping gene sequences, 16S rRNA gene sequences, and whole genome sequences (WGS). Single gene sequences can often, then, be useful surrogates for WGS in determining the taxonomic identity of an isolate. And because they are less conserved than 16S rRNA gene sequences, they can usually identify genes down to the species level.
(3) MLST is indeed a highly sensitive, highly accurate method to classify bacteria. Again, only housekeeping genes are chosen. It isn't "luck of the draw," really. Several thousand bp of data are a good sample size, and the incorporation of multiple genes safeguards against the risks of single gene comparisons noted above. I maintain a Bacillus subtilis group MLST database on the MLST website created by Keith Jolley at Oxford. I've done extensive (unpublished) statistical analyses comparing phylogenetic distances computed by MLST and WGS comparisons, and the correlation is extremely high (R2=0.986 for my data). That gives me a high degree of confidence that MLST gives very meaningful answers for bacterial classification.
(4) The reason that MLST is being superseded by whole genome comparisons isn't that they fail; it is that WGS technology is developing so rapidly, and the costs are dropping at the same rate. Why not get the whole genome sequence if you can afford it? I completely disagree with your characterization of WGS comparison as providing meaningless output. Instead of throwing up your hands in despair, I encourage you to read the papers and learn what the numbers mean! They actually convey an amazing amount of information. And current best practice algorithms, like the Genome to Genome Distance Calculator of the DSMZ, aren't at all perturbed by the type of situation that you describe, where one subset of genomes has a mobile element that another lacks. They don't even require that all genomes are complete. The resulting phylogenies are highly accurate and exquisitely precise.
So there you have it. :) In my view, we are in a golden age of bacterial species identification, thanks to the revolutionary developments in sequence-based technologies and bioinformatics applications. If you only have a few dollars at your disposal, do a 16S sequence. It will give you the proper genus and will narrow down the species possibilities to a small group. If you have a couple of hundred dollars, you can do a very thorough MLST and know exactly what species you have, and even determine which isolates within the species are the closest relatives. And if you have more like $1000-$1500, go ahead and do the WGS. Not only will you be able to identify the species with total confidence, you will acquire deep insights into its physiology and ecological role as you examine the sequence. And these price estimates will only come down over the next few years.
So cheer up! And happy sequencing. :)
  • asked a question related to Computational Genomics
Question
4 answers
What tools you can recommend to visualize the synteny/comparative genome of 2-4 species/cultivars specifically structural variation among species?
Similar to Gale et al 1998 check image.
Thanks
Relevant answer
Answer
Hi,
Try any of the following:
3. circos - http://circos.ca/
Best regards,
---
Diran
  • asked a question related to Computational Genomics
Question
1 answer
I have a list of rat genomic regions. I have liftovered them to human orthologous regions. For the rat regions that can be successfully converted to human orthologs, I want to find out if the matched rat-human regions are conserved. I can use the phastCons score or phyloP score to evaluate the level of conservation, but these values are based on multiple-species alignment (46 species). What I am really interested in is the conservation between human and rat. I wonder how I can obtain that value? using blastn or fasta or something else?
Relevant answer
Answer
Hi Jia Zhou , I am doing something similar so if you ever came up with an answer could you let me know the methodology that you followed. Thanks 
  • asked a question related to Computational Genomics
Question
4 answers
I trimmed and assembled (Reverse and Forward) my sequences using CLC Main Workbench software. The trimming I did aims to remove poor quality 3’ and 5’.
I’m going to perform multiple alignment, haplotype diversity and phylogenetic tree, using other software (DnaSP, Arlquin and MEGA).
Those softwares need uniform size of sequences (same length).
I have about 120 COI sequences of different sizes (640-750bp). The expected size was 710 bp.
I want to cut them at same length (like all sequences at 550bp) make all of them in equal size.
How can I do it in MEGA? I have all consensus in a FASTA file. Is it normal to cut the right end before alignment when file is imported in MEGA?
Thank you
Relevant answer
Answer
Before cutting the sequences, blast against the nucleotide database to see the reference sequences and compare it have a optimum size range with good chromatogram quality.
  • asked a question related to Computational Genomics
Question
4 answers
I have a protein sequence but I checked PDB for the purpose comparative modeling, but I couldn't find any similar protein with known 3D structure, yet I want model my protein. Beside ITASSER which software should I use to model my protein?
Relevant answer