Science topic
Computational Genomics - Science topic
Explore the latest questions and answers in Computational Genomics, and find Computational Genomics experts.
Questions related to Computational Genomics
I have whole genome of a bacteria. Do you know which program to detect virus genome within the bacteria?
Is using annotation (example : prokka) and then looking manually for viral genes/proteins? Or by checking the assembly (example : prokka) and blast the shorter contigs will is enough?
Thank you in advanced
Does anyone have experience with the table2asn tool of NCBI? It is used for submitting annotations of genomes using GFF files.
It can be found at this location: ftp://ftp.ncbi.nih.gov/toolbox/ncbi_tools/converters/by_program/table2asn_GFF/
I am having a problem with it.
1) The windows tool I get running, but it gives me an error. The command used is the following: table2asn -M n -J -c w -t template.sbt -a r10k -l paired-ends -i FinalContigs.fsa -f 2700988623.gff -o output_file.sqn -Z output_file.dr -locus-tag-prefix xxxxx
template.sbt is a file generated by NCBI
FinalContigs.fsa is a file containing all the different contigs of the draft genome. The first contig is named Contig001, seconf contig is Contig002, and so on.
2700988623.gffis the file containing the annotation. Column 1 gives the contigs where the CDS and RNA are found in, so Contig001, Contig002, etc.
After running the command (as told by NCBI), I get the following error:
Cannot resolve lcl|Contig002: unknown
Line: 0
So if anyone could assist me with getting the tool running in Windows, that would be very much appreciated as it is the final step of submitting my genome.
Thanks in advance!
I would like to draw an haplotype network in Geneious directly, without using any other software. Is there a way to do that? Any plug-in that could help?
Thank you!
I have to analyze PacBio transcriptome data with four bin sizes: 1-2 Kb, 2-3 Kb, 3-6 Kb and 5-10 Kb. I can understand that the HSP length cut-off and HSP-Hit coverage settings must be customized for each of the bins for running Blast but what should be the criteria for the same. Any good literature or suggestions on this topic will be of great help.
I used WebMGA to cluster my NGS data (COG). I have problem on analyzing the data provided in output.zip since the format file is unknown, in this case do I need some specific software to open each of those files?
Genomic data privacy is an essential thing while sharing the genomic data to the public. How can the privacy of genomic data be protected? Which anonymization models are useless for preserving the privacy of genomic data? Which model is suggested for preserving the privacy?
Hello,
I have this sample that was sequenced as two different libraries.
I do not have access to the BAM files, only PLINK files.
Each PLINK file has that specific sample as its single individual.
Each PLINK file has a different number of SNPs, therefore some might be common to both files and others not.
Can I merge the two PLINK files in a way that I get all SNPs, choosing at random from which file to keep the duplicated ones?
E.g.
SampleA1 - 400000 SNPs
SampleA2 - 150000 SNPs
As some SNPs will only exist in one file and not on the other, I wish to merge the files and keep all of them, discarding for example the duplicated on on the second sample.
Thanks!
EDIT: As mentioned above, I really needed to keep single alleles in all variants, hence the need to discard one of two, which would in cases lead to some biallelic variants. This is because my data was/is pseudo-haploid. If this is irrelevant for you, you can use PLINK's (b)merge instead.
Hi,
I am trying to create a genome browser software (online, free to use) and I have come across the following difficulty: BRCA1 for example has 30 transcripts, each containing different exons. The question now is which of these transcripts is the relevant one for Breast cancer? It would be wrong to detect a mutation and communicate it as disease causing in one exon that does not actually show up in the final relevant protein. Any idea how I can figure out which one is the right transcript to focus on?
I need a general rule , not the one for BRCA1, as I need to apply this rule to all 25 000 human genes. Any help would be welcome.
My best approach so far: Find amino acid sequence of relevant protein and then reverse calculate which transcript it came from. Is there a better way?
Hi everyone..i just want to ask..how to analyse raw dru sequence of Staphylococcus species for dru typing identification of Staphylococcus species?..I already read several journals and they suggest to used TRST sequence analysis software from Bionumerics software and dru repeat analysis using manually can be a challenge. So..really mean that, typing using dru sequence require bionumerics software and can't used manually?...
Thank you
Hello,
I am working with mouse RNA-seq data aligned to GENCODE's M8 and am trying to create a custom trackhub to visualize my data. I had done this before on RNA-seq of human cell lines without any problems, but in the process of trying to use the bedGraphToBigWig program and obtaining the mm10 chromosome sizes, I keep getting the error:
GL456210.1 is not found in chromosome sizes file
My bedgraphs are from sorted bam files, and I used the following command:
bedGraphToBigWig filename.bedGraph mm10.chrom.sizes filename.bw
Browsing the UCSC forums, I also found that interrogating the unscaffolded reads through:
mysql -hgenome-mysql.cse.ucsc.edu -ugenomep -ppassword -A \
> -e "select chrom,frag from gold;" mm10
led to my not seeing GL456210.1 correspond to anything.
Any thoughts on how to solve this issue? Your suggestions on how to get around this would be much appreciated! Thanks!
The search for shortest unique substrings (shustrings) of the genome is an important problem, for in some sense it is the mix of these shustrings that defines the phenotype of an organism. A related question concerns the set of shustrings that *is not* found within any genome, for the presence of these may correspond to specific detrimental consequence in organisms. So, my question addresses both the issue of any prior review (there has been) and any discussion of justification for the observance of non-representation of these shustrings within organisms. The shorter the length of such sequences, the more important would be any corresponding justification, I should think.
To correct the phenomenon of mistagging in Illumina, what is the best threshold to filter the low abundance OTUs of each sample?
I have 100s of bacterial genome sequences that I have to use to construct an ortholog table by bidirectional all-vs-all BLAST+ search among the protein products. I have already performed a blastall search, but do not know how to pull-out the best hit sequences and arrange them as orthologs (which will also include paralogs). Please help.
Hi,
I work on reptile genome and I have a problem with variant calling using samtools. I'm interested in vcf file with SNPs and indels in my bam file across genomes of multiple species.
I run the command from manual and it gives "[mpileup] 1 samples in 1 input files
<mpileup> Set max per-file depth to 8000"
Does anyone know the solution for this?
Many Thanks.
I have 9 VCF files for individuals i need some publications or tools to know more about:-
- Disease risk calculations.
- Disease identification.
- Drug response.
- Antibiotic resistance.
- Dealing with novel SNPs
- Fitness and Nutrition genes
Also if sample contain gene related to specific disease Is this means that this person suffer from this disease or this person suspected to be patient in the future. (How to know if gene is active or not?)
Waiting your replies with publications, tools and recommendations
Thanks
elsayed
While literature survey I found DarkHorse which can predict horizontal gene transfer (HGT) Candidate Resource from bacterial and archaeal genomes. I want to know if my candidate proteins has been transferred via HGT either from prokaryotes or eukaryotes. Therefore, I am looking for similar online tools to predict horizontal gene transfer in plants.
Hello, could someone suggest me a tool to get information (gene localization, DNA methylation, CG composition and other ) from geo data? Thanks
I ran the popsizeabc tool and I got problem on "abc" function of abc R package. They said: error on "nnet.default", x and y must be match. I do compare my input data and an example data, and I think they were similar format.
Hope anyone could help me, please.
Thanks in advance.
Hi everyone,
I've sequenced (Illumina; Nextera; 2x250) different strains of Salmonella (size of the genome around 5Mb) and I want to compare the genomes to check if there are related or not: some came from animals the others from patients.
I've cleaned my sequences with trimmomatic before doing the assembly (I used megahit, although now I know it's not the best one) . After that I tried difference analysis:
-REALPHY (contigs and reference genome as input files) to obtain the phylogenetic tree.
-bwa to do the mapping between my contigs and the reference, but I didn't know how to get a single sequence. I checked it with QUAST.
-With mauve I got my contigs ordered according to my reference but not a consensus sequence for each sample and I could not interpret the results very well.
I am quite new in Bioinformatics, so I'd appreciate some advice.
1/If I use two different assemblers (Velvet and Spades), how can I merge both?
2/Is is better in this case to do assembly de novo or just mapping with a reference (I have a high coverage)?
3/One I have the contigs, which software should I used to obtain scaffols? And to obtain a unique sequence I can use in further analysis?
4/Which software would you recommend me for phylogeny? And for clustering? Do you think this would be the best way to answer my problem?
I'd really appreciate any suggestion.
Thank you very much!
Andrea
Hello,
I have sequenced an organism by using NGS technology. After assembly, I found it only has 6-22% homology with other related GenBank enteries. I performed MSA with all of related-organism genomes, but it is time-consuming and its phylogenetic tree is hard to interprate. I want to know what is the exact process to do. The organism is a phage, and its family has identified by electron-microscopy.
Regards
Hi, could you recommended me any prediction software for DNA-binding sites of my protein?
I have protein with predict structure using Psipred (pdb file) and fasta file with genome. I would like to predict binding site of my protein in this genome.
Thank you.
I have a large set of data for which I already got the DEGs and also have them clustered. I need to get the functions enriched for each of the clustered genes. I was able to use the DAVID tool for just few of the clusters but DAVID is complaining that genes in other clusters are more than 3000 and could not get any result. Which other tool(s) can help. Note that I am dealing with human data.
I'm doing a project, my group needs to know what does the DNA sequence represents. The problem is we are new to bioinformatics and we did not know what database we should use for BLAST. We tried Non-redundant protein and SwissProt, both give different answers. Non-redundant shows protein with 0 E-value and 100% identity, while SwissProt shows protein with 4e-29 E-value and 23% identity. Which one should I use??
I am running BLAST search of a target gene in Staph. aureus to find interspecies gene similarity. The BLAST results (top 100) show the target gene in different strains of Staph. aureus. I downloaded the FASTA with all search results (>1000 sequence matches) but manually editing the FASTA is a cumbersome job. The phylogeny tree with this FASTA is a mess. Is there any software available which can cluster the sequences from a single organism into just one branch (no further branching for each strain as they have >99% sequence similarity) of the phylogeny tree so I can compare the similarity of target gene between Staph. aureus and other organisms.
Hi Friends,
while I run my RNA-Seq data (Tophat/HISAT2 aligned output & samtools name sorted file) in HTSeq2 for counting DEG, which display warning message like “Mate records missing for 5069 records; Paired end read :HWI-D00486:7:1116:11181:7055” and “Mate pairing was ambiguous for 61656 records;” after SAM alignment record pairs processed. Kindly any one provide a solution to solve this warnings.
I have sequenced partial mitochondrial genome of Philautus sp. and have generated a consensus representing the reconstructed genome from the reads. There are many single nucleotide variants present when compared to a reference genome. How can I visualize these SNVs without a reference genome as genome browsers like IGV require reference genomes?
Can anyone recommend programs to perform pan-genomic analyses on bacterial genomes (approximately 10 or more)? Either web-based or command line-based suggestions are welcome although from my experience command line tools provide more options and flexibility. I've discovered a few programs so far (get_homologues, PanWeb, BPGA) but I'd like to get some opinions from those who have performed pan-genome work before and have specific recommendations. It looks like a lot of new tools are coming out recently and it's either hard to keep track of them or even find them in the first place without reading literature on what others have done, which of course I'm also doing.
I'm mostly interested in programs that can define the core, accessory and unique genes but of course programs that also integrate functions for graphing/summarizing results would be helpful.
Thank you!
Planning to send samples off, have ~50 individuals, but will pay for the whole plate, so it has been suggested we fill the rest of the wells with duplicate individuals. However, we won't be able to duplicate all individuals (only have 45 wells to fill). Moreover, we don't necessarily have enough DNA to fill the rest of the plate. Trying to decide if our dataset would be compromised by having only a subset of duplicated runs.
Hi all,
The question is pretty straightforward, I want to find out closer relative of Juncus effusus closer relative whose genome sequence is fully annotated (Protein evidence in FASTA format)?
Thanks
Very nice project. Best of luck, but why just transcriptomic data. Can use the ddRAD protocols two, to get SNPs from the junk DNA too.
Hi all,
I want to do rarefaction analysis for the plant "Juncus effusus" transcriptome assembly. At the moment I have Fasta file with final assembly. I wonder how can I use this dataset? Is there any tool available for this purpose? If there is no direct tool available, how can I get each gene abundance? an example file with a few reads is attached!
I have sequence the genome of a Klebsiella pneumoniae and I am trying to find other Klebsiella pneumoniae genomes belonging to the same ST to use them for comparison.
I have used BLAST to find similar sequences but best matches resulted in different sequence types.
I want to know that is there is any suitable tool for extracting start codon and stop codon from the fasta file of multiple protein-coding genes (PCGs). I have downloaded multifasta PCGs (nucleotides) from NCBI; now I want to get the start codon and stop codons of all fasta sequence at a go. Thanks in advance.
I am looking for an open bioinformatic program or script with the option of quick analyzing the presence or absence of nonsynonymous SNPs in the NGS results of amplicons of known proteins in a particular human individuals. The program of course may include mapping to the reference genomes but in principle it should detect known SNPs by analyzing the nucleotide sequence and finding a substitution in a codon causing a change in a known SNP position (triggering mutation to another amino acid). Mainly known substitutions are important, however detection of novel would be optional with mapping. I possibly would appreciate pointing to a few programs to fix them with a script. And it's not about dbSNP on NCBI or any other.
Dear all,
I have a 100,000 transcripts sequences in .fasta file after RNASeq analysis.
So, I need to blast only one sequence that might be similar with some nucleotide in that .fasta file, how I can do it? any online tools or program that I can do? (Blastn of NCBI cannot support my file, because it so big)
regards,
I have cadR gene sequence of Pseudomonas putida. There is a cadA gene downstream to cadR. ORF of cadR is 3 'to 5' frame 1and it is 5' to 3' for cadA. So how do I find the promoter and operator regions of cadR gene??
I have a bunch of gene mutations which I know the exact location for. For example, chr1:979748. In all, there are over 1000 of these, that I need to know the gene and whether the location is within an intron, exon or UTR region.
Is there any way to do this as batch search?
We currently do it manually through IGV but this is not practical. I've tried USCS and although it does batch searching if the gene is within an intron or UTR, it doesn't report it. I also can't get it to report what I originally put into the file, the input file doesn't match the output making it impossible to reconstruct.
Hi,
I have some BAC sequences, each assembled as a unique scaffold, and corresponding to a region of interest.
I also have scaffolds from an inaccurate genome reference that have been retrieved after BLAST of the BAC sequences.
I'd like to assemble all these sequences to get a better view of the region.
I do not have any external info such as distance (only sequences).
Is there any software that can be useful for this task ?
Hi there,
I have recently assembled a NGS HiSeq data with ABySS 2.02. With the default setting (K-mers=64) I had a 183nt stretch of N (or Gap). I have changed k-mer to 40 and the gap reduced to 165nts. Furthermore, k-mers=20 has also performed and I have observed NO Gap containing N sequences. I want to ask you what exactly can I conclude from this result. More importantly, I want to know if my last result, which has no N sequence, is reliable to further analysis.
Regards
Alireza
I have been trying to assemble some data from a Illumina miseq system analysing a bacterial whole genome sequence for the first time. I firstly used SPAdes genome assembler to assemble the sequence and then used Mauve multiple genome alignment to order the contigs using a very closely related strain as the reference.
Then I tried to submit the sequence to the genebank. They informed me that a foreign contamination screening on these sequences has shown that the sequence contains adaptors which must be removed. Besides, a preliminary annotation of the genome finds 31 fragmented rRNAs, indicating that the assembly is incorrect.
Then I have been working on trimming the original fasta.gc file using cutadapt/skewer. However, after the trimming the FastQC results are still not good.
I'm hoping experts/researchers in this field please give me some guidances on what I should do, or maybe provide some reference/tutorials to help me better learn this.
Actually I am working with genome based metabolic network reconstruction. But I didn't get any success yet in using a software which can be handy to work with and if someone have used it before in their work please do mention while responding.
I am using MEME suit for the motif search in my gene sequences now i have the results but unable to download the search results.
I want to perform most accurate variant calling by implementing a perfect pipeline. I want to predict both somatic and germline mutations
Please tell me your suggestions.
I am trying to do in silico digestion of cotton reference genome using SimRAD package in R, but I kept getting error message:
Error in strsplit(DNAseq, split = recognition_code, fixed = FALSE, perl = FALSE) :
non-character argument
My codes are as follows:
simseq <- Biostrings::readDNAStringSet("C:/Users/...", format="fasta")
library(SimRAD)
#PstI
cs_5p1 <- "CTGCA"
cs_3p1 <- "G"
simseq.dig <- insilico.digest(simseq, cs_5p1, cs_3p1, verbose=TRUE)
Anyone has ideas to fix this problem?
Thank you!
Hi everybody
An easy issue that made me confused these days is why ESTs are used for in silico identification of miRNAs or why other types, for instance transcriptome seq data may not be suitable for?
Answers would be highly appreciated.
The genomes are not annotated, so tblastx or tblastn has to be used to generated the alignment for matching orthologs genes. The only softwareI know is ACT Artemis, which unfortunately does not allow the comparison of more than 4 'clusters'.
Thanks
I was looking at the gene coordinates of LDLR for the hg19 assenbly and NG RefSeq in order to convert them to each other, while I do understand the length of the gene may be different in the two assemblies, I fail to understand why should there be a difference in the length of the same gene in the two assemblies.
I do appreciate if someone can clarify it to me.
I need to run codeml for ortholog clusters from different related genus. Do I need to generate a species tree for it that I can use for all the clusters or I need to generate gene tree for all the cluster individually.
There are millions of proteins given in PDB, the sequence for which we can download in FASTA format. There are also hundreds of SNP's given in NCBI dbSNP. My question is whether the proteins in PDB incorporate the SNP's into their structure? If not, is there a way to visualize protein structure using any tool after a SNP on the protein? I know that tools like SIFT exist but they only say whether or not a SNP is harmful or not. They don't comment on the structure of the protein in anyway.
Hi everyone,
I want to retrieve the sequences of all protein coding genes that are present on a particular chromosome of a species. Suppose, chromosome 23 of human. Is there any gene browser available for this? Or is there any computer program coding written for the same?
If anyone has done this before or if has an idea about how to do it, then kindly guide me.
Thank you in advance!
Hi, everyone!
I have been using Cuffdiff to determine differential gene expression in Serratia marcescens. I notice that multiple genes shared the same transcript, giving the same FPKM values instead of each individual gene in the final testing. Therefore, I compared the genome and the annotation file (gff3 format) which is downloaded from NCBI by Artemis. I found that all the genes that overlapped seem to be merged as one gene. Is this a known issue with Cuffdiff and is there a way to solve it in order to get the FPKM for individual gene even though they are overlapped?
Thanks for your help! :)
The UCSC genome browser provides visualization of ENCODE HiC data. The data points are annotated as "Genome Compartments". I was wondering if anyone experienced with HiC data sets could shed some light on what that annotation actually means? Just off the bat, I assume it means a region where interaction occurs. However, some regions are as long as > 10 Mb which makes me question whether I have correctly understood what the annotation means?
Dear all who are have expertise in bioinformatic, i was beginner and i would like to ask, now i do genome analysis, i got results but unfortunately there were some miss annotation by blastp caused by incorrect previous database which two genes were predicted as single gene. So here i need your help to guide me how to do curating this miss annotation, it was impossible if i should do blastp one by one. Thank for your kind, best regards: SA
I have a kinase substrate consensus sequence and I would like to blast it through a mitochondrial proteome to see if any targets pop up but i'm not sure how to do this? I know there are known mitochondrial proteome databases like MitoCarta2.0 and tools like MitoMiner 4.0 but I'm not sure how to use this to identify a sequence versus a whole protein??? Completely out of my depth here, I would really appreciate any advice or least guidance for the direction I should go in.
I have a list of UNIPROT IDs of entire genome and want to perform COG analysis .Please help
I am looking for tools to get basic statistics of contigs files from Illumina sequencing.
I would like to know how to open the .soft file downloaded from NCBI GEO.
I am facing difficulty in running blast sequencing in that software as it takes too much time and it gets terminated automatically after sometime. So anyone knows about it how to resolve it or any other alternative software for genome scale reconstruction?
GTF/GFF is required to run cufflink and cuffdiff on Galaxy server. But only few species have this file either in specific genome database or public genome db. So, is it possible to find DEG for species whose GTF/GFF file is not reported yet?
anyone can tell me how to look for minisatellites in genome of an organism.
Hi,
I am trying to map my illumina PE reads on two Rhizoctonia solani genome,AG1-IA and AG1-IB. I have downloaded the fasta files for AG1-IA genome from RSIADB (http://genedenovoweb.ticp.net:81/rsia/index.php?m=download&f=index) and AG1-IB scaffolds from JGI portal (http://genome.jgi.doe.gov/pages/dynamicOrganismDownload.jsf?organism=Rhiso1). I am using tophat2 and built the index files using the bowtie2-build command as given in the manual. AG1-IA genome is working fine and I am getting more than 75% aligned reads as expected. But when I am using AG1-IB genome, the reads are not mapping (showing an alignment of <1%) which shouldn't be the case as both are members of the same species. What could be the problem?
I'm trying to blast different nitrogen cycle related gene against various genomes got many hits with different E-value (starting from 0 to 9). What should be the threshold E-value. In literature, researchers used different thresholds for E-value, e.g. 1e-5, 1e-10, 1e-100 etc. I understand bit score, identity and other stats are also important. However, NCBI Blast can use only E-value as a constraint to research queries through the subject.
What should be a decent E-value?
DE analysis was carried out using cufflinks, TAIR10 database, Arabidopsis
I have whole genome of chickpea data base from other sites. I have other one is with diseased tissues of chickpea database. Is it possible by mathematically .pathogen +tissue minus healthy tissue = pathogen genome size. Can it possible for me to submit in ncbi. Or is it possible to get good amount of annotations/ real pathogen size. For example one person working in wilt of chickpea and this trascriptome data base minus icrisat data base of chickpea equalt to wilt genome. Should I used such ideas in drylabs and submit database in ncbi.
We are having demo projects on integrative systematics on bees and wasps in the lab. To gather enough data, we would like to try following technologies of next generation sequencing -
RADSeq, Restriction-site associated DNA sequencing;
AHE, Anchored Hybrid Enrichment;
GBS, Genotyping-by-Sequencing;
UCE, Ultraconserved Elements.
Any comments or suggestions?
am from digital signal processing domain.
I am looking for genes that have altered expression levels. When I use the GEO database I find multiple results for the same gene within one dataset (for example MCM4 in dataset "GSE9750"). In this example MCM4-results appear 4 times: 2x MCM4 is significantly elevated and 2x MCM4 shows no significant elevation. I'm wondering why one dataset contains multiple results for the same gene that also appear to contradict each other.
I appreciate your help!
We have mapped short-read data to a reference genome using CLC Genomics Workbench. The mapping looks good but we are unsure how to extract coverage per sites or coordinates for individual mapped reads. We would like to use this data in constructing violin plots of coverage...any ideas on how to extract and construct these figures?
In my shotgun sequencing data I have a low coverage for some contigs. How can I explain that?
BioEdit is an outdated biological sequence alignment editor: http://www.mbio.ncsu.edu/bioedit/bioedit.html
Do you know alternatives for it ? Thanks in advance
::Edit:: This question has been answered, thank you!
I have been running *BEAST to estimate species divergence times, with one nuclear locus and one mitochondrial locus, using multiple individuals for each species. Posteriors look ok as far as I can tell, but a colleague has suggested that species trees for divergence times should be created using a single individual for each species. However, I cannot find any information stating whether a single individual/species or multiple individuals/species is preferable. Can someone shed some light on this issue for me?
bismark [options] <genome_folder> {-1 <mates1> -2 <mates2> | <singles>}
in bismark when i write this i get an error msg that
could not read file {-1
I dont know what is the problem with i. Although i have done with genome_preparation but dont know what to do with this step
When an isolate enters your lab it is almost instinctive to want to classify it at the "species" level based on phylogeny or similarity. Also, bacteria evolve rapidly, exchange sometime huge pieces of DNA, and recombine, so phylogeny may not be what your are looking for anyway, as it may be irrelevant to what an isolate is now. Whole concurrent comparative genome methods such as CCT viewer has shown how woefully inadequate previous classification methods are and why:
1) rRNA sequencing is a good method but has 2 critical failures. One is that there simply isn't enough variation for it to be useful. Second, single mutation events have a huge impact on on the outcome when really they don't mean much.
2) Cytochome Oxidase: In my experience these types of genes are much more likely to have just the right amount of variation to be meaningful, diagnostic, and statistically valid. However, considering gene mobility this method can still be wildly inaccurate.
3) MLST: It is basically the luck of the draw whether the miniscule number of genes a researcher picks reflects what an isolate is. Pick another 10 and get another answer. Sill it is perhaps one of the best guesses with a good gene set.
4) Whole genome alignments that generate numbers. The researcher plugs in genomes into an alignment program and it spits out numbers. What do they mean? Who knows? These numbers can be terribly flawed, perhaps worse than all the others....
To understand why lets look at the classification of a number of isolates of the "species" Flavobacterium psychrophilum. Using a multi-isolate concurrent genome alignment tool like CCT viewer (Stothard Research Group) the relationships between the genomes can be seen holistically. In the case of FP, some isolates have an almost 100kbp insertion. Numeric methods will always place these isolates into a group, despite that these isolates may not really be that close in core genome. CCT viewer also visualizes the patterns of similarity and differences in the group as a whole. Is variation concentrated or distributed? What has moved around, gained/lost? Whats the CG content in these regions. What are the genes in these regions? How do those genes relate to what a isolate actually is?
With genomics and concurrent whole genome alignment methodology there is some chance of figuring this out on a case by case basis. The interpretation really can't be done by a computer yet. Once concurrent methods are done and the patterns of variation quantified, other simpler numeric method may work well.
Mis-information is worse than no information.
What tools you can recommend to visualize the synteny/comparative genome of 2-4 species/cultivars specifically structural variation among species?
Similar to Gale et al 1998 check image.
Thanks
I have a list of rat genomic regions. I have liftovered them to human orthologous regions. For the rat regions that can be successfully converted to human orthologs, I want to find out if the matched rat-human regions are conserved. I can use the phastCons score or phyloP score to evaluate the level of conservation, but these values are based on multiple-species alignment (46 species). What I am really interested in is the conservation between human and rat. I wonder how I can obtain that value? using blastn or fasta or something else?
I trimmed and assembled (Reverse and Forward) my sequences using CLC Main Workbench software. The trimming I did aims to remove poor quality 3’ and 5’.
I’m going to perform multiple alignment, haplotype diversity and phylogenetic tree, using other software (DnaSP, Arlquin and MEGA).
Those softwares need uniform size of sequences (same length).
I have about 120 COI sequences of different sizes (640-750bp). The expected size was 710 bp.
I want to cut them at same length (like all sequences at 550bp) make all of them in equal size.
How can I do it in MEGA? I have all consensus in a FASTA file. Is it normal to cut the right end before alignment when file is imported in MEGA?
Thank you
I have a protein sequence but I checked PDB for the purpose comparative modeling, but I couldn't find any similar protein with known 3D structure, yet I want model my protein. Beside ITASSER which software should I use to model my protein?