Science method

Next Generation Sequencing - Science method

Explore the latest questions and answers in Next Generation Sequencing, and find Next Generation Sequencing experts.
Questions related to Next Generation Sequencing
Question
18 answers
.
Relevant answer
Answer
The most important 'skill' to have is PATIENCE, just enough to avoid an early death due to bashing your head against the wall when trying to get convoluted R, C, C++, Python, Perl, etc packages to run. Half your life will be spent resolving dependencies only to find that when it all finally compiles it's useless anyway. Welcome to NGS! ;-)
But seriously, *nix sysadmin skills are *essential*. Usually some flavor of Linux, but you may have to mess around with Solaris if you're running Sun equipment. If you're moving serious quantities of data then you're going to need decent servers (we use 64-bit, 32 CPU, 130GB RAM, 4TB onboard systems with 32TB Sun Thumpers as NAS). Even having competent dedicated sysadmins on deck will not release you from having to have solid command line skills. Get a VPN set up or get familiar with SSH and SSH tunneling (for secure VNC for example), both can work well together and allow you to work happily remotely. Consider DynDNS if you have a dynamic IP at a remote site.
A 'scripting' language is essential. It doesn't matter what, but the bioinformatics community seems to prefer being stuck in the early 1990's with Perl while the rest of the world is moving on. As painful as Perl is when compared to something like Python, it's hard to argue against the depth of modules available.
Interpreted languages are getting better and better but you'll still need a compiled language at some point. I prefer C++ (with Boost) because it's slightly less painful than C and almost as fast when compiled with -O3.
Database skills are also essential. MySQL is a good compromise between efficiency and ease of use and nobody can argue with the price. Be sure to configure it to make the most of your systems (use block rather than file storage and tweak my-huge.cnf).
For statistics, R is popular with the bioinformatics community despite the fact that it's one of the most awful hacked-together, inconsistent, illegible, slow, lumps of crap in existence. Leveraging other peoples' R packages can sometimes be a time saver, but if you're really serious then pay the money and use Mathematica.
Finally, there's the important aspect of visualization that often gets forgotten until it's too late. A decent charting package is worth the money. A genome browser that slots in happily with your choice of server-side setup is also advantageous.
If you're in a team of bioinformaticists and can delegate specific roles then great. Otherwise, you're expected to be a Swiss army knife.
Next Generation Sequencing: Is there a source where I can learn it??
Question
3 answers
Do we have any literature sources for learning NSG?
Relevant answer
Answer
Which platform do you want to use/learn ?
Question
22 answers
Also, which is better (or most widely used) for metagenomics NGS, FLX or Illumina?
Relevant answer
Answer
It all depends on your budget. As much RAM as you can afford and two sets of disks: one set of expensive disks as fast as you can afford (for calculations) and another set of cheap disks as large as possible for storage.
Question
17 answers
In my opinion, when the analysis is made by the NGS, the similarity of the 16S rDNA in environmental bacterias gives place to a bigger diversity than the real one.
Relevant answer
Answer
Yes, this is at the moment becoming the standard approach to assess diversity in microbial communities. Many labs are now switching to 454 or Illumina reads, using barcoded primers. There is (currently) no way of knowing true diversity of an unknown sample, and no matter what method one chooses, there will be a biassed view of the true diversity. The choice of extraction method, primers, PCR mastermix ingredients, and PCR cycling conditions all could introduce some bias. But as long as one compared samples using the same amplification method, one could still say something about beta diversity between samples.
I hope this helps!
Question
3 answers
I am wondering if people are still resequencing regions flanking an associated SNP, in order to identify hitherto undiscovered common variants which may be the causative SNP? Or is this a bit of an out-of-date strategy after the release of 1000G data?
Relevant answer
Answer
The answer is: it depends. If the goal is to identify hitherto undiscovered *common* variants and your study participants come from one of the population groups represented in the 1000 Genomes Project, resequencing the region flanking an associated SNP is unlikely to find novel variants. However, if the population is not well represented in the 1000G, then resequencing is indicated.
Question
Using small RNA libraries and Illumina sequencing, I would like to detect the differentially expressed small RNA in plant samples and understand their biological meaning and function. I'm using CLC Genomics Workbench 5.5.
Could a scheme of different analysis steps be done, which could be helpful for beginner in the Next Generation Sequencing data analysis?
Question
1 answer
Given two or more reads library sampled from their respective tissues (e.g. cancer or non-cancer), we have a method that can assign/classify the reads to their respective 'source genome', but without mapping to the reference genome.
This is inspired from the fact that there are no-identical genomes. Mapping them will lead to loss of information.
This method also considers error rates in the reads and coverage.
One possible application is to use it for finding rearrangement breakpoint reads.
What are the other potential application of such methods?
Relevant answer
Answer
I would be interested to try this with some viral sequence reads, where a population may be a mixture of different clonal types with various mutations present. I am very interested in identifying breakpoints too.
Question
3 answers
I have used Nextera library kit for the first time and had issues with the reads coming out of HiSeq 100 being GC biased .... but only for the first few base pair reads. The majority of the reads is good. Just wondering if it is a common issue with Nextera kits.
Relevant answer
Answer
It is common. The transposase that the Nextera kit uses has a GC-insertion bias. I found a good demonstration of this here:
Question
1 answer
I am looking for some DNA to use as an internal standard when we do sequencing. I need to find a species that has never been found in the human gut, but has a medium GC content. If you know where I can order the DNA from, that is even better. I have been looking for something like Nitrosomonas or Rhodopsuedemonas, but I am having trouble finding a good source. Any suggestions?
Relevant answer
Answer
Maybe you can go for cyanobacteria
Question
47 answers
A new wave of scientists from the systems biology field suggest that drug discovery can be improved by the use of biomolecular networks. Can they make it? Should they try? Is it the future of drug discovery?
See what one of them wrote in the NYAS website:
"The most pressing unmet medical needs correspond to complex diseases caused by a combination of genetic and environmental factors. Traditional drug discovery strategies ignore the complexity of biological systems, screening compounds on individual targets rather than focusing on biomolecular networks. Despite growing evidence that the conditions we aim to treat are complex and require the development of treatments that exhibit polypharmacological properties, current drug discovery programs still rely on simplistic approaches during compound selection. Complexity is then considered during the development phase, where the costs and risks are much higher than in the discovery phase. This symposium aims to challenge the "one-target, one-disease" tradition and to discuss design and implementation of biological assays featuring multiple target strategies during the primary discovery steps." (source http://www.nyas.org/Events/Detail.aspx?cid=073a364a-af58-49e0-896e-499e51427b66)
Relevant answer
Answer
Maybe, but not today. Tomorrow morning is also quite unlikely. A single-shot approach has it's increasingly apparent limitations, but systems biology is not ready to deliver on the level we used to expect. Hyper-enthusiasm and over-investment in Systems Biology didn't work well for the Pharma and Biotech and investor disappointment only casts unnecessary and undeserved shadow over Systems Biology research. In fact there is no need to pitch Systems Approach against traditional "mechanistic", "hypothesis-driven", "candidate gene" and so on down the list of approaches. The order of the day for the strategists is finding the balance and appropriate packaging for both approaches in R&D. Please don't take this comment as discouraging. Just don't expect the revolution. The riot is over. Prepare to evolve.
Question
11 answers
Let's say I ran forty cancer samples, twenty responders and twenty non-responders to some treatment. I performed alignment and annotation of SNVs and indels in each sample. Now I want to know if there are any deferentially affected genes in the two groups that may justify the different response I observed in those tumors. Problem is, simple chi squares won't do, because a single gene can be affected by multiple potentially deleterious SNVs and indels, and I may have to bin multiple alterations to have enough power. I may therefore tag each gene as affected/unaffected by a potentially deleterious variation, and proceed this way with my analyses. But how do I know if a "potentially" deleterious variation is really so? It's impossible to validate biologically all of them. Or I may restrict my analysis to variations that recur more than once in two different samples, but then again, some genes can be altered in many places, and in one place just once out of several samples, and I would overlook them.
Finally, I may consider not genes but pathways, and see whether some pathways are involved by deleterious alterations in responder tumors and not in non-responders, or vice-verse, but pathways are not as well defined as one may want to think; there is a risk of including deleterious variations that have nothing to do with the underlying biology of my samples. It's more philosophy than statistics here. What would you do? What do you you recommend?
Relevant answer
Answer
You can get a lot of information about your mutations from the CRAVAT web server (www.cravat.us) . The server hosts some machine learning-based tools for the analysis of missense mutations detected in tumor sequencing. This includes functional analysis (is the protein affected?) as well as a method that is cancer-specific ( does the mutation look more like a driver or a passenger?). These methods return a continuous score between 0 and 1 that can be used to prioritize mutations, as well as p-value and FDR estimates to help you select a cutoff for deciding which missense mutations to keep. CRAVAT annotates all of your variants with dbSNP ids, allele frequencies for 1000 genomes and ESP6500 populations, overlap with the COSMIC database, provides gene functional annotations from the GeneCards database, and the results of a PubMed search for each mutated gene.
If you want to look for more fundamental differences between the groups, you might want to read "Mutational processes molding the genomes of 21 breast cancers" from Mike Stratton's group (Cell 2012). They look at differences in the processes underlying mutation among 21 individuals with Breast Cancer.
Question
13 answers
Special micro array data related to plants infected by pathogens.
Relevant answer
Answer
Esther, please follow me and join the Journal Club initiative. We need scientists like yourself. I have followed your work closely over the years. Very best.
Question
11 answers
I have sequenced a bacterial 16S rRNA coding region using capillary sequencing and then got the De-Novo assembly (DNA sequence) of the same bacteria (same pure culture). After a blast search I wasn't able to find any hit between these two sequences. Can anyone suggest the reason behind such a disconnect?
Relevant answer
Answer
which program did you use for assembling? Some programs considered 16S rRNA as a "repeat" and they removed it from the final assembly.
Question
5 answers
Next Generation Sequencing (NGS) technologies dropped down sequencing costs and times for obtaining data. Now it is almost easy to obtain the complete genome of an organism of interest (faster if re-sequencing). Comparative genomics could then highlight features making our strain of special interest. How affordable is it to apply this vision at large scale through public health microbiology units? Do public health systems working in such a way exist?
Relevant answer
Answer
I guess for (general) public health it is still a bit too expensive and complicated, but this might change during the next decade. For epidemics, this is feasible, as it was already done during the last EHEC-epidemic, where the EHEC strain responsible for that was sequenced and compared to existing strains. See http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0022751
Question
4 answers
Does anybody know how to check if a new SNP is on ANY chip? I am looking for experiments in which a specfific SNP has been evaluated.
Relevant answer
Answer
I knew that WAS the answer 10 years ago. I am surprised to know that was NOT resolved yet.
Yes, it's a dbsnp SNP.
Question
With the current Next Generation Sequencing, how far does the necessity of chromosome walking go?
Question
8 answers
What is the benefit of using NGS vs microarray technique for DNA methylation? Seems current methylation arrays, e.g. infinium humanmethylation450 used by TCGA, already provided very high density of coverage, although not down to each single nucleotide, but good enough to discovery methylation pattern along the genome. So is there any reason we have to use NGS rather than microarray for DNA methylation studies?
Relevant answer
Answer
It's a balance between practicality and comprehensiveness. NGS will certainly give you a more complete picture of the methylome (at least for the 5mC portion), but it is at least an order of magnitude more expensive and will require substantially more analysis (which is to be expected of any discovery-based project). The array was built with the intention of covering as many methylation sites that have at least some evidence of being interesting/relevant. The sites are heavily biased to coding and promoter regions (although there are some throughout the genome). The 450k array remains quite popular (in a time of decreasing array popularity) because it fits what many methylation researchers need. NGS methods are also popular, but remain out of reach for many due to cost and analysis issues. Unless you're in pure discovery mode, most don't seem to want a lot of data they can't interpret. As the methylome becomes better understood over time, the shift from arrays to NGS will likely accelerate.
Question
6 answers
M220 is great for shearing DNA, but I have heard that it may not be as good with chromatin, and that it is a safer bet to use S220 for that. I would like to have an instrument that has both capabilities, but S220 is so expensive. Any thoughts/suggestions?
Relevant answer
Answer
Not as good as Covaris - usually we observed much wider size distribution. I think - this is due to the fat that ultrasound waves are not focused on the tubes as in Covaris. Good thing of Bioruptor (except of price) is that you can process upto 12 samples in one run.
Question
18 answers
We have very small samples from snap-frozen biopsies and need to perform several analyses with them. I have been against preamplification, since I'm afraid it would introduce more bias in an already tricky specimen.
What are your personal opinions on the subject of total RNA preamplification for microarray and other transcriptomics (e.g. transcriptome sequencing) applications?
Relevant answer
Answer
We have used the TargetAmp 2-round aRNA amplification kit 2.0 (Epicentre Biotechnologies) for two rounds of linear amplification of RNA from microdissected fungal samples. It is based on reverse transcription and in vitro transcription from the cDNA (modified Eberwine protocol) and should give a linear amplification (Teichert et al. 2012, BMC Genomics).
Question
3 answers
I have lots of sequencing data for a particular cancer I am studying, and I am bit confused about detection of gene fusions. Please can anyone suggest what are the methods and tools available and which is the best in terms of identifying gene fusions in incompletely annotated genome?
Relevant answer
Answer
There are many free online tools available, (fusion finder, deFuse....etc...each one will give a set of candidate gene fusions, it is very likely that you may or may not get the important one. . Are you working on cancer samples?
Question
3 answers
I'm looking for tools working under windows environment with java that could be used as atlernatives to vcftools or even GATK.
Relevant answer
Answer
I haven't seen anything similar. I have a perl script that I usually use to filter and convert vcf files into tables or fasta alignments. I think there is a command line vcftools package that does similar things, but I haven't seen anything in java.
Question
2 answers
Difference between old and new version of samtools
Relevant answer
Answer
Hi Shalini, You probably have already chaged of variant calling tool, but from my own experience, I can assess that you may miss variant with Samtools. More especially indels. Yet, I think it's also the case with GATK.
Question
5 answers
I can't find anything on the Affymetrix website.
Relevant answer
Answer
It is best you write to them and ask. People are quite prompt in replying. When you do, make sure that you explain the context of your experiments so that they can give you the best possible advice. I don't think we are allowed to post costs that are confidential on an open forum. I will inbox you with the costs on Geochips from our quote.
Remember that you can cut the costs down by reducing replicates and doing qPCR for statistical purposes on specific function or taxa.
Question
16 answers
Num. reads assembled: 1669184
Avg. total coverage: 16.60
Number of contigs: 13991
Largest contig: 112360
N50 contig size: 1171
that is the summary I got out of a de novo assembly (ion torrent data and MIRA3 assembler
I would like a suggession as to how you would rate the sequence.
Bacterial genome, expected to have a few plasmids.
How will you rate the quality of this sequence?
I have tried it with Newbler, which are like
numAlignedReads = 1647777,
N50ContigSize = 955,
numberOfContigs = 26930,
largestContigSize = 129390;
what do I do with these now... suggestions required.
Relevant answer
Answer
MIRA has been said to be a good assembly for Bacteria (with small genome). However, its assembly improves so much doing an extra assembly with CAP3 software. MIRA + CAP3 combination is the best assembly described for bacterial genomes.
In the case you describe, it's obvious that MIRA has performed a better assembly than Newbler: higher N50 and less contigs, which may mean NO REDUNDANCY. For checking the redundancy I would recommend you to do a BLASTn of your contigs against themselves. In the case of MIRA, I would hope to get less hits than in the case of Newbler.
If I were you, I would choose the MIRA assembly. But a BLAST analysis will confirm you.
Question
14 answers
I have been working on data analyzed by Samtools or GATK, and in both cases indels were badly detected. More especially, duplications or small insertions were missed by the two tools. I had already heard about this problem, but I had thought that I would rather see false positives than false negatives. Do you know any algorithm that performs better regarding this precise point?
Relevant answer
Question
2 answers
To my knowledge, 11 types of cancers have been surveyed by whole-exome sequencing: leukemia, ovarian cancer, head and neck squamous cell, renal carcinoma, melanoma, pancreatic cancer, gastric cancer, colon cancer, prostate cancer, lung cancer, brain cancer, and breast cancer. I do not count whole genome sequencing or RNA-Seq. Do you know any tumor types I missed?
Relevant answer
Answer
Thank you. I missed lymphoma and myeloma as well. So many cancer types already have been sequenced.
Question
17 answers
Is there any algorithmic approach to find the adapter traces within the reads efficiently.
Relevant answer
Answer
I can also recommend cutadapt (http://code.google.com/p/cutadapt/). you can specify, where to look for adaptor sequences and even wildcards can be set within the adaptor.
Question
9 answers
I want to see differential expression, novel transcript, novel splice variants, and some information about NON coding RNA s too. Should I enrich my sample with poly Ts then, or just go for sequencing without any mRNA enrichment. Let me know the pros and cons of both of the strategies.
Relevant answer
Answer
if you polyA+ select your RNA you're going to deplete for the ncRNA species. That doesn't mean you wont see them, I've seen miRNA and lncRNA in pA+ selected RNA, but I would have concerns over the accuracy of its quantification.
If you're using a model organism a RiboZero like approach might be better suited, preserving ncRNA in the sample. Be aware that you will have to sequence more in terms of raw read numbers to get the same sensitivity for alt-splicing etc. than if you had pA+ selected the RNA in the first place.
Question
16 answers
We do not have access to liquid nitrogen or dry ice, so I need to determine another method of storing fresh tumor biopsy material for up to 48 hours prior to DNA extraction for whole exome/genome sequence. Can the samples be stored at room temperature? Does anyone have any suggestions?
Relevant answer
Answer
RNA later. We get optimal results for storage of small biopsies for RNA and DNa extraction for microarray analysis. NGS has similar requirements of RNA/DNA quality. In RNA later nucleic acids are stabel for several days at room temperature and extended times at -20°C. RNA later should not be stored at -80°C.
Question
32 answers
I have a GFF file containing positions of annotations (Transposons, mRNA, etc.), and the FASTA file of the reference genome (i.e. Vitis vinifera).
I would like to transfer the GFF into a FASTA file containing the sequences corresponding to the information included in the GFF.
Relevant answer
Answer
Hi Giovanni if you don't want to install any software you can do it online with the Galaxy Platform. It is a really nice platform that you can use for a lot of other analysis.
For your particular problem, it's really easy!
2)Click on Get Data on the left panel
3)select as a file format gff
4) add your gff file clicking choose file or paste it directly on the input
5) select your genome of reference and click execute
6)After it loads the file click on Fetch Sequences on the left panel and click Extract Genomic DNA using coordinate..
7) Click execute selecting your file and it will create a new fasta file for you with your sequences (see the right panel) that you can visualize on the browser or download
Good Luck!
Luca
Question
1 answer
I intend to perform Illumina sequencing of my RNA samples, but due to degradation problems I would like to add RNase Out to the sample to improve its quality. Is it a problem for Illumina platforms?
Relevant answer
Answer
I do not know for sure but the very idea of adding a protein (and RNase OUT is a protein!) to your RNA sample does not sound good. If you have problems with stability of your RNA try to use DEPC-treated water to dissolve it, instead.
Question
1 answer
There are many ChIP-seq peak callers available, such as MACS, ERANGE, SISSRS, CisGenome et al. Which one do you favour and why?
Relevant answer
Answer
I use to go for Homer. Try once. I checked the results with macs and quest. Homer call all the peaks of macs and 95% of validated peaks. Also it has annotation and motif prediction modules also. faster, easy to understand the concept.
Question
1 answer
We got a typical pedigree samples (n=10, and 4 patients). How to make a reasonable genetic analysis to search the candidate genes? With STR marker? SNP array? SNP+ CNV array? Exon sequencing? Given limited funding.
Relevant answer
Answer
Use genetic analysis software (preferably PLINK) to get the SNP association analysis, stratified and haplotype analysis. Use statistical method to identify the associations. you can also opt for enrichment analysis of those genes and detect the disease association. just a thought.
Question
3 answers
Is their any identifier within the sam files which can give me the number of reads aligned to a particular chromosomal region.
Relevant answer
Answer
Hi,
BEDTools (http://code.google.com/p/bedtools/) can do that.
First of all you create a bed file (region.bed) of your chromosomal region; you can put more than one region into the file, for example:
chr1 10 50
chr1 500 1000
than you run the intersectBed tool from BEDTools suite, for example:
> intersectBed –abam reads.bam –b region.bed -bed
The output is the list of the read mapping that region, with additional information about the strand ecc...
If you just want to know the # of read mapping you can use in pipe the "wc" command:
> intersectBed –abam reads.unsorted.bam –b region.bed -bed | wc
Hope you find it useful.
Best regards
Question
2 answers
I would like to use a DB that is exclusively human and a perhaps separate one that has been experimentally established to aim at bacteria/viruses. I would like to sieve through some NGS data and compare RNA interference profiles between healthy humans and disease carriers.
Relevant answer
Answer
For non commercial purpose,a siRNA Database is available at http://siRNA.cgb.ki.se which has data from literature as well as computational inference.For experimentally verified siRNAs search using AOsearch (http://aosearch.cgb.ki.se).miRBase is a database for miRNA and is available at http://microrna.sanger.ac.uk/
Question
8 answers
Due to the specific BLAST algorithm, short sequences cannot be used as queries. Thus, I wonder if there are other methods to search homology of short sequences with ESTs or mRNAs. For example, can BLAST performed after the sequences are assembled in contig?
Relevant answer
Answer
Your statement that short sequences cannot be used is not correct. For some time now, the BLAST algorithm has included specific modifications just for short sequence searches (since around 2007 or 2008). It has the ability to "Search for short, nearly exact sequences" which was designed to work with sequences of between 7 and 20 bp in length (e.g. motif finding).
The equivalent protein search is designed around queries of 5-15 residues.
There are details of all of the various program choice options on the BLAST help pages under the Program Selection Guide. And if you want detailed techical help with optimizing your search, I would email your question to the NCBI staff at blast-help@ncbi.nlm.nih.gov as these are the foremost experts on just what BLAST is capable of.
Question
8 answers
I have a dataset of reads collected from Illumina platform.From these raw data I want to generate high quality read dataset. Can anyone suggest me any freely available softwares.
Relevant answer
Answer
You can also try the Galaxy server. That is a cloud based "suite" of programs. Most of the freely available aligners are available on Galaxy. Similarly quality assurance and filtering programs are also available. Moreover, they have good set of tutorials, so you should be able to start off nicely.
Seqanswers (as has been mentioned earlier by Fabio) is a good resource. Similarly you want to check out http://www.biostars.org/, that is also a question answer forum and probably much active than here.
Good luck.
Question
2 answers
Does anyone have feedback on iron torrent technology for detection of CNV (copy number variation) in small genome (like yeast)? Is this technique really cheaper than illumina?
Relevant answer
Answer
Don't hesitate on the IonTorrent. We have just bought one and are very happy with it.
There are numerous advantages, cost, time, ease of use to the IonTorrent, however, the biggest is the size of the raw data files. With illumina ther files are REALLY big as they take a hi res photo every few minutes for a week. The IonTorrent just records pH or conductivity.
The cost per base including data storage is much lower for the IonTorrent, but more expensive excluding data storage.
We have not done cnv analyses yet, but this is not a machine problem, but a data analysis problem.
However, the different read lengths may be an advantage for the IonTorrent in CNV analyses.
Jon
Ps the IonTorrent will produce upto 12million reads (typically 9million) reads of upto 400bp ( we are still running the 200bp kits) in about 4 hours. It also has a low seq read error rate that we have calculated to be less than 1 in 100000 bp sequenced
Question
6 answers
While processing Illumnia paired-end reads, dataset of mine gave rise to both - paired-end and single-end due to read filtering based on phred score. Is there any way to map both dataset into the reference genome to generate single sam file using bwa aligner?
Relevant answer
Answer
Hi,
I am not sure if you need to map everything at the same time, but if the goal is to obtain a single sam/bam file you can:
1) Map independently both datasets with bwa or other mapping tool.
2) Convert the sam files to bam files with samtools view -Sb -o file.bam file.sam
3) Merge bam files with samtools merge out.bam bam1 bam2
Good luck,
Question
Generally in illumina paired end sequences we are missing out on het indels and looks like its a common problem. Can anyone suggest something to overcome these and which could be helpful in calling these het indels.
Question
1 answer
I mean due to their algorithm some aligners tend to align methylated sequences to the genome more efficiently or vice versa. Can anyone explain with some examples please.
Relevant answer
Answer
We have written a paper last year comparing the aligners and commented on some of these aspects. You might find that useful. http://nar.oxfordjournals.org/content/40/10/e79.short
The Next NGS Challenge Conference - Data Processing and Integration
Question
The Next NGS Challenge Conference http://www.thenextngschallenge.org/ is a joint event of the EU COST Action SeqAhead, the EMBnet, the International Society of Computational Biology ISCB and the 7FP Project STATegra that aims to become a dedicated meeting on cutting-edge Next Generation Sequencing applications, presenting the most innovative bioinformatics approaches in NGS data analysis. The Conference will be preceded by the EMBnet Workshop on NGS data analysis on 14th May: Morning session: "RNA-seq and ChIP-seq data analysis" by Endre Barta (U. of Debrecen, Hungary) and Eija Korpelainen (CSC Institute, Finland) Afternoon session: " NGS and structural Biology" by Goran Neshich (EMBRAPA, Brazil), Jose R. Valverde (CNB-CSIC, Spain) and Gert Vriend (CMBI, The Netherlands). Deadline for registration is the 15th of April. Abstract submission deadline is the 15th of February.
Question
Loven et al. 2012 (Revisiting global gene expression analysis. Cell 151: 476-482) addresses a significant issue with current expression profiling methodologies - namely, the untested (and apparently unwarranted) assumption that cells being compared produce equivalent amounts of RNA/cell. We have previously addressed this issue in a different context (plant polyploids - Coate and Doyle 2010, see attached reprint), and demonstrated that an allopolypploid has 1.4x more RNA per cell than its diploid progenitors. The fact that various other cell types (including tumor cells, embryonic stem cells, lymphocytes) are also likely to vary dramatically in RNA/cell (transcriptome size) suggests that transcriptome size variation is a significant issue affecting many expression profiling experiments.
Loven et al. propose an RNA spike-in approach to address this problem, which they demonstrate to be effective with cell cultures. In Coate & Doyle (2010) we described a different approach that also works with solid tissues and doesn't require cell counting. Any feedback on our approach would be welcomed!
Finally, a question: with the typical normalization methods used today (e.g. RPKM for RNA-Seq), expression is reported as a fraction of the total RNA input. As explained in Loven et al. (2012) and Coate & Doyle (2010), this estimate (which we call expression per transcriptome) can differ significantly from expression per cell. My question is which of these measures is biologically most relevant? Absolute expression per cell (or per gene) might seem the obvious choice, but given that gene products interact with other gene products in a cellular context, expression relative to the expression of other genes (as estimated by RPKM and standard microarray normalization methods) could be what determines the biological response. Any thoughts?
Question
14 answers
dears: I'm working on chip-seq data. I've downloaded sra format data. to analyze I have to convert them to fastq format and then to SAM file. I made a fastq file but I cant convert it to SAM file? who knows how I can do this?
Relevant answer
Answer
fastq is a file that store raw sequence and quality scores. sam is a file that store mapped tags to genome. You can not "convert" a fastq file into a sam, but you have to map the sequences in the fastq to the reference genome to obtain a sam output. I normally use bowtie to do it:
bowtie --best --strata -m1 --sam -l 36 -n 3 h_sapiens_ncbi36 -q SAMPLE.fastq > SAMPLE.sam
Hope this help
best
r
Question
1 answer
Developing a a good CLIP-Seq methodology
Relevant answer
Answer
Hi,
Me personally would choose UV irradiation for the first shot because it is simpler and you don't have to clean up your samples from any chemicals residue if it is necessary. But in some cases UV light crosslinking are quiet tricky because some proteins crosslinked better to an RNA compare to the other proteins (protein dependent). In this case you should try the chemical crosslinking.
Good luck!
Question
I could imagine that a SNP on X in males would be called ref/ref or alt/alt, but not ref/alt...
Question
17 answers
I have to isolate total RNA from needle biopsy tissue samples for the purpose of microRNA profiling by using next generation sequencing. I have purchased Ambion mirvana kits but after repeated attempts (I have used 6 different tissue samples so far), I am unable to obtain a good yield and OD values. The maximum yield I am getting from 4mg tissue is no more than 70-120ng/ul when dissolved in 50ul of nuclease free water and 260/230 value is always from 0.75 to 1.8. I have tried different steps to optimize yield from mirvana kit but with no success. So anyone who has experience with this kit please share your protocol so I could possibly improve total RNA from my samples without wasting anymore samples. These samples have been stored in RNAlater and there is no degradation as measured by Bioanalyzer (RIN>8).
Relevant answer
Answer
It depends on the protocol. If the first step is to size fractionate the small RNAs by denaturing PAGE and subsequent elution of the small RNAs from the gel, I would think that is causes no problems. The gel will purify it from this contamination source and will replace it with another...dependent how pure the following solutions are :)
I was looking into my labbook about my NGS (already 4 years ago) and actually I did the sequencing with RNA which had 280/260 = 2,09 and 260/230 = 0,7 (so this really suprised me now ^^). The bioanalyzer results were good (don't find the RIN number at the moment) but the 28s/18S ratio was 1,6 , which was the minimum requirement. So not the very top RNA but ok and I was heappy with the results. RNA integrity is the most critical part I think ;)
Question
15 answers
SNP array? SNP array + low coverage genome sequencing? CNVs?
Relevant answer
Answer
As August mentioned, GWAS is very much dependent on phenotype you are looking at. Of course when you are looking at a phenotype like Breast cancer or common cold you cannot go for linkage studies and the load of common variants will be much more than rare variants. Nowdays labs like Eichler's and people from NHGRI , NIH concentrate on CNVs but a complementary approach is always good. Case control analysis for variant allele burden. If you have good funding then exome sequence small families or sib pairs and sanger confirm. GWAS with CNVs through CGH array is helpful. Low covergae genome sequencing will give you false positives so you will be simple fishing out things in a pool of huge variants and the confirmation will take long time.
SNP array and CGH array should be feasible on affimetrix. For power your sample size matters and you must be aware that no one knows how big is really Big!
Missing out rare variants in GWAS can happen or alleles with moderate effect and common variants can only be picked up with good confidence. Your sample access and Phenotype should decide the approach. University of Washington and 1000 genome data shows all allele frequencies all over the world and if you find your disorder being represented as common or rare in terms of the variants reported in the genes through there SNPs then decide accordingly.
Question
2 answers
We want to know if it is an available method.
Relevant answer
Answer
An Ion AmpliSeq panel has been available for a couple years now.
example: PMID: 25556971
An updated panel is due out by end of year.
Question
7 answers
I failed to make it on either Illumina PE, Hiseq2000 (stuck at mery) and Roche 454 FLX (stuck to create overlap)
Relevant answer
Answer
sorry forgotten to give more detail.
I use spec file which will automatically convert SFF to FRG via gatekeeper.
FOr Illumina yes, i convert it to FRG and use spec file to run it.
However, i still fail >_<
Question
14 answers
I'm trying to compare gene expression profiles using RNA-seq (100 base, Single Read). I want to multiplex my samples but I'm not able to determine the coverage.
I'm working on human cell lines.
Relevant answer
Answer
The bigger question (that many people seem to just 'forget' when it comes to NGS) is how many biological replicates do you need to answer your research question? Unless the point of your project is to dig as deeply as possible into the trancpscriptome of a particular sample, you'll be comparing multiple samples against each other. How many biological replicates do you need for each condition? (I would say the bare minimum would be three, but it depends on how much variation is present in your samples - lots for humans, not as much for mice from the same strain.) How many conditions are you testing? (These two together will tell you how many libraries you need to create.)
Once you know the total number of samples, then you need to decide how deeply each needs to be sequenced. As Michal said, 10-20M reads is ok for basic 'array level' type data (comparison of medium to highly expressed genes, maybe some splice junctions). Getting into the 50-100M range is what you'll need for measuring the rare transcripts, looking at SNPs and discovering novel splice junctions, fused genes, etc.
Then, multiply the number of samples times the number of reads to determine your total sequencing needs. If that fits within your budget, you're golden! If not, you'll need to make some compromises. You'll either have to:
1) secure a bigger budget
2) reduce the scope of the experiment so that it fits you budget, but be aware that you'll have to adjust your goals accordingly. Something has to give - you either have to reduce the number of tested conditions (but NOT biological replicates) or the depth at which each sample is sequenced (with the understanding that this will limit what you'll be able to discover about you samples).
Good luck!
Question
5 answers
In my lab, a target gene was knockdowned (RNAi). Both RNAi and Control samples were separately prepared to construct libraries of small RNAs that were deep sequenced using Illumina platform (36 cycles). The sequencing results (performed in triplicate) showed a clear differential pattern in the length distribution of reads when Control and RNAi samples were compared. Most reads from RNAi samples ranges from 10 to 17 nt and many of them are rRNA. On the other hand, while reads generated from Control samples are longer than 20 nt. What these differential distribution patterns mean? How to explain the high abundance of ribosomal RNA in the RNAi samples and low abundance in Controls? Have you similar results and/or advices on how to interpret these data? Do you know any article discussing it? Please, share your expertise and beliefs.
Relevant answer
Answer
Hello Francis,
Have the knock-down and control samples been prepared only once (followed by three sequencing runs on the same pair of samples)? Then it could be an accident, but if you had three pairs of independent transfection reactions followed by three pairs of RNA preps, then I am at a loss. A few things I would think about:
How do you isolate/prepare RNA for library construction? Have you checked general RNA quality before library construction, or did you start out with only short RNA?
One more question: what was your control? I would suggest using a si/shRNA with scrambled or unrelated sequence.
Good luck,
Vera
Question
7 answers
I have an hybrid data of 454 + Illumina + SOLiD which i have assembled into 157439 contigs covering around 1.3 GB genome. Now I want to create scaffolds using 20kb (3 libraries) 454 paired end reads and BAC sequences.
Which tool can I use? I want to assemble genome of 850 MB.
Relevant answer
Answer
The CLC Bio Genomics Workbench allows hybrid assemblies from these different sequencing technologies. I have personally done a hybrid assembly using Illumina and 454 sequence data....and I know they support SOLID data as well. I believe they have a 30 trial license that you can download.
Question
2 answers
List of the algorithms in a compact form.
Relevant answer
Answer
I assume you mean the algorithms used to analyse NGS data. Depends what you want to do. If you do de novo sequencing then assemble the NGS reads are the first step then you can go for further analysis. If you doing a resequencing then you need to do mapping to a reference genome and so on..!!
Question
6 answers
What refined analysis is possible and are they reliable?
Relevant answer
Answer
It depends on the purpose. For me, it is useful to detect background noise of antisense transcription on strand-specific library and to estimate the detection limit of amount of transcripts in my experiment. But it is possible to affect the transcriptome sequencing results somehow.
Question
5 answers
I heard that there is a place in China that performs about a quarter of all Illumina sequences in the world, do you know the place? Also of course if there are other worthy places I would be happy to hear about them....
Thanks a lot...
Relevant answer
Question
9 answers
I am trying to assemble hybrid reads using PE and BAC reads. I have shredded the sequences 1Kb with 800bp overlap and I am assembling with Newbler. I am giving input of 1 GB but getting output of 500 MB. Can anyone suggest an alternative method?
Relevant answer
Answer
This paper is a little old but might show some hints, or maybe it has been referred to in follow-up publications: http://122.205.95.40/files/publication/62.file.Rice%202009,%20De%20novo%20next%20generation%20seq%20of%20plant%20genomes.pdf
The other thing you could try is to ask in seqanswers.com as someone there may have tried to address the problem you are working with. Good luck!
Question
8 answers
I'm looking for a software tool to call copy number variations (CNV's) from NGS exome sequence data. There seem to be many possibilities (CNVnator, ExomeCNV, cn.MOPS, control-FREEC, BreakDancer.... etc.) Has anyone compared or evaluated any of these tools? Any recommendations for something that's easy to use, speedy, and accurate?
Relevant answer
Question
12 answers
I would like to know what the general consensus is regarding cut-off values for gene expression fold changes (is it mainly >2 up and down-regulated?). Also, is this cut-off applied together with the cut-off for p-value which is p<0.05?
Relevant answer
Answer
No such "consensus" value :)
Question
4 answers
I'm looking for alternative software for gene and functional prediction, metabolic pathway construction and etc., is anyone willing to share their experience in yeast gene and functional prediction?
Relevant answer
Answer
I guess you already have done up to Mapping to your reference genome to get the consensus sequence. Once you get the Consensus sequence upload that in Artemis. it will look for all ORFs in that sequence. After that you can do all the necessary steps for annotations ( e.g. running blasts from Artemis for function predictions etc.) . Otherwise You have go individually, first with glimmer or genewise programme for gene prediction on your consensus sequence and so on.
Question
10 answers
I have analysed my NGS data from Illumina (RRBS), I have fragments of information (chromosome, genomic position etc) and I would like to search against databases and see whether there are any common SNPs present in my fragments. I presume I would find DbSNPs (131 or 135 may be), but if anyone details the process or can give me advice which will enable me to do this search quickly that will be much appreciated.
Relevant answer
Answer
Have you tried using GALAXY to align your fragments to DbSNP? What exactly are you trying to do?
Question
8 answers
From where I can get the complete information regarding next generation sequence/deep sequencing data analysis ?
Relevant answer
Answer
R / Bioconductor software may be useful to your problem. it has many packages to analyze NGS data (alignment, differential expression, annotation,...etc.)
Question
6 answers
Is there anyone interested in transcriptomic analysis of Sporothrix schenckii?
Unfortunately my research group is not able to use NGS and I would like to sequence the cDNAs of this very important pathogenic fungus. Currently there are no data about transcriptomics of S. schenckii.
Anyone interested please contact me.
Orazio
Relevant answer
Answer
Hi, we can do analsis . If sequencing is done . Let me know if you have sequencing done we can help in analysis.
Question
3 answers
I am planning to do Illumina HiSeq sequncing on a relatively small number of exons for 96 individuals (snakes). I am thinking about using the Illumina's Nextera 96-indexing DNA library prep kit for barcoding and prepping the sequencing libraries, in combination with the MySelect kit for custom sequence capture. These 96 barcoded sequence capture libraries will then be pooled for a single lane on the HiSeq2000. Has anyone tried the Nextera kit and the MySelect kit in combination? If so, was it successful? I am most worried about the small starting amount of DNA for the Nextera kit causing targeted exons be missed. Thanks for any advise you can provide.
Relevant answer
Answer
Location: Ann Arbor, MI, USA
Join Date: Mar 2010
Posts: 5
Default
Please note that MYselect has been renamed MYbaits. Same exact product, just a new name that better reflect the kit content.
Question
I cannot make newbler run GUI after installation and restart, can anyone can help me solve such an issue?
I'm using CentOS 6.2 and Redhat 6. Both have the same problem >_<
Question
I'm still newbie in NGS and my knowledge limited to few software. I'm trying to explorer to more software and method to gain my knowledge.
Question
2 answers
After quality filter, assembly, and validation. Is there is anything else or tiny thing that user could missing up. I'm wandering, the exact definition for advance assembly analysis
Relevant answer
Answer
Annotation. That is some thing that is very trick and then their is no standard defination or answers to annotation.
Question
10 answers
Looking for user experience feedback
Relevant answer
Answer
meraculous is latest assembler which is working good , we can control the amount of ram used and tread in it, so far i m finding it out very good. other which is SGA which stood 3rd on assemblathon behind allpath and soap . you can try that here is one usefull link http://gage.cbcb.umd.edu/
Question
7 answers
I realize it's a big topic, but I'd be curious to hear people's opinions about the best ways to rank genetic variants obtained from an NGS procedure by their probability of being disease causing. I'm aware of SIFT, PolyPhen, MutationTaster, etc., but these all seem to give conflicting results much of the time. What tools do you use? What techniques or algorithms have you developed? Is it even possible?
Relevant answer
Answer
The original VAAST paper (http://www.ncbi.nlm.nih.gov/pubmed/21700766) had a number of comparisons along those lines. A new paper which will be submitted in the next few weeks has a much more extensive comparison of VAAST as both a variant prioritization tool, and a disease gene prioritization tool. You could check back on the VAAST website mentioned above or follow the VAAST mailing list to get updates on publication of the new paper.
Question
1 answer
We have developed our own tool for compression, which saves ~70% while creating "regular" BAM files. But we are seeing all these new formats that are space economic - but I want to hear from people who are actually using them if they have hidden caveats and what performance I should be expecting.
Relevant answer
Answer
I developed my own (rejected from "bioinformatics"). Simply remove extra or computable fields, cap the maximal quality you care about (we used 23) and use the '=' format, and you will save ~80% of the disk space.
The result is still BAM, and many programs will work with it without "decompression".
I can share the tool if you want.
Question
16 answers
Do you know of any software that allows prediction and analyses of mis-assemble and QC of assembly result other than AMOS and CLC software?
Relevant answer
Answer
best open source (free) and CPU-efficient program i know is Tablet, next generation sequence assembly visualization, see http://bioinformatics.oxfordjournals.org/content/26/3/401
Question
During the conversion, the msg indicate the conversion fail, it's lock and lib not found. This situation also happend when I convert yeast velvet amos compatible to bank format.
Question
5 answers
Did anyone know how to convert Chip-seq CEL files to Wig format?
Relevant answer
Answer
Question
Does anyone have any recommendations on good papers on this topic? Like "must reads". Can put for any sequencing methods (ion-semi., nanopore, etc.)
Many thanks in advance!
Jon
Question
19 answers
I've seen numerous articles showing good correlation between their qPCR and transcriptome seq data, however, I have not had such great correlation. Is this more common than most articles report, has anyone else had this problem?
Relevant answer
Answer
verify the exon level expression of your gene of interest, and design rt pcr primers accordingly. it is possible that your primers might hit splice variants to cause such discrepancy. moreover check the number of sequence reads supporting the gene of interest, it could be truly biological that the expression level is low, or technical,due to depth of sequencing and also quality of sequencing library. Transcriptome libraries made with good quality RNA (RIN>7) provide uniform coverage of all exons. it is possible that 5' end coverage may be low due to RNA degradation.
Question
7 answers
I am trying to align NGS data to repetitive regions. Straight-forward use of tools such as BWA fails. Any suggestions?
Relevant answer
Answer
Hi,
If you are working with ChIP-seq data, this article describes how to do it
----------------------------------------------------------------------------------------------------------
Short reads are aligned with BWA such that non-unique reads are reported in your SAM file. After that, non-unique reads are re-allocated to the most probable location based on which of the candidate loci has higher number of unique reads reads aligned to it. You can ask the author for the code.
Alternatively, you can try GNUMAP (http://dna.cs.byu.edu/gnumap/)
Good luck for your work.
Question
Last time I checked the Roche 480 LightCycler made this possible, but it was near impossible to pipette the 1536 plates.... The question is, is there any way to address individual positions (a one-er pipette tip on a gantry robot, for instance) that can actually save money in the lab AND provide high individual PCR throughput? Yes, I know about Fluidigm. Not exactly the same thing. Here is a recent review and I don't see the kind of robotics I am thinking of: http://www.labtimes.org/labtimes/issues/lt2010/lt01/lt_2010_01_54_57.pdf
Is there anything new on this front?
Question
2 answers
I'm looking for a cost-effective means for validation of potential associations in a GWAS as an alternative to individual genotyping, at least in a first time.
Relevant answer
Answer
Thank you very much, Fabio. That's exactly what I was looking for. I will have a close look at these articles.
Question
10 answers
I thought in a paired-end sistem that gives me reads of 250 bp making contigs of 500 bp.
Relevant answer
Answer
Hello Daniel,
Although the points raised are valid, you can probably obtain enough data to do good work on, from a relatively "small" dataset.
On projects where we are comparing several samples (e.g. time series, or cross section studies), we routinely sequence up to 12 samples per single HiSeq 2000 lane, and this give us plenty of data.
You may also consider the MiSeq plataform: it has only the one lane, but if it gives even more reads than one HiSeq lane (and the low cost is really attractive).
Best,
Chris
Question
6 answers
For cancer still I may understand the need ( to capture the variation due to multiple clone), but why it is needed for other disease.
Relevant answer
Answer
Ultra deep sequencing does not make that much sense when we talk about finding genetic variation in humans. If you have a sequencing depth of about 20 (given the quality is fine) you are very certain that you call the correct genotype. The only problem is that if you have a mean depth of 20 you will have very many sites with lower depths and then you can not call genotype and you can not find rare variation. If only the sequencing-manufacturer could promise a coverage instead of mean depth. Example: It would be nice if the manufacturer could promise a min. coverage of 20X for 90% of the target region. Then I would be very certain that for 90 % of my target I would find all mutations. A mean depth, per see, does not tell you much. The only reason for increasing depth is to increase the overall coverage.
(This answer is only viewed from the perspectives of finding genetic human variation).
Question
29 answers
I had 2 runs from the same library at some NGS platform (intentionally not mentioning the name of the platform) and I have ended up with some strange findings. In my second run I found some extra variable regions which were not there in the first.
FYI: I have used the same parameters in the analysis pipeline for both sets of data. Quality scores is above Q30 for all those alternative alleles in the second run. Locus been not reported as variant site before.
What could be the reason behind such a result?
Relevant answer
Answer
When we are dealing with cancer FFPE tissues, even in the same tissue sample, in two different library preparation runs, the reproducibility is not exactly the same. This boils down to the sample prep for the ion torrent. One of the major reasons why we are shifting over to the miseq. Plus we have had some known off target capture in the 314 and 316 cancer ampliseq panels, depending on how you do the library prep (using neb fast). I would recommend not doing the sample preps with a time gap. there is not a lot really which i can suggest without looking at the data. One way out would be to check for the coverage of the variable regions. Try to overlap the data sets from both the runs and see what you get. Maybe it can be a misalignment, or a contamination.
Question
16 answers
We are dealing with Capsicum and also soon with sunflower transcriptomes. I am not updated in the novelities of assemblers. We will try Trinity and Newbler (if I can make it run in my Mac!). Does anyone have suggestions?
Relevant answer
Answer
An important consideration for denovo assembly is pre-filtering of the reads before attempting an assembly.
Reads containing errors can be identified as they tend to be singletons. Duplicate reads can also be eliminated as they don't add to the information for the assembler.
The khmer tool and digital normalization approach developed by C. Titus Brown perform these tasks very well.
Links for paper and code here:
Applying these pre-filters helps to limit memory usage as well as reducing the redundancy in the final transcriptome assembly.
Question
10 answers
Prediction softwares especially with respect to structure and function
Relevant answer
Answer
Nallasivam Palanisamy: Sir the ORF which I am looking into, I am aware of the protein coding regions of the ORF. I have used ORF finder but it just gives the representation of the frames within the region. It would be helpful if I could use any bioinformatics tools to predict the protein structure, function or the signaling pathways/interctome with any reliable softwares.
Question
32 answers
Microarray is still cheaper but covering less genes.
NGS is a little more expensive but covers whole genes.
Relevant answer
Answer
Depends... I don't really agree that prices are the same, if you have both devices (a good scanner and a good sequencer) in-house. Real in-depth transcriptome seq is still expensive, whereas Affy or Agilent chips are really cheap nowadays. Moreover, we've had fifteen years to make sense of microarray normalization and interpretation, and a wealth of good tools exist to analyze them, while the same cannot be said yet for mRNA-seq. It really depends whether you have money, good bioinfo people, and you trust who makes your experiments. Most important, it depends on your biological hypothesis: if you want to assess rare transcripts, splice variants or fusion transcripts, there is no comparison. If you want to just do a gene expression analysis, GSEA or classification/prediction analysis, I would go for microarrays. Even in this case, if you have enough cDNA I would store an aliquot for futher comparisons between the two methods!
Question
11 answers
Hi, are there any suggestions or references to run RNAseq for measuring expression level of only selected genes out of very small amount of material?
I know there are efforts to sequence even single cells, but it seems they are more focused on identifying mutations rather than quantifying expression levels and the technique itself is still under development. Any suggestions? Or maybe we should go back to run qPCR? Thanks.
Relevant answer
Answer
Thank you Shawn for the good suggestions, I'm also worrying that the target enrichment will introduce additional bias and variation to the measurement of RNA level. I think I will try to see if there are public data in the database for relative expression level of my target genes then to decide if RNAseq has any chance of success. But thank you everybody for the very helpful discussion.
Question
17 answers
(In soil samples) which reactives?
Relevant answer
Answer
I'm not a bench scientist (have not done lab work in well over a decade), but one thing I caution you to consider in costs is that of analyzing the data. Depending on just what your project is and how big it is, expecting to complete data analysis on a simple off-the-shelf desktop or even a basic server may be unrealistic. Many of us use, and need, large memory machines, mulcti-core servers or clusters (at least a small cluster) to handle next-gen sequence analyses.
Luckily, almost all of the software you may need is likely available free as open source tools. The catch to that is that often it is only for UNIX/Linux and is also often command line only, so if need be, you may have to learn some basic command line UNIX.
I just mention all this because all too often I have seen people sink much time, money and effort into generating next-gen data, only to be brought to a slamming halt when they realize they need new computing hardware, or money for analytical services, to do anything with it.
Question
2 answers
I would like to know if the software miRDeep2, for miRNAs prediction from Next-Generation Sequencing data, when checking for a correct Drosha/Dicer cut, takes into consideration the presence of reads mapping on pre-miRNAs containing putative "offset miRNAs (moRNAs)" (to see what moRNAs are please, see the paper by Bortoluzzi et al. in Blood. 2012 Mar 29;119) and if the hairpin sequence is therefore retained or discarded for further analysis.
I fear that the software does not have a check for these particular miRNAs forms and can arbitrarily exclude "real" precursors encoding both moRNAs and miRNAs.
Have you ever encountered this problem?
Relevant answer
Answer
Hi,
I used a couple of time this tool, but never checked that. What I can tell you it is, that all data is stored, and then score, so there is no filtering out, at least with the first version.
I think you could direct contact the authors, because they will very pleased to reply you, and quickly. And maybe, improve the tool.
cheers
Question
18 answers
Troubleshooting ChIP-Seq
Relevant answer
Answer
I think that can happen, TFs may differ in binding perferences (i.e., promoter regions or enchancer regions)
Question
2 answers
What's the general consensus as to whether amplicon pyrosequencing fusion primers should be PAGE purified? It seems often regular that desalting is thought to be sufficient. However, the primers are as long as 58-mers and so there is a concern of errors during primer synthesis. If the oligos miss a portion at the 5' end, they would not have the pyrotag but would still amplify target in 16S PCR. However, in emulsion PCR those targets wouldn't amplify.
Relevant answer
Answer
I work at a sequencing CRO, and we have gotten the most consistent results from HPLC purification of all fusion primers (454 and Ion Torrent PGM platforms usually).
Question
1 answer
As far as I know at the moment, only Sanger sequencing is reimbursed by health insurances of many countries. If the commissions in your region on reimbursement already worked on specific reimbursement for next-generation sequencing, what do they have as a definite concept on what kinds of rules and regulations apply to next-generation sequencing for the diagnostic of diseases?
Relevant answer
Answer
First, please (if applicable) differentiate between a) private health insurance providers with a reimbursement system and b) social health insurance (e.g. the German social health insurance system) where costs are borne directly by the insurance providers.
In German private health insurance, sequencing costs may in principle be borne by insurers on the basis of § 1 I 1 MBKK 2009 if medically necessary for treatment in the individual case.
However, regarding diagnostic genetic tests, § 18 of the German Gene Diagnostic Act (Gendiagnostikgesetz - GenDG), which is applicable only for private insurers, stipulates that insurers are prohibited from receiving genetic information of their insured persons. This also includes the information that a genetic test has been carried out for whatever purpose.
In this light, it so far remains unclear, whether private health insurers are able to reimburse costs of medically necessary sequencing in the treatment of disease as insurers, in order to be enabled to reimburse costs, need to receive the information about the used genetic sequencing technique.
In contrast, e.g. the US-American Genetic Information Nondiscrimination Act (GINA) contains a provision regarding information transfer to insurers due to reimbursement of costs for genetic tests.
Question
22 answers
Today, functional enrichment tools are used to interpret biological data generated by deep sequencing and other high-throughput technologies. Examples are Gene Ontology or pathway analysis software such as Ingenuity or MetaCore.
How may these tools evolve and what new tools should appear in the future, as the sequencing technologies develop and become more popular, even outside of basic research?
Relevant answer
Answer
I'd like to mention a free alternative to Ingenuity IPA and Metacore, it's called ConsensusPathDB and integrates interactions and human-curated pathways from overall 30 open-access databases. It can help you analyze a list of genes or metabolites in several ways based on pathways, GO categories, interactions, etc. Give it a try: http://ConsensusPathDB.org .
Question
7 answers
There are lots of efforts to use NGS rather than microarray for pooled shRNA screening, my questions: Is there any unique value that NGS can provide us in this type of study that is not available by using microarray? Does it provide maybe better data quality but with the price of longer turn-around time and much more complicated data analysis work?
Relevant answer
Answer
Depending on your assay the benefits can include quality of data and flexibility (ability to use different/additional shRNA with redesigning a chip or to use focused sets without wasting space on a genome chip). Also, some of the complexity of data analysis is being addressed as people have created software specifically for pooled screens.
There was a really nice publication Genome Biology last year titled "High-throughput RNA interference screening using pooled shRNA libraries and next generation sequencing." It discusses the benefits of NGS including sensitivity, dynamic range and representation of the library. There is a detailed walk through of their screening process, and they compare their data to what may expected from an array.
They also developed an open source program called shALIGN. I have not used it, but they describe some of the data analysis problems it solves because it was specifically designed to align sequences to shRNA libraries rather than aligning to the whole genome.
You also gain flexibility in the pool content. You can add new shRNAs to your screen, change libraries used to create the pools, or do more focused screens on gene families and pathways without having to print new arrays or deal with wasted space on the array. Smaller, more focused screens may also help with the data analysis.
Both methods are clearly valid and you may benefit more or less depending on what array your are comparing too. Another big factor in any pooled screen is the quality of the shRNA library. If only a fraction of the shRNAs are functional you may benefit more from a more sensitive readout since the nonfunctional hairpins decrease your effective representation and add to the noise.
Question
8 answers
Let us imagine that some laboratory developed a method for creating restriction endonucleases that can specifically bind to short DNA sequence and specifically cut it or other determined DNA motif. In what applications would such enzymes be required? Could they be applied to Next Generation Sequencing?
If you use restriction endonucleases, describe their usage in few words or just say “I use them”.
I am aware of Zinc-Finger Nucleases or TALENs, but these enzymes only specifically bind target DNA and have non-specific DNA cutting activity.
Relevant answer
Answer
Google "Zinc-Finger Nuclease" or "Transcription Activator-Like Effector Nuclease" (TALENs). You can buy these commercially (or design them yourself) to cut any DNA sequence you like.
Reality is more amazing than imagination any day.
Question
9 answers
We're planning to next-gen sequence flow-sorted microbes out of sputum samples. Flow sorting will be done using a fluorescent antibody. Our concern is that the staining involves a formalin-fixation step, and that this might crosslink the microbial DNA, making it useless for DNA sequencing. Does anyone have experience sequencing antibody-fixed/sorted microbial DNA? We'd like to know if there are any pitfalls to this approach before we invest the money. Thanks in advance.
Relevant answer
Answer
Hi there,
if you are dealing with cell-walled microbes (bacteria, most fungi) it shouldn't be problematic to flow-sort them without prior fixation (although you could try to improve the staining by using 50% EtOH). A simple staining with DAPI, SybrGold or a Syto stain should do. Don't sort them into distilled water, or your cells might get lost.
Question
2 answers
I am trying to compare array snp calls to ngs snp calls, by using the position of the genotype. However the builds used differ, so I would like to make a liftover of the array data to meet the ngs data. USCS offers a liftover from e.g. build 18 to 19, however I havent been able to find a site where I can do a liftover from e.g. build 19.p1 to 19.p9. Is there someone out there that can help me ? Thanks.
Relevant answer
Answer
Although not what you asked for, but my solution would be (and have actually been) to either i) blast/blat the probe-seqs against the genome that you aligned you NGS data to or ii) realign the NGS data to a genome you can use liftOver on.
Question
My basic understanding of 'paired ended' sequencing is that it adds nucleotides to both ends of the fragment that's to be sequenced. My questions are: for MeDIP-seq, usually how many nucleotides to add on each end? and how are these nucleotides added, since there is no template? are the two 'paired ends' added to the same strand or to two different strands?
Thank you,
Jackie
Question
17 answers
I am planning to study the miRNomes in different diseases. Does anyone have experience with the analysis of miRNomes using NGS, especially when starting with a low amount (2-10 ng) of enriched miRNA? Any kind of information, especially in the generation of sequencing libraries (with or without amplification) would be helpful for me. Thanks for your efforts! Best, CK
Relevant answer
Answer
I am interested in doing the same thing, I have read that the Epicentre kit can work with low input amounts of enriched miRNAs. Also, you can try with the Truseq kit, but what you do is you ligate the adapters and then do a qPCR so that you can see at which cycle number you are within the amplification stage and then use that amount of cycles for the PCR when creating your library. Best of luck.
Question
4 answers
I have an hybrid contigs of 454 + Illumina + SOLiD which is around 850 Mb. Now I want to create scaffolds using SSPACE with 20kb (3 libraries) of 454 paired end reads which have been converted into Paired end fastq reads (linker were removed).
Can anyone help me out with this or suggest some better approach for scaffolding?
Relevant answer
Answer
I want to do scaffolding with SSPACE using paired end (20kb) reads of Roche 454.
Question
2 answers
If I wish to do whole methylome study of human what should be my chipseq platform of choice and why?
Relevant answer
Answer
ChIP-Seq isn't the best platform. Bisulfite sequencing is better.