Science method

Single Nucleotide Polymorphism - Science method

A single nucleotide variation in a genetic sequence that occurs at appreciable frequency in the population.
Questions related to Single Nucleotide Polymorphism
  • asked a question related to Single Nucleotide Polymorphism
Question
1 answer
All the data conversions (SNP file and metadata file) have been done correctly. The impute command is also executed correctly. Both the "frequency" and "neighbour" commands were tried, but all the missing data simply gets replaced by "NA" and does not allocate an imputed value. Can someone tell what is going wrong?
Note: My species is an oprhan tree species, which does not have any population sequencing data to use as a reference panel, although the genome has been sequenced for a standalone tree. I want to use de novo imputation using the existing data.
Relevant answer
Answer
Hello,
I was using package "dartR" and I had to use "dartRverse".
I could solve the above issue by:
> Install(dartRverse)
> Library(dartRverse)
> t1 <-gl.read.dart(
"Filename.csv",
ind.metafile = "Filename_Metadata.csv",
recalc = TRUE,
mono.rm = FALSE,
nas = "-",
topskip = NULL,
# lastmetric = RepAvg,
# covfilename = "Filename_Metadata.csv",
probar = FALSE,
verbose = NULL
)
> t2 <- gl.impute(
x = t1,
method = "frequency",
fill.residual = TRUE,
parallel = FALSE,
verbose = NULL
)
> gl.write.csv(t2, outfile = "Outfile.csv", outpath = "C:/../../../", verbose = NULL)
Query kindly solved by Luis, the main dartR developer from Diversity Array Technologies.
  • asked a question related to Single Nucleotide Polymorphism
Question
8 answers
I'm researching one of the lignin biosynthesis encoding genes in 9 accessions of foxtail millet (Si4CL1). I amplify the gene using the standard PCR protocol and perform sequencing using Sanger method. I already got the sequence and already aligned the gene with the DNA genomic sequence from the reference as well as the CDS sequence. The aligned sequences show several SNPs in both exon and intron, and there are about 300 bp deletions in the intron part, but there's no difference between 9 foxtail millet accessions, the difference is only with the reference sequence (I don't know, can I still call it SNP or not). I'm curious whether are there alternative splicing or not, but I don't have enough budget to do mRNA analysis, so I plan to do only prediction but I have no idea how to do that. Is it possible to do the prediction analysis?
The length of the Si4CL1 gene is about ~4500 bp, I divided the reference sequence into 7 fragments to design primers for the amplification (with the length of each fragment about ~700 bp). The deletion is found in the 4th and 5th fragments, it is because when I tried to amplify those parts, I got the clear single band in the length of about ~500 bp, and I also already tried to repeat the amplification several times but I still got the same result. So I think the result wasn't caused by a technical error.
Here's the link to the reference I used in my research:
PS:
I'm still a student in my master's degree, and actually my previous degree isn't related to molecular biology, so this thing is quite new for me, you can ask for clarification if you feel my question is unclear
Relevant answer
Answer
M. Reza Pahlevi Loubna Youssar Amy Klocko You generally cannot reliably predict alternative splicing events solely from Sanger-generated DNA sequences, as you need transcript-level evidence. However, you can attempt in silico predictions using known transcriptomic data and computational tools for splice site recognition.
  • asked a question related to Single Nucleotide Polymorphism
Question
2 answers
I have a few sample vcfs which are not in a very good quality. They are 23andme files from OpenSNP in the following format:
rsID, chromosome no, position, genotype
I have tried remapping them using Galaxy. However, I guess the error is due to the format. The vcfs contain only SNPs.
ANY IDEAS PLEASE? How can i make it work?
These vcfs are mapped on the GRCh36/hg18 and need to be remapped on hg38.
I have a specific list of SNPs (according to the hg38) in a csv format which I need to filter from each of these vcfs after remapping.
Please suggest any alternate workflows if there are any to help me make this work.
Relevant answer
Answer
The file you are trying to liftover is not a .vcf.
You need to reorder columns to match the Variant call format, that is #chromosome, #position, #rsID, #refAllele, #altAllele, etc.
  • asked a question related to Single Nucleotide Polymorphism
Question
1 answer
I am currently conducting a mendelian randomization study, and I was attempting to use PhenoScanner to look for potential confounders associated with the selected SNPs (any SNPs significantly associated with risk factors of my outcome). I tried using the PhenoScanner website directly, and accessing the database via R, but to no avail.
Relevant answer
Answer
Hi Radwan,
You're right to be concerned about the current accessibility issues with PhenoScanner V2, as it is a useful tool for conducting Mendelian Randomization (MR) studies and identifying potential confounders associated with specific SNPs. I’ve faced similar difficulties due to the ongoing unavailability of PhenoScanner V2.
PhenoScanner V2, as described in its methodology, aggregates data from several large genetic association datasets, including the NHGRI-EBI GWAS Catalog and other resources, to provide genotype–phenotype associations (Kamat et al., 2019). The tool offers a Python-R interface that connects to a MySQL database, enabling the identification of relevant SNPs across various traits and conditions, with a comprehensive catalogue that includes billions of associations for diseases, traits, gene expression, protein levels, metabolite levels, and epigenetic markers. However, its data sources, such as the GWAS Catalog, remain accessible independently, even when PhenoScanner is down.
In your case, I would recommend using the GWAS Catalog directly as an alternative. The GWAS Catalog (https://www.ebi.ac.uk/gwas/home) contains a wealth of information on SNP-trait associations and would allow you to query SNPs related to your traits of interest. You can search for associations using SNP identifiers, genes, or traits, much like you would with PhenoScanner, albeit without the streamlined Python-R interface of PhenoScanner V2.
This approach should serve as a viable alternative and can provide similar insights for your study. I hope this helps, and please feel free to ask if you encounter further issues.
Reference:
  • Mihir A Kamat, James A Blackshaw, Robin Young, Praveen Surendran, Stephen Burgess, John Danesh, Adam S Butterworth, James R Staley, PhenoScanner V2: an expanded tool for searching human genotype–phenotype associations, Bioinformatics, Volume 35, Issue 22, November 2019, Pages 4851–4853, https://doi.org/10.1093/bioinformatics/btz469
  • asked a question related to Single Nucleotide Polymorphism
Question
3 answers
Hi, I am designing primers to discriminate SNP alelle, which is 2 forward primer specific for each allele. To improved the discrimination of SNP, what is the acceptable delta G value of the primer?
Relevant answer
Answer
It's probably going to be cheaper and faster to PCR amplify and Sanger sequence to find SNPs.
Your assay is based on amplification from one primer (SNP) vs no amplification (other allele). And a second reaction where you get amplification from (other allele) and none from (SNP). Relying on negative results (e.g. not seeing amplification for homozygous for SNP with 2nd set of primers) is a risky design since it relies on negative data. PCR can fail for lots of reasons. If the SNP is rare, you can easily confused a failed amplification with presence of the SNP.
And it is common for the polymerase to be able to "read across" a mismatch. Yes, I know it's not supposed to, but certain mismatches are well-tolerated and easy to ignore.
RFLPs can work well, but can be really time consuming to optimize. And sometimes they just don't work reliably enough or the enzyme is expensive.
My vote is Sanger, back up the primers so they are at least 150 bp away from the SNP, make the product less than 1000 bp, and sequence from both directions.
  • asked a question related to Single Nucleotide Polymorphism
Question
5 answers
Hi, I am doing research on quantification of SNP on miRNA. I designed a universal reverse primer and two forward primers that have 3’ end are specific for each alleles. Anyway, when I did experiment using sample of allele A adding Forward specific for allele C instead of primer specific for allele A, I obtained the result that the primer specific for allele C is binding to allele A sequence and synthesize.
How can I make two primers specific when it is binding to a sequence with just only one nucleotide difference?
Relevant answer
Answer
try to introduce additional mismatch in position -1 to the 3` end for both primers
  • asked a question related to Single Nucleotide Polymorphism
Question
2 answers
Can there be a brief description
Relevant answer
Answer
CRISPR-Cas9 corrects an SNP by creating a DSB near the SNP site, which is then repaired using a donor DNA template with the correct nucleotide sequence. The process relies on the cell's natural HDR mechanism to incorporate the correct sequence into the genome.
  • asked a question related to Single Nucleotide Polymorphism
Question
1 answer
I understand that VCFtools can do the trick, so does tool like vcf-kit. I also read some high profile papers that report Tajima D based on VCF file. I am always afraid of calculating Tajima D using VCF file. In my experience, Tajima D is very sensitive to the low frequency polymorphism. However, in VCF file, some of the low frequency (say less than 1%) variations are caused by errors of sequencing or SNP calling process, and may not be the real variations. So in most VCF file based analysis, we do SNPs filtering and discard those SNPs whose af are less than 5%. If we discard low frequency SNPs, Tajima D will be definitely affected. So I found myself in a dilemma. Any help is appreciated.
Relevant answer
Answer
VCF file is similar to multiple sequence alignments. Your question is about filtering variants. You can always set filtering criteria by yourself. Filtering variants with AF<5% is not a criteria but filtering false variants is criteria. Once false variants are filtered then you can use any tool to detect Tajima's D.
VCF2Tajima can be helpful to run Tajima's D on VCF files.
  • asked a question related to Single Nucleotide Polymorphism
Question
2 answers
I am getting problem to exactly identify the SNP position.
Relevant answer
Answer
And you'll be able to get more of the details from the figure caption & methods section.
The convention is that the boxes are exons & the thick solid line is the introns. In general, folks don't bother to look for SNPs in introns unless it's a splice-site acceptor.
  • asked a question related to Single Nucleotide Polymorphism
Question
3 answers
Seeking Efficient Tools for Detecting Disease-Causing SNPs in Large Genomic Datasets. Any recommendations for advanced computational tools or strategies? I already used SIFT CADD MSC AlphaMissense etc and I only have the data from 7 patients and their parents. thank you for your help !!!
Relevant answer
Answer
In the best capacity of my knowledge, I would suggest the most commonly used MutationalTaster (omit if a spelled it incorrectly) which aids in efficient detection of any modifications incorporating prediction of changes based on data collected (can be changes in proteins or splicing of the genome). MutationTaster has never been a modality in my memory where I actually analysed it in-person.
Exomiser is one of the platforms which came in my Gene Manipulation and Genomics module during my Engineering course. Exomiser is one of the most trusted modality to depict and identify de-novo and recessive variants.
I have read not witnessed a tool called as PolyPhen-2 (please ignore if I spelled it incorrectly). I came to know about this in the same module and to the best of my knowledge I think it is specifically meant for the identifying SNPs (Nonsynonymous)
  • asked a question related to Single Nucleotide Polymorphism
Question
2 answers
I am getting a significant marker trait association for traits. But understanding the allelic association is difficult; NN is the missing allele. Can anyone help me?
Relevant answer
Answer
Hi
From the figure file shared, it is clear that it is a false association. You need to do data cleaning by removing minor alleles and missing data, both genotype-wise and marker-wise, with certain thresholds. Then impute the missing data points. I think you should get true associations with cleaned data.
Best regards
  • asked a question related to Single Nucleotide Polymorphism
Question
3 answers
We have done in-silico analysis on SNPs (pathogenicity analysis, stability analysis, etc), which has given us the deleterious SNPs that were affecting protein structure. Now that we want 2 to 4 SNPs for experimental validation in a particular population using ARMS-PCR, my question is: how do we select the SNPs?
What if there is no literature on the SNPs but they are marked as deleterious?
How do we deal with such SNPs in the experiment?
Relevant answer
Answer
@Frederic Lepretre, it's a good answer.
  • asked a question related to Single Nucleotide Polymorphism
Question
8 answers
Hi, I have a question regarding my PCR products. Okay so i did some primer optimization for primer 1 using gradient temperature starting 55-60’C and i chose temperature 56.6’C as my temperature as it show a clear band. However, when i did amplification using primer 1 at 65.5’C for all my 60 samples, they showed double bands which are quite questionable as they did not even showed up when i did my optimization. Can someone help me to identify the problem? I did the same step with the same primer and the same machine. Here, i attached my result. For your information, 1e is the band for 56.6’C.
Relevant answer
Answer
It is still useful to know if the NTC is amplifying but if we treat the 2 bands as real amplimers then I suggest running a dimethyl sulphoxide DMSO gradien. Set up 7 identical samples but in each tube the dmso concentration (final) should be 0%, 1%, 2%,4%,5%, 6%, 7% and 8% dmso. At some dmso concentration the lower band should disappear
  • asked a question related to Single Nucleotide Polymorphism
Question
3 answers
Collaboration and joint research Call: Recently we conducted studies on BRCA1, BRCA2, EGFR, and ESR1 mutations in cancer! Our computational approach identified potential pathogenic variations in these genes, increasing the risk of cancer development. However, with limited funds, instead of using sequencing, we intend to utilise ARMS-PCR for experimental validation. Notably, the predicted SNPs are not found in literature showing risk with cancer. How do we select and validate these SNPs for further investigation? Let's discuss strategies for bridging the gap between bioinformatics predictions and experimental validation in cancer research. hashtag#CancerResearch hashtag#BRCA1 hashtag#BRCA2 hashtag#ESR1 hashtag#EGFR hashtag#Bioinformatics hashtag#WetLabE
Relevant answer
Answer
UCSC is a powerful friend that can give you lot of tools around genomics and genetics. spending some times on its manipulation is not a waste, it's an investment
U're welcome.
fred
  • asked a question related to Single Nucleotide Polymorphism
Question
1 answer
these SNPs do not have restriction enzymes is it possible to do RFLP
Relevant answer
Answer
Almost certainly yes.
The base change itself may not create/delete an enzyme cut site for rflp but using dcaps
it will be seen that a second deliberately introduced base change in the amplifying primer will almost always create a restriction site specific to the base found in the template sequence.. This software designs primers to detect either/both of the bases in the snp position. You just have to be a bit careful that the cut site that you are creating does not occur naturally within the amplimer making analysis more difficult
  • asked a question related to Single Nucleotide Polymorphism
Question
1 answer
Which database should I use to extract phenotype data related to protein and yield traits and genotype data consisting of SNP markers from the selected sources?
Relevant answer
Answer
Hi,
The Legume Information System hold information across many legume species. Try this under Taxa tab of home page.
  • asked a question related to Single Nucleotide Polymorphism
Question
1 answer
Hello,
I received some GWAS results. There is a column for SNP, position, REF/ALT allele, p value, SE, which is pretty straightforward. However, I am confused about the column titled 'Effect', they are values ranging from -1.4 to 1.3. Any help would be greatly appreciated, thank you in advance.
Relevant answer
Answer
That is the effect size, the "beta" of the regression.
  • asked a question related to Single Nucleotide Polymorphism
Question
7 answers
Can anyone please tell me the database names or websites from where I can download human SNP datasets along with the quantitative traits (phenotypes) for genome-wide association studies (GWAS)?
Relevant answer
Answer
  • asked a question related to Single Nucleotide Polymorphism
Question
1 answer
I'm using the Novaseq 6000 and HiSeq 4000. Assembled alone, the Hiseq data has little missing data and many reads per sample, Novaseq has more missing data but still is useable. When I assemble them together, Hiseq individuals have few or no SNPs. I've checked trimming for both datasets and that does not appear to be the issue.
Assembling using ipyrad, I've assembled de novo and mapped to a reference.
Relevant answer
Answer
Hi. I am having similar issues. Were u able to find any solution for this?
  • asked a question related to Single Nucleotide Polymorphism
Question
1 answer
I am doing a research on pharmacogenetics of treatment of type2 diabetes. the SNPs i have selected are in close proximity on the same chromosome. I want to carry out linkage disequilibrium and haplotype analysis in our research. Can anyone please give the easiest way / program to conduct this and how the results are interpreted
Relevant answer
Answer
Hii,
Which algorithms is used your study?
  • asked a question related to Single Nucleotide Polymorphism
Question
3 answers
Hi,
Does my TaqMan probe recognize the gene only if the sequence has 100% coverage with the probe or will I also get signal if there is a SNP or even a few bp difference? And if a few bp are ok are there any consequences, e.g. weaker binding?
Thanks
Aleks
Relevant answer
Answer
It obviously will depend the melting temperature, sequence and length of your probe, but Taqman probes can tolerate a few bp difference (e.g., 1 base mismatch will likely be barely noticeable, but 3-4 mismatches will result in a decrease of amplification efficiency)
  • asked a question related to Single Nucleotide Polymorphism
Question
1 answer
How can you identify a specific variant (specific SNPs) after we have received the sequences after sanger sequence ? that is mean how we can reveal a specific variants in the sequence ?
Thanks :)
Relevant answer
Answer
I think it can be done in several ways.
One simple way can be -
firstly you can copy the reference of your sequenced gene from NCBI. Then using mafft or other multiple sequence alignment tool you have to align all the sequences. And finally open the aligned fasta file using JalView you can find the mismatches. This "JalView" software provides an interactive platform.
  • asked a question related to Single Nucleotide Polymorphism
Question
2 answers
Hi,
I have provided my LD plot by Haploview and I used Linkage Format (.ped & .info) to this aim. I am wondering how could I have the gene structure on the top of my LD plot ?
I mean I need something like this :
Relevant answer
Answer
Dear Mona,
As far as I’m concerned, Haploview is not capable of producing such a detailed plot. I believe that what you are looking for, the attached plot, consists of two separate graphs. The bottom one is the output of Haploview, and the upper one is added later manually using any photo editing software. My guess is that the upper one derived from Ensemble or other genome browsers such as UCSC or NIH genome viewer. You can simply type in your region of interest to obtain detailed chromosomal information. Later, you can overlay it on top of your Haploview output.
I hope this helps!
  • asked a question related to Single Nucleotide Polymorphism
Question
3 answers
I am trying to predict the stability of a protein with different SNP. I tried using DUET, Predict SNP and Dynamut. The problem with DUET is that I cannot do double mutation however, it gives fast result. But Predict SNP and Dynamut takes long time to generate the result in my case.
Please suggest me other tools that can be used for the stability prediction that are accurate also convenient.
Relevant answer
Answer
u're welcome
fred
  • asked a question related to Single Nucleotide Polymorphism
Question
2 answers
I want to assess multiple SNPs in multiple sample treatment groups and I am not sure which lab technique is the best for that?
Relevant answer
Answer
Hot greeting
By multiplex real time pcr
  • asked a question related to Single Nucleotide Polymorphism
Question
1 answer
Hi,  in last period we used dbSNP database in NCBI for two variants rs2304365 and rs17315309 in ST18 gene  According to the dsSNP database, about  the variant rs2304365 the wild type allele is C and the variants are C or T,  But according  to the publications the risk alleles is  A not C or T (Etesami, I., Seirafi, H., Ghandi, N., Salmani, H., Arabpour, M., Nasrollahzadeh, A., ... & Keramatipour, M. (2018). The association between ST 18 gene polymorphism and severe pemphigus disease among Iranian population. Experimental Dermatology, 27(12), 1395-1398.)‏  Also specifically about rs17315309 the wild type is A and the variants are G or T , but in the references ( Vodo, D., Sarig, O., Geller, S., Ben-Asher, E., Olender, T., Bochner, R., ... & Sprecher, E. (2016). Identification of a functional risk variant for pemphigus vulgaris in the ST18 gene. PLoS genetics, 12(5), e1006008. )‏ ,
the article that cited in the database the variant risk allele is C, I don't understand why in the most of articles or specific articles that cited in the dsSNP database mention different SNPs with the  dsSNP database .
I really want to understand this,
Thanks:)
Relevant answer
Answer
Hi Yasmin
if you look at NCBI or even better at UCSC you'll see that the human ST gene is on the minus strand of DNA.
that means that the sequence of the variant rs2304365 is given on the + strand, and its impact given in the coding sequence strand, the-.
in consequence, the variant is a T on the genomic hand, and A when you look the impact on the RNA and proteins. the reverse complement is therefore to be applicator for minus strand coding genes.
you can zoom in, but look at the arrow:
all the best
fred
  • asked a question related to Single Nucleotide Polymorphism
Question
2 answers
I need to calculate number of different and effective allele for a diversity study using large SNP dataset. Normally I use GenAlex but I can't load the data due to limited no of columns. Please advise.
Relevant answer
Answer
Thank you Danial Hariz Zainal Abidin , will give it a try.
  • asked a question related to Single Nucleotide Polymorphism
Question
6 answers
I keep getting the same answer when googling this question (which is the percentage in a population).
is there any difference between non-synonymous SNP and mutation?
and between synonymous SNP and silent mutation?
What's the difference between their effect on protein function?
I'd appreciate a reference on the topic.
thanks.
Relevant answer
Answer
A snp is a single base change that produces no obvious phenotype so is not subject to natural selection so the minor allele can become very common. A mutation will produce a new phenotype and the minor allele is subject to natural selection so frequently remains rare ( a possible exception is heavily inbred small populations where the change is beneficial. A useful dividing line between snp and mutation is that if the minor allele frequency is more than 1% then the change is a snp
  • asked a question related to Single Nucleotide Polymorphism
Question
1 answer
Dear All
I got some significant markers from different articles and I want to convert the genetic positions (cM)of these markers to their physical positions in base pairs (bp).
Relevant answer
Answer
Hi Asmaa
since genetics positions differs from species to species, depending on the structure of the samples you're working on (on the genetics point of view of course), you can't do that.
the best is to get the list of your markers and search their physical positions by hand (on the UCSC for example), or by command lines if you got enough informations and all is structured as one.
in fact, as always in bioinformatics, all depends on what you got on start.
all the best
fred
  • asked a question related to Single Nucleotide Polymorphism
Question
1 answer
Regarding interpretation of sequence variants using ACMG rules.
Are ACMG rules PS3 and PP3 exclusionary?
In other words, if both PS3 and PP3 rules are fulfilled, can PP3 can be applied?
PS3 - Well-established in vitro or in vivo functional studies supportive of a damaging effect on the gene or gene product.
PP3 - Multiple lines of computational evidence support a deleterious effect on the gene or gene product -conservation, evolutionary, splicing impact, etc.
P.S - For me, they are not exclusionary.
  • asked a question related to Single Nucleotide Polymorphism
Question
1 answer
When comparing SNP densities for intronic and exonic PAS, how did you account for the fact that exonic PAS lie within coding regions and, thus, are under amino-acid sequence-preserving selection? Did you consider only SNPs that do not alter aa sequence?
Relevant answer
Answer
ChIP analysis
  • asked a question related to Single Nucleotide Polymorphism
Question
1 answer
I want to know what is more recommended for Specific SNPs (single nucleotide polymorphism)
Who has good experience, What is more recommended GWAS or candidate gene? for Specific SNPs in specific genes are related to specific rare disease .
Also for small genetically homogenous samples ? what the best methods and what the diffrences ?
Relevant answer
Answer
Hi Yasmin
it depends on what you got in your hands and labs, and what you""l do with this work, publishing or not...
go to this website and see what will be possible, power is the only way to get reliable results and to be published:
all the best
fred
  • asked a question related to Single Nucleotide Polymorphism
Question
8 answers
Hi Everyone, 
I am working with data of SNPs, I want to do logistic regression analysis. 
In multinomial logistic regression, is it compulsory to choose most common genotype as reference? or I can choose any genotype as reference? 
In my one SNP (Genotypes: II, ID, DD), when I choose most common II genotype as reference than Odds ratio come out like 0.57, but on choosing ID genotype Odds ratio change to 1.67 with p <0.05. Is it fine to choose heterozygous genotype as reference?
Thanks,  
Relevant answer
Answer
Malik Olatunde Oduoye , it requires some level of abstraction to wrap your head around.
Consider this simple case: you have two groups in which you measured the value of some variable. Say you measured the weight of sample of male and of female bugs. You are interested to analyze the expected (mean) difference between the weights of males and females.
A statistical model tries to express the expected (mean) weight as a function of sex.
Now "sex" is a binary (categorical, nominal) variable. How to use this in a mathematical formula that operates on numbers? This is where we need to encode the different possible values of the variable "sex" by different values of a numerical variable that we can use in a formula.
There are no restrictions in how to encode names/groups/categories by numbers. However, some ways will give you results that are meaningful for your research question and that you can readily interpret. We will now first take a small detour to demonstrate this, before coming to the "usual solution".
You have two sexes, and you may think of using an indicator function for each of the sexes that take the values 1 or 0, depending on whether the sex is male or female. These indicator variables can be used to multiply the expected weights of the respective sex, and these products can be added to give the expected weight of the desired sex.
This is the mapping from sex to the values 0 or 1 based on two separate indicator functions:
Imale(sex) = {if sex = "male" then 1, otherwise 0}
Ifemale(sex) = {if sex = "female" then 1, otherwise 0}
The function of the expected value is then
µ(sex)= bmale*Imale(sex) + bfemale*Ifemale(sex)
Hence, µ(male) gives bmale*1 + bfemale*0 = bmale and µ(female) gives bmale*0 + bfemale*1 = bfemale
Although this function works, it is somewhat unwieldy. It is about two coefficients, bmale and bfemale, and a statistical model based on this function can estimate their values along with standard errors, confidence intervals, and p-values, if you wish. The problem with this formula is that it does not give you direct access to what you actually wanted to investigate: the difference between males and females.
There is, of course, a more convenient function to address this directly. This function needs only one indicator variable for one of the sexes and includes an intercept term that will represent the expected value of the other sex. You are free to chose which sex should be represented by the intercept term. This is the "reference". Let's say you decide using females as the reference, then the intercept is bfemale and indicator function would be Imale which is multiplied with a coefficient d which will then represent the mean difference between females and males:
µ(sex)= bfemale + d*Imale(sex)
Here, µ(female) gives bfemale+d*0 = bfemale and µ(male) gives bfemale + d*1 = bfemale+d, so
µ(male) - µ(female) = d
As is obvious here, the coefficient d represents the difference in the mean weight between males and females, and a statistical model will provide an estimate for this difference with corresponding standard error, confidence interval, and p-value.
Of course you can choose to let the intercept represent the mean weight of males as well. The indicator function must then be Ifemale and the coefficient d then represents the difference in the mean weight between females and males. You may convince yourself by actually calculating the results of the formula for the two possible sexes:
µ(female) gives bmale+d*1 = bmale + d and µ(male) gives bmale + d*0 = bmale, so µ(female) - µ(male) = d
This is it. Practically, statistical software will automatically encode categorical variables into so-called dummy variable by applying the respective indicator functions and use these dummy variables in the statistical models. But these are technicalities.
This interpretation of the coefficient d depends on what sex you have chosen as the reference. If you have chosen males as reference and d > 0, then female bugs are expected to be heavier than male bugs. If you have chosen females as reference and d > 0, then female bugs are expected to be lighter than male bugs.
  • asked a question related to Single Nucleotide Polymorphism
Question
3 answers
I performed an analysis for a specific gene/region of interest using GWAS summary statistics for a binary phenotype (+/- disease). Using the corrected p values in the summary statistics and mapping in FUMA, I found a significant missense/coding SNP, with p=0.025. Other papers have shown this SNP leads to a less functional protein. This SNP was not in LD with the other sig SNPs (intronic/non-coding). Could it be an independent SNP? Beta is 0.002 and MAF is 0.005. Is it possible this SNP is a false positive due to its rarity? Population size is n=400k and cases n=1100. Any help would be appreciated!
Relevant answer
Answer
Analysis of the information you provided:
1. Significance: The corrected p-value of 0.025 suggests that the missense/coding SNP is statistically significant. However, it's important to consider the multiple testing correction applied, as it can affect the threshold for significance.
2. Functional impact: The fact that other papers have shown this SNP to lead to a less functional protein adds further support to its potential role in the phenotype of interest. It suggests that the SNP might have a functional impact on the gene's protein product.
3. LD with other significant SNPs: As you mentioned, the missense/coding SNP is not in linkage disequilibrium (LD) with the other significant SNPs in the region. This suggests that it represents an independent genetic variant influencing the phenotype, rather than being in LD with another causal variant.
4. Rarity and false positive: The SNP's low minor allele frequency (MAF) of 0.005 indicates that it is quite rare in the population. Rare variants can pose challenges in terms of statistical power and increased probability of false positives. Given the population size of 400k and the number of cases (n=1100), it is possible that the rarity of the SNP could contribute to increased uncertainty in the association results.
To further validate the association and assess the potential false positive rate, it would be beneficial to consider additional evidence, such as functional studies or replication in independent cohorts. Additionally, performing appropriate statistical analyses, such as replication analysis, permutation testing, or assessing genomic control inflation factors, can help evaluate the robustness of the association and minimize potential false positives.
It is advisable to consult with experts in the field or bioinformatics/statistical geneticists who can provide more specialized guidance based on your specific dataset and research question.
Hope it helps:credit AI
  • asked a question related to Single Nucleotide Polymorphism
Question
1 answer
I am planning to process the Illumina Omni5 SNP data. However, I am currently using Macbook Pro that is provided by my new institute. Does anyone have any suggestions on selecting any Mac-compatible software to replace the functions of Illumina's GenomeStudio? I need to analyze the raw data from the Omni5 SNP platform. Thanks a lot.
Relevant answer
Answer
Hi, did you get the Mac-compatible software for Illumina's GenomeStudio? I met the same problem and need some suggestions. Thanks!
  • asked a question related to Single Nucleotide Polymorphism
Question
3 answers
I need CEL1 enzyme to use it in my laboratory work to complete my PhD thesis. Unfortunately, I didn't find it anywhere. So I'm waiting for any request about companies name where is available or anyone who sell it for me.
thank you.
Relevant answer
Answer
Unfortunately I couldnt find any company to sell it online, which is a bit weird, but as you probably know, CEL I Endonuclease is naturally exists in Celery juice, which normally utilized for detection of heteroduplex and SNPs, specifically in confirmation of genome editing by CRISPR-Cas. In this link you will find a protocol which has been described an extraction method of CEL1 from Celery juice which could be done by your own.
Also, based on your target in working with such a Endonuclease, you may be able to substitute it by T7E1 which is more common than CEL1 and could be purchased from NEB.
Good luck
  • asked a question related to Single Nucleotide Polymorphism
Question
3 answers
I am asking about the meaning of the letter "c" in c.G421T?
Relevant answer
Answer
Probably refers to a coding sequence.
  • asked a question related to Single Nucleotide Polymorphism
Question
6 answers
As above, I'm wondering what is the justification for removing monomorphic SNP loci for genetic diversity analysis?
Using a genome wide association study, I am analysing SNP data for a wide ranging animal species from multiple regions and want to be able compare diversity between regions.
Screening for monomorphic SNPS results in loss of up to 20% SNPs for some regions and <5% for others - is it reasonable to compare these data with monomorphs excluded?
Relevant answer
Answer
People usually suggest excluding monomorphic loci from STRUCTURE analysis because there is no differentiation or any variation in these loci and they bring no information on ancestry. This means that using a dataset with some monomorphic loci versus excluding them would lead to the same results (or with negligible differences?). So, it just saves time.
However, if we are interested in the real state of the populations/localities, it seems reasonable not to exclude them because we have always only some random(?) proportion of a genome. I think it is the same as with sampling individuals from a population - different sets lead to different estimates of an average, for example, and the point is to maximise the sample sizes to improve the accuracy of estimates. Still, not sure...
  • asked a question related to Single Nucleotide Polymorphism
Question
1 answer
I have been using a PCR master mix green coloured from a Promega ( green taq). But for my recent work in a particular gene that PCR master mix is not working. I borrowed Phusion DNA polymerase ( high fidelity Cat no: F-530XL) from a neighbouring lab for two samples and the amplification did take place. My aim is to amplify the gene promoter region ( PCR product size: 269 bp) and study Single nucleotide polymorphism in the region using RFLP technique. For this purpose I have introduced an restriction cut site ( AluI) near the 3' end of my Reverse primer ( single basepair mismatch) . While checking the details of this phusion DNA polymerase I found it has 3'-5' exonuclease activity and it is known for its accuracy. Hence I am anticipating that there is a chance that the polymerase might cut the error basepair for editing and accuracy maintenance. If this happens then I will have false results for my SNP study. I have never used a high fidelity polymerase before, I used taq polymerase which does not have 3'-5' exonuclease activity. In this regard I am confused which polymerase should I buy. I have seen in thermo catalogue there are normal taq polymerases , dreamtaq as well as many other options but I am unaware about their properties and which one to select. I donot have access to more phusion DNA polymerase or any other taq pol at this moment. This particular gene amplification has been complicated for from the beginning. I have 300+ patient DNA samples to look for SNPs so I am looking for an economic option as well. Kindly guide me what my options are regarding this and which polymerase should I opt for.
Relevant answer
Answer
As for polymerase selection, you may find this chart useful:
As for the rest, if you only have the promega mix available, try to optimize tmeperature/timing/salt cc
  • asked a question related to Single Nucleotide Polymorphism
Question
6 answers
We have a myostatine SNP g.6215942T>C but can not find some references about this one, need some help for this, thanks
  • asked a question related to Single Nucleotide Polymorphism
Question
1 answer
Dear Folks,
To do a functional analysis of a specific gene In SIFT software we need desired gene FASTA sequence and amino acid changes at specific positions are needed. In our case, we have desired genes dbSNP ID only. How can we retrieve Amino acid changes by using db SNP ID?
Relevant answer
Answer
If it's a human SNP simply search for it in dbSNP and there you will find links to RefSeq protein and nucleotide sequences.
If it's a pathogenic SNP you may also search in https://www.ncbi.nlm.nih.gov/clinvar/
  • asked a question related to Single Nucleotide Polymorphism
Question
1 answer
I have only the rsID of the SNP of interest, but need to determine how much of my population of interest is affected.
Relevant answer
Answer
Hi
just type the rs id in UCSC (https://genome.ucsc.edu) in the genome browser, go to the location and click on the rs.
if there is some information on population frequencies, the next page will give you the informations.
otherwise, you can also Dig in dbsnp (https://www.ncbi.nlm.nih.gov/snp/) from NCBI and see what next.
fred
  • asked a question related to Single Nucleotide Polymorphism
Question
2 answers
Dear experts,
We have analyzed the FOXP3 gene mutation of 10 healthy volunteers and 13 diseased samples. Out of these, 3 healthy volunteers (30%) and 8 diseased patients (61.53%) were found to have mutations at specific SNPs. Now, we would like to perform a statistical analysis of these results. Could you kindly guide us on how to conduct the statistical analysis? If possible, please suggest the software package that should be used for this purpose.
Thank you for your assistance
Relevant answer
Answer
Although the number of individuals is small to conduct a statistical test and draw strong conclusions, you can still use the Chi-square test with null and alternative hypotheses. For more details, you may search any statistics homepage and apply the test in R or Python.
Hope it helps.
Best
  • asked a question related to Single Nucleotide Polymorphism
Question
1 answer
SNP are biallelic marker, give reason?
Relevant answer
Answer
in human there are 2 copies of chromosomes...
fred
  • asked a question related to Single Nucleotide Polymorphism
Question
1 answer
Dear Colleagues,
I am excited to reach out to experts in the field of bioinformatics to invite you to collaborate on an ongoing research project focused on validating and confirming the significant association of various gene polymorphisms with breast cancer risk within the Pashtun ethnicity. Our recent studies have successfully confirmed the association of the following gene polymorphisms with breast cancer risk: BRCA1 (rs1799950), BRCA2 (rs144848), TP53 (rs1042522), OPG (rs2073618, rs3102735), RANKL (rs9533156), ESR1 (rs2234693 and rs2046210), HER1 (rs11543848), and HER2 (rs1136201). These findings have been published in reputed journals, underscoring their significance.
To further strengthen the validity of our results and expand our knowledge on ethnic-specific polymorphisms, we seek collaboration with experts in bioinformatics who can contribute their expertise to our research. The proposed collaboration will involve the utilization of various bioinformatics tools and databases to enhance our understanding of the identified gene polymorphisms. Specifically, we plan to utilize the following tools:
  1. ENSEMBL, dbSNP, or NCBI databases: These databases will serve as invaluable resources to retrieve detailed information about the identified SNPs. We will explore their genomic locations, functional annotations, potential disease associations, and other relevant information to deepen our understanding of their implications in breast cancer risk.
  2. Population-specific variation databases: By leveraging population-specific variation databases, we aim to assess the frequency and distribution of the identified SNPs within different populations, including Pashtun and other ethnic groups. This analysis will enable us to evaluate the potential presence of ethnic-specific polymorphisms associated with breast cancer risk.
  3. Gene expression datasets, pathway databases, and functional annotation tools: Integrating gene expression datasets, pathway databases, and functional annotation tools will allow us to uncover the functional implications of the identified SNPs. By examining their potential involvement in breast cancer development and related pathways, we can gain insights into the underlying mechanisms and further refine our understanding.
We believe that collaborating with experts like you will significantly enhance the effectiveness and robustness of our studies. We welcome your expertise and recommendations for additional bioinformatics tools that can further enrich our research and facilitate the exploration of ethnic-specific polymorphisms in breast cancer risk.
By joining forces on this project, we can collectively advance our understanding of the genetic factors contributing to breast cancer risk within the Pashtun ethnicity and potentially identify other ethnic-specific polymorphisms. Moreover, our research outcomes will have broader implications for personalized medicine, risk assessment, and tailored interventions in breast cancer management.
If you are interested in collaborating on this research endeavor or have suggestions for other bioinformatics tools that could strengthen our studies, please do not hesitate to reach out to us. Together, we can make significant strides in breast cancer research and contribute to the development of more effective strategies for risk assessment, early detection, and management.
We eagerly anticipate the opportunity to collaborate with you and drive forward our collective understanding of breast cancer genetics.
Sincerely,
Relevant answer
Answer
Please send me message
  • asked a question related to Single Nucleotide Polymorphism
Question
5 answers
I have a list of SNPs for a rice gene. And I have two questions related to this:
1. I want to make a figure showing the respective position of the haplotype in the genomic region (UTR, exon, intron). How can I do it?
2. If not manually, how can I replace the original sequence nucleotides with the haplotype's SNPs?
Relevant answer
Un haplotipo en genética es una combinación de alelos de diferentes loci de un cromosoma que son transmitidos juntos. Un haplotipo puede ser un locus, varios loci, o un cromosoma entero dependiendo del número de eventos de recombinación que han ocurrido entre un conjunto dado de loci. Si tu haplotipo es un locus lugar que ocupa un gen y sabes el lugar o la banda, porque la amplificaste entonces debe limpiar y secuenciar esa banda, así tendrás tu haplotipo y tu secuencia de nucleotidos. Luego los informaticos tienen programas especiales que dan la posición.
  • asked a question related to Single Nucleotide Polymorphism
Question
1 answer
40 samples show one heterozygous (CA)(175bp, 202bp) and one homozygous allele (CC)(202bp) only and not (AA)(175bp) allele. Attached is just a snippet of some samples showing CA and CC allele mutations. Outer primers cover band size of 326bp.
2.5% agarose gel electrophoresis. Longitudinal. Long run.
Relevant answer
Answer
You'll have to provide some more background.
What genotypes are you expecting? In what proportion? What is the sample population (mutagenized? inbred? wild? etc.)
You can't involve probability unless you have expected numbers to start with.
Also, do you have controls to show that AA genotype is detectable with your assay?
  • asked a question related to Single Nucleotide Polymorphism
Question
2 answers
Please guide me to find the SNPs of a specific gene region, for example, a specific region of the CYP2D6 gene, How can I find? and which way is good? I tried SNP database for NCBI but I couldn't found.
Thank you,
Relevant answer
Answer
Hi Hokmabadi,
May I know if you mean to find SNP on known region? If yes, I have the similar experience. I use the NCBI in which all SNPs in specific gene can be shown.
1. Open the home page for NCBI. In the drop-down list, choose "Gene" and enter the gene name like CYP2D6 in the field and search.
2 Please choose the suitable option for human gene in the first result.
3 Please scroll down to find the button "GenBank" and click it. Then you will skip to new page and please enter the target region position on right column and click "update view".
4 Finally, please click the button "Graphics" at the top of the page. Then you can see all SNPs with rs number and red line in target region.
Hope this is wanted guideline.
  • asked a question related to Single Nucleotide Polymorphism
Question
1 answer
I'm looking for Durango Diversity (Phaseolus vulgaris) sequencing SNP data. Can anyone please explain briefly?
Relevant answer
Abstract
The decreasing cost along with rapid progress in next-generation sequencing and related bioinformatics computing resources has facilitated large-scale discovery of SNPs in various model and nonmodel plant species. Large numbers and genome-wide availability of SNPs make them the marker of choice in partially or completely sequenced genomes. Although excellent reviews have been published on next-generation sequencing, its associated bioinformatics challenges, and the applications of SNPs in genetic studies, a comprehensive review connecting these three intertwined research areas is needed. This paper touches upon various aspects of SNP discovery, highlighting key points in availability and selection of appropriate sequencing platforms, bioinformatics pipelines, SNP filtering criteria, and applications of SNPs in genetic analyses. The use of next-generation sequencing methodologies in many non-model crops leading to discovery and implementation of SNPs in various genetic studies is discussed. Development and improvement of bioinformatics software that are open source and freely available have accelerated the SNP discovery while reducing the associated cost. Key considerations for SNP filtering and associated pipelines are discussed in specific topics. A list of commonly used software and their sources is compiled for easy access and reference.
Go to:
1. Introduction
Molecular markers are widely used in plant genetic research and breeding. Single Nucleotide Polymorphisms (SNPs) are currently the marker of choice due to their large numbers in virtually all populations of individuals. The applications of SNP markers have clearly been demonstrated in human genomics where complete sequencing of the human genome led to the discovery of several million SNPs [1] and technologies to analyze large sets of SNPs (up to 1 million) have been developed. SNPs have been applied in areas as diverse as human forensics [2] and diagnostics [3], aquaculture [4], marker assisted-breeding of dairy cattle [5], crop improvement [6], conservation [7], and resource management in fisheries [8]. Functional genomic studies have capitalized upon SNPs located within regulatory genes, transcripts, and Expressed Sequence Tags (ESTs) [9, 10]. Until recently large scale SNP discovery in plants was limited to maize, Arabidopsis, and rice [11–15]. Genetic applications such as linkage mapping, population structure, association studies, map-based cloning, marker-assisted plant breeding, and functional genomics continue to be enabled by access to large collections of SNPs. Arabidopsis thaliana was the first plant genome sequenced [16] followed soon after by rice [17, 18]. In the year 2011 alone, the number of plant genomes sequenced doubled as compared to the number sequenced in the previous decade, resulting in currently, 31 and counting, publicly released sequenced plant genomes (http://www.phytozome.net/). With the ever increasing throughput of next-generation sequencing (NGS), de novo and reference-based SNP discovery and application are now feasible for numerous plant species.
Sequencing refers to the identification of the nucleotides in a polymer of nucleic acids, whether DNA or RNA. Since its inception in 1977, sequencing has brought about the field of genomics and increased our understanding of the organization and composition of plant genomes. Tremendous improvements in sequencing have led to the generation of large amounts of DNA information in a very short period of time [19]. The analyses of large volumes of data generated through various NGS platforms require powerful computers and complex algorithms and have led to a recent expansion of the bioinformatics field of research. This book chapter focuses on the a priori discovery of SNPs through NGS, bioinformatics tools and resources, and the various downstream applications of SNPs.
Go to:
2. History and Evolution of Sequencing Technologies
2.1. Invention of Sequencing
In 1977, two sequencing methods were developed and published. The Sanger method is a sequencing-by-synthesis (SBS) method that relies on a combination of deoxy- and dideoxy-labeled chain terminator nucleotides [20]. The first complete genome sequencing, that of bacteriophage phi X174, was achieved that same year using this pioneering method [21]. The chemical modification followed by cleavage at specific sites method also published in 1977 [22] quickly became the less favored of the two methods because of its technical complexities, use of hazardous chemicals, and inherent difficulty in scale-up. In contrast, the Sanger method, for which Frederick Sanger was awarded his second Nobel Prize in chemistry in 1980, was quickly adopted by the biotechnology industry which implemented it using a broad array of chemistries and detection methods [19].
2.2. Sequencing Technologies
In the last decade, new sequencing technologies have outperformed Sanger-based sequencing in throughput and overall cost, if not quite in sequence length and error rate [23]. This section will focus on the three main NGS platforms as well as the two main third-generation sequencing (TGS) platforms, their throughput and relative cost. We made every effort to ensure the accuracy of the data at the time of submission. However, the cost and throughput of these sequencing platforms change rapidly and, as such, our analysis only represents a snapshot in time. The flux of innovation in this field imposes a need for constant assessment of the technologies' potentials and realignment of research goals.
2.2.1. Roche (454) Sequencing
Pyrosequencing was the first of the new highly parallel sequencing technologies to reach the market [24]. It is commonly referred to as 454 sequencing after the name of the company that first commercialized it. It is an SBS method where single fragments of DNA are hybridized to a capture bead array and the beads are emulsified with regents necessary to PCR amplifying the individually bound template. Each bead in the emulsion acts as an independent PCR where millions of copies of the original template are produced and bound to the capture beads which then serve as the templates for the subsequent sequencing reaction. The individual beads are deposited into a picotiter plate along with DNA polymerase, primers, and the enzymes necessary to create fluorescence through the consumption of inorganic phosphate produced during sequencing. The instrument washes the picotiter plate with each of the DNA bases in turn. As template-specific incorporation of a base by DNA polymerase occurs, a pyrophosphate (PPi) is produced. This pyrophosphate is detected by an enzymatic luminometric inorganic pyrophosphate detection assay (ELIDA) through the generation of a light signal following the conversion of PPi into ATP [25]. Thus, the wells in which the current nucleotides are being incorporated by the sequencing reaction occurring on the bead emit a light signal proportional to the number of nucleotides incorporated, whereas wells in which the nucleotides are not being incorporated do not. The instrument repeats the sequential nucleotide wash cycle hundreds of times to lengthen the sequences. The 454 GS FLX Titanium XL+ platform currently generates up to 700 MB of raw 750 bp reads in a 23 hour run. The technology has difficulty quantifying homopolymers resulting in insertions/deletions and has an overall error rate of approximately 1%. Reagent costs are approximately $6,200 per run [26].
2.2.2. Illumina Sequencing
Illumina technology, acquired by Illumina from Solexa, followed the release of 454 sequencing. With this sequencing approach, fragments of DNA are hybridized to a solid substrate called a flow cell. In a process called bridge amplification, the bound DNA template fragments are amplified in an isothermal reaction where copies of the template are created in close proximity to the original. This results in clusters of DNA fragments on the flow cell creating a “lawn” of bound single strand DNA molecules. The molecules are sequenced by flooding the flow cell with a new class of cleavable fluorescent nucleotides and the reagents necessary for DNA polymerization [27]. A complementary strand of each template is synthesized one base at a time using fluorescently labeled nucleotides. The fluorescent molecule is excited by a laser and emits light, the colour of which is different for each of the four bases. The fluorescent label is then cleaved off and a new round of polymerization occurs. Unlike 454 sequencing, all four bases are present for the polymerization step and only a single molecule is incorporated per cycle. The flagship HiSeq2500 sequencing instrument from Illumina can generate up to 600 GB per run with a read length of 100 nt and 0.1% error rate. The Illumina technique can generate sequence from opposite ends of a DNA fragment, so called paired-end (PE) reads. Reagent costs are approximately $23,500 per run [26].
2.2.3. Applied Biosystems (SOLiD) Sequencing
The SOLiD system was jointly developed by the Harvard Medical School and the Howard Hughes Medical Institute [28]. The library preparation in SOLiD is very similar to Roche/454 in which clonal bead populations are prepared in microreactors containing DNA template, beads, primers, and PCR components. Beads that contain PCR products amplified by emulsion PCR are enriched by a proprietary process. The DNA templates on the beads are modified at their 3′ end to allow attachment to glass slides. A primer is annealed to an adapter on the DNA template and a mixture of fluorescently tagged oligonucleotides is pumped into the flow cell. When the oligonucleotide matches the template sequence, it is ligated onto the primer and the unincorporated nucleotides are washed away. A charged couple device (CCD) camera captures the different colours attached to the primer. Each fluorescence wavelength corresponds to a particular dinucleotide combination. After image capture, the fluorescent tag is removed and new set of oligonucleotides are injected into the flow cell to begin the next round of DNA ligation [19]. This sequencing-by-ligation method in SOLiD-5500x1 platform generates up to 1,410 million PE reads of 75 + 35 nt each with an error rate of 0.01% and reagent cost of approximately $10,500 per run [26].
Although widely accepted and used, the NGS platforms suffer from amplification biases introduced by PCR and dephasing due to varying extension of templates. The TGS technologies use single molecule sequencing which eliminates the need for prior amplification of DNA thus overcoming the limitations imposed by NGS. The advantages offered by TGS technology are (i) lower cost, (ii) high throughput, (iii) faster turnaround, and (iv) longer reads [19, 29]. The TGS can broadly be classified into three different categories: (i) SBS where individual nucleotides are observed as they incorporate (Pacific Biosciences single molecule real time (SMART), Heliscope true single molecule sequencing (tSMS), and Life Technologies/Starlight and Ion Torrent), (ii) nanopore sequencing where single nucleotides are detected as they pass through a nanopore (Oxford/Nanopore), and (iii) direct imaging of individual molecules (IBM).
2.2.4. Helicos Biosciences Corporation (Heliscope) Sequencing
Heliscope sequencing involves DNA library preparation and DNA shearing followed by addition of a poly-A tail to the sheared DNA fragments. These poly-A tailed DNA fragments are attached to flow cells through poly-T anchors. The sequencing proceeds by DNA extension with one out of 4 fluorescent tagged nucleotides incorporated followed by detection by the Heliscope sequencer. The fluorescent tag on the incorporated nucleotide is then chemically cleaved to allow subsequent elongation of DNA [30]. Heliscope sequencers can generate up to 28 GB of sequence data per run (50 channels) with maximum read length of 55 bp at ~99% accuracy [31]. The cost per run per channel is approximately $360.
2.2.5. Pacific Biosciences SMART Sequencing
The Pacific Biosciences sequencer uses glass anchored DNA polymerases which are housed at the bottom of a zero-mode waveguide (ZMW). DNA fragments are added into the ZMW chamber with the anchored DNA polymerase and nucleotides, each labeled with a different colour fluorophore, and are diffused from above the ZMW. As the nucleotides circulate through the ZMW, only the incorporated nucleotides remain at the bottom of the ZMW while unincorporated nucleotides diffuse back above the ZMW. A laser placed below the ZMW excites only the fluorophores of the incorporated nucleotides as the ZMW entraps the light and does not allow it to reach the unincorporated nucleotides above [32]. The Pacific Biosciences sequencers can generate up to 140 MB of sequences per run (per smart cell) with reads of 2.5 Kbp at ~85% accuracy. The cost per run per smart cell is approximately $600.
Among the TGS technologies, Pacific Biosciences SMART and Heliscope tSMS have been used in characterizing bacterial genomes and in human-disease-related studies [31]; however, TGS has yet to be capitalized upon in plant genomes. The Heliscope generates short reads (55 bp) which may cause ambiguous read mapping due to the presence of paralogous sequences and repetitive elements in plant genomes. The Pacific Biosciences reads have high error rates which limit their direct use in SNP discovery. However, their long reads offer a definite advantage to fill gaps in genomic sequences and, at least in bacterial genomes, NGS reads have proven capable of “correcting” the base call errors of this TGS technology [33–36]. Hybrid assemblies incorporating short (Illumina, SOLiD), medium (454/Roche), and long reads (Pac-Bio) have the potential to yield better quality reference genomes and, as such, would provide an improved tool for SNP discovery.
The choice of a sequencing strategy must take into account the research goals, ability to store and analyze data, the ongoing changes in performance parameters, and the cost of NGS/TGS platforms. Some key considerations include cost per raw base, cost per consensus base, raw and consensus accuracy of bases, read length, cost per read, and availability of PE or single end reads. The pre- and postprocessing protocols such as library construction [37] and pipeline development and implementation for data analysis [38] are also important.
2.3. RNA and ChIP Sequencing
Genome-wide analyses of RNA sequences and their qualitative and quantitative measurements provide insights into the complex nature of regulatory networks. RNA sequencing has been performed on a number of plant species including Arabidopsis [39], soybean [40], rice [41], and maize [42] for transcript profiling and detection of splice variants. RNA sequencing has been used in de novo assemblies followed by SNP discovery performed in nonmodel plants such as Eucalyptus grandis [43], Brassica napus [44], and Medicago sativa [45].
RNA deep-sequencing technologies such as digital gene expression [46] and Illumina RNASeq [47] are both qualitative and quantitative in nature and permit the identification of rare transcripts and splice variants [48]. RNA sequencing may be performed following its conversion into cDNA that can then be sequenced as such. This method is, however, prone to error due to (i) the inefficient nature of reverse transcriptases (RTs) [49], (ii) DNA-dependent DNA polymerase activity of RT causing spurious second strand DNA [50], and (iii) artifactual cDNA synthesis due to template switching [51]. Direct RNA sequencing (DRS) developed by Helicos Biosciences Corporation is a high throughput and cost-effective method which eliminates the need for cDNA synthesis and ligation/amplification leading to improved accuracy [52].
Chromatin immunoprecipitation (ChIP) is a specialized sequencing method that was specifically designed to identify DNA sequences involved in in vivo protein DNA interaction [53]. ChIP-sequencing (ChIP-Seq) is used to map the binding sites of transcription factors and other DNA binding sites for proteins such as histones. As such, ChIP-Seq does not aid SNP discovery, but the availability of SNP data along with ChIP-Seq allows the study of allele-specific states of chromatin organization. Deep sequence coverage leading to dense SNP maps permits the identification of transcription factor binding sites and histone-mediated epigenetic modifications [54]. ChIP-Seq can be performed on serial analysis of gene expression (SAGE) tags or PE using Sanger, 454, and Illumina platforms [55, 56].
The DNA, RNA, and ChIP-Seq data is analysed using a reference sequence if available or, in the absence of such reference, it requires de novo assembly, all of which is performed using specialized software, algorithms, pipelines, and hardware.
Go to:
3. Computing Resources for Sequence Assembly
The next-generation platforms generate a considerable amount of data and the impact of this with respect to data storage and processing time can be overlooked when designing an experiment. Bioinformatics research is constantly developing new software and algorithms, data storage approaches, and even new computer architectures to better meet the computation requirements for projects incorporating NGS. This chapter describes the state-of-the-art with respect to software for NGS alignment and analysis at the time of writing.
3.1. Software for Sequence Analysis
Both commercial and noncommercial sequence analysis software are available for Windows, Macintosh, and Linux operating systems. NGS companies offer proprietary software such as consensus assessment of sequence and variation (Cassava) for Illumina data and Newbler for 454 data. Such software tend to be optimized for their respective platform but have limited cross applicability to the others. Web-based portals such as Galaxy [57] are tailored to a multitude of analyses, but the requirement to transfer multigigabyte sequence files across the internet can limit its usability to smaller datasets. Commercially available software such as CLC-Bio (http://www.clcbio.com/) and SeqMan NGen (http://www.dnastar.com/t-sub-products-genomics-seqman-ngen.aspx) provide a friendly user interface, are compatible with different operating systems, require minimal computing knowledge, and are capable of performing multiple downstream analyses. However, they tend to be relatively expensive, have narrow customizability, and require locally available high computing power. A recent review by Wang et al. [58] recommends Linux-based programs because they are often free, not specific to any sequencing platform, and less computing power hungry and, as a consequence, tend to perform faster. Flexibility in the parameter's choice for read assembly is another major advantage. However, most biologists are unfamiliar with Linux operating systems, its structure and command lines, thereby imposing a steep learning curve for adoption. Linux-based software such as Bowtie [59], BWA [60], and SOAP2/3 [61] have been used widely for the analysis of NGS data. Other software may not have gained broad acceptance but may have unique features worth noting. For reviews on NGS software, see Li and Homer [62], Wang et al. [58], and Treangen and Salzberg [63]. Characteristics of the most common NGS software and their attributes are listed in Table 1, and their download information can be found in Table 4.
  • asked a question related to Single Nucleotide Polymorphism
Question
5 answers
I have developed this multiplex PCR panel for next generation sequencing library prepration. This panel is used for the diagnosis of a particular bacteria infection, as well as some SNP I'm interested in.
The panel works well with sputum samples, but failed to detected some expected SNPs when we tested with FFPE samples. The copy number of this bacteria might be lower (120) than my detection limit (250). We still managed to get at least 5000 coverage in most of the SNP locations. But only about half the SNPs were called, why not the others?
Relevant answer
Answer
amplicons panels need to be sure samples are in good shape and designed amplicons not too long. dis you test quality of your samples (DIN or RIN) and how long are your amplicons?
fred
  • asked a question related to Single Nucleotide Polymorphism
Question
5 answers
Hello everyone,
I calculated some pairwise FST estimates with hierfstat (pairwise.fstWC function) in R on my SNP dataset and then calculated confidence intervals (not p-values) on those estimates with the same package (function boot.ppfst). Is there a need to correct my CI for multiple testing?
As far as I know, I am estimating parameters and not testing hypotheses (p-value) so I wouldnt need to correct for false positives. Am I correct? Can anyone explain this to me?
Thanks,
Giulia
Relevant answer
Answer
You are correct in your understanding. When you calculate pairwise FST estimates and derive confidence intervals (CIs) using the bootstrapping technique, you estimate parameters rather than conduct hypothesis tests. Therefore, there is no need to correct the CIs for multiple testing.
Multiple testing correction is typically applied when conducting hypothesis tests that involve multiple comparisons (e.g., testing for significant differences between groups). The purpose of correction methods, such as the Bonferroni correction or false discovery rate (FDR) control, is to control the overall false positive rate when testing multiple hypotheses simultaneously.
In your case, you are estimating parameters (pairwise FST values) and constructing confidence intervals around these estimates to quantify the uncertainty. The CIs provide a range of plausible values for the parameters, and they reflect the precision of your estimates based on the bootstrapping procedure. These intervals indicate how reliable your estimates are and allow you to draw appropriate conclusions from your data.
In summary, since you are not conducting hypothesis tests and focusing on parameter estimation instead, there is no need to correct the confidence intervals for multiple testing.
  • asked a question related to Single Nucleotide Polymorphism
Question
1 answer
I am working on cattle genomics and now, I am looking for a validated cattle SNPs used for VQSR approach in GTAK. Is there any validated cattle SNPs? If not, am I suppose to use a separate vcf or combined (i.e. Includes SNP and INDELs) file for SNPs and INDELs Hard filtering?
Relevant answer
Answer
1000 Bull Genomes Project aimed to sequence the genomes of a large number of cattle from various breeds and populations. The project provides a database of genetic variants, including validated SNPs, which you can access to find relevant variants for your study.
  • asked a question related to Single Nucleotide Polymorphism
Question
1 answer
I would like to know if you could help me with a theoretical problem. I know the proportion of Europeans, Africans, and Amerindians in the population of Rio de Janeiro. I am studying the frequency of certain SNPs in a sample of leukemia patients. Ideally, I should have the frequency of the wild-type and mutated alleles in the population of Rio de Janeiro. However, this information is not available. I have the frequency information for these SNPs in the European, African, and Amerindian populations from the 1000 Genomes project. Would it be sufficient to calculate a weighted average? Do you have any books or articles that discuss this problem? I found many on the opposite scenario, determining the original populations based on a mixed population. Could you provide some guidance?
Relevant answer
Answer
check dbSNP ,u may find the proportion for each group of population in RIO.
  • asked a question related to Single Nucleotide Polymorphism
Question
4 answers
Dear All
I need a reference (proofed study) that report the minimum number of markers for GWAS.
Recently, I have read so many papers on GWAS which were published in high-profile international journals. I have found a wide range of markers that were used in GWAS extending from 200 to up to 1,000,000 SNPs.
Relevant answer
Answer
Dear Gregor,
With the bigger genome size (in humans, animals, or plants), you need more and more positions to detect significantly associated variants. It depends on your pre-specified distance through the genome for selecting the SNPs. But the thing is, you or the machine that generated the data should evenly pick up the positions through the genome. This condition, evenly distributed, is more important than the number of SNPs.
So, If the number of SNPs in your research is low, you will expect to find fewer variants that are significant signals or even new! In other words, the chances of winning will decrease.
Hopefully helpful.
Best,
  • asked a question related to Single Nucleotide Polymorphism
Question
2 answers
I have conducted GWAS with GAPIT using almost 400k SNPs and nearly 500 accessions, but I can locate the R2 values from the output, can anyone help me to resolve since GAPIT keep changing? The last time I used was October 2022, it was not like this. Thank you in advance.
Relevant answer
Answer
Last time I ran GAPIT using BLINK, I got a file, 'GAPIT.Sex.PVE_by_Association_Markers', and I had a student run it a few weeks ago and still got the file, are you not getting that?
Which R version, GAPIT version and algorithm are you running? Feel free to reach out to me
  • asked a question related to Single Nucleotide Polymorphism
Question
2 answers
Hi,
I see some SNPs are represented with rs+number and some are rsl+number! I do not know what is the difference between the two IDs?
Relevant answer
Answer
Thank you.
  • asked a question related to Single Nucleotide Polymorphism
Question
1 answer
Hi,
Other than randomForest, how do you go about analyzing by GWAS the SNPs genotyping data on categorical phenotypes (say, host species for a pathogen)?
Any pointers would be great!
-Marcin
Relevant answer
Answer
I use Tassel to analyse SNPs panel. It can provide the similar result compared with Plink
  • asked a question related to Single Nucleotide Polymorphism
Question
1 answer
I have simple sanger sequencing aligned data with SNPs of about 80 specimens. How can I reconstruct the most likely haplotypes from them?
Relevant answer
Answer
One limitation is that you can't "phase" the SNPs using Sanger because they all show as a double-peak.
If you have C/T at one site and G/C at another, there is no way to know if the correct pairings are:
Site 1 C with Site 2 G
or
Site 1 C with Site 2 C
Long-read, single molecule sequencing would answer that question. Or the "old-school" method of cloning and sequencing.
Good luck!
  • asked a question related to Single Nucleotide Polymorphism
Question
2 answers
How can you downloaded SNPs from NCBI after the last update on the site?
Relevant answer
Answer
  1. Go to the NCBI website (https://www.ncbi.nlm.nih.gov/) and click on the "All Databases" dropdown menu on the top left corner of the page.
  2. Select "dbSNP" from the list of databases.
  3. In the search bar, enter your search criteria, such as the SNP ID or gene name, and click "Search."
  4. Review the search results and select the SNP(s) you wish to download.
  5. Click on the "Send to" button on the top right corner of the page and select "File" as the destination.
  6. Choose the desired file format, such as text, Excel, or XML, and click "Create File."
  7. The file will download to your computer, and you can open it with the appropriate software to view the SNP data.
You can also use other tools provided by NCBI, such as the Variation Viewer, to download SNPs.
  • asked a question related to Single Nucleotide Polymorphism
Question
3 answers
Hello
I am reading papers about SNPs and I find these sentences "a person with A / G and G / G type mutations in the skin of a healthy person"
I think it means when A turns into G. So what is the meaning of G/G? I can not understand it!!!!
Relevant answer
Answer
Hello,
Yes this is the best description and I already assumed that, but still not sure! or I am just complicating the whole subject!
Thanks a lot
  • asked a question related to Single Nucleotide Polymorphism
Question
1 answer
I’m a field biologist working on social evolution. I want to switch from micro-satellites to SNPs to determine kinship and sex of individuals (I'm working on a corvid bird). Does anybody know which commercial companies that can do Dartseq for the SNP discovery step and the subsequent SNP typing (e.g. with DD-RAD methods or similar) in Europe or North America? And, what would the first step and the second step cost for 3000 individuals?
Thanks a lot!
Relevant answer
Answer
As someone that gets given data for similar studies (fingerprinting clonal crops) from lots of different providers I think it's best to just reach out to the companies for quotes.
DArT's new technology DArTag seems to be very popular right now rather than DArTseq.
LGC is offering FlexSeq, CaptureSeq and SNPseq (disappearing soon, but the technology is also offered as Allegro by Tecan if you look up SPET). FlexSeq may actually be the best for you since you will be able to construct haplotypes that then could be more powerful than microsats. LGC also offer KASP assays if you just want SNPs (if you bring this in house and use PACE it can be very cheap). LGC also have European and US centers whereas I'm not sure if DArT have centers exclusively in Australia (I doubt it considering the amount of studies using that platform).
Good luck! Happy to give more details if needed
  • asked a question related to Single Nucleotide Polymorphism
Question
2 answers
Hello,
I have human TNF-A DNA sequences aligned in Fasta format and I want to import them to Haploview in order to represent the linkage disequilibrium between SNPs.
Someone can explain to me how to get the format accepted by Haploview from my Fasta alignments.
Thank you in advance.
Relevant answer
Thank you for your answer,
But I've read the manual several times and it presents only six formats that Haploview can accept. The manual doesn't talk about what to do if you have DNA sequences aligned in fasta format.
When I import the sequences I have saved in phased format with the DnaSP software it gives me an error.
  • asked a question related to Single Nucleotide Polymorphism
Question
3 answers
I am trying to use ANGSD (Korneliussen et al. 2014) to calculate population allele frequencies from PL values (Phred-scaled genotype likelihoods). However, ANGSD expects a beagle file containing genotype probabilities and not genotype likelihoods, so I was not able to get the allele frequencies. I am wondering if there is a way to get genotype probability values from PL values that I have in a vcf file or from genotype likelihoods that I have in a beagle file.
Thank you in advance!
Relevant answer
Answer
It is possible to convert PL values to genotype probabilities using the phred-scaled likelihoods. The process involves applying the phred-scaled likelihoods to the Hardy-Weinberg equilibrium equation, which is a mathematical model used to describe the distribution of genotypes in a population. The equation is as follows:
p^2 = p(AA) 2pq = p(Aa) q^2 = p(aa)
where p is the frequency of the A allele and q is the frequency of the non-A allele. The genotype probabilities can be obtained by solving this equation for p(AA), p(Aa), and p(aa) using the PL values as input.
One way to do this is to use the package "VCFR" which provide functions to read and manipulate VCF files in R.
Another way is to use the package "GenotypeIO" which provides utility functions for reading and writing genotype data in R.
You can also use the package "SNPRelate" which has a function "gtl2gprob()" that converts genotype likelihoods to genotype probabilities.
be aware of organising the files for each package. you may need to adjust accordingly.
  • asked a question related to Single Nucleotide Polymorphism
Question
1 answer
Hi!
I'm looking for a database where I could find different SNPs for a set of genes in the context of NSCLC. I've tried Haploview software, but I suspect that the servers are down since 10 years ago and the software doesn´t work, it can't connect to the HapMap database.
I would be very grateful if anyone could help me in my thesis project.
Relevant answer
Answer
Hi Jose
when you go to the UCSC website (http://genome.ucsc.edu) and search for your genes of interest in its genome browser, there is an option for viewing the SNP from the database you want.
you can also get those informations by digging in the table browser as xls files.
all the best
fred
  • asked a question related to Single Nucleotide Polymorphism
Question
3 answers
Hi there,
I have a PCR sequence for the SNP of a particular gene,
and I want to submit it to NCBI, so I need to know how to submit the sequence to NCBI to get an accession number online.
Is there anyone who can help? Regards
Relevant answer
Answer
go To snpdb and put in the name ( eg rs17102287) or sequence and search. Then select genomic position and you will get a lot of sequence either side of your snp but if you know the gene and want its accession number then search " accession number for (gene name)" and select gene cards and scroll down for identifiers
  • asked a question related to Single Nucleotide Polymorphism
Question
3 answers
Dear community,
I am planning on conducting a GWAS analysis with two groups of patients differing in binary characteristics. As this cohort naturally is very rare, our sample size is limited to a total of approximately 1500 participants (low number for GWAS). Therefore, we were thinking on studying associations between pre-selected genes that might be phenotypically relevant to our outcome. As there exist no pre-data/arrays that studied similiar outcomes in a different patient cohort, we need to identify regions of interest bioinformatically.
1) Do you know any tools that might help me harvest genetic information for known pathways involved in relevant cell-functions and allow me to downscale my number of SNPs whilst still preserving the exploratory character of the study design? e.g. overall thrombocyte function, endothelial cell function, immune function etc.
2) Alternatively: are there bioinformatic ways (AI etc.) that circumvent the problem of multiple testing in GWAS studies and would allow me to robustly explore my dataset for associations even at lower sample sizes (n < 1500 participants)?
Thank you very much in advance!
Kind regards,
Michael Eigenschink
Relevant answer
Answer
for the second part of your problem, you can try vcf2gwas pipeline, that is very easy to run as a docker image
  • asked a question related to Single Nucleotide Polymorphism
Question
4 answers
I analyzed two SNPs for linkage disequilibrium using SNPStats. It gave D'=0.9995 and r=0.99 but when I analyze my SNP data Manually, always these two SNPs are either present or absent together in all of my study subjects, giving a hint that these two SNPs are completely linked. Can I call these two SNPs in 'perfect linkage disequilibrium' despite D' and r values being very slightly less than 1. Do we ever get D' and r values equal to 1 practically? Do these values become closer to 1 for SNPs in perfect linkage disequilibrium with increasing sample size?
Relevant answer
Thanks Puneetpal for clearing my doubts
  • asked a question related to Single Nucleotide Polymorphism
Question
3 answers
What is the meaning of the string (*) in CYP3A4*22, knowing that this SNP has the ID rs35599367?
Relevant answer
  • asked a question related to Single Nucleotide Polymorphism
Question
3 answers
Can someone please explain me the logic behind identifying genes present within 50KB, 100KB and 500KB (both side) of a SNP locus ? How does the SNP affect the function of the genes present within the above mentioned windows?
Relevant answer
Answer
Hi
you can see this one:
all the best
fred
  • asked a question related to Single Nucleotide Polymorphism
Question
1 answer
Hi,
is there anyone who used to analyze the sequencing of SNP data of specific genes?
regards
Relevant answer
Answer
You may parse SNPs from the NGS sequencing reads using scikit-allel.
  • asked a question related to Single Nucleotide Polymorphism
Question
1 answer
Hello!
I am trying to find SNPs which are in linkage disequilibrium with my target SNP on LDproxy (https://ldlink.nci.nih.gov/?tab=ldproxy), however I realize I get very different results using different genome builds (hg37 and hg38). In particular, one interesting LD SNP result I got for hg37 completely disappeared for hg38. The SNP can still be found on hg38 ENSEMBL.
I do not think the query region has been removed or changed from hg37 to hg38, and I would really appreciate any explanation regarding the vast differences in results for hg38 and hg37.
Thank you and have a great day! :)
Relevant answer
Answer
Hi Kay
some changes have been done during evolutions of the human genomes assemblies. in the last genome all sequences are now published and therefore some position have changed.
just go to the UCSC genome browser (http://genome.ucsc.edu), search for your genes or region of interest, choose the CHM13 version of the genome and open the optional track named "cactus alignment" to get an idea on what happened.
maybe it's just a problem of positions, you must use the rsid prior to their positions.
all the best
fred
  • asked a question related to Single Nucleotide Polymorphism
Question
3 answers
Hi all,
I have a quite large set of SNPs from several geographically distinct populations. I want to estimate average allele frequencies for each SNP among, lets say, two-three westernmost and two-three eastern most populations.
As I understand, it is a rather trivial task for a plenty of further pipelines, but have no idea how to do it correctly.
Can I do it manually by calculating the arithmetic mean of AF value of each allele in each population (taken from vcf), and then calculating the arithmetic mean among particular populations, or maybe, this task should be solved by means of some utility using more complex mathematics?
Relevant answer
Answer
Hi Danila
you can dig in the dbSNP tool from NCBI (https://www.ncbi.nlm.nih.gov/snp/rs28360521) and pick some data (the link shows you an example of one snp).
all the best
fred
  • asked a question related to Single Nucleotide Polymorphism
Question
2 answers
I've been asked to help with a project determining genotypes of the gene coding for a specific enzyme in patient samples, but the most recent literature seems to only mention the use of qPCR with specific probes for the detection of the specific SNPs of interest. Due to the characteristics of the project, we can't do qPCR at the moment, just PCR, but we have access to sequencing as well, so I wanted to know if that's a feasible option, at least while we are able to perform qPCR
More specifically, not knowing much about the way sequencing is currently done beyond undergrad level lessons, I wanted to know if heterozygosity at a specific SNP could be picked up by commercial sequencing, i.e. getting two smaller, relatively equallly-sized peaks at a specific position instead of just a single peak, from a PCR-amplified fragment encompassing the SNP location. And if that's the case, how certain could I be that's actually a SNP and not just an artifact (assuming it's in the expected location)
Thanks in advance !!