Questions related to GWAS
I am planning on conducting a GWAS analysis with two groups of patients differing in binary characteristics. As this cohort naturally is very rare, our sample size is limited to a total of approximately 1500 participants (low number for GWAS). Therefore, we were thinking on studying associations between pre-selected genes that might be phenotypically relevant to our outcome. As there exist no pre-data/arrays that studied similiar outcomes in a different patient cohort, we need to identify regions of interest bioinformatically.
1) Do you know any tools that might help me harvest genetic information for known pathways involved in relevant cell-functions and allow me to downscale my number of SNPs whilst still preserving the exploratory character of the study design? e.g. overall thrombocyte function, endothelial cell function, immune function etc.
2) Alternatively: are there bioinformatic ways (AI etc.) that circumvent the problem of multiple testing in GWAS studies and would allow me to robustly explore my dataset for associations even at lower sample sizes (n < 1500 participants)?
Thank you very much in advance!
Is a GWAS study of LVH genetic predisposition of 500 samples be considered low powered study? If so how to justify a small sample size ? Is there research conducted with small n size with GWAS ?
I was wondering if there are any bioinformatic tools to search for species-specific genes across species' genomes. For example, in the Venn diagram attached below (Kim et al., 2011), there are overlapping gene families between these species. How would you go about finding these species-specific genes without simply checking every gene? Ensembl seems to be a good start, but I'm having trouble narrowing down the right tools. A big goal is to find genes with only a few existing orthologs (i.e. maybe genes only appear in certain species).
Can someone please explain me the logic behind identifying genes present within 50KB, 100KB and 500KB (both side) of a SNP locus ? How does the SNP affect the function of the genes present within the above mentioned windows?
I am performing a GWAS Analysis, while comparing my pre-imputation and post-imputation data, I observed that the most significant genetic variant (p<1x 10-16) from pre-imputation data is no more significant post imputation. Imputation performed using reference genome 1000genome phase3 v5 SAS population data in Michigan Imputation server. These variants were missed out while matching the target data and the ref data. How do I overcome this? What is should be reported in the manuscript (pre or post imputation data)? How do we justify such findings ?
We have a GWAS population (genotyped) and we want to select the minimum (an optimal) set of genotype to phenotype in order to detect GWAS signals. Any soft or method available?
screening for herbicide resistance and identify genomic regions associated with it in maize
How would one associate a single phenotype factor (binary presence / absence of trait) to SNPs in Tassel (v5.2.58) GUI?
I generated VCF file using GATK and am able to import the VCF and phenotype data into Tassel. The phenotype data is just two columns the 'Taxa' and the 'Factor' (as Y and N; where Y = has the phenotype and N = does not).
The desired end result is a Manhattan plot with any SNPs associated with the trait, and ideally, an output file which contains the SNP locations.
I couldn't see much paper where plant breeders use biochemical such as proline content Malondialdehyde (MDA) and dyes such as NBT or DAB( for ROS detection) for screening stress-tolerant accession on a large scale (100-200 which I suppose is possible to do). Are not these methods better than phenotyping grain yield, biomass, plant height, NDVI, LAI, etc?
I did the SNP genotyping of two different plant species through TASSEL5-GBS pipeline. In one of the species, I am getting significantly high number of missing genotype information (N) (Attached figure 1) but in other species I am not getting high number of missing genotype (Figure 2) information.
What could be the reason of getting high number of missing genotype information (N), how to filter them before performing downstream analyses like GWAS, LD and so on.
we obtained our data from SNP genotyping from external lab. We found out, that there are letters "'D" and "I" in some positions. Do you know what those means? It is also in the reference fields.
1 13901895 chr1_13901894 D I
1 13903334 chr1_13903333 I D
1 13903422 chr1_13903421 I .
Thank you very much for your help.
I am doing my research on Cancer biology.
I need your suggestions in order to perform GWAS of autoghapy related gene in homo sapiens.
Kindly please suggest me some information about tools for the same.
Thank you all in advance.
We have SSR marker (150) based genotypic data of 190 rice landraces. Want to prepare a research article. Which kind of analysis may be performed with these data? Kindly give your opinion and suggestions.
Why is the term covariate used for the independent variable in GWAS? It seems there is a sense of ambiguity about the term "covariate" in GWAS.
In " Statistics" , in analysis of covariance (ANCOVA), the auxiliary variable is called covariate. So "Covariate " has a separate meaning from the independent variable. But in "Genomic analysis", all variables are called covariates. Whether it is really an auxiliary variable or an independent variable.
I want to do a meta analysis from GWAS and publications. For the GWAS studies, I only have the beta value, p value and total number of samples while for the publications I have sufficient data that could be used to calculate effect size.
Is there a way to compute effect size from only the beta, pval and N?
If not, is there any way to calculate beta using the mean, SEM and N of cases and controls from publications?
Thanks in advance!
Hi, I am trying to understand quality control in GWAS for individuals and for SNPs. I don't understand what is missing rate (--missing) for individuals and MAF (--freq) for SNPs. Can anyone explain it in plain language?
I have a big list of significant SNPs (>30K) from a GWAS/meta-analysis. Can you please suggest what are some best ways to find the respective gene names and further classify them as already reported and novel ones?
Thanks in advance
HI, nowdays i use illumina beadchip for GWAS analysis.
I have 2 type chip. It' s bovinesnp50_v1, bovinesnp50_v2.
But it has different chromosome, maker position in the SNP_map file.
So, How can i use different versions.
Can i imputation from bovinesnp50_v1 to bovinesnp50_v2.
If I can imputation it. After imputation, how can i use different version of maker position?
i wanna know answers. Have a nice day : )
Including these steps: 1) raw data format transformation for five companies 2) update positions for all SNPs to hg37 version 3) Quality control within companies 4) Pre-phasing (SHAPEIT2) and imputation (IMPUTE2) for all SNPs of each company 5) Perform GWAS using two logistic models for 27 phenotypes 6) Statistic and downstream bioinformatic analysis. 7) Estimation of genetic parameters (rg and hg). 8) PRS analysis. However. the size of my dataset only consist more than 1000 people. With no background knowledge, how long would this take as a bioinformatics master student?
There is very little publication where functional characterization(cloning, overexpression, silencing, etc.) of genes identified through GWAS has been performed. However, most of the publications on functional characterization are on genes identified through transcriptome. Why is this? I doubt whether there is any usefulness of GWAS on crop improvement or not? if yes then give me some successful publication examples?
I recently received many *.CEL files from a recent UK biobank genotyping. According to the SNPolisher guide, I have to conduct certain metrics on these SNPs to keep just the ones fulfilling essential criteria. It is not clear (at least for me, first time using it) how these CEL files are using in the SNPolisher inputs. The first input is: Ps_Metrics(posteriorFile, callFile, output.metricsFile, pidFile).
Is there a previous step where these cell files are converted into these posterior or cal files? Both should be in *.txt format but all I have are *.CEL files.
Hope for some guidance from any expert!
Hello fellow researchers,
I am currently dealing with very large data sets of SNPs (more than 2 million) to investigate whether GWAS significant SNPs are more frequently located within certain genomic regions than non-significant SNPs. I have a 2x2 table stating the absolute number of SNPs in the significant vs. non-significant group that are either located within this specific region or not. Now, I obviously need to check my results for statistical significance, which initially I have done with the Chi-square test. But because I have so high numbers, every investigation is (putatively) statistically significant. I know that some publications just state the Cramer's V as an additional indicator, but I would rather have something alternative to use (if it exists). So do any of you know good alternative tests or methods to deal with these high numbers without this large sample size bias? How do you normally deal with these huge samples sizes?
I would be grateful for any tip or advice.
Hi, I am working on GWAS for identification of markers and candidate genes associated with leaf rust and powdery mildew resistance. I was reading an old paper and in that paper they identified one QTL associated with leaf rust resistance on long arm of chromosome 2H. They mentioned one SSR marker close to that QTL. Now I was interested to compare that QTL position with my own results because I also identified one QTL on long arm of chromosome 2H.
Is there any way to identify the physical position of that QTL through SSR marker information ??
I will appreciate your response.
Can anyone please tell me the database names or websites from where I can download human SNP datasets along with the quantitative traits (phenotypes) for genome-wide association studies (GWAS)?
I conducted a Mendelian randomisation study for assessing the association between X (exposure) on Y (outcome or case). But, I am not sure whether our outcome (cases) are valid. My question is, how can I check the validity of my outcome using genetic data? For example, is there any reasonable method for checking genetic correlation between our outcome and previously published GWAS including gold standard case ascertainment?
The relationship between gene expression and disease was widely known, however, is the gene ratio (Gene A expression value/Gene B expression value) might cause any diseases or phenotypes in any organism?
I want to perform GWAS analysis in a crop in which markers are not assigned to chromosomes. We have developed SSRs from ddRad seq information, these markers are not assigned to specific chromosomes yet. Please share information to address this issue.
I am currently working on bioinformatics. I have received raw data of NGS showing SNPs, CHR, position, etc... For the first step of GWAS, I would like to create Manhattan Plots by using QQman packing in R. However, I do not know how to calculate the p-value of each SNPs on my data.
Anyone, please help me?
Thanks so much!
I am starting to work in genome wide association studies (GWAS) in Arabidopsis and I would like to know what is the difference between GWAPP Web application and performing the analysis on R. Can I get the same results with either of these tools?
1. If we don't have a population size of more than 100 or let suppose (between 50-70). Can we go for QTL mapping?
2. Can we use different generation (F3, F4, F5 ) obtained from the cross combination of different parents (A x B, A x C, A x D ) for mapping a QTL or GWAS?
I have 2 cohort datasets with 500 sample size each and 200 SNPS of genotype files as (example1.gen and example2.gen) as input files and trying to generate other formats from the .gen files such as (example1.bgen and example2.bgen), sample ID and phenotype for the cohort data set (example1.sample and example2.sample) and VCF format (example1.vcf & example2.vcf). I am currently using SNPTEST (latest version) for GWAS analysis . Is there any possible ways to look into it ???
Thanks in advance
Type Hap_type Indica Japonica
C--C--C---C---GGGGAAAA Hap1 10 128
C--C--C---C---GGGGCCAA Hap2 224 53
CCACCACCAGCCAGGGGGCCGG Hap3 0 73
can you notice the deletions and insertions?what is the implications of these changes on the three haplotypes?what can be the correct description or conclusion you can make from here regarding the haplotypes of this genes and the subspecies?
its urgent,thanks in advance for your responses
I want to calculate independent number of snps to set a threshold for GWAS. As the basic bonferroni correction is too conservative. Due to correlation between snps, we cannot assume they are independent. FDR adjustment doesn’t seems suitable too. My genotype data has 1.2 million snps. I applied basic bonferroni and FDR correction, but the results are not satisfactory. As I read in many papers, setting threshold as follows.
1- calculating the independent number of snps.
2- Apply a basic bonferroni correction.
Let’s suppose if independent number of markers are 11500, then at 0.05 error rate, the threshold would be 11500/0.05 =4.3E-6
A software called genetic type 1 error calculator can calculate it. The problem is, the website of GEC is unstable. I tried multiple times to download it , but couldn’t get there. Highly appreciated , if anyone having this software, can share it to me.
Another question is, can we get independent snps by LD pruning. I tried it in plink. But i am not sure, whether we can get independent snps by LD pruning or not.
This is just a hypothetical question. For example, you genotyped 500 BC3F2 rice plants, phenotypes them and conduct GWAS. Then, two generations later, you genotyped the BC3F4 plants and you got interesting phenotypic results. Would it be possible to do GWAS using the phenotypic data from BC3F4 plants but using the genotyping data from BC3F2 plants?
I have conducted GWAS and also estimated genetic variance and h2 explained by all SNPs, but I require to have these estimations for each SNP individually.
Could someone advise me or introduce a special software or way that calculates exactly these values? I appreciate it if anyone helps me.
I am trying to perform a genome-wide association on mitogenome variants (obtained from GATK-Mutect2) vs. phenotypes by using GAPIT. The number of individuals is seven and GAPIT function does not work due to the low number of indvs. We have increased the number of the same genotype (vcfs) and phenotype files by basically copying-pasting them under different names and run the function. Would this method affect the results in any way? Is there any other way to increase the number of individuals for GWAS studies?
I have conducted GWAS analysis through TASSEL software using SNPs generated from genotyping by sequencing data(GBS). But I do not have the reference genome. How do I calculate LD decay values and locate QTLs on LD map? Kindly let me know.
I'm looking for a software to perform GWAS of binary traits, using imputed data, and that can take into account the relatedness between the individuals.
I have recently started studying genomic data privacy and it seems the field is relatively new. I am looking for the existing problems. Implementing Homomorphic encryption or differential privacy has a lot going on. Can anyone suggest any other existing challenges?
I have a list of 2000 SNPs (chicken)and require to find close genes to all these 2000 SNPs.
I used to do that one by one in a small set of SNPs by using org.Gg.eg.DB, BiocManager, Entrezquery, and chicken database in R, but that way is most time-consuming.
It is appreciative, if anyone suggests to me a reasonable and practical way to do so.
Frankly, I have a hitch here and I need your constructive comments or suggestions.
Thanks in advance
I have compared common SNPs among two different populations. I need to do an imputation after QC. I would like to know whether I am having an adequate number of SNPs in hand for imputation. I am having around 1lakh common SNPs. Thanks in advance.
so I have SNPs (RSIDs) from imputation done in 2011 on http://csg.sph.umich.edu/abecasis/MACH/tour/ (call it 2011 data)
and I did imputation on the same genotype files on Michigan Imputation Server, Genotype Imputation (Minimac4) 1.2.4 (call it 2020 data)
using the same QC steps I perfomed GWAS using plink.
In 2011 I have ~2.5 million SNPs and in 2020 I have ~2.7 million SNPs. The issue is that only ~900000 SNPs are matching between those two data sets. Can someone please explain me why? Did RS names changed in the meantime? I did put both genotype files on Build 37. Here I am presenting number of SNPs per chromosome for old (2011) data and new (2020) data.
Also I am comparing snps_that_can_be_found_in_old_but_not_in_new and snps_that_can_be_found_in_new_but_not_in_old.
Can someone please explain me what might be the issue and why there is only ~900000 SNPs matching SNPs?
Hi, I have two data sets from the illumina omniexpress snp array platform. The first data set was mapped using the GRCh37 build and the second one was more recently read using the GRCh38 build. Not surprisingly when I've tried to merge the files in PLINK for a larger analysis it comes up with the warning snp rs... is in a different genetic position. Is there any way to update the build of the first data set? Or suggestions for how best to proceed, I haven't done much genetic analysis before so any help would be welcome :)
i am learning GWAS analysis using http://www.stat-gen.org/tut/tut_preproc.html tutorial but am getting these errors:
Error in read.plink(gwas.fn$bed, gwas.fn$bim, gwas.fn$fam, na.strings = ("-9")) :
Couln't open input file: ~/Desktop/gwas//GWAStutorial.bed
I have trouble in understanding GWAS research papers (as they are full of statistics... )for finding phenotypic to genotypic relationship in crop plants.Can someone suggest me the basic material to start reading with specific reference to plants.
I am conducting GWAS using GAPIT R package with FarmCPU model. However, unlike GLM and MLM, GAPIT does not produce R2 when FarmCPU model is used. I tried to work around this by using the linear model function in R (lm) like this: fit<-lm(trait~SNP, data=mydata) but I am not sure if this is correct because I got R2=0 for some SNPs with very significant GWAS signals! I have seen hierGWAS package which gives R2 for cluster of SNPs but not single SNP.
I appreciate your suggestions on this and thank you for your time.
Hi,I am doing GWAS analysis on both quantitative and qualitative trait through GAPIT 3.0.
Code run well for Quantitative traits but shows following error in qualitative trait
Error in plot.window(...) : need finite 'ylim' values
is it because my trait does not follows normality or anything else. Please suggest me a proper method for qualitative trait.
What is the minimum acceptable broad-sense heritability value of a quantitative trait to decide if the trait is suitable for GWAS analysis or not?
I am investigating on SNP-trait association in a species with a narrow genetic base. I have phenotyped and genotyped 80 accessions. I would like to use GWAS for the association analysis.
From literature, a suitable GWAS analysis requires a large sample size. My question is: what is the minimum sample size required for GWAS? Or How large should the sample size be?
I'm looking at running some 2 sample MR analysis and my outcome GWAS was run in UK biobank, so i'm looking for a sex-specific GWAS for testosterone that was pulled from a different cohort. I've been having a browse online, but most of what i've found is either UKB or men only.
If anyone could point me in the right direction that would be great!
I need to perform a meta-analysis of 2 GWAS studies and I need assistance with the input study files (plink --assoc.logistic) as well as the output meta-analysis file (.TBL).
- Firstly, I am not sure about the METAL script because of the format of the plink assoc.logistic files. (Script is attached)
- And secondly there are missing columns/information in the meta-analysed output file, such as CHR, POS, Allele1, Allele2. Do these columns need to be manually added?
There are some regular GWAS data (control and cases with some kind of tumors) in my group. Do you have any novel idea about doing some interesting research using these data?
I have a biparental population of Cacao for about 188 lines. I was initially thinking of doing QTL mapping, however, when I got the GBS data for my samples, I was informed by the sequencing company that there is a lot of genetic distance between the markers and I should go with the GWAS approach? Can anyone please give me some advice on it?
The genetic distance between the two parents used to construct the mapping population is 0.5635
Sometimes interpreting biological terms/pathways related to a particular phenotype might be very difficult and confusing. I have read many papers which only discussed a few selections of pathways related to their objectives of study and simply ignored the rest of pathways and terms (buried in supplementary materials!). This is an issue which needs an urgent solution. In Animal Science, we work with traits e.g. milk production or birth weight which are complex traits and multiple genes involved in the whole process. Although through GWAS studies we know most of genes involved in a complex trait. To my knowledge the "gap" is: there is no direct biological pathway (for example in Cytoscape or David) related to milk production.
I know "mammary gland development" or "cell-cell junctions" or other known terms, together are related to milk production but they need to be altogether in one term. This was only one example among many other important traits.
I would like to hear other peoples opinion.
Can anyone suggest a reason why the association between my SNPs (90 thousand) and two quantitative traits (BMI and age) has a FDR-adjusted P-value over 0.9 for all my SNPs? I think it is very unlikely that not a single SNP is associated... They come from ichip genotyping and I did an imputation with the Michigan Imputation Server.
I had my files in PLINK vcf format and I transformed them to HapMap using TASSEL, and my phenotypes are in a tab separated file just as in the GAPIT manual.
My code was:
myG_tAPIT <- GAPIT(