Question
Computational Challenges in Population Genetics
Dear all,
what are in your opinion the major computational bottlenecks in population genetics?
I am Bioinformatician and am searching for methods that are algorithmically challenging to optimize/parallelize. To put it differently: for which tools/programs that you use would it be advantageous, if greater datasets could be computed on a multi-core cluster?
Cheers,
Andre
what are in your opinion the major computational bottlenecks in population genetics?
I am Bioinformatician and am searching for methods that are algorithmically challenging to optimize/parallelize. To put it differently: for which tools/programs that you use would it be advantageous, if greater datasets could be computed on a multi-core cluster?
Cheers,
Andre
All Answers (4)
-
Dear Andre,
I think the most computational challenge in population genetics is the analysis of the incoming genomic datasets. Current tools are, in general, designed for "few" unlinked loci (MCMC coalescent samplers or ABC approaches). With a genomic data set, it will be unlikely that so many loci could be analysed with the same algorithms (even if they were parallelized), and the linkage among markers will be not possible to ignore. I think the challenge it will be to find new algorithms rather than optimize old ones. -
thanks, Miguel, that's an interesting topic to start reading around.
-
I somewhat disagree with the assessment of Miguel because his statement is the current mantra of the genomics folks. The current new and 'exciting' methods can analyze genomic data using methods borrowed from PCA etc but use trivial population genetics models (e.g. no population structure or when structure all equal immigration etc). or then reactivate old methods, if you check on the use of FST then you will see that almost all population genomics researchers calculate FST scores for their genomic data, falling back to dark ages of population genetics. Population genetics applies also to population genomics, but with many more loci (and that surely is a problem).
Given the advances of multicore-machines and GPUs in today's computer there is no reason to believe that MCMC/ABC methods one day can analyze such data. I agree with Miguel that today it is a pain and current MCMC programs are not really up to the task. [Historical note, in the early eighties people told Joe Felsenstein that using maximum likelihood is pointless for phylogenetic problems because it is way to slow, fortunately for us he ignored their 'concern' -- and today most phylogenies are score with ML or Bayesian inference. -
I agree with Peter that we should not sacrifice the advances on population genetic inference under complex models. In many cases, I think it will be more interesting to analyse a subset of loci with current methods (MCMC/ABC) than "calculate FST scores for their genomic data".
My point was that it will be a good research topic for those with a mathematical mind (not me) to explore alternatives to MCMC and ABC for the estimation or approximation of likelihood. These alternatives could be more computationally efficient or more effective for taking into account the linkage among markers. Some alternatives have been already proposed based on PAC-likelihood and Hidden Markov Models (NB: I do not fully understand those methods, but what I have read seems promising). Some advances might be done also for improving MCMC and ABC approaches in a more fundamental way that just doing the same thing faster. In any case, if any new method arises is likely to be implemented in a simple model at the begining; I do not think that would be "falling back to dark ages of population genetics" if the method has potential to be further developed into more complex models (which takes time, as Peter knows better than I do).
From Peter's comment it seems that, even if nobody finds a faster way to do MCMC/ABC or faster alternatives, we could always wait for faster computers, because populations genomics is just population genetics "but with many more loci". I am not sure of this. It is not "just many more loci" but linkage among markers will be significant, while in population genetic studies usually the markers are independent. I have not followed the progress to incorporate recombination in MCMC analysis but I think it is still not very advanced (any feedback on this? it would be great to be wrong on this :) ) and regarding ABC, recombination increases simulation time enormously (well, I guess this really can be solved with faster computers...).
Last, I would like to point out that model-based statistics might not be always the most appropriate or effective way to analyse our data (either "genetic" or "genomic"). It depends on the purpose of the analsys and on our study organism. For instace, partially clonal diploids (with high clonality rate) will have a coalescent completely different from the one implemented on most (all?) MCMC-based software and coalescent simulators. Another case can be that the purpose of our analysis is just descriptive, Fst scores and other summary statistics migth be useful to describe the genetic diversity of the studied species. Also, clustering algorithms based on PCA and related methods seem to lead to very similar results as model-based clustering methods (i.e. STRUCTURE, etc) in much less time, so they might be preferable for genomic data.