Luca Ferretti

Collège de France, Lutetia Parisorum, Île-de-France, France

Are you Luca Ferretti?

Claim your profile

Publications (34)82.91 Total impact

  • [Show abstract] [Hide abstract]
    ABSTRACT: Abstract Text: Principal component analysis (PCA) is one of the most widely used tools to explore variability of high dimensional data. PCA is used for population and quantitative genetics. Its popularity has recently increased due to the huge amount of molecular markers available in datasets worldwide. In genetics, a common issue due to external constraints is uneven sampling of populations, limiting the usefulness of PCA because of well-known sample size sensitivity and two-dimensional projection bias. Here we evaluated the use of weighted PCA (wPCA) in genetic data in order to correct uneven sampling bias. Simulations suggest that wPCA improves the two-dimensional projections of PCA data and, in some cases, recovers population relationships patterns, even when sample size is as low as n=1. We used this correction in pig data from populations with uneven sampling, recovering a more realistic structure than inferred with only PCA. Keywords: SNP Population structure Phylogeography
    10th World Congress on Genetics Applied to Livestock Production; 08/2014
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Growing network models with both heterogeneity of the nodes and topological constraints can give rise to a rich phase structure. We present a simple model based on preferential attachment with rewiring of the links. Rewiring probabilities are modulated by the negative fitness of the nodes and by the constraint for the network to be a simple graph. At low temperatures and high rewiring rates, this constraint induces a Bose-Einstein condensation of paths of length 2, i.e., a new phase transition with an extended condensate of links. The phase space of the model includes further transitions in the size of the connected component and the degeneracy of the network.
    Physical Review E 04/2014; 89(4-1):042810. · 2.31 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: There is a complex relation between the mechanism of preferential attachment, scale-free degree distributions and hyperbolicity in complex networks. In fact, both preferential attachment and hidden hyperbolic spaces often generate scale-free networks. We show that there is actually a duality between a class of growing spatial networks based on preferential attachment on the sphere and a class of static random networks on the hyperbolic plane. Both classes of networks have the same scale-free degree distribution as the Barabasi-Albert model. As a limit of this correspondence, the Barabasi-Albert model is equivalent to a static random network on an hyperbolic space with infinite curvature.
    EPL (Europhysics Letters) 10/2013; 105(3). · 2.26 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Deciphering the evolutionary processes driving nucleotide variation in multi-allelic genes is limited by the number of genetic systems in which such genes occur. The complementary sex determiner (csd) gene in the honey bee Apis mellifera is an informative example for studying allelic diversity and the underlying evolutionary forces in a well-described model of balancing selection. Acting as the primary signal of sex determination, diploid individuals heterozygous for csd develop into females, whereas csd homozygotes are diploid males, which have zero fitness. Examining 77 of the functional heterozygous csd allele pairs, we established combinatorical criteria which provide insights into the minimum number of amino acid differences among those pairs. Given a data set of 244 csd sequences we show that the total number of csd alleles found in A. mellifera ranges from 53 (locally) to 87 (worldwide), which is much higher than previously reported (20). Using a coupon-collector model, we extrapolate the presence of in total 116 - 145 csd alleles worldwide. The hypervariable region (HVR) is of particular importance in determining csd allele specificity, and we provide for this region evidence of high evolutionary rate for length differences exceeding those of microsatellites. The proportion of amino acids driven by positive selection and the rate of nonsynonymous substitutions in the HVR-flanking regions reach values close to 1 but differ in respect to the HVR length. Using a model of csd coalescence, we identified the high originating rate of csd specificities as a major evolutionary force, leading to an origin of a novel csd allele every 400.000 years. The csd polymorphism frequencies in natural populations indicate an excess of new mutations, whereas signs of ancestral trans-species polymorphism can still be detected. This study provides a comprehensive view of the enormous diversity and the evolutionary forces shaping a multi-allelic gene.
    Molecular Biology and Evolution 10/2013; · 14.31 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Several variation of the Watterson estimator of variability for Next Generation Sequencing (NGS) data have been proposed in the literature. We present a unified framework for generalized Watterson estimators based on Maximum Composite Likelihood, which encompasses most of the existing estimators. We propose this class of unbiased estimators as generalized Watterson estimators for a large class of NGS data, including pools and trios. We also discuss the relation with the estimators that have been proposed in the literature and show that they admit two equivalent but seemingly different forms, deriving a set of combinatorial identities as a byproduct. Finally, we give a detailed treatment of Watterson estimators for single or multiple autopolyploid individuals.
    09/2013;
  • [Show abstract] [Hide abstract]
    ABSTRACT: Next generation sequencing of pooled samples is an effective approach for studies of variability and differentiation in populations. In this paper we provide a comprehensive set of estimators of the most common statistics in population genetics based on the frequency spectrum, namely the Watterson estimator θW , nucleotide pairwise diversity II, Tajima's D, Fu and Li's D and F, Fay and Wu's H, McDonald-Kreitman and HKA tests and Fst, corrected for sequencing errors and ascertainment bias. In a simulation study, we show that pool and individual θ estimates are highly correlated and discuss how the performance of the statistics vary with read depth and sample size in different evolutionary scenarios. As an application, we reanalyze sequences from Drosophila mauritiana and from an evolution experiment in Drosophila melanogaster. These methods are useful for population genetic projects with limited budget, study of communities of individuals that are hard to isolate, or autopolyploid species. This article is protected by copyright. All rights reserved.
    Molecular Ecology 09/2013; · 6.28 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: BACKGROUND: In contrast to international pig breeds, the Iberian breed has not been admixed with Asian germplasm. This makes it an important model to study both domestication and relevance of Asian genes in the pig. Besides, Iberian pigs exhibit high meat quality as well as appetite and propensity to obesity. Here we provide a genome wide analysis of nucleotide and structural diversity in a reduced representation library from a pool (n=9 sows) and shotgun genomic sequence from a single sow of the highly inbred Guadyerbas strain. In the pool, we applied newly developed tools to account for the peculiarities of these data. RESULTS: A total of 254,106 SNPs in the pool (79.6 Mb covered) and 643,783 in the Guadyerbas sow (1.47 Gb covered) were called. The nucleotide diversity (1.31x10-3 per bp in autosomes) is very similar to that reported in wild boar. A much lower than expected diversity in the X chromosome was confirmed (1.79x10-4 per bp in the individual and 5.83x10-4 per bp in the pool). A strong (0.70) correlation between recombination and variability was observed, but not with gene density or GC content. Multicopy regions affected about 4% of annotated pig genes in their entirety, and 2% of the genes partially. Genes within the lowest variability windows comprised interferon genes and, in chromosome X, genes involved in behavior like HTR2C or MCEP2. A modified Hudson-Kreitman-Aguade test for pools also indicated an accelerated evolution in genes involved in behavior, as well as in spermatogenesis and in lipid metabolism. CONCLUSIONS: This work illustrates the strength of current sequencing technologies to picture a comprehensive landscape of variability in livestock species, and to pinpoint regions containing genes potentially under selection. Among those genes, we report genes involved in behavior, including feeding behavior, and lipid metabolism. The pig X chromosome is an outlier in terms of nucleotide diversity, which suggests selective constraints. Our data further confirm the importance of structural variation in the species, including Iberian pigs, and allowed us to identify new paralogs for known gene families.
    BMC Genomics 03/2013; 14(1):148. · 4.40 Impact Factor
  • Andrea Gentili, Luca Ferretti
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper models the dynamic of migration with a particular focus on the cumulative process that causes a variation in the distribution of income in sending communities and therefore a variation in the distribution of skills across different cohorts. The model provides a theoretical framework to Cumulative Causation theory of migration and specifically a theoretical rationale behind the use of migration prevalence ratio to study migration flows. Moreover the model shows how brain drain (in sending communities) and negative cohort effect in terms of education (in receiving communities) are the result of a positive selection of migrants in terms of skills if there is a intergenerational transmission of education.
    02/2013;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Recombination allows faithful chromosomal segregation during meiosis and contributes to the production of new heritable allelic variants that are essential for the maintenance of genetic diversity. Therefore, an appreciation of how this variation is created and maintained is of critical importance to our understanding of biodiversity and evolutionary change. Here, we analysed the recombination features from species representing the major eutherian taxonomic groups Afrotheria, Rodentia, Primates and Carnivora to better understand the dynamics of mammalian recombination. Our results suggest a phylogenetic component in recombination rates (RRs), which appears to be directional, strongly punctuated and subject to selection. Species that diversified earlier in the evolutionary tree have lower RRs than those from more derived phylogenetic branches. Furthermore, chromosome-specific recombination maps in distantly related taxa show that crossover interference is especially weak in the species with highest RRs detected thus far, the tiger. This is the first example of a mammalian species exhibiting such low levels of crossover interference, highlighting the uniqueness of this species and its relevance for the study of the mechanisms controlling crossover formation, distribution and resolution.
    Proceedings of the Royal Society B: Biological Sciences 01/2013; 280(1771):20131945. · 5.68 Impact Factor
  • Source
    Luca Ferretti, Filippo Disanto, Thomas Wiehe
    [Show abstract] [Hide abstract]
    ABSTRACT: The coalescent with recombination is a fundamental model to describe the genealogical history of DNA sequence samples from recombining organisms. Considering recombination as a process which acts along genomes and which creates sequence segments with shared ancestry, we study the influence of single recombination events upon tree characteristics of the coalescent. We focus on properties such as tree height and tree balance and quantify analytically the changes in these quantities incurred by recombination in terms of probability distributions. We find that changes in tree topology are often relatively mild under conditions of neutral evolution, while changes in tree height are on average quite large. Our results add to a quantitative understanding of the spatial coalescent and provide the neutral reference to which the impact by other evolutionary scenarios, for instance tree distortion by selective sweeps, can be compared.
    PLoS ONE 01/2013; 8(4):e60123. · 3.53 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Performing high throughput sequencing on samples pooled from different individuals is a strategy to characterize genetic variability at a small fraction of the cost required for individual sequencing. In certain circumstances some variability estimators have even lower variance than those obtained with individual sequencing. SNP calling and estimating the frequency of the minor allele from pooled samples, though, is a subtle exercise for at least three reasons. First, sequencing errors may have a much larger relevance than in individual SNP calling: while their impact in individual sequencing can be reduced by setting a restriction on a minimum number of reads per allele, this would have a strong and undesired effect in pools because it is unlikely that alleles at low frequency in the pool will be read many times. Second, the prior allele frequency for heterozygous sites in individuals is usually 0.5 (assuming one is not analyzing sequences coming from, e.g. cancer tissues), but this is not true in pools: in fact, under the standard neutral model, singletons (i.e. alleles of minimum frequency) are the most common class of variants because P(f) ∝ 1/f and they occur more often as the sample size increases. Third, an allele appearing only once in the reads from a pool does not necessarily correspond to a singleton in the set of individuals making up the pool, and vice versa, there can be more than one read - or, more likely, none - from a true singleton. To improve upon existing theory and software packages, we have developed a Bayesian approach for minor allele frequency (MAF) computation and SNP calling in pools (and implemented it in a program called snape): the approach takes into account sequencing errors and allows users to choose different priors. We also set up a pipeline which can simulate the coalescence process giving rise to the SNPs, the pooling procedure and the sequencing. We used it to compare the performance of snape to that of other packages. We present a software which helps in calling SNPs in pooled samples: it has good power while retaining a low false discovery rate (FDR). The method also provides the posterior probability that a SNP is segregating and the full posterior distribution of f for every SNP. In order to test the behaviour of our software, we generated (through simulated coalescence) artificial genomes and computed the effect of a pooled sequencing protocol, followed by SNP calling. In this setting, snape has better power and False Discovery Rate (FDR) than the comparable packages samtools, PoPoolation, Varscan : for N = 50 chromosomes, snape has power ≈ 35%and FDR ≈ 2.5%. snape is available at http://code.google.com/p/snape-pooled/ (source code and precompiled binaries).
    BMC Bioinformatics 09/2012; 13:239. · 3.02 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Missing data are common in DNA sequences obtained through high-throughput sequencing. Furthermore, samples of low quality or problems in the experimental protocol often cause a loss of data even with traditional sequencing technologies. Here we propose modified estimators of variability and neutrality tests that can be naturally applied to sequences with missing data, without the need to remove bases or individuals from the analysis. Modified statistics include the Watterson estimator θW, Tajima's D, Fay and Wu's H, and HKA. We develop a general framework to take missing data into account in frequency spectrum-based neutrality tests and we derive the exact expression for the variance of these statistics under the neutral model. The neutrality tests proposed here can also be used as summary statistics to describe the information contained in other classes of data like DNA microarrays.
    Genetics 06/2012; 191(4):1397-401. · 4.39 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Critical phenomena can show unusual phase diagrams when defined in complex network topologies. The case of classical phase transitions such as the classical Ising model and the percolation transition has been studied extensively in the last decade. Here we show that the phase diagram of the Bose-Hubbard model, an exclusively quantum mechanical phase transition, also changes significantly when defined on random scale-free networks. We present a mean-field calculation of the model in annealed networks and we show that when the second moment of the average degree diverges the Mott-insulator phase disappears in the thermodynamic limit. Moreover we study the model on quenched networks and we show that the Mott-insulator phase disappears in the thermodynamic limit as long as the maximal eigenvalue of the adjacency matrix diverges. Finally we study the phase diagram of the model on Apollonian scale-free networks that can be embedded in 2 dimensions showing the extension of the results also to this case.
    EPL (Europhysics Letters) 03/2012; 99(1). · 2.26 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Many complex networks from the World Wide Web to biological networks grow taking into account the heterogeneous features of the nodes. The feature of a node might be a discrete quantity such as a classification of a URL document such as personal page, thematic website, news, blog, search engine, social network, etc., or the classification of a gene in a functional module. Moreover the feature of a node can be a continuous variable such as the position of a node in the embedding space. In order to account for these properties, in this paper we provide a generalization of growing network models with preferential attachment that includes the effect of heterogeneous features of the nodes. The main effect of heterogeneity is the emergence of an “effective fitness” for each class of nodes, determining the rate at which nodes acquire new links. The degree distribution exhibits a multiscaling behavior analogous to the the fitness model. This property is robust with respect to variations in the model, as long as links are assigned through effective preferential attachment. Beyond the degree distribution, in this paper we give a full characterization of the other relevant properties of the model. We evaluate the clustering coefficient and show that it disappears for large network size, a property shared with the Barabási-Albert model. Negative degree correlations are also present in this class of models, along with nontrivial mixing patterns among features. We therefore conclude that both small clustering coefficients and disassortative mixing are outcomes of the preferential attachment mechanism in general growing networks.
    Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics 11/2011; 85(6).
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Despite dramatic reduction in sequencing costs with the advent of next generation sequencing technologies, obtaining a complete mammalian genome sequence at sufficient depth is still costly. An alternative is partial sequencing. Here, we have sequenced a reduced representation library of an Iberian sow from the Guadyerbas strain, a highly inbred strain that has been used in numerous QTL studies because of its extreme phenotypic characteristics. Using the Illumina Genome Analyzer II (San Diego, CA, USA), we resequenced ∼ 1% of the genome with average 4 × depth, identifying 68,778 polymorphisms. Of these, 55,457 were putative fixed differences with respect to the assembly, based on the genome of a Duroc pig, and 13,321 were heterozygous positions within Guadyerbas. Despite being highly inbred, the estimate of heterozygosity within Guadyerbas was ∼ 0.78 kb(-1) in autosomes, after correcting for low depth. Nucleotide variability was consistently higher at the telomeric regions than on the rest of the chromosome, likely a result of increased recombination rates. Further, variability was 50% lower in the X-chromosome than in autosomes, which may be explained by a recent bottleneck or by selection. We divided the whole genome in 500 kb windows and we analyzed overrepresented gene ontology terms in regions of low and high variability. Multi organism process, pigmentation and cell killing were overrepresented in high variability regions and metabolic process ontology, within low variability regions. Further, a genome wide Hudson-Kreitman-Aguadé test was carried out per window; overall, variability was in agreement with neutral expectations.
    Heredity 03/2011; 107(3):256-64. · 4.11 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Artificial selection has caused rapid evolution in domesticated species. The identification of selection footprints across domesticated genomes can contribute to uncover the genetic basis of phenotypic diversity. Genome wide footprints of pig domestication and selection were identified using massive parallel sequencing of pooled reduced representation libraries (RRL) representing ∼2% of the genome from wild boar and four domestic pig breeds (Large White, Landrace, Duroc and Pietrain) which have been under strong selection for muscle development, growth, behavior and coat color. Using specifically developed statistical methods that account for DNA pooling, low mean sequencing depth, and sequencing errors, we provide genome-wide estimates of nucleotide diversity and genetic differentiation in pig. Widespread signals suggestive of positive and balancing selection were found and the strongest signals were observed in Pietrain, one of the breeds most intensively selected for muscle development. Most signals were population-specific but affected genomic regions which harbored genes for common biological categories including coat color, brain development, muscle development, growth, metabolism, olfaction and immunity. Genetic differentiation in regions harboring genes related to muscle development and growth was higher between breeds than between a given breed and the wild boar. These results, suggest that although domesticated breeds have experienced similar selective pressures, selection has acted upon different genes. This might reflect the multiple domestication events of European breeds or could be the result of subsequent introgression of Asian alleles. Overall, it was estimated that approximately 7% of the porcine genome has been affected by selection events. This study illustrates that the massive parallel sequencing of genomic pools is a cost-effective approach to identify footprints of selection.
    PLoS ONE 01/2011; 6(4):e14782. · 3.53 Impact Factor
  • M Pérez-Enciso, L Ferretti
    [Show abstract] [Hide abstract]
    ABSTRACT: Next generation sequencing (NGS) has revolutionized genomics research, making it difficult to overstate its impact on studies of Biology. NGS will immediately allow researchers working in non-mainstream species to obtain complete genomes together with a comprehensive catalogue of variants. In addition, RNA-seq will be a decisive way to annotate genes that cannot be predicted purely by computational or comparative approaches. Future applications include whole genome sequence association studies, as opposed to classical SNP-based association, and implementing this new source of information into breeding programmes. For these purposes, one of the main advantages of sequencing vs. genotyping is the possibility of identifying copy number variants. Currently, experimental design is a topic of utmost interest, and here we discuss some of the options available, including pools and reduced representation libraries. Although bioinformatics is still an important bottleneck, this limitation is only transient and should not deter animal geneticists from embracing these technologies.
    Animal Genetics 12/2010; 41(6):561-9. · 2.58 Impact Factor
  • Source
    Luca Ferretti, Michele Cortelezzi
    [Show abstract] [Hide abstract]
    ABSTRACT: We obtain the degree distribution for a class of growing network models on flat and curved spaces. These models evolve by preferential attachment weighted by a function of the distance between nodes. The degree distribution of these models is similar to the one of the fitness model of Bianconi and Barabasi, with a fitness distribution dependent on the metric and the density of nodes. We show that curvature singularities in these spaces can give rise to asymptotic Bose-Einstein condensation, but transient condensation can be observed also in smooth hyperbolic spaces with strong curvature. We provide numerical results for spaces of constant curvature (sphere, flat and hyperbolic space) and we discuss the conditions for the breakdown of this approach and the critical points of the transition to distance-dominated attachment. Finally we discuss the distribution of link lengths.
    Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics 11/2010; 84.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: One of the main necessities for population geneticists is the availability of statistical tools that enable to accept or reject the neutral Wright-Fisher model with high power. A number of statistical tests have been developed to detect specific deviations from the null frequency spectrum in different directions (i.e., Tajima's D, Fu and Li's F and D test, Fay and Wu's H). Recently, a general framework was proposed to generate all neutrality tests that are linear functions of the frequency spectrum. In this framework, a family of optimal tests was developed to have almost maximum power against a specific alternative evolutionary scenario. Following these developments, in this paper we provide a thorough discussion of linear and nonlinear neutrality tests. First, we present the general framework for linear tests and emphasize the importance of the property of scalability with the sample size (that is, the results of the tests should not depend on the sample size), which, if missing, can guide to errors in data interpretation. The motivation and structure of linear optimal tests are discussed. In a further generalization, we develop a general framework for nonlinear neutrality tests and we derive nonlinear optimal tests for polynomials of any degree in the frequency spectrum. Comment: 42 pages, 3 figures, elsarticle
    11/2010;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The ascertainment of the demographic and selective history of populations has been a major research goal in genetics for decades. To that end, numerous statistical tests have been developed to detect deviations between expected and observed frequency spectra, e.g., Tajima's D, Fu and Li's F and D tests, and Fay and Wu's H. Recently, Achaz developed a general framework to generate tests that detect deviations in the frequency spectrum. In a further development, we argue that the results of these tests should be as independent on the sample size as possible and propose a scale-free form for them. Furthermore, using the same framework as that of Achaz, we develop a new family of neutrality tests based on the frequency spectrum that are optimal against a chosen alternative evolutionary scenario. These tests maximize the power to reject the standard neutral model and are scalable with the sample size. Optimal tests are derived for several alternative evolutionary scenarios, including demographic processes (population bottleneck, expansion, contraction) and selective sweeps. Within the same framework, we also derive an optimal general test given a generic evolutionary scenario as a null model. All formulas are relatively simple and can be computed very fast, making it feasible to apply them to genome-wide sequence data. A simulation study showed that, generally, the tests proposed are more consistently powerful than standard tests like Tajima's D. We further illustrate the method with real data from a QTL candidate region in pigs.
    Genetics 09/2010; 186(1):353-65. · 4.39 Impact Factor

Publication Stats

263 Citations
82.91 Total Impact Points

Institutions

  • 2013–2014
    • Collège de France
      Lutetia Parisorum, Île-de-France, France
    • Social Science Research Council
      New York City, New York, United States
    • University of Cologne
      • Institute for Genetics
      Köln, North Rhine-Westphalia, Germany
  • 2012–2013
    • CRAG Centre for Research in Agricultural Genomics
      Barcino, Catalonia, Spain
    • Centro Nacional de Análisis Genómico de Barcelona
      Barcino, Catalonia, Spain
  • 2009–2013
    • Autonomous University of Barcelona
      • • Deparment of Cellular Biology, Immunology and Physiology
      • • Facultat de Veterinària
      Cerdanyola del Vallès, Catalonia, Spain
  • 2007
    • Università di Pisa
      • Department of Physics "E.Fermi"
      Pisa, Tuscany, Italy