Robert Lanfear's research while affiliated with Australian National University and other places

Publications (159)

Article
Accurate and timely detection of recombinant lineages is crucial for interpreting genetic variation, reconstructing epidemic spread, identifying selection and variants of interest, and accurately performing phylogenetic analyses 1–4. During the SARS-CoV-2 pandemic, genomic data generation has exceeded the capacities of existing analysis platforms,...
Article
Motivation: Phylogenetic tree optimization is necessary for precise analysis of evolutionary and transmission dynamics, but existing tools are inadequate for handling the scale and pace of data produced during the COVID-19 pandemic. One transformative approach, online phylogenetics, aims to incrementally add samples to an ever-growing phylogeny, b...
Article
Full-text available
Sequence simulators play an important role in phylogenetics. Simulated data has many applications, such as evaluating the performance of different methods, hypothesis testing with parametric bootstraps, and, more recently, generating data for training machine-learning applications. Many sequence simulation programs exist, but the most feature-rich...
Preprint
Neighbour-Joining is one of the most widely used distance-based phylogenetic inference methods. However, current implementations do not scale well for datasets with more than 10,000 sequences. Given the increasing pace of generating new sequence data, particularly in outbreaks of emerging diseases, and the already enormous existing databases of seq...
Article
Amino acid substitution models are a key component in phylogenetic analyses of protein sequences. All commonly-used amino acid models available to date are time-reversible, an assumption designed for computational convenience but not for biological reality. Another significant downside to time-reversible models is that they do not allow inference o...
Preprint
Full-text available
Phylogenetic tree optimization is necessary for precise analysis of evolutionary and transmission dynamics, but existing tools are inadequate for handling the scale and pace of data produced during the COVID-19 pandemic. One transformative approach, online phylogenetics, aims to incrementally add samples to an ever-growing phylogeny, but there are...
Preprint
Full-text available
Sequence simulators play an important role in phylogenetics. Simulated data has many applications, such as evaluating the performance of different methods, hypothesis testing with parametric bootstraps, and, more recently, generating data for training machine-learning applications. Many sequence simulation programs exist, but the most feature-rich...
Preprint
Full-text available
Phylogenetics has been foundational to SARS-CoV-2 research and public health policy, assisting in genomic surveillance, contact tracing, and assessing emergence and spread of new variants. However, phylogenetic analyses of SARS-CoV-2 have often relied on tools designed for de novo phylogenetic inference, in which all data are collected before any a...
Preprint
Full-text available
Amino acid substitution models are a key component in phylogenetic analyses of protein sequences. All amino acid models available to date are time-reversible, an assumption designed for computational convenience but not for biological reality. Another significant downside to time-reversible models is that they do not allow inference of rooted trees...
Article
Full-text available
We introduce the AusTraits database - a compilation of values of plant traits for taxa in the Australian flora (hereafter AusTraits). AusTraits synthesises data on 448 traits across 28,640 taxa from field campaigns, published literature, taxonomic monographs, and individual taxon descriptions. Traits vary in scope from physiological measures of per...
Article
Full-text available
Using time-reversible Markov models is a very common practice in phylogenetic analysis, because although we expect many of their assumptions to be violated by empirical data, they provide high computational efficiency. However, these models lack the ability to infer the root placement of the estimated phylogeny. In order to compensate for the inabi...
Article
Full-text available
As the SARS-CoV-2 virus spreads through human populations, the unprecedented accumulation of viral genome sequences is ushering in a new era of ‘genomic contact tracing’—that is, using viral genomes to trace local transmission dynamics. However, because the viral phylogeny is already so large—and will undoubtedly grow many fold—placing new sequence...
Preprint
Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) emerged in late 2019 and spread globally to cause the COVID-19 pandemic. Despite the constant accumulation of genetic variation in the SARS-CoV-2 population, there was little evidence for the emergence of significantly more transmissible lineages in the first half of 2020. Starting around...
Article
Full-text available
The COVID-19 pandemic has seen an unprecedented response from the sequencing community. Leveraging the sequence data from more than 140,000 SARS-CoV-2 genomes, we study mutation rates and selective pressures affecting the virus. Understanding the processes and effects of mutation and selection has profound implications for the study of viral evolut...
Article
Full-text available
Amino acid substitution models play a crucial role in phylogenetic analyses. Maximum likelihood (ML) methods have been proposed to estimate amino acid substitution models, however, they are typically complicated and slow. In this paper, we propose QMaker, a new ML method to estimate a general time-reversible Q matrix from a large protein dataset co...
Preprint
Full-text available
Most phylogenetic analyses assume that the evolutionary history of an alignment (either that of a single locus, or of multiple concatenated loci) can be described by a single bifurcating tree, the so-called the treelikeness assumption. Treelikeness can be violated by biological events such as recombination, introgression, or incomplete lineage sort...
Article
Full-text available
sangeranalyseR is feature-rich, free, and open-source R package for processing Sanger sequencing data. It allows users to go from loading reads to saving aligned contigs in a few lines of R code by using sensible defaults for most actions. It also provides complete flexibility for determining how individual reads and contigs are processed, both at...
Preprint
Full-text available
A bstract The COVID-19 pandemic has seen an unprecedented response from the sequencing community. Leveraging the sequence data from more than 140,000 SARS-CoV-2 genomes, we study mutation rates and selective pressures affecting the virus. Understanding the processes and effects of mutation and selection has profound implications for the study of vi...
Preprint
Full-text available
A bstract Phylogenetic methods are emerging as an useful tool to understand cancer evolutionary dynamics, including tumor structure, heterogeneity, and progression. Most currently used approaches utilize either bulk whole genome sequencing (WGS) or single-cell DNA sequencing (scDNA-seq) and are based on calling copy number alterations and single nu...
Preprint
Full-text available
We introduce the AusTraits database - a compilation of measurements of plant traits for taxa in the Australian flora (hereafter AusTraits). AusTraits synthesises data on 375 traits across 29230 taxa from field campaigns, published literature, taxonomic monographs, and individual taxa descriptions. Traits vary in scope from physiological measures of...
Article
Full-text available
Our understanding of the evolutionary history of primates is undergoing continual revision due to ongoing genome sequencing efforts. Bolstered by growing fossil evidence, these data have led to increased acceptance of once controversial hypotheses regarding phylogenetic relationships, hybridization and introgression, and the biogeographical history...
Article
Full-text available
The SARS-CoV-2 pandemic has led to unprecedented, nearly real-time genetic tracing due to the rapid community sequencing response. Researchers immediately leveraged these data to infer the evolutionary relationships among viral samples and to study key biological questions, including whether host viral genome editing and recombination are features...
Preprint
Full-text available
As the SARS-CoV-2 virus spreads through human populations, the unprecedented accumulation of viral genome sequences is ushering a new era of “genomic contact tracing” – that is, using viral genome sequences to trace local transmission dynamics. However, because the viral phylogeny is already so large – and will undoubtedly grow many fold – placing...
Chapter
Full-text available
Unravelling the phylogenetic relationships among the major groups of living birds has been described as the greatest outstanding problem in dinosaur systematics. Recent work has identified portions of the avian tree of life that are particularly challenging to reconstruct, perhaps as a result of rapid cladogenesis early in crown bird evolutionary h...
Preprint
Full-text available
Using time-reversible Markov models is a very common practice in phylogenetic analysis, because although we expect many of their assumptions to be violated by empirical data, they provide high computational efficiency. However, these models lack the ability to infer the root placement of the estimated phylogeny. In order to compensate for the inabi...
Preprint
Full-text available
sangeranalyseR is an interactive R/Bioconductor package and two associated Shiny applications designed for analysing Sanger sequencing from data from the ABIF file format in R. It allows users to go from loading reads to saving aligned contigs in a few lines of R code. sangeranalyseR provides a wide range of options for a number of commonly-perform...
Article
Full-text available
We implement two measures for quantifying genealogical concordance in phylogenomic datasets: the gene concordance factor (gCF) and the novel site concordance factor (sCF). For every branch of a reference tree, gCF is defined as the percentage of "decisive" gene trees containing that branch. This measure is already in wide usage, but here we introdu...
Article
Full-text available
Somatic mutations can have important effects on the life history, ecology, and evolution of plants, but the rate at which they accumulate is poorly understood and difficult to measure directly. Here, we develop a method to measure somatic mutations in individual plants and use it to estimate the somatic mutation rate in a large, long-lived, phenoty...
Preprint
Full-text available
Amino acid substitution models play a crucial role in phylogenetic analyses. Maximum likelihood (ML) methods have been proposed to estimate amino acid substitution models, however, they are typically complicated and slow. In this paper, we propose QMaker, a new ML method to estimate a general time-reversible Q matrix from a large protein dataset co...
Article
Full-text available
IQ-TREE (http://www.iqtree.org, last accessed February 6, 2020) is a user-friendly and widely used software package for phylogenetic inference using maximum likelihood. Since the release of version 1 in 2014, we have continuously expanded IQ-TREE to integrate a plethora of new models of sequence evolution and efficient computational approaches of p...
Article
Full-text available
Background Eucalyptus pauciflora (the snow gum) is a long-lived tree with high economic and ecological importance. Currently, little genomic information for E. pauciflora is available. Here, we sequentially assemble the genome of Eucalyptus pauciflora with different methods, and combine multiple existing and novel approaches to help to select the b...
Article
Evolution leaves heterogeneous patterns of nucleotide variation across the genome, with different loci subject to varying degrees of mutation, selection, and drift. In phylogenetics, the potential impacts of partitioning sequence data for the assignment of substitution models are well appreciated. In contrast, the treatment of branch lengths has re...
Preprint
Full-text available
IQ-TREE (http://www.iqtree.org) is a user-friendly and widely used software package for phylogenetic inference using maximum likelihood. Since the release of version 1 in 2014, we have continuously expanded IQ-TREE to integrate a plethora of new models of sequence evolution and efficient computational approaches of phylogenetic inference to deal wi...
Article
Full-text available
The chloroplast genome usually has a quadripartite structure consisting of a large single copy region and a small single copy region separated by two long inverted repeats. It has been known for some time that a single cell may contain at least two structural haplotypes of this structure, which differ in the relative orientation of the single copy...
Article
Full-text available
In phylogenetic inference we commonly use models of substitution which assume that sequence evolution is stationary, reversible and homogeneous (SRH). Although the use of such models is often criticized, the extent of SRH violations and their effects on phylogenetic inference of tree topologies and edge lengths are not well understood. Here, we int...
Preprint
Full-text available
The chloroplast genome usually has a quadripartite structure consisting of a large single copy region and a small single copy region separated by two long inverted repeats. It has been known for some time that a single cell may contain at least two structural haplotypes of this structure, which differ in the relative orientation of the single copy...
Preprint
Full-text available
Background Selecting the best genome assembly from a collection of draft assemblies for the same species remains a difficult task. Here, we combine new and existing approaches to help to address this, using the non-model plant Eucalyptus pauciflora (snow gum) as a test case. Eucalyptus pauciflora is a long-lived tree with high economic and ecologic...
Preprint
Full-text available
Unravelling the phylogenetic relationships among the major groups of living birds has been described as the greatest outstanding problem in dinosaur systematics. Recent work has identified portions of the avian tree of life that are particularly challenging to reconstruct, perhaps as a result of rapid cladogenesis early in crown bird evolutionary h...
Preprint
Full-text available
Unravelling the phylogenetic relationships among the major groups of living birds has been described as the greatest outstanding problem in dinosaur systematics. Recent work has identified portions of the avian tree of life that are particularly challenging to reconstruct, perhaps as a result of rapid cladogenesis early in crown bird evolutionary h...
Article
Full-text available
Background Chloroplasts are organelles that conduct photosynthesis in plant and algal cells. The information chloroplast genome contained is widely used in agriculture and studies of evolution and ecology. Correctly assembling chloroplast genomes can be challenging because the chloroplast genome contains a pair of long inverted repeats (10–30 kb)....
Preprint
Full-text available
We introduce and implement two measures for quantifying genealogical concordance in phylogenomic datasets: the gene concordance factor (gCF) and the site concordance factor (sCF). For every branch of a reference tree, gCF is defined as the percentage of decisive gene trees containing that branch. This measure is already in wide usage, but here we i...
Preprint
Full-text available
Evolution leaves heterogeneous patterns of nucleotide variation across the genome, with different loci subject to varying degrees of mutation, selection, and drift. Appropriately modelling this heterogeneity is important for reliable phylogenetic inference. One modelling approach in statistical phylogenetics is to apply independent models of molecu...
Preprint
Full-text available
In phylogenetic inference, we commonly use models which assume that sequence evolution is stationary, reversible and homogeneous (SRH). Although such assumptions are often criticized, the extent of SRH violations and their effects on phylogenetic inference are not well understood. Here, we extend the matched-pairs test of symmetry to assess the sca...
Article
Full-text available
Long‐read sequencing technologies are transforming our ability to assemble highly complex genomes. Realising their full potential is critically reliant on extracting high quality, high molecular weight (HMW) DNA from the organisms of interest. This is especially the case for the portable MinION sequencer which enables all laboratories to undertake...
Article
Full-text available
For the last 100 years, it has been uncontroversial to state that the plant germline is set aside late in development, but there is surprisingly little evidence to support this view. In contrast, much evolutionary theory and several recent empirical studies seem to suggest the opposite—that the germlines of some and perhaps most plants may be set a...
Preprint
Full-text available
Background Chloroplasts are organelles that conduct photosynthesis in plant and algal cells. Chloroplast genomes code for around 130 genes, and the information they contain is widely used in agriculture and studies of evolution and ecology. Correctly assembling complete chloroplast genomes can be challenging because the chloroplast genome contains...
Article
Full-text available
UltraConserved Elements (UCEs) are popular markers for phylogenomic studies. They are relatively simple to collect from distantly-related organisms, and contain sufficient information to infer relationships at almost all taxonomic levels. Most studies of UCEs use partitioning to account for variation in rates and patterns of molecular evolution amo...
Article
Full-text available
UltraConserved Elements (UCEs) are popular markers for phylogenomic studies. They are relatively simple to collect from distantly-related organisms, and contain sufficient information to infer relationships at almost all taxonomic levels. Most studies of UCEs use partitioning to account for variation in rates and patterns of molecular evolution amo...
Article
Full-text available
Most flowering plants are hermaphroditic, yet the proportion of seeds fertilized by self and outcross pollen varies widely among species, ranging from predominant self-fertilization to exclusive outcrossing. A population's rate of outcrossing has important evolutionary outcomes as it influences genetic structure, effective population size, and offs...