Article

Missing inaction: the dangers of ignoring missing data

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

The most common approach to dealing with missing data is to delete cases containing missing observations. However, this approach reduces statistical power and increases estimation bias. A recent study shows how estimates of heritability and selection can be biased when the 'invisible fraction' (missing data due to mortality) is ignored, thus demonstrating the dangers of neglecting missing data in ecology and evolution. We highlight recent advances in the procedures of handling missing data and their relevance and applicability.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... For both model types, we included all data from all surveys in all years, though treated missing data for repetitions four and five in years 2014 to mid-2016 differently. We categorized missing values as NA in occupancy models, but used imputation methods to infer missing values in GLMM analyses (see below; Nakagawa & Freckleton, 2008). ...
... For GLMM analysis, as our surveys varied from three to five replicates, we replaced missing values (8.57% of the total dataset) in all sites with less than five replicates using a multiple imputation procedure (Nakagawa & Freckleton, 2008). Specifically, we used multiple imputation to fill in 18 (of 210; 8.6%) missing values for each of crocodile encounter rate, moon phase, wind speed, amount of aquatic vegetation, fishing net encounter rate, and precipitation the day prior to the survey. ...
... We preferred multiple imputation (MI) over single imputation (e.g., substituting missing values with global means) because of the small sample size and risk of underestimating the errors (Nakagawa & Freckleton, 2008). Further, Nakagawa and Freckleton (2011) found that, with mixed linear models, replacing missing values through MI results in better estimates of Akaike weights and standard errors compared to leaving NAs in the dataset. ...
Article
Full-text available
West African crocodylians are among the most threatened and least studied crocodylian species globally. Assessing population status and establishing a basis for population monitoring is the highest priority action for this region. Monitoring of crocodiles is influenced by many factors that affect detectability, including environmental variables and individual- or population-level wariness. We investigated how these factors affect detectability and counts of the critically endangered Mecistops cataphractus and the newly recognized Crocodylus suchus. We implemented 195 repetitive surveys at 38 sites across Côte d’Ivoire between 2014 and 2019. We used an occupancy-based approach and a count-based GLMM analysis to determine the effect of environmental and anthropogenic variables on detection and modeled crocodile wariness over repetitive surveys. Despite their rarity and level of threat, detection probability of both species was relatively high (0.75 for M. cataphractus and 0.81 for C. suchus), but a minimum of two surveys were required to infer absence of either species with 90% confidence. We found that detection of M. cataphractus was significantly negatively influenced by fishing net encounter rate, while high temperature for the previous 48 h of the day of the survey increased C. suchus detection. Precipitation and aquatic vegetation had significant negative and positive influence, respectively, on M. cataphractus counts and showed the opposite effect for C. suchus counts. We also found that fishing encounter rate had a significant negative effect on C. suchus counts. Interestingly, survey repetition did not generally affect wariness for either species, though there was some indication that at least M. cataphractus was more wary by the fourth replicate. These results are informative for designing future survey and monitoring protocols for these threatened crocodylians in West Africa and for other endangered crocodylians globally.
... Missing data are a widespread problem in ecological and evolutionary research (Ellington et al., 2015;Nakagawa, 2017;Nakagawa & Freckleton, 2010, often resulting in the exclusion of a substantial amount of available (but incomplete) data (e.g., through 'complete case' or 'pairwise deletion'). This contributes to a reduction in statistical power and, if the nature of 'missingness' is not considered carefully, can lead to biased parameter estimates (Enders, 2001b;Graham, 2009;Nakagawa & Freckleton, 2010). ...
... Missing data are a widespread problem in ecological and evolutionary research (Ellington et al., 2015;Nakagawa, 2017;Nakagawa & Freckleton, 2010, often resulting in the exclusion of a substantial amount of available (but incomplete) data (e.g., through 'complete case' or 'pairwise deletion'). This contributes to a reduction in statistical power and, if the nature of 'missingness' is not considered carefully, can lead to biased parameter estimates (Enders, 2001b;Graham, 2009;Nakagawa & Freckleton, 2010). Theoretical frameworks for dealing with incomplete data have received substantial attention, and missing data theory is now a well-developed field of research grounded on solid statistical theory (Enders, 2001a;Graham, 2003Graham, , 2009Graham et al., 1996;Little & Rubin, 2002;van Buuren, 2012). ...
... Theoretical frameworks for dealing with incomplete data have received substantial attention, and missing data theory is now a well-developed field of research grounded on solid statistical theory (Enders, 2001a;Graham, 2003Graham, , 2009Graham et al., 1996;Little & Rubin, 2002;van Buuren, 2012). While social scientists commonly use missing data methods, these techniques remain relatively unknown, and seldom applied, in ecological and evolutionary circles (Nakagawa & Freckleton, 2010). ...
Article
Full-text available
Ecological and evolutionary research questions are increasingly requiring the integration of research fields along with larger datasets to address fundamental local and global scale problems. Unfortunately, these agendas are often in conflict with limited funding and a need to balance animal welfare concerns. Planned missing data design (PMDD), where data are randomly and deliberately missed during data collection, combined with missing data procedures, can be useful tools when working under greater research constraints. Here, we review how PMDD can be incorporated into existing experimental designs by discussing alternative design approaches and demonstrate with simulated datasets how missing data procedures work with incomplete data. PMDDs can provide researchers with a unique toolkit that can be applied during the experimental design stage. Planning and thinking about missing data early can: 1) reduce research costs by allowing for the collection of less expensive measurement variables; 2) provide opportunities to distinguish predictions from alternative hypotheses by allowing more measurement variables to be collected; and 3) minimise distress caused by experimentation by reducing the reliance on invasive procedures or allowing data to be collected on fewer subjects (or less often on a given subject). PMDDs and missing data methods can even provide statistical benefits under certain situations by improving statistical power relative to a complete case design. The impacts of unplanned missing data, which can cause biases in parameter estimates and their uncertainty, can also be ameliorated using missing data procedures. PMDDs are still in their infancy. We discuss some of the difficulties in their implementation and provide tentative solutions. While PMDDs may not always be the best option, missing data procedures are becoming more sophisticated and more easily implemented and it is likely that PMDDs will be effective tools for a wide range of experimental designs, data types and problems in ecology and evolution.
... Likewise, reproductive success is not exempt of biases, as uncertainties can be caused by unsampled competitors (Walling et al. 2010) and offspring (also known as the 'invisible fraction' ;Grafen 1988;Hadfield 2008;Nakagawa and Freckleton 2008). Furthermore, reproductive success depends on both sexual (pre-and post-copulatory, see section 1.3.3) ...
... Moreover, if males have specific home ranges and limited mating opportunities (chapter 3), the estimated reproductive success of some males is overly affected if sampling is spatially biased. Biased offspring sampling can arise from biological reasons (Nakagawa and Freckleton 2008). One specific case is the pre-sampling death of some offspring, referred to as the invisible fraction (Grafen 1988;Hadfield 2008;Nakagawa and Freckleton 2008 ...
... Biased offspring sampling can arise from biological reasons (Nakagawa and Freckleton 2008). One specific case is the pre-sampling death of some offspring, referred to as the invisible fraction (Grafen 1988;Hadfield 2008;Nakagawa and Freckleton 2008 ...
Thesis
Full-text available
Sexual selection is responsible for some of the most extravagant traits found in the animal kingdom, as they confer a fitness advantage in terms of access to mates. Large body size in males of many polygynous species should provide a mating advantage through its association with dominance rank and, ultimately, access to females. Recent evidence from multiple species, however, questions this generalization, suggesting instead a weak relationship between male size and reproductive success. That empirical evidence suggests a need to develop new theory to explain patterns of male reproductive success. This thesis meets this challenge by examining if and how sexual selection acts on a polygynous sexually dimorphic species. I used a combination of behavioral, morphological, spatial, and reproductive data on marked individuals from a long-term longitudinal study of eastern grey kangaroos (Macropus giganteus) to explore which ecological and demographic factors influence the association between body size and dominance, siring chances and reproductive success, and relate the effect of paternal body size to offspring sex and maternal allocation. First, I needed to understand if body size is positively associated with dominance status, as individual rank is assumed to be a decisive factor granting priority access to receptive females. By analyzing the outcome of about 2,300 male-male agonistic interactions across six years (2010-2011 and 2015-2018), chapter 2 shows that kangaroo males formed yearly linear, steep, and stable dominance hierarchies based on body size. Dominance status was, however, only moderately correlated with yearly reproductive success. This finding strongly suggests that body size is not the only factor influencing reproductive success on this species, possibly indicating weak sexual selection on body size. To determine what factors other than body size may influence siring success, I examined if males had an upper reproductive threshold set by their mating opportunity. If a male does not encounter a female, he cannot father her offspring. Most studies of sexual selection, however, assume that all males in a population have equal access to all females. Chapter 3 thus uses spatial data collected over 9 years (2010-2018) to show that ecological variables such as mating opportunity, quantified by the spatial overlap between each male-female pair, residency on the breeding site, and an accurate estimation of the competitive environment influence individual siring chances. Next, I quantified sexual selection on body size, its fluctuation across years and according to mating opportunity and residency, and the strength of reproductive inequality. Chapter 4 shows that sexual selection was overall stabilizing and there was no evidence of temporal fluctuation or fluctuation caused by mating opportunity, and only limited variation caused by differences in residency. Despite weak reproductive inequality, sexual selection acted strongly on body size. Finally, I redirected my attention to recent findings showing that fathers can influence offspring sex and maternal allocation. Chapter 5 thus examined if paternal body size, which is strongly sexually selected, affected offspring sex ratio or maternal differential allocation. The results indicated that maternal and paternal influences modulated each other, as light mothers conceived sons when the father was heavy, conversely, heavy mothers conceived sons when the father was light. Maternal sex-specific allocation was independent of paternal size. I found that despite a stable linear social hierarchy, the most dominant males did not monopolize paternities, possibly because they did not have the opportunity to do so. It is important to emphasize that strong sexual selection does not necessarily lead to contemporary high variance in reproduction, as the phenotypes selected will determine the strength of selection. By reporting a rare occurrence of non-linear sexual selection on body size experienced by a sexually dimorphic species, this thesis underlines the necessity of simultaneous study of pre- and post-copulatory sexual selection. Moreover, it provides a solid contribution to our understanding of sexual selection by highlighting the importance of underrated ecological and demographic factors such true mating opportunity, generated by spatial overlap, and effective number of competitors faced by each male.
... We identified gaps in data coverage for IUCN threatened category species, as well as when considering trait coverage for bats experiencing specific high-impact threats. Given the underlying low trait coverage and biases, it is critical for researchers using bat wing morphology to consider the appropriate mechanism for dealing with missing data within their framework (Nakagawa & Freckleton 2008, Baraldi & Enders 2010. For instance, missing data patterns with respect to conservation status and/or for specific threats could confound results from trait-based vulnerability assessments and bias conservation-based conclusions. ...
... Within available trait data, missing data patterns (e.g. whether data are missing at random or not) affect how ecologists should handle missing data cases (Nakagawa & Freckleton 2008). Missing data rarely occur completely at random, which means that simple approaches such as case deletion can introduce strong biases (Nakagawa & Freckleton 2008, González-Suárez et al. 2012, Pakeman 2014. ...
... whether data are missing at random or not) affect how ecologists should handle missing data cases (Nakagawa & Freckleton 2008). Missing data rarely occur completely at random, which means that simple approaches such as case deletion can introduce strong biases (Nakagawa & Freckleton 2008, González-Suárez et al. 2012, Pakeman 2014. As missing data are ubiquitous across fields, multiple techniques have been developed to impute non-random missing data (Bruggeman et al. 2009, Pantanowitz & Marwala 2009, Van Buuren and Groothuis-Oudshoorn, 2011. ...
Article
• Species’ life-history traits have a wide variety of applications in ecological and conservation research, particularly when assessing threats. The development and growth of global species’ trait databases are critical for improving trait-based analyses; however, it is vital to understand the gaps and biases in the available data. • We reviewed bat wing morphology data, specifically mass, wingspan, wing area, wing loading, and aspect ratio, to identify issues with data reporting and ambiguity. Additionally, we aimed to assess taxonomic and geographic biases in trait data coverage. • We found that most studies used similar field methodology, but that data reporting and quality were inconsistent/poor. Additionally, we noted several issues regarding semantic ambiguity in trait definitions, specifically around what constitutes wing area. • Globally, we found that bat wing morphology trait coverage was low. Only six bat families had ≥40% trait coverage, and, of those, none consisted of more than 11 species in total. We found similar biases in trait coverage across International Union for Conservation of Nature Red List categories, with threatened species having lower coverage. • Geographically, North America, Europe, and the Indomalayan regions exhibited higher overall trait coverage, while both the Afrotropical and Neotropical ecoregions showed poor trait coverage. • The underlying biases and gaps in bat wing morphology data have implications for researchers conducting global trait-based assessments. Implementing imputation techniques may address missing data, but only for smaller regional subsets with substantial trait coverage. • Due to generally low overall trait coverage, increasing species’ representation in the database should be prioritised. We suggest adopting an Ecological Trait Standard Vocabulary to reduce semantic ambiguity in bat wing morphology traits, to improve data compilation and clarity. Additionally, we advocate that researchers adopt an Open Science approach to facilitate the growth of a bat wing morphology trait database.
... Several widely used approaches for handling missing data can introduce bias or fail to effectively leverage existing data (Schafer and Graham, 2002;Nakagawa and Freckleton, 2008). Such approaches include: (1) Ignoring data gaps through listwise or pairwise deletion (i.e., complete-case and available-case analysis); (2) replacing missing data with an average value (i.e., mean substitution); and (3) interpolating missing data through loess regression or smoothing splines (McKnight et al., 2007;Enders, 2010; Table 1 and Figure 1). ...
... Such approaches include: (1) Ignoring data gaps through listwise or pairwise deletion (i.e., complete-case and available-case analysis); (2) replacing missing data with an average value (i.e., mean substitution); and (3) interpolating missing data through loess regression or smoothing splines (McKnight et al., 2007;Enders, 2010; Table 1 and Figure 1). These approaches can compromise data integrity by exacerbating the effects of small sample size on statistical power (listwise deletion), artificially reducing sample variance (mean substitution, smoothing), biasing parameter estimates (listwise and pairwise deletion, smoothing), or synthesizing missing data based on spurious assumptions about data structure or distribution (smoothing) (Schafer and Graham, 2002;Nakagawa and Freckleton, 2008;Little and Rubin, 2019). Although such approaches may still have occasional utility (e.g., see Hughes et al., 2019), our goal is to motivate the uptake of more robust alternatives, where appropriate. ...
... The extent to which inadequate treatment of missing data leads to faulty inference, and how these gaps may be corrected, depends on the specific mechanism of missingness that has resulted in the missing data. For example, listwise deletion or complete-case analysis are only valid when data are "missing completely at random" (MCAR), meaning that there are no detectable patterns of missingness across any variables in the dataset (McKnight et al., 2007;Nakagawa and Freckleton, 2008). Assuming that data are MCAR means that researchers assume that data missingness is independent of both observed and unobserved data. ...
Article
Full-text available
The COVID-19 pandemic profoundly affected research in ecology and evolution, with lockdowns resulting in the suspension of most research programs and creating gaps in many ecological datasets. Likewise, monitoring efforts directed either at tracking trends in natural systems or documenting the environmental impacts of anthropogenic activities were largely curtailed. In addition, lockdowns have affected human activity in natural environments in ways that impact the systems under investigation, rendering many widely used approaches for handling missing data (e.g., available case analysis, mean substitution) inadequate. Failure to properly address missing data will lead to bias and weak inference. Researchers and environmental monitors must ensure that lost data are handled robustly by diagnosing patterns and mechanisms of missingness and applying appropriate tools like multiple imputation, full-information maximum likelihood, or Bayesian approaches. The pandemic has altered many aspects of society and it is timely that we critically reassess how we treat missing data in ecological research and environmental monitoring, and plan future data collection to ensure robust inference when faced with missing data. These efforts will help ensure the integrity of inference derived from datasets spanning the COVID-19 lockdown and beyond.
... These mechanisms for missing data can be described in the context of three general frameworks. Sparsity resulting from mechanism (1), (2), and (3) are referred to as Missing Not At Random (MNAR), Missing At Random (MAR), and Missing Completely At Random (MCAR) [3], respectively. MCAR values are independent of both the observed and missing values and arise randomly. ...
... In practice, metabolomics data are known to contain a mixture of MAR, MCAR and MNAR missing data [4] which are typically omitted from the data set for further analyses, or otherwise, they are imputed. However, omitting missing data that are not MCAR may reduce statistical power for downstream analyses [3]. On the other hand, if missing values are imputed poorly, we risk introducing bias into our results [3]. ...
... However, omitting missing data that are not MCAR may reduce statistical power for downstream analyses [3]. On the other hand, if missing values are imputed poorly, we risk introducing bias into our results [3]. ...
Article
Full-text available
When analyzing large datasets from high-throughput technologies, researchers often encounter missing quantitative measurements, which are particularly frequent in metabolomics datasets. Metabolomics, the comprehensive profiling of metabolite abundances, are typically measured using mass spectrometry technologies that often introduce missingness via multiple mechanisms: (1) the metabolite signal may be smaller than the instrument limit of detection; (2) the conditions under which the data are collected and processed may lead to missing values; (3) missing values can be introduced randomly. Missingness resulting from mechanism (1) would be classified as Missing Not At Random (MNAR), that from mechanism (2) would be Missing At Random (MAR), and that from mechanism (3) would be classified as Missing Completely At Random (MCAR). Two common approaches for handling missing data are the following: (1) omit missing data from the analysis; (2) impute the missing values. Both approaches may introduce bias and reduce statistical power in downstream analyses such as testing metabolite associations with clinical variables. Further, standard imputation methods in metabolomics often ignore the mechanisms causing missingness and inaccurately estimate missing values within a data set. We propose a mechanism-aware imputation algorithm that leverages a two-step approach in imputing missing values. First, we use a random forest classifier to classify the missing mechanism for each missing value in the data set. Second, we impute each missing value using imputation algorithms that are specific to the predicted missingness mechanism (i.e., MAR/MCAR or MNAR). Using complete data, we conducted simulations, where we imposed different missingness patterns within the data and tested the performance of combinations of imputation algorithms. Our proposed algorithm provided imputations closer to the original data than those using only one imputation algorithm for all the missing values. Consequently, our two-step approach was able to reduce bias for improved downstream analyses.
... Information on species is often incomplete because they are rare, cryptic, or extinct (Nakagawa and Freckleton, 2008). When dealing with missing life history traits of species it is often not possible to remove the variables, or species containing the missing information, as this can lead to a decrease in statistical power or skewed results (Nakagawa and Freckleton, 2008;Penone et al., 2014). ...
... Information on species is often incomplete because they are rare, cryptic, or extinct (Nakagawa and Freckleton, 2008). When dealing with missing life history traits of species it is often not possible to remove the variables, or species containing the missing information, as this can lead to a decrease in statistical power or skewed results (Nakagawa and Freckleton, 2008;Penone et al., 2014). ...
... When studying extinct species, data are considered as Missing At Random (MAR). However, missing data is not completely random as extinction events likely have some factors in common (multiple species may have similar traits that resulted in their extinction) that are shared between the extinct species (Fisher et al., 2003;Nakagawa and Freckleton, 2008). ...
Thesis
Understanding the variations in structure and abundance of animals and what leads to their distribution within the landscape has captured the attention of ecologists for centuries. Importantly, knowledge of current behaviour of large mammals can be used to inform historic population dynamics and is essential to understanding how early humans used large mammals as a foraging resource. Central to this thesis and improving our understanding of large herbivores is the Palaeo-Agulhas Plain (PAP) where large mammalian herbivores formed a key food resource for early humans. The PAP, now submerged off the southern Cape of South Africa, formed a novel ecosystem during lower sea levels. Characterised by large expanses of nutrient rich grasslands and large grazing herbivores, the PAP stands in stark contrast to the nutrient poor fynbos ecosystems that is in the southern Cape today. In this thesis I focus on the Last Glacial Maximum (LGM; ~20 ka) when the PAP was last fully exposed to answer questions relating to the habitat use and range distribution of large herbivores. Importantly, through the Paleoscape Project, modelled climate, soil and vegetation have made these recreations of large mammals possible. Using modelled climate and vegetation this thesis aims to model the large herbivore communities and understand the influence of early humans on the PAP during the LGM for successful integration into the PaleoscapeABM (the PAP agent-based model). To improve our understanding of large mammals on the PAP I identified five large herbivores that became extinct on the PAP since the LGM and modelled their behavioural and physical traits using k-Nearest Neighbour imputation. I predicted the biomass of large herbivores across the PAP using actual biomass of large herbivores from 39 protected areas across South Africa (spanning five functional groups to include the extinct species) across a rainfall gradient and different biomes. The distribution of large herbivores would likely have been driven by similar top-down and bottom-up drivers we see in large herbivore ecology today. Knowing this, I created a predictive model for large mammals by predicting the probability of occurrence of functional groups of large herbivores in relation to environmental drivers and humans. Results showed that all species (except Antidorcas australis) were adapted to the grassy environment of the PAP and these specialisations likely contributed to their extinction along with changing climates and intensified hunting from humans. When predicting herbivore viii biomass, biome was the most important factor influencing the relationship between herbivores and rainfall. In general, large herbivore biomass increased with rainfall across biomes, except for grassland. Finally, I showed the probability of occurrence of large herbivores was influenced by early humans, water availability and a landscape of fear on the PAP. Through this thesis I have successfully provided detailed accounts of the biomass and probability of occurrence of large herbivores on the PAP. Importantly, this information can be seamlessly integrated into the PaleoscapeABM. Finally, I highlight the importance of this knowledge in understanding early humans, the potential shortcomings of this study and resulting areas where research needs to be focused.
... To answer these questions, ecologists and evolutionary biologists rely upon databases that include information on a large number of species and their traits (e.g., Jones et al., 2009;Kattge et al., 2011;Reichman et al., 2011;Wilman et al., 2014). However, missing data are a common feature of biodiversity datasets (Etard et al., 2020;Nakagawa & Freckleton, 2008) and the application of statistical analyses to these data can lead to biased estimates and conclusions on the phenomena of interest (Rubin, 1976). This lack of knowledge about species' traits and their ecological functions was defined as the Raunkiaeran shortfall (Hortal et al., 2015) or Eltonian shortfall (Rosado, 2015). ...
... These mechanisms were classified into three categories: missing completely at random (MCAR), missing at random (MAR) and missing not at random (MNAR). Under MCAR, that missing values are equally probable across a dataset; under MAR, the probability of missing data is correlated with other variables (e.g., traits values are missing mainly in a phylogenetically related group of species) rather than to the variable with missing data (target variable); under MNAR, the probability of missing data is itself correlated to the target variable after controlling for other variables (Enders, 2010;Nakagawa & Freckleton, 2008;Rubin, 1976;van Buuren, 2012) (Fig. 1). When dealing with missing data, the above mechanisms need to be taken into account before analysis (Rubin, 1976) to minimize biases in parameter estimates (Enders, 2010;Rubin, 1976;van Buuren, 2012). ...
... These missing data properties should be a concern even for strongly correlated traits, as we showed for primate the brain size and body size relatonship in primates. This has been commonly acknowledged in statistical literature and should begin to be so in the ecological and evolutionary research (Nakagawa, 2015;Nakagawa & Freckleton, 2008), as highlighted by recent studies (Johnson et al., 2020). The main obstacle for using large biodiversity datasets is the proportion of species lacking data. ...
Article
Full-text available
Given the prevalence of missing data on species’ traits – the Raunkiaeran shortfall-, several methods have been proposed to fill sparse databases. However, analyses based on these imputed databases can introduce several biases. Here, we evaluated potential estimation biases caused by the use of imputed databases. In the evaluation, we considered the estimation of descriptive statistics, regression coefficient, and phylogenetic signal for different missing and imputing scenarios. We found that percentage of missing data, missing mechanisms and imputation methods were important in determining estimation errors. Imputation errors are not linearly related to estimate errors. Adding phylogenetic information provides better estimates of the evaluated statistics, but this information should be combined with other variables such as traits correlated to the missing data variable. Using an empirical dataset, we found that even traits that are strongly correlated to each other, such as brain and body size of primates, can produce biases when estimating phylogenetic signal from missing data datasets. We advise researchers to share both their raw and imputed data as well as to consider the pattern of missing data to evaluate methods that perform better for their goals. In addition, the performance of imputation methods should be mainly based on statistical estimates instead of only in imputation error.
... Missing-data-imputation techniques are also used in ecology, but much less than in social or medical sciences (Van Buuren 2018), even though ecological data, by its nature, often contain gaps, outliers or even wrong data (e.g Underhill and Prys-Jones 1994.;Nakagawa and Freckleton 2008;Jóhannesson et al. 2019;Hallmann et al. 2020). In ecological studies, most data gaps are unfortunately ignored or removed, or sometimes filled with single-imputation methods (Nakagawa 2015;Onkelinx et al. 2017a). ...
... In the other cells percentage differences between the "most correct" model and the value estimated by the single methods or other aggregation models are showed. research (Underhill and Prys-Jones 1994;Nakagawa and Freckleton 2008;Jóhannesson et al. 2019). As Luengo et al. (2010) state, rates of missing data less than 1% are generally considered trivial, 1-5% are manageable, 5-15% require sophisticated methods to handle, and more than 15% may have severe impacts on interpretations and decrease the accuracy of model results. ...
Article
Ecological datasets often contain gaps, outliers or even incorrect data. Ignoring the problem of missing data can lead to reduction in the statistical power of the models used, estimation of biased parameters and incorrect conclusions about the phenomenon studied. In this study, using simulated and real ecological data (seven-year monitoring of offspring production in 239 white stork (Ciconia ciconia) nests), we show how results differ when ignoring missing data, filling the gaps using single methods (including fuzzy methods) and using multiple-imputation and aggregation techniques. Based on simulation data, we showed that data gaps can be filled with high accuracy if the appropriate method (model) is used (96.5% of perfectly matched cases). Based on empirical data, we showed how results can differ when accepting or filling the missing data. These differences concerned both general indicators of breeding success (e.g. total number of offspring, mean annual productivity per nest), differences in trends (e.g. increase or decrease in productivity between years) and more detailed analyses, such as ranks of the most productive nests. The observed differences in the results could lead to formulation of incorrect conclusions about the state of the stork population, the condition of its habitat or conservation guidelines. We highlight that well-developed set of data-imputation methods dedicated to monitoring the white stork could increase the accuracy of modern estimates and re-analyse rich historical data. Similar data pre-processing solutions to fill data gaps should also find wider application in other ecological research.
... To use as many species for the analyses as possible, we followed previous phylogenetic analyses and imputed some of the life history trait values that were not available in the literature 41 . The following data were missing: 31% of female body mass, 20% of male body mass, 48% of testis mass, 36% of sperm length, 31% of female gamete mass, 27% of clutch size, 14% of opportunity for selection, 7% of opportunity for sexual selection and 23% of Bateman gradient (Fig. 2). ...
... The following data were missing: 31% of female body mass, 20% of male body mass, 48% of testis mass, 36% of sperm length, 31% of female gamete mass, 27% of clutch size, 14% of opportunity for selection, 7% of opportunity for sexual selection and 23% of Bateman gradient (Fig. 2). In multi-predictor models, casewise deletion of species with missing data is expected to reduce the power of analysis and potentially bias the results 41 . To pre-empt these potential caveats, missing data were estimated using multiple imputation (see below) and the PGLSs were carried out using the imputed datasets as well. ...
Article
Full-text available
Males and females often display different behaviours and, in the context of reproduction, these behaviours are labelled sex roles. The Darwin–Bateman paradigm argues that the root of these differences is anisogamy (i.e., differences in size and/or function of gametes between the sexes) that leads to biased sexual selection, and sex differences in parental care and body size. This evolutionary cascade, however, is contentious since some of the underpinning assumptions have been questioned. Here we investigate the relationships between anisogamy, sexual size dimorphism, sex difference in parental care and intensity of sexual selection using phylogenetic comparative analyses of 64 species from a wide range of animal taxa. The results question the first step of the Darwin–Bateman paradigm, as the extent of anisogamy does not appear to predict the intensity of sexual selection. The only significant predictor of sexual selection is the relative inputs of males and females into the care of offspring. We propose that ecological factors, life-history and demography have more substantial impacts on contemporary sex roles than the differences of gametic investments between the sexes.
... The debate about which of the three hypotheses best explains brain size evolution coincides with controversy over what specific variables select for the evolution of larger brains. This situation is exacerbated by poor data availability for many important variables, particularly behavioural data, such that only small subsets of species have a complete collection of variables and therefore confidence in the analyses is low [15][16][17][18]. ...
... In the current study, we expand existing marsupial brain size data by a third and use several novel analytical approaches providing the most comprehensive test of the main hypotheses of brain evolution. This involves the first use of phylogenetically informed multiple imputations (MI) through chained equations of missing data in brain size studies [18,44,45], followed by phylogenetically corrected Bayesian generalized linear mixed-effects modelling-MCMCglmm [46]. ...
Article
Considerable controversy exists about which hypotheses and variables best explain mammalian brain size variation. We use a new, high-coverage dataset of marsupial brain and body sizes, and the first phylogenetically imputed full datasets of 16 predictor variables, to model the prevalent hypotheses explaining brain size evolution using phylogenetically corrected Bayesian generalized linear mixed-effects modelling. Despite this comprehensive analysis, litter size emerges as the only significant predictor. Marsupials differ from the more frequently studied placentals in displaying a much lower diversity of reproductive traits, which are known to interact extensively with many behavioural and ecological predictors of brain size. Our results therefore suggest that studies of relative brain size evolution in placental mammals may require targeted co-analysis or adjustment of reproductive parameters like litter size, weaning age or gestation length. This supports suggestions that significant associations between behavioural or ecological variables with relative brain size may be due to a confounding influence of the extensive reproductive diversity of placental mammals.
... Independent of the approach used to collect data, missing data are fairly common and not the exception in trait datasets. The effects of missing data in analyses are not easy to predict (Nakagawa and Freckleton, 2008) and it can depend on a series of factors, such as the proportion of missing entries and the distributions of these missing values (Goberna and Verdú, 2016;Májeková et al., 2016;Pakeman, 2014). ...
... Missing values in datasets were introduced completely at random (Nakagawa and Freckleton, 2008), using the prodNA function of package missForest (Stekhoven and Bühlmann, 2012). We simulate missing data as the proportion of entries cell in trait dataset, ranging from 5% to 50% (proportion of missing data of 0.05, 0.10, 0.20, 0.30, 0.40 and 0.50). ...
Article
Trait-based approaches offer complementary views to the classic taxonomic approach, which is a crucial step forward to unveil mechanisms of community assembly, species interactions, ecosystem functioning, and tackling important conservation issues. These approaches require an enormous sampling effort to provide complete trait datasets, consequently, missing data are very common. We evaluated the performance of the missForest algorithm, that uses the Random Forest method to impute species traits values using phylogenetic information. We simulated datasets with different sizes and proportions of missing data, different levels of trait conservatism, and trait correlation. We tested trait imputation using the missForest algorithm without phylogenetic information and adding the phylogenetic relationship among species, using phylogenetic eigenvector. Our results show that the level of phylogenetic signal in traits and the correlation among them are the main parameters that influence the measures of imputation error. The measures of errors are smaller when traits have higher levels of correlation and when traits are conserved in the phylogenetic tree. In general, the inclusion of phylogenetic vectors as predictors in the missForest algorithm improves the estimation of missing values. However, the importance of phylogenetic information to the imputation process depends on the proportion of missing entries, phylogenetic conservatism of traits, and the correlation among traits. The missForest algorithm seems to be a robust method for trait imputation, and it can be used to estimate missing traits without the exclusion of species. Thus, we expect to have contributed with a new step to guide methodological choices to impute entire databases of traits with the goal of decreasing uncertainties and bias in the interpretation of ecological patterns and processes at different levels of ecological organization.
... The dataset presented here was considerably patchy in both space and time despite prior filtering (e.g., no South Pacific observations in 2003, 2012 or 2013). Patchy data are a pertinent issue in statistics [49] and can have a considerable effect on the estimated model parameters [50], and model selection criteria (e.g., deviance information criterion-DIC) [51]. Thus, to address patchiness beyond basic filtering, we performed a simulation test ( Figure S3 and Figure S4). ...
... The dataset presented here was considerably patchy in both space and time despite prior filtering (e.g., no South Pacific observations in 2003, 2012 or 2013). Patchy data are a pertinent issue in statistics [49] and can have a considerable effect on the estimated model parameters [50], and model selection criteria (e.g., deviance information criterion-DIC) [51]. Thus, to address patchiness beyond basic filtering, we performed a simulation test (Figures S3 and S4). ...
Article
Full-text available
Increasingly intense marine heatwaves threaten the persistence of many marine ecosystems. Heat stress-mediated episodes of mass coral bleaching have led to catastrophic coral mortality globally. Remotely monitoring and forecasting such biotic responses to heat stress is key for effective marine ecosystem management. The Degree Heating Week (DHW) metric, designed to monitor coral bleaching risk, reflects the duration and intensity of heat stress events and is computed by accumulating SST anomalies (HotSpot) relative to a stress threshold over a 12-week moving window. Despite significant improvements in the underlying SST datasets, corresponding revisions of the HotSpot threshold and accumulation window are still lacking. Here, we fine-tune the operational DHW algorithm to optimise coral bleaching predictions using the 5 km satellite-based SSTs (CoralTemp v3.1) and a global coral bleaching dataset (37,871 observations, National Oceanic and Atmospheric Administration). After developing 234 test DHW algorithms with different combinations of the HotSpot threshold and accumulation window, we compared their bleaching prediction ability using spatiotemporal Bayesian hierarchical models and sensitivity–specificity analyses. Peak DHW performance was reached using HotSpot thresholds less than or equal to the maximum of monthly means SST climatology (MMM) and accumulation windows of 4–8 weeks. This new configuration correctly predicted up to an additional 310 bleaching observations globally compared to the operational DHW algorithm, an improved hit rate of 7.9%. Given the detrimental impacts of marine heatwaves across ecosystems, heat stress algorithms could also be fine-tuned for other biological systems, improving scientific accuracy, and enabling ecosystem governance.
... Where species had missing trait data, we selected congeneric species to fill in these gaps, because deleting taxa with missing data can reduce statistical power and lead to biased results (Nakagawa & Freckleton, 2008). The only gaps in data that needed filling in this way were life span for Sylvia undata (surrogate species: Sylvia melanocephala) and Regulus ignicapillus (surrogate species: Regulus regulus), and gape width for Actitis hypoleucos (surrogate species: Actitis macularius). ...
Article
Full-text available
Functional diversity metrics based on species traits are widely used to investigate ecosystem functioning. In theory, such metrics have different implications depending on whether they are calculated from traits mediating responses to environmental change (response traits) or those regulating function (effect traits), yet trait choice in diversity metrics is rarely scrutinized. Here, we compile effect and response traits for British bird species supplying two key ecological services – seed dispersal and insect predation – to assess the relationship between functional diversity and both mean and stability of community abundance over time. As predicted, functional diversity correlates with stability in community abundance of seed dispersers when calculated using response traits. However, we found a negative relationship between functional diversity and mean community abundance of seed dispersers when calculated using effect traits. Subsequently, when combining all traits together, we found inconsistent results with functional diversity correlating with reduced stability in insectivores, but greater stability in seed dispersers. Our findings suggest that trait choice should be considered more carefully when applying such metrics in ecosystem management.
... For meta-analysis of variance across drug treatment, we included data from studies with sufficient group-level information on the drug group, rat strain, and sex of experimental/control groups (see full model parameters in S3 Table). For all analyses, we dealt with missing data via multiple imputation [65,66] using the package mice [67] as follows: We first generated multiple, simulated datasets (m = 20) by replacing missing values with possible values under the assumption that data are missing at random (MAR) [68,69]. After imputation, meta-analyses were performed on each imputed dataset (as described in the Statistical analysis section), and model estimates were then pooled across analyses into a single set of estimates and errors. ...
Article
Full-text available
The replicability of research results has been a cause of increasing concern to the scientific community. The long-held belief that experimental standardization begets replicability has also been recently challenged, with the observation that the reduction of variability within studies can lead to idiosyncratic, lab-specific results that cannot be replicated. An alternative approach is to, instead, deliberately introduce heterogeneity, known as “heterogenization” of experimental design. Here, we explore a novel perspective in the heterogenization program in a meta-analysis of variability in observed phenotypic outcomes in both control and experimental animal models of ischemic stroke. First, by quantifying interindividual variability across control groups, we illustrate that the amount of heterogeneity in disease state (infarct volume) differs according to methodological approach, for example, in disease induction methods and disease models. We argue that such methods may improve replicability by creating diverse and representative distribution of baseline disease state in the reference group, against which treatment efficacy is assessed. Second, we illustrate how meta-analysis can be used to simultaneously assess efficacy and stability (i.e., mean effect and among-individual variability). We identify treatments that have efficacy and are generalizable to the population level (i.e., low interindividual variability), as well as those where there is high interindividual variability in response; for these, latter treatments translation to a clinical setting may require nuance. We argue that by embracing rather than seeking to minimize variability in phenotypic outcomes, we can motivate the shift toward heterogenization and improve both the replicability and generalizability of preclinical research.
... Temporal variation in selection will thus fundamentally affect 43 population responses to varying and changing environmental conditions. In particular, both 44 fluctuating selection and sexually antagonistic selection, respectively defined as episodes of 45 selection acting in opposite directions within short ecologically-relevant periods or between 46 the sexes, can help maintain genetic variation and alter timeframes for adaptation [3,5]. Yet, 47 temporal dynamics of sex-specific selection on key traits in wild populations have still rarely 48 been quantified [2]. ...
Article
Full-text available
Quantifying temporal variation in sex-specific selection on key ecologically relevant traits, and quantifying how such variation arises through synergistic or opposing components of survival and reproductive selection, is central to understanding eco-evolutionary dynamics, but rarely achieved. Seasonal migration versus residence is one key trait that directly shapes spatio-seasonal population dynamics in spatially and temporally varying environments, but temporal dynamics of sex-specific selection have not been fully quantified. We fitted multi-event capture-recapture models to year-round ring resightings and breeding success data from partially migratory European shags (Phalacrocorax aristotelis) to quantify temporal variation in annual sex-specific selection on seasonal migration versus residence arising through adult survival, reproduction and the combination of both (i.e. annual fitness). We demonstrate episodes of strong and strongly fluctuating selection through annual fitness that were broadly synchronized across females and males. These overall fluctuations arose because strong reproductive selection against migration in several years contrasted with strong survival selection against residence in years with extreme climatic events. These results indicate how substantial phenotypic and genetic variation in migration versus residence could be maintained, and highlight that biologically important fluctuations in selection may not be detected unless both survival selection and reproductive selection are appropriately quantified and combined.
... The second approach is designed to delete all the records having missing values [6]. This approach is simple but it has a serious disadvantage that, while using small-sized data, the correctness of statistical analysis will be reduced [7]. The third approach for handling missing data is referred to as imputation, which aims to fill the missing values by some estimation methods [4,5]. ...
Article
Full-text available
The presence of missing data is a common and pivotal issue, which generally leads to a serious decrease of data quality and thus indicates the necessity to effectively handle missing data. In this paper, we propose a missing value imputation approach driven by Fuzzy C-Mean clustering to improve the classification accuracy by referring only to the known feature values of some selected instances. In particular, the missing values for each instance are imputed by selecting a shorter interval based on the cluster membership value within the certain threshold limit of each feature, while using a short interval is considered to improve the imputation effectiveness and get more accurate estimation of the values in comparison with using a long interval. Our method is evaluated through comparing with state-of-the-art imputation methods on UCI datasets. The experimental results demonstrate that the proposed approach performs closely to or better than those state-of-the-art imputation methods.
... When combining data from different sources into one body of metadata, some features of the studied objects were found to be absent (e.g., T, pH, conductivity, or sampling month). We decided not to exclude samples with missing data from the multidimensional statistical analysis because this approach reduces statistical power and increases the systematic bias of the estimate [91][92][93]. We took the mean value for waters with unknown quantitative temperature, conductivity and/or pH values, and used random values from the list of known values for missing qualitative data (sampling month) according to the authors' guidelines [94,95]. ...
Article
Full-text available
Silica-scaled chrysophytes have an ancient origin; nowadays they inhabit many northern water bodies. As the territories above the 60th parallel north were under the influence of glaciers during the Late Pleistocene, the local water bodies and their microalgal populations formed mainly during the Early Holocene. Now, the arctic, sub-arctic and temperate zones are located here and the water bodies in these regions have varying environmental characteristics. We analyzed the dispersal of silica-scaled chrysophytes in 193 water bodies in 21 northern regions, and for 135 of them determined the role of diverse environmental factors in their species composition and richness using statistical methods. Although the species composition and richness certainly depend on water body location, water temperature and conductivity, regions and individual water bodies with similar species composition can be significantly distant in latitudinal direction. Eighteen species and one variety from 165 taxa occurring here have clear affinities to fossil congeners; they have been encountered in all regions studied and amount to 6–54% of the total number of silica-scaled chrysophytes. We also compared the distribution of the species with a reconstruction of glacier-dammed lakes in the Northern Hemisphere in the Late Pleistocene–Early Holocene. The dispersal of silica-scaled chrysophytes in the northern water bodies could take place in the Late Pleistocene–Early Holocene over the circumpolar freshwater network of glacier-dammed lakes, the final Protista composition being subject to the environmental parameters of each individual water body and the region where the water body is located. This species dispersal scenario can also be valid for other microscopic aquatic organisms as well as for southerly water bodies of the Northern Hemisphere.
... Just 81 (14%) species had no missing data, although most (82%) had a growth or developmental time estimate for one life stage. To make full use of the data and limit potential biases, we imputed missing host and parasite traits (Nakagawa and Freckleton 2008). The imputation procedure was described previously (Benesh et al. 2021b), so we only present it briefly. ...
Article
Full-text available
Parasitic worms (helminths) with complex life cycles divide growth and development between successive hosts. Using data from 597 species of acanthocephalans, cestodes, and nematodes with two‐host life cycles, we found that helminths with larger intermediate hosts were more likely to infect larger, endothermic definitive hosts, although some evolutionarily shifts in definitive host mass occurred without changes in intermediate host mass. Life‐history theory predicts parasites to shift growth to hosts in which they can grow rapidly and/or safely. Accordingly, helminth species grew relatively less as larvae and more as adults if they infected smaller intermediate hosts and/or larger, endothermic definitive hosts. Growing larger than expected in one host, relative to host mass/endothermy, was not associated with growing less in the other host, implying a lack of cross‐host tradeoffs. Rather, some helminth orders had both large larvae and large adults. Within these taxa, though, size at maturity in the definitive host was unaffected by changes to larval growth, as predicted by optimality models. Parasite life‐history strategies were mostly (though not entirely) consistent with theoretical expectations, suggesting that helminths adaptively divide growth and development between the multiple hosts in their complex life cycles. This article is protected by copyright. All rights reserved
... Statistical programs often default to 'complete case', deleting rows that contain missing data (empty cells) prior to analysis, but our assessment of reporting practices found it was uncommon for authors to state that complete case analysis was conducted (despite their data showing missing values for metaregression moderator variables). Understandably, authors might not recognise the passive method of complete case analysis as a method of dealing with missing data, but it is important to be explicit about this step, both for the sample size implications (Item 20) and because of the potential to introduce bias when data are not 'missing completely at random' (Nakagawa & Freckleton, 2008;Little & Rubin, 2020). As an alternative to complete case analysis, authors can impute missing data based on the values of available correlated variables (e.g. ...
Article
Full-text available
Since the early 1990s, ecologists and evolutionary biologists have aggregated primary research using meta-analytic methods to understand ecological and evolutionary phenomena. Meta-analyses can resolve long-standing disputes, dispel spurious claims, and generate new research questions. At their worst, however, meta-analysis publications are wolves in sheep's clothing: subjective with biased conclusions, hidden under coats of objective authority. Conclusions can be rendered unreliable by inappropriate statistical methods, problems with the methods used to select primary research, or problems within the primary research itself. Because of these risks, meta-analyses are increasingly conducted as part of systematic reviews, which use structured, transparent, and reproducible methods to collate and summarise evidence. For readers to determine whether the conclusions from a systematic review or meta-analysis should be trusted - and to be able to build upon the review - authors need to report what they did, why they did it, and what they found. Complete, transparent, and reproducible reporting is measured by 'reporting quality'. To assess perceptions and standards of reporting quality of systematic reviews and meta-analyses published in ecology and evolutionary biology, we surveyed 208 researchers with relevant experience (as authors, reviewers, or editors), and conducted detailed evaluations of 102 systematic review and meta-analysis papers published between 2010 and 2019. Reporting quality was far below optimal and approximately normally distributed. Measured reporting quality was lower than what the community perceived, particularly for the systematic review methods required to measure trustworthiness. The minority of assessed papers that referenced a guideline (~16%) showed substantially higher reporting quality than average, and surveyed researchers showed interest in using a reporting guideline to improve reporting quality. The leading guideline for improving reporting quality of systematic reviews is the Preferred Reporting Items for Systematic reviews and Meta-Analyses (PRISMA) statement. Here we unveil an extension of PRISMA to serve the meta-analysis community in ecology and evolutionary biology: PRISMA-EcoEvo (version 1.0). PRISMA-EcoEvo is a checklist of 27 main items that, when applicable, should be reported in systematic review and meta-analysis publications summarising primary research in ecology and evolutionary biology. In this explanation and elaboration document, we provide guidance for authors, reviewers, and editors, with explanations for each item on the checklist, including supplementary examples from published papers. Authors can consult this PRISMA-EcoEvo guideline both in the planning and writing stages of a systematic review and meta-analysis, to increase reporting quality of submitted manuscripts. Reviewers and editors can use the checklist to assess reporting quality in the manuscripts they review. Overall, PRISMA-EcoEvo is a resource for the ecology and evolutionary biology community to facilitate transparent and comprehensively reported systematic reviews and meta-analyses.
... Imputing missing data is considered a better alternative to removing incomplete plots, which can potentially result in systematic biases (Nakagawa & Freckleton, 2008). We used the mice function of the mice package in R-3.6.1, which relies on between-trait correlations using multi-variate imputation with chained equations to fill gaps. ...
Article
Full-text available
Aim European grassland communities are highly diverse, but patterns and drivers of continental‐scale diversities remain elusive. This study analyses taxonomic and functional richness in European grasslands along continental‐scale temperature and precipitation gradients. Location Europe. Methods We quantified functional and taxonomic richness of 55,748 vegetation plots. Six plant traits, related to resource acquisition and conservation, were analysed to describe plant community functional composition. Using a null‐model approach we derived functional richness effect sizes that indicate higher or lower diversity than expected given the taxonomic richness. We assessed the variation in absolute functional and taxonomic richness and in functional richness effect sizes along gradients of minimum temperature, temperature range, annual precipitation, and precipitation seasonality using a multiple general additive modelling approach. Results Functional and taxonomic richness was high at intermediate minimum temperatures and wide temperature ranges. Functional and taxonomic richness was low in correspondence with low minimum temperatures or narrow temperature ranges. Functional richness increased and taxonomic richness decreased at higher minimum temperatures and wide annual temperature ranges. Both functional and taxonomic richness decreased with increasing precipitation seasonality and showed a small increase at intermediate annual precipitation. Overall, effect sizes of functional richness were small. However, effect sizes indicated trait divergence at extremely low minimum temperatures and at low annual precipitation with extreme precipitation seasonality. Conclusions Functional and taxonomic richness of European grassland communities vary considerably over temperature and precipitation gradients. Overall, they follow similar patterns over the climate gradients, except at high minimum temperatures and wide temperature ranges, where functional richness increases and taxonomic richness decreases. This contrasting pattern may trigger new ideas for studies that target specific hypotheses focused on community assembly processes. And though effect sizes were small, they indicate that it may be important to consider climate seasonality in plant diversity studies.
... However it causes bias in parameter estimation. Data imputation and data augmentation methods are comparatively similar methods, however, in data imputation methods missing observations substitute with imputed values but in data augmentation, parameter estimation is augmented by the information gained from assuming certain probability models from observed data [52]. Data augmentation procedures included maximum likelihood (ML) and Bayesian Inference. ...
Thesis
Full-text available
Breast cancer is a leading cause of premature mortality among women in the United States. Breast cancer screening tests can help with detecting breast cancer in early stages and thereby reducing the breast cancer mortality risk. However, due to the imperfect nature of screening tests, there is always some associated overdiagnosis, false positives, and false negatives risks. Therefore, to improve breast cancer preventive care, we defined the focus of this dissertation on modeling breast cancer screening decisions. Breast cancer overdiagnosis is the first issue that is addressed in this dissertation. Although overdiagnosis is known to be the major risk inherent in mammography screening; currently there is no way to distinguish between overdiagnosed cancers and the ones that would cause problems over a patient’s lifetime. Overdiagnosis risk significantly depends on a patient’s compliance with screening recommendations. In Chapter 2, we use a stochastic framework to perform a harm-benefit analysis to compare the overdiagnosis risk with the benefits that breast cancer screening provides. In addition, we estimate the lifetime mortality risk of breast cancer while considering the overdiagnosis risk and the uncertainty in a patient’s adherence behavior. Our results show that, although overdiagnosis rate is relatively high in breast cancer screening, the benefits of breast cancer mammography screening outweigh the overdiagnosis risk. The second issue that is addressed in this dissertation is false negative results caused by density of breast tissue. Breast density is known to increase breast cancer risk and decrease mammography screening sensitivity. Breast density notification laws, require physicians to inform women with high breast density of these potential risks. The laws usually require healthcare providers to notify patients of the possibility of using more sensitive supplemental screening tests (e.g., ultrasound). Since the enactment of the laws, there have been controversial debates over i) their implementations due to the potential radiologists bias in breast density classification of mammogram images and ii) the necessity of supplemental screenings for all patients with high breast density. Breast density is a dynamic risk factor. Therefore, in the third chapter, we apply a hidden Markov model (HMM) on a sparse unbalanced longitudinal data to quantify the yearly progression of breast density based on Breast Imaging Reporting and Data System (BI-RADs) classifications. In Chapter 4, we use the results from previous chapter to investigate the effectiveness of supplemental screening and the impact of radiologists’ bias on patients’ outcomes under the breast density notification law. We consider the conditional probability of eventually detecting breast cancer in early states given that the patient develops breast cancer in her lifetime and the expected number of supplemental tests as patient’s outcome. Our results indicate that referring patients to a supplemental test solely based on their breast density may not necessarily improve their health outcomes and other risk factors need to be considered when making such referrals. Additionally, average-skilled radiologists’ performances are shown to be comparable with the performance of a perfect radiologist.
... Indeed, statistical guarantees for commonly used, state-of-the-art methods for large biodiversity datasets usually assume an asymptotic regime, where the number of observations is large compared to the number of parameters. Yet, one acute issue in biodiversity monitoring schemes is the occurrence of a substantial amount of missing data (Harel & Zhou, 2006;Nakagawa & Freckleton, 2008;Wauchope et al., 2019), up to the point where the asymptotic assumption becomes obsolete. This is especially the case in areas where data collection is costly or logistically difficult to undertake, but where biodiversity is no less in need of monitoring (Stephenson, Bowles-Newark, et al., 2017;Tibshirani, 1996). ...
Article
1. In biodiversity monitoring, large datasets are becoming more and more widely available and are increasingly used globally to estimate species trends and conservation status. These large‐scale datasets challenge existing statistical analysis methods, many of which are not adapted to their size, incompleteness and heterogeneity. The development of scalable methods to impute missing data in incomplete large‐scale monitoring datasets is crucial to balance sampling in time or space and thus better inform conservation policies. 2. We developed a new method based on penalized Poisson models to impute and analyse incomplete monitoring data in a large‐scale framework. The method allows parameterization of (a) space and time factors, (b) the main effects of predictor covariates, as well as (c) space–time interactions. It also benefits from robust statistical and computational capability in large‐scale settings. 3. The method was tested extensively on both simulated and real‐life waterbird data, with the findings revealing that it outperforms 6 existing methods in terms of missing‐data imputation errors. Applying the method to 16 waterbird species, we estimated their long‐term trends for the first time at the entire North African scale, a region where monitoring data suffers from many gaps in space‐ and time‐series. 4. This new approach opens promising perspectives to increase the accuracy of species‐abundance trend estimations. We made it freely available in the R package ‘lori’ (https://CRAN.R‐project.org/package=lori) and recommend its use for large‐scale count data, particularly in citizen‐science monitoring programmes.
... The alternative is to use more advanced methods, such as multiple imputation [10] and likelihood estimation [11]. However, each method makes assumptions about the distribution of missing values [12]-known as the missing data mechanism-and understanding this distribution is a key task before applying those methods [13]. ...
... The first five eigenvectors represented on average c. 95% of the variance.Data imputation can contribute to reducing sampling bias in available data(Nakagawa & Freckleton, 2008). However, imputed trait values may vary in their uncertainty depending on the proportion of missing data and on the correlation strength between the considered traits (Madley-Dowd et al., 2019). ...
Article
Full-text available
Aim Our understanding of the biological strategies employed by species to cope with challenges posed by aridity is still limited. Despite being sensitive to water loss, bats successfully inhabit a wide range of arid lands. We here investigated how functional traits of bat assemblages vary along the global aridity gradient to identify traits that favour their persistence in arid environments. Location Global. Time period Contemporary. Major taxa studied Bats. Methods We mapped the assemblage‐level averages of four key bat traits describing wing morphology, echolocation and body size, based on a grid of 100‐km resolution and a pool of 915 bat species, and modelled them against aridity values. To support our results, we conducted analyses also at the species level to control for phylogenetic autocorrelation. Results At the assemblage level, we detected a rise in values of aspect ratio, wing loading and forearm length, and a decrease in echolocation frequency with increasing aridity. These patterns were consistent with trends detected at the species level for all traits. Main conclusions Our findings show that trait variation in bats is associated with the aridity gradient and suggest that greater mobility and larger body size are advantageous features in arid environments. Greater mobility favours bats’ ability to track patchy and temporary resources, while the reduced surface‐to‐volume ratio associated with a larger body size is likely to reduce water stress by limiting cutaneous evaporation. These findings highlight the importance of extending attention from species‐specific adaptations to broad scale and multispecies variation in traits when investigating the ability of species to withstand arid conditions.
... This may increase the rate of Type I errors, that is, detecting a difference in survival rates among individuals with different ages of first reproduction that does not exist. Instead of omitting individuals with incomplete life histories (i.e., those individuals with missing breeding attempts; e.g., Nakagawa andFreckleton 2008, Bouwhuis et al. 2010) or using the first observed breeder encounter of an individual as a proxy of the age of first reproduction, we treat the age of first reproduction as a latent individualspecific state, without state transition. Consequently, the multievent model correctly accounts for the uncertainty associated with the age of first reproduction as a result of imperfect detection, and propagates this uncertainty in the model to estimate parameter variances appropriately. ...
Article
Correlations between early‐ and late‐life performance are a major prediction of life‐history theory. Negative early‐late correlations can emerge because biological processes are optimized for early but not late life (e.g., rapid development may accelerate the onset of senescence; “developmental theory of ageing”) or because allocation to early life performance comes at a cost in terms of late‐life performance (as in the disposable soma theory). But, variation in genetic and environmental challenges that each individual has to cope with during early life may also lead to positive early‐late life‐history trait correlations (the “fixed heterogeneity” or “individual quality” hypothesis). We analyzed individual life‐history trajectories of 7,420 known‐age female southern elephant seals (Mirounga leonina) monitored over 36 years to determine how actuarial senescence (a proxy for late‐life performance) correlate with age at first reproduction (a proxy for early‐life performance). As some breeding events may not be detected in this field study, we used a custom “multievent” hierarchical model to estimate the age at first reproduction and correlate it to other life‐history traits. The probability of first reproduction was 0.34 at age 3, with most females breeding for the first time at age 4, and comparatively few at older ages. Females with an early age of first reproduction outperformed delayed breeders in all aspects we considered (survival, rate of senescence, net reproductive output) but one: early breeders appeared to have an onset of actuarial senescence one year earlier compared to late breeders. Genetics and environmental conditions during early‐life likely explain the positive correlation between early‐ and late‐life performance. Our results provide the first evidence of actuarial senescence in female southern elephant seals.
... Deviance Information Criterion -DIC) (Nakagawa and Freckleton 2008). Thus, to address patchiness 287 beyond basic filtering, we performed a simulation test (Fig. S3 & Fig. S4). ...
Preprint
Full-text available
Increasingly severe marine heatwaves under climate change threaten the persistence of many marine ecosystems. Mass coral bleaching events, caused by periods of anomalously warm sea surface temperatures (SST), have led to catastrophic levels of coral mortality globally. Remotely monitoring and forecasting such biotic responses to heat stress is key for effective marine ecosystem management. The Degree Heating Week (DHW) metric, designed to monitor coral bleaching risk, reflects the duration and intensity of heat stress events, and is computed by accumulating SST anomalies (HotSpot) relative to a stress threshold over a 12-week moving window. Despite significant improvements in the underlying SST datasets, corresponding revisions of the HotSpot threshold and accumulation window are still lacking. Here, we fine-tune the operational DHW algorithm to optimise coral bleaching predictions using the 5km satellite-based SSTs (CoralTemp v3.1) and a global coral bleaching dataset (37,871 observations, National Oceanic and Atmospheric Administration). After developing 234 test DHW algorithms with different combinations of HotSpot threshold and accumulation window, we compared their bleaching-prediction ability using spatiotemporal Bayesian hierarchical models and sensitivity-specificity analyses. Peak DHW performance was reached using HotSpot thresholds less than or equal to Maximum Monthly Mean SST and accumulation windows of 4 - 8 weeks. This new configuration correctly predicted up to an additional 310 bleaching observations compared to the operational DHW algorithm, an improved hit rate of 7.9 %. Given the detrimental impacts of marine heatwaves across ecosystems, heat stress algorithms could also be fine-tuned for other biological systems, improving scientific accuracy, and enabling ecosystem governance.
... Which of the potential 9 future mothers will actually become mother and be used to estimate C are directly determined by viability and fertility selections (Hadfield 2008). The influence of selections on heritability has already been recognized in quantitative genetics (Hadfield 2008, Nakagawa and Freckleton 2008, 12 Steinsland et al. 2014), but here we managed to quantify how much it will influence the phenotypic correlation between maternal and offspring phenotypic traits. The large negative influence of selection in roe deer and Soay sheep on C may suggest that our measures of heritability are often 15 biased. ...
Article
Full-text available
Phenotypic traits partly determine expected survival and reproduction and so have been used as the basis for demographic models of population dynamics. Within a population, the distribution of phenotypic traits depends upon their transmission from parents to offspring, yet we still have a limited understanding of the factors shaping phenotypic transmission in wild populations. Phenotypic transmission can be measured using the phenotypic parent‐offspring correlation (C), defined as the slope of the regression of offspring phenotypic trait on parental phenotypic trait, both traits measured at the same age, often at birth. This correlation reflects phenotypic variation due to both additive genetic effects and parental effects. Researchers seldom account for the possible influence of selection on estimates of the phenotypic parent‐offspring correlation. However, because individuals must grow, survive and reproduce before giving birth to offspring, these demographic processes might influence the phenotypic parent‐offspring correlation in addition to the inheritance process, the latter being the direct relationship between parental and offspring phenotypic traits when the parental trait is measured at age of reproduction while the offspring trait is measured at birth. Here we used a female‐based population model to study the relative effects of fertility and viability selections, trait ontogeny and inheritance on C. The relative influence of each demographic process is estimated by deriving the exact formulas for the proportional changes in C to changes in the parameters of integral projection models structured by age and phenotypic traits. We illustrate our method for two long lived species. We find that C can be strongly affected by both viability and fertility selections, mediated by growth and inheritance. Generally, demographic processes that result in mothers reproducing at similar phenotypic traits regardless of their birth traits, such as high fertility selection or converging developmental trajectories, lead to a decreased C. More generally, our models show how the age and phenotypic dependence of fertility and viability selections can influence phenotypic mother‐offspring correlation to a much larger extent than ontogeny and inheritance. Our results suggest that accounting for such dependence is needed to reliably model the distribution of offspring phenotypic traits and the eco‐evolutionary dynamics of phenotypic traits.
... Overall, just 20% of the parasite stages were complete for all host and parasite traits. To make full use of the data and limit potential biases, we imputed missing host and parasite traits (Nakagawa and Freckleton 2008). ...
Article
Full-text available
Parasitic worms (i.e. helminths) commonly infect multiple hosts in succession. With every transmission step, they risk not infecting the next host and thus dying before reproducing. Given this risk, what are the benefits of complex life cycles? Using a dataset for 973 species of trophically transmitted acanthocephalans, cestodes, and nematodes, we tested whether hosts at the start of a life cycle increase transmission and whether hosts at the end of a life cycle enable growth to larger, more fecund sizes. Helminths with longer life cycles, i.e. more successive hosts, infected conspicuously smaller first hosts, slightly larger final hosts, and exploited trophic links with lower predator-prey mass ratios. Smaller first hosts likely facilitate transmission because of their higher abundance and because parasite propagules were the size of their normal food. Bigger definitive hosts likely increase fecundity because parasites grew larger in big hosts, particularly endotherms. Helminths with long life cycles attained larger adult sizes through later maturation, not faster growth. Our results indicate that complex helminth life cycles are ubiquitous because growth and reproduction are highest in large, endothermic hosts that are typically only accessible via small intermediate hosts, i.e. the best hosts for growth and transmission are not the same. This article is protected by copyright. All rights reserved
... Importantly, missing mechanisms describe a specific dataset being used in a model or analysis, and are not characteristics of a complete dataset itself (Baraldi and Enders, 2010). Therefore, within a larger dataset and depending on which variables are included in the model, there may be independent analyses that meet assumptions for MCAR, MAR, and MNAR (Nakagawa and Freckleton, 2008). As we describe in sections below, while both MNAR and MAR violate assumptions of casewise deletion, LME in contrast remains unbiased when data are MAR. ...
Article
Full-text available
Event-related potentials (ERPs) are advantageous for investigating cognitive development. However, their application in infants/children is challenging given children’s difficulty in sitting through the multiple trials required in an ERP task. Thus, a large problem in developmental ERP research is high subject exclusion due to too few analyzable trials. Common analytic approaches (that involve averaging trials within subjects and excluding subjects with too few trials, as in ANOVA and linear regression) work around this problem, but do not mitigate it. Moreover, these practices can lead to inaccuracies in measuring neural signals. The greater the subject exclusion, the more problematic inaccuracies can be. We review recent developmental ERP studies to illustrate the prevalence of these issues. Critically, we demonstrate an alternative approach to ERP analysis—linear mixed effects (LME) modeling—which offers unique utility in developmental ERP research. We demonstrate with simulated and real ERP data from preschool children that commonly employed ANOVAs yield biased results that become more biased as subject exclusion increases. In contrast, LME models yield accurate, unbiased results even when subjects have low trial-counts, and are better able to detect real condition differences. We include tutorials and example code to facilitate LME analyses in future ERP research.
... Of the 69 eggs analysed, data on the concentration of lutein and zeaxanthin were missing for one and two eggs, respectively. We assigned the value of the average population of each antioxidant to those eggs with missing values [73]. ...
Article
Full-text available
Background In egg-laying animals, mothers can influence the developmental environment and thus the phenotype of their offspring by secreting various substances into the egg yolk. In birds, recent studies have demonstrated that different yolk substances can interactively affect offspring phenotype, but the implications of such effects for offspring fitness and phenotype in natural populations have remained unclear. We measured natural variation in the content of 31 yolk components known to shape offspring phenotypes including steroid hormones, antioxidants and fatty acids in eggs of free-living great tits (Parus major) during two breeding seasons. We tested for relationships between yolk component groupings and offspring fitness and phenotypes. Results Variation in hatchling and fledgling numbers was primarily explained by yolk fatty acids (including saturated, mono- and polyunsaturated fatty acids) - but not by androgen hormones and carotenoids, components previously considered to be major determinants of offspring phenotype. Fatty acids were also better predictors of variation in nestling oxidative status and size than androgens and carotenoids. Conclusions Our results suggest that fatty acids are important yolk substances that contribute to shaping offspring fitness and phenotype in free-living populations. Since polyunsaturated fatty acids cannot be produced de novo by the mother, but have to be obtained from the diet, these findings highlight potential mechanisms (e.g., weather, habitat quality, foraging ability) through which environmental variation may shape maternal effects and consequences for offspring. Our study represents an important first step towards unraveling interactive effects of multiple yolk substances on offspring fitness and phenotypes in free-living populations. It provides the basis for future experiments that will establish the pathways by which yolk components, singly and/or interactively, mediate maternal effects in natural populations.
... In literature, there are various possible options to manage the problem of missing values. For example, Cortinovis, Vella, and Ndiku (1993) have suggested dropping observations with at least one missing value; however, this way of elimination might lower the sample size and reduce the validity of the results particularly if the original sample size is small (Nakagawa & Freckleton, 2008 were standardized to a zero mean and standard deviation of one before the PCA estimation was carried out. ...
Preprint
Full-text available
Most studies measuring food security have used one or two of the dimensions of food security, with snapshot data at a particular point in time. Policies derived from such measurement might be misleading because of the dynamic nature of food security or insecurity in vulnerable populations. This paper presents a composite food security measure that captures the four dimensions of food security i.e., availability, accessibility, utilization, and stability over time. Principal Component Analysis (PCA) is used to reduce the four dimensions into a single index. Data from three rounds of household-level panel data, collected by the Central Statistical Agency (CSA) of Ethiopia in collaboration with the World Bank are used to demonstrate this measurement. The aggregate food security indices result revealed that 44, 57, and 45 percent of households were food secured in 2011, 2013, and 2015 respectively. On the other hand, only 20 percent of households were food secured all the time while 67 percent of households termed as transitory food insecure since they remained food insecure at least in one of the survey periods. The rest 13 percent of households were also termed chronically food insecure since they fall short of food all the time of the study. The finding confirmed a high prevalence of multidimensionally food-insecure households in rural Ethiopia. Therefore, various food security intervention programs that enhance the four dimensions should be introduced.
... One of the challenges in a long-term study is missing count creating uncertainty in ecological models (Atkinson et al., 2006;Franklin, 1989;Lindenmayer & Likens, 2009;Russell et al., 2002). Many studies have suggested different statistical models, such as kNN; MICE, to overcome missing data problems (Nakagawa and Freckleton, 2008;Penone et al., 2014). Some of these models were also developed on bird populations (Penone et al., 2014). ...
Article
Full-text available
The breeding population of birds are dynamic and are affected by multiple factors including climate and local environmental conditions. However, often to understand such relations requires long-term data modelling. Such long-term population data is either lacking or has data gaps. This study demonstrates the use of Multiple Imputation Chained Equation (MICE) to overcome the problem of missing data population census. This is also the first comprehensive study, modelling the 36-year (1980-2015) long-term breeding population data of a near-threatened bird, Painted Stork, from Keoladeo National Park, India. It tests the effect of local water availability, i.e., water released to the park, and regional rainfall, i.e, climatic condition, on the breeding pop-ulation using Generalised Additive Model (GAM). Both imputation and observed data series-based GAM models identified the local water availability as the most important factor influencing the breeding population of Painted Stork. More than 80% popula-tion decline was observed, despite a slight increase in the rainfall at regional scale, suggesting local hydrological conditions are limiting to the breeding population and not the climate. According to the visual assessment of partial plot of GAM, minimum 200-300 million cubic feet of water is needed each nesting season to ensure sustenance of breeding population. Post-1989, the breeding population was unable to match the long-term mean (~726) except in 1992, 1995, and 1996. The maximum decline was observed between 2000-2009, a decade of frequent droughts. The breeding population was stable until the end of this study, but it was far below the long term mean.
... Some courses that end up high on a recommended list for a given learner but that this learner did not see or have time for it would not be relevant. Dealing with this well-known problem of missing-not-at-random interactions is an open problem in the recommendation landscape (Nakagawa & Freckleton, 2008). ...
Article
Full-text available
Online education platforms play an increasingly important role in mediating the success of individuals’ careers. Therefore, while building overlying content recommendation services, it becomes essential to guarantee that learners are provided with equal recommended learning opportunities, according to the platform principles, context, and pedagogy. Though the importance of ensuring equality of learning opportunities has been well investigated in traditional institutions, how this equality can be operationalized in online learning ecosystems through recommender systems is still under-explored. In this paper, we shape a blueprint of the decisions and processes to be considered in the context of equality of recommended learning opportunities, based on principles that need to be empirically-validated (no evaluation with live learners has been performed). To this end, we first provide a formalization of educational principles that model recommendations’ learning properties, and a novel fairness metric that combines them to monitor the equality of recommended learning opportunities among learners. Then, we envision a scenario wherein an educational platform should be arranged in such a way that the generated recommendations meet each principle to a certain degree for all learners, constrained to their individual preferences. Under this view, we explore the learning opportunities provided by recommender systems in a course platform, uncovering systematic inequalities. To reduce this effect, we propose a novel post-processing approach that balances personalization and equality of recommended opportunities. Experiments show that our approach leads to higher equality, with a negligible loss in personalization. This paper provides a theoretical foundation for future studies of learners’ preferences and limits concerning the equality of recommended learning opportunities.
... This method is simple to operate, but it is easy to produce estimation deviation. And simple deletion can lose a lot of information and affect the accuracy of prediction results (Hooke et al. 2021;Nakagawa et al., 2008;Qiu et al. 2008). Therefore, researchers usually do not use such a & Du Ni mt15nn@mail.wbs.ac.uk 1 method when preprocessing missing data, except in cases where the missing proportion is very low. ...
Article
Full-text available
Data loss has become an inevitable phenomenon in corporate credit risk (CCR) prediction. To ensure the integrity of data information for subsequent analysis and prediction, it is essential to repair the missing data as accurately as possible. To solve the problem of missing data in credit classification, this study proposes a multi-layer perceptron ensemble (MLP-ESM) model that can perform data interpolation and prediction simultaneously to predict CCR. The model makes full use of non-missing information and interpolates more missing columns with fewer missing values. In this way, not only the data features needed for missing data interpolation are extracted, but also the structural relationship features between the predicted target and the existing data are extracted, which can achieve the effect of simultaneous interpolation and prediction. The results show that the MLP-ESM model can effectively interpolate and predict the missing dataset of CCR. The prediction accuracy is 83.11%, which is better than the traditional machine learning model. This fully shows that the dataset after interpolation can achieve a better prediction effect.
... To fill the missing trait value gaps, we used a phylogenetically informed trait imputation (missForest R package, Stekhoven & Buhlmann, 2012). Overall, imputation is preferred over omitting species with missing data (Nakagawa & Freckleton, 2008). However, to test the imputation method's potential influence on our results, we repeated all tests using only the subset of data with full trait information, and the results were consistent (Appendix S3: Table S10). ...
Article
Full-text available
Aim Non‐native species threaten ecosystems worldwide, but we poorly know why some species invade more. Functional traits, residence time and native range size have been often used as invasion predictors. Here, we advance in the field by linking invasion success to native range parameters derived from dark diversity – a set of species present in the surrounding region that are absent in a study location even if ecological conditions are suitable. We tested whether those parameters improve the description of species success outside their native range. Location North America; Europe and Mediterranean Basin. Methods For 170 tree species native in one and non‐native in another region, we defined their invasion success as the number of locations occupied at the non‐native range. The probabilistic dark diversity was estimated based on the species co‐occurrences in their native ranges. It specifies how suitable is a species for a location, even if the species is absent. We calculated two parameters: sum of native location suitabilities (niche breadth proxy) and dark diversity probability (how often a species is absent from suitable locations, indicating niche realization limitations). We explored whether models including the dark diversity parameters performed better than one with a common species range measure, the number of locations occupied. We accomplished our models by adding functional traits, residence time and invasion direction. Results Invasion success increased with the sum of native location suitabilities and decreased with dark diversity probability. This model with dark diversity parameters outperformed an alternative using the number of native locations occupied. Our best model included invasion direction, functional traits (including mycorrhizal status) and residence time, but dark diversity parameters remained important predictors. Main conclusions The dark diversity parameters can contribute to invasion ecology by linking the species performance in the non‐native range to its niches parameters, derived from the native range.
... Although there exist probabilistic models able to handle missing observations, these are scarce, strongly tailored for specific analyses and consequently their use is limited and not an standard procedure [18]. Instead, the usual way researchers proceed in such cases is to reduce the sample size ( ) by removing individuals missing certain data variables resulting in a decrease of statistical power for any further analyses [19]. This problem becomes most notable when performing multivariate analyses involving multiple variables [20], [21], for example classification or clustering, since the number of individuals available in any such analyses will be limited by the simultaneous availability of several clinical measures, reducing the sample size even further. ...
Preprint
Full-text available
An increasing number of large-scale multi-modal research initiatives has been conducted in the typically developing population, as well as in psychiatric cohorts. Missing data is a common problem in such datasets due to the difficulty of assessing multiple measures on a large number of participants. The consequences of missing data accumulate when researchers aim to explore relationships between multiple measures. Here we aim to evaluate different imputation strategies to fill in missing values in clinical data from a large (total N=764) and deeply characterised (i.e. range of clinical and cognitive instruments administered) sample of N=453 autistic individuals and N=311 control individuals recruited as part of the EU-AIMS Longitudinal European Autism Project (LEAP) consortium. In particular we consider a total of 160 clinical measures divided in 15 overlapping subsets of participants. We use two simple but common univariate strategies, mean and median imputation, as well as a Round Robin regression approach involving four independent multivariate regression models including a linear model, Bayesian Ridge regression, as well as several non-linear models, Decision Trees, Extra Trees and K-Neighbours regression. We evaluate the models using the traditional mean square error towards removed available data, and consider in addition the KL divergence between the observed and the imputed distributions. We show that all of the multivariate approaches tested provide a substantial improvement compared to typical univariate approaches. Further, our analyses reveal that across all 15 data-subsets tested, an Extra Trees regression approach provided the best global results. This allows the selection of a unique model to impute missing data for the LEAP project and deliver a fixed set of imputed clinical data to be used by researchers working with the LEAP dataset in the future.
Thesis
Full-text available
Conservation in the European Alps is a central topic, as this region is one of the largest natural areas in Europe, and a center of plant diversity and high endemism. The European Alps host 4’485 vascular plant species – more than the third of the flora recorded in Western Europe – with around 400 endemic species. Alpine and mountain ecosystems are generally known to be particularly vulnerable to climate changes, with many species expected to migrate upward, with risk of extinctions for cold-adapted alpine plants not being able to disperse further. Consequently, research and conservation studies on plant taxonomic, phylogenetic and functional diversity of the European Alps have increased lately. However, a comprehensive study assessing areas of conservation priority for the multifaceted diversity and uniqueness of this remarkable region under climate and land use change is still missing. Conservation prioritization and planning generally needs to implement species distribution maps as inputs, and species distribution model (SDM) outputs are increasingly recommended due to their finer distribution and description of the species’ ecological niche. SDMs are invaluable tools for conservation purposes and predict the potential suitable areas of species in space by relating their observations to environmental and/or anthropogenic predictors. These models bear decisive information for improved species/biodiversity conservation and management, provided that good modelling practices are followed. These practices may severely impact SDM outputs if not followed, and include, among others, the choice of relevant predictor variables and the comprehensive sampling of the observational dataset; two practices strongly driven by the increasing availability of presence-only records and environmental predictors. Their multiplication and growth in number open novel questions and investigations on how to correct for sampling bias more efficiently in large observational datasets, and how novel categories of environmental predictors influence plant species distributions in mountain environments. Preliminary exploring these effects is severely needed to ensure sound models outputs and current/future conservation prioritization of the European Alps. This thesis aimed at providing the first comprehensive conservation assessment of the plant multifaceted diversity/uniqueness of the European Alps under global change. For this, this thesis developed novel methods to correct for the observer bias of large observational datasets in presence-only SDMs (Chapter 1), assessed the influence of climate, soil, and land cover on the plant distribution of the European Alps along different elevation gradients (Chapter 2), evaluated the spatial consequences of predictor resolutions on modelling the plant multifaceted diversity of the region (Chapter 3), and, based on previous findings, predicted and proposed key areas of conservation for the plant multifaceted diversity and uniqueness of the European Alps (Chapter 4).
Article
Model-based approaches that attempt to delimit species are hampered by computational limitations as well as the unfortunate tendency by users to disregard algorithmic assumptions. Alternatives are clearly needed, and machine-learning (M-L) is attractive in this regard as it functions without the need to explicitly define a species concept. Unfortunately, its performance will vary according to which (of several) bioinformatic parameters are invoked. Herein, we gauge the effectiveness of M-L-based species-delimitation algorithms by parsing 64 variably-filtered versions of a ddRAD-derived SNP dataset collected from North American box turtles (Terrapene spp.). Our filtering strategies included: (A) minor allele frequencies (MAF) of 5%, 3%, 1%, and 0% (=none), and (B) maximum missing data per-individual/per-population at 25%, 50%, 75%, and 100% (=no filtering). We found that species-delimitation via unsupervised M-L impacted the signal-to-noise ratio in our data, as well as the discordance among resolved clades. The latter may also reflect biogeographic history, gene flow, incomplete lineage sorting, or combinations thereof (as corroborated from previously observed patterns of differential introgression). Our results substantiate M-L as a viable species-delimitation method, but also demonstrate how commonly observed patterns of phylogenetic discordance can seriously impact M-L-classification.
Article
The Campanian–Maastrichtian Konzentrat-Lagerstätte at Lo Hueco (Spain) has yielded a large sample of appendicular elements referable to sauropod titanosaurs that exhibit considerable morphological variability. The taxonomic assessment of these elements is difficult, they display various degrees of preservation (including fragmentary preservation and taphonomic deformation, each of which were addressed in the methods applied in this study), and also many of the appendicular elements are found as isolated remains. Accurate taxonomic determinations are generally difficult to achieve from the morphological information available on appendicular remains of sauropods, and particularly in groups such as titanosaurirformes. In these cases, the geometric morphometrics (GMM) tool kit is a suitable methodology to explore morphological variability for taxonomic assessment. In this study, several femora, tibiae and fibulae from Lo Hueco were analyzed. To this purpose, the remains were digitized, and the resultant mesh representations were used for definition of 3D landmarks and curves that were analyzed through GMM methods. The quantification of shape variables allowed the use of clustering and K-means techniques as well as a proposed statistical workflow. Several hypotheses of a priori anatomically-defined morphotypes and the results from our machine learning algorithms revealed the presence of two main morphotypes as the most plausible explanation for the morphological variability in the sample of titanosaurian hind limb elements from Lo Hueco.
Article
Mercury is a pervasive environmental contaminant that can negatively impact seabirds. Here, we measure total mercury (THg) concentrations in red blood cells (RBCs) from breeding brown skuas (Stercorarius antarcticus) (n = 49) at Esperanza/Hope Bay, Antarctic Peninsula. The aims of this study were to: (i) analyse RBCs THg concentrations in relation to sex, year and stable isotope values of carbon (δ¹³C) and nitrogen (δ¹⁵N); and (ii) examine correlations between THg, body condition and breeding success. RBC THg concentrations were positively correlated with δ¹⁵N, which is a proxy of trophic position, and hence likely reflects the biomagnification process. Levels of Hg contamination differed between our study years, which is likely related to changes in diet and distribution. RBC THg concentrations were not related to body condition or breeding success, suggesting that Hg contamination is currently not a major conservation concern for this population.
Article
Full-text available
Strong trophic interactions link primary producers (phytoplankton) and consumers (zooplankton) in lakes. However, the influence of such interactions on the biogeographical distribution of the taxa and functional traits of planktonic organisms in lakes has never been explicitly tested. To better understand the spatial distribution of these two major aquatic groups, we related composition across boreal lakes (104 for zooplankton and 48 for phytoplankton) in relation to a common suite of environmental and spatial factors. We then directly tested the degree of coupling in their taxonomic and functional distributions across the subset of common lakes. Although phytoplankton composition responded mainly to properties related to water quality while zooplankton composition responded more strongly to lake morphometry, we found significant coupling between their spatial distributions at taxonomic and functional levels based on a Procrustes test. This coupling was not significant after removing the effect of environmental drivers (water quality and morphometry) on the spatial distributions of the two groups. This result suggests that top-down and bottom-up effects (e.g. nutrient concentration and predation) drove trophic interactions at the landscape level. We also found a significant effect of dispersal limitation on the distribution of taxa, which could explain why coupling was stronger for taxa than for traits at the scale of this study, with a turnover of species observed between regions, but no trait turnover. Our results indicate that landscape pelagic food web responses to anthropogenic changes in ecosystem parameters should be driven by a combination of top-down and bottom-factors for taxonomic composition, but with a relative resilience in functional trait composition of lake communities.
Article
Methods: A survey of 136 articles published in 2019 (sampled at random) was conducted to determine whether a statement about missing data was included. Results: The proportion of studies reporting on missing data was low, at 11.0% (95% confidence interval = 6.3% to 17.5%). Recommendations: We recommend that researchers describe the number and percentage of missing values, including when there are no missing values. Exploratory analysis should be conducted to explore missing values, and visualisations describing missingness overall should be provided in the paper, or at least in supplementary materials. Missing values should almost always be imputed, and imputation methods should be explored to ensure they are appropriately representative. Researchers should consider these recommendations and pay greater attention to missing data and its influence on research results.
Preprint
Full-text available
The widespread use of species traits to infer community assembly mechanisms or to link species to ecosystem functions has led to an exponential increase in functional diversity analyses, with >10,000 papers published in 2010–2019, and >1,500 papers only in 2020. This interest is reflected in the development of a multitude of theoretical and methodological frameworks for calculating functional diversity, making it challenging to navigate the myriads of options and to report details to reproduce a trait-based analysis. Therefore, the study of functional diversity would benefit from the existence of a general guideline for standard reporting and good practices in this discipline. We do so by streamlining available terminology, concepts, and methods, with the overarching goal of increasing reproducibility, transparency and comparability across studies. The protocol is based on the following key elements: identification of a research question, a sampling scheme and a study design, assemblage of community and trait data matrices, data exploration and preprocessing, functional diversity computation, model fitting, evaluation and interpretation, and data, metadata and code provision. Throughout the protocol, we provide information on how to best select research questions and study designs, and discuss ways to ensure reproducibility in reporting results. To facilitate the implementation of this protocol, we further developed an interactive web-based application (stepFD) in the form of a checklist workflow, detailing all the steps of the protocol and providing tabular and graphical outputs that can be merged to produce a final report. The protocol streamlined here is expected to promote the description of functional diversity analyses in sufficient detail to ensure full transparency and reproducibility. A thorough reporting of functional diversity analyses ensures that ecologists can incorporate others’ findings into meta-analyses, the shared data can be integrated into larger databases for consensus analyses, and available code can be reused by other researchers. All these elements are key to push forward this vibrant and fast-growing field of research.
Chapter
In the next 30 years, Alzheimer’s disease cases are predicted to drastically increase. Consequently, there is a critical need for research that can counteract the increasing number of Alzheimer’s disease patients. However, current methods of Alzheimer’s disease research have significant limitations. For example, Alzheimer’s disease research is often restricted by resource, temporal, and recruitment barriers (e.g., participant dropout). Unlike standard research, big data analysis is excellent at investigating complex long-term phenomena such as Alzheimer’s disease. Big data methods can also overcome many of the limitations that restrict Alzheimer’s disease research. Accordingly, researchers are turning to big data methods to study Alzheimer’s disease. In this chapter, we outline the applications of big data to Alzheimer’s disease research as well as common methods used to collect and analyze big data. We also explore how big data research could be used to treat, diagnose, and understand Alzheimer’s disease. Accordingly, we aim to provide a general understanding of big data methods in Alzheimer’s disease research and highlight the advantages of big data analysis over standard dementia research.
Thesis
Full-text available
ABSTRACT Studies assessing the grammatical knowledge of speakers of Spanish as a heritage language have largely focused on the Spanish subjunctive mood and have concluded, almost unanimously, that heritage speakers’ knowledge of the Spanish subjunctive is non-native-like and subject to incomplete acquisition. However, there is also evidence that while different, heritage speakers’ linguistic knowledge is by no means deficient. The goal of the present dissertation is to achieve a holistic understanding of the nature of grammars of heritage speakers and to contribute to a theory of processing in heritage language contexts that has greater explanatory adequacy. To this end, this dissertation examines knowledge of the Spanish subjunctive in heritage speakers who live in a long-standing bilingual community in Albuquerque, New Mexico, in comparison to a group of Spanish-dominant bilinguals born in Mexico. The dissertation sets out to provide (1) an evidence-based characterization of heritage speakers by using a sociolinguistic questionnaire which, along with PCA, examined the language-experience related factors that best explain the variability in the processing of the subjunctive mood in this population; and (2) an examination of heritage speakers’ and Spanish-dominant bilinguals’ processing of the Spanish subjunctive during online comprehension and production by means of psycholinguistic experiments that integrated corpus data into their design. Results indicated that, both in comprehension and production, the current group of heritage speakers was sensitive to the lexical and structural conditioning of mood selection, and that the performance of heritage speakers and Spanish-dominant bilinguals converged on the same results and trends. All participants showed nuanced knowledge of the morphosyntactic factors that modulate the conditioning of mood selection, as suggested by the fact that linguistic factors such as frequency and proficiency also modulated their sensitivity. In addition, based on the PCA conducted, the role of three sociolinguistic variables was examined: use of the heritage language, language entropy, and identification with the heritage language. As predicted, results indicated that sensitivity to the lexical and structural conditioning of mood selection was greater for heritage speakers who: (1) used the heritage language more often on average, (2) used the heritage language in more diverse contexts, and (3) felt more identified with the heritage language. The findings highlight that factors such as the community examined and the ecological validity of the materials used are crucial. In addition, they underscore the importance of triangulating both comprehension and production experimental data, and employing multiple explanatory variables for a more comprehensive approach to complex and highly variable systems such as heritage grammars.
Article
Full-text available
Missing data are a recurring problem that can cause bias or lead to inefficient analyses. Development of statistical methods to address missingness have been actively pursued in recent years, including imputation, likelihood and weighting approaches. Each approach is more complicated when there are many patterns of missing values, or when both categorical and continuous random variables are involved. Implementations of routines to incorporate observations with incomplete variables in regression models are now widely available. We review these routines in the context of a motivating example from a large health services research dataset. While there are still limitations to the current implementations, and additional efforts are required of the analyst, it is feasible to incorporate partially observed values, and these methods should be utilized in practice.
Article
Full-text available
The understanding of the dynamics of animal populations and of related ecological and evolutionary issues frequently depends on a direct analysis of life history parameters. For instance, examination of trade-offs between reproduction and survival usually rely on individually marked animals, for which the exact time of death is most often unknown, because marked individuals cannot be followed closely through time. Thus, the quantitative analysis of survival studies and experiments must be based on capture-recapture (or resign ting) models which consider, besides the parameters of primary interest, recapture or resighting rates that are nuisance parameters. Capture-recapture models oriented to estimation of survival rates are the result of a recent change in emphasis from earlier approaches in which population size was the most important parameter, survival rates having been first introduced as nuisance parameters. This emphasis on survival rates in capture-recapture models developed rapidly in the 1980s and used as a basic structure the Cormack-Jolly-Seber survival model applied to an homogeneous group of animals, with various kinds of constraints on the model parameters. These approaches are conditional on first captures; hence they do not attempt to model the initial capture of unmarked animals as functions of population abundance in addition to survival and capture probabilities. This paper synthesizes, using a common framework, these recent developments together with new ones, with an emphasis on flexibility in modeling, model selection, and the analysis of multiple data sets. The effects on survival and capture rates of time, age, and categorical variables characterizing the individuals (e.g., sex) can be considered, as well as interactions between such effects. This "analysis of variance" philosophy emphasizes the structure of the survival and capture process rather than the technical characteristics of any particular model. The flexible array of models encompassed in this synthesis uses a common notation. As a result of the great level of flexibility and relevance achieved, the focus is changed from fitting a particular model to model building and model selection. The following procedure is recommended: (1) start from a global model compatible with the biology of the species studied and with the design of the study, and assess its fit; (2) select a more parsimonious model using Akaike's Information Criterion to limit the number of formal tests; (3) test for the most important biological questions by comparing this model with neighboring ones using likelihood ratio tests; and (4) obtain maximum likelihood estimates of model parameters with estimates of precision. Computer software is critical, as few of the models now available have parameter estimators that are in closed form. A comprehensive table of existing computer software is provided. We used RELEASE for data summary and goodness-of-fit tests and SURGE for iterative model fitting and the computation of likelihood ratio tests. Five increasingly complex examples are given to illustrate the theory. The first, using two data sets on the European Dipper (Cinclus cinclus), tests for sex-specific parameters, explores a model with time-dependent survival rates, and finally uses a priori information to model survival allowing for an environmental variable. The second uses data on two colonies of the Swift (Apus apus), and shows how interaction terms can be modeled and assessed and how survival and recapture rates sometimes partly counterbalance each other. The third shows complex variation in survival rates across sexes and age classes in the roe deer (Capreolus capreolus), with a test of density dependence in annual survival rates. The fourth is an example of experimental density manipulation using the common lizard (Lacerta vivipara). The last example attempts to examine a large and complex data set on the Greater Flamingo (Phoenicopterus ruber), where parameters are age specific, survival is a function of an environmental variable, and an age × year interaction term is important. Heterogeneity seems present in this example and cannot be adequately modeled with existing theory. The discussion presents a summary of the paradigm we recommend and details issues in model selection and design, and foreseeable future developments.
Article
Full-text available
Recent attempts to explain the susceptibility of vertebrates to declines worldwide have largely focused on intrinsic factors such as body size, reproductive potential, ecological specialization, geographical range and phylogenetic longevity. Here, we use a database of 145 Australian marsupial species to test the effects of both intrinsic and extrinsic factors in a multivariate comparative approach. We model five intrinsic (body size, habitat specialization, diet, reproductive rate and range size) and four extrinsic (climate and range overlap with introduced foxes, sheep and rabbits) factors. We use quantitative measures of geographical range contraction as indices of decline. We also develop a new modelling approach of phylogenetically independent contrasts combined with imputation of missing values to deal simultaneously with phylogenetic structuring and missing data. One extrinsic variable-geographical range overlap with sheep-was the only consistent predictor of declines. Habitat specialization was independently but less consistently associated with declines. This suggests that extrinsic factors largely determine interspecific variation in extinction risk among Australian marsupials, and that the intrinsic factors that are consistently associated with extinction risk in other vertebrates are less important in this group. We conclude that recent anthropogenic changes have been profound enough to affect species on a continent-wide scale, regardless of their intrinsic biology.
Article
Full-text available
Determining whether speciation and extinction rates depend on the state of a particular character has been of long-standing interest to evolutionary biologists. To assess the effect of a character on diversification rates using likelihood methods requires that we be able to calculate the probability that a group of extant species would have evolved as observed, given a particular model of the character's effect. Here we describe how to calculate this probability for a phylogenetic tree and a two-state (binary) character under a simple model of evolution (the "BiSSE" model, binary-state speciation and extinction). The model involves six parameters, specifying two speciation rates (rate when the lineage is in state 0; rate when in state 1), two extinction rates (when in state 0; when in state 1), and two rates of character state change (from 0 to 1, and from 1 to 0). Using these probability calculations, we can do maximum likelihood inference to estimate the model's parameters and perform hypothesis tests (e.g., is the rate of speciation elevated for one character state over the other?). We demonstrate the application of the method using simulated data with known parameter values.
Article
When making sampling distribution inferences about the parameter of the data, theta, it is appropriate to ignore the process that causes missing data if the missing data are 'missing at random' and the observed data are 'observed at random', but these inferences are generally conditional on the observed pattern of missing data. When making direct likelihood or Bayesian inferences about theta, it is appropriate to ignore the process that causes missing data if the missing data are missing at random and the parameter of the missing data process is 'distinct' from theta. These conditions are the weakest general conditions under which ignoring the process that causes missing data always leads to correct inferences.
Article
The idea of data augmentation arises naturally in missing value problems, as exemplified by the standard ways of filling in missing cells in balanced two-way tables. Thus data augmentation refers to a scheme of augmenting the observed data so as to make it more easy to analyze. This device is used to great advantage by the EM algorithm (Dempster, Laird, and Rubin 1977) in solving maximum likelihood problems. In situations when the likelihood cannot be approximated closely by the normal likelihood, maximum likelihood estimates and the associated standard errors cannot be relied upon to make valid inferential statements. From the Bayesian point of view, one must now calculate the posterior distribution of parameters of interest. If data augmentation can be used in the calculation of the maximum likelihood estimate, then in the same cases one ought to be able to use it in the computation of the posterior distribution. It is the purpose of this article to explain how this can be done.
Article
Statistical procedures for missing data have vastly improved, yet misconception and unsound practice still abound. The authors frame the missing-data problem, review methods, offer advice, and raise issues that remain unresolved. They clear up common misunderstandings regarding the missing at random (MAR) concept. They summarize the evidence against older procedures and, with few exceptions, discourage their use. They present, in both technical and practical language, 2 general approaches that come highly recommended: maximum likelihood (ML) and Bayesian multiple imputation (MI). Newer developments are discussed, including some for dealing with missing data that are not MAR. Although not yet in the mainstream, these procedures may eventually extend the ML and MI methods that currently represent the state of the art. (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Article
SUMMARY When making sampling distribution inferences about the parameter of the data, θ, it is appropriate to ignore the process that causes missing data if the missing data are ‘missing at random’ and the observed data are ‘observed at random’, but these inferences are generally conditional on the observed pattern of missing data. When making direct-likelihood or Bayesian inferences about θ, it is appropriate to ignore the process that causes missing data if the missing data are missing at random and the parameter of the missing data process is ‘distinct’ from θ. These conditions are the weakest general conditions under which ignoring the process that causes missing data always leads to correct inferences.
Article
Population sample data are complex; inference and prediction require proper accommodation of not only the nonlinear interactions that determine the expected future abundance, but also the stochasticity inherent in data and variable (often unobserved) environmental factors. Moreover, censuses may occur sporadically, and observation errors change with sample methods and effort. The state variable (usually density or abundance) may be hidden from view and known only through highly indirect observational schemes (such as public health records, hunting reports, or fossil/archeological surveys). We extend the basic state-space model for time-series analysis to accommodate these dominant sources of variability that influence population data. Using examples, we show how different types of process error and observation error, unequal sample intervals, and missing values can be accounted for within the flexible framework of Bayesian state-space models. We provide algorithms based on Gibbs sampling that can be used to obtain posterior estimates of population states and of model parameters. For models that can be linearized, results can be used for direct sampling of the posterior, including those with missing values and unequal sample intervals. For nonlinear models, we make use of Metropolis-Hastings within the Gibbs sampling framework. Examples derive from long-term census and population data. We illustrate the extension to discrete state variables with multiple stages using a Time-series Susceptible–Infected–Recovered (TSIR) model for mid 20th-century measles infection in London, where birth rates are assumed known, but susceptibles and infected individuals arise from imperfect reporting.
Article
A broadly applicable algorithm for computing maximum likelihood estimates from incomplete data is presented at various levels of generality. Theory showing the monotone behaviour of the likelihood and convergence of the algorithm is derived. Many examples are sketched, including missing value situations, applications to grouped, censored or truncated data, finite mixture models, variance component estimation, hyperparameter estimation, iteratively reweighted least squares and factor analysis.
Article
A broadly applicable algorithm for computing maximum likelihood estimates from incomplete data is presented at various levels of generality. Theory showing the monotone behaviour of the likelihood and convergence of the algorithm is derived. Many examples are sketched, including missing value situations, applications to grouped, censored or truncated data, finite mixture models, variance component estimation, hyperparameter estimation, iteratively reweighted least squares and factor analysis.
Article
Introduction General Conditions for the Randomization-Validity of Infinite-m Repeated-Imputation Inferences Examples of Proper and Improper Imputation Methods in a Simple Case with Ignorable Nonresponse Further Discussion of Proper Imputation Methods The Asymptotic Distribution of (Q̄m, Ūm, Bm) for Proper Imputation Methods Evaluations of Finite-m Inferences with Scalar Estimands Evaluation of Significance Levels from the Moment-Based Statistics Dm and Δm with Multicomponent Estimands Evaluation of Significance Levels Based on Repeated Significance Levels
Article
Data are presented on adult body mass for 230 of 249 primate species, based on a review of the literature and previously unpublished data. The issues involved in collecting data on adult body mass are discussed, including the definition of adults, the effects of habitat and pregnancy, the strategy for pooling data on single species from multiple studies, and use of an appropriate number of significant figures. An analysis of variability in body mass indicates that the coefficient of variation for body mass increases with increasing species mean mass. Evaluation of several previous body mass reviews reveals a number of shortcomings with data that have been used often in comparative studies.
Article
In recent years, multiple imputation has emerged as a convenient and flexible paradigm for analysing data with missing values. Essential features of multiple imputation are reviewed, with answers to frequently asked questions about using the method in practice.
Article
Statistical procedures for missing data have vastly improved, yet misconception and unsound practice still abound. The authors frame the missing-data problem, review methods, offer advice, and raise issues that remain unresolved. They clear up common misunderstandings regarding the missing at random (MAR) concept. They summarize the evidence against older procedures and, with few exceptions, discourage their use. They present, in both technical and practical language, 2 general approaches that come highly recommended: maximum likelihood (ML) and Bayesian multiple imputation (MI). Newer developments are discussed, including some for dealing with missing data that are not MAR. Although not yet in the mainstream, these procedures may eventually extend the ML and MI methods that currently represent the state of the art.
Article
Aim: To describe the geographical pattern of mean body size of the non-volant mammals of the Nearctic and Neotropics and evaluate the influence of five environmental variables that are likely to affect body size gradients. Location: The Western Hemisphere. Methods: We calculated mean body size (average log mass) values in 110 × 110 km cells covering the continental Nearctic and Neotropics. We also generated cell averages for mean annual temperature, range in elevation, their interaction, actual evapotranspiration, and the global vegetation index and its coefficient of variation. Associations between mean body size and environmental variables were tested with simple correlations and ordinary least squares multiple regression, complemented with spatial autocorrelation analyses and split-line regression. We evaluated the relative support for each multiple-regression model using AIC. Results: Mean body size increases to the north in the Nearctic and is negatively correlated with temperature. In contrast, across the Neotropics mammals are largest in the tropical and subtropical lowlands and smaller in the Andes, generating a positive correlation with temperature. Finally, body size and temperature are nonlinearly related in both regions, and split-line linear regression found temperature thresholds marking clear shifts in these relationships (Nearctic 10.9 °C; Neotropics 12.6 °C). The increase in body sizes with decreasing temperature is strongest in the northern Nearctic, whereas a decrease in body size in mountains dominates the body size gradients in the warmer parts of both regions. Main conclusions: We confirm previous work finding strong broad-scale Bergmann trends in cold macroclimates but not in warmer areas. For the latter regions (i.e. the southern Nearctic and the Neotropics), our analyses also suggest that both local and broad-scale patterns of mammal body size variation are influenced in part by the strong mesoscale climatic gradients existing in mountainous areas. A likely explanation is that reduced habitat sizes in mountains limit the presence of larger-sized mammals.
Article
Recently, researchers in several areas of ecology and evolution have begun to change the way in which they analyze data and make biological inferences. Rather than the traditional null hypothesis testing approach, they have adopted an approach called model selection, in which several competing hypotheses are simultaneously confronted with data. Model selection can be used to identify a single best model, thus lending support to one particular hypothesis, or it can be used to make inferences based on weighted support from a complete set of competing models. Model selection is widely accepted and well developed in certain fields, most notably in molecular systematics and mark-recapture analysis. However, it is now gaining support in several other areas, from molecular evolution to landscape ecology. Here, we outline the steps of model selection and highlight several ways that it is now being implemented. By adopting this approach, researchers in ecology and evolution will find a valuable alternative to traditional null hypothesis testing, especially when more than one hypothesis is plausible.
Article
Together, graphical models and the Bayesian paradigm provide powerful new tools that promise to change the way that environmental science is done. The capacity to merge theory with mechanistic understanding and empirical evidence, to assimilate diverse sources of information and to accommodate complexity will transform the collection and interpretation of data. As we discuss here, we specifically expect a shift from a focus on simple experiments with inflexible design and selection among models that embrace parts of processes to a synthesis of integrated process models. With this potential come new challenges, including some that are specific and technical and others that are general and will involve reexamination of the role of inference and prediction.
Article
Most ecologists and evolutionary biologists continue to rely heavily on null hypothesis significance testing, rather than on recently advocated alternatives, for inference. Here, we briefly review null hypothesis significance testing and its major alternatives. We identify major objectives of statistical analysis and suggest which analytical approaches are appropriate for each. Any well designed study can improve our understanding of biological systems, regardless of the inferential approach used. Nevertheless, an awareness of available techniques and their pitfalls could guide better approaches to data collection and broaden the range of questions that can be addressed. Although we should reduce our reliance on significance testing, it retains an important role in statistical education and is likely to remain fundamental to the falsification of scientific hypotheses.
Article
Some individuals die before a trait is measured or expressed (the invisible fraction), and some relevant traits are not measured in any individual (missing traits). This paper discusses how these concepts can be cast in terms of missing data problems from statistics. Using missing data theory, I show formally the conditions under which a valid evolutionary inference is possible when the invisible fraction and/or missing traits are ignored. These conditions are restrictive and unlikely to be met in even the most comprehensive long-term studies. When these conditions are not met, many selection and quantitative genetic parameters cannot be estimated accurately unless the missing data process is explicitly modelled. Surprisingly, this does not seem to have been attempted in evolutionary biology. In the case of the invisible fraction, viability selection and the missing data process are often intimately linked. In such cases, models used in survival analysis can be extended to provide a flexible and justified model of the missing data mechanism. Although missing traits pose a more difficult problem, important biological parameters can still be estimated without bias when appropriate techniques are used. This is in contrast to current methods which have large biases and poor precision. Generally, the quantitative genetic approach is shown to be superior to phenotypic studies of selection when invisible fractions or missing traits exist because part of the missing information can be recovered from relatives.
Article
Missing data are a pervasive problem in many public health investigations. The standard approach is to restrict the analysis to subjects with complete data on the variables involved in the analysis. Estimates from such analysis can be biased, especially if the subjects who are included in the analysis are systematically different from those who were excluded in terms of one or more key variables. Severity of bias in the estimates is illustrated through a simulation study in a logistic regression setting. This article reviews three approaches for analyzing incomplete data. The first approach involves weighting subjects who are included in the analysis to compensate for those who were excluded because of missing values. The second approach is based on multiple imputation where missing values are replaced by two or more plausible values. The final approach is based on constructing the likelihood based on the incomplete observed data. The same logistic regression example is used to illustrate the basic concepts and methodology. Some software packages for analyzing incomplete data are described.
On the uses of data on lifetime reproductive success Estimating evolutionary parameters when viability selection is operating
  • A Grafen
  • T H Clutton-Brock
Grafen, A. (1988) On the uses of data on lifetime reproductive success. In Reproductive Success (Clutton-Brock, T.H., ed.), pp. 454–471, University of Chicago Press 10 Hadfield, J.D. (2008) Estimating evolutionary parameters when viability selection is operating. Proc. R. Soc. Lond. B Biol. Sci. 275, 723–734
The Biology of Rarity: Causes and Consequences of Rare-Common Differences Estimating a binary character's effect on speciation and extinction
  • W E Kunin
  • K J W P Gaston
Kunin, W.E. and Gaston, K.J. (1997) The Biology of Rarity: Causes and Consequences of Rare-Common Differences, Chapman & Hall 13 Maddison, W.P. et al. (2007) Estimating a binary character's effect on speciation and extinction. Syst. Biol. 56, 701–710
Much ado about nothing: a comparison of missing data methods and software to fit incomplete data regression models
  • Horton
Model selection in ecology and evolution
  • Johnson