Recent years have seen a dramatic increase in the volumes of data that are produced, stored, and analyzed. This advent of big data has led to commercial success stories, for example in recommender systems in online shops. However, scientific research in various disciplines including environmental and climate science will likely also benefit from increasing volumes of data, new sources for data, and the increasing use of algorithmic approaches to analyze these large datasets. This thesis uses tools from philosophy of science to conceptually address epistemological questions that arise in the analysis of these increasing volumes of data in environmental science with a special focus on data-driven modeling in climate research. Data-driven models, here, are defined as models of phenomena that are built with machine learning. While epistemological analyses of machine learning exist, these have mostly been conducted for fields characterized by a lack of hierarchies of theoretical background knowledge. Such knowledge is often available in environmental science and especially in physical climate science, and it is relevant for the construction, evaluation, and use of data-driven models. This thesis investigates predictions, uncertainty, and understanding from data-driven models in environmental and climate research and engages in in-depth discussions of case studies. These three topics are discussed in three topical chapters.
The first chapter addresses the term “big data”, and rationales and conditions for the use of big-data elements for predictions. Namely, it uses a framework for classifying case studies from climate research and shows that “big data” can refer to a range of different activities. Based on this classification, it shows that most case studies lie in between classical domain science and pure big data. The chapter specifies necessary conditions for the use of big data and shows that in most scientific applications, background knowledge is essential to argue for the constancy of the identified relationships. This constancy assumption is relevant both for new forms of measurements and for data-driven models. Two rationales for the use of big-data elements are identified. Namely, big-data elements can help to overcome limitations in financial, computational, or time resources, which is referred to as the rationale of efficiency. Big-data elements can also help to build models when system understanding does not allow for a more theory-guided modeling approach, which is referred to as the epistemic rationale.
The second chapter addresses the question of predictive uncertainties of data-driven models. It highlights that existing frameworks for understanding and characterizing uncertainty focus on specific locations of uncertainty, which are not informative for the predictive uncertainty of data-driven models. Hence, new approaches are needed for this task. A framework is developed and presented that focuses on the justification of the fitness-for-purpose of the models for the specific kind of prediction at hand. This framework uses argument-based tools and distinguishes between first-order and second-order epistemic uncertainty. First-order uncertainty emerges when it cannot be conclusively justified that the model is maximally fit-for-purpose. Second-order uncertainty emerges when it is unclear to what extent the fitness-for-purpose assumption and the underlying assumptions are justified. The application of the framework is illustrated by discussing a case study of data-driven projections of the impact of climate change on global soil selenium concentrations. The chapter also touches upon how the information emerging from the framework can be used in decision-making.
The third chapter addresses the question of scientific understanding. A framework is developed for assessing the fitness of a model for providing understanding of a phenomenon. For this, the framework draws from the philosophical literature on scientific understanding and focuses on the representational accuracy, the representational depth, and the graspability of a model. Then, based on the framework, the fitness of data-driven and process-based climate models for providing understanding of phenomena is compared. It is concluded that data-driven models can, under some conditions, be fit to serve as vehicles for understanding to a satisfactory extent. This is specifically the case when sufficient background knowledge is available such that the coherence of the model with background knowledge provides good reasons for the representational accuracy of the data-driven model, which can be assessed e.g. through sensitivity analyses. This point is illustrated by discussing a case study from atmospheric physics in which data-driven models are used to better understand the drivers of a specific type of clouds.
The work of this thesis highlights that while big data is no panacea for scientific research, data-driven modeling offers new tools to scientists that can be very useful for a variety of questions. All three studies emphasize the importance of background knowledge for the construction and evaluation of data-driven models as this helps to obtain models that are representationally accurate. The importance of domain-specific background knowledge and the technical challenges of implementing data-driven models for complex phenomena highlight the importance of interdisciplinary work. Previous philosophical work on machine learning has stressed that the problem framing makes models theory-laden. This thesis shows that in a field like climate research, the model evaluation is strongly guided by theoretical background knowledge, which is also important for the theory-ladenness of data-driven modeling. The results of the thesis are relevant for a range of methodological questions regarding data-driven modeling and for philosophical discussions of models that go beyond data-driven models.