Article

Data Science: An Action Plan for Expanding the Technical Areas of the Field of Statistics

Authors:
To read the full-text of this research, you can request a copy directly from the author.

Abstract

An action plan to enlarge the technical areas of statistics focuses on the data analyst. The plan sets out six technical areas of work for a university department and advocates a specific allocation of resources devoted to research in each area and to courses in each area. The value of technical work is judged by the extent to which it benefits the data analyst, either directly or indirectly. The plan is also applicable to government research labs and corporate research organizations. 1 Summary of the Plan This document describes a plan to enlarge the major areas of technical work of the field of statistics. Because the plan is ambitious and implies substantial change, the altered field will be called "data science." The focus of the plan is the practicing data analyst. A basic premise is that technical areas of data science should be judged by the extent to which they enable the analyst to learn from data. The benefit of an area can be direct or indirect. Tools that are used by...

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

... Sans l'impulsion de la recherche consacrée à l'innovation, le progrès dans le traitement de données a été plus lent, et la créativité qui existe à l'intérieur des universités n'a eu pratiquement aucune inluence. (Cleveland, 2001) C'est pour surmonter cette crise professionnelle que plusieurs statisticiens américains jugent urgent d'en revenir aux propositions que formulait John Tukey dans les années 1960. William S. Cleveland -un autre chercheur statisticien des Bell Labs -joue un ici un rôle majeur en revalorisant le mouvement pour l'analyse de données au tout début des années 2000. ...
... William S. Cleveland -un autre chercheur statisticien des Bell Labs -joue un ici un rôle majeur en revalorisant le mouvement pour l'analyse de données au tout début des années 2000. Il propose l'expression « science des données » (data sciences) pour qualiier un domaine des statistiques qui serait à la fois plus étendu et redéini (Cleveland, 2001). ...
... […] Cela veut dire que les départements de « science des données » devraient comporter des chercheurs qui consacrent leur carrière à faire avancer le traitement informatique de données et qui nouent des partenariats avec des informaticiens. (Cleveland, 2001) Ainsi conçue, la « science des données » enjoint aux statisticiens de ne plus se définir professionnellement à partir des seules mathématiques, mais de s'intéresser tout autant à la conception d'outils informatiques et à la résolution de problèmes intéressant un grand nombre d'organisations situées en dehors des départements de statistiques. D'une nature profondément pluridisciplinaire, cette « science » se présente donc comme un assemblage de technologies et de savoirs situés au croisement des statistiques et de l'informatique. ...
... Considering Data Science as a formal area of knowledge, for [15] Data Science is in its infancy, [2] also mention that Data Science is in its initial phase, and [18] considers that Data Science is in an embryonic stage. For [5], Data Science is the child of Statistics and Computer Science, as well as for [19]. [20] adds that in its embryonic phase, Data Science must also consider the aspects of Business Sciences. ...
... Cleveland [19] presents a plan for technical work in the field of statistics, and mentions that given the substantial change in this field, including multidisciplinary research, this would imply a new field, called Data Science. [19] also presents a set of actions as a way to implement this plan, each of these actions with percentages of impact to the expansion plan of the technical areas of the statistics field evolving to what can be called Data Science, emphasizing the creation and the structure of a new formal area of knowledge nominated Data Science. ...
... Cleveland [19] presents a plan for technical work in the field of statistics, and mentions that given the substantial change in this field, including multidisciplinary research, this would imply a new field, called Data Science. [19] also presents a set of actions as a way to implement this plan, each of these actions with percentages of impact to the expansion plan of the technical areas of the statistics field evolving to what can be called Data Science, emphasizing the creation and the structure of a new formal area of knowledge nominated Data Science. ...
Preprint
Full-text available
Data and Science has stood out in the generation of results, whether in the projects of the scientific domain or business domain. CERN Project, Scientific Institutes, companies like Walmart, Google, Apple, among others, need data to present their results and make predictions in the competitive data world. Data and Science are words that together culminated in a globally recognized term called Data Science. Data Science is in its initial phase, possibly being part of formal sciences and also being presented as part of applied sciences, capable of generating value and supporting decision making. Data Science considers science and, consequently, the scientific method to promote decision making through data intelligence. In many cases, the application of the method (or part of it) is considered in Data Science projects in scientific domain (social sciences, bioinformatics, geospatial projects) or business domain (finance, logistic, retail), among others. In this sense, this article addresses the perspectives of Data Science as a multidisciplinary area, considering science and the scientific method, and its formal structure which integrate Statistics, Computer Science, and Business Science, also taking into account Artificial Intelligence, emphasizing Machine Learning, among others. The article also deals with the perspective of applied Data Science, since Data Science is used for generating value through scientific and business projects. Data Science persona is also discussed in the article, concerning the education of Data Science professionals and its corresponding profiles, since its projection changes the field of data in the world.
... Considering Data Science as a formal area of knowledge, for [15] Data Science is in its infancy, [2] also mention that Data Science is in its initial phase, and [18] considers that Data Science is in an embryonic stage. For [5], Data Science is the child of Statistics and Computer Science, as well as for [19]. [20] adds that in its embryonic phase, Data Science must also consider the aspects of Business Sciences. ...
... Cleveland [19] presents a plan for technical work in the field of statistics, and mentions that given the substantial change in this field, including multidisciplinary research, this would imply a new field, called Data Science. [19] also presents a set of actions as a way to implement this plan, each of these actions with percentages of impact to the expansion plan of the technical areas of the statistics field evolving to what can be called Data Science, emphasizing the creation and the structure of a new formal area of knowledge nominated Data Science. ...
... Cleveland [19] presents a plan for technical work in the field of statistics, and mentions that given the substantial change in this field, including multidisciplinary research, this would imply a new field, called Data Science. [19] also presents a set of actions as a way to implement this plan, each of these actions with percentages of impact to the expansion plan of the technical areas of the statistics field evolving to what can be called Data Science, emphasizing the creation and the structure of a new formal area of knowledge nominated Data Science. ...
Conference Paper
Full-text available
Data and Science has stood out in the generation of results, whether in the projects of the scientific domain or business domain. CERN Project, Scientific Institutes, companies like Walmart, Google, Apple, among others, need data to present their results and make predictions in the competitive data world. Data and Science are words that together culminated in a globally recognized term called Data Science. Data Science is in its initial phase, possibly being part of formal sciences and also being presented as part of applied sciences, capable of generating value and supporting decision making. Data Science considers science and, consequently, the scientific method to promote decision making through data intelligence. In many cases, the application of the method (or part of it) is considered in Data Science projects in scientific domain (social sciences, bioinformatics, geospatial projects) or business domain (finance, logistic, retail), among others. In this sense, this article addresses the perspectives of Data Science as a multidisciplinary area, considering science and the scientific method, and its formal structure which integrate Statistics, Computer Science, and Business Science, also taking into account Artificial Intelligence, emphasizing Machine Learning, among others. The article also deals with the perspective of applied Data Science, since Data Science is used for generating value through scientific and business projects. Data Science persona is also discussed in the article, concerning the education of Data Science professionals and its corresponding profiles, since its projection changes the field of data in the world.
... Thus, as the emergence of data science has created a balance between theory and computation, the distinction between statisticians and non-statisticians has blurred (Cleveland, 2001). Historically, data analysis was associated with statisticians. ...
... It was not until many years later that data science was formed into a field, when authors such as Cleveland (2001) and Wu (1997) started referring to the practices of Tukey and others as data science (Donoho, 2017;Raban and Gordon, 2020). The Data Science Association 3 defines data science as "the scientific study of the creation, validation and transformation of data to create meaning" and statistics as "the practice or science of collecting and analysing numerical data in large quantities." ...
... Mathematical biology and biometry (Norton, 1978) Statistics (De Veaux et al., 2017;Weihs and Ickstadt, 2018;Cao, 2017) and probability (Tayo, 2019) Main focus Theoretical sophistication ( Carmichael and Marron, 2018;Olhede and Wolfe, 2018;van der Aalst, 2016) Practical solutions to real problems (Cleveland, 2001;van der Aalst, 2016) Main approach Methodology/model development and confirmation (Wild et al., 2018) (precise models with strict assumptions, Olhede and Wolfe, (2018)) Application of machine learning and data mining (avoid being restricted by models) (Ribeiro et al., 2017;Gorunescu, 2011) ...
Article
The importance and relevance of the discipline of statistics with the merits of the evolving field of data science continues to be debated in academia and industry. Following a narrative literature review with over 100 scholarly and practitioner-oriented publications from statistics and data science, this article generates a pragmatic perspective on the relationships and differences between statistics and data science. Some data scientists argue that statistics is not necessary for data science as statistics delivers simple explanations and data science delivers results. Therefore, this article aims to stimulate debate and discourse among both academics and practitioners in these fields. The findings reveal the need for stakeholders to accept the inherent advantages and disadvantages within the science of statistics and data science. The science of statistics enables data science (aiding its reliability and validity), and data science expands the application of statistics to Big Data. Data scientists should accept the contribution and importance of statistics and statisticians must humbly acknowledge the novel capabilities made possible through data science and support this field of study with their theoretical and pragmatic expertise. Indeed, the emergence of data science does pose a threat to statisticians, but the opportunities for synergies are far greater.
... William Cleveland coined the term 'data science' in 2001 in his article titled 'Data science: An action plan for expanding the technical areas of the field of statistics'. Cleveland (2001) describes data science as involving a mixture of statistics and large-scale computing. Additionally, after the US economy recovered in 2011, data science expanded by launching more data science programmes. ...
... By searching the data science literature, the researcher discovered several points of view that have been introduced by statisticians, mathematicians and computer scientists in order to divide the subject interests of data science as a new major. Three similar visions have been presented by Cleveland (2001), Donoho (2017) and De Veaux et al. (2017). Thus, the courses of the data science programmes were classified into 10 groups based on the researcher's vision (see Table 6). ...
Article
In response to the current trends in dealing with data in academia, various research institutions and commercial entities around the world are building new programmes to fill the gaps in workforce demand in specific disciplines, including data curation, big data, data management, data science and data analytics. Thus, the aim of the present study was to reveal the reality of data science education in the Middle East and to determine the opportunities and challenges for teaching data science in the region. Thirteen countries in the Middle East were offering 48 data science programmes at the time of the study. The results reveal that these data science programmes significantly use the words ‘data’ and ‘analytics’ in their names. With regard to the academic affiliations of the data science programmes, the study found that they are offered in a variety of schools, especially computer science, information technology and business. Moreover, the study found that computer science is the dominant trend in the programmes. Data science programmes have a significant overlap with other programmes, especially statistics and computer science, because of the interdisciplinary nature of this field. Data science schools in the Middle East differ in terms of their programme titles, programme descriptions, course catalogues, curriculum structures and course objectives. Broadly, this study may be useful for those who are seeking to establish a data science programme or to strengthen data science curricula at both the undergraduate and postgraduate levels.
... Der nachfolgende Forschungsstand zu Curricula in den Datenwissenschaften umfasst äusserst heterogene Perspektiven unterschiedlicher Akteur*innen und ist in drei Teilen strukturiert. Erstens werden in einer disziplinenorientierten Perspektive -wie bereits gezeigt -normative Anforderungen an solche Programme formuliert (Cleveland 2001;Donoho 2017;Song & Zhu 2017;De Veaux et al. 2017). So versuchen primär hochschulpolitische Akteur*innen, datenwissenschaftliche Kompetenzprofile für unterschiedliche Bildungsstufen zu identifizieren und zu definieren (BHEF 2016; ETH-Rat 2016a; NASEM 2018). ...
... So versuchen primär hochschulpolitische Akteur*innen, datenwissenschaftliche Kompetenzprofile für unterschiedliche Bildungsstufen zu identifizieren und zu definieren (BHEF 2016; ETH-Rat 2016a; NASEM 2018). Parallel dazu skizzieren Praktiker*innen aufgrund eigener Forschungs-und Lehrerfahrungen sowie bestehender Ausbildungsgänge normativ die Anforderungen an solche Curricula (Cleveland 2001;Donoho 2017;Gupta et al. 2015;Kane 2014;De Veaux et al. 2017). In den USA haben sich verschiedene wissenschaftliche Institutionen intensiv mit den Herausforderungen der epistemologischen Transformation von Wissenschaft und der Rolle der Datenwissenschaften auseinandergesetzt (Berman et al. 2016;Kloef korn et al. 2020;NASEM 2017NASEM , 2018NASEM & The Royal Society 2018). ...
Book
Full-text available
Die Datenwissenschaften beschäftigen sich mit der Analyse großer, komplexer Datenmengen und erregen damit im Kontext der Digitalisierung hohe mediale und politische Aufmerksamkeit. Philippe Saner untersucht die Entstehung dieses transversalen Wissensfeldes um Big Data mit einem feldtheoretischen Zugang. Er legt dar, dass es sich um ein feldübergreifendes Netzwerk von Expertisen handelt, das durch unterschiedliche Interessen, Strategien und Machtverhältnisse strukturiert ist. Die Datenwissenschaften eröffnen so einen durchlässigen Raum, der für Akteur*innen aus etablierten Feldern wie Wissenschaft, Wirtschaft, Hochschulbildung und Politik lukrative Möglichkeiten eröffnet.
... Data science-the extraction of knowledge and insight from raw data (Provost and Fawcett, 2013) culminating in a data product (Cao, 2017)-is the interdisciplinary study of data (Cao, 2017), enabled by the growth of computational power over the last few decades. By combining computational skills and statistical theory (Cleveland, 2001;Donoho, 2017), data scientists capture, curate, and analyze data (Provost and Fawcett, 2013) that they could not process using previous generations of information technology. The primary feature distinguishing data science from statistics is "computing with data" (Cleveland, 2001). ...
... By combining computational skills and statistical theory (Cleveland, 2001;Donoho, 2017), data scientists capture, curate, and analyze data (Provost and Fawcett, 2013) that they could not process using previous generations of information technology. The primary feature distinguishing data science from statistics is "computing with data" (Cleveland, 2001). Increases in computational power enabled the analysis of nonstructured data such as text, photos, video, and sound (Donoho, 2017), new forms of data visualization (Bell et al., 2009), and data products powered by advances in predictive analytics (Cao, 2017). ...
Article
Full-text available
Mass adoption of advanced information technologies is fueling a need for public servants with the skills to manage data-driven public agencies. Public employees typically acquire data skills through graduate research methods courses, which focus primarily on research design and statistical analysis. What data skills are currently taught, and what content should Master of Public Administration (MPA) programs include in their research method courses? We categorized research method course content in 52 syllabi from 31 MPA programs to understand how data skills are taught in public administration. We find that most graduate programs rely on research methods more suited for academic and policy research while lacking the data skills needed to modernize public agencies. Informed by these results, this work presents the Data Science Literacy Framework as a guide for assessing and planning curriculum within MPA programs.
... Since the action plan for data science articulated by Cleveland (2001), the field has continued to blossom within academia. Academic data science can be described in aspirational terms using a pyramid, with doctoral degrees rare but important for leadership and research in the field. ...
... When such molecules are linked by physical interactions, they form networks of molecular interactions that are usually classified according to the nature of the compounds involved. Most commonly, interactome refers to the protein-to-protein interaction network (PPI) (PIN) or variations thereof [2]. This is an extremely complex circuit, the implementation of which would not be possible without computer simulation. ...
Article
Full-text available
In this article, we will consider what IT technologies are most used in medicine and by genomics methods in particular, also we will take a look at the use of big data in this matter. Additionally, we will learn what a connectome is, analyze 4M and 3V frameworks in genomics. Statistics in medicine is one of the analysis tools experimental data and clinical observations, as well as the language by means of which the obtained mathematical results are reported. However, this is not the only task of statistics in medicine. Mathematical apparatus widely used for diagnostic purposes, solving classification problems and search for new patterns, for setting new scientific hypotheses. The use of statistical programs presupposes knowledge of the basic methods and stages of statistical analysis: their sequence, necessity and sufficiency. In the proposed presentation, the main emphasis is not on detailed presentation of the formulas that make up the statistical methods, and on their essence and application rules. Finally, we talk through genome-wide association studies, methods of statistical processing of medical data and their relevance. In this article, we analyzed the basic concepts of statistics, statistical methods in medicine and data science, considered several areas in which large amounts of data are used that require modern IT technologies, including genomics, genome-wide association studies, visualization and connectome data collection.
... Introduction. In a context where there is an exponentially growing mass of data [30], a reproducibility crisis in Science [4], and a progressive adoption of Open Science practices [5], statistics is broadened to a wider discipline called Data Science [13]. For the Data Science Association, "the Data Science means the scientific study of the creation, validation and transformation of data to create meaning" (http: //www.datascienceassn.org/code-of-conduct.html). ...
... Along with defining the meaning of the term data science, there is an ongoing debate on the legitimacy of establishing it as fully autonomous, brand-new academic discipline, instead of regarding it as a mere extension of statistical methods (Cleveland, 2001;Diggle, 2015). Some researchers also associate data science with such terms as business analytics, operations research, business intelligence, competitive intelligence, data analysis and modelling, and knowledge extraction from big data (Foreman, 2013;Kelleher & Tierney, 2018). ...
Article
Full-text available
Research data management (RDM) poses a significant challenge for academic organizations. The creation of library research data services (RDS) requires assessment of their maturity, i.e., the primary objective of this study. Its authors have set out to probe the nationwide level of library RDS maturity, based on the RDS maturity model, as proposed by Cox et al. (2019), while making use of natural language processing (NLP) tools, typical for big data analysis. The secondary objective consisted in determining the actual suitability of the above-referenced tools for this particular type of assessment. Web scraping, based on 72 keywords, and completed twice, allowed the authors to select from the list of 320 libraries that run RDS, i.e., 38 (2021) and 42 (2022), respectively. The content of the websites run by the academic libraries offering a scope of RDM services was then appraised in some depth. The findings allowed the authors to identify the geographical distribution of RDS (academic centers of various sizes), a scope of activities undertaken in the area of research data (divided into three clusters, i.e., compliance, stewardship, and transformation), and overall potential for their prospective enhancement. Although the present study was carried within a single country only (Poland), its protocol may easily be adapted for use in any other countries, with a view to making a viable comparison of pertinent findings.
... Each method will allow us to make dedicated predictions about a particular patient-specific aspect. Importantly, the methods for dedicated problems can come from different fields of data science, e.g., machine learning, artificial intelligence or statistics [31,32], focusing on different aspects of data and employing different methodologies. Hence, it would not be appropriate to call such an analysis system, e.g., an AI system, because it can also comprise non-AI methods. ...
Article
Full-text available
The idea of a digital twin has recently gained widespread attention. While, so far, it has been used predominantly for problems in engineering and manufacturing, it is believed that a digital twin also holds great promise for applications in medicine and health. However, a problem that severely hampers progress in these fields is the lack of a solid definition of the concept behind a digital twin that would be directly amenable for such big data-driven fields requiring a statistical data analysis. In this paper, we address this problem. We will see that the term ’digital twin’, as used in the literature, is like a Matryoshka doll. For this reason, we unstack the concept via a data-centric machine learning perspective, allowing us to define its main components. As a consequence, we suggest to use the term Digital Twin System instead of digital twin because this highlights its complex interconnected substructure. In addition, we address ethical concerns that result from treatment suggestions for patients based on simulated data and a possible lack of explainability of the underling models.
... The main difference between DS and ML comes from the fact that data science involves data analysis, data visualization, and statistical analysis which requires a background in computer science, statistics, and mathematics [2][3][4], whereas machine learning involves learning from data [5][6][7][8][9] and the creation of learning models based on data. DS includes data analytics [10,11]. ...
Article
Full-text available
Data science and machine learning are subjects largely debated in practice and in mainstream research. Very often, they are overlapping due to their common purpose: prediction. Therefore, data science techniques mix with machine earning techniques in their mutual attempt to gain insights from data. Data contains multiple possible predictors, not necessarily structured, and it becomes difficult to extract insights. Identifying important or relevant features that can help improve the prediction power or to better characterize clusters of data is still debated in the scientific literature. This article uses diverse data science and machine learning techniques to identify the most relevant aspects which differentiate data science and machine learning. We used a publicly available dataset that describes multiple users who work in the field of data engineering. Among them, we selected data scientists and machine learning engineers and analyzed the resulting dataset. We designed the feature engineering process and identified the specific differences in terms of features that best describe data scientists and machine learning engineers by using the SelectKBest algorithm, neural networks, random forest classifier, support vector classifier, cluster analysis, and self-organizing maps. We validated our model through different statistics. Better insights lead to better classification. Classifying between data scientists and machine learning engineers proved to be more accurate after features engineering.
... In 2001, William S. Cleveland (2001) introduced data science as an independent scientific discipline based on his proposal of: ...
Article
Full-text available
We present a summary of the 1 st International Symposium on the Science of Data Science, organized in Summer 2021 as a satellite event of the 8 th Swiss Conference on Data Science held in Lucerne, Switzerland. We discuss what establishes the scientific core of the discipline of data science by introducing the corresponding research question, providing a concise overview of relevant related prior work, followed by a summary of the individual workshop contributions. Finally, we expand on the common views which were formed during the extensive workshop discussions.
... Statistics has rapidly evolved during the past decades with the so-called data revolution characterised by the incorporation of technology that makes it possible to manage and analyse huge amounts of data. These significant developments led some authors to propose changing the name of the discipline to "data science" (Cleveland, 2001). The teaching of statistics at the university level is not immune to these changes, although it evolves at a very different rate. ...
Article
Full-text available
We present the organisation of a first course in statistics for Business Administration degree students, which includes a study and research path (SRP) as an inquiry-based teaching proposal. The paper aims to summarise the course’s evolution, design, and reflections on its various components separately and together as a complete unit. The analysis considers three perspectives on the course: those of the students, the lecturer, and the researcher to provide a critical perspective. The discussion includes the joint evolution of the course and the SRP. Under the Anthropological Theory of the Didactic framework, we show that the design and management of the SRP cannot be detached from the course as a whole. We also see how the course components nourish the SRP and how this, in turn, drives the evolution of the course content and adapts it to the students’ professional needs. This inquiry proposal requires a multidimensional approach in both its planning and the dissemination of its outcomes in the research and professional literature. Therefore, our study can contribute to didactics research on SRPs and serve as a starting point for newcomers to inquiry-based teaching, and as a reflection to foster collaborations between researchers in didactics and lecturers. Nous présentons l'organisation d'un premier cours de statistique destiné aux étudiants en gestion d’entreprise, qui comprend un parcours d'étude et de recherche (PER) en tant que proposition d'enseignement basé sur l'enquête. L'article vise à résumer l'évolution du cours, sa conception et les réflexions sur ses différentes composantes, séparément et dans leur ensemble, comme une unité complète. L'analyse prend en compte les différents points de vue sur le cours, celui de l'étudiant, celui de l'enseignant et celui du chercheur, offrant ainsi une perspective critique. Elle inclut également l'évolution conjointe du cours et du PER. Dans le cadre de la théorie anthropologique de la didactique, nous montrons que la conception et la gestion du PER ne peuvent être détachées du cours dans lequel il s’inscrit. Nous montrons comment les composantes du cours nourrissent le PER et comment celle-ci, en retour, fait évoluer le contenu du cours en l'adaptant aux besoins professionnels de l'étudiant. La proposition d'enquête nécessite une approche multidimensionnelle, tant dans sa planification que dans la diffusion de ses résultats dans la recherche et la littérature professionnelle. Par conséquent, notre étude peut contribuer à la recherche en didactique sur les PER, servir de point de départ pour les nouveaux venus à l'enseignement basé sur l'enquête et de point de réflexion pour favoriser les modes de collaboration entre chercheurs en didactique et enseignants universitaires.
... Unlike statistics with its rich history as a discipline, data science is a newer field still being defined (Cao, 2017;Donoho, 2017;NASEM, 2018), but often referred to as multidisciplinary. The field has grown from industry's need to utilize and make sense of vast amounts of data (Cao, 2017) and a recognition in academia for new theories and methods for data analysts (Cleveland, 2001;Donoho, 2017). Most agree that data science is "the science of learning from data" (Donoho, 2017, p. 748) or using "data to solve problems" (Carmichael & Marron, 2018, p. 1). ...
Article
With a call for schools to infuse data across the curriculum, many are creating curricula and examining students’ thinking in data-intensive problems. As the discipline of statistics education broadens to data science education, there is a need to examine how practices in data science can inform work in K-12. We synthesize literature about statistics investigation processes, data science as a field and practices of data scientists. Further, we provide results from an ethnographic and interview study of the work of data scientists. Together, these inform a new framework to support data investigation processes. We explicate the practices and dispositions needed and offer a glimpse of how the framework can be used to move the discipline of data science education forward.
... An early use of the term data science appears in "Data Science: An Action Plan for Expanding the Technical Areas of the Field of Statistics" in which Cleveland (2001) describes data science by laying out a university training programming. His description is notable for including both a computing with data strand as well as a pedagogy strand. ...
Conference Paper
The need for people fluent in working with data is growing rapidly and enormously, but U.S. K–12 education does not provide meaningful learning experiences designed to develop understanding of data science concepts or a fluency with data science skills. Data science is inherently inter- disciplinary, so it makes sense to integrate it with existing content areas, but difficulties abound. Consideration of the work involved in doing data science and the habits of mind that lie behind it leads to a way of thinking about integrating data science with mathematics and science. Examples drawn from current activity development in the Data Games project shed some light on what technology-based, data-driven might be like. The project’s ongoing research on learners’ conceptions of organizing data and the relevance to data science education is explained.
... The "algorithmic modeling" culture, he claimed, was better suited to the age of big data. Cleveland (2001) offered a definition of data science that arose from a similar criticism that the academic statistics community was not sufficiently concerned with data analysis. The purpose of data science is, according to Cleveland, to "enable the analyst to learn from data". ...
Conference Paper
Making sense of data is complex, and the knowledge and skills required to understand "Big Data" - and many open data sources - go beyond those taught in traditional introductory statistics courses. The Mobilize project has created and implemented a course for secondary students, Introduction to Data Science (IDS), that aims to develop computational and statistical thinking skills so that students can access and analyze data from a variety of traditional and non-traditional sources. Although the course does not directly address open source data, such data are used in the curriculum, and an outcome of the project is to develop skills and habits of mind that allow students to use open source data to understand their community. This paper introduces the course and describes some of the challenges in its implementation.
... Offensichtlich wird das anderswo anders gesehen. Historisch gesehen geht der Begriff auf Cleveland (2001) zurück, der eine Abkehr von der Mathematisierung der Statistik forderte und die stärkere Betonung der Anwendungsseite. 7 In Berlin wurde das Programm in einen allgemeinen Master Studiengang Statistik eingegliedert. ...
Article
Full-text available
Zusammenfassung Die Statistik als fachliche Disziplin muss sich in einem rasch wandelnden Umfeld behaupten, das durch den Aufstieg der Data Science, den Bedeutungszuwachs der künstlichen Intelligenz sowie neue Datenstrukturen charakterisiert wird. Wie kann sich die Statistik hier behaupten oder verlorenes Terrain wiedergewinnen? Unter dem provokanten Motto „Make Statistics great again“ wurden aus verschiedenen Blickwinkeln Entwicklungen, Strategien und positive Beispiele skizziert, wie sich das Fach Statistik an der Universität, im Wissenschaftsbetrieb und auf dem Arbeitsmarkt positionieren sollte. Willi Seidel schaut aus dem Blickwinkel eines Hochschulpräsidenten auf den Ressourcenkampf der Fächer. Christine Müller berichtet über die Initiativen der Dachorganisation DAGStat die vielen Teildisziplinen der Statistik wirkungsvoll im Wissenschaftsbetrieb und in der Öffentlichkeit zu positionieren. Florian Meinfelder dokumentiert den Aufstieg des Masterstudiengangs in Survey Statistik zu einem der nachgefragtesten Studiengänge der Uni Bamberg. Jürgen Chlumsky und Markus Zwick beleuchten die historische Wahrnehmung der Amtsstatistik bei Pflichterhebungen und die Entwicklung der Forschungsdatenzentren sowie moderner Zugänge zu neuen Datenquellen. Joachim Wagner schildert das Verhältnis von Datenproduzenten und Datennutzern aus der Sicht eines unzufriedenen Datennutzers. Schließlich geht es um die Position der Statistik in der Datenwissenschaft. Ist „Data Science“ nur ein neumodisches Wort für Statistik? Ein Konzeptionspapier der Gesellschaft für Informatik (GI) hat zu Positionspapieren der DStatG und der DAGStat geführt, die von Ulrich Rendtel vorgestellt werden. Das Kolloquium fand anlässlich der Abschiedsvorlesung von Ulrich Rendtel im Juni 2019 am Fachbereich Wirtschaftswissenschaft der Freien Universität statt.
... The need for a broader approach to teaching statistics is evidenced by Harraway (2003), who surveyed employers about statistics needs of their operations. A proposal for broadening the scope of statistics education is suggested by Cleveland (2001), and his perspective is from an extensive immersion in consulting. A similar view is presented by Kettenring (1997) based on his extensive experience in industry and with the American Statistical Association. ...
Conference Paper
The pioneers of statistics focused on parametric estimation and summary to communicate statistical findings. The tradition of basing inference on parametric fits is a central mode in statistics education, but in statistics applications, computer- based graphical summary is playing an increasingly important role. A parallel development has been the spread of statistics education to almost all disciplines, and thus the need to communicate statistical results to non-specialists has become more acute. These influences of more graphics and a wider distribution require adaptation in our statistics courses. This paper provides examples of, and arguments for, the use of simulation and graphical display, and the role of these techniques in enhancing the verbalization of analytical results. The immediate goal of the paper is to persuade those who design curricula for early statistics courses to provide a serious introduction to graphical data analysis, at the expense of some traditional parametric inference. The ultimate goal is to enable more students to communicate statistical findings effectively.
... Data science (Cleveland, 2001;Donoho, 2017) combines multiple pre-existing disciplines (e.g., statistics, machine learning, computer science) with a redirected focus on creating, understanding and systematizing workflows that turn real-world data into actionable conclusions. The ubiquity of data in all economic sectors and scientific disciplines makes data science eminently relevant to cohorts of researchers for whom the discipline of statistics was previously closed-off and esoteric. ...
Preprint
Full-text available
Data science has arrived, and computational statistics is its engine. As the scale and complexity of scientific and industrial data grow, the discipline of computational statistics assumes an increasingly central role among the statistical sciences. An explosion in the range of real-world applications means the development of more and more specialized computational methods, but five Core Challenges remain. We provide a high-level introduction to computational statistics by focusing on its central challenges, present recent model-specific advances and preach the ever-increasing role of non-sequential computational paradigms such as multi-core, many-core and quantum computing. Data science is bringing major changes to computational statistics, and these changes will shape the trajectory of the discipline in the 21st century.
... Data science became formalized during the mid-20th century and is a comparatively new field. 1 Recognizing the need for computer scientists to not only define and develop software and hardware platforms, but to analyze the data captured electronically therein, a cross-disciplinary approach was proposed that incorporated the rigor of various computational approaches with statistics (Cleveland, 2014). Yet it is not solely an applied discipline with a focus on algorithmic development, such as machine learning, or statistics (Meng, 2019). ...
Article
Epidemiology, biostatistics, and data science are broad disciplines that incorporate a variety of substantive areas. Common among them is a focus on quantitative approaches for solving intricate problems. When the substantive area is health and health care, the overlap is further cemented. Researchers in these disciplines are fluent in statistics, data management and analysis, and health and medicine, to name but a few competencies. Yet there are important and perhaps mutually exclusive attributes of these fields that warrant a tighter integration. For example, epidemiologists receive substantial training in the science of study design, measurement, and the art of causal inference. Biostatisticians are well versed in the theory and application of methodological techniques, as well as the design and conduct of public health research. Data scientists receive equivalently rigorous training in computational and visualization approaches for high-dimensional data. Compared to data scientists, epidemiologists and biostatisticians may have less expertise in computer science and informatics, while data scientists may benefit from a working knowledge of study design and causal inference. Collaboration and cross-training offer the opportunity to share and learn of the constructs, frameworks, theories, and methods of these fields with the goal of offering fresh and innovate perspectives for tackling challenging problems in health and health care. In this article, we first describe the evolution of these fields focusing on their convergence in the era of electronic health data, notably electronic medical records (EMRs). Next we present how a collaborative team may design, analyze, and implement an EMR-based study. Finally, we review the curricula at leading epidemiology, biostatistics, and data science training programs, identifying gaps and offering suggestions for the fields moving forward.
... The origin of the concept of DS can be traced backed to 1960. Peter Naur termed it to be a substitute for computer science 2 whereas C. F. Jeff gave lectures on "Statistics = Data science?" 3,4 and Cleveland 5 termed it as a discipline itself, which has come true now as various undergraduate, postgraduate programs have been developed especially for DS. Though many researchers, scientists have tried to define the term DS, there are still disagreements on the term and how it has been used in the literature loosely 6,7 . ...
Chapter
Data science deals with the better utilization of data using scientific methods to suffice a purpose, whereas ontologies deal with the representation of data so that it can be used efficiently. The combination of these two domains of studies is the upcoming field. Ontologies can clarify data, represent data in a machine-processable format, incorporate data from disparate sources, improve data quality, and describe the effects of data science processes. In this work, we explore the roles that ontologies could play in the data science domain from various perspectives, like semantic data modelling, semantic data integration, and semantic data mining with the help of examples available in the literature. The work here also summarizes the purpose and use of existing data science ontologies in the literature. This exploration revealed that there are very few data science ontologies which have been used for purposes like; organizing the domain of data science, providing recommendations of educational resources for data science, creating a data scientist skills profile and describing the program code of a data science program. Three data science ontologies were developed in OWL format using ontology editors like Protégé, ONTOLIS whereas one was developed using a new language called MONoidal Ontology and Computing Language (Monocl) and one was represented as just a hierarchical tree. One of the ontologies called Data Science Education Ontology was available in BioPortal and was mapped to 38 other ontologies of the BioPortal.
... At this stage, a little mental issue has become a lifelong issue. Data science is a growing field of study in today's world [10]. A survey was used to gather information on the parents' level of affection and the child's level of intelligence. ...
Article
Full-text available
A nation's most valuable resource is its children. In the future, a nation will be controlled in the same way that a kid will develop. The majority of parent's lack expertise about how to help their children develop a positive outlook. We concluded in our study by analyzing the association between parental excessive affection and the development of children's intelligence. Through the use of a questionnaire, information was gathered from 531 families. Whereas 43 percent of parents show excessive affection to their children, while 45 percent lavish proper affection. On the other hand, in our study, 48 percent of the children had an IQ score of less than 49. We have identified the alterations in their child's brain as a result of their parents' blind affection and have also identified remedies to the problem. We analyzed it so that the growth of children's intelligence is not hampered by their parents' excessive affection and that the parents and children enjoy a close relationship with their parents.
... Data mining uses database technologies, machine learning and statistical models to uncover hidden patterns in (typically large) datasets [90]. It is also recently referred to as the unified field of data science [29]. Methods and use cases include anomaly detection, e.g., detecting abnormal or fraud activity in a bank institutions, and designing association rules, e.g., discovering product correlations in online shops for marketing purposes. ...
Thesis
Full-text available
As part of our everyday life we consume breaking news and interpret it based on our own viewpoints and beliefs. We have easy access to online social networking platforms and news media websites, where we inform ourselves about current affairs and often post about our own views, such as in news comments or social media posts. The media ecosystem enables opinions and facts to travel from news sources to news readers, from news article commenters to other readers, from social network users to their followers, etc. The views of the world many of us have depend on the information we receive via online news and social media. Hence, it is essential to maintain accurate, reliable and objective online content to ensure democracy and verity on the Web. To this end, we contribute to a trustworthy media ecosystem by analyzing news and social media in the context of politics to ensure that media serves the public interest. In this thesis, we use text mining, natural language processing and machine learning techniques to reveal underlying patterns in political news articles and political discourse in social networks. Mainstream news sources typically cover a great amount of the same news stories every day, but they often place them in a different context or report them from different perspectives. In this thesis, we are interested in how distinct and predictable newspaper journalists are, in the way they report the news, as a means to understand and identify their different political beliefs. To this end, we propose two models that classify text from news articles to their respective original news source, i.e., reported speech and also news comments. Our goal is to capture systematic quoting and commenting patterns by journalists and news commenters respectively, which can lead us to the newspaper where the quotes and comments are originally published. Predicting news sources can help us understand the potential subjective nature behind news storytelling and the magnitude of this phenomenon. Revealing this hidden knowledge can restore our trust in media by advancing transparency and diversity in the news. Media bias can be expressed in various subtle ways in the text and it is often challenging to identify these bias manifestations correctly, even for humans. However, media experts, e.g., journalists, are a powerful resource that can help us overcome the vague definition of political media bias and they can also assist automatic learners to find the hidden bias in the text. Due to the enormous technological advances in artificial intelligence, we hypothesize that identifying political bias in the news could be achieved through the combination of sophisticated deep learning modelsxi and domain expertise. Therefore, our second contribution is a high-quality and reliable news dataset annotated by journalists for political bias and a state-of-the-art solution for this task based on curriculum learning. Our aim is to discover whether domain expertise is necessary for this task and to provide an automatic solution for this traditionally manually-solved problem. User generated content is fundamentally different from news articles, e.g., messages are shorter, they are often personal and opinionated, they refer to specific topics and persons, etc. Regarding political and socio-economic news, individuals in online communities make use of social networks to keep their peers up-to-date and to share their own views on ongoing affairs. We believe that social media is also an as powerful instrument for information flow as the news sources are, and we use its unique characteristic of rapid news coverage for two applications. We analyze Twitter messages and debate transcripts during live political presidential debates to automatically predict the topics that Twitter users discuss. Our goal is to discover the favoured topics in online communities on the dates of political events as a way to understand the political subjects of public interest. With the up-to-dateness of microblogs, an additional opportunity emerges, namely to use social media posts and leverage the real-time verity about discussed individuals to find their locations. That is, given a person of interest that is mentioned in online discussions, we use the wisdom of the crowd to automatically track her physical locations over time. We evaluate our approach in the context of politics, i.e., we predict the locations of US politicians as a proof of concept for important use cases, such as to track people that are national risks, e.g., warlords and wanted criminals.
... Despite some disquiet among some in the field, who regard data science simply as a rebranding of statistics, aspects of data science emerged in contrast to statistics precisely because of increase in computing power and internet connectivity, which facilitated this new discipline. In a seminal paper on data science, William Cleveland argued that the computational methods open to data scientists were partly determined by the commercial companies that produce software, but praised the development of the S programming in altering how scientists could manipulate, visualise and analyse data (Cleveland, 2001). R is the open-source successor to S, and many would say has out-shone its alphabetic neighbour! ...
... Finally, how might stakeholders encourage further hub and node development in service of national-and international-platform and infrastructure? In defining data science two decades ago, Cleveland (2001) emphasized "how valuable data science is for learning about the world" (p. 24). ...
Article
Full-text available
Addressing the data skills gap, namely the superabundance of data and the lack of human capital to exploit it, this paper argues that iSchools and Library and Information Science programs are ideal venues for data science education. It unpacks two case studies: the LIS Education and Data Science for the National Digital Platform (LEADS‐4‐NDP) project (2017–2019), and the LIS Education and Data Science‐Integrated Network Group (LEADING) project (2020–2023). These IMLS‐funded initiatives respond to four national digital platform challenges: LIS faculty prepared to teach data science and mentor the next generation of educators and practitioners, an underdeveloped pedagogical infrastructure, scattered and inconsistent data science education opportunities for students and current information professionals, and an immature data science network. LEADS and LEADING have made appreciable collaborative, interdisciplinary contributions to the data science education community; these projects comprise an essential part of the long‐awaited and much‐needed national digital platform.
... The term, data science, likely coined in 2001 by W. S. Clevland [2], has had a number of definitions. For the purposes of this paper, we focus on considering data science as a discipline that combines the fields of computer science, mathematics, statistics, and information technology but with a focus on the generation, organization, modeling, and use of data to make scientific and business decisions. ...
... Furthermore, data science is not a thing to be done but a shift in how people think (Guerra & Borne, 2016). Many data science methods used today were introduced decades ago by statisticians such as George Box (1976) and John Chambers (1993), computer scientists such as William Cleveland (1993Cleveland ( , 1994Cleveland ( , 2014, and mathematicians such as John Tukey (1962). Collectively, they set the stage for how the data science field would evolve. ...
Thesis
Many organizations are “data rich, but insight poor” and contend with challenges in developing data visualizations that facilitate insights. Studies show that numeracy, graph literacy, and cognitive processes influence how people perceive data visualizations. However, empirical studies on data visualization practices, numeracy, and graph literacy are rare. This study, which used a cognitive psychology framework, explains how data comprehension varies by data visualization practices and by measures of subjective numeracy (SNS) and subjective graph literacy (SGL). This paper presents findings from 212 participants (students and professionals) who viewed data visualizations from the U.S. Bureau of Labor Statistics. Participants were randomly assigned to one of four groups. Each group saw a table and a chart across two scenarios (the original table or chart and modified versions using data visualization best practices). There were four key findings: (a) Overall, participants who saw tables had higher data comprehension than participants who saw charts, (b) participants who saw the best practice chart had higher data comprehension than participants who saw the control stacked bar chart, (c) participants with high SNS or high SGL had higher data comprehension (except for participants with high SGL in one scenario), and (d) data comprehension correlated positively with SNS and SGL. Implications for theory and practice expand the field of data visualization. A central principle is that the seeing and the thinking of data can facilitate cognitive tasks. This study may help analysts develop a better understanding of how to communicate findings using data visualization best practices.
... Donoho [16] not only advocates 'greater statistics' but also 'greater data science', and initiates a vision for the latter that is far more than a 'mere scaling up to big data' and big technology, but an ongoing 'more intellectually productive and lasting' science. The term data science has been around for a while (see, for example, [17]) including the idea of calling 'statistics' 'data science' as mooted by Wu in 1986 [18]. ...
Article
Full-text available
There has been increasing interest in recent years in training in official statistics with reference to the 2030 Agenda, big data, diversification of data types and sources, and data science. Backgrounds for work in official statistics are becoming more varied than ever. The official statistics community has also become progressively more aware of the importance of statistical literacy in education and trust in official statistics. Hence foundation and introductory are of as much interest to official statistics as more specialised training. At the same time, greater access to data and vast technological capabilities has seen much emphasis and discussion of the statistical and data sciences and education therein, including development of educational resources in contexts such as civic data and statistics. Data science provides opportunities to renew the decades-long push for authentic learning that reflects the practice of ‘greater statistics’ and ‘greater data science’, and to examine progress to date in implementing and sustaining the extensive work and advocacy of many. This article discusses what is needed at the foundation and introductory levels to realize this advocacy, with commentary relevant to official statistics.
... An important step in creating effective action plans is setting the goal (Cramer, 2017). Also, in creating action plans, the existing requirements and limitations must be fully identified and considered (Cleveland, 2001;Pannell and Roberts, 2010). In this paper, eight applicable and effective action plans are determined for ECs to improve their performance and boost their productivity during the Covid-19 pandemic. ...
Article
Obviously, the Covid-19 pandemic has huge impact on most businesses and has caused serious and countless problems for them. Therefore, providing solutions for affected businesses to recover and improve their activities during pandemic times is inevitable. In this regard, ecotourism centers are one of the businesses that went through this problem and have faced significant dilemmas in their activities. Also, reportedly, there is no related research focusing on the recovery approaches to address these obstacles relating to these kinds of businesses during the pandemic. Therefore, all of these exhorted us to do the current research. In this paper, some practical and useful action plans for ecotourism centers are firstly developed to help these businesses. To obtain the action plans, some brainstorming sessions were held consisting of tourism experts, university professors, managers, owners, and some personnel of eco-tourism centers. In order to prioritize the defined action plans, four criteria are considered. Firstly, we compute the weights of the considered criteria by the Fuzzy DEMATEL and then they are prioritized using the Fuzzy VIKOR. The findings of the current study divulge that the AP2 “Standardization of the centers” and AP3 “Estimating demand number and increasing the capacity” and AP7 “Identifying other natural tourist attractions of the region” have the highest and lowest priority to be implemented.
... An early use of the term data science appears in "Data Science: An Action Plan for Expanding the Technical Areas of the Field of Statistics" in which Cleveland (2001) describes data science b y l a y i n g o u t a u n i v e r s i t y t r a i n i n g programming. His description is notable for including both a computing with data strand as well as a pedagogy strand. ...
Chapter
Full-text available
Define mHealth and mobile technologies • Discuss three mobile technology uses cases in a clinical setting • Discuss the shortcomings of medical apps for smartphones • Enumerate the challenges of mHealth in low and middle-income countries • Identify the software development kits (SDKs) for the iPhone and Android OS
Preprint
Full-text available
A substantial fraction of students who complete their college education at a public university in the United States begin their journey at one of the 935 public two-year colleges. While the number of four-year colleges offering bachelor's degrees in data science continues to increase, data science instruction at many two-year colleges lags behind. A major impediment is the relative paucity of introductory data science courses that serve multiple student audiences and can easily transfer. In addition, the lack of pre-defined transfer pathways (or articulation agreements) for data science creates a growing disconnect that leaves students who want to study data science at a disadvantage. We describe opportunities and barriers to data science transfer pathways. Five points of curricular friction merit attention: 1) a first course in data science, 2) a second course in data science, 3) a course in scientific computing, data science workflow, and/or reproducible computing, 4) lab sciences, and 5) navigating communication, ethics, and application domain requirements in the context of general education and liberal arts course mappings. We catalog existing transfer pathways, efforts to align curricula across institutions, obstacles to overcome with minimally-disruptive solutions, and approaches to foster these pathways. Improvements in these areas are critically important to ensure that a broad and diverse set of students are able to engage and succeed in undergraduate data science programs.
Article
Full-text available
This paper reviews literature pertaining to the development of data science as a discipline, current issues with data bias and ethics, and the role that the discipline of information science may play in addressing these concerns. Information science research and researchers have much to offer for data science, owing to their background as transdisciplinary scholars who apply human-centered and social-behavioral perspectives to issues within natural science disciplines. Information science researchers have already contributed to a humanistic approach to data ethics within the literature and an emphasis on data science within information schools all but ensures that this literature will continue to grow in coming decades. This review article serves as a reference for the history, current progress, and potential future directions of data ethics research within the corpus of information science literature.
Article
With influences from different communities, data science has evolved to provide insights in many different data‐driven environments, including climate science. In this article, a brief review of data science and its connection to climate science will be presented. Additionally, two data science pipelines for quantifying risks from climate change are discussed. These pipelines focus on flooding due to tropical cyclone storm surge and changes in the distribution of temperature or precipitation or wind due to climate change via downscaling climate models. Finally, some key data science research areas in climate risk analytics are discussed.
Article
There is growing interest in data science and the challenges that scientists can solve through its application. The growing interest is in part due to the promise of “extracting value from data.” The pharmaceutical industry is no different in this regard reflected by the advancement and excitement surrounding data science. Data science brings new perspectives, new methods, new skill sets and the wider use of new data modalities. For example, there is a belief that extracting value from data integrated from multiple sources and modalities using advances in statistics, machine learning, informatics and computation can answer fundamental questions. These questions span a variety of themes including disease understanding, drug and target discovery, and trial design. By answering fundamental questions, we cannot only increase knowledge and understanding but more importantly inform decision making; accelerating drug development through data-driven prioritization, increasingly precise and accurate measurements, optimized trial designs and operational excellence. However, with the promise of data science, there are obstacles to overcome, especially if data science is to live up to this promise and deliver a positive impact. These obstacles include consensus on the definition of data science, the relationship between data science and existing fields such as statistics and computing science, what should be involved in the day-to-day practices of data science, and what is “good” practice. In this article, we cover these themes, highlighting issues with scientific practice from five perspectives and argue how advances in data science will not be immune, especially exploratory, investigative, and innovative activities. We propose a definition of data science as a coming together but also a refocusing of established disciplines leading to a framework for good practice. In doing so, we aim to begin a dialogue on good data science practice in the context of drug development, where there is no industry view or consensus.
Article
Data Science (DS) has emerged from the shadows of its parents—statistics and computer science—into an independent field since its origin nearly six decades ago. Its evolution and education have taken many sharp turns. We present an impressionistic study of the evolution of DS anchored to Kuhn's four stages of paradigm shifts. First, we construct the landscape of DS based on curriculum analysis of the 32 iSchools across the world offering graduate‐level DS programs. Second, we paint the “field” as it emerges from the word frequency patterns, ranking, and clustering of course titles based on text mining. Third, we map the curriculum to the landscape of DS and project the same onto the Edison Data Science Framework (2017) and ACM Data Science Knowledge Areas (2021). Our study shows that the DS programs of iSchools align well with the field and correspond to the Knowledge Areas and skillsets. iSchool's DS curriculums exhibit a bias toward “data visualization” along with machine learning, data mining, natural language processing, and artificial intelligence; go light on statistics; slanted toward ontologies and health informatics; and surprisingly minimal thrust toward eScience/research data management, which we believe would add a distinctive iSchool flavor to the DS.
Article
Full-text available
Conventional approaches to developing biomaterials and implants require intuitive tailoring of process variables, long development cycles, and high expenses. To meet the biomedical and clinical demands, it is critical to accelerate the production of personalized implantable biomaterials and biomedical devices. Building on the Materials Genome Initiative, we define the concept ‘biomaterialomics’ as the integration of multi-omics data and high-dimensional analysis with artificial intelligence (AI) tools throughout the entire pipeline of biomaterials development. The Data Science-driven approach is envisioned to bring together on a single platform, the computational tools, databases, experimental methods, machine learning, and advanced manufacturing (e.g., 3D printing) to develop the fourth-generation biomaterials and implants, whose clinical performance will be predicted using ‘digital twins’. While analysing the key elements of the concept of ‘biomaterialomics’, significant emphasis has been put forward to effectively utilize high-throughput biocompatibility data together with multiscale physics-based models, E-platform/online databases of clinical studies, data science approaches, including metadata management, AI/ Machine Learning (ML) algorithms and uncertainty predictions. Such integrated formulation will allow one to adopt cross-disciplinary approaches to establish processing-structure-property (PSP) linkages. A few published studies from the lead author's research group serve as case studies to illustrate the formulation and relevance of the ‘Biomaterialomics’ approaches for three emerging research themes, i.e. patient-specific implants, additive manufacturing, and bioelectronic medicine. The increased adaptability of AI/ML tools in biomaterials science along with the training of the new generation researchers with data science concepts are strongly recommended. Statement of Significance The currently practiced strategy to develop new biomaterials and implants require intuitive tailoring of manufacturing protocols, biocompatibility assessment and clinical studies. This leading opinion review paper emphasizes the need to integrate the concepts and algorithms of the data science with biomaterials science. Also, this paper emphasizes the need to establish a mathematically rigorous cross-disciplinary framework that will allow a systematic quantitative exploration and curation of critical biomaterials knowledge needed to drive objectively the innovation efforts within a suitable uncertainty quantification framework, as embodied in ‘biomaterialomics’ concept, which integrates multi-omics data and high-dimensional analysis with artificial intelligence (AI) tools, like machine learning. The formulation of this approach has been demonstrated for patient-specific implants, additive manufacturing, and bioelectronic medicine.
Article
Data science is considered a young field by many. This column shares the growing trends of data science as one of the most sought-after career options and as an emerging discipline in almost every industry in the world.
Chapter
The clinical decisions are driven by all the available information of the medical and patient history. The issue currently faced is the provision of medical information that is based on evidence as soon as its requirement in delivering the best possible care at that time. This can be done by making the “evidence based actionable content” available to the healthcare professional at that time so as he can choose the best possible decision at that time. The actionable content gives clinical guidance in terms of recommendations that can be queried, link to relevant context, and actionable items. To discover actionable content in healthcare domain, health analytics needs a sequence of three types of analytics: descriptive, predictive and prescriptive. These analytics helps to retrieve coherent meaningful information that can represent actionable content. Traditional machine learning techniques help to transform descriptive analytics into predictive analytics. However, these techniques are data-hungry and in healthcare domain, data comes from multiple heterogeneous sources. These heterogeneity issues are addressed using semantic technology i.e. ontology. This chapter discusses the usability of ontology and proposed ontology-driven three layer architecture to extract actionable content from multiple heterogeneous information sources in healthcare domain.
Article
Full-text available
This AERA Open special topic concerns the large emerging research area of education data science (EDS). In a narrow sense, EDS applies statistics and computational techniques to educational phenomena and questions. In a broader sense, it is an umbrella for a fleet of new computational techniques being used to identify new forms of data, measures, descriptives, predictions, and experiments in education. Not only are old research questions being analyzed in new ways but also new questions are emerging based on novel data and discoveries from EDS techniques. This overview defines the emerging field of education data science and discusses 12 articles that illustrate an AERA-angle on EDS. Our overview relates a variety of promises EDS poses for the field of education as well as the areas where EDS scholars could successfully focus going forward.
Chapter
As a capstone to this research inquiry, the final phase frames and advocates design-derived gap-prescriptions. CSDS has thus far been systematically explored through triangulated diagnostic methods as an emerging practitioner discipline. In the preceding phases, practice-oriented diagnostic research has been undertaken, encompassing background analysis (Phase I), opinion research (Phase II), and gap analysis (Phase II). Per guidance from Doorewaard and Verschuren (2010), a design approach is a natural accompaniment to conclude diagnostic analysis in problem-solving research.
Chapter
Due to the nature of cybersecurity data science (CSDS) as a novel field emerging in the midst of rapid technological change, there is a gap in CSDS-focused organizational research. Challenges operationalizing CSDS solutions lead to a call for an increased theoretical focus on organizational problem-solving research. To address this gap, CSDS fits the profile of an organizational problem that is “relatively new or fairly complex,” necessitating an effort to “clarify the relevant background and the reasons for the problem” (Doorewaard and Verschuren 2010).
Chapter
The chapter discusses the basic concepts of computer security as well as the taxonomy and classification of the fundamental algorithms in the domains of artificial intelligence, machine learning, and data science in relation to their applications in computer security. It reviews the sources of security threats and the attacks, using the area of IoT and wireless devices as an example, as well as examines the possible protection mechanisms and tools. The module provides a general classification of intelligent approaches and their relationship to various computer security fields. It focuses on an introduction of the major intelligent techniques and technologies in computer security, such as expert systems, fuzzy logic, machine learning, artificial neural networks, and genetic algorithms. While presenting multiple techniques, the text emphasizes their advantage in comparison to each other as well as the obstacles in their further progress. Short algorithm descriptions and code examples are included.
Thesis
Full-text available
Le métabolisme territorial offre un paradigme pour étudier les flux de matières et d’énergie sur un territoire. Il vise à mieux qualifier et quantifier les ressources mobilisées et rejetées dans l’environnement. Néanmoins, l’étude du métabolisme reste complexe par la quantité de données à mobiliser et à traiter. Dans cette thèse, nous abordons directement cette problématique. Pour commencer, nous formalisons les notions et approches à mobiliser autour du traitement des données et du métabolisme. Nous concevons ensuite un Système d’Information pour l’Analyse du Métabolisme des Territoires (SINAMET). Enfin, nous mettons en application ces outils sur quatre cas d’études : la consommation d’énergie du patrimoine de l’Eurométropole de Strasbourg, les marchandises transportées par voie navigable dans le port de Strasbourg, la sensibilité à l’échelle des indicateurs d’importation et d’exportation, et les flux de matières alimentaire à l’échelle de l’Eurométropole.
Article
Full-text available
This article presents a model for developing case studies, or labs, for use in undergraduate mathematical statistics courses. The model proposed here is to design labs that are more in-depth than most examples in statistical texts by providing rich background material, investigations and analyses in the context of a scientific problem, and detailed theoretical development within the lab. An important goal of this approach is to encourage and develop statistical thinking. It is also advocated that the labs be made the centerpiece of the theoretical course. As a result, the curriculum, lectures, and assignments are significantly restructured. For example, the course work includes written assignments based on open-ended data analyses, and the lectures include group work and discussions of the case-studies.
Article
Full-text available
Higher education faces an environment of financial constraints, changing customer demands, and loss of public confidence. Technological advances may at last bring widespread change to college teaching. The movement for education reform also urges widespread change. What will be the state of statistics teaching at the university level at the end of the century? This article attempts to imagine plausible futures as stimuli to discussion. It takes the form of provocations by the first author, with responses from the others on three themes: the impact of technology, the reform of teaching, and challenges to the internal culture of higher education.
Article
How does statistical thinking differ from mathematical thinking? What is the role of mathematics in statistics? If you purge statistics of its mathematical content, what intellectual substance remains? In what follows, we offer some answers to these questions and relate them to a sequence of examples that provide an overview of current statistical practice. Along the way, and especially toward the end, we point to some implications for the teaching of statistics.
Article
Significant advances in, and the resultant impact of, Information Technology (IT) during the last fifteen years has resulted in a much more data based society, a trend that can be expected to continue into the foreseeable future. This phenomenon has had a real impact on the Statistics discipline and will continue to result in changes in both content and course delivery. Major research directions have also evolved during the last ten years directly as a result of advances in IT. The impact of these advances has started to flow into course content, at least for advanced courses. One question which arises relates to what impact will this have on the future training of statisticians, both with respect to course content and mode of delivery. At the tertiary level the last 40 years has seen significant advances in theoretical aspects of the Statistics discipline. Universities have been outstanding at producing scholars with a strong theoretical background but questions have been asked as to whether this has, to some degree, been at the expense of appropriate training of the users of statistics (the 'tradespersons'). Future directions in the teaching and learning of Statistics must take into account the impact of IT together with the competing need to produce scholars as well as competent users of statistics to meet the future needs of the market place. For Statistics to survive as a recognizable discipline the need to be able to train statisticians with an ability to communicate is also seen as an area of crucial importance. Satisfying the needs of society as well as meeting the needs of the profession are considered as the basic determinants which will derive the future teaching and training of statisticians at the tertiary level and will form the basis of this presentation.
Article
Aspects of scientific method are discussed: In particular, its representation as a motivated iteration in which, in succession, practice confronts theory, and theory, practice. Rapid progress requires sufficient flexibility to profit from such confrontations, and the ability to devise parsimonious but effective models, to worry selectively about model inadequacies and to employ mathematics skillfully but appropriately. The development of statistical methods at Rothamsted Experimental Station by Sir Ronald Fisher is used to illustrate these themes.
Article
The nature of data is rapidly changing. Data sets are becoming increasingly large and complex. Modern methodology for analyzing these new types of data are emerging from the fields of Data Base Managment, Artificial Intelligence, Machine Learning, Pattern Recognition, and Data Visualization. So far Statistics as a field has played a minor role. This paper explores some of the reasons for this, and why statisticians should have an interest in participating in the development of new methods for large and complex data sets.
Article
The statistics community is showing increasing interest in consulting in industry. This interest has stimulated questions concerning recognition and job satisfaction, job opportunities, and educational and training needs. These questions are considered in this article. A central theme is that effective statistical consulting requires total involvement in the consulting situation and that good recognition flows naturally from such an approach. This concept is defined in operational terms.
Article
The profession of statistics has adopted too narrow a definition of itself. As a consequence, both statistics and statisticians play too narrow a role in policy formation and execution. Broadening that role will require statisticians to change the curriculum they use to train and develop their own professionals and what they teach nonstatisticians about statistics. Playing a proper role will require new research from statisticians that combines our skills in methods with other techniques of social scientists.
Article
Centre of Location. That abscissa of a frequency curve for which the sampling errors of optimum location are uncorrelated with those of optimum scaling. (9.)
Article
Data analysis is not a new subject. It has accompanied productive experimentation and observation for hundreds of years. At times, as in the work of Kepler, it has produced dramatic results.
Article
Enormous quantities of data go unused or underused today, simply because people can't visualize the quantities and relationships in it. Using a downloadable programming environment developed by the author, Visualizing Data demonstrates methods for representing data accurately on the Web and elsewhere, complete with user interaction, animation, and more. How do the 3.1 billion A, C, G and T letters of the human genome compare to those of a chimp or a mouse? What do the paths that millions of visitors take through a web site look like? With Visualizing Data, you learn how to answer complex questions like these with thoroughly interactive displays. We're not talking about cookie-cutter charts and graphs. This book teaches you how to design entire interfaces around large, complex data sets with the help of a powerful new design and prototyping tool called "Processing". Used by many researchers and companies to convey specific data in a clear and understandable manner, the Processing beta is available free. With this tool and Visualizing Data as a guide, you'll learn basic visualization principles, how to choose the right kind of display for your purposes, and how to provide interactive features that will bring users to your site over and over. This book teaches you: The seven stages of visualizing data -- acquire, parse, filter, mine, represent, refine, and interactHow all data problems begin with a question and end with a narrative construct that provides a clear answer without extraneous detailsSeveral example projects with the code to make them workPositive and negative points of each representation discussed. The focus is on customization so that each one best suits what you want to convey about your data set The book does not provide ready-made "visualizations" that can be plugged into any data set. Instead, with chapters divided by types of data rather than types of display, you'll learn how each visualization conveys the unique properties of the data it represents -- why the data was collected, what's interesting about it, and what stories it can tell. Visualizing Data teaches you how to answer questions, not simply display information.
Article
An efficient method for the calculation of the interactions of a 2' factorial ex- periment was introduced by Yates and is widely known by his name. The generaliza- tion to 3' was given by Box et al. (1). Good (2) generalized these methods and gave elegant algorithms for which one class of applications is the calculation of Fourier series. In their full generality, Good's methods are applicable to certain problems in which one must multiply an N-vector by an N X N matrix which can be factored into m sparse matrices, where m is proportional to log N. This results inma procedure requiring a number of operations proportional to N log N rather than N2. These methods are applied here to the calculation of complex Fourier series. They are useful in situations where the number of data points is, or can be chosen to be, a highly composite number. The algorithm is here derived and presented in a rather different form. Attention is given to the choice of N. It is also shown how special advantage can be obtained in the use of a binary computer with N = 2' and how the entire calculation can be performed within the array of N data storage locations used for the given Fourier coefficients. Consider the problem of calculating the complex Fourier series N-1 (1) X(j) = EA(k)-Wjk, j = 0 1, * ,N- 1, k=0
Article
Statisticians regularly bemoan what they perceive to be the lack of impact and appreciation of their discipline on the part of others. And it does seem to be true, as we approach the magic 2000 number, that we are faced with the paradox of more and more information being available, more and more complex problems to be solved – but, apparently, less and less direct appreciation of the role and power of statistical thinking. The purpose of the conference talk will be to explore one possible pedagogic route to raising awareness of the central importance of statistical thinking to good citizenship – namely, a focus on public policy issues as a means of creating an awareness and appreciation of the need for and power of statistical thinking. We all know that our discipline is both intellectually exciting and stimulating in itself – and that it provides the crucial underpinning of any would-be coherent, quantitative approach to the world around us. Indeed, one might even go so far as to say that our subject is 'the science of doing science', providing theory and protocols to guide and discipline all forms of quantitative investigatory procedure. We have certainly expanded our horizons way beyond the rather modest ambitions of the founding fathers of the Royal Statistical Society, who set out to 'collect, arrange, digest and publish facts, illustrating the condition and prospects of society in its material, social and moral relations'.
Article
Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printed page of such transmission. JSTOR is an independent not-for-profit organization dedicated to and preserving a digital archive of scholarly journals. For more information regarding JSTOR, please contact support@jstor.org. How does statistical thinking differ from mathematical thinking? What is the role of mathematics in statistics? If you purge statistics of its mathematical content, what intellectual substance remains? In what follows, we offer some answers to these questions and relate them to a sequence of examples that provide an overview of current statistical practice. Along the way, and especially toward the end, we point to some implications for the teaching of statistics.
Article
There appears to be a paradox in the fact that as more and more quantitative information becomes routinely available in a world perceived to be ever more complex, there is less and less direct appreciation of the role and power of statistical thinking. It is suggested that the profession should exploit very real public concerns regarding risk aspects of public policy as a possible pedagogic route to raising statistical awareness.
Article
Stochastic substitution, the Gibbs sampler, and the sampling-importance-resampling algorithm can be viewed as three alternative sampling- (or Monte Carlo-) based approaches to the calculation of numerical estimates of marginal probability distributions. The three approaches will be reviewed, compared, and contrasted in relation to various joint probability structures frequently encountered in applications. In particular, the relevance of the approaches to calculating Bayesian posterior densities for a variety of structured models will be discussed and illustrated.
Article
Aspects of scientific method are discussed: In particular, its representation as a motivated iteration in which, in succession, practice confronts theory, and theory, practice. Rapid progress requires sufficient flexibility to profit from such confrontations, and the ability to devise parsimonious but effective models, to worry selectively about model inadequacies and to employ mathematics skillfully but appropriately. The development of statistical methods at Rothamsted Experimental Station by Sir Ronald Fisher is used to illustrate these themes.
Article
This paper examines work in "computing with data"---in computing support for scientific and other activities to which statisticians can contribute. Relevant computing techniques, besides traditional statistical computing, include data management, visualization, interactive languages and user-interface design. The paper emphasizes the concepts underlying computing with data, with emphasis on how those concepts can help in practical work. We look at past, present, and future: some concepts as they arose in the past and as they have proved valuable in current software; applications in the present, with one example in particular, to illustrate the challenges these present; and new directions for future research, including one exciting joint project. 1 Contents 1 Introduction 2 2 The Past 2 2.1 Programming Languages in 1963 . . . . . . . . . . . . . . . . . . 3 2.2 Statistical Computing: Bell Labs, 1965 . . . . . . . . . . . . . . . 5 2.3 Statistical Computing: England, 1967 . ....
Teaching Statistics Theory Through Applications The American Statistician 53 Public Policy Issues as a Route to Statistical Awareness Department of Mathematics, Imperial College Data Analysis and Statistics: An Expository Overview Visions: The Evolution of Statistics
  • D Nolan
  • T Speed
  • A F M Smith
Nolan, D. and T. Speed (1999). Teaching Statistics Theory Through Applications. The American Statistician 53, 370–375. Smith, A. F. M. (2000). Public Policy Issues as a Route to Statistical Awareness. Technical report, Department of Mathematics, Imperial College. Tukey, J. W. and M. B. Wilk (1986). Data Analysis and Statistics: An Expository Overview. In L. V. Jones (Ed.), The Collected Works of John W. Tukey, pp. 549–578. New York: Chapman & Hall. Wegman, E. J. (2000). Visions: The Evolution of Statistics. Technical report, Center for Computational Statistics, George Mason University. 6
Visions: The Evolution of Statistics
  • E J Wegman
Wegman, E. J. (2000). Visions: The Evolution of Statistics. Technical report, Center for Computational Statistics, George Mason University.
Future Directions for the Teaching and Learning of Statistics at the Tertiary Level
  • D Nichols
Nichols, D. (2000). Future Directions for the Teaching and Learning of Statistics at the Tertiary Level. Technical report, Department of Statistics and Econometrics, Australian National University.
  • Nicholls D.