Article

Data Science: An Action Plan for Expanding the Technical Areas of the Field of Statistics

Authors:
To read the full-text of this research, you can request a copy directly from the author.

Abstract

An action plan to enlarge the technical areas of statistics focuses on the data analyst. The plan sets out six technical areas of work for a university department and advocates a specific allocation of resources devoted to research in each area and to courses in each area. The value of technical work is judged by the extent to which it benefits the data analyst, either directly or indirectly. The plan is also applicable to government research labs and corporate research organizations. 1 Summary of the Plan This document describes a plan to enlarge the major areas of technical work of the field of statistics. Because the plan is ambitious and implies substantial change, the altered field will be called "data science." The focus of the plan is the practicing data analyst. A basic premise is that technical areas of data science should be judged by the extent to which they enable the analyst to learn from data. The benefit of an area can be direct or indirect. Tools that are used by...

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

... We assume that data science, as a discipline, must involve data (at any scale) and science, defined as "knowledge or a system of knowledge covering general truths or the operation of general laws especially as obtained and tested through scientific method." (Merriam Webster; see also Wu, 1997;Cleveland, 2001;Leek 2013;Donoho 2017). Two core assumptions of this project are: ...
... 1. Data science is an extension of applied statistics (Wu, 1997;Cleveland, 2001;Leek 2013;Donoho 2017); and 2. In order to merit the word "science", data science is statistics that is computationally intense and applied to solving problems in a scientific way, i.e., reproducibly and rigorously. ...
... Clearly, the definition, and assumptions, make many, if not most, statisticians into data scientists (Wu, 1997;Cleveland, 2001;Leek 2013;Donoho 2017). However, these assumptions may also challenge undergraduate programs that currently use the label "data science" to demonstrate (or integrate) the development of scientific reasoning and skills in their students. ...
Preprint
Full-text available
Consensus based publications of both competencies and undergraduate curriculum guidance documents targeting data science instruction for higher education have recently been published. Recommendations for curriculum features from diverse sources may not result in consistent training across programs. A Mastery Rubric was developed that prioritizes the promotion and documentation of formal growth as well as the development of independence needed for the 13 requisite knowledge, skills, and abilities for professional practice in statistics and data science, SDS. The Mastery Rubric, MR, driven curriculum can emphasize computation, statistics, or a third discipline in which the other would be deployed or, all three can be featured. The MR SDS supports each of these program structures while promoting consistency with international, consensus based, curricular recommendations for statistics and data science, and allows 'statistics', 'data science', and 'statistics and data science' curricula to consistently educate students with a focus on increasing learners independence. The Mastery Rubric construct integrates findings from the learning sciences, cognitive and educational psychology, to support teachers and students through the learning enterprise. The MR SDS will support higher education as well as the interests of business, government, and academic work force development, bringing a consistent framework to address challenges that exist for a domain that is claimed to be both an independent discipline and part of other disciplines, including computer science, engineering, and statistics. The MR-SDS can be used for development or revision of an evaluable curriculum that will reliably support the preparation of early e.g., undergraduate degree programs, middle e.g., upskilling and training programs, and late e.g., doctoral level training practitioners.
... This paper provides candidate definiWons for essenWal data science arWfacts that are required to discuss such a definiWon. They are based on the classical research paradigm concept [15] consisWng of a philosophy of data science, the data science problem solving paradigm, and the six component data science reference framework -axiology, ontology, epistemology, methodology, methods, and technology that is a unifying framework that is frequently called for [1] [4] [7][10] [11] [16] [19] with which to define, unify, and evolve data science. It presents challenges of defining data science, soluWon approaches, i.e., means for defining data science, and their requirements and benefits -the basis of a comprehensive soluWon [24]-[32]. ...
... The philosophy provides the philosophical underpinning for research with which to discover, reason about, understand, arWculate, and validate the true nature of the ul-mate ques-ons about phenomena in a specific discipline as knowledge about those phenomena. The research paradigm reference framework consists of six components 4 -axiology, ontology, epistemology, methodology, methods, and technology. Full definiWons of the components can fill enWre books. ...
... This descripWon does not address the frequent For modern science and data science, methods and technology are added to Comte's definiWon. 4 A research paradigm, e.g., science, has one methodology, e.g., the scienWfic method governs the design 5 and execuWon of experiments; analyses in data science are governed by the data science method. ...
Preprint
Full-text available
Data science is not a science. It is a research paradigm. Its power, scope, and scale will surpass science, our most powerful research paradigm, to enable knowledge discovery and change our world. We have yet to understand and define it, vital to realizing its potential and managing its risks. Modern data science is in its infancy. Emerging slowly since 1962 and rapidly since 2000, it is a fundamentally new field of inquiry, one of the most active, powerful, and rapidly evolving 21st century innovations. Due to its value, power, and applicability, it is emerging in 40+ disciplines, hundreds of research areas, and thousands of applications. Millions of data science publications contain myriad definitions of data science and data science problem solving. Due to its infancy, many definitions are independent, application-specific, mutually incomplete, redundant, or inconsistent, hence so is data science. This research addresses this data science multiple definitions challenge by proposing the development of coherent, unified definition based on a data science reference framework using a data science journal for the data science community to achieve such a definition. This paper provides candidate definitions for essential data science artifacts that are required to discuss such a definition. They are based on the classical research paradigm concept consisting of a philosophy of data science, the data science problem solving paradigm, and the six component data science reference framework (axiology, ontology, epistemology, methodology, methods, technology) that is a frequently called for unifying framework with which to define, unify, and evolve data science. It presents challenges for defining data science, solution approaches, i.e., means for defining data science, and their requirements and benefits as the basis of a comprehensive solution.
... Herbert Alexander Simon (1916-2001), Alan Newell (1927-1992 und John Clifford Shaw hatten an Programmen zur Lösung von »ultracomplicated« Problemen, etwa aus den Bereichen des Schachspielens, der Euklidischen Geometrie, der Streichholzaufgaben oder der symbolischen Logik gearbeitet. Bei der Rand Corporation in Santa Monica in Kalifornien hatten sie den »Logic Theorist« entworfen -ein System, das Beweise einiger Theoreme durchführte, wie sie in der »Principia Mathematica« vorexerziert wurden. ...
... Zudem wurden die Algorithmen der künstlichen neuronalen Netze, und des Deep Learning sowie andere Machine-Learning-Algorithmen populär. Auch der Statistiker William Swain Cleveland (*1943) warb nun dafür, dass aus der Statistik eine Data Science hervorgehen möge, indem sich die Statistiker*innen den technischen Aspekten der Disziplin der Informatik zuwenden und mit deren Vertreter*innen zusammenzuarbeiten (Cleveland 2001). Noch im Jahre 1997 hatte Friedman in einem Keynote-Vortrag auf dem 29. ...
Chapter
Full-text available
Wie sieht ethische Verantwortung im Zeitalter der Digitalisierung, Datafizierung und Künstlichen Intelligenz aus? Die Beiträger*innen geben fundierte Einsichten in die KI-gestützte Entscheidungs- und Urteilsfindung. Von der digitalen Operationalisierung über die Rolle des Menschen im Zentrum des technischen Fortschritts bis hin zur Konzeption von vertrauenswürdigen Systemen - im Fokus steht die Diskussion von Chancen und Herausforderungen, die nicht nur Akademiker*innen vielseitige Anregungen zur weiteren Auseinandersetzung mit dem Thema gibt.
... Attwood et al. (2019) noted that it is often difficult to find suitably qualified candidates for DSE lecturing posts. Mostly, the technical aspects of data science can be crucial and can be immensely beneficial when more resources are available (Cleveland, 2001). Therefore, collaboration among universities can accelerate the creation of an environment where data science exists as a cross-campus endeavor that involves faculties and students in different departments (Van Dusen et al., 2019). ...
... Y. Kim et al., 2018), and the ability to scale up data science (Donoghue et al., 2021). Topics on teaching pedagogies are not often initiated, yet so many individuals who graduate proceed to take teaching roles (Cleveland, 2001). ...
Article
Full-text available
Aim/Purpose: This study aimed to evaluate the extant research on data science education (DSE) to identify the existing gaps, opportunities, and challenges, and make recommendations for current and future DSE. Background: There has been an increase in the number of data science programs especially because of the increased appreciation of data as a multidisciplinary strategic resource. This has resulted in a greater need for skills in data science to extract meaningful insights from data. However, the data science programs are not enough to meet the demand for data science skills. While there is growth in data science programs, they appear more as a rebranding of existing engineering, computer science, mathematics, and statistics programs. Methodology: A scoping review was adopted for the period 2010–2021 using six scholarly multidisciplinary databases: Google Scholar, IEEE Xplore, ACM Digital Library, ScienceDirect, Scopus, and the AIS Basket of eight journals. The study was narrowed down to 91 research articles and adopted a classification coding framework and correlation analysis for analysis. Contribution: We theoretically contribute to the growing body of knowledge about the need to scale up data science through multidisciplinary pedagogies and disciplines as the demand grows. This paves the way for future research to understand which programs can provide current and future data scientists the skills and competencies relevant to societal needs. Findings: The key results revealed the limited emphasis on DSE, especially in non-STEM (Science, Technology, Engineering, and Mathematics) disciplines. In addition, the results identified the need to find a suitable pedagogy or a set of pedagogies because of the multidisciplinary nature of DSE. Further, there is currently no existing framework to guide the design and development of DSE at various education levels, leading to sometimes inadequate programs. The study also noted the importance of various stakeholders who can contribute towards DSE and thus create opportunities in the DSE ecosystem. Most of the research studies reviewed were case studies that presented more STEM programs as compared to non-STEM. Recommendations for Practitioners: We recommend CRoss Industry Standard Process for Data Mining (CRISP-DM) as a framework to adopt collaborative pedagogies to teach data science. This research implies that it is important for academia, policymakers, and data science content developers to work closely with organizations to understand their needs. Recommendation for Researchers: We recommend future research into programs that can provide current and future data scientists the skills and competencies relevant to societal needs and how interdisciplinarity within these programs can be integrated. Impact on Society: Data science expertise is essential for tackling societal issues and generating beneficial effects. The main problem is that data is diverse and always changing, necessitating ongoing (up)skilling. Academic institutions must therefore stay current with new advances, changing data, and organizational requirements. Industry experts might share views based on their practical knowledge. The DSE ecosystem can be shaped by collaborating with numerous stakeholders and being aware of each stakeholder’s function in order to advance data science internationally. Future Research: The study found that there are a number of research opportunities that can be explored to improve the implementation of DSE, for instance, how can CRISP-DM be integrated into collaborative pedagogies to provide a fully comprehensive data science curriculum?
... Pierwsze użycie pojęcia we współczesnym sensie datowane jest na rok 2001 (Donoho, 2015: 13;Andrus, Cook, Sood, 2017: 1). Chodzi o pracę Data Science: An Action Plan for Expanding the Technical Areas of the field of Statistics (Cleveland, 2001). Autor -William S. Cleveland -wieloletni współpracownik Johna Tuckeya w laboratorium badawczym firmy Bell i profesor statystyki, postuluje rozszerzenie technicznych obszarów statystyki uniwersyteckiej tak, by lepiej wspierać analityków danych (data analyst) pracujących dla rządowych lub komercyjnych podmiotów (Donoho, 2015: 13). ...
... Dyskusje o relacji między DS a statystyką trwają od końca lat dziewięćdziesiątych. Postulowano przemianowanie statystyki na DS (Cao, 2017a: 7) w celu zorientowania tej dziedziny bardziej empirycznie (Wu, 1997), rozszerzenia statystyki o zainteresowanie problemami obliczeniowymi i współpracą z informatykami (Cleveland, 2001), zaadaptowania do statystyki odmiennego podejścia do modelowania -tzw. kultury modelowania algorytmicznego (Breiman, 2001), uznania statystyki i ML za odgrywające centralną rolę w DS (van Dyk i in., 2015). ...
Book
Full-text available
Czy sztuczna inteligencja pozbawia nas pracy? Algorytmy przejmują władzę nad światem? Czy big data sprawia, że jesteśmy bezustannie inwigilowani? Czy ogromna ilość danych zastępuje ekspertów i naukowców? Cokolwiek sądzimy na te tematy, jedno jest pewne – istnieje heterogeniczne środowisko ludzi, zajmujących się tzw. „sztuczną inteligencją” czy tzw. „big data” od strony technicznej i metodologicznej. Pole ich działania nazywane jest data science, a oni data scientists. Ta książka poświęcona jest właśnie im, polskiemu środowisku data science. Jest to pierwsza monografia socjologiczna dotycząca data science i pierwsza praca w naukach społecznych, w której data science zostało zbadane jako społeczny świat w rozumieniu Adele E. Clarke. To podejście pozwala spojrzeć na data science, nazwane dekadę wstecz w Harvard Business Review „najseksowniejszym zawodem XXI w.”, zarówno z perspektywy jego uczestników jak i z lotu ptaka, w relacji do akademii, biznesu, prawa, mediów czy polityki.
... Potrzebę alternatywnych podejść do analizy odczuli w szczególności analitycy danych i informacji, którzy bardzo często nie mogą zapewnić równego rozkładu próby, a jednocześnie muszą łączyć dane wtórne i pierwotne z różnych modeli danych. Dodatkowo zwiększa to obciążenie wnioskowania czynnikami zewnętrznymi, których nie można uwzględnić w modelu badawczym, a poprzeć można je wyłącznie intuicją i wnioskowaniem indukcyjnym (Cleveland, 2001;Ziemba, 1961). W ostatnich latach wzrosło też zainteresowanie statystyką bayesowską w kontekście eksperymentów kontrolowanych w firmach it, w których prawdopodobieństwo osiągania lepszych rezultatów lub wyższych przychodów jest kluczem do podejmowania decyzji (Kamalbasha & Eugster, 2021). ...
... Tak używane modele badawcze mogą być wykorzystywane również przez praktyków w diagnostyce zarządzania informacją, np. analityków biznesowych, którzy rozwijają swoją wiedzę ekspercką na podstawie faktów, dowodów naukowych, na bazie których tworzą swoje przypuszczenia (Cleveland, 2001), a przede wszystkim prowadzą własne analizy instytucjonalne, aby wspierać decyzje w różnych typach organizacji (Zych, 2020). Wnioskowanie bayesowskie na podstawie prawdopodobieństwa potwierdzenia hipotezy badawczej, a nie częstościowa negacja błędu statystycznego, może być zatem bardziej użyteczne do oceny w analizie statystycznej możliwości wystąpienia efektów wynikających z decyzji odnoszących się do kształtowania kultury informacyjnej w kontekście kompetencyjnym (inwestycja w kursy poszerzające kompetencje informacyjne) i normatywnym (inwestycja w kształtowanie relacji między ludźmi i komunikację cyfrową). ...
Chapter
Full-text available
Cel/teza: Wnioskowanie bayesowskie w analizach statystycznych jest elementem pragmatycznego podejścia do prowadzenia badań ilościowych i mieszanych. Cel artykułu to przedstawienie bayesowskiej interpretacji wyników badań kultury informacyjnej w środowisku akademickim. Koncepcja/metodyka badań: Przeprowadzono analizę porównawczą na danych wykorzystanych wcześniej w badaniu kultury informacyjnej w środowisku organizacyjnym. Wyniki i wnioski: Zastąpienie wartości prawdopodobieństwa testowego p wartością czynnika bayesowskiego zwiększa potencjał interpretacji wyników badań w naukach o komunikacji społecznej i mediach, które w znacznej mierze bazują na wykorzystaniu skal przedziałowych lub interwałowych, co wiąże się z trudnościami w ich interpretacji zgodnie z założeniami statystyki częstościowej. Ograniczenia badań: Przedstawione porównanie nie uwzględnia wszelkich możliwych testów i implementacji czynnika bayesowskiego, a jedynie testowanie hipotez badawczych, jako podstawowej procedury opisu i weryfikacji dowodów naukowych w kontekście interdyscyplinarnym. Zgodnie z intencją autora i jego kompetencjami opisany został potencjał użyteczności czynnika bayesowskiego, a nie dowód matematyczny na istnienie przewagi statystyki bayesowskiej nad częstościową. Oryginalność/wartość poznawcza: Statystyka bayesowska jest gałęzią analiz statystycznych o charakterze pragmatycznym. Ten typ analiz staje się coraz bardziej popularny w dyskursie interdyscyplinarnym, w szczególności w kontekście poszukiwania lternatywnych i uniwersalnych metod wspierania decyzji i opisu dowodów naukowych, które byłyby zrozumiałe jednocześnie w wielu dyscyplinach nauki.
... However, only a small amount of smart transport literature has directly and explicitly discussed complexity theory in the theoretical part (Ribeiro et al., 2021;Docherty et al., 2018;Kester, 2018;Field and Jon, 2021;Moscholidou and Pangbourne, 2019;Icasiano and Taeihagh, 2021). Although data analytics is widely seen in the methodology part of the reviewed studies on smart transport governance (Sudmant et al., 2021;Nasser et al., 2021), data science is more than analysis and the theoretical aspects are often neglected (Donoho, 2017;Kang et al., 2019;Cleveland, 2014). ...
... Data science can enhance the accuracy and validity of science through scientific data analysis, but it is beyond data analysis (Donoho, 2017). The six key domains in data science are: 1) data collection, wrangling, and exploration, 2) data representation and transformation, 3) computing with data in programming languages, 4) data modelling, 5) data visualisation and interpretation, and 6) science and theory about data science (Donoho, 2017;Kang et al., 2019;Cleveland, 2014). ...
Thesis
With the increasing popularity of the concept “smart city”, many cities have adopted smart governance to address complex socio-economic and spatial issues in urban areas. Smart transport governance is applying innovations in the process of collective decision making in response to the technological and other changes in smart transport development. Governing smart transport, as a key priority in smart cities, faces old and new challenges such as managing complex uncertainties, considering alternative futures, involving citizens and correct analysis of their needs, as well as changing roles of governance. Robust theoretical and practical understandings of smart transport governance are useful for planners and policymakers to address these challenges and transform the urban mobility system towards accessible, sustainable, and innovative futures. This PhD research explores the complexities in smart transport governance from theoretical, methodological, and practical aspects with a special focus on citizens’ needs. Four gaps in theory, methods, and practice are addressed in six chapters. In Chapter 2, a systematic literature review is performed to enhance the theoretical understanding of smart transport governance and its linkage with complexity theory in cities (CTC) and urban data science (UDS). A citizen-centric adaptive governance framework is proposed. Using the proposed framework to understand specific issues in smart transport governance, Chapters 3-5 conduct empirical studies. Chapter 3 first assesses the existing smart transport governance and development, using a new evaluation framework. Within English metropolitan areas, Greater London ranks first in smart transport development. Chapter 4 zooms into Greater London and applies novel methods to understand citizens’ activity-travel patterns with uncertainties. Typical activity-travel patterns before COVID-19 and the emerging self-organising changes when COVID-19 first hit London are identified. To supply quick insights into the pandemic’s impact on different sub-systems, Chapter 5 senses the public opinion towards different transport sub-systems through real-time social media big data. Dynamic behavioural changes and potential opportunities for smart transport transitions are found. The outcomes of this research support the idea that CTC and UDS can enhance existing smart transport governance in terms of adaptive planning, robust analysis, and citizen involvement. We have identified and discussed emerging technologies and abrupt crises that add complexity to the urban transport sector on its way to transforming into smart transport. Adaptive understanding with the help of citizen-centric data is crucial for planning uncertain futures. Despite some limitations, the studies can provide theoretical and practical implications for smart transport governance in an increasingly complex world. The study also shows significant potential for future development and further applications of the adaptive governance framework.
... Algunos estudios (McAfee; Brynjolfsson, 2012) indican que las organizaciones con enfoque basado en datos mejoran su productividad y rentabilidad. Según consultoras internacionales (Acumen, Research;Consulting, 2022) Se puede definir el análisis de datos como un trabajo multidisciplinario, integrando esfuerzos de profesionales de distintas disciplinas, que gestiona datos con el propósito de generar información para mejorar el proceso de toma de decisiones (Cleveland, 2014). ...
Article
La vinculación profesional entre la universidad y la industria es objetivo de nuestra labor académica. Es interés del trabajo registrar un proyecto de articulación basado en el análisis de datos de servicios de telecomunicaciones. El desempeño de los estudiantes de ingeniería que lo implementan se evalúa a partir de rúbricas analíticas elaboradas para este trabajo. Para ello se consideró un perfil profesional conformado por habilidades técnicas específicas y competências genéricas transversales. Los procesos de transformación digital originan profundos cambios, entre otros en la generación de bienes y servicios. En este contexto la obtención de información relevante a partir de datos disponibles resulta un aporte de valor para las organizaciones. El proyecto se basa en un estudio exploratorio de datos reales, que utiliza como herramienta la plataforma Python y sus diferentes librerías, para construir un perfil de cliente con alto potencial de abandono en los servicios de telecomunicaciones. Se pretende contribuir al fomento de actividades que integren la academia y la industria, además de reflejar el valor que la aplicación de nuevas herramientas de evaluación pueden aportar a las organizaciones, estudiantes y docentes.
... In 2001, William S. Cleveland laid out plans for training data scientists to meet the needs of the future. He presented an action plan titled "Data Science: An Action Plan for Expanding the Technical Areas of the Field of Statistics" (Cleveland, 2001 In 2011, job listings for data scientists increased by 15,000%. There was also an increase in seminars and conferences devoted specifically to data science and big data. ...
Thesis
Full-text available
We live in the Big Data era; as our daily interactions move from the physical world to the digital world, every action we take generates data, information pours from our mobile devices, our computers, every file we save, and every social media interaction we make, it is even generated when we do something as simple as asking Google for directions to the nearest gas station...!! Data science is the key to making this flow of information helpful. Simply, it is the art of employing data to predict our future behavior, discover hidden patterns, and use it to help provide information or draw meaningful conclusions from these vast untapped data resources. These vague and misty definitions are shared with other modern fields such as Data Mining, Machine Learning, and Artificial Intelligence. So, what are the differences between these fields and Data Science? Furthermore, what is Data science in practice? As far as I know, the subject of data science is not well known for most Libyan statisticians, and there is no Libyan research in this field. Therefore, this thesis will provide additional knowledge to those interested in data science, especially Libyan scientists, and consider the first step to introduce data science for researchers in the Department of Statistics at the University of Tripoli. The primary purpose of this thesis is to declare the vague definitions of data science and show that statistics is the base behind all of its theories, and other fields are just giving advanced tools to apply statistical analysis on enormous amounts of data. We will focus on statistics and its contributions in applying data science by analyzing and discussing (with some details) the fundamental steps of the data science process by using the European Soccer Database (2008- 2016) as a case study, using SQLite data base and R programming language version 4.1.1. In order to apply the data science process to this case study, many statistical techniques have been used in this thesis for different purposes, such as descriptive statistics, confidence intervals, and design models such as the design of factorial experiments with interaction, and design of factorial experiments with blocks and interaction, in addition to some data mining techniques such as clustering with K-mean algorithm and classification with Decision Tree algorithm.
... He has made an effort to popularize the term "data science," and advocated that statistics be renamed as data science [10,11]. Another renowned statistician, William S. Cleveland, also promoted to expand statistics focus from theory to data science or data analysis [12]. Moreover, Data Science encompasses both statistics and datarelated computer science, highlighting the importance of statistics in processing and analyzing data effectively. ...
Article
Full-text available
The advent of the Big Data era has necessitated a transformational shift in statistical research, responding to the novel demands of data science. Despite extensive discourse within statistical communities on confronting these emerging challenges, we offer our unique perspectives, underscoring the extended responsibilities of statisticians in pre-analysis and post-analysis tasks. Moreover, we propose a new definition and classification of Big Data based on data sources: Type I Big Data, which is the result of aggregating a large number of small datasets via data sharing and curation, and Type II Big Data, which is the Real-World Data (RWD) amassed from business operations and practices. Each category necessitates distinct data preprocessing and preparation (DPP) methods, and the objectives of analysis as well as the interpretation of results can significantly diverge between these two types of Big Data. We further suggest that the statistical communities should consider adopting and rapidly incorporating new paradigms and cultures by learning from other disciplines. Particularly, beyond Breiman’s (Stat Sci 16(3):199–231, 2021) two modeling cultures, statisticians may need to pay more attention to a newly emerging third culture: the integration of algorithmic modeling with multi-scale dynamic modeling based on fundamental physics laws or mechanisms that generate the data. We draw from our experience in numerous related research projects to elucidate these novel concepts and perspectives.
... Серія «Педагогіка», Вип. 2,2024 Introduction. In recent years, the development of technology has enabled significant progress in the field of natural language processing (NLP). ...
Article
У цій статті розглядаються міждисциплінарні питання природномовних досліджень та перспективи докторської освіти в контексті тріади «Інтелектуалізація – Віртуалізація – Великі дані». Останні технологічні досягнення значно сприяли прогресу у сфері обробки природної мови (NLP). Сучасне NLP значною мірою покладається на методи машинного та глибинного навчання, що дозволяє комп’ютерам навчатися на великих наборах даних і виконувати завдання з високою точністю. Основні аспекти, які висвітлені в статті, включають використання нейронних мереж, механізми уваги та трансферного навчання. Обговорюються головні виклики, з якими стикається сучасне NLP, такі як неоднозначність, обмеженість даних, відсутність контексту та етичні проблеми. Важливими напрямками досліджень є мультимодальність, пояснювальні нейронні мережі та розвиток систем NLP для мов з обмеженими ресурсами. Стаття акцентує увагу на технологічному статусі NLP та його значущості в умовах інтелектуалізації, віртуалізації та аналізу великих даних. Визначені можливі напрямки досліджень, такі як покращення інтелектуалізації систем NLP, створення віртуальних середовищ для NLP-додатків та використання аналітики великих даних для підвищення точності та ефективності систем NLP. Крім того, у статті розглядається навчання докторантів в умовах використання NLP та аналізу великих даних, акцентуючи увагу на розробці компетенцій в управлінні дослідницькими даними та методологічному капіталі освітньої науки даних. У підсумках підкреслюється важливість інтеграції етичних аспектів у використання технологій NLP та їх значення для сучасної докторської освіти в Україні.У дослідженні підкреслено критичну необхідність комплексного підходу до аналізу мови, віртуальних середовищ та великих даних для розвитку систем NLP та модернізації докторської освіти.
... Data scientists have told origin stories that centered on Facebook and LinkedIn in their early startup days, struggling to get users to connect and navigate the then-new world of social media (Hammerbacher, 2009;Davenport and Patil, 2012), but the data science label first appeared in academic circles during the 1990s and early 2000s (e.g., Hayashi, 1998;Cleveland, 2001), and many underlying ideas are much older (Donoho, 2015;González-Bailón, 2017). Data scientists recognize their ties to established quantitative expertise and present their integration of it with computer sciences as a distinguishing feature (e.g., Schutt and O'Neil, 2013). ...
Article
Full-text available
Introduction “Data scientists” quickly became ubiquitous, often infamously so, but they have struggled with the ambiguity of their novel role. This article studies data science's collective definition on Twitter. Methods The analysis responds to the challenges of studying an emergent case with unclear boundaries and substance through a cultural perspective and complementary datasets ranging from 1,025 to 752,815 tweets. It brings together relations between accounts that tweeted about data science, the hashtags they used, indicating purposes, and the topics they discussed. Results The first results reproduce familiar commercial and technical motives. Additional results reveal concerns with new practical and ethical standards as a distinctive motive for constructing data science. Discussion The article provides a sensibility for local meaning in usually abstract datasets and a heuristic for navigating increasingly abundant datasets toward surprising insights. For data scientists, it offers a guide for positioning themselves vis-à-vis others to navigate their professional future.
... VOLUME From the Venn diagram in Figure 4, it is quite evident that the three fields existed independently before the introduction of the data science nomenclature. It is only after the efforts of several authors such as John Tukey [23], John Chambers [24], William Cleveland [25] among others, who called for the establishment of an interdisciplinary field that was based on the expansion of statistics and integration with computer science and domain knowledge [26], [27]. ...
Article
Full-text available
Over the last decade, Data Science has emerged as one of the most important subjects that has had a major impact on industry. This is due to the continual development of scientific methods, algorithms, processes, and computational tools that help to extract knowledge from raw data efficiently and cost-effectively, compared with early-generation tools. Professional data scientists create code that processes, analyses and extracts actionable insights from high volumes of data. This process requires a deep understanding of mathematical principles, statistics, business knowledge, and computer science. But most importantly, the data science development chain requires knowledge of a high-level programming tool and its dependencies. This is a major problem in some aspects due to the steep learning curve. In this paper, we describe and present a modularized Data Science curriculum for undergraduate learners that relies on no-code software development tools as programming aids for non-computer science majors. No-code development tools have been added to the traditional teaching pedagogy to improve students’ motivation and conceptual understanding of coding despite their limited programming skills. The study aims to assess the impacts of visual programming languages on the performance of non-computer science majors on programming problems. The study’s sample consists of 50 fourth-year students from the Faculty of Science and Technology at the Midlands State University. A post-survey questionnaire and assessment items were administered to the control and experimental groups. Results show that the students drawn from the experimental group benefited from the use of a visual programming language. These results offer evidence-based recommendations for incorporating high-performance no-code software development tools in the formal curriculum to aid teaching and learning data science programming for students of diverse academic backgrounds.
... These skills and tools augment research on the human brain with elements of software engineering, automation, scalable computing, provenance tracking, and advanced statistical and data visualization methods, which are sometimes collectively known as "data science" (Donoho, 2017; Ariel Rokem and Noah C. Benson are authors contributed equally to this work. Cleveland, 2001). While there is a growing appreciation for the importance of data science tools and skills in neuroscience research, there is still a dearth of opportunities for students and researchers in neuroscience to learn about them. ...
Article
Full-text available
NeuroHackademy (https://neurohackademy.org) is a two-week event designed to train early-career neuroscience researchers in data science methods and their application to neuroimaging. The event seeks to bridge the big data skills gap by introducing participants to data science methods and skills that are often ignored in traditional curricula. Such skills are needed for the analysis and interpretation of the kinds of large and complex datasets that have become increasingly important to neuroimaging research due to concerted data collection efforts. In 2020, the event rapidly pivoted from an in-person event to an online event that included hundreds of participants from all over the world. This experience and those of the participants substantially changed our valuation of large online-accessible events. In subsequent events held in 2022 and 2023, we have developed a “hybrid” format that includes both online and in-person participants. We discuss the technical and sociotechnical elements of hybrid events and discuss some of the lessons we have learned while organizing them. We emphasize in particular the role that these events can play in creating a global and inclusive community of practice in the intersection of neuroimaging and data science.
... The term "data science" appeared in 1974 but took almost 30 years to be identified as a discipline of its own [1], specifically as an enlargement of "the major areas of technical work of the field of statistics" [2]. However, the proliferation of data-driven applications (especially with the use of AI in recent years) has given rise to a variety of ethical, legal, and societal challenges that led these kinds of applications to be characterised as "Weapons of Math Destruction" [3]. ...
Article
Full-text available
The use of artificial intelligence (AI) applications in a growing number of domains in recent years has put into focus the ethical, legal, and societal aspects (ELSA) of these technologies and the relevant challenges they pose. In this paper, we propose an ELSA curriculum for data scientists aiming to raise awareness about ELSA challenges in their work, provide them with a common language with the relevant domain experts in order to cooperate to find appropriate solutions, and finally, incorporate ELSA in the data science workflow. ELSA should not be seen as an impediment or a superfluous artefact but rather as an integral part of the Data Science Project Lifecycle. The proposed curriculum uses the CRISP-DM (CRoss-Industry Standard Process for Data Mining) model as a backbone to define a vertical partition expressed in modules corresponding to the CRISP-DM phases. The horizontal partition includes knowledge units belonging to three strands that run through the phases, namely ethical and societal, legal and technical rendering knowledge units (KUs). In addition to the detailed description of the aforementioned KUs, we also discuss their implementation, issues such as duration, form, and evaluation of participants, as well as the variance of the knowledge level and needs of the target audience.
... Statisticians and statistics educators have sought to frame data science as an extension of statistics, manifested in acquiring 'new' skills for investigating, analyzing, and modeling large data sets . This has led to calls to reform and expand the field of statistics to emphasize the preparation, analysis, and presentation of data (Cleveland, 2001;. Others have argued that data science is more closely aligned with computer science and mathematics . ...
... Pemanfaatan data science untuk meningkatkan kualitas pembinaan SDM militer di Indonesia terus berkembang dan meningkat. Sudut pandang dalam pemanfaatan Data science untuk meningkatkan kualitas pembinaan SDM militer di Indonesia meliputi penggunaan teknologi data science sebagai alat bantu dalam pengambilan keputusan, pemantauan kesehatan fisik dan mental personil militer, pengembangan sistem pengawasan dan pengendalian, serta pengembangan sistem informasi keamanan nasional (Cleveland, 2001). Adapun beberapa penelitian sebelumnya terkait pemanfaatan data science dalam bidang militer diantaranya oleh Adiprasetyo pada tahun 2019 membahas tentang pengembangan sistem deteksi anomali untuk aset militer menggunakan Machine Learning. ...
Article
Full-text available
In today's digital era, data has become a very valuable asset in decision making in the military field. Where, data science can be used for various purposes, such as risk prediction and analysis, policy development, human resource development, operational efficiency improvement, and predicting someone's length of service in the military. However, effective implementation of data science also requires adequate data infrastructure, strict data security, and skilled and trained human resources in data science. The purpose of this study is to discuss the effect of using data science in developing the quality of human resources, where from this point of view data science is considered to be able to make a significant contribution in increasing the effectiveness and efficiency of military human resources development in Indonesia. This research method is a survey method with the aim of obtaining data on the use and benefits of data science in developing military human resources with a sample of 50 people. As for data processing using SPSS software analysis tools. Utilization of data science is the independent variable in this study, while the quality of military human resource development in Indonesia is the dependent variable, with the control variable being the level of education and experience of military human resources. The results of multiple linear regression tests show that the utilization of data science, education level, and experience of military human resources simultaneously has a significant effect on the quality of military human resource development in Indonesia.
... Statistics as a profession has undergone significant changes ever since the 1990s when it started to be considered as more than just a branch of mathematics, but as 'data science' (Cleveland, 2001). This change cannot be understood without the spreading of technological developments that enabled collecting and effectively analysing massive amounts of data. ...
The fast evolution of statistics caused by technological developments is the basis of numerous challenges in undergraduate courses. Project-based learning (PBL) and similar methodologies have been proposed by various authors to develop students’ ‘statistical sense’. This paper presents a case study regarding implementing three study and research paths (SRPs) in an undergraduate statistics course. SRPs are inquiry-based teaching devices developed in the Anthropological Theory of the Didactic. This study aims to analyse the inquiry dynamics that SRP implementations promote, the influences of teachers’ interventions and the role of external instances. The way statistical knowledge evolves and links descriptive and inferential statistics is also examined. The results of this study suggest that SRPs can contribute to the self-sustainability of the inquiry process, which is not often the case in other approaches such as PBL. This study also highlights the use ofquestion–answer maps to analyse the dynamics of the inquiry process in terms of the connectivity of the knowledge involved. Future research points to the study of the conditions needed to generate more adidacticity in PBL proposals.
... Many academic courses in statistics for non-STEM 1 students include various types of classification methods, and there is broad agreement among statistics educators that students need to learn more classification methods (Cleveland 2001;De Veaux et al. 2017). Classification is one of the most useful implementations of data analysis, easily interpretable, applicable, and very important for a wide range of fields (Gordon 1999). ...
... Many statistical graphing tools introduced in his 5 books, such as the coplot, are still widely used by data scientists today. Following this line of research methodology, Cleveland (2001) published Data Science: An Action Plan for Expanding the Technical Areas of the Field of Statistics, proposing a new discipline of data science that synthesizes modern computing methods and statistics. He suggested that statisticians should incorporate advanced computing into data analysis and learn how to engage with subject matter experts. ...
Conference Paper
Full-text available
Data Science and Machine Learning (DSML) are taking over the world by storm, but the roots of DSML and their relationship to traditional statistics remain understudied, leading to misconceptions and potential misapplications. For example, although linear regression, which is based on the least square criterion discovered in 1805, is known to be unsuitable for big data analytics, it is still categorized as a DSML method by some data scientists. This presentation illustrates how the emergence of DSML as a reaction to the shortcomings of classical statistics by covering four major sources that contributed to the development of DSML: (1) Exploratory Data Analysis (EDA) advocated by John Turkey, (2) The notion of learning from the data proposed by John Chambers, (3) Data visualization and advanced computing methods developed by William Cleveland, and (4) the two-culture thesis suggested by Leo Brieman. In addition, differences between DSML and traditional statistics in seven aspects are discussed: (1) Dichotomous evidence and decision vs. pattern recognition and contextual decision, (2) Model-driven and assumption-based vs data-driven and assumption-free, (3) Single-modeling vs multiple-modeling, (4) Whole sample vs resampling and subsetting, (5) Small data vs. big data, (6) Inference and explanation vs prediction, and (7) Overall generalization vs. personalized recommendation. Nonetheless, although DSML methods are considered more versatile, powerful, and efficient than traditional statistics in many aspects, it does not necessarily imply that the former can completely supersede the latter. A number of examples are given to illustrate when DSML can outperform traditional statistics, and vice versa.
... John Tukey in 1962 predicted the merging of statistics and computers to speed up data analysis that would take a long time to complete when processed manually, shorten it to hours rather than the days it typically takes. In preparing for future challenges, Cleveland (2001) plans to train data scientists to cope with the future challenges in data analytics. An action plan was prepared, in which Cleveland described the procedure to increase the technical expertise and the range of the data analysts including the specification of six areas to departments in Universities. ...
... The term data science was also used by William S. Cleveland in the year 2001. He wrote 'Data Science: An Action Plan for Expanding the Technical Areas of the Field of Statistics' (Cleveland 2001). In his paper, he has described an action plan to analyse data technically and statistically. ...
... For item formulation of this particular item, we consulted "the conceptual model for the data scientist profile" (Costa and Santos, 2017, p. 731). In the second section, we comprised general leadership skills extracted from existing leadership literature (Bruce and Bruce, 2017;Cleveland, 2001;Mumford et al., 2007). An example item is that of strategic skills based on Mumford et al. (2007), where we asked respondents to rate the importance of strategic skills. ...
Article
Full-text available
Purpose This study focuses on leadership in organizations where big data analytics (BDA) is an essential component of corporate strategy. While leadership researchers have conducted promising studies in the field of digital transformation, the impact of BDA on leadership is still unexplored. Design/methodology/approach This study is based on semi-structured interviews with 33 organizational leaders and subject-matter experts from various industries. Using a grounded theory approach, a framework is provided for the emergent field of BDA in leadership research. Findings The authors present a conceptual model comprising foundational competencies and higher order roles that are data analytical skills, data self-efficacy, problem spotter, influencer, knowledge facilitator, visionary and team leader. Research limitations/implications This study focuses on BDA competency research emerging as an intersection between leadership research and information systems research. The authors encourage a longitudinal study to validate the findings. Practical implications The authors provide a competency framework for organizational leaders. It serves as a guideline for leaders to best support the BDA initiatives of the organization. The competency framework can support recruiting, selection and leader promotion. Originality/value This study provides a novel BDA leadership competency framework with a unique combination of competencies and higher order roles.
... Data science has attracted the attention of various academic disciplines and domains. Numerous studies over two decades have defined the roots and the scope of data science, characterizing the main features of the emerging field, mainly Cao 12 ; Cleveland 13 22 Chen et al. 23 summarize the characteristics of data science into the following aspects: ...
Article
Various semantic technologies such as ontologies, machine learning, or artificial intelligence-based are being used today with data science for the purpose of explaining the meaning of data, and making this explanation exploitable by computer processing. Although some quick and brief reports do exist, to the best of our knowledge, the literature lacks a detailed study reporting why, when and how semantic technologies are used with data science. This paper is a theoretical review aiming at providing an insight into data science with semantic technologies. We characterize this research topic through a framework called DS2T helping to understand data science with semantic technologies and giving a comprehensive overview of the field through different, but complementary views. The proposed framework may be used to position research studies integrating semantic technologies with data science, compare them, understand new trends, and identify opportunities and open issues related to a given application domain. Software development processes are used as illustration domain.
... Data science has attracted the attention of various academic disciplines and domains. Numerous studies over two decades have defined the roots and the scope of data science, characterizing the main features of the emerging field, mainly Banafa 75 (2014); Buckingham Shum et al. (2013); Cao (2017); Cielen, Meysman, and Ali (2016); Cleveland (2001); Dhar (2013); Loukides (2011); Provost and Fawcett (2013); Ruehle (2020); I. Sarker et al. (2020); I. H. Sarker (2021a). Chen et al. in Chen, Ayala, Alsmadi, and Wang (2018) summarize the characteristics of data science into the following aspects: ...
Preprint
Various semantic technologies such as ontologies, machine learning or artificial intelligence-based are being used today with data science in the purpose of explaining the meaning of data, and making this explanation exploitable by computer processing. Although some quick and brief reports do exist, to the best of our knowledge , the literature lacks a detailed study reporting why, when and how semantic technologies are used with data science. This paper is a theoretical review aiming at providing an insight into data science with semantic technologies. We characterize this research topic through a framework called DS2T helping to understand data science with semantic technologies and giving a comprehensive overview of the field through different, but complementary views. The proposed framework may be used to position research studies integrating semantic technologies with data science, compare them, understand new trends and identify opportunities and open issues related to a given application domain. Software development processes is used as illustration domain.
... This approach, executed in July 2021, led to the collection of 1,115 scientific papers. The search was constrained to the last 20 years since even if DS has sub-fields, such as AI, that have roots in the early 1960s (Minsky, 1961), the specific field has emerged only in the early 2000s (Cleveland, 2001). ...
Article
Full-text available
The city and its territory present broad and complex intellectual challenges that belong to the study of urban planning. However, it is not intended to discredit the work, techniques, and methodologies, but to show that the help of a disciplinary dialogue provides the urban planning researcher as well as the student to visualize urban phenomena from another perspective and maintain a different vocabulary that allows validating the studies in the territory. This work aims to present a statistical model that shows the relationship between economic units and size of the territory in the Metropolitan area of Xalapa whose purpose is to demonstrate that urban phenomena can be measured using validated statistical techniques in conjunction with urban planning research. The hypothesis is assumed that economic activity is not distributed homogeneously in the territory of the Xalapa Metropolitan Zone, through a multivariate statistical analysis, it is demonstrated that there are 3 groups of municipalities with similar characteristics, establishing the relationship between concentration of activities in territorial extensions, which concludes that there is no regional economy but that the region is unequal and that they can be classified based on the mix of economic and territorial variables for studies with an urban-spatial approach.
Article
Стаття висвітлює еволюцію аналізу даних від традиційної статистики до науки про дані. Починаючи з твердження Пітера Хьюбера про емпіричний характер аналізу даних, де дослідник наголошує, що цей етап розвитку не можна визначити як нову наукову парадигму, але як певну тенденцію, яка об’єднується під назвою «наука про дані». Основний акцент робиться на внеску Джона Т’юкі, який першим висловив ідеї, що лягли в основу аналізу даних. Робота розкриває концепції «підтверджувального» та «експлораторного» аналізу даних, визначає їхні цілі та різницю, а також відзначає важливість чергування цих етапів у процесі дослідження. Принципи Т’юкі для сучасного аналізу даних, такі як «максимальне проникнення в дані» та «візуалізація закономірностей», розглядаються як ключові підходи для виявлення нових знань. Роботи Т’юкі викликали значні дебати серед статистиків, а його погляди на аналіз даних шокували академічне співтовариство. Докладно розглядається вплив робіт Т’юкі на розвиток науки про дані протягом півстоліття, включаючи коментарі відомого статистика Пітера Губера. Заклики Т’юкі до реформування статистики та його погляди на важливість ставлення правильних запитань і отримання приблизних відповідей наголошують на його важливості в контексті аналізу даних і науки про дані. Важливий акцент робиться на впливі обчислювальних середовищ на розвиток аналізу даних. Зазначається, що реальний прогрес в розумінні поняття «аналіз даних» був стимульований кодом і обчислювальними середовищами. Вказується на роль різних статистичних пакетів та програмних середовищ, таких як BMDP, SPSS, SAS, Minitab, S, STATA і R у розвитку аналізу даних. Вивчається їхній вплив за допомогою аналізу частоти слів у літературі. Зазначається, що сьогодні R є домінуючим середовищем програмування в академічній статистиці з великою кількістю прихильників. Завдяки роботі зі скриптами можна точно кодифікувати кроки обчислень. Ці зміни викликали зміну у правилах гри, і тепер вираз «науковий підхід до аналізу даних» став більш очевидним, відповідаючи твердженню Дж. Т’юкі щодо можливостей вивчення аналізу даних як науки.
Article
This paper analyses the future prospects of statistics as a profession and how data science will change it. Indeed, according to Hadley Wickham, Chief Scientist at Rstudio, “a data scientist is a useful statistician”, establishing a strong connection between data science and applied statistics. In this direction, the aim is to look to the future by proposing a structural approach to future scenarios. Some possible definitions of data science are then discussed, considering the relationship with statistics as a scientific discipline. The focus then turns to an assessment of the skills required by the labor market for data scientists and the specific characteristics of this profession. Finally, the phases of a data science project are considered, outlining how these can be exploited by a statistician.
Article
Many Library and Information Science (LIS) training programs are gradually expanding their curricula to include computational data science courses such as supervised and unsupervised machine learning. These programs focus on developing both “classic” information science competencies as well as core data science competencies among their students. Since data science competencies are often associated with mathematical and computational thinking, departmental officials and prospective students often raise concerns regarding the appropriate background students should have in order to succeed in this newly introduced computational content of the LIS training programs. In order to address these concerns, we report on an exploratory study through which we examined the 2020 and 2021 student classes of Bar-Ilan University's LIS graduate training, focusing on the computational data science courses (i.e., supervised and unsupervised machine learning). Our study shows that contrary to many of the concerns raised, students from the humanities performed as well (and in some cases significantly better) on data science competencies compared to those from the social sciences and had better success in the training program as a whole. In addition, students’ undergraduate GPA acted as an adequate indicator for both their success in the training program and in the data science part thereof. In addition, we find no evidence to support concerns regarding age or sex. Finally, our study suggests that the computational data science part of students’ training is very much aligned with the rest of their training program.
Chapter
We are living in a data-rich era, where every field of science or industry sector generates data in a seemingly effortless manner. To emphasize the importance of this, data have been called the “oil of the twenty-first century.” To deal with this flood of data, a new field called data science has been established. This chapter provides a general overview of data science and what learning from data means.
Preprint
Full-text available
Devido ao aumento de volume de dados, a urgência na busca de cientistas de dados devidamente qualificados têm crescido. Desta forma, as Instituições de Ensino Superior (IES) brasileiras têm buscado suprir tal demanda. Neste enredo, o objetivo deste artigo é realizar uma caracterização dos cursos de graduação em Ciência de Dados. Assim, buscou-se responder questionamentos como: os cursos têm sido ofertados em grande maioria pelas universidades públicas ou privadas? Quando começaram a ser ofertados? Costumam ser EAD (ensino à distância) ou presenciais? São do tipo tecnológico ou bacharelado? Quais grupos de disciplinas mais compõem a grade? Em quais regiões do país se concentram? Como é a oferta de vagas e qual é o perfil de ingressos em cursos do tipo bacharelado e tecnológico? Para isso, utilizou-se a junção das bases do e-MEC e do Censo da Educação Superior de 2021 e optou-se por fazer a exploração e visualização de dados considerando a técnica ACM. Entre os resultados, observa-se que há um certo equilíbrio entre as modalidades presencial e EAD, além de que em grande parte os cursos são do tipo tecnológico e costumam ser ofertados por IES privadas. Acerca das regiões, nota-se uma grande concentração de cursos presenciais na região Sudeste do Brasil.
Article
Full-text available
News discourse is a type of discourse analysis that deals with the analysis of news discourse. Due to the fact that in the formatting of news there are two hidden features of selection and prominence in the communication representation of news, the inverted pyramid of news is used to grade the importance of the discourse parts of the news. Although it is desirable to meet the structure of the inverted pyramid of news, sometimes this structure may change. In this article, we put an effort to analyse the discourse analysis of Persian news websites with the help of statistical analysis. To research the goal, data science can be used. This inter-discipline deals with data analysis from a scientific aspect, finding implicit concepts to be obtained from data analysis and extracting knowledge from the data. In the framework of data science, we examined the Persian news corpus and studied the existence of semantic correlation between the news title and the news content based on the structure of the news inverted pyramid. To achieve the goal, by using the crawling method, a relatively large news corpus with a volume of 14 billion words has been obtained from 24 news websites. After pre-processing and normalizing the corpus, in the framework of distributional semantics, the vector of title news and content have been created by using the Word2Vec tool for creating the vector model to have the vector representation of each news. After segmenting news content into three parts (lead, body and further explanation about the lead) according to the inverted pyramid, the Pearson correlation coefficient has been used to calculate the correlation between the title and each part of the news. Although Pearson's correlation coefficient was positive for a large number of news, zero value and no correlation was found for the news. On average, the correlation between the headline and the news lead and body was higher than the correlation between the headline and the lead development. This research can be used as a method to carefully select the title and content and filter the news according to the inverted pyramid structure.
Chapter
While the data and information services industry dates back to the fifteenth century AD, the information technology (also called as software services) industry is as young as 70 years. Yet the pace at which both have grown is phenomenal, and this is tied to the evolution of computer technology. Networking of computers and people through the Internet has moved into high gear. The decision (or knowledge) sciences arena enveloping the operations research methodologies came into prominence during the Second World War but has had its growth in parallel to the computing technologies till the 1970s. The last two decades have also witnessed a spurt in their development due to advances in expert systems, machine learning, and artificial intelligence (AI) algorithms, all of which are now embedded within the field of data science. These three sectors have given birth to multiple business segments in the process. Delivery of data and information, data and business process outsourcing, analytics, printing and display solutions, and information flow in supply chain management of goods and services are the five segments identified within the information services, while computer-aided software engineering, independent software testing and quality assurance, package and bespoke software implementation and maintenance, network and security management, and hosting and infrastructure management are covered within software services. The data science arena consists of big data, statistical and mathematical techniques, and business domains but covered as one distinct business segment. The automation path of each segment is reviewed in detail. An impact analysis to identify the changing landscape is described. Finally, current trends are traced, followed by predictions for future developments.KeywordsSoftware servicesInformation technology servicesBig data and analyticsCustomer relationship managementEnterprise resource planningData science
Article
Full-text available
Considerable debate exists today on almost every facet of what data science entails. Almost all commentators agree, however, that data science must be characterized as having an interdisciplinary or metadisciplinary nature. There is interest from many stakeholders in formalizing the emerging discipline of data science by defining boundaries and core concepts for the field. This paper presents a comparison between the data science of today and the development and evolution of information science over the past century. Data science and information science present a number of similarities: diverse participants and institutions, contested disciplinary boundaries, and diffuse core concepts. This comparison is used to discuss three questions about data science going forward: (1) What will be the focal points around which data science and its stakeholders coalesce? (2) Can data science stakeholders use the lack of disciplinary clarity as a strength? (3) Can data science feed into an “empowering profession”? The historical comparison to information science suggests that the boundaries of data science will be a source of contestation and debate for the foreseeable future. Stakeholders face many questions as data science evolves with the inevitable societal and technological changes of the next few decades.
Chapter
Statistical and machine learning methods have many applications in the environmental sciences, including prediction and data analysis in meteorology, hydrology and oceanography; pattern recognition for satellite images from remote sensing; management of agriculture and forests; assessment of climate change; and much more. With rapid advances in machine learning in the last decade, this book provides an urgently needed, comprehensive guide to machine learning and statistics for students and researchers interested in environmental data science. It includes intuitive explanations covering the relevant background mathematics, with examples drawn from the environmental sciences. A broad range of topics is covered, including correlation, regression, classification, clustering, neural networks, random forests, boosting, kernel methods, evolutionary algorithms and deep learning, as well as the recent merging of machine learning and physics. End‑of‑chapter exercises allow readers to develop their problem-solving skills, and online datasets allow readers to practise analysis of real data.
Chapter
Although many attempts have been made to define data science, such a definition has not yet been reached. One reason for the difficulty to reach a single, consensus definition for data science is its multifaceted nature: it can be described as a science, as a research method, as a discipline, as a workflow, or as a profession. One single definition just cannot capture this diverse essence of data science. In this chapter, we first take an interdisciplinary perspective and review the background for the development of data science (Sect. 2.1). Then we present data science from several perspectives: data science as a science (Sect. 2.2), data science as a research method (Sect. 2.3), data science as a discipline (Sect. 2.4), data science as a workflow (Sect. 2.5), and data science as a profession (Sect. 2.6). We conclude by highlighting three main characteristics of data science: interdisciplinarity, learner diversity, and its research-oriented nature (Sect. 2.7).
Chapter
Full-text available
Dieses Kapitel vermittelt folgende Lernziele: Die Bedeutung computationaler Methoden für die Sozial- und Humanwissenschaften erläutern können. Unterschiede und Gemeinsamkeiten zwischen einer herkömmlichen manuellen Inhaltsanalyse und einer computationalen Textanalyse darstellen können. Den typischen Ablauf einer computationalen Studie am Beispiel einer automatischen Textanalyse mit ihren Besonderheiten erläutern können. Wissen, wie man bei einer diktionärsbasierten Sentimentanalyse von Social-Media-Beiträgen vorgeht, und dies mit der Software R anhand eines Beispieldatensatzes umsetzen können. Drei Ansätze der computationalen Textanalyse charakterisieren können: diktionärsbasierte Ansätze sowie Ansätze auf der Basis von überwachtem und unüberwachtem maschinellen Lernen. Wissen, welche anderen computationalen Analyseverfahren neben der automatisierten Textanalyse in den Sozial- und Humanwissenschaften genutzt werden. Spezifische ethische Aspekte der computationalen Forschung in den Sozial- und Humanwissenschaften erläutern können.
Article
Full-text available
This article presents a model for developing case studies, or labs, for use in undergraduate mathematical statistics courses. The model proposed here is to design labs that are more in-depth than most examples in statistical texts by providing rich background material, investigations and analyses in the context of a scientific problem, and detailed theoretical development within the lab. An important goal of this approach is to encourage and develop statistical thinking. It is also advocated that the labs be made the centerpiece of the theoretical course. As a result, the curriculum, lectures, and assignments are significantly restructured. For example, the course work includes written assignments based on open-ended data analyses, and the lectures include group work and discussions of the case-studies.
Article
Full-text available
Higher education faces an environment of financial constraints, changing customer demands, and loss of public confidence. Technological advances may at last bring widespread change to college teaching. The movement for education reform also urges widespread change. What will be the state of statistics teaching at the university level at the end of the century? This article attempts to imagine plausible futures as stimuli to discussion. It takes the form of provocations by the first author, with responses from the others on three themes: the impact of technology, the reform of teaching, and challenges to the internal culture of higher education.
Article
How does statistical thinking differ from mathematical thinking? What is the role of mathematics in statistics? If you purge statistics of its mathematical content, what intellectual substance remains? In what follows, we offer some answers to these questions and relate them to a sequence of examples that provide an overview of current statistical practice. Along the way, and especially toward the end, we point to some implications for the teaching of statistics.
Article
Significant advances in, and the resultant impact of, Information Technology (IT) during the last fifteen years has resulted in a much more data based society, a trend that can be expected to continue into the foreseeable future. This phenomenon has had a real impact on the Statistics discipline and will continue to result in changes in both content and course delivery. Major research directions have also evolved during the last ten years directly as a result of advances in IT. The impact of these advances has started to flow into course content, at least for advanced courses. One question which arises relates to what impact will this have on the future training of statisticians, both with respect to course content and mode of delivery. At the tertiary level the last 40 years has seen significant advances in theoretical aspects of the Statistics discipline. Universities have been outstanding at producing scholars with a strong theoretical background but questions have been asked as to whether this has, to some degree, been at the expense of appropriate training of the users of statistics (the 'tradespersons'). Future directions in the teaching and learning of Statistics must take into account the impact of IT together with the competing need to produce scholars as well as competent users of statistics to meet the future needs of the market place. For Statistics to survive as a recognizable discipline the need to be able to train statisticians with an ability to communicate is also seen as an area of crucial importance. Satisfying the needs of society as well as meeting the needs of the profession are considered as the basic determinants which will derive the future teaching and training of statisticians at the tertiary level and will form the basis of this presentation.
Article
Aspects of scientific method are discussed: In particular, its representation as a motivated iteration in which, in succession, practice confronts theory, and theory, practice. Rapid progress requires sufficient flexibility to profit from such confrontations, and the ability to devise parsimonious but effective models, to worry selectively about model inadequacies and to employ mathematics skillfully but appropriately. The development of statistical methods at Rothamsted Experimental Station by Sir Ronald Fisher is used to illustrate these themes.
Article
The nature of data is rapidly changing. Data sets are becoming increasingly large and complex. Modern methodology for analyzing these new types of data are emerging from the fields of Data Base Managment, Artificial Intelligence, Machine Learning, Pattern Recognition, and Data Visualization. So far Statistics as a field has played a minor role. This paper explores some of the reasons for this, and why statisticians should have an interest in participating in the development of new methods for large and complex data sets.
Article
The statistics community is showing increasing interest in consulting in industry. This interest has stimulated questions concerning recognition and job satisfaction, job opportunities, and educational and training needs. These questions are considered in this article. A central theme is that effective statistical consulting requires total involvement in the consulting situation and that good recognition flows naturally from such an approach. This concept is defined in operational terms.
Article
The profession of statistics has adopted too narrow a definition of itself. As a consequence, both statistics and statisticians play too narrow a role in policy formation and execution. Broadening that role will require statisticians to change the curriculum they use to train and develop their own professionals and what they teach nonstatisticians about statistics. Playing a proper role will require new research from statisticians that combines our skills in methods with other techniques of social scientists.
Article
Centre of Location. That abscissa of a frequency curve for which the sampling errors of optimum location are uncorrelated with those of optimum scaling. (9.)
Article
Data analysis is not a new subject. It has accompanied productive experimentation and observation for hundreds of years. At times, as in the work of Kepler, it has produced dramatic results.
Article
Enormous quantities of data go unused or underused today, simply because people can't visualize the quantities and relationships in it. Using a downloadable programming environment developed by the author, Visualizing Data demonstrates methods for representing data accurately on the Web and elsewhere, complete with user interaction, animation, and more. How do the 3.1 billion A, C, G and T letters of the human genome compare to those of a chimp or a mouse? What do the paths that millions of visitors take through a web site look like? With Visualizing Data, you learn how to answer complex questions like these with thoroughly interactive displays. We're not talking about cookie-cutter charts and graphs. This book teaches you how to design entire interfaces around large, complex data sets with the help of a powerful new design and prototyping tool called "Processing". Used by many researchers and companies to convey specific data in a clear and understandable manner, the Processing beta is available free. With this tool and Visualizing Data as a guide, you'll learn basic visualization principles, how to choose the right kind of display for your purposes, and how to provide interactive features that will bring users to your site over and over. This book teaches you: The seven stages of visualizing data -- acquire, parse, filter, mine, represent, refine, and interactHow all data problems begin with a question and end with a narrative construct that provides a clear answer without extraneous detailsSeveral example projects with the code to make them workPositive and negative points of each representation discussed. The focus is on customization so that each one best suits what you want to convey about your data set The book does not provide ready-made "visualizations" that can be plugged into any data set. Instead, with chapters divided by types of data rather than types of display, you'll learn how each visualization conveys the unique properties of the data it represents -- why the data was collected, what's interesting about it, and what stories it can tell. Visualizing Data teaches you how to answer questions, not simply display information.
Article
An efficient method for the calculation of the interactions of a 2' factorial ex- periment was introduced by Yates and is widely known by his name. The generaliza- tion to 3' was given by Box et al. (1). Good (2) generalized these methods and gave elegant algorithms for which one class of applications is the calculation of Fourier series. In their full generality, Good's methods are applicable to certain problems in which one must multiply an N-vector by an N X N matrix which can be factored into m sparse matrices, where m is proportional to log N. This results inma procedure requiring a number of operations proportional to N log N rather than N2. These methods are applied here to the calculation of complex Fourier series. They are useful in situations where the number of data points is, or can be chosen to be, a highly composite number. The algorithm is here derived and presented in a rather different form. Attention is given to the choice of N. It is also shown how special advantage can be obtained in the use of a binary computer with N = 2' and how the entire calculation can be performed within the array of N data storage locations used for the given Fourier coefficients. Consider the problem of calculating the complex Fourier series N-1 (1) X(j) = EA(k)-Wjk, j = 0 1, * ,N- 1, k=0
Article
Statisticians regularly bemoan what they perceive to be the lack of impact and appreciation of their discipline on the part of others. And it does seem to be true, as we approach the magic 2000 number, that we are faced with the paradox of more and more information being available, more and more complex problems to be solved – but, apparently, less and less direct appreciation of the role and power of statistical thinking. The purpose of the conference talk will be to explore one possible pedagogic route to raising awareness of the central importance of statistical thinking to good citizenship – namely, a focus on public policy issues as a means of creating an awareness and appreciation of the need for and power of statistical thinking. We all know that our discipline is both intellectually exciting and stimulating in itself – and that it provides the crucial underpinning of any would-be coherent, quantitative approach to the world around us. Indeed, one might even go so far as to say that our subject is 'the science of doing science', providing theory and protocols to guide and discipline all forms of quantitative investigatory procedure. We have certainly expanded our horizons way beyond the rather modest ambitions of the founding fathers of the Royal Statistical Society, who set out to 'collect, arrange, digest and publish facts, illustrating the condition and prospects of society in its material, social and moral relations'.
Article
Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printed page of such transmission. JSTOR is an independent not-for-profit organization dedicated to and preserving a digital archive of scholarly journals. For more information regarding JSTOR, please contact support@jstor.org. How does statistical thinking differ from mathematical thinking? What is the role of mathematics in statistics? If you purge statistics of its mathematical content, what intellectual substance remains? In what follows, we offer some answers to these questions and relate them to a sequence of examples that provide an overview of current statistical practice. Along the way, and especially toward the end, we point to some implications for the teaching of statistics.
Article
There appears to be a paradox in the fact that as more and more quantitative information becomes routinely available in a world perceived to be ever more complex, there is less and less direct appreciation of the role and power of statistical thinking. It is suggested that the profession should exploit very real public concerns regarding risk aspects of public policy as a possible pedagogic route to raising statistical awareness.
Article
Stochastic substitution, the Gibbs sampler, and the sampling-importance-resampling algorithm can be viewed as three alternative sampling- (or Monte Carlo-) based approaches to the calculation of numerical estimates of marginal probability distributions. The three approaches will be reviewed, compared, and contrasted in relation to various joint probability structures frequently encountered in applications. In particular, the relevance of the approaches to calculating Bayesian posterior densities for a variety of structured models will be discussed and illustrated.
Article
Aspects of scientific method are discussed: In particular, its representation as a motivated iteration in which, in succession, practice confronts theory, and theory, practice. Rapid progress requires sufficient flexibility to profit from such confrontations, and the ability to devise parsimonious but effective models, to worry selectively about model inadequacies and to employ mathematics skillfully but appropriately. The development of statistical methods at Rothamsted Experimental Station by Sir Ronald Fisher is used to illustrate these themes.
Article
This paper examines work in "computing with data"---in computing support for scientific and other activities to which statisticians can contribute. Relevant computing techniques, besides traditional statistical computing, include data management, visualization, interactive languages and user-interface design. The paper emphasizes the concepts underlying computing with data, with emphasis on how those concepts can help in practical work. We look at past, present, and future: some concepts as they arose in the past and as they have proved valuable in current software; applications in the present, with one example in particular, to illustrate the challenges these present; and new directions for future research, including one exciting joint project. 1 Contents 1 Introduction 2 2 The Past 2 2.1 Programming Languages in 1963 . . . . . . . . . . . . . . . . . . 3 2.2 Statistical Computing: Bell Labs, 1965 . . . . . . . . . . . . . . . 5 2.3 Statistical Computing: England, 1967 . ....
Teaching Statistics Theory Through Applications The American Statistician 53 Public Policy Issues as a Route to Statistical Awareness Department of Mathematics, Imperial College Data Analysis and Statistics: An Expository Overview Visions: The Evolution of Statistics
  • D Nolan
  • T Speed
  • A F M Smith
Nolan, D. and T. Speed (1999). Teaching Statistics Theory Through Applications. The American Statistician 53, 370–375. Smith, A. F. M. (2000). Public Policy Issues as a Route to Statistical Awareness. Technical report, Department of Mathematics, Imperial College. Tukey, J. W. and M. B. Wilk (1986). Data Analysis and Statistics: An Expository Overview. In L. V. Jones (Ed.), The Collected Works of John W. Tukey, pp. 549–578. New York: Chapman & Hall. Wegman, E. J. (2000). Visions: The Evolution of Statistics. Technical report, Center for Computational Statistics, George Mason University. 6
Visions: The Evolution of Statistics
  • E J Wegman
Wegman, E. J. (2000). Visions: The Evolution of Statistics. Technical report, Center for Computational Statistics, George Mason University.
Future Directions for the Teaching and Learning of Statistics at the Tertiary Level
  • D Nichols
Nichols, D. (2000). Future Directions for the Teaching and Learning of Statistics at the Tertiary Level. Technical report, Department of Statistics and Econometrics, Australian National University.
  • Nicholls D.