Article

Data Science: An Action Plan for Expanding the Technical Areas of the Field of Statistics

Authors:
To read the full-text of this research, you can request a copy directly from the author.

Abstract

An action plan to enlarge the technical areas of statistics focuses on the data analyst. The plan sets out six technical areas of work for a university department and advocates a specific allocation of resources devoted to research in each area and to courses in each area. The value of technical work is judged by the extent to which it benefits the data analyst, either directly or indirectly. The plan is also applicable to government research labs and corporate research organizations. 1 Summary of the Plan This document describes a plan to enlarge the major areas of technical work of the field of statistics. Because the plan is ambitious and implies substantial change, the altered field will be called "data science." The focus of the plan is the practicing data analyst. A basic premise is that technical areas of data science should be judged by the extent to which they enable the analyst to learn from data. The benefit of an area can be direct or indirect. Tools that are used by...

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

... 1966-cı ildə həmçinin verilənlər elminə aidiyyatı olan Beynəlxalq Elm Şurasının Elm və Texnologiya üzrə Komitəsi (ing. CODATA -International Council for science: Committee on Data for Science and Technology) təsis edilmişdir [2,15]. 1974-cü ildə P.Naur "Kompüter metodlarının qısa təsviri" adlı kitabında verilənlərin müasir emal metodlarının icmalını vermiş və verilənlər elmini "rəqəmsal verilənlərin həyat dövrünü (yarandığı andan başqa bilik sahələrinə təqdim olunmaq üçün edilən dəyişikliklərə qədər olan dövr) öyrənən bir fənn" kimi müəyyən etmişdir [16]. ...
... Ancaq termin 1990-cı illərdən etibarən geniş istifadə olunmağa başlamış və 2000-ci illərin əvvəllərindən hamı tərəfindən qəbul edilmişdir. 2001-ci ildə Purdue Universitetinin professoru, statistika, verilənlərin vizuallaşdırılması, maşın təlimi sahəsində tanınmış mütəxəssis U.Klivlend "Statistik tədqiqatların texniki aspektlərinin inkişaf planı" adlı tədqiqatında diqqəti "Verilənlər elmi" fənninin tədrisi məsələsinə yönəltmişdir [15]. O, planda verilənlər elmini ayrıca akademik fənn kimi təklif etmişdir. ...
... Verilənlər elmi böyük həcmli verilənlərdən tədqiqatlar vasitəsilə gizli biliklərin aşkarlanmasına əsaslanan fəndir. Bu fənn analitik cəhətdən mürəkkəb biznes məsələlərinin həllində ilkin informasiyadan istifadə edərək, gizli biliklərin aşkarlanması üçün riyazi və alqoritmik üsulları özündə əks etdirir [15]. ...
... It is a young academic discipline and so is the associated profession of data scientist. The term, with its current meaning, was essentially coined at the beginning of the 21st Century [2]. About a year later, the International Council for Science: Committee on Data for Science and Technology [3] started publishing the CODATA Data Science Journal [4] beginning April 2002. ...
... 1 Converting analogous physical measurements to computer-readable format, in which the information is organised into bits, basic units of information used in computing and digital communications. 2 Ambient sensors, cameras, Internet logs, transactions logs, etc. ...
... 1. Good architectures and frameworks are top priority. 2. Support a variety of analytical methods. ...
Thesis
Full-text available
Understanding data is the main purpose of data science and how to achieve it is one of the challenges of data science, especially when dealing with big data. The big data era brought us new data processing and data management challenges to face. Existing state-of-the-art analytics tools come now close to handle ongoing challenges and provide satisfactory results with reasonable cost. But the speed at which new data is generated and the need to manage changes in data both for content and structure lead to new rising challenges. This is especially true in the context of complex systems with strong dynamics, as in for instance large scale ambient systems. One existing technology that has been shown as particularly relevant for modeling, simulating and solving problems in complex systems are Multi-Agent Systems. The AMAS (Adaptive Multi-Agent Systems) theory proposes to solve complex problems for which there is no known algorithmic solution by self-organization. The cooperative behavior of the agents enables the system to self-adapt to a dynamical environment so as to maintain the system in a functionality adequate state. In this thesis, we apply this theory to Big Data Analytics. In order to find meaning and relevant information drowned in the data flood, while overcoming big data challenges, a novel analytic tool is needed, able to continuously find relations between data, evaluate them and detect their changes and evolution over time. The aim of this thesis is to present the AMAS4BigData analytics framework based on the Adaptive Multi-agent systems technology, which uses a new data similarity metric, the Dynamics Correlation, for dynamic data relations discovery and dynamic display. This framework is currently being applied in the neOCampus operation, the ambient campus of the University Toulouse III - Paul Sabatier.
... In recent years, biomedical data science and health data science gained considerable interest because the data flood in these fields posses unprecedented challenges [1][2][3] . An example for such big data are provided by Electronic health records (eHRs). ...
... s 1610 }, s n = {x 1 , x 2 , x 3 , x 4 . . . . x 5572 }, x n ∈ [1,48848], and this is the structure of the input samples to the word level network. ...
... Hence, the sentence-level network receives a preprocessed sentence level input z, whereas each sample is represented by z = {s 1 , s 2 , s 3 . . . s 545 }, s i ∈ t 150 , t j ∈ [1,48848]. Also an embedding layer is applied to the sentence-level input. In the embedding layer the input z is transformed into a 3-dimensional matrix ∈ ′ × × E IR 545 150 50 . ...
Article
Full-text available
Artificial intelligence provides the opportunity to reveal important information buried in large amounts of complex data. Electronic health records (eHRs) are a source of such big data that provide a multitude of health related clinical information about patients. However, text data from eHRs, e.g., discharge summary notes, are challenging in their analysis because these notes are free-form texts and the writing formats and styles vary considerably between different records. For this reason, in this paper we study deep learning neural networks in combination with natural language processing to analyze text data from clinical discharge summaries. We provide a detail analysis of patient phenotyping, i.e., the automatic prediction of ten patient disorders, by investigating the influence of network architectures, sample sizes and information content of tokens. Importantly, for patients suffering from Chronic Pain, the disorder that is the most difficult one to classify, we find the largest performance gain for a combined word- and sentence-level input convolutional neural network (ws-CNN). As a general result, we find that the combination of data quality and data quantity of the text data is playing a crucial role for using more complex network architectures that improve significantly beyond a word-level input CNN model. From our investigations of learning curves and token selection mechanisms, we conclude that for such a transition one requires larger sample sizes because the amount of information per sample is quite small and only carried by few tokens and token categories. Interestingly, we found that the token frequency in the eHRs follow a Zipf law and we utilized this behavior to investigate the information content of tokens by defining a token selection mechanism. The latter addresses also issues of explainable AI.
... This contributed to the later term of "data-driven discovery" in 1989 [KDD89 1989]. In 2001, an action plan was suggested in Cleveland [Cleveland 2001] that would expand the technical areas of statistics toward data science. ...
... The intention was to shift the focus of statistics from "data collection, modeling, analysis, problem understanding/resolving, decision making" to future directions on "large/complex data, empirical-physical approach, representation and exploitation of knowledge". -William S. Cleveland suggested in 2001 that it would be appropriate to alter the statistics field to data science and "to enlarge the major areas of technical work of the field of statistics" by looking to computing and partnering with computer scientists [Cleveland 2001]. -Leo Breiman suggested in 2001 that it was necessary to "move away from exclusive dependence on data models (in statistics) and adopt a more diverse set of tools" such as algorithmic modeling, which treats the data mechanism as unknown [Breiman 2001]. ...
... The original scientific agenda of data science has been driven by both government initiatives and academic recommendations. This was built on the strong promotion of converting statistics to data science, and blending statistics with computing science in the statistics community [Wu 1997;Cleveland 2001;Iwata 2008;Hardin et al. 2015;Hand 2015;Diggle 2015;Graham 2012;Finzer 2013]. Today, many regional and global initiatives have been taken in data science research, disciplinary development and education, as strategic matters and agenda in the digital era. ...
Preprint
The twenty-first century has ushered in the age of big data and data economy, in which data DNA, which carries important knowledge, insights and potential, has become an intrinsic constituent of all data-based organisms. An appropriate understanding of data DNA and its organisms relies on the new field of data science and its keystone, analytics. Although it is widely debated whether big data is only hype and buzz, and data science is still in a very early phase, significant challenges and opportunities are emerging or have been inspired by the research, innovation, business, profession, and education of data science. This paper provides a comprehensive survey and tutorial of the fundamental aspects of data science: the evolution from data analysis to data science, the data science concepts, a big picture of the era of data science, the major challenges and directions in data innovation, the nature of data analytics, new industrialization and service opportunities in the data economy, the profession and competency of data education, and the future of data science. This article is the first in the field to draw a comprehensive big picture, in addition to offering rich observations, lessons and thinking about data science and analytics.
... While being applied to many different fields, data science draws on three main fields regarding methods and models: statistics [55,56], machine learning [57,58] 8 and mathematical modelling [36]. Different study programs set different focal points within the field spanned by the three, causing at least in part the observed plurality in the educational content of different data science study programs. ...
... In fact, there is currently no widely accepted definition of data science [54,57,58,63] and the above-mentioned plurality in educational content is certainly (at least partially) rooted in this ambiguity. Several divergent definitions can be found, e.g., in [9,12,36,55,64]. Occasionally, authors define the term data scientist instead, as in [9,38].A strong, widely accepted definition of the term data science would certainly form a foundation for coherent, well-grounded data science teaching, but lecturers need to make decisions on the teaching content in courses offered now. ...
Article
Full-text available
Data are increasingly important in central facets of modern life: academics, professions, and society at large. Educating aspiring minds to meet highest standards in these facets is the mandate of institutions of higher education. This, naturally, includes the preparation for excelling in today’s data-driven world. In recent years, an intensive academic discussion has resulted in the distinction between two different modes of data related education: data science and data literacy education. As a large number of study programs and offers is emerging around the world, data literacy in higher education is a particular focus of this paper. These programs, despite sharing the same name, differ substantially in their educational content, i.e., a high plurality can be observed. This paper explores this plurality, comments on the role it might play and suggests ways it can be dealt with by maintaining a high degree of adaptiveness and plurality while simultaneously establishing a consistent educational “essence”. It identifies a skill set, data self-empowerment, as a potential part of this essence. Data science and literacy education are still experiencing changeability in their emergence as fields of study, while additionally being stirred up by rapid developments, bringing about a need for flexibility and dialectic.
... Currently, the data in those disciplines and applied fields that lacked solid theories, like the social sciences and related disciplines, could be utilized to generate powerful predictive models [53]. Cleveland (2001) [54] urges prioritizing extracting applicable predictive tools over explanatory theories from colossal amounts of data. For the future of data science, Donoho (2015) [7] projects an ever-growing environment for open science where data sets used for academic publications are accessible to all researchers. ...
... Currently, the data in those disciplines and applied fields that lacked solid theories, like the social sciences and related disciplines, could be utilized to generate powerful predictive models [53]. Cleveland (2001) [54] urges prioritizing extracting applicable predictive tools over explanatory theories from colossal amounts of data. For the future of data science, Donoho (2015) [7] projects an ever-growing environment for open science where data sets used for academic publications are accessible to all researchers. ...
Article
Full-text available
As a new area of science and technology (S&T), big data science and analytics embodies an unprecedentedly transformative power—which is manifested not only in the form of revolutionizing science and transforming knowledge, but also in advancing social practices, catalyzing major shifts, and fostering societal transitions. Of particular relevance, it is instigating a massive change in the way both smart cities and sustainable cities are understood, studied, planned, operated, and managed to improve and maintain sustainability in the face of expanding urbanization. This relates to what has been dubbed data-driven smart sustainable urbanism, an emerging approach that is based on a computational understanding of city systems that reduces urban life to logical and algorithmic rules and procedures, as well as employs a new scientific method based on data-intensive science, while also harnessing urban big data to provide a more holistic and integrated view and synoptic intelligence of the city. This paper examines the unprecedented paradigmatic and scholarly shifts that the sciences underlying smart sustainable urbanism are undergoing in light of big data science and analytics and the underlying enabling technologies, as well as discusses how these shifts intertwine with and affect one another in the context of sustainability. I argue that data-intensive science, as a new epistemological shift, is fundamentally changing the scientific and practical foundations of urban sustainability. In specific terms, the new urban science—as underpinned by sustainability science and urban sustainability—is increasingly making cities more sustainable, resilient, efficient, and livable by rendering them more measurable, knowable, and tractable in terms of their operational functioning, management, planning, design, and development.
... Within the practice and teaching of data science [1][2][3][4][5][6][7][8][9][10], a data scientist builds a data analysis [11][12][13][14][15][16][17] to extract knowledge and insights from examining data [18]. However, there is surprisingly little discussion on how to evaluate the quality of a given data analysis, which is different than the evaluation of the science or question underlying the data analysis. ...
... However, there is surprisingly little discussion on how to evaluate the quality of a given data analysis, which is different than the evaluation of the science or question underlying the data analysis. Three possible reasons for this include (1) there is an insufficient vocabulary to describe how to characterize the variation between data analyses, (2) there is a lack of definitive and precise performance metrics to evaluate the quality of the analyses, and (3) there is lack of specificity by whom the data analysis is being evaluated. This leaves the educator or the practicing data scientist to focus the discussion of data analysis quality assessment on specific methods, technologies or programming languages used in a data analysis, with the vague hope that such discussion will lead to success. ...
Preprint
Full-text available
A fundamental problem in the practice and teaching of data science is how to evaluate the quality of a given data analysis, which is different than the evaluation of the science or question underlying the data analysis. Previously, we defined a set of principles for describing data analyses that can be used to create a data analysis and to characterize the variation between data analyses. Here, we introduce a metric of quality evaluation that we call the success of a data analysis, which is different than other potential metrics such as completeness, validity, or honesty. We define a successful data analysis as the matching of principles between the analyst and the audience on which the analysis is developed. In this paper, we propose a statistical model and general framework for evaluating the success of a data analysis. We argue that this framework can be used as a guide for practicing data scientists and students in data science courses for how to build a successful data analysis.
... Currently, the data in those disciplines and applied fields that lacked solid theories, like the social sciences and related disciplines, could be utilized to generate powerful predictive models [53]. Cleveland (2001) [54] urges prioritizing extracting applicable predictive tools over explanatory theories from colossal amounts of data. For the future of data science, Donoho (2015) [7] projects an ever-growing environment for open science where data sets used for academic publications are accessible to all researchers. ...
... Currently, the data in those disciplines and applied fields that lacked solid theories, like the social sciences and related disciplines, could be utilized to generate powerful predictive models [53]. Cleveland (2001) [54] urges prioritizing extracting applicable predictive tools over explanatory theories from colossal amounts of data. For the future of data science, Donoho (2015) [7] projects an ever-growing environment for open science where data sets used for academic publications are accessible to all researchers. ...
Chapter
Full-text available
As a new area of science and technology (S&T), big data science and analytics embodies an unprecedentedly transformative power—manifested not only in the form of revolutionizing science and transforming knowledge, but also in advancing social practices, producing new discourses, catalyzing major shifts, and fostering societal transitions. Of particular relevance, it is instigating a massive change in the way both smart cities and sustainable cities are studied and understood, and in how they are planned, designed, operated, managed, and governed in the face of urbanization. This relates to what has been dubbed data-driven smart sustainable urbanism, an emerging approach which is based on a computational understanding of city systems that reduces urban life to logical and algorithmic rules and procedures and that employs new scientific methods and principles, while also harnessing urban big data to provide a more holistic and integrated view or synoptic intelligence of the city. This is underpinned by epistemological realism and instrumental rationality, which sustain and are shaped by urban science. However, all knowledge is socially constructed and historically situated, so too are research methods and applied research as related to S&T and as historically produced social formations and practices that circumscribe and produce culturally specific forms of knowledge and reality. This chapter examines the unprecedented paradigmatic, scientific, scholarly, epistemic, and discursive shifts the field of smart sustainable urbanism is undergoing in light of big data science and analytics and the underlying advanced technologies, as well as discusses how these shifts intertwine with and affect one another, and their sociocultural specificity and historical situatedness. I argue that data-intensive science as a new paradigmatic shift is fundamentally changing the scientific and practical foundations of urban sustainability. In specific terms, the new urban science—as underpinned by sustainability science—is increasingly making cities more sustainable, resilient, efficient, livable, and equitable by rendering them more measurable, knowable, and tractable in terms of their operational functioning, management, planning, design, and development.
... We are living in a data-rich era in which every field of science or industry generates data seemingly in an effortless manner [1,2]. To cope with these big data a new field has been established called data science [3,4]. Data science combines the skill sets and expert knowledge from many different fields including statistics, machine learning, artificial intelligence, and pattern recognition [5][6][7][8]. ...
... Step 2: Calculate the test statistics t i,b for i ∈ {1, . . . , m} 4 Step 3: Estimate u i,b for i ∈ {1, . . . , m} by ...
Article
Full-text available
A statistical hypothesis test is one of the most eminent methods in statistics. Its pivotal role comes from the wide range of practical problems it can be applied to and the sparsity of data requirements. Being an unsupervised method makes it very flexible in adapting to real-world situations. The availability of high-dimensional data makes it necessary to apply such statistical hypothesis tests simultaneously to the test statistics of the underlying covariates. However, if applied without correction this leads to an inevitable increase in Type 1 errors. To counteract this effect, multiple testing procedures have been introduced to control various types of errors, most notably the Type 1 error. In this paper, we review modern multiple testing procedures for controlling either the family-wise error (FWER) or the false-discovery rate (FDR). We emphasize their principal approach allowing categorization of them as (1) single-step vs. stepwise approaches, (2) adaptive vs. non-adaptive approaches, and (3) marginal vs. joint multiple testing procedures. We place a particular focus on procedures that can deal with data with a (strong) correlation structure because real-world data are rarely uncorrelated. Furthermore, we also provide background information making the often technically intricate methods accessible for interdisciplinary data scientists.
... William S. Cleveland introduced Data Science as an independent discipline in 2001. 9 He considered that six technical areas encompass the field of Data Science: multidisciplinary investigations, models and methods for data, computing with data, pedagogy, tool evaluation, and theory. Subsequently in 2002, the International Council for Science: Committee on Data for Science and Technology (CODATA) founded the first journal on Data Science (https://datascience.codata. ...
Article
Full-text available
The aim of this paper is to frame Data Science, a fashion and emerging topic nowadays in the context of business and industry. We open with a discussion about the origin of Data Science and its requirement for a challenging mix of capability in data analytics, information technology, and business know‐how. The mission of Data Science is to provide new or revised computational theory able to extract useful information from the massive volumes of data collected at an accelerating pace. In fact, besides the traditional measurements, digital data obtained from images, text, audio, sensors, etc complement the survey. Then, we review the different and most popular methodologies among the practitioners of Data Science research and applications. In addition, because the emerging field requires personnel with new competences, we attempt to describe the Data Scientist profile, one of the sexiest jobs of the 21st Century according to Davenport and Patil. Most people are aware of the need to embrace Data Science, but they feel intimidated that they do not understand it and they worry that their jobs will disappear. We want to encourage them: Data Science is more likely to add value to jobs and enrich the lives of working people by helping them make better, more informed business decisions. We conclude this paper by presenting examples of Data Science in action in business and industry, to demonstrate the collection of specialist skills that must come together for this new science to be effective.
... 2. Data Science: Global Impact and dissemination Cleveland (2014) proposes an action plan for statistics, in which he elevates the role of the statistician to the level of a researcher who should not limit him or herself to providing only statistical calculations and p-values, but should, also, be involved in the interpretation of these. ...
Preprint
Full-text available
We highlight the role of Data Science in Biomedicine. Our manuscript goes from the general to the particular, presenting a global definition of Data Science and showing the trend for this discipline together with the terms of cloud computing and big data. In addition, since Data Science is mostly related to areas like economy or business, we describe its importance in biomedicine. Biomedical Data Science (BDS) presents the challenge of dealing with data coming from a range of biological and medical research, focusing on methodologies to advance the biomedical science discoveries, in an interdisciplinary context.
... It employs techniques and theories drawn from many fields of science such as mathematics, statistics, information science, and computer science. Data science was described as a 'rebranded statistics discipline' by Cleveland (2001), and most now view data science as an interdisciplinary field that uses "scientific methods, processes, algorithms and systems to extract knowledge and insights" (Dhar 2013). As an educational programme it often includes managing various forms of structured and unstructured data. ...
Article
Full-text available
Although existing data science educational programmes develop talent and produce graduates, business-focused data science curricula comprising essential skills oriented to business and managerial data with associated analysis, remain underserved. Current pedagogy has focused either on data science or on purely analytic technical aspects. There is therefore, an opportunity to rethink how institutions can develop innovative data-focussed education programmes, addressing both modern industry and community demands. As both academia and industry strive to integrate applied learning, transferable and enterprise skills into business and sciences, this paper proposes a design based research approach (DBR) for designing such a new interdisciplinary data science teaching curriculum as a foundation to deliver business undergraduate degrees in Business Data Science. Adopting a design science method our proposed DBR illustrates effective utilities for conceptualising and evaluating a fully functional new degree programme - Bachelor of Business Data Science. Ten senior business information systems academics and five analytics industry practitioners in Victoria, Australia were interviewed in three iterative prototyping phases followed by a final focus group session with business information systems students that evaluated the proposed structure. The findings suggest that proposed DBR ensures the design of an innovative data science degree that may meet growing industry and interdisciplinary demands. The paper concludes by discussing overall feasibility of the proposal in the Australian higher education sector, particularly for the case context of an Australian University.
... The understanding of data science as the field concerned with all aspects of making sense of data goes back to discussions in the scientific community that started with Tukey (1962) 2 and where summarized by Cleveland (2001) in requiring an independant scientific discipline in extension to the technical areas of the field of statistics. Notable mentions go to the foundation of the field of "knowledge discovery in databases" (Fayyad et al, 1996) after the first KDD workshop in 1989 , the first mentioning of "data science" in the title of a scientific 3 conference in 1996 (Hayashi et al., 1996), and Leo Breiman's (2001) famous call to unite statistical and computational approaches to modeling data. ...
Chapter
Full-text available
What is data science? Attempts to define it can be made in one (prolonged) sentence, while it may take a whole book to demonstrate the meaning of this definition. This book introduces data science in an applied setting, by first giving a coherent overview of the background in Part I, and then presenting the nuts and bolts of the discipline by means of diverse use cases in Part II; finally, specific and insightful lessons learned are distilled in Part III. This chapter introduces the book and provides an answer to the following questions: What is data science? Where does it come from? What are its connections to big data and other mega trends? We claim that multidisciplinary roots and a focus on creating value lead to a discipline in the making that is inherently an interdisciplinary, applied science.
... This is the first time the term "data science" was used in the statistical community. Cleveland (Cleveland 2001) outlined a plan for a "new" discipline, broader than statistics, that he called "data science", but did not reference Wu's use of the term. The International Council for Science: Committee on Data for Science and Technology began publication of the Data Science Journal in 2002, and Columbia University began publication of The Journal of Data Science in 2003. ...
Preprint
Full-text available
Data science is a discipline that provides principles, methodology and guidelines for the analysis of data for tools, values, or insights. Driven by a huge workforce demand, many academic institutions have started to offer degrees in data science, with many at the graduate, and a few at the undergraduate level. Curricula may differ at different institutions, because of varying levels of faculty expertise, and different disciplines (such as Math, computer science, and business etc) in developing the curriculum. The University of Massachusetts Dartmouth started offering degree programs in data science from Fall 2015, at both the undergraduate and the graduate level. Quite a few articles have been published that deal with graduate data science courses, much less so dealing with undergraduate ones. Our discussion will focus on undergraduate course structure and function, and specifically, a first course in data science. Our design of this course centers around a concept called the data science life cycle. That is, we view tasks or steps in the practice of data science as forming a process, consisting of states that indicate how it comes into life, how different tasks in data science depend on or interact with others until the birth of a data product or the reach of a conclusion. Naturally, different pieces of the data science life cycle then form individual parts of the course. Details of each piece are filled up by concepts, techniques, or skills that are popular in industry. Consequently, the design of our course is both "principled" and practical. A significant feature of our course philosophy is that, in line with activity theory, the course is based on the use of tools to transform real data in order to answer strongly motivated questions related to the data.
... Eine etymologische Untersuchung über den Ursprung des Begriffs Data Science steht sicher noch aus. Aus Sicht der Statistik wird vornehmlich Clevelands Artikel von 2001 [3] als Namensgeber genannt, wobei der Statistiker Jeff Wu bereits 1997 in einem Vortrag vorgeschlagen haben soll, dass man Statistik doch besser als Data Science bezeichnen soll. Die vorgeschlagene Definition von Cleveland und Wu baut dabei auf Tukey [13] auf, der 1962 eine Reform der Statistik forderte, weg von der Mathematik hin zur angewandten Wissenschaft der Datenanalyse. ...
Article
Full-text available
Zusammenfassung Data Science ist das neue Schlagwort; nach Big Data und Digitalisierung nun also Data Science. Die Stellenbörsen sind voll von Inseraten, Data Scientists werden händeringend gesucht und manch Bewerber fügt heute Data Science in sein Profil, um seine Jobchancen zu erhöhen. Doch was ist Data Science eigentlich? Der nachfolgende Beitrag nähert sich der Fragestellung aus der Sichtweise eines Statistikers, ohne dabei eine finale Definition von Data Science geben zu wollen.
... This is the first time the term "data science" was used in the statistical community. Cleveland (Cleveland 2001) outlined a plan for a "new" discipline, broader than statistics, that he called "data science", but did not reference Wu's use of the term. The International Council for Science: Committee on Data for Science and Technology began publication of the Data Science Journal in 2002, and Columbia University began publication of The Journal of Data Science in 2003. ...
Preprint
Full-text available
Data science is a discipline that provides principles, methodology and guidelines for the analysis of data for tools, values, or insights. Driven by a huge workforce demand, many academic institutions have started to offer degrees in data science, with many at the graduate, and a few at the undergraduate level. Curricula may differ at different institutions, because of varying levels of faculty expertise, and different disciplines (such as Math, computer science, and business etc) in developing the curriculum. The University of Massachusetts Dartmouth started offering degree programs in data science from Fall 2015, at both the undergraduate and the graduate level. Quite a few articles have been published that deal with graduate data science courses, much less so dealing with undergraduate ones. Our discussion will focus on undergraduate course structure and function, and specifically, a first course in data science. Our design of this course centers around a concept called the data science life cycle. That is, we view tasks or steps in the practice of data science as forming a process, consisting of states that indicate how it comes into life, how different tasks in data science depend on or interact with others until the birth of a data product or a conclusion. Naturally, different pieces of the data science life cycle then form individual parts of the course. Details of each piece are filled up by concepts, techniques, or skills that are popular in industry. Consequently, the design of our course is both "principled" and practical. A significant feature of our course philosophy is that, in line with activity theory, the course is based on the use of tools to transform real data in order to answer strongly motivated questions related to the data.
... Estas mudanças foram potencializadas a partir dos avanços das tecnologias de informação e comunicação, em particular das redes computacionais. Dentre os avanços, ressalta-se a específicos, mas podem conjugar-se a modo de interdomínio (CLEVELAND, 2001 O apoio na literatura para a discussão dos principais aspectos dessa relação é justificado pelos especialistas no tema, que argumentam sobre a necessidade de maiores esforços para a compreensão dos fundamentos das atuais práticas de curadoria dos dados de pesquisa, por meio de uma compreensão de longo prazo. E, igualmente importante quanto compreender a natureza e atributos desta moderna configuração da dinâmica científica, aqui denominada orientada a dados, é apresentar um panorama daquilo que se foi publicado sobre a égide da referida temática. ...
Article
Full-text available
Introdução: A atual configuração da dinâmica relativa à produção e àcomunicação científicas revela o protagonismo da Ciência Orientada a Dados,em concepção abrangente, representada principalmente por termos como “e-Science” e “Data Science”. Objetivos: Apresentar a produção científica mundial relativa à Ciência Orientada a Dados a partir dos termos “e-Science” e “Data Science” na Scopus e na Web of Science, entre 2006 e 2016. Metodologia: A pesquisa está estruturada em cinco etapas: a) busca de informações nas bases Scopus e Web of Science; b) obtenção dos registros; bibliométricos; c) complementação das palavras-chave; d) correção e cruzamento dos dados; e) representação analítica dos dados. Resultados: Os termos de maior destaque na produção científica analisada foram Distributed computer systems (2006), Grid computing (2007 a 2013) e Big data (2014 a 2016). Na área de Biblioteconomia e Ciência de Informação, a ênfase é dada aos temas: Digital library e Open access, evidenciando a centralidade do campo nas discussões sobre dispositivos para dar acesso à informação científica em meio digital. Conclusões: Sob um olhar diacrônico, constata-se uma visível mudança de foco das temáticas voltadas às operações de compartilhamento de dados para a perspectiva analítica de busca de padrões em grandes volumes de dados.Palavras-chave: Data Science. E-Science. Ciência orientada a dados. Produção científica.Link:http://www.uel.br/revistas/uel/index.php/informacao/article/view/26543/20114
... What is clear however is that several disciplines claim ownership, with early references within both Computer Science (Naur 1974) and Statistics (Wu 1997;Cleveland 2001;Provost and Fawcett 2013). Data Science is also promoted widely by industry as the solution to the problem of making sense of and monetizing the increasing volumes of "Big Data" produced by computer-mediated systems (Kitchin 2014b;Varian 2014). ...
Article
Full-text available
It is widely acknowledged that the emergence of “Big Data” is having a profound and often controversial impact on the production of knowledge. In this context, Data Science has developed as an interdisciplinary approach that turns such “Big Data” into information. This article argues for the positive role that Geography can have on Data Science when being applied to spatially explicit problems; and inversely, makes the case that there is much that Geography and Geographical Analysis could learn from Data Science. We propose a deeper integration through an ambitious research agenda, including systems engineering, new methodological development, and work toward addressing some acute challenges around epistemology. We argue that such issues must be resolved in order to realize a Geographic Data Science, and that such goal would be a desirable one.
... The advancements of tools and techniques for collecting, preprocessing and analyzing Big Data have resulted in a new interdisciplinary domain called Data Science, which utilizes the knowledge developed in areas such as statistics, informatics, computing, communication, management, and sociology [18,30,33]. Cleveland proposed Data Science as an independent field of study [34]. ...
Conference Paper
Full-text available
The past two decades are characterized by a tremendous growth of the amount of data generated and recorded in computer repositories. To learn and benefit of accumulated data, people need to use Information Technologies to retrieve, process, analyze and explore huge amount of data. Consequently, terms such as Big Data, data analytics, machine learning, deep learning, etc. appeared to mark the dependence of practically all aspects of human life on data and on instruments to explore data. The term Data Science represents in the best possible way the complexity and comprehension of expertise needed nowadays. Developing competences and training professionals in this field represents a significant challenge to educational institutions. Professionals in the field of Data Science, known as Data Scientist, need to possess competences in various areas such statistics, informatics, computing, communication, management, sociology, economics, etc. Data Science has thus emerged as an inter, multi and even transdisciplinary area of knowledge. Many authors investigate the range of competences, knowledge and skills a Data Scientist need to master. Although in many cases the focus is on technical skills, working with data requires mastery of a huge variety of skills and abilities. In fact, a combination of analytical, statistical, algorithmic, engineering, and technical skills have to be possessed to mine relevant data by involving contextual domain information. In previous studies, we have shown that analytical competences represent the cross-point of all other hard (technical) and soft (non-technical e.g. communication, collaboration, curiosity etc.) skills, especially in the Big Data context. For building such competences, a certain level of maturity and experience is essential and graduate level is the natural choice in building educational programs to train Data Scientists. Many factors may influence success of a graduate program in such complicated field. Among the rest, we consider assessment of students' entry background as essential. The analytical thinking expertise may serve as the key for students, coming from different Bachelor degree programs, to succeed in a Data Science Master program. The paper shares development of a questionnaire to assess the analytical thinking among the current students in IT-related bachelor degree programs. Motivated by the aims listed above, the research addresses the following questions: 1 Do prospective students have substantial analytical skills to study Data Science Master program? 2 Can we reveal the potential success that students could achieve as analysts when they graduate? 3 Can we improve the Data Science master program to achieve shifting educational patterns and analytical thinking development? We have examined principles of analytical thinking, many relevant tests such as existing assessments of analytical thinking, critical thinking, problem solving skills tests, etc. The main part of the questionnaire constructed is consisted from logical problems in three different formats: math questions, text assignment and figures pattern recognition. It also includes two questions which measure how students themselves rate their analytical thinking skills and dispositions.
... The data revolution has led to an increased interest in the practice of data analysis and increased demand for training and education in this area [Cleveland, 2001, Nolan and Lang, 2010, Workgroup, 2014, Baumer, 2015, PricewaterhouseCoopers, 2019, Hardin et al., 2015, Kaplan, 2018, Hicks and Irizarry, 2018. This revolution has also led to the creation of the term data science, which is often defined only in relation to existing fields of study [Conway, 2010, Tierney, 2012, Matter, 2013, Harris, 2013, such as the intersection of computer science, statistics, and substantive expertise. ...
Preprint
Full-text available
The data revolution has led to an increased interest in the practice of data analysis. As a result, there has been a proliferation of "data science" training programs. Because data science has been previously defined as an intersection of already-established fields or union of emerging technologies, the following problems arise: (1) There is little agreement about what is data science; (2) Data science becomes secondary to established fields in a university setting; and (3) It is difficult to have discussions on what it means to learn about data science, to teach data science courses and to be a data scientist. To address these problems, we propose to define the field from first principles based on the activities of people who analyze data with a language and taxonomy for describing a data analysis in a manner spanning disciplines. Here, we describe the elements and principles of data analysis. This leads to two insights: it suggests a formal mechanism to evaluate data analyses based on objective characteristics, and it provides a framework to teach students how to build data analyses. We argue that the elements and principles of data analysis lay the foundational framework for a more general theory of data science.
... En referencia al primer grupo es Cleveland (2001) quien estableció seis áreas técnicas que conformarían al campo de la Ciencia de Datos: investigaciones multidisciplinares, modelos y métodos para datos, computación con datos, pedagogía, evaluación de herramientas y teoría. Adaptando estas seis áreas a las competencias del primer grupo dentro de la familia profesional de Ciencia de Datos tres posiciones son necesarias: el de científico de datos, especialista en Big Data y Análisis de Datos. ...
... Los estadísticos desarrollan ahora trabajo de análisis de datos, y algunos autores (Cleveland, 2001) sugieren que este es el nuevo camino al cual debe apuntar la estadística; pues dentro de la ciencia de los datos el proceso de limpiado o transformación de los datos requiere manejos matemáticos y particularmente, los estadísticos. ...
Article
Full-text available
Este artículo presenta una revisión a los trabajos desarrollados en la aplicación de métodos de ciencia de datos para calcular y gestionar los riesgos financieros de los proyectos de ingeniería, enfocándose en la simulación y los resultados obtenidos, analizando los datos históricos, países, autores, temática y sectores en los cuales se ha desarrollado el objeto de estudio, mediante las herramientas especializadas VOS viewer y Scopus, encontrando cómo es un tema nuevo que viene tomando relevancia en los últimos años.
... The emphasis on mathematical statistics with theorems and proofs was too narrow and he introduced the term "Data Analysis" with a focus on techniques for analyzing and interpreting data and the design of experiments for collecting data having high information content, which he felt should become the new areas of focus. In his 2001 paper entitled, "Data Science: An Action Plan for Expanding the Technical Areas in the Field of Statistics", William S. Cleveland suggested a plan for how academic statistics departments should reframe their work [9]. The abstract reads: "An action plan to expand the technical areas of statistics focuses on the data analyst. ...
Article
Full-text available
With the increasing availability of large amounts of data, methods that fall under the term data science are becoming important assets for chemical engineers to use. Methods, broadly speaking, are needed to carry out three tasks, namely data management, statistical and machine learning and data visualization. While claims have been made that data science is essentially statistics, consideration of the three tasks previously mentioned make it clear that it is really broader than just statistics alone and furthermore, statistical methods from a data-poor era are likely insufficient. While there have been many successful applications of data science methodologies, there are still many challenges that must be addressed. For example, just because a dataset is large, does not necessarily mean it is meaningful or information rich. From an organizational point of view, a lack of domain knowledge and a lack of a trained workforce among other issues are cited as barriers for the successful implementation of data science within an organization. Many of the methodologies employed in data science are familiar to chemical engineers; however, it is generally the case that not all the methods required to carry out data science projects are covered in an undergraduate chemical engineering program. One option to address this is to adjust the curriculum by modifying existing courses and introducing electives. Other examples include the introduction of a data science minor or a postgraduate certificate or a Master’s program in data science.
... The subject matter of this book is a sub-set of what in recent decades has become known as data science. Data science has been used as a term since the 1960s, albeit in a modern sense it has grown in popularity since the early 1990s, leading to the publication of William S. Cleveland's paper titled 'Data Science: An Action Plan for Expanding the Technical Areas of the Field of Statistics' (Cleveland, 2001) and the launch of the Journal of Data Science in 2003 (www.jds-online.com). ...
... In this lecture, as well as in a lecture from 1998 [6], Wu uses the term Data Science in order to call statistics in a modern way. According to Yan [30], the author of Cleveland [29] "in his publication of 2001, outlines a plan for a 'new' discipline, wider than statistics, which he called Data Science, but he did not refer to the term coined by Wu". ...
Conference Paper
Full-text available
With the vast amounts of data available in the world, most companies today are focusing on data usage to identify the strengths, weaknesses, and opportunities in business. Аn area called "data science" аppeared, closely related to the data mining concept. "Data science" is a term that has entered public perception and imagination only since the first half of the decade, but today it is extremely popular in science, practice, and education. It includes tools, methods, and systems applied to big data to leverage knowledge for decision-making. Based on the literature review and initial findings, this research study found that there are differences in the understanding of the term "data science" by academics and practitioners. This article aims to outline, on the one hand, a comprehensive framework of data science competences and on the other to sum up the inter and multidisciplinarity of the area. In result, is presented a review of the existing definitions and Fundamental Concept of Data Science. Тhe available opportunities for competency development in the Data Science area are discussed.
... Jeff Wu, an engineering professor at Georgia Institute of Technology, is reported to have used the term in the 1970s to refer to statistical data analysis (Sundaresan, 2016). An environmental statistician used it in 2001 (Cleveland, 2001). The scientific community has been considering the benefits and risks of these developments for some time. ...
Chapter
Full-text available
This chapter considers the implications of the increasingly common juxtaposition of three keywords - surveillance, power and communication – specifically in the context of the role of algorithms in society.
... The concept of "data science" was originally proposed in the statistics and mathematics community [23,24], at which time it essentially concerned data analysis. Today, the art of data science [17] goes beyond specific areas like data mining and machine learning, and the argument that data science is the next-generation of statistics [8,10,12]. Data science is becoming a very rich concept which carries the vision and responsibilities of an independent scientific field that is systematic and inter-disciplinary. ...
Preprint
While data science has emerged as a contentious new scientific field, enormous debates and discussions have been made on it why we need data science and what makes it as a science. In reviewing hundreds of pieces of literature which include data science in their titles, we find that the majority of the discussions essentially concern statistics, data mining, machine learning, big data, or broadly data analytics, and only a limited number of new data-driven challenges and directions have been explored. In this paper, we explore the intrinsic challenges and directions inspired by comprehensively exploring the complexities and intelligence embedded in data science problems. We focus on the research and innovation challenges inspired by the nature of data science problems as complex systems, and the methodologies for handling such systems.
... En referencia al primer grupo es Cleveland (2001) quien estableció seis áreas técnicas que conformarían al campo de la Ciencia de Datos: investigaciones multidisciplinares, modelos y métodos para datos, computación con datos, pedagogía, evaluación de herramientas y teoría. Adaptando estas seis áreas a las competencias del primer grupo dentro de la familia profesional de Ciencia de Datos tres posiciones son necesarias: el de científico de datos, especialista en Big Data y Análisis de Datos. ...
Article
Full-text available
El presente trabajo explora, a partir del reconocimiento y definición del nuevo paradigma digital, las siguientes cuestiones: la necesidad de catalogar las competencias y habilidades para profesiones emergentes en la economía, la empresa y la comunicación; en segundo lugar, el reconocimiento de una oportunidad histórica para la necesaria innovación teórica y metodológica en Ciencias Sociales y en Humanidades y, en tercer lugar, la aplicación de la Inteligencia Artificial (en adelante IA) para la mejora de la calidad en las publicaciones científicas. Estos tres asuntos resultan ser nucleares, a juicio de los autores, en la medida en que los tres inciden en la necesaria renovación en la formación de las personas que van a tener que gestionar datos de todo tipo que afectan a los modos de vida de todos los individuos. Por ello, este trabajo, tras detectar las carencias en los sistemas reglados de formación, plantea las oportunidades que el nuevo paradigma digital ofrece en lo teórico y en el terreno de la publicación científica para encarar los retos ineludibles de la nueva situación.
Article
Like many other disciplines, the information systems (IS) community has embraced big data analytics and data science. However, in the rush to exploit the popularity of this latest trend, the areas of big data analytics and data science that are most relevant to the IS field are not made clear. While many consider data analytics as an evolution of decision support systems (DSS), that is, as a technology that needs to be managed or enhanced, this essay traces the complex origins and philosophy of analytics instead back to Luhn’s text analytics in the late 1950s, Naur’s Computing as a Human Activity and his datalogy, Tukey’s Future of Data Analysis of the 1960s, and Codd’s relational database schema in the 1970s, well before big data analytics and data science became industry buzzwords. Many of what is now considered mainstream thinking in big data analytics and data science can be traced back to these visionaries. This essay examines the implications of the complex origins of data analytics and data science for the IS field, specifically on how those different discourses impact future research and practice.
Experiment Findings
Full-text available
Abstract We present a vision of Data Science considering the experiences on the role of Statistics and Computation. From the developed analysis means a proposal of the curriculum of a 8 semesters career. Key Words: Data Sciences, Data Science, Statistics , Computation, curriculum Resumen: Se presenta una visión de la Ciencia de los Datos considerando las experiencias sobre el rol de la Estadística y la Computación. Del análisis desarrollado surge una propuesta de currículo para una carrera a desarrollar durante 8 semestres. Palabras clave: Ciencia de los Datos, Computación, Estadística, currículo
Article
The efficiency of public sector results from public resources and is calculated with multidimensional indicators, which can evaluate the effects through consume of resource units. Analytical dashboards visualize set data for certain purpose and enables users to see what is happening and to action. Our paper identifies actual situation of these indicators for Hunedoara district from West Region, Romania in EU-statistics. Then, we present the most utilized solution for analytical dashboards. In the final part, our study presents a part of research results, realized on local administration from local district in West Region, and also the best analytical dashboard for our public administration.
Chapter
The aim of this chapter is to describe how to implement a strategy of Big Data to boost Social and Solidarity Economy (SSE). Because of reduction of prices in ICT systems, computing technology has changed and new techniques for distributed computing are the mainstream. With the evolution, it is now possible to manage immense volumes of data that previously could have only been handled by supercomputers at great expense. Through better analysis of the large volumes of data that are becoming available, there is the potential for making faster advances and improving the profitability of many enterprises. Thus, large companies can invest more money into these tools and consequently have more opportunities in obtaining good results. This new situation will widen the gap between large and small organizations, mainly those organizations of modest economic capacity as those that belong to SSE. Therefore, in this research we have made a complete development of software, techniques and tools for implementing a Big Data in SSE. It will help them to narrow the gap with large organizations.
Article
Full-text available
En la actualidad, es cada vez más común encontrarnos con formas alternativas para medir el desarrollo y bienestar de las naciones. Una de esas formas que ha cobrado reciente interés en círculos especializados de análisis y toma de decisión, es el de las mediciones de la felicidad de los pueblos. Por otro lado, el significante aumento en el uso de herramientas tecnológicas, como las redes sociales digitales, en aspectos como la obtención de información, la comunicación social, los intercambios de opiniones sobre los asuntos públicos y la generación en general de opinión pública, hacen que estas herramientas sean muy útiles para el entendimiento de múltiples fenómenos sociales. En este contexto, en México se han llevado a cabo esfuerzos a manera de análisis experimental que intentan conocer el estado de ánimo de la población utilizando como fuente el uso de las redes sociales digitales, de forma específica Twitter, mediante la utilización de los metadatos de esa red social. El presente trabajo pretende realizar un acercamiento teórico a los estudios de la felicidad, para posteriormente dar cuenta del caso específico de la medición de la felicidad de los tuiteros en México. Las conclusiones apuntan a destacar la viabilidad y el futuro promisorio de las mediciones experimentales, en este caso del manejo de los metadatos, para el análisis de diversos aspectos de la fenomenología social.
Article
The headline asks: “How do you say ‘data science’ in Latin?” But perhaps the more pertinent question is, why should we care? Matthew A. Jay and Mario Cortina Borja look to the past to better understand a thoroughly modern phrase The headline asks: “How do you say 'data science' in Latin?” But perhaps the more pertinent question is, why should we care? Matthew A. Jay and Mario Cortina Borja look to the past to better understand a thoroughly modern phrase.
Thesis
L'évolution des télécommunications amené aujourd'hui à un foisonnement des appareils connectés et une massification des services multimédias. Face à cette demande accrue de service, les opérateurs ont besoin d'adapter le fonctionnement de leurs réseaux, afin de continuer à garantir un certain niveau de qualité d'expérience à leurs utilisateurs. Pour ce faire, les réseaux d'opérateur tendent vers un fonctionnement plus cognitif voire autonomique. Il s'agit de doter les réseaux de moyens d'exploiter toutes les informations ou données à leur disposition, les aidant à prendre eux-mêmes les meilleures décisions sur leurs services et leur fonctionnement, voire s'autogérer. Il s'agit donc d'introduire de l'intelligence artificielle dans les réseaux. Cela nécessite la mise en place de moyens d'exploiter les données, d'effectuer surelles de l'apprentissage automatique de modèles généralisables, apportant l’information qui permet d'optimiser les décisions. L'ensemble de ces moyens constituent aujourd'hui une discipline scientifique appelée science des données. Cette thèse s'insère dans une volonté globale de montrer l'intérêt de l'introduction de la science des données dans différents processus d'exploitation des réseaux. Elle comporte deux contributions algorithmiques correspondant à des cas d'utilisation de la science des données pour les réseaux d'opérateur, et deux contributions logicielles, visant à faciliter, d'une part l'analyse, et d'autre part le déploiement des algorithmes issus de la science des données. Les résultats concluants de ces différents travaux ont démontré l'intérêt et la faisabilité de l'utilisation de la science des données pour l'exploitation des réseaux d'opérateur. Ces résultats ont aussi fait l'objet de plusieurs utilisations par des projets connexes.
Article
Full-text available
Presentation
Full-text available
Esta presentación fue realizada en el centro de investigación "Mente, Cerebro y Comportamiento" (CIMCYC) de la Universidad de Granada el 10 de noviembre de 2018
Chapter
Nowadays, a big pool of different machine learning components (i.e., algorithms and tools) exists that are capable of predicting various decisions in different problem domains successfully. Unfortunately, a problem has emerged in this respect that we cannot estimate safely which component behaves well on a particular dataset without huge experimental work. Consequently, designers and developers must capture as many methods as possible during experimental work to establish which one is more appropriate for the specific problem. To solve this challenge, researchers have proposed customized classification pipelines based on a framework of various search algorithms, machine learning tools, and appropriate parameters for these algorithms that are capable of working independently of user knowledge. Until recently, the majority of these pipelines were constructed using genetic programming. In this paper, a new method is proposed for evolving classification pipelines automatically, founded on stochastic nature-inspired population-based optimization algorithms. The algorithms act as a tool for modeling customized classification pipelines consisting of the following tasks: choosing the proper preprocessing method, selecting the appropriate classification tool, and optimizing the model hyperparameters. The evaluation of the customized classification pipelines also showed potential for using the proposed method in the real world.
Preprint
When doing data science, it's important to know what you're building. This paper describes an idealized final product of a data science project, called a Continuously Updated Data-Analysis System (CUDAS). The CUDAS concept synthesizes ideas from a range of successful data science projects, such as Nate Silver's FiveThirtyEight. A CUDAS can be built for any context, such as the state of the economy, the state of the climate, and so on. To demonstrate, we build two CUDAS systems. The first provides continuously-updated ratings for soccer players, based on the newly developed Augmented Adjusted Plus-Minus statistic. The second creates a large dataset of synthetic ecosystems, which is used for agent-based modeling of infectious diseases.
Conference Paper
Undergraduates and postgraduates in science subjects are increasingly expected to conduct their data analyses using R, SQL and Python. This requires of instructors to develop resources that get students up and running quickly. This study presents and evaluates a learning design that (1) uses a pattern-oriented tutorial to teach language-independent key operations for implementing data analytic queries, and (2) uses cheat sheets to show how these operations map onto language-specific syntax. The evaluation study (N=21) concludes that using this approach, two thirds of the data science novices sampled could implement simple to moderately complex queries in all the aforementioned languages within two hours. A permutation test moreover produced a significant main effect of language, with SQL ranking the highest in accuracy. The results form part of a general discussion on the merits and language-dependent feasibility of pattern-oriented aids for accelerated data science instruction.
Article
Purpose Data science is a relatively new field which has gained considerable attention in recent years. This new field requires a wide range of knowledge and skills from different disciplines including mathematics and statistics, computer science and information science. The purpose of this paper is to present the results of the study that explored the field of data science from the library and information science (LIS) perspective. Design/methodology/approach Analysis of research publications on data science was made on the basis of papers published in the Web of Science database. The following research questions were proposed: What are the main tendencies in publication years, document types, countries of origin, source titles, authors of publications, affiliations of the article authors and the most cited articles related to data science in the field of LIS? What are the main themes discussed in the publications from the LIS perspective? Findings The highest contribution to data science comes from the computer science research community. The contribution of information science and library science community is quite small. However, there has been continuous increase in articles from the year 2015. The main document types are journal articles, followed by conference proceedings and editorial material. The top three journals that publish data science papers from the LIS perspective are the Journal of the American Medical Informatics Association , the International Journal of Information Management and the Journal of the Association for Information Science and Technology . The top five countries publishing are USA, China, England, Australia and India. The most cited article has got 112 citations. The analysis revealed that the data science field is quite interdisciplinary by nature. In addition to the field of LIS the papers belonged to several other research areas. The reviewed articles belonged to the six broad categories: data science education and training; knowledge and skills of the data professional; the role of libraries and librarians in the data science movement; tools, techniques and applications of data science; data science from the knowledge management perspective; and data science from the perspective of health sciences. Research limitations/implications The limitations of this research are that this study only analyzed research papers in the Web of Science database and therefore only covers a certain amount of scientific papers published in the field of LIS. In addition, only publications with the term “data science” in the topic area of the Web of Science database were analyzed. Therefore, several relevant studies are not discussed in this paper that are not reflected in the Web of Science database or were related to other keywords such as “e-science,” “e-research,” “data service,” “data curation” or “research data management.” Originality/value The field of data science has not been explored using bibliographic analysis of publications from the perspective of the LIS. This paper helps to better understand the field of data science and the perspectives for information professionals.
Chapter
How can big data be leveraged to create value and what are the main barriers that prevent companies from benefiting from the full potential of data and analytics? This chapter describes the phenomenon of big data and how its use through data science is dramatically changing the basis of competition. The chapter also delves into the main organizational challenges faced by companies in extracting value from data, namely the promotion of a data-driven culture, the design of the internal and external structures, and the acquisition of the technical and behavioral skills required by big data professional roles. The aim and the structure of the book are illustrated. Shedding light on the human side of big data through the lense of emotional intelligence, the book aims to provide an in-depth understanding of the behavioral competencies that big data profiles require in order to achieve a higher performance.
Article
Full-text available
The aim of this study; to give information about data science, data scientist and the importance of data science for businesses. Advances in computer and cloud technologies have led to a significant increase in the structure and amount of data produced and stored. This increase in the amount of information has created a new area of interest called "data science". Data science is a multidisciplinary field that uses a variety of fields such as mathematics, statistics, computer science, and business strategy to generate added value from data. Data science not only uses data, but also converts data into a format that can be analyzed quickly and efficiently. Data science is especially important for large companies that produce large amounts of data. Today, companies solve many complex business problems using data. Companies also increase their profit rates by making their marketing strategies customer-oriented. Özet Bu çalışmanın amacı; veri bilimi, veri bilimci ve veri biliminin işletmeler için önemi hakkında bilgi vermektir. Bilgisayar ve bulut teknolojilerinde yaşanan gelişmeler üretilen ve saklanan verinin yapısında ve miktarında büyük artışa sebep oldu. Bilgi miktarında yaşanan bu artış "veri bilimi" olarak adlandırılan yeni bir ilgi alanının ortaya çıkmasını sağladı. Veri bilimi, veriden bir katma değer elde etmek için matematik, istatistik, bilgisayar bilimleri ve işletme stratejisi gibi çeşitli alanları kullanan çok disiplinli bir alandır. Veri bilimi sadece verileri kullanmakla kalmaz, aynı zamanda verileri hızlı ve etkili bir şekilde analiz edilebilir bir formata dönüştürmektedir. Veri bilimi, özellikle büyük miktarda veri üreten büyük firmalar için hayati öneme sahiptir. Bugün firmalar veriyi kullanarak karmaşık birçok iş problemini çözmektedir. Firmalar ayrıca, pazarlama stratejilerini müşteri odaklı yaparak kar oranlarını büyük ölçüde arttırmaktadır.
Article
Students of reference service can benefit by learning about data science. This column introduces the topic, the needed skills, and sources that support acquiring the related experience.
Article
Full-text available
This article presents a model for developing case studies, or labs, for use in undergraduate mathematical statistics courses. The model proposed here is to design labs that are more in-depth than most examples in statistical texts by providing rich background material, investigations and analyses in the context of a scientific problem, and detailed theoretical development within the lab. An important goal of this approach is to encourage and develop statistical thinking. It is also advocated that the labs be made the centerpiece of the theoretical course. As a result, the curriculum, lectures, and assignments are significantly restructured. For example, the course work includes written assignments based on open-ended data analyses, and the lectures include group work and discussions of the case-studies.
Article
Full-text available
Higher education faces an environment of financial constraints, changing customer demands, and loss of public confidence. Technological advances may at last bring widespread change to college teaching. The movement for education reform also urges widespread change. What will be the state of statistics teaching at the university level at the end of the century? This article attempts to imagine plausible futures as stimuli to discussion. It takes the form of provocations by the first author, with responses from the others on three themes: the impact of technology, the reform of teaching, and challenges to the internal culture of higher education.
Article
How does statistical thinking differ from mathematical thinking? What is the role of mathematics in statistics? If you purge statistics of its mathematical content, what intellectual substance remains? In what follows, we offer some answers to these questions and relate them to a sequence of examples that provide an overview of current statistical practice. Along the way, and especially toward the end, we point to some implications for the teaching of statistics.
Article
Significant advances in, and the resultant impact of, Information Technology (IT) during the last fifteen years has resulted in a much more data based society, a trend that can be expected to continue into the foreseeable future. This phenomenon has had a real impact on the Statistics discipline and will continue to result in changes in both content and course delivery. Major research directions have also evolved during the last ten years directly as a result of advances in IT. The impact of these advances has started to flow into course content, at least for advanced courses. One question which arises relates to what impact will this have on the future training of statisticians, both with respect to course content and mode of delivery. At the tertiary level the last 40 years has seen significant advances in theoretical aspects of the Statistics discipline. Universities have been outstanding at producing scholars with a strong theoretical background but questions have been asked as to whether this has, to some degree, been at the expense of appropriate training of the users of statistics (the 'tradespersons'). Future directions in the teaching and learning of Statistics must take into account the impact of IT together with the competing need to produce scholars as well as competent users of statistics to meet the future needs of the market place. For Statistics to survive as a recognizable discipline the need to be able to train statisticians with an ability to communicate is also seen as an area of crucial importance. Satisfying the needs of society as well as meeting the needs of the profession are considered as the basic determinants which will derive the future teaching and training of statisticians at the tertiary level and will form the basis of this presentation.
Article
Aspects of scientific method are discussed: In particular, its representation as a motivated iteration in which, in succession, practice confronts theory, and theory, practice. Rapid progress requires sufficient flexibility to profit from such confrontations, and the ability to devise parsimonious but effective models, to worry selectively about model inadequacies and to employ mathematics skillfully but appropriately. The development of statistical methods at Rothamsted Experimental Station by Sir Ronald Fisher is used to illustrate these themes.
Article
The nature of data is rapidly changing. Data sets are becoming increasingly large and complex. Modern methodology for analyzing these new types of data are emerging from the fields of Data Base Managment, Artificial Intelligence, Machine Learning, Pattern Recognition, and Data Visualization. So far Statistics as a field has played a minor role. This paper explores some of the reasons for this, and why statisticians should have an interest in participating in the development of new methods for large and complex data sets.
Article
The statistics community is showing increasing interest in consulting in industry. This interest has stimulated questions concerning recognition and job satisfaction, job opportunities, and educational and training needs. These questions are considered in this article. A central theme is that effective statistical consulting requires total involvement in the consulting situation and that good recognition flows naturally from such an approach. This concept is defined in operational terms.
Article
The profession of statistics has adopted too narrow a definition of itself. As a consequence, both statistics and statisticians play too narrow a role in policy formation and execution. Broadening that role will require statisticians to change the curriculum they use to train and develop their own professionals and what they teach nonstatisticians about statistics. Playing a proper role will require new research from statisticians that combines our skills in methods with other techniques of social scientists.
Centre of Location. That abscissa of a frequency curve for which the sampling errors of optimum location are uncorrelated with those of optimum scaling. (9.)
Article
Data analysis is not a new subject. It has accompanied productive experimentation and observation for hundreds of years. At times, as in the work of Kepler, it has produced dramatic results.
Article
Enormous quantities of data go unused or underused today, simply because people can't visualize the quantities and relationships in it. Using a downloadable programming environment developed by the author, Visualizing Data demonstrates methods for representing data accurately on the Web and elsewhere, complete with user interaction, animation, and more. How do the 3.1 billion A, C, G and T letters of the human genome compare to those of a chimp or a mouse? What do the paths that millions of visitors take through a web site look like? With Visualizing Data, you learn how to answer complex questions like these with thoroughly interactive displays. We're not talking about cookie-cutter charts and graphs. This book teaches you how to design entire interfaces around large, complex data sets with the help of a powerful new design and prototyping tool called "Processing". Used by many researchers and companies to convey specific data in a clear and understandable manner, the Processing beta is available free. With this tool and Visualizing Data as a guide, you'll learn basic visualization principles, how to choose the right kind of display for your purposes, and how to provide interactive features that will bring users to your site over and over. This book teaches you: The seven stages of visualizing data -- acquire, parse, filter, mine, represent, refine, and interactHow all data problems begin with a question and end with a narrative construct that provides a clear answer without extraneous detailsSeveral example projects with the code to make them workPositive and negative points of each representation discussed. The focus is on customization so that each one best suits what you want to convey about your data set The book does not provide ready-made "visualizations" that can be plugged into any data set. Instead, with chapters divided by types of data rather than types of display, you'll learn how each visualization conveys the unique properties of the data it represents -- why the data was collected, what's interesting about it, and what stories it can tell. Visualizing Data teaches you how to answer questions, not simply display information.
Article
An efficient method for the calculation of the interactions of a 2' factorial ex- periment was introduced by Yates and is widely known by his name. The generaliza- tion to 3' was given by Box et al. (1). Good (2) generalized these methods and gave elegant algorithms for which one class of applications is the calculation of Fourier series. In their full generality, Good's methods are applicable to certain problems in which one must multiply an N-vector by an N X N matrix which can be factored into m sparse matrices, where m is proportional to log N. This results inma procedure requiring a number of operations proportional to N log N rather than N2. These methods are applied here to the calculation of complex Fourier series. They are useful in situations where the number of data points is, or can be chosen to be, a highly composite number. The algorithm is here derived and presented in a rather different form. Attention is given to the choice of N. It is also shown how special advantage can be obtained in the use of a binary computer with N = 2' and how the entire calculation can be performed within the array of N data storage locations used for the given Fourier coefficients. Consider the problem of calculating the complex Fourier series N-1 (1) X(j) = EA(k)-Wjk, j = 0 1, * ,N- 1, k=0
Article
Statisticians regularly bemoan what they perceive to be the lack of impact and appreciation of their discipline on the part of others. And it does seem to be true, as we approach the magic 2000 number, that we are faced with the paradox of more and more information being available, more and more complex problems to be solved – but, apparently, less and less direct appreciation of the role and power of statistical thinking. The purpose of the conference talk will be to explore one possible pedagogic route to raising awareness of the central importance of statistical thinking to good citizenship – namely, a focus on public policy issues as a means of creating an awareness and appreciation of the need for and power of statistical thinking. We all know that our discipline is both intellectually exciting and stimulating in itself – and that it provides the crucial underpinning of any would-be coherent, quantitative approach to the world around us. Indeed, one might even go so far as to say that our subject is 'the science of doing science', providing theory and protocols to guide and discipline all forms of quantitative investigatory procedure. We have certainly expanded our horizons way beyond the rather modest ambitions of the founding fathers of the Royal Statistical Society, who set out to 'collect, arrange, digest and publish facts, illustrating the condition and prospects of society in its material, social and moral relations'.
Article
Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printed page of such transmission. JSTOR is an independent not-for-profit organization dedicated to and preserving a digital archive of scholarly journals. For more information regarding JSTOR, please contact support@jstor.org. How does statistical thinking differ from mathematical thinking? What is the role of mathematics in statistics? If you purge statistics of its mathematical content, what intellectual substance remains? In what follows, we offer some answers to these questions and relate them to a sequence of examples that provide an overview of current statistical practice. Along the way, and especially toward the end, we point to some implications for the teaching of statistics.
Article
There appears to be a paradox in the fact that as more and more quantitative information becomes routinely available in a world perceived to be ever more complex, there is less and less direct appreciation of the role and power of statistical thinking. It is suggested that the profession should exploit very real public concerns regarding risk aspects of public policy as a possible pedagogic route to raising statistical awareness.
Article
Stochastic substitution, the Gibbs sampler, and the sampling-importance-resampling algorithm can be viewed as three alternative sampling- (or Monte Carlo-) based approaches to the calculation of numerical estimates of marginal probability distributions. The three approaches will be reviewed, compared, and contrasted in relation to various joint probability structures frequently encountered in applications. In particular, the relevance of the approaches to calculating Bayesian posterior densities for a variety of structured models will be discussed and illustrated.
Article
Aspects of scientific method are discussed: In particular, its representation as a motivated iteration in which, in succession, practice confronts theory, and theory, practice. Rapid progress requires sufficient flexibility to profit from such confrontations, and the ability to devise parsimonious but effective models, to worry selectively about model inadequacies and to employ mathematics skillfully but appropriately. The development of statistical methods at Rothamsted Experimental Station by Sir Ronald Fisher is used to illustrate these themes.
Article
This paper examines work in "computing with data"---in computing support for scientific and other activities to which statisticians can contribute. Relevant computing techniques, besides traditional statistical computing, include data management, visualization, interactive languages and user-interface design. The paper emphasizes the concepts underlying computing with data, with emphasis on how those concepts can help in practical work. We look at past, present, and future: some concepts as they arose in the past and as they have proved valuable in current software; applications in the present, with one example in particular, to illustrate the challenges these present; and new directions for future research, including one exciting joint project. 1 Contents 1 Introduction 2 2 The Past 2 2.1 Programming Languages in 1963 . . . . . . . . . . . . . . . . . . 3 2.2 Statistical Computing: Bell Labs, 1965 . . . . . . . . . . . . . . . 5 2.3 Statistical Computing: England, 1967 . ....
Teaching Statistics Theory Through Applications The American Statistician 53 Public Policy Issues as a Route to Statistical Awareness Department of Mathematics, Imperial College Data Analysis and Statistics: An Expository Overview Visions: The Evolution of Statistics
  • D Nolan
  • T Speed
  • A F M Smith
Nolan, D. and T. Speed (1999). Teaching Statistics Theory Through Applications. The American Statistician 53, 370–375. Smith, A. F. M. (2000). Public Policy Issues as a Route to Statistical Awareness. Technical report, Department of Mathematics, Imperial College. Tukey, J. W. and M. B. Wilk (1986). Data Analysis and Statistics: An Expository Overview. In L. V. Jones (Ed.), The Collected Works of John W. Tukey, pp. 549–578. New York: Chapman & Hall. Wegman, E. J. (2000). Visions: The Evolution of Statistics. Technical report, Center for Computational Statistics, George Mason University. 6
Visions: The Evolution of Statistics
  • E J Wegman
Wegman, E. J. (2000). Visions: The Evolution of Statistics. Technical report, Center for Computational Statistics, George Mason University.
Future Directions for the Teaching and Learning of Statistics at the Tertiary Level
  • D Nichols
Nichols, D. (2000). Future Directions for the Teaching and Learning of Statistics at the Tertiary Level. Technical report, Department of Statistics and Econometrics, Australian National University.
  • Nicholls D.