ArticlePDF Available

Abstract and Figures

In recent years the term “data science” gained considerable attention worldwide. In a A Very Short History Of Data Science by Press (2013), the first appearance of the term is ascribed to Peter Naur in 1974 (Concise Survey of Computer Methods). Regardless who used the term first and in what context it has been used, we think that data science is a good term to indicate that data are the focus of scientific research. This is in analogy to computer science, where the first department of computer science in the USA had been established in 1962 at Purdue University, at a time when the first electronic computers became available and it was still not clear enough what computers can do, one created therefore a new field where the computer was the focus of the study. In this paper, we want to address a couple of questions in order to demystify the meaning and the goals of data science in general.
Content may be subject to copyright.
published: 09 February 2016
doi: 10.3389/fgene.2016.00012
Frontiers in Genetics | 1February 2016 | Volume 7 | Article 12
Edited by:
Celia M.T. Greenwood,
Lady Davis Institute, Canada
Reviewed by:
Robert Nadon,
McGill University, Canada
Frank Emmert-Streib
Specialty section:
This article was submitted to
Statistical Genetics and Methodology,
a section of the journal
Frontiers in Genetics
Received: 25 October 2015
Accepted: 25 January 2016
Published: 09 February 2016
Emmert-Streib F, Moutari S and
Dehmer M (2016) The Process of
Analyzing Data is the Emergent
Feature of Data Science.
Front. Genet. 7:12.
doi: 10.3389/fgene.2016.00012
The Process of Analyzing Data is the
Emergent Feature of Data Science
Frank Emmert-Streib 1*, Salissou Moutari 2and Matthias Dehmer 3
1Computational Medicine and Statistical Learning Laboratory, Department of Signal Processing, Tampere University of
Technology, Tampere, Finland, 2Centre for Statistical Science and Operational Research, School of Mathematics and
Physics, Queen’s University Belfast, Belfast, UK, 3Department of Computer Science, Institute for Theoretical Informatics,
Mathematics and Operations Research, Universität der Bundeswehr München, Neubiberg, Germany
Keywords: data science, computational biology, statistics, big data, high-throughput data
In recent years the term “data science” gained considerable attention worldwide. In a A Very Short
History Of Data Science by Press (2013), the first appearance of the term is ascribed to Peter Naur
in 1974 (Concise Survey of Computer Methods). Regardless who used the term first and in what
context it has been used, we think that data science is a good term to indicate that data are the
focus of scientific research. This is in analogy to computer science, where the first department of
computer science in the USA had been established in 1962 at Purdue University, at a time when the
first electronic computers became available and it was still not clear enough what computers can
do, one created therefore a new field where the computer was the focus of the study. In this paper,
we want to address a couple of questions in order to demystify the meaning and the goals of data
science in general.
The first question that comes to mind when hearing there is a new field is, what makes such
existing field different from others? For this purpose Drew Conway created the data science Venn
diagram (Conway, 2003) that is helpful in this discussion. In Figure 1, we show a modified version
of the original diagram as an Efron-triangle (Efron, 2003), which includes metric information.
The important point to realize is that data science is not an entirely new field in the sense that it
deals with problems outside of any other field. Instead, the new contribution is its composition,
consisting of at least three major fields, or dimensions, namely (1) domain knowledge, (2)
statistics/mathematics, and (3) computer science. Here “domain knowledge” corresponds to a field
that generates the data, e.g., biology, economics, finance, medicine, sociology, psychology etc. The
position of a particular field in Figure 1, respectively the distances to the three corners of the
triangle, i.e., (d1,d2,d3), provide information about the contribution of the three major fields,
which can be seen as proportions or weights (see the examples in Figure 1). Overall, data science
emerges at the intersection of these three fields whereas the term emerges is important because there
is more than just “adding” the three parts together.
Before we come back to the discussion of the emergent aspect of data science let us give
some specific examples for existing fields in the light of data science in particular application
domains. Scientific fields with a long history analyzing data by means of statistical methods are
biostatistics and applied statistics. However, the computer science component in these fields is not
very noticeable despite the fact that also algorithmic methods are used, e.g., via software package
like SPSS (Statistical Package for the Social Sciences) or SAS (Statistical Analysis System). However,
such software packages are not fully flexible programming languages but provide only a rather
limited set of statistical and visualization functions that can be applied to a data set, usually, via a
graphical user interface (GUI). Also, the preprocessing of the data sets themselves is difficult within
the provided capabilities of such packages because of the lack of basic data manipulation functions.
A more recent field that evolved from elements in biostatistics and bioinformatics is computational
biology. Depending on its definition, which differs somewhat between the US and Europe, in
general, computational biology is an example for a data science with a specific focus on biological
Emmert-Streib et al. Data Science
FIGURE 1 | Schematic visualization of the constituting parts of data science and other disciplines in terms of the involvement of (1) domain
knowledge, (2) statistics/mathematics, and (3) computer science.
and biomedical data. This is especially true since the completion
of the Human Genome Project has led to a series of new
and affordable high-throughput technologies that allow the
generating of a variety of different types of ’omics data
(Quackenbush, 2011). Also, initiatives like The Cancer Genome
Atlas (TCGA) (The Cancer Genome Atlas Research Network,
2008) making the results of such large-scale experiments publicly
available in the form of databases contributed considerably to the
establishment of computational biology as a data science, because
without such data availability, there would not be data science at
Maybe the best non-biological example of a problem that is
purely based on data is the Stock Market. Here the prices of shares
and stocks are continuously determined by the demand and
supply level of electronic transactions enabled by the different
markets. For instance, the goal of day traders is to recognize
stable patterns within ordinary stochastic variations of price
movements to forecast future prices reliably. Due to the fact that
a typical time scale of a buying and selling cycle is within (a
fraction of) a trading day these prices are usually beyond real
value differences, e.g., of productivity changes of the companies
themselves corresponding to the traded shares, but are more
an expression of the expectations of the shareholders. For
this reason, economic knowledge about balance sheets, income
statements and cash flow statements are not sufficient for making
informed trading decisions because the chart patterns need to be
analyzed themselves by means of statistical algorithms.
From Figure 1, one might get the feeling that a data scientist
is expected to have every skill of all three major fields. This
is not true, but merely an impression of the two-dimensional
projection of a multidimensional problem. For instance, it would
not be expected from a data scientist to prove theorems about
the convergence of a learning algorithm (Vapnik, 1995). This
would be within the skill set of a (mathematical) statistician
or a statistical learning theoretician. Also, conducting wet lab
experiments leading to the generation of the data itself is
outside the skill set. That means a data scientist needs to have
interdisciplinary skills from a couple of different disciplines
but does not need to possess a complete skill set from all
of these fields. For this reason, these core disciplines do not
become redundant or obsolete, but will still make important
contributions beyond data science.
Importantly, there is an emergent component to data science
that cannot be explained by the linear summation of its
constituting elements described above. In our opinion this
emergent component is the dynamical aspect that makes every
data analysis a process. Others have named this data analysis
process the data analysis cycle (Hardin et al., 2015). This
includes the overall assessment of the problem, experimental
design, data acquisition, data cleaning, data transformation,
modeling and interpretation, prediction and the performance
of an exploratory as well as confirmatory analysis. Especially
the exploratory data analysis part (Tukey, 1977) makes it clear
that there is an interaction process between the data analyst
and the data under investigation, which follows a systematic
approach, but is strongly influenced by the feedback of previous
analysis steps making a static description from the outset usually
impossible. Metaphorically, the result of a data analysis process
is like a cocktail having a taste that is beyond its constituting
ingredients. The process character of the analysis is also an
expression of the fact that, typically, there is not just one method
that allows answering a complex, domain specific question
but the consecutive application of multiple methods allows
achieving this. From this perspective it becomes also clear why
statistical inference is more than the mere understanding of
the technicalities of one method but that the output of one
method forms the input of another method which is requires the
sequential understanding of decision processes.
From the above description, a layman may think that statistics
is data science, because the above elements can also be found
in statistics. However, this is not true. It is more what statistics
could be! Interestingly, we think it is fair to say that some (if
Frontiers in Genetics | 2February 2016 | Volume 7 | Article 12
Emmert-Streib et al. Data Science
not all) of the founders of statistics, e.g., Fisher or Pearson can
be considered as data scientists because they (1) analyzed and
showed genuine interest in a large number of different data types
and their underlying phenomena, (2) possessed mathematical
skills to develop new methods for their analysis, and (3) gathered
large amounts of data and crunched numbers (by human labor).
From this perspective one may wonder how statistics, that started
out as data science, could end at a different point? Obviously,
there had to be a driving force during the maturation of the
field that prevented statistics from continuing along its original
principles. We don’t think this is due to the lack of creativity
or insight of statisticians that didn’t recognize this deviation,
instead, we think that the institutionalization of statistics or
science in general, e.g., by the formation of departments of
statistics and their bureaucratic management as well as an era
of slow progression in the development of data generation
technologies, e.g., comparing the periods 1950–1970 with 2000-
present, are major sources for this development. Given that,
naturally, the beginning of every scientific discipline is not only
indicated by the lack of a well defined description of the field
and its goals, but rather by a collection of people who share a
common vision. Therefore, it is clear that a formalization of a
field leads inevitably to a restriction in its scope in order to form
clear boundaries to other disciplines. In addition, the increasing
introduction of managemental structures in universities and
departments is an accelerating factor that led to a further
reduction in flexibility and tolerability of the variability in
individual research interests. The latter is certainly true for every
discipline, not just statistics.
More specific to statistics is the fact that the technological
progress between the 1930s and 1980s was rather slow compared
to the developments within the last 30 years. These periods of
stasis may have given the impression that developing fixed sets
of methods is sufficient to deal with all possible problems. This
may also explain the wide spread usage of software packages like
SPSS or SAS among statisticians despite their limitations of not
being fully flexible programming languages (Turing machines
Turing, 1936), offering exactly these fixed sets of methods.
In combination with the adaptation of the curriculum toward
teaching students the usage of such software packages instead
of programming languages, causes nowadays problems since
the world changed and every year appear new technologies
that are accompanied by data types with new and challenging
characteristics; not to mention the integration of such data sets.
Overall, all of these developments made statistics more applied
and also less flexible, opening in this way the door for a new field
to fill the gap. This field is data science.
A contribution that should not be underestimated in making,
e.g., computational biology a data science is the development
of the statistical programming language R pioneered by
Robert Gentleman and Ross Ihaka (Altschul et al., 2013)
and the establishment of the package repository Bioconductor
(Gentleman et al., 2004). In our opinion the key to success
is the flexibility of R being particularly suited for a statistical
data analysis, yet having all features of a multi-purpose
language and its interface to integrate programs written in other
major languages like C++ or Fortran. Also, the license free
availability for all major operating systems, including Windows,
Apple and Linux, makes R an enabler for all kinds of data
related problems that is in our opinion currently without
Interestingly, despite the fact that computational biology has
currently all attributes that makes it a data science, the case of
statistics teaches us that this does not have to be this way forever.
A potential danger to the field is to spend too much effort on
the development of complex and complicated algorithms when in
fact a solution can be achieved by simple methods. Furthermore,
the iterative improvement of methods that lead only to marginal
improvements of results consumes large amounts of resources
that distract from the original problem buried within given data
sets. Last but not least, the increasing institutionalization of
computational biology at universities and research centers may
lead to a less flexible field as discussed above. All such influences
need to be battled because otherwise, in a couple of years from
now, people may wonder how computational biology could leave
the trajectory from being a data science.
FS conceived the study. FS, SM, and MD wrote the paper and
approved the final version.
MD thanks the Austrian Science Funds for supporting this work
(project P26142). MD gratefully acknowledges financial support
from the German Federal Ministry of Education and Research
(BMBF) (project RiKoV, Grant No. 13N12304).
For professional proof reading of the manuscript we would like
to thank Bárbara Macías Solís.
Altschul, S., Demchak, B., Durbin, R., Gentleman, R., Krzywinski, M., Li, H.,
et al. (2013). The anatomy of successful computational biology software. Nat.
Biotechnol. 31, 894–897. doi: 10.1038/nbt.2721
Conway, D. (2003). The data science venn diagram. Available online at: http:// science-venn- diagram
Efron, B. (2003). “The statistical century,” in Stochastic Musings: Perspectives
From the Pioneers of the Late 20th Century, ed J. Panaretos (New York, NY:
Psychology Press), 29–44.
Gentleman, R. C., Carey, V. J., Bates, D. M., Bolstad, B., Dettling, M., Dudoit,
S., et al. (2004). Bioconductor: open software development for computational
biology and bioinformatics. Genome Biol. 5, R80. doi: 10.1186/gb-2004-5-
Frontiers in Genetics | 3February 2016 | Volume 7 | Article 12
Emmert-Streib et al. Data Science
Hardin, J., Hoerl, R., Horton, N. J., Nolan, D., Baumer, B., Hall-Holt, O., et al.
(2015). Data science in statistics curricula: preparing students to ‘think with
data’. Am. Stat. 69, 343–353. doi: 10.1080/00031305.2015.1077729
Press, G. (2013). A very short history of data science. Forbes. Available online
at: short-history-of-
Quackenbush, J. (2011). The Human Genome: The Book of Essential Knowledge.
New York, NY: Imagine Publishing.
The Cancer Genome Atlas Research Network (2008). Comprehensive genomic
characterization defines human glioblastoma genes and core pathways. Nature
455, 1061–1068. doi: 10.1038/nature07385
Tukey, J. (1977). Exploratory Data Analysis. New York, NY: Addison-Wesley.
Turing, A. (1936). On computable numbers, with an application to the
entscheidungsproblem. Proc. Lond. Math. Soc. 2, 230–265.
Vapnik, V. N. (1995). The Nature of Statistical Learning Theory. New York, NY:
Conflict of Interest Statement: The authors declare that the research was
conducted in the absence of any commercial or financial relationships that could
be construed as a potential conflict of interest.
Copyright © 2016 Emmert-Streib, Moutari and Dehmer. This is an open-access
article distributed under the terms of the Creative Commons Attribution License
(CC BY). The use, distribution or reproduction in other forums is permitted,
provided the original author(s) or licensor are credited and that the original
publication in this journal is cited, in accordance with accepted academic practice.
No use, distribution or reproduction is permitted which does not comply with these
Frontiers in Genetics | 4February 2016 | Volume 7 | Article 12
... Ideally, data scientists should be equipped with statistical and computational competencies, paired with domain knowledge, in order to analyse and interpret raw data and assist in decision-making processes (Alarcón-Soto et al., 2019). Nevertheless, what is particularly noteworthy within data science is that having specialised knowledge from all relevant fields is not required (Emmert-Streib et al., 2016). Since people who are capable of addressing social issues based on real-world data are scarce (Song & Zhu, 2016), the combined nature of data science further intensifies the need for interdisciplinary collaboration. ...
... Lastly, the overlap between computational skills and domain knowledge without sufficient understanding of mathematics and statistics can be viewed as the most serious problem in data science (Baldassarre, 2016). Although researchers might be well-versed in computing science, it is equally critical to understand the underlying mathematical meanings so as to transform statements into theorems (Emmert-Streib et al., 2016). A lack of rigorous methodological structure can be problematic when validation is required, which, in turn, can lead to incorrect analysis. ...
In parallel with the progression of technology, the tourism industry has been continuously confronted with a large amount of data that needs to be systematically analyzed in order to gain significant insights into the science and business sectors. Data science has emerged as an interdisciplinary field where specific competencies from different sub-disciplines come together. This poses far-reaching challenges for both researchers and practitioners alike. To unlock the pillars of data science research and provide a guideline for relevant stakeholders in tourism, this chapter aims to conceptualize the core competencies needed in the data science process. More specifically, it will start with a discussion regarding the interplay between computer science, mathematics and statistics, and domain knowledge. Next, the procedure of data science will be classified into seven distinct phases: (1) topic formulation and relevance for academia and industry, (2) data access and data collection, (3) data pre-processing, (4) feature engineering, (5) analysis, (6) model evaluation and model tuning, and (7) interpretation of results. This chapter will review each stage in depth and evaluate the corresponding level of knowledge and competencies required for each phase. Finally, current implications and potential future directions of data science in the tourism industry will be discussed.
... Similarly, also these phases require the cooperation between different groups. Overall, the standard CRISP-DM describes a cyclic process, which has recently been highlighted as the emergent feature of data science (Emmert-Streib et al., 2016). In contrast, ML or statistics focus traditionally on a single method for the analysis of data. ...
... We would like to note that the cyclic-nature of CRISP-DM is similar to general data science approaches (Emmert-Streib et al., 2016). Interestingly, this differs from ML or statistics that focus traditionally on a single method only for the analysis of data, establishing in this way one-step processes. ...
Full-text available
The Cross-Industry Standard Process for Data Mining (CRISP-DM) is a widely accepted framework in production and manufacturing. This data-driven knowledge discovery framework provides an orderly partition of the often complex data mining processes to ensure a practical implementation of data analytics and machine learning models. However, the practical application of robust industry-specific data-driven knowledge discovery models faces multiple data- and model development-related issues. These issues need to be carefully addressed by allowing a flexible, customized and industry-specific knowledge discovery framework. For this reason, extensions of CRISP-DM are needed. In this paper, we provide a detailed review of CRISP-DM and summarize extensions of this model into a novel framework we call Generalized Cross-Industry Standard Process for Data Science (GCRISP-DS). This framework is designed to allow dynamic interactions between different phases to adequately address data- and model-related issues for achieving robustness. Furthermore, it emphasizes also the need for a detailed business understanding and the interdependencies with the developed models and data quality for fulfilling higher business objectives. Overall, such a customizable GCRISP-DS framework provides an enhancement for model improvements and reusability by minimizing robustness-issues.
... Nowadays, "data" are at the center of our society, regardless of whether one looks at the science, industry or entertainment [1,2]. The availability of such data makes it necessary for them to be analyzed adequately, which explains the recent emergence of a new field called data science [3][4][5][6]. For instance, in biology, the biomedical sciences, and pharmacology, the introduction of novel sequencing technologies enabled the generation of high-throughput data from all molecular levels for the study of pathways, gene networks, and drug networks [7][8][9][10][11]. ...
... , p − 1 do 3 Fit all p − k models having k + 1 parameters. 4 Select the best of these p − k models and call itM k+1 . The evaluation of this is based on minimizing the MSE, or on maximizing R 2 , or cross-validation. ...
Full-text available
When performing a regression or classification analysis, one needs to specify a statistical model. This model should avoid the overfitting and underfitting of data, and achieve a low generalization error that characterizes its prediction performance. In order to identify such a model, one needs to decide which model to select from candidate model families based on performance evaluations. In this paper, we review the theoretical framework of model selection and model assessment, including error-complexity curves, the bias-variance tradeoff, and learning curves for evaluating statistical models. We discuss criterion-based, step-wise selection procedures and resampling methods for model selection, whereas cross-validation provides the most simple and generic means for computationally estimating all required entities. To make the theoretical concepts transparent, we present worked examples for linear regression models. However, our conceptual presentation is extensible to more general models, as well as classification problems.
... At the beginning of this article, we mentioned that the concept of a digital twin originated in manufacturing [9]. However, as we have shown in the preceding sections, it is beneficial to place a digital twin into the wider context of data science [60]. This allows the interpretation of a DTS as a general decision-making system whereas a DT is a key component thereof allowing to generate simulated data in an adaptive manner. ...
Full-text available
The concept of a digital twin (DT) has gained significant attention in academia and industry because of its perceived potential to address critical global challenges, such as climate change, healthcare, and economic crises. Originally introduced in manufacturing, many attempts have been made to present proper definitions of this concept. Unfortunately, there remains a great deal of confusion surrounding the underlying concept, with many scientists still uncertain about the distinction between a simulation, a mathematical model and a DT. The aim of this paper is to propose a formal definition of a digital twin. To achieve this goal, we utilize a data science framework that facilitates a functional representation of a DT and other components that can be combined together to form a larger entity we refer to as a digital twin system (DTS). In our framework, a DT is an open dynamical system with an updating mechanism, also referred to as complex adaptive system (CAS). Its primary function is to generate data via simulations, ideally, indistinguishable from its physical counterpart. On the other hand, a DTS provides techniques for analyzing data and decision-making based on the generated data. Interestingly, we find that a DTS shares similarities to the principles of general systems theory. This multi-faceted view of a DTS explains its versatility in adapting to a wide range of problems in various application domains such as engineering, manufacturing, urban planning, and personalized medicine.
... The main module of the Health Data Science Training Program should encompass three interdisciplinary areas: (a) Computer Science/Informatics, (b) Statistics/Mathematics, and (c) Domain knowledge experts (21). Figure 1a shows these three pillars along with some examples that combine skills from these focus areas, and Figure 1b displays diverse expertise and skills necessary for a successful Health Data Science training program. ...
Full-text available
Technological advances now make it possible to generate diverse, complex and varying sizes of data in a wide range of applications from business to engineering to medicine. In the health sciences, in particular, data are being produced at an unprecedented rate across the full spectrum of scientific inquiry spanning basic biology, clinical medicine, public health and health care systems. Leveraging these data can accelerate scientific advances, health discovery and innovations. However, data are just the raw material required to generate new knowledge, not knowledge on its own, as a pile of bricks would not be mistaken for a building. In order to solve complex scientific problems, appropriate methods, tools and technologies must be integrated with domain knowledge expertise to generate and analyze big data. This integrated interdisciplinary approach is what has become to be widely known as data science. Although the discipline of data science has been rapidly evolving over the past couple of decades in resource-rich countries, the situation is bleak in resource-limited settings such as most countries in Africa primarily due to lack of well-trained data scientists. In this paper, we highlight a roadmap for building capacity in health data science in Africa to help spur health discovery and innovation, and propose a sustainable potential solution consisting of three key activities: a graduate-level training, faculty development, and stakeholder engagement. We also outline potential challenges and mitigating strategies.
... The former finds application in digital health whereas the latter is used in digital business, e.g., for virtual viewings of properties. From an abstract point of view, also data science falls within the category of human-computer interaction because a complex data analysis process involves many individual steps which may not be automatically connectable but requires human intervention, e.g., via an explanatory analysis [98]. ...
Full-text available
Technological progress has led to powerful computers and communication technologies that penetrate nowadays all areas of science, industry and our private lives. As a consequence, all these areas are generating digital traces of data amounting to big data resources. This opens unprecedented opportunities but also challenges toward the analysis, management, interpretation and responsible usage of such data. In this paper, we discuss these developments and the fields that have been particularly effected by the digital revolution. Our discussion is AI-centered showing domain-specific prospects but also intricacies for the method development in artificial intelligence. For instance, we discuss recent breakthroughs in deep learning algorithms and artificial intelligence as well as advances in text mining and natural language processing, e.g., word-embedding methods that enable the processing of large amounts of text data from diverse sources such as governmental reports, blog entries in social media or clinical health records of patients. Furthermore, we discuss the necessity of further improving general artificial intelligence approaches and for utilizing advanced learning paradigms. This leads to arguments for the establishment of statistical artificial intelligence. Finally, we provide an outlook on important aspects of future challenges that are of crucial importance for the development of all fields, including ethical AI and the influence of bias on AI systems. As potential end-point of this development, we define digital society as the asymptotic limiting state of digital economy that emerges from fully connected information and communication technologies enabling the pervasiveness of AI. Overall, our discussion provides a perspective on the elaborate relatedness of digital data and AI systems.
Massively multiplayer online games (MMOGs) played on the Web provide a new form of social, computer-mediated interactions that allow the connection of millions of players worldwide. The rules governing team-based MMOGs are typically complex and non-deterministic giving rise to an intricate dynamical behavior. However, due to the novelty and complexity of MMOGs their behavior is understudied. In this paper, we investigate the MMOG World of Tanks (WOT) Blitz by using a combined approach based on data science and complex adaptive systems. We analyze data on the population level to get insight into organizational principles of the game and its game mechanics. For this reason, we study the scaling behavior and the predictability of system variables. As a result, we find a power-law behavior on the population level revealing long-range interactions between system variables. Furthermore, we identify and quantify the predictability of summary statistics of the game and its decomposition into explanatory variables. This reveals a heterogeneous progression through the tiers and identifies only a single system variable as key driver for the win rate.
Full-text available
In this study, the investigations of the non-Newtonian fluid and heat transfer is considered as an exponentially stretching surface in the occurrence of porous magnetic field due to its diversified applications in medical and engineering disciplines.The behavior of non-Newtonian fluid has been characterized as Casson fluid model.The major parameters like thermal radiation and viscous dissipation are incorporated in the energy equation. By considering governing equations of the mathematical model as a platform partial differential equations have been reduced to ordinary differential equations by using similarity transformations with proper boundary conditions.BVP4C software technique is used for finding the solutions and results are presented in the form of tables and graphs for the momentum and energy equations.After simulation, it is found that the heat transfer rate decreases with higher values of the magnetic field and radiation, and the temperature profile of the Casson fluid flow is directly related to the Casson fluid parameters. Finally, it is observed that parameters such as Casson fluid, Prandtl number, and Eckert number are stable at the point of convergence.
Data science is a rapidly growing discipline and organizations increasingly depend on data science work. Yet the ambiguity around data science, what it is, and who data scientists are can make it difficult for visualization researchers to identify impactful research trajectories. We have conducted a retrospective analysis of data science work and workers as described within the data visualization, human computer interaction, and data science literature. From this analysis we synthesis a comprehensive model that describes data science work and breakdown to data scientists into nine distinct roles. We summarise and reflect on the role that visualization has throughout data science work and the varied needs of data scientists themselves for tooling support. Our findings are intended to arm visualization researchers with a more concrete framing of data science with the hope that it will help them surface innovative opportunities for impacting data science work.
Full-text available
A growing number of students are completing undergraduate degrees in statistics and entering the workforce as data analysts. In these positions, they are expected to understand how to utilize databases and other data warehouses, scrape data from Internet sources, program solutions to complex problems in multiple languages, and think algorithmically as well as statistically. These data science topics have not traditionally been a major component of undergraduate programs in statistics. Consequently, a curricular shift is needed to address additional learning outcomes. The goal of this paper is to motivate the importance of data science proficiency and to provide examples and resources for instructors to implement data science in their own statistics curricula. We provide case studies from seven institutions. These varied approaches to teaching data science demonstrate curricular innovations to address new needs. Also included here are examples of assignments designed for courses that foster engaging of undergraduates with data and data science.
Full-text available
Creators of software widely used in computational biology discuss the factors that contributed to their success
Full-text available
Human cancer cells typically harbour multiple chromosomal aberrations, nucleotide substitutions and epigenetic modifications that drive malignant transformation. The Cancer Genome Atlas (TCGA) pilot project aims to assess the value of large-scale multi-dimensional analysis of these molecular characteristics in human cancer and to provide the data rapidly to the research community. Here we report the interim integrative analysis of DNA copy number, gene expression and DNA methylation aberrations in 206 glioblastomas-the most common type of adult brain cancer-and nucleotide sequence aberrations in 91 of the 206 glioblastomas. This analysis provides new insights into the roles of ERBB2, NF1 and TP53, uncovers frequent mutations of the phosphatidylinositol-3-OH kinase regulatory subunit gene PIK3R1, and provides a network view of the pathways altered in the development of glioblastoma. Furthermore, integration of mutation, DNA methylation and clinical treatment data reveals a link between MGMT promoter methylation and a hypermutator phenotype consequent to mismatch repair deficiency in treated glioblastomas, an observation with potential clinical implications. Together, these findings establish the feasibility and power of TCGA, demonstrating that it can rapidly expand knowledge of the molecular basis of cancer.
Full-text available
The Bioconductor project is an initiative for the collaborative creation of extensible software for computational biology and bioinformatics. The goals of the project include: fostering collaborative development and widespread use of innovative software, reducing barriers to entry into interdisciplinary scientific research, and promoting the achievement of remote reproducibility of research results. We describe details of our aims and methods, identify current challenges, compare Bioconductor to other open bioinformatics projects, and provide working examples.
In the history of research of the learning problem one can extract four periods that can be characterized by four bright events: (i) Constructing the first learning machines, (ii) constructing the fundamentals of the theory, (iii) constructing neural networks, (iv) constructing the alternatives to neural networks.
Setting of the learning problem consistency of learning processes bounds on the rate of convergence of learning processes controlling the generalization ability of learning processes constructing learning algorithms what is important in learning theory?.