ChapterPDF Available

On Developing Data Science

Authors:

Abstract and Figures

Understanding phenomena based on the facts—on the data—is a touchstone of data science. The power of evidence-based, inductive reasoning distinguishes data science from science. Hence, this chapter argues that, in its initial stages, data science applications and the data science discipline itself be developed inductively and deductively in a virtuous cycle.
Content may be subject to copyright.
On Developing Data Science
Michael L. Brodie, Computer Science and Artificial Intelligence Laboratory, MIT
Abstract
Understanding phenomena based on the facts on the data is a touchstone of data science.
The power of evidence-based, inductive reasoning distinguishes data science from science. Hence, this
chapter argues that, in its initial stages, data science applications and the data science discipline itself be
developed inductively and deductively in a virtuous cycle.
The virtues of the 20th Century Virtuous Cycle (aka virtuous hardware-software cycle, Intel-
Microsoft virtuous cycle) that built the personal computer industry (National Research Council, 2012)
were being grounded in reality and being self-perpetuating more powerful hardware enabled more
powerful software that required more powerful hardware, enabling yet more powerful software, and so
forth. Being grounded in reality solving genuine problems at scale was critical to its success, as it will
be for data science. While it lasted, it was self-perpetuating, due to a constant flow of innovation, and to
benefitting all participants producers, consumers, the industry, the economy, and society. It is a
wonderful success story for 20th Century applied science. Given the success of virtuous cycles in
developing modern technology, virtuous cycles grounded in reality should be used to develop data
science, driven by the wisdom of the 16th Century proverb, Necessity is the mother of invention.
This chapter explores this hypothesis using the example of the evolution of database
management systems over the last 40 years. For the application of data science to be successful and
virtuous, it should be grounded in a cycle that encompasses industry (i.e., real problems), research,
development, and delivery. This chapter proposes applying the principles and lessons of the virtuous cycle
to the development of data science applications; to the development of the data science discipline itself,
e.g., a data science method; and to the development of data science education; all focusing on the critical
role of collaboration in data science research and management, thereby addressing the development
challenges faced by the more than 150 Data Science Research Institutes (DSRIs) worldwide. A companion
chapter (Brodie, 2018a), addresses essential questions that DSRIs should answer in preparation for the
developments proposed here: What is data science? and What is world-class data science research?
1 Introduction
Data Science is inherently data- or evidence-based analysis; hence, it is currently an applied science.
Data science emerged at the end of the 20th Century as a new paradigm of discovery in science and
engineering that used ad hoc analytical methods to find correlations in data at scale. While there was
science in each analysis, there was little science underlying data science per se. Data science is in its
infancy and will take a decade to mature as a discipline with underlying scientific principles, methods, and
infrastructure (Brodie, 2018a). This chapter describes a method by which data scientists and DSRIs might
develop data science as a science (e.g., fundamentals - principles, models, and methods) and as a
discipline or an applied science (e.g., practices in the development of data science products, as described
throughout this book (Braschler et al., 2018). The method is based on the 21st Century Virtuous Cycle a
cycle of collaboration among industry, research, development, and delivery, e.g., to develop and use data
science products.
The cycle and its virtues evolved from medieval roots to surface in industry including in the
research and development of large-scale computer systems and applications, extended to include product
development as a research and development (R&D) cycle; now extended to deployment in a research,
development, and delivery (RD&D) cycle. The cycle is used extensively in academic and industrial
computer science research and development, by most technology startups, and is integral to the open
source ecosystem. It is used extensively in applied science and education, and increasingly in medical and
scientific research and practice. We look at the lessons learned in the development of large-scale
computer systems, specifically relational database systems based on a recent analysis (Brodie, 2018b),
tracing how the virtuous cycle was extended to a larger virtuous cycle of demand, research, product
development, deployment, practice, and back again.
Section 2 introduces the 20th Century Virtuous R&D Cycle made famous by Microsoft and Intel.
Section 3 extends the cycle to the 21st Century Virtuous RD&D Cycle, illustrated using the mutual
development of database management system (DBMS) research and products; and extends the cycle to
education. Section 4 builds upon this blueprint and applies it to three aspects of data science: concrete
data products, the discipline itself, and data science education; and concludes by looking forward. Section
5 illustrates previous themes with lessons learned in the development of data science and DSRIs, and
exposing commonly reported data science facts as pure myths. Section 6 speculates on the impacts of
data science, both benefits and threats; given the projected significance data data science, there may be
more profound impacts. Section 7 concludes optimistically with challenges that lie ahead.
2 20th Century virtuous cycles
The 20th Century Virtuous Cycle accelerated the growth of the personal computer industry with
more powerful hardware (speed, capacity, miniaturization) that enabled more powerful software
(functions, features, ease of use) that in turn required more powerful hardware (Figure 1). Hardware
vendors produced faster, cheaper, more powerful hardware (i.e., chips, memory) fueled by Moore’s Law.
This led software vendors to increase the features and functions of existing and new applications, in turn
requiring more speed and memory. Increasing hardware and software power made personal computers
more useful and applicable to more users, thus increasing demand and growing the market that in turn,
through economies of scale, lowered costs in ever-shortening cycles. But what made the cycle virtuous?
Figure 1: The Hardware-Software Cycle
The hardware-software cycle had two main virtues worth emulating. First, the cycle became self-
perpetuating driven by a continuous stream of innovation - good hardware ideas, e.g., next generation
chips, and good software ideas, e.g., next great applications (Figure 2). It ended in 2010 (National
Research Council, 2012) when dramatic hardware gains were exhausted, the market approached
saturation, and its fuel good ideas was redirected to other technologies. Second, all participants
benefited: hardware and software vendors, customers, and more generally the economy and society
through the growth of the personal computer industry and the use of personal computers. The 20th
Century Virtuous Cycle was simply hardware innovation and software innovation in a cycle.
Figure 2: 20th Century Virtuous Hardware-Software Cycle
The virtuous hardware-software cycle produced hardware and software each of which developed its
own R&D cycle (Figure 3). Hardware vendors and universities used the hardware (R&D) cycle to address
hardware opportunities and challenges by conducting fundamental research into next generation
hardware. As long as there was hardware innovation good ideas the hardware R&D cycle was virtuous.
Similarly, software vendors used the software R&D cycle to address software challenges and
opportunities in their ever-shortening cycles. This also worked well for next generation applications.
However, fundamental research into next generation systems, specifically database management
systems, was conducted by vendors (e.g., IBM, Software AG, Honeywell Information Systems) not by
universities.
Figure 3: Software and Hardware R&D Cycles
Addressing fundamental DBMS challenges and opportunities in a university requires access to
industrial-scale systems, industrial applications, and use cases (i.e., data). Until the early 1970s,
universities lacked industrial experience, case studies, and resources such as large-scale systems and
programming teams. At that time, Michael Stonebraker at University of California, Berkeley, began to
address this gap
1
. Stonebraker and Eugene Wong built Ingres (Stonebraker et. al., 1976), a prototype
industrial scale relational DBMS (RDBMS) for industrial scale geographic applications. They made the
Ingres code line available as one of the first open source systems. The Ingres code line then enabled
universities to conduct fundamental systems research. Ingres was the first example in a university of
extending the 20th Century Virtuous Cycle to systems engineering, specifically to a DBMSs. The cycle was
subsequently extended to large systems research in universities and industry. Due to the importance of
the system developed in the process, it became known as the 20th Century Virtuous R&D Cycle which
simply stated is research innovation and engineering innovation, in a cycle.
3 21st Century virtuous research, development, and delivery cycles
3.1 The virtuous DBMS RD&D cycle
Using Ingres for industry scale geographic applications was a proof of concept of the feasibility of the
relational model and RDBMSs. But were they of any value? How real were these solutions? Were
relational systems applicable in other domains? These questions would be answered if there were a
market for Ingres, i.e., a demand. Stonebraker, Wong, and Larry Rowe formed Relational Technology, Inc.,
later named the Ingres Corporation, to develop and market Ingres. Many companies have used the open
source Ingres and Postgres (Stonebraker et al., 1991) code lines to produce commercial RDBMSs
(Naumann, 2018) that together with IBM’s DB2, Oracle, and Microsoft SQL Server now form a $55bn per
year market, thus demonstrating the value and impact of RDBMSs as a “good idea” (Stonebraker, 2018a,
2018b). This extended the 20th Century Virtuous R&D Cycle to DBMSs in which DBMS research innovation
led to DBMS engineering innovation that led to DBMS product innovation. DBMS vendors and universities
repeated the cycle resulting in expanding DBMS capabilities, power, and applicability that in turn
contributed to building the DBMS market. Just as the hardware-software cycle became virtuous, so did
the DBMS R&D cycle. First, research innovation successive good ideas led to engineering innovation
that led to product innovation. This cycle continues to this day with the emergence of novel DBMS ideas
especially with the new demands of Big Data. Second, all participants benefit: vendors, researchers, DBMS
users, and more generally the economy using data management products and the growth of the data
management industry. Big Data and data science follow directly in this line.
A wonderful example of necessity being the mother of invention is the use of abstract data types as
the primary means of extending the type system of a DBMS and providing an interface between the type
systems of a DBMS and its application systems; arguably Stonebraker’s most significant technical
contribution. To build an RDBMS based on Ted Codd’s famous paper (Codd, 1970), Stonebraker and Wong
obtained funding for a DBMS to support Geographic Information Systems. They soon discovered that it
required point, line, and polygon data types and operations that were not part of Codd’s model. Driven by
this necessity, Stonebraker chose the emerging idea of abstract data types to extend the built-in type
system of a DBMS. This successful innovation has been a core feature of DBMSs ever since. Abstract data
types is only one of many innovations that fed the 40-year-old virtuous necessity-innovation-
development-product cycle around Ingres and Postgres.
In all such cycles, there is a natural feedback loop. Problems (e.g., recovery and failover), challenges,
and opportunities that arose with relational DBMS products fed back to the vendors to improve and
enhance the products while more fundamental challenges (e.g., lack of points, lines, and polygons) and
opportunities went back to university and vendor research groups for the next cycle of innovation.
1
Stonebraker’s DBMS developments coincided with the emergence of the open source movement. Together they created a virtuous
cycle that benefited many constituencies - research, DBMS technology, products, ap plications, users, a nd the open source
movement resulting in a multi-billion-dollar industry. Hence, this example warrants a detailed review as lessons for the development
of data s cience.
Indeed, modern cycles use frequent iteration between research, engineering, and products to test or
validate ideas, such as the release of beta versions to find “bugs”.
Stonebraker, together with legions of open source contributors, extended the 20th Century Virtuous
R&D Cycle in several important dimensions to become the 21st Century Virtuous Research, Development,
and Delivery Cycle. First, in creating a commercial product he provided a compelling method of
demonstrating the value and impact of was claimed as a “good idea” in terms of demand in a commercial
market. This added the now critical delivery step to become the research-development-delivery (RD&D)
cycle. Second, as an early proponent of open source software on commodity Unix platforms he created a
means by which DBMS researchers and entrepreneurs have access to industrial scale systems for RD&D.
Open source software is now a primary method for industry, universities, and entrepreneurs to research,
develop, and deliver DBMSs and other systems. Third, by using industry scale applications as use cases for
proofs of concept, he provided a method by which research prototypes could be developed and
demonstrated to address industrial scale applications. Now benchmarks are used for important industrial
scale problems as a means of evaluating and comparing systems in industrial scale contexts. Fourth, and
due to the above, his method provided means by which software researchers could engage in
fundamental systems research, a means not previously available that is now a critical requirement for
large-scale systems research.
The RD&D cycle is used to develop good research ideas into software products with a proven
demand. Sometimes the good idea is a pure technical innovation, e.g., a column store DBMS: queries will
be much faster if we read only the relevant columns! that led to the Vertica DBMS (Stonebraker et al.,
2005). More often it is a “pain in the ass” (PIA) problem, namely a genuine problem in a real industrial
context for which someone will pay for the development of a solution. Paying for a solution demonstrates
the need for a solution and helps fund its development. Here is a real example: A major information
service company creates services, e.g., news reports, by discovering, curating, de-duplicating, and
integrating hundreds of news wire reports from data items that are dirty, heterogeneous, and highly
redundant, e.g., over 500 reports of a US school shooting in 500 different formats. Due to the Internet, as
the number of news data sources soared from hundreds to hundreds of thousands, the largely manual
methods would not scale. This PIA problem led to Tamr
2
, a product for curating data at scale.
The RD&D cycle is the process underlying applied science. The RD&D cycle an applied science
method becomes virtuous as long as there is a continuous flow of good ideas and PIA problems that
perpetuate it (Figure 4).
Figure 4: Virtuous DBMS RD&D Cycle
Stonebraker received the 2014 A. M. Turing Award - “the Nobel prize in computing “For
fundamental contributions to the concepts and practices underlying modern database systems” (ACM,
2015)
3
. Concepts mean good research ideas DBMS innovations. Practice means taking DBMS
2
Tamr.com provides tools and services to discover and prepare data at scale, e.g., 100,000 data sources, for Big Data projects and
data science.
3
The RDBMS RD&D cycle was chosen to illustrate the theme of this chapter, as it is one of the major achievements in computing.
innovations across the virtuous RD&D cycle to realize value and create impact. Following the cycle
produced the open source Ingres DBMS that resulted in the Ingres DBMS product, and the Ingres
Corporation with a strong market, i.e., users who valued the product. Stonebraker refined and applied his
method in eight subsequent academic projects and their commercial counterparts: Ingres (Ingres)
(Stonebraker et al., 1976), Postgres (Illustra) (Stonebraker et al., 1991), Mariposa (Cohera), Aurora
(StreamBase), C-Store (Vertica) (Stonebraker et al., 2005), Morpheus (Goby), H-Store (VoltDB), SciDB
(Paradigm4), and Data Tamer (Tamr), with BigDAWG Polystore and Data Civilizer currently in
development. The concepts and practice of this RD&D cycle are a formula for applied science of which
Stonebraker’s systems are superb examples
4
(Stonebraker, 2018a):
Repeat {
Find somebody who is in pain
Figure out how to solve their problem
Build a prototype
Commercialize it
}
The systems research community adopted open source methods and extended the cycle to all types
of systems resulting in a 21st Century Virtuous RD&D Cycle for systems that transformed academic systems
research to deliver greater value for and higher impact in research, industry, and practice.
The 21st Century Virtuous Research, Development, and Delivery Cycle is simply research innovation,
engineering innovation, and product innovation, in a cycle. As we will now see, its application and impacts
go well beyond systems RD&D.
3.2 The critical role of research-industry collaboration in technology innovation
Virtuous RD&D cycles require researchers-industry collaboration that mutually benefits research and
industry. Industry often needs insight into challenges for which they may not have the research resources.
More commonly, industry faces PIA problems for which there are no commercial solutions. As discussed
in section 4.2 this is precisely the case for data science today. Most US enterprises have launched data
science efforts most of which fail as few in industry understand data science nor can hire data scientists.
But lets return to understanding the cycles before applying them to data science.
It is common that industry may not be aware of PIA problems that lurk below the surface. For
example, all operational DBMSs, more than 5m in the USA alone, decay due to their continuous evolution
to meet changing business requirements. While database decay is a widely-known pattern, it has not
been accepted as a PIA problem since there is little insight into its causes, let alone technical or
commercial solutions. Recent research (Stonebraker et al., 2016a, 2016b, 2017) proposes both causes and
solutions that will be realized only with industrial scale systems and use cases with which to develop,
evaluate, and demonstrate that the proposed “good ideas” actually work! Insights into causes and
solutions came exclusively through a research-industry collaboration between MIT and B2W Digital, a
large Brazilian retailer.
Industry gains in RD&D cycles in several ways. First, industry gains insight into good ideas or
challenges being researched. Second, industry gets access to research prototypes to investigate the
problem in their environment. Third, if successful, the prototype may become open source
5
available to
industry to apply and develop, potentially becoming a commercial product. Fourth, industry can gain
ongoing benefits from collaborating with research such as facilitating technology transfer and indicating
to customers, management, and investors its pursuit of advanced technology to improve its products and
services. Finally, a PIA industry problem may be resolved or a hypothesized opportunity may be realized.
4
Don’t let the pragmatism of these examples hide the scientific merit. Computer science was significantly advanced by fundamental
principles introd uced in each of the systems mentioned.
5
Open source is not required for research-industry collaborations; however, open source can significantly enhance development,
e.g., Apache Spark’s 42m contributions from 1,567 contributors; and impact, e.g., used by over 1m organizations, due in part to free
downloads .
Industry collaboration is even more critical for research, especially for research involving industrial-
scale use cases. Researchers need access to genuine, industrial scale opportunities or, more often
challenges, that require research that is beyond the capability or means of industry to address, and to real
use cases with which to develop, evaluate, and demonstrate prototype solutions. Scale is important as
“the devil is in the details” that arise in industrial-scale challenges and seldom in toy use cases. Through
collaboration, research can understand and verify the existence and extent of a problem or the likelihood
and potential impact of a good idea by analyzing them in a genuine industrial context. Is the problem
real? Is a solution feasible? What might be the impact of the solution? This is precisely what is needed in
data science for both researchers and industry.
Ideally, collaboration occurs in a continuous RD&D cycle in which research and industry interact to
identify and understand problems, opportunities, and solutions. It is virtuous if all participants benefit and
as long as problems and opportunities arise. Such researchindustry collaborations are better for
technology transfer than conventional marketing and sales (Stonebraker, 2018a,b).
By the mid-2000s startups worldwide used a version of the 21st Century Virtuous RD&D Cycle (Figure
5) as their development method as a natural extension of the open source ecosystem. An obvious
example is the World Wide Web, that spawned an enormous number of apparently odd innovations. Who
knew that a weird application idea like Twitter, a 140-character message service, would become a thing
(weaponized by a US president)? Or Snapchat, an image service where images self-destruct? The virtuous
RD&D cycle was used on a much grander scale in the World Wide Web and in Steve Jobs’ iPhone both of
which went from self-perpetuating to viral and in so doing changed our world. These projects were
developed, and continue to be developed, with extensive industry collaboration driven by good
sometimes weird ideas, novel applications, and PIA problems to be proven at scale. One might argue
that the 21st Century Virtuous RD&D Cycle is one of the most effective development methods.
Figure 5: 21st Century Virtuous RD&D Cycle
3.3 The role of innovation in RD&D Cycles
The virtues of the RD&D cycle apply to data science. First, data science should be grounded in
reality by using industrial-scale challenges, opportunities, and use cases to drive the cycle to develop and
validate solutions and products to prove value and impact. Second, it should be made self-perpetuating
by ensuring a constant flow of innovation, especially in its emerging state good ideas, challenges, PIA
problems, and opportunities with the result that the methods and results improve, thus benefiting all
participants: producers, consumers, the industry, the economy, and society. Innovative ideas perpetuate
the cycle, the best innovations accelerate the cycle.
As illustrated in Figure 6, innovation is required in each stage, for the cycle to be virtuous to
self-perpetuate. There is a two-way flow between cycle stages. Technology, e.g., a data science platform,
transfers down () the cycle in the form of research results, prototypes, and products, while
requirements transfer up () the cycle in the form of use cases, PIA problems, opportunities, challenges,
and user requirements. Innovation good ideas can enter anywhere in the cycle, but must continuously
enter for the cycle to self-perpetuate.
The cycle also applies to education - understanding How each stage works and educating
participants in its successful application. For data science education, understanding How stages work
leads to data science theories in research, to data science architectures and mechanisms in engineering,
to data science products in development, and to data science applications in practice. Education also
benefits from a two-way flow between theories in research, architectures in engineering, products in
development, and use cases in practice. Innovation good ideas can enter anywhere in the cycle.
Education in an established domain such as DBMSs involves understanding the principles and
techniques and How they work. Innovation for education across the cycle concerns innovation not only in
data science per se but also in education how data science is taught and understood. Research and
technology transfer across the cycle requires innovation in each stage. The cycle is more dynamic and
powerful in an emerging domain such as data science. Each stage in data science is in its infancy; hence
each stage in research could involve developing, generalizing, and integrating the current results in that
stage principles, platforms, products, and practice. Applying virtuous cycle principles to data science
means grounding the work in a real challenge, e.g., drug discovery in cancer research (Spangler et al.,
2014), with industrial-scale challenges and opportunities to drive the cycle, real use cases to develop and
validate solutions, and products to determine value and impact. In the cancer case just cited, innovation
occurred, i.e., Spangler et. al. developed a domain-specific data science method that was subsequently
generalized to be more domain independent (Nagarajan et al., 2015), and the mechanisms used to further
verify the results are now more widely applied in data science.
Activity
Research
Engineering
Develop ment
Delivery
Result
Publication
Prototype
Product
Application /
Use Case
Applied to Technology
20th C. hardware-software R&D cycle
Innovation
←→
Innovation
20th C. Infrastructure / Systems RD&D cycle
Innovation
Innovation
←→
Innovation
21st C. RD&D cycle
Innovation
Innovation
←→
Innovation
←→
Innovation
Applied to Research, and Education, and Technology Transfer
Education
How
How
←→
How
←→
How
Research & Technolog y Transfer
Innovation
Innovation
←→
Innovation
←→
Innovation
Figure 6: The Flow of Good Ideas in Virtuous Cycles
3.4 Establishing Causality: A Critical Challenge
Due to the critical problems to which data science is being applied, e.g., IBM Watson is in the
business of recommending medical treatments, it is critical that accurate likelihoods of outcomes be
established. One of the greatest challenges of data science is doing just that establishing accurate
estimates of probabilistic outcomes and error bounds for those results, to which we now turn our
attention.
The objective of the 21st Century Virtuous RD&D Cycle is to continuously produce technology and
applications that are grounded in reality, namely that produce products that create value, or even a
market of such products that have positive practical, economic, and social impacts. For example, there is a
market for data science-based systems that automate aspects of online retailers supply chain, e.g.,
automatically buying hundreds of thousands of products to meet future sales while not overstocking. In
2015 the cost of overstocking was approximately $470bn and of understocking $630bn worldwide
(Economist, 2018d). Normal economics and the marketplace are the mechanisms for demonstrating value
and measuring impact. Determining value and impact is far from simple. Most technology such as DBMSs
and products such as Microsoft Office have immense value and impact with continuously growing, multi-
billion dollar markets. Data science-based products have the potential for great contributions to
individuals, organizations, society, and the economy. Like most technology, data science holds equal
potential for positive and negative impacts. Disliking a Netflix data-science-driven movie recommendation
may waste half an hour of your time. Unfortunately, substantial negative consequences are being
discovered in data science applications, such as ethical problems in parole sentencing used extensively in
the USA (O'Neil, 2016). What might be the impact of data-driven personalized medicine treatment
recommendations currently being pursued by governments around the world?
Consider that question given that Why Most Published Research Findings Are False (Ioannidis, 2005)
has been the most referenced paper in medical research since 2005. Data science currently lacks robust
methods of determining likelihood of and error bounds for predicted outcomes, let alone how to move
from such correlations to causality. While mathematical and statistical research may be used to address
probabilistic causality and error bounds, consider the research required to address ethical and societal
issues such a sentencing.
The scientific principles that underlie most research also underlie data science. Empirical studies
report causal results while data science cannot. Data science can accelerate the discovery of correlations
(Brodie, 2018a). A significant challenge is to assign likelihoods and error bounds to these correlations.
While the current mechanisms of the 21st Century Virtuous RD&D Cycle to measure value and impact of
products worked well for simple technology products, they may not work as well for technology that is
increasingly applied to every human endeavor, thus directly influencing our lives. This is a significant issue
for the development and operation of data science in many domains. This is yet another class of issues
that illustrate the immaturity of data science and the need for multi-disciplinary collaboration. The
complex issue of causal reasoning in data science is addressed in greater detail in the companion chapter
(Brodie 2018a).
4 Applying 21st Century virtuous RD&D cycles to data science
A primary benefit of the 21st Century Virtuous RD&D Cycle is to connect research, engineering, and
products in a research-development-delivery cycle with the objective of being virtuous through a
continuous flow of innovative, good i deas and challenging problems. The cycle has many applications. It is
used extensively in computer science research in academia and industry, in startups that are building our
digital world, and increasingly in medicine and science. It has been and is being used to transform
education. I propose that it be used to guide and develop data science research, practice, and education.
4.1 A data science RD&D cycle example
In the mid-2000s legions of software startups applied the 21st Century Virtuous RD&D Cycle to
customer facing applications. As an example, Stonebraker applied the RD&D cycle to Goby an
application that searches the web for leisure activities to provide users, e.g., tourists, with a list of distinct
local, leisure activities. The “good idea” was to find all activities on the web that might be of interest to
tourists. The PIA problem is that there are thousands of leisure activities with many listings that are highly
redundant (i.e., replicas), dirty, often inaccurate and contradictory, and in heterogeneous formats. As is
typically the case in data science analyses, more than 80% of the resources were required to discover,
deduplicate, and prepare the data, leaving less than 20% for analysis, in this case determining relevant
activities. This real, industrial-scale use case led to research, Morpheus (Dohzen et al., 2006), that
developed machine driven, user guided solutions to discover, clean, curate, deduplicate, integrate (a
better term is unify), and present data from potentially hundreds of thousands of data sources. The “good
idea” led to a PIA
6
problem that resulted in a prototype that led to a product with a commercial market
that demonstrated its value and impact. Meanwhile, unanticipated challenges cycled back to Goby for
product improvements and enhancements while more fundamental, research challenges went back to
Morpheus. The good idea find events on the web was generalized from events to the data discovery
and preparation of any type of information leading to further innovation that led to a new research
project Data Tamer that in turn led to a new product Tamr.com and a burgeoning market in data
discovery and preparation for data science (Forrester, 2017b) (Gartner, 2017c). Tamr and similar products
are part of the budding infrastructures for data science, called data science platforms (Gartner, 2017a,
2017b).
The 21st Century Virtuous RD&D Cycle is being used to design, develop, and deliver data science tools
and platforms. Data discovery and preparation, and data science platforms are concrete examples of this
cycle in practice. Over 30 data preparation tools and 60 data science platforms are emerging (Gartner,
2017a, 2017b, 2018a, 2018b). This cycle is virtuous as long as there are continuous innovation and broad
benefits. Currently, aspects of most human endeavors are being automated by means of digital tools
developed to study, manage, and automate those endeavors. Data preparation tools are being developed
by being applied to an increasing number of new domains, each presenting new challenges. The
continuous flow of practical applications, use cases, PIA problems and other challenges contribute to the
cycle being virtuous. The cycle becomes virtuous when all participants benefit. Data science tools and
platforms are beginning to flip the ratio of the data-preparation to analysis resources from 80:20 to 20:80,
so that data scientists can devote the majority of their time to analysis and not to plumbing. Data science
practiced in a virtuous cycle is applied science at its best producing broad value and contributing to
accelerating data science practice and the development of data science per se.
4.2 Developing data science in practice and as a discipline
Data science is an emerging phenomenon worldwide that will take a decade to mature as a robust
discipline (Brodie, 2015, 2018a). Its growth and diversity can be seen in the number (over 150) and nature
of DSRIs, most of which were established after 2015. The emerging state of data science can be seen in
the fact that each DSRI provides different answers to key data science questions that all DSRIs should
answer (Brodie, 2018a): Wha t is data science? What is the practice of data science? What is world class
data science research?
The 21st Century Virtuous RD&D Cycle can guide the development and practice of data science. First,
the domain is just emerging characterized by a constant flow of new ideas entering the cycle. Data science
is being attempted in every human endeavor for which there is adequate data (Brodie, 2018a). Second,
due to its immaturity (Brodie, 2015) data science must be grounded in reality, i.e., real data in real use
cases at the appropriate scale. The cycle can be used to guide the development and work of individual
data scientists and, at a greater scale, of DSRIs. Major features of the cycle are already present in most
DSRIs, specifically research-industry collaboration in their research and education. Most have industry
partners and collaborations for education, RD&D, for case studies, and for technology transfer. In most
cases, significant funding has come from industry partners. The charter of the Center of Excellence at
Goergen Institute for Data Science
7
includes collaborating with industry “to apply data science methods
and tools to solve some of the world’s greatest challenges in sectors including: Medicine and Health,
6
Good ideas hopefully arise in answer to a PIA challenge. In this example, the good idea, finding events
on the web, led to a PIA problem that was resolved with the now conventional machine driven (ML) and
human guided method. The trick is a combination of good ideas and PIA challenges, leading to valuable
results.
7
http://www.sas.r ochester.edu/ dsc/
Imaging and Optics, Energy and the Environment, Food and Agriculture, Defense and National Security,
and Economics and Finance.” The mission statement of the recently launched Harvard Data Science
Initiative
8
states “Applications are by no means limited to academia. Data scientists are currently key
contributors in seemingly every enterprise. They grow our economy, make our cities smarter, improve
healthcare, and promote civic engagement. All these activities and more are catalyzed by the
partnership between new methodologies in research and the expertise and vision to develop real-world
applications.”
Applying the 21st Century Virtuous RD&D Cycle to DSRIs must recognize three factors that
distinguish data science from conventional academic research that often lacks research-industry
engagement. First, while core or theoretical research is equally important in both cases, DSRI resources
must be allocated to applied research, technology transfer, and supporting research-industry
collaboration
9
. Unlike a computer science research institute and in support of this objective, a DSRI might
have a Chief Scientific Officer to establish DSRI-wide data science objectives, such as contributing more
than the sum of its parts, and coordinating research across the many organizational units into the
components of data science, e.g., principles, models, and analytical methods; pipelines and infrastructure;
and a data science method, to support data science in all domains. Second, special skills, often not
present in research staff, are required for research-industry engagement, the research-development-
delivery cycle, and technology transfer. For example, emerging data science platforms are increasingly
important for developing and conducting data science. A data science platform includes workflow
engines, extensive libraries of models and analytical methods, platforms for data curation and
management, large-scale computation, and visualization; that is, a technology infrastructure to support
end-to-end data science workflows or pipelines. Hence, research into the development of data science
platforms should be a DSRI research objective. Again, unlike a computer science research institute, a DSRI
might also establish a Chief Technology Officer responsible for those functions including the development
and maintenance of a shared data science technology infrastructure.
The third distinguishing factor is the relative immaturity of data science versus most academic
disciplines; excitement and hype cloud the real state of data science. A common claim is that data science
is successful, ready for technology transfer and application in most human endeavors. While there are
successful data science technologies and domain-specific results, in general this impression, often
espoused by vendors and enthusiasts
10
, is false. While there are major successes and expert data
scientists, data science is an immature, emerging domain that will take a decade to mature (Brodie, 2015,
2018a). Analysts report that most early (2010-2012) data science projects in US enterprises failed
(Forrester, 2015a, 2015b) (Demirkan & Dal, 2014) (Veeramachaneni, 2016) (Ramanathan, 2016). In late
2016, Gartner reported that while most (73%) enterprises declare data science as a core objective, only
15% have deployed Big Data projects in their organization (Gartner, 2016a) with well-known failures (Lohr
& Singer, 2016). This reflects confusion concerning data science and that technology analysts are not
reliable judges of scientific progress.
Slow progress makes perfect sense as data science is far more complex than vendors and enthusiasts
report. For example, data science platforms provide libraries of sophisticated algorithms (visualization
(Matplotlib, Matlab, Mathematica); data manipulation, aggregation, and visualization (Pandas); linear
algebra, optimization, integration, and statistics (SciPy); image processing and machine learning (SciKit-
Learn); Deep Learning (Keras, TensorFlow, Theano); Natural Language Processing (NLTK)), that business
users have significant difficulty fitting to business problems (Forrester, 2015b). There is a significant
learning curve – few people understand deep learning, let alone statistics at scale and substantial
differences with conventional data analytics. What do you mean these aren’t just spreadsheets?
8
http://datascience.harv ard.ed u/
9
In its emerging state, data science lacks a scientific or theoretical bas e. Establishing data scie nce as a science should be a
fundamental objective of data science researchers and DSRIs (Brodie, 2018a).
10
Michael Dell, Dell CEO, predicted at the 2015 D ublin Web Summit that big data analytics is the next trillion-dollar market. IDC
predicts 23.1% compound an nual growth rate, reaching $48.6 billion in 2019. Forrest er Research declared that “all companies are in
the data business now.” Gartner predicts “More than 40 percent of data science tasks will be automated by 2020” (Gartner, 2016b).
Over the next decade, research will establish data science principles, methods, practices, and
infrastructure, and will address these key questions. This research should be grounded in practical
problems, opportunities, and use cases. DSRIs should use the 21st Century Virtuous RD&D Cycle to direct
and conduct research, practice, education, and technology transfer. Initially, they might use the R&D cycle
to explore good ideas. Research-industry collaborations should be used to identify and evaluate novel
data science ideas. When collaborations can identify plausible use cases or PIA problems, the research-
development-delivery cycle should be used. That is, to identify research domains and directions, DSRIs
should identify industrial partners with whom to collaborate to establish virtuous cycles that equally
benefit researchers and industry partners. As with applied university research funding, a significant
portion of data science research funding should come from industry to increase industry-research
engagement and quickly identify valuable research with impact potential.
4.3 Developing data science education
Data science is one of the fastest growing subjects in education due to the demand for data
scientists. Data science courses, programs, degrees, and certificates are offered by most universities and
professional training institutes and are part of the mission of most DSRIs. Given the decade to maturity of
data science, how should data science education programs be developed?
Just as the 21st Century Virtuous RD&D Cycle is used to transform the research, development,
delivery, and use of computer systems and applications, it can also being used to transform education.
The intention of the recently launched 21st Century Applied PhD Program in Computer Science
11
at Texas
State University, is for PhD level research ideas, innovations, and challenges to be developed in prototype
solutions and refined and tested in industrial scale problems of industrial partners. The cycle is to be
driven by industrial partners that investigate or face challenges collaboratively with the university. PhD
candidates work equally in research and in industry to identify and research challenges and opportunities
that are grounded in real industrial contexts; and to develop prototype solutions that are refined using
industrial use cases. This educational cycle requires technology transfer from research to advanced
prototypes to industry with opportunities and problems transferring, in the opposite direction, from
practice to advanced development and to research. It becomes virtuous with a constant stream of “good
ideas” challenges and opportunities and of PhD candidates in one direction, and industry PIA
problems, challenges, and opportunities in the other. The primary benefits of this program are that
research, teaching, and products are grounded in reality.
These ideas are not new. The Fachhochschule system (universities of applied sciences) applied
virtuous cycle principles in Germany since the late 1960s, and in Austria and Switzerland since the 1990s
as a graduate extension of the vocational training and apprenticeship (Berufslehre und Ausbildung)
programs that have roots in mentorships and apprenticeships from the middle ages.
While the quality and intent of the European and US educational systems are the same, the systems
differ. Academic universities focus on theory and applied universities focus on the application of science
and engineering. Fachhochschulen usually do not grant PhDs. In addition, research in applied universities
is funded differently from research in academic universities. Usually, over 80% of applied research funding
comes from third parties to ensure research-industry engagement
12
and as a test of the PIA principle.
Unsuccessful research is quickly identified and terminated. Dedicated government agencies provide
partial funding and promote innovation and technology transfer through collaboration between industry
and the applied universities. Enrollments in Fachhochschulen are soaring, indicating the demand for
education grounded in reality closely mirroring successful startup behavior. Due to the significance of,
demand for, and perceived value of data science, education programs should be revisited considering
adding more applied aspects to conventional research and education for data science. A good example of
11
https://cs.txstate.edu/academics/phd/
12
A similar principle applied by th e funding agency in the section 5.1 story w as initially considered a death knell by the DSRI and by
me. It took a year for me to s ee the value.
this vision is the 21st Century Applied PhD Program in Data Science at Texas State University, based on a
collaborative research-industry-development-delivery model.
5 Lessons Learned
5.1 Data science and DSRI stages of development
In 2013, I was invited to join the Scientific Advisory Committee (SAC) of Ireland’s Insight Center
for Data Analytics, at the time one of the first and largest DSRIs, composed of four partner institutes.
Since then I have actively participated on the SAC as well as on Insight’s Governance Committee. Over the
following years, I observed the development of Insight as a DSRI as well as the establishment of over 150
DSRIs at major institutions worldwide. Insight’s development as a DSRI was not without challenges. In
2017, Science Foundation Ireland (SFI) reviewed Insight for a potential five-year funding renewal. Insight
needed to tell SFI what data science was, what world class data science research was, and to measure its
progress accordingly. This led me to the observation, stated to the review board, that Insight’s
development as a DSRI reflected the development of data science as a discipline. The most thoughtful
contributors to data science fully understood that while the potential benefits for Ireland and the world
were enormous, data science as a discipline was in its infancy and faced considerable scientific and
organizational developmental challenges. Further, that Insight in operating for five years and in aspiring to
world class data science contributions as a world class DSRI, had faced and overcome significant
challenges that I had witnessed first-hand at Insight and indirectly in eight other DSRIs.
Over five years, Insight had gone through the four stages of development that younger DSRIs are
just encountering. Insight is currently at stage five - a critical stage. Successful progress through the stages
revolved around three fundamental issues:
Just as the science and the scientific method are far more than experiments in a single domain,
so too is data science more than data science activities in a single domain.
Changing centuries of research behaviour to enable collaboration across disciplines in data
science pipelines, as well as across academic and organizational boundaries.
Producing, for Ireland and for data science, more than the sum of the parts, i.e., the results of
individual member institutes.
The five stages are simple.
1. Act of creation: An organizational decision was made to form a DSRI from independent, one
might say competing, institutes with a new focus, the emerging discipline of data science. The
institutes researchers and administrators alike in a behavioral and legal tradition of individual
progress and reward were not happy campers. Awkwardness arose.
2. Initial participation: Participants continued business as usual, but expressed a willingness to
participate and cooperate followed by little actual collaboration and some ingrained
competitiveness. The DSRI administration soldiered on towards understanding the bigger picture
that had not been defined by anyone funders, researchers, or advisors.
3. Data science objectives understood conceptually: After a few years of successful execution of
individual research efforts and attempts to understand data science, modest progress was made,
especially once it was clear that funding would depend on the DSRI being more than the sum of
the parts and would be measured on world-class data science, interpreted then as contributing
to data science, per se. But what is that exactly?
4. Data science objectives understood emotionally: Goals provide focus. Five years of funding of
the now seven institutes depended on the DSRI being “more than the sum of the parts”. This was
not an abstract concept but required providing benefits such as accelerating discovery in specific
parts of the Irish economy, educating data scientists, and economic growth in Ireland, involving
not just researchers but major industrial partners. Individual researchers rose to the challenge to
propose a collaborative DSRI. By the time of the review, they had become a band of data science
brothers and sisters, together with industrial partners.
5. Stand and deliver: While the DSRI will continue to produce specific data science results that are
world class in specific domains, e.g., physiology, it is defining and planning contributions to data
science, including data science principles, models, methods, and infrastructure (Brodie 2018a).
Many DSRIs around the world have been created, like Insight, by a higher-level organization,
typically a university, to coordinate the myriad data science activities in that organization. The critical
factor missing in many DSRIs, at least as viewed through their web sites, is an imperative to understand
and contribute to data science per se, to contribute more than the sum of the contributions of the partner
organizations. SFI’s funding of Insight depends on contributing to data science per se, worded as
“contributing more than the sum of the parts.” This imperative is not present in many DSRI’s.
5.2 Myths of applying data science in business
As often happens with new technology trends, their significance, impact, value, and adoption are
exaggerated by the analysts and promoters as well as by optimists and the doomsayers. Technology
analysts see their roles as reporting on new technology trends, e.g., Gartner’s Hype Cycles. If a technology
trend is seen as significant, investment analysts join the prediction party. Technology and investment
analysts are frequently wrong as they are now with data science. Many technology trends reported by
Gartner die before reaching adoption, e.g., 1980’s service-oriented architectures. Some trends that are
predicted as dying become widely adopted, e.g., the .com boom was reported as a failure, largely due to
the .com stockmarket bubble, but the technology has been adopted globally and has led to transforming
many industries. Data science is one of the most visible technology trends of the 21st Century with data
scientists called “the sexiest job of the 21st Century” (Davenport, 2012) and “engineers of the future” (van
der Aalst, 2014). To illustrate the extent to which data science is blown out of proportion to reality, let's
consider several data science myths. A reasonable person might ask, given the scale, scope, and nature of
the change of data science as a new discovery paradigm, how could anyone predict with any accuracy
how valuable it will be and how it will be adopted, especially when few people, including some “experts”,
currently understand it (that, by the way, was myth #1).
Everyone is successfully applying data science: As reported above most (80%) early (2010-2012) data
science projects in most US enterprises failed. By early 2017, while 73% of enterprises declare data
science as a core objective, only 15% have deployed it. In reality, AI/data science is a hot area, with
considerable, perceived benefit. Hence many companies are exploring it. However, such projects are not
easy and require ramping up of rare skills, methods, technologies. It is also difficult know when and how
to apply the technology and to appropriately interpret the results. Hence, most initial projects are highly
unlikely to succeed but are critical to gain the expertise. Applying AI/data science in business will have
major successes (10%) and moderate successes (40%) (Gartner, 2016a). Most companies are and should
explore AI/data science but be prepared for a significant learning curve. Not pursuing AI/data science will
likely be an advantage to your competitors.
Reality: organizations perceiving advantages should explore data science and prepare for a learning curve.
Data science applications are massive: While scale is a definitive characteristic of Big Data and data
science, successful applications can be small and inexpensive. The pothole example (Brodie 2018a) was a
very successful launch of now flourishing startups in the emerging domain of autonomous vehicles. It was
based on building and placing small, inexpensive (~$100) motion detectors in seven taxis. It started with
the question shared by many business, Wha t is this data science stuff? It was a pure exploration of data
science and not to find a solution to a PIA problem. As data science matures, we see that the critical
characteristics of a data analysis are determined by the domain and the analytical methods applied.
Volume is one characteristics that must meet statistical requirements but even GB or TB may be adequate
and can be handled readily by laptops.
Reality: data science can be applied on modest data sets to solve interesting, small problems.
Data science is expensive: Famous, successful data analytics (Higgs Boson, Baylor-Watson cancer study,
LIGO, Google, Amazon, Facebook) often require budgets at scale (e.g., massive processing centers,
100,000 cores, 1,000s of analysts); however, data analytics even over relatively large data volumes can be
run on desktops using inexpensive or free open source tools and the cloud. Businesses can and should
conduct initial explorations like the pothole analyses at negligible cost.
Reality: small players with small budgets can profit from data science.
Data science predicts what will happen: Otto, a German retailer orders 200,000 SKUs fully automated.
Above we cited predictions of trillions of dollars in related savings worldwide. However, the results of
good data analytics are at best probabilistic with error bounds. This is somewhat similar to science
(scientific method) but is typically less precise with lower probabilities and greater error bounds due to
inability of applying the controls that are applied in science. Businesses should explore the predictive
power of data science but with the full understanding of its probabilistic and error prone nature. Otto and
the supply chain industry constantly monitors and verifies results and adjusts as needed or, like H&M, you
might end up with a $4.3bn overstock (New York Times, 2018).
Reality: predictions are probabilistic and come with error bounds.
Data science is running machine learning over data: Machine learning is highly visible in popular press
accounts of data science. In reality, one must select from thousands of AI and non-AI methods and
algorithms depending on the phenomenon being analyzed and the characteristics of the data. What’s
more, as reported above, while algorithm selection is critical, 80% of the resources including time for a
data analysis is required just to find and prepare the data for analysis.
Reality: as there is no free lunch (Wolpert, 1997), there is no single methodology, algorithm or tool to
master to do successful data science, just as it is in science.
AI/data science is disrupting and transforming conventional industries and our lives: This widely
reported myth (Marr, 2017) (Chipman, 2016) makes eye catching press but is false. There is ample
evidence that AI/data science is being applied in every human endeavor for which adequate data is
available such as reported throughout this book. The list of impacted industries is long: mechanical
engineering & production of industrial goods (shop floor planning, robotics, predictive maintenance);
medicine (personalized health); commerce/trade (e-commerce, online business, recommenders);
hospitality (demand planning and marketing via analytics, pricing based on customer analytics);
transportation (ride-sharing/hailing); automotive (self-driving cars, e-mobility); services (new business
models based on data); and many more. In reality, five industries have been massively disrupted by digital
innovation music, video-rental, publishing (books, newspapers), taxicabs, and retailing (predominantly
clothing). They are in the process of being transformed, e.g., the Spotify business model is an example of
transformation in music; Uber’s is in taxicabs, but the process takes years or decades. However, the vast
majority of industries are currently unaffected. If an industry is being transformed, it is reflected in the
stock market, e.g., a price-earnings ratio of less than 12 is generally forecast imminent collapse. According
to that rule of thumb, Ford and GM’s price-earnings ratio of 7 suggest disruption and transformation if not
collapse possibly due to electric vehicles (EVs) such as Tesla and ride-sharing/hailing. There are no such
indications for the other “conventional industries” (Economist, 2017).
Reality: almost all conventional industries are impacted, but only few are disrupted.
It’s all about AI: Current popular and even scientific press suggests that AI is one of the hottest and
potentially most significant technologies of the 21st century. AI is sometimes referred to as an object as in
“an AI is used to …”. Without doubt AI and specifically machine learning (ML) and deep learning (DL) have
been applied to a wide range of problems with significant success and impact as described above. It is
very probable that ML will be applied much more extensively with even greater success and impact.
However, like most “hot” technical trends, the press characterization is a wild exaggeration a myth. First
of all, AI is a very broad field of research and technology that pursues all forms of intelligence exhibited by
a machine (Russell & Norvig, 2010). ML is one of perhaps 1,000 AI technologies. Second, until we
understand ML, its application will be limited. The current, very successful ML technologies arose in the
early 2000s from a previously unsuccessful technology, neural networks. Amazingly ML, augmented by
massive data sets and high-performance computing, has been applied to images, sentences, and data to
appear to identify entities that are meaningful to humans, e.g., pizzas, cats, trains on tracks, and cluster
those meaningful entities based on similarities meaningful to humans. We have no idea why there is a
correlation between the results of an ML analysis and meaning understood by humans. Considerable
research is being invested in understanding such reasoning, but it is far from mature. As a result, the use
of ML in the European Community is restricted by the GDPR law. Finally, the successful application of ML
is proportional to the data to which it is applied, typically ML works most effectively on massive data sets.
Massive data analysis requires high performance computing, one of the critical components that moved
neural networks from failure to success. Hence, most naïve misuses of the term AI should be replaced
with the specific AI technology, e.g., ML, plus data plus high-performance computing. This sounds
remarkably like data science.
Reality: AI is a key component, amongst many others, that are necessary to conduct data science; AI does
not perform miracles; in many cases “AI” should be replaced by the technologies used to support analytical
workflows.
6 Potential impacts of data science
The development of data science involves not just the science, technology, and applications, it
also involves the opportunities, challenges, and risks posed by the applications of data science. Hence, I
now briefly review some potential benefits and threats of applying data science, many of which have been
reported in the popular press. However, popular press descriptions of hot technical topics and their
impacts are usually to be taken with a grain of salt, especially concerning AI and data science that are not
well understood by some experts.
In the early 2010s Big Data was the hot technology topic. In 2018, AI and its power was the hot
topic, not unreasonably as Sundar Pichai, Google CEO, said that AI will have a “more profound” impact
than electricity or fire (Economist, March 2018a). Consider it a matter of terminology. Big Data, on its own
is of no value. Similarly, without data AI is useless. The hot technical topics that surface in the media are
equally attributable to AI, massive data, and powerful computation. In what follows, as above, I refer to
applications of that combination as AI/data science. Yet even those three terms are not adequate to
achieve the hot results since data science depends also on domain knowledge and more, but this will
suffice for the following discussion.
According to Jeff Dean, director of Google’s AI research unit, Google Brain, more than 10m
organizations “have a problem that would be amenable to a machine-learning solution. They have the
data but don’t have the experts on staff.” (Economist, March 2018b) That is, the potential impacts of
AI/data science will have broad applicability.
As with a new, powerful technology, society, e.g., legislation, is seldom able to keep up with its
impacts. In the case of AI/data science, let’s consider its impacts on our daily lives, both the benefits and
the threats, of a multi-billion-dollar industry that currently has almost no regulations to restrain its
application.
6.1 Benefits
Google’s youthful founders, Sergey Brin and Larry page, defined its original vision “to provide
access to the world’s information in one click” and mission statement to organize the world’s
information and make it universally accessible and useful” with the famous motto “Don’t be evil”. Indeed,
they almost succeeded beyond their wildest dreams. The entire world benefits from instant access to
much of the world’s information. We went from this utopian view of the potential of AI/data science to
one in which Google, in 2015, dropped its famous motto from its code of conduct. I address some
shortcomings, such as use and protection of personal information, in the next section.
It is infeasible to list here the many benefits of AI/data science, so let’s consider two examples,
medicine and autonomous vehicles. A data-science based medical analysis can compare a patient’s
mammogram with 1m similar mammograms in seconds to find potential causes and treatments that were
most effective for the conditions present in the subject mammogram. Similar analyses and achievements
are being made with our genetic code to identify the onset of a disease and effective treatment plans
based on millions of similar cases, something no human doctor could possibly do on their own.
Autonomous vehicles depend on AI/data science. It is commonly projected that autonomous
vehicles will radically reduce the 1m annual traffic deaths per year worldwide, pollution and traffic
congestion while shortening travel times, freeing us up for a better quality of life. The impacts could be far
greater than those of the automobile. But how will autonomous vehicles change the world? One factor to
consider is that currently the average car sits parked 95% of the time. What might be the impacts of
autonomous vehicles on real estate, roads, automobile manufacturing, and employment?
Most benefits of technology harbor unanticipated threats. For example, autonomous vehicle
results can be applied in many domains, e.g., autonomous weapons are used by 80 countries including the
USA that has over 10,000
13
. Let’s consider a few threats posed by AI/data science.
6.2 Threats
On May 6, 2010 the US stock market crashed. In the 2010 Flash Crash, over a trillion dollars in
value was lost and the indexes (Dow Jones Industrial average, S&P 500, Nasdaq Composite) collapsed
(Dow Jones down ~1,000 points, 9% in value). Within 36 minutes the indexes and value largely but not
entirely rebounded. This was due in part to algorithmic trading that operates 60% of trading in US
exchanges, and in part to the actions of Navinder Singh Sarao, a trader who the U.S. Department of
Justice convicted for fraud and market manipulation
14
. Algorithmic trading is a data science-based method
of making trades based on complex economic and market analysis based on potentially all trades ever
transacted. This was not a threat. It was a reality and a harbinger of similar threats.
How might AI/data science threaten employment? Consider the potential impact of autonomous
vehicles on America’s 4m professional drivers (as of 2016, US Bureau of Labor Statistics). Robots will
impact a vastly larger number of jobs. McKinsey Global Institute estimated that by 2030 up to 375m
people could have their jobs “automated away (Economist 2018c). These examples are the tip of the
AI/data science unemployment iceberg. The Economist (Economist 2018f) and the Organisation for
Economic Co-operation and Development (OECD) (Nedelkoska, et. al. 2018), estimate that over 50% of all
jobs are vulnerable to automation.
An insidious threat is bias in decision making. Our lives are increasingly determined by
algorithms. Increasing machine learning and other sophisticated algorithms are used to make decisions in
our lives, in our companies, in our careers, in our education, and in our economy. These algorithms are
developed with models that represent the significant features of the problem being addressed. No one
but the developers see the code, fewer people actually understand the code. So, what is in the code? Are
race, sex, or a history of past behaviour significant and acceptable features in parole sentencing? Are
these algorithms biased against certain types of individuals? ProPublica proved that parole sentencing is
indeed biased against blacks (Angwin et al., 2016). The twelve vendors of the systems that the US
government uses for sentencing refuse to release their code for inspection. Ironically, ProPublica proved
their case using data science. They collected and analyzed sentencing data to prove with high confidence
that the systems were inherently biased. This has led to the algorithmic accountability movement in the
legal community.
13
Do you trust this computer? http://doyoutrustthiscomputer.org
14
https://www.justice.gov/opa/pr/futures-trader-pleads-guilty-illegally-manipulating-futures-market-
connection-2010-flash
In many countries, tech companies, e.g., Apple, Alphabet (Google parent), Microsoft, Amazon,
Facebook, Alibaba, and Tencent, know more about us and can predict our behaviour better than we can.
In some countries, the government takes this role (e.g., China’s social credit system). Over the past
decade, there has been increasing concern for personal information. Legislation to govern the use and
privacy of personal information (General Data Protection Regulation
15
(GDPR)) was enacted in Europe
only in May 25, 2018. US congressional hearings only began in early 2018 prompted by the alleged illegal
acquisition of 87m Facebook profiles by Cambridge Analytica (CA), described below.
The power and growth of the seven companies mentioned above, the largest companies in the
world by market capitalization, is directly attributable to AI/data science. Their average age is less than
ten years in contrast to average age of 141 years of the legacy companies that they are supplanting from
the top ten largest companies. These tech leaders vastly outspend the largest legacy companies in
research and development, e.g., Apple’s 2017 $22.6bn R&D investment was twice that of non-tech
Johnson & Johnson, established in 1886. It is frequently argued (Lee, 2017) (Economist, 2018e) that the
power of AI/data science is such that the country that dominates the field will wield disproportionate
economic and ultimately political power worldwide, i.e., will monopolize not just AI/data science but
areas of the economy for which it is a critical success factor. Currently China and the USA are the leaders
by far. However, the playing field is beginning to favor China. The power and development of AI solutions
is heavily dependent on vast amounts of data. Increasing restrictions on data, such as privacy legislation
mentioned above, will significantly inhibit US AI companies while there is little or no such limitations by
the Chinese government that itself collects massive data on its citizens.
It may seem dramatic, but data science has allegedly been used to threaten democracy.
Alexander Nix, the now-suspended CEO of now-insolvent CA, claimed to have applied a data science-
based methodology, psychometrics, to influence political opinion. Nix reported that it was used to
influence the outcomes of political elections around the world including the 2016 British EU referendum,
aka Brexit referendum, in favor of leaving, and the 2016 US election in favor of Donald Trump.
Psychometrics is based on a physiological profiling model from Cambridge and Stanford Universities. For
the US election, CA illegally and legally acquired up to 5,000 data points each of 230m Americans to
develop a detailed profile of every American voter. Profiles were used to send “persuasion” messages
(e.g., on gun rights) targeted to and nuanced for the weaknesses and preferences of individual voters. CA
activities were first reported in 2015 and resurfaced in January 2017 when Trump took office. It wasn’t
until April 2018 that CA’s actions in the 2016 US election were considered for prosecution.
Notwithstanding CA’s illegal actions and potentially violating American democratic principles, CA’s data-
science method appears to have been very effective and broadly applicable, e.g., being applied in
targeted, 1-on-1 marketing. Such methods are allegedly being used by governments, e.g., in the Chinese
social credit system and in Russian interference with the 2016 US election. This genie is out of the bottle.
6.3 More profound questions
A more profound question is: Will these advanced technologies enhance or replace man? In
Homo Deus (Harari, 2016), the author Yuval Noah Harari, hypothesizes that the human race augmented
by advanced technologies, specifically AI/data science, will transform homo sapiens into a new species.
Just as homo sapiens surpassed and replaced Neanderthals, so will humans augmented with machines
surpass homo sapiens without automation. Could you compete or survive without automation? This is
well beyond considering the impacts of data science. Or is it? In 2018 there were multiple attacks on the
very foundations of democracy (see above). At the TED 2018 conference, Jaron Lanier, virtual reality
creator, suggested that, using data, social networks had become behaviour modification networks. Harari
speculated that just as corporations use data now, so too could dictatorships use data to control
populations.
Technological progress is never solely positive, e.g., automation that eliminates waste due to
optimized supply chains. Progress is relative to our expectations, e.g., computers will eliminate most
15
https://en.wikipedia.org/wiki/General_Data_Protection_Regulation
human drivers thereby reducing road accidents by 95%. In this case, the cost of saving lives is a loss of
jobs. The greatest impacts of technology are seldom foreseen, e.g., the redistribution of populations from
cities to suburbs due to the mobility offered by automobiles. What might be the impact of a machine
beating humans playing Jeopardy?
The Future of Life Institute
16
was established “To catalyze and support research and initiatives for
safeguarding life and developing optimistic visions of the future, including positive ways for humanity to
steer its own course considering new technologies and challenges.” Its motto is: “Technology is giving life
the potential to flourish like never before... or to self-destruct. Let’s make a difference.”
7 Conclusions
Data Science is potentially one of the most significant new disciplines of the 21st Century, yet it is just
emerging, poses substantial challenges, and will take a decade to mature. The potential benefits and risks
warrant developing data science as a discipline and as a method for accelerated discovery in any domain
for which adequate data is available. That development should be grounded in reality following the
proverb: Necessity is the mother of invention. This chapter proposes a long standing, proven development
model.
Innovation in computing technology has flourished through three successive versions of the
virtuous cycle. The 20th Century Virtuous Cycle was hardware innovation and software innovation in a
cycle. The 20th Century Virtuous R&D Cycle was research innovation and engineering innovation in a cycle.
The emerging 21st Century Virtuous RD&D Cycle is research innovation, engineering innovation, and
product innovation in a cycle. While innovation perpetuates the cycle, it is not the goal. Innovation is
constantly and falsely heralded as the objective of modern research. Of far greater value are the
solutions. Craig Vintner a leading innovator in genetics said, “Good ideas are a dime a dozen. What
makes the difference is the execution of the idea.” The ultimate goal is successful, efficient solutions that
fully address PIA problems or major challenges, or that realize significant, beneficial opportunities. Data
science does not provide such results. Data science accelerates the discovery of probabilistic results
within certain error bounds. It usually does not produce definitive results. Having rapidly reduced a vast
search space, to a smaller number of likely results, non-data science methods, typically conventional
methods in the domain of interest are used to produce the definitive results. Once definitive results are
achieved, the data science analysis can be converted to a product, e.g., a report, inventory replenishment,
etc. however, the results of such a product must be monitored as conditions and data can change
constantly. For more on this see (Meierhofer et. al., 2018).
The principles and objectives of the 21st Century Virtuous RD&D Cycle are being applied in many
domains beyond computer science, startups, education, and data science. In medicine it is called
translational medicine (STM, 2018) in which healthcare innovation and challenges go across the
benchside/research-bedside-community
17
cycle, delivering medical innovations to patients and
communities more rapidly than conventional medical practice and taking experience and issues back for
research and refinement. The US National Institutes of Health (NIH) established The National Center for
Advancing Translational Sciences in 2012 for this purpose and is increasingly requiring its practice in NIH
funded research programs. In the broader scientific community, such activities are called translational
science and translational research, e.g., (AJTR, 2018) (Fang & Casadevall, 2010). The RD&D cycle is now
incorporated in all natural science and engineering research funded in Canada
18
.
Data science researchers and DSRI leaders might consider the 21st Century Virtuous RD&D Cycle
to develop and contribute to data science theory, practice, and education.
16
https://futureoflife.org
17
The US National Institutes of Healt h support of translational medicine in which the research process includes testing research
(benchside) results in practice (bedside) to speed conventional clinical trial methods.
18
Dr. Mario Pinto, President of the Natur al Sciences and Engineering Research Council of Canada, in 2017 annou nced that a
research-development-delivery method was to be used in all NSERC funded projects.
Acknowledgement
Thanks to Dr. Thilo Stadelmann, Zurich University of Applied Sciences, Institute for Applied Information
Technology in the Swiss Fachhochschule system, for insights into these ideas; and to Dr. He H. (Anne) Ngu,
Texas State University, for insights into applying these principles and pragmatics to the development of
Texas State University’s 21st Century Applied PhD Program in Computer Science.
8 References
ACM (2015). Michael Stonebraker, 2014 Turing Award Citation
19
, Association of Computing Machinery,
April 2015
AJTR (2018). American Journal of Translational Research, e-Century Publishing Corporation
20
.
Angwin, J., Larson, J., Mattu, S. and Kirchner, L., Machine Bias: There’s software used across the country
to predict future criminals. And it’s biased against blacks, ProPublica, May 23, 2016
https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing
Braschler, M., Stadelmann, T. & Stockinger, K. (Eds.) (2018). “Applied Data Science - Lessons Learned for
the Data-Driven Business”, Berlin, Heidelberg: Springer, expected 2018.
Brodie, M.L. (2015). Understanding Data Science: An Emerging Discipline for Data-Intensive Discovery, in
Shannon Cutt (ed.), Getting Data Right: Tackling the Challenges of Big Data Volume and Variety, O’Reilly
Media, Sebastopol, CA, USA, June 2015.
Brodie, M.L. (2018a). What is Data Science? to appear in (Braschler, et. al. 2018).
Brodie, M.L. (Ed.) (2018b). Making Databases Work: The Practical Wisdom of Michael Stonebraker, A.M.
Turing Book Series, ACM Books, Forthcoming Summer 2018.
Chipman, I., (2016). How data analytics is going to transform all industries, Stanford Engineering
Magazine, February 13, 2016.
Codd. E.F. (1970). A relational model of data for large shared data banks. Commun. ACM 13, 6 (June
1970), 377-387.
Davenport, T. H., and Patil, D.J., (2012). "Data Scientist: The Sexiest Job of the 21st Century." Harvard
Business Review Vol. 90, no. 10 (October 2012).
Demirkan, H. & Dal, B. (2014). The Data Economy: Why do so many analytics projects fail? Analytics
Magazine, July/August 2014.
Dohzen, T., Pamuk, M., Seong, S. W., Hammer, J., & Stonebraker, M. (2006). Data integration through
transform reuse in the Morpheus project (pp. 736738). ACM SIGMOD International Conference on
Management of Data, Chicago, Illinois, USA, June 27-29, 2006.
Economist (March 2018a). GrAIt expectations, Special Report AI in Business, The Economist, March 31,
2018.
Economist (March 2018b). External Providers: Leave it to the experts, Special Report AI in Business, The
Economist, March 31, 2018.
Economist (March 2018c). The future: Two-faced, Special Report AI in Business, The Economist, March 31,
2018.
Economist (March 2018d). Supply chains: In algorithms we trust, Special Report AI in Business, The
Economist, March 31, 2018.
19
http://amturing.acm.org/award_winners/stonebraker_1172121.cfm
20
http://www.ajtr.org
Economist (March 2018e). America v China: The battle for digital supremacy: America’s technological
hegemony is under threat from China, The Economist, March 15, 2018.
Economist(2018f). A study finds nearly half of jobs are vulnerable to automation, The Economist, April 24,
2018.
Economist (2017). Who’s afraid of disruption? The business world is obsessed with digital disruption, but
it has had little impact on profits, The Economist September 30, 2017.
Fang, F. C. & Casadevall, A. (2010). Lost in Translation-Basic Science in the Era of Translational Research,
Infection and Immunity, vol. 78, no. 2, pp. 563566, Jan. 2010.
Forrester (2015a). Brief: Why Data-Driven Aspirations Fail, Forrester Research, Inc., October 7, 2015
Forrester (2015b). Predictions 2016: The Path from Data to Action for Marketers: How Marketers Will
Elevate Systems of Insight. Forrester Research, November 9, 2015
Forrester (2017b). The Forrester WaveTM: Data Preparation Tools, Q1 2017, Forrester, March 13, 2017
Gartner G00326555 2018a Magic Quadrant for Analytics and Business Intelligence Platforms 26 February
2018.
Gartner G00326456 2018b Magic Quadrant for Data Science and Machine-Learning Platforms 22 February
2018
Gartner G00301536 (2017a). 2017 Magic Quadrant for Data Science Platforms, 14 February 2017.
Gartner G00326671 (2017b). Critical Capabilities for Data Science Platforms, Gartner, June 7, 2017.
Gartner G00315888 (2017c) Market Guide for Data Preparation, Gartner, 14 December 2017
Gartner G00310700 (2016a). Survey Analysis: Big Data Investments Begin Tapering in 2016, Gartner,
September 19, 2016
Gartner G00316349 (2016b). Predicts 2017: Analytics Strategy and Technology, Gartner, Report
G00316349, November 30, 2016.
Harari, Y.N. (2016). Homo Deus: a brief history of tomorrow, Random House, 2016
Ioannidis, J.P.A. (2005). Why Most Published Research Findings Are False? PLOS Medicine, 2(8), e124.
Lee, Kai-Fu, The Real Threat of Artificial Intelligence, New York Times, June 24, 2017
Lohr, S. & Singer, N. (2016) How Data Failed Us in Calling an Election, New York Times, November 10,
2016.
Marr, B., (2018). How Big Data Is Transforming Every Business, In Every Industry, Forbes.com, November
21, 2017.
Meierhofer, J., Stadelmann, T., & Cieliebak, M. (2018). Data Products. In: Braschler, M., Stadelmann, T., &
Stockinger, K. (Editors). Applied Data Science - Lessons Learned for the Data-Driven Business. Springer (to
appear).
Nagarajan, M. et al. (2015). Predicting Future Scientific Discoveries Based on a Networked Analysis of the
Past Literature. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining (KDD '15). ACM, New York, NY, USA, 2019-2028.
National Research Council (2012). The New Global Ecosystem in Advanced Computing: Implications for
U.S. Competitiveness and National Security. Washington, DC: The National Academies Press.
Naumann, F. (2018). Genealogy of Relational Database Management Systems
21
, Hasso-Plattner Institüt,
Universität, Potsdam.
Nedelkoska, L. & G. Quintini (2018), "Automation, skills use and training", OECD Social, Employment and
Migration Working Papers, No. 202, OECD Publishing, Paris, http://dx.doi.org/10.1787/2e2f4eea-en.
New York Times (2018). H&M, a Fashion Giant, Has a Problem: $4.3 Billion in Unsold Clothes, New York
Times, March 27, 2018.
O'Neil, C. (2016). Weapons of Math Destruction: How Big Data Increases Inequality and Threatens
Democracy. Crown Publishing Group, New York, NY, USA.
Olson, M. (2018). Stonebraker and open source, to appear in (Brodie 2018)
Palmer, A. (2018) How to create & run a Stonebraker Startup-- The Real Story, to appear in (Brodie 2018)
Piatetsky, G. (2016). Trump, Failure of Prediction, and Lessons for Data Scientists, KDnuggets, November
2016.
Ramanathan, A. (2016). The Data Science Delusion, Medium.com, November 18, 2016.
Spangler, S. et. al. (2014). Automated hypothesis generation based on mining scientific literature. In
Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining
(KDD '14). ACM, New York, NY, USA, 1877-1886.
STM (2018). Science Translational Medicine, a journal of the American Association for the Advancement
of Science.
Stonebraker, M. (2018a). How to start a company in 5 (not so) easy steps", to appear in (Brodie 2018b)
Stonebraker, M. (2018b). Where Do Good Ideas Come from and How to Exploit Them? to appear in
(Brodie 2018b)
Stonebraker, M., & Kemnitz, G. (1991). The Postgres Next Generation Database Management System.
Communications of the ACM, 34(10), 7892.
Stonebraker, M., Abadi, D. J., Batkin, A., Chen, X., Cherniack, M., Ferreira, M., et al. (2005). C-store: a
column-oriented DBMS, In Proceedings of the 31st international conference on Very large data bases,
2005.
Stonebraker, M., Castro Fernandez, R., Deng, D., & Brodie, M.L. (2016a). Database Decay and What to do
about it. Comm. ACM 60, 1 (December 2016), 10-11.
Stonebraker, M., Deng, D., & Brodie, M. L. (2016b). Database Decay and How to Avoid It (pp. 110).
Proceedings of the IEEE International Conference on Big Data, Washington, DC.
Stonebraker, M., Deng, D., & Brodie, M. L. (2017). Application-Database Co-Evolution: A New Design and
Development Paradigm. New England Database Day, (pp. 13) January 2017
Stonebraker, M., Wong, E., Kreps, P., & Held, G. (1976). The Design and Implementation of INGRES. ACM
Transactions on Database Systems, 1(3), 189222.
van der Aalst, W. M. P. (2014). Data Scientist: The Engineer of the Future. In K. Mertins, F. Bénaben, R.
Poler, & J.-P. Bourrières (Eds.), (pp. 1326). Presented at the Enterprise Interoperability VI, Cham:
Springer International Publishing.
Veeramachaneni, K. (2016). Why You’re Not Getting Value from Your Data Science, Harvard Business
Review, December 7, 2016.
21
https://hpi.de/naumann/projects/rdbms-genealogy.html
Wolpert, D. H., & Macready, W. G. (1997). No free lunch theorems for optimization. IEEE transactions on
evolutionary computation, 1(1), 67-82.
... To make the prototype models is usually Data Scientist has most important activity for that model. [1] [8] Data Science is bothered with inspecting Big Data to extract correlations with estimates of feasibility and error. (Brodie, 2015a) ...
... Programs like Recognition of Traffic Signs have been made using CNN to develop autonomous vehicles, where we will learn how to program this model to correctly recognize different types of traffic signals through input images. [8] ...
... This work flow should be created with the cooperation of the relevant business stakeholders keeping in mind the problem identified and identified and defined checklist to rectify it. [8] 3 ...
... This chapter offers initial answers to these and related questions. A companion chapter (Brodie, 2018b) addresses the development of data science as a discipline, as a methodology, as well as data science research and education. Let's start with some slightly provocative claims concerning data science. ...
... While transformations are underway in many areas, including supply chain management 3 (Waller and Fawcett, 2013) and chemical engineering (Data Science, 2018), only time and concrete results will tell the extent and value of the transformations. The companion chapter On Developing Data Science (Brodie, 2018b) discusses with the transformation myth. ...
... In such an environment, how does a DSRI establish strategic directions and set research objectives? One proposal is through a DSRI Chief Scientific Officer (Brodie, 2018b). ...
Preprint
Full-text available
Data Science, a new discovery paradigm, is potentially one of the most significant advances of the early 21 st century. Originating in scientific discovery, it is being applied to every human endeavor for which there is adequate data. While remarkable successes have been achieved, even greater claims have been made. Benefits, challenge, and risks abound. The science underlying data science has yet to emerge. Maturity is more than a decade away. This claim is based firstly on observing the centuries-long developments of its predecessor paradigms-empirical, theoretical, and Jim Gray's Fourth Paradigm of Scientific Discovery (Hey, Tansley & Tolle, 2009) (aka eScience, data-intensive, computational, procedural); and secondly on my studies of over 150 data science use cases, several data science-based startups, and, on my scientific advisory role for Insight 1 , a
... This chapter offers initial answers to these and related questions. A companion chapter (Brodie, 2018b) addresses the development of data science as a discipline, as a methodology, as well as data science research and education. Let's start with some slightly provocative claims concerning data science. ...
... While transformations are underway in many areas, including supply chain management 3 (Waller and Fawcett, 2013) and chemical engineering (Data Science, 2018), only time and concrete results will tell the extent and value of the transformations. The companion chapter On Developing Data Science (Brodie, 2018b) discusses with the transformation myth. ...
... In such an environment, how does a DSRI establish strategic directions and set research objectives? One proposal is through a DSRI Chief Scientific Officer (Brodie, 2018b). ...
Chapter
Full-text available
Data science, a new discovery paradigm, is potentially one of the most significant advances of the early twenty-first century. Originating in scientific discovery, it is being applied to every human endeavor for which there is adequate data. While remarkable successes have been achieved, even greater claims have been made. Benefits, challenge, and risks abound. The science underlying data science has yet to emerge. Maturity is more than a decade away. This claim is based firstly on observing the centuries-long developments of its predecessor paradigms—empirical, theoretical, and Jim Gray’s Fourth Paradigm of Scientific Discovery (Hey et al., The fourth paradigm: data-intensive scientific discovery Edited by Microsoft Research, 2009) (aka eScience, data-intensive, computational, procedural)—and secondly on my studies of over 150 data science use cases, several data science-based startups, and, on my scientific advisory role for Insight (https://www.insight-centre.org/), a Data Science Research Institute (DSRI) that requires that I understand the opportunities, state of the art, and research challenges for the emerging discipline of data science. This chapter addresses essential questions for a DSRI: What is data science? What is world-class data science research? A companion chapter (Brodie, On Developing Data Science, in Braschler et al. (Eds.), Applied data science – Lessons learned for the data-driven business, Springer 2019) addresses the development of data science applications and of the data science discipline itself.
... Demographic and background information on the participants in our courses is listed in Appendix A, while information on specific class sizes (or number of students who returned a questionnaire) will be listed in the captions of the figures below that deal with them. For the ML course, most of these students predominantly seek a career in their original fields of study, though a growing minority considers a job related to ML engineer or ML researcher (the possibility to, e.g., take up graduate studies, is typically completely unknown to our students due to the setup of a "Fachhochschule" [43]). Our students of computer science in the AI course usually envisage a career in software engineering, not specifically AI. ...
Article
Full-text available
We present the “AI-Atlas” didactic concept as a coherent set of best practices for teaching Artificial Intelligence (AI) and Machine Learning (ML) to a technical audience in tertiary education, and report on its implementation and evaluation within a design-based research framework and two actual courses: an introduction to AI within the final year of an undergraduate computer science program, as well as an introduction to ML within an interdisciplinary graduate program in engineering. The concept was developed in reaction to the recent AI surge and corresponding demand for foundational teaching on the subject to a broad and diverse audience, with on-site teaching of small classes in mind and designed to build on the specific strengths in motivational public speaking of the lecturers. The research question and focus of our evaluation is to what extent the concept serves this purpose, specifically taking into account the necessary but unforeseen transfer to ongoing hybrid and fully online teaching since March 2020 due to the COVID-19 pandemic. Our contribution is two-fold: besides (i) presenting a general didactic concept for tertiary engineering education in AI and ML, ready for adoption, we (ii) draw conclusions from the comparison of qualitative student evaluations (n = 24–30) and quantitative exam results (n = 62–113) of two full semesters under pandemic conditions with the result of previous years (participants from Zurich, Switzerland). This yields specific recommendations for the adoption of any technical curriculum under flexible teaching conditions—be it on-site, hybrid, or online.
... The notion of data with a new force and on a much larger scale manifests itself in modern computer science in the form of "data science" ( [3]. But the methods of science are not directly transferred to this area, and the understanding of facts-based phenomena -data -is the touchstone of data science. ...
Article
Full-text available
In this paper, we study the problems of extracting concepts from data and introduce the notion of deviation of concepts. A wireframe model structure is established in the form of a semantic mesh, which allows you to control deviation by establishing the spectrum of possible hops on the way to achieving the generic concepts. The use of scaling to reflect the dynamics of information processes that represent data is shown. By varying the degree of generality of the scales used, the desired level of abstracting the semantic mesh is achieved. A representative example of a stepwise transition from a specific individual semantic mesh based on using the content of a subject area to an abstract semantic mesh that controls general information processes is constructed. The resulting conceptual model supports a scaled conceptual dependency on the generic and derived concepts. To map on a generalized semantic mesh with nodes in the form of variable domains, a mapper-adapter is required.
... Data science, in other words, has no data of its own, but proclaims its relevance to all who do. Consider, for instance, the imaginary of data science's virtuous cycle (Brodie, 2019), wherein the "fundamental" data science disciplines-e.g. mathematics, statistics, and computer science-solicit and engage with data and other computational resources culled from specific "domains" as a means of advancing computational performance and accruing novel technical capacities (e.g., programs like the NSF TRIPODS awards 1 ). ...
Article
Full-text available
Data science is characterized by engaging heterogeneous data to tackle real world questions and problems. But data science has no data of its own and must seek it within real world domains. We call this search for data “prospecting” and argue that the dynamics of prospecting are pervasive in, even characteristic of, data science. Prospecting aims to render the data, knowledge, expertise, and practices of worldly domains available and tractable to data science method and epistemology. Prospecting precedes data synthesis, analysis, or visualization, and is constituted by the upstream work of discovering disordered or inaccessible data resources, thereafter to be ordered and rendered available for computation. Through this work, data science positions itself in the middle of all things—capable of engaging this, that, or any domain—and thus prospecting is a key driver of data science’s ongoing formation as a universal(izing) science.
... In practice, Data Science has become popular over the last decade with the development of large Internet corporations such as Yahoo, Google, LinkedIn, Facebook and Amazon, and is seen as "potentially one of the most significant advances of the early 21 st century" [17]. ...
Conference Paper
Full-text available
With the vast amounts of data available in the world, most companies today are focusing on data usage to identify the strengths, weaknesses, and opportunities in business. Аn area called "data science" аppeared, closely related to the data mining concept. "Data science" is a term that has entered public perception and imagination only since the first half of the decade, but today it is extremely popular in science, practice, and education. It includes tools, methods, and systems applied to big data to leverage knowledge for decision-making. Based on the literature review and initial findings, this research study found that there are differences in the understanding of the term "data science" by academics and practitioners. This article aims to outline, on the one hand, a comprehensive framework of data science competences and on the other to sum up the inter and multidisciplinarity of the area. In result, is presented a review of the existing definitions and Fundamental Concept of Data Science. Тhe available opportunities for competency development in the Data Science area are discussed.
Conference Paper
The emerging discipline of Data Science poses several challenges for teams conducting projects in the field as notably, the majority of Data Science teams fail to deliver the expected outcomes. To improve the results, researchers tried to adapt agile project methodologies like Scrum for Data Science projects. Scrum in particular is often implemented due its success in software engineering. However, the basic Scrum framework has proven itself to be too strict for Data Science, due to frequent unpredictabilities of Data Science tasks. Consequently, adaptions were made to traditional Scrum to make it more suitable for the new challenges. This article discusses further adaptations and suggests that Scrum in itself is usable in Data Science, however, additional adaptations of the core concepts need to be envisioned.
Technical Report
Full-text available
Every student seems to have an opinion on AI. This is arguably due to the fact that its assumed topic, "intelligence", is deemed to be one's very own possession, and hence an area of every individual's expertise. To turn this initial motivation into a stable foundation for lifelong learning and working, the opposite of ready-made solutions must be made available by an educator. Additionally, the current hype needs to be exposed to thoroughly assess the real potential (for better or worse) of the technology. Hence, students need to be given an ATLAS: a collection of analog maps to the field of AI that (a) give an overview in this highly dynamic and complex environment; that (b) highlight the beauty of certain places therein; that however (c) don't restrict themselves to advocating only a single path. This paper outlines the concept behind the design and teaching of said "cartographical material" and evaluates it in the context of two curricula: an introduction to AI for undergraduate students of computer science, and an introduction to machine learning in an interdisciplinary masters in engineering programme. It further contributes a model assignment for teaching a fundamental lesson on AI: lever-aging the right algorithms pays off way more than leveraging human insight. All course materials including slides, assignments and video lectures, are freely available online.
Chapter
Full-text available
Data science, a new discovery paradigm, is potentially one of the most significant advances of the early twenty-first century. Originating in scientific discovery, it is being applied to every human endeavor for which there is adequate data. While remarkable successes have been achieved, even greater claims have been made. Benefits, challenge, and risks abound. The science underlying data science has yet to emerge. Maturity is more than a decade away. This claim is based firstly on observing the centuries-long developments of its predecessor paradigms—empirical, theoretical, and Jim Gray’s Fourth Paradigm of Scientific Discovery (Hey et al., The fourth paradigm: data-intensive scientific discovery Edited by Microsoft Research, 2009) (aka eScience, data-intensive, computational, procedural)—and secondly on my studies of over 150 data science use cases, several data science-based startups, and, on my scientific advisory role for Insight (https://www.insight-centre.org/), a Data Science Research Institute (DSRI) that requires that I understand the opportunities, state of the art, and research challenges for the emerging discipline of data science. This chapter addresses essential questions for a DSRI: What is data science? What is world-class data science research? A companion chapter (Brodie, On Developing Data Science, in Braschler et al. (Eds.), Applied data science – Lessons learned for the data-driven business, Springer 2019) addresses the development of data science applications and of the data science discipline itself.
Article
AI is growing in popular technology with various uses can be seen in many aspects of life. AI has many positive effects and creates social benefits. AI applications can improve health and living conditions, facilitate justice, create wealth, enhance public safety, and reduce the impact of human activities on the environment and climate (Montreal Declaration 2018). AI is a tool that can help people do their jobs faster and better, creating many benefits. But, beyond that, AI can also facilitate new tasks, for example by analyzing research data on an unprecedented scale, thus creating an expectation of scientific knowledge. can be beneficial in all aspects of life. In this paper we are describing negative effects of AI. Keywords: Artificial intelligence, ML, Artificialgeneral intelligence · Socio-technical systems
Book
This book has two main goals: to define data science through the work of data scientists and their results, namely data products, while simultaneously providing the reader with relevant lessons learned from applied data science projects at the intersection of academia and industry. As such, it is not a replacement for a classical textbook (i.e., it does not elaborate on fundamentals of methods and principles described elsewhere), but systematically highlights the connection between theory, on the one hand, and its application in specific use cases, on the other. With these goals in mind, the book is divided into three parts: Part I pays tribute to the interdisciplinary nature of data science and provides a common understanding of data science terminology for readers with different backgrounds. These six chapters are geared towards drawing a consistent picture of data science and were predominantly written by the editors themselves. Part II then broadens the spectrum by presenting views and insights from diverse authors – some from academia and some from industry, ranging from financial to health and from manufacturing to e-commerce. Each of these chapters describes a fundamental principle, method or tool in data science by analyzing specific use cases and drawing concrete conclusions from them. The case studies presented, and the methods and tools applied, represent the nuts and bolts of data science. Finally, Part III was again written from the perspective of the editors and summarizes the lessons learned that have been distilled from the case studies in Part II. The section can be viewed as a meta-study on data science across a broad range of domains, viewpoints and fields. Moreover, it provides answers to the question of what the mission-critical factors for success in different data science undertakings are. The book targets professionals as well as students of data science: first, practicing data scientists in industry and academia who want to broaden their scope and expand their knowledge by drawing on the authors’ combined experience. Second, decision makers in businesses who face the challenge of creating or implementing a data-driven strategy and who want to learn from success stories spanning a range of industries. Third, students of data science who want to understand both the theoretical and practical aspects of data science, vetted by real-world case studies at the intersection of academia and industry.
Chapter
This paper presents the design of a read-optimized relational DBMS that contrasts sharply with most current systems, which are write-optimized. Among the many differences in its design are: storage of data by column rather than by row, careful coding and packing of objects into storage including main memory during query processing, storing an overlapping collection of column-oriented projections, rather than the current fare of tables and indexes, a non-traditional implementation of transactions which includes high availability and snapshot isolation for read-only transactions, and the extensive use of bitmap indexes to complement B-tree structures. We present preliminary performance data on a subset of TPC-H and show that the system we are building, C-Store, is substantially faster than popular commercial products. Hence, the architecture looks very encouraging.
Article
Build simple models, faster.