Chapter

The Data Deluge: An e‐Science Perspective

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

This paper previews the imminent flood of scientific data expected from the next generation of experiments, simulations, sensors and satellites. In order to be exploited by search engines and data mining software tools, such experimental data needs to be annotated with relevant metadata giving information as to provenance, content, conditions and so on. The need to automate the process of going from raw data to information to knowledge is briefly discussed. The paper argues the case for creating new types of digital libraries for scientific data with the same sort of management services as conventional digital libraries in addition to other data-specific services. Some likely implications of both the Open Archives Initiative and e-Science data for the future role for university libraries are briefly mentioned. A substantial subset of this e-Science data needs to archived and curated for long-term preservation. Some of the issues involved in the digital preservation of both scientific data and of the programs needed to interpret the data are reviewed. Finally, the implications of this wealth of e-Science data for the Grid middleware infrastructure are highlighted.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Even though its growth is less exponential, the trends seems to be a constant growth factor. Web users, as well as researchers easily get drowned within this data deluge [43] and searching for relevant information is nowadays a tedious task. In despite of the numerous scientific search engines, the bibliographic phase is still a complex and tremendous task that researchers regularly face. ...
... Both of these ideas are not yet added to our approach and only the specific connected categories are kept because several general ones-sometimes even off-topic-may be included into synset's categories. A perfect example is the synset for Artificial intelligence 43 . It contains correct categories such as Artificial intelligence or Computational neuroscience but also very generic ones such as Emerging technologies or Formal sciences. ...
... This safety should be defined. Currently, we did not find a rule defining the threshold when categories are good enough based on domains connections.43 https://babelnet.org/synset?word=bn:00002150n ...
Thesis
The abundance of data from the Internet is such that web users struggle to find data relevant to their initial problematic and find themselves drown in the data deluge. This observation is also applicable to researchers and other scientists during their bibliographic research phase. This thesis, realized in collaboration with MDPI (publisher of open access scientific articles - www.mdpi.com), proposes an original way of linking semantically related scientific articles. For this, the disambiguation of the keywords of the articles using a knowledge base is achieved, thanks to the first step of theapproach, namely the categorization. An data augmentation is then performed by the extraction of semantical neighbors from these contextualized keywords. Finally, a metric taking into account all the possible intersections between disambiguated keywords and their semantic neighbors is proposed. Other similarity measures based on neural networks or other probabilistie models have been implemented and compared in this thesis. The results obtained are comparable to those obtained by our approach and further investigations are possible (e.g., combination of these methods). The evaluation of our approach highlights promising results (up to 92% accuracy) and opens interesting leads for future research.
... For the same research aim, some natural sciences domains with many similarities to DCH research appear more advanced, or at least seem to converge more efficiently on stabilized workflows and methodologies [Huisman et al. 2021]. For a while, the 2:3 Studies in Digital Heritage, Vol. 6, No. 1, Publication date: June 2022 DCH community has been claiming to move forward massive digitization [Santos et al. 2017, Hey andTrefethen 2003] but had to admit the important gap between humanities and E-science top contributors (physics, astronomy, biology or earth and chemical sciences) [Hey andTrefethen 2003, Schrorer andMudge 2017] in term of advances in data integration and ingestion. Beside this scalability gap, Heritage Sciences share a sort of modus operandi with biology and earth sciences. ...
... For the same research aim, some natural sciences domains with many similarities to DCH research appear more advanced, or at least seem to converge more efficiently on stabilized workflows and methodologies [Huisman et al. 2021]. For a while, the 2:3 Studies in Digital Heritage, Vol. 6, No. 1, Publication date: June 2022 DCH community has been claiming to move forward massive digitization [Santos et al. 2017, Hey andTrefethen 2003] but had to admit the important gap between humanities and E-science top contributors (physics, astronomy, biology or earth and chemical sciences) [Hey andTrefethen 2003, Schrorer andMudge 2017] in term of advances in data integration and ingestion. Beside this scalability gap, Heritage Sciences share a sort of modus operandi with biology and earth sciences. ...
Article
Full-text available
In the field of Digital Heritage Studies, data provenance has always been an open and challenging issue. As Cultural Heritage objects are unique by definition, the methods, the practices and the strategies to build digital documentation are not homogeneous, universal or standardized. Metadata is a minimalistic yet powerful form to source and describe a digital document. They are often required or mandatory at an advanced stage of a Digital Heritage project. Our approach is to integrate since data capture steps meaningful information to document a Digital Heritage asset as it is moreover being composed nowadays from multiple sources or multimodal imaging surveys. This article exposes the methodological and technical aspects related to the ongoing development of MEMoS; standing for Metadata Enriched Multimodal documentation System. MEMoS aims to contribute to data provenance issues in current multimodal imaging surveys. It explores a way to document CH oriented capture data sets with a versatile descriptive metadata scheme inspired from the W7 ontological model. In addition, an experiment illustrated by several case studies, explores the possibility to integrate those metadata encoded into 2D barcodes directly to the captured image set. The article lays the foundation of a three parted methodology namely describe - encode and display toward metadata enriched documentation of CH objects.
... The widespread re-purposing [175] of data is a root cause of many data curation problems. In practice, there are three main categories of approaches to data curation, as data workers aim to prepare data for analytics, namely, ad hoc/manual [76,120], automated [68], and human-in-the-loop [89,113]. Manual approaches to data curation are still the predominant choice for data workers [15] and usually do not facilitate building well-defined processes for required data curation activities, and, hence, the impact of data curation (particularly transfor-mations) on the quality of the data [120] remains unclear, as observed in the Illness Severity Prediction case study with the impact of inconsistent and missing data. ...
... Manual approaches to data curation are still the predominant choice for data workers [15] and usually do not facilitate building well-defined processes for required data curation activities, and, hence, the impact of data curation (particularly transfor-mations) on the quality of the data [120] remains unclear, as observed in the Illness Severity Prediction case study with the impact of inconsistent and missing data. Thus, it can lead to several potential issues, including bias effects brought in by the data worker during their analysis process, the problem of data reusability brought in by lack of transparency and documentation of the analysis or generative processes, and scalability and generalisability across different datasets and use cases [61,76]. ...
Article
Full-text available
The appetite for effective use of information assets has been steadily rising in both public and private sector organisations. However, whether the information is used for social good or commercial gain, there is a growing recognition of the complex socio-technical challenges associated with balancing the diverse demands of regulatory compliance and data privacy, social expectations and ethical use, business process agility and value creation, and scarcity of data science talent. In this vision paper, we present a series of case studies that highlight these interconnected challenges, across a range of application areas. We use the insights from the case studies to introduce Information Resilience, as a scaffold within which the competing requirements of responsible and agile approaches to information use can be positioned. The aim of this paper is to develop and present a manifesto for Information Resilience that can serve as a reference for future research and development in relevant areas of responsible data management.
... DLM faces three main challenges. The first one is managing a massive volume of data for storage, which requires distributed storage supports and parallel processing [12]. The second challenge is linked to the highly dynamic character of the data, i.e., the fact that they may be incrementally produced, modified, or temporarily unavailable. ...
... The comparison shows that the main features of CockroachDB are that the database schema may change (Online, Active and Dynamic), data are geo-partitioning, and it uses multi-clouds. Moreover, Cock-roachDB is wire compatible with PostgreSQL, and it supports JSON 12 . In this last case, the JSONB 13 datatype stores JSON data as a binary representation of the JSONB value, which eliminates whitespace, duplicate keys, and key ordering. ...
Conference Paper
In this paper, we share our vision to study a complex Information Technology (IT) system handling a massive amount of data in the context of ’smart buildings.’ One technique for analyzing complex IT systems relies on emulation, where the final software system is fully deployed on real architectures, and is evaluated in considering ”small” instances of situations the system is supposed to solve.We propose a software architecture for studying the ecosystem of ’smart buildings’. This software architecture is built: 1) on top of ontologies for the description of smart buildings; 2) on a special tool for mastering the life cycle of data produced by sensors and actuators inside the buildings. We assume that it is equally important to model both the building’s components and the flow of data produced inside the building. We use existing software components for both goals and to make real our concerns. According to a translational methodology, we also discuss use cases for illustrating the potential of our approach and the particular challenges associated with making the two main components of our emulation tool inter-operate. Therefore, our main contribution is to propose a comprehensive, ambitious and realistic research plan to guide communities. The paper illustrates how computer scientists and smart buildings domain scientists may communicate to address and solve specific research problems related to Big Data in emergent distributed environments. We are also guessing that experimental results that can demonstrate the practicality of the proposed combination of tools could be devised in the future, based on our broad vision. The paper is, first and foremost, a visionary paper.
... However, in the beginning of designing metadata for a certain purpose, it first has to be discussed how metadata is defined. Usually, metadata is defined as a structured form of knowledge representation, or simply, as many authors put it, "data about data" [2]. Edwards describes this as the holy grail of information science: ...
... First, the requirements and functions for data infrastructures are explained in detail in Sect. 2 ...
Chapter
Full-text available
This chapter introduces metadata models as a semantic technology for knowledge representation to describe selected aspects of a research asset. The process of building a hierarchical metadata model is reenacted in this chapter and highlighted by the example of EngMeta. Moreover, an overview on data infrastructures is given, the general architecture and functions are disscussed, and multiple examples of data infrastructures in materials modelling are given.
... This process needs to be automated because most genomes are too large for manual annotation, not to mention the need to annotate as many genomes as possible since sequencing speed is no longer an issue. The annotations are made possible by the fact that genes have recognizable start and end regions (promoters and terminators that often have similar or identical composition in different groups of organisms), although the exact sequence found in these regions may vary between genes [3]. ...
Article
Full-text available
In this article, we will consider what IT technologies are most used in medicine and by genomics methods in particular, also we will take a look at the use of big data in this matter. Additionally, we will learn what a connectome is, analyze 4M and 3V frameworks in genomics. Statistics in medicine is one of the analysis tools experimental data and clinical observations, as well as the language by means of which the obtained mathematical results are reported. However, this is not the only task of statistics in medicine. Mathematical apparatus widely used for diagnostic purposes, solving classification problems and search for new patterns, for setting new scientific hypotheses. The use of statistical programs presupposes knowledge of the basic methods and stages of statistical analysis: their sequence, necessity and sufficiency. In the proposed presentation, the main emphasis is not on detailed presentation of the formulas that make up the statistical methods, and on their essence and application rules. Finally, we talk through genome-wide association studies, methods of statistical processing of medical data and their relevance. In this article, we analyzed the basic concepts of statistics, statistical methods in medicine and data science, considered several areas in which large amounts of data are used that require modern IT technologies, including genomics, genome-wide association studies, visualization and connectome data collection.
... This paper lies at the focal point of three orthogonal advances. First, the recent surge in GLAM 1 -led digitisation efforts (Terras, 2011), open citizen science (Haklay et al., 2021) and the expansive commodification of data (Hey and Trefethen, 2003), have enabled a new mode of historical inquiry that capitalises on the 'big data of the past' (Kaplan and Di Lenardo, 2017). Second, the 2017 breakthrough that was the transformer architecture (Vaswani et al., 2017) has led to the so-called ImageNet moment of Natural Language Processing (Ruder, 2018) in transfer-learning (Raffel et al., 2020), few-shot learning (Schick and Schütze, 2021), zero-shot learning (Sanh et al., 2021), and prompt-based learning (Le Scao and Rush, 2021) for natural language. ...
... This paper lies at the focal point of three orthogonal advances. First, the recent surge in GLAM 1 -led digitisation efforts (Terras, 2011), open citizen science (Haklay et al., 2021) and the expansive commodification of data (Hey and Trefethen, 2003), have enabled a new mode of historical inquiry that capitalises on the 'big data of the past' (Kaplan and Di Lenardo, 2017). Second, the 2017 breakthrough that was the transformer architecture (Vaswani et al., 2017) has led to the so-called ImageNet moment of Natural Language Processing (Ruder, 2018) and brought about unprecedented progress in transfer-learning (Raffel et al., 2020), few-shot learning (Schick and Schütze, 2021), zero-shot learning (Sanh et al., 2021), and prompt-based learning (Le Scao and Rush, 2021) for natural language. ...
Preprint
Full-text available
In this work, we explore whether the recently demonstrated zero-shot abilities of the T0 model extend to Named Entity Recognition for out-of-distribution languages and time periods. Using a historical newspaper corpus in 3 languages as test-bed, we use prompts to extract possible named entities. Our results show that a naive approach for prompt-based zero-shot multilingual Named Entity Recognition is error-prone, but highlights the potential of such an approach for historical languages lacking labeled datasets. Moreover, we also find that T0-like models can be probed to predict the publication date and language of a document, which could be very relevant for the study of historical texts.
... Datasets and reproducibility of research play a crucial role in modern data-driven research. Scientific data management has become increasingly complex and is gaining traction in the research community, mainly when spotlighting data and metadata's share, reuse, and interoperation, particularly for machine-actionable processes [1]. Genomic databases are classic examples of this scenario. ...
Preprint
Full-text available
While the publication of datasets in scientific repositories has become broadly recognised, the repositories tend to have increasing semantic-related problems. For instance, they present various data reuse obstacles for machine-actionable processes, especially in biological repositories, hampering the reproducibility of scientific experiments. An example of these shortcomings is the GenBank database. We propose GAP, an innovative data model to enhance the semantic data meaning to address these issues. The model focuses on converging related approaches like data provenance, semantic interoperability, FAIR principles, and nanopublications. Our experiments include a prototype to scrape genomic data and trace them to nanopublications as a proof of concept. For this, (meta)data are stored in a three-level nanopub data model. The first level is related to a target organism, specifying data in terms of biological taxonomy. The second level focuses on the biological strains of the target, the central part of our contribution. The strains express information related to deciphered (meta)data of the genetic variations of the genomic material. The third level stores related scientific papers (meta)data. We expect it will offer higher data storage flexibility and more extensive interoperability with other data sources by incorporating and adopting associated approaches to store genomic data in the proposed model.
... "Data Deluge" [Hey 2003] 2015: 6.5 ZettaByte (10 21 ) of data 2020: 44-60 ZettaByte of data [IDC] Science nowadays data-driven HPC/Supercomputing as a driving force Example: Molecular Simulation Workflow Trajectories of molecular systems based on computer models Calculation of the trajectories on the compute nodes Output files are written to the parallel filesystem Knowledge gained through output data analysis (multiple 100TB/project) Publication of results in a scientific paper Figure 1. "'Dark data' is not carefully indexed and stored so it becomes nearly invisible to scientists and other potential users and therefore is more likely to remain underutilized and eventually lost [..]" [Heidorn 2008, 280] 2. " [..] the type of data that exists only in the bottom left-hand desk drawer of scientists on some media that is quickly aging and soon will be unreadable by commonly available devices" [Heidorn 2008, 281] "Dark Data" [Heidorn 2008]: unusable/invisible data 1. ...
Presentation
Full-text available
... This problem of scale has been widely documented (e.g. Berman, 2008;Hey, 2003). And it is from this problem that others arise. ...
Thesis
The goal of this study is to understand if digital repositories that have a preservation mandate are engaging in disaster planning, particularly in relation to their pursuit of trusted digital repository status. For those that are engaging in disaster planning, the study examines the creation of formal disaster response and recovery plans, finding that in most cases the process of going through an audit for certification as a trusted repository provides the impetus for the creation of formalized disaster planning documentation. This paper also discusses obstacles that repositories encounter and finds that most repositories struggle with making their documentation available.
... The use of data for science and engineering is not new [10], and most key breakthroughs in the past decade have been fundamentally catalyzed by advances in data quality and quantity. However, with an unprecedented ability to collect and store data [11], we have entered a new era of data-intensive analysis, where hypotheses are now driven by data. This mode of data-driven discovery is often referred to as the fourth paradigm [12], which does not supplant, but instead complements the established modes of theoretical, experimental, and numerical inquiry. ...
Article
Full-text available
Data science, and machine learning in particular, is rapidly transforming the scientific and industrial landscapes. The aerospace industry is poised to capitalize on big data and machine learning, which excels at solving the types of multi-objective, constrained optimization problems that arise in aircraft design and manufacturing. Indeed, emerging methods in machine learning may be thought of as data-driven optimization techniques that are ideal for high-dimensional, nonconvex, and constrained, multi-objective optimization problems, and that improve with increasing volumes of data. This review will explore the opportunities and challenges of integrating data-driven science and engineering into the aerospace industry. Importantly, this paper will focus on the critical need for interpretable, generalizable, explainable, and certifiable machine learning techniques for safety-critical applications. This review will include a retrospective, an assessment of the current state-of-the-art, and a roadmap looking forward. Recent algorithmic and technological trends will be explored in the context of critical challenges in aerospace design, manufacturing, verification, validation, and services. In addition, this review will explore this landscape through several case studies in the aerospace industry. This document is the result of close collaboration between University of Washington and Boeing to summarize past efforts and outline future opportunities.
... Concomitant with the emergence of the Internet, the Web and social networks, the amount of information increases geometrically, thereby changing the whole "information ecosystem". In the past, the scarcity of information was usually the main problem whereas it is now the opposite: the abundance of information and the so called "data deluge" (Hey & Trefethen, 2003) need to be tackled first to carry out not only the mundane tasks but also the scholarly endeavors. The previous estimates of growth of information seem very modest in the age of "big data" (e.g., Lyman & Varian, 2003;Gantz et al., 2008). ...
Article
Full-text available
The first university-level library schools were opened during the last quarter of the 19th century. The number of such schools has gradually increased during the first half of the 20th century, especially after the Second World War, both in the USA and elsewhere. As information has gained further importance in scientific endeavors and social life, librarianship became a more interdisciplinary field and library schools were renamed as schools of library and information science/ information studies/ information management/information to better reflect the range of education provided. In this paper, we review the major developments in education for library and information science (LIS) and the impact of these developments on the curricula of LIS schools. We then review the programs and courses introduced by some LIS schools to address the data science and data curation issues. We also discuss some of the factors such as "data deluge" and "big data" that might have forced LIS schools to add such courses to their programs. We conclude by observing that "data" has already triggered some curriculum changes in a number of LIS schools in the USA and elsewhere as "Data Science" is becoming an interdisciplinary research field just as "Information Science" has once been (and still is).
... The data deluge hit science in the mid-2000s, when processor technology was able to produce petabytes of data and communication networks enabled transferring this data [1]. New ways to cope with the flood of scientific data had to be found, and a new paradigm of data-intensive science was proclaimed in 2009 [2]. ...
Article
Full-text available
The deluge of dark data is about to happen. Lacking data management capabilities, especially in the field of supercomputing, and missing data documentation (i.e., missing metadata annotation) constitute a major source of dark data. The present work contributes to addressing this challenge by presenting ExtractIng, a generic automated metadata extraction toolkit. Existing metadata information of simulation output files scattered through the file system, can be aggregated, parsed and converted to the EngMeta metadata model. Use cases from computational engineering are considered to demonstrate the viability of ExtractIng. The evaluation results show that the metadata extraction is simulation-code independent in the sense that it can handle data outputs from various fields of science, is easy to integrate into simulation workflows and compatible with a multitude of computational environments.
... Human body consists of 30,000-35,000 genes [88,296]. From the DNA structure of the human, it is estimated that there are 23 chromosomes with the distribution of 3.2 billion base pairs [297,298]. These data increase dramatically to about 200 gigabytes. ...
Article
Full-text available
Clinical decisions are more promising and evidence-based, hence, big data analytics to assist clinical decision-making has been expressed for a variety of clinical fields. Due to the sheer size and availability of healthcare data, big data analytics has revolutionized this industry and promises us a world of opportunities. It promises us the power of early detection, prediction, prevention, and helps us to improve the quality of life. Researchers and clinicians are working to inhibit big data from having a positive impact on health in the future. Different tools and techniques are being used to analyze, process, accumulate, assimilate, and manage large amount of healthcare data either in structured or unstructured form. In this review, we address the need of big data analytics in healthcare: why and how can it help to improve life?. We present the emerging landscape of big data and analytical techniques in the five sub-disciplines of healthcare, i.e., medical image analysis and imaging informatics, bioinformatics, clinical informatics, public health informatics and medical signal analytics. We present different architectures, advantages and repositories of each discipline that draws an integrated depiction of how distinct healthcare activities are accomplished in the pipeline to facilitate individual patients from multiple perspectives. Finally, the paper ends with the notable applications and challenges in adoption of big data analytics in healthcare.
... As the advancement of genomic tools and other technologies facilitate rapid data generation, we will be confronted with the enormous task of dealing with this deluge of information (Hey and Trefethen, 2003;Bell et al., 2009). Thus, vital to advancements in next-generation biomonitoring will be concurrent advancements in computing and bioinformatics that will enable us to maximize the use of these rich genomic and biodiversity datasets. ...
Article
Full-text available
Global biodiversity loss is unprecedented, and threats to existing biodiversity are growing. Given pervasive global change, a major challenge facing resource managers is a lack of scalable tools to rapidly and consistently measure Earth's biodiversity. Environmental genomic tools provide some hope in the face of this crisis, and DNA metabarcoding, in particular, is a powerful approach for biodiversity assessment at large spatial scales. However, metabarcoding studies are variable in their taxonomic, temporal, or spatial scope, investigating individual species, specific taxonomic groups, or targeted communities at local or regional scales. With the advent of modern, ultra-high throughput sequencing platforms, conducting deep sequencing metabarcoding surveys with multiple DNA markers will enhance the breadth of biodiversity coverage, enabling comprehensive, rapid bioassessment of all the organisms in a sample. Here, we report on a systematic literature review of 1,563 articles published about DNA metabarcoding and summarize how this approach is rapidly revolutionizing global bioassessment efforts. Specifically, we quantify the stakeholders using DNA metabarcoding, the dominant applications of this technology, and the taxonomic groups assessed in these studies. We show that while DNA metabarcoding has reached global coverage, few studies deliver on its promise of near-comprehensive biodiversity assessment. We then outline how DNA metabarcoding can help us move toward real-time, global bioassessment, illustrating how different stakeholders could benefit from DNA metabarcoding. Next, we address barriers to widespread adoption of DNA metabarcoding, highlighting the need for standardized sampling protocols, experts and computational resources to handle the deluge of genomic data, and standardized, open-source bioinformatic pipelines. Finally, we explore how technological and scientific advances will realize the promise of total biodiversity assessment in a sample-from microbes to mammals-and unlock the rich information genomics exposes, opening new possibilities for merging whole-system DNA metabarcoding with (1) abundance and biomass quantification, (2) advanced modeling, such as species occupancy models, to improve species detection, (3) population genetics, (4) phylogenetics, and (5) food web and functional gene analysis. While many challenges need to be addressed to facilitate widespread adoption of environmental genomic approaches, concurrent scientific and technological advances will usher in methods to supplement existing bioassessment tools reliant on morphological and
... In an envisioned research culture of data sharing, how can we be sure that navigating through the data deluge (Hey and Trefethen 2003;Marcum and George 2010) will lead secondary researchers to the appropriate data? Almost 30 years ago, economist Martin David, expert in public statistics and survey data analysis, expressed his concerns about data seeking with regard to successful data sharing: "Can shared data be as easily decoded as a shared library book?" (David 1991, 93) He suggested approaching the task of providing helpful data services by taking the users' perspective: "Affirmative answers […] require that we identify the expectations and needs of the secondary data user and that we provide support to meet those needs." ...
Thesis
Fulltext available at: https://doi.org/10.18452/22173 From information behaviour research we have a rich knowledge of how people are looking for, retrieving, and using information. We have scientific evidence for information behaviour patterns in a wide scope of contexts and situations, but we don’t know enough about researchers’ information needs and goals regarding the usage of research data. Having emerged from library user studies, information behaviour research especially provides insight into literature-related information behaviour. This thesis is based on the assumption that these insights cannot be easily transferred to data-related information behaviour. In order to explore this assumption, a study of secondary data users’ information-seeking behaviour was conducted. The study was designed and evaluated in comparison to existing theories and models of information-seeking behaviour. The overall goal of the study was to create evidence of actual information practices of users of one particular retrieval system for social science data in order to inform the development of research data infrastructures that facilitate data sharing. The empirical design of this study follows a mixed methods approach. This includes a qualitative study in the form of expert interviews and – building on the results found therein – a quantitative web survey of secondary survey data users. The core result of this study is that community involvement plays a pivotal role in survey data seeking. The analyses show that survey data communities are an important determinant in survey data users' information seeking behaviour and that community involvement facilitates data seeking and has the capacity of reducing problems or barriers. Community involvement increases with growing experience, seniority, and data literacy. This study advances information behaviour research by modelling the specifics of data seeking behaviour. In practical respect, the study specifies data-user oriented requirements for systems design.
... The collection, storage and exchange of enormous distributed and heterogeneous data resources has quickly become a central part of the vision for escience. According to those running the UK E-Science Core Programme, "it is evident that e-science data generated from sensors, satellites, high-performance computer simulations, high-throughput devices, scientific images and so on will soon dwarf all of the scientific data collected in the whole history of scientific exploration" (Hey & Trefethen, 2003). Applications in the natural sciences such as astronomy, particle physics, bioinformatics, and weather prediction are thought to involve volumes of data expressed in "petabytes." ...
Chapter
Despite a substantial unfolding investment in Grid technologies (for the development of cyberinfrastructures or e-science), little is known about how, why and by whom these new technologies are being adopted or will be taken up. This chapter argues for the importance of addressing these questions from an STS (science and technology studies) perspective, which develops and maintains a working scepticism with respect to the claims and attributions of scientific and technical capacity. We identify three interconnected topics with particular salience for Grid technologies: data, networks, and accountability. The chapter provides an illustration of howthese topics might be approached from an STS perspective, by revisiting the idea of “virtual witnessing”—a key idea in understanding the early emergence of criteria of adequacy in experiments and demonstrations at the birth of modern science—and by drawing upon preliminary interviews with prospective scientist users of Grid technologies. The chapter concludes that, against the temptation to represent the effects of new technologies on the growth of scientific knowledge as straightforward and determinate, e-scientists are immersed in structures of interlocking accountabilities which leave the effects uncertain.
... The use of data for science and engineering is not new [10], and most key breakthroughs in the past decade have been fundamentally catalyzed by advances in data quality and quantity. However, with an unprecedented ability to collect and store data [11], we have entered a new era of data-intensive analysis, where hypotheses are now driven by data. This mode of data-driven discovery is often referred to as the fourth paradigm [12], which does not supplant, but instead complements the established modes of theoretical, experimental, and numerical inquiry. ...
Preprint
Full-text available
Data science, and machine learning in particular, is rapidly transforming the scientific and industrial landscapes. The aerospace industry is poised to capitalize on big data and machine learning, which excels at solving the types of multi-objective, constrained optimization problems that arise in aircraft design and manufacturing. Indeed, emerging methods in machine learning may be thought of as data-driven optimization techniques that are ideal for high-dimensional, non-convex, and constrained, multi-objective optimization problems, and that improve with increasing volumes of data. In this review, we will explore the opportunities and challenges of integrating data-driven science and engineering into the aerospace industry. Importantly, we will focus on the critical need for interpretable, generalizeable, explainable, and certifiable machine learning techniques for safety-critical applications. This review will include a retrospective, an assessment of the current state-of-the-art, and a roadmap looking forward. Recent algorithmic and technological trends will be explored in the context of critical challenges in aerospace design, manufacturing, verification, validation, and services. In addition, we will explore this landscape through several case studies in the aerospace industry. This document is the result of close collaboration between UW and Boeing to summarize past efforts and outline future opportunities.
... For example, radiological imaging such as computed tomography scans, magnetic resonance images, or positron emission tomography scan images can reach several hundreds of megabytes per exam, while stereoscopic video related to high-definition surgery is based on two streams of 1.5 GB per second each. In the United States, the storage of mammograms required 2.5 PB in 2009, and 30% of all the images stored in the world in 2010 were medical images [4]. ...
Article
Full-text available
The amount of medical video data that has to be securely stored has been growing exponentially. This rapid expansion is mainly caused by the introduction of higher video resolution such as 4K and 8K to medical devices and the growing usage of telemedicine services, along with a general trend toward increasing transparency with respect to medical treatment, resulting in more and more medical procedures being recorded. Such video data, as medical data, must be maintained for many years, resulting in datasets at the exabytes scale that each hospital must be able to store in the future. Currently, hospitals do not have the required information and communications technology infrastructure to handle such large amounts of data in the long run. In this paper, we discuss the challenges and possible solutions to this problem. We propose a generic architecture for a holistic, end-to-end recording and storage platform for hospitals, define crucial components, and identify existing and future solutions to address all parts of the system. This paper focuses mostly on the recording part of the system by introducing the major challenges in the area of bioinformatics, with particular focus on three major areas: video encoding, video quality, and video metadata.
... We are living in the age of big data, advanced analytics, and data science. The trend of "big data growth" [Laney 2001;CSC 2012;Beyer and Laney 2012;McKinsey 2011;Vesset et al. 2012], or "data deluge" [Hey and Trefethen 2003], has not only triggered tremendous hype and buzz, but more importantly presents enormous challenges which in turn bring incredible innovation and economic opportunities. Big data has attracted intensive and growing attention, initially from giant private data-oriented enterprise and lately from major governmental organizations and academic institutions. ...
Preprint
The twenty-first century has ushered in the age of big data and data economy, in which data DNA, which carries important knowledge, insights and potential, has become an intrinsic constituent of all data-based organisms. An appropriate understanding of data DNA and its organisms relies on the new field of data science and its keystone, analytics. Although it is widely debated whether big data is only hype and buzz, and data science is still in a very early phase, significant challenges and opportunities are emerging or have been inspired by the research, innovation, business, profession, and education of data science. This paper provides a comprehensive survey and tutorial of the fundamental aspects of data science: the evolution from data analysis to data science, the data science concepts, a big picture of the era of data science, the major challenges and directions in data innovation, the nature of data analytics, new industrialization and service opportunities in the data economy, the profession and competency of data education, and the future of data science. This article is the first in the field to draw a comprehensive big picture, in addition to offering rich observations, lessons and thinking about data science and analytics.
... assessment of value perceived and collection of feedback) the data infrastructure can be continuously improved to respond to user's demands. As a consequence, the CSF Increasing participation in content use and reuse, and take up from users can be achieved (Hey and Trefethen, 2003;Janssen et al., 2012). The CSF Acceptable Quality of Service is influenced by the quality of the services delivered, the security in services, data communication and distribution, and the discoverability of contents and easy-of-use aspect of the data infrastructure CDI. ...
Preprint
Full-text available
The systems that operate the infrastructure of cities have evolved in a fragmented fashion across several generations of technology, causing city utilities and services to operate sub-optimally and limiting the creation of new value-added services and restrict opportunities for cost-saving. The integration of cross-domain city data offers a new wave of opportunities to mitigate some of these impacts and enables city systems to draw effectively on interoperable data that will be used to deliver smarter cities. Despite the considerable potential of city data, current smart cities initiatives have mainly addressed the problem of data management from a technology perspective, have treated it as a single and disjoint ICT development project, and have disregarded stakeholders and data needs. As a consequence, such initiatives are susceptible to failure from inadequate stakeholder input, requirements neglecting, and information fragmentation and overload. They are also likely to be limited in terms of both scalability and future proofing against technological, commercial and legislative change. This paper proposes a systematic business-modeldriven framework, named SMARTify, to guide the design of large and highly interconnected data infrastructures which are provided and supported by multiple stakeholders. The framework is used to model, elicit and reason about the requirements of the service, technology, organization, value, and governance aspects of smart cities. The requirements serve as an input to a closed-loop supply chain model, which is designed and managed to explicitly consider the activities and processes that enables the stakeholders of smart cities to efficiently leverage their collective knowledge at R&D, Procurement, Roll-Out and Market stages. We demonstrate how our approach can be used to design data infrastructures by examining a series of exemplary scenarios and by demonstrating how our approach handles the holistic design of a data infrastructure and informs the decision making process.
... assessment of value perceived and collection of feedback) the data infrastructure can be continuously improved to respond to user's demands. As a consequence, the CSF Increasing participation in content use and reuse, and take up from users can be achieved (Hey and Trefethen, 2003;Janssen et al., 2012). The CSF Acceptable Quality of Service is influenced by the quality of the services delivered, the security in services, data communication and distribution, and the discoverability of contents and easy-of-use aspect of the data infrastructure CDI. ...
... Human body consists of 30,000 to 35,000 genes [72,48]. From the DNA structure of the human, it is estimated that there are 23 chromosomes with the distribution of 3.2 billion base pairs [78,117]. This data increases dramatically to about 200 gigabytes. ...
Preprint
Clinicians decisions are becoming more and more evidence-based meaning in no other field the big data analytics so promising as in healthcare. Due to the sheer size and availability of healthcare data, big data analytics has revolutionized this industry and promises us a world of opportunities. It promises us the power of early detection, prediction, prevention and helps us to improve the quality of life. Researchers and clinicians are working to inhibit big data from having a positive impact on health in the future. Different tools and techniques are being used to analyze, process, accumulate, assimilate and manage large amount of healthcare data either in structured or unstructured form. In this paper, we would like to address the need of big data analytics in healthcare: why and how can it help to improve life?. We present the emerging landscape of big data and analytical techniques in the five sub-disciplines of healthcare i.e.medical image analysis and imaging informatics, bioinformatics, clinical informatics, public health informatics and medical signal analytics. We presents different architectures, advantages and repositories of each discipline that draws an integrated depiction of how distinct healthcare activities are accomplished in the pipeline to facilitate individual patients from multiple perspectives. Finally the paper ends with the notable applications and challenges in adoption of big data analytics in healthcare.
Article
Due to the vast amounts of data generated during research activities and mandates obtained from funding agencies, higher education institutions put a high priority on research data management (RDM). RDM is prioritized by global organizations, and India is no exception. The current work aims to address the current state of RDM in the top-ranked higher education institutions (HEIs) of India and attempts to answer the research questions posed on various aspects of RDM, such as RDM services, policy implementation, responsible stakeholders, funding support for RDM strategy and development, and so on. Furthermore, unlike most of the existing literature, the development of RDM capacity is viewed as an institutional concern in this study rather than a library concern. The survey and website analysis approaches were used to conduct this study. The outcomes of the study show that RDM is still in its early stage among India’s leading institutions.
Article
This study aims to analyze the needs of researchers in a regional comprehensive university for research data management services; discuss the options for developing a research data management program at the university; and then propose a phased three-year implementation plan for the university libraries. The method was to design a survey to collect information from researchers and assess and evaluate their needs for research data management services. The results show that researchers’ needs in a regional comprehensive university could be quite different from those of researchers in a research-intensive university. Also, the results verify the hypothesis that researchers in the regional comprehensive university would welcome the libraries offering managed data services for the research community. Therefore, this study suggests a phased three-year implementation plan. The significance of the study is that it can give some insights and helpful information for regional comprehensive universities that are planning to develop a research data management program.
Article
In the aftermath of revelations made by ex-NSA employee Edward Snowden about violation of privacy of individuals by states in the name of surveillance, right to privacy became one of the highly debated rights. There is no doubt that the state must secure privacy of its citizens, but it also has a responsibility towards safety of the citizens. There exist different views related to privacy and surveillance. One view is that the state has no right to look into the private affairs of an individual while the other view is that there is no harm in putting someone suspicious under the surveillance as it is the duty of the State to prevent any untoward act in the society. Considering the contrasting views about privacy and surveillance, this article explores the position existing in the United Kingdom and aims to answer several questions pertaining to the Privacy v. Surveillance debate.
Chapter
In this chapter, we will introduce several key points of this new discipline with a particular focus on human-inspired cognitive systems. We will provide several examples of well-known developed robots, to finally reach a detailed description of a special case study: F.A.C.E., Facial Automaton for Conveying Emotions, which is a highly expressive humanoid robot with a bio-inspired cognitive system. At the end of the chapter, we will briefly discuss the future perspective about this branch of science and its potential merging with the IoT, giving our vision of what could happen in a not-too-distant future.
Chapter
While the publication of datasets in scientific repositories has become broadly recognised, the repositories tend to have increasing semantic-related problems. For instance, they present various data reuse obstacles for machine-actionable processes, especially in biological repositories, hampering the reproducibility of scientific experiments. An example of these shortcomings is the GenBank database. We propose GAP, an innovative data model to enhance the semantic data meaning to address these issues. The model focuses on converging related approaches like data provenance, semantic interoperability, FAIR principles, and nanopublications. Our experiments include a prototype to scrape genomic data and trace them to nanopublications as a proof of concept. For this, (meta)data are stored in a three-level nanopub data model. The first level is related to a target organism, specifying data in terms of biological taxonomy. The second level focuses on the biological strains of the target, the central part of our contribution. The strains express information related to deciphered (meta)data of the genetic variations of the genomic material. The third level stores related scientific papers (meta)data. We expect it will offer higher data storage flexibility and more extensive interoperability with other data sources by incorporating and adopting associated approaches to store genomic data in the proposed model.KeywordsNanopublicationFAIR principlesData provenanceGenomic dataReusabilityInteroperability
Article
This paper presents findings from an interview study of research data managers in academic data archives. Our study examined policies and professional autonomy with a focus on dilemmas encountered in everyday work by data managers. We found that dilemmas arose at every stage of the research data lifecycle, and legacy data presents particularly vexing challenges. The iFields' emphasis on knowledge organization and representation provides insight into how data, used by scientists, are used to create knowledge. The iFields' disciplinary emphasis also encompasses the sociotechnical complexity of dilemmas that we found arise in research data management. Therefore, we posit that iSchools are positioned to contribute to data science education by teaching about ethics and infrastructure used to collect, organize, and disseminate data through problem‐based learning.
Article
Full-text available
The ever-increasing amount of data generated from experiments and simulations in engineering sciences is relying more and more on data science applications to generate new knowledge. Comprehensive metadata descriptions and a suitable research data infrastructure are essential prerequisites for these tasks. Experimental tribology, in particular, presents some unique challenges in this regard due to the interdisciplinary nature of the field and the lack of existing standards. In this work, we demonstrate the versatility of the open source research data infrastructure Kadi4Mat by managing and producing FAIR tribological data. As a showcase example, a tribological experiment is conducted by an experimental group with a focus on comprehensiveness. The result is a FAIR data package containing all produced data as well as machine- and user-readable metadata. The close collaboration between tribologists and software developers shows a practical bottom-up approach and how such infrastructures are an essential part of our FAIR digital future.
Article
Full-text available
RESUMO Objetivo: De forma similar à "explosão informacional" o fenômeno do Big Data vem sendo de forma crescente, objeto da CI/OC. Como descobrir, acessar, processar e reusar a enorme e crescente quantidade de dados que são disponibilizados continuamente na Web por nossa sociedade? Em especial, como tratar os chamados "dados não estruturados", documentos textuais, que sempre foram o objeto da CI/OC? Metodologia: Teorias de amplo espectro como Ontologia e Semiótica foram utilizadas para analisar dados como elemento essencial do Big Data, em especial os "dados não estruturados". Resultados: A partir da análise de várias definições de dados, um dado é identificado como parte de esquemas lógicos e semióticos já conhecidos, as proposições. Um dado é encontrado juntamente com outros, formando conjuntos de dados. Conjuntos de dados são na verdade conjuntos de proposições. Estas estão presentes no que é conhecido como dados estruturados-tabelas de bancos de dados relacionais ou de planilhas. Documentos textuais também contém conjuntos de proposições. Dados estruturados são comparados com "dados não estruturados". Conclusões: Embora no limite, ambos contenham proposições e possam ser equivalentes, enquanto conjuntos, dados estruturados são expressos e percebidos como um todo, conjuntos de dados não estruturados são processuais, expressos sequencialmente o que torna mais difícil a identificação de dados não estruturados em documentos textuais para seu processamento por máquinas.
Article
Full-text available
Objective. This study aims to analyze the scientific production on research data management indexed in the Dimensions database. Design/Methodology/Approach. Using the term “research data management” in the Dimensions database, 677 articles were retrieved and analyzed employing bibliometric and altmetric indicators. The Altmetrics.com system was used to collect data from alternative virtual sources to measure the online attention received by the retrieved articles. Bibliometric networks from journals bibliographic coupling and keywords co-occurrence were generated using the VOSviewer software. Results/Discussion. Growth in scientific production over the period 1970-2021 was observed. The countries/regions with the highest rates of publications were the USA, Germany, and the United Kingdom. Among the most productive authors were Andrew Martin Cox, Stephen Pinfield, Marta Teperek, Mary Anne Kennan, and Amanda L. Whitmire. The most productive journals were the International Journal of Digital Curation, Journal of eScience Librarianship, and Data Science Journal, while the most representative research areas were Information and Computing Sciences, Information Systems, and Library and Information Studies. Conclusions. The multidisciplinarity in research data management was demonstrated by publications occurring in different fields of research, such as Information and Computing Sciences, Information Systems, Library and Information Studies, Medical and Health Sciences, and History and Archeology. About 60% of the publications had at least one citation, with a total of 3,598 citations found, featuring a growing academic impact. Originality/Value. This bibliometric and altmetric study allowed the analysis of the literature on research data management. The theme was investigated in the Dimensions database and analyzed using productivity, impact, and online attention indicators.
Article
Full-text available
In the design and development of novel materials that have excellent mechanical properties, classification and regression methods have been diversely used across mechanical deformation simulations or experiments. The use of materials informatics methods on large data that originate in experiments or/and multiscale modeling simulations may accelerate materials’ discovery or develop new understanding of materials’ behavior. In this fast-growing field, we focus on reviewing advances at the intersection of data science with mechanical deformation simulations and experiments, with a particular focus on studies of metals and alloys. We discuss examples of applications, as well as identify challenges and prospects.
Preprint
Full-text available
In the design and development of novel materials that have excellent mechanical properties, classification and regression methods have been diversely used across mechanical deformation simulations or experiments. The use of materials informatics methods on large data that originate in experiments or/and multiscale modeling simulations may accelerate materials discovery or develop new understanding of materials’ behavior. In this fast-growing field, we focus on reviewing advances at the intersection of data science with mechanical deformation simulations and experiments, with a particular focus on studies of metals and alloys. We discuss examples of applications, as well as identify challenges and prospects.
Chapter
Advancements in digital technology have eased the process of gathering, generating, and altering digital data at large scale. The sheer scale of the data necessitates the development and use of smaller secondary data structured as ‘indices,’ which are typically used to locate desired subsets of the original data, thereby speeding up data referencing and retrieval operations. Many variants of such indices exist in today’s database systems, and the subject of their design is well investigated by computer scientists. However, indices are examples of data derived from existing data; and the implications of such derived indices, as well as indices derived from other indices, pose problems that require careful ethical analysis. But before being able to thoroughly discuss the full nature of such problems, let alone analyze their ethical implications, an appropriate and complete vocabulary in the form of a robust taxonomy for defining and describing the myriad variations of derived indices and their nuances is needed. This paper therefore introduces a novel taxonomy of derived indices that can be used to identify, characterise, and differentiate derived indices.
Article
Full-text available
This open access book discusses advances in semantic interoperability for materials modelling, aiming at integrating data obtained from different methods and sources into common frameworks, and facilitating the development of platforms where simulation services in computational molecular engineering can be provided as well as coupled and linked to each other in a standardized and reliable way. The Virtual Materials Marketplace (VIMMP), which is open to all service providers and clients, provides a framework for offering and accessing such services, assisting the uptake of novel modelling and simulation approaches by SMEs, consultants, and industrial R&D end users. Semantic assets presented include the EngMeta metadata schema for research data infrastructures in simulation-based engineering and the collection of ontologies from VIMMP, including the ontology for simulation, modelling, and optimization (OSMO) and the VIMMP software ontology (VISO).
Book
Full-text available
DOI: 10.1007/978-3-030-68597-3 This open access book discusses advances in semantic interoperability for materials modelling, aiming at integrating data obtained from different methods and sources into common frameworks, and facilitating the development of platforms where simulation services in computational molecular engineering can be provided as well as coupled and linked to each other in a standardized and reliable way. The Virtual Materials Marketplace (VIMMP), which is open to all service providers and clients, provides a framework for offering and accessing such services, assisting the uptake of novel modelling and simulation approaches by SMEs, consultants, and industrial R&D end users. Semantic assets presented include the EngMeta metadata schema for research data infrastructures in simulation-based engineering and the collection of ontologies from VIMMP, including the ontology for simulation, modelling, and optimization (OSMO) and the VIMMP software ontology (VISO).
Chapter
Full-text available
Several studies show the difficulty experienced for the reuse of the ever increasing amount of genomic data. Initiatives are being created to mitigate this concern; one of the most well-known is the FAIR Data Principles. Nonetheless, the related works are too generic and do not describe simultaneously and properly the human and machine perspectives of the FAIRness of databases. Hence, in order to bridge this gap, our paper introduces an approach named the Bio FAIR Evaluator Framework, a semiautomated tool aimed to analyze the FAIRness of genomic databases. Furthermore, we performed experiments that analyzed selected genomic databases according to two orthogonal and complementary perspectives (human and machine). The approach uses standardized FAIR metrics and generates recommendation reports to researchers indicating how to enhance the FAIRness of databases. Our findings, when compared with related works, show the feasibility of the approach, indicating that the current genomic databases are poorly compliant with FAIR Principles.
Chapter
WiPo (Web in your pocket) is a prototypical mobile information provisioning concept that can offer potential benefits to a range of situations where data sources are vast, dynamic and unvalidated and where continuous Internet connectivity cannot be assured. One such case is that of search and rescue (SAR), a unique case of emergency management characterized by the need for high-quality, accurate and time-sensitive information. This paper reports on empirical research undertaken to explore the potential for a real-world application of a mobile service such as WiPo which is based on the delivery of highly curated, multi-source data made available offline. Adopting an interpretive interview-based approach, the authors evaluate the potential usefulness of WiPo for search and rescue incidents in New Zealand. Upon learning of the core functionality of WiPo and the alignment of that with the typical search and rescue situation, study participants were unanimously positive about its potential for improving search and rescue management and outcomes.
Chapter
“Big Data” is an emerging term used with business, engineering, and other domains. Although Big Data is a popular term used today, it is not a new concept. However, the means in which data can be collected is more readily available than ever, which makes Big Data more relevant than ever because it can be used to improve decisions and insights within the domains it is used. The term Big Data can be loosely defined as data that is too large for traditional analysis methods and techniques. In this article, varieties of prominent but loose definitions for Big Data are shared. In addition, a comprehensive overview of issues related to Big Data is summarized. For example, this paper examines the forms, locations, methods of analyzing and exploiting Big Data, and current research on Big Data. Big Data also concerns a myriad of tangential issues, from privacy to analysis methods that will also be overviewed. Best practices will further be considered. Additionally, the epistemology of Big Data and its history will be examined, as well as technical and societal problems existing with Big Data.
Chapter
With the continuous increase in the data requirements of scientific and commercial applications, access to remote and distributed data has become a major bottleneck for end-to-end application performance. Traditional distributed computing systems closely couple data access and computation, and generally, data access is considered a side effect of computation. The limitations of traditional distributed computing systems and CPU-oriented scheduling and workflow management tools in managing complex data handling have motivated a newly emerging era: data-aware distributed computing. In this chapter, the authors elaborate on how the most crucial distributed computing components, such as scheduling, workflow management, and end-to-end throughput optimization, can become “data-aware.” In this new computing paradigm, called data-aware distributed computing, data placement activities are represented as full-featured jobs in the end-to-end workflow, and they are queued, managed, scheduled, and optimized via a specialized data-aware scheduler. As part of this new paradigm, the authors present a set of tools for mitigating the data bottleneck in distributed computing systems, which consists of three main components: a data-aware scheduler, which provides capabilities such as planning, scheduling, resource reservation, job execution, and error recovery for data movement tasks; integration of these capabilities to the other layers in distributed computing, such as workflow planning; and further optimization of data movement tasks via dynamic tuning of underlying protocol transfer parameters.
Chapter
Many-task computing aims to bridge the gap between two computing paradigms, high throughput computing and high performance computing. Traditional techniques to support many-task computing commonly found in scientific computing (i.e. the reliance on parallel file systems with static configurations) do not scale to today’s largest systems for data intensive application, as the rate of increase in the number of processors per system is outgrowing the rate of performance increase of parallel file systems. In this chapter, the authors argue that in such circumstances, data locality is critical to the successful and efficient use of large distributed systems for data-intensive applications. They propose a “data diffusion” approach to enable data-intensive many-task computing. They define an abstract model for data diffusion, define and implement scheduling policies with heuristics that optimize real world performance, and develop a competitive online caching eviction policy. They also offer many empirical experiments to explore the benefits of data diffusion, both under static and dynamic resource provisioning, demonstrating approaches that improve both performance and scalability.
Chapter
Data curation processes constitute a number of activities, such as transforming, filtering or de-duplicating data. These processes consume an excessive amount of time in data science projects, due to datasets often being external, re-purposed and generally not ready for analytics. Overall, data curation processes are difficult to automate and require human input, which results in a lack of repeatability and potential errors propagating into analytical results. In this paper, we explore a crowd intelligence-based approach to building robust data curation processes. We study how data workers engage with data curation activities, specifically related to data quality detection, and how to build a robust and effective data curation process by learning from the wisdom of the crowd. With the help of a purpose-designed data curation platform based on iPython Notebook, we conducted a lab experiment with data workers and collected a multi-modal dataset that includes measures of task performance and behaviour data. Our findings identify avenues by which effective data curation processes can be built through crowd intelligence.
Article
Full-text available
During the past few years, deep learning (DL) architectures are being employed in many potential areas such as object detection, face recognition, natural language processing, medical image analysis and other related applications. In these applications, DL has achieved remarkable results matching the performance of human experts. This paper presents a novel convolutional neural networks (CNN)-based approach for the detection of breast cancer in invasive ductal carcinoma tissue regions using whole slide images (WSI). It has been observed that breast cancer has been a leading cause of death among women. It also remains a striving task for pathologist to find the malignancy regions from WSI. In this research, we have implemented different CNN models which include VGG16, VGG19, Xception, Inception V3, MobileNetV2, ResNet50, and DenseNet. The experiments were performed on standard WSI slides data-set which include 163 patients of IDC. For performance evaluation, same data-set was divided into 113 and 49 images for training and testing, respectively. The testing was carried out separately over each model and the obtained results showed that our proposed CNN model achieved 83% accuracy which is better than the other models.
Article
Purpose The purpose of this paper is to investigate how scientists’ prior data-reuse experience affects their data-sharing intention by updating diverse attitudinal, control and normative beliefs about data sharing. Design/methodology/approach This paper used a survey method and the research model was evaluated by applying structural equation modelling to 476 survey responses from biological scientists in the USA. Findings The results show that prior data-reuse experience significantly increases the perceived community and career benefits and subjective norms of data sharing and significantly decreases the perceived risk and effort involved in data sharing. The perceived community benefits and subjective norms of data sharing positively influence scientists’ data-sharing intention, whereas the perceived risk and effort negatively influence scientists’ data-sharing intention. Research limitations/implications Based on the theory of planned behaviour, the research model was developed by connecting scientists’ prior data-reuse experience and data-sharing intention mediated through diverse attitudinal, control and normative perceptions of data sharing. Practical implications This research suggests that to facilitate scientists’ data-sharing behaviours, data reuse needs to be encouraged. Data sharing and reuse are interconnected, so scientists’ data sharing can be better promoted by providing them with data-reuse experience. Originality/value This is one of the initial studies examining the relationship between data-reuse experience and data-sharing behaviour, and it considered the following mediating factors: perceived community benefit, career benefit, career risk, effort and subjective norm of data sharing. This research provides an advanced investigation of data-sharing behaviour in the relationship with data-reuse experience and suggests significant implications for fostering data-sharing behaviour.
Article
Full-text available
In December 1999,IBM announced the start of a five-year effort to build a massively parallel computer, to be applied to the study of biomolecular phenomena such as protein folding. The project has two main goals: to advance our understanding of the mechanisms behind protein folding via large-scale simulation, and to explore novel ideas in massively parallel machine architecture and software. This project should enable biomolecular simulations that are orders of magnitude larger than current technology permits. Major areas of investigation include: how to most effectively utilize this novel platform to meet our scientific goals, how to make such massively parallel machines more usable, and how to achieve performance targets, with reasonable cost, through novel machine architectures. This paper provides an overview of the Blue Gene project at IBM Research. It includes some of the plans that have been made, the intended goals, and the anticipated challenges regarding the scientific work, the software application, and the hardware design.
Conference Paper
Full-text available
The usefulness of the many on-line journals and scientific digital lib raries that exist today is limited by the lack of a service that can federate them through a unified interface. The Open Archive Initiative (OAI) is one major effort to address technical interoperability among distributed archives. The objective of OAI is to develop a framework to facilitate the discovery of content in distributed archives. In this paper, we describe our experience and lessons learned in building Arc, the first federated searching service based on the OAI protocol. Arc harvests metadata from several OAI compliant archives, normalizes them, and stores them in a search service based on a relational database (MySQL or Oracle). At present we have over 165K metadata records from 16 data providers from various domains.
Conference Paper
Full-text available
Data Grids are becoming increasingly important in scientific communities for sharing large data collections and for archiving and disseminating them in a digital library framework. The Storage Resource Broker (SRB) provides transparent virtualized middleware for sharing data across distributed, heterogeneous data resources separated by different administrative and security domains. The MySRB is a Web-based interface to the SRB that provides a user-friendly interface to distributed collections brokered by the SRB. In this paper we briefly describe the use of the SRB infrastructure as tools in the data grid architecture for building distributed data collections, digital libraries, and persistent archives. We also provide details about the MySRB and its functionalities.
Conference Paper
Grids tie together distributed storage systems and execution platforms into globallyaccessible resources. Data Grids provide collection management and global namespacesfor organizing data objects that reside within a grid. Knowledge-based grids provideconcept spaces for discovering relevant data objects. In a knowledge-based grid, dataobjects are discovered by mapping from scientific domain concepts, to the attributes usedwithin a data grid collection, to the digital objects residing in an archival storage system.These concepts will be explored in the context of large-scale storage in the web, andillustrated based on infrastructure under development at the San Diego SupercomputerCenter.
Article
If digital documents and their programs are to be saved, their migration must must not modify their bit streams, beacuse programs and their files can be corrupted by the slightest change. If such changes are unavoidable, they must be reversible without loss. More over, one must record enough detail about each transformation to allow reconstruction of the original encoding of the bit stream. Although bit streams can be designed to be immune to any expected change, future migration may introduce unextpected alterations. Similarly, encryption makes it impossible to recover an original bit stream without the decryption key.
Conference Paper
Implicit in the evolution of current technology and high-end system evolution is the anticipated achievement of the implementation of computers capable of a peak performance of 1 Petaflops by the year 2010. This is consistent with both the semiconductor industry’s roadmap of basic device technology development and an extrapolation of the TOP-500 list of the world’s fastest computers according to the Linpack benchmark. But if contemporary experiences with today’s largest systems hold true for their descendents at the end of the decade, then they will be very expensive (> $100M), consume too much power (> 3 Mwatts), take up too much floor space (> 10,000 square feet), deliver very low efficiency (< 10%), and are too difficult to program. Also important is the likely degradation of reliability due to the multiplicative factors of MTBF and scale of components. Even if these systems do manage to drag the community to the edge of Petaflops, there is no basis of confidence to assume that they will provide the foundation for systems across the next decade that will transition across the trans-Petaflops performance regime. It has become increasingly clear that an alternative model of system architecture may be required for future generation high-end computers.
Conference Paper
The preservation of digital data for the long term presents a variety of challenges from technical to social and organizational. The technical challenge is to ensure that the information, generated today, can survive long term changes in storage media, devices and data formats. This paper presents a novel approach to the problem. It distinguishes between archiving of data files and archiving of programs (so that their behavior may be reenacted in the future).For the archiving of a data file, the proposal consists of specifying the processing that needs to be performed on the data (as physically stored) in order to return the information to a future client (according to a logical view of the data). The process specification and the logical view definition are archived with the data.For the archiving of a program behavior, the proposal consists of saving the original executable object code together with the specification of the processing that needs to be performed for each machine instruction of the original computer (emulation).In both cases, the processing specification is based on a Universal Virtual Computer that is general, yet basic enough as to remain relevant in the future.
Article
The Open Archives Initiative (OAI) develops and promotes interoperability solutions that aim to facilitate the efficient dissemination of content. The roots of the OAI lie in the E-Print community. Over the last year its focus has been extended to include all content providers. This paper describes the recent history of the OAI -- its origins in promoting E-Prints, the broadening of its focus, the details of its technical standard for metadata harvesting, the applications of this standard, and future plans.
In Search of Petabyte Databases, talk at 2001 HPTS Workshop Asilomar www.research
  • J Gray
  • T Hey
Gray J. and Hey T., In Search of Petabyte Databases, talk at 2001 HPTS Workshop Asilomar www.research.microsoft/~gray
The myGrid Project: http://mygrid.man.ac.uk [45] The Comb-e-Chem project
  • Geodise The
  • R W Project
  • Moore
The GEODISE Project: http://www.geodise.org/ [44] The myGrid Project: http://mygrid.man.ac.uk [45] The Comb-e-Chem project: http://www.combechem.org [46] R.W.Moore, Digital Libraries, Data Grids and Persistent Archives, presentation at NARA December 2001.
  • Eu Us Workshop On Large Scientific
  • Databases
EU/US Workshop on Large Scientific Databases: http://www.cacr.caltech.edu/euus [60] J.Rothenberg, Ensuring the Longevity of Digital Documents, Scientific American, 272(1), January 1995.
  • Ginsparg E-Print Archive
Ginsparg e-print archive: http://arxiv.org [54] C. Lynch, Metadata Harvesting and the Open Archives Initiative, ARL Bimonthly Report 217:1-9 (2001).
Arc – An OAI Service Provider for Cross-Archive Searching, JCDL'01 [56] The Caltech Library Systems Digital Collections project: http://library
  • Z Lui
  • K Maly
  • M Zubair
  • M Nelson
Z. Lui, K. Maly, M. Zubair and M.Nelson, Arc – An OAI Service Provider for Cross-Archive Searching, JCDL'01, Roanoke, Virginia ACM 65-66 (2001) [56] The Caltech Library Systems Digital Collections project: http://library.caltech.edu/digital/
Shaping the Future? Grids, Web Services and Digital Libraries
  • T Hey
  • L Lyon
T. Hey and L.Lyon, Shaping the Future? Grids, Web Services and Digital Libraries, International JISC/CNI Conference, Edinburgh, Scotland, June 2002.
Service: http://edina.ac.uk [34] Arts and Humanities Data Service
  • Mimas Service
MIMAS Service: http://www.mimas.ac.uk [33] EDINA Service: http://edina.ac.uk [34] Arts and Humanities Data Service: http://www.ahds.ac.uk [35] Xindice native XML database: http://xml.apache.org/xindice [36] DAML+OIL: http://www.daml.org/2001/03/daml+oil-index.html
Towards a Semantic Grid
  • D Deroure
  • N Jennings
  • N Shadbolt
D. DeRoure, N. Jennings and N.Shadbolt, Towards a Semantic Grid, Concurrency & Computation (to be published) and in this collection.
Digital Libraries, Data Grids and Persistent Archives, presentation at NARA
  • R W Moore
R.W.Moore, Digital Libraries, Data Grids and Persistent Archives, presentation at NARA December 2001.
Metadata Harvesting and the Open Archives Initiative
  • C Lynch
C. Lynch, Metadata Harvesting and the Open Archives Initiative, ARL Bimonthly Report 217:1-9 (2001).