Paolo Missier

Paolo Missier
Newcastle University | NCL · School of Computing Science

Ph.D.

About

209
Publications
37,243
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
3,704
Citations
Introduction
ResearchGate is very pushy and has vested itself of an authority in listing people's publications, which it simply does not have. By doing so it hurts people's reputation. MY UPDATED LIST OF PUBLICATIONS IS ON MY PERSONAL PAGE: https://sites.google.com/site/paolomissier/ the list of publications on this site is vastly incomplete and I am not going to curate it.
Additional affiliations
January 2011 - May 2016
Newcastle University
Position
  • Reader (Associate Professor)
January 2011 - present
Newcastle University
Position
  • Professor (Full)

Publications

Publications (209)
Article
A data ecosystem (DE) offers a keystone-player or alliance-driven infrastructure that enables the interaction of different stakeholders and the resolution of interoperability issues among shared data. However, despite years of research in data governance and management, trustability is still affected by the absence of transparent and traceable data...
Conference Paper
COVID-19 related pneumonia requires different modalities of Intensive Care Unit (ICU) interventions at different times to facilitate breathing, depending on severity progression. The ability for clinical staff to predict how patients admitted to hospital will require more or less ICU treatment on a daily basis is critical to ICU management. For rea...
Article
Full-text available
Primary care EHR data are often of clinical importance to cohort studies however they require careful handling. Challenges include determining the periods during which EHR data were collected. Participants are typically censored when they deregister from a medical practice, however, cohort studies wish to follow participants longitudinally includin...
Article
Full-text available
Substantial research is available on detecting influencers on social media platforms. In contrast, comparatively few studies exists on the role of online activists , defined informally as users who actively participate in socially-minded online campaigns. Automatically discovering activists who can potentially be approached by organisations that pr...
Preprint
Full-text available
A Data Ecosystem offers a keystone-player or alliance-driven infrastructure that enables the interaction of different stakeholders and the resolution of interoperability issues among shared data. However, despite years of research in data governance and management, trustability is still affected by the absence of transparent and traceable data-driv...
Conference Paper
Full-text available
Improving machine learning models’ fairness is an active research topic, with most approaches focusing on specific definitions of fairness. In contrast, we propose ParDS, a parametrised data sampling method by which we can optimise the fairness ratios observed on a test set, in a way that is agnostic to both the specific fairness definitions, and t...
Conference Paper
Full-text available
Data processing pipelines that are designed to clean, transform and alter data in preparation for learning predictive models, have an impact on those models' accuracy and performance, as well on other properties, such as model fairness. It is therefore important to provide developers with the means to gain an in-depth understanding of how the pipel...
Article
Full-text available
Aims The aim of this study was to estimate a 48 hour prediction of moderate to severe respiratory failure, requiring mechanical ventilation, in hospitalized patients with COVID-19 pneumonia. Methods This was an observational prospective study that comprised consecutive patients with COVID-19 pneumonia admitted to hospital from 21 February to 6 Apr...
Chapter
As data science techniques are being applied to solve societal problems, understanding what is happening within the “pipeline” is essential for establishing trust and reproducibility of the results. Provenance captures information about what happened during design and execution in order to support reasoning for trust and reproducibility. However, h...
Article
p>Data processing pipelines that are designed to clean, transform and alter data in preparation for learning predictive models, have an impact on those models’ accuracy and performance, as well on other properties, such as model fairness. It is therefore important to provide developers with the means to gain an in-depth understanding of how the pip...
Preprint
BACKGROUND Between 2013 and 2015, the UK Biobank (UKBB) collected accelerometer traces (AXT) using wrist-worn triaxial accelerometers for 103,712 volunteers aged between 40 and 69, for one week each. This dataset has been used in the past to verify that individuals with chronic diseases exhibit reduced activity levels compared to healthy population...
Article
Full-text available
Background: Between 2013 and 2015, the UK Biobank (UKBB) collected accelerometer traces (AXT) using wrist-worn triaxial accelerometers for 103,712 volunteers aged between 40 and 69, for one week each. This dataset has been used in the past to verify that individuals with chronic diseases exhibit reduced activity levels compared to healthy populati...
Preprint
Full-text available
Concerns about the societal impact of AI-based services and systems has encouraged governments and other organisations around the world to propose AI policy frameworks to address fairness, accountability, transparency and related topics. To achieve the objectives of these frameworks, the data and software engineers who build machine-learning system...
Preprint
Many systems have been developed in recent years to mine logical rules from large-scale Knowledge Graphs (KGs), on the grounds that representing regularities as rules enables both the interpretable inference of new facts, and the explanation of known facts. Among these systems, the walk-based methods that generate the instantiated rules containing...
Conference Paper
Full-text available
The Covid-19 crisis caught health care services around the world by surprise, putting unprecedented pressure on Intensive Care Units (ICU). To help clinical staff to manage the limited ICU capacity, we have developed a Machine Learning model to estimate the probability that a patient admitted to hospital with COVID-19 symptoms would develop severe...
Preprint
Full-text available
Aims The aim of this study was to estimate a 48 hour prediction of moderate to severe respiratory failure, requiring mechanical ventilation, in hospitalized patients with COVID-19 pneumonia. Methods This was an observational study that comprised consecutive patients with COVID-19 pneumonia admitted to hospital from 21 February to 6 April 2020. The...
Article
Full-text available
Data provenance is a structured form of metadata designed to record the activities and datasets involved in data production, as well as their dependency relationships. The PROV data model, released by the W3C in 2013, defines a schema and constraints that together provide a structural and semantic foundation for provenance. This enables the interop...
Preprint
The logic-based methods that learn first-order rules from knowledge graphs (KGs) for knowledge graph completion (KGC) task are desirable in that the learnt models are inductive, interpretable and transferable. The challenge in such rule learners is that the expressive rules are often buried in vast rule space, and the procedure of identifying expre...
Conference Paper
Scientific workflows rely on provenance to be understandable, reproducible and trustworthy. Nowadays, there is a growing demand for interoperability between provenance data generated from heterogeneous workflow management systems. To address this issue, some provenance models have been proposed by extending PROV to support specific requirements of...
Conference Paper
Full-text available
Insights generated from Big Data through ana-lytics processes are often unstable over time and thus lose their value, as the analysis typically depends on elements that change and evolve dynamically. However, the cost of having to periodically "redo" computationally expensive data analytics is not normally taken into account when assessing the bene...
Conference Paper
Full-text available
Improving machine learning models' fairness is an active research topic. Most approaches to it focus on a particular fairness definition. We propose a parametrised training-set-resampling method, which allows optimising in both a fairness-definition and classification-model agnostic manner. Given a binary protected attribute and a binary label, we...
Article
Full-text available
Phenotype‐based filtering and prioritization contribute to the interpretation of genetic variants detected in exome sequencing. However, it is currently unclear how extensive this phenotypic annotation should be. In this study, we compare methods for incorporating phenotype into the interpretation process and assess the extent to which phenotypic a...
Article
In the last decade, advances in computing have deeply transformed data processing. Increasingly systems aim to process massive amounts of data efficiently, often with fast response times that are typically characterised by the 4V's, i.e., Volume, Variety, Velocity, and Veracity. While fast data processing is desirable, it is also often the case tha...
Chapter
On social media platforms and Twitter in particular, specific classes of users such as influencers have been given satisfactory operational definitions in terms of network and content metrics. Others, for instance online activists, are not less important but their characterisation still requires experimenting. We make the hypothesis that such inter...
Conference Paper
Full-text available
On social media platforms and Twitter in particular, specific classes of users such as influencers have been given satisfactory operational definitions in terms of network and content metrics. Others, for instance online activists, are not less important but their characterisation still requires experimenting. We make the hypothesis that such inter...
Preprint
Full-text available
On social media platforms and Twitter in particular, specific classes of users such as influencers have been given satisfactory operational definitions in terms of network and content metrics. Others, for instance online activists, are not less important but their characterisation still requires experimenting. We make the hypothesis that such inter...
Article
Full-text available
Data analytics processes such as scientific workflows tend to be executed repeatedly, with varying dependencies and input datasets. The case has been made in the past for tracking the provenance of the final information products through the workflow steps, to enable their reproducibility. In this work, we explore the hypothesis that provenance trac...
Article
Full-text available
Despite recent scientific advances, most rare genetic diseases - including most neuro-muscular diseases - do not currently have curative gene-based therapies available. However, in some cases, such as vitamin, cofactor or enzyme deficiencies, channelopathies and disorders of the neuromuscular junction, a confirmed genetic diagnosis provides guidanc...
Conference Paper
Successful e-science requires control over variations of an experiment, typically encoded as a script or workflow, as well as the ability to transfer existing experiments to other environments in a reproducible way. Variations may be introduced either deliberately or because of inaccurate porting, for instance when the target environment does not s...
Chapter
The PROV data model assumes that entities are immutable and all changes to an entity \(e\) are represented by the creation of a new entity \(e'\). This is reasonable for many provenance applications but may produce verbose results once we move towards fine-grained provenance due to the possibility of multiple binds (i.e., variables, elements of dat...
Chapter
Many resource-intensive analytics processes evolve over time following new versions of the reference datasets and software dependencies they use. We focus on scenarios in which any version change has the potential to affect many outcomes, as is the case for instance in high throughput genomics where the same process is used to analyse large cohorts...
Preprint
The value of knowledge assets generated by analytics processes using Data Science techniques tends to decay over time, as a consequence of changes in the elements the process depends on: external data sources, libraries, and system dependencies. For large-scale problems, refreshing those outcomes through greedy re-computation is both expensive and...
Article
We illustrate how combining retrospective and prospectiveprovenance can yield scientifically meaningful hybrid provenancerepresentations of the computational histories of data produced during a script run. We use scripts from multiple disciplines (astrophysics, climate science, biodiversity data curation, and social network analysis), implemented i...
Conference Paper
Scientific workflows rely on provenance to be understandable, reproducible and trustworthy. Nowadays, there is a growing demand for interoperability between provenance data generated from heterogeneous workflow management systems. To address this issue, some provenance models have been proposed by extending PROV to support specific requirements of...
Conference Paper
Full-text available
Many resource-intensive analytics processes evolve over time following new versions of the reference datasets and software dependencies they use. We focus on scenarios in which any version change has the potential to affect many outcomes, as is the case for instance in high throughput genomics where the same process is used to analyse large cohorts...
Preprint
Full-text available
Internet of Things (IoT) data are increasingly viewed as a new form of massively distributed and large scale digital assets, which are continuously generated by millions of connected devices. The real value of such assets can only be realized by allowing IoT data trading to occur on a marketplace that rewards every single producer and consumer, at...
Article
Full-text available
The number of connected devices is quickly approaching 20 billion thanks to the increasing availability of connectivity technologies and decreasing hardware costs, and driven by a growing ability to mine the generated data. This market is expected to grow even larger if access to devices and their data were open, with a flourishing ecosystem of app...
Chapter
Full-text available
Zika and Dengue are viral diseases transmitted by infected mosquitoes (Aedes aegypti) found in warm, humid environments. Mining data from social networks helps to find locations with highest density of reported cases. Differently from approaches that process text from social networks, we present a new strategy that analyzes Instagram images. We use...
Conference Paper
Full-text available
Scalable and efficient processing of genome sequence data, i.e. for variant discovery, is key to the mainstream adoption of High Throughput technology for disease prevention and for clinical use. Achieving scalability, however, requires a significant effort to enable the parallel execution of the analysis tools that make up the pipelines. This is f...
Preprint
Full-text available
Scalable and efficient processing of genome sequence data, i.e. for variant discovery, is key to the mainstream adoption of High Throughput technology for disease prevention and for clinical use. Achieving scalability, however, requires a significant effort to enable the parallel execution of the analysis tools that make up the pipelines. This is f...
Article
Full-text available
As with general graph processing systems, partitioning data over a cluster of machines improves the scalability of graph database management systems. However, these systems will incur additional network cost during the execution of a query workload, due to inter-partition traversals. Workload-agnostic partitioning algorithms typically minimise the...
Conference Paper
Full-text available
Internet of Things (IoT) data are increasingly viewed as a new form of massively distributed and large scale digital assets, which are continuously generated by millions of connected devices. The real value of such assets can only be realized by allowing IoT data trading to occur on a marketplace that rewards every single producer and consumer, at...
Conference Paper
A pervasive problem in Data Science is that the knowledge generated by possibly expensive analytics processes is subject to decay over time as the data and algorithms used to compute it change, and the external knowledge embodied by reference datasets evolves. Deciding when such knowledge outcomes should be refreshed, following a sequence of data c...
Article
Full-text available
Graph partitioning has long been seen as a viable approach to address Graph DBMS scalability. A partitioning, however, may introduce extra query processing latency unless it is sensitive to a specific query workload, and optimised to minimise inter-partition traversals for that workload. Additionally, it should also be possible to incrementally adj...
Conference Paper
Full-text available
Tropical diseases like Chikungunya and Zika have come to prominence in recent years as the cause of serious health problems. We explore the hypothesis that monitoring and analysis of social media content streams may effectively complement institutional disease prevention efforts. Specifically, we aim to identify selected members of the public who a...
Article
Full-text available
Tropical diseases like \textit{Chikungunya} and \textit{Zika} have come to prominence in recent years as the cause of serious, long-lasting, population-wide health problems. In large countries like Brasil, traditional disease prevention programs led by health authorities have not been particularly effective. We explore the hypothesis that monitorin...
Conference Paper
Full-text available
We illustrate how combining retrospective and prospective provenance can yield scientifically meaningful hybrid provenance representations of the computational histories of data produced during a script run. We use scripts from multiple disciplines (astrophysics, climate science, biodiversity data curation, and social network analysis), implemented...
Article
Full-text available
A pervasive problem in Data Science is that the knowledge generated by possibly expensive analytics processes is subject to decay over time, as the data used to compute it drifts, the algorithms used in the processes are improved, and the external knowledge embodied by reference datasets used in the computation evolves. Deciding when such knowledge...
Conference Paper
As digital objects become increasingly important in people's lives, people may need to understand the provenance, or lineage and history, of an important digital object, to understand how it was produced. This is particularly important for objects created from large, multi-source collections of personal data. As the metadata describing provenance,...
Conference Paper
DataONE is a federated data network focusing on earth and environmental science data. We present the provenance and search features of DataONE by means of an example involving three earth scientists who interact through a DataONE Member Node. DataONE provenance systems enable reproducible research and facilitate proper attribution of scientific res...
Conference Paper
Full-text available
Provenance generated by different workflow systems is generally expressed using different formats. This is not an issue when scientists analyze provenance graphs in isolation, or when they use the same workflow system. However, when analyzing heterogeneous provenance graphs from multiple systems poses a challenge. To address this problem we adopt P...
Conference Paper
Detecting and preventing outbreaks of mosquito-borne diseases such as Dengue and Zika in Brasil and other tropical regions has long been a priority for governments in affected areas. Streaming social media content, such as Twitter, is increasingly being used for health vigilance applications such as flu detection. However, previous work has not add...
Chapter
This chapter outlines some of the challenges and opportunities associated with adopting provenance principles (Cheney et al., Dagstuhl Reports 2(2):84–113, 2012) and standards (Moreau et al., Web Semant. Sci. Serv. Agents World Wide Web, 2015) in a variety of disciplines, including data publication and reuse, and information sciences.
Article
Full-text available
Detecting and preventing outbreaks of mosquito-borne diseases such as Dengue and Zika in Brasil and other tropical regions has long been a priority for governments in affected areas. Streaming social media content, such as Twitter, is increasingly being used for health vigilance applications such as flu detection. However, previous work has not add...
Article
Full-text available
The cost of deriving actionable knowledge from large datasets has been decreasing thanks to a convergence of positive factors: low cost data generation, inexpensively scalable storage and processing infrastructure (cloud), software frameworks and tools for massively distributed data processing, and parallelisable data analytics algorithms. One obse...
Conference Paper
Full-text available
Partitioning large graphs, in order to balance storage and processing costs across multiple physical machines, is becoming increasingly necessary as the typical scale of graph data continues to increase. A partitioning, however, may introduce query processing latency due to inter-partition communication overhead, especially if the query workload ex...
Article
Full-text available
The ability to measure the use and impact of published data sets is key to the success of the open data / open science paradigm. A direct measure of impact would require tracking data (re)use in the wild, which however is difficult to achieve. This is therefore commonly replaced by simpler metrics based on data download and citation counts. In this...
Article
Dataflow-style workflows offer a simple, high-level programming model for flexible prototyping of scientific applications as an attractive alternative to low-level scripting. At the same time, workflow management systems (WfMS) may support data parallelism over big datasets by providing scalable, distributed deployment and execution of the workflow...
Conference Paper
Full-text available
The cost of deriving actionable knowledge from large datasets has been decreasing thanks to a convergence of positive fac- tors: low cost data generation, inexpensively scalable stor- age and processing infrastructure (cloud), software frame- works and tools for massively distributed data processing, and parallelisable data analytics algorithms. On...
Book
This book reports on the results of an interdisciplinary and multidisciplinary workshop on provenance that brought together researchers and practitioners from different areas such as archival science, law, information science, computing, forensics and visual analytics that work at the frontiers of new knowledge on provenance. Each of these fields u...