Article

Introduction to the special issue on Data Science for COVID-19

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

An introduction to this Special Issue on Data Science for COVID-19 is included in this paper. It contains a general overview about methods and applications of nonparametric inference and other flexible data science methods for the COVID-19 pandemic. Specifically, some methods existing before the COVID-19 outbreak are surveyed, followed by an account of survival analysis methods for COVID-related times. Then, several nonparametric tools for the estimation of certain COVID rates are revised, along with the forecasting of most relevant series counts, and some other related problems. Within this setup, the papers published in this special issue are briefly commented in this introductory article.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
The incubation period is a key characteristic of an infectious disease. In the outbreak of a novel infectious disease, accurate evaluation of the incubation period distribution is critical for designing effective prevention and control measures . Estimation of the incubation period distribution based on limited information from retrospective inspection of infected cases is highly challenging due to censoring and truncation. In this paper, we consider a semiparametric regression model for the incubation period and propose a sieve maximum likelihood approach for estimation based on the symptom onset time, travel history, and basic demographics of reported cases. The approach properly accounts for the pandemic growth and selection bias in data collection. We also develop an efficient computation method and establish the asymptotic properties of the proposed estimators. We demonstrate the feasibility and advantages of the proposed methods through extensive simulation studies and provide an application to a dataset on the outbreak of COVID-19.
Article
Full-text available
Artificial Intelligence (AI) encompasses various domains such as Machine Learning (ML), Deep Learning (DL), and other cognitive technologies which have been widely applied in healthcare sector. AI models are utilized in healthcare sector in which the machines are used to investigate and make decisions based on prediction and classification of input data. With this motivation, the current study involves the design of Metaheuristic Optimization with Kernel Extreme Learning Machine for COVID-19 Prediction Model on Epidemiology Dataset, named MOKELM-CPED technique. The primary aim of the presented MOKELM-CPED model is to accomplish effectual COVID-19 classification outcomes using epidemiology dataset. In the proposed MOKELM-CPED model, the data first undergoes pre-processing to transform the medical data into useful format. Followed by, data classification process is performed by following Kernel Extreme Learning Machine (KELM) model. Finally, Symbiotic Organism Search (SOS) optimization algorithm is utilized to fine tune the KELM parameters which consequently helps in achieving high detection efficiency. In order to investigate the improved classifier outcomes of MOKELM-CPED model in an effectual manner, a comprehensive experimental analysis was conducted and the results were inspected under diverse aspects. The outcome of the experiments infer the enhanced performance of the proposed method over recent approaches under distinct measures.
Article
Full-text available
We consider a retrospective modelling approach for estimating effective reproduction numbers based on death counts during the first year of the COVID-19 pandemic in Germany. The proposed Bayesian hierarchical model incorporates splines to estimate reproduction numbers flexibly over time while adjusting for varying effective infection fatality rates. The approach also provides estimates of dark figures regarding undetected infections. Results for Germany illustrate that our estimates based on death counts are often similar to classical estimates based on confirmed cases; however, considering death counts allows to disentangle effects of adapted testing policies from transmission dynamics. In particular, during the second wave of infections, classical estimates suggest a flattening infection curve following the “lockdown light” in November 2020, while our results indicate that infections continued to rise until the “second lockdown” in December 2020. This observation is associated with more stringent testing criteria introduced concurrently with the “lockdown light”, which is reflected in subsequently increasing dark figures of infections estimated by our model. In light of progressive vaccinations, shifting the focus from modelling confirmed cases to reported deaths with the possibility to incorporate effective infection fatality rates might be of increasing relevance for the future surveillance of the pandemic.
Article
Full-text available
With the outbreak of COVID-19 in spring 2020, small, medium, and large companies were forced to cope with the unexpected circumstances. Faced by this health emergency, it was necessary to ensure that staff remained motivated and that they could continue to carry out their duties despite the obstacles. The main goal of this exploratory research was to characterize employees who teleworked and who did not, and their motivation during the lockdown. A total of 11,779 workers from different-sized companies in various sectors answered an ad hoc questionnaire. By using non-parametric comparisons and Classification and Regression Trees (CRTs), the results show differences in both the assessment of strategies put into practice by the companies and the level of motivation of teleworkers and non-teleworkers, with the latter being more highly motivated. Nonetheless, teleworkers assessed their companies’ strategies and the role of their managers and colleagues more positively. This research helps to understand how different sectors have dealt with the crisis, according to the degree of teleworking implemented in each sector, and to what extent the motivation of the employees has been affected. The analysis of the large amount of data obtained confirms the importance of the role of managers in sustaining the motivation of their subordinates in times of crisis. In this sense, it is necessary to develop managers’ competencies in order to develop and maintain relations of trust and support with their coworkers. On the other hand, it is necessary to foster employees’ sense of meaningfulness and responsibility at work in order to keep them motivated.
Article
Full-text available
Backgrounds SARS-CoV-2 is almost the most problematic virus of this century. It has caused extensive damage to various economic, social, and health aspects worldwide. Nowadays, coronavirus disease 2019 (COVID-19) is the most dangerous threat to human survival. Therefore, this study aimed to investigate factors associated with the survival of Iranian patients with SARS-CoV-2. Methods This retrospective hospital-based cohort study was conducted on 870 COVID-19 patients with blood oxygen levels of less than 93%. Cox regression and mixture cure model were used and compared to analyze the patients’ survival. It is worth noting that no similar study has been previously conducted using mixture cure regression to model the survival of Iranian patients with COVID-19. Result The cure rate and median survival time were respectively 81.5% and 20 days. Cox regression identified that respiratory distress, history of heart disease and hypertension, and older age were shown to increase the hazard. The Incidence and Latency parts of the mixture cure model respectively revealed that respiratory distress, history of hypertension, diabetes and cardiovascular diseases (CVDs), cough, fever, and older age reduced the cure odds; also, respiratory distress, history of hypertension, and CVDs, and older age increased the hazard. Conclusion The findings of our study revealed that priority should be given to older patients with a history of diabetes, hypertension, and CVDs in receiving intensive care and immunization. Also, the lower cure odds for patients with respiratory distress, fever, and cough favor early hospitalization before the appearance of severe symptoms.
Article
Full-text available
Short-term forecasts of the dynamics of coronavirus disease 2019 (COVID-19) in the period up to its decline following mass vaccination was a task that received much attention but proved difficult to do with high accuracy. However, the availability of standardized forecasts and versioned datasets from this period allows for continued work in this area. Here, we introduce the Gaussian infection state space with time dependence (GISST) forecasting model. We evaluate its performance in one to four weeks ahead forecasts of COVID-19 cases, hospital admissions and deaths in the state of California made with official reports of COVID-19, Google’s mobility reports and vaccination data available each week. Evaluation of these forecasts with a weighted interval score shows them to consistently outperform a naive baseline forecast and often score closer to or better than a high-performing ensemble forecaster. The GISST model also provides parameter estimates for a compartmental model of COVID-19 dynamics, includes a regression submodel for the transmission rate and allows for parameters to vary over time according to a random walk. GISST provides a novel, balanced combination of computational efficiency, model interpretability and applicability to large multivariate datasets that may prove useful in improving the accuracy of infectious disease forecasts.
Article
Full-text available
A sentinel network, Obépine, has been designed to monitor SARS-CoV-2 viral load in wastewaters arriving at wastewater treatment plants (WWTPs) in France as an indirect macro-epidemiological parameter. The sources of uncertainty in such a monitoring system are numerous, and the concentration measurements it provides are left-censored and contain outliers, which biases the results of usual smoothing methods. Hence, the need for an adapted pre-processing in order to evaluate the real daily amount of viruses arriving at each WWTP. We propose a method based on an auto-regressive model adapted to censored data with outliers. Inference and prediction are produced via a discretized smoother which makes it a very flexible tool. This method is both validated on simulations and real data from Obépine. The resulting smoothed signal shows a good correlation with other epidemiological indicators and is currently used by Obépine to provide an estimate of virus circulation over the watersheds corresponding to about 200 WWTPs.
Article
Full-text available
During a pandemic, data are very “noisy” with enormous amounts of local variation in daily counts, compared with any rapid changes in trend. Accurately characterizing the trends and reliable predictions on future trajectories are important for planning and public situation awareness. We describe a semi-parametric statistical model that is used for short-term predictions of daily counts of cases and deaths due to COVID-19 in Canada, which are routinely disseminated to the public by Public Health Agency of Canada. The main focus of the paper is the presentation of the model. Performance indicators of our model are defined and then evaluated through extensive sensitivity analyses. We also compare our model with other commonly used models such as generalizations of logistic models for similar purposes. The proposed model is shown to describe the historical trend very well with excellent ability to predict the short-term trajectory.
Article
Full-text available
Background When dealing with recurrent events in observational studies it is common to include subjects who became at risk before follow-up. This phenomenon is known as left censoring, and simply ignoring these prior episodes can lead to biased and inefficient estimates. We aimed to propose a statistical method that performs well in this setting. Methods Our proposal was based on the use of models with specific baseline hazards. In this, the number of prior episodes were imputed when unknown and stratified according to whether the subject had been at risk of presenting the event before t = 0. A frailty term was also used. Two formulations were used for this “Specific Hazard Frailty Model Imputed” based on the “counting process” and “gap time.” Performance was then examined in different scenarios through a comprehensive simulation study. Results The proposed method performed well even when the percentage of subjects at risk before follow-up was very high. Biases were often below 10% and coverages were around 95%, being somewhat conservative. The gap time approach performed better with constant baseline hazards, whereas the counting process performed better with non-constant baseline hazards. Conclusions The use of common baseline methods is not advised when knowledge of prior episodes experienced by a participant is lacking. The approach in this study performed acceptably in most scenarios in which it was evaluated and should be considered an alternative in this context. It has been made freely available to interested researchers as R package miRecSurv.
Article
Full-text available
Detecting changes in COVID-19 disease transmission over time is a key indicator of epidemic growth. Near real-time monitoring of the pandemic growth is crucial for policy makers and public health officials who need to make informed decisions about whether to enforce lockdowns or allow certain activities. The effective reproduction number RtR_t is the standard index used in many countries for this goal. However, it is known that due to the delays between infection and case registration, its use for decision making is somewhat limited. In this paper a near real-time COVINDEX is proposed for monitoring the evolution of the pandemic. The index is computed from predictions obtained from a GAM beta regression for modelling the test positive rate as a function of time. The proposal is illustrated using data on COVID-19 pandemic in Italy and compared with RtR_t. A simple chart is also proposed for monitoring local and national outbreaks by policy makers and public health officials.
Article
Full-text available
In epidemics many interesting quantities, like the reproduction number, depend on the incubation period (time from infection to symptom onset) and/or the generation time (time until a new person is infected from another infected person). Therefore, estimation of the distribution of these two quantities is of distinct interest. However, this is a challenging problem since it is normally not possible to obtain precise observations of these two variables. Instead, in the beginning of a pandemic, it is possible to observe for transmission pairs the time of symptom onset for both people as well as a window for infection of the first person (e.g. because of travel to a risk area). In this paper we suggest a simple semi-parametric sieve-estimation method based on Laguerre-Polynomials for estimation of these distributions. We provide detailed theory for consistency and illustrate the finite sample performance for small datasets via a simulation study.
Article
Full-text available
Time-to-event data are right-truncated if only individuals who have experienced the event by a certain time can be included in the sample. For example, we may be interested in estimating the distribution of time from onset of disease symptoms to death and only have data on individuals who have died. This may be the case, for example, at the beginning of an epidemic. Right truncation causes the distribution of times to event in the sample to be biased towards shorter times compared to the population distribution, and appropriate statistical methods should be used to account for this bias. This article is a review of such methods, particularly in the context of an infectious disease epidemic, like COVID-19. We consider methods for estimating the marginal time-to-event distribution, and compare their efficiencies. (Non-)identifiability of the distribution is an important issue with right-truncated data, particularly at the beginning of an epidemic, and this is discussed in detail. We also review methods for estimating the effects of covariates on the time to event. An illustration of the application of many of these methods is provided, using data on individuals who had died with coronavirus disease by 5 April 2020.
Article
Full-text available
In this paper, a fuzzy clustering model for multivariate time series based on the quantile cross-spectral density and principal component analysis is extended by including: (i) a weighting system which assigns a weight to each principal component in accordance with its importance concerning the underlying clustering structure and (ii) a penalization term allowing to take into account the spatial information. The iterative solutions of the new model, which employs the exponential distance in order to gain robustness against outlying series, are derived. A simulation study shows that the weighting system substantially enhances the effectiveness of the former approach. The behaviour of the extended model in terms of the spatial penalization term is also analysed. An application involving multivariate time series of mobility indicators concerning COVID-19 pandemic highlights the usefulness of the proposed technique.
Article
Full-text available
The quantification of the SARS-CoV-2 RNA load in wastewater has emerged as a useful tool to monitor COVID–19 outbreaks in the community. This approach was implemented in the metropolitan area of A Coruña (NW Spain), where wastewater from a treatment plant was analyzed to track the epidemic dynamics in a population of 369,098 inhabitants. Viral load detected in the wastewater and the epidemiological data from A Coruña health system served as main sources for statistical models developing. Regression models described here allowed us to estimate the number of infected people (R² = 0.9), including symptomatic and asymptomatic individuals. These models have helped to understand the real magnitude of the epidemic in a population at any given time and have been used as an effective early warning tool for predicting outbreaks in A Coruña municipality. The methodology of the present work could be used to develop a similar wastewater-based epidemiological model to track the evolution of the COVID–19 epidemic anywhere in the world where centralized water-based sanitation systems exist.
Article
Full-text available
Background The COVID-19 pandemic has initiated several initiatives to better understand its behavior, and some projects are monitoring its evolution across countries, which naturally leads to comparisons made by those using the data. However, most “at a glance” comparisons may be misleading because the curve that should explain the evolution of COVID-19 is different across countries, as a result of the underlying geopolitical or socio-economic characteristics. Therefore, this paper contributes to the scientific endeavour by creating a new evaluation framework to help stakeholders adequately monitor and assess the evolution of COVID-19 in countries, considering the occurrence of spikes, "secondary waves" and structural breaks in the time series. Methods Generalized Additive Models were used to model cumulative and daily curves for confirmed cases and deaths. The Root Relative Squared Error and the Percentage Deviance Explained measured how well the models fit the data. A local min-max function was used to identify all local maxima in the fitted values. The pure Markov-Switching and the family of Markov-Switching GARCH models were used to identify structural breaks in the COVID-19 time series. Finally, a quadrants system to identify countries that are more/less efficient in the short/long term in controlling the spread of the virus and the number of deaths was developed. Such methods were applied in the time series of 189 countries, collected from the Centre for Systems Science and Engineering at Johns Hopkins University. Results Our methodology proves more effective in explaining the evolution of COVID-19 than growth functions worldwide, in addition to standardizing the entire estimation process in a single type of function. Besides, it highlights several inflection points and regime-switching moments, as a consequence of people’s diminished commitment to fighting the pandemic. Although Europe is the most developed continent in the world, it is home to most countries with an upward trend and considered inefficient, for confirmed cases and deaths. Conclusions The new outcomes presented in this research will allow key stakeholders to check whether or not public policies and interventions in the fight against COVID-19 are having an effect, easily identifying examples of best practices and promote such policies more widely around the world.
Article
Full-text available
The COVID-19 pandemic has affected all countries in the world and brings a major disruption in our daily lives. Estimation of the prevalence and contagiousness of COVID-19 infections may be challenging due to under-reporting of infected cases. For a better understanding of such pandemic in its early stages, it is crucial to take into consideration unreported infections. In this study we propose a truncation model to estimate the under-reporting probabilities for infected cases. Hypothesis testing on the differences in truncation probabilities, that are related to the under-reporting rates, is implemented. Large sample results of the hypothesis test are presented theoretically and by means of simulation studies. We also apply the methodology to COVID-19 data in certain countries, where under-reporting probabilities are expected to be high.
Article
Full-text available
This paper deals with an important subject in classification problems addressed by machine learning techniques: the evaluation of the influence of each of the features on the classification of individuals. Specifically, a measure of that influence is introduced using the Shapley value of cooperative games. In addition, an axiomatic characterisation of the proposed measure is provided based on properties of efficiency and balanced contributions. Furthermore, some experiments have been designed in order to validate the appropriate performance of such measure. Finally, the methodology introduced is applied to a sample of COVID-19 patients to study the influence of certain demographic or risk factors on various events of interest related to the evolution of the disease.
Article
Full-text available
Background COVID-19 Coronavirus variants are emerging across the globe causing ongoing pandemics. It is important to estimate the case fatality ratio (CFR) during such an epidemic of a potentially fatal disease. Methods Firstly, we have performed a non-parametric approach for odds ratios with corresponding confidence intervals (CIs) and illustrated relative risks and cumulative mortality rates of COVID-19 data of Spain. We have demonstrated the modified non-parametric approach based on Kaplan-Meier (KM) technique using COVID-19 data of Italy. We have also performed the significance of characteristics of patients regarding outcome by age for both gender. Furthermore, we have applied a non-parametric cure model using Nadaraya-Watson weight to estimate cure-rate using Israel data. Simulations are based on R- software. Results The analytical illustrations of these approaches predict the effects of patients based on covariates in different scenarios. Sex differences are increased from ages less than 60 years to 60-69 years but decreased thereafter with the smallest sex difference at ages 80 years in a case for estimating both purposes RR (Relative Risk) and OR (Odds Ratio). The non-parametric approach investigates the range of cure-rate ranges from 5.3% to 9% and from 4% to 7% approximately for male and female respectively. The modified KM estimator performs for such censored data and detects the changes in CFR more rapidly for both gender and age-wise. Conclusion Older-age, male-sex, number of comorbidities and access to timely health care are identified as some of the risk factors associated with COVID-19 mortality in Spain. The non-parametric approach has investigated the influence of covariates on models and it provides the effect in both gender and age. The health impact of public for inaccurate estimates, inconsistent intelligence, conflicting messages, or resulting in misinformation can increase awareness among people and also induce panic situations that accompany major outbreaks of COVID-19.
Article
Full-text available
With the spread of the novel coronavirus disease 2019 (COVID-19) around the world, the estimation of the incubation period of COVID-19 has become a hot issue. Based on the doubly interval-censored data model, we assume that the incubation period follows lognormal and Gamma distribution, and estimate the parameters of the incubation period of COVID-19 by adopting the maximum likelihood estimation, expectation maximization algorithm and a newly proposed algorithm (expectation mostly conditional maximization algorithm, referred as ECIMM). The main innovation of this paper lies in two aspects: Firstly, we regard the sample data of the incubation period as the doubly interval-censored data without unnecessary data simplification to improve the accuracy and credibility of the results; secondly, our new ECIMM algorithm enjoys better convergence and universality compared with others. With the framework of this paper, we conclude that 14-day quarantine period can largely interrupt the transmission of COVID-19, however, people who need specially monitoring should be isolated for about 20 days for the sake of safety. The results provide some suggestions for the prevention and control of COVID-19. The newly proposed ECIMM algorithm can also be used to deal with the doubly interval-censored data model appearing in various fields.
Article
Full-text available
A short introduction to survival analysis and censored data is included in this paper. A thorough literature review in the field of cure models has been done. An overview on the most important and recent approaches on parametric, semiparametric and nonparametric mixture cure models is also included. The main nonparametric and semiparametric approaches were applied to a real time dataset of COVID-19 patients from the first weeks of the epidemic in Galicia (NW Spain). The aim is to model the elapsed time from diagnosis to hospital admission. The main conclusions, as well as the limitations of both the cure models and the dataset, are presented, illustrating the usefulness of cure models in this kind of studies, where the influence of age and sex on the time to hospital admission is shown.
Article
Full-text available
Understanding the SARS-CoV-2 dynamics has been subject of intense research in the last months. In particular, accurate modeling of lockdown effects on human behaviour and epidemic evolution is a key issue in order e.g. to inform health-care decisions on emergency management. In this regard, the compartmental and spatial models so far proposed use parametric descriptions of the contact rate, often assuming a time-invariant effect of the lockdown. In this paper we show that these assumptions may lead to erroneous evaluations on the ongoing pandemic. Thus, we develop a new class of nonparametric compartmental models able to describe how the impact of the lockdown varies in time. Our estimation strategy does not require significant Bayes prior information and exploits regularization theory. Hospitalized data are mapped into an infinite-dimensional space, hence obtaining a function which takes into account also how social distancing measures and people’s growing awareness of infection’s risk evolves as time progresses. This also permits to reconstruct a continuous-time profile of SARS-CoV-2 reproduction number with a resolution never reached before in the literature. When applied to data collected in Lombardy, the most affected Italian region, our model illustrates how people behaviour changed during the restrictions and its importance to contain the epidemic. Results also indicate that, at the end of the lockdown, around 12%12\% 12 % of people in Lombardy and 5%5\% 5 % in Italy was affected by SARS-CoV-2, with the fatality rate being 1.14%. Then, we discuss how the situation evolved after the end of the lockdown showing that the reproduction number dangerously increased in the summer, due to holiday relax, reaching values larger than one on August 1, 2020. Finally, we also document how Italy faced the second wave of infection in the last part of 2020. Since several countries still observe a growing epidemic and others could be subject to other waves, the proposed reproduction number tracking methodology can be of great help to health care authorities to prevent SARS-CoV-2 diffusion or to assess the impact of lockdown restrictions on human behaviour to contain the spread.
Article
Full-text available
Estimating the lengths-of-stay (LoS) of hospitalised COVID-19 patients is key for predicting the hospital beds' demand and planning mitigation strategies, as overwhelming the healthcare systems has critical consequences for disease mortality. However, accurately mapping the time-to-event of hospital outcomes, such as the LoS in the intensive care unit (ICU), requires understanding patient trajectories while adjusting for covariates and observation bias, such as incomplete data. Standard methods, such as the Kaplan-Meier estimator, require prior assumptions that are untenable given current knowledge. Using real-time surveillance data from the first weeks of the COVID-19 epidemic in Galicia (Spain), we aimed to model the time-to-event and event probabilities of patients' hospitalised, without parametric priors and adjusting for individual covariates. We applied a non-parametric mixture cure model and compared its performance in estimating hospital ward (HW)/ICU LoS to the performances of commonly used methods to estimate survival. We showed that the proposed model outperformed standard approaches, providing more accurate ICU and HW LoS estimates. Finally, we applied our model estimates to simulate COVID-19 hospital demand using a Monte Carlo algorithm. We provided evidence that adjusting for sex, generally overlooked in prediction models, together with age is key for accurately forecasting HW and ICU occupancy, as well as discharge or death outcomes.
Article
Full-text available
Containment strategies to combat epidemics such as SARS-CoV-2/COVID-19 require the availability of epidemiological parameters, e.g., the effective reproduction number. Parametric models such as the commonly used susceptible-infected-removed (SIR) compartment models fitted to observed incidence time series have limitations due to the time-dependency of the parameters. Furthermore, fatalities are delayed with respect to the counts of new cases, and the reproduction cycle leads to periodic patterns in incidence time series. Therefore, based on comprehensible nonparametric methods including time-delay correlation analyses, estimates of crucial parameters that characterise the COVID-19 pandemic with a focus on the German epidemic are presented using publicly available time-series data on prevalence and fatalities. The estimates for Germany are compared with the results for seven other countries (France, Italy, the United States of America, the United Kingdom, Spain, Switzerland, and Brazil). The duration from diagnosis to death resulting from delay-time correlations turns out to be 13 days with high accuracy for Germany and Switzerland. For the other countries, the time-to-death durations have wider confidence intervals. With respect to the German data, the two time series of new cases and fatalities exhibit a strong coherence. Based on the time lag between diagnoses and deaths, properly delayed asymptotic as well as instantaneous fatality–case ratios are calculated. The temporal median of the instantaneous fatality–case ratio with time lag of 13 days between cases and deaths for Germany turns out to be 0.02. Time courses of asymptotic fatality–case ratios are presented for other countries, which substantially differ during the first half of the pandemic but converge to a narrow range with standard deviation 0.0057 and mean 0.024. Similar results are obtained from comparing time courses of instantaneous fatality–case ratios with optimal delay for the 8 exemplarily chosen countries. The basic reproduction number, R0, for Germany is estimated to be between 2.4 and 3.4 depending on the generation time, which is estimated based on a delay autocorrelation analysis. Resonances at about 4 days and 7 days are observed, partially attributable to weekly periodicity of sampling. The instantaneous (time-dependent) reproduction number is estimated from the incident (counts of new) cases, thus allowing us to infer the temporal behaviour of the reproduction number during the epidemic course. The time course of the reproduction number turns out to be consistent with the time-dependent per capita growth.
Article
Full-text available
To date, official data on the number of people infected with the SARS-CoV-2-responsible for the Covid-19-have been released by the Italian Government just on the basis of a non-representative sample of population which tested positive for the swab. However a reliable estimation of the number of infected, including asymptomatic people, turns out to be crucial in the preparation of operational schemes and to estimate the future number of people, who will require, to different extents, medical attentions. In order to overcome the current data shortcoming, this article proposes a bootstrap-driven, estimation procedure for the number of people infected with the SARS-CoV-2. This method is designed to be robust, automatic and suitable to generate estimations at regional level. Obtained results show that, while official data at March the 12th report 12.839 cases in Italy, people infected with the SARS-CoV-2 could be as high as 105.789.
Article
Full-text available
In the spreading of infectious diseases, an important number to determine is how many other people will be infected on average by anyone who has become infected themselves. This is known as the reproduction number. This paper describes a non-parametric inverse method for extracting the full transfer function of infection, of which the reproduction number is the integral. The method is demonstrated by applying it to the timeline of hospitalisation admissions for covid-19 in the Netherlands up to May
Article
Full-text available
Objectives: The COVID-19 pandemic (caused by SARS-CoV-2) has introduced significant challenges for accurate prediction of population morbidity and mortality by traditional variable-based methods of estimation. Challenges to modelling include inadequate viral physiology comprehension and fluctuating definitions of positivity between national-to-international data. This paper proposes that accurate forecasting of COVID-19 caseload may be best preformed non-parametrically, by vector autoregression (VAR) of verifiable data regionally. Methods: A non-linear VAR model across 7 major demographically representative New York City (NYC) metropolitan region counties was constructed using verifiable daily COVID-19 caseload data March 12-July 23, 2020. Through association of observed case trends with a series of (county-specific) data-driven dynamic interdependencies (lagged values), a systematically non-assumptive approximation of VAR representation for COVID-19 patterns to-date and prospective upcoming trends was produced. Results: Modified VAR regression of NYC area COVID-19 caseload trends proves highly significant modelling capacity of observed patterns in longitudinal disease incidence (county R2 range: 0.9221-0.9751, all p < 0.001). Predictively, VAR regression of daily caseload results at a county-wide level demonstrates considerable short-term forecasting fidelity (p < 0.001 at one-step ahead) with concurrent capacity for longer-term (tested 11-week period) inferences of consistent, reasonable upcoming patterns from latest (model data update) disease epidemiology. Conclusions: In contrast to macroscopic variable-assumption projections, regionally-founded VAR modelling may substantially improve projection of short-term community disease burden, reduce potential for biostatistical error, as well as better model epidemiological effects resultant from intervention. Predictive VAR extrapolation of existing public health data at an interdependent regional scale may improve accuracy of current pandemic burden prognoses.
Article
Full-text available
It is important to forecast the risk of COVID-19 symptom onset and thereby evaluate how effectively the city lockdown measure could reduce this risk. This study is a first comprehensive, high-resolution investigation of spatiotemporal heterogeneities on the effect of the Wuhan lockdown on the risk of COVID-19 symptom onset in all 347 Chinese cities. An extended Weight Kernel Density Estimation model was developed to predict the COVID-19 onset risk under two scenarios (i.e., with and without the Wuhan lockdown). The Wuhan lockdown, compared with the scenario without lockdown implementation, in general, delayed the arrival of the COVID-19 onset risk peak for 1–2 days and lowered risk peak values among all cities. The decrease of the onset risk attributed to the lockdown was more than 8% in over 40% of Chinese cities, and up to 21.3% in some cities. Lockdown was the most effective in areas with medium risk before lockdown.
Article
Full-text available
This study aims to locate the knots of cumulative coronavirus disease 2019 (COVID-19) case number during the first-level response to public health emergency in the provinces of China except Hubei. The provinces were grouped into three regions, namely eastern, central and western provinces, and the trends between adjacent knots were compared among the three regions. COVID-19 case number, migration scale index, Baidu index, demographic, economic and public health resource data were collected from 22 Chinese provinces from 19 January 2020 to 12 March 2020. Spline regression was applied to the data of all included, eastern, central and western provinces. The research period was divided into three stages by two knots. The first stage (from 19 January to around 25 January) was similar among three regions. However, in the second stage, growth of COVID-19 case number was flatter and lasted longer in western provinces (from 25 January to 18 February) than in eastern and central provinces (from 26 February to around 11 February). In the third stage, the growth of COVID-19 case number slowed down in all the three regions. Included covariates were different among the three regions. Overall, spline regression with covariates showed the different change patterns in eastern, central and western provinces, which provided a better insight into regional characteristics of COVID-19 pandemic.
Article
Full-text available
Background The prognosis of patients with Covid-19 infection is uncertain. We derived and validated a new risk model for predicting progression to disease severity, hospitalization, admission to intensive care unit (ICU) and mortality in patients with Covid-19 infection (Gal-Covid-19 scores). Methods This is a retrospective cohort study of patients with Covid-19 infection confirmed by reverse transcription polymerase chain reaction (RT-PCR) in Galicia, Spain. Data were extracted from electronic health records of patients, including age, sex and comorbidities according to International Classification of Primary Care codes (ICPC-2). Logistic regression models were used to estimate the probability of disease severity. Calibration and discrimination were evaluated to assess model performance. Results The incidence of infection was 0.39% (10 454 patients). A total of 2492 patients (23.8%) required hospitalization, 284 (2.7%) were admitted to the ICU and 544 (5.2%) died. The variables included in the models to predict severity included age, gender and chronic comorbidities such as cardiovascular disease, diabetes, obesity, hypertension, chronic obstructive pulmonary disease, asthma, liver disease, chronic kidney disease and haematological cancer. The models demonstrated a fair–good fit for predicting hospitalization {AUC [area under the receiver operating characteristics (ROC) curve] 0.77 [95% confidence interval (CI) 0.76, 0.78]}, admission to ICU [AUC 0.83 (95%CI 0.81, 0.85)] and death [AUC 0.89 (95%CI 0.88, 0.90)]. Conclusions The Gal-Covid-19 scores provide risk estimates for predicting severity in Covid-19 patients. The ability to predict disease severity may help clinicians prioritize high-risk patients and facilitate the decision making of health authorities.
Article
Full-text available
As multifactorial and chronic diseases, cancers are among these pathologies for which the exposome concept is essential to gain more insight into the associated etiology and, ultimately, lead to better primary prevention strategies for public health. Indeed, cancers result from the combined influence of many genetic, environmental and behavioral stressors that may occur simultaneously and interact. It is thus important to properly account for multifactorial exposure patterns when estimating specific cancer risks at individual or population level. Nevertheless, the risk factors, especially environmental, are still too often considered in isolation in epidemiological studies. Moreover, major statistical difficulties occur when exposures to several factors are highly correlated due, for instance, to common sources shared by several pollutants. Suitable statistical methods must then be used to deal with these multicollinearity issues. In this work, we focused on the specific problem of estimating a disease risk from highly correlated environmental exposure covariates and a censored survival outcome. We extended Bayesian profile regression mixture (PRM) models to this context by assuming an instantaneous excess hazard ratio disease sub-model. The proposed hierarchical model incorporates an underlying truncated Dirichlet process mixture as an attribution sub-model. A specific adaptive Metropolis-Within-Gibbs algorithm—including label switching moves—was implemented to infer the model. This allows simultaneously clustering individuals with similar risks and similar exposure characteristics and estimating the associated risk for each group. Our Bayesian PRM model was applied to the estimation of the risk of death by lung cancer in a cohort of French uranium miners who were chronically and occupationally exposed to multiple and correlated sources of ionizing radiation. Several groups of uranium miners with high risk and low risk of death by lung cancer were identified and characterized by specific exposure profiles. Interestingly, our case study illustrates a limit of MCMC algorithms to fit full Bayesian PRM models even if the updating schemes for the cluster labels incorporate label-switching moves. Then, although this paper shows that Bayesian PRM models are promising tools for exposome research, it also opens new avenues for methodological research in this class of probabilistic models.
Article
Competing risk analyses have been widely used for the analysis of in-hospital mortality in which hospital discharge is considered as a competing event. The competing risk model assumes that more than one cause of failure is possible, but there is only one outcome of interest and all others serve as competing events. However, hospital discharge and in-hospital death are two outcomes resulting from the same disease process and patients whose disease conditions were stabilized so that inpatient care was no longer needed were discharged. We therefore propose to use cure models, in which hospital discharge is treated as an observed “cure” of the disease. We consider both the mixture cure model and the promotion time cure model and extend the models to allow cure status to be known for those who were discharged from the hospital. An EM algorithm is developed for the mixture cure model. We also show that the competing risk model, which treats hospital discharge as a competing event, is equivalent to a promotion time cure model. Both cure models were examined in simulation studies and were applied to a recent cohort of COVID-19 in-hospital patients with diabetes. The promotion time model shows that statin use improved the overall survival; the mixture cure model shows that while statin use reduced the in-hospital mortality rate among the susceptible, it improved the cure probability only for older but not younger patients. Both cure models show that treatment was more beneficial among older patients.
Article
Doubly censored data are very common in epidemiology studies. Ignoring censorship in the analysis may lead to biased parameter estimation. In this paper, we highlight that the publicly available COVID19 data may involve high percentage of double-censoring and point out the importance of dealing with such missing information in order to achieve better forecasting results. Existing statistical methods for doubly censored data may suffer from the convergence problems of the EM algorithms or may not be good enough for small sample sizes. This paper develops a new empirical likelihood method to analyse the recovery rate of COVID19 based on a doubly censored dataset. The efficient influence function of the parameter of interest is used to define the empirical likelihood (EL) ratio. We prove that −2log(EL-ratio) asymptotically follows a standard χ2 distribution. This new method does not require any scale parameter adjustment for the log-likelihood ratio and thus does not suffer from the convergence problems involved in traditional EM-type algorithms. Finite sample simulation results show that this method provides much less biased estimate than existing methods, when censoring percentage is large. The application to COVID19 data will help researchers in other field to achieve better estimates and forecasting results.
Article
In the presence of unmeasured spatial confounding, spatial models may actually increase (rather than decrease) bias, leading to uncertainty as to how they should be applied in practice. We evaluated spatial modelling approaches through simulation and application to a big data electronic health record study. Whereas the risk of bias was high for purely spatial exposures (e.g. built environment), we found very limited potential for increased bias for individual‐level exposures that cluster spatially (e.g. smoking status). We also proposed a novel exposure‐penalized spline approach that selects the degree of spatial smoothing to explain spatial variability in the exposure. This approach appeared promising for efficiently reducing spatial confounding bias.
Article
Quantitative assessment of the infection rate of a virus is key to monitor the evolution of an epidemic. However, such variable is not accessible to direct measurement and its estimation requires the solution of a difficult inverse problem. In particular, being the result not only of biological but also of social factors, the transmission dynamics can vary significantly in time. This makes questionable the use of parametric models which could be unable to capture their full complexity. In this paper we exploit compartmental models which include important COVID-19 peculiarities (like the presence of asymptomatic individuals) and allow the infection rate to assume any continuous-time profile. We show that these models are universal, i.e. capable to reproduce exactly any epidemic evolution, and extract from them closed-form expressions of the infection rate time-course. Building upon such expressions, we then design a regularized estimator able to reconstruct COVID-19 transmission dynamics in continuous-time. Using real data collected in Italy, our technique proves to be an useful tool to monitor COVID-19 transmission dynamics and to predict and assess the effect of lockdown restrictions.
Article
Epidemic modelling is an essential tool to understand the spread of the novel coronavirus and ultimately assist in disease prevention, policymaking, and resource allocation. In this article, we establish a state-of-the-art interface between classic mathematical and statistical models and propose a novel space-time epidemic modelling framework to study the spatial-temporal pattern in the spread of infectious diseases. We propose a quasi-likelihood approach via the penalised spline approximation and alternatively reweighted least-squares technique to estimate the model. The proposed estimators are consistent, and the asymptotic normality is established for the constant coefficients. Utilizing spatiotemporal analysis, our proposed model enhances the dynamics of the epidemiological mechanism and dissects the spatiotemporal structure of the spreading disease. We evaluate the numerical performance of the proposed method through a simulation example. Finally, we apply the proposed method in the study of the devastating COVID-19 pandemic.
Article
Highest density regions refer to level sets containing points of relatively high density. Their estimation from a random sample, generated from the underlying density, allows to determine the clusters of the corresponding distribution. This task can be accomplished considering different nonparametric perspectives. From a practical point of view, reconstructing highest density regions can be interpreted as a way of determining hot-spots, a crucial task for understanding COVID-19 space-time evolution. In this work, we compare the behaviour of classical plug-in methods and a recently proposed hybrid algorithm for highest density regions estimation through an extensive simulation study. Both methodologies are applied to analyse a real data set about COVID-19 cases in the United States.
Conference Paper
The coronavirus disease (COVID-19) has rapidly spread throughout the world and while pregnant women present the same adverse outcome rates, they are underrepresented in clinical research. We collected clinical data of 155 test-positive COVID-19 pregnant women at Stony Brook University Hospital. Many of these collected data are of multivariate categorical type, where the number of possible outcomes grows exponentially as the dimension of data increases. We modeled the data within the unsupervised Bayesian framework and mapped them into a lower-dimensional space using latent Gaussian processes. The latent features in the lower dimensional space were further used for predicting if a pregnant woman would be admitted to a hospital due to COVID-19 or would remain with mild symptoms. We compared the prediction accuracy with the dummy/one-hot encoding of categorical data and found that the latent Gaussian process had better accuracy.
Article
COVID-19 is a virus that has been declared an epidemic by the world health organization and causes more than 2 million deaths in the world. To achieve this, computer-aided automatic diagnosis systems are created on medical images. In this study, an image processing and machine learning-based method is proposed that enables segmenting of CT images taken from COVID-19 patients and automatic detection of the virus through the segmented images. The main purpose of the study is to automatically diagnose the COVID-19 virus. The study consists of three basic steps: preprocessing, segmentation and classification. Image resizing, image sharpening, noise removal, contrast stretching processes are included in the preprocessing phase and segmentation of images with Expectation-Maximization-based Gaussian Mixture Model in the segmentation phase. In the classification stage, COVID-19 is classified as positive and negative by using kNN, decision tree, and two different ensemble methods together with the kernel support vector machines method. In the study, two different CT datasets that are open to the public and a mixed dataset created by combining these datasets were used. The best accuracy values for Dataset-1, Dataset-2 and Mixed Dataset are 98.5%, 86.3%, 94.5%, respectively. The achieved results prove that the proposed approach advances state-of-the-art performance. Within the scope of the study, a GUI that can automatically detect COVID-19 has been created.
Article
We consider incomplete observations of stochastic processes governing the spread of infectious diseases through finite populations by way of contact. We propose a flexible semiparametric modelling framework with at least three advantages. First, it enables researchers to study the structure of a population contact network and its impact on the spread of infectious diseases. Second, it can accommodate short- and long-tailed degree distributions and detect potential superspreaders, who represent an important public health concern. Third, it addresses the important issue of incomplete data. Starting from first principles, we show when the incomplete-data generating process is ignorable for the purpose of Bayesian inference for the parameters of the population model. We demonstrate the semiparametric modelling framework by simulations and an application to the partially observed MERS epidemic in South Korea in 2015. We conclude with an extended discussion of open questions and directions for future research.
Article
A vast majority of the countries are under economic and health crises due to the current epidemic of coronavirus disease 2019 (COVID-19). The present study analyzes the COVID-19 using time series, an essential gizmo for knowing the enlargement of infection and its changing behavior, especially the trending model. We consider an autoregressive model with a non-linear time trend component that approximately converts into the linear trend using the spline function. The spline function splits the series of COVID-19 into different piecewise segments between respective knots in the form of various growth stages and fits the linear time trend. First, we obtain the number of knots with their locations in the COVID-19 series to identify the transmission stages of COVID-19 infection. Then, the estimation of the model parameters is obtained under the Bayesian setup for the best-fitted model. The results advocate that the proposed model appropriately determines the location of knots based on different transmission stages and know the current transmission situation of the COVID-19 pandemic in a country.
Article
Unprecedented travel restrictions due to the COVID-19 pandemic caused remarkable reductions in anthropogenic emissions, however, the Beijing area still experienced extreme haze pollution even under the strict COVID-19 controls. Generalized Additive Models (GAM) were developed with respect to inter-annual variations, seasonal cycles, holiday effects, diurnal profile, and the non-linear influences of meteorological factors to quantitatively differentiate the lockdown effects and meteorology impacts on concentrations of nitrogen dioxide (NO2) and fine particulate matters (PM2.5) at 34 sites in the Beijing area. The results revealed that lockdown measures caused large reductions while meteorology offset a large fraction of the decrease in surface concentrations. GAM estimates showed that in February, the control measures led to average NO2 reductions of 19 μg/m3 and average PM2.5 reductions of 12 μg/m3. At the same time, meteorology was estimated to contribute about 12 μg/m3 increase in NO2, thereby offsetting most of the reductions as well as an increase of 30 μg/m3 in PM2.5, thereby resulting in concentrations higher than the average PM2.5 concentrations during the lockdown. At the beginning of the lockdown period, the boundary layer height was the dominant factor contributing to a 17% increase in NO2 while humid condition was the dominant factor for PM2.5 concentrations leading to an increase of 65% relative to the baseline level. Estimated NO2 emissions declined by 42% at the start of the lockdown, after which the emissions gradually increased with the increase of traffic volumes. The diurnal patterns from the models showed that the peak of vehicular traffic occurred from about 12pm to 5pm daily during the strictest control periods. This study provides insights for quantifying the changes in air quality due to the lockdowns by accounting for meteorological variability and providing a reference in evaluating the effectiveness of control measures, thereby contributing to air quality mitigation policies.
Article
The sign test is one of the most popular nonparametric tests for location problems and allows testing for any quantile of a population. However, the common sign test has serious drawbacks such as loss of information by considering solely signs of observations but not their magnitudes, various problems related to handling of ties in the data, and the lack of embedding uncertainty regarding the fraction of underlying quantile. To address these issues, we present an extended sign test based on fuzzy categories and fuzzy formulated hypotheses that improves the generality, versatility, and practicability of the common sign test. This generalized test procedure is neat in theory and practice and avoids disadvantages that are often associated with fuzzy tests (e.g., a considerably higher complexity of the underlying model, a fuzzy test decision, and a possibilistic instead of a probabilistic interpretation of test results). In addition, we perform a comprehensive case study on COVID‐19 in HIV‐infected individuals with a focus on human body temperature and related measurement problems. The results of the study clearly indicate that fuzzy categories and fuzzy hypotheses improve the performance of the sign test.
Article
Objectives Coronavirus disease is a fatal epidemic that has originated in Wuhan, China in December 2019. This disease is diagnosed using radiological images taken with the help of basic scanning methods besides the test kits for Reverse Transcription Polymerase Chain Reaction (RT-PCR). Automatic analysis of chest Computed Tomography (CT) images that are based on image processing technology plays an important role in combating this infectious disease. Material and methods In this paper, a new Multiple Kernels-ELM-based Deep Neural Network (MKs-ELM-DNN) method is proposed for the detection of novel coronavirus disease - also known as COVID-19, through chest CT scanning images. In the model proposed, deep features are extracted from CT scan images using a Convolutional Neural Network (CNN). For this purpose, pre-trained CNN-based DenseNet201 architecture, which is based on the transfer learning approach is used. Extreme Learning Machine (ELM) classifier based on different activation methods is used to calculate the architecture's performance. Lastly, the final class label is determined using the majority voting method for prediction of the results obtained from each architecture based on ReLU-ELM, PReLU-ELM, and TanhReLU-ELM. Results In experimental works, a public dataset containing COVID-19 and Non-COVID-19 classes was used to verify the validity of the MKs-ELM-DNN model proposed. According to the results obtained, the accuracy score was obtained as 98.36% using the MKs-ELM-DNN model. The results have demonstrated that, when compared, the MKs-ELM-DNN model proposed is proven to be more successful than the state-of-the-art algorithms and previous studies. Conclusion This study shows that the proposed Multiple Kernels-ELM-based Deep Neural Network model can effectively contribute to the identification of COVID-19 disease.
Article
The outbreak of Coronavirus Disease 2019 (COVID-19) is an ongoing pandemic affecting over 200 countries and regions. Inference about the transmission dynamics of COVID-19 can provide important insights into the speed of disease spread and the effects of mitigation policies. We develop a novel Bayesian approach to such inference based on a probabilistic compartmental model using data of daily confirmed COVID-19 cases. In particular, we consider a probabilistic extension of the classical susceptible-infectious-recovered model, which takes into account undocumented infections and allows the epidemiological parameters to vary over time. We estimate the disease transmission rate via a Gaussian process prior, which captures nonlinear changes over time without the need of specific parametric assumptions. We utilize a parallel-tempering Markov chain Monte Carlo algorithm to efficiently sample from the highly correlated posterior space. Predictions for future observations are done by sampling from their posterior predictive distributions. Performance of the proposed approach is assessed using simulated datasets. Finally, our approach is applied to COVID-19 data from six states of the United States: Washington, New York, California, Florida, Texas, and Illinois. An R package BaySIR is made available at https://github.com/tianjianzhou/BaySIR for the public to conduct independent analysis or reproduce the results in this paper.
Article
Two-stage meta-analysis has been popularly used in epidemiological studies to investigate an association between environmental exposure and health response by analyzing time-series data collected from multiple locations. The first stage estimates the location-specific association, while the second stage pools the associations across locations. The second stage often incorporates location-specific predictors (i.e., meta-predictors) to explain the between-location heterogeneity and is called meta-regression. The existing second-stage meta-regression relies on parametric assumptions and does not accommodate functional meta-predictors and spatial dependency. Motivated by these limitations, our research proposes a nonparametric Bayesian meta-regression which relaxes parametric assumptions and incorporates functional meta-predictors and spatial dependency. The proposed meta-regression is formulated by jointly modeling the association parameters and the functional meta-predictors using Dirichlet process (DP) or local DP mixtures. In doing so, the functional meta-predictors are represented parsimoniously by the coefficients of the orthonormal basis. The proposed models were applied to (1) a temperature–mortality association study and (2) suicide seasonality study, and validated through a simulation study. Supplementary materials accompanying this paper appear online.
Article
Multi‐compartment models have been playing a central role in modelling infectious disease dynamics since the early 20th century. They are a class of mathematical models widely used for describing the mechanism of an evolving epidemic. Integrated with certain sampling schemes, such mechanistic models can be applied to analyse public health surveillance data, such as assessing the effectiveness of preventive measures (e.g. social distancing and quarantine) and forecasting disease spread patterns. This review begins with a nationwide macromechanistic model and related statistical analyses, including model specification, estimation, inference and prediction. Then, it presents a community‐level micromodel that enables high‐resolution analyses of regional surveillance data to provide current and future risk information useful for local government and residents to make decisions on reopenings of local business and personal travels. r software and scripts are provided whenever appropriate to illustrate the numerical detail of algorithms and calculations. The coronavirus disease 2019 pandemic surveillance data from the state of Michigan are used for the illustration throughout this paper.