Article

Time Series Analysis with R

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

A brief overview of the R statistical computing and programming environment is given that explains why many time series researchers in both applied and the-oretical research may find R useful. The core features of R for basic time series analysis are outlined. Some intermediate level and advanced topics in time series analysis that are supported in R are discussed such as including state-space mod-els, structural change, generalized linear models, threshold models, neural nets, co-integration, GARCH, wavelets, and stochastic differential equations. Numer-ous examples of beautiful graphs constructed using R for time series are shown. R code for reproducing all the graphs and tables is given on my homepage.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Its forecasting superiority is subsequently assessed through formal SPA testing. Further details on implementing traditional statistical methods in the R environment are provided next, while more information can be found in [54,55]. ...
... Here, as in [65], STS models are automatically implemented in R by maximum likelihood through the function "StructTS" in the "stats" package. Further information on the automatically estimation and forecasting with STS models in R can be retrieved from [55]. ...
Article
Full-text available
The European Union (EU) has positioned itself as a frontrunner in the worldwide battle against climate change and has set increasingly ambitious pollution mitigation targets for its members. The burden is heavier for the more vulnerable economies in Central and Eastern Europe (CEE), who must juggle meeting strict greenhouse gas emission (GHG) reduction goals, significant fossil-fuel reliance, and pressure to respond to current pandemic concerns that require an increasing share of limited public resources, while facing severe repercussions for non-compliance. Thus, the main goals of this research are: (i) to generate reliable aggregate GHG projections for CEE countries; (ii) to assess whether these economies are on track to meet their binding pollution reduction targets; (iii) to pin-point countries where more in-depth analysis using spatial inventories of GHGs at a finer resolution is further needed to uncover specific areas that should be targeted by additional measures; and (iv) to perform geo-spatial analysis for the most at-risk country, Poland. Seven statistical and machine-learning models are fitted through automated forecasting algorithms to predict the aggregate GHGs in nine CEE countries for the 2019–2050 horizon. Estimations show that CEE countries (except Romania and Bulgaria) will not meet the set pollution reduction targets for 2030 and will unanimously miss the 2050 carbon neutrality target without resorting to carbon credits or offsets. Austria and Slovenia are the least likely to meet the 2030 emissions reduction targets, whereas Poland (in absolute terms) and Slovenia (in relative terms) are the farthest from meeting the EU’s 2050 net-zero policy targets. The findings thus stress the need for additional measures that go beyond the status quo, particularly in Poland, Austria, and Slovenia. Geospatial analysis for Poland uncovers that Krakow is the city where pollution is the most concentrated with several air pollutants surpassing EU standards. Short-term projections of PM2.5 levels indicate that the air quality in Krakow will remain below EU and WHO standards, highlighting the urgency of policy interventions. Further geospatial data analysis can provide valuable insights into other geo-locations that require the most additional efforts, thereby, assisting in the achievement of EU climate goals with targeted measures and minimum socio-economic costs. The study concludes that statistical and geo-spatial data, and consequently research based on these data, complement and enhance each other. An integrated framework would consequently support sustainable development through bettering policy and decision-making processes.
... If the data are just nonlinear, they only consist of nonlinear structures; then, q can be 0 since the Box-Jenkins method is a linear model that cannot simulate nonlinear interaction. Suboptimal methods may be used in a hybrid model, but suboptimality does not change the functional characteristics of the hybrid approach [17,[46][47][48]. The interpretation of time series requires quantification of the vector dynamic response with time shifts. ...
Article
Full-text available
Design: At the heart of time series forecasting, if nonlinear and nonstationary data are analyzed using traditional time series, the results will be biased. At the same time, if just using machine learning without any consideration given to input from traditional time series, not much information can be obtained from the results because the machine learning model is a black box. Purpose: In order to better study time series forecasting, we extend the combination of traditional time series and machine learning and propose a hybrid cascade neural network considering a metaheuristic optimization genetic algorithm in space–time forecasting. Finding: To further show the utility of the cascade neural network genetic algorithm, we use various scenarios for training and testing while also extending simulations by considering the activation functions SoftMax, radbas, logsig, and tribas on space–time forecasting of pollution data. During the simulation, we perform numerical metric evaluations using the root-mean-square error (RMSE), mean absolute error (MAE), and symmetric mean absolute percentage error (sMAPE) to demonstrate that our models provide high accuracy and speed up time-lapse computing.
... Infine la misurazione dello stato di salute dei pazienti al controllo dopo tre mesi ha evidenziato un notevole incremento sia del punteggio PCS (che misura la limitazione fisica, la disabilità, il benessere generale e la percezione del proprio stato di salute) e sia del punteggio MCS (che misura l'attitudine psicologica del paziente, la limitazione nelle attività sociali e personali). I valori di PCS tuttavia rimanevano inferiori al riferimento normativo per fascia di età riportati da Gandek et al [19] (49.7±7.9 vs 44.05 media presente studio), mentre erano più alti rispetto ai valori riportati da Cole et al [16] a 1 anno (media 40) in uno studio specifico sull'outcome dopo ricostruzione artroscopica della cuffia dei rotatori. I valori di MCS si sono avvicinati invece maggiormente al valore di riferimento normativo (47.5±10.3 ...
Article
Full-text available
Introduzione. In Italia la ricerca delle professioni sanitarie è poco sviluppata e vi è una carenza di conoscenza della metodologia per condurla. In alcuni ospedali sedi del presente studio è stato costituito un Centro di Ricerca per supportare e formare le professioni sanitarie alla conduzione di ricerche cliniche. Lo studio vuole valutare se la sua costituzione ha portato ad un aumento della produzione scientifica in termini di protocolli approvati e articoli pubblicati, oltre a misurare il livello di conoscenze, attitudini e abilità di ricerca degli infermieri. Disegno dello studio. Multiple Interrupted Time Series e studio trasversale. Metodi: Sono stati raccolti i dati dal 2002 al 2012 in 7 ospedali, 4 in cui è stato costituito il Centro Ricerca delle professioni sanitarie e 3 no. Inoltre si è effettuata una indagine con il Nursing Research Questionnaire modificato (NRQ-IT). Risultati. Per due ospedali con il centro di ricerca si conferma un incremento statisticamente significativo dell’outcome primario (numero di protocolli approvati dal Comitato Etico, aventi come ricercatore responsabile un infermiere) a circa 2 anni dall’implementazione dell’intervento. Per gli altri due ospedali sperimentali non è possibile raggiungere conclusioni definitive per i pochi dati disponibili (post intervento). Negli ospedali controllo le osservazioni rimangono stazionarie senza alcuna produzione scientifica. Per il questionario le differenze risultano staticamente significative per “Abilità di ricerca”, “Partecipazione in attività di ricerca” e “Uso della ricerca nella pratica”. Conclusioni. Lo studio sostiene come la costituzione di una unità di supporto alla ricerca delle professioni assistenziali all’interno degli ospedali faciliti la produzione di ricerche assistenziale ed aumenti la cultura della ricerca.
... Time series forecasting(TSF) is the use of a model to predict future values based on previously observed values. This modeling approach is used primarily when little knowledge is available on the underlying data generating process or when there is no adequate informative model that relates the prediction variable to other explanatory variables [3,4]. ...
Article
Full-text available
Air quality index (AQI) is a number used by government agencies to communicate to the public how polluted the air currently. It is based on several factors like SO2, NO2, O3, RSPM/PM10, and PM2.5. Several methods were developed in the past by various researchers/environmental agencies for the determination of AQI. Still, there is no universally accepted method that exists, which is appropriate for all situations. We have developed a prediction model that is confined to standard classification or regression models. These prediction models have ignored the co-relation between sub-models in different time slots. The paper focusses on a refined model for inferring air pollutants based on historical and current meteorological datasets. Also, the model is designed to forecast AQI for the coming months, quarters or years where the emphasis is on how to improve its accuracy and performance. The algorithms are used on Air Pollution Geocodes Dataset (2016-2018), and results calculated for 196 cities of India on various classifiers. Accuracy of 94%-96% achieved from Linear Robust Regression, which increases to 97.92% after application of KNN and 97.91% after SVM and 97.47 after 5th epoch of ANN. Decision Tree Classifier has given the best accuracy of 99.7%, which increases by 0.02% on the application of the Random Forest Classifier. Forecasting achieved by Moving Average Smoothing using R-ARIMA, which offers daily values for the coming 45 days or monthly data of AQI for the next year.
... Time series forecasting(TSF) is the use of a model to predict future values based on previously observed values. This modeling approach is used primarily when little knowledge is available on the underlying data generating process or when there is no adequate informative model that relates the prediction variable to other explanatory variables [3,4]. ...
Conference Paper
Full-text available
Air quality index (AQI) is a number used by government agencies to communicate to the public how polluted the air currently. It is based on several factors like SO2, NO2, O3, RSPM/PM10, and PM2.5. Several methods were developed in the past by various researchers/environmental agencies for the determination of AQI. Still, there is no universally accepted method that exists, which is appropriate for all situations. We have developed a prediction model that is confined to standard classification or regression models. These prediction models have ignored the co-relation between sub-models in different time slots. The paper focusses on a refined model for inferring air pollutants based on historical and current meteorological datasets. Also, the model is designed to forecast AQI for the coming months, quarters or years where the emphasis is on how to improve its accuracy and performance. The algorithms are used on Air Pollution Geocodes Dataset (2016-2018), and results calculated for 196 cities of India on various classifiers. Accuracy of 94%-96% achieved from Linear Robust Regression, which increases to 97.92% after application of KNN and 97.91% after SVM and 97.47 after 5th epoch of ANN. Decision Tree Classifier has given the best accuracy of 99.7%, which increases by 0.02% on the application of the Random Forest Classifier.
... @BULLET European call option price is assumed to be a stochastic process with stationary and independent increments and the risk -free interest rate is constant in accordance with stated in (Gerber and Landry, 1997) @BULLET According to Mcleod et al. (2012) many researchers use R language to perform statistical calculations. Then the simulation program of European call option on an incomplete market will be created using the R language assistance to perform calculations @BULLET This study compares the results of the calculation of European call option with a linear approximation, shifted Gamma Process and shifted Inverse Gaussian Process with Black-Scholes model, which is more often used, to see that the calculation of the linear approximation is almost close to the Black- Scholes when circumstances are made to resembles the Black-Scholes model @BULLET This study also compares the results of calculations with the original data of option prices in the market. ...
Article
Option is derivative instrument that have investment benefit and provide return for the writer and the holder. Option price determination is affected by risk factor. However in Black-Scholes model option price is determined without arbitrage risk affection so it is impossible to take return. In this study, option price formula is constructed to be more represent the condition of financial market using incomplete market concept where financial asset, that is traded, is affected by arbitrage risk so it is possible for market participants to take return. European call option is defined by Esscher Transform method and option price formula is determined by changing its form to linear approximation. The result from this study is option price formula with linear approximation has some privileges. That is easy to be applied in computation process, more representatives in getting risk indication in the financial market and can predict option price more accurately. Linear approximation formula is applied in the program that can be used by option writer or holder and is equipped with export data feature that can be possible for further research development.
... The application of Neural Network in time series prediction models [18] is expected to provide more accurate and robust results against data fluctuations [19]. One of the flexibility, of the Neural Network model as a nonparametric model is that there is no need to test model assumptions [20], so the main thing to consider is the formation of a model to get the smallest possible error [21]. ...
Article
Full-text available
Air pollution is the entry or inclusion of living things, energy substances, and other components into the air. Moreover, Air pollution is the presence of one or several contaminants in the outside atmospheric air such as dust, foam, gas, fog, smoke or steam in large quantities with various properties and time intervals of the contaminants in the air resulting in disturbances to the lives of humans, plants or animals. One of the parameters measured in determining air quality is PM2.5. However, PM2.5 has a higher probability of being able to enter the lower respiratory tract because small particle diameters can potentially pass through the lower respiratory tract. In this paper, we will get two different insight. First, the probability of status change using Markov chain and second, forecasting by using VAR-NN-PSO(Vector Autoregressive, Neural Network, Particle Swarm optimization). More details we classify by three classifications no risk (1-30), medium risk (30-48), and moderate (>49) in Pingtung and Chaozhou start from January 2014 to May 2019. This data is starting from January 2014 to May 2019 so that it can be modeled with the Markov chain. At the same time, we perform Hybrid VAR-NN-PSO to forecast PM2.5 in Pingtung and Chaozhou. In this optimization, the search for solutions is carried out by a population consisting of several particles. Based on the results of the discussion, opportunities for the transition from monthly status change are obtained continuous stochastic time with a stationary probability distribution. Regarding the VAR-NN-PSO, we obtained the mean absolute percentage error (MAPE) 3.57% for PM2.5 data in Pingtung and 4.87% for PM2.5 data in Chaozhou, respectively. This model can be predicted to forecasting 180 days ahead. The population in PSO has generated randomly with the smallest value and the largest value the accuracy.
... Our R package, (McLeod et al., 2016a), for ARTFIMA, ARFIMA and ARIMA model fitting, forecasting and simulation is freely available. R, also freely available, provides many stateof-the-art and advanced functions for time series analysis ( McLeod et al., 2012), as well it provides an advanced quantitative programming environment for data science. The most important R functions in our package, artfima, are described in Table 1. ...
Article
Estimation and diagnostic checking with the tempered ARFIMA (ARTFIMA) model is discussed. The ARTFIMA model replaces the fractional difference term with a tempered fractional difference that depends on a parameter λ. When λ = 0 the model reduces to the usual ARFIMA model. This model was derived from a continuous-time model developed to describe geophysical turbulence. More generally, the ARTFIMA provides a model that includes the ARMA as well as ARFIMA. Basic properties are discussed including exact maximum likelihood estimation (MLE) and Whittle MLE and their associated asymptotic distributions. Unlike the ARMA and ARFIMA cases, it is shown that the Whittle estimator is not always first order efficient, and this finding is confirmed by a simulation experiment. The distribution of the residual autocorrelations is derived and its application to portmanteau diagnostic check discussed. An R software package, artfima, available on CRAN, implements model fitting for the ARTFIMA family of models including the ARIMA/ARFIMA special cases. It is demonstrated, with examples, that the ARTFIMA family of models may provide a more parsimonious or sparse model and applications to forecasting, simulation and spectral analysis are discussed .
... We model the data trend with an ARIMA process. The Hurst exponents and fBm models are estimated with a R-package [26][27][28]. ...
Article
Full-text available
Scrub typhus, an infectious disease caused by a bacterium transmitted by “chigger” mites, constitutes a public health problem in Thailand. Predicting epidemic peaks would allow implementing preventive measures locally. This study analyses the predictability of the time series of incidence of scrub typhus aggregated at the provincial level. After stationarizing the time series, the evaluation of the Hurst exponents indicates the provinces where the epidemiological dynamics present a long memory and are predictable. The predictive performances of ARIMA (autoregressive integrated moving average model), ARFIMA (autoregressive fractionally integrated moving average) and fractional Brownian motion models are evaluated. The results constitute the reference level for the predictability of the incidence data of this zoonosis.
... Regression models in which a time-lagged dependent variable is used as an additional predictor variable are often referred to as dynamic linear regression models-also known as autoregressive (AR) models. AR models are commonly used for the analysis and forecasting of time series data in economics (McLeod et al., 2012) and hydrology and climate (Anderson, 2011). The lagged dependent variable introduces a temporal component into the model, so that the EVI at a given time step is also a function of the EVI of the previous time step in the time series as: y t = ϕy (t−1) + βX t + ε t , t = 1, 2, 3, . . . ...
Article
Full-text available
Floodplain wetlands are valuable ecosystems for maintaining biodiversity, but are vulnerable to hydrological modification and climatic extremes. The floodplain wetlands in the middle Yangtze region are biodiversity hotspots, particularly important for wintering migratory waterbirds. In recent years, extremely low winter water level events frequently occurred in the middle Yangtze River. The hydrological droughts greatly impacted the development and distribution of the wet meadows, one of the most important ecological components in the floodplains, which is vital for the survival of many migratory waterbirds wintering in the Yangtze region. To effectively manage the wet meadows, it is critical to pinpoint the drivers for their deterioration. In this study, we assessed the effects of hydrological connectivity on the ecological stability of wet meadow in Poyang Lake for the period of 2000 to 2016. We used the time series of MODIS EVI (Enhanced Vegetation Index) as a proxy for productivity to infer the ecological stability of wet meadows in terms of resistance and resilience. Our results showed that (1) the wet meadows developed in freely connected lakes had significantly higher resilience; (2) wet meadows colonizing controlled lakes had higher resistance to water level anomalies; (3) there was no difference in the resistance to rainfall anomaly between the two types of lakes; (4) the wet meadow in freely connected lakes might approach a tipping point and a regime shift might be imminent. Our findings suggest that adaptive management at regional- (i.e., operation of Three Gorges Dam) and site-scale (e.g., regulating sand mining) are needed to safeguard the long-term ecological stability of the system, which in term has strong implications for local, regional and global biodiversity conservation.
Chapter
Effectively managing uncertain health, safety, and environmental risks requires quantitative methods for quantifying uncertain risks, answering the following questions about them, and characterizing uncertainties about the answers: Event detection: What has changed recently in disease patterns or other adverse outcomes, by how much, when? Consequence prediction: What are the implications for what will probably happen next if different actions (or no new actions) are taken? Risk attribution: What is causing current undesirable outcomes? Does a specific exposure harm human health, and, if so, who is at greatest risk and under what conditions? Response modeling: What combinations of factors affect health outcomes, and how strongly? How would risks change if one or more of these factors were changed? Decision making: What actions or interventions will most effectively reduce uncertain health risks? Retrospective evaluation and accountability: How much difference have exposure reductions actually made in reducing adverse health outcomes? These are all causal questions. They are about the uncertain causal relations between causes, such as exposures, and consequences, such as adverse health outcomes. This chapter reviews advances in quantitative methods for answering them. It recommends integrated application of these advances, which might collectively be called causal analytics, to better assess and manage uncertain risks. It discusses uncertainty quantification and reduction techniques for causal modeling that can help to predict the probable consequences of different policy choices and how to optimize decisions. Methods of causal analytics, including change-point analysis, quasi-experimental studies, causal graph modeling, Bayesian Networks and influence diagrams, Granger causality and transfer entropy methods for time series, and adaptive learning algorithms provide a rich toolkit for using data to assess and improve the performance of risk management efforts by actively discovering what works well and what does not.
Article
Graphics is an important component of data analysis workflows, especially in interactive systems such as R. This chapter gives an overview of R's static graphics facilities. In addition to commonly used functions, we describe the underlying plotting model and how it can be exploited to customize default output. We also give a brief overview of the relatively recent Grid graphics, and the lattice and ggplot2 packages which use it to implement general purpose high-level systems.
Article
It is shown that the limiting distribution of the augmented Dickey-Fuller (ADF) test under the null hypothesis of a unit root is valid under a very general set of assumptions that goes far beyond the linear AR(∞) process assumption typically imposed. In essence, all that is required is that the error process driving the random walk possesses a continuous spectral density that is strictly positive. Furthermore, under the same weak assumptions, the limiting distribution of the ADF test is derived under the alternative of stationarity, and a theoretical explanation is given for the well-known empirical fact that the test's power is a decreasing function of the chosen autoregressive order p. The intuitive reason for the reduced power of the ADF test is that, as p tends to infinity, the p regressors become asymptotically collinear.
Article
Predicting road accident probability by exploiting high resolution traffic data has been a continuously researched topic in the last years. However, there is no specific focus on Powered-Two-Wheelers. Furthermore, urban arterials have not received adequate attention so far, because the majority of relevant studies considers freeways. This study aims to contribute to the current knowledge by utilizing support vector machine (SVM) models for predicting Powered-Two-Wheeler (PTW) accident risk and PTW accident type propensity on urban arterials. The proposed methodology is applied both on original and transformed time series of real-time traffic data collected from urban arterials in Athens, Greece for the years 2006–2011. Findings suggest that PTW accident risk and PTW accident type propensity can be adequately defined by the prevailing traffic conditions. When predicting PTW accident risk, the original traffic time series performed better than the transformed time series. On the other hand, when PTW accident type is investigated, neither of the two approaches clearly outperformed the other, but the transformed time series perform slightly better. The results of the study indicate that the combination of SVM models and time-series data can be used for road safety purposes especially by utilizing real-time traffic data.
Article
Aging sensitive parameters like dissipation factor (tanδ), oil and paper conductivity and paper moisture can be estimated from insulation model parameterized using polarization current to assess the condition of transformer insulation. However, polarization current measurement is a time-consuming offline technique. During measurement, variation of environmental conditions (especially temperature) affects the monotonically decreasing nature of recorded data. Analysis of such affected data lead to incorrect conclusion regarding insulation condition. Power transformer being a crucial equipment, utility prefer to reduce its shut down time to minimum amount. In this paper a technique is discussed through which the polarization current measurement time can be reduced significantly. Several transformer data are used for verification of developed method. Presented results show that measurement data corresponding to only 10 minutes is sufficient to estimate the remaining data through the application of discussed method.
Article
Full-text available
The number of people registering in an online community depends on two main factors: interest in, and awareness of, the project. Registering to a project does not, however, imply contributing to it, as lacking the knowledge and skills can be a barrier to participation. In order to identify the nature of events that might have facilitated or hindered enrollments in the OpenStreetMap (OSM) project over time, we analyzed the correlations between the number of new participants and the events that dotted its history. Four different metrics were defined to characterize participants’ behaviors: the daily number of registrations, the daily number of participants that made a first contribution, the delays between contributors’ registration and their first edits, and a daily contribution ratio built from the number of new contributors and the number of new registered members. Time series analyses were used to identify trends, and outstanding variations of the number of participants. An inventory of events that took place along the OSM project’s history was created and appreciable variations of the metrics have been linked to events that seemed to be meaningful. Although a correlation does not imply causality, many of the explanations these correlations suggest are supported by the results of other studies, either directly or indirectly. For instance, when considering the time participants spend as “lurker”, as well as on the nature of the contribution of early participants. In other cases, they suggest new explanations for the origin of the spam accounts that affect registration statistics, or the decline in the proportion of registered members who actually become contributors.
Article
Full-text available
Imputation of missing data in datasets with high seasonality plays an important role in data analysis and prediction. Failure to appropriately account for missing data may lead to erroneous findings, false conclusions, and inaccurate predictions. The essence of a good imputation method is its missingness-recovery-ability, i.e., the ability to deal with large periods of missing data in the dataset and the ability to extract the right characteristics (e.g., seasonality pattern) buried under the dataset to be analyzed. Univariate imputation is usually incapable of providing a reasonable imputation for a variable when periods of missing values are large. On the other hand, the default multivariate imputation approach cannot provide an accurate imputation for a variable when missing values of other correlated variables used for imputation occur at exactly the same time intervals. To deal with these drawbacks and to provide feasible imputations in such scenarios, we propose a novel method that converts a single variable into a multivariate form by exploiting the high seasonality and random missingness of this variable. After this conversion, multivariate imputation can then be applied. We then test the proposed method on an LTE spectrum dataset for imputing a single variable, such as the average cell throughput. We compare the performance of our proposed method with Kalman filtering and default method for multivariate imputation. The performance evaluation results clearly show that the proposed method significantly outperforms Kalman filtering and default method in terms of imputation and prediction accuracy.
Article
Full-text available
Due to complex natural water flux processes and the ambiguous explanation of Bouchet’s complementary theory, site-level investigations on evapotranspiration (ET) and related climate variables assist in understanding the regional hydrological response to climate change. In this study, site specific empirical parameters were incorporated in the Bouchet’s complementary relationship (CR) and potential and actual ET were estimated by CR method and subsequently validated by 6 years of ground-based vapor flux observations. Time series analysis, correlation analysis and principal regression analysis were conducted to reveal the characteristics of climate change and the controlling factor(s) of the variations of potential ET and actual ET. The results show that this region is exhibiting a combined warming and drying trend over the past decades with two change points that occurred in 1993 and in 2000. Potential ET was predominantly influenced by temperature and vapor pressure deficit, while actual ET was mostly influenced by vegetation activity. Potential ET was found to be increasing concurrently with declining actual ET to constitute nearly a symmetric complementary relationship over the past decades. This study help to enhance our understanding of the regional hydrological response to climate change. Further studies are needed to partition the actual ET into transpiration and other components and to reveal the role of vegetation activity in determining regional ET as well as water balance.
Article
Full-text available
We revisit and update estimating variances, fundamental quantities in a time series forecasting approach called kriging, in time series models known as FDSLRMs, whose observations can be described by a linear mixed model (LMM). As a result of applying the convex optimization, we resolved two open problems in FDSLRM research: (1) theoretical existence and equivalence between two standard estimation methods—least squares estimators, non-negative (M)DOOLSE, and maximum likelihood estimators, (RE)MLE, (2) and a practical lack of free available computational implementation for FDSLRM. As for computing (RE)MLE in the case of n observed time series values, we also discovered a new algorithm of order \({\mathcal {O}}(n)\), which at the default precision is \(10^7\) times more accurate and \(n^2\) times faster than the best current Python(or R)-based computational packages, namely CVXPY, CVXR, nlme, sommer and mixed. The LMM framework led us to the proposal of a two-stage estimation method of variance components based on the empirical (plug-in) best linear unbiased predictions of unobservable random components in FDSLRM. The method, providing non-negative invariant estimators with a simple explicit analytic form and performance comparable with (RE)MLE in the Gaussian case, can be used for any absolutely continuous probability distribution of time series data. We illustrate our results via applications and simulations on three real data sets (electricity consumption, tourism and cyber security), which are easily available, reproducible, sharable and modifiable in the form of interactive Jupyter notebooks.
Article
Full-text available
Long-term (1982–2013) datasets of climate variables and Normalized Difference Vegetation Index (NDVI) were collected from Climate Research Union (CRU) and GIMMS NDVI3g. By setting the NDVI values below the threshold of 0.2 as 0, NDVI_0.2 was created to eliminate the noise caused by changes of surface albedo during non-growing period. TimeSat was employed to estimate the growing season length (GSL) from the seasonal variation of NDVI. Statistical analyses were conducted to reveal the mechanisms of climate-vegetation interactions in the cold and semi-arid Upper Amur River Basin of Northeast Asia. The results showed that the regional climate change can be summarized as warming and drying. Annual mean air temperature (T) increased at a rate of 0.13 °C per decade. Annual precipitation (P) declined at a rate of 18.22 mm per decade. NDVI had an insignificantly negative trend, whereas, NDVI_0.2 displayed a significantly positive trend (MK test, p < 0.05) over the past three decades. GSL had a significantly positive rate of approximately 2.9 days per decade. Correlation analysis revealed that, NDVI was significantly correlated with amount of P, whereas, GSL was highly correlated with warmth index (WMI), accumulation of monthly T above the threshold of 5°C. Principal regression analysis revealed that the inter-annual variations of NDVI, NDVI_0.2 and GSL were mostly contributed by WMI. Spatially, NDVI in grassland was more sensitive to P, whereas, T was more important in areas of high elevation. GSL in most of the areas displayed high sensitivity to T. This study examined the different roles of climate variables in controlling the vegetation activities. Further studies are needed to reveal the impact of extended GSL on the regional water balance and the water level of regional lakes, providing the habitats for the migratory birds and endangered species.
Article
Full-text available
Introdução: a taxa de mortalidade infantil é um indicador sensível às transformações sociais de uma região, sinaliza o desenvolvimento socioeconômico e as condições de vida, importando, por isso, aos profissionais da administração em saúde e da medicina preventiva e social. Objetivos: estabelecer a associação entre determinantes sociais e a taxa de mortalidade infantil nos municípios do Piauí em 2010. Metodologia: estudo ecológico de correlação e autocorrelação espacial entre a taxa de mortalidade infantil e os indicadores índice de desenvolvimento humano municipal e seus componentes longevidade, renda e emprego e educação, coeficiente de Gini e proporção de pobres em 2010. Os dados foram obtidos no sítio eletrônico do Atlas de Desenvolvimento Humano no Brasil. Realizou-se estatística descritiva, teste de correlação de Spearman e dependência espacial univariada e bivariada com o índice de Moran. Consideraram-se estatisticamente significante um valor p menor de 0,05 e um pseudo-p menor ou igual a 0,05 quando o módulo do valor z fosse maior ou igual a 1,96. Utilizaram-se os softwares MINITAB v.17, GeoDa e R Studio. Este trabalho foi aprovado pelo CEP da Universidade Federal do Piauí. Resultados: A taxa de mortalidade infantil correlacionou-se de modo complexo e nem sempre homogêneo com o IDHM, com os componentes do IDHM e com a proporção de pobres. Não houve correlação com o coeficiente de Gini. O IDHM renda e emprego foi o único determinante social que não exibiu dependência espacial. Apenas o IDHM longevidade correlacionou-se espacialmente com a taxa de mortalidade infantil, predominando a formação de outliers onde maior IDHM longevidade esteve associado com menor mortalidade infantil. Conclusões: Existiram correlação e autocorrelação espacial entre taxa de mortalidade infantil e determinantes sociais no Piauí em 2010. As áreas de maior risco, principalmente aquelas com piores indicadores sociais, são alvo das ações de planejamento estratégico em administração em saúde e em medicina preventiva e Social.Palavras-chave: Mortalidade Infantil. Determinantes Sociais da Saúde. Correlação de Dados. Análise Espacial. ABSTRACTIntroduction: infant mortality rate is a sensitive indicator to regional social transformations and indicates the socioeconomic development and the way of life, consequently it matters to health administration and to preventive and social medicine. Objectives: to establish the association between social determinants and the infant mortality rate in the cities of Piaui in 2010. Methodology: ecological study of correlation and spatial autocorrelation between the infant mortality rate and the human development index and their components (‘longevity’, ‘income and employment’ and ‘education’), Gini’s coefficient and proportion of poverty in 2010. Data were obtained from the website of Atlas of Human Development in Brazil. Descriptive statistics, Spearman's correlation test and univariate and bivariate spatial dependence with the Moran index were performed. A p-value less than 0.05 and a pseudo-p equals to or less than 0.05 with an absolute z-value equals to or greater than 1.96 were considered statistically significant. The software MINITAB v.17, GeoDa and R Studio were used. This paper was approved by the Research Ethics Committee of the Federal University of Piaui. Results: The infant mortality rate correlated in a complex and not always homogeneous way with the HDI, with the components of the HDI and with the proportion of poverty. There was no correlation with the Gini’s coefficient. The income and employment component of HDI was the only social determinant that did not demonstrate spatial dependence. Only the longevity component of HDI was spatially correlated with the infant mortality rate, with the formation of outliers predominating where greater longevity component of HDI was associated with lower infant mortality rate. Conclusions: There were correlation and spatial autocorrelation between infant mortality rate and social determinants in Piaui in 2010. The areas of greatest risk, especially those with the worst social determinants, are the target of strategic planning actions in health administration and preventive and social medicine.Keywords: Infant Mortality. Social Determinants of Health. Correlation of Data. Spatial Analysis.
Conference Paper
As information technology (I.T.) is progressing rapidly day by day a massive amount of data is emerging at a fast rate in different sectors. Data dredging provides techniques to have relevant data from a large amount of data for the task. This paper introduces an algorithm for fuzzy data dredging through which fuzzy association rules can be generated for time series data. Time series data can be stock market data, climatic observed data or any sequence data which has some trend or pattern in it. In the past many approaches based on mathematical models were suggested for dredging association rules but they were quite complex for the users. This paper emphasis on the reduction of large number of irrelevant association rules obtained providing a better platform of future prediction using fuzzy membership functions and fuzzy rules for time series data. Secondly, this paper also measures the data dispersion in time series data mainly in stock market data and shows the deviation of the stock prices from the mean of several stock price data points taken over a period of time which help the investors to decide whether to buy or sell their products. Risk investment can be predicted understanding the obtained curve in the experiment. Experiments are also carried out to show the results of the proposed algorithm.
Working Paper
Full-text available
State space modelling is an efficient and flexible method for statistical inference of a broad class of time series and other data. This paper describes an R package KFAS for state space modelling with the observations from an exponential family, namely Gaussian, Poisson, binomial, negative binomial and gamma distributions. After introducing the basic theory behind Gaussian and non-Gaussian state space models, an illustrative example of Poisson time series forecasting is provided. Finally, a comparison to alternative R packages suitable for non-Gaussian time series modelling is presented.
Article
Full-text available
The most general form of a nonlinear strictly stationary process is that referred to as a Volterra expansion; this is to a linear process what a polynomial is to a linear function. Because of this similarity, an analogue of Tukey's one degree of freedom for nonadditivity test is constructed as a test for linearity versus a second-order Volterra expansion.
Article
Full-text available
The increasing popularity of R is leading to an increase in its use in undergraduate courses at universities (R Development Core Team 2008). One of the strengths of R is the flexible graphics provided in its base package. However, students often run up against its limitations, or they find the amount of effort to create an interesting plot may be excessive. The grid package (Murrell 2005) has a wealth of graphical tools which are more accessible to such R users than many people may realize. The purpose of this paper is to highlight the main features of this package and to provide some examples to illustrate how students can have fun with this different form of plotting and to see that it can be used directly in the visualization of data.
Article
Full-text available
Our ltsa package implements the Durbin-Levinson and Trench algorithms and provides a general approach to the problems of fitting, forecasting and simulating linear time series models as well as fitting regression models with linear time series errors. For computational efficiency both algorithms are implemented in C and interfaced to R. Examples are given which illustrate the efficiency and accuracy of the algorithms. We provide a second package FGN which illustrates the use of the ltsa package with fractional Gaussian noise (FGN). It is hoped that the ltsa will provide a base for further time series software.
Book
Statistical Methods for Long Term Memory Processes covers the diverse statistical methods and applications for data with long-range dependence. Presenting material that previously appeared only in journals, the author provides a concise and effective overview of probabilistic foundations, statistical methods, and applications. The material emphasizes basic principles and practical applications and provides an integrated perspective of both theory and practice. This book explores data sets from a wide range of disciplines, such as hydrology, climatology, telecommunications engineering, and high-precision physical measurement. The data sets are conveniently compiled in the index, and this allows readers to view statistical approaches in a practical context. Statistical Methods for Long Term Memory Processes also supplies S-PLUS programs for the major methods discussed. This feature allows the practitioner to apply long memory processes in daily data analysis. For newcomers to the area, the first three chapters provide the basic knowledge necessary for understanding the remainder of the material. To promote selective reading, the author presents the chapters independently. Combining essential methodologies with real-life applications, this outstanding volume is and indispensable reference for statisticians and scientists who analyze data with long-range dependence.
Book
The fourth edition of this popular graduate textbook, like its predecessors, presents a balanced and comprehensive treatment of both time and frequency domain methods with accompanying theory. Numerous examples using nontrivial data illustrate solutions to problems such as discovering natural and anthropogenic climate change, evaluating pain perception experiments using functional magnetic resonance imaging, and monitoring a nuclear test ban treaty. The book is designed as a textbook for graduate level students in the physical, biological, and social sciences and as a graduate level text in statistics. Some parts may also serve as an undergraduate introductory course. Theory and methodology are separated to allow presentations on different levels. In addition to coverage of classical methods of time series regression, ARIMA models, spectral analysis and state-space models, the text includes modern developments including categorical time series analysis, multivariate spectral methods, long memory series, nonlinear models, resampling techniques, GARCH models, ARMAX models, stochastic volatility, wavelets, and Markov chain Monte Carlo integration methods. This edition includes R code for each numerical example in addition to Appendix R, which provides a reference for the data sets and R scripts used in the text in addition to a tutorial on basic R commands and R time series. An additional file is available on the book’s website for download, making all the data sets and scripts easy to load into R. • Student-tested and improved • Accessible and complete treatment of modern time series analysis • Promotes understanding of theoretical concepts by bringing them into a more practical context • Comprehensive appendices covering the necessities of understanding the mathematics of time series analysis • Instructor's Manual available for adopters New to this edition: • Introductions to each chapter replaced with one-page abstracts • All graphics and plots redone and made uniform in style • Bayesian section completely rewritten, covering linear Gaussian state space models only • R code for each example provided directly in the text for ease of data analysis replication • Expanded appendices with tutorials containing basic R and R time series commands • Data sets and additional R scripts available for download on Springer.com • Internal online links to every reference (equations, examples, chapters, etc.) •
Book
Reveals How HMMs Can Be Used as General-Purpose Time Series Models Implements all methods in RHidden Markov Models for Time Series: An Introduction Using R applies hidden Markov models (HMMs) to a wide range of time series types, from continuous-valued, circular, and multivariate series to binary data, bounded and unbounded counts, and categorical observations. It also discusses how to employ the freely available computing environment R to carry out computations for parameter estimation, model selection and checking, decoding, and forecasting. Illustrates the methodology in actionAfter presenting the simple Poisson HMM, the book covers estimation, forecasting, decoding, prediction, model selection, and Bayesian inference. Through examples and applications, the authors describe how to extend and generalize the basic model so it can be applied in a rich variety of situations. They also provide R code for some of the examples, enabling the use of the codes in similar applications. Effectively interpret data using HMMs This book illustrates the wonderful flexibility of HMMs as general-purpose models for time series data. It provides a broad understanding of the models and their uses.
Book
The first edition of this book has established itself as one of the leading references on generalized additive models (GAMs), and the only book on the topic to be introductory in nature with a wealth of practical examples and software implementation. It is self-contained, providing the necessary background in linear models, linear mixed models, and generalized linear models (GLMs), before presenting a balanced treatment of the theory and applications of GAMs and related models. The author bases his approach on a framework of penalized regression splines, and while firmly focused on the practical aspects of GAMs, discussions include fairly full explanations of the theory underlying the methods. Use of R software helps explain the theory and illustrates the practical application of the methodology. Each chapter contains an extensive set of exercises, with solutions in an appendix or in the book’s R data package gamair, to enable use as a course text or for self-study.
Chapter
Once we have identified any trend and seasonal effects, we can deseasonalise the time series and remove the trend. If we use the additive decomposition method of §1.5, we first calculate the seasonally adjusted time series and then remove the trend by subtraction. This leaves the random component, but the random component is not necessarily well modelled by independent random variables. In many cases, consecutive variables will be correlated. If we identify such correlations, we can improve our forecasts, quite dramatically if the correlations are high. We also need to estimate correlations if we are to generate realistic time series for simulations. The correlation structure of a time series model is defined by the correlation function, and we estimate this from the observed time series.
Book
Thanks to its data handling and modeling capabilities and its flexibility, R is becoming the most widely used software in bioinformatics. R Programming for Bioinformatics builds the programming skills needed to use R for solving bioinformatics and computational biology problems. Drawing on the author's experiences as an R expert, the book begins with coverage on the general properties of the R language, several unique programming aspects of R, and object-oriented programming in R. It presents methods for data input and output as well as database interactions. The author also examines different facets of string handling and manipulations, discusses the interfacing of R with other languages, and describes how to write software packages. He concludes with a discussion on the debugging and profiling of R code.
Book
This is the first book on applied econometrics using the R system for statistical computing and graphics. It presents hands-on examples for a wide range of econometric models, from classical linear regression models for cross-section, time series or panel data and the common non-linear models of microeconometrics such as logit, probit and tobit models, to recent semiparametric extensions. In addition, it provides a chapter on programming, including simulations, optimization, and an introduction to R tools enabling reproducible econometric research. An R package accompanying this book, AER, is available from the Comprehensive R Archive Network (CRAN). It contains some 100 data sets taken from a wide variety of sources, the full source code for all examples used in the text plus further worked examples, e.g., from popular textbooks. The data sets are suitable for illustrating, among other things, the fitting of wage equations, growth regressions, hedonic regressions, dynamic regressions and time series models as well as models of labor force participation or the demand for health care. The goal of this book is to provide a guide to R for users with a background in economics or the social sciences. Readers are assumed to have a background in basic statistics and econometrics at the undergraduate level. A large number of examples should make the book of interest to graduate students, researchers and practitioners alike.
Article
Time series econometrics is used for predicting future developments of variabls of interest such as economic growth, stock market volatility or interest rates. A model has to be constructed, accordingly, to describe the data generation process and to estimate its parameters. Modern tools to accomplish these tasks are provided in this volume, which also demonstrates by example how the tools can be applied. © Cambridge University Press 2004 and Cambridge University Press, 2009.
Article
1. Introduction to wavelets 2. Review of Fourier theory and filters 3. Orthonormal transforms of time series 4. The discrete wavelet transform 5. The maximal overlap discrete wavelet transform 6. The discrete wavelet packet transform 7. Random variables and stochastic processes 8. The wavelet variance 9. Analysis and synthesis of long memory processes 10. Wavelet-based signal estimation 11. Wavelet analysis of finite energy signals Appendix. Answers to embedded exercises References Author index Subject index.
Book
The analysis of integrated and co-integrated time series can be considered as the main methodology employed in applied econometrics. This book not only introduces the reader to this topic but enables him to conduct the various unit root tests and co-integration methods on his own by utilizing the free statistical programming environment R. The book encompasses seasonal unit roots, fractional integration, coping with structural breaks, and multivariate time series models. The book is enriched by numerous programming examples to artificial and real data so that it is ideally suited as an accompanying text book to computer lab classes. The second edition adds a discussion of vector auto-regressive, structural vector auto-regressive, and structural vector error-correction models. To analyze the interactions between the investigated variables, further impulse response function and forecast error variance decompositions are introduced as well as forecasting. The author explains how these model types relate to each other.
Article
Wavelet methods have recently undergone a rapid period of development with important implications for a number of disciplines including statistics. This book has three main objectives: (i) providing an introduction to wavelets and their uses in statistics; (ii) acting as a quick and broad reference to many developments in the area; (iii) interspersing R code that enables the reader to learn the methods, to carry out their own analyses, and further develop their own ideas. The book code is designed to work with the freeware R package WaveThresh4, but the book can be read independently of R. The book introduces the wavelet transform by starting with the simple Haar wavelet transform, and then builds to consider more general wavelets, complex-valued wavelets, non-decimated transforms, multidimensional wavelets, multiple wavelets, wavelet packets, boundary handling, and initialization. Later chapters consider a variety of wavelet-based nonparametric regression methods for different noise models and designs including density estimation, hazard rate estimation, and inverse problems; the use of wavelets for stationary and non-stationary time series analysis; and how wavelets might be used for variance estimation and intensity estimation for non-Gaussian sequences. The book is aimed both at Masters/Ph.D. students in a numerate discipline (such as statistics, mathematics, economics, engineering, computer science, and physics) and postdoctoral researchers/users interested in statistical wavelet methods.
Article
The most general form of a nonlinear strictly stationary process is that referred to as a Volterra expansion; this is to a linear process what a polynomial is to a linear function. Because of this similarity, an analogue of Tukey's one degree of freedom for non-additivity test is constructed as a test for linearity versus a second-order Volterra expansion.
Article
With time series data, there is often the issue of finding accurate approximations for the variance of such quantities as the sample autocovariance function or spectral estimate. Smith and Field (J. Time. Ser. Anal 14: 381–395, 1993) proposed a variance estimate motivated by resampling in the frequency domain. In this paper we present some results on the cumulants of this and other frequency domain estimates obtained via symbolic computation. The statistics of interest are linear combinations of products of discrete Fourier transforms. We describe an operator which calculates the joint cumulants of such statistics, and use the operator to deepen our understanding of the behaviour of the resampling based variance estimate. The operator acts as a filter for a general purpose operator described in Andrews and Stafford (J.R. Statist. Soc. B55, 613–627).
Conference Paper
Suppose that a forecasting model is available for the process Xt but that interest centres on the instantaneous transformation Yt = T(Xt). On the assumption that Xt is Gaussian and stationary, or can be reduced to stationarity by differencing, this paper examines the autocovariance structure of and methods for forecasting the transformed series. The development employs the Hermite polynomial expansion, thus allowing results to be derived for a very general class of instantaneous transformations.
Book
Ripley brings together two crucial ideas in pattern recognition: statistical methods and machine learning via neural networks. He brings unifying principles to the fore, and reviews the state of the subject. Ripley also includes many examples to illustrate real problems in pattern recognition and how to overcome them.
Article
The shape parameter of a two-variable graph is the ratio of the horizontal and vertical distances spanned by the data. For at least 70 years this parameter has received much attention in writings on data display, because it is a critical factor on two-variable graphs that show how one variable depends on the other. But despite the attention, there has been little systematic study. In this article the shape parameter and its effect on the visual decoding of slope information are studied through historical, empirical, theoretical, and experimental investigations. These investigations lead to a method for choosing the shape that maximizes the accuracy of slope judgments.
Article
New fractal parameters, the dimension of minimum covers and the related index of fractality has been introduced. The limiting value of dimension coincides with the usual fractal dimension D. Numerical calculations carried out for stock price series have shown that the application of minimal covers leads to rapid convergence to a power-law asymptotic behavior of the function with respect to δ. This makes it possible to treat the index as a local characteristic and to introduce the function which is an indicator of local stability of the time series, the greater m, the more stable the series. It has been shown, using a very rich empirical data array, that the index of fractality, in essence, defines a natural way of integrating over all possible price trajectories (starting from the shortest). It turns out that trajectories corresponding to random walk have the greatest weight.
Article
Acknowledgments I would like to thank Pierre Duguay, Stephen Poloz, and David Delchamps for helpful comments. ISSN 1192-5434 ISBN 0-662-20494-8 Printed in Canada on recycled paper iii Contents