Question

What is a reliable method of dealing with missing data in time series records?

Specifically this issue relates to calculating an annual river salt load (tonnes) based on TDS x flow (ML/day) data, where sites may have 5 - 25 % of TDS data missing over the year but have the relevant flow data.

Topics

1 / 0  ·  11 Answers  ·  483 Views

All Answers (11)

  • Miguel Pais · Centro de Oceanografia
    Maybe plotting the evolution of TDS along the year and trying to fit some model to the available points. Then use the model instead of the real points whenever TDS data is missing. Something like that would work, I think... it all depends on the information that can be retrieved form the available data. I've never done this, but it is just an idea that came to my mind when I read your question.
  • Thomas Petzoldt · Technische Universität Dresden
    General answer: One question is whether your data is uni- or multivariate. If it is multivariate, then use a statistical model as suggested by Miguel. A common technique for this is called "imputation" and a very nice technique is "MICE" (multiple imputation by chained equations).

    Specific answer: Regarding flow/concentration data, it is very common to use a nonlinear relationship between concentration and discharge. Another (and sometimes more robust method) is to classify discharge into classes and then to assign class average values.

    Here a few papers:

    Coats R, Lju FJ, Goldman CR (2002). “A Monte Carlo test of load calculation methods, Lake tahoe basin, California-Nevada.” Journal of American Water Resources Association, 38(3), 719–730.

    Dann MS, Lynch JA, Corbett ES (1986). “Comparison of Methods for Estimating Sulfate Export from a Forested Watershed.” J. Environ. Qual., 15(2), 140–145.

    Harned DA, Daniell CCI, Crawford J (1981). “Methods of discharge compensation as an aid to the evaluation of water quality trends.” Water Resources Research, 17(5), 1389–1400.

    Schleppi P, Waldner PA, Fritschi B (2006). “Accuracy and precision of different sampling strategies and flux integration methods for runoff water: comparisons based on measurements of the electrical conductivity.” Hydrological Processes, 20, 395–410.

    Swistock BR, Edwards PJ,Wood F, Dewalle DR (1997). “Comparison of methods for calculating annual solute exports from six forested Appalachian watersheds.” Hydrological Processes, 11, 655–669.
  • Nasser Saleh · Universiteit Twente
    Since it is time series, moving average is commonly used
  • Mary-Anne Jones · Queensland Government
    My data are bivariate. Basically, the TDS data are flow dependent and separate into two main groups: low flow or high flow. I was thinking of using a statistical model as suggested by Miguel. But was not sure of the best approach, e.g. whether to separate the data by flow first. Thank you for your suggestions and the references. I will follow these up.
  • Barry Croke · Australian National University
    The best method will depend on the amount of data missing, and the nature of the signal. Generally, the best method would be to use a model, but you have the problem of determining the model structure first. An artifical neural networrk may help as this approach will automatically tune the structure to the data. The problem is that an ANN is a black box - you get no increased understanding in the processes from such an approach, but this approach is ideal for filling in missing data. How well it works depends on how much data is missing, and what parts of the data are missing (e.g. if the flow peaks are missing, then you have no information about what is happening under these conditions, so you will get very poor estimates.

    Best idea is to try several different approaches, and look at the difference in the predictions. This will give you an idea of the confidence in the predictions. You should also look at how the different predictions affect the result you are looking for (e.g. in estimating annual loads).
  • John Gallagher · University of Tasmania
    The best method for dealing with missing blocks of data in a time series no smaller than 30 or so points is Singular Spectrum analysis for both univariant and multivariant itme series. The Method is model independent and requires no conversion to a stationary series. Basically it is a principal componets in a time domain. It iterates a best solution based on all the data within the series not just proximate values or trends either side or mssing blocks.

    There are pachanges in R but easier applications with two slighly diferent algorythyms are K spectra software (there is a good tool box with examples to try good but MAC only) or the Russian time Catepillar (refers to the algorythm form LOL) . Both are intially free ware to try out with your data set before buying. References in litretaure of their use can be found in thefree ware package.

    For myself I have used Kpectra for imputation (wiggly lines across the missing data blocks) for costal nitrate data at different scales from mouths, inter an intra decadal non linear trends. I also hindcasted the data back 20 years and confirmed from other lines of evidence of the correctness of the result.

    The other possibility it to use a combination of sine wave modelling with a floating phase, and the Aikake index to stop over fitting, in combination with a Lomb spectral analysis (designed for missing data) to constrain or confirm the staistically significant adominant sine waves periodicity in your time series fit. This can be found in the free stats package PAST . PAST is in a spread sheet form and very (too easy) to use, wetheras spectra need to be put in a csv type of file in an even timne specing with missing data markers (eg. NAN) in the even time series, so choose the scale you want. In other words if you have missing monthly data that can bias the annual average and you are intersted in the annula or interannual variation, then you need to imputate the monthly data of course.

    What else, oh yes what ever you do maybe use all the above for confirmation through convergence.,Alsoi depending on the scale you are intersted in the smoothing function in PAST which fits various weighting factors for the lowest root mean square error may be useful in combination with the above.

    If you have any questions don hesertate to ask

    Barry
  • John Gallagher · University of Tasmania
    Oh!, one other thought Mary-Anne, in relatioin to TDS mising data with tides the Singular spectrum analysis was first devloped for Sanfransciso Bay. I dont have the reference handy at the momement but a google search will do I think. The algorythym is more for stationary series I think but it nevertheless illustrates what can be done with missing TDS time series.
  • Mary-Anne Jones · Queensland Government
    Excellent! Thank you John and Barry.
  • John Gallagher · University of Tasmania
    Oh one other thing Mary-Anne I am also in agreement with Thomas (see above) about river discharge classes as another posibility. I didnt know that it was a common procedure, nevertheless I used it for my research (Phd submitting for UTAS Eprints in 2 weeks huraaah). I assigned DIN with discharge data as part of a palaeoreconstruction model for recreating the mix from coastal river imputs over time using paleosalinity reconstruction for proportions in the estuary. The coastal DIN was using the imputation I mentioned, and the former a set of mean and linear relationships of discharge, as it appeared to be better than some complex empirical curve with no underlying mechanism.

    The secret in this analysis is to determine the changes in the TDS relationship with discharge. This was done by CUSUM analysis (Taylor sofware google it should come up OK). In this way I found high mean invariant nutrient concs at low flows (stagnation phase) minimum mean invariant nutrient at high set of flows (dilution phase), then a linear response of increasing nutrients with river flow (response phase); finally a fall in mean invariant nutrient response (flood high particulate phase)
    Note, the phase nomenclature are mine for Little Swanport river and may not be general. Once you have the TDS and river flow response curves, there may be possible to also apply it as a general relationship. However saying that, the critique is that TDS and river flow are probably autorgressive (depends on the previous values to some unknown degree) and it may be best to produce a univariant ARMA model once the missing data is imputated. That is, a relationship of the fuutre TDS with past TDS on the reconstructed time series. If you are not familiar (I am not fully familiar ) with this process it is more or less standard, try PAST free software or the readers of research gate may help in references and sofware.
    OH its John Barry Gallagher , it just that I've been called Barry since I can remember hahaha. If you wish to comunicate or collaborate further you can contact me At UTAS in Tassie- johng6@utas.edu.au.
    Barry
  • Mary-Anne Jones · Queensland Government
    Okay, thanks very much John Barry :)
  • Deborah Hilton · Deborah Hilton Statistics Online
    I think it is Armitage and Berry have a good section on case-wise versus list-wise deletion of data, I could be wrong there, it is ages since I've read this.

Question Followers (15) See all