Question
Asked 2 February 2023

How to handle Time Series data Missing Values ?

If one has a time series dataset, that contain columns of item number, Date, qty_item_sold. If the frequency of the dataset is 'MS'(Month start) and there are missing value('0.0') in some months due to the lack of purchase orders for those Items how does one handle this type of data set and prepare it for forecasting. Do we drop the rows containing the null values, or do we apply time series missingness mechanisms to fill them in?
I tried dropping the rows and applying statsforecast using models such as AutoArima, AutoETS, Naive. But I don't think the models would are forecasting the dataset properly.

Most recent answer

Precious Nwabueze
University of Kent
thanks so much Anton Rainer. How do I reach you? Maybe through LinkedIn.

All Answers (14)

Ette Etuk
Rivers State University
A possible solution is to use linear interpolation especially when the spaces are not too wide.
Precious Nwabueze
University of Kent
The spaces are too wide
Anton Rainer
bundesministerium für finanzen, austria, vienna
How is the specification of your model. Without knowing what and how you want to forecast it is impossible to give advice. It seems , that you want to find out trends for different goods (maybe with a seasonal component). In this case, one could perhaps change from monthly data to quarterly, semiannual or annual aggregates.
Precious Nwabueze
University of Kent
The data is non seasonal
Anton Rainer
bundesministerium für finanzen, austria, vienna
This makes things easier. Obviously, the data are for sales to only one or a few buyers, who order, when their stocks are running down. I recommend to look at a diagram to discover any regularities. For a trend calculation, I would take moving annual averages. By the way, up to now, I can only guess what you really want to analyse.
Anton Rainer
bundesministerium für finanzen, austria, vienna
"Moving annual averages" should mean "12-month moving averages", such that ypu have the average (or the sum) of the past 12 months for every month as a time series.
Mayur Wanjari
Datta Meghe Institute of Higher Education & Research
Handling missing values in time series data can be challenging as missing values can impact the validity and reliability of the results. There are several methods to handle missing values in time series data, including:
  1. Interpolation: This method replaces missing values with estimates based on the values of surrounding time points. Common interpolation methods include linear interpolation and spline interpolation.
  2. Extrapolation: This method extends the time series data beyond the existing data points by using mathematical models to estimate missing values.
  3. Forward-fill and backward-fill: In forward-fill, missing values are filled in with the next available value. In backward-fill, missing values are filled in with the previous available value.
  4. Last Observation Carried Forward (LOCF): This method replaces missing values with the last observed value.
  5. Multiple imputation: This method uses statistical methods to generate multiple estimates for missing values and then combines the results to produce a single estimate.
The choice of method will depend on the specific characteristics of the time series data, such as the frequency of missing values, the type of data, and the research question. In general, multiple imputation is considered the best method for handling missing values in time series data as it provides more robust results compared to other methods.
Anton Rainer
bundesministerium für finanzen, austria, vienna
It seems, that there is not really a missing value problem, but the problem to analyse time series with many (correct!) zeros: only some months with sales, no sale in other months (i.e. a stochastic process, where not only the values are random, but also the time index). This is not easy to analyse with the usual econometric methods. If one replaces the zeros by "fill-in", one would get a highly wrong picture.
Precious Nwabueze
University of Kent
What do you suggest to do Anton Rainer.
Anton Rainer
bundesministerium für finanzen, austria, vienna
Up to now, I can only guess about the nature of your data and what you want to do with them. It seems that it is sales of a firm (measured in weight, length, volume, amount in ...?) to another firm, which itself sells from its stocks and orders, when its stocks are run down to a critical limit. Without more information, it is hardly possible to help you with the specification of a testable model.
Precious Nwabueze
University of Kent
This is the raw data I was practicing with.
Precious Nwabueze
University of Kent
This is the raw data I was practicing with. What is the best way to handle this dataset Anton Rainer.
Anton Rainer
bundesministerium für finanzen, austria, vienna
First you should know what the items are and what dimension the sales are (pieces, tons, barrels,.....?). You should know, to how many buyers the goods are sold. You should also change the format of the data table:
Year Month 5 13864 13867 etc.
2015 1 0 0 0
2015 2 0 53 0 <==sales
.
.
2022 2
With these columns, one could make diagrams which could show some regularities, and one could calculate the average time span between the sales and the averages of the sales.
Maybe there are connection between the items (substitutes, complementaries).
I fear, without knowing the nature of items and of the client(s), one cannot sensefully analyse the data.
Precious Nwabueze
University of Kent
thanks so much Anton Rainer. How do I reach you? Maybe through LinkedIn.

Similar questions and discussions

Call for papers-第二届通信网络与机器学习国际学术会议(CNML 2024)
Discussion
Be the first to reply
  • Sijia MaSijia Ma
会议征稿:第二届通信网络与机器学习国际学术会议(CNML 2024)
Call for papers: 2024 2nd International Conference on Communication Networks and Machine Learning (CNML 2024) will be held from October 25 to 27 in Zhengzhou, China.
重要信息
大会官网(投稿网址):https://ais.cn/u/rm6vqq
大会时间:2024年10月25-27日
大会地点:中国-河南省郑州科学院
收录类型:EI,Scopus
主办单位:中原科技城、河南大学
会议信息
2024年第二届通信网络与机器学习国际学术会议(CNML 2024) 将于2024年10月25-27日在中国郑州举行。它致力于为通信网络和机器学习领域的专家和学者之间的学术交流创造一个平台。会议的理念是让来自世界各地大学和行业的科学家、学者、工程师和学生展示正在进行的研究活动,从而促进大学和行业之间的研究关系。
会议征稿主题(包括但不限于)
1. 通信网络
通信理论
通信建模理论与实践
通信信号处理
人机交互
绿色通信系统
网络与无线通信
光通信与光网络
5G通信与网络......
2. 机器学习
智能系统
通信人工智能
计算机视觉
图像处理
通信大数据
机器学习5G系统
机器学习的安全和保护
机器学习的实验评估......
出版信息
所有的投稿都必须经过2-3位组委会专家审稿,经过严格的审稿之后,最终所有录用的论文将以会议论文集的形式提交至 ACM International Conference Proceedings Series( ISBN: 979-8-4007-1168-8)出版,见刊后由出版社提交至 EI Compendex, SCOPUS检索,目前该出版社EI检索非常稳定。
投稿参会方式
所有参会人员可申请口头演讲以及海报展示,可开具中英文证明:
1. 全文投稿:一篇录用文章允许一名作者免费参会;
2. 口头演讲:申请口头报告,时间为15-20分钟;
3. 海报展示:申请海报展示,A1尺寸,彩色打印;
4. 听众参会:不投稿仅参会,仍可申请演讲或海报展示;
5. 投稿参会网址:https://ais.cn/u/rm6vqq
Spatial Autocorrelation (SA) or Spatial Residual Autocorrelation (SRA)
Discussion
Be the first to reply
  • Wim KaijserWim Kaijser
Ecological data is in complex and underlying model assumptions are till certain degrees always violated. For example, assume a plant species increases in abundance (y) with a decrease in temperature (x) in mountains which also means it increases with height (h) as proxy for SA.
We can fit a log-linear model with poisson error (e) as: log(E(y|x))=bx+f(h)+e, whereby f(h) corrects for the spatial construct according to Legandre:
However, if x ≈ f(h), then b ≈ 0. Hence, if a function corrects for the spatial construct the estimate on b appoximates 0. However, also in rivers organic matter or otherwise conductivity increase downstreams and some species will naturally be more abundantly cluster along this spatial structure.
On the other hand, if the residuals are strongly correlated with height as r ≉ y- log(E(y|x)), and h ≉ r. Then the assumption on iid is not strongly violated, given the realisations, modeled as e. But when h ≈ r we have SRA, this is what I understand as SRA. This is also discussed in https://doi.org/10.1111/j.1365-2699.2012.02707.x.
Question 1.) Thus, I believe SA is not an issue while SRA is. Is this correct?
Question 2.) Moreover, iid is ascribed to the realisations (not a property of it) based on the underlying knowledge of the data generating process (reasonable sample protocol and study setup/design) and till some extend visualisations (i.e., qq-plot). But, when h is unkown and h ≈ r are strongly correlated, our samples are still iid, simply because we have no knowledge of h ≈ r?
Thank you in advance!

Related Publications

Article
A 16-hour time series of data from a 96 × 96 × 20 Mm hydrodynamic convection simulation has recently been made available. We will present the preliminary results of applying the ring-diagram technique of local helioseismology to this dataset, in comparison with similar quiet-Sun observations from GONG and MDI, and show that it is possible to recove...
Article
Full-text available
With long data sets available for asteroseismology from space missions, it is sometimes necessary to deal with time series that have large gaps. This is becoming particularly relevant for TESS, which is revisiting many fields on the sky every two years. Because solar-like oscillators have finite mode lifetimes, it has become tempting to close large...
Got a technical question?
Get high-quality answers from experts.