Content uploaded by Reza Rawassizadeh

Author content

All content in this area was uploaded by Reza Rawassizadeh on Oct 08, 2020

Content may be subject to copyright.

Content uploaded by Reza Rawassizadeh

Author content

All content in this area was uploaded by Reza Rawassizadeh on May 01, 2019

Content may be subject to copyright.

Ghost Imputation: Accurately Reconstructing

Missing Data of the Off Period

Reza Rawassizadeh , Hamidreza Keshavarz , and Michael Pazzani

Abstract—Noise and missing data are intrinsic characteristics ofreal-world data, leading to uncertainty that negatively affectsthe quality of

knowledge extractedfrom the data. The burdenimposed by missing datais often severe in sensors that collect data fromthe physical world,

where large gaps of missing data may occur when the system is temporarily off or disconnected. How can we reconstruct missing data for

these periods? We introduce an accurate and efﬁcient algorithm for missing data reconstruction (imputation),that is speciﬁcally designed to

recover off-period segments of missing data. This algorithm, Ghost, searches the sequential dataset to ﬁnd data segments that have a prior

and posterior segment that matches those of the missing data. If there isa similar segment that also satisﬁes the constraint –suchas

location ortime of day – then it is substituted for the missing data. A baseline approach results inquadratic computational complexity,

therefore we introduce a caching approach that reduces the search space and improves the computational complexity to linear inthe

common case. Experimental evaluations onﬁve real-world datasets show that our algorithm signiﬁcantly outperforms four state-of-the-art

algorithms with an average of 18 percent higherF-score.

Index Terms—Imputation, multivariate, pattern mining

Ç

1INTRODUCTION

WITH recent technological advances and increases in

computing capabilities, data intensive scientiﬁc discov-

ery is being widely used. This has led to the introduction of

methods for analyzing data collected from multiple sources

of information, i.e., “multivariate data”.

1

One of the inevitable

challenges of real-world data analysis is uncertainty rising

from noise and missing data [16]. This uncertainty negatively

affects the quality of knowledge extracted from the data.

Indeed, the burden imposed by missing data is often severe in

applications collecting data from the physical world, e.g.,

mobile sensing [33] or genome sequencing [6].

For example, consider battery powered devices, such as

smartwatches, equipped with inexpensive sensors such as

ambient light and accelerometer. Due to sensor quality, bat-

tery limits,and user preferences, context-sensing applications

can not continuously and effectively collect data [33] and

there are often segments of missing data, e.g., the device is

turned off. These missing segments affect the quality of

knowledge-extraction methods [31]. Although, missing data

reconstruction is an important requirement of these system, it

has not received much attention.

There are longstanding efforts in statistics [10], [40], [43]

to reconstruct missing data. These imputation methods

assume the missing data points occur at random, i.e., miss-

ing at random (MAR) or missing completely at random

(MCAR). If the data is missing not at random (MNAR) [40],

the imputation process is more challenging.

In this paper we propose a novel algorithm for imputation

of multivariate sensor data. This algorithm only uses (i) a con-

straint such as time of the day or location (ii) the data values

immediately prior to the missing event, and (iii) the data val-

ues immediately following the missing event. Since our

method does not rely on statistical methods, it might be able

to handle some MNAR data, but only if a similar segment

exists in the dataset.

In particular, our algorithm operates on multivariate and

sequential data streams. It reads two adjacent data segments –

one before and one after the missing data (missing seg-

ment) –and searches the dataset to ﬁnd two segments simi-

lar to the adjacent segments of the missing segment. If

segment between these two similar segments is of the same

length as missing segment, it is a candidate recovery seg-

ment. Next, if the constraint values of the segment of inter-

est matches the constraint values of the missing segment,

the algorithm substitutes the missing segment with the con-

tent of this candidate recovery segment. A naive approach

imposes a quadratic computational complexity, so we add a

pre-processing step that reads all data segments and their

indexes into a cache, achieving a linear computational com-

plexity in the common case.

The characteristics and contributions of our algorithm

are as follows.

1. Multivariate is slightly different than multimodal. Modality refers

to the way when an event happens or experienced [3]. In other words,

it refers to primary channels of sensation and communication, e.g.,

touch and vision.

R. Rawassizadeh is with the Department of Computer Science, Metropoli-

tan College, Boston University, Boston, MA 02215.

E-mail: rrawassizadeh@acm.org.

H. Keshavarz is with the Department of Electrical and Computer Engineering,

Tarbiat Modares University, Tehran, Iran. E-mail: keshavarz.h@modares.ac.ir.

M. Pazzani is with the Department of Computer Science, University of

California at Riverside, Riverside, CA 92521. E-mail: michael.pazzani@ucr.edu.

Manuscript received 15 Oct. 2018; revised 25 Apr. 2019; accepted 30 Apr.

2019. Date of publication 3 May 2019; date of current version 6 Oct. 2020.

(Corresponding author: Reza Rawassizadeh.)

Recommended for acceptance by S. Cohen.

Digital Object Identiﬁer no. 10.1109/TKDE.2019.2914653

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 32, NO. 11, NOVEMBER 2020 2185

1041-4347 ß2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.

See ht_tps://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: BOSTON UNIVERSITY. Downloaded on October 08,2020 at 19:23:07 UTC from IEEE Xplore. Restrictions apply.

Heterogeneous Multivariate Real-World Data. Statistical

imputation approaches are optimized to handle

numerical data [40], [43]. Real-world systems, how-

ever, produce data in numerical, categorical, or binary

forms. Our algorithm relies on a categorical abstrac-

tion of the original data by converting data values to

categorical symbols, e.g., by bucketing numeric data

into categories. Therefore, in contrast to statistical-

based imputation, any data type, regardless of its

distribution, can be fed into this algorithm, i.e., non-

parametric imputation. All datasets we employed in

this work are real-world datasets and most of them

(wearable, mobile, IoT, and news media datasets)

have not previously been used for imputation studies.

We recommend to use this algorithm mainly for multi-

variate sensor data used in consumer IoT and mobile

devices, but to demonstrate its versatility, we experi-

ment it with two other real-world datasets as well

(clinical data and real estate data).

Instance based Learning. Inspired from algorithms that

learn from a single labeled instance [15], [23], our

algorithm tries to estimate the missing data from the

ﬁrst similar instance that can be found in a sequential

search of the dataset. Clearly, relying on a single label

(similar instance) is prone to false positive errors.

Instead, we rely on a constraint, as a controlling

variable that signiﬁcantly reduces false positives. Our

deﬁnition of constraint is inspired by binary con-

straint in constraint satisfaction problems (CSP) [42],

but, unlike traditional CSP it is not used to reduce the

search space.

Search Space. Continuously collecting and storing

data can be expensive in terms of resource usage,

especially in battery-powered wireless devices. Data

is typically not stored long-term on these devices,

and most data processing is conducted in cloud serv-

ers [34]. Our algorithm can reconstruct the missing

data merely by ﬁnding the ﬁrst match for the missing

segment without a need to search the entire dataset.

For instance, we have used only three days for a

smartphone dataset and only seven days for a smart-

watch dataset. Their datasets are fairly small, but in

both of these examples, our algorithm outperforms

state-of-the-art algorithms and reconstructs the miss-

ing data with higher accuracy.

Note that all versions of our algorithm, i.e., baseline and

cache based ones, have only one efﬁcient parameter and it

is the window size. In the evaluation section we identify an

optimal window size value for each dataset. Therefore,

based on the target dataset (or application), this parameter

could be assigned automatically and there is no need for the

user, who has a domain knowledge, to decide about its opti-

mal value. There is another parameter used for tolerating

slight dissimilarity, we will demonstrate why users should

not tolerate dissimilarity.

2PROBLEM STATEMENT AND DEFINITIONS

Asystem produces a dataset, over time. For example, a

smartwatch system is producing sequential data about its

wearer’s activity, physiology, and environment. The system

generates data intermittently, when it is on, but may be off

for extended periods.

Data value is a data unit in numeric, categorical or in

binary form. For our purposes we convert all such data

types into categorical form, so that all data values are repre-

sented by symbols from a ﬁnite alphabet. This has an inevi-

table impact on the accuracy of the data. However, all

imputation methods which can handle multivariate data

have such a limitation.

Dataset is a temporal sequence of data records, produced

by a single system over time. Because typical systems produce

multiple types of data (multivariate data), we envision each

record comprises one value from each type of data produced

by that system. We discretize time such that each data record

rirepresents a brief period of time beginning at time ti, where

irepresents the index of the record within the dataset

ð1inÞ.Werepresentrias a tuple comprising one data

symbol for each of the ‘streams produced by the system

during time interval ti, plus one special stream 0 called the

constraint stream. Thus, we can write ri¼ðsi0;s

i1;...;s

i‘Þ.

If a stream produces no data during time interval ti,

record riincludes the null symbol ;for that stream. For

periods where the system is off, the records during that

period will include the null symbol for every data stream.

Constraint stream is the portion of every data record ipre-

senting the value of the constraint, i.e., si0; for notational con-

venience, we refer to the this stream as S0,S0¼½s1;0;s

2;0;...;

sn;0. The constraint stream, however, is not produced by

the system and we assume it can be generated where the

dataset is assembled for post-processing; for example, the

constraint may be “time of day”. The choice of a suitable con-

straint depends on the kind of data being processed. Other

constraint examples are described in Section 4.1.

Segment is a subsequence of the dataset; thus for dataset

d½r1;r

2; ::; ri;...;r

nwe can refer to a segment dij

½ri;...;r

j, for 1ijn.

Missing Segment. A segment in which all its data values

(other than constraint values) are missing (null symbols);

that is dij is a missing segment iff 8ikj;81l‘;s

kl ¼;.A

missing segment occurs when the target system is not col-

lecting data at all, because it is off or unavailable, during

time intervals tithrough tj.

Window size wis a parameter of our algorithm; a small

number wsuch that a segment of wrecords is a “window”

into the dataset; window didi;iþw1.

Prior window 3dij is the window of data immediately

preceding a missing segment. If dij is the missing segment,

then the prior window is 3dij diw;i1.

Posterior window

"

dij is the window of data immediately

following a missing segment. If dij is the missing segment,

then the posterior window is

"

dij djþ1;jþw.

Segment Equality. Two segments are equal if they have

the same size and identical values for all of their data values.

For example, dij ’dkl iff ððlk¼jiÞ^ð8

0pji80q‘:

siþp;q ¼skþp;qÞÞ.

Segment Similarity. Two segments are similar if they have

the same size, the similarity check function returns true

while comparing their priorand posterior segments, and have

identical values in their constraint stream. That is, segment

dij ’dkl iff ðlk¼jiÞ^ð3dij ¼3dklÞ^ð

"

dij ¼

"

dklÞ^

ð80pji:simðsiþp;0;s

kþp;0Þ¼TrueÞ:

2186 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 32, NO. 11, NOVEMBER 2020

Authorized licensed use limited to: BOSTON UNIVERSITY. Downloaded on October 08,2020 at 19:23:07 UTC from IEEE Xplore. Restrictions apply.

Recovery segment drec is a segment of data that can be

substituted into the dataset in place of a missing segment.

Our approach ﬁnds segments elsewhere in the dataset to

use as recovery segments. For example, if dij is the missing

segment, and another segment dkl is similar ðdij dklÞ, then

dkl is a candidate recovery segment.

Cache Cis a table with one entry for each ‘unique’ window,

listing the index (or indices) where that window occurs in the

dataset. In the simplest case, one entry in the table looks like

ðdi;figÞ. If, however, there is another window in the dataset

djsuch that dj¼di, then the entry in the table would instead

be ðdi;fi; jgÞ. When complete, the cache is therefore the small-

est set of entries Cfðde;fijdi¼degÞg, such that the cache

represents the whole dataset, 8di2d; 9ðde;IÞ 2 C j ðde¼

di^i2IÞ, and such that the cache has as few entries as possi-

ble, 8ðde;IÞ 2 C;@ðdf;JÞ2Cjde¼df. If there were two

windows de¼dfin the cache C, their entries would simply be

combined intoone entry representing the union of the indices

where that window occurs.

To better understand the cache consider Fig. 1. There is a

missing segment at index 6, that is, dmiss ¼d6;8. Given w¼2,

for the prior window 3dmiss ½ð1;B;2;XÞ;ð1;E;1;XÞ], the

cache holds two beginning indices: f4;12g. This means that

this window value has been repeated twice at d4and at d12.

The same is applicable for the posterior window

"

dmiss,

which appears at d9and d17.

Problem. Given a missing segment dmiss, its prior 3dmiss , its

posterior

"

dmiss and the constraint stream S0, the objective of our

algorithm is to accurately ﬁnd a recovery segment drec such that

drec ’dmiss, and replace dmiss with drec .

Function 1. Segment Similarity function,sim is a function that

receive two segments da,dband conﬁdence threshold .This

function uses Jaccard distance to compare both input seg-

ments. It is a binary comparison between the data members of

each segment. If their similarity tolerance is greater or equal

than the it returns true. Otherwise it returns false. Since we

are dealing with binary comparison between similarities of

two set of discrete data Jaccard is the most optimal similarity

measure that could be used for our application

simðda;d

b;Þ¼ 1if da\da

da[da

0else

(:

Function 2. Segment match function,cmatchð3d;

"

d; dmissÞ,

is a function that by using the hash table (cache) compares

3dand

"

doffset distances with each other. If the distance

between their offsets is exactly similar to the size of dmiss

then it returns “true”, otherwise it returns “false”. More

about this will be described when we present the caching

algorithm.

3ALGORITHM

Our algorithm has three main characteristics. First, it uses a

constraint value, which we assume is known for all records.

Consider a smartwatch app that continuously collects ambient

light; when the user puts her hand inside her pocket while

seated, the ambient light is zero. While she is sleeping, the

ambient light is also zero. Sleeping and sitting with hands

inside the pockets are two different behaviors. Therefore, an

algorithm that tries to estimate missing values of ambient light

should consider the time of the day (a constraint) while per-

formingimputation.Multiple-imputation methods often have

better accuracy value over a single-imputation approach, but

are often less efﬁcient. Our algorithm mitigates the accuracy

issue by relying on the constraint.

The algorithm’s second characteristic is the discretization of

data. Discretization enables our algorithm to handle any data

type, including binary, numerical or categorical values. Dis-

cretizing numeric data may have some impact on accuracy

depending on the granularity [24]. To avoid decreasing the

accuracy further, our algorithm compares data objects based

on equality and does not use any other similarity metrics. We

also discretize the timestamp into buckets using the temporal

granularity [32], (e.g., if the granularity is ﬁve minutes then

11 : 32 !11 : 30), and to discretize time series data we use

the SAX algorithm [24].

The third characteristic is its statistical distribution inde-

pendence, unlike several other state-of-the-art imputation

algorithms [8], [20], [41], which require the data to be nor-

mally distributed. In traditional imputation methods, MCAR

and MAR cause losses of statistical signiﬁcance, and MNAR

can cause bias in the result [11]. Recently there are methods

designed speciﬁcally to deal with MNAR [19], [28]. Since our

algorithm relies on the ﬁrst similar segment to reconstruct the

missing data and it does not use any statistical methods, it

might be able to reconstruct MNAR data, but only if a similar

segment exists in the dataset.

The quality of data reconstruction is strongly depend on

the choose of constraint. Therefore, it is important for the

user of the algorithm to be familiar with the dataset and have

some domain knowledge.

3.1 Baseline Algorithm

To better understand the algorithm, we ﬁrst describe its base-

line version. Consider the dataset in Fig. 1 (top), where, three

streams producing data, i.e., S1,S2and S3. The constraint

stream is S0. From index 6 to 8 there is no data available, thus

dmiss ¼d6;8. Given the window size of two, the algorithm

reads the prior window 3d6;8=[ð1;B;2;XÞ;ð1;E;1;XÞ], the

posterior window

"

d6;8=[ð2;D;3;YÞ;ð2;C;4;;Þ], and = the

constraint content [1;2;2]. Then, the algorithm scans the data-

set until it encounters a window similar to 3d6;8, which it

ﬁnds at d12; it next checks whether the corresponding window

d17 is similar to the posterior window, for the missing segment

(i.e., whether d17

"

d6;8). If so, this new found segment d14;16

Fig. 1. (top) An abstract representation of a system with three data

streams that gets ofﬂine from tnto tnþ2. (bottom) Based on the exactly

similar posterior segment, prior segment and constraint the missing

segment has been identiﬁed from tmto tmþ2and recovered.

RAWASSIZADEH ET AL.: GHOST IMPUTATION: ACCURATELY RECONSTRUCTING MISSING DATA OF THE OFF PERIOD 2187

Authorized licensed use limited to: BOSTON UNIVERSITY. Downloaded on October 08,2020 at 19:23:07 UTC from IEEE Xplore. Restrictions apply.

is a candidate recovery segment. If the constraint in d14;16,i.e.,

1;2;2, are equal to those in d6;8,i.e.,1;2;2,thenthiscandidate

is a recovery segment and it will substituted at the missing

data segment, as shown in Fig. 1 (bottom). The search process

stops immediately after a recovery segment has been found; if

there is more than one substitution possible, only the ﬁrst sim-

ilar segment will be chosen. Thus, this algorithm does not per-

form “multiple imputation” [38]. Also note that to construct

the missing data, our algorithm relies on the comparison to all

available data sources (multivariate) inside posterior and

prior segments. This means that each information source

inside prior and posterior will be a condition for comparison.

The algorithm can operate on single source of information,

but it will be more accurate when we use it with multivariate

datasets.

The baseline algorithm is simple and thus we do not

describe its pseudo code in detail.

3.2 Baseline Computational Complexity

The computational complexity of the algorithm is propor-

tional to the number of missing segments, and to the amount

of data searched to ﬁnd a recovery segment for each missing

segment. Assuming there are nnumbers of tuple in the data-

set, the algorithm can ﬁnd missing segments in a linear scan

of the dataset, in OðnÞtime; in a worst-case scenario it may

need to scan the whole dataset to ﬁnd a recovery segment for

each missing segment, also in OðnÞtime. This means that

computational complexity is quadratic, i.e., Oðn2Þ. Since it is

not optimal, especially to run on battery powered devices,

we should improve it. By using a cache we can mitigate this

performance overhead.

3.3 Caching Algorithm

Caching is being used widely to improve the performance

of algorithms [9]. Here we adapt a caching mechanism that

reduces the search space to reduce time complexity. To

implement the cache we use a data dictionary data structure

that hosts all unique window values and their beginning

indices. Note that missing records are skipped when build-

ing the cache of window values.

By using a hash table to implement the dictionary, return-

ing a list of indices where this window value occurs in the data-

set, the cache allows the algorithm to consider only a limited

set of possible candidate recovery segments. In other words,

using the cache reduces the search space from two different

perspectives: (i) instead of searching the entire dataset that

includes repetitive data it searches a table of window values

that are all unique with no repeated data. (ii) Instead of

searching the dataset to ﬁnd the similar posterior and prior

segments, ﬁrst, it searches the list of indices for matches to the

prior window, then if the match is founded it checks the size

match between missing and candidate recovery segments,

then if both previous conditions are true it checks for the

posterior segments. If all three conditions are true then it con-

siders the candidate recovery segment as a ﬁnal recovery

segment for the missing segment.

Given the window size of two, Fig. 2 is an abstract repre-

sentation that shows how the algorithm builds the cache. In

particular, the cache creation process scans the dataset,

sequentially examining each window, updating the dictionary

to add an entry for new window values or add a new index to

an existing dictionary entry.

The table at the bottom of Fig. 2 presents the dictionary as it

is being created. For the sake of simplicity, at the beginning we

show all possible four data segments, then for the rest we only

highlight repeated data segments. For the purpose of explana-

tion, we have assigned a name to each window, a1,b2,... and

because windows are overlapping we have useddifferent col-

ors and names (a,b) to distinguish them from each other.

Algorithm 1. Cache-Based Implementation of the

Algorithm

Data: dataset dwith nrecords ri(where 1in), window

size w, cache C, similarity threshold

Result: updated d

// outer loop: scan dto ﬁnd missing segments

1i:¼wþ1// we skip over the ﬁrst window

2while ðinwÞdo

// scan to ﬁnd contiguous missing segment starting at i

3j:¼i

4while ðisMissingðdi;jþ1Þ^ðj<nwÞÞ do

5j:¼jþ1

6if ðisMissingðdijÞÞ then

7l¼jiþ1// length of missing segment

83dij ¼diw;i1// prior window

9

"

dij ¼djþ1;jþw// posterior window

// use cache to iterate over set of indices where

3dij is found in d, if any

10 forall k2csearchðC;3dij;Þdo

// exclude candidates close to the end of dataset

11 if ðkþwþl1nwÞthen

// candidate recovery segment

12 drec ¼dkþw;kþwþl1

// check whether posterior windows match based

on epsilon threshold

13 if ðsimð

"

drec;

"

dij;ÞÞ then

// now check whether constraints match; recall

sil are the ﬁelds of record ri

14 if ð80pji:siþp;0¼skþwþp;0Þthen

15 dij :¼drec // substitute into the missing segment

16 break

17 i:¼jþ1// continue scan after this segment

The current implementation of the cache constructions

adds windows into the dictionary with a simple check. If

Fig. 2. The cache creation process that loads the dataset into the hash

table. It uses overlapping sliding window, equal to the given ws size, to

read all possible data segments and their beginning offsets.

2188 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 32, NO. 11, NOVEMBER 2020

the window exists in the dictionary the cache updates the

existing entry by adding the new index into the existing list

of indices. Therefore, there will be no collisions unlike using

hash functions.

In the common case, there will be a small number of indi-

ces in each dictionary entry. Therefore, if the dictionary is

implemented as a hash table, the insert and lookup time

will be Oð1Þ, but we have the risk collision. In our algorithm,

the process of reading the dataset and caching its windows

is a one-time pre-processing step. In other words, this cache

creation step, however, occurs once at the beginning, and is

not repeated. Nevertheless, we consider this overhead in

our evaluation. In the worst case, its computational com-

plexity is Oðn2Þ, but in common cases it is close to OðnÞ.In

particular, unusual or degenerate datasets could lead to

Oðn2Þbehavior if, for instance, all data windows have the

same value and the dictionary has only one entry with a list

of OðnÞindices. After creating the dictionary (cache), we

can now run a more efﬁcient algorithm to ﬁnd and replace

missing segments, as shown in Algorithm 1.

Note, most of the overhead is on scanning the content of

dictionary and comparing the values of cached, which hash

function can not improve the scan as well. Besides, our algo-

rithm assumes that the given dataset will have no predeﬁned

structure, such as data streams. Considering the space and

time complexity, ordering the data stream is not cost-effective.

Besides, we need to store all unique (not all of them) data seg-

ments into the cache. Therefore, we do not recommend using

preﬁx-tree or any other types of trees, due to their space and

time in-efﬁciency required for ordering data or relying on the

structure of data.

In the ﬁrst step, the algorithm iterates through dataset

and identify missing segments. It checks the placement of

the missing segment to locate its posterior and prior in line

4. Then, it reads the associated posterior and prior segments

in lines 8 and 9. From line 10 the algorithm starts to search

the cache to ﬁnd offsets of the prior segment or similar offset

(it tolerates dissimilarity based on ). In line 11 it excludes

prior segments which are close to the end of the dataset,

because there will be no associated posterior or prior existed

for them. Next, it checks the window size and based on the

prior segment and window size it loads the candidate

recovery segment into drec, line 12. However, still the algo-

rithm is not sure whether it is a correct recovery segment,

and it should check the posterior as well. Line 13 will do the

comparison between posterior segment of the drec and pos-

terior of the missing segment dij. If they match (based on

given epsilon) the next step is to check their constraint

which should be identical. Note that constraint matching is

checking for exact equality and there will be no tolerance

(no will be used). If they match as well, this means that all

three conditions are true, and in line 15, drec will be returned

as a recovery segment for the identiﬁed missing segment.

Fig. 3 provides an example to understand the cmatch

function. Assume the ¼0which means data segment

should be identical (partial similarity is not tolerated) and

ws ¼2is the window size. There we have a dataset and we

show two snapshots of the dataset, each with three segments.

The ﬁrst one includes the missing data segment and the sec-

ond one includes its recovery data segment. The beginning

offset (index) of prior segment xis 25. The missing event

occurs at 27 and continues until offset 30.

According to the cache, data segment x(prior) existed at

offset 16, 25 and 52, and data segment y(posterior) exited at

offset 20, 24, 30, 60 and 67. The cache algorithm ﬁrst sub-

tracts the ﬁrst posterior 20, from the ﬁrst prior 16+ws. The

result shows 6, since it is not equal to the missing size, i.e.,

3, the cursor moves to the next x. Now it subtracts 20+ws

from 25 that results in 3. Since the result is negative and

smaller than the missing size, the cursor shifts to the next

posterior, i.e., 30. If the subtraction of 30 from 25+ws is cor-

rect, but the prior is the same, thus the algorithm uses the

next prior, i.e., 52. This procedure continues until it reaches

60 from yand 52 from x, and the differences are are similar

to the missing data size. Then the value from index 57 to 59

will be read from the dataset to the construct the recovery

data segment.

3.4 Computational Complexity with Cache

The computational complexity of the baseline algorithm is

super-linear and in worst case it is quadratic. As it has been

described, assuming that there are mmissing segments in the

dataset and each recovery segment is located at kth element of

the dataset we have Oðk:mÞ. By using the cache based optimi-

zation, the algorithm needs to search the hash table. The size

of the hash table is the size of the dataset divided by size of

the sliding windows, i.e., ðn=wsÞ. Since the sliding windows

are overlapping instead of n=ws we should have it multiplied

to ws 1. Therefore the computational complexity is

ððn=wsÞ:ðws 1ÞÞ. Besides, there is a very small overhead to

search the offset lists for each data segment. Its impact is near

zero and we do not consider it as an overhead in the algo-

rithm. As a result, we can summarize, together with the cache

construction phase, the overall algorithm will run in near

linear computational complexity and in worst case it will be

super linear. The experimental evaluations analyze this in

more detail.

In the Experimental Evaluation section, we provide a

detailed analysis for the differences of using the cache and

not using the cache, their impact on dataset size and so forth.

4EXPERIMENTAL EVALUATION

Before describing our experimental evaluation, ﬁrst we

describe datasets that we use. Then, we introduce state-

of-the-art algorithms that will be used for comparison with

our algorithms. Next, the accuracy analysis will be described

in detail, followed by a detailed efﬁciency analysis.

All experiments reported in this section were conducted

on a MacBook Pro laptop, with 2.8 GHz Intel Core i7 CPU,

Fig. 3. A toy example that shows data segments and their offsets (indices)

read and cached into the dictionary.

RAWASSIZADEH ET AL.: GHOST IMPUTATION: ACCURATELY RECONSTRUCTING MISSING DATA OF THE OFF PERIOD 2189

16 GB memory and SSD hard disk. We implemented the

algortihm, which we call Ghost, in R version 3.5.

4.1 Datasets

Here we choose to evaluate our method on ﬁve real-world mul-

tivariate datasets. These datasets include mobile lifelogging,

wearable smartwatch, Internet of Thing (IoT) Smarthome,

clinical data and online news data about Europe real-estate

market. We design this algorithm with the intention to rec-

onstruct sensor data for inexpensive consumer-electronic

devices, such as smartphone or IoT devices. However, to

demonstrate the versatility of our algorithm, we experiment it

on clinical data and online news data as well.

The mobile, wearable and IoT datasets have a timestamp

in each record. We converted each timestamp to a time of the

day with a temporal granularity of ﬁve minutes [32]. This

approach was inspired by the Allen temporal logic [1], which

treats temporal data as discrete time intervals, and it is differ-

ent than time series data. We converted time series data

(such as ambient light or accelerometer numerical data) to

characters with the SAX algorithm [24].

Wearable Dataset (WD). Because wearable devices are

small, they have limited battery and limited sensor capabili-

ties [35]. To preserve battery, their operating system shuts

down background services frequently, which can result in

signiﬁcant data loss during the data collection process. We

use eight days of data for a sample user of “Insight for Wear”

[36], smartwatch app.

2

The constraint is time of the day.

Online News Data (ND). Financial applications use news to

predict market ﬂuctuations. Recent political issues in Europe

such as Brexit and European policy regarding refugees have

led to signiﬁcant ﬂuctuations in European real-estate markets.

These ﬂuctuations and their patterns are not continuously

reported in the news media. Therefore, imputation can be

used to construct missing market data from online news

media. We get 3,000 real-estate records, extracted from

500,000 news articles from 5 years (2012 to 2017) in several

German online news media sources. We acquired this data

from a market prediction start-up, i.e., eMentalist

3

and

ordered based on their date of publication. The constraint in

this dataset is region (country name) and they are ordered

chronologically.

Mobile Sensing Dataset (MD). Due to the proximity of

mobile phones to their users, it is not possible to continuously

collect contextual data at all times from the user [33], and the

imputation can be used to reconstruct missing mobile sensing

data. We use only three days of a user data

4

collected using an

open source lifelogging tool, i.e., UbiqLog [37]. The constraint

is time of the day.

Smarthome Dataset (SD). One well-known use of IoT devices

is in a household setting, i.e., smart-homes. Similar to mobile

and wearable devices, inexpensive sensors that will be used

for the smarthome conﬁguration are prone to malfunction

and disconnection. We use UMass Smart Home dataset [4]

that includes information from different sensors inside and

out side twohomes. The constraint is the time of the day.

Clinical Data (CD). A traditional application of imputa-

tion is clinical data [2]. To demonstrate the generalizability

of our algorithms we evaluate our algorithm on a dataset of

visits of diabetes patients to 130 US hospitals from 1998-

2000 [49]. The constraint is combination of Age group, race

and gender. Since our algorithm operates with sequential

data, we order this dataset based on the constraint and

encounter sequence of patients, then run the imputation

algorithm. Note for this dataset we deﬁne an additional con-

straint that a sequence of data that constitute prior, missing

and posterior segments should belong to one single patient

only and a second patient will not be included in a same

segment.

As it has been described we chose a small subset of data

from the wearable and mobile datasets, that do not follow

any normal statistical distribution, i.e., a multivariate

Shapiro-Wilk [45] test did not reject the null hypothesis. We

make such a selection by purpose, to demonstrate that we

can still reconstruct the missing data with superior preci-

sion, despite not having a Gaussian distribution. Table 1

summarizes our experiment datasets. The variety of con-

straints demonstrates the versatility of our algorithm for dif-

ferent settings and its independency from having explicit

notion of time.

To quantify the characteristics of our datasets we have

conducted a Shapiro-wilk test [45] to identify whether each

attribute have been normally distributed. Besides we have

used Shanon entropy [44] to identify the level of uncer-

tainty. Shanon entropy could be interpreted as a predictabil-

ity of the dataset. For the sake of space, we do not report

them in detail, but none of the dataset has its all attribute

data normally distributed. Therefore, we can not conclude

that whether the selected dataset are highly predictable or

unpredictable.

4.2 State of the Art Algorithms

We compared the efﬁciency and accuracy of our algorithms

with four well-known state-of-the-art imputation algo-

rithms. The criteria for our selection is to support categorical

data and not just numerical data, and to support multivari-

ate imputation, as is needed for real-world applications. We

choose two well-known algorithm that use machine learn-

ing, i.e., mi [50] and missForest [47], and two that use statis-

tical inferences, Amelia II [20] and MICE [8].

TABLE 1

Experiment Datasets and Their Attributes

Dataset Contraint Data Streams #Rec.

Wearable Time of Day Battery Utilization, Ambient

light, Average Number of Steps

1913

Online News

Media

Country Topic, Inﬂuence Level of the

Topic, Sentiment of the Content,

Year and Month, Sub-Topic

3000

Mobile Time of Day Accelerometer, Ambient Light,

Battery Use, Screen Interactions

2366

Smart-

Home

Time of Day Inside Temperature, Outside

Temperature, Raining Status,

Door Status (is open or close)

25905

Clinical Patient ID A1C result, Insulin Level,

Diabetes Medication Use

101768

2. http://insight4wear.com

3. http://ementalist.net

4. http://archive.ics.uci.edu/ml/datasets/UbiqLog+%28smart-

phone +lifelogging%29

2190 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 32, NO. 11, NOVEMBER 2020

“Amelia”[20] operates by bootstrapping expectation max-

imization and imputation posterior methods. Its imputation

posterior method draws random simulation on normally dis-

tributed data. It assumes that the variables of the multivari-

ate dataset are jointly following a normal distribution. It is a

useful and widely used method to reconstruct missing data

in data streams. “MICE” [8] performs multivariate imputa-

tion by using chained equations. The chained equation is

using joint modeling and fully conditional speciﬁcations

[43]. Joint modeling speciﬁes a multivariate distribution and

performs imputation from their conditional distributions by

Markov chain Monte Carlo simulation. “mi” [50] creates an

Approximate Bayesian model from the data, then it recon-

structs data from the conditional distribution for each data

object given the observed and reconstructed data of the other

variables in the dataset. “missForest” [47] treats data as

a multi-dimensional matrix and performs the imputation

by predicting the missing value using the random-forest

algorithm trained on observed parts of the dataset.

To have a fair comparison we experiment with different

parameters several times and choose the best settings for

each state-of-the-art algorithm. Moreover, we have used

only one run for algorithms that support multiple run for the

imputation.

4.3 Accuracy

To measure the accuracy of the imputation process in our

algorithm, we have randomly removed records from each

dataset. In particular, we have removed 10 to 100 data

records from each dataset (10,20,30,...100) and compared the

imputed dataset with the original dataset. The accuracy esti-

mation will be reported as precision, recall and F-score. True

positives are missing data that successfully get recovered.

False positives are missing data that have been identiﬁed

and recovered but their original value is different than the

recovered one. False negatives, are missing data that have

not been recovered. True negative are data that are not miss-

ing and also not recovered, which is obvious in our scenario

and all algorithms can easily identify true negatives. There-

fore, in most analysis we report F-score that is independent

from true negative.

We provide ﬁve analyses for the accuracy, four of which

are focused on the characteristic of the algorithm including

constraint impact, sliding window parameter sensitivity and

dissimilarity () impact on precision and recall, and missing

length (or duration) impact on accuracy. The ﬁfth analysis is

focused on comparison of our algorithm with the aforemen-

tioned state-of-the-art algorithms. It is notable that there are

no differences between the accuracy of baseline or cache-

based version of the algorithm, and using a cache does not

have any impact on the accuracy of the algorithm.

4.3.1 Tolerating Dissimilarity Impact on Accuracy

The data objects are multivariate and heterogeneous. There-

fore, the algorithm should treat them as discrete data objects.

In cases the user intends to tolerate a dissimilarity the

algorithm uses Jaccard Index to compare contents of two data

segments. The parameter will be used totolerate thedissimi-

larity of prior and posterior segments. Increasing the dissimi-

larity tolerance increases the recall and enable the algorithm to

identify more missing data segments. It, however, decreases

the precision of result signiﬁcantly.

Fig. 4 presents precision and recall of our experiment

datasets with window size of two, while tolerating one (n-1)

or two (n-2) data object dissimilarity. Later we will describe

why we set window size to two. We choose to remove only

two data objects at most, because either by removing one or

two data objects we will get near to ideal recall for all datasets

(more than 0.8). This ﬁgure presents an average of precision

and recall. It means that we calculate the precision and recall

for all 10 different record removes (10 recrord removes,...,100

record removes) and report the mean of results.

Based on the very low precision we acquired by slight

toleration of dissimilarity, it is clear that exact similarity is

favored by this algorithm and tolerating dissimilarity, even

small dissimilarity is not recommended between. Further-

more, the randomness of results between (n-1) and (n-2)

questions the credibility of tolerating dissimilarity as well.

Therefore, we highly recommend avoid tolerating dissimi-

larity in this algorithm.

4.3.2 Constraint Impact on Accuracy

We introduced constraints as a means to reduce false-positive

errors. To demonstrate the impact of constraints, we mea-

sured the accuracy of our algorithm with and witout use of

constraint. Fig. 6 demonstrates the superior accuracy of using

the constraint over not using it on theseﬁve datasets we chose

use to test. Although not using a constraint increases the recall

in all datasets except news media, it also increases the false

positive errors.

The superior recall of notusing a constraint, over using the

constraint in news media dataset is due to the tight connec-

tion of real-estate changes in German regions of Europe.

Most of real-estate news where either about Germany or UK.

Therefore, a constraint plays an auxiliary role and increases

recall. Unlike other datasets it does not act as a ﬁlter for data.

4.3.3 Sliding Window Parameter Sensitivity

The only effective parameter that this algorithm is using is

the window size.

5

We analyzed precision, recall and F-score

Fig. 4. A comparison between precision and recall of using exact similarity (n) versus tolerating one dissimilarity (n-1) and two (n-2) dissimilarities.

5. is also a parameter, but we demonstrate that it is not useful to

tolerate dissimilarity.

RAWASSIZADEH ET AL.: GHOST IMPUTATION: ACCURATELY RECONSTRUCTING MISSING DATA OF THE OFF PERIOD 2191

for each dataset with different window sizes for the same

dataset with different number of missing records, i.e., 10, 20,

...,100. Then we get an average from the precision, recall and

F-score for each dataset and each window size and report itin

Fig. 5. We evaluate the algorithm for four different window

sizes, i.e., 1,2,3 and 4. Based on the results in Fig. 5 we can rec-

ommend that the optimal window size for these datasets, is

two or three. Window size of two or three was optimal for sen-

sor based datasets (mobile, wearableand smarthome). For the

clinical dataset, window size of two had the highest accuracy.

For the new media dataset, window size of three has the high-

est accuracy. Nevertheless, we recommend that the user

experiment the algorithm with different window sizes for

their target dataset. Window size and precision have mono-

tonic relation and not linear. Our initial assumption was that

recall will be improved by smaller window size, but due to

the real-world nature of the data such an assumption is not

valid and you can see the large window size, i.e., 4 decreases

the accuracy.

4.3.4 Length of the Missing Segment and Accuracy

One question might arise is the ﬂexibility of the algorithm

on the size of the missing data segment. In other words,

while the target system is off, how much accuracy of our

imputation algorithm decrease based on the duration of off

period? To answer this question, we have experimented

with removing sequential number of data. By using 100

missing records we have experimented on all missing data

with the missing segment size 2,3,5 and 10 records. For

instance, missing records 2 indicate 50 missing sequences

that each has the size of two segments. Missing record 3

indicate 33 missing sequences that each has the size of three

segments. The result is presented in Fig. 8. It presents preci-

sion and recall based on the number of missing length in a

sequence. As it can been seen from the result of Fig. 8, there

is a slight precision decreases among datasets while we

move from one missing segment to three. Then there is a

signiﬁcant decrease by having 10 missing segments. The

precision of mobile and smarthome dataset are getting zero.

The behavior of recall in some datasets is decreasing, but in

some datasets it is also not predictable and we can not gen-

eralize it. In summary, we can argue that the longer the off

time will be, the less accurate will be the imputation pro-

cess. This reveals that the imputation process has some sen-

sitivity to the length of missing data.

4.3.5 Comparison with Other Algorithms

To compare the accuracy of our algorithm with state-of-the-art

algorithms, we chose the optimal window size from the pre-

ceding evaluation and the optimal parameter settings for

state-of-the-art algorithms. Fig. 7 summarizes the F-score for

imputation by each of the algorithms. As it has been shown in

this Figure, our algorithm outperforms other algorithms sig-

niﬁcantly in terms of F-score, except on the clinical dataset.

The recallof all other algorithms are ideal, because (i) it is easy

to identify all missing data and (ii) theyprovide a substitution

for all missing segments (although often incorrectly). We can

easily change our algorithm and substitute exact similarity

Fig. 6. A comparison between using or not using the constraint.

Fig. 5. Window size parameter sensitivity analysis.

Fig. 7. Comparison of accuracy between our algorithm with constraint and state-of-the-art methods.

2192 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 32, NO. 11, NOVEMBER 2020

with another similarity metrics and achieve perfect recall.

However, this approach introduces a high false-positive rate.

False-positive errors are severe errors for imputation algo-

rithms and can bias the imputed data [19].

Notably, the missForest algorithm returns zero precision

for the mobile dataset, due to the screen interaction data.

This data is rarely available and most of its time slots are

ﬁlled with zero value. Such a sparse stream of data affects

the random forest algorithm, which can not handle too many

null variables for a dataset. Although the missForest

algorithm and our algorithm had a comparable F-score for

the news media dataset, our algorithm had 25 percent higher

precision. On average, even considering the low score of

our algorithm for clinical data it had about 18 percent higher

F-score (across all ﬁve datasets) than other algorithms.

Not performing better than missForest and mi for clinical

data could be due to the non-sequantial nature of the dataset.

Our algorithm is designed for datasets include sequential

data.

Note that ﬂuctuations in Fig. 7 are due to randomness of

missing data and it is not possible to get an smooth line by

repeating the experiment. Our algorithm performs signi-

ﬁcantly better than other algorithms that have a time as a

constraint.

4.4 Efﬁciency

To demonstrate the efﬁciency of our algorithm, we analyze

the execution time and memory use of the algorithm in dif-

ferent settings.

First we analyze the response time and memory use of

the algorithm and the impact of dataset size on its perfor-

mance. Next, we compare the memory use and response

time of our algorithm with state-of-the-art algorithms. To

understand the impact of the number of missing data on the

memory use and response time, we report the experiment

with 10 to 100 missing records for each dataset. This helps

to quantify the sensitivity of imputation algorithms based

on the number of missing data.

4.4.1 Cache versus Baseline: Execution Time

Our cached optimization reduces the search space, but

increases memory use. We report the memory use and

response time of both versions of the algorithm for different

data sizes varied from 500 to 10,000 records, i.e., 500, 1,000,

2,000,... and 10,000. To prevent any bias originating from the

dataset structure, we created a synthetic dataset with four

data streams (data values varied from 1 to 10) and one stream

as a constraint. We keep the number of random missing

records ﬁxed at 100 records in all datasets. Nevertheless, indi-

vidual missing records were randomly distributed among all

dataset records. Fig. 9 shows response times (in seconds) for

differentwindow sizes between the baseline and the cache.

As shown in the Fig. 9, increasing the dataset size increased

the response time for both the baseline and cache. The

increase for response time in the cache version, however, was

signiﬁcantly lower than that the baseline.

Moreover, increasing the size of the window size length-

ened the response time, because a comparison is done for

each object (due to the use of exact similarity) and as the

window size gets larger, there are moreobjects to compare.

4.4.2 Cache versus Baseline: Memory Use

Our cached optimization reduces the search space, but

increases memory use. We report the memory use and

response time of both versions of the algorithm for different

data sizes varied from 500 to 10,000 records, i.e., 500, 1,000,

2,000,... and 10,000.

To prevent any bias originating from the dataset structure,

we created a synthetic dataset with four data streams (data

values varied from 1 to 10) and one stream as a constraint. We

keep the number of random missing records ﬁxed at 100

records in all datasets. Nevertheless, individual missing

records were randomly distributed among all dataset records.

Fig. 9 shows response times (in seconds)for different window

sizes between the baseline and the cache.

As shown in the Fig. 9, increasingthe dataset size increased

the response time forboth the baseline and cache. The increase

Fig. 8. Missing data length impact on accuracy.

Fig. 9. Response time impact of using cache rather than baseline method, with different window sizes.

RAWASSIZADEH ET AL.: GHOST IMPUTATION: ACCURATELY RECONSTRUCTING MISSING DATA OF THE OFF PERIOD 2193

for response time in the cache version, however, was signiﬁ-

cantly lower than that the baseline. Moreover, increasing the

size of the window size lengthened the response time, because

a comparison is done for each object (due to the use of exact

similarity) and as the window size gets larger, there are more

objects to compare.

Fig. 10 shows the baseline algorithm uses only 112 Bytes of

memory, independent from the dataset size. The cache version

starts from 508 bytes and grows slowly based on the size of the

dataset. At 10,000 records the cache version occupies about

856 bytes ofmemory. Developers who are planning to use this

algorithm can consider comprising on the amount of memory

in order to have a shorter execution time or vice versa.

4.4.3 Memory Use Comparison with Other Algorithms

Since all algorithms read the dataset into memory, we ignore

that memory use and focus on additional memory use by the

algorithm. Our experiments show that there was no signiﬁ-

cant differences in memory use while using different window

sizes. Therefore, we report the memory overhead independent

from the window sizes; see Fig. 10. The memory allocation

policy in our implementation was based on the R compiler.

The memory overhead of the algorithm alone was insigniﬁ-

cant. In particular, the baseline version was memory efﬁcient,

and only used 112 Kilo Bytes (KB) of memory, independent

from the dataset size, because the dataset was not in memory.

The cache version used more memory, of course. At 10,000

records (w¼2) the cache version occupies about 856 KB of

memory. Developers who are planning to use this algorithm

can consider comprising on the amount of memory in order to

have a shorter execution time or vice versa.

Note for this experiment, we selected only 3,000 records

from the clinical and smarthome dataset, because the “mi”

algorithm memory utilization grows exponentially as the size

of the dataset grows and it can not operate on large dataset.

Besides, based on the ﬁndingof the previous section, that win-

dow size two and three provides the highest accuracy, we set

the window size to two.

Although the baseline algorithm usesless amount of mem-

ory, we refrain from comparing it with other algorithms due

to its slow response times. A large portion of the memory

overhead came from the cache. Therefore, in Table 2 for

reporting the Ghost memory use, we report “algorithm’s

used memory” + “memory used for the cache”. Otherwise,

missForest is the most efﬁcient one followed by Amelia.

We do not report memory utilization while tolerating

dissimilarity because there will be no logical reason to have

a differences in memory utilization.

4.4.4 Execution Time Comparison with Other Algorithms

Table 3 compares our algorithm execution time against state-

of-the-art algorithms. In this study we report average execu-

tion time averaged across runs with varying number of miss-

ing records, i.e., 10 to 100 records. Results of this table

show that our algorithm was slower than both missForest and

Amelia and faster than MICE and mi. Although our algorithm

was not the fastest imputation algorithm, our implementation

can be slightly improved by using a more subtle hash function

instead of a list.

Despite signiﬁcant decrease in precision, tolerating dis-

similarity decreases execution time signiﬁcantly, because

the matching segment will be identiﬁed much faster and

search space is smaller. Table 4 reports the execution time

based on different dissimilarities.

For these evaluations we conclude that our algorithm is

appropriate for systems that need to reconstruct the missing

data ofﬂine (not in real-time) and systems that can tolerate

latency, but require high accuracy. Our algorithm was the

most accurate algorithm for all sequential datasets.

Furthermore, due to its low memory use it could be imple-

mented on small devices such as wearable devices. Our report

was focused on fully missing data, butour algorithm could be

easily extended to handle partial missing data as well.

5LIMITATIONS

Although not using exact equality and tolerate the dissimilar-

ity increase the recall, we recommend to use exact similarity.

Fig. 10. Memory use comparison between the cache and the baseline.

TABLE 2

Memory Use (in KB) of Different Algorithms

Dataset Ghost missForest Amelia MICE mi

SmartHome 0.64 + 230 49.23 132.58 288.41 1708.56

Mobile 1.06 + 234 60.59 142.23 74.40 1948.12

Wearable 0.86 + 241 49.28 133.06 291.66 1711.24

Clinical 0.86 + 225 75.30 156.85 236.41 2017.19

NewsMedia 0.85 + 261 99.86 154.24 753.65 2928.63

TABLE 3

Execution Time (in Seconds) of Different Algorithms

Dataset Ghost missForest Amelia MICE mi

SmartHome 10.39 1.24 0.28 10.57 35.89

Mobile 5.38 0.61 0.32 1.77 10.68

Wearable 13.64 0.69 0.82 2.91 10.06

Clinical 7.57 0.63 0.58 2.33 7.42

NewsMedia 6.91 0.68 0.57 10.34 32.21

TABLE 4

Execution Time (in Seconds) of Different Level

of Dissimilarity Tolerance in Ghost

Dataset Ghost (n-1) Ghost (n-2) Ghost

SmartHome 7.26 6.79 10.39

Mobile 5.38 3.91 5.38

Wearable 8.64 9.01 13.64

Clinical 5.13 6.00 7.57

NewsMedia 4.36 3.78 6.91

The last column (Ghost), which is same as Table 3, presents the algorithm

while not tolerating the dissimilarity at all and looking for exact equality.

2194 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 32, NO. 11, NOVEMBER 2020

This is due to the fact that similar to other algorithm our algo-

rithms also impose a need for smoothing the data, and to keep

data close to its original values, In our experimental evalua-

tion, we analyzed the trade-off between recall and precision

by not using exact equality (see Section 4.3.1).

Our approach cannot reconstruct a missing segment at

the beginning or end of the dataset, this limitation is not

likely signiﬁcant in large real-world datasets, because the

missing segment would be small relative to the dataset.

We assume this algorithm will be used to reconstruct

a missing segment of battery powered devices, and not

machines which collect hundreds of different data types.

Therefore, in our experiments we choose datasets with not too

many different features. Our largest dataset has six features.

The maximum number of multivariate data we have

experimented was six attributes (in news media dataset).

However, such a number is not exact and it depends on the

correlation and repeatability of the data. Nevertheless, by

reducing the precision and relying on smoother similarity

those types of could be handled as well. Besides, because of

the same reason in our experiments we did not test on big

data and we assume the user of this algorithm is looking for

accurately reconstructing the missing data from the smallest

possible sample of data.

As it has been reported in the evaluation, our algorithm

cannot reconstruct the data in real-time. It can be useful for

data streams and datasets which are changing frequently,

but since it requires at least one scan on the entire dataset, it

cannot operate in real-time.

6RELATED WORK

Imputation has a longstanding history in statistical analysis

[40]. On the other hand, due to inherited noise in the real-

world data, there are several promising application speciﬁc

imputation approaches proposed to reconstruct missing

data for a single application.

We propose two categories of related works. One cate-

gory of works are holistic methods that are usually based

on statistical analysis and do not consider the underlying

application. The second category of works are application

speciﬁc works that are designed based on the target applica-

tion requirements. We have evaluated our approach on ﬁve

different datasets, thus we conclude our approach is not

application speciﬁc and belongs to the ﬁrst category.

Holistic Algorithms. Basic approaches for imputations use

mathematical methods to calculate the missing data such as

mean imputation, regression analysis, and missing indicator

methods [11]. Practical algorithms usually use more com-

plex approaches. In particular, there are two categories of

imputation algorithms that are widely in use. The ﬁrst cate-

gory features maximum likelihood algorithms (ML) [13],

[18] such as EM (E for conditional expectation and M for

maximum likelihood) algorithms [10] and their successors

algorithms, such as the work proposed by Enders [13]. EM

algorithms usually works well with numerical data, because

they use statistical model to ﬁnd missing data.

The second category of algorithms are multiple imputation

(MI) algorithms [41] such as the work proposed by Buuren

and Groothuis-Oudshoorn [8] or Honaker et al. [20]. MI

algorithms combine different imputation methods into a

single procedure (mostly expectation maximization). There-

fore, it allows for the uncertainty by creating several plausible

imputed datasets. Repeating multiple experiments makes MI

algorithms approximately unbiased. Nevertheless, the repeat-

ing and combining of different processes increases the time

and memory complexity of such approaches and thus their

are not computationally efﬁcient, despite their superior accu-

racy. Moreover, most of MI algorithms assume that the data

are following a normal distribution [48], which is not neces-

sary true for all real-world datasets. However, MI algorithms

could beneﬁt from ML approaches, and thus there is no dis-

tinct border between them. For example, Amelia [20], which

has been described previously, is a the well-known algorithm

that combines the classical expectation maximization with

the bootstrap approach. Rubin [25] and Schafer [43] provide

a detailed comprehensive description about statistical appro-

aches for imputation. These algorithms are still advancing

and progressing. For instance, Yuan [53] uses “Propencity

score” [39], which is used when there is a vector of observed

covariates. This algorithm generates a score for variables with

missing values. Afterwards, observations are grouped based

on these scores, and an approximate Bayesian bootstrap

imputationwill be applied on each group. Song et al. [46] uses

approximate and exact neighboring information to identify

the missing information in a dataset. Both of these recent

approaches are useful when the there are other streams of

data available, and when they can not optimally operate

when the system is completely off. Mohan proposed a method

to perform imputation by using a directed acyclic graph [28].

This approach operates based on the causal relation of nodes

in a graph.

Our work is inspired by sequence mining algorithms

[17], [30] and [54] which focus on identiﬁed ordered pat-

terns of events in a sequence. Nevertheless, the objective of

our algorithm is signiﬁcantly different than sequence min-

ing algorithms.

Application Speciﬁc Methods. Application speciﬁc efforts.

such as model based imputation, try to resolve the missing

data based on the assumption that the model provides. Some

example of application speciﬁc imputation include sensor net-

work [21], [29], clinical data [5] and genome wide association

[27]. Besides, due to the nature of real-world data streams,

these approaches handle multivariate and different data types

[12], [22], [43] that vary from categorical to binary and numeri-

cal data. Usually, the process of imputation will be done in

batch mode and most of the existing approaches in this cate-

gory are computationally complex. For instance, to recon-

struct the missing data, the data will be converted into

contingency table and will be inserted into a large matrix [25],

[43]. Another example is the use of compressed sensing [12]

for sensor-network data imputation [22], which has a high

computational complexity (in both time and space). Another

work proposed by Papadimitriou et al. [29] that applies

principal component analysis (PCA) to estimate the missing

time series based on its correlation with another time series in

time stamped sensor data. The space cost of their approach is

efﬁcient, but because of its reliance on PCA, this approach

operates in a two- dimenstional space of numerical data.

Moreover, PCA has poly nominal time complexity. Kong

et al. [22] use a customized spatio-temporal compressed

sensing approach [12] for imputing environmental sensor

RAWASSIZADEH ET AL.: GHOST IMPUTATION: ACCURATELY RECONSTRUCTING MISSING DATA OF THE OFF PERIOD 2195

network data. Due to use of compressed sensing and nested

matrix iteration, this approach is a polynomial and computa-

tionally complex as well. Jeon et al. [21] proposes a noise

reduction model for removing audio noises, based on the

multi-band spectral subtraction. Marchini and Howie use a

reference panel for estimating the missing data in a genome

wide association [27]. Fryett et al. [14] propose a detailed sur-

vey on comparing transcriptome imputation. Wang et al. [51]

employs Petri net to recover missing events based on the time

constraints in a set of business process sequences. Their sim-

ple case recovery has a linear computational complexity, but

the approach they propose for general cases based on branch-

ing and indexing does not have linear complexity. Some of

these efforts can be generalized for different applications as

well. For instance, the work proposed by Batista and Monard

[5] use the k(th) nearest neighbor for reconstructing the miss-

ing data. They implement their approach on clinical data

while their approach has a higher accuracy rate than basic

imputation methods, i.e., mean and linear regression. A more

recent example of imputation algorithm is proposed by

Wellenzohn et al. [52], which focuses on imputation for time

series. It introduces a concept of anchor point which is similar

to our prior segment approach. Their approach is also beneﬁt-

ing from a prior data of the missing event and therefore inde-

pendent from linear correlation. Nevertheless, since it uses

only prior window as a constraint its recall is higher, but our

precision is higher.

Several application speciﬁc imputation rely on the period-

icity of the data. There are promising approaches to quantify

periodic changes in a dataset [7], [26] and thus improve the

application efﬁciency. For instance, Boegl et al. [7] rely on the

periodicity of the data to perform the imputation.

In the experimental evaluation section, we described well-

known holistic state-of-the-art algorithms [8], [20], [47], [50]

and why we have selected them.

7CONCLUSION &FUTURE WORK

Inthispaperwehaveintroducedanaccurateimputationalgo-

rithm, Ghost, that can operate on multivariate datasets. It uses

a constraint and the ﬁrst similar segments, adjacent to the

missing data segment to perform the imputation process. To

improve its efﬁciency of the algorithm we use a cache based

optimization. Our algorithm accuracy has outperformed state-

of-the-art algorithm by 18 percent in F-score and 25 percent in

Precision. Our proposed algorithm is appropriate for systems

that produce data streams and can not hold data for long

term. Moreover, it is useful for systems that prioritize accuracy

overtheresponsetime.Asafutureworkwewilltryto

develop a distance function that can identify prior and poste-

rior segments with are in the proximity (not adjacent) of the

missing segments. Finding priors and posteriors patterns and

their distance to the missing segments could increase number

of recovery segments, and thus accuracy of the algorithm.

ACKNOWLEDGMENTS

The authors acknowledge Thomas H. Cormen for his hint on

the design of our algorithm and David Kotz for formalizing the

problem and contributing in the process of writing the paper.

REFERENCES

[1] J. Allen, “Maintaining knowledge about temporal intervals,” Com-

mun. ACM, vol. 26, no. 11, pp. 832–843, 1983.

[2] C. Anagnostopoulos and P. Triantaﬁllou, “Scaling out big data

missing value imputations: Pythia vs. Godzilla,” in Proc. 20th

ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, 2014,

pp. 651–660.

[3] T. Baltru

saitis, C. Ahuja, and L. P. Morency, “Multimodal

machine learning: A survey and taxonomy,” IEEE Trans. Pattern

Anal. Mach. Intell., vol. 41, no. 2, pp. 423–443, Feb. 2019.

[4] S. Barker, A. Mishra, D. Irwin, E. Cecchet, P. Shenoy,and J. Albrecht,

“Smart*: An open data set and tools for enabling research in sustain-

able homes,” in Proc. KDD Workshop DataMining Appl. Sustainability,

2012, Art. no. 112.

[5] G. Batista and M. Monard, “An analysis of four missing data treat-

ment methods for supervised learning,” Appl. Artif. Intell., vol. 17,

no. 5/6, pp. 519–533, 2003.

[6] B. Berger, N. M. Daniels, and Y. W. Yu, “Computational biology in

the 21st century: Scaling with compressive algorithms,” Commun.

ACM, vol. 59, no. 8, pp. 72–80, 2016.

[7] M. B’ogl, P. Filzmoser, T. Gschwandtner, S. Miksch, W. Aigner,

A. Rind, and T. Lammarsch, “Visually and statistically guided

imputation of missing values in univariate seasonal time

series,” in Proc. IEEE Conf. Visual Analytics Sci. Technol., 2015,

pp. 189–190.

[8] S. Buuren and K. Groothuis-Oudshoorn, “mice: Multivariate

imputation by chained equations in R,” J. Statistical Softw., vol. 45,

no. 3, pp. 1–68, 2011.

[9] T. Cormen, C. Leiserson, R. Rivest, and C. Stein, Introduction to

Algorithms. Cambridge, MA, USA: MIT Press, 2009.

[10] A. Dempster, N. Laird, and D. Rubin, “Maximum likelihood from

incomplete data via the EM algorithm,” J. Roy. Statistical Soc..

Series B (Methodological), vol. 39, pp. 1–38, 1977.

[11] A. Donders, G. van der Heijden, T. Stijnen, and K. Moons,

“Review: A gentle introduction to imputation of missing values,”

J. Clinical Epidemiology, vol. 59, no. 10, pp. 1087–1091, 2006.

[12] D. Donoho, “Compressed sensing,” IEEE Trans. Inf. Theory, vol. 52,

no. 4, pp. 1289–1306, Apr. 2006.

[13] C. Enders, “A primer on maximum likelihood algorithms available

for use with missing data,” Structural Equation Model., vol. 8, no. 1,

pp. 128–141, 2001.

[14] J. Fryett, J. Inshaw, A. Morris, and H. Cordell, “Comparison of

methods for transcriptome imputation through application to

two common complex diseases,” Eur. J. Human Genetics, vol. 26,

pp. 1658–1667, 2018.

[15] A. G

eron, Hands on Machine Learning with Scikit-Learn and Tensor-

Flow: Concepts, Tools, and Techniques to Build Intelligent Systems.

Sebastopol, CA, USA: O’Reilly Media, 2017.

[16] Z. Ghahramani, “Probabilistic machine learning and artiﬁcial

intelligence,” Nature, vol. 521, no. 7553, pp. 452–459, 2015.

[17] J. Han, J. Pei, B. Mortazavi-Asl, Q. Chen, U. Dayal, and M. Hsu,

“FreeSpan: Frequent pattern-projected sequential pattern mining,”

in Proc. 6th ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining,

2000, pp. 355–359.

[18] H. Hartley and R. Hocking, “The analysis of incomplete data,”

Biometrics, vol. 27, no. 4, pp. 783–823, 1971.

[19] J. Hern

andez-Lobato, N. Houlsby, and Z. Ghahramani, “Probabilistic

matrix factorization with non-random missing data,” in Proc. Int.

Conf. Mach. Learn., 2014, pp. 1512–1520.

[20] J. Honaker, G. King, and M. Blackwell, “Amelia II: A program for

missing data,” J. Statistical Softw., vol. 45, no. 7, pp. 1–47, 2011.

[21] K. Jeon, N. Park, D. Lee, and H. Kim, “Audio restoration based on

multi-band spectral subtraction and missing data imputation,” in

Proc. IEEE Int. Conf. Consum. Electron., 2014, pp. 522–523.

[22] L. Kong, M. Xia, X. Liu, M. Wu, and X. Liu, “Data loss and re-

construction in sensor networks,” in Proc. IEEE INFOCOM, 2013,

pp. 1654–1662.

[23] B. Lake,R. Salakhutdinov, and J. Tenenbaum, “Human-levelconcept

learning through probabilistic program induction,” Sci., vol. 350,

no. 6266, pp. 1332–1338, 2015.

[24] J. Lin, E.Keogh, S. Lonardi, and B. Chiu, “A symbolic representation

of time series, with implications for streaming algorithms,” in Proc.

8th ACM SIGMOD Workshop Res. Issues Data Mining Knowl. Discovery,

2003, pp. 2–11.

[25] R. Little and D. Rubin, Statistical Analysis with Missing Data.Hoboken,

NJ,USA:Wiley,2014.

2196 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 32, NO. 11, NOVEMBER 2020

[26] C. Loglisci and D. Malerba, “Mining periodic changes in complex

dynamic data through relational pattern discovery,” in Proc. Int.

Workshop New Frontiers Mining Complex Patterns, 2015, pp. 76–90.

[27] J. Marchini and B. Howie, “Genotype imputation for genome-

wide association studies,” Nature Rev. Genetics, vol. 11, no. 7,

pp. 499–511, 2010.

[28] K. Mohan, J. Pearl, and J. Tian, “Graphical models for inference

with missing data,” in Proc. Int. Conf. Neural Inf. Process. Syst.,

2013, pp. 1277–1285.

[29] S. Papadimitriou, J. Sun, and C. Faloutsos, “Streaming pattern

discovery in multiple time-series,” in Proc. 31st Int. Conf. Very

Large Data Bases, 2005, pp. 697–708.

[30] J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal, and

M. Hsu, “PreﬁxSpan: Mining sequential patterns efﬁciently by

preﬁx-projected pattern growth,” in Proc. Int. Conf. Comput. Com-

mun. Netw., 2001, pp. 215–224.

[31] R. Rawassizadeh, C. Dobbins, M. Akbari,and M. Pazzani, “Indexing

multivariate mobile data through spatio-temporal event detection

and clustering,” Sensors, vol. 19,no. 3, 2019, Art. no. 448.

[32] R. Rawassizadeh, E. Momeni, C. Dobbins, J. Gharibshah, and

M. Pazzani, “Scalable daily human behavioral pattern mining

from multivariate temporal data,” IEEE Trans. Knowl. Data Eng.,

vol. 28, no. 11, pp. 3098–3112, Nov. 2016.

[33] R. Rawassizadeh, E. Momeni, C. Dobbins, P. Mirza-Babaei, and

R. Rahnamoun, “Lesson learned from collecting quantiﬁed self

information via mobile and wearable devices,” J. Sensor Actuator

Netw., vol. 4, no. 4, 2015, Art. no. 315.

[34] R. Rawassizadeh, T. Pierson, R. Peterson, and D. Kotz, “NoCloud:

Exploring network disconnection through on-device data analysis,”

IEEE Pervasive Comput., vol. 17, no. 1, pp. 64–74, Jan.–Mar. 2018.

[35] R. Rawassizadeh, B. Price, and M. Petre, “Wearables: Has the age

of smartwatches ﬁnally arrived?,” Commun. ACM, vol. 58, no. 1,

pp. 45–47, 2015.

[36] R. Rawassizadeh, M. Tomitsch, M. Nourizadeh, E. Momeni,

A. Peery, L. Ulanova, and M. Pazzani, “Energy-efﬁcient integration

of continuous context sensing and prediction into smartwatches,”

Sensors, vol. 15, no. 9, pp. 22616–22645, 2015.

[37] R. Rawassizadeh, M. Tomitsch, K. Wac, and A. Tjoa, “UbiqLog: A

generic mobile phone-based life-log framework,” Pers. Ubiquitous

Comput., vol. 17, no. 4, pp. 621–637, 2013.

[38] M. Resche-Rigon and I. R. White, “Multiple imputation by chained

equations for systematically and sporadically missing multilevel

data,” Statistical Methods Med. Res., vol. 27, no. 6, pp. 1634–1649, 2018.

[39] P. R. Rosenbaum and D. B.Rubin, “The central role of the propensity

score in observational studies for causal effects,” Biometrika,vol.70,

no. 1, pp. 41–55, 1983.

[40] D. Rubin, “Inference and missing data,” Biometrika, vol. 63, no. 3,

pp. 581–592, 1976.

[41] D. Rubin, “Multiple imputations in sample surveys- A phenome-

nological Bayesian approach to nonresponse,” Proc. Survey Res.

Methods Section Amer. Statistical Assoc., vol. 1, pp. 20–34, 1978.

[42] S. Russell and P. Norvig, Artiﬁcial Intelligence: A Modern Approach,

3rd ed. London, U.K.: Pearson Education Limited, 2011.

[43] J. L. Schafer, Analysis of Incomplete Multivariate Data. Boca Raton,

FL, USA: CRC Press, 1997.

[44] C. E. Shannon, “A mathematical theory of communication,” Bell

Syst. Tech. J., vol. 27, no. 3, pp. 379–423, 1948.

[45] S. Shaphiro and M. Wilk, “An analysis of variance test for normal-

ity,” Biometrika, vol. 52, no. 3, pp. 591–611, 1965.

[46] S. Song, A. Zhang, L. Chen, and J. Wang, “Enriching data imputa-

tion with extensive similarity neighbors,” Proc. VLDB Endowment,

vol. 8, no. 11, pp. 1286–1297, 2015.

[47] D. Stekhoven, “MissForest: Nonparametric missing value imputa-

tion using random forest,” Astrophysics Source Code Library, 2015.

[48] J. Sterne, I. White, J. Carlin, M. Spratt, P. Royston, M. Kenward,

A. Wood, and J. Carpenter, “Multiple imputation for missing data

in epidemiological and clinical research: Potential and pitfalls,”

Brit. Med. J., vol. 338, 2009, Art. no. b2393.

[49] B. Strack, J. DeShazo, C. Gennings, J. Olmo, S. Ventura, K. Cios, and

J. Clore, “Impact of HbA1c measurement on hospital readmission

rates: Analysis of 70,000 clinical database patient records,” BioMed

Res. Int.,vol. 2014, 2014, Art. no. 781670.

[50] Y. Su, A. Gelman, J. Hill, and M. Yajima, “Multiple imputation

with diagnostics (mi) in R: Opening windows into the black box,”

J. Statistical Softw., vol. 45, no. 2, pp. 1–31, 2011.

[51] J. Wang, S. Song, X. Zhu, and X. Lin, “Efﬁcient recovery of missing

events,” Proc. VLDB Endowment, vol. 6, no. 10, pp. 841–852, 2013.

[52] K. Wellenzohn, M. H.B €

ohlen, A. Dign€

os, J. Gamper, andH. Mitterer,

“Continuous imputation of missing values in streams of pattern-

determining time series,” in Proc. Int. Conf. Extending Database Tech-

nol., 2017, pp. 330–341.

[53] Y. Yuan, “Multiple imputation for missing data: Concepts and

new development,” in Proc. Twenty-Fifth Annu. SAS Users Group

Int. Conf., vol. 267, 2010.

[54] M. Zaki, “SPADE: An efﬁcient algorithm for mining frequent

sequences,” Mach. Learn., vol. 42, no. 1/2, pp. 31–60, 2001.

Reza Rawassizadeh received the BSc degree in

software engineering, the master’s degree in com-

puter science, and the PhD degree in computer

science from the University of Vienna, Austria, in

2012. He is an assistant professor with the Depart-

ment of Computer Science, Metropolitan College,

Boston University. His research interests include

data mining, ubiquitous computing, and applied

machine learning.

Hamidreza Keshavarz received the PhD degree

from Tarbiat Modares University, Tehran, Iran, in

2018. His research is focused on developing algo-

rithms and techniques for sentiment analysis and

data mining. His interests include metaheuristic

algorithms, information retrieval, computational intel-

ligence, pattern recognition, and machine learning.

Michael Pazzani received the PhDdegree in com-

puter science from the University of California, Los

Angeles (UCLA). He is vice chancellor for research

and economic development and a professor of

computer science with the University of California,

Riverside. He was a professor with the University

of California, Irvine, where he also served as chair

of information and computer science. His research

interests include machine learning, personalization,

and cognitive science.

"

For more information on this or any other computing topic,

please visit our Digital Library at www.computer.org/csdl.

RAWASSIZADEH ET AL.: GHOST IMPUTATION: ACCURATELY RECONSTRUCTING MISSING DATA OF THE OFF PERIOD 2197