Content uploaded by Reza Rawassizadeh
Author content
All content in this area was uploaded by Reza Rawassizadeh on Oct 08, 2020
Content may be subject to copyright.
Content uploaded by Reza Rawassizadeh
Author content
All content in this area was uploaded by Reza Rawassizadeh on May 01, 2019
Content may be subject to copyright.
Ghost Imputation: Accurately Reconstructing
Missing Data of the Off Period
Reza Rawassizadeh , Hamidreza Keshavarz , and Michael Pazzani
Abstract—Noise and missing data are intrinsic characteristics ofreal-world data, leading to uncertainty that negatively affectsthe quality of
knowledge extractedfrom the data. The burdenimposed by missing datais often severe in sensors that collect data fromthe physical world,
where large gaps of missing data may occur when the system is temporarily off or disconnected. How can we reconstruct missing data for
these periods? We introduce an accurate and efficient algorithm for missing data reconstruction (imputation),that is specifically designed to
recover off-period segments of missing data. This algorithm, Ghost, searches the sequential dataset to find data segments that have a prior
and posterior segment that matches those of the missing data. If there isa similar segment that also satisfies the constraint –suchas
location ortime of day – then it is substituted for the missing data. A baseline approach results inquadratic computational complexity,
therefore we introduce a caching approach that reduces the search space and improves the computational complexity to linear inthe
common case. Experimental evaluations onfive real-world datasets show that our algorithm significantly outperforms four state-of-the-art
algorithms with an average of 18 percent higherF-score.
Index Terms—Imputation, multivariate, pattern mining
Ç
1INTRODUCTION
WITH recent technological advances and increases in
computing capabilities, data intensive scientific discov-
ery is being widely used. This has led to the introduction of
methods for analyzing data collected from multiple sources
of information, i.e., “multivariate data”.
1
One of the inevitable
challenges of real-world data analysis is uncertainty rising
from noise and missing data [16]. This uncertainty negatively
affects the quality of knowledge extracted from the data.
Indeed, the burden imposed by missing data is often severe in
applications collecting data from the physical world, e.g.,
mobile sensing [33] or genome sequencing [6].
For example, consider battery powered devices, such as
smartwatches, equipped with inexpensive sensors such as
ambient light and accelerometer. Due to sensor quality, bat-
tery limits,and user preferences, context-sensing applications
can not continuously and effectively collect data [33] and
there are often segments of missing data, e.g., the device is
turned off. These missing segments affect the quality of
knowledge-extraction methods [31]. Although, missing data
reconstruction is an important requirement of these system, it
has not received much attention.
There are longstanding efforts in statistics [10], [40], [43]
to reconstruct missing data. These imputation methods
assume the missing data points occur at random, i.e., miss-
ing at random (MAR) or missing completely at random
(MCAR). If the data is missing not at random (MNAR) [40],
the imputation process is more challenging.
In this paper we propose a novel algorithm for imputation
of multivariate sensor data. This algorithm only uses (i) a con-
straint such as time of the day or location (ii) the data values
immediately prior to the missing event, and (iii) the data val-
ues immediately following the missing event. Since our
method does not rely on statistical methods, it might be able
to handle some MNAR data, but only if a similar segment
exists in the dataset.
In particular, our algorithm operates on multivariate and
sequential data streams. It reads two adjacent data segments –
one before and one after the missing data (missing seg-
ment) –and searches the dataset to find two segments simi-
lar to the adjacent segments of the missing segment. If
segment between these two similar segments is of the same
length as missing segment, it is a candidate recovery seg-
ment. Next, if the constraint values of the segment of inter-
est matches the constraint values of the missing segment,
the algorithm substitutes the missing segment with the con-
tent of this candidate recovery segment. A naive approach
imposes a quadratic computational complexity, so we add a
pre-processing step that reads all data segments and their
indexes into a cache, achieving a linear computational com-
plexity in the common case.
The characteristics and contributions of our algorithm
are as follows.
1. Multivariate is slightly different than multimodal. Modality refers
to the way when an event happens or experienced [3]. In other words,
it refers to primary channels of sensation and communication, e.g.,
touch and vision.
R. Rawassizadeh is with the Department of Computer Science, Metropoli-
tan College, Boston University, Boston, MA 02215.
E-mail: rrawassizadeh@acm.org.
H. Keshavarz is with the Department of Electrical and Computer Engineering,
Tarbiat Modares University, Tehran, Iran. E-mail: keshavarz.h@modares.ac.ir.
M. Pazzani is with the Department of Computer Science, University of
California at Riverside, Riverside, CA 92521. E-mail: michael.pazzani@ucr.edu.
Manuscript received 15 Oct. 2018; revised 25 Apr. 2019; accepted 30 Apr.
2019. Date of publication 3 May 2019; date of current version 6 Oct. 2020.
(Corresponding author: Reza Rawassizadeh.)
Recommended for acceptance by S. Cohen.
Digital Object Identifier no. 10.1109/TKDE.2019.2914653
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 32, NO. 11, NOVEMBER 2020 2185
1041-4347 ß2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See ht_tps://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: BOSTON UNIVERSITY. Downloaded on October 08,2020 at 19:23:07 UTC from IEEE Xplore. Restrictions apply.
Heterogeneous Multivariate Real-World Data. Statistical
imputation approaches are optimized to handle
numerical data [40], [43]. Real-world systems, how-
ever, produce data in numerical, categorical, or binary
forms. Our algorithm relies on a categorical abstrac-
tion of the original data by converting data values to
categorical symbols, e.g., by bucketing numeric data
into categories. Therefore, in contrast to statistical-
based imputation, any data type, regardless of its
distribution, can be fed into this algorithm, i.e., non-
parametric imputation. All datasets we employed in
this work are real-world datasets and most of them
(wearable, mobile, IoT, and news media datasets)
have not previously been used for imputation studies.
We recommend to use this algorithm mainly for multi-
variate sensor data used in consumer IoT and mobile
devices, but to demonstrate its versatility, we experi-
ment it with two other real-world datasets as well
(clinical data and real estate data).
Instance based Learning. Inspired from algorithms that
learn from a single labeled instance [15], [23], our
algorithm tries to estimate the missing data from the
first similar instance that can be found in a sequential
search of the dataset. Clearly, relying on a single label
(similar instance) is prone to false positive errors.
Instead, we rely on a constraint, as a controlling
variable that significantly reduces false positives. Our
definition of constraint is inspired by binary con-
straint in constraint satisfaction problems (CSP) [42],
but, unlike traditional CSP it is not used to reduce the
search space.
Search Space. Continuously collecting and storing
data can be expensive in terms of resource usage,
especially in battery-powered wireless devices. Data
is typically not stored long-term on these devices,
and most data processing is conducted in cloud serv-
ers [34]. Our algorithm can reconstruct the missing
data merely by finding the first match for the missing
segment without a need to search the entire dataset.
For instance, we have used only three days for a
smartphone dataset and only seven days for a smart-
watch dataset. Their datasets are fairly small, but in
both of these examples, our algorithm outperforms
state-of-the-art algorithms and reconstructs the miss-
ing data with higher accuracy.
Note that all versions of our algorithm, i.e., baseline and
cache based ones, have only one efficient parameter and it
is the window size. In the evaluation section we identify an
optimal window size value for each dataset. Therefore,
based on the target dataset (or application), this parameter
could be assigned automatically and there is no need for the
user, who has a domain knowledge, to decide about its opti-
mal value. There is another parameter used for tolerating
slight dissimilarity, we will demonstrate why users should
not tolerate dissimilarity.
2PROBLEM STATEMENT AND DEFINITIONS
Asystem produces a dataset, over time. For example, a
smartwatch system is producing sequential data about its
wearer’s activity, physiology, and environment. The system
generates data intermittently, when it is on, but may be off
for extended periods.
Data value is a data unit in numeric, categorical or in
binary form. For our purposes we convert all such data
types into categorical form, so that all data values are repre-
sented by symbols from a finite alphabet. This has an inevi-
table impact on the accuracy of the data. However, all
imputation methods which can handle multivariate data
have such a limitation.
Dataset is a temporal sequence of data records, produced
by a single system over time. Because typical systems produce
multiple types of data (multivariate data), we envision each
record comprises one value from each type of data produced
by that system. We discretize time such that each data record
rirepresents a brief period of time beginning at time ti, where
irepresents the index of the record within the dataset
ð1inÞ.Werepresentrias a tuple comprising one data
symbol for each of the ‘streams produced by the system
during time interval ti, plus one special stream 0 called the
constraint stream. Thus, we can write ri¼ðsi0;s
i1;...;s
i‘Þ.
If a stream produces no data during time interval ti,
record riincludes the null symbol ;for that stream. For
periods where the system is off, the records during that
period will include the null symbol for every data stream.
Constraint stream is the portion of every data record ipre-
senting the value of the constraint, i.e., si0; for notational con-
venience, we refer to the this stream as S0,S0¼½s1;0;s
2;0;...;
sn;0. The constraint stream, however, is not produced by
the system and we assume it can be generated where the
dataset is assembled for post-processing; for example, the
constraint may be “time of day”. The choice of a suitable con-
straint depends on the kind of data being processed. Other
constraint examples are described in Section 4.1.
Segment is a subsequence of the dataset; thus for dataset
d½r1;r
2; ::; ri;...;r
nwe can refer to a segment dij
½ri;...;r
j, for 1ijn.
Missing Segment. A segment in which all its data values
(other than constraint values) are missing (null symbols);
that is dij is a missing segment iff 8ikj;81l‘;s
kl ¼;.A
missing segment occurs when the target system is not col-
lecting data at all, because it is off or unavailable, during
time intervals tithrough tj.
Window size wis a parameter of our algorithm; a small
number wsuch that a segment of wrecords is a “window”
into the dataset; window didi;iþw1.
Prior window 3dij is the window of data immediately
preceding a missing segment. If dij is the missing segment,
then the prior window is 3dij diw;i1.
Posterior window
"
dij is the window of data immediately
following a missing segment. If dij is the missing segment,
then the posterior window is
"
dij djþ1;jþw.
Segment Equality. Two segments are equal if they have
the same size and identical values for all of their data values.
For example, dij ’dkl iff ððlk¼jiÞ^ð8
0pji80q‘:
siþp;q ¼skþp;qÞÞ.
Segment Similarity. Two segments are similar if they have
the same size, the similarity check function returns true
while comparing their priorand posterior segments, and have
identical values in their constraint stream. That is, segment
dij ’dkl iff ðlk¼jiÞ^ð3dij ¼3dklÞ^ð
"
dij ¼
"
dklÞ^
ð80pji:simðsiþp;0;s
kþp;0Þ¼TrueÞ:
2186 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 32, NO. 11, NOVEMBER 2020
Authorized licensed use limited to: BOSTON UNIVERSITY. Downloaded on October 08,2020 at 19:23:07 UTC from IEEE Xplore. Restrictions apply.
Recovery segment drec is a segment of data that can be
substituted into the dataset in place of a missing segment.
Our approach finds segments elsewhere in the dataset to
use as recovery segments. For example, if dij is the missing
segment, and another segment dkl is similar ðdij dklÞ, then
dkl is a candidate recovery segment.
Cache Cis a table with one entry for each ‘unique’ window,
listing the index (or indices) where that window occurs in the
dataset. In the simplest case, one entry in the table looks like
ðdi;figÞ. If, however, there is another window in the dataset
djsuch that dj¼di, then the entry in the table would instead
be ðdi;fi; jgÞ. When complete, the cache is therefore the small-
est set of entries Cfðde;fijdi¼degÞg, such that the cache
represents the whole dataset, 8di2d; 9ðde;IÞ 2 C j ðde¼
di^i2IÞ, and such that the cache has as few entries as possi-
ble, 8ðde;IÞ 2 C;@ðdf;JÞ2Cjde¼df. If there were two
windows de¼dfin the cache C, their entries would simply be
combined intoone entry representing the union of the indices
where that window occurs.
To better understand the cache consider Fig. 1. There is a
missing segment at index 6, that is, dmiss ¼d6;8. Given w¼2,
for the prior window 3dmiss ½ð1;B;2;XÞ;ð1;E;1;XÞ], the
cache holds two beginning indices: f4;12g. This means that
this window value has been repeated twice at d4and at d12.
The same is applicable for the posterior window
"
dmiss,
which appears at d9and d17.
Problem. Given a missing segment dmiss, its prior 3dmiss , its
posterior
"
dmiss and the constraint stream S0, the objective of our
algorithm is to accurately find a recovery segment drec such that
drec ’dmiss, and replace dmiss with drec .
Function 1. Segment Similarity function,sim is a function that
receive two segments da,dband confidence threshold .This
function uses Jaccard distance to compare both input seg-
ments. It is a binary comparison between the data members of
each segment. If their similarity tolerance is greater or equal
than the it returns true. Otherwise it returns false. Since we
are dealing with binary comparison between similarities of
two set of discrete data Jaccard is the most optimal similarity
measure that could be used for our application
simðda;d
b;Þ¼ 1if da\da
da[da
0else
(:
Function 2. Segment match function,cmatchð3d;
"
d; dmissÞ,
is a function that by using the hash table (cache) compares
3dand
"
doffset distances with each other. If the distance
between their offsets is exactly similar to the size of dmiss
then it returns “true”, otherwise it returns “false”. More
about this will be described when we present the caching
algorithm.
3ALGORITHM
Our algorithm has three main characteristics. First, it uses a
constraint value, which we assume is known for all records.
Consider a smartwatch app that continuously collects ambient
light; when the user puts her hand inside her pocket while
seated, the ambient light is zero. While she is sleeping, the
ambient light is also zero. Sleeping and sitting with hands
inside the pockets are two different behaviors. Therefore, an
algorithm that tries to estimate missing values of ambient light
should consider the time of the day (a constraint) while per-
formingimputation.Multiple-imputation methods often have
better accuracy value over a single-imputation approach, but
are often less efficient. Our algorithm mitigates the accuracy
issue by relying on the constraint.
The algorithm’s second characteristic is the discretization of
data. Discretization enables our algorithm to handle any data
type, including binary, numerical or categorical values. Dis-
cretizing numeric data may have some impact on accuracy
depending on the granularity [24]. To avoid decreasing the
accuracy further, our algorithm compares data objects based
on equality and does not use any other similarity metrics. We
also discretize the timestamp into buckets using the temporal
granularity [32], (e.g., if the granularity is five minutes then
11 : 32 !11 : 30), and to discretize time series data we use
the SAX algorithm [24].
The third characteristic is its statistical distribution inde-
pendence, unlike several other state-of-the-art imputation
algorithms [8], [20], [41], which require the data to be nor-
mally distributed. In traditional imputation methods, MCAR
and MAR cause losses of statistical significance, and MNAR
can cause bias in the result [11]. Recently there are methods
designed specifically to deal with MNAR [19], [28]. Since our
algorithm relies on the first similar segment to reconstruct the
missing data and it does not use any statistical methods, it
might be able to reconstruct MNAR data, but only if a similar
segment exists in the dataset.
The quality of data reconstruction is strongly depend on
the choose of constraint. Therefore, it is important for the
user of the algorithm to be familiar with the dataset and have
some domain knowledge.
3.1 Baseline Algorithm
To better understand the algorithm, we first describe its base-
line version. Consider the dataset in Fig. 1 (top), where, three
streams producing data, i.e., S1,S2and S3. The constraint
stream is S0. From index 6 to 8 there is no data available, thus
dmiss ¼d6;8. Given the window size of two, the algorithm
reads the prior window 3d6;8=[ð1;B;2;XÞ;ð1;E;1;XÞ], the
posterior window
"
d6;8=[ð2;D;3;YÞ;ð2;C;4;;Þ], and = the
constraint content [1;2;2]. Then, the algorithm scans the data-
set until it encounters a window similar to 3d6;8, which it
finds at d12; it next checks whether the corresponding window
d17 is similar to the posterior window, for the missing segment
(i.e., whether d17
"
d6;8). If so, this new found segment d14;16
Fig. 1. (top) An abstract representation of a system with three data
streams that gets offline from tnto tnþ2. (bottom) Based on the exactly
similar posterior segment, prior segment and constraint the missing
segment has been identified from tmto tmþ2and recovered.
RAWASSIZADEH ET AL.: GHOST IMPUTATION: ACCURATELY RECONSTRUCTING MISSING DATA OF THE OFF PERIOD 2187
Authorized licensed use limited to: BOSTON UNIVERSITY. Downloaded on October 08,2020 at 19:23:07 UTC from IEEE Xplore. Restrictions apply.
is a candidate recovery segment. If the constraint in d14;16,i.e.,
1;2;2, are equal to those in d6;8,i.e.,1;2;2,thenthiscandidate
is a recovery segment and it will substituted at the missing
data segment, as shown in Fig. 1 (bottom). The search process
stops immediately after a recovery segment has been found; if
there is more than one substitution possible, only the first sim-
ilar segment will be chosen. Thus, this algorithm does not per-
form “multiple imputation” [38]. Also note that to construct
the missing data, our algorithm relies on the comparison to all
available data sources (multivariate) inside posterior and
prior segments. This means that each information source
inside prior and posterior will be a condition for comparison.
The algorithm can operate on single source of information,
but it will be more accurate when we use it with multivariate
datasets.
The baseline algorithm is simple and thus we do not
describe its pseudo code in detail.
3.2 Baseline Computational Complexity
The computational complexity of the algorithm is propor-
tional to the number of missing segments, and to the amount
of data searched to find a recovery segment for each missing
segment. Assuming there are nnumbers of tuple in the data-
set, the algorithm can find missing segments in a linear scan
of the dataset, in OðnÞtime; in a worst-case scenario it may
need to scan the whole dataset to find a recovery segment for
each missing segment, also in OðnÞtime. This means that
computational complexity is quadratic, i.e., Oðn2Þ. Since it is
not optimal, especially to run on battery powered devices,
we should improve it. By using a cache we can mitigate this
performance overhead.
3.3 Caching Algorithm
Caching is being used widely to improve the performance
of algorithms [9]. Here we adapt a caching mechanism that
reduces the search space to reduce time complexity. To
implement the cache we use a data dictionary data structure
that hosts all unique window values and their beginning
indices. Note that missing records are skipped when build-
ing the cache of window values.
By using a hash table to implement the dictionary, return-
ing a list of indices where this window value occurs in the data-
set, the cache allows the algorithm to consider only a limited
set of possible candidate recovery segments. In other words,
using the cache reduces the search space from two different
perspectives: (i) instead of searching the entire dataset that
includes repetitive data it searches a table of window values
that are all unique with no repeated data. (ii) Instead of
searching the dataset to find the similar posterior and prior
segments, first, it searches the list of indices for matches to the
prior window, then if the match is founded it checks the size
match between missing and candidate recovery segments,
then if both previous conditions are true it checks for the
posterior segments. If all three conditions are true then it con-
siders the candidate recovery segment as a final recovery
segment for the missing segment.
Given the window size of two, Fig. 2 is an abstract repre-
sentation that shows how the algorithm builds the cache. In
particular, the cache creation process scans the dataset,
sequentially examining each window, updating the dictionary
to add an entry for new window values or add a new index to
an existing dictionary entry.
The table at the bottom of Fig. 2 presents the dictionary as it
is being created. For the sake of simplicity, at the beginning we
show all possible four data segments, then for the rest we only
highlight repeated data segments. For the purpose of explana-
tion, we have assigned a name to each window, a1,b2,... and
because windows are overlapping we have useddifferent col-
ors and names (a,b) to distinguish them from each other.
Algorithm 1. Cache-Based Implementation of the
Algorithm
Data: dataset dwith nrecords ri(where 1in), window
size w, cache C, similarity threshold
Result: updated d
// outer loop: scan dto find missing segments
1i:¼wþ1// we skip over the first window
2while ðinwÞdo
// scan to find contiguous missing segment starting at i
3j:¼i
4while ðisMissingðdi;jþ1Þ^ðj<nwÞÞ do
5j:¼jþ1
6if ðisMissingðdijÞÞ then
7l¼jiþ1// length of missing segment
83dij ¼diw;i1// prior window
9
"
dij ¼djþ1;jþw// posterior window
// use cache to iterate over set of indices where
3dij is found in d, if any
10 forall k2csearchðC;3dij;Þdo
// exclude candidates close to the end of dataset
11 if ðkþwþl1nwÞthen
// candidate recovery segment
12 drec ¼dkþw;kþwþl1
// check whether posterior windows match based
on epsilon threshold
13 if ðsimð
"
drec;
"
dij;ÞÞ then
// now check whether constraints match; recall
sil are the fields of record ri
14 if ð80pji:siþp;0¼skþwþp;0Þthen
15 dij :¼drec // substitute into the missing segment
16 break
17 i:¼jþ1// continue scan after this segment
The current implementation of the cache constructions
adds windows into the dictionary with a simple check. If
Fig. 2. The cache creation process that loads the dataset into the hash
table. It uses overlapping sliding window, equal to the given ws size, to
read all possible data segments and their beginning offsets.
2188 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 32, NO. 11, NOVEMBER 2020
Authorized licensed use limited to: BOSTON UNIVERSITY. Downloaded on October 08,2020 at 19:23:07 UTC from IEEE Xplore. Restrictions apply.
the window exists in the dictionary the cache updates the
existing entry by adding the new index into the existing list
of indices. Therefore, there will be no collisions unlike using
hash functions.
In the common case, there will be a small number of indi-
ces in each dictionary entry. Therefore, if the dictionary is
implemented as a hash table, the insert and lookup time
will be Oð1Þ, but we have the risk collision. In our algorithm,
the process of reading the dataset and caching its windows
is a one-time pre-processing step. In other words, this cache
creation step, however, occurs once at the beginning, and is
not repeated. Nevertheless, we consider this overhead in
our evaluation. In the worst case, its computational com-
plexity is Oðn2Þ, but in common cases it is close to OðnÞ.In
particular, unusual or degenerate datasets could lead to
Oðn2Þbehavior if, for instance, all data windows have the
same value and the dictionary has only one entry with a list
of OðnÞindices. After creating the dictionary (cache), we
can now run a more efficient algorithm to find and replace
missing segments, as shown in Algorithm 1.
Note, most of the overhead is on scanning the content of
dictionary and comparing the values of cached, which hash
function can not improve the scan as well. Besides, our algo-
rithm assumes that the given dataset will have no predefined
structure, such as data streams. Considering the space and
time complexity, ordering the data stream is not cost-effective.
Besides, we need to store all unique (not all of them) data seg-
ments into the cache. Therefore, we do not recommend using
prefix-tree or any other types of trees, due to their space and
time in-efficiency required for ordering data or relying on the
structure of data.
In the first step, the algorithm iterates through dataset
and identify missing segments. It checks the placement of
the missing segment to locate its posterior and prior in line
4. Then, it reads the associated posterior and prior segments
in lines 8 and 9. From line 10 the algorithm starts to search
the cache to find offsets of the prior segment or similar offset
(it tolerates dissimilarity based on ). In line 11 it excludes
prior segments which are close to the end of the dataset,
because there will be no associated posterior or prior existed
for them. Next, it checks the window size and based on the
prior segment and window size it loads the candidate
recovery segment into drec, line 12. However, still the algo-
rithm is not sure whether it is a correct recovery segment,
and it should check the posterior as well. Line 13 will do the
comparison between posterior segment of the drec and pos-
terior of the missing segment dij. If they match (based on
given epsilon) the next step is to check their constraint
which should be identical. Note that constraint matching is
checking for exact equality and there will be no tolerance
(no will be used). If they match as well, this means that all
three conditions are true, and in line 15, drec will be returned
as a recovery segment for the identified missing segment.
Fig. 3 provides an example to understand the cmatch
function. Assume the ¼0which means data segment
should be identical (partial similarity is not tolerated) and
ws ¼2is the window size. There we have a dataset and we
show two snapshots of the dataset, each with three segments.
The first one includes the missing data segment and the sec-
ond one includes its recovery data segment. The beginning
offset (index) of prior segment xis 25. The missing event
occurs at 27 and continues until offset 30.
According to the cache, data segment x(prior) existed at
offset 16, 25 and 52, and data segment y(posterior) exited at
offset 20, 24, 30, 60 and 67. The cache algorithm first sub-
tracts the first posterior 20, from the first prior 16+ws. The
result shows 6, since it is not equal to the missing size, i.e.,
3, the cursor moves to the next x. Now it subtracts 20+ws
from 25 that results in 3. Since the result is negative and
smaller than the missing size, the cursor shifts to the next
posterior, i.e., 30. If the subtraction of 30 from 25+ws is cor-
rect, but the prior is the same, thus the algorithm uses the
next prior, i.e., 52. This procedure continues until it reaches
60 from yand 52 from x, and the differences are are similar
to the missing data size. Then the value from index 57 to 59
will be read from the dataset to the construct the recovery
data segment.
3.4 Computational Complexity with Cache
The computational complexity of the baseline algorithm is
super-linear and in worst case it is quadratic. As it has been
described, assuming that there are mmissing segments in the
dataset and each recovery segment is located at kth element of
the dataset we have Oðk:mÞ. By using the cache based optimi-
zation, the algorithm needs to search the hash table. The size
of the hash table is the size of the dataset divided by size of
the sliding windows, i.e., ðn=wsÞ. Since the sliding windows
are overlapping instead of n=ws we should have it multiplied
to ws 1. Therefore the computational complexity is
ððn=wsÞ:ðws 1ÞÞ. Besides, there is a very small overhead to
search the offset lists for each data segment. Its impact is near
zero and we do not consider it as an overhead in the algo-
rithm. As a result, we can summarize, together with the cache
construction phase, the overall algorithm will run in near
linear computational complexity and in worst case it will be
super linear. The experimental evaluations analyze this in
more detail.
In the Experimental Evaluation section, we provide a
detailed analysis for the differences of using the cache and
not using the cache, their impact on dataset size and so forth.
4EXPERIMENTAL EVALUATION
Before describing our experimental evaluation, first we
describe datasets that we use. Then, we introduce state-
of-the-art algorithms that will be used for comparison with
our algorithms. Next, the accuracy analysis will be described
in detail, followed by a detailed efficiency analysis.
All experiments reported in this section were conducted
on a MacBook Pro laptop, with 2.8 GHz Intel Core i7 CPU,
Fig. 3. A toy example that shows data segments and their offsets (indices)
read and cached into the dictionary.
RAWASSIZADEH ET AL.: GHOST IMPUTATION: ACCURATELY RECONSTRUCTING MISSING DATA OF THE OFF PERIOD 2189
Authorized licensed use limited to: BOSTON UNIVERSITY. Downloaded on October 08,2020 at 19:23:07 UTC from IEEE Xplore. Restrictions apply.
16 GB memory and SSD hard disk. We implemented the
algortihm, which we call Ghost, in R version 3.5.
4.1 Datasets
Here we choose to evaluate our method on five real-world mul-
tivariate datasets. These datasets include mobile lifelogging,
wearable smartwatch, Internet of Thing (IoT) Smarthome,
clinical data and online news data about Europe real-estate
market. We design this algorithm with the intention to rec-
onstruct sensor data for inexpensive consumer-electronic
devices, such as smartphone or IoT devices. However, to
demonstrate the versatility of our algorithm, we experiment it
on clinical data and online news data as well.
The mobile, wearable and IoT datasets have a timestamp
in each record. We converted each timestamp to a time of the
day with a temporal granularity of five minutes [32]. This
approach was inspired by the Allen temporal logic [1], which
treats temporal data as discrete time intervals, and it is differ-
ent than time series data. We converted time series data
(such as ambient light or accelerometer numerical data) to
characters with the SAX algorithm [24].
Wearable Dataset (WD). Because wearable devices are
small, they have limited battery and limited sensor capabili-
ties [35]. To preserve battery, their operating system shuts
down background services frequently, which can result in
significant data loss during the data collection process. We
use eight days of data for a sample user of “Insight for Wear”
[36], smartwatch app.
2
The constraint is time of the day.
Online News Data (ND). Financial applications use news to
predict market fluctuations. Recent political issues in Europe
such as Brexit and European policy regarding refugees have
led to significant fluctuations in European real-estate markets.
These fluctuations and their patterns are not continuously
reported in the news media. Therefore, imputation can be
used to construct missing market data from online news
media. We get 3,000 real-estate records, extracted from
500,000 news articles from 5 years (2012 to 2017) in several
German online news media sources. We acquired this data
from a market prediction start-up, i.e., eMentalist
3
and
ordered based on their date of publication. The constraint in
this dataset is region (country name) and they are ordered
chronologically.
Mobile Sensing Dataset (MD). Due to the proximity of
mobile phones to their users, it is not possible to continuously
collect contextual data at all times from the user [33], and the
imputation can be used to reconstruct missing mobile sensing
data. We use only three days of a user data
4
collected using an
open source lifelogging tool, i.e., UbiqLog [37]. The constraint
is time of the day.
Smarthome Dataset (SD). One well-known use of IoT devices
is in a household setting, i.e., smart-homes. Similar to mobile
and wearable devices, inexpensive sensors that will be used
for the smarthome configuration are prone to malfunction
and disconnection. We use UMass Smart Home dataset [4]
that includes information from different sensors inside and
out side twohomes. The constraint is the time of the day.
Clinical Data (CD). A traditional application of imputa-
tion is clinical data [2]. To demonstrate the generalizability
of our algorithms we evaluate our algorithm on a dataset of
visits of diabetes patients to 130 US hospitals from 1998-
2000 [49]. The constraint is combination of Age group, race
and gender. Since our algorithm operates with sequential
data, we order this dataset based on the constraint and
encounter sequence of patients, then run the imputation
algorithm. Note for this dataset we define an additional con-
straint that a sequence of data that constitute prior, missing
and posterior segments should belong to one single patient
only and a second patient will not be included in a same
segment.
As it has been described we chose a small subset of data
from the wearable and mobile datasets, that do not follow
any normal statistical distribution, i.e., a multivariate
Shapiro-Wilk [45] test did not reject the null hypothesis. We
make such a selection by purpose, to demonstrate that we
can still reconstruct the missing data with superior preci-
sion, despite not having a Gaussian distribution. Table 1
summarizes our experiment datasets. The variety of con-
straints demonstrates the versatility of our algorithm for dif-
ferent settings and its independency from having explicit
notion of time.
To quantify the characteristics of our datasets we have
conducted a Shapiro-wilk test [45] to identify whether each
attribute have been normally distributed. Besides we have
used Shanon entropy [44] to identify the level of uncer-
tainty. Shanon entropy could be interpreted as a predictabil-
ity of the dataset. For the sake of space, we do not report
them in detail, but none of the dataset has its all attribute
data normally distributed. Therefore, we can not conclude
that whether the selected dataset are highly predictable or
unpredictable.
4.2 State of the Art Algorithms
We compared the efficiency and accuracy of our algorithms
with four well-known state-of-the-art imputation algo-
rithms. The criteria for our selection is to support categorical
data and not just numerical data, and to support multivari-
ate imputation, as is needed for real-world applications. We
choose two well-known algorithm that use machine learn-
ing, i.e., mi [50] and missForest [47], and two that use statis-
tical inferences, Amelia II [20] and MICE [8].
TABLE 1
Experiment Datasets and Their Attributes
Dataset Contraint Data Streams #Rec.
Wearable Time of Day Battery Utilization, Ambient
light, Average Number of Steps
1913
Online News
Media
Country Topic, Influence Level of the
Topic, Sentiment of the Content,
Year and Month, Sub-Topic
3000
Mobile Time of Day Accelerometer, Ambient Light,
Battery Use, Screen Interactions
2366
Smart-
Home
Time of Day Inside Temperature, Outside
Temperature, Raining Status,
Door Status (is open or close)
25905
Clinical Patient ID A1C result, Insulin Level,
Diabetes Medication Use
101768
2. http://insight4wear.com
3. http://ementalist.net
4. http://archive.ics.uci.edu/ml/datasets/UbiqLog+%28smart-
phone +lifelogging%29
2190 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 32, NO. 11, NOVEMBER 2020
Authorized licensed use limited to: BOSTON UNIVERSITY. Downloaded on October 08,2020 at 19:23:07 UTC from IEEE Xplore. Restrictions apply.
“Amelia”[20] operates by bootstrapping expectation max-
imization and imputation posterior methods. Its imputation
posterior method draws random simulation on normally dis-
tributed data. It assumes that the variables of the multivari-
ate dataset are jointly following a normal distribution. It is a
useful and widely used method to reconstruct missing data
in data streams. “MICE” [8] performs multivariate imputa-
tion by using chained equations. The chained equation is
using joint modeling and fully conditional specifications
[43]. Joint modeling specifies a multivariate distribution and
performs imputation from their conditional distributions by
Markov chain Monte Carlo simulation. “mi” [50] creates an
Approximate Bayesian model from the data, then it recon-
structs data from the conditional distribution for each data
object given the observed and reconstructed data of the other
variables in the dataset. “missForest” [47] treats data as
a multi-dimensional matrix and performs the imputation
by predicting the missing value using the random-forest
algorithm trained on observed parts of the dataset.
To have a fair comparison we experiment with different
parameters several times and choose the best settings for
each state-of-the-art algorithm. Moreover, we have used
only one run for algorithms that support multiple run for the
imputation.
4.3 Accuracy
To measure the accuracy of the imputation process in our
algorithm, we have randomly removed records from each
dataset. In particular, we have removed 10 to 100 data
records from each dataset (10,20,30,...100) and compared the
imputed dataset with the original dataset. The accuracy esti-
mation will be reported as precision, recall and F-score. True
positives are missing data that successfully get recovered.
False positives are missing data that have been identified
and recovered but their original value is different than the
recovered one. False negatives, are missing data that have
not been recovered. True negative are data that are not miss-
ing and also not recovered, which is obvious in our scenario
and all algorithms can easily identify true negatives. There-
fore, in most analysis we report F-score that is independent
from true negative.
We provide five analyses for the accuracy, four of which
are focused on the characteristic of the algorithm including
constraint impact, sliding window parameter sensitivity and
dissimilarity () impact on precision and recall, and missing
length (or duration) impact on accuracy. The fifth analysis is
focused on comparison of our algorithm with the aforemen-
tioned state-of-the-art algorithms. It is notable that there are
no differences between the accuracy of baseline or cache-
based version of the algorithm, and using a cache does not
have any impact on the accuracy of the algorithm.
4.3.1 Tolerating Dissimilarity Impact on Accuracy
The data objects are multivariate and heterogeneous. There-
fore, the algorithm should treat them as discrete data objects.
In cases the user intends to tolerate a dissimilarity the
algorithm uses Jaccard Index to compare contents of two data
segments. The parameter will be used totolerate thedissimi-
larity of prior and posterior segments. Increasing the dissimi-
larity tolerance increases the recall and enable the algorithm to
identify more missing data segments. It, however, decreases
the precision of result significantly.
Fig. 4 presents precision and recall of our experiment
datasets with window size of two, while tolerating one (n-1)
or two (n-2) data object dissimilarity. Later we will describe
why we set window size to two. We choose to remove only
two data objects at most, because either by removing one or
two data objects we will get near to ideal recall for all datasets
(more than 0.8). This figure presents an average of precision
and recall. It means that we calculate the precision and recall
for all 10 different record removes (10 recrord removes,...,100
record removes) and report the mean of results.
Based on the very low precision we acquired by slight
toleration of dissimilarity, it is clear that exact similarity is
favored by this algorithm and tolerating dissimilarity, even
small dissimilarity is not recommended between. Further-
more, the randomness of results between (n-1) and (n-2)
questions the credibility of tolerating dissimilarity as well.
Therefore, we highly recommend avoid tolerating dissimi-
larity in this algorithm.
4.3.2 Constraint Impact on Accuracy
We introduced constraints as a means to reduce false-positive
errors. To demonstrate the impact of constraints, we mea-
sured the accuracy of our algorithm with and witout use of
constraint. Fig. 6 demonstrates the superior accuracy of using
the constraint over not using it on thesefive datasets we chose
use to test. Although not using a constraint increases the recall
in all datasets except news media, it also increases the false
positive errors.
The superior recall of notusing a constraint, over using the
constraint in news media dataset is due to the tight connec-
tion of real-estate changes in German regions of Europe.
Most of real-estate news where either about Germany or UK.
Therefore, a constraint plays an auxiliary role and increases
recall. Unlike other datasets it does not act as a filter for data.
4.3.3 Sliding Window Parameter Sensitivity
The only effective parameter that this algorithm is using is
the window size.
5
We analyzed precision, recall and F-score
Fig. 4. A comparison between precision and recall of using exact similarity (n) versus tolerating one dissimilarity (n-1) and two (n-2) dissimilarities.
5. is also a parameter, but we demonstrate that it is not useful to
tolerate dissimilarity.
RAWASSIZADEH ET AL.: GHOST IMPUTATION: ACCURATELY RECONSTRUCTING MISSING DATA OF THE OFF PERIOD 2191
Authorized licensed use limited to: BOSTON UNIVERSITY. Downloaded on October 08,2020 at 19:23:07 UTC from IEEE Xplore. Restrictions apply.
for each dataset with different window sizes for the same
dataset with different number of missing records, i.e., 10, 20,
...,100. Then we get an average from the precision, recall and
F-score for each dataset and each window size and report itin
Fig. 5. We evaluate the algorithm for four different window
sizes, i.e., 1,2,3 and 4. Based on the results in Fig. 5 we can rec-
ommend that the optimal window size for these datasets, is
two or three. Window size of two or three was optimal for sen-
sor based datasets (mobile, wearableand smarthome). For the
clinical dataset, window size of two had the highest accuracy.
For the new media dataset, window size of three has the high-
est accuracy. Nevertheless, we recommend that the user
experiment the algorithm with different window sizes for
their target dataset. Window size and precision have mono-
tonic relation and not linear. Our initial assumption was that
recall will be improved by smaller window size, but due to
the real-world nature of the data such an assumption is not
valid and you can see the large window size, i.e., 4 decreases
the accuracy.
4.3.4 Length of the Missing Segment and Accuracy
One question might arise is the flexibility of the algorithm
on the size of the missing data segment. In other words,
while the target system is off, how much accuracy of our
imputation algorithm decrease based on the duration of off
period? To answer this question, we have experimented
with removing sequential number of data. By using 100
missing records we have experimented on all missing data
with the missing segment size 2,3,5 and 10 records. For
instance, missing records 2 indicate 50 missing sequences
that each has the size of two segments. Missing record 3
indicate 33 missing sequences that each has the size of three
segments. The result is presented in Fig. 8. It presents preci-
sion and recall based on the number of missing length in a
sequence. As it can been seen from the result of Fig. 8, there
is a slight precision decreases among datasets while we
move from one missing segment to three. Then there is a
significant decrease by having 10 missing segments. The
precision of mobile and smarthome dataset are getting zero.
The behavior of recall in some datasets is decreasing, but in
some datasets it is also not predictable and we can not gen-
eralize it. In summary, we can argue that the longer the off
time will be, the less accurate will be the imputation pro-
cess. This reveals that the imputation process has some sen-
sitivity to the length of missing data.
4.3.5 Comparison with Other Algorithms
To compare the accuracy of our algorithm with state-of-the-art
algorithms, we chose the optimal window size from the pre-
ceding evaluation and the optimal parameter settings for
state-of-the-art algorithms. Fig. 7 summarizes the F-score for
imputation by each of the algorithms. As it has been shown in
this Figure, our algorithm outperforms other algorithms sig-
nificantly in terms of F-score, except on the clinical dataset.
The recallof all other algorithms are ideal, because (i) it is easy
to identify all missing data and (ii) theyprovide a substitution
for all missing segments (although often incorrectly). We can
easily change our algorithm and substitute exact similarity
Fig. 6. A comparison between using or not using the constraint.
Fig. 5. Window size parameter sensitivity analysis.
Fig. 7. Comparison of accuracy between our algorithm with constraint and state-of-the-art methods.
2192 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 32, NO. 11, NOVEMBER 2020
Authorized licensed use limited to: BOSTON UNIVERSITY. Downloaded on October 08,2020 at 19:23:07 UTC from IEEE Xplore. Restrictions apply.
with another similarity metrics and achieve perfect recall.
However, this approach introduces a high false-positive rate.
False-positive errors are severe errors for imputation algo-
rithms and can bias the imputed data [19].
Notably, the missForest algorithm returns zero precision
for the mobile dataset, due to the screen interaction data.
This data is rarely available and most of its time slots are
filled with zero value. Such a sparse stream of data affects
the random forest algorithm, which can not handle too many
null variables for a dataset. Although the missForest
algorithm and our algorithm had a comparable F-score for
the news media dataset, our algorithm had 25 percent higher
precision. On average, even considering the low score of
our algorithm for clinical data it had about 18 percent higher
F-score (across all five datasets) than other algorithms.
Not performing better than missForest and mi for clinical
data could be due to the non-sequantial nature of the dataset.
Our algorithm is designed for datasets include sequential
data.
Note that fluctuations in Fig. 7 are due to randomness of
missing data and it is not possible to get an smooth line by
repeating the experiment. Our algorithm performs signi-
ficantly better than other algorithms that have a time as a
constraint.
4.4 Efficiency
To demonstrate the efficiency of our algorithm, we analyze
the execution time and memory use of the algorithm in dif-
ferent settings.
First we analyze the response time and memory use of
the algorithm and the impact of dataset size on its perfor-
mance. Next, we compare the memory use and response
time of our algorithm with state-of-the-art algorithms. To
understand the impact of the number of missing data on the
memory use and response time, we report the experiment
with 10 to 100 missing records for each dataset. This helps
to quantify the sensitivity of imputation algorithms based
on the number of missing data.
4.4.1 Cache versus Baseline: Execution Time
Our cached optimization reduces the search space, but
increases memory use. We report the memory use and
response time of both versions of the algorithm for different
data sizes varied from 500 to 10,000 records, i.e., 500, 1,000,
2,000,... and 10,000. To prevent any bias originating from the
dataset structure, we created a synthetic dataset with four
data streams (data values varied from 1 to 10) and one stream
as a constraint. We keep the number of random missing
records fixed at 100 records in all datasets. Nevertheless, indi-
vidual missing records were randomly distributed among all
dataset records. Fig. 9 shows response times (in seconds) for
differentwindow sizes between the baseline and the cache.
As shown in the Fig. 9, increasing the dataset size increased
the response time for both the baseline and cache. The
increase for response time in the cache version, however, was
significantly lower than that the baseline.
Moreover, increasing the size of the window size length-
ened the response time, because a comparison is done for
each object (due to the use of exact similarity) and as the
window size gets larger, there are moreobjects to compare.
4.4.2 Cache versus Baseline: Memory Use
Our cached optimization reduces the search space, but
increases memory use. We report the memory use and
response time of both versions of the algorithm for different
data sizes varied from 500 to 10,000 records, i.e., 500, 1,000,
2,000,... and 10,000.
To prevent any bias originating from the dataset structure,
we created a synthetic dataset with four data streams (data
values varied from 1 to 10) and one stream as a constraint. We
keep the number of random missing records fixed at 100
records in all datasets. Nevertheless, individual missing
records were randomly distributed among all dataset records.
Fig. 9 shows response times (in seconds)for different window
sizes between the baseline and the cache.
As shown in the Fig. 9, increasingthe dataset size increased
the response time forboth the baseline and cache. The increase
Fig. 8. Missing data length impact on accuracy.
Fig. 9. Response time impact of using cache rather than baseline method, with different window sizes.
RAWASSIZADEH ET AL.: GHOST IMPUTATION: ACCURATELY RECONSTRUCTING MISSING DATA OF THE OFF PERIOD 2193
Authorized licensed use limited to: BOSTON UNIVERSITY. Downloaded on October 08,2020 at 19:23:07 UTC from IEEE Xplore. Restrictions apply.
for response time in the cache version, however, was signifi-
cantly lower than that the baseline. Moreover, increasing the
size of the window size lengthened the response time, because
a comparison is done for each object (due to the use of exact
similarity) and as the window size gets larger, there are more
objects to compare.
Fig. 10 shows the baseline algorithm uses only 112 Bytes of
memory, independent from the dataset size. The cache version
starts from 508 bytes and grows slowly based on the size of the
dataset. At 10,000 records the cache version occupies about
856 bytes ofmemory. Developers who are planning to use this
algorithm can consider comprising on the amount of memory
in order to have a shorter execution time or vice versa.
4.4.3 Memory Use Comparison with Other Algorithms
Since all algorithms read the dataset into memory, we ignore
that memory use and focus on additional memory use by the
algorithm. Our experiments show that there was no signifi-
cant differences in memory use while using different window
sizes. Therefore, we report the memory overhead independent
from the window sizes; see Fig. 10. The memory allocation
policy in our implementation was based on the R compiler.
The memory overhead of the algorithm alone was insignifi-
cant. In particular, the baseline version was memory efficient,
and only used 112 Kilo Bytes (KB) of memory, independent
from the dataset size, because the dataset was not in memory.
The cache version used more memory, of course. At 10,000
records (w¼2) the cache version occupies about 856 KB of
memory. Developers who are planning to use this algorithm
can consider comprising on the amount of memory in order to
have a shorter execution time or vice versa.
Note for this experiment, we selected only 3,000 records
from the clinical and smarthome dataset, because the “mi”
algorithm memory utilization grows exponentially as the size
of the dataset grows and it can not operate on large dataset.
Besides, based on the findingof the previous section, that win-
dow size two and three provides the highest accuracy, we set
the window size to two.
Although the baseline algorithm usesless amount of mem-
ory, we refrain from comparing it with other algorithms due
to its slow response times. A large portion of the memory
overhead came from the cache. Therefore, in Table 2 for
reporting the Ghost memory use, we report “algorithm’s
used memory” + “memory used for the cache”. Otherwise,
missForest is the most efficient one followed by Amelia.
We do not report memory utilization while tolerating
dissimilarity because there will be no logical reason to have
a differences in memory utilization.
4.4.4 Execution Time Comparison with Other Algorithms
Table 3 compares our algorithm execution time against state-
of-the-art algorithms. In this study we report average execu-
tion time averaged across runs with varying number of miss-
ing records, i.e., 10 to 100 records. Results of this table
show that our algorithm was slower than both missForest and
Amelia and faster than MICE and mi. Although our algorithm
was not the fastest imputation algorithm, our implementation
can be slightly improved by using a more subtle hash function
instead of a list.
Despite significant decrease in precision, tolerating dis-
similarity decreases execution time significantly, because
the matching segment will be identified much faster and
search space is smaller. Table 4 reports the execution time
based on different dissimilarities.
For these evaluations we conclude that our algorithm is
appropriate for systems that need to reconstruct the missing
data offline (not in real-time) and systems that can tolerate
latency, but require high accuracy. Our algorithm was the
most accurate algorithm for all sequential datasets.
Furthermore, due to its low memory use it could be imple-
mented on small devices such as wearable devices. Our report
was focused on fully missing data, butour algorithm could be
easily extended to handle partial missing data as well.
5LIMITATIONS
Although not using exact equality and tolerate the dissimilar-
ity increase the recall, we recommend to use exact similarity.
Fig. 10. Memory use comparison between the cache and the baseline.
TABLE 2
Memory Use (in KB) of Different Algorithms
Dataset Ghost missForest Amelia MICE mi
SmartHome 0.64 + 230 49.23 132.58 288.41 1708.56
Mobile 1.06 + 234 60.59 142.23 74.40 1948.12
Wearable 0.86 + 241 49.28 133.06 291.66 1711.24
Clinical 0.86 + 225 75.30 156.85 236.41 2017.19
NewsMedia 0.85 + 261 99.86 154.24 753.65 2928.63
TABLE 3
Execution Time (in Seconds) of Different Algorithms
Dataset Ghost missForest Amelia MICE mi
SmartHome 10.39 1.24 0.28 10.57 35.89
Mobile 5.38 0.61 0.32 1.77 10.68
Wearable 13.64 0.69 0.82 2.91 10.06
Clinical 7.57 0.63 0.58 2.33 7.42
NewsMedia 6.91 0.68 0.57 10.34 32.21
TABLE 4
Execution Time (in Seconds) of Different Level
of Dissimilarity Tolerance in Ghost
Dataset Ghost (n-1) Ghost (n-2) Ghost
SmartHome 7.26 6.79 10.39
Mobile 5.38 3.91 5.38
Wearable 8.64 9.01 13.64
Clinical 5.13 6.00 7.57
NewsMedia 4.36 3.78 6.91
The last column (Ghost), which is same as Table 3, presents the algorithm
while not tolerating the dissimilarity at all and looking for exact equality.
2194 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 32, NO. 11, NOVEMBER 2020
Authorized licensed use limited to: BOSTON UNIVERSITY. Downloaded on October 08,2020 at 19:23:07 UTC from IEEE Xplore. Restrictions apply.
This is due to the fact that similar to other algorithm our algo-
rithms also impose a need for smoothing the data, and to keep
data close to its original values, In our experimental evalua-
tion, we analyzed the trade-off between recall and precision
by not using exact equality (see Section 4.3.1).
Our approach cannot reconstruct a missing segment at
the beginning or end of the dataset, this limitation is not
likely significant in large real-world datasets, because the
missing segment would be small relative to the dataset.
We assume this algorithm will be used to reconstruct
a missing segment of battery powered devices, and not
machines which collect hundreds of different data types.
Therefore, in our experiments we choose datasets with not too
many different features. Our largest dataset has six features.
The maximum number of multivariate data we have
experimented was six attributes (in news media dataset).
However, such a number is not exact and it depends on the
correlation and repeatability of the data. Nevertheless, by
reducing the precision and relying on smoother similarity
those types of could be handled as well. Besides, because of
the same reason in our experiments we did not test on big
data and we assume the user of this algorithm is looking for
accurately reconstructing the missing data from the smallest
possible sample of data.
As it has been reported in the evaluation, our algorithm
cannot reconstruct the data in real-time. It can be useful for
data streams and datasets which are changing frequently,
but since it requires at least one scan on the entire dataset, it
cannot operate in real-time.
6RELATED WORK
Imputation has a longstanding history in statistical analysis
[40]. On the other hand, due to inherited noise in the real-
world data, there are several promising application specific
imputation approaches proposed to reconstruct missing
data for a single application.
We propose two categories of related works. One cate-
gory of works are holistic methods that are usually based
on statistical analysis and do not consider the underlying
application. The second category of works are application
specific works that are designed based on the target applica-
tion requirements. We have evaluated our approach on five
different datasets, thus we conclude our approach is not
application specific and belongs to the first category.
Holistic Algorithms. Basic approaches for imputations use
mathematical methods to calculate the missing data such as
mean imputation, regression analysis, and missing indicator
methods [11]. Practical algorithms usually use more com-
plex approaches. In particular, there are two categories of
imputation algorithms that are widely in use. The first cate-
gory features maximum likelihood algorithms (ML) [13],
[18] such as EM (E for conditional expectation and M for
maximum likelihood) algorithms [10] and their successors
algorithms, such as the work proposed by Enders [13]. EM
algorithms usually works well with numerical data, because
they use statistical model to find missing data.
The second category of algorithms are multiple imputation
(MI) algorithms [41] such as the work proposed by Buuren
and Groothuis-Oudshoorn [8] or Honaker et al. [20]. MI
algorithms combine different imputation methods into a
single procedure (mostly expectation maximization). There-
fore, it allows for the uncertainty by creating several plausible
imputed datasets. Repeating multiple experiments makes MI
algorithms approximately unbiased. Nevertheless, the repeat-
ing and combining of different processes increases the time
and memory complexity of such approaches and thus their
are not computationally efficient, despite their superior accu-
racy. Moreover, most of MI algorithms assume that the data
are following a normal distribution [48], which is not neces-
sary true for all real-world datasets. However, MI algorithms
could benefit from ML approaches, and thus there is no dis-
tinct border between them. For example, Amelia [20], which
has been described previously, is a the well-known algorithm
that combines the classical expectation maximization with
the bootstrap approach. Rubin [25] and Schafer [43] provide
a detailed comprehensive description about statistical appro-
aches for imputation. These algorithms are still advancing
and progressing. For instance, Yuan [53] uses “Propencity
score” [39], which is used when there is a vector of observed
covariates. This algorithm generates a score for variables with
missing values. Afterwards, observations are grouped based
on these scores, and an approximate Bayesian bootstrap
imputationwill be applied on each group. Song et al. [46] uses
approximate and exact neighboring information to identify
the missing information in a dataset. Both of these recent
approaches are useful when the there are other streams of
data available, and when they can not optimally operate
when the system is completely off. Mohan proposed a method
to perform imputation by using a directed acyclic graph [28].
This approach operates based on the causal relation of nodes
in a graph.
Our work is inspired by sequence mining algorithms
[17], [30] and [54] which focus on identified ordered pat-
terns of events in a sequence. Nevertheless, the objective of
our algorithm is significantly different than sequence min-
ing algorithms.
Application Specific Methods. Application specific efforts.
such as model based imputation, try to resolve the missing
data based on the assumption that the model provides. Some
example of application specific imputation include sensor net-
work [21], [29], clinical data [5] and genome wide association
[27]. Besides, due to the nature of real-world data streams,
these approaches handle multivariate and different data types
[12], [22], [43] that vary from categorical to binary and numeri-
cal data. Usually, the process of imputation will be done in
batch mode and most of the existing approaches in this cate-
gory are computationally complex. For instance, to recon-
struct the missing data, the data will be converted into
contingency table and will be inserted into a large matrix [25],
[43]. Another example is the use of compressed sensing [12]
for sensor-network data imputation [22], which has a high
computational complexity (in both time and space). Another
work proposed by Papadimitriou et al. [29] that applies
principal component analysis (PCA) to estimate the missing
time series based on its correlation with another time series in
time stamped sensor data. The space cost of their approach is
efficient, but because of its reliance on PCA, this approach
operates in a two- dimenstional space of numerical data.
Moreover, PCA has poly nominal time complexity. Kong
et al. [22] use a customized spatio-temporal compressed
sensing approach [12] for imputing environmental sensor
RAWASSIZADEH ET AL.: GHOST IMPUTATION: ACCURATELY RECONSTRUCTING MISSING DATA OF THE OFF PERIOD 2195
Authorized licensed use limited to: BOSTON UNIVERSITY. Downloaded on October 08,2020 at 19:23:07 UTC from IEEE Xplore. Restrictions apply.
network data. Due to use of compressed sensing and nested
matrix iteration, this approach is a polynomial and computa-
tionally complex as well. Jeon et al. [21] proposes a noise
reduction model for removing audio noises, based on the
multi-band spectral subtraction. Marchini and Howie use a
reference panel for estimating the missing data in a genome
wide association [27]. Fryett et al. [14] propose a detailed sur-
vey on comparing transcriptome imputation. Wang et al. [51]
employs Petri net to recover missing events based on the time
constraints in a set of business process sequences. Their sim-
ple case recovery has a linear computational complexity, but
the approach they propose for general cases based on branch-
ing and indexing does not have linear complexity. Some of
these efforts can be generalized for different applications as
well. For instance, the work proposed by Batista and Monard
[5] use the k(th) nearest neighbor for reconstructing the miss-
ing data. They implement their approach on clinical data
while their approach has a higher accuracy rate than basic
imputation methods, i.e., mean and linear regression. A more
recent example of imputation algorithm is proposed by
Wellenzohn et al. [52], which focuses on imputation for time
series. It introduces a concept of anchor point which is similar
to our prior segment approach. Their approach is also benefit-
ing from a prior data of the missing event and therefore inde-
pendent from linear correlation. Nevertheless, since it uses
only prior window as a constraint its recall is higher, but our
precision is higher.
Several application specific imputation rely on the period-
icity of the data. There are promising approaches to quantify
periodic changes in a dataset [7], [26] and thus improve the
application efficiency. For instance, Boegl et al. [7] rely on the
periodicity of the data to perform the imputation.
In the experimental evaluation section, we described well-
known holistic state-of-the-art algorithms [8], [20], [47], [50]
and why we have selected them.
7CONCLUSION &FUTURE WORK
Inthispaperwehaveintroducedanaccurateimputationalgo-
rithm, Ghost, that can operate on multivariate datasets. It uses
a constraint and the first similar segments, adjacent to the
missing data segment to perform the imputation process. To
improve its efficiency of the algorithm we use a cache based
optimization. Our algorithm accuracy has outperformed state-
of-the-art algorithm by 18 percent in F-score and 25 percent in
Precision. Our proposed algorithm is appropriate for systems
that produce data streams and can not hold data for long
term. Moreover, it is useful for systems that prioritize accuracy
overtheresponsetime.Asafutureworkwewilltryto
develop a distance function that can identify prior and poste-
rior segments with are in the proximity (not adjacent) of the
missing segments. Finding priors and posteriors patterns and
their distance to the missing segments could increase number
of recovery segments, and thus accuracy of the algorithm.
ACKNOWLEDGMENTS
The authors acknowledge Thomas H. Cormen for his hint on
the design of our algorithm and David Kotz for formalizing the
problem and contributing in the process of writing the paper.
REFERENCES
[1] J. Allen, “Maintaining knowledge about temporal intervals,” Com-
mun. ACM, vol. 26, no. 11, pp. 832–843, 1983.
[2] C. Anagnostopoulos and P. Triantafillou, “Scaling out big data
missing value imputations: Pythia vs. Godzilla,” in Proc. 20th
ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, 2014,
pp. 651–660.
[3] T. Baltru
saitis, C. Ahuja, and L. P. Morency, “Multimodal
machine learning: A survey and taxonomy,” IEEE Trans. Pattern
Anal. Mach. Intell., vol. 41, no. 2, pp. 423–443, Feb. 2019.
[4] S. Barker, A. Mishra, D. Irwin, E. Cecchet, P. Shenoy,and J. Albrecht,
“Smart*: An open data set and tools for enabling research in sustain-
able homes,” in Proc. KDD Workshop DataMining Appl. Sustainability,
2012, Art. no. 112.
[5] G. Batista and M. Monard, “An analysis of four missing data treat-
ment methods for supervised learning,” Appl. Artif. Intell., vol. 17,
no. 5/6, pp. 519–533, 2003.
[6] B. Berger, N. M. Daniels, and Y. W. Yu, “Computational biology in
the 21st century: Scaling with compressive algorithms,” Commun.
ACM, vol. 59, no. 8, pp. 72–80, 2016.
[7] M. B’ogl, P. Filzmoser, T. Gschwandtner, S. Miksch, W. Aigner,
A. Rind, and T. Lammarsch, “Visually and statistically guided
imputation of missing values in univariate seasonal time
series,” in Proc. IEEE Conf. Visual Analytics Sci. Technol., 2015,
pp. 189–190.
[8] S. Buuren and K. Groothuis-Oudshoorn, “mice: Multivariate
imputation by chained equations in R,” J. Statistical Softw., vol. 45,
no. 3, pp. 1–68, 2011.
[9] T. Cormen, C. Leiserson, R. Rivest, and C. Stein, Introduction to
Algorithms. Cambridge, MA, USA: MIT Press, 2009.
[10] A. Dempster, N. Laird, and D. Rubin, “Maximum likelihood from
incomplete data via the EM algorithm,” J. Roy. Statistical Soc..
Series B (Methodological), vol. 39, pp. 1–38, 1977.
[11] A. Donders, G. van der Heijden, T. Stijnen, and K. Moons,
“Review: A gentle introduction to imputation of missing values,”
J. Clinical Epidemiology, vol. 59, no. 10, pp. 1087–1091, 2006.
[12] D. Donoho, “Compressed sensing,” IEEE Trans. Inf. Theory, vol. 52,
no. 4, pp. 1289–1306, Apr. 2006.
[13] C. Enders, “A primer on maximum likelihood algorithms available
for use with missing data,” Structural Equation Model., vol. 8, no. 1,
pp. 128–141, 2001.
[14] J. Fryett, J. Inshaw, A. Morris, and H. Cordell, “Comparison of
methods for transcriptome imputation through application to
two common complex diseases,” Eur. J. Human Genetics, vol. 26,
pp. 1658–1667, 2018.
[15] A. G
eron, Hands on Machine Learning with Scikit-Learn and Tensor-
Flow: Concepts, Tools, and Techniques to Build Intelligent Systems.
Sebastopol, CA, USA: O’Reilly Media, 2017.
[16] Z. Ghahramani, “Probabilistic machine learning and artificial
intelligence,” Nature, vol. 521, no. 7553, pp. 452–459, 2015.
[17] J. Han, J. Pei, B. Mortazavi-Asl, Q. Chen, U. Dayal, and M. Hsu,
“FreeSpan: Frequent pattern-projected sequential pattern mining,”
in Proc. 6th ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining,
2000, pp. 355–359.
[18] H. Hartley and R. Hocking, “The analysis of incomplete data,”
Biometrics, vol. 27, no. 4, pp. 783–823, 1971.
[19] J. Hern
andez-Lobato, N. Houlsby, and Z. Ghahramani, “Probabilistic
matrix factorization with non-random missing data,” in Proc. Int.
Conf. Mach. Learn., 2014, pp. 1512–1520.
[20] J. Honaker, G. King, and M. Blackwell, “Amelia II: A program for
missing data,” J. Statistical Softw., vol. 45, no. 7, pp. 1–47, 2011.
[21] K. Jeon, N. Park, D. Lee, and H. Kim, “Audio restoration based on
multi-band spectral subtraction and missing data imputation,” in
Proc. IEEE Int. Conf. Consum. Electron., 2014, pp. 522–523.
[22] L. Kong, M. Xia, X. Liu, M. Wu, and X. Liu, “Data loss and re-
construction in sensor networks,” in Proc. IEEE INFOCOM, 2013,
pp. 1654–1662.
[23] B. Lake,R. Salakhutdinov, and J. Tenenbaum, “Human-levelconcept
learning through probabilistic program induction,” Sci., vol. 350,
no. 6266, pp. 1332–1338, 2015.
[24] J. Lin, E.Keogh, S. Lonardi, and B. Chiu, “A symbolic representation
of time series, with implications for streaming algorithms,” in Proc.
8th ACM SIGMOD Workshop Res. Issues Data Mining Knowl. Discovery,
2003, pp. 2–11.
[25] R. Little and D. Rubin, Statistical Analysis with Missing Data.Hoboken,
NJ,USA:Wiley,2014.
2196 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 32, NO. 11, NOVEMBER 2020
Authorized licensed use limited to: BOSTON UNIVERSITY. Downloaded on October 08,2020 at 19:23:07 UTC from IEEE Xplore. Restrictions apply.
[26] C. Loglisci and D. Malerba, “Mining periodic changes in complex
dynamic data through relational pattern discovery,” in Proc. Int.
Workshop New Frontiers Mining Complex Patterns, 2015, pp. 76–90.
[27] J. Marchini and B. Howie, “Genotype imputation for genome-
wide association studies,” Nature Rev. Genetics, vol. 11, no. 7,
pp. 499–511, 2010.
[28] K. Mohan, J. Pearl, and J. Tian, “Graphical models for inference
with missing data,” in Proc. Int. Conf. Neural Inf. Process. Syst.,
2013, pp. 1277–1285.
[29] S. Papadimitriou, J. Sun, and C. Faloutsos, “Streaming pattern
discovery in multiple time-series,” in Proc. 31st Int. Conf. Very
Large Data Bases, 2005, pp. 697–708.
[30] J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal, and
M. Hsu, “PrefixSpan: Mining sequential patterns efficiently by
prefix-projected pattern growth,” in Proc. Int. Conf. Comput. Com-
mun. Netw., 2001, pp. 215–224.
[31] R. Rawassizadeh, C. Dobbins, M. Akbari,and M. Pazzani, “Indexing
multivariate mobile data through spatio-temporal event detection
and clustering,” Sensors, vol. 19,no. 3, 2019, Art. no. 448.
[32] R. Rawassizadeh, E. Momeni, C. Dobbins, J. Gharibshah, and
M. Pazzani, “Scalable daily human behavioral pattern mining
from multivariate temporal data,” IEEE Trans. Knowl. Data Eng.,
vol. 28, no. 11, pp. 3098–3112, Nov. 2016.
[33] R. Rawassizadeh, E. Momeni, C. Dobbins, P. Mirza-Babaei, and
R. Rahnamoun, “Lesson learned from collecting quantified self
information via mobile and wearable devices,” J. Sensor Actuator
Netw., vol. 4, no. 4, 2015, Art. no. 315.
[34] R. Rawassizadeh, T. Pierson, R. Peterson, and D. Kotz, “NoCloud:
Exploring network disconnection through on-device data analysis,”
IEEE Pervasive Comput., vol. 17, no. 1, pp. 64–74, Jan.–Mar. 2018.
[35] R. Rawassizadeh, B. Price, and M. Petre, “Wearables: Has the age
of smartwatches finally arrived?,” Commun. ACM, vol. 58, no. 1,
pp. 45–47, 2015.
[36] R. Rawassizadeh, M. Tomitsch, M. Nourizadeh, E. Momeni,
A. Peery, L. Ulanova, and M. Pazzani, “Energy-efficient integration
of continuous context sensing and prediction into smartwatches,”
Sensors, vol. 15, no. 9, pp. 22616–22645, 2015.
[37] R. Rawassizadeh, M. Tomitsch, K. Wac, and A. Tjoa, “UbiqLog: A
generic mobile phone-based life-log framework,” Pers. Ubiquitous
Comput., vol. 17, no. 4, pp. 621–637, 2013.
[38] M. Resche-Rigon and I. R. White, “Multiple imputation by chained
equations for systematically and sporadically missing multilevel
data,” Statistical Methods Med. Res., vol. 27, no. 6, pp. 1634–1649, 2018.
[39] P. R. Rosenbaum and D. B.Rubin, “The central role of the propensity
score in observational studies for causal effects,” Biometrika,vol.70,
no. 1, pp. 41–55, 1983.
[40] D. Rubin, “Inference and missing data,” Biometrika, vol. 63, no. 3,
pp. 581–592, 1976.
[41] D. Rubin, “Multiple imputations in sample surveys- A phenome-
nological Bayesian approach to nonresponse,” Proc. Survey Res.
Methods Section Amer. Statistical Assoc., vol. 1, pp. 20–34, 1978.
[42] S. Russell and P. Norvig, Artificial Intelligence: A Modern Approach,
3rd ed. London, U.K.: Pearson Education Limited, 2011.
[43] J. L. Schafer, Analysis of Incomplete Multivariate Data. Boca Raton,
FL, USA: CRC Press, 1997.
[44] C. E. Shannon, “A mathematical theory of communication,” Bell
Syst. Tech. J., vol. 27, no. 3, pp. 379–423, 1948.
[45] S. Shaphiro and M. Wilk, “An analysis of variance test for normal-
ity,” Biometrika, vol. 52, no. 3, pp. 591–611, 1965.
[46] S. Song, A. Zhang, L. Chen, and J. Wang, “Enriching data imputa-
tion with extensive similarity neighbors,” Proc. VLDB Endowment,
vol. 8, no. 11, pp. 1286–1297, 2015.
[47] D. Stekhoven, “MissForest: Nonparametric missing value imputa-
tion using random forest,” Astrophysics Source Code Library, 2015.
[48] J. Sterne, I. White, J. Carlin, M. Spratt, P. Royston, M. Kenward,
A. Wood, and J. Carpenter, “Multiple imputation for missing data
in epidemiological and clinical research: Potential and pitfalls,”
Brit. Med. J., vol. 338, 2009, Art. no. b2393.
[49] B. Strack, J. DeShazo, C. Gennings, J. Olmo, S. Ventura, K. Cios, and
J. Clore, “Impact of HbA1c measurement on hospital readmission
rates: Analysis of 70,000 clinical database patient records,” BioMed
Res. Int.,vol. 2014, 2014, Art. no. 781670.
[50] Y. Su, A. Gelman, J. Hill, and M. Yajima, “Multiple imputation
with diagnostics (mi) in R: Opening windows into the black box,”
J. Statistical Softw., vol. 45, no. 2, pp. 1–31, 2011.
[51] J. Wang, S. Song, X. Zhu, and X. Lin, “Efficient recovery of missing
events,” Proc. VLDB Endowment, vol. 6, no. 10, pp. 841–852, 2013.
[52] K. Wellenzohn, M. H.B €
ohlen, A. Dign€
os, J. Gamper, andH. Mitterer,
“Continuous imputation of missing values in streams of pattern-
determining time series,” in Proc. Int. Conf. Extending Database Tech-
nol., 2017, pp. 330–341.
[53] Y. Yuan, “Multiple imputation for missing data: Concepts and
new development,” in Proc. Twenty-Fifth Annu. SAS Users Group
Int. Conf., vol. 267, 2010.
[54] M. Zaki, “SPADE: An efficient algorithm for mining frequent
sequences,” Mach. Learn., vol. 42, no. 1/2, pp. 31–60, 2001.
Reza Rawassizadeh received the BSc degree in
software engineering, the master’s degree in com-
puter science, and the PhD degree in computer
science from the University of Vienna, Austria, in
2012. He is an assistant professor with the Depart-
ment of Computer Science, Metropolitan College,
Boston University. His research interests include
data mining, ubiquitous computing, and applied
machine learning.
Hamidreza Keshavarz received the PhD degree
from Tarbiat Modares University, Tehran, Iran, in
2018. His research is focused on developing algo-
rithms and techniques for sentiment analysis and
data mining. His interests include metaheuristic
algorithms, information retrieval, computational intel-
ligence, pattern recognition, and machine learning.
Michael Pazzani received the PhDdegree in com-
puter science from the University of California, Los
Angeles (UCLA). He is vice chancellor for research
and economic development and a professor of
computer science with the University of California,
Riverside. He was a professor with the University
of California, Irvine, where he also served as chair
of information and computer science. His research
interests include machine learning, personalization,
and cognitive science.
"
For more information on this or any other computing topic,
please visit our Digital Library at www.computer.org/csdl.
RAWASSIZADEH ET AL.: GHOST IMPUTATION: ACCURATELY RECONSTRUCTING MISSING DATA OF THE OFF PERIOD 2197
Authorized licensed use limited to: BOSTON UNIVERSITY. Downloaded on October 08,2020 at 19:23:07 UTC from IEEE Xplore. Restrictions apply.
























