ArticlePDF Available

Ghost Imputation: Accurately Reconstructing Missing Data of the Off Period

Authors:

Abstract and Figures

Noise and missing data are intrinsic characteristics of real-world data, leading to uncertainty that negatively affects the quality of knowledge extracted from the data. The burden imposed by missing data is often severe in sensors that collect data from the physical world, where large gaps of missing data may occur when the system is temporarily off or disconnected. How can we reconstruct missing data for these periods? We introduce an accurate and efficient algorithm for missing data reconstruction (imputation), that is specifically designed to recover off-period segments of missing data. This algorithm, Ghost, searches the sequential dataset to find data segments that have a prior and posterior segment that matches those of the missing data. If there is a similar segment that also satisfies the constraint-such as location or time of day-then it is substituted for the missing data. A baseline approach results in quadratic computational complexity, therefore we introduce a caching approach that reduces the search space and improves the computational complexity to linear in the common case. Experimental evaluations on five real-world datasets show that our algorithm significantly outperforms four state-of-the-art algorithms with an average of 18% higher F-score.
Content may be subject to copyright.
Ghost Imputation: Accurately Reconstructing
Missing Data of the Off Period
Reza Rawassizadeh , Hamidreza Keshavarz , and Michael Pazzani
Abstract—Noise and missing data are intrinsic characteristics ofreal-world data, leading to uncertainty that negatively affectsthe quality of
knowledge extractedfrom the data. The burdenimposed by missing datais often severe in sensors that collect data fromthe physical world,
where large gaps of missing data may occur when the system is temporarily off or disconnected. How can we reconstruct missing data for
these periods? We introduce an accurate and efficient algorithm for missing data reconstruction (imputation),that is specifically designed to
recover off-period segments of missing data. This algorithm, Ghost, searches the sequential dataset to find data segments that have a prior
and posterior segment that matches those of the missing data. If there isa similar segment that also satisfies the constraint –suchas
location ortime of day – then it is substituted for the missing data. A baseline approach results inquadratic computational complexity,
therefore we introduce a caching approach that reduces the search space and improves the computational complexity to linear inthe
common case. Experimental evaluations onfive real-world datasets show that our algorithm significantly outperforms four state-of-the-art
algorithms with an average of 18 percent higherF-score.
Index Terms—Imputation, multivariate, pattern mining
Ç
1INTRODUCTION
WITH recent technological advances and increases in
computing capabilities, data intensive scientific discov-
ery is being widely used. This has led to the introduction of
methods for analyzing data collected from multiple sources
of information, i.e., “multivariate data”.
1
One of the inevitable
challenges of real-world data analysis is uncertainty rising
from noise and missing data [16]. This uncertainty negatively
affects the quality of knowledge extracted from the data.
Indeed, the burden imposed by missing data is often severe in
applications collecting data from the physical world, e.g.,
mobile sensing [33] or genome sequencing [6].
For example, consider battery powered devices, such as
smartwatches, equipped with inexpensive sensors such as
ambient light and accelerometer. Due to sensor quality, bat-
tery limits,and user preferences, context-sensing applications
can not continuously and effectively collect data [33] and
there are often segments of missing data, e.g., the device is
turned off. These missing segments affect the quality of
knowledge-extraction methods [31]. Although, missing data
reconstruction is an important requirement of these system, it
has not received much attention.
There are longstanding efforts in statistics [10], [40], [43]
to reconstruct missing data. These imputation methods
assume the missing data points occur at random, i.e., miss-
ing at random (MAR) or missing completely at random
(MCAR). If the data is missing not at random (MNAR) [40],
the imputation process is more challenging.
In this paper we propose a novel algorithm for imputation
of multivariate sensor data. This algorithm only uses (i) a con-
straint such as time of the day or location (ii) the data values
immediately prior to the missing event, and (iii) the data val-
ues immediately following the missing event. Since our
method does not rely on statistical methods, it might be able
to handle some MNAR data, but only if a similar segment
exists in the dataset.
In particular, our algorithm operates on multivariate and
sequential data streams. It reads two adjacent data segments –
one before and one after the missing data (missing seg-
ment) –and searches the dataset to find two segments simi-
lar to the adjacent segments of the missing segment. If
segment between these two similar segments is of the same
length as missing segment, it is a candidate recovery seg-
ment. Next, if the constraint values of the segment of inter-
est matches the constraint values of the missing segment,
the algorithm substitutes the missing segment with the con-
tent of this candidate recovery segment. A naive approach
imposes a quadratic computational complexity, so we add a
pre-processing step that reads all data segments and their
indexes into a cache, achieving a linear computational com-
plexity in the common case.
The characteristics and contributions of our algorithm
are as follows.
1. Multivariate is slightly different than multimodal. Modality refers
to the way when an event happens or experienced [3]. In other words,
it refers to primary channels of sensation and communication, e.g.,
touch and vision.
R. Rawassizadeh is with the Department of Computer Science, Metropoli-
tan College, Boston University, Boston, MA 02215.
E-mail: rrawassizadeh@acm.org.
H. Keshavarz is with the Department of Electrical and Computer Engineering,
Tarbiat Modares University, Tehran, Iran. E-mail: keshavarz.h@modares.ac.ir.
M. Pazzani is with the Department of Computer Science, University of
California at Riverside, Riverside, CA 92521. E-mail: michael.pazzani@ucr.edu.
Manuscript received 15 Oct. 2018; revised 25 Apr. 2019; accepted 30 Apr.
2019. Date of publication 3 May 2019; date of current version 6 Oct. 2020.
(Corresponding author: Reza Rawassizadeh.)
Recommended for acceptance by S. Cohen.
Digital Object Identifier no. 10.1109/TKDE.2019.2914653
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 32, NO. 11, NOVEMBER 2020 2185
1041-4347 ß2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See ht_tps://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: BOSTON UNIVERSITY. Downloaded on October 08,2020 at 19:23:07 UTC from IEEE Xplore. Restrictions apply.
Heterogeneous Multivariate Real-World Data. Statistical
imputation approaches are optimized to handle
numerical data [40], [43]. Real-world systems, how-
ever, produce data in numerical, categorical, or binary
forms. Our algorithm relies on a categorical abstrac-
tion of the original data by converting data values to
categorical symbols, e.g., by bucketing numeric data
into categories. Therefore, in contrast to statistical-
based imputation, any data type, regardless of its
distribution, can be fed into this algorithm, i.e., non-
parametric imputation. All datasets we employed in
this work are real-world datasets and most of them
(wearable, mobile, IoT, and news media datasets)
have not previously been used for imputation studies.
We recommend to use this algorithm mainly for multi-
variate sensor data used in consumer IoT and mobile
devices, but to demonstrate its versatility, we experi-
ment it with two other real-world datasets as well
(clinical data and real estate data).
Instance based Learning. Inspired from algorithms that
learn from a single labeled instance [15], [23], our
algorithm tries to estimate the missing data from the
first similar instance that can be found in a sequential
search of the dataset. Clearly, relying on a single label
(similar instance) is prone to false positive errors.
Instead, we rely on a constraint, as a controlling
variable that significantly reduces false positives. Our
definition of constraint is inspired by binary con-
straint in constraint satisfaction problems (CSP) [42],
but, unlike traditional CSP it is not used to reduce the
search space.
Search Space. Continuously collecting and storing
data can be expensive in terms of resource usage,
especially in battery-powered wireless devices. Data
is typically not stored long-term on these devices,
and most data processing is conducted in cloud serv-
ers [34]. Our algorithm can reconstruct the missing
data merely by finding the first match for the missing
segment without a need to search the entire dataset.
For instance, we have used only three days for a
smartphone dataset and only seven days for a smart-
watch dataset. Their datasets are fairly small, but in
both of these examples, our algorithm outperforms
state-of-the-art algorithms and reconstructs the miss-
ing data with higher accuracy.
Note that all versions of our algorithm, i.e., baseline and
cache based ones, have only one efficient parameter and it
is the window size. In the evaluation section we identify an
optimal window size value for each dataset. Therefore,
based on the target dataset (or application), this parameter
could be assigned automatically and there is no need for the
user, who has a domain knowledge, to decide about its opti-
mal value. There is another parameter used for tolerating
slight dissimilarity, we will demonstrate why users should
not tolerate dissimilarity.
2PROBLEM STATEMENT AND DEFINITIONS
Asystem produces a dataset, over time. For example, a
smartwatch system is producing sequential data about its
wearer’s activity, physiology, and environment. The system
generates data intermittently, when it is on, but may be off
for extended periods.
Data value is a data unit in numeric, categorical or in
binary form. For our purposes we convert all such data
types into categorical form, so that all data values are repre-
sented by symbols from a finite alphabet. This has an inevi-
table impact on the accuracy of the data. However, all
imputation methods which can handle multivariate data
have such a limitation.
Dataset is a temporal sequence of data records, produced
by a single system over time. Because typical systems produce
multiple types of data (multivariate data), we envision each
record comprises one value from each type of data produced
by that system. We discretize time such that each data record
rirepresents a brief period of time beginning at time ti, where
irepresents the index of the record within the dataset
ð1inÞ.Werepresentrias a tuple comprising one data
symbol for each of the streams produced by the system
during time interval ti, plus one special stream 0 called the
constraint stream. Thus, we can write ri¼ðsi0;s
i1;...;s
i‘Þ.
If a stream produces no data during time interval ti,
record riincludes the null symbol ;for that stream. For
periods where the system is off, the records during that
period will include the null symbol for every data stream.
Constraint stream is the portion of every data record ipre-
senting the value of the constraint, i.e., si0; for notational con-
venience, we refer to the this stream as S0,S0¼½s1;0;s
2;0;...;
sn;0. The constraint stream, however, is not produced by
the system and we assume it can be generated where the
dataset is assembled for post-processing; for example, the
constraint may be “time of day”. The choice of a suitable con-
straint depends on the kind of data being processed. Other
constraint examples are described in Section 4.1.
Segment is a subsequence of the dataset; thus for dataset
d½r1;r
2; ::; ri;...;r
nwe can refer to a segment dij
½ri;...;r
j, for 1ijn.
Missing Segment. A segment in which all its data values
(other than constraint values) are missing (null symbols);
that is dij is a missing segment iff 8ikj;81l;s
kl ¼;.A
missing segment occurs when the target system is not col-
lecting data at all, because it is off or unavailable, during
time intervals tithrough tj.
Window size wis a parameter of our algorithm; a small
number wsuch that a segment of wrecords is a “window”
into the dataset; window didi;iþw1.
Prior window 3dij is the window of data immediately
preceding a missing segment. If dij is the missing segment,
then the prior window is 3dij diw;i1.
Posterior window
"
dij is the window of data immediately
following a missing segment. If dij is the missing segment,
then the posterior window is
"
dij djþ1;jþw.
Segment Equality. Two segments are equal if they have
the same size and identical values for all of their data values.
For example, dij dkl iff ððlk¼jiÞ^ð8
0pji80q:
siþp;q ¼skþp;qÞÞ.
Segment Similarity. Two segments are similar if they have
the same size, the similarity check function returns true
while comparing their priorand posterior segments, and have
identical values in their constraint stream. That is, segment
dij dkl iff ðlk¼jiÞ^ð3dij ¼3dklÞ^ð
"
dij ¼
"
dklÞ^
ð80pji:simðsiþp;0;s
kþp;0Þ¼TrueÞ:
2186 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 32, NO. 11, NOVEMBER 2020
Authorized licensed use limited to: BOSTON UNIVERSITY. Downloaded on October 08,2020 at 19:23:07 UTC from IEEE Xplore. Restrictions apply.
Recovery segment drec is a segment of data that can be
substituted into the dataset in place of a missing segment.
Our approach finds segments elsewhere in the dataset to
use as recovery segments. For example, if dij is the missing
segment, and another segment dkl is similar ðdij dklÞ, then
dkl is a candidate recovery segment.
Cache Cis a table with one entry for each ‘unique’ window,
listing the index (or indices) where that window occurs in the
dataset. In the simplest case, one entry in the table looks like
ðdi;fi. If, however, there is another window in the dataset
djsuch that dj¼di, then the entry in the table would instead
be ðdi;fi; j. When complete, the cache is therefore the small-
est set of entries Cfðde;fijdi¼degÞg, such that the cache
represents the whole dataset, 8di2d; de;IÞ 2 C j ðde¼
di^i2IÞ, and such that the cache has as few entries as possi-
ble, de;IÞ 2 C;@ðdf;2Cjde¼df. If there were two
windows de¼dfin the cache C, their entries would simply be
combined intoone entry representing the union of the indices
where that window occurs.
To better understand the cache consider Fig. 1. There is a
missing segment at index 6, that is, dmiss ¼d6;8. Given w¼2,
for the prior window 3dmiss ½ð1;B;2;XÞ;ð1;E;1;XÞ], the
cache holds two beginning indices: f4;12g. This means that
this window value has been repeated twice at d4and at d12.
The same is applicable for the posterior window
"
dmiss,
which appears at d9and d17.
Problem. Given a missing segment dmiss, its prior 3dmiss , its
posterior
"
dmiss and the constraint stream S0, the objective of our
algorithm is to accurately find a recovery segment drec such that
drec dmiss, and replace dmiss with drec .
Function 1. Segment Similarity function,sim is a function that
receive two segments da,dband confidence threshold .This
function uses Jaccard distance to compare both input seg-
ments. It is a binary comparison between the data members of
each segment. If their similarity tolerance is greater or equal
than the it returns true. Otherwise it returns false. Since we
are dealing with binary comparison between similarities of
two set of discrete data Jaccard is the most optimal similarity
measure that could be used for our application
simðda;d
b;Þ¼ 1if da\da
da[da

0else
(:
Function 2. Segment match function,cmatchð3d;
"
d; dmissÞ,
is a function that by using the hash table (cache) compares
3dand
"
doffset distances with each other. If the distance
between their offsets is exactly similar to the size of dmiss
then it returns “true”, otherwise it returns “false”. More
about this will be described when we present the caching
algorithm.
3ALGORITHM
Our algorithm has three main characteristics. First, it uses a
constraint value, which we assume is known for all records.
Consider a smartwatch app that continuously collects ambient
light; when the user puts her hand inside her pocket while
seated, the ambient light is zero. While she is sleeping, the
ambient light is also zero. Sleeping and sitting with hands
inside the pockets are two different behaviors. Therefore, an
algorithm that tries to estimate missing values of ambient light
should consider the time of the day (a constraint) while per-
formingimputation.Multiple-imputation methods often have
better accuracy value over a single-imputation approach, but
are often less efficient. Our algorithm mitigates the accuracy
issue by relying on the constraint.
The algorithm’s second characteristic is the discretization of
data. Discretization enables our algorithm to handle any data
type, including binary, numerical or categorical values. Dis-
cretizing numeric data may have some impact on accuracy
depending on the granularity [24]. To avoid decreasing the
accuracy further, our algorithm compares data objects based
on equality and does not use any other similarity metrics. We
also discretize the timestamp into buckets using the temporal
granularity [32], (e.g., if the granularity is five minutes then
11 : 32 !11 : 30), and to discretize time series data we use
the SAX algorithm [24].
The third characteristic is its statistical distribution inde-
pendence, unlike several other state-of-the-art imputation
algorithms [8], [20], [41], which require the data to be nor-
mally distributed. In traditional imputation methods, MCAR
and MAR cause losses of statistical significance, and MNAR
can cause bias in the result [11]. Recently there are methods
designed specifically to deal with MNAR [19], [28]. Since our
algorithm relies on the first similar segment to reconstruct the
missing data and it does not use any statistical methods, it
might be able to reconstruct MNAR data, but only if a similar
segment exists in the dataset.
The quality of data reconstruction is strongly depend on
the choose of constraint. Therefore, it is important for the
user of the algorithm to be familiar with the dataset and have
some domain knowledge.
3.1 Baseline Algorithm
To better understand the algorithm, we first describe its base-
line version. Consider the dataset in Fig. 1 (top), where, three
streams producing data, i.e., S1,S2and S3. The constraint
stream is S0. From index 6 to 8 there is no data available, thus
dmiss ¼d6;8. Given the window size of two, the algorithm
reads the prior window 3d6;8=[ð1;B;2;XÞ;ð1;E;1;XÞ], the
posterior window
"
d6;8=[ð2;D;3;YÞ;ð2;C;4;], and = the
constraint content [1;2;2]. Then, the algorithm scans the data-
set until it encounters a window similar to 3d6;8, which it
finds at d12; it next checks whether the corresponding window
d17 is similar to the posterior window, for the missing segment
(i.e., whether d17
"
d6;8). If so, this new found segment d14;16
Fig. 1. (top) An abstract representation of a system with three data
streams that gets offline from tnto tnþ2. (bottom) Based on the exactly
similar posterior segment, prior segment and constraint the missing
segment has been identified from tmto tmþ2and recovered.
RAWASSIZADEH ET AL.: GHOST IMPUTATION: ACCURATELY RECONSTRUCTING MISSING DATA OF THE OFF PERIOD 2187
Authorized licensed use limited to: BOSTON UNIVERSITY. Downloaded on October 08,2020 at 19:23:07 UTC from IEEE Xplore. Restrictions apply.
is a candidate recovery segment. If the constraint in d14;16,i.e.,
1;2;2, are equal to those in d6;8,i.e.,1;2;2,thenthiscandidate
is a recovery segment and it will substituted at the missing
data segment, as shown in Fig. 1 (bottom). The search process
stops immediately after a recovery segment has been found; if
there is more than one substitution possible, only the first sim-
ilar segment will be chosen. Thus, this algorithm does not per-
form “multiple imputation” [38]. Also note that to construct
the missing data, our algorithm relies on the comparison to all
available data sources (multivariate) inside posterior and
prior segments. This means that each information source
inside prior and posterior will be a condition for comparison.
The algorithm can operate on single source of information,
but it will be more accurate when we use it with multivariate
datasets.
The baseline algorithm is simple and thus we do not
describe its pseudo code in detail.
3.2 Baseline Computational Complexity
The computational complexity of the algorithm is propor-
tional to the number of missing segments, and to the amount
of data searched to find a recovery segment for each missing
segment. Assuming there are nnumbers of tuple in the data-
set, the algorithm can find missing segments in a linear scan
of the dataset, in OðnÞtime; in a worst-case scenario it may
need to scan the whole dataset to find a recovery segment for
each missing segment, also in OðnÞtime. This means that
computational complexity is quadratic, i.e., Oðn2Þ. Since it is
not optimal, especially to run on battery powered devices,
we should improve it. By using a cache we can mitigate this
performance overhead.
3.3 Caching Algorithm
Caching is being used widely to improve the performance
of algorithms [9]. Here we adapt a caching mechanism that
reduces the search space to reduce time complexity. To
implement the cache we use a data dictionary data structure
that hosts all unique window values and their beginning
indices. Note that missing records are skipped when build-
ing the cache of window values.
By using a hash table to implement the dictionary, return-
ing a list of indices where this window value occurs in the data-
set, the cache allows the algorithm to consider only a limited
set of possible candidate recovery segments. In other words,
using the cache reduces the search space from two different
perspectives: (i) instead of searching the entire dataset that
includes repetitive data it searches a table of window values
that are all unique with no repeated data. (ii) Instead of
searching the dataset to find the similar posterior and prior
segments, first, it searches the list of indices for matches to the
prior window, then if the match is founded it checks the size
match between missing and candidate recovery segments,
then if both previous conditions are true it checks for the
posterior segments. If all three conditions are true then it con-
siders the candidate recovery segment as a final recovery
segment for the missing segment.
Given the window size of two, Fig. 2 is an abstract repre-
sentation that shows how the algorithm builds the cache. In
particular, the cache creation process scans the dataset,
sequentially examining each window, updating the dictionary
to add an entry for new window values or add a new index to
an existing dictionary entry.
The table at the bottom of Fig. 2 presents the dictionary as it
is being created. For the sake of simplicity, at the beginning we
show all possible four data segments, then for the rest we only
highlight repeated data segments. For the purpose of explana-
tion, we have assigned a name to each window, a1,b2,... and
because windows are overlapping we have useddifferent col-
ors and names (a,b) to distinguish them from each other.
Algorithm 1. Cache-Based Implementation of the
Algorithm
Data: dataset dwith nrecords ri(where 1in), window
size w, cache C, similarity threshold
Result: updated d
// outer loop: scan dto find missing segments
1i:¼wþ1// we skip over the first window
2while ðinwÞdo
// scan to find contiguous missing segment starting at i
3j:¼i
4while ðisMissingðdi;jþ1Þ^ðj<nwÞÞ do
5j:¼jþ1
6if ðisMissingðdijÞÞ then
7l¼jiþ1// length of missing segment
83dij ¼diw;i1// prior window
9
"
dij ¼djþ1;jþw// posterior window
// use cache to iterate over set of indices where
3dij is found in d, if any
10 forall k2csearchðC;3dij;Þdo
// exclude candidates close to the end of dataset
11 if ðkþwþl1nwÞthen
// candidate recovery segment
12 drec ¼dkþw;kþwþl1
// check whether posterior windows match based
on epsilon threshold
13 if ðsimð
"
drec;
"
dij;ÞÞ then
// now check whether constraints match; recall
sil are the fields of record ri
14 if ð80pji:siþp;0¼skþwþp;0Þthen
15 dij :¼drec // substitute into the missing segment
16 break
17 i:¼jþ1// continue scan after this segment
The current implementation of the cache constructions
adds windows into the dictionary with a simple check. If
Fig. 2. The cache creation process that loads the dataset into the hash
table. It uses overlapping sliding window, equal to the given ws size, to
read all possible data segments and their beginning offsets.
2188 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 32, NO. 11, NOVEMBER 2020
Authorized licensed use limited to: BOSTON UNIVERSITY. Downloaded on October 08,2020 at 19:23:07 UTC from IEEE Xplore. Restrictions apply.
the window exists in the dictionary the cache updates the
existing entry by adding the new index into the existing list
of indices. Therefore, there will be no collisions unlike using
hash functions.
In the common case, there will be a small number of indi-
ces in each dictionary entry. Therefore, if the dictionary is
implemented as a hash table, the insert and lookup time
will be Oð1Þ, but we have the risk collision. In our algorithm,
the process of reading the dataset and caching its windows
is a one-time pre-processing step. In other words, this cache
creation step, however, occurs once at the beginning, and is
not repeated. Nevertheless, we consider this overhead in
our evaluation. In the worst case, its computational com-
plexity is Oðn2Þ, but in common cases it is close to OðnÞ.In
particular, unusual or degenerate datasets could lead to
Oðn2Þbehavior if, for instance, all data windows have the
same value and the dictionary has only one entry with a list
of OðnÞindices. After creating the dictionary (cache), we
can now run a more efficient algorithm to find and replace
missing segments, as shown in Algorithm 1.
Note, most of the overhead is on scanning the content of
dictionary and comparing the values of cached, which hash
function can not improve the scan as well. Besides, our algo-
rithm assumes that the given dataset will have no predefined
structure, such as data streams. Considering the space and
time complexity, ordering the data stream is not cost-effective.
Besides, we need to store all unique (not all of them) data seg-
ments into the cache. Therefore, we do not recommend using
prefix-tree or any other types of trees, due to their space and
time in-efficiency required for ordering data or relying on the
structure of data.
In the first step, the algorithm iterates through dataset
and identify missing segments. It checks the placement of
the missing segment to locate its posterior and prior in line
4. Then, it reads the associated posterior and prior segments
in lines 8 and 9. From line 10 the algorithm starts to search
the cache to find offsets of the prior segment or similar offset
(it tolerates dissimilarity based on ). In line 11 it excludes
prior segments which are close to the end of the dataset,
because there will be no associated posterior or prior existed
for them. Next, it checks the window size and based on the
prior segment and window size it loads the candidate
recovery segment into drec, line 12. However, still the algo-
rithm is not sure whether it is a correct recovery segment,
and it should check the posterior as well. Line 13 will do the
comparison between posterior segment of the drec and pos-
terior of the missing segment dij. If they match (based on
given epsilon) the next step is to check their constraint
which should be identical. Note that constraint matching is
checking for exact equality and there will be no tolerance
(no will be used). If they match as well, this means that all
three conditions are true, and in line 15, drec will be returned
as a recovery segment for the identified missing segment.
Fig. 3 provides an example to understand the cmatch
function. Assume the ¼0which means data segment
should be identical (partial similarity is not tolerated) and
ws ¼2is the window size. There we have a dataset and we
show two snapshots of the dataset, each with three segments.
The first one includes the missing data segment and the sec-
ond one includes its recovery data segment. The beginning
offset (index) of prior segment xis 25. The missing event
occurs at 27 and continues until offset 30.
According to the cache, data segment x(prior) existed at
offset 16, 25 and 52, and data segment y(posterior) exited at
offset 20, 24, 30, 60 and 67. The cache algorithm first sub-
tracts the first posterior 20, from the first prior 16+ws. The
result shows 6, since it is not equal to the missing size, i.e.,
3, the cursor moves to the next x. Now it subtracts 20+ws
from 25 that results in 3. Since the result is negative and
smaller than the missing size, the cursor shifts to the next
posterior, i.e., 30. If the subtraction of 30 from 25+ws is cor-
rect, but the prior is the same, thus the algorithm uses the
next prior, i.e., 52. This procedure continues until it reaches
60 from yand 52 from x, and the differences are are similar
to the missing data size. Then the value from index 57 to 59
will be read from the dataset to the construct the recovery
data segment.
3.4 Computational Complexity with Cache
The computational complexity of the baseline algorithm is
super-linear and in worst case it is quadratic. As it has been
described, assuming that there are mmissing segments in the
dataset and each recovery segment is located at kth element of
the dataset we have Oðk:mÞ. By using the cache based optimi-
zation, the algorithm needs to search the hash table. The size
of the hash table is the size of the dataset divided by size of
the sliding windows, i.e., ðn=wsÞ. Since the sliding windows
are overlapping instead of n=ws we should have it multiplied
to ws 1. Therefore the computational complexity is
ððn=wsÞ:ðws 1ÞÞ. Besides, there is a very small overhead to
search the offset lists for each data segment. Its impact is near
zero and we do not consider it as an overhead in the algo-
rithm. As a result, we can summarize, together with the cache
construction phase, the overall algorithm will run in near
linear computational complexity and in worst case it will be
super linear. The experimental evaluations analyze this in
more detail.
In the Experimental Evaluation section, we provide a
detailed analysis for the differences of using the cache and
not using the cache, their impact on dataset size and so forth.
4EXPERIMENTAL EVALUATION
Before describing our experimental evaluation, first we
describe datasets that we use. Then, we introduce state-
of-the-art algorithms that will be used for comparison with
our algorithms. Next, the accuracy analysis will be described
in detail, followed by a detailed efficiency analysis.
All experiments reported in this section were conducted
on a MacBook Pro laptop, with 2.8 GHz Intel Core i7 CPU,
Fig. 3. A toy example that shows data segments and their offsets (indices)
read and cached into the dictionary.
RAWASSIZADEH ET AL.: GHOST IMPUTATION: ACCURATELY RECONSTRUCTING MISSING DATA OF THE OFF PERIOD 2189
Authorized licensed use limited to: BOSTON UNIVERSITY. Downloaded on October 08,2020 at 19:23:07 UTC from IEEE Xplore. Restrictions apply.
16 GB memory and SSD hard disk. We implemented the
algortihm, which we call Ghost, in R version 3.5.
4.1 Datasets
Here we choose to evaluate our method on five real-world mul-
tivariate datasets. These datasets include mobile lifelogging,
wearable smartwatch, Internet of Thing (IoT) Smarthome,
clinical data and online news data about Europe real-estate
market. We design this algorithm with the intention to rec-
onstruct sensor data for inexpensive consumer-electronic
devices, such as smartphone or IoT devices. However, to
demonstrate the versatility of our algorithm, we experiment it
on clinical data and online news data as well.
The mobile, wearable and IoT datasets have a timestamp
in each record. We converted each timestamp to a time of the
day with a temporal granularity of five minutes [32]. This
approach was inspired by the Allen temporal logic [1], which
treats temporal data as discrete time intervals, and it is differ-
ent than time series data. We converted time series data
(such as ambient light or accelerometer numerical data) to
characters with the SAX algorithm [24].
Wearable Dataset (WD). Because wearable devices are
small, they have limited battery and limited sensor capabili-
ties [35]. To preserve battery, their operating system shuts
down background services frequently, which can result in
significant data loss during the data collection process. We
use eight days of data for a sample user of “Insight for Wear”
[36], smartwatch app.
2
The constraint is time of the day.
Online News Data (ND). Financial applications use news to
predict market fluctuations. Recent political issues in Europe
such as Brexit and European policy regarding refugees have
led to significant fluctuations in European real-estate markets.
These fluctuations and their patterns are not continuously
reported in the news media. Therefore, imputation can be
used to construct missing market data from online news
media. We get 3,000 real-estate records, extracted from
500,000 news articles from 5 years (2012 to 2017) in several
German online news media sources. We acquired this data
from a market prediction start-up, i.e., eMentalist
3
and
ordered based on their date of publication. The constraint in
this dataset is region (country name) and they are ordered
chronologically.
Mobile Sensing Dataset (MD). Due to the proximity of
mobile phones to their users, it is not possible to continuously
collect contextual data at all times from the user [33], and the
imputation can be used to reconstruct missing mobile sensing
data. We use only three days of a user data
4
collected using an
open source lifelogging tool, i.e., UbiqLog [37]. The constraint
is time of the day.
Smarthome Dataset (SD). One well-known use of IoT devices
is in a household setting, i.e., smart-homes. Similar to mobile
and wearable devices, inexpensive sensors that will be used
for the smarthome configuration are prone to malfunction
and disconnection. We use UMass Smart Home dataset [4]
that includes information from different sensors inside and
out side twohomes. The constraint is the time of the day.
Clinical Data (CD). A traditional application of imputa-
tion is clinical data [2]. To demonstrate the generalizability
of our algorithms we evaluate our algorithm on a dataset of
visits of diabetes patients to 130 US hospitals from 1998-
2000 [49]. The constraint is combination of Age group, race
and gender. Since our algorithm operates with sequential
data, we order this dataset based on the constraint and
encounter sequence of patients, then run the imputation
algorithm. Note for this dataset we define an additional con-
straint that a sequence of data that constitute prior, missing
and posterior segments should belong to one single patient
only and a second patient will not be included in a same
segment.
As it has been described we chose a small subset of data
from the wearable and mobile datasets, that do not follow
any normal statistical distribution, i.e., a multivariate
Shapiro-Wilk [45] test did not reject the null hypothesis. We
make such a selection by purpose, to demonstrate that we
can still reconstruct the missing data with superior preci-
sion, despite not having a Gaussian distribution. Table 1
summarizes our experiment datasets. The variety of con-
straints demonstrates the versatility of our algorithm for dif-
ferent settings and its independency from having explicit
notion of time.
To quantify the characteristics of our datasets we have
conducted a Shapiro-wilk test [45] to identify whether each
attribute have been normally distributed. Besides we have
used Shanon entropy [44] to identify the level of uncer-
tainty. Shanon entropy could be interpreted as a predictabil-
ity of the dataset. For the sake of space, we do not report
them in detail, but none of the dataset has its all attribute
data normally distributed. Therefore, we can not conclude
that whether the selected dataset are highly predictable or
unpredictable.
4.2 State of the Art Algorithms
We compared the efficiency and accuracy of our algorithms
with four well-known state-of-the-art imputation algo-
rithms. The criteria for our selection is to support categorical
data and not just numerical data, and to support multivari-
ate imputation, as is needed for real-world applications. We
choose two well-known algorithm that use machine learn-
ing, i.e., mi [50] and missForest [47], and two that use statis-
tical inferences, Amelia II [20] and MICE [8].
TABLE 1
Experiment Datasets and Their Attributes
Dataset Contraint Data Streams #Rec.
Wearable Time of Day Battery Utilization, Ambient
light, Average Number of Steps
1913
Online News
Media
Country Topic, Influence Level of the
Topic, Sentiment of the Content,
Year and Month, Sub-Topic
3000
Mobile Time of Day Accelerometer, Ambient Light,
Battery Use, Screen Interactions
2366
Smart-
Home
Time of Day Inside Temperature, Outside
Temperature, Raining Status,
Door Status (is open or close)
25905
Clinical Patient ID A1C result, Insulin Level,
Diabetes Medication Use
101768
2. http://insight4wear.com
3. http://ementalist.net
4. http://archive.ics.uci.edu/ml/datasets/UbiqLog+%28smart-
phone +lifelogging%29
2190 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 32, NO. 11, NOVEMBER 2020
Authorized licensed use limited to: BOSTON UNIVERSITY. Downloaded on October 08,2020 at 19:23:07 UTC from IEEE Xplore. Restrictions apply.
“Amelia”[20] operates by bootstrapping expectation max-
imization and imputation posterior methods. Its imputation
posterior method draws random simulation on normally dis-
tributed data. It assumes that the variables of the multivari-
ate dataset are jointly following a normal distribution. It is a
useful and widely used method to reconstruct missing data
in data streams. “MICE” [8] performs multivariate imputa-
tion by using chained equations. The chained equation is
using joint modeling and fully conditional specifications
[43]. Joint modeling specifies a multivariate distribution and
performs imputation from their conditional distributions by
Markov chain Monte Carlo simulation. “mi” [50] creates an
Approximate Bayesian model from the data, then it recon-
structs data from the conditional distribution for each data
object given the observed and reconstructed data of the other
variables in the dataset. “missForest” [47] treats data as
a multi-dimensional matrix and performs the imputation
by predicting the missing value using the random-forest
algorithm trained on observed parts of the dataset.
To have a fair comparison we experiment with different
parameters several times and choose the best settings for
each state-of-the-art algorithm. Moreover, we have used
only one run for algorithms that support multiple run for the
imputation.
4.3 Accuracy
To measure the accuracy of the imputation process in our
algorithm, we have randomly removed records from each
dataset. In particular, we have removed 10 to 100 data
records from each dataset (10,20,30,...100) and compared the
imputed dataset with the original dataset. The accuracy esti-
mation will be reported as precision, recall and F-score. True
positives are missing data that successfully get recovered.
False positives are missing data that have been identified
and recovered but their original value is different than the
recovered one. False negatives, are missing data that have
not been recovered. True negative are data that are not miss-
ing and also not recovered, which is obvious in our scenario
and all algorithms can easily identify true negatives. There-
fore, in most analysis we report F-score that is independent
from true negative.
We provide five analyses for the accuracy, four of which
are focused on the characteristic of the algorithm including
constraint impact, sliding window parameter sensitivity and
dissimilarity () impact on precision and recall, and missing
length (or duration) impact on accuracy. The fifth analysis is
focused on comparison of our algorithm with the aforemen-
tioned state-of-the-art algorithms. It is notable that there are
no differences between the accuracy of baseline or cache-
based version of the algorithm, and using a cache does not
have any impact on the accuracy of the algorithm.
4.3.1 Tolerating Dissimilarity Impact on Accuracy
The data objects are multivariate and heterogeneous. There-
fore, the algorithm should treat them as discrete data objects.
In cases the user intends to tolerate a dissimilarity the
algorithm uses Jaccard Index to compare contents of two data
segments. The parameter will be used totolerate thedissimi-
larity of prior and posterior segments. Increasing the dissimi-
larity tolerance increases the recall and enable the algorithm to
identify more missing data segments. It, however, decreases
the precision of result significantly.
Fig. 4 presents precision and recall of our experiment
datasets with window size of two, while tolerating one (n-1)
or two (n-2) data object dissimilarity. Later we will describe
why we set window size to two. We choose to remove only
two data objects at most, because either by removing one or
two data objects we will get near to ideal recall for all datasets
(more than 0.8). This figure presents an average of precision
and recall. It means that we calculate the precision and recall
for all 10 different record removes (10 recrord removes,...,100
record removes) and report the mean of results.
Based on the very low precision we acquired by slight
toleration of dissimilarity, it is clear that exact similarity is
favored by this algorithm and tolerating dissimilarity, even
small dissimilarity is not recommended between. Further-
more, the randomness of results between (n-1) and (n-2)
questions the credibility of tolerating dissimilarity as well.
Therefore, we highly recommend avoid tolerating dissimi-
larity in this algorithm.
4.3.2 Constraint Impact on Accuracy
We introduced constraints as a means to reduce false-positive
errors. To demonstrate the impact of constraints, we mea-
sured the accuracy of our algorithm with and witout use of
constraint. Fig. 6 demonstrates the superior accuracy of using
the constraint over not using it on thesefive datasets we chose
use to test. Although not using a constraint increases the recall
in all datasets except news media, it also increases the false
positive errors.
The superior recall of notusing a constraint, over using the
constraint in news media dataset is due to the tight connec-
tion of real-estate changes in German regions of Europe.
Most of real-estate news where either about Germany or UK.
Therefore, a constraint plays an auxiliary role and increases
recall. Unlike other datasets it does not act as a filter for data.
4.3.3 Sliding Window Parameter Sensitivity
The only effective parameter that this algorithm is using is
the window size.
5
We analyzed precision, recall and F-score
Fig. 4. A comparison between precision and recall of using exact similarity (n) versus tolerating one dissimilarity (n-1) and two (n-2) dissimilarities.
5. is also a parameter, but we demonstrate that it is not useful to
tolerate dissimilarity.
RAWASSIZADEH ET AL.: GHOST IMPUTATION: ACCURATELY RECONSTRUCTING MISSING DATA OF THE OFF PERIOD 2191
Authorized licensed use limited to: BOSTON UNIVERSITY. Downloaded on October 08,2020 at 19:23:07 UTC from IEEE Xplore. Restrictions apply.
for each dataset with different window sizes for the same
dataset with different number of missing records, i.e., 10, 20,
...,100. Then we get an average from the precision, recall and
F-score for each dataset and each window size and report itin
Fig. 5. We evaluate the algorithm for four different window
sizes, i.e., 1,2,3 and 4. Based on the results in Fig. 5 we can rec-
ommend that the optimal window size for these datasets, is
two or three. Window size of two or three was optimal for sen-
sor based datasets (mobile, wearableand smarthome). For the
clinical dataset, window size of two had the highest accuracy.
For the new media dataset, window size of three has the high-
est accuracy. Nevertheless, we recommend that the user
experiment the algorithm with different window sizes for
their target dataset. Window size and precision have mono-
tonic relation and not linear. Our initial assumption was that
recall will be improved by smaller window size, but due to
the real-world nature of the data such an assumption is not
valid and you can see the large window size, i.e., 4 decreases
the accuracy.
4.3.4 Length of the Missing Segment and Accuracy
One question might arise is the flexibility of the algorithm
on the size of the missing data segment. In other words,
while the target system is off, how much accuracy of our
imputation algorithm decrease based on the duration of off
period? To answer this question, we have experimented
with removing sequential number of data. By using 100
missing records we have experimented on all missing data
with the missing segment size 2,3,5 and 10 records. For
instance, missing records 2 indicate 50 missing sequences
that each has the size of two segments. Missing record 3
indicate 33 missing sequences that each has the size of three
segments. The result is presented in Fig. 8. It presents preci-
sion and recall based on the number of missing length in a
sequence. As it can been seen from the result of Fig. 8, there
is a slight precision decreases among datasets while we
move from one missing segment to three. Then there is a
significant decrease by having 10 missing segments. The
precision of mobile and smarthome dataset are getting zero.
The behavior of recall in some datasets is decreasing, but in
some datasets it is also not predictable and we can not gen-
eralize it. In summary, we can argue that the longer the off
time will be, the less accurate will be the imputation pro-
cess. This reveals that the imputation process has some sen-
sitivity to the length of missing data.
4.3.5 Comparison with Other Algorithms
To compare the accuracy of our algorithm with state-of-the-art
algorithms, we chose the optimal window size from the pre-
ceding evaluation and the optimal parameter settings for
state-of-the-art algorithms. Fig. 7 summarizes the F-score for
imputation by each of the algorithms. As it has been shown in
this Figure, our algorithm outperforms other algorithms sig-
nificantly in terms of F-score, except on the clinical dataset.
The recallof all other algorithms are ideal, because (i) it is easy
to identify all missing data and (ii) theyprovide a substitution
for all missing segments (although often incorrectly). We can
easily change our algorithm and substitute exact similarity
Fig. 6. A comparison between using or not using the constraint.
Fig. 5. Window size parameter sensitivity analysis.
Fig. 7. Comparison of accuracy between our algorithm with constraint and state-of-the-art methods.
2192 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 32, NO. 11, NOVEMBER 2020
Authorized licensed use limited to: BOSTON UNIVERSITY. Downloaded on October 08,2020 at 19:23:07 UTC from IEEE Xplore. Restrictions apply.
with another similarity metrics and achieve perfect recall.
However, this approach introduces a high false-positive rate.
False-positive errors are severe errors for imputation algo-
rithms and can bias the imputed data [19].
Notably, the missForest algorithm returns zero precision
for the mobile dataset, due to the screen interaction data.
This data is rarely available and most of its time slots are
filled with zero value. Such a sparse stream of data affects
the random forest algorithm, which can not handle too many
null variables for a dataset. Although the missForest
algorithm and our algorithm had a comparable F-score for
the news media dataset, our algorithm had 25 percent higher
precision. On average, even considering the low score of
our algorithm for clinical data it had about 18 percent higher
F-score (across all five datasets) than other algorithms.
Not performing better than missForest and mi for clinical
data could be due to the non-sequantial nature of the dataset.
Our algorithm is designed for datasets include sequential
data.
Note that fluctuations in Fig. 7 are due to randomness of
missing data and it is not possible to get an smooth line by
repeating the experiment. Our algorithm performs signi-
ficantly better than other algorithms that have a time as a
constraint.
4.4 Efficiency
To demonstrate the efficiency of our algorithm, we analyze
the execution time and memory use of the algorithm in dif-
ferent settings.
First we analyze the response time and memory use of
the algorithm and the impact of dataset size on its perfor-
mance. Next, we compare the memory use and response
time of our algorithm with state-of-the-art algorithms. To
understand the impact of the number of missing data on the
memory use and response time, we report the experiment
with 10 to 100 missing records for each dataset. This helps
to quantify the sensitivity of imputation algorithms based
on the number of missing data.
4.4.1 Cache versus Baseline: Execution Time
Our cached optimization reduces the search space, but
increases memory use. We report the memory use and
response time of both versions of the algorithm for different
data sizes varied from 500 to 10,000 records, i.e., 500, 1,000,
2,000,... and 10,000. To prevent any bias originating from the
dataset structure, we created a synthetic dataset with four
data streams (data values varied from 1 to 10) and one stream
as a constraint. We keep the number of random missing
records fixed at 100 records in all datasets. Nevertheless, indi-
vidual missing records were randomly distributed among all
dataset records. Fig. 9 shows response times (in seconds) for
differentwindow sizes between the baseline and the cache.
As shown in the Fig. 9, increasing the dataset size increased
the response time for both the baseline and cache. The
increase for response time in the cache version, however, was
significantly lower than that the baseline.
Moreover, increasing the size of the window size length-
ened the response time, because a comparison is done for
each object (due to the use of exact similarity) and as the
window size gets larger, there are moreobjects to compare.
4.4.2 Cache versus Baseline: Memory Use
Our cached optimization reduces the search space, but
increases memory use. We report the memory use and
response time of both versions of the algorithm for different
data sizes varied from 500 to 10,000 records, i.e., 500, 1,000,
2,000,... and 10,000.
To prevent any bias originating from the dataset structure,
we created a synthetic dataset with four data streams (data
values varied from 1 to 10) and one stream as a constraint. We
keep the number of random missing records fixed at 100
records in all datasets. Nevertheless, individual missing
records were randomly distributed among all dataset records.
Fig. 9 shows response times (in seconds)for different window
sizes between the baseline and the cache.
As shown in the Fig. 9, increasingthe dataset size increased
the response time forboth the baseline and cache. The increase
Fig. 8. Missing data length impact on accuracy.
Fig. 9. Response time impact of using cache rather than baseline method, with different window sizes.
RAWASSIZADEH ET AL.: GHOST IMPUTATION: ACCURATELY RECONSTRUCTING MISSING DATA OF THE OFF PERIOD 2193
Authorized licensed use limited to: BOSTON UNIVERSITY. Downloaded on October 08,2020 at 19:23:07 UTC from IEEE Xplore. Restrictions apply.
for response time in the cache version, however, was signifi-
cantly lower than that the baseline. Moreover, increasing the
size of the window size lengthened the response time, because
a comparison is done for each object (due to the use of exact
similarity) and as the window size gets larger, there are more
objects to compare.
Fig. 10 shows the baseline algorithm uses only 112 Bytes of
memory, independent from the dataset size. The cache version
starts from 508 bytes and grows slowly based on the size of the
dataset. At 10,000 records the cache version occupies about
856 bytes ofmemory. Developers who are planning to use this
algorithm can consider comprising on the amount of memory
in order to have a shorter execution time or vice versa.
4.4.3 Memory Use Comparison with Other Algorithms
Since all algorithms read the dataset into memory, we ignore
that memory use and focus on additional memory use by the
algorithm. Our experiments show that there was no signifi-
cant differences in memory use while using different window
sizes. Therefore, we report the memory overhead independent
from the window sizes; see Fig. 10. The memory allocation
policy in our implementation was based on the R compiler.
The memory overhead of the algorithm alone was insignifi-
cant. In particular, the baseline version was memory efficient,
and only used 112 Kilo Bytes (KB) of memory, independent
from the dataset size, because the dataset was not in memory.
The cache version used more memory, of course. At 10,000
records (w¼2) the cache version occupies about 856 KB of
memory. Developers who are planning to use this algorithm
can consider comprising on the amount of memory in order to
have a shorter execution time or vice versa.
Note for this experiment, we selected only 3,000 records
from the clinical and smarthome dataset, because the “mi”
algorithm memory utilization grows exponentially as the size
of the dataset grows and it can not operate on large dataset.
Besides, based on the findingof the previous section, that win-
dow size two and three provides the highest accuracy, we set
the window size to two.
Although the baseline algorithm usesless amount of mem-
ory, we refrain from comparing it with other algorithms due
to its slow response times. A large portion of the memory
overhead came from the cache. Therefore, in Table 2 for
reporting the Ghost memory use, we report “algorithm’s
used memory” + “memory used for the cache”. Otherwise,
missForest is the most efficient one followed by Amelia.
We do not report memory utilization while tolerating
dissimilarity because there will be no logical reason to have
a differences in memory utilization.
4.4.4 Execution Time Comparison with Other Algorithms
Table 3 compares our algorithm execution time against state-
of-the-art algorithms. In this study we report average execu-
tion time averaged across runs with varying number of miss-
ing records, i.e., 10 to 100 records. Results of this table
show that our algorithm was slower than both missForest and
Amelia and faster than MICE and mi. Although our algorithm
was not the fastest imputation algorithm, our implementation
can be slightly improved by using a more subtle hash function
instead of a list.
Despite significant decrease in precision, tolerating dis-
similarity decreases execution time significantly, because
the matching segment will be identified much faster and
search space is smaller. Table 4 reports the execution time
based on different dissimilarities.
For these evaluations we conclude that our algorithm is
appropriate for systems that need to reconstruct the missing
data offline (not in real-time) and systems that can tolerate
latency, but require high accuracy. Our algorithm was the
most accurate algorithm for all sequential datasets.
Furthermore, due to its low memory use it could be imple-
mented on small devices such as wearable devices. Our report
was focused on fully missing data, butour algorithm could be
easily extended to handle partial missing data as well.
5LIMITATIONS
Although not using exact equality and tolerate the dissimilar-
ity increase the recall, we recommend to use exact similarity.
Fig. 10. Memory use comparison between the cache and the baseline.
TABLE 2
Memory Use (in KB) of Different Algorithms
Dataset Ghost missForest Amelia MICE mi
SmartHome 0.64 + 230 49.23 132.58 288.41 1708.56
Mobile 1.06 + 234 60.59 142.23 74.40 1948.12
Wearable 0.86 + 241 49.28 133.06 291.66 1711.24
Clinical 0.86 + 225 75.30 156.85 236.41 2017.19
NewsMedia 0.85 + 261 99.86 154.24 753.65 2928.63
TABLE 3
Execution Time (in Seconds) of Different Algorithms
Dataset Ghost missForest Amelia MICE mi
SmartHome 10.39 1.24 0.28 10.57 35.89
Mobile 5.38 0.61 0.32 1.77 10.68
Wearable 13.64 0.69 0.82 2.91 10.06
Clinical 7.57 0.63 0.58 2.33 7.42
NewsMedia 6.91 0.68 0.57 10.34 32.21
TABLE 4
Execution Time (in Seconds) of Different Level
of Dissimilarity Tolerance in Ghost
Dataset Ghost (n-1) Ghost (n-2) Ghost
SmartHome 7.26 6.79 10.39
Mobile 5.38 3.91 5.38
Wearable 8.64 9.01 13.64
Clinical 5.13 6.00 7.57
NewsMedia 4.36 3.78 6.91
The last column (Ghost), which is same as Table 3, presents the algorithm
while not tolerating the dissimilarity at all and looking for exact equality.
2194 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 32, NO. 11, NOVEMBER 2020
Authorized licensed use limited to: BOSTON UNIVERSITY. Downloaded on October 08,2020 at 19:23:07 UTC from IEEE Xplore. Restrictions apply.
This is due to the fact that similar to other algorithm our algo-
rithms also impose a need for smoothing the data, and to keep
data close to its original values, In our experimental evalua-
tion, we analyzed the trade-off between recall and precision
by not using exact equality (see Section 4.3.1).
Our approach cannot reconstruct a missing segment at
the beginning or end of the dataset, this limitation is not
likely significant in large real-world datasets, because the
missing segment would be small relative to the dataset.
We assume this algorithm will be used to reconstruct
a missing segment of battery powered devices, and not
machines which collect hundreds of different data types.
Therefore, in our experiments we choose datasets with not too
many different features. Our largest dataset has six features.
The maximum number of multivariate data we have
experimented was six attributes (in news media dataset).
However, such a number is not exact and it depends on the
correlation and repeatability of the data. Nevertheless, by
reducing the precision and relying on smoother similarity
those types of could be handled as well. Besides, because of
the same reason in our experiments we did not test on big
data and we assume the user of this algorithm is looking for
accurately reconstructing the missing data from the smallest
possible sample of data.
As it has been reported in the evaluation, our algorithm
cannot reconstruct the data in real-time. It can be useful for
data streams and datasets which are changing frequently,
but since it requires at least one scan on the entire dataset, it
cannot operate in real-time.
6RELATED WORK
Imputation has a longstanding history in statistical analysis
[40]. On the other hand, due to inherited noise in the real-
world data, there are several promising application specific
imputation approaches proposed to reconstruct missing
data for a single application.
We propose two categories of related works. One cate-
gory of works are holistic methods that are usually based
on statistical analysis and do not consider the underlying
application. The second category of works are application
specific works that are designed based on the target applica-
tion requirements. We have evaluated our approach on five
different datasets, thus we conclude our approach is not
application specific and belongs to the first category.
Holistic Algorithms. Basic approaches for imputations use
mathematical methods to calculate the missing data such as
mean imputation, regression analysis, and missing indicator
methods [11]. Practical algorithms usually use more com-
plex approaches. In particular, there are two categories of
imputation algorithms that are widely in use. The first cate-
gory features maximum likelihood algorithms (ML) [13],
[18] such as EM (E for conditional expectation and M for
maximum likelihood) algorithms [10] and their successors
algorithms, such as the work proposed by Enders [13]. EM
algorithms usually works well with numerical data, because
they use statistical model to find missing data.
The second category of algorithms are multiple imputation
(MI) algorithms [41] such as the work proposed by Buuren
and Groothuis-Oudshoorn [8] or Honaker et al. [20]. MI
algorithms combine different imputation methods into a
single procedure (mostly expectation maximization). There-
fore, it allows for the uncertainty by creating several plausible
imputed datasets. Repeating multiple experiments makes MI
algorithms approximately unbiased. Nevertheless, the repeat-
ing and combining of different processes increases the time
and memory complexity of such approaches and thus their
are not computationally efficient, despite their superior accu-
racy. Moreover, most of MI algorithms assume that the data
are following a normal distribution [48], which is not neces-
sary true for all real-world datasets. However, MI algorithms
could benefit from ML approaches, and thus there is no dis-
tinct border between them. For example, Amelia [20], which
has been described previously, is a the well-known algorithm
that combines the classical expectation maximization with
the bootstrap approach. Rubin [25] and Schafer [43] provide
a detailed comprehensive description about statistical appro-
aches for imputation. These algorithms are still advancing
and progressing. For instance, Yuan [53] uses “Propencity
score” [39], which is used when there is a vector of observed
covariates. This algorithm generates a score for variables with
missing values. Afterwards, observations are grouped based
on these scores, and an approximate Bayesian bootstrap
imputationwill be applied on each group. Song et al. [46] uses
approximate and exact neighboring information to identify
the missing information in a dataset. Both of these recent
approaches are useful when the there are other streams of
data available, and when they can not optimally operate
when the system is completely off. Mohan proposed a method
to perform imputation by using a directed acyclic graph [28].
This approach operates based on the causal relation of nodes
in a graph.
Our work is inspired by sequence mining algorithms
[17], [30] and [54] which focus on identified ordered pat-
terns of events in a sequence. Nevertheless, the objective of
our algorithm is significantly different than sequence min-
ing algorithms.
Application Specific Methods. Application specific efforts.
such as model based imputation, try to resolve the missing
data based on the assumption that the model provides. Some
example of application specific imputation include sensor net-
work [21], [29], clinical data [5] and genome wide association
[27]. Besides, due to the nature of real-world data streams,
these approaches handle multivariate and different data types
[12], [22], [43] that vary from categorical to binary and numeri-
cal data. Usually, the process of imputation will be done in
batch mode and most of the existing approaches in this cate-
gory are computationally complex. For instance, to recon-
struct the missing data, the data will be converted into
contingency table and will be inserted into a large matrix [25],
[43]. Another example is the use of compressed sensing [12]
for sensor-network data imputation [22], which has a high
computational complexity (in both time and space). Another
work proposed by Papadimitriou et al. [29] that applies
principal component analysis (PCA) to estimate the missing
time series based on its correlation with another time series in
time stamped sensor data. The space cost of their approach is
efficient, but because of its reliance on PCA, this approach
operates in a two- dimenstional space of numerical data.
Moreover, PCA has poly nominal time complexity. Kong
et al. [22] use a customized spatio-temporal compressed
sensing approach [12] for imputing environmental sensor
RAWASSIZADEH ET AL.: GHOST IMPUTATION: ACCURATELY RECONSTRUCTING MISSING DATA OF THE OFF PERIOD 2195
Authorized licensed use limited to: BOSTON UNIVERSITY. Downloaded on October 08,2020 at 19:23:07 UTC from IEEE Xplore. Restrictions apply.
network data. Due to use of compressed sensing and nested
matrix iteration, this approach is a polynomial and computa-
tionally complex as well. Jeon et al. [21] proposes a noise
reduction model for removing audio noises, based on the
multi-band spectral subtraction. Marchini and Howie use a
reference panel for estimating the missing data in a genome
wide association [27]. Fryett et al. [14] propose a detailed sur-
vey on comparing transcriptome imputation. Wang et al. [51]
employs Petri net to recover missing events based on the time
constraints in a set of business process sequences. Their sim-
ple case recovery has a linear computational complexity, but
the approach they propose for general cases based on branch-
ing and indexing does not have linear complexity. Some of
these efforts can be generalized for different applications as
well. For instance, the work proposed by Batista and Monard
[5] use the k(th) nearest neighbor for reconstructing the miss-
ing data. They implement their approach on clinical data
while their approach has a higher accuracy rate than basic
imputation methods, i.e., mean and linear regression. A more
recent example of imputation algorithm is proposed by
Wellenzohn et al. [52], which focuses on imputation for time
series. It introduces a concept of anchor point which is similar
to our prior segment approach. Their approach is also benefit-
ing from a prior data of the missing event and therefore inde-
pendent from linear correlation. Nevertheless, since it uses
only prior window as a constraint its recall is higher, but our
precision is higher.
Several application specific imputation rely on the period-
icity of the data. There are promising approaches to quantify
periodic changes in a dataset [7], [26] and thus improve the
application efficiency. For instance, Boegl et al. [7] rely on the
periodicity of the data to perform the imputation.
In the experimental evaluation section, we described well-
known holistic state-of-the-art algorithms [8], [20], [47], [50]
and why we have selected them.
7CONCLUSION &FUTURE WORK
Inthispaperwehaveintroducedanaccurateimputationalgo-
rithm, Ghost, that can operate on multivariate datasets. It uses
a constraint and the first similar segments, adjacent to the
missing data segment to perform the imputation process. To
improve its efficiency of the algorithm we use a cache based
optimization. Our algorithm accuracy has outperformed state-
of-the-art algorithm by 18 percent in F-score and 25 percent in
Precision. Our proposed algorithm is appropriate for systems
that produce data streams and can not hold data for long
term. Moreover, it is useful for systems that prioritize accuracy
overtheresponsetime.Asafutureworkwewilltryto
develop a distance function that can identify prior and poste-
rior segments with are in the proximity (not adjacent) of the
missing segments. Finding priors and posteriors patterns and
their distance to the missing segments could increase number
of recovery segments, and thus accuracy of the algorithm.
ACKNOWLEDGMENTS
The authors acknowledge Thomas H. Cormen for his hint on
the design of our algorithm and David Kotz for formalizing the
problem and contributing in the process of writing the paper.
REFERENCES
[1] J. Allen, “Maintaining knowledge about temporal intervals,” Com-
mun. ACM, vol. 26, no. 11, pp. 832–843, 1983.
[2] C. Anagnostopoulos and P. Triantafillou, “Scaling out big data
missing value imputations: Pythia vs. Godzilla,” in Proc. 20th
ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, 2014,
pp. 651–660.
[3] T. Baltru
saitis, C. Ahuja, and L. P. Morency, “Multimodal
machine learning: A survey and taxonomy,” IEEE Trans. Pattern
Anal. Mach. Intell., vol. 41, no. 2, pp. 423–443, Feb. 2019.
[4] S. Barker, A. Mishra, D. Irwin, E. Cecchet, P. Shenoy,and J. Albrecht,
“Smart*: An open data set and tools for enabling research in sustain-
able homes,” in Proc. KDD Workshop DataMining Appl. Sustainability,
2012, Art. no. 112.
[5] G. Batista and M. Monard, “An analysis of four missing data treat-
ment methods for supervised learning,” Appl. Artif. Intell., vol. 17,
no. 5/6, pp. 519–533, 2003.
[6] B. Berger, N. M. Daniels, and Y. W. Yu, “Computational biology in
the 21st century: Scaling with compressive algorithms,” Commun.
ACM, vol. 59, no. 8, pp. 72–80, 2016.
[7] M. B’ogl, P. Filzmoser, T. Gschwandtner, S. Miksch, W. Aigner,
A. Rind, and T. Lammarsch, “Visually and statistically guided
imputation of missing values in univariate seasonal time
series,” in Proc. IEEE Conf. Visual Analytics Sci. Technol., 2015,
pp. 189–190.
[8] S. Buuren and K. Groothuis-Oudshoorn, “mice: Multivariate
imputation by chained equations in R,” J. Statistical Softw., vol. 45,
no. 3, pp. 1–68, 2011.
[9] T. Cormen, C. Leiserson, R. Rivest, and C. Stein, Introduction to
Algorithms. Cambridge, MA, USA: MIT Press, 2009.
[10] A. Dempster, N. Laird, and D. Rubin, “Maximum likelihood from
incomplete data via the EM algorithm,” J. Roy. Statistical Soc..
Series B (Methodological), vol. 39, pp. 1–38, 1977.
[11] A. Donders, G. van der Heijden, T. Stijnen, and K. Moons,
“Review: A gentle introduction to imputation of missing values,”
J. Clinical Epidemiology, vol. 59, no. 10, pp. 1087–1091, 2006.
[12] D. Donoho, “Compressed sensing,” IEEE Trans. Inf. Theory, vol. 52,
no. 4, pp. 1289–1306, Apr. 2006.
[13] C. Enders, “A primer on maximum likelihood algorithms available
for use with missing data,” Structural Equation Model., vol. 8, no. 1,
pp. 128–141, 2001.
[14] J. Fryett, J. Inshaw, A. Morris, and H. Cordell, “Comparison of
methods for transcriptome imputation through application to
two common complex diseases,” Eur. J. Human Genetics, vol. 26,
pp. 1658–1667, 2018.
[15] A. G
eron, Hands on Machine Learning with Scikit-Learn and Tensor-
Flow: Concepts, Tools, and Techniques to Build Intelligent Systems.
Sebastopol, CA, USA: O’Reilly Media, 2017.
[16] Z. Ghahramani, “Probabilistic machine learning and artificial
intelligence,” Nature, vol. 521, no. 7553, pp. 452–459, 2015.
[17] J. Han, J. Pei, B. Mortazavi-Asl, Q. Chen, U. Dayal, and M. Hsu,
“FreeSpan: Frequent pattern-projected sequential pattern mining,”
in Proc. 6th ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining,
2000, pp. 355–359.
[18] H. Hartley and R. Hocking, “The analysis of incomplete data,”
Biometrics, vol. 27, no. 4, pp. 783–823, 1971.
[19] J. Hern
andez-Lobato, N. Houlsby, and Z. Ghahramani, “Probabilistic
matrix factorization with non-random missing data,” in Proc. Int.
Conf. Mach. Learn., 2014, pp. 1512–1520.
[20] J. Honaker, G. King, and M. Blackwell, “Amelia II: A program for
missing data,” J. Statistical Softw., vol. 45, no. 7, pp. 1–47, 2011.
[21] K. Jeon, N. Park, D. Lee, and H. Kim, “Audio restoration based on
multi-band spectral subtraction and missing data imputation,” in
Proc. IEEE Int. Conf. Consum. Electron., 2014, pp. 522–523.
[22] L. Kong, M. Xia, X. Liu, M. Wu, and X. Liu, “Data loss and re-
construction in sensor networks,” in Proc. IEEE INFOCOM, 2013,
pp. 1654–1662.
[23] B. Lake,R. Salakhutdinov, and J. Tenenbaum, “Human-levelconcept
learning through probabilistic program induction,” Sci., vol. 350,
no. 6266, pp. 1332–1338, 2015.
[24] J. Lin, E.Keogh, S. Lonardi, and B. Chiu, “A symbolic representation
of time series, with implications for streaming algorithms,” in Proc.
8th ACM SIGMOD Workshop Res. Issues Data Mining Knowl. Discovery,
2003, pp. 2–11.
[25] R. Little and D. Rubin, Statistical Analysis with Missing Data.Hoboken,
NJ,USA:Wiley,2014.
2196 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 32, NO. 11, NOVEMBER 2020
Authorized licensed use limited to: BOSTON UNIVERSITY. Downloaded on October 08,2020 at 19:23:07 UTC from IEEE Xplore. Restrictions apply.
[26] C. Loglisci and D. Malerba, “Mining periodic changes in complex
dynamic data through relational pattern discovery,” in Proc. Int.
Workshop New Frontiers Mining Complex Patterns, 2015, pp. 76–90.
[27] J. Marchini and B. Howie, “Genotype imputation for genome-
wide association studies,” Nature Rev. Genetics, vol. 11, no. 7,
pp. 499–511, 2010.
[28] K. Mohan, J. Pearl, and J. Tian, “Graphical models for inference
with missing data,” in Proc. Int. Conf. Neural Inf. Process. Syst.,
2013, pp. 1277–1285.
[29] S. Papadimitriou, J. Sun, and C. Faloutsos, “Streaming pattern
discovery in multiple time-series,” in Proc. 31st Int. Conf. Very
Large Data Bases, 2005, pp. 697–708.
[30] J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal, and
M. Hsu, “PrefixSpan: Mining sequential patterns efficiently by
prefix-projected pattern growth,” in Proc. Int. Conf. Comput. Com-
mun. Netw., 2001, pp. 215–224.
[31] R. Rawassizadeh, C. Dobbins, M. Akbari,and M. Pazzani, “Indexing
multivariate mobile data through spatio-temporal event detection
and clustering,” Sensors, vol. 19,no. 3, 2019, Art. no. 448.
[32] R. Rawassizadeh, E. Momeni, C. Dobbins, J. Gharibshah, and
M. Pazzani, “Scalable daily human behavioral pattern mining
from multivariate temporal data,” IEEE Trans. Knowl. Data Eng.,
vol. 28, no. 11, pp. 3098–3112, Nov. 2016.
[33] R. Rawassizadeh, E. Momeni, C. Dobbins, P. Mirza-Babaei, and
R. Rahnamoun, “Lesson learned from collecting quantified self
information via mobile and wearable devices,” J. Sensor Actuator
Netw., vol. 4, no. 4, 2015, Art. no. 315.
[34] R. Rawassizadeh, T. Pierson, R. Peterson, and D. Kotz, “NoCloud:
Exploring network disconnection through on-device data analysis,”
IEEE Pervasive Comput., vol. 17, no. 1, pp. 64–74, Jan.–Mar. 2018.
[35] R. Rawassizadeh, B. Price, and M. Petre, “Wearables: Has the age
of smartwatches finally arrived?,” Commun. ACM, vol. 58, no. 1,
pp. 45–47, 2015.
[36] R. Rawassizadeh, M. Tomitsch, M. Nourizadeh, E. Momeni,
A. Peery, L. Ulanova, and M. Pazzani, “Energy-efficient integration
of continuous context sensing and prediction into smartwatches,”
Sensors, vol. 15, no. 9, pp. 22616–22645, 2015.
[37] R. Rawassizadeh, M. Tomitsch, K. Wac, and A. Tjoa, “UbiqLog: A
generic mobile phone-based life-log framework,” Pers. Ubiquitous
Comput., vol. 17, no. 4, pp. 621–637, 2013.
[38] M. Resche-Rigon and I. R. White, “Multiple imputation by chained
equations for systematically and sporadically missing multilevel
data,” Statistical Methods Med. Res., vol. 27, no. 6, pp. 1634–1649, 2018.
[39] P. R. Rosenbaum and D. B.Rubin, “The central role of the propensity
score in observational studies for causal effects,” Biometrika,vol.70,
no. 1, pp. 41–55, 1983.
[40] D. Rubin, “Inference and missing data,” Biometrika, vol. 63, no. 3,
pp. 581–592, 1976.
[41] D. Rubin, “Multiple imputations in sample surveys- A phenome-
nological Bayesian approach to nonresponse,” Proc. Survey Res.
Methods Section Amer. Statistical Assoc., vol. 1, pp. 20–34, 1978.
[42] S. Russell and P. Norvig, Artificial Intelligence: A Modern Approach,
3rd ed. London, U.K.: Pearson Education Limited, 2011.
[43] J. L. Schafer, Analysis of Incomplete Multivariate Data. Boca Raton,
FL, USA: CRC Press, 1997.
[44] C. E. Shannon, “A mathematical theory of communication,” Bell
Syst. Tech. J., vol. 27, no. 3, pp. 379–423, 1948.
[45] S. Shaphiro and M. Wilk, “An analysis of variance test for normal-
ity,” Biometrika, vol. 52, no. 3, pp. 591–611, 1965.
[46] S. Song, A. Zhang, L. Chen, and J. Wang, “Enriching data imputa-
tion with extensive similarity neighbors,” Proc. VLDB Endowment,
vol. 8, no. 11, pp. 1286–1297, 2015.
[47] D. Stekhoven, “MissForest: Nonparametric missing value imputa-
tion using random forest,” Astrophysics Source Code Library, 2015.
[48] J. Sterne, I. White, J. Carlin, M. Spratt, P. Royston, M. Kenward,
A. Wood, and J. Carpenter, “Multiple imputation for missing data
in epidemiological and clinical research: Potential and pitfalls,”
Brit. Med. J., vol. 338, 2009, Art. no. b2393.
[49] B. Strack, J. DeShazo, C. Gennings, J. Olmo, S. Ventura, K. Cios, and
J. Clore, “Impact of HbA1c measurement on hospital readmission
rates: Analysis of 70,000 clinical database patient records,” BioMed
Res. Int.,vol. 2014, 2014, Art. no. 781670.
[50] Y. Su, A. Gelman, J. Hill, and M. Yajima, “Multiple imputation
with diagnostics (mi) in R: Opening windows into the black box,”
J. Statistical Softw., vol. 45, no. 2, pp. 1–31, 2011.
[51] J. Wang, S. Song, X. Zhu, and X. Lin, “Efficient recovery of missing
events,” Proc. VLDB Endowment, vol. 6, no. 10, pp. 841–852, 2013.
[52] K. Wellenzohn, M. H.B
ohlen, A. Dign
os, J. Gamper, andH. Mitterer,
“Continuous imputation of missing values in streams of pattern-
determining time series,” in Proc. Int. Conf. Extending Database Tech-
nol., 2017, pp. 330–341.
[53] Y. Yuan, “Multiple imputation for missing data: Concepts and
new development,” in Proc. Twenty-Fifth Annu. SAS Users Group
Int. Conf., vol. 267, 2010.
[54] M. Zaki, “SPADE: An efficient algorithm for mining frequent
sequences,” Mach. Learn., vol. 42, no. 1/2, pp. 31–60, 2001.
Reza Rawassizadeh received the BSc degree in
software engineering, the master’s degree in com-
puter science, and the PhD degree in computer
science from the University of Vienna, Austria, in
2012. He is an assistant professor with the Depart-
ment of Computer Science, Metropolitan College,
Boston University. His research interests include
data mining, ubiquitous computing, and applied
machine learning.
Hamidreza Keshavarz received the PhD degree
from Tarbiat Modares University, Tehran, Iran, in
2018. His research is focused on developing algo-
rithms and techniques for sentiment analysis and
data mining. His interests include metaheuristic
algorithms, information retrieval, computational intel-
ligence, pattern recognition, and machine learning.
Michael Pazzani received the PhDdegree in com-
puter science from the University of California, Los
Angeles (UCLA). He is vice chancellor for research
and economic development and a professor of
computer science with the University of California,
Riverside. He was a professor with the University
of California, Irvine, where he also served as chair
of information and computer science. His research
interests include machine learning, personalization,
and cognitive science.
"
For more information on this or any other computing topic,
please visit our Digital Library at www.computer.org/csdl.
RAWASSIZADEH ET AL.: GHOST IMPUTATION: ACCURATELY RECONSTRUCTING MISSING DATA OF THE OFF PERIOD 2197
Authorized licensed use limited to: BOSTON UNIVERSITY. Downloaded on October 08,2020 at 19:23:07 UTC from IEEE Xplore. Restrictions apply.
... We use heart rate, step count, and activity type from the "Insight for Wear" dataset and the step count and activity type data from the "UbiqLog" dataset. Missing data due to a variety of sensors within different devices is inherent in these datasets [54]. Therefore, we chose the user from each dataset (two users total) with the most available data and consequently the fewest missing data points. ...
Preprint
Full-text available
Mobile and wearable technologies have promised significant changes to the healthcare industry. Although cutting-edge communication and cloud-based technologies have allowed for these upgrades, their implementation and popularization in low-income countries have been challenging. We propose ODSearch, an On-device Search framework equipped with a natural language interface for mobile and wearable devices. To implement search, ODSearch employs compression and Bloom filter, it provides near real-time search query responses without network dependency. Our experiments were conducted on a mobile phone and smartwatch. We compared ODSearch with current state-of-the-art search mechanisms, and it outperformed them on average by 55 times in execution time, 26 times in energy usage, and 2.3% in memory utilization.
... Furthermore, XR gained positive sentiment scores ranging from conservatively positive to extremely positive in three out of four sectors, excluding academia, which appeared to be critical of wearable technologies. The negative attitudes of academia demonstrate that wearable technology has not yet fully matured, and that its applications still have room for improvement specifically with regards to battery challenges [87] and issues with the quality of collected data [82]. BCI could achieve monumental success as a wearable technology, and it recently started to receive more attention. ...
Article
Full-text available
The role of wearable technology in our daily lives is rapidly growing and many users are cumulatively becoming dependent on it. To provide insight into the future of wearable technologies and various community attitudes towards them, we implemented an in-depth quantitative investigation of opinions from academic texts (DBLP and PubMed), social media (Twitter), news media (Google News and Bing News), and entrepreneurship communities (Kickstarter and Indiegogo) over a 10-year period. Our results indicate that unlike academia, the news media, entrepreneurship communities, and social media all hold overall positive attitudes towards wearable technologies. Secondly, there are diverse perspectives towards various wearable products across different platforms. Specifically, "XR" technologies received the most attention, while "Exoskeleton" ignited the most heated debates. Thirdly, we discovered that the lifetime of a hyped wearable technology lasts approximately three years. Furthermore, the news media and entrepreneurship community's attitudes towards wearable technologies did not have a strong impact on public opinion. Finally, among all types of wearable technologies, "fashion design" and "healthcare" products were the most enlightening for the market.
... It also provides higher scalability and reliability levels by preventing any single point of failure in handling IoT devices' requests. Furthermore, MEC can reconstruct missing data that may happen because of noise or other problems in IoT devices [14]. However, it needs more storage capacity for IoT data and requires an advanced infrastructure, which increases the cost of deploying the MECs. ...
Article
Full-text available
Mobile edge computing (MEC) is an interesting technology aimed at providing various processing and storage resources at the edge of mobile devices (MDs). However, MECs contain limited resources, and they should be appropriately managed to prevent resource wastage. Workflow scheduling is a process that tries to map tasks to the most proper set of resources based on some objectives. This paper presents DBOA, a discrete version of the Butterfly Optimization Algorithm (BOA) that applies the Levy flight method to improve its convergence speed and prevent local optima problems. We also employed a task prioritization method to find the task execution order in the scientific workflows. Then, we use DBOA for Dynamic Voltage and Frequency Scaling or DVFS-based data-intensive workflow scheduling and data placement in MEC environments. For evaluating the performance of the proposed scheduling scheme, extensive simulations are conducted on various well-known scientific workflows with different sizes. The obtained experimental results indicate that our method can outperform other algorithms in terms of energy consumption, data access overheads, and so on.
... One of the main requirements of machine learning algorithms designed for mobile and other battery-powered devices is scalability [2]. A lack of scalability negatively affects the quality of the communication they are performing and results in missing data [55]. This problem grows more pronounced when dealing with deep learning models with millions of parameters. ...
Preprint
Full-text available
Federated Learning marks a turning point in the implementation of decentralized machine learning (especially deep learning) for wireless devices by protecting users' privacy and safeguarding raw data from third-party access. It assigns the learning process independently to each client. First, clients locally train a machine learning model based on local data. Next, clients transfer local updates of model weights and biases (training data) to a server. Then, the server aggregates updates (received from clients) to create a global learning model. However, the continuous transfer between clients and the server increases communication costs and is inefficient from a resource utilization perspective due to the large number of parameters (weights and biases) used by deep learning models. The cost of communication becomes a greater concern when the number of contributing clients and communication rounds increases. In this work, we propose a novel framework, FedZip, that significantly decreases the size of updates while transferring weights from the deep learning model between clients and their servers. FedZip implements Top-z sparsification, uses quantization with clustering, and implements compression with three different encoding methods. FedZip outperforms state-of-the-art compression frameworks and reaches compression rates up to 1085x, and preserves up to 99% of bandwidth and 99% of energy for clients during communication.
... However, due to their proximity, which did not continuously access the user's body movement 24/7 [68], they were replaced by wearable devices. However, wearables also suffer from inaccuracy in their data collection [69]). ...
Article
Full-text available
There is growing consumer demand for digital technologies that help users track, motivate, and receive coaching for both aerobic and anaerobic activities. In this article, we provide a review of existing technological advances in tracking, coaching, and motivating users during indoor training in contexts such as gymnasiums. This study lists the advantages and limitations of various apparatuses and applications used for this purpose. Our review and discussion are intended to help entrepreneurs and engineers improve their products to better meet users? needs and aid researchers in identifying potential new areas.
... When the graph is assumed to be sparse, the GGM inference problem can be solved using sparse precision matrix estimation algorithms [13]. This approach has been used to construct microbial networks from sample-taxa matrices [21] [51] [52]. ...
Preprint
The interactions among the constituent members of a microbial community play a major role in determining the overall behavior of the community and the abundance levels of its members. These interactions can be modeled using a network whose nodes represent microbial taxa and edges represent pairwise interactions. A microbial network is a weighted graph that is constructed from a sample-taxa count matrix, and can be used to model co-occurrences and/or interactions of the constituent members of a microbial community. The nodes in this graph represent microbial taxa and the edges represent pairwise associations amongst these taxa. A microbial network is typically constructed from a sample-taxa count matrix that is obtained by sequencing multiple biological samples and identifying taxa counts. From large-scale microbiome studies, it is evident that microbial community compositions and interactions are impacted by environmental and/or host factors. Thus, it is not unreasonable to expect that a sample-taxa matrix generated as part of a large study involving multiple environmental or clinical parameters can be associated with more than one microbial network. However, to our knowledge, microbial network inference methods proposed thus far assume that the sample-taxa matrix is associated with a single network.
... When the graph is assumed to be sparse, the GGM inference problem can be solved using sparse precision matrix estimation algorithms [13]. This approach has been used to construct microbial networks from sample-taxa matrices [21] [77] [78]. ...
... Instead of removing rows and columns containing missing values, interpolation-based methods retain all the information in the original dataset, and replace missing values with a designated placeholder value (e.g., "0"s or the mean of other values). Reza et al. proposed an interpolation-based imputation algorithm which searches the sequential dataset to find data segments that have a prior and posterior segment that matches those of the missing data to use as a substitute [58]. Finally, regression-based methods are designed to predict observed values of a target variable based on other variables available in the dataset. ...
Conference Paper
Full-text available
Exploiting the capabilities of smartphones for monitoring social anxiety shows promise for advancing our ability to both identify indicators of and treat social anxiety in natural settings. Smart devices allow researchers to collect passive data unobtrusively through built-in sensors and active data using subjective, self-report measures with Ecological Momentary Assessment (EMA) studies. Prior work has established the potential to predict subjective measures from passive data. However, the majority of the past work on social anxiety has focused on a limited subset of self-reported measures. Furthermore, the data collected in real-world studies often results in numerous missing values in one or more data streams, which ultimately reduces the usable data for analysis and limits the potential of machine learning algorithms. We explore several approaches for addressing these problems in a smartphone-based monitoring and intervention study of eighty socially anxious participants over a five-week period. Our work complements and extends prior work in two directions: (i) we show the predictability of seven different self-reported dimensions of social anxiety, and (ii) we explore four imputation methods to handle missing data and evaluate their effectiveness in the prediction of subjective measures from the passive data. Our evaluation shows imputation of missing data reduces prediction error by as much as 22%. We discuss the implications of these results for future research.
Time series classification has become an interesting field of research thanks to the extensive studies conducted in the past two decades. Time series may have missing data which may affect both the representation and also modeling of time series. Thus, recovering missing data using appropriate time series based imputation methods is an essential step. Multiple imputation is a data recovery method where it produced multiple imputed data. The method proves its usefulness in terms of reflecting the uncertainty inherit in missing data, however, it is under-researched in time series problems. In this paper we propose two multiple imputation approaches for time series. The first is a multiple imputation method based on interpolation. The second is a multiple imputation and ensemble method. First we simulate missing consecutive sub-sequences under a Missing Completely at Random mechanism; then we use single/multiple imputation methods. The imputed data are used to build bagging and stacking ensembles. We build ensembles using standard classification algorithms as well as time series classifiers. The standard classifiers involve Random Forest, Support Vector Machines, K-Nearest Neighbour, C4.5, and PART while TSCHIEF, Proximity Forest, Time Series Forest, RISE and BOSS are chosen as time series classifiers. Our findings show that the combination of multiple imputation and ensemble improves the performance of the majority of classifiers tested in this study, often above the performance obtained from the complete data, even under increasing missing data scenarios. This may be because the diversity injected by multiple imputation has a very favourable and stabilising effect on the classifier performance, which is a very important finding.
Article
Full-text available
Mobile and wearable devices are capable of quantifying user behaviors based on their contextual sensor data. However, few indexing and annotation mechanisms are available, due to difficulties inherent in raw multivariate data types and the relative sparsity of sensor data. These issues have slowed the development of higher level human-centric searching and querying mechanisms. Here, we propose a pipeline of three algorithms. First, we introduce a spatio-temporal event detection algorithm. Then, we introduce a clustering algorithm based on mobile contextual data. Our spatio-temporal clustering approach can be used as an annotation on raw sensor data. It improves information retrieval by reducing the search space and is based on searching only the related clusters. To further improve behavior quantification, the third algorithm identifies contrasting events withina cluster content. Two large real-world smartphone datasets have been used to evaluate our algorithms and demonstrate the utility and resource efficiency of our approach to search.
Article
Full-text available
Transcriptome imputation has become a popular method for integrating genotype data with publicly available expression data to investigate the potentially causal role of genes in complex traits. Here, we compare three approaches (PrediXcan, MetaXcan and FUSION) via application to genome-wide association study (GWAS) data for Crohn's disease and type 1 diabetes from the Wellcome Trust Case Control Consortium. We investigate: (i) how the results of each approach compare with each other and with those of standard GWAS analysis; and (ii) how variants in the models used by the prediction tools compare with variants previously reported as eQTLs. We find that all approaches produce highly correlated results when applied to the same GWAS data, although for a subset of genes, mostly in the major histocompatibility complex, the approaches strongly disagree. We also observe that most associations detected by these methods occur near known GWAS risk loci. PrediXcan and MetaXcan's models for predicting expression more consistently recapitulate known effects of genotype on expression, suggesting they are more robust than FUSION. Application of these transcriptome imputation approaches to summary statistics from meta-analyses in Crohn's disease and type 1 diabetes detects 53 significant expression-Crohn's disease associations and 154 significant expression-type 1 diabetes associations, providing insight into biology underlying these diseases. We conclude that while current implementations of transcriptome imputation typically detect fewer associations than GWAS, they nonetheless provide an interesting way of interpreting association signals to identify potentially causal genes, and that PrediXcan and MetaXcan generally produce more reliable results than FUSION.
Article
Full-text available
Application developers often advocate uploading data to the cloud for analysis or storage, primarily due to concerns about the limited computational capability of ubiquitous devices. Today, however, many such devices can still effectively operate and execute complex algorithms without reliance on the cloud. The authors recommend prioritizing on-device analysis over uploading the data to another host, and if on-device analysis is not possible, favoring local network services over a cloud service.
Conference Paper
Full-text available
Time series data is ubiquitous but often incomplete, e.g., due to sensor failures and transmission errors. Since many applications require complete data, missing values must be imputed before fur- ther data processing is possible. We propose Top-k Case Matching (TKCM) to impute missing values in streams of time series data. TKCM defines for each time series a set of reference time series and exploits similar historical situations in the reference time series for the imputation. A situa- tion is characterized by the anchor point of a pattern that consists of l consecutive measurements over the reference time series. A missing value in a time series s is derived from the values of s at the anchor points of the k most similar patterns. We show that TKCM imputes missing values consistently if the reference time series pattern-determine time series s, i.e., the pattern of length l at time tn is repeated at least k times in the reference time se- ries and the corresponding values of s at the anchor time points are similar to each other. In contrast to previous work, we support time series that are not linearly correlated but, e.g., phase shifted. TKCM is resilient to consecutively missing values, and the accu- racy of the imputed values does not decrease if blocks of values are missing. The results of an exhaustive experimental evaluation using real-world and synthetic data shows that we outperform the state-of-the-art solutions.
Article
Full-text available
Our experience of the world is multimodal - we see objects, hear sounds, feel texture, smell odors, and taste flavors. Modality refers to the way in which something happens or is experienced and a research problem is characterized as multimodal when it includes multiple such modalities. In order for Artificial Intelligence to make progress in understanding the world around us, it needs to be able to interpret such multimodal signals together. Multimodal machine learning aims to build models that can process and relate information from multiple modalities. It is a vibrant multi-disciplinary field of increasing importance and with extraordinary potential. Instead of focusing on specific multimodal applications, this paper surveys the recent advances in multimodal machine learning itself and presents them in a common taxonomy. We go beyond the typical early and late fusion categorization and identify broader challenges that are faced by multimodal machine learning, namely: representation, translation, alignment, fusion, and co-learning. This new taxonomy will enable researchers to better understand the state of the field and identify directions for future research.
Article
Full-text available
This work introduces a set of scalable algorithms to identify patterns of human daily behaviors. These patterns are extracted from multivariate temporal data that have been collected from smartphones. We have exploited sensors that are available on these devices, and have identified frequent behavioral patterns with a temporal granularity, which has been inspired by the way individuals segment time into events. These patterns are helpful to both end-users and third parties who provide services based on this information. We have demonstrated our approach on two real-world datasets and showed that our pattern identification algorithms are scalable. This scalability makes analysis on resource constrained and small devices such as smartwatches feasible. Traditional data analysis systems are usually operated in a remote system outside the device. This is largely due to the lack of scalability originating from software and hardware restrictions of mobile/wearable devices. By analyzing the data on the device, the user has the control over the data, i.e. privacy, and the network costs will also be removed.
Article
The parallel explosions of interest in streaming data, and data mining of time series have had surprisingly little intersection. This is in spite of the fact that time series data are typically streaming data. The main reason for this apparent paradox is the fact that the vast majority of work on streaming data explicitly assumes that the data is discrete, whereas the vast majority of time series data is real valued.Many researchers have also considered transforming real valued time series into symbolic representations, nothing that such representations would potentially allow researchers to avail of the wealth of data structures and algorithms from the text processing and bioinformatics communities, in addition to allowing formerly "batch-only" problems to be tackled by the streaming community. While many symbolic representations of time series have been introduced over the past decades, they all suffer from three fatal flaws. Firstly, the dimensionality of the symbolic representation is the same as the original data, and virtually all data mining algorithms scale poorly with dimensionality. Secondly, although distance measures can be defined on the symbolic approaches, these distance measures have little correlation with distance measures defined on the original time series. Finally, most of these symbolic approaches require one to have access to all the data, before creating the symbolic representation. This last feature explicitly thwarts efforts to use the representations with streaming algorithms.In this work we introduce a new symbolic representation of time series. Our representation is unique in that it allows dimensionality/numerosity reduction, and it also allows distance measures to be defined on the symbolic approach that lower bound corresponding distance measures defined on the original series. As we shall demonstrate, this latter feature is particularly exciting because it allows one to run certain data mining algorithms on the efficiently manipulated symbolic representation, while producing identical results to the algorithms that operate on the original data. Finally, our representation allows the real valued data to be converted in a streaming fashion, with only an infinitesimal time and space overhead.We will demonstrate the utility of our representation on the classic data mining tasks of clustering, classification, query by content and anomaly detection.
Article
In multilevel settings such as individual participant data meta-analysis, a variable is ‘systematically missing’ if it is wholly missing in some clusters and ‘sporadically missing’ if it is partly missing in some clusters. Previously proposed methods to impute incomplete multilevel data handle either systematically or sporadically missing data, but frequently both patterns are observed. We describe a new multiple imputation by chained equations (MICE) algorithm for multilevel data with arbitrary patterns of systematically and sporadically missing variables. The algorithm is described for multilevel normal data but can easily be extended for other variable types. We first propose two methods for imputing a single incomplete variable: an extension of an existing method and a new two-stage method which conveniently allows for heteroscedastic data. We then discuss the difficulties of imputing missing values in several variables in multilevel data using MICE, and show that even the simplest joint multilevel model implies conditional models which involve cluster means and heteroscedasticity. However, a simulation study finds that the proposed methods can be successfully combined in a multilevel MICE procedure, even when cluster means are not included in the imputation models.
Article
COMPUTATIONAL BIOLOGISTS ANSWER biological and biomedical questions by using computation in support of-or in place of-laboratory procedures, hoping to obtain more accurate answers at a greatly reduced cost. The past two decades have seen unprecedented technological progress with regard to generating biological data; next-generation sequencing, mass spectrometry, microarrays, cryo-electron microscopy, and other high-Throughput approaches have led to an explosion of data. However, this explosion is a mixed blessing. On the one hand, the scale and scope of data should allow new insights into genetic and infectious diseases, cancer, basic biology, and even human migration patterns. On the other hand, researchers are generating datasets so massive that it has become difficult to analyze them to discover patterns that give clues to the underlying biological processes.