Personal analytics: Time management using Google Maps
Achyuthuni Sri Harsha
EEP, Business Analtics and Intelligence, Indian Institute of Management,
Abstract. The modern world runs on data captured from millions of
individuals. Data of an individual is captured by both the government and
private companies. Only a small percentage of the data on a person captured
is accessible to that person. This paper explains a seven-step data-driven
way for any individual to use this data for improving their lives. Time
management using Google maps and other data sources for an individual is
used as an example to explain the process.
Keywords: Personal analytics, regression, personal data, hypothesis testing,
time management, Google maps, CRISP-DM.
Data is an integral component of today's life. Analytics is used extensively for
problem-solving and in assisting decision making across many verticals and
companies(Kumar, 2017). Studies have shown that 1.7Mb of data is captured
every second on every person on earth(Miller, 2019). From engineering(Aho and
Uden, 2014; Chiang et al., 2017) to retail(Dinesh Kumar et al., 2012; Jeeson et al.,
2013), from transportation to finance, from medical (Pannu et al., 2010) to
insurance(Lixia, 2010), data has changed the way businesses work. Behind this
revolution is personal data of every individual captured for various
purposes(Schwartz, 2003). Some percentage of this data is available for the
individual,(Gurrin et al., 2014) and data science and analytics enthusiasts can use
this data to make their lives better(Selke, 2014; Sellen and Whittaker, 2010). A
time management problem for an individual is taken as an example to explain the
process. Data that was collected by various external agents/organizations on the
person in the study (Person A) are cumulated. A CRISP-DM(Wirth and Hipp,
2000) analytical approach is used to identify factors and make recommendations.
CRISP-DM is a widely used methodology for solving data science
problems(Azevedo and Santos, 2008).
Time management is a significant problem in every person's life. This paper
discusses different factors that influence the time at which an employee arrives the
workplace. The goal is to explain maximum variation in in-time of the individual.
2 Data and Methods
2.1 Set the goals
The goal is to identify and quantify factors that affect the time at which an
employee (Person A) arrives at the workplace. The time at which an employee
arrives at his/her workplace (in-time) is dependent on various factors. They are
broadly classified into the following groups(Ailabouni et al., 2009; Porter and
1. Personal factors
2. Commute based factors(Olsson et al., 2013; van Hooff, 2015)
3. Work-related factors
4. Time and seasonality-based factors
2.2 Identifying data sources
For each factor, the availability and sources of data were considered. Two sources
of data identify with most of the factors taken into consideration. They are:
1. Google location history
2. Workplace management tools at Person A's workplace
Google captures large amounts of data on every individual, from browser activity,
health (Google Fit), bank transactions (Google Pay), photos, emails, and location
history (Google Maps).
Data from Google Location history was downloaded for the person in the study.
The downloaded file is in JSON format which was converted to a data frame
format in R. The date and time column which were in POSIX milliseconds format
were converted to a human-readable format. Similarly, latitude and longitude were
in magnitudes of 107 were converted to GPS coordinates. This data was filtered
for the timeframe (of travelling to the workplace) and the location of the
individual's workplace and home.
Data from the workplace management systems consisting of in-time and out-time
was also collected from 4th October 2017 to 29th November 2018. This data was
joined with the previous data extracted using Google Location history.
From the data available, the influence of the following factors on the in-time of
the employee are considered:
1. Commute based factors
a. Travelling time
b. Vehicle type
c. Starting place and the route has taken
2. Work-related factors
a. Nature of work
b. The previous day out time
c. Previous day hours worked
3. Time-based factors
a. Deterministic and stochastic trend
b. The previous day's in time
c. Previous day's error of in-time
3.1 Exploratory data analysis
Initial EDA was carried for all the factors available. Fig 1 shows the distribution
among the dependent variable in-time. The distribution of in-time is not normal.
Fig 2 exhibits the variation of the dependent variable in-time across time. There
seems to be a logarithmic decrease in the mean across time. Fig 3 shows the
relationship between travelling time and in-time. Fig 4 displays the variation of in-
time for different transportation methods. On average, the individual is earlier to
the workplace while walking when compared to bicycle or by vehicle. Fig 5
presents the variation among the in-time from two starting positions A and B. The
individual is earlier to the workplace from location A.
Fig 6 shows the variation of in-time with different types of work. There seems to
be a distinction between C and D type of work when compared to A and B. From
Fig 7, a slight decrease in in-time with a decrease in out of time (of the previous
day) can be observed, especially for different types of work.
Fig 1: In time distribution
Fig 2: Variation of in-time across time
Fig 3: Variation of travelling time across
Fig 4: Variation in in-time across different
modes of transport
Fig 5: Variation of in-time across different
Fig 6: Variation of in-time across different
Fig 7: Change of in-time with a previous
day out time
3.2 Confirmatory data analytics
The next step is to test the significance of every factor using hypothesis tests. The
conclusions from EDA can be validated using hypothesis tests.
In-time distribution: Fig 1 indicates that the distribution of in-time is not nor-
mally distributed. Chi-Square Goodness of fit test can be conducted to identify if
the distribution is normally distributed. The null and alternative hypothesis are as
𝐻:𝑖𝑛 − 𝑡𝑖𝑚𝑒 𝑓𝑜𝑙𝑙𝑜𝑤𝑠 𝑎 𝑛𝑜𝑟𝑚𝑎𝑙 𝑑𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛
𝐻: 𝑖𝑛 − 𝑡𝑖𝑚𝑒 𝑑𝑜𝑒𝑠 𝑛𝑜𝑡 𝑓𝑜𝑙𝑙𝑜𝑤 𝑎 𝑛𝑜𝑟𝑚𝑎𝑙 𝑑𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛
The χ test statistic is 258.04, which is higher than the cut-off value at 5% α indi-
cating that the distribution is not normally distributed.
Travelling time: From Fig 3, no direct correlation between travelling time and in-
time is observed. A correlation test with Null and alternate hypothesis
𝐻: 𝑟 , = 0
𝐻: 𝑟 , ≠ 0
gave a p-value of 0.43, indicating that travelling time might not be correlated with
Vehicle type: From Fig 1, it can be observed that the distribution is not normally
distributed, and from Fig 4, it can be observed that the variations within the groups
are not constant. As the assumptions of ANOVA are violated, multiple t-tests with
Bonferroni correction of 3 is conducted. The null and alternate hypothesis is as
𝐻: μ = μ = μ
𝐻: Not 𝑎ll μ values are equal
The p-values of the tests are given in Table 1:
Table 1: p-values for t-tests for different vehicle types
Walking 1 1
From Table 1, it can be inferred that there is a statistical difference in in-time be-
tween walking and travelling in a motor vehicle. The same was visualized in Fig
Starting location: From Fig 5, it can be observed that there is a significant differ-
ence of in-time between starting location A and starting location B. This can be
tested using a t-test with the hypothesis:
𝐻: μ= μ
𝐻: μ≠ μ
The p-value for the t-test is 4 × 10 which indicates that there might be a signif-
icant difference in in-time between A and B starting locations.
Nature of work: From Fig 6, it can be inferred that A and B type of work might
not be significantly different from each other. But when compared to A: C and D
types of work are significantly different in terms of in-time. An ANOVA test is
conducted to validate the following hypothesis:
𝐻: μ= μ= μ= μ
𝐻: Not 𝑎ll μ values are equal
The p-value for ANOVA is 2 × 10 indicating the nature of work is a signifi-
cant factor in identifying in-time.
Previous day out-time: Fig 7 shows that as the previous day out-time increases,
in-time decreases. A correlation test is used to validate this hypothesis.
𝐻: r, = 0
𝐻: r, ≠ 0
The p-value is 0.009918, which is lesser than 5% cut off. The previous day out-
time is a significant factor affecting in-time.
Stochastic and deterministic trend: From Fig 2, the time series is not stationary.
Dickey-Fuller unit root tests(Gujarati, 2009) with the following null hypothesis:
Yt is a random walk : ΔYt = δYt−1 + ϵt
Yt is a random walk with drift : ΔYt = β1 + δYt−1 + ϵt
Yt is a random walk with drift around a deterministic trend : ΔYt=β1+β2t+δYt−1+ϵt
is used to test if the series is stationary. The results of the test are as shown in Ta-
ble 2. A summary of all the tests is given in Table 3.
Table 2: Dickey-Fuller test results
The series is not stationary
Random walk with
Series is not stationary, and there is drift
Random walk with
drift around a de-
The series is not stationary there is a trend, and
there may or may not be drift
Table 3: Summary of hypothesis tests
Variable Test Null Hypothe-
Correlation r = 0 0.4358 Travelling time
does not affect in-
Vehicle type Multiple t-
The difference in
walking and motor
Starting place t-test
Starting place af-
The difference in
in-time between C
and D when com-
pared to A
Previous day out
time is a significant
factor which affects
Series is not stationary and
there is a deterministic trend
3.3 Modelling and forecasting
A linear regression model was built using stepwise elimination based on AIC.
Travelling time, nature of work, starting location and deterministic trend were ob-
served as essential factors affecting in-time. The model summary statistics can be
found in Table 4, Table 5 and Table 6.
Table 4: Model summary
Adjusted R Square 0.5337
Table 5: Regression ANOVA
Regression 5 52.51 Significant
Table 6: Regression coefficients
Coefficients Standard Error
Intercept 52.052635 6.36202
Travelling time -0.006913 0.002165 -3.193
Nature of work B 20.008394 2.158982 9.268 2.00×10-16
Nature of work C 15.711124 3.252032 4.831 2.54×10-06
Starting location B 10.720798 3.272792 3.276 0.00122
log(t) -10.077145 1.178391 -8.552
Independently, factors like the previous day out time or vehicle type were signifi-
cant while travelling time was not significant. But their significance changes in the
presence of other variables. The model is significant as the F statistic of ANOVA
(from Table 5) is less than 5%. The variation inflation factor (VIF =
4 indicates multicollinearity between variables(Hair et al., 1998; Kumar, 2017).
From Table 7 based on VIF, there is no correlation between variables. The ideal
number of independent variables in the model to prevent overfitting is given by
Mallows Cp(Mallows, 1973). Mallows Cp is 4.82, while the number of variables
in the current model is 5 indicating no overfitting. From Table 4, the overall varia-
tion in in-time explained by the model is 54%. Durbin−Watson test(Durbin and
Watson, 1950; Kumar, 2017) was used to check the existence of autocorrelation in
the residuals. The test statistic is 1.87, which has a p-value of 0.22, indicating no
further presence of autocorrelation in the residuals. The residuals in Fig 8 show no
Fig 8: Unexplained variation
Table 7: VIF among variables
Nature of work B
Nature of work C
Starting location B
This paper aimed to quantify the different reasons affecting the in-time of an indi-
vidual using the CRISP-DM method. The regression results in Table 6 illustrates
how the individual (Person A) can plan his/her time accordingly. Travelling time,
nature of work and starting location are important factors that determine in-time.
There is a logarithmic decrease in in-time. This can cause problems if there is an
unexpected delay in the future. Person A should examine traffic-related delays
better by finding alternate routes as travelling time is an essential factor.
This paper illustrates how the individual can use data collected by various compa-
nies or agencies. As the needs and analytics capabilities of every individual are
different, the approach used in the in-time example can be generalized. The
CRISP-DM procedure, as implemented in this paper, can be summarised as fol-
1. Identify the problem, measurement metrics, and success criterion
2. Identify the factors affecting the problem
3. Find data sources which capture the data for the different features
4. Perform exploratory data analysis to identify the relationships between
5. Perform hypothesis tests to confirm the relationships between the varia-
6. Build an explainable model that optimizes the success metrics
7. Incorporate learnings from the study
8. Repeat step 3 to 8 with newer data sources and factors
Aho, A.-M., Uden, L., 2014. Developing data analytics to improve services in a mechanical
engineering company, in: International Conference on Knowledge Management
in Organizations. Springer, pp. 99–107.
Ailabouni, N., Gidado, K., Painting, N., 2009. Factors affecting employee productivity in
the UAE construction industry, in: 25th Annual ARCOM Conference,
Nottingham, UK. pp. 7–9.
Azevedo, A.I.R.L., Santos, M.F., 2008. KDD, SEMMA and CRISP-DM: a parallel
Chiang, L., Lu, B., Castillo, I., 2017. Big data analytics in chemical engineering. Annual
review of chemical and biomolecular engineering 8, 63–85.
Dinesh Kumar, U., Arun, P., Nachiappan, S.P., 2012. Supply chain optimization at Madurai
Aavin milk dairy.
Durbin, J., Watson, G.S., 1950. Testing for serial correlation in least squares regression: I.
Biometrika 37, 409–428.
Gujarati, D.N., 2009. Basic econometrics. Tata McGraw-Hill Education.
Gurrin, C., Smeaton, A.F., Doherty, A.R., 2014. Lifelogging: Personal big data.
Foundations and Trends® in information retrieval 8, 1–125.
Hair, J.F., Black, W.C., Babin, B.J., Anderson, R.E., Tatham, R.L., 1998. Multivariate data
analysis. Prentice hall Upper Saddle River, NJ.
Jeeson, K.J., Jathar, A., Dinesh Kumar, U., 2013. Consumer choice between house brands
and national brands in detergent purchases at Reliance retail.
Kumar, U.D., 2017. Business Analytics: The Science of Data-driven Decision Making.
Lixia, Q., 2010. Empirical research on the importance of incentive factors to life insurance
agents, in: 2010 International Conference On Computer Design and Applications.
IEEE, pp. V5-38-V5-41.
Mallows, C.L., 1973. Some comments on C p. Technometrics 15, 661–675.
Miller, P.D., 2019. Introduction to Focus: The App Issue. American Book Review 40, 3–4.
Olsson, L.E., Gärling, T., Ettema, D., Friman, M., Fujii, S., 2013. Happiness and
satisfaction with work commute. Social indicators research 111, 255–263.
Pannu, H.S., Kumar, U.D., Farooquie, J.A., 2010. Impact of innovation on the performance
of Indian pharmaceutical industry using Data Envelopment Analysis. IIM
Bangalore Research Paper.
Porter, L.W., Steers, R.M., 1973. Organizational, work, and personal factors in employee
turnover and absenteeism. Psychological bulletin 80, 151.
Schwartz, P.M., 2003. Property, privacy, and personal data. Harv. L. Rev. 117, 2056.
Selke, S., 2014. Lifelogging. Wie die digitale Selbstvermessung unsere Gesellschaft
verändert. Berlin: ECON.
Sellen, A., Whittaker, S., 2010. Beyond total capture: a constructive critique of lifelogging.
Communications of the ACM.
van Hooff, M.L., 2015. The daily commute from work to home: examining employees'
experiences in relation to their recovery status. Stress and Health 31, 124–137.
Wirth, R., Hipp, J., 2000. CRISP-DM: Towards a standard process model for data mining,
in: Proceedings of the 4th International Conference on the Practical Applications
of Knowledge Discovery and Data Mining. Springer-Verlag London, UK, pp.