Content uploaded by Zilu Liang
Author content
All content in this area was uploaded by Zilu Liang on Mar 15, 2025
Content may be subject to copyright.
Statistical Lessons Learned in Personal Informatics Research
Zilu Liang
liang.zilu@kuas.ac.jp
Kyoto University of Advanced Science (KUAS)
Kyoto, Japan
1 Introduction
Personal informatics has emerged as one of the most popular topics
in the HCI + health community, with its motto of "self-knowledge
through numbers" resonating deeply with many researchers (myself
included). As personal informatics researchers, we are fortunately
to have an abundance of analytical methods at our disposal, helping
us make sense of raw data. It has been a thrilling journey learning
from various disciplines like statistics, psychometrics, and funda-
mental science. What I have learned has come from analyzing my
own data collected with early generations of Fitbit wristbands, from
digging through literature, and yes, from those painful yet valu-
able journal and conference reviews. In this paper, I hope to share
some lessons I have learned and spark reections on how we, as
an emerging research community, can improve the rigor of our
statistical practices. After all, sometimes a little bit of introspection
can go a long way.
2
Three Lessons I Have Learned (Among Others)
2.1 Not All "Numbers" Are Numbers
Psychometric questionnaires, particularly the Likert scale, have
been widely used in personal informatics research. Simple yet ver-
satile, these scales are often employed to capture subjective percep-
tions, such as how much participants agree with statements like
"Fitbit’s sleep data reect my true sleep quality." These scales com-
monly feature ve levels (e.g., "strongly agree," "agree," "neutral,"
"disagree," and "strongly disagree"), which are coded with symbols
from 1 to 5, giving the appearance of natural numbers.
Naturally, it’s tempting to compute means, standard deviations,
or perform t-tests. However, I’ve learned that individual Likert
scale items should not be treated as numbers. The distance between
"disagree" (2) and "neutral" (3) is not necessarily equal to the dis-
tance between "neutral" (3) and "agree" (4). This is a key distinction:
Likert scales provide ordinal data, while natural numbers represent
interval data.
This subtle dierence has major implications for data analysis.
For instance, calculating an average Likert score, such as 4.45, feels
odd—what does this number mean? Is it 1.1% more "agree" than
4.40? The decimal place has no meaningful interpretation and could
be better rounded down (as Cohen advocates in the "less is more"
principle) [6].
There’s good news, however. The Central Limit Theorem comes
to the rescue when we aggregate multiple ordinal items into a scale.
While still ordinal, the summed data can approximate a normal
distribution, allowing us to apply traditional statistical methods [
5
].
This is why well-validated questionnaires often combine multiple
Likert items to measure underlying constructs.
That said, I now resist the temptation to calculate means and stan-
dard deviations for individual Likert items. Instead, I use the median
and quartiles, which are more appropriate for ordinal data [
5
,
10
].
I’ve also developed the habit of inspecting data distributions visu-
ally before calculating any statistics. In my experience, a histogram
or other graphical representation of Likert scale data can often
reveal far more than summary statistics [9].
2.2 Many Datasets Are Nested
The rise of wearable devices like Fitbit, Apple Watch and Oura Ring
has made it easier and more cost-eective to collect large volumes
of physiological, behavioral, and contextual data. With this new
opportunity comes a shift from traditional cross-sectional studies
to more sophisticated longitudinal designs, where multiple data
points are collected from each participant over time.
This shift from "1-of-N" to "N-of-1" (and eventually "N-of-N")
necessitates a reevaluation of the statistical analysis techniques we
apply. Many personal informatics studies naturally adopt repeated
measures designs, but researchers often apply statistical methods
intended for cross-sectional data, which can lead to incorrect con-
clusions.
For example, in one of my previous studies, I tested three dif-
ferent correlation analysis techniques using a nested dataset: (1)
Pearson’s correlation on individual data points (sscor), (2) Pearson’s
correlation after averaging each subject’s data (smcor), and (3) re-
peated measures correlation (rmcor). The results were revealing:
sscor and smcor might have shown correlations where none existed
(or vice versa), the rmcor, which properly handled the nested data
structure, provided more accurate insights [
11
]. This reinforced an
important lesson: if we ignore the nested structure in our data, we
risk drawing erroneous conclusions.
The key takeaway here is that we must account for within-
subject variance when analyzing repeated measures data. Instead
of removing this variance through averaging, which can obscure
meaningful patterns, multilevel analysis techniques should be em-
ployed (e.g., mixed-eects models).
Things get tricker when applying machine learning to nested
dataset, as many machine learning algorithms (with tree-based mod-
els being notable exceptions) assume data points are independently
and identically distributed (the i.i.d "curse"). This highlights the
need for new algorithms that integrate multilevel analysis princi-
ples with machine learning approaches, which should be addressed
in future personal informatics studies.
2.3 The P-Value Is Often Misinterpreted
Another lesson that truly hit home for me recently was the growing
criticism of the p-value and its role in statistical analysis. Like many
researchers, I had relied on the 0.05 threshold for determining
statistical signicance without fully appreciating its limitations. I
used to think the 0.05 threshold was a neat, simply way to interpret
results: if p
>
0.05, we could safely conclude that the null hypothesis
was true.
Liang
Figure 1: Correlation analysis between the mean absolute change in heart rate and wake ratio using a nested dataset comprising
229 days of sleep, steps, and heart rate data from 16 participants using Fitbit wristbands [
11
]. Left: The repeated measure
correlation method correctly accounts for within-subject variance, revealing a weak negative correlation with a coecient of
-0.23. Middle: When the dataset was attened by averaging each participant’s data, and Pearson’s correlation was applied, no
correlation was found. This averaging masked valuable insights. Right: Ignoring the nested structure and applying Pearson’s
correlation directly to the raw data also resulted in no correlation. This example highlights how improper statistical techniques
and the failure to consider data structure can lead to misleading results.
It wasn’t until recently that I realized what I could actually con-
clude from p
>
0.05 is simply that I couldn’t conclude that the null
was false; in other words, I could "hardly conclude anything" [6].
The over-reliance on the p-value has contributed to the replica-
tion crisis in science and is now considered as a major pitfalls in
modern statistical practice [
7
]. The backlash against the p-value
has been ongoing for decades, perhaps ever since the concept was
invented [
6
]. But it was not until very recently, that consensus
has been emerging across elds like psychology [
8
], statistics [
3
],
and beyond [
2
]. Multiple publications in Nature have advocated for
redening [
4
] or even abandoning the idea of statistical signicance
altogether [1, 2, 12].
Rather than discarding the p-value entirely, I now interpret it
with caution. A better approach, supported by the statistical com-
munity, is to report eect sizes and condence intervals [
2
,
8
]. These
metrics provide a more nuanced understanding of the data and re-
duce the dangers of relying solely on p-values to make conclusions.
3 Staying Current With Statistical Practices
Personal informatics is a rapidly evolving eld that draws heavily
from more traditional research disciplines. However, this "method-
ological mimicry" can be a double-edged sword: while it opens up
new application avenues for statistical techniques, it also carries the
risk of perpetuating outdated or incorrect statistical practices. The
pace at which new methods are adopted can be slow, particularly
in elds like HCI, where researchers may not always keep up with
the latest developments in statistical analysis.
I’ve experienced this rsthand, particularly in relation to the
statistical aspects of my work. While my training in informatics
has given me solid technical skills, it left me with gaps in statistical
knowledge, and I continue to learn and grow in this area. As the
eld of psychology, for example, undergoes signicant reforms
to improve statistical practices, I encourage personal informatics
researchers to come together and share updated best practices, and
develop guidelines that can support newcomers to the eld. By
working together to embrace better statistical practices, we can
improve the quality of our research and ensure that we are using the
best possible methods to derive meaningful conclusions from the
data we collect. Statistical rigor will not only enhance the credibility
of our ndings but also support the larger mission of advancing
personal informatics in ways that truly benet individuals.
References
[1]
Valentin Amrhein and Sander Greenland. 2018. Remove, rather than redene,
statistical signicance. Nature Human Behavior 2, 4 (2018), 4.
[2] Valentin Amrhein, Sander Greenland, and Blake McShane. 2019. Scientists rise
up against statistical signicance. Nature 567 (2019), 305–307.
Statistical Lessons Learned in Personal Informatics Research
[3]
Monya Baker. 2016. Statisticians issue warning over misuse of P values. Nature
531 (2016), 151.
[4]
Daniel Benjamin, James Berger, Magnus Johannesson, Brian Nosek, E.-J Wagen-
makers, Richard Berk, Kenneth Bollen, Björn Brembs, Lawrence Brown, Colin
Camerer, David Cesarini, Christopher Chambers, Merlise Clyde, Thomas Cook,
Paul De Boeck, Zoltan Dienes, Anna Dreber, Kenny Easwaran, Charles Eer-
son, and Valen Johnson. 2018. Redene Statistical Signicance. Nature Human
Behaviour 2 (2018), 6–10. doi:10.1038/s41562-017-0189-z
[5]
James Cario and Rocco J. Perla. 2007. Ten Common Misunderstandings, Mis-
conceptions, Persistent Myths and Urban Legends about Likert Scales and Likert
Response Formats and their Antidotes. Journal of Social Sciences 3, 3 (2007),
106–116.
[6]
Jacob Cohen. 1990. Things I have learned (so far). American Psychologist 45, 12
(1990), 1304–1312. doi:10.1037/0003-066X.45.12.1304
[7]
Lincoln J. Colling and Dénes Szűcs. 2021. Statistical Inference and the Replication
Crisis. Review of Philosophy and Psychology 12 (2021), 121–147. doi:10.1007/
s13164-018- 0421-4
[8]
Geo Cumming, Fiona Fidler, Martine Leonard, Pavel Kalinowski, Ashton Chris-
tiansen, Anita Kleinig, Jessica Lo, Natalie McMenamin, and Sarah Wilson. 2007.
Statistical Reform in Psychology: Is Anything Changing? Psychological Science
18, 3 (2007), 230–232.
[9]
Richard Heiberger and Naomi Robbins. 2014. Design of Diverging Stacked Bar
Charts for Likert Scales and Other Applications. Journal of Statistical Software
57, 5 (2014), 1–32. doi:10.18637/jss.v057.i05
[10]
Malcolm Koo and Shih-Wei Yang. 2025. Likert-type scale. Encyclopedia 5, 1
(2025), 18.
[11]
Zilu Liang. 2022. Correlation Analysis of Nested Consumer Health Data: A New
Look at an Old Problem. In 2022 IEEE 4th Global Conference on Life Sciences and
Technologies (LifeTech). 46–47. doi:10.1109/LifeTech53646.2022.9754805
[12]
Blakeley B. McShane and Andrew Gelman. 2017. Abandon statistical signicance.
Nature 551 (2017), 558.