Conference PaperPDF Available

Statistical Lessons Learned in Personal Informatics Research

Authors:
  • Kyoto University of Advanced Science

Abstract

Personal informatics has emerged as one of the most popular topics in the HCI + health community, with its motto of "self-knowledge through numbers" resonating deeply with many researchers (myself included). As personal informatics researchers, we are fortunately to have an abundance of analytical methods at our disposal, helping us make sense of raw data. It has been a thrilling journey learning from various disciplines like statistics, psychometrics, and fundamental science. What I have learned has come from analyzing my own data collected with early generations of Fitbit wristbands, from digging through literature, and yes, from those painful yet valuable journal and conference reviews. In this paper, I hope to share some lessons I have learned and spark reflections on how we, as an emerging research community, can improve the rigor of our statistical practices. After all, sometimes a little bit of introspection can go a long way.
Statistical Lessons Learned in Personal Informatics Research
Zilu Liang
liang.zilu@kuas.ac.jp
Kyoto University of Advanced Science (KUAS)
Kyoto, Japan
1 Introduction
Personal informatics has emerged as one of the most popular topics
in the HCI + health community, with its motto of "self-knowledge
through numbers" resonating deeply with many researchers (myself
included). As personal informatics researchers, we are fortunately
to have an abundance of analytical methods at our disposal, helping
us make sense of raw data. It has been a thrilling journey learning
from various disciplines like statistics, psychometrics, and funda-
mental science. What I have learned has come from analyzing my
own data collected with early generations of Fitbit wristbands, from
digging through literature, and yes, from those painful yet valu-
able journal and conference reviews. In this paper, I hope to share
some lessons I have learned and spark reections on how we, as
an emerging research community, can improve the rigor of our
statistical practices. After all, sometimes a little bit of introspection
can go a long way.
2
Three Lessons I Have Learned (Among Others)
2.1 Not All "Numbers" Are Numbers
Psychometric questionnaires, particularly the Likert scale, have
been widely used in personal informatics research. Simple yet ver-
satile, these scales are often employed to capture subjective percep-
tions, such as how much participants agree with statements like
"Fitbit’s sleep data reect my true sleep quality." These scales com-
monly feature ve levels (e.g., "strongly agree," "agree," "neutral,"
"disagree," and "strongly disagree"), which are coded with symbols
from 1 to 5, giving the appearance of natural numbers.
Naturally, it’s tempting to compute means, standard deviations,
or perform t-tests. However, I’ve learned that individual Likert
scale items should not be treated as numbers. The distance between
"disagree" (2) and "neutral" (3) is not necessarily equal to the dis-
tance between "neutral" (3) and "agree" (4). This is a key distinction:
Likert scales provide ordinal data, while natural numbers represent
interval data.
This subtle dierence has major implications for data analysis.
For instance, calculating an average Likert score, such as 4.45, feels
odd—what does this number mean? Is it 1.1% more "agree" than
4.40? The decimal place has no meaningful interpretation and could
be better rounded down (as Cohen advocates in the "less is more"
principle) [6].
There’s good news, however. The Central Limit Theorem comes
to the rescue when we aggregate multiple ordinal items into a scale.
While still ordinal, the summed data can approximate a normal
distribution, allowing us to apply traditional statistical methods [
5
].
This is why well-validated questionnaires often combine multiple
Likert items to measure underlying constructs.
That said, I now resist the temptation to calculate means and stan-
dard deviations for individual Likert items. Instead, I use the median
and quartiles, which are more appropriate for ordinal data [
5
,
10
].
I’ve also developed the habit of inspecting data distributions visu-
ally before calculating any statistics. In my experience, a histogram
or other graphical representation of Likert scale data can often
reveal far more than summary statistics [9].
2.2 Many Datasets Are Nested
The rise of wearable devices like Fitbit, Apple Watch and Oura Ring
has made it easier and more cost-eective to collect large volumes
of physiological, behavioral, and contextual data. With this new
opportunity comes a shift from traditional cross-sectional studies
to more sophisticated longitudinal designs, where multiple data
points are collected from each participant over time.
This shift from "1-of-N" to "N-of-1" (and eventually "N-of-N")
necessitates a reevaluation of the statistical analysis techniques we
apply. Many personal informatics studies naturally adopt repeated
measures designs, but researchers often apply statistical methods
intended for cross-sectional data, which can lead to incorrect con-
clusions.
For example, in one of my previous studies, I tested three dif-
ferent correlation analysis techniques using a nested dataset: (1)
Pearson’s correlation on individual data points (sscor), (2) Pearson’s
correlation after averaging each subject’s data (smcor), and (3) re-
peated measures correlation (rmcor). The results were revealing:
sscor and smcor might have shown correlations where none existed
(or vice versa), the rmcor, which properly handled the nested data
structure, provided more accurate insights [
11
]. This reinforced an
important lesson: if we ignore the nested structure in our data, we
risk drawing erroneous conclusions.
The key takeaway here is that we must account for within-
subject variance when analyzing repeated measures data. Instead
of removing this variance through averaging, which can obscure
meaningful patterns, multilevel analysis techniques should be em-
ployed (e.g., mixed-eects models).
Things get tricker when applying machine learning to nested
dataset, as many machine learning algorithms (with tree-based mod-
els being notable exceptions) assume data points are independently
and identically distributed (the i.i.d "curse"). This highlights the
need for new algorithms that integrate multilevel analysis princi-
ples with machine learning approaches, which should be addressed
in future personal informatics studies.
2.3 The P-Value Is Often Misinterpreted
Another lesson that truly hit home for me recently was the growing
criticism of the p-value and its role in statistical analysis. Like many
researchers, I had relied on the 0.05 threshold for determining
statistical signicance without fully appreciating its limitations. I
used to think the 0.05 threshold was a neat, simply way to interpret
results: if p
>
0.05, we could safely conclude that the null hypothesis
was true.
Liang
Figure 1: Correlation analysis between the mean absolute change in heart rate and wake ratio using a nested dataset comprising
229 days of sleep, steps, and heart rate data from 16 participants using Fitbit wristbands [
11
]. Left: The repeated measure
correlation method correctly accounts for within-subject variance, revealing a weak negative correlation with a coecient of
-0.23. Middle: When the dataset was attened by averaging each participant’s data, and Pearson’s correlation was applied, no
correlation was found. This averaging masked valuable insights. Right: Ignoring the nested structure and applying Pearson’s
correlation directly to the raw data also resulted in no correlation. This example highlights how improper statistical techniques
and the failure to consider data structure can lead to misleading results.
It wasn’t until recently that I realized what I could actually con-
clude from p
>
0.05 is simply that I couldn’t conclude that the null
was false; in other words, I could "hardly conclude anything" [6].
The over-reliance on the p-value has contributed to the replica-
tion crisis in science and is now considered as a major pitfalls in
modern statistical practice [
7
]. The backlash against the p-value
has been ongoing for decades, perhaps ever since the concept was
invented [
6
]. But it was not until very recently, that consensus
has been emerging across elds like psychology [
8
], statistics [
3
],
and beyond [
2
]. Multiple publications in Nature have advocated for
redening [
4
] or even abandoning the idea of statistical signicance
altogether [1, 2, 12].
Rather than discarding the p-value entirely, I now interpret it
with caution. A better approach, supported by the statistical com-
munity, is to report eect sizes and condence intervals [
2
,
8
]. These
metrics provide a more nuanced understanding of the data and re-
duce the dangers of relying solely on p-values to make conclusions.
3 Staying Current With Statistical Practices
Personal informatics is a rapidly evolving eld that draws heavily
from more traditional research disciplines. However, this "method-
ological mimicry" can be a double-edged sword: while it opens up
new application avenues for statistical techniques, it also carries the
risk of perpetuating outdated or incorrect statistical practices. The
pace at which new methods are adopted can be slow, particularly
in elds like HCI, where researchers may not always keep up with
the latest developments in statistical analysis.
I’ve experienced this rsthand, particularly in relation to the
statistical aspects of my work. While my training in informatics
has given me solid technical skills, it left me with gaps in statistical
knowledge, and I continue to learn and grow in this area. As the
eld of psychology, for example, undergoes signicant reforms
to improve statistical practices, I encourage personal informatics
researchers to come together and share updated best practices, and
develop guidelines that can support newcomers to the eld. By
working together to embrace better statistical practices, we can
improve the quality of our research and ensure that we are using the
best possible methods to derive meaningful conclusions from the
data we collect. Statistical rigor will not only enhance the credibility
of our ndings but also support the larger mission of advancing
personal informatics in ways that truly benet individuals.
References
[1]
Valentin Amrhein and Sander Greenland. 2018. Remove, rather than redene,
statistical signicance. Nature Human Behavior 2, 4 (2018), 4.
[2] Valentin Amrhein, Sander Greenland, and Blake McShane. 2019. Scientists rise
up against statistical signicance. Nature 567 (2019), 305–307.
Statistical Lessons Learned in Personal Informatics Research
[3]
Monya Baker. 2016. Statisticians issue warning over misuse of P values. Nature
531 (2016), 151.
[4]
Daniel Benjamin, James Berger, Magnus Johannesson, Brian Nosek, E.-J Wagen-
makers, Richard Berk, Kenneth Bollen, Björn Brembs, Lawrence Brown, Colin
Camerer, David Cesarini, Christopher Chambers, Merlise Clyde, Thomas Cook,
Paul De Boeck, Zoltan Dienes, Anna Dreber, Kenny Easwaran, Charles Eer-
son, and Valen Johnson. 2018. Redene Statistical Signicance. Nature Human
Behaviour 2 (2018), 6–10. doi:10.1038/s41562-017-0189-z
[5]
James Cario and Rocco J. Perla. 2007. Ten Common Misunderstandings, Mis-
conceptions, Persistent Myths and Urban Legends about Likert Scales and Likert
Response Formats and their Antidotes. Journal of Social Sciences 3, 3 (2007),
106–116.
[6]
Jacob Cohen. 1990. Things I have learned (so far). American Psychologist 45, 12
(1990), 1304–1312. doi:10.1037/0003-066X.45.12.1304
[7]
Lincoln J. Colling and Dénes Szűcs. 2021. Statistical Inference and the Replication
Crisis. Review of Philosophy and Psychology 12 (2021), 121–147. doi:10.1007/
s13164-018- 0421-4
[8]
Geo Cumming, Fiona Fidler, Martine Leonard, Pavel Kalinowski, Ashton Chris-
tiansen, Anita Kleinig, Jessica Lo, Natalie McMenamin, and Sarah Wilson. 2007.
Statistical Reform in Psychology: Is Anything Changing? Psychological Science
18, 3 (2007), 230–232.
[9]
Richard Heiberger and Naomi Robbins. 2014. Design of Diverging Stacked Bar
Charts for Likert Scales and Other Applications. Journal of Statistical Software
57, 5 (2014), 1–32. doi:10.18637/jss.v057.i05
[10]
Malcolm Koo and Shih-Wei Yang. 2025. Likert-type scale. Encyclopedia 5, 1
(2025), 18.
[11]
Zilu Liang. 2022. Correlation Analysis of Nested Consumer Health Data: A New
Look at an Old Problem. In 2022 IEEE 4th Global Conference on Life Sciences and
Technologies (LifeTech). 46–47. doi:10.1109/LifeTech53646.2022.9754805
[12]
Blakeley B. McShane and Andrew Gelman. 2017. Abandon statistical signicance.
Nature 551 (2017), 558.
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Valentin Amrhein, Sander Greenland, Blake McShane and more than 800 signatories call for an end to hyped claims and the dismissal of possibly crucial effects.
Article
Full-text available
We propose to change the default P-value threshold for statistical significance for claims of new discoveries from 0.05 to 0.005.
Article
Full-text available
Rating scales, such as Likert scales, are very common in marketing research, customer satisfaction studies, psychometrics, opinion surveys, population studies, and numerous other fields. We recommend diverging stacked bar charts as the primary graphical display technique for Likert and related scales. We also show other applications where diverging stacked bar charts are useful. Many examples of plots of Likert scales are given. We discuss the perceptual and programming issues in constructing these graphs. We present two implementations for diverging stacked bar charts. Most examples in this paper were drawn with the likert function included in the HH package in R. We also have a dashboard in Tableau.
Article
Full-text available
This is an account of what I have learned (so far) about the application of statistics to psychology and the other sociobiomedical sciences. It includes the principles "less is more" (fewer variables, more highly targeted issues, sharp rounding off), "simple is better" (graphic representation, unit weighting for linear com- posites), and "some things you learn aren't so." I have learned to avoid the many misconceptions that surround Fisherian null hypothesis testing. I have also learned the importance of power analysis and the determination of just how big (rather than how statistically significant) are the effects that we study. Finally, I have learned that there is no royal road to statistical induction, that the informed judgment of the investigator is the crucial element in the interpretation of data, and that things take time.
Article
Full-text available
A recent article by Jamieson in Medical Education outlined some of the (alleged) abuses of “Likert scales” with suggestions about how researchers can overcome some of the (alleged) methodological pitfalls and limitations[1]. However, many of the ideas advanced in the Jamison article, as well as a great many of articles it cited, and similar recent articles in medical, health, psychology, and educational journals and books, are themselves common misunderstandings, misconceptions, conceptual errors, persistent myths and “urban legends” about “Likert scales” and their characteristics and qualities that have been propagated and perpetuated across six decades, for a variety of differentreasons. This article identifies, analyses and traces many of these aforementioned problems and presents the arguments, counter arguments and empirical evidence that show these many persistent claims and myths about “Likert scales” to be factually incorrect and untrue. Many studies have shown that Likert Scales (as opposed to single Likert response format items) produce interval data and thatthe F-test is very robust to violations of the interval data assumption and moderate skewing and may be used to analyze “Likert data” (even if it is ordinal), but not on an item-by-item “shotgun” basis, which is simply a current research and analysis practice that must stop. After sixty years, it is more than time to dispel these particular research myths and urban legends as well as the various damage and problems they cause, and put them to bed and out of their misery once and for all.
Conference Paper
Correlation analysis is one of the most popular analytic methods for discovering the relationship between two variables in consumer health data. Nevertheless, the simple sample correlation techniques (e.g., Pearson’s or Spearman’s correlation) assume the independence of the observations, which is often violated when there are repeated measures from each subject (also called nested data). In this paper, we compare three methods for correlation calculation on nested consumer health data: (1) repeated measure correlation (rmcorr), (2) subject mean correlation (smcor), and (3) simple sample correlation (sscor). We contend that rmcorr is a more appropriate technique for the correlation analysis of nested data as it could explicitly capture the within-subject variations. In contrast, the smcor and sscor, which could only capture the cross-subject variations, may produce misleading and faulty results.
Statisticians issue warning over misuse of P values
  • Monya Baker
Monya Baker. 2016. Statisticians issue warning over misuse of P values. Nature 531 (2016), 151.
Colling and Dénes Szűcs. 2021. Statistical Inference and the Replication Crisis
  • J Lincoln
Lincoln J. Colling and Dénes Szűcs. 2021. Statistical Inference and the Replication Crisis. Review of Philosophy and Psychology 12 (2021), 121-147. doi:10.1007/ s13164-018-0421-4