Conference PaperPDF Available

Data Sharing in Wellness, Accessibility, and Aging



Curation and sharing of datasets are crucial for innovation, benchmarking, bias mitigation, and understanding of real-word scenarios, where AI-infused applications are deployed. This is especially the case for datasets from underrepresented populations typically studied in wellness, accessibility, and aging. However such datasets are scarce and in this paper we highlight challenges for sharing or locating them. They tend to come from smaller samples having highly variable characteristics , require expert annotators, and pose more prominent privacy risks. We discuss sharing practices as they pertain to specific user groups, access methods, and licensing. Our analysis is based on 140 datasets that were manually located across different sources: 56 available publicly, 31 available upon request, and 53 unshared but described in manuscripts. To promote discovery and transparency, all datasets are described at IncluSet, our new data surfacing repository.
Data Sharing in Wellness, Accessibility, and Aging
Hernisa Kacorri Utkarsh Dwivedi Rie Kamikubo
University of Maryland University of Maryland University of Maryland
College Park, MD 20742 College Park, MD 20742 College Park, MD 20742
Curation and sharing of datasets are crucial for innovation, benchmarking, bias
mitigation, and understanding of real-word scenarios, where AI-infused applica-
tions are deployed. This is especially the case for datasets from underrepresented
populations typically studied in wellness, accessibility, and aging. However such
datasets are scarce and in this paper we highlight challenges for sharing or locating
them. They tend to come from smaller samples having highly variable charac-
teristics, require expert annotators, and pose more prominent privacy risks. We
discuss sharing practices as they pertain to specific user groups, access methods,
and licensing. Our analysis is based on 140 datasets that were manually located
across different sources: 56 available publicly, 31 available upon request, and 53
unshared but described in manuscripts. To promote discovery and transparency, all
datasets are described at IncluSet, our new data surfacing repository.
1 Introduction
Data lie at the heart of innovative solutions arising from advances in machine learning and artificial
intelligence. They are used to train models and benchmark performance. They are also employed
for understanding the different contexts in which AI-infused applications are deployed and cases of
exclusion [
]– issues of fairness pertaining to underrepresented populations. Knowing the importance
and cost of data collection and annotation, as well as their potential benefit to the public, researchers
often release data publicly or upon request [
]. However, sharing practices vary considerably among
research communities and contexts [
]. In this work, we focus on wellness, accessibility and aging,
where data sharing can be a double-edged sword for the communities involved.
Sharing has often served to attract, nurture, and challenge the machine learning community to work
on specific problems. Many fields, including the health community, have seized this opportunity [
We observe the start of a similar trend in wellness, accessibility, and aging e.g., the Parkinson’s
condition sub-challenge [
] calling for the speech community to work on neurological state estimation
and the VizWiz data challenge [
] calling for the computer vision community to work on visual
question answering problems for the blind. However, scarcity of large datasets directly sourced
from these underrepresented populations remains one of the biggest challenges [
]. While this
is partly due to smaller samples [
], there are other factors specific to these user groups. People
vary in their individual preferences and environments, but people with disabilities and older adults
lend further dimensions with disparate characteristics, even within a given disability or age group.
Moreover, data annotations often require domain knowledge that few possess making it difficult to
fit them in crowdsourcing tasks. For instance, creating annotated video datasets for sign language
synthesis requires linguistic background and sign language fluency [
]. And more importantly,
there are privacy and ethical concerns for creating and sharing accessibility datasets as people who
have distinct data patterns may be more susceptible to data abuse and misuse [14, 15, 16, 17].
In this paper, we analyze 140 manually identified accessibility datasets from 1984 to 2019 with few
examples illustrated in Figure 1; all were used to pre-populate our new data surfacing repository,
NeurIPS 2020 Workshop on Dataset Curation and Security
Figure 1: Examples of datasets in wellness, accessibility, and aging including from left to right photos
taken by blind individuals [
], sign language videos and annotations [
], stroke gestures by
people with low vision [
], mobility app logs from people with visual impairments [
], audio
recording from people with dysphonia [23], and text written by people with dyslexia [24].
], which aims to contribute to the transparency and discoverability of datasets in wellness,
accessibility, and aging. We hope this discussion moves us a step towards better understanding of
data sharing practices and potential concerns for datasets from underrepresented communities.
2 Related Work
Prior efforts on collecting and analyzing disability data have mainly focused on demographics,
diagnoses, causes of injury, interventions, outcomes, and costs (e.g. BMS National Database [
Other researchers have also looked into survey data [
], accessible websites [
], and geographical
data on the accessibility of physical environments [
]. Our work is complementary to and qual-
itatively different from these prior efforts. It focuses on data resources, typically annotated, that
include raw data generated by people with disabilities (e.g. extrasensory data and images taken by
blind users), which can be used to drive AI-infused assistive technologies or to benchmark models
for inclusivity with respect to disability [
]. For example, video recordings of Deaf signers with
annotated facial expressions timestamps can be used to either train sign language avatars to be more
understandable [
] or explore the performance of facial expression recognition technologies that
might misread linguistically meaningful facial expressions during signing such as emotions [
Prior attempts to analyze such datasets from multiple sources tend to focus on a specific population
and task e.g. sign language recognition [
] and object recognition data [
]. With a broader lens,
this work helps to gain insights on data sharing practices across different populations and tasks.
3 Sharing Risks and Practices in Accessibility Datasets
We characterize sharing of accessibility data as a double-edged sword. This tension between benefits
and harms for social data collection is not new and definitely not unique to wellness, accessibility, and
aging. However, it is more prominent [
] calling for better research practices, technical, legal, and
institutional privacy frameworks [
]. Naturally, reporting data collection on smaller populations
can increase the risk of deduction [
]. For example, the location of researchers’ institution when
combined with blind participants’ age, gender, visual acuity, onset, and use of mobility aids, typically
reported on data from navigation studies, may make the participants known to those living in that
area. As certain disability populations can be really small, the effectiveness of privacy-preserving
techniques can also be affected [
] calling for novel approaches (e.g., privacy-enhancing distortions
on sign language datasets [
]). Even when re-identification is not a risk, consent and disclosure can
be, as disability status is sensitive. Same datasets that are collected to mitigate bias against people
with disabilities or to support them through novel AI-infused assistive tech, can be used against them
by “detecting” their disabilities. This can happen even when disclosure is not voluntary, posing
further discrimination risks e.g., for one’s healthcare and employment [33].
To better understand how researchers are navigating these risks when collecting and sharing datasets
from such underrepresented communities, we started collecting and analyzing accessibility datasets
early in 2018. A key challenge we identified is that many are difficult to locate and require domain and
community knowledge. They are spread across different venues and do not surface in search engines.
Typically, they lack consistent descriptions and require manual screening and in few occasions
guessing or further inquiries, highlighting the importance of standardized process for documenting
datasets (e.g. [
]). More often, links or request information for the datasets are buried in some
footnote or a specific section on manuscripts, making it challenging to discover. Half-way through
our dataset collection, we were excited to see Google deploy the new Dataset Search engine [
Unfortunately, at that time only 1 dataset related to accessibility, VizWiz [
] surfaced. To promote
discovery and transparency for accessibility datasets, we use our initial dataset collection to pre-
populate and deploy IncluSet [
], an accessibility dataset repository that only stores metadata linking
to the data source and description while supporting data discovery through the Google Schema [
Out of the 140 datasets that were manually located over a two-year period, only 56 can be downloaded
directly (e.g., through a webpage from the dataset creators) and 31 are available upon request (e.g., a
given email by the creators). The remaining 53 don’t include any sharing intent or information (we
still link to them as they fit our criteria). This is not a surprise. The majority of human-computer
interaction researchers that work with these populations do not share data. When looking at 509
papers on wellness, accessibility, and aging published at ACM CHI 2010–2018, Abbot et al. [
found that only 3 made their data publicly available. This number is quite low when compared to
prior work surveying CHI authors from the same period [
]; researchers found that out of 373
reporting or generating any type of data, 80 shared raw data. Reasons for not sharing included data
sensitivity, participant consent, and re-identification risks. This difference could be explained by
increased privacy risks for accessibility data that are amplified by the risk of disability disclosure.
In our collection, we see that this non-sharing strategy is not unique to a specific population; it
is prevalent across data from different user groups including those with visual, hearing, cognitive,
speech, and mobility impairments as well as autism. More so, we observe that all of our datasets from
people with developmental impairments follow this strategy. Another observation is that children are
often involved in these unshared datasets (e.g., [
]). The only publicly available dataset collected
from children in our repository included eye tracking measurements of autistic children [42].
3.1 Publicly Available Accessibility Datasets
Datasets in this group can be directly downloaded from personal and project-specific websites (36);
repositories like Kaggle (4), UCI Machine Learning Repository (3), and PhysioNet (5); OrtoLang (3);
Zenodo (3); (1); and Open Science Foundation (1). This strategy was most commonly
found across datasets from people who are deaf/Deaf or hard-of-hearing (25), typically including
sign language videos and gloss annotations. The majority of them were shared by computational
linguists and computer vision researchers. We also see this sharing strategy for data from people
with motor impairments (14) e.g., providing touchscreen gestures for users with upper body motor
impairments [
]. The majority of these datasets fall under both motor and cognitive categories (9) as
they typically involve people with progressive conditions such as Parkinson’s and Huntington where
symptoms relate to motor and cognitive abilities. Though the motivation for collecting these datasets
may differ, there is an underlying potential for “detecting” such conditions. This strategy of direct
download was also common among datasets sourced from people who are blind or have low vision
(7) sharing their photos (e.g., [
]), touchscreen gestures (e.g., [
]), and walking patterns (e.g., [
Sharing License
. The majority of publicly available datasets did not provide any license information,
form of agreement, or requirement for downloading (31 out of 56). Those who did, mainly opted for
the creative commons family of licenses (CC: 11, CC BY-NC 2.0: 3, CC4.0: 1). Few chose ODC
Public Domain Dedication and License (5) and New BSD License (2). One dataset used a custom
license such as Synapse Commons Governance and one declared the data under a specific copyright
holder but did not provide a license.
. We highlight two more recent efforts found among these datasets that con-
sider re-identification and privacy risks. The first one relates to detecting progression of Parkinson’s
disease, where researchers make a conscious decision to use only non-speech sounds like breathing,
clearing throat and swallowing to predict the risk of onset [
]. However, the risk for disability
disclosure remains. The second one, relates to visual question answering systems, where researchers
attempt to recognize the presence of private information in images taken by blind individuals [44].
. The majority of the datasets indicate support in their acknowledgments from public funding
such as NSF and the European Union. We observe that sharing efforts start around year 2000 for the
health population and populations related to vision, mobility, hearing, and cognitive impairments;
after 2010 for datasets related to speech impairments; and after 2015 for autism. We haven’t found
any publicly available datasets sourced from people with developmental or learning impairments.
3.2 Datasets Shared Upon Request
Datasets in this group can be accessed only upon request through specific procedures. The most
common practice we observe is to have a dedicated dataset webpage with a note to contact one of
the authors (typically the Project Investigator) given an email address without any further details
on eligibility or process. Another practice is to describe the license agreement, the requirements
to obtain the data, as well as the types of data that would be shared. This information was either
included on the project webpage or included on a dedicated section of the publication where the data
were introduced, named Distribution. For example, in the BosphorusSign dataset [
] this section
reads: “The collected corpus will be available to download for academic purposes upon filling a
license agreement available from the BosphorusSign website. The provided data will include ... ”
A contrasting pattern across many of the populations sourced in datasets shared upon request is
that they fall under what is called “invisible disabilities,” disabilities that are less apparent to others
and perhaps more sensitive for disclosure. For example, we see here datasets from people with
language and learning impairments, which were not publicly shared. Overall, this strategy was most
often adopted for datasets sourced from people with cognitive impairments (16) such as people with
dementia, Alzheimer’s disease, or people with mental-health issues. Motivated by early diagnosis
or detection they include logs of daily activities, in video or audio formats, or interaction events
with computing devices. Though less in number than the publicly available datasets, we see here
datasets generated by people who are deaf/Deaf or hard-of-hearing (9) focusing mostly on analysis
of linguistic phenomena that can contribute to sign language synthesis (e.g., [
]) and recognition
(e.g., [
]). This sharing strategy seems also to be more prominent for datasets sourced from people
with speech impairments (7) with a goal to improve speech recognition (e.g., dysarthric speech [
and assessment tools (e.g., [50]).
Sharing License.
Almost all datasets available upon request did not provide any license information,
form of agreement, or conditions for access prior to contacting (28 out of 31). Few exceptions
included the DEVISIGN datasets [
], which detailed a procedure and specified an application
format, and the dataset by Avgerinakis et al. [
], which mentions non-commercial usage with
additional information to be revealed upon request.
. We highlight in chronological order two efforts found among these datasets
that consider re-identification and privacy risks. The first one relates to dementia detection through
videos of activities, where researchers prohibit those requesting the dataset from linking individual
data to any other information, prevent them from contacting any participant, and forbid the use
of participants’ face in publications of any kind [
]. The second one, relates to sign language
corpora, where is difficult to hide the visual appearance of the signers as facial expressions and head
movements are critical for conveying meaning. Here, researchers attempt to anonymize name entities
by making relevant signs or mouthing components unrecognizable [53].
. All datasets indicate support from public funding (29) (e.g., NSF and the European Union)
and industry (2) (e.g., Microsoft). We observe that one of the first sharing efforts was in 1984 for
language and cognitive impairments populations; around 1995 for speech impairments and health
populations; and 2010-2016 for hearing, mobility, learning, and vision impairments as well as autism.
We haven’t found any datasets available upon request from people with developmental impairments.
4 Conclusion
Datasets directly sourced from underrepresented communities such as people with disabilities and
older adults can contribute to more inclusive AI applications as well as innovative assistive tech-
nologies. However, they are scarce. We discuss challenges for locating such datasets and provide a
data surfacing repository to help with their discovery. More so, we present unique challenges and
privacy risks for collecting and sharing these datasets and discuss how strategies (shared publicly,
shared upon request, and unshared) across the 140 datasets prepopulating our repository, differ across
populations and research communities. We find that beyond lack of standardization, the majority of
shared datasets lacked any license information, form of agreement, or condition for access. Also very
few of them address potential re-identification and privacy risks. We call for better sharing practices
as well as technical, legal, and institutional privacy frameworks that are more attuned to concerns
from these communities e.g., risks of inaccurate or non-consenting disclosure of a disability.
Broader Impact
As machine learning expands its role in decision making processes, so does the impact of the
underrepresented training and benchmarking data for the life of people with disabilities affecting their
employment, economic self sufficiency, independence, inclusion and integration into society. Given
the increasing attention in machine learning to concerns of fairness and ethics, we have an opportunity
to ensure that people with disabilities and other underrepresented communities involved in wellness,
accessibility, and aging are part of this conversation. In this paper we discuss why datasets directly
sourced by these communities are scarce with a focus on data sharing risks. We hope that the insights
from our analysis of sharing practices across 140 datasets from 1984 to 2019 to inform and motivate
appropriate curation and use of such datasets. More so, to promote research and educational efforts
that can benefit these communities we have deployed a data surfacing repository for accessibility
datasets. We note that our repository is not a call to include underrepresented communities, that
we aim to benefit, in models that follow rigid categorization that can pose risks for non voluntary
disability disclosure. On the contrary, we are hoping it will help us better understand sharing practices
and potential concerns that can feed into the conversations to follow.
Acknowledgments and Disclosure of Funding
We thank our anonymous reviewers for their insightful feedback on an earlier version of this paper.
This work is supported by the National Institute on Disability, Independent Living, and Rehabilitation
Research (NIDILRR), ACL, HHS (#90REGE0008). The opinions herein are those of the authors.
Himabindu Lakkaraju, Ece Kamar, Rich Caruana, and Eric Horvitz. Identifying unknown un-
knowns in the open world: Representations and policies for guided exploration. In Proceedings
of the Thirty-First AAAI Conference on Artificial Intelligence, AAAI’17, pages 2124–2132.
AAAI Press, 2017.
Benedikt Fecher, Sascha Friesike, and Marcel Hebing. What drives academic data sharing?
volume 10, pages 1–25. Public Library of Science, 02 2015.
Ingeborg Meijer, Stephane Berghmans, Helena Cousijn, Clifford Tatum, Gemma Deakin,
Andrew Plume, Alex Rushforth, Adrian Mulligan, Sarah de Rijcke, Stacey Tobin, Thed
Van Leeuwen, and Ludo Waltman. Open data: the researcher perspective. CWTS, Universiteit
Leiden, Leiden., 04 2017.
Carol C. Diamond, Farzad Mostashari, and Clay Shirky. Collecting and sharing data for
population health: A new paradigm. volume 28, pages 454–466, 2009.
Mark Walport and Paul Brest. Sharing research data to improve public health. In The Lancet,
volume 377, pages 537–539. Elsevier, 2019/09/15 2011.
Björn Schuller, Stefan Steidl, Anton Batliner, Simone Hantke, Florian Hönig, Juan Rafael
Orozco-Arroyave, Elmar Nöth, Yue Zhang, and Felix Weninger. The interspeech 2015 computa-
tional paralinguistics challenge: nativeness, parkinson’s & eating condition. In Sixteenth annual
conference of the international speech communication association, pages 478–482, 2015.
Danna Gurari, Qing Li, Abigale J. Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo,
and Jeffrey P. Bigham. Vizwiz grand challenge: Answering visual questions from blind people.
IEEE, Jun 2018.
Hernisa Kacorri. Teachable machines for accessibility. Number 119, page 10–18, New York,
NY, USA, November 2017. Association for Computing Machinery.
Danielle Bragg, Oscar Koller, Mary Bellard, Larwan Berke, Patrick Boudrealt, Annelies
Braffort, Naomi Caselli, Matt Huenerfauth, Hernisa Kacorri, Tessa Verhoef, et al. Sign language
recognition, generation, and translation: An interdisciplinary perspective. 2019.
Meredith Ringel Morris. Ai and accessibility. volume 63, page 35–37, New York, NY, USA,
May 2020. Association for Computing Machinery.
Andrew Sears and Vicki L. Hanson. Representing users in accessibility research. volume 4,
pages 7:1–7:6, New York, NY, USA, March 2012. ACM.
Carol Neidle, Ashwin Thangali, and Stan Sclaroff. Challenges in development of the american
sign language lexicon video dataset (asllvd) corpus. In 5th Workshop on the Representation
and Processing of Sign Languages: Interactions between Corpus and Lexicon, LREC. Citeseer,
Hernisa Kacorri. Data-Driven Synthesis and Evaluation of Syntactic Facial Expressions in
American Sign Language Animation. PhD thesis, CUNY Academic Works, 2016.
Foad Hamidi, Kellie Poneres, Aaron Massey, and Amy Hurst. Who should have access to my
pointing data?: Privacy tradeoffs of adaptive assistive technologies. In Proceedings of the 20th
International ACM SIGACCESS Conference on Computers and Accessibility, ASSETS ’18,
pages 203–216, New York, NY, USA, 2018. ACM.
Jutta Treviranus. The value of being different. In Proceedings of the 16th Web For All 2019
Personalization - Personalizing the Web, W4A ’19, pages 1:1–1:7, New York, NY, USA, 2019.
Anhong Guo, Ece Kamar, Jennifer Wortman Vaughan, Hanna Wallach, and Meredith Ringel
Morris. Toward fairness in ai for people with disabilities: A research roadmap. 2019.
Jacob Abbott, Haley MacLeod, Novia Nurain, Gustave Ekobe, and Sameer Patil. Local standards
for anonymization practices in health, wellness, accessibility, and aging research at chi. In
Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, CHI ’19,
page 1–14, New York, NY, USA, 2019. Association for Computing Machinery.
Hernisa Kacorri, Utkarsh Dwivedi, Sravya Amancherla, Mayanka Jha, and Riya Chanduka.
IncluSet: A Data Surfacing Repository for Accessibility Datasets. Association for Computing
Machinery, New York, NY, USA, 2020.
Helen Cooper, Eng-Jon Ong, Nicolas Pugeault, and Richard Bowden. Sign Language Recogni-
tion Using Sub-units. In Sergio Escalera, Isabelle Guyon, and Vassilis Athitsos, editors, Gesture
Recognition, The Springer Series on Challenges in Machine Learning, pages 89–118. Springer
International Publishing, Cham, 2017.
Pengfei Lu and Matt Huenerfauth. Cuny american sign language motion-capture corpus: first
release. In Proceedings of the 5th Workshop on the Representation and Processing of Sign
Languages: Interactions between Corpus and Lexicon, The 8th International Conference on
Language Resources and Evaluation (LREC 2012), Istanbul, Turkey, 2012.
R. Vatavu, B. Gheran, and M. D. Schipor. The Impact of Low Vision on Touch-Gesture
Articulation on Mobile Devices. volume 17, pages 27–37, January 2018.
Hernisa Kacorri, Sergio Mascetti, Andrea Gerino, Dragan Ahmetovic, Hironobu Takagi, and
Chieko Asakawa. Supporting orientation of people with visual impairment: Analysis of large
scale usage data. In Proceedings of the 18th International ACM SIGACCESS Conference on
Computers and Accessibility, ASSETS ’16, pages 151–159, New York, NY, USA, 2016. ACM.
Ugo Cesari, Giuseppe De Pietro, Elio Marciano, Ciro Niri, Giovanna Sannino, and Laura Verde.
A new database of healthy and pathological voices. volume 68, pages 310–321, May 2018.
Luz Rello, Ricardo Baeza-Yates, and Joaquim Llisterri. DysList: An annotated resource of
dyslexic errors. In Proceedings of the Ninth International Conference on Language Resources
and Evaluation (LREC-2014), pages 1289–1296, Reykjavik, Iceland, May 2014. European
Languages Resources Association (ELRA).
Burn Model System National Data and Statistical Center. Burn model system: Advancing
recovery through knowledge, 1994.
Jason Markesich. Surveying persons with disabilities: A source guide, version ii. Princeton, NJ:
Mathematica Policy Research, 10 2008.
Christian Thomsen and Torben Bach Pedersen. Building a web warehouse for accessibility
data. In Proceedings of the 9th ACM International Workshop on Data Warehousing and OLAP,
DOLAP ’06, pages 43–50, New York, NY, USA, 2006. ACM.
Chaohai Ding, Mike Wald, and Gary Wills. A survey of open accessibility data. In Proceedings
of the 11th Web for All Conference, W4A ’14, pages 37:1–37:4, New York, NY, USA, 2014.
Hernisa Kacorri and Matt Huenerfauth. Continuous profile models in ASL syntactic facial
expression synthesis. In Proceedings of the 54th Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Papers), pages 2084–2093, Berlin, Germany,
August 2016. Association for Computational Linguistics.
[30] Irene Rogan Shaffer. Exploring the performance of facial expression recognition technologies
on deaf adults and their children. In Proceedings of the 20th International ACM SIGACCESS
Conference on Computers and Accessibility, ASSETS ’18, pages 474–476, New York, NY,
USA, 2018. ACM.
Hamid Vaezi Joze and Oscar Koller. Ms-asl: A large-scale data set and benchmark for un-
derstanding american sign language. In The British Machine Vision Conference (BMVC),
September 2019.
Kyungjun Lee and Hernisa Kacorri. Hands holding clues for object recognition in teachable
machines. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems,
CHI ’19, page 1–12, New York, NY, USA, 2019. Association for Computing Machinery.
Meredith Whittaker, Meryl Alper, Cynthia L Bennett, Sara Hendren, Liz Kaziunas, Mara Mills,
Meredith Ringel Morris, Joy Rankin, Emily Rogers, Marcel Salas, et al. Disability, bias, and ai.
Eun Seo Jo and Timnit Gebru. Lessons from archives: Strategies for collecting sociocultural
data in machine learning. FAT* ’20, page 306–316, New York, NY, USA, 2020. Association for
Computing Machinery.
Danielle Bragg, Oscar Koller, Naomi Caselli, and William Thies. Exploring collection of sign
language datasets: Privacy, participation, and model performance. In In The 22nd International
ACM SIGACCESS Conference on Computers and Accessibility, 2020.
Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna
Wallach, Hal Daumé III au2, and Kate Crawford. Datasheets for datasets, 2020.
Natasha Noy, Matthew Burgess, and Dan Brickley. Google dataset search: Building a search
engine for datasets in an open web ecosystem. In 28th Web Conference (WebConf 2019), 2019.
[38] Google Inc. React-schemaorg, 2020.
Chat Wacharamanotham, Lukas Eisenring, Steve Haroz, and Florian Echtler. Transparency
of chi research artifacts: Results of a self-reported survey. In Proceedings of the 2020 CHI
Conference on Human Factors in Computing Systems, CHI ’20, page 1–14, New York, NY,
USA, 2020. Association for Computing Machinery.
Marco Leo, Marco Del Coco, Pierluigi Carcagni, Cosimo Distante, Massimo Bernava, Giovanni
Pioggia, and Giuseppe Palestra. Automatic emotion recognition in robot-children interaction
for asd treatment. In Proceedings of the 2015 IEEE International Conference on Computer
Vision Workshop (ICCVW), ICCVW ’15, page 537–545, USA, 2015. IEEE Computer Society.
Gaurav Aggarwal and Latika Singh. Evaluation of supervised learning algorithms based
on speech features as predictors to the diagnosis of mild to moderate intellectual disability.
volume 9, page 55. Springer, 2018.
Huiyu Duan, Guangtao Zhai, Xiongkuo Min, Zhaohui Che, Yi Fang, Xiaokang Yang, Jesús
Gutiérrez, and Patrick Le Callet. A dataset of eye movements for the children with autism
spectrum disorder. In Proceedings of the 10th ACM Multimedia Systems Conference, pages
255–260, 2019.
Radu-Daniel Vatavu and Ovidiu-Ciprian Ungurean. Stroke-gesture input for people with motor
impairments: Empirical results amp; research roadmap. In Proceedings of the 2019 CHI
Conference on Human Factors in Computing Systems, CHI ’19, page 1–14, New York, NY,
USA, 2019. Association for Computing Machinery.
D. Gurari, Q. Li, C. Lin, Y. Zhao, A. Guo, A. Stangl, and J. P. Bigham. Vizwiz-priv: A dataset
for recognizing the presence and purpose of private visual information in images taken by blind
people. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),
pages 939–948, 2019.
R. Vatavu, B. Gheran, and M. D. Schipor. The impact of low vision on touch-gesture articulation
on mobile devices. volume 17, pages 27–37, 2018.
German H Flores and Roberto Manduchi. Weallwalk: An annotated dataset of inertial sensor
time series from blind walkers. volume 11, pages 1–28. ACM New York, NY, USA, 2018.
Hanbin Zhang, Chen Song, Aosen Wang, Chenhan Xu, Dongmei Li, and Wenyao Xu. Pdvocal:
Towards privacy-preserving parkinson’s disease detection using non-speech body sounds. In
The 25th Annual International Conference on Mobile Computing and Networking, page 16.
ACM, 2019.
Necati Cihan Camgöz, Ahmet Alp Kındıro˘ se Sumruglu, Serpil Karabüklü, Meltem Kelepir, Ay¸
Özsoy, and Lale Akarun. BosphorusSign: A Turkish Sign Language recognition corpus
in health and finance domains. In Proceedings of the Tenth International Conference on
Language Resources and Evaluation (LREC’16), pages 1383–1388, Portorož, Slovenia, May
2016. European Language Resources Association (ELRA).
JR Deller Jr, MS Liu, LJ Ferrier, and P Robichaud. The whitaker database of dysarthric (cerebral
palsy) speech. volume 93, pages 3516–3518. Acoustical Society of America, 1993.
Renee Peje Clapham, Lisette van der Molen, RJJH van Son, Michiel WM van den Brekel,
Frans JM Hilgers, et al. Nki-ccrt corpus-speech intelligibility before and after advanced head
and neck cancer treated with concomitant chemoradiotherapy. In LREC, volume 4, pages
3350–3355. Citeseer, 2012.
Hanjie Wang, Xiujuan Chai, Xiaopeng Hong, Guoying Zhao, and Xilin Chen. Isolated sign
language recognition with grassmann covariance matrices. volume 8, New York, NY, USA,
May 2016. Association for Computing Machinery.
Konstantinos Avgerinakis, Alexia Briassouli, and Ioannis Kompatsiaris. Activity detection and
recognition of daily living events. In Proceedings of the 1st ACM International Workshop on
Multimedia Indexing and Information Retrieval for Healthcare, MIIRH ’13, page 3–10, New
York, NY, USA, 2013. Association for Computing Machinery.
Julian Bleicken, Thomas Hanke, Uta Salden, and Sven Wagner. Using a language technology
infrastructure for german in order to anonymize german sign language corpus data. In Pro-
ceedings of the Tenth International Conference on Language Resources and Evaluation (LREC
2016), pages 3303–3306, 2016.
While advances in technologies like artificial intelligence promise a lot of possibilities for the disability community, they are centered around data-driven approaches. Datasets and data sharing play an important role in training and testing machine learning models and helping deployed systems work better in the real world. However, sharing data sourced from people with disabilities or older adults poses ethical and privacy concerns, which significantly limit the availability and re-use of accessibility datasets. Under such tension between making their data accessible and restricting access to protect the people represented in the data, this paper serves as a starting point to call for action in developing guidelines and frameworks for ethical use and sharing of accessibility datasets. The work proposes to take a mixed-method research approach to gain a deep understanding of the need and challenges of shared resources in this field. The insights gained will facilitate discussions on the future of data sharing and ownership in accessibility research contributing to informing the development of inclusive AI applications and assistive technologies.
Conference Paper
Full-text available
Datasets and data sharing play an important role for innovation, benchmarking, mitigating bias, and understanding the complexity of real world AI-infused applications. However, there is a scarcity of available data generated by people with disabilities with the potential for training or evaluating machine learning models. This is partially due to smaller populations, disparate characteristics, lack of expertise for data annotation, as well as privacy concerns. Even when data are collected and are publicly available, it is often difficult to locate them. We present a novel data surfacing repository, called IncluSet, that allows researchers and the disability community to discover and link accessibility datasets. The repository is pre-populated with information about 139 existing datasets: 65 made publicly available, 25 available upon request, and 49 not shared by the authors but described in their manuscripts. More importantly, IncluSet is designed to expose existing and new dataset contributions so they may be discoverable through Google Dataset Search.
Conference Paper
Full-text available
Several fields of science are experiencing a ""replication crisis"" that has negatively impacted their credibility. Assessing the validity of a contribution via replicability of its experimental evidence and reproducibility of its analyses requires access to relevant study materials, data, and code. Failing to share them limits the ability to scrutinize or build-upon the research, ultimately hindering scientific progress.Understanding how the diverse research artifacts in HCI impact sharing can help produce informed recommendations for individual researchers and policy-makers in HCI. Therefore, we surveyed authors of CHI 2018-2019 papers, asking if they share their papers' research materials and data, how they share them, and why they do not. The results (34% response rate) show that sharing is uncommon, partly due to misunderstandings about the purpose of sharing and reliable hosting. We conclude with recommendations for fostering open research practices.This paper and all data and materials are freely available at
Conference Paper
Full-text available
Parkinson's disease (PD) is a chronic neurodegenerative disorder resulting from the progressive loss of dopaminergic nerve cells. People with PD usually demonstrate deficits in performing basic daily activities, and the relevant annual social cost can reach about $25 billion in the United States. Early detection of PD plays an important role in symptom relief and improvement in performance of activities in daily life (ADL), which eventually reduces societal and economic burden. However, conventional PD detection methods are inconvenient in daily life (e.g., requiring users to wear sensors). To overcome this challenge, we propose and identify the non-speech body sounds as the new PD biomarker, and utilize the data in smartphone usage to realize the passive PD detection in daily life without interrupting the user. Specifically, we present PDVocal, an end-to-end smartphone-based privacy-preserving system towards early PD detection. PDVocal can passively recognize the PD digital biomarkers in the voice data during daily phone conversation. At the user end, PDVo-cal filters the audio stream and only extracts the non-speech body sounds (e.g., breathing, clearing throat and swallowing) which contain no privacy-sensitive content. At the cloud end, PDVocal analyzes the body sounds of interest and assesses the health condition using a customized residual network. For the sake of reliability in real-world PD detection, we investigate the method of the performance optimizer including an opportunistic learning knob and a long-term tracking protocol. We evaluate our proposed PDVocal on a collected data-set from 890 participants and real-life conversations from publicly available data sources. Results indicate that non-speech body sounds are a promising digital biomarker for privacy-preserving PD detection in daily life. CCS CONCEPTS • Human-centered computing → Ubiquitous and mobile computing.
Conference Paper
Full-text available
Developing successful sign language recognition, generation, and translation systems requires expertise in a wide range of fields, including computer vision, computer graphics, natural language processing, human-computer interaction, linguistics, and Deaf culture. Despite the need for deep interdisciplinary knowledge, existing research occurs in separate disciplinary silos, and tackles separate portions of the sign language processing pipeline. This leads to three key questions: 1) What does an interdisciplinary view of the current landscape reveal? 2) What are the biggest challenges facing the field? and 3) What are the calls to action for people working in the field? To help answer these questions, we brought together a diverse group of experts for a two-day workshop. This paper presents the results of that interdisciplinary workshop, providing key background that is often overlooked by computer scientists, a review of the state-of-the-art, a set of pressing challenges, and a call to action for the research community.
Conference Paper
Full-text available
A single common characteristic of disability is difference---sufficient difference from the hypothetical average that many things are not made to fit your needs. How do we move from a common accessibility checklist to a system that recognizes our individual differences, our entangled complex lives, and inevitable change? What are the risks and what are the possibilities? How will this benefit the Web? Note: The following is a summary of a keynote delivered in first-person narrative at the Web4All conference, in conjunction with the World Wide Web Conference, during the 30th anniversary of the World Wide Web.
AI technologies have the potential to dramatically impact the lives of people with disabilities (PWD). Indeed, improving the lives of PWD is a motivator for many state-of-the-art AI systems, such as automated speech recognition tools that can caption videos for people who are deaf and hard of hearing, or language prediction algorithms that can augment communication for people with speech or cognitive disabilities. However, widely deployed AI systems may not work properly for PWD, or worse, may actively discriminate against them. These considerations regarding fairness in AI for PWD have thus far received little attention. In this position paper, we identify potential areas of concern regarding how several AI technology categories may impact particular disability constituencies if care is not taken in their design, development, and testing. We intend for this risk assessment of how various classes of AI might interact with various classes of disability to provide a roadmap for future research that is needed to gather data, test these hypotheses, and build more inclusive algorithms.