Independently learned MMD coresets for two randomly selected dataset of 250 femaleidentified photographs from US yearbooks in the 1990s. Area of each bubble is proportional to weight of the corresponding exemplar.

Independently learned MMD coresets for two randomly selected dataset of 250 femaleidentified photographs from US yearbooks in the 1990s. Area of each bubble is proportional to weight of the corresponding exemplar.

Source publication
Article
Full-text available
Understanding how two datasets differ can help us determine whether one dataset under-represents certain sub-populations, and provides insights into how well models will generalize across datasets. Representative points selected by a maximum mean discrepancy (MMD) coreset can provide interpretable summaries of a single dataset, but are not easily c...

Contexts in source publication

Context 1
... demonstrate this, we constructed two datasets, each containing 250 randomly selected, female-identified US highschool yearbook photos from the 1990s. Figure 2 shows MMD-coresets obtained for the two datasets (See Section 4.2.1 for full details of dataset and coreset generation.) While both datasets were selected from the same distribution, there is no overlap in the support of the two coresets. ...
Context 2
... coresets in Figure 2 are hard to compare due to their disjoint supports (i.e., the fact that there are no shared exemplars). Comparing the two coresets involves comparing multiple individual photos and assessing their similarities, in addition to incorporating the information encoded in the associated weights. ...
Context 3
... set of candidate images, U, was the entire dataset of 15,367 images. The resulting coresets are shown in Figure 7; the areas of the bubbles correspond to the weights associated with each exemplar (The top row of Figure 2 duplicates Figure 7). ...

Citations

... al. proposed a hierarchical dataset summarization method [32] that organizes group entities into a hierarchy. In [33] MMD coresets is proposed, method for data summarization, useful for understanding multiple related datasets. ...
Preprint
Full-text available
Artificial intelligence has become mainstream and its applications will only proliferate. Specific measures must be done to integrate such systems into society for the general benefit. One of the tools for improving that is explainability which boosts trust and understanding of decisions between humans and machines. This research offers an update on the current state of explainable AI (XAI). Recent XAI surveys in supervised learning show convergence of main conceptual ideas. We list the applications of XAI in the real world with concrete impact. The list is short and we call to action - to validate all the hard work done in the field with applications that go beyond experiments on datasets, but drive decisions and changes. We identify new frontiers of research, explainability of reinforcement learning and graph neural networks. For the latter, we give a detailed overview of the field.
... The relationship between data and phenomena is still assumed to be stable, and the technical work of data science projects is typically assumed to be the 'one-off application' of a statistical model to a given static dataset (Polyzotis et al. 2017). Changes to data (or its underlying distribution), called 'data drift,' are detrimental to model performance and to be controlled (Hohman et al. 2020;Hoens et al. 2012;Williamson and Henderson 2021). While several studies explore ways to set up a 'pipeline' for data science activities with continuously incoming, and possibly changing, datasets (e.g., Roh et al. 2019;Breck et al. 2019;Lourenço et al. 2019), these activities still mostly center on data scientists working on their computers. ...
... This focus on analytic models or algorithms is consistent with the priorities of technical data science and machine learning practitioners on developing generalizable machine learning models that learn to make accurate predictions "independent of which dataset is used" (Hohman et al. 2020, p.1;Sambasivan et al. 2021]. Changes in data are seen as detrimental to model performance and need to be tightly controlled (Gama et al. 2004;Sculley et al. 2015;Williamson and Henderson 2021). The data are rendered abstract and static in that a single data set can be statistically manipulated and modeled in many different ways. ...
... Second, the technical work of data science projects is typically approached as the 'one-off application' of a statistical model to a given dataset (Polyzotis et al. 2017). Built on an assumption of a 'largely stable world' (Marcus 2018), data science often views changes to data (or its underlying distribution), called 'data drift', as detrimental to model performance (Hohman et al. 2020;Hoens et al. 2012;Williamson and Henderson 2021). For data science activities to be sustainable, data need to be managed to account for changes to data in the dynamic world (e.g., Amershi et al. 2019b;Bopp et al. 2017 with their data and model settings (Seidelin et al. 2020;Ferreira and Monteiro 2020), and into how providing more interactive visualization techniques and interface design may allow domain experts to directly build and modify models relevant to the context of their data and problems (Amershi et al. 2014;Gil et al. 2019). ...
Thesis
How can we make data science systems more actionable? This dissertation explores this question by placing end-users and their data practices, rather than data scientists and their technical work of building models and algorithms, at the center of data science systems. Inspired by phenomenological views of technical systems from CSCW, HCI, and STS, I use ethnographic and other qualitative methods to understand how participants from four studies worked with data across three settings: craft brewers producing beers, people with visual impairments engaging with image descriptions of their photos on their smartphones, and repair workers repairing broken artifacts. I analyze implications for making data science systems actionable by framing the participants as potential end-users of these systems. My findings emphasize that actionability in data science systems concerns not just predictions made on mostly given datasets. Actionability in my settings arose from the ongoing work of making data relevant to artifacts and phenomena that end-users engaged with in their practices and settings. I show how this ongoing work of making data relevant was challenging. The properties of artifacts and phenomena were inherently multiple and their relevance was contingent on end-users’ situations. I describe end-users’ data practices as processes of “registering” (making intelligible) a contingent yet coherent set of properties to turn multiple, uncertain artifacts and phenomena into actionable versions. My dissertation makes several contributions to emerging research on actionability and data science in CSCW, HCI, and STS literature. First, based on my findings, I theorize an approach to data science systems that imagines actionability as driven not so much by data scientists generating predictions, or even by putting humans in the loop, but by placing end-users at the center. Second, my end-user approach to data science systems informs the technical work of data science by proposing requirements for models and algorithms to be accountable not just in their predictions but to end-users’ practices and settings. Third, my dissertation integrates into data science research foundational phenomenological views from CSCW that focus on how technological systems can account for and support end-users in their domains of practice, rather than the other way around.
Article
Domain experts play an essential role in data science by helping data scientists situate their technical work beyond the statistical analysis of large datasets. How domain experts themselves may engage with data science tools as a type of end-user remains largely invisible. Understanding data science as domain expert-driven depends on understanding how domain experts use data. Drawing on an ethnographic study of a craft brewery in Korea, we show how craft brewers worked with data by situating otherwise abstract data within their brewing practices and settings. We contribute theoretical insight into how domain experts use data distinctly from technical data scientists in terms of their view of data (situated vs. abstract), purposes for engaging with data (guiding processes over predicting outcomes), and overall goals of using data (flexible control vs. precision). We propose four ways in which working with data can be supported through the design of data science tools, and discuss how craftwork can be a useful lens for integrating domain expert-driven understandings of data science into CSCW and HCI research.