Marco Tulio Ribeiro's research while affiliated with Microsoft and other places

Publications (27)

Preprint
Full-text available
We introduce SpotCheck, a framework for generating synthetic datasets to use for evaluating methods for discovering blindspots (i.e., systemic errors) in image classifiers. We use SpotCheck to run controlled studies of how various factors influence the performance of blindspot discovery methods. Our experiments reveal several shortcomings of existi...
Article
Feature attribution methods are popular in interpretable machine learning. These methods compute the attribution of each input feature to represent its importance, but there is no consensus on the definition of "attribution", leading to many competing methods with little systematic evaluation, complicated in particular by the lack of ground truth a...
Preprint
Full-text available
Interpretability methods are developed to understand the working mechanisms of black-box models, which is crucial to their responsible deployment. Fulfilling this goal requires both that the explanations generated by these methods are correct and that people can easily and reliably understand them. While the former has been addressed in prior work,...
Conference Paper
Although measuring held-out accuracy has been the primary approach to evaluate generalization, it often overestimates the performance of NLP models, while alternative approaches for evaluating models either focus on individual tasks or on specific behaviors. Inspired by principles of behavioral testing in software engineering, we introduce CheckLis...
Preprint
Full-text available
Machine learning models often use spurious patterns such as "relying on the presence of a person to detect a tennis racket," which do not generalize. In this work, we present an end-to-end pipeline for identifying and mitigating spurious patterns for image classifiers. We start by finding patterns such as "the model's prediction for tennis racket c...
Preprint
Full-text available
Feature attribution methods are exceedingly popular in interpretable machine learning. They aim to compute the attribution of each input feature to represent its importance, but there is no consensus on the definition of "attribution", leading to many competing methods with little systematic evaluation. The lack of attribution ground truth further...
Preprint
Full-text available
Counterfactual examples have been shown to be useful for many applications, including calibrating, evaluating, and explaining model decision boundaries. However, previous methods for generating such counterfactual examples have been tightly tailored to a specific application, used a limited range of linguistic patterns, or are hard to scale. We pro...
Preprint
Increasingly, organizations are pairing humans with AI systems to improve decision-making and reducing costs. Proponents of human-centered AI argue that team performance can even further improve when the AI model explains its recommendations. However, a careful analysis of existing literature reveals that prior studies observed improvements due to...
Preprint
Although measuring held-out accuracy has been the primary approach to evaluate generalization, it often overestimates the performance of NLP models, while alternative approaches for evaluating models either focus on individual tasks or on specific behaviors. Inspired by principles of behavioral testing in software engineering, we introduce CheckLis...
Preprint
Existing VQA datasets contain questions with varying levels of complexity. While the majority of questions in these datasets require perception for recognizing existence, properties, and spatial relationships of entities, a significant portion of questions pose challenges that correspond to reasoning tasks -- tasks that can only be answered through...
Article
We introduce a novel model-agnostic system that explains the behavior of complex models with high-precision rules called anchors, representing local, "sufficient" conditions for predictions. We propose an algorithm to efficiently compute these explanations for any black-box model with high-probability guarantees. We demonstrate the flexibility of a...
Article
Recent work in model-agnostic explanations of black-box machine learning has demonstrated that interpretability of complex models does not have to come at the cost of accuracy or model flexibility. However, it is not clear what kind of explanations, such as linear models, decision trees, and rule lists, are the appropriate family to consider, and d...
Article
At the core of interpretable machine learning is the question of whether humans are able to make accurate predictions about a model's behavior. Assumed in this question are three properties of the interpretable output: coverage, precision, and effort. Coverage refers to how often humans think they can predict the model's behavior, precision to how...
Conference Paper
Despite widespread adoption, machine learning models remain mostly black boxes. Understanding the reasons behind predictions is, however, quite important in assessing trust, which is fundamental if one plans to take action based on a prediction, or when choosing whether to deploy a new model. Such understanding also provides insights into the model...
Article
Understanding why machine learning models behave the way they do empowers both system designers and end-users in many ways: in model selection, feature engineering, in order to trust and act upon the predictions, and in more intuitive user interfaces. Thus, interpretability has become a vital concern in machine learning, and work in the area of int...
Conference Paper
Despite widespread adoption, machine learning models remain mostly black boxes. Understanding the reasons behind predictions is, however, quite important in assessing trust in a model. Trust is fundamental if one plans to take action based on a prediction, or when choosing whether or not to deploy a new model. Such understanding further provides in...
Article
Despite widespread adoption, machine learning models remain mostly black boxes. Understanding the reasons behind predictions is, however, quite important in assessing trust in a model. Trust is fundamental if one plans to take action based on a prediction, or when choosing whether or not to deploy a new model. Such understanding further provides in...

Citations

... For example, feature visualization has conventionally relied on human interpretations, but using the labels and/or latents from outside models could be used to automate much of this process away. Additional progress toward obtaining rigorous, quantitative results without a human in the loop has been made by, [277] who proposed a mathematical framework for quantitative evaluation of model understanding and [109] who introduced a statistical pipeline for quantifying interpretability based on proxy measures. ...
... A different technique to computationally evaluate saliency maps for classification models is to compare them with groundtruth saliency maps on modified datasets (Yang and Kim, 2019;Zhou et al., 2022). Here, a natural dataset is manipulated by adding artificial features that a model has to focus on to perfectly classify the dataset. ...
... When a bit is set to zero in the surrogate space, the conversion function η x must map the resulting vector to the original space. For images, this can be achieved by replacing the toggled-off super-pixels with a baseline monochrome segment or with a patch from another image [16]. LIME weighs the neighbors inX according to a kernel function π σ x (based on a distance D and a bandwidth hyper-parameter σ ∈ R + ) on the surrogate space, that is, ...
... On the other hand, recent criticism of the validity of many benchmarks for capturing real-world performance of AI systems 3 suggest that the development of fewer, but more quality-assured benchmarks covering multiple AI capabilities might be desirable. 9 Broad coverage of AI capabilities by single benchmarks might also extend the 'lifetime' of benchmarks before they become saturated with results from highly specialized AI models 1,8 . Ideally, future benchmarks would be increasingly developed through large collaborative teams from many institutions, knowledge domains and cultures, ensuring high quality, diversity and representativeness of benchmarks. ...
... Counterfactuals identify changes necessary for achieving alternative and possibly more desirable outcomes [170], for instance what should be changed in a loan application in order for it to be approved. They have become popular in XAI as a mean to help stakeholders to form actionable plans and control the system's behavior [102,82], but have recently shown promise as a mean to design novel interaction strategies [84,183,44]. Attention mechanisms [18,166] also offer insight into the decision process of neural networks and, although their interpretation is somewhat controversial [20], they are a viable alternative to gradient-based attributions for integrating explanatory feedback into the model in an end-to-end manner [113,73]. ...
... Recent work has shown that explanations may increase blind trust in research [40]. Simply stating that conversation mode provides a better-quality source can be ethically fraught. ...
... Explainable or interpretable ML (Caruana, Lundberg, Ribeiro, Nori, & Jenkins, 2020;Mitchell, 2019;Molnar, 2020;Senoner, Netland, & Feuerriegel, 2021) can offer support by allowing practitioners to observe both the outputs of their models and the (potential) reasoning that support them. However, explainable ML often delivers insights that are not perceived as useful by persons (Molnar, 2020). ...
... Alternatively, [5,23,27] used relations between questions to impose constraints in the VQA's embedding space. To avoid needing to know the relation between questions, [20] proposed to enforce consistency by making attention maps of reasoning and perception questions similar to one another. However, even though these approaches tackle unconstrained question relations, the ensuring of VQA models' consistency remains limited and often reduces the overall performance [20]. ...
... ACE is the first constraint-aware early stopping algorithm for HPO. Our experiments demonstrate that ACE obtains superior performance to constraint-agnostic early stopping baselines (Li et al., 2020), on UCI credit card dataset (Yeh and Lien, 2009) with a fairness constraint (Bird et al., 2020) and on GLUE SST2 with a robustness constraint (Ribeiro et al., 2020). ...
... Niven and Kao, 2019;Geva et al., 2019;Shah et al., 2020). Further, aggregate and thus abstract performance metrics such as accuracy and F1 score may obscure more specific model weaknesses (Wu et al., 2019). ...