Marcelo Arenas’s research while affiliated with University of Santiago Chile and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (183)


Probabilistic Explanations for Linear Models
  • Article

April 2025

·

3 Reads

Proceedings of the AAAI Conference on Artificial Intelligence

Bernardo Subercaseaux

·

Marcelo Arenas

·

Kuldeep S. Meel

Formal XAI is an emerging field that focuses on providing explanations with mathematical guarantees for the decisions made by machine learning models. A significant amount of work in this area is centered on the computation of ``sufficient reasons''. Given a model M and an input instance x, a sufficient reason for the decision x is a subset S of the features of x such that for any instance z that has the same values as x for every feature in S, it holds that M(x) = M(z). Intuitively, this means that the features in S are sufficient to fully justify the classification of x by M. For sufficient reasons to be useful in practice, they should be as small as possible, and a natural way to reduce the size of sufficient reasons is to consider a probabilistic relaxation; the probability of M(x) = M(z) must be at least some value delta in (0,1], where z is a random instance that coincides with x on the features in S. Computing small delta-sufficient reasons (delta-SRs) is known to be a theoretically hard problem; even over decision trees — traditionally deemed simple and interpretable models — strong inapproximability results make the efficient computation of small delta-SRs unlikely. We propose the notion of (delta, epsilon)-SR, a simple relaxation of delta-SRs, and show that this kind of explanations can be computed efficiently over linear models.


Probabilistic Explanations for Linear Models
  • Preprint
  • File available

December 2024

·

17 Reads

Formal XAI is an emerging field that focuses on providing explanations with mathematical guarantees for the decisions made by machine learning models. A significant amount of work in this area is centered on the computation of "sufficient reasons". Given a model M and an input instance x\vec{x}, a sufficient reason for the decision M(x)M(\vec{x}) is a subset S of the features of x\vec{x} such that for any instance z\vec{z} that has the same values as x\vec{x} for every feature in S, it holds that M(x)=M(z)M(\vec{x}) = M(\vec{z}). Intuitively, this means that the features in S are sufficient to fully justify the classification of x\vec{x} by M. For sufficient reasons to be useful in practice, they should be as small as possible, and a natural way to reduce the size of sufficient reasons is to consider a probabilistic relaxation; the probability of M(x)=M(z)M(\vec{x}) = M(\vec{z}) must be at least some value δ(0,1]\delta \in (0,1], for a random instance z\vec{z} that coincides with x\vec{x} on the features in S. Computing small δ\delta-sufficient reasons (δ\delta-SRs) is known to be a theoretically hard problem; even over decision trees--traditionally deemed simple and interpretable models--strong inapproximability results make the efficient computation of small δ\delta-SRs unlikely. We propose the notion of (δ,ϵ)(\delta, \epsilon)-SR, a simple relaxation of δ\delta-SRs, and show that this kind of explanation can be computed efficiently over linear models.

Download

Towards Tractability of the Diversity of Query Answers: Ultrametrics to the Rescue

November 2024

Proceedings of the ACM on Management of Data

The set of answers to a query may be very large, potentially overwhelming users when presented with the entire set. In such cases, presenting only a small subset of the answers to the user may be preferable. A natural requirement for this subset is that it should be as diverse as possible to reflect the variety of the entire population. To achieve this, the diversity of a subset is measured using a metric that determines how different two solutions are and a diversity function that extends this metric from pairs to sets. In the past, several studies have shown that finding a diverse subset from an explicitly given set is intractable even for simple metrics (like Hamming distance) and simple diversity functions (like summing all pairwise distances). This complexity barrier becomes even more challenging when trying to output a diverse subset from a set that is only implicitly given (such as the query answers for a given query and a database). Until now, tractable cases have been found only for restricted problems and particular diversity functions. To overcome these limitations, we focus in this work on the notion of ultrametrics, which have been widely studied and used in many applications. Starting from any ultrametric d and a diversity function δ extending d , we provide sufficient conditions over δ for having polynomial-time algorithms to construct diverse answers. To the best of our knowledge, these conditions are satisfied by all the diversity functions considered in the literature. Moreover, we complement these results with lower bounds that show specific cases when these conditions are not satisfied and finding diverse subsets becomes intractable. We conclude by applying these results to the evaluation of conjunctive queries, demonstrating efficient algorithms for finding a diverse subset of solutions for acyclic conjunctive queries when the attribute order is used to measure diversity.


A Uniform Language to Explain Decision Trees

November 2024

·

23 Reads

·

3 Citations

Marcelo Arenas

·

·

·

[...]

·

Bernardo Subercaseaux

The formal XAI community has studied a plethora of interpretability queries aiming to understand the classifications made by decision trees. However, a more uniform understanding of what questions we can hope to answer about these models, traditionally deemed to be easily interpretable, has remained elusive. In an initial attempt to understand uniform languages for interpretability, Arenas et al. proposed FOIL, a logic for explaining black-box ML models, and showed that it can express a variety of interpretability queries. However, we show that FOIL is limited in two important senses: (i) it is not expressive enough to capture some crucial queries, and (ii) its model agnostic nature results in a high computational complexity for decision trees. In this paper, we carefully craft two fragments of first-order logic that allow for efficiently interpreting decision trees: Q-DT-FOIL and its optimization variant OPT-DT-FOIL. We show that our proposed logics can express not only a variety of interpretability queries considered by previous literature, but also elegantly allows users to specify different objectives the sought explanations should optimize for. Using finite model-theoretic techniques, we show that the different ingredients of Q-DT-FOIL are necessary for its expressiveness, and yet that queries in Q-DT-FOIL can be evaluated with a polynomial number of queries to a SAT solver, as well as their optimization versions in OPT-DT-FOIL. Besides our theoretical results, we provide a SAT-based implementation of the evaluation for OPT-DT-FOIL that is performant on industry-size decision trees.


Towards Tractability of the Diversity of Query Answers: Ultrametrics to the Rescue

August 2024

·

9 Reads

The set of answers to a query may be very large, potentially overwhelming users when presented with the entire set. In such cases, presenting only a small subset of the answers to the user may be preferable. A natural requirement for this subset is that it should be as diverse as possible to reflect the variety of the entire population. To achieve this, the diversity of a subset is measured using a metric that determines how different two solutions are and a diversity function that extends this metric from pairs to sets. In the past, several studies have shown that finding a diverse subset from an explicitly given set is intractable even for simple metrics (like Hamming distance) and simple diversity functions (like summing all pairwise distances). This complexity barrier becomes even more challenging when trying to output a diverse subset from a set that is only implicitly given such as the query answers of a query and a database. Until now, tractable cases have been found only for restricted problems and particular diversity functions. To overcome these limitations, we focus on the notion of ultrametrics, which have been widely studied and used in many applications. Starting from any ultrametric d and a diversity function δ\delta extending d, we provide sufficient conditions over δ\delta for having polynomial-time algorithms to construct diverse answers. To the best of our knowledge, these conditions are satisfied by all diversity functions considered in the literature. Moreover, we complement these results with lower bounds that show specific cases when these conditions are not satisfied and finding diverse subsets becomes intractable. We conclude by applying these results to the evaluation of conjunctive queries, demonstrating efficient algorithms for finding a diverse subset of solutions for acyclic conjunctive queries when the attribute order is used to measure diversity.



Figure 1. Information on presidency of Chile.
Figure 2. Reified triples representing the duration of a presidency.
Figure 3. RDF* triples with another triple as the subject.
Figure 4. Property graph representing the information about the presidency of Chile.
Figure 5. Wikidata statement group for Michelle Bachelet.

+10

MillenniumDB: An Open-Source Graph Database System

August 2023

·

106 Reads

·

12 Citations

Data Intelligence

In this systems paper, we present MillenniumDB: a novel graph database engine that is modular, persistent, and open source. MillenniumDB is based on a graph data model, which we call domain graphs, that provides a simple abstraction upon which a variety of popular graph models can be supported, thus providing a flexible data management engine for diverse types of knowledge graph. The engine itself is founded on a combination of tried and tested techniques from relational data management, state-of-the-art algorithms for worst-case-optimal joins, as well as graph-specific algorithms for evaluating path queries. In this paper, we present the main design principles underlying MillenniumDB, describing the abstract graph model and query semantics supported, the concrete data model and query syntax implemented, as well as the storage, indexing, query planning and query evaluation techniques used. We evaluate MillenniumDB over real-world data and queries from the Wikidata knowledge graph, where we find that it outperforms other popular persistent graph database engines (including both enterprise and open source alternatives) that support similar query features.


Figure 2: Reified triples representing the duration of a presidency.
Figure 5: Wikidata statement group for Michelle Bachelet
Figure 13: Boxplots showing distributions for BGP runtimes
Summary of runtimes (in seconds) for path queries
MillenniumDB: An Open-Source Graph Database System

June 2023

·

212 Reads

·

11 Citations

Data Intelligence

In this systems paper, we present MillenniumDB: a novel graph database engine that is modular, persistent, and open source. MillenniumDB is based on a graph data model, which we call domain graphs, that provides a simple abstraction upon which a variety of popular graph models can be supported, thus providing a flexible data management engine for diverse types of knowledge graph. The engine itself is founded on a combination of tried and tested techniques from relational data management, state-of-the-art algorithms for worst-case-optimal joins, as well as graph-specific algorithms for evaluating path queries. In this paper, we present the main design principles underlying MillenniumDB, describing the abstract graph model and query semantics supported, the concrete data model and query syntax implemented, as well as the storage, indexing, query planning and query evaluation techniques used. We evaluate MillenniumDB over real-world data and queries from the Wikidata knowledge graph, where we find that it outperforms other popular persistent graph database engines (including both enterprise and open source alternatives) that support similar query features.


Counting the Answers to a Query

November 2022

·

15 Reads

·

2 Citations

ACM SIGMOD Record

Counting the answers to a query is a fundamental problem in databases, with several applications in the evaluation, optimization, and visualization of queries. Unfortunately, counting query answers is a #P-hard problem in most cases, so it is unlikely to be solvable in polynomial time. Recently, new results on approximate counting have been developed, specifically by showing that some problems in automata theory admit fully polynomial-time randomized approximation schemes. These results have several implications for the problem of counting the answers to a query; in particular, for graph and conjunctive queries. In this work, we present the main ideas of these approximation results, by using labeled DAGs instead of automata to simplify the presentation. In addition, we review how to apply these results to count query answers in different areas of databases.


Figure 3: Example of , the tree constructed in the proof of Lemma 2, for = 5 and = 2 5 . In this case = 12 = 2 2 + 2 3 . Note that Pr [ () = 1 | ∈ CCCC(⊥ )] = 2 = 12 32 , and 2 5 − 12 32 = 1 40 < 1 32 .
Figure 8: Time for computing a minimum sufficient reason ( = 1) as a function of decision tree size. All datapoints correspond to an average of 10 different instances for decision trees trained to recognize the digit 1 in the MNIST dataset.
Experimental results for the probabilistic encoding over positive instances of tall rectangles. Each datapoint is the average of 3 instances. Size of smallest explanation Time Number of leaves Accuracy of the tree
Experimental results for the deterministic encoding over negative instances of digit 1 in MNIST. Each datapoint corresponds to the average of 10 instances. Number of leaves Size of smallest explanation Time Accuracy of the tree
On Computing Probabilistic Explanations for Decision Trees

June 2022

·

47 Reads

·

5 Citations

Formal XAI (explainable AI) is a growing area that focuses on computing explanations with mathematical guarantees for the decisions made by ML models. Inside formal XAI, one of the most studied cases is that of explaining the choices taken by decision trees, as they are traditionally deemed as one of the most interpretable classes of models. Recent work has focused on studying the computation of "sufficient reasons", a kind of explanation in which given a decision tree T and an instance x, one explains the decision T(x) by providing a subset y of the features of x such that for any other instance z compatible with y, it holds that T(z)=T(x)T(z) = T(x), intuitively meaning that the features in y are already enough to fully justify the classification of x by T. It has been argued, however, that sufficient reasons constitute a restrictive notion of explanation, and thus the community has started to study their probabilistic counterpart, in which one requires that the probability of T(z)=T(x)T(z) = T(x) must be at least some value δ(0,1]\delta \in (0, 1], where z is a random instance that is compatible with y. Our paper settles the computational complexity of δ\delta-sufficient-reasons over decision trees, showing that both (1) finding δ\delta-sufficient-reasons that are minimal in size, and (2) finding δ\delta-sufficient-reasons that are minimal inclusion-wise, do not admit polynomial-time algorithms (unless P=NP). This is in stark contrast with the deterministic case (δ=1\delta = 1) where inclusion-wise minimal sufficient-reasons are easy to compute. By doing this, we answer two open problems originally raised by Izza et al. On the positive side, we identify structural restrictions of decision trees that make the problem tractable, and show how SAT solvers might be able to tackle these problems in practical settings.


Citations (64)


... Third, it would be interesting to allow for a more declarative way of specifying the probabilistic guarantees or constraints on the explanations. While a recent line of research has studied the design of languages for defining explainability queries with a uniform algorithmic treatment (Arenas et al. 2021;Barceló, Pérez, and Subercaseaux 2020;Arenas et al. 2024), we are not aware of any work on that line that allows for probabilistic terms. ...

Reference:

Probabilistic Explanations for Linear Models
A Uniform Language to Explain Decision Trees
  • Citing Conference Paper
  • November 2024

... Many graph databases provide limited built-in support for global graph analytics, typically offering only basic functions while requiring external libraries for advanced algorithms [59]. They also often lack robust semantic similarity search capabilities [63], although systems like Neo4j have recently added vector indexing. ...

MillenniumDB: A Multi-modal, Multi-model Graph Database
  • Citing Conference Paper
  • June 2024

... In recent work [24] the MillenniumDB DBMS is presented. MillenniumDB is a multimodal, multimodel graph database that supports the popular property graph paradigm, RDF format, and a multilayer graph model that combines and extends both of these models [25], which is an abstraction over popular graph models. In fact, this model extends the edges of a graph by adding a type to each edge. ...

MillenniumDB: An Open-Source Graph Database System

Data Intelligence

... The One-Graph initiative [33] aims to bridge the different graph data models by promoting a unified graph data model for seamless interaction. Similarly, MilleniumDB's Domain Graph model [51] aims at covering RDF, RDF-star, and property graphs. These works seek a common supermodel, aiming to support both RDF and property graphs via more general solutions. ...

MillenniumDB: An Open-Source Graph Database System

Data Intelligence

... with edge labels over Σ and the aim is to support queries, which on input a regular expression , return information about walks in the graph whose labels match the regular 1 Formulating the NFA Acceptance problem as a colored walk problem is not uncommon, see, e.g., [8]. Here we use the Colored Walk problem as an umbrella to express various related problems in a common language. ...

Counting the Answers to a Query
  • Citing Article
  • November 2022

ACM SIGMOD Record

... Contributions. This paper argues that the key issue with SHAP scores is not the use of Shapley values in explainability per se, and shows that the identified shortcomings of SHAP scores can be solely attributed to the characteristic functions used in earlier works Kononenko 2010, 2014;Lundberg and Lee 2017;Janzing, Minorics, and Blöbaum 2020;Sundararajan and Najmi 2020;Arenas et al. 2021; Van den Broeck et al. 2021, 2022Arenas et al. 2023). As noted in the recent past (Janzing, Minorics, and Blöbaum 2020;Sundararajan and Najmi 2020), by changing the characteristic function, one is able to produce different sets of SHAP scores. 2 Motivated by these observations, the paper outlines fundamental properties that characteristic functions ought to exhibit in the context of XAI. ...

The Tractability of SHAP-Score-Based Explanations for Classification over Deterministic and Decomposable Boolean Circuits
  • Citing Article
  • May 2021

Proceedings of the AAAI Conference on Artificial Intelligence

... More recently, AXp's have been studied in terms of their computational complexity [6], [7], [9], [19]. There is a growing body of recent work on formal explanations [1], [2], [5], [12], [13], [20], [24], [26], [52], [55], [59], [87]. Formally, given v = (v 1 , . . . ...

On Computing Probabilistic Explanations for Decision Trees

... The class spanL is contained in totP. It is known that spanL admits a fully polynomial randomized approximation scheme (FPRAS) via counting the number of strings of length n accepted by an NFA [2], [4], [33]. Since this reduction is parsimonious, it preserves the approximation. ...

#NFA Admits an FPRAS: Efficient Enumeration, Counting, and Uniform Generation for Logspace Classes
  • Citing Article
  • December 2021

Journal of the ACM

... Bitcoin [9] is a transaction-based blockchain that stores the history of all the accepted transactions. 3 Ownership is based on the custody of private keys, managed outside the blockchain. Valid transactions move bitcoins from one to another account, represented by addresses. ...

Is it Possible to Verify if a Transaction is Spendable?

Frontiers in Blockchain

... Despite being a validated screening tool for olfactory function in COVID-19 patients, the KOR test is not designed to establish a diagnosis or describe specific olfactory dysfunctions, such as anosmia, hyposmia, or microsmia. 25 Replication of our findings within a cohort using objective clinical assessments can provide additional validation and enhance the robustness of the results. 55 Another limitation of the study is that we could not access laboratory tests to determine the specific types of SARS-CoV-2 variants. ...

Screening of COVID-19 cases through a Bayesian network symptoms model and psychophysical olfactory test

iScience