Padhraic Smyth’s research while affiliated with University of California System and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (337)


Semantic Probabilistic Control of Language Models
  • Preprint

May 2025

·

1 Read

Kareem Ahmed

·

Catarina G Belem

·

Padhraic Smyth

·

Sameer Singh

Semantic control entails steering LM generations towards satisfying subtle non-lexical constraints, e.g., toxicity, sentiment, or politeness, attributes that can be captured by a sequence-level verifier. It can thus be viewed as sampling from the LM distribution conditioned on the target attribute, a computationally intractable problem due to the non-decomposable nature of the verifier. Existing approaches to LM control either only deal with syntactic constraints which cannot capture the aforementioned attributes, or rely on sampling to explore the conditional LM distribution, an ineffective estimator for low-probability events. In this work, we leverage a verifier's gradient information to efficiently reason over all generations that satisfy the target attribute, enabling precise steering of LM generations by reweighing the next-token distribution. Starting from an initial sample, we create a local LM distribution favoring semantically similar sentences. This approximation enables the tractable computation of an expected sentence embedding. We use this expected embedding, informed by the verifier's evaluation at the initial sample, to estimate the probability of satisfying the constraint, which directly informs the update to the next-token distribution. We evaluated the effectiveness of our approach in controlling the toxicity, sentiment, and topic-adherence of LMs yielding generations satisfying the constraint with high probability (>95%) without degrading their quality.


IEEE_TGRS_Guilloteau_DifERS_supplemental_material.pdf
  • Data
  • File available

March 2025

·

10 Reads

·

Gavin Kerrigan

·

Kai Nelson

·

[...]

·

Download

Overview of the evaluation methodology for assessing the calibration gap between model confidence and human confidence in the model
The multiple-choice questions, the approach works as follows: (1) prompt the LLM with a question to obtain the model’s internal confidence for each answer choice; (2) select the most likely answer and prompt the model a second time to generate an explanation for the given answer; (3) obtain the human confidence by showing users the question and the LLM's explanation and asking users to indicate the probability that the model is correct. In this toy example, the model confidence for the multiple-choice question is 0.46 for answer C, whereas the human confidence is 0.95. For short-answer questions, the approach is similar except that internal model confidence is obtained by an additional step where the LLM is prompted to evaluate whether the previously provided answer to the question is true or false¹³. In the short-answer question example, the LLM model explanation was modified with uncertainty language to convey the low model confidence (0.18). For the two toy examples, the correct answers are ‘A’ and ‘blue bird’.
Calibration error and discrimination for model confidence and human confidence across the behavioural experiments and LLMs
Calibration error is assessed by ECE (lower is better), while discrimination is assessed by AUC (higher is better). The vertical dashed lines represent the calibration and discrimination gap between model confidence and human confidence for unmodified explanations (experiments 1a, 1b and 1c). For human confidence, the data points represent the AUC values computed separately for each participant (n shown in figure) and the error bars represent the 95% confidence interval of the mean across participants. Because of data sparsity, the ECE values were computed at the group level.
Source data
Calibration diagrams for model confidence and human confidence across experiments 1 and 2
The top and middle rows show results for multiple-choice questions with the GPT-3.5 and PaLM2 models, respectively. The bottom row shows results for short-answer questions with the GPT-4o model. The histograms at the bottom of each plot show the proportion of observations in each confidence bin (values are scaled by 30% for visual clarity). The shaded regions represent the 95% confidence interval of the mean computed across participants and questions.
Source data
Mean human confidence across LLM explanation styles varying in uncertainty language and length
Data are presented as mean values of participant confidence in experiments 2a (n = 60), 2b (n = 60) and 2c (n = 59). For reference, the dashed lines show the average human confidence for the baseline explanations in experiments 1a, 1b, and 1c. The error bars represent the 95% confidence interval of the mean across participants.
Source data
What large language models know and what people think they know

January 2025

·

107 Reads

·

26 Citations

Nature Machine Intelligence

As artificial intelligence systems, particularly large language models (LLMs), become increasingly integrated into decision-making processes, the ability to trust their outputs is crucial. To earn human trust, LLMs must be well calibrated such that they can accurately assess and communicate the likelihood of their predictions being correct. Whereas recent work has focused on LLMs’ internal confidence, less is understood about how effectively they convey uncertainty to users. Here we explore the calibration gap, which refers to the difference between human confidence in LLM-generated answers and the models’ actual confidence, and the discrimination gap, which reflects how well humans and models can distinguish between correct and incorrect answers. Our experiments with multiple-choice and short-answer questions reveal that users tend to overestimate the accuracy of LLM responses when provided with default explanations. Moreover, longer explanations increased user confidence, even when the extra length did not improve answer accuracy. By adjusting LLM explanations to better reflect the models’ internal confidence, both the calibration gap and the discrimination gap narrowed, significantly improving user perception of LLM accuracy. These findings underscore the importance of accurate uncertainty communication and highlight the effect of explanation length in influencing user trust in artificial-intelligence-assisted decision-making environments.


A Generative Diffusion Model for Probabilistic Ensembles of Precipitation Maps Conditioned on Multisensor Satellite Observations

January 2025

·

55 Reads

·

1 Citation

IEEE Transactions on Geoscience and Remote Sensing

A generative diffusion model is used to produce probabilistic ensembles of precipitation intensity maps at the 1-hour 5-km resolution. The generation is conditioned on infrared and microwave radiometric measurements from the GOES and DMSP satellites and is trained with merged ground radar and gauge data over southeastern United States. The generated precipitation maps reproduce the spatial autocovariance and other multiscale statistical properties of the gauge-radar reference fields on average. Conditioning the generation on the satellite measurements allows us to constrain the magnitude and location of each generated precipitation feature. The mean of the 128-member ensemble shows high spatial coherence with the reference fields with 0.82 linear correlation between the two. On average, the coherence between any two ensemble members is approximately the same as the coherence between any ensemble member and the ground reference, attesting that the ensemble dispersion is a proper measure of the estimation uncertainty. From the generated ensembles we can easily derive the probability of the precipitation intensity exceeding any given intensity threshold, at the 5-km resolution of the generation or at any desired aggregated resolution.


Deep Linear Hawkes Processes

December 2024

·

2 Reads

Marked temporal point processes (MTPPs) are used to model sequences of different types of events with irregular arrival times, with broad applications ranging from healthcare and social networks to finance. We address shortcomings in existing point process models by drawing connections between modern deep state-space models (SSMs) and linear Hawkes processes (LHPs), culminating in an MTPP that we call the deep linear Hawkes process (DLHP). The DLHP modifies the linear differential equations in deep SSMs to be stochastic jump differential equations, akin to LHPs. After discretizing, the resulting recurrence can be implemented efficiently using a parallel scan. This brings parallelism and linear scaling to MTPP models. This contrasts with attention-based MTPPs, which scale quadratically, and RNN-based MTPPs, which do not parallelize across the sequence length. We show empirically that DLHPs match or outperform existing models across a broad range of metrics on eight real-world datasets. Our proposed DLHP model is the first instance of the unique architectural capabilities of SSMs being leveraged to construct a new class of MTPP models.


Figure 4: The Iris dataset from the UCI ML Repository is widely used for evaluating clustering and classification algorithms [115]. Each observation corresponds to an iris flower, including sepal and petal measurements and its specific species (out of three classes). After years of use, it was discovered that there were multiple different widely-publicized versions of this dataset, with differing measurements for certain observations. Consequently, the reported performances of classification models on Iris (across a large number of published papers) are not necessarily comparable [116].
Figure 6: The BUPA Liver Disorders dataset is a popular classification benchmark from the UCI ML Repository [120]. Each row contains information on an individual's consumption of alcoholic drinks and their results on several blood tests targeting alcohol-related liver issues; the intended task is to predict alcohol consumption based on these test results. The last column of the dataset is an indicator, added by the dataset creators, intended to split the rows into training and test sets; however, the data documentation did not clearly explain the meaning of each column. It was subsequently found that many highly-cited papers using this dataset had mistakenly treated this last column as the class label, producing "meaningless results" [121].
Benchmark Data Repositories for Better Benchmarking

October 2024

·

37 Reads

In machine learning research, it is common to evaluate algorithms via their performance on standard benchmark datasets. While a growing body of work establishes guidelines for -- and levies criticisms at -- data and benchmarking practices in machine learning, comparatively less attention has been paid to the data repositories where these datasets are stored, documented, and shared. In this paper, we analyze the landscape of these benchmark data repositories\textit{benchmark data repositories} and the role they can play in improving benchmarking. This role includes addressing issues with both datasets themselves (e.g., representational harms, construct validity) and the manner in which evaluation is carried out using such datasets (e.g., overemphasis on a few datasets and metrics, lack of reproducibility). To this end, we identify and discuss a set of considerations surrounding the design and use of benchmark data repositories, with a focus on improving benchmarking practices in machine learning.


Figure 1: Variational inference with SVGD-derived particles [Liu and Wang, 2016] versus with an SMI-derived probability density, formulated as a mixture model (this work). Left: SVGD uses m particles θ θ θ ℓ to approximate the posterior p(θ θ θ|D). Right: SMI uses a mixture model (with uniform weights) of m guides q(θ θ θ|ψ ψ ψ ℓ ), parameterized by particles ψ ψ ψ ℓ to approximate p(θ θ θ|D). As a result, SMI approximates a Bayesian posterior with a richer model that alleviates variance collapse in higher dimensional posteriors.
Figure 3: Top row: High-density interval (HDI) for the low-dimensional model inferred using SMI, SVGD, ASVGD and OVI on the 1D wave dataset (dotted line). SVGD, ASVGD, and SMI use five particles. The posteriors are inferred with data drawn from the In region, highlighted with vertical lines. Bottom row: HDI for the moderate-dimensional model. ASVGD and SVGD display collapse by a significant narrowing in HDI between the In regions when comparing the low to moderate dimensions. On the other hand, both OVI and SMI widen the HDI with the richer model. In contrast to SMI, OVI overestimates the variance in the In region for the mid-sized model.
Figure 4: Top row: The samples were drawn from the data-generating process for evaluating Between, In and Entire regions, respectively. The In region used for inferring the BNNs is highlighted in grey. Bottom row: The samples drawn from the data-generating process to infer BNN posteriors.
Figure 5: Figure 5a: High-density interval (HDI) for the low-dimensional model inferred using SMI, SVGD, ASVGD and OVI on the 1D wave dataset (dotted line). SVGD, ASVGD, and SMI use five particles. The posteriors are inferred with data drawn from the In region, highlighted with vertical lines. Figure 5b: HDI for the moderate-dimensional model. ASVGD and SVGD display collapse by a significant narrowing in HDI between the In regions when comparing the low to moderate dimensions. In low-dimensional models, initialization plays a role in narrowing or widening HDI for all methods. In mid-sized models, SMI is robust to initialization.
Standard UCI Dataset Train size Test size Features Std Splits Gap10 Splits
ELBOing Stein: Variational Bayes with Stein Mixture Inference

October 2024

·

25 Reads

Stein variational gradient descent (SVGD) [Liu and Wang, 2016] performs approximate Bayesian inference by representing the posterior with a set of particles. However, SVGD suffers from variance collapse, i.e. poor predictions due to underestimating uncertainty [Ba et al., 2021], even for moderately-dimensional models such as small Bayesian neural networks (BNNs). To address this issue, we generalize SVGD by letting each particle parameterize a component distribution in a mixture model. Our method, Stein Mixture Inference (SMI), optimizes a lower bound to the evidence (ELBO) and introduces user-specified guides parameterized by particles. SMI extends the Nonlinear SVGD framework [Wang and Liu, 2019] to the case of variational Bayes. SMI effectively avoids variance collapse, judging by a previously described test developed for this purpose, and performs well on standard data sets. In addition, SMI requires considerably fewer particles than SVGD to accurately estimate uncertainty for small BNNs. The synergistic combination of NSVGD, ELBO optimization and user-specified guides establishes a promising approach towards variational Bayesian inference in the case of tall and wide data.


EventFlow: Forecasting Continuous-Time Event Data with Flow Matching

October 2024

·

20 Reads

Continuous-time event sequences, in which events occur at irregular intervals, are ubiquitous across a wide range of industrial and scientific domains. The contemporary modeling paradigm is to treat such data as realizations of a temporal point process, and in machine learning it is common to model temporal point processes in an autoregressive fashion using a neural network. While autoregressive models are successful in predicting the time of a single subsequent event, their performance can be unsatisfactory in forecasting longer horizons due to cascading errors. We propose EventFlow, a non-autoregressive generative model for temporal point processes. Our model builds on the flow matching framework in order to directly learn joint distributions over event times, side-stepping the autoregressive process. EventFlow is likelihood-free, easy to implement and sample from, and either matches or surpasses the performance of state-of-the-art models in both unconditional and conditional generation tasks on a set of standard benchmarks


A Generative Diffusion Model for Probabilistic Ensembles of Precipitation Maps Conditioned on Multisensor Satellite Observations

September 2024

·

106 Reads

A generative diffusion model is used to produce probabilistic ensembles of precipitation intensity maps at the 1-hour 5-km resolution. The generation is conditioned on infrared and microwave radiometric measurements from the GOES and DMSP satellites and is trained with merged ground radar and gauge data over southeastern United States. The generated precipitation maps reproduce the spatial autocovariance and other multiscale statistical properties of the gauge-radar reference fields on average. Conditioning the generation on the satellite measurements allows us to constrain the magnitude and location of each generated precipitation feature. The mean of the 128-member ensemble shows high spatial coherence with the reference fields with 0.82 linear correlation between the two. On average, the coherence between any two ensemble members is approximately the same as the coherence between any ensemble member and the ground reference, attesting that the ensemble dispersion is a proper measure of the estimation uncertainty. From the generated ensembles we can easily derive the probability of the precipitation intensity exceeding any given intensity threshold, at the 5-km resolution of the generation or at any desired aggregated resolution.


Perceptions of Linguistic Uncertainty by Language Models and Humans

July 2024

·

10 Reads

·

1 Citation

Uncertainty expressions such as ``probably'' or ``highly unlikely'' are pervasive in human language. While prior work has established that there is population-level agreement in terms of how humans interpret these expressions, there has been little inquiry into the abilities of language models to interpret such expressions. In this paper, we investigate how language models map linguistic expressions of uncertainty to numerical responses. Our approach assesses whether language models can employ theory of mind in this setting: understanding the uncertainty of another agent about a particular statement, independently of the model's own certainty about that statement. We evaluate both humans and 10 popular language models on a task created to assess these abilities. Unexpectedly, we find that 8 out of 10 models are able to map uncertainty expressions to probabilistic responses in a human-like manner. However, we observe systematically different behavior depending on whether a statement is actually true or false. This sensitivity indicates that language models are substantially more susceptible to bias based on their prior knowledge (as compared to humans). These findings raise important questions and have broad implications for human-AI alignment and AI-AI communication.


Citations (61)


... One first example is downscaling (Mardani et al., 2024a;Srivastava et al., 2023;Watt & Mansfield, 2024;Tomasi et al., 2024;Han et al., 2024;Lopez-Gomez et al., 2024). Another natural application of diffusion models in atmospheric sciences in precipitation nowcasting, due to the high stochasticity of the field (Yu et al., 2024;Zhao et al., 2024;Gong et al., 2024;Addison et al., 2024;Guilloteau et al., 2024). Other fields that have benefitted from diffusion-based approaches are tropical cyclones forecasting (Nath et al., 2023;Huang et al., 2024a), sea ice modeling (Finn et al., 2024). ...

Reference:

ArchesWeather & ArchesWeatherGen: a deterministic and generative model for efficient ML weather forecasting
A Generative Diffusion Model for Probabilistic Ensembles of Precipitation Maps Conditioned on Multisensor Satellite Observations

IEEE Transactions on Geoscience and Remote Sensing

... In addition, people form stronger beliefs from those that are confident and appear knowledgeable [26]. AI is exceedingly confident and, unlike people, rarely indicates anything less than 100% confidence in the output,they do not use indicators like hedging language for example [48]. The strength of the persuasiveness of AI-output is likely to increase as the text output becomes even more fluent and accurate. ...

What large language models know and what people think they know

Nature Machine Intelligence

... Eisape et al. (2024) compared LLM syllogistic reasoning with human behavior results (Ragni et al., 2019) under the framework of the Mental Models Theory (Johnson-Laird, 1983). Lampinen et al. (2024) found similar content effects in human and LLM reasoning, supporting the need to control for common-sense knowledge in benchmarks ( §3.2). Belem et al. (2024) studied human and LLM perception of uncertainty at a lexical level. Compared to them, we focus on the modal logic reasoning process and contribute new behavioral data. ...

Perceptions of Linguistic Uncertainty by Language Models and Humans
  • Citing Preprint
  • July 2024

... Satellite detection of fire activity will undoubtedly play an important role in efforts to produce accurate sub-daily fire records (see Liu et al., 2024, for an example spanning 2019-2021). Creation, verification, and advancement of such records for both operational products (e.g., Marsha and Larkin, 2022;French et al. 2014) and retrospective study (e.g., Freeborn et al. 2022;Mahood et al., 2022;Giglio et al. 2010) should continue. ...

Systematically tracking the hourly progression of large wildfires using GOES satellite observations

... Robles et al. [15] introduce a neural network for cell-level cancer detection, processing high-resolution cellular images to classify abnormalities. The use of image-focused preprocessing enhances the clarity of input data. ...

A cell-level discriminative neural network model for diagnosis of blood cancers

Bioinformatics

... Beyond the above-mentioned projects, the Western North America area has been the focus of several recent publications dealing with changes in precipitation characteristics under global warming [e.g., Liu et al. 2017, Huang et al. 2020, Gensini et al. 2023, Rahimi et al. 2024. With the frequent occurrence of atmospheric rivers during the cold season [Rutz et al. 2014, Gershunov et al. 2019, Western North America is particularly exposed to hazards induced by intense precipitation occurring over durations (a few hours) such as flash floods, landslides and debris flow [Ralph and Dettinger 2011, Cordeira et al. 2019, Guilinger et al. 2023]. The typical coldseason atmospheric rivers occurring in Western North American can be classified as mesoscale convective systems or mesoscale convective complexes with pronounced spatial anisotropy. ...

Predicting Postfire Sediment Yields of Small Steep Catchments Using Airborne Lidar Differencing

... In particular, Bayesian treatments of theory of mind (ToM) have offered strong accounts for how we may rapidly reason about each others' beliefs, desires, goals, and intentions 33,147,[183][184][185] . We may build mental models 186,187 of our thought partners, which can in turn be used to support communication and collaboration, informing the way we teach [188][189][190] , infer whether to rely on a partner for help 191 , and support rapid, flexible adaptation to new conversation partners 192,193 . We call particular attention to the Rational Speech Act (RSA) framework 59,150 , which models communicative partners as recursively reasoning about each others' minds to inform what to say (from the perspective of the speaker) and how to interpret a received utterance (as the listener). ...

Differentiating Mental Models of Self and Others: A Hierarchical Framework for Knowledge Assessment

... The results of this study are consistent with the literature where it is stated that precipitation predictions are more challenging than the predictions of other weather variables, such as temperature, due to its greater spatial and temporal variability and its reliance on numerous small-scale atmospheric processes (Johnson et al., 2019;Calì Quaglia et al., 2021;Le et al., 2023). Ji et al. (2016) reported an underestimation of precipitation by WRF forecasts in winter-dominant precipitation zones of southeast Australia. ...

Climate-driven changes in the predictability of seasonal precipitation

... For instance, a model may be both over-confident for high confidence scores and under-confident for low scores (Calster et al., 2019). Alternatively, a model may be globally well-calibrated but miscalibrated with respect to a variable of interest (Kelly & Smyth, 2023). There are various reasons why a model may be poorly calibrated, including the architecture, the distribution of the training set and overfitting (Guo et al., 2017;Minderer et al., 2021;Mukhoti et al., 2020). ...

Variable-Based Calibration for Machine Learning Classifiers

Proceedings of the AAAI Conference on Artificial Intelligence