Andreas BullingMax Planck Institute for Informatics
Andreas Bulling
PhD
About
247
Publications
83,513
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
14,075
Citations
Introduction
Skills and Expertise
Publications
Publications (247)
Human hand and head movements are the most pervasive input modalities in extended reality (XR) and are significant for a wide range of applications. However, prior works on hand and head modelling in XR only explored a single modality or focused on specific applications. We present HaHeAE - a novel self-supervised method for learning generalisable...
We propose MToMnet – a Theory of Mind (ToM) neural network for predicting beliefs and their dynamics during human social interactions from multimodal input. ToM is key for effective nonverbal human communication and collaboration, yet existing methods for belief modelling have not included explicit ToM modelling or have typically been limited to on...
Recent work has highlighted the potential of modelling interactive behaviour analogously to natural language. We propose interactive behaviour summarisation as a novel computational task and demonstrate its usefulness for automatically uncovering latent user intentions while interacting with graphical user interfaces. To tackle this task, we introd...
High-frequency components in eye gaze data contain user-specific information promising for various applications, but existing gaze modelling methods focus on low frequencies of typically not more than 30 Hz. We present DiffEyeSyn -- the first computational method to synthesise high-frequency gaze data, including eye movement characteristics specifi...
Estimating the momentary level of participant's engagement is an important prerequisite for assistive systems that support human interactions. Previous work has addressed this task in within-domain evaluation scenarios, i.e. training and testing on the same dataset. This is in contrast to real-life scenarios where domain shifts between training and...
Multilevel models (MLMs) are a central building block of the Bayesian workflow. They enable joint, interpretable modeling of data across hierarchical levels and provide a fully probabilistic quantification of uncertainty. Despite their well-recognized advantages, MLMs pose significant computational challenges, often rendering their estimation and e...
Human motion prediction is important for many virtual and augmented reality (VR/AR) applications such as collision avoidance and realistic avatar generation. Existing methods have synthesised body motion only from observed past motion, despite the fact that human eye gaze is known to correlate strongly with body movements and is readily available i...
Interruptions are often pervasive and require attentional shifts from the primary task. Limited data are available on the factors influencing individuals' efficiency in resuming from interruptions during digital reading. The reported investigation -conducted using the InteRead dataset -examined whether individual differences in visuo-spatial workin...
We present GazeMotion-a novel method for human motion forecasting that combines information on past human poses with human eye gaze. Inspired by evidence from behavioural sciences showing that human eye and body movements are closely coordinated, GazeMotion first predicts future eye gaze from past gaze, then fuses predicted future gaze and past pos...
We propose MToMnet - a Theory of Mind (ToM) neural network for predicting beliefs and their dynamics during human social interactions from multimodal input. ToM is key for effective nonverbal human communication and collaboration, yet, existing methods for belief modelling have not included explicit ToM modelling or have typically been limited to o...
We present HOIMotion-a novel approach for human motion forecasting during human-object interactions that integrates information about past body poses and egocentric 3D object bounding boxes. Human motion forecasting is important in many augmented reality applications but most existing methods have only used past body poses to predict future motion....
We present MST-MIXER - a novel video dialog model operating over a generic multi-modal state tracking scheme. Current models that claim to perform multi-modal state tracking fall short of two major aspects: (1) They either track only one modality (mostly the visual input) or (2) they target synthetic datasets that do not reflect the complexity of r...
We present HOIMotion - a novel approach for human motion forecasting during human-object interactions that integrates information about past body poses and egocentric 3D object bounding boxes. Human motion forecasting is important in many augmented reality applications but most existing methods have only used past body poses to predict future motio...
While numerous works have assessed the generative performance of language models (LMs) on tasks requiring Theory of Mind reasoning, research into the models' internal representation of mental states remains limited. Recent work has used probing to demonstrate that LMs can represent beliefs of themselves and others. However, these claims are accompa...
We introduce the Overcooked Generalisation Challenge (OGC) - the first benchmark to study agents' zero-shot cooperation abilities when faced with novel partners and levels in the Overcooked-AI environment. This perspective starkly contrasts a large body of previous work that has trained and evaluated cooperating agents only on the same level, faili...
Human eye gaze plays a significant role in many virtual and augmented reality (VR/AR) applications, such as gaze-contingent rendering, gaze-based interaction, or eye-based activity recognition. However, prior works on gaze analysis and prediction have only explored eye-head coordination and were limited to human-object interactions. We first report...
Question answering has recently been proposed as a promising means to assess the recallability of information visualisations. However, prior works are yet to study the link between visually encoding a visualisation in memory and recall performance. To fill this gap, we propose VisRecall++ -- a novel 40-participant recallability dataset that contain...
Latest gaze estimation methods require large-scale training data but their collection and exchange pose significant privacy risks. We propose PrivatEyes - the first privacy-enhancing training approach for appearance-based gaze estimation based on federated learning (FL) and secure multi-party computation (MPC). PrivatEyes enables training gaze esti...
Reusable embeddings of user behaviour have shown significant performance improvements for the personalised saliency prediction task. However, prior works require explicit user characteristics and preferences as input, which are often difficult to obtain. We present a novel method to extract user embeddings from pairs of natural images and correspon...
Recent work on dialogue-based collaborative plan acquisition (CPA) has suggested that Theory of Mind (ToM) modelling can improve missing knowledge prediction in settings with asymmetric skill-sets and knowledge. Although ToM was claimed to be important for effective collaboration, its real impact on this novel task remains under-explored. By repres...
Eye movements during reading offer a window into cognitive processes and language comprehension, but the scarcity of reading data with interruptions – which learners frequently encounter in their everyday learning environments – hampers advances in the development of intelligent learning technologies. We introduce InteRead – a novel 50-participant...
We propose the Intuitive Reasoning Network (IRENE) - a novel neural model for intuitive psychological reasoning about agents' goals, preferences, and actions that can generalise previous experiences to new situations. IRENE combines a graph neural network for learning agent and world state representations with a transformer to encode the task conte...
Analysing and modelling interactive behaviour is an important topic in human-computer interaction (HCI) and a key requirement for the development of intelligent interactive systems. Interactive behaviour has a sequential (actions happen one after another) and hierarchical (a sequence of actions forms an activity driven by interaction goals) structu...
Predicting the next action that a human is most likely to perform is key to human-AI collaboration and has consequently attracted increasing research interests in recent years. An important factor for next action prediction are human intentions: If the AI agent knows the intention it can predict future actions and plan collaboration more effectivel...
Automatic analysis of human behaviour is a fundamental prerequisite for the creation of machines that can effectively interact with- and support humans in social interactions. In MultiMediate'23, we address two key human social behaviour analysis tasks for the first time in a controlled challenge: engagement estimation and bodily behaviour recognit...
While deep reinforcement learning (RL) agents outperform humans on an increasing number of tasks, training them requires data equivalent to decades of human gameplay. Recent hierarchical RL methods have increased sample efficiency by incorporating information inherent to the structure of the decision problem but at the cost of having to discover or...
Lifelogging is traditionally used for memory augmentation. However, recent research shows that users’ trust in the completeness and
accuracy of lifelogs might skew their memories. Privacy-protection
alterations such as body blurring and content deletion are commonly applied to photos to circumvent capturing sensitive information. However, their imp...
What do you have to keep in mind when developing or using eye-tracking technologies regarding privacy? In this article we discuss the main ethical, technical, and legal categories of privacy, which is much more than just data protection. We additionally provide recommendations about how such technologies might mitigate privacy risks and in which ca...
Analysing and modelling interactive behaviour is an important topic in human-computer interaction (HCI) and a key requirement for the development of intelligent interactive systems. Interactive behaviour has a sequential (actions happen one after another) and hierarchical (a sequence of actions forms an activity driven by interaction goals) structu...
We propose Unified Model of Saliency and Scanpaths (UMSS)-a model that learns to predict multi-duration saliency and scanpaths (i.e. sequences of eye fixations) on information visualisations. Although scanpaths provide rich information about the importance of different visualisation elements during the visual exploration process, prior work has bee...
Gaze estimation methods have significantly matured in recent years, but the large number of eye images required to train deep learning models poses significant privacy risks. In addition, the heterogeneous data distribution across different users can significantly hinder the training process. In this work, we propose the first federated learning ap...
Backchannels, i.e. short interjections of the listener, serve important meta-conversational purposes like signifying attention or indicating agreement. Despite their key role, automatic analysis of backchannels in group interactions has been largely neglected so far. The MultiMediate challenge addresses, for the first time, the tasks of backchannel...
Adaptive visualization and interfaces pervade our everyday tasks to improve interaction from the point of view of user performance and experience. This approach allows using several user inputs, whether physiological , behavioral, qualitative, or multimodal combinations , to enhance the interaction. Due to the multitude of approaches, we outline th...
Digital eye strain (DES), caused by prolonged exposure to digital screens, stresses the visual system and negatively affects users’ well-being and productivity. While DES is well-studied in computer displays, its impact on users of virtual reality (VR) head-mounted displays (HMDs) is largely unexplored—despite that some of their key properties (e.g...
We propose Neuro-Symbolic Visual Dialog (NSVD) -the first method to combine deep learning and symbolic program execution for multi-round visually-grounded reasoning. NSVD significantly outperforms existing purely-connectionist methods on two key challenges inherent to visual dialog: long-distance co-reference resolution as well as vanishing questio...
Despite its importance for assessing the effectiveness of communicating information visually, fine-grained recallability of information visualisations has not been studied quantitatively so far. In this work, we propose a question-answering paradigm to study visualisation recallability and present VisRecall - a novel dataset consisting of 200 visua...
One approach to mitigate shoulder surfing attacks on mobile devices is to detect the presence of a bystander using the phone’s front-facing camera. However, a person’s face in the camera’s field of view does not always indicate an attack. To overcome this limitation, in a novel data collection study (N=16), we analysed the influence of three viewin...
Gaze-based analysis of areas of interest (AOIs) is widely used in information visualisation research to understand how people explore visualisations or assess the quality of visualisations concerning key characteristics such as memorability. However, nearby AOIs in visualisations amplify the uncertainty caused by the gaze estimation
error, which st...
Mind wandering (MW) is defined as a shift of attention to task-unrelated internal thoughts that is pervasive and disruptive for learning performance. Current state-of-the-art gaze-based attention-aware intelligent systems are capable of detecting MW from eye movements and delivering interventions to mitigate its negative effects. However, the benef...
Emotional expressions are inherently multimodal -- integrating facial behavior, speech, and gaze -- but their automatic recognition is often limited to a single modality, e.g. speech during a phone call. While previous work proposed crossmodal emotion embeddings to improve monomodal recognition performance, despite its importance, an explicit repre...
Handheld mobile devices store a plethora of sensitive data, such as private emails, personal messages, photos, and location data. Authentication is essential to protect access to sensitive data. However, the majority of mobile devices are currently secured by singlemodal authentication schemes which are vulnerable to shoulder surfing, smudge attack...
Emotional expressions are inherently multimodal -- integrating facial behavior, speech, and gaze -- but their automatic recognition is often limited to a single modality, e.g. speech during a phone call. While previous work proposed crossmodal emotion embeddings to improve monomodal recognition performance, despite its importance, an explicit repre...
Despite its importance for assessing the effectiveness of communicating information visually, fine-grained recallability of information visualisations has not been studied quantitatively so far. In this work we propose a visual question answering (VQA) paradigm to study visualisation recallability and present VisQA -- a novel VQA dataset consisting...
Understanding human visual attention in immersive virtual reality (VR) is crucial for many important applications, including gaze prediction, gaze guidance, and gaze-contingent rendering. However, previous works on visual attention analysis typically only explored one specific VR task and paid less attention to the differences between different tas...
We propose Unified Model of Saliency and Scanpaths (UMSS) -- a model that learns to predict visual saliency and scanpaths (i.e. sequences of eye fixations) on information visualisations. Although scanpaths provide rich information about the importance of different visualisation elements during the visual exploration process, prior work has been lim...
Pupil dilation may reveal the outcomes of binary decisions. Hereby, pupils dilate stronger for stimuli that are deemed targets by the beholder than for distractors. Respective findings are built on average pupil dynamics, aggregated over multiple trials and participants rather than on single trials. Further, the reported differences between targets...
Human-like attention as a supervisory signal to guide neural attention has shown significant promise but is currently limited to uni-modal integration - even for inherently multimodal tasks such as visual question answering (VQA). We present the Multimodal Human-like Attention Network (MULAN) - the first method for multimodal integration of human-l...
We present VQA-MHUG - a novel 49-participant dataset of multimodal human gaze on both images and questions during visual question answering (VQA) collected using a high-speed eye tracker. We use our dataset to analyze the similarity between human and neural attentive strategies learned by five state-of-the-art VQA models: Modular Co-Attention Netwo...
We propose a novel method that leverages human fixations to visually decode the image a person has in mind into a photofit (facial composite). Our method combines three neural networks: An encoder, a scoring network, and a decoder. The encoder extracts image features and predicts a neural activation map for each face looked at by a human observer....
Although our pupils slightly dilate when we look at an intended target, they do not when we look at irrelevant distractors. This finding suggests that it may be possible to decode the intention of an observer, understood as the outcome of implicit covert binary decisions, from the pupillary dynamics over time. However, few previous works have inves...
Human visual attention in immersive virtual reality (VR) is key for many important applications, such as content design, gaze-contingent rendering, or gaze-based interaction. However, prior works typically focused on free-viewing conditions that have limited relevance for practical applications. We first collect eye tracking data of 27 participants...
A lack of corpora has so far limited advances in integrating human gaze data as a supervisory signal in neural attention mechanisms for natural language processing(NLP). We propose a novel hybrid text saliency model (TSM) that, for the first time, combines a cognitive model of reading with explicit human gaze supervision in a single machine learnin...
While neural networks with attention mechanisms have achieved superior performance on many natural language processing tasks, it remains unclear to which extent learned attention resembles human visual attention. In this paper, we propose a new method that leverages eye-tracking data to investigate the relationship between human visual attention an...
An ever-growing body of work has demonstrated the rich information content available in eye movements for user modelling, e.g. for predicting users' activities, cognitive processes, or even personality traits. We show that state-of-the-art classifiers for eye-based user modelling are highly vulnerable to adversarial examples: small artificial pertu...
Predicting the target of visual search from human eye fixations (gaze) is a difficult problem with many applications, e.g. in human-computer interaction. While previous work has focused on predicting specific search target instances, we propose the first approach to predict categories and attributes of search intents from gaze data and to visually...