James PustejovskyBrandeis University · Department of Computer Science
James Pustejovsky
Ph.D.
About
333
Publications
81,549
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
15,914
Citations
Introduction
Additional affiliations
September 1986 - present
Publications
Publications (333)
Our goal is to develop an AI Partner that can provide support for group problem solving and social dynamics. In multi-party working group environments, multimodal analytics is crucial for identifying non-verbal interactions of group members. In conjunction with their verbal participation, this creates an holistic understanding of collaboration and...
Semantic textual similarity (STS) is a fundamental NLP task that measures the semantic similarity between a pair of sentences. In order to reduce the inherent ambiguity posed from the sentences, a recent work called Conditional STS (C-STS) has been proposed to measure the sentences' similarity conditioned on a certain aspect. Despite the popularity...
The Institute for Student‐AI Teaming (iSAT) addresses the foundational question: how to promote deep conceptual learning via rich socio‐collaborative learning experiences for all students?—a question that is ripe for AI‐based facilitation and has the potential to transform classrooms. We advance research in speech, computer vision, human‐agent team...
Cognitive science has evolved since early disputes between radical empiricism and radical nativism. The authors are reacting to the revival of radical empiricism spurred by recent successes in deep neural network (NN) models. We agree that language-like mental representations (language-of-thoughts [LoTs]) are part of the best game in town, but they...
The ability to understand and model human-object interactions is becoming increasingly important in advancing the field of human-computer interaction (HCI). To maintain more effective dialogue, embodied agents must utilize situated reasoning - the ability to ground objects in a shared context and understand their roles in the conversation [35]. In...
VoxML is a modeling language used to map natural language expressions into real-time visualizations using commonsense semantic knowledge of objects and events. Its utility has been demonstrated in embodied simulation environments and in agent-object interactions in situated multimodal human-agent collaboration and communication. It introduces the n...
Psychiatric electronic health records (EHRs) present a distinctive challenge in the domain of ML owing to their unstructured nature, with a high degree of complexity and variability. This study aimed to identify a cohort of patients with diagnoses of a psychotic disorder and posttraumatic stress disorder (PTSD), develop clinically-informed guidelin...
While affordance detection and Human-Object interaction (HOI) detection tasks are related, the theoretical foundation of affordances makes it clear that the two are distinct. In particular, researchers in affordances make distinctions between J. J. Gibson's traditional definition of an affordance, “the action possibilities” of the object within the...
Understanding inferences and answering questions from text requires more than merely recovering surface arguments, adjuncts, or strings associated with the query terms. As humans, we interpret sentences as contextualized components of a narrative or discourse, by both filling in missing information, and reasoning about event consequences. In this p...
Much progress in AI over the last decade has been driven by advances in natural language processing technology, in turn facilitated by large datasets and increased computation power used to train large neural language models. These systems demonstrate apparently sophisticated linguistic understanding or generation capabilities, but often fail to tr...
In this paper, we extend Abstract Meaning Representation (AMR) in order to represent situated multimodal dialogue, with a focus on the modality of gesture. AMR is a general-purpose meaning representation that has become popular for its transparent structure, its ease of annotation and available corpora, and its overall expressiveness. While AMR was...
In this paper, we argue that, as HCI becomes more multimodal with the integration of gesture, gaze, posture, and other nonverbal behavior, it is important to understand the role played by affordances and their associated actions in human-object interactions (HOI), so as to facilitate reasoning in HCI and HRI environments. We outline the requirement...
The need for deeper semantic processing of human language by our natural language processing systems is evidenced by their still-unreliable performance on inferencing tasks, even using deep learning techniques. These tasks require the detection of subtle interactions between participants in events, of sequencing of subevents that are often not expl...
We have recently begun a project to develop a more effective and efficient way to marshal inferences from background knowledge to facilitate deep natural language understanding. The meaning of a word is taken to be the entities, predications, presuppositions, and potential inferences that it adds to an ongoing situation. As words compose, the minim...
In this paper, we argue that embodiment can play an important role in the design and modeling of systems developed for Human Computer Interaction. To this end, we describe a simulation platform for building Embodied Human Computer Interactions (EHCI). This system, VoxWorld, enables multimodal dialogue systems that communicate through language, gest...
In this paper, we argue that embodiment can play an important role in the evaluation of systems developed for Human Computer Interaction. To this end, we describe a simulation platform for building Embodied Human Computer Interactions (EHCI). This system, VoxWorld, enables multimodal dialogue systems that communicate through language, gesture, acti...
In this paper series, we argue for the role embodiment plays in the evaluation of systems developed for Human Computer Interaction. We use a simulation platform, VoxWorld, for building Embodied Human Computer Interactions (EHCI). VoxWorld enables multimodal dialogue systems that communicate through language, gesture, action, facial expressions, and...
In this paper, we argue that the design and development of multimodal datasets for natural language processing (NLP) challenges should be enhanced in two significant respects: to more broadly represent commonsense semantic inferences; and to better reflect the dynamics of actions and events, through a substantive alignment of textual and visual inf...
In this paper we present Uniform Meaning Representation (UMR), a meaning representation designed to annotate the semantic content of a text. UMR is primarily based on Abstract Meaning Representation (AMR), an annotation framework initially designed for English, but also draws from other meaning representations. UMR extends AMR to other languages, p...
In recent years, data-intensive AI, particularly the domain of natural language processing and understanding, has seen significant progress driven by the advent of large datasets and deep neural networks that have sidelined more classic AI approaches to the field. These systems can apparently demonstrate sophisticated linguistic understanding or ge...
In this paper we present Jarvis, a multimodal explorer and navigation system for biocuration data, from both curated sources and text-derived datasets. This system harnesses voice and haptic control for a bioinformatic research context, specifically manipulation of data visualizations such as heatmaps and word clouds showing related terms in the da...
We present a new interface for controlling a navigation robot in novel environments using coordinated gesture and language. We use a TurtleBot3 robot with a LIDAR and a camera, an embodied simulation of what the robot has encountered while exploring, and a cross-platform bridge facilitating generic communication. A human partner can deliver instruc...
We are developing semantic visualization techniques in order to enhance exploration and enable discovery over large datasets of complex networks of relations. Semantic visualization is a method of enabling exploration and discovery over large datasets of complex networks by exploiting the semantics of the relations in them. This involves (i) NLP to...
State of the art unimodal dialogue agents lack some core aspects of peer-to-peer communication—the nonverbal and visual cues that are a fundamental aspect of human interaction. To facilitate true peer-to-peer communication with a computer, we present Diana, a situated multimodal agent who exists in a mixed-reality environment with a human interlocu...
In this paper, we present an analysis of computationally generated mixed-modality definite referring expressions using combinations of gesture and linguistic descriptions. In doing so, we expose some striking formal semantic properties of the interactions between gesture and language, conditioned on the introduction of content into the common groun...
We propose a new representation of predicate-argument structure, which encodes how the participants to the event expressed by the verb
change as the event unfolds. This involves a dynamic interpretation of the predicate and their arguments, and the development of a new representation designed to encode such information, which we call dynamic predic...
We present an architecture for integrating real-time, multimodal input into a computational agent's contextual model. Using a human-avatar interaction in a virtual world, we treat aligned gesture and speech as an ensemble where content may be communicated by either modality. With a modified nondeterministic pushdown automaton architecture, the comp...
Many modern machine learning approaches require vast amounts of training data to learn new concepts; conversely, human learning often requires few examples—sometimes only one—from which the learner can abstract structural concepts. We present a novel approach to introducing new spatial structures to an AI agent, combining deep learning over qualita...
In this paper, we argue that simulation platforms enable a novel type of embodied spatial reasoning, one facilitated by a formal model of object and event semantics that renders the continuous quantitative search space of an open-world, real-time environment tractable. We provide examples for how a semantically-informed AI system can exploit the pr...
Cambridge Core - Grammar and Syntax - The Lexicon - by James Pustejovsky
Many modern machine learning approaches require vast amounts of training data to learn new concepts; conversely, human learning often requires few examples--sometimes only one--from which the learner can abstract structural concepts. We present a novel approach to introducing new spatial structures to an AI agent, combining deep learning over quali...
Advances in artificial intelligence are fundamentally changing how we relate to machines. We used to treat computers as tools, but now we expect them to be agents, and increasingly our instinct is to treat them like peers. This paper is an exploration of peer-to-peer communication between people and machines. Two ideas are central to the approach e...
Many modern machine learning approaches require vast amounts of training data to learn new concepts; conversely, human learning often requires few examples--sometimes only one--from which the learner can abstract structural concepts. We present a novel approach to introducing new spatial structures to an AI agent, combining deep learning over quali...
We describe an ongoing project in learning to perform primitive actions from demonstrations using an interactive interface. In our previous work, we have used demonstrations captured from humans performing actions as training samples for a neural network-based trajectory model of actions to be performed by a computational agent in novel setups. We...
In this paper, I argue that an important component of the language-ready brain is the ability to recognize and conceptualize events. By ‘event’, I mean any situation or activity in the world or our mental life, that we find salient enough to individuate as a thought or word. While this may sound either trivial or non-unique to humans, I hope to sho...
In this paper, we introduce a framework in which computers learn to enact complex temporal-spatial actions by observing humans, and outline our ongoing experiments in this domain. Our framework processes motion capture data of human subjects performing actions, and uses qualitative spatial reasoning to learn multi-level representations for these ac...
This paper presents a new classification of verbs of change and modification, proposing a dynamic interpretation of the lexical semantics of the predicate and its arguments. Adopting the model of dynamic event structure proposed in Pustejovsky (2013), and extending the model of dynamic selection outlined in Pustejovsky and Jezek (2011), we define a...
This paper details the technical functionality of VoxSim, a system for generating three-dimensional visual simulations of natural language motion expressions. We use a rich formal model of events and their participants to generate simulations that satisfy the minimal constraints entailed by an utterance and its minimal model, relying on real-world...
Selecting an optimal event representation is essential for event classification in real world contexts. In this paper, we investigate the application of qualitative spatial reasoning (QSR) frameworks for classification of human-object interaction in three dimensional space, in comparison with the use of quantitative feature extraction approaches fo...
Event learning is one of the most important problems in AI. However, notwithstanding significant research efforts, it is still a very complex task, especially when the events involve the interaction of humans or agents with other objects, as it requires modeling human kinematics and object movements. This study proposes a methodology for learning c...
In this paper, we introduce an architecture for multimodal communication between humans and computers engaged in a shared task. We describe a representative dialogue between an artificial agent and a human that will be demonstrated live during the presentation. This assumes a multimodal environment and semantics for facilitating communication and i...
The extraction of spatial semantics is important in many real-world applications such as geographical information systems, robotics and navigation, semantic search, etc. Moreover, spatial semantics are the most relevant semantics related to the visualization of language. The goal of multimodal spatial role labeling task is to extract spatial inform...
In this paper, we examine a set of object interactions generated with a 3D natural language simulation and visual-ization platform, VoxSim (Krishnaswamy and Pustejovsky 2016b). These simulations all realize the natural language relations " touching " and " near " over a test set of various objects within a 3-dimensional world that interprets descri...
The extraction of spatial semantics is important in many real-world applications such as geographical information systems, robotics and navigation, semantic search, etc. Moreover, spatial semantics are the most relevant semantics related to the visualization of language. The goal of multimodal spatial role labeling task is to extract spatial inform...
The project "Workset Creation for Scholarly Analysis and Data Capsules" is building an infrastructure where researchers have access to text processing tools that can then be used on a copyrighted set of digital data. The infrastructure is built on (1) the HathiTrust Research Center (HTRC) Data Capsule services that can be used to access the HathiTr...
In this paper we describe ISO-TimeML, an expressive and interoperable specification language for event and temporal expressions in natural language text. Besides annotating times and events, ISO-TimeML aims to capture three additional phenomena relating to temporal information in text: (1) it systematically anchors event predicates to a broad range...
In this chapter, we describe the method and process of transforming the theoretical formulations of a linguistic phenomenon, based on empirical observations, into a model that can be used for the development of a language annotation specification. We outline this procedure generally, and then examine the steps in detail by specific example. We look...
An understanding of spatial information in natural language is necessary for many computational linguistics and artificial intelligence applications. In this chapter, we describe an annotation scheme for the markup of spatial relations, both static and dynamic, as expressed in text and other media. The desiderata for such a specification language a...
In this paper, we describe a computational model for motion events in natural language that maps from linguistic expressions, through a dynamic event interpretation, into three-dimensional temporal simulations in a model. Starting with the model from (Pustejovsky and Moszkowicz, 2011), we analyze motion events using temporally-traced Labelled Trans...
In this paper, we describe a system for generating three-dimensional visual simulations of natural language motion expressions. We use a rich formal model of events and their participants to generate simulations that satisfy the minimal constraints entailed by the associated utterance, relying on semantic knowledge of physical objects and motion ev...
Annotated datasets are commonly used in the training and evaluation of tasks involving natural language and vision (image description generation, action recognition and visual question answering). However, many of the existing datasets reflect problems that emerge in the process of data selection and annotation. Here we point out some of the diffic...
As time cannot be observed directly, it must be analyzed in terms of mental categories, which manifest themselves on various linguistic levels. In this interdisciplinary volume, novel approaches to time are proposed that consider temporality without time, on the one hand, and the coding of time in language, including sign language, and gestures, on...
This paper introduces the Event Capture Annotation Tool (ECAT), a user-friendly, open-source interface tool for annotating events and their participants in video, capable of extracting the 3D positions and orientations of objects in video captured by Microsoft's Kinect(R) hardware. The modeling language VoxML (Pustejovsky and Krishnaswamy, 2016) un...
We present the specification for a modeling language, VoxML, which encodes semantic knowledge of real-world objects represented as three-dimensional models, and of events and attributes related to and enacted over these objects. VoxML is intended to overcome the limitations of existing 3D visual markup languages by allowing for the encoding of a br...
In the context of the Linguistic Applications (LAPPS) Grid project, we have undertaken the definition of a Web Service Exchange Vocabulary (WS-EV) specifying a terminology for a core of linguistic objects and properties exchanged among NLP tools that consume and produce linguistically annotated data. The goal is not to define a new set of terms, bu...
The Language Application (LAPPS) Grid project is establishing a framework that enables language service discovery, composition, and reuse and promotes sustainability, manageability, usability, and interoperability of natural language Processing (NLP) components. It is based on the service-oriented architecture (SOA), a more recent, web-oriented ver...
We describe and motivate the LAPPS Interchange Format, a JSON-LD format that is used for data transfer between language services in the Language Application Grid. The LAPPS Interchange Format enables syntactic and semantic interoperability of language services by providing a uniform syntax for common linguistic data and by using the Linked Data asp...