Article

From Treebank to PropBank

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

This paper describes our approach to the development of a Proposition Bank, which involves the addition of semantic information to the Penn English Treebank. Our primary goal is the labeling of syntactic nodes with specific argument labels that preserve the similarity of roles such as the window in John broke the window and the window broke. After motivating the need for explicit predicate argument structure labels, we briefly discuss the theoretical considerations of predicate argument structure and the need to maintain consistency across syntactic alternations. The issues of consistency of argument structure across both polysemous and synonymous verbs are also discussed and we present our actual guidelines for these types of phenomena, along with numerous examples of tagged sentences and verb frames. Metaframes are introduced as a technique for handling similar frames among nearsynonymous verbs. We conclude with a summary of the current status of annotation process.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... AMRs are rooted Directed Acyclic Graphs (DAGs). AMR nodes represent two core concepts in a sentence: words (typically adjectives or stemmed nouns/adverbs), or frames extracted from Propbank [9]. 3 For example in Figure 3, nodes such as 'you' and 'thing' are English words, while 'find-01' and 'solve-01' represent Propbank framesets. ...
... Figure 3(a) shows the output AMR. Nodes are either words or concepts (such as 'find-01') from the PropBank framesets [9]. The edge labels show the relationship between the nodes. ...
... Note that because some of our systems are trained using relevance judgments for ARQMath-1 and -2 topics, those results should be interpreted as training results rather than as a clean progress test since some models (and in particular MathAMR) may be over-fit to this data. 9 To compare isolated vs contextual formula search, we look at results from Tangent-CFT2TED and MathAMR runs. Using MathAMR can be helpful specifically for formulas for which context is important. ...
Preprint
There are two main tasks defined for ARQMath: (1) Question Answering, and (2) Formula Retrieval, along with a pilot task (3) Open Domain Question Answering. For Task 1, five systems were submitted using raw text with formulas in LaTeX and/or linearized MathAMR trees. MathAMR provides a unified hierarchical representation for text and formulas in sentences, based on the Abstract Meaning Representation (AMR) developed for Natural Language Processing. For Task 2, five runs were submitted: three of them using isolated formula retrieval techniques applying embeddings, tree edit distance, and learning to rank, and two using MathAMRs to perform contextual formula search, with BERT embeddings used for ranking. Our model with tree-edit distance ranking achieved the highest automatic effectiveness. Finally, for Task 3, four runs were submitted, which included the Top-1 results for two Task 1 runs (one using MathAMR, the other SVM-Rank with raw text and metadata features), each with one of two extractive summarizers.
... For instance, in the sentence in Figure 1, the word " (he)" acts as the A0 of the two predicates " (went to)" and " (left)", respectively. Most available semantic resources, such as PropBank (P-B) (Kingsbury and Palmer 2002), adopt this type of graphstructure representation, which makes structured prediction computationally more expensive, compared to linear or tree structures. As a result, many state-of-the-art SRL systems take SRL as several subtasks (Xue 2008;Björkelund, Hafdell, and Nugues 2009), including predicate selection, argument selection and role classification, thereby neglecting the relation between different roles the same word or phrase plays. ...
... Semantic role set induction. PropBank (Kingsbury and Palmer 2002), FrameNet (Baker, Fillmore, and Lowe 1998) and VerbNet (Schuler 2005) choose different types of semantic role sets. For English, the existence of a semantic dictionary makes it possible to map semantic roles from Prop-Bank to VerbNet. ...
... Semantic Resources with annotated predicate-argument structures include the PropBank (Kingsbury and Palmer 2002;Hovy et al. 2006;Xue 2008) and FrameNet (Baker, Fillmore, and Lowe 1998). PropBank is an important lexical resource for semantic role labeling, and a part of the OntoNotes resource (Hovy et al. 2006), which include semantic corpora for English, Chinese and Arabic. ...
Article
We present a novel annotation framework for representing predicate-argument structures, which uses dependency trees to encode the syntactic and semantic roles of a sentence simultaneously. The main contribution is a semantic role transmission model, which eliminates the structural gap between syntax and shallow semantics, making them compatible. A Chinese semantic treebank was built under the proposed framework, and the first release containing about 14K sentences is made freely available. The proposed framework enables semantic role labeling to be solved as a sequence labeling task, and experiments show that standard sequence labelers can give competitive performance on the new treebank compared with state-of-the-art graph structure models.
... Towards the SRL task or more precisely, SRL of verbs, which is regarded as the backbone of a sentence, researchers have developed a range of lexical resources, including FrameNet (Baker, Fillmore, and Lowe 1998), PropBank (Kingsbury and Palmer 2002), VerbNet (Schuler 2006), etc. They can be classified into two types given that linguistic theories relied upon by them have different criteria or basis in identifying arguments. ...
... Such a difference, from our perspective, is rooted Table 11: With the situation-based approach, "accuracies" of verbal/nominal events identification and SRL tasks for both human and machines in English and Mandarin Chinese. For verbal events, the IAA on identification task in English is inferred according to the annotation of Fan et al. (2011); the accuracy is inferred by the percentage false predictions of the Stanford tagger reported by He, Lewis, and Zettlemoyer (2015); the IAA on identifying those of Mandarin is from our own annotation; the performance of machines is approximated by F 1 scores of two subcategories of verbs, VA and VC reported in Sun and Wan (2016); and the data of SRL are provided by the PropBank (Kingsbury and Palmer 2002) the Chinese PropBank (Xue and Palmer 2003) respectively. For nominal events, QANom project (Klein et al. 2020) reports the IAAs and F 1 scores on identification task in English; the IAA on identifying those of Mandarin comes from our own annotation; the performance of machines is evaluated by F 1 scores in Zhao et al. (2007); IAAs on English nomianl SRL are provided by NomBank (Meyers et al. 2004b,a); Liu and Ng (2007) reports the the F 1 score of this task; IAAs on Mandarin nominal SRL are from our own annotation experiment on a role classification task towards nonadverbial arguments following the guideline of Chinese NomBank (Xue 2006) and Li et al. (2009) reports the F 1 score based on the data of Chinese NomBank (Xue 2006 in their forms -the appearance of DE relaxes fixed syntactic pattern and incorporates more ARG0 unconditionally. ...
Article
Full-text available
Divergence of languages observed at the surface level is a major challenge encountered by multilingual data representation, especially when typologically distant languages are involved. Drawing inspirations from a formalist Chomskyan perspective towards language universals, Universal Grammar (UG), this article employs deductively pre-defined universals to analyse a multilingually heterogeneous phenomenon, event nominals. In this way, deeper universality of event nominals beneath their huge divergence in different languages is uncovered, which empowers us to break barriers between languages and thus extend insights from some synthetic languages to a non-inflectional language, Mandarin Chinese. Our empirical investigation also demonstrates this UG-inspired schema is effective: with its assistance, the inter-annotator agreement (IAA) for identifying event nominals in Mandarin grows from 88.02% to 94.99%, and automatic detection of event-reading nominalizations on the newly-established data achieves an accuracy of 94.76% and an F1 score of 91.3%, which are significantly surpass those achieved on the pre-existing resource by 9.8% and 5.2% respectively. Our systematic analysis also sheds light on nominal semantic role labelling (SRL). By providing a clear definition and classification on arguments of event nominal, the IAA of this task significantly increases from 90.46% to 98.04%.
... As an example where the grammatical representation and the non-linguistic meaning representation may encode argument information differently, we compare Double R grammatical representations to UMR (UMR = Uniform Meaning Representation) meaning representations (cf. Van Gysel et al., 2021;UMR guidelines, downloaded 2023), and to AMR (AMR = Abstract Meaning Representation) meaning representations (Kingsbury & Palmer, 2002;Banarescu et al., 2013), the predecessor of UMR. English encodes the subjective and objective case of pronouns like he (subjective case) and him (objective case) via lexical contrast. ...
... Since UMR meaning representations do not typically represent the meaning of prepositions, they often abstract away from the grammatical distinction between object referring expressions functioning as arguments and oblique referring expressions functioning as adjuncts. UMR uses the PropBank (Kingsbury & Palmer, 2002) to represent the meaning of events concepts as verb senses, in annotations of English. For each verb sense or event concept, PropBank provides a verb frame with "a verb specific set of semantic roles as the arguments" (Pradhan et al., 2022) (not highlighted in original). ...
Preprint
Full-text available
This preprint argues for the combining of Double R grammatical representations with UMR meaning representations to support fuller meaning determination. We provide numerous examples where both representations are needed, especially for co-reference resolution, and when grammatical features and semantic features do not align. We suggest improvements to UMR meaning representations with respect to event nominals, prepositions, and level of abstraction, which bring it into closer alignment with Double R grammatical representations. 2023-12-22: uploaded a new version with an expanded introduction and minor changes.
... Universal Proposition Bank (Jindal et al., 2022) provides semantic role annotation for 23 languages, based on their UD treebanks. As the name suggests, semantic role labels follow the PropBank (Kingsbury and Palmer, 2002). Second, a recent proposal by Evang (2023) defines the CRANS annotation scheme to annotate semantic roles on top of UD. ...
... 1. Meaning-Text Theory (Žolkovskij and Mel'čuk, 1965) 2. Functional Generative Description (Sgall, 1967) 3. PropBank (Kingsbury and Palmer, 2002) 4. Sequoia (Candito et al., 2014) In the present paper we focus on the perspective of the Meaning-Text Theory and Functional Generative Description, leaving the comparison to the other two frameworks for future work. ...
Preprint
Full-text available
This paper analyzes multiple deep-syntactic frameworks with the goal of creating a proposal for a set of universal semantic role labels. The proposal examines various theoretic linguistic perspectives and focuses on Meaning-Text Theory and Functional Generative Description frameworks. For the purpose of this research, data from four languages is used -- Spanish and Catalan (Taule et al., 2011), Czech (Hajic et al., 2017), and English (Hajic et al., 2012). This proposal is oriented towards Universal Dependencies (de Marneffe et al., 2021) with a further intention of applying the universal semantic role labels to the UD data.
... To further determine semantic roles in images, GSR also provides visually grounded information (i.e., bounding boxes) for noun entities. From the perspective of the theory of frame semantics [26], GSR task can be considered as a multimedia extension to earlier lexical databases such as FrameNet [3] and PropBank [33]. By describing activities with verbs and grounded semantic roles, GSR can provide a visually grounded verb-frame, which benefits many downstream scene understanding tasks, such as information retrieval [15,16,44,56], question answering [1,7,35,67], recommended system [10,12,14,55], and multimedia understanding [29,51,77,78]. ...
... These predefined verb frames are filtered from PropBank[33] or FrameNet[3,27]. ...
... To further determine semantic roles in images, GSR also provides visually grounded information (i.e., bounding boxes) for noun entities. From the perspective of the theory of frame semantics [26], GSR task can be considered as a multimedia extension to earlier lexical databases such as FrameNet [3] and PropBank [33]. By describing activities with verbs and grounded semantic roles, GSR can provide a visually grounded verb-frame, which benefits many downstream scene understanding tasks, such as information retrieval [15,16,44,56], question answering [1,7,35,67], recommended system [10,12,14,55], and multimedia understanding [29,51,77,78]. ...
... These predefined verb frames are filtered from PropBank[33] or FrameNet[3,27]. ...
Preprint
Full-text available
Grounded Situation Recognition (GSR) aims to generate structured semantic summaries of images for ``human-like'' event understanding. Specifically, GSR task not only detects the salient activity verb (e.g. buying), but also predicts all corresponding semantic roles (e.g. agent and goods). Inspired by object detection and image captioning tasks, existing methods typically employ a two-stage framework: 1) detect the activity verb, and then 2) predict semantic roles based on the detected verb. Obviously, this illogical framework constitutes a huge obstacle to semantic understanding. First, pre-detecting verbs solely without semantic roles inevitably fails to distinguish many similar daily activities (e.g., offering and giving, buying and selling). Second, predicting semantic roles in a closed auto-regressive manner can hardly exploit the semantic relations among the verb and roles. To this end, in this paper we propose a novel two-stage framework that focuses on utilizing such bidirectional relations within verbs and roles. In the first stage, instead of pre-detecting the verb, we postpone the detection step and assume a pseudo label, where an intermediate representation for each corresponding semantic role is learned from images. In the second stage, we exploit transformer layers to unearth the potential semantic relations within both verbs and semantic roles. With the help of a set of support images, an alternate learning scheme is designed to simultaneously optimize the results: update the verb using nouns corresponding to the image, and update nouns using verbs from support images. Extensive experimental results on challenging SWiG benchmarks show that our renovated framework outperforms other state-of-the-art methods under various metrics.
... The availability of accurate syntactic parses opened the door to richer, deeper representations. The second Human Language Technology conference included a presentation on Adding Predicate Argument Structure to the Penn Treebank and the Proposition Bank (PropBank) was born (Kingsbury and Palmer, 2002). Over the next few years, with the able guidance of a steering committee consisting of Ralph Weischedel, Mitch Marcus, Doug Appelt, Mark Villain and Ralph Grishman, the annotation guidelines and the annotation continued to grow, with the end result of over 110,000 predicate argument structures pointing directly to syntactic nodes in the phrase structure syntax trees of the roughly 50,000 sentences of the Penn Treebank. ...
... PropBank (Kingsbury and Palmer, 2002;Palmer et al., 2005a) is a paradigm for the development of corpora annotated with predicate argument structures. In its original form, these predicate arguments structures were applied to the syntactic scaffolding provided by the Penn Treebank. ...
... The Framester ontology hub [19,32] provides a formal semantics to semantic frames [30] in a curated linked data version of multiple linguistic resources (e.g. FrameNet [43], WordNet [33], VerbNet [34], PropBank [44], a cognitive layer including MetaNet [35] and ImageSchemaNet [36], BabelNet [37], factual knowledge bases (e.g. DBpedia [38], YAGO [39], etc.), and ontology schemas (e.g. ...
Preprint
Full-text available
The development of artificial intelligence systems capable of understanding and reasoning about complex real-world scenarios is a significant challenge. In this work we present a novel approach to enhance and exploit LLM reactive capability to address complex problems and interpret deeply contextual real-world meaning. We introduce a method and a tool for creating a multimodal, knowledge-augmented formal representation of meaning that combines the strengths of large language models with structured semantic representations. Our method begins with an image input, utilizing state-of-the-art large language models to generate a natural language description. This description is then transformed into an Abstract Meaning Representation (AMR) graph, which is formalized and enriched with logical design patterns, and layered semantics derived from linguistic and factual knowledge bases. The resulting graph is then fed back into the LLM to be extended with implicit knowledge activated by complex heuristic learning, including semantic implicatures, moral values, embodied cognition, and metaphorical representations. By bridging the gap between unstructured language models and formal semantic structures, our method opens new avenues for tackling intricate problems in natural language understanding and reasoning.
... The clause type indicates how those extracted chunks are syntactically connected among one another. Before clause type identification for a given news item, we first apply Verb-Net [45] and PropBank [22] on the news item to categorize its verbs and get the argument frames of categorized verbs. We then map the extracted chunks of the news item to the argument frames given by VerbNet and PropBank. ...
Preprint
Full-text available
Large Language Models (LLMs) have received considerable interest in wide applications lately. During pre-training via massive datasets, such a model implicitly memorizes the factual knowledge of trained datasets in its hidden parameters. However, knowledge held implicitly in parameters often makes its use by downstream applications ineffective due to the lack of common-sense reasoning. In this article, we introduce a general framework that permits to build knowledge bases with an aid of LLMs, tailored for processing Web news. The framework applies a rule-based News Information Extractor (NewsIE) to news items for extracting their relational tuples, referred to as knowledge bases, which are then graph-convoluted with the implicit knowledge facts of news items obtained by LLMs, for their classification. It involves two lightweight components: 1) NewsIE: for extracting the structural information of every news item, in the form of relational tuples; 2) BERTGraph: for graph convoluting the implicit knowledge facts with relational tuples extracted by NewsIE. We have evaluated our framework under different news-related datasets for news category classification, with promising experimental results.
... The other dyads are Fairness/Cheating, Loyalty/Betrayal, Authority/Subversion, Sanctity/Degradation, Liberty/Oppression. Original ValueNet MFT ontological module includes one class per each of these value-violations, and aligns it, following frame semantics principles, to entities from FrameNet [17], WordNet [18], VerbNet [19], PropBank [20] and several others in the Framester hub [21]. However, in this work, we use only those value-frames which are evoked by some existing frame in the Framester hub, and for which we can reuse the semantic roles, resulting in a final set of 10 value-frames out of the original 12. ...
Preprint
Full-text available
This work explores the integration of ontology-based reasoning and Machine Learning techniques for explainable value classification. By relying on an ontological formalization of moral values as in the Moral Foundations Theory, relying on the DnS Ontology Design Pattern, the \textit{sandra} neuro-symbolic reasoner is used to infer values (fomalized as descriptions) that are \emph{satisfied by} a certain sentence. Sentences, alongside their structured representation, are automatically generated using an open-source Large Language Model. The inferred descriptions are used to automatically detect the value associated with a sentence. We show that only relying on the reasoner's inference results in explainable classification comparable to other more complex approaches. We show that combining the reasoner's inferences with distributional semantics methods largely outperforms all the baselines, including complex models based on neural network architectures. Finally, we build a visualization tool to explore the potential of theory-based values classification, which is publicly available at http://xmv.geomeaning.com/.
... But the manual programming of such conversion rules is a tedious task and, therefore, rule-based approaches are hard to scale. The identification and validation of conversion rules may be alleviated by relying on existing syntactic parsers and lexical resources, such as WordNet [108\, PropBank [43,109], VerbNet [83], FrameNet [80,7] and so on. A major obstacle is that parsers and lexical resources are of (extremely) limited use beyond the domain they were prepared for, and few-to-none were prepared with ACC in mind. ...
Article
Full-text available
The digitalisation of the regulatory compliance process has been an active area of research for several decades. However, more recently the level of activities in this area has increased considerably. In the UK, the tragic incident of Grenfell fire in 2017 has been a major catalyst for this as a result of the Hackitt report's recommendations pointing a lot of the blame on the broken regulatory regime in the country. The Hackitt report emphasises the need to overhaul the building regulations, but the approach to do so remains an open research question. Existing work in this space tends to overlook the processing of actual regulatory documents, or limits their scope to solving a relatively small subtask. This paper presents a new comprehensive platform approach to the digitalisation of the regulatory compliance processing. We present i-ReC (intelligent Regulatory Compliance), a platform approach to digitalisation of regulatory compliance that takes into consideration the enormous diversity of all the stakeholders' activities. A historical perspective on research in this area is first presented to put things in perspective which identifies the challenges in such an endeavour and identifies the gaps in state-of-the-art. After enumerating all the challenges in implementing a platform-based approach to digitalising the regulatory compliance process, the implementation of some parts of the platform is described. Our research demonstrates that the identification and extraction of all relevant requirements from the corpus of several hundred regulatory documents is a key part of the whole process which underlies the entire process from authoring to eventually compliance checking of designs. Some of the issues that need addressing in this endeavour include ambiguous language, inconsistent use of terms, contradicting requirements and handling multi-word expressions. The implementation of these tools is driven by NLP, ML and Semantic Web technologies. A semantic search engine was developed and validated against other popular and comparable engines with a corpus of 420 (out of about 800) documents used in the UK for compliance checking of building designs. In every search scenario, our search engine performed better on all objective criteria. Limitations of the approach are discussed which includes the challenges around licensing for all the documents in the corpus. Further work includes improving the performance of SPaR.txt (the tool created to identify multi-word expressions) as well as the information retrieval engine by increasing the dataset and providing the model with examples from more diverse formats of regulations. There is also a need to develop and align strategies to collect a comprehensive set of domain vocabularies to be combined in a Knowledge Graph.
... Knowledge-based annotation projection have also been proposed: Padó andLapata (2009) used FrameNet (Baker 2017) to create an English-German parallel corpus with high-quality annotations, with minimal human effort. Van der Plas et al. (2011) generated semantic role annotations for the French language using PropBank (Kingsbury and Palmer 2002), an English corpus annotated with semantic propositions. ...
Conference Paper
Full-text available
Natural language processing (NLP) applications such as named entity recognition (NER) for low-resource corpora do not benefit from recent advances in the development of large language models (LLMs) where there is still a need for larger annotated datasets. This research article introduces a methodology for generating translated versions of annotated datasets through crosslingual annotation projection and is freely available on GitHub (link: https://github.com/JamilProg/crosslingual_bert_annotation_projection). Leveraging a language agnostic BERT-based approach, it is an efficient solution to increase low-resource corpora with few human efforts and by only using already available open data resources. Quantitative and qualitative evaluations are often lacking when it comes to evaluating the quality and effectiveness of semi-automatic data generation strategies. The evaluation of our crosslingual annotation projection approach showed both effectiveness and high accuracy in the resulting dataset. As a practical application of this methodology, we present the creation of French Annotated Resource with Semantic Information for Medical Entities Detection (FRASIMED), an annotated corpus comprising 2'051 synthetic clinical cases in French. The corpus is now available for researchers and practitioners to develop and refine French natural language processing (NLP) applications in the clinical field (https://zenodo.org/record/8355629), making it the largest open annotated corpus with linked medical concepts in French.
... As a fundamental research topic in the domain of information extraction, event extraction aims to identify instances of events and their arguments from unstructured data [7,11,20,44,65]. An event refers to a specific incident that involves a change in state, which are marked by triggers such as verbs. ...
... VerbNet [77] is also a hierarchical network of English words but focuses only on verbs. It maps WordNet, PropBank [131], and FrameNet [132] verb types to their corresponding verb classes. Both WordNet and VerbNet can easily be read using NLTK interfaces. ...
Article
Full-text available
Storytelling and narrative are fundamental to human experience, intertwined with our social and cultural engagement. As such, researchers have long attempted to create systems that can generate stories automatically. In recent years, powered by deep learning and massive data resources, automatic story generation has shown significant advances. However, considerable challenges, like the need for global coherence in generated stories, still hamper generative models from reaching the same storytelling ability as human narrators. To tackle these challenges, many studies seek to inject structured knowledge into the generation process, which is referred to as structured knowledge-enhanced story generation. Incorporating external knowledge can enhance the logical coherence among story events, achieve better knowledge grounding, and alleviate over-generalization and repetition problems in stories. This survey provides the latest and comprehensive review of this research field: (i) we present a systematic taxonomy regarding how existing methods integrate structured knowledge into story generation; (ii) we summarize involved story corpora, structured knowledge datasets, and evaluation metrics; (iii) we give multidimensional insights into the challenges of knowledge-enhanced story generation and cast light on promising directions for future study.
... In SRL, predicates are typically verbs that represent an action or a state, and arguments are phrases representing roles or entities taking part in that action or state. The semantic roles captured by SRL are defined in frameworks such as PropBank [140,141] and FrameNet [142]. Frameworks such as Propbank use 31 Chapter 2. Background arguments like Arg0, Arg1, Arg2 and ArgM, to represent different arguments in a sentence. ...
... Finally, the amount of VerbNet training data is relatively small (compared to PropBank (Kingsbury and Palmer, 2002) or AMR (Banarescu et al., 2013)), leading to misclassifications due to sparse data. All of these limitations can be improved by expanding the coverage of the VerbNet lexicon, and expanding and updating the VerbNet labeled data accordingly. ...
... The AMRs are subsequently transformed into a symbolic form using a deterministic AMR-to-triples approach. Abstract Meaning Representation (AMR): AMR parsing produces rooted, directed acyclic graphs from the input sentences, where each node represents concepts from propbank frames (Kingsbury and Palmer, 2002) or entities from the text. The edges represent the arguments for the semantic frames. ...
... The AMRs are subsequently transformed into a symbolic form using a deterministic AMR-to-triples approach. Abstract Meaning Representation (AMR): AMR parsing produces rooted, directed acyclic graphs from the input sentences, where each node represents concepts from propbank frames (Kingsbury and Palmer, 2002) or entities from the text. The edges represent the arguments for the semantic frames. ...
Preprint
Full-text available
Text-based reinforcement learning agents have predominantly been neural network-based models with embeddings-based representation, learning uninterpretable policies that often do not generalize well to unseen games. On the other hand, neuro-symbolic methods, specifically those that leverage an intermediate formal representation, are gaining significant attention in language understanding tasks. This is because of their advantages ranging from inherent interpretability, the lesser requirement of training data, and being generalizable in scenarios with unseen data. Therefore, in this paper, we propose a modular, NEuro-Symbolic Textual Agent (NESTA) that combines a generic semantic parser with a rule induction system to learn abstract interpretable rules as policies. Our experiments on established text-based game benchmarks show that the proposed NESTA method outperforms deep reinforcement learning-based techniques by achieving better generalization to unseen test games and learning from fewer training interactions.
... For example, the above sentence contains the arguments buyer and purchased item, but it could also contain seller and price. The NLP community has long been creating machine-readable lexical resources like FrameNet [7] and PropBank [8] that contain manually-labeled predicates and arguments. ...
Article
Full-text available
When the entity processing personal data (the processor) differs from the one collecting personal data (the controller), processing personal data is regulated in Europe by the General Data Protection Regulation (GDPR) through data processing agreements (DPAs) . Checking the compliance of DPAs contributes to the compliance verification of software systems as DPAs are an important source of requirements for software development involving the processing of personal data. However, manually checking whether a given DPA complies with GDPR is challenging as it requires significant time and effort for understanding and identifying DPA-relevant compliance requirements in GDPR and then verifying these requirements in the DPA. Legal texts introduce additional complexity due to convoluted language and inherent ambiguity leading to potential misunderstandings. In this paper, we propose an automated solution to check the compliance of a given DPA against GDPR. In close interaction with legal experts, we first built two artifacts: (i) the “shall” requirements extracted from the GDPR provisions relevant to DPA compliance and (ii) a glossary table defining the legal concepts in the requirements. Then, we developed an automated solution that leverages natural language processing (NLP) technologies to check the compliance of a given DPA against these “shall” requirements. Specifically, our approach automatically generates phrasal-level representations for the textual content of the DPA and compares them against predefined representations of the “shall” requirements. By comparing these two representations, the approach not only assesses whether the DPA is GDPR compliant but it further provides recommendations about missing information in the DPA. Over a dataset of 30 actual DPAs, the approach correctly finds 618 out of 750 genuine violations while raising 76 false violations, and further correctly identifies 524 satisfied requirements. The approach has thus an average precision of 89.1%, a recall of 82.4%, and an accuracy of 84.6%. Compared to a baseline that relies on off-the-shelf NLP tools, our approach provides an average accuracy gain of \approx 20 percentage points. The accuracy of our approach can be improved to \approx 94% with limited manual verification effort.
... We do this by calculating the frequency of "agenthood" for a noun (agent ratio), i.e. dividing the number of times the noun appears as an agent by the number of times it is either an agent or patient. The ideal annotated corpus for this would be one with semantic role labels such as Propbank (Kingsbury and Palmer, 2002), where the "ARG0" label corresponds to agent and "ARG1" to patient. However, many of the nouns in our data appeared only a few times in Propbank or not at all-out of all 233 nouns, only 166 of them occurred within an ARG0 or ARG1 span. 4 Thus, we also tried utilizing syntax as a proxy using Google Syntactic Ngrams biarcs (Goldberg and Orwant, 2013), as it is significantly larger. ...
Preprint
Full-text available
Recent advances in large language models have prompted researchers to examine their abilities across a variety of linguistic tasks, but little has been done to investigate how models handle the interactions in meaning across words and larger syntactic forms -- i.e. phenomena at the intersection of syntax and semantics. We present the semantic notion of agentivity as a case study for probing such interactions. We created a novel evaluation dataset by utilitizing the unique linguistic properties of a subset of optionally transitive English verbs. This dataset was used to prompt varying sizes of three model classes to see if they are sensitive to agentivity at the lexical level, and if they can appropriately employ these word-level priors given a specific syntactic context. Overall, GPT-3 text-davinci-003 performs extremely well across all experiments, outperforming all other models tested by far. In fact, the results are even better correlated with human judgements than both syntactic and semantic corpus statistics. This suggests that LMs may potentially serve as more useful tools for linguistic annotation, theory testing, and discovery than select corpora for certain tasks.
... Another work explores parallel temporal and causal relations. They designed a corpus of causal and parallel temporal relations to fill a gap in the relation configuration annotated by existing resources including Penn Discourse Treebank (PDTB) (Prasad et al., n.d.), PropBank (Kingsbury and Palmer, 2002), and TimeBank (Pustejovsky et al., 2003b). This work defines the annotations for the corpus for both relations types and initial effort creates joint events. ...
Article
Full-text available
Researchers in natural language processing are paying more attention to causality mining. Numerous applications of the growing need for efficient and accurate causality mining include question answering, future events predication, discourse comprehension, decision making, scenario generation, medical text mining, and textual entailment. Although causality has long been in the spotlight, but there are still issues that need to be addressed. This study provides a comprehensive review of casualty mining for various application domains available in the new-age literature from 1989 to 2022. We searched and rigorously examined numerous papers in the most reliable libraries for the review, and the terminologies that drive the context are described. Each paper underwent a thorough review process to extract the following meta-data: techniques, target domains, datasets, features, and limits of each approach. This meta-data will aid researchers in selecting the strategy that is most suited to their research needs. The literature is divided into three groups based on critical reviews including traditional, machine learning-based, and deep learning-based approaches. A concise taxonomy that can substantially help new scholars comprehend the field is developed. In order to make it simple for new researchers to start their research, various perspectives and suggestions are offered.
... Another rich source of additional training data is PropBank (Kingsbury and Palmer, 2002). PropBank is a similar frame parsing dataset to FrameNet, although PropBank tends to focus more on verbs than FrameNet and has simpler arguments. ...
Preprint
Full-text available
While the state-of-the-art for frame semantic parsing has progressed dramatically in recent years, it is still difficult for end-users to apply state-of-the-art models in practice. To address this, we present Frame Semantic Transformer, an open-source Python library which achieves near state-of-the-art performance on FrameNet 1.7, while focusing on ease-of-use. We use a T5 model fine-tuned on Propbank and FrameNet exemplars as a base, and improve performance by using FrameNet lexical units to provide hints to T5 at inference time. We enhance robustness to real-world data by using textual data augmentations during training.
... The OntoNotes corpus [10] has since long been the standard for benchmarking both entity and event coreference resolution systems in the English language domain. This largescale multi-domain text collection includes Treebank [11], Propbank [12,13] and withindocument coreference annotations both at the entity and event level. A notable caveat of the OntoNotes corpus, however, is that no distinction is made between the entity and event labels; as such, they are both simply designated as a mention. ...
Article
Full-text available
In this paper, we present a benchmark result for end-to-end cross-document event coreference resolution in Dutch. First, the state of the art of this task in other languages is introduced, as well as currently existing resources and commonly used evaluation metrics. We then build on recently published work to fully explore end-to-end event coreference resolution for the first time in the Dutch language domain. For this purpose, two well-performing transformer-based algorithms for the respective detection and coreference resolution of Dutch textual events are combined in a pipeline architecture and compared to baseline scores relying on feature-based methods. The results are promising and comparable to similar studies in higher-resourced languages; however, they also reveal that in this specific NLP domain, much work remains to be done. In order to gain more insights, an in-depth analysis of the two pipeline components is carried out to highlight and overcome possible shortcoming of the current approach and provide suggestions for future work.
... Top-down curated knowledge databases such as WordNet [6] and its counterparts in other languages [12] form the basis for computational lexicons and the contextualization of paradigmatic hypernymy relations. WordNet lexical synsets, and later VerbNet [13], PropBank [14], BabelNet [15], and VerbAtlas [16] encode lexical semantic knowledge using word senses as units of meaning. A major problem with Wordnet is the top-down structure of curated resource creation, which inevitably leads to less granularity and the static nature of the inventories. ...
Article
Full-text available
We present a graph-based method for the lexical task of labeling senses of polysemous lexemes. The labeling task aims at generalizing sense features of a lexical item in a corpus using more abstract concepts. In this method, a coordination dependency-based lexical graph is first constructed with clusters of conceptually associated lexemes representing related senses and conceptual domains of a source lexeme. The label abstraction is based on the syntactic patterns of the x is_a y dependency relation. For each sense cluster, an additional lexical graph is constructed by extracting label candidates from a corpus and selecting the most prominent is_a collocates in the constructed label graph. The obtained label lexemes represent the sense abstraction of the cluster of conceptually associated lexemes. In a similar graph-based procedure, the semantic class representation is validated by constructing a WordNet hypernym relation graph. These additional labels indicate the most appropriate hypernym category of a lexical sense community. The proposed labeling method extracts hierarchically abstract conceptual content and the sense semantic features of the polysemous source lexeme, which can facilitate lexical understanding and build corpus-based taxonomies.
... VerbNet [60] is also a hierarchical network of English words but focuses only on verbs. It maps WordNet, PropBank [125], and FrameNet [126] verb types to their corresponding verb classes. Both WordNet and VerbNet can easily be read using NLTK interfaces. ...
Preprint
Full-text available
Storytelling and narrative are fundamental to human experience, intertwined with our social and cultural engagement. As such, researchers have long attempted to create systems that can generate stories automatically. In recent years, powered by deep learning and massive data resources, automatic story generation has shown significant advances. However, considerable challenges, like the need for global coherence in generated stories, still hamper generative models from reaching the same storytelling ability as human narrators. To tackle these challenges, many studies seek to inject structured knowledge into the generation process, which is referred to as structure knowledge-enhanced story generation. Incorporating external knowledge can enhance the logical coherence among story events, achieve better knowledge grounding, and alleviate over-generalization and repetition problems in stories. This survey provides the latest and comprehensive review of this research field: (i) we present a systematical taxonomy regarding how existing methods integrate structured knowledge into story generation; (ii) we summarize involved story corpora, structured knowledge datasets, and evaluation metrics; (iii) we give multidimensional insights into the challenges of knowledge-enhanced story generation and cast light on promising directions for future study.
... For example, the above sentence contains the arguments buyer and purchased item, but it could also contain seller and price. The NLP community has long been creating machine-readable lexical resources like FrameNet [8] and PropBank [9] that contain manually-labeled predicates and arguments. ...
Preprint
Processing personal data is regulated in Europe by the General Data Protection Regulation (GDPR) through data processing agreements (DPAs). Checking the compliance of DPAs contributes to the compliance verification of software systems as DPAs are an important source of requirements for software development involving the processing of personal data. However, manually checking whether a given DPA complies with GDPR is challenging as it requires significant time and effort for understanding and identifying DPA-relevant compliance requirements in GDPR and then verifying these requirements in the DPA. In this paper, we propose an automated solution to check the compliance of a given DPA against GDPR. In close interaction with legal experts, we first built two artifacts: (i) the "shall" requirements extracted from the GDPR provisions relevant to DPA compliance and (ii) a glossary table defining the legal concepts in the requirements. Then, we developed an automated solution that leverages natural language processing (NLP) technologies to check the compliance of a given DPA against these "shall" requirements. Specifically, our approach automatically generates phrasal-level representations for the textual content of the DPA and compares it against predefined representations of the "shall" requirements. Over a dataset of 30 actual DPAs, the approach correctly finds 618 out of 750 genuine violations while raising 76 false violations, and further correctly identifies 524 satisfied requirements. The approach has thus an average precision of 89.1%, a recall of 82.4%, and an accuracy of 84.6%. Compared to a baseline that relies on off-the-shelf NLP tools, our approach provides an average accuracy gain of ~20 percentage points. The accuracy of our approach can be improved to ~94% with limited manual verification effort.
... Large-scale annotated data is a prerequisite to develop high-performance SRL systems (Fürstenau and Lapata, 2009;Xia et al., 2020). The most representative ones in English are FrameNet (Baker et al., 1998), PropBank (Kingsbury andPalmer, 2002), and NomBank (Meyers et al., 2004). FrameNet is a large-scale manually annotated semantic lexicon resource and uses semantic frames to represent meanings of words. ...
... Large-scale annotated data is a prerequisite to develop high-performance SRL systems (Fürstenau and Lapata, 2009;Xia et al., 2020). The most representative ones in English are FrameNet (Baker et al., 1998), PropBank (Kingsbury andPalmer, 2002), and NomBank (Meyers et al., 2004). FrameNet is a large-scale manually annotated semantic lexicon resource and uses semantic frames to represent meanings of words. ...
... One key abstraction we use is the predicate-argument structure provided by Semantic Role Labeling (SRL). Many SRL systems have been designed (Gildea and Jurafsky 2002;Punyakanok, Roth, and Yih 2008) using linguistic resources such as FrameNet (Baker, Fillmore, and Lowe 1998), Prop-Bank (Kingsbury and Palmer 2002), and NomBank (Meyers et al. 2004). These systems are meant to convey high-level information about predicates (which can be a verb, a noun, etc.) and related elements in the text. ...
Article
We propose a novel method for exploiting the semantic structure of text to answer multiple-choice questions. The approach is especially suitable for domains that require reasoning over a diverse set of linguistic constructs but have limited training data. To address these challenges, we present the first system, to the best of our knowledge, that reasons over a wide range of semantic abstractions of the text, which are derived using off-the-shelf, general-purpose, pre-trained natural language modules such as semantic role labelers, coreference resolvers, and dependency parsers. Representing multiple abstractions as a family of graphs, we translate question answering (QA) into a search for an optimal subgraph that satisfies certain global and local properties. This formulation generalizes several prior structured QA systems. Our system, SEMANTICILP, demonstrates strong performance on two domains simultaneously. In particular, on a collection of challenging science QA datasets, it outperforms various state-of-the-art approaches, including neural models, broad coverage information retrieval, and specialized techniques using structured knowledge bases, by 2%-6%.
... In the graph, each node is labeled by a variable and a concept (e.g., "d / dog") or a constant (e.g. negation, a quantity), while each edge represents a relation, similar to PropBank arguments [10]. The representation can be roughly expressed as a grammar in Backus-Naur form as in Fig. 2. Notably, there isn't always a straightforward alignment between words in a sentence and its AMR notes; the mapping from a graph node to a word is neither injective nor surjective (i.e., each word can correspond to zero or more nodes). ...
Preprint
The ability to understand and generate languages sets human cognition apart from other known life forms'. We study a way of combing two of the most successful routes to meaning of language--statistical language models and symbolic semantics formalisms--in the task of semantic parsing. Building on a transition-based, Abstract Meaning Representation (AMR) parser, AmrEager, we explore the utility of incorporating pretrained context-aware word embeddings--such as BERT and RoBERTa--in the problem of AMR parsing, contributing a new parser we dub as AmrBerger. Experiments find these rich lexical features alone are not particularly helpful in improving the parser's overall performance as measured by the SMATCH score when compared to the non-contextual counterpart, while additional concept information empowers the system to outperform the baselines. Through lesion study, we found the use of contextual embeddings helps to make the system more robust against the removal of explicit syntactical features. These findings expose the strength and weakness of the contextual embeddings and the language models in the current form, and motivate deeper understanding thereof.
Conference Paper
Full-text available
Universal Semantic Representation (USR) is designed as a language-independent information packaging system that captures information at three levels: (a) Lexico-conceptual, (b) Syntactico-Semantic, and (c) Discourse. Unlike other representations that mainly encode predicates and their argument structures, our proposed representation captures the speaker's vivaks .ā-how the speaker views the activity. The idea of "speaker's vivaks .ā " is inspired by Indian Grammatical Tradition. There can be some amount of idiosyncrasy of the speaker in the annotation since it is the speaker's viewpoint that has been captured in the annotation. Hence the evaluation metrics of such resources need to be also thought through from scratch. This paper presents an extensive evaluation procedure of this semantic representation from two perspectives (a) Inter-Annotator Agreement and (b) Utility for downstream task of multilingual Natural Language Generation. We also qualitatively evaluate the experience of natural language generation by manual parsing of USR, in order to understand the readability of USR. We have achieved above 80% Inter-Annotator Agreement for USR annotations and above 80% semantic similarity in multilingual generation tasks suggesting reliability of USR annotations and utility for multilingual generations. The qualitative evaluation also suggests high readability and hence utility of USR as a semantic representation.
Chapter
In the field of Machine Reading Comprehension (MRC), existing models have already surpassed human average performance in many tasks such as SQuAD. In recent years, more challenging MRC datasets have been introduced, such as ReClor and LogiQA datasets. These datasets place a greater emphasis on evaluating the logical reasoning abilities of models. To enhance the model's logical reasoning capabilities, we propose the AMR-CL method, a contrastive learning pretraining approach based on AMR (Abstract Meaning Representation) logical graphs. We employ an AMR parser to construct AMR logical graphs that represent the semantic information implied in the text. Then, we enhance the logical relationships in the AMR graph based on logical predicates and perform logical expansion using the principle of logical equivalence. We create logically consistent positive examples and logically inconsistent negative examples using logical equivalences for data augmentation. Contrastive learning is applied to help models better understand logical information within the text. We conducted experiments on two logical reasoning datasets, ReClor and LogiQA, and the results confirm the effectiveness of our method.
Article
Full-text available
One of the most used and well-known semantic representation models is Abstract Meaning Representation (AMR). This representation has had numerous applications in natural language processing tasks in recent years. Currently, for English and Chinese languages, large annotated corpora are available. Besides, in some low-recourse languages, related corpora have been generated with less size. Although, till now to the best of our knowledge, there is not any AMR corpus for the Persian/Farsi language. Therefore, the aim of this paper is to create a Persian AMR (PAMR) corpus via translating English sentences and adjusting AMR guidelines and to solve the various challenges that are faced in this regard. The result of this research is a corpus, containing 1020 Persian sentences and their related AMR which can be used in various natural language processing tasks. In this paper, to investigate the feasibility of using the corpus, we have applied it to two natural language processing tasks: Sentiment Analysis and Text Summarization.
Preprint
Contemporary news reporting increasingly features multimedia content, motivating research on multimedia event extraction. However, the task lacks annotated multimodal training data and artificially generated training data suffer from the distribution shift from the real-world data. In this paper, we propose Cross-modality Augmented Multimedia Event Learning (CAMEL), which successfully utilizes artificially generated multimodal training data and achieves state-of-the-art performance. Conditioned on unimodal training data, we generate multimodal training data using off-the-shelf image generators like Stable Diffusion and image captioners like BLIP. In order to learn robust features that are effective across domains, we devise an iterative and gradual annealing training strategy. Substantial experiments show that CAMEL surpasses state-of-the-art (SOTA) baselines on the M2E2 benchmark. On multimedia events in particular, we outperform the prior SOTA by 4.2\% F1 on event mention identification and by 9.8\% F1 on argument identification, which demonstrates that CAMEL learns synergistic representations from the two modalities.
Article
Design Rule Checking (DRC) is a critical step in integrated circuit design. DRC requires formatted scripts as the input to design rule checkers. However, these scripts are manually generated in the foundry, which is tedious and error-prone for generation of thousands of rules in advanced technology nodes. To mitigate this issue, we propose the first DRC script generation framework, leveraging a deep learning-based key information extractor to automatically identify essential arguments from rules and a script translator to organize the extracted arguments into executable DRC scripts. We further enhance the performance of the extractor with three specific design rule generation techniques and a multi-task learning-based rule classification module. Experimental results demonstrate that the framework can generate a single rule script in 5.46ms on average, with the extractor achieving 91.1% precision and 91.8% recall on the key information extraction. Compared with the manual generation, our framework can significantly reduce the turn-around time and speed up process design closure.
Preprint
Full-text available
Social norms underlie all human social interactions, yet formalizing and reasoning with them remains a major challenge for AI systems. We present a novel system for taking social rules of thumb (ROTs) in natural language from the Social Chemistry 101 dataset and converting them to first-order logic where reasoning is performed using a neuro-symbolic theorem prover. We accomplish this in several steps. First, ROTs are converted into Abstract Meaning Representation (AMR), which is a graphical representation of the concepts in a sentence, and align the AMR with RoBERTa embeddings. We then generate alternate simplified versions of the AMR via a novel algorithm, recombining and merging embeddings for added robustness against different wordings of text, and incorrect AMR parses. The AMR is then converted into first-order logic, and is queried with a neuro-symbolic theorem prover. The goal of this paper is to develop and evaluate a neuro-symbolic method which performs explicit reasoning about social situations in a logical form.
Article
Full-text available
Computational approaches to the study of language emergence can help us understand how natural languages are shaped by cognitive and sociocultural factors. Previous work focused on tasks where agents refer to a single entity. In contrast, we study how agents predicate, that is, how they express that some relation holds between several entities. We introduce a setup where agents talk about a variable number of entities that can be partially observed by the listener. In the presence of a least-effort pressure, they tend to discuss only entities that are not observed by the listener. Thus we can obtain artificial phrases that denote a single entity, as well as artificial sentences that denote several entities. In natural languages, if we ignore the verb, phrases are usually concatenated, either in a specific order or by adding case markers to form sentences. Our setup allows us to quantify how much this holds in emergent languages using a metric we call concatenability. We also measure transitivity, which quantifies the importance of word order. We demonstrate the usefulness of this new setup and metrics for studying factors that influence argument structure. We compare agents having access to input representations structured into pre-segmented objects with properties, versus unstructured representations. Our results indicate that the awareness of object structure yields a more natural sentence organization.
Conference Paper
This paper studies the use of NLP techniques, in particular, neural language models, for the generation of question/answer exercises from English texts. The experiments aim to generate beginner-level exercises from simple texts, to be used in teaching ESL (English as a Second Language) to children. The approach we present in this paper is based on four stages: a pre-processing stage that, among other basic tasks, applies a co-reference resolution tool; an answer candidate selection stage, which is based on semantic role labeling; a question generation stage, which takes as input the text with the resolved co-references and returns a set of questions for each answer candidate using a language model based on the Transformers architecture; and a post-processing stage that adjusts the format of the generated questions. The question generation model was evaluated on a benchmark obtaining similar results to those of previous works, and the complete pipeline was evaluated on a corpus specifically created for this task, achieving good results. Pre-print: https://www.colibri.udelar.edu.uy/jspui/handle/20.500.12008/37155
Article
A discourse containing one or more sentences describes daily issues and events for people to communicate their thoughts and opinions. As sentences are normally consist of multiple text segments, correct understanding of the theme of a discourse should take into consideration of the relations in between text segments. Although sometimes a connective exists in raw texts for conveying relations, it is more often the cases that no connective exists in between two text segments but some implicit relation does exist in between them. The task of implicit discourse relation recognition (IDRR) is to detect implicit relation and classify its sense between two text segments without a connective. Indeed, the IDRR task is important to diverse downstream natural language processing tasks, such as text summarization, machine translation and so on. This article provides a comprehensive and up-to-date survey for the IDRR task. We first summarize the task definition and data sources widely used in the field. We categorize the main solution approaches for the IDRR task from the viewpoint of its development history. In each solution category, we present and analyze the most representative methods, including their origins, ideas, strengths and weaknesses. We also present performance comparisons for those solutions experimented on a public corpus with standard data processing procedures. Finally, we discuss future research directions for discourse relation analysis.
Conference Paper
As part of the unprecedented wealth of data available nowadays, semi-formal reports in the domain of remote sensing can convey information important for decision making in structured and unstructured text parts. For such reports, often kept in large data management systems, targeted information retrieval remains difficult, e.g., the extraction of texts parts relevant to a question posed via natural language. The work presented in this paper therefore aims at finding the relevant documents in data management systems and extracting their relevant content parts based on natural language questions. For this purpose, an approach for semantic information retrieval based on Abstract Meaning Representation (AMR) is adapted, extended and evaluated for the considered domain of remote sensing and image exploitation. In detail, two different metrics used in AMR, Smatch and SemBleu, are compared for their suitability in an AMR-based search. The first results presented in this paper are promising. In addition, more detailed experiments regarding the performance of the metrics under differently formulated yet semantically equivalent questions reveal interesting insights into their ability for semantic comparison.
Chapter
A temporal knowledge graph (TKG) comprises facts aligned with timestamps. Question answering over TKGs (TKGQA) finds an entity or timestamp to answer a question with certain temporal constraints. Current studies assume that the questions are fully annotated before being fed into the system, and treat question answering as a link prediction task. Moreover, the process of choosing answers is not interpretable due to the implicit reasoning in the latent space. In this paper, we propose a semantic parsing based method, namely AE-TQ, which leverages abstract meaning representation (AMR) for understanding complex questions, and produces question-oriented semantic information for explicit and effective temporal reasoning. We evaluate our method on CronQuestions, the largest known TKGQA dataset, and the experiment results demonstrate that AE-TQ empirically outperforms several competing methods in various settings.KeywordsTemporal knowledge graphsQuestion answeringAbstract meaning representation
Preprint
Full-text available
Development of task guidance systems for aiding humans in a situated task remains a challenging problem. The role of search (information retrieval) and conversational systems for task guidance has immense potential to help the task performers achieve various goals. However, there are several technical challenges that need to be addressed to deliver such conversational systems, where common supervised approaches fail to deliver the expected results in terms of overall performance, user experience and adaptation to realistic conditions. In this preliminary work we first highlight some of the challenges involved during the development of such systems. We then provide an overview of existing datasets available and highlight their limitations. We finally develop a model-in-the-loop wizard-of-oz based data collection tool and perform a pilot experiment.
Chapter
Reading comprehension involves ability of reading the text, understanding the meaning of the given text passage and answering the questions based on it. It is a challenging task for machines, as it requires natural language understanding and knowledge about the world. In order to understand the meaning of the text, it is necessary to identify the context and organization of the given passage. It also involves ability to draw inference based on sentences in the paragraph. In this type of answering, reasoning is dependent on the type of comprehension dataset. The answering can be based on single sentence or set of multiple sentences. The complexity of the reasoning required in reading comprehension dataset depends on several factors such as source of paragraphs, the ordering of sentences in the paragraphs and type of questions. Answering the questions based on context requires identification of linguistic features, based on syntax, semantics, and different sentence patterns appearing in the paragraph text. This work presents, exhaustive study of the various approaches used by different authors for feature extraction in the domain of reading comprehension. Most of the recent work in reading comprehension is mainly focused on application of deep learning algorithms but it may not work well with low resource data. Low resource datasets having diverse linguistic features, need deep understanding of the text. At the end, we discuss some of the open challenges in modeling the comprehension systems.KeywordsReasoningMachine learningWord embeddingSemantics features
Chapter
In this chapter, we describe some of the relevant past literature on Relation Extraction.
Article
Full-text available
The abstract for this document is available on CSA Illumina.To view the Abstract, click the Abstract button above the document title.
Article
This paper presents our basic approach to creating Proposition Bank, which involves adding a layer of semantic annotation to the Penn English TreeBank. Without attempting to confirm or disconfirm any particular semantic theory, our goal is to provide consistent argument labeling that will facilitate the automatic extraction of relational data. An argument such as the window in John broke the window and in The window broke would receive the same label in both sentences. In order to ensure reliable human annotation, we provide our annotators with explicit guidelines for labeling all of the syntactic and semantic frames of each particular verb. We give several examples of these guidelines and discuss the inter-annotator agreement figures. We also discuss our current experiments on the automatic expansion of our verb guidelines based on verb class membership. Our current rate of progress and our consistency of annotation demonstrate the feasibility of the task.
Conference Paper
In this paper we specifically address questions of polysemy with respect to verbs, and how regular extensions of meaning can be achieved through the adjunction of particular syntactic phrases. We see verb classes as the key to making generalizations about regular extensions of meaning. Current approaches to English classification, Levin classes and WordNet, have limitations in their applicability that impede their utility as general classification schemes. We present a refinement of Levin classes, intersective sets, which are a more fine-grained classification and have more coherent sets of syntactic frames and associated semantic components. We have preliminary indications that the membership of our intersective sets will be more compatible with WordNet than the original Levin classes. We also have begun to examine related classes in Portuguese, and find that these verbs demonstrate similarly coherent syntactic and semantic properties.
Conference Paper
. This paper describes an approach for handling structural divergencesand recovering dropped arguments in an implemented Koreanto English machine translation system. The approach relies on canonicalpredicate-argument structures (or dependency structures), which providea suitable pivot representation for the handling of structural divergencesand the recovery of dropped arguments. It can also be converted to andfrom the interface representations of many off-the-shelf parsers and...
Article
In this rich reference work, Beth Levin classifies over 3,000 English verbs according to shared meaning and behavior. Levin starts with the hypothesis that a verb's meaning influences its syntactic behavior and develops it into a powerful tool for studying the English verb lexicon. She shows how identifying verbs with similar syntactic behavior provides an effective means of distinguishing semantically coherent verb classes, and isolates these classes by examining verb behavior with respect to a wide range of syntactic alternations that reflect verb meaning. The first part of the book sets out alternate ways in which verbs can express their arguments. The second presents classes of verbs that share a kernel of meaning and explores in detail the behavior of each class, drawing on the alternations in the first part. Levin's discussion of each class and alternation includes lists of relevant verbs, illustrative examples, comments on noteworthy properties, and bibliographic references. The result is an original, systematic picture of the organization of the verb inventory. Easy to use, English Verb Classes and Alternations sets the stage for further explorations of the interface between lexical semantics and syntax. It will prove indispensable for theoretical and computational linguists, psycholinguists, cognitive scientists, lexicographers, and teachers of English as a second language.
Article
We present a language model in which the probability of a sentence is the sum of the individual parse probabilities, and these are calculated using a probabilistic context-free grammar (PCFG) plus statistics on individual words and how they fit into parses. We have used the model to improve syntactic disambiguation. After training on Wall Street Journal (WSJ) text we tested on about 200 WSJ sentence restricted to the 5400 most common words from our training. We observed a 41% reduction in bracket-crossing errors compared to the performance of our PCFG without the use of the word statistics. 1 Introduction Modern statistical language processing offers a new window on the old problem of syntactic disambiguation --- finding the correct syntactic structure among the many that are possible. This research program typically starts with the notion of a "language model." This is an assignment of probabilities to all strings in the language. Suppose that a corpus consists of a sequence...
Article
We present a system for identifying the semantic relationships, or semantic roles, filled by constituents of a sentence within a semantic frame. We use frame semantics as a level of representation intermediate between task-specific templates commonly used in information extraction and complete theories of language understanding using complex semantic structures.
Article
bibliography and index) is therefore something like a dictionary, and accordingly difficult to review exhaustively. Some clarification of the book's contents may be in order. After an introduction, which sets out the theoretical perspective of the book, and which we shall discuss below, the first part deals with various essentially syntactic alternations that verbs are subject to. Beginning with transitivity alternations, such as the well-known middle, causative-inchoative, and some lesser-known such as "characteristic property of instrument alternation," we move on to alternations involving arguments within the VP, such as dative shift, double object constructions, and spray paint constructions. Next come cases of "oblique" subjects (instruments, locations, etc), reflexives, passives, subject inversions, cognate objects, and so on. In each case, the construction is explained, bibliographic references are provided, often in abundance (this is one of the book's major strengths), follo
Article
In this paper we first propose a new statistical parsing model, which is a generative model of lexicalised context-free gram- mar. We then extend the model to in- clude a probabilistic treatment of both subcategorisation and wh~movement. Results on Wall Street Journal text show that the parser performs at 88.1/87.5% constituent precision/recall, an average improvement of 2.3% over (Collins 96).
Article
FrameNet is a three-year NSF-supported project in corpus-based computational lexicography, now in its second year (NSF IRI-9618838, "Tools for Lexicon Building"). The project's key features are (a) a commitment to corpus evidence for semantic and syntactic generalizations, and (b) the representation of the valences of its target words (mostly nouns, adjectives, and verbs) in which the semantic portion makes use of frame semantics. The resulting database will contain (a) descriptions of the semantic frames underlying the meanings of the words described, and (b) the valence representation (semantic and syntactic) of several thousand words and phrases, each accompanied by (c) a repre- sentative collection of annotated corpus attestations, which jointly exemplify the observed linkings between "frame elements" and their syntactic realizations (e.g. grammatical function, phrase type, and other syntactic traits). This report will present the project's goals and workflow, and information about the computational tools that have been adapted or created in-house for this work.
Building a large annotated corpus of English : the Penn Treebank Five papers on wordnet
  • M Marcus
  • B Santorini
  • M A Marcinkiewicz
  • G Miller
  • R Beckwith
  • C Fellbaum
  • D Gross
  • K Miller
Marcus, M., Santorini, B., & Marcinkiewicz, M.A. (1993). Building a large annotated corpus of English : the Penn Treebank. Computational linguistics 19(2), 313−330 Miller, G., Beckwith, R., Fellbaum, C., Gross, D., & Miller, K. (1993) Five papers on wordnet. International Journal of Lexicography 3(4), 235−312. 1993
The Penn treebank: A revised corpus design for extracting predicate−argument structure
  • M Marcus
Marcus, M. (1994). The Penn treebank: A revised corpus design for extracting predicate−argument structure. In Proceedings of the ARPA Human Language Technology Workshop, Princeton, NJ.
Adding Predicate Argument Structure to the Penn TreeBank
  • P Kingsbury
  • M Marcus
  • M Palmer
Kingsbury, P., Marcus, M. & Palmer, M. (forthcoming). Adding Predicate Argument Structure to the Penn TreeBank. In Proceedings of the Human Language Technology Conference, San Diego, CA, March 2002.
SBIR−II: Korean− English Machine Translation of Battlefield Messages
  • R Kittredge
  • T Korelsky
  • B Lavoie
  • M Palmer
  • C Han
  • C Park
  • A Bies
Kittredge, R., Korelsky, T., Lavoie, B., Palmer, M., Han, C., Park, C., & Bies, A. (2001) SBIR−II: Korean− English Machine Translation of Battlefield Messages. Final Report to the Army Research Lab.