Conference Paper

Sangrahaka: a tool for annotating and querying knowledge graphs

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... The tool doccano, although simple to set up and intuitive, only supports labeling tasks. Sangrahaka (Terdalkar and Bhattacharya, 2021), while being easy to set up and use, focuses only on the annotation towards creation of knowledge graphs and lacks support towards general-purpose NLP annotation tasks. ...
... Following (Terdalkar and Bhattacharya, 2021), we have evaluated our tool using a two-fold evaluation method of subjective and objective evaluation. We have not used the time taken for annotation as an evaluation metric since annotators often spend more time processing the text to identify the relevant information than physically annotating. ...
... The objective evaluation utilized a scoring mechanism similar to that employed in previous studies (Neves and Ševa, 2021;Terdalkar and Bhattacharya, 2021). We retained the additional categories introduced by (Terdalkar and Bhattacharya, 2021) while incorporating supplementary categories pertinent to the comprehensive assessment of a general-purpose NLP tool. ...
... Each JSON object contains text of the line and optional extra information such as word segmentation, verse id, linguistic information etc. The structure of JSON file corresponding to a chapter is explained in Appendix A of [27]. This information is then organized in a hierarchical structure with 4 levels: Corpus, Chapter, Verse and Line. ...
... Placeholder variables represent values where user input is expected. Query templates are provided in a JSON file whose structure is explained in Appendix A of [27]. The natural language query template, combined with user input, forms a valid natural language question, and the same replacement in Cypher query template forms a valid Cypher query. ...
... Sangrahaka outperformed other tools with a score of 0.82 compared to FLAT (0.78), WebAnno (0.74) and BRAT (0.70). Further details about the evaluation can be found in Appendix B of [27]. ...
Preprint
Full-text available
In this work, we present a web-based annotation and querying tool Sangrahaka. It annotates entities and relationships from text corpora and constructs a knowledge graph (KG). The KG is queried using templatized natural language queries. The application is language and corpus agnostic, but can be tuned for special needs of a specific language or a corpus. A customized version of the framework has been used in two annotation tasks. The application is available for download and installation. Besides having a user-friendly interface, it is fast, supports customization, and is fault tolerant on both client and server side. The code is available at https://github.com/hrishikeshrt/sangrahaka and the presentation with a demo is available at https://youtu.be/nw9GFLVZMMo.
... resolution capabilities. Ontologies(Bikaun et al. 2022;Terdalkar and Bhattacharya 2021;Tang et al. 2020;Oliveira and d'Aquin 2019) provide knowledge representations with configurable labels and knowledge graph capabilities but are not designed to handle heterogeneous data sources. Classification tools ...
Article
Full-text available
Accurate data annotation is essential to successfully implementing machine learning (ML) for regulatory compliance. Annotations allow organizations to train supervised ML algorithms and to adapt and audit the software they buy. The lack of annotation tools focused on regulatory data is slowing the adoption of established ML methodologies and process models, such as CRISP-DM, in various legal domains, including in regulatory compliance. This article introduces Ant, an open-source annotation software for regulatory compliance. Ant is designed to adapt to complex organizational processes and enable compliance experts to be in control of ML projects. By drawing on Business Process Modeling (BPM), we show that Ant can contribute to lift major technical bottlenecks to effectively implement regulatory compliance through software, such as the access to multiple sources of heterogeneous data and the integration of process complexities in the ML pipeline. We provide empirical data to validate the performance of Ant, illustrate its potential to speed up the adoption of ML in regulatory compliance, and highlight its limitations.
... Third, we annotate one complete chapter from the text (Dhānyavarga), and create a KG from the annotations. For this purpose, we deploy a customized instance of Sangrahaka, an annotation and querying framework developed by us previously (Terdalkar and Bhattacharya, 2021). We also create 31 query templates in English and Sanskrit to feed into the templatized querying interface, that aids users in finding answers for objective questions related to the corpus. ...
Preprint
Full-text available
Knowledge bases (KB) are an important resource in a number of natural language processing (NLP) and information retrieval (IR) tasks, such as semantic search, automated question-answering etc. They are also useful for researchers trying to gain information from a text. Unfortunately, however, the state-of-the-art in Sanskrit NLP does not yet allow automated construction of knowledge bases due to unavailability or lack of sufficient accuracy of tools and methods. Thus, in this work, we describe our efforts on manual annotation of Sanskrit text for the purpose of knowledge graph (KG) creation. We choose the chapter Dhanyavarga from Bhavaprakashanighantu of the Ayurvedic text Bhavaprakasha for annotation. The constructed knowledge graph contains 410 entities and 764 relationships. Since Bhavaprakashanighantu is a technical glossary text that describes various properties of different substances, we develop an elaborate ontology to capture the semantics of the entity and relationship types present in the text. To query the knowledge graph, we design 31 query templates that cover most of the common question patterns. For both manual annotation and querying, we customize the Sangrahaka framework previously developed by us. The entire system including the dataset is available from https://sanskrit.iitk.ac.in/ayurveda/ . We hope that the knowledge graph that we have created through manual annotation and subsequent curation will help in development and testing of NLP tools in future as well as studying of the Bhavaprakasanighantu text.
Conference Paper
Full-text available
Sanskrit, a language renowned for its profound literature and philosophical insights, has remained a cornerstone of ancient wisdom. In this digital age, the fusion of tradition and technology has led to the emergence of transformative tools such as the Sanskrit Teaching, Annotation, and Research Tool (START), being developed by the Department of Sanskrit Studies, University of Hyderabad. This research paper delves into the intricate features, methodologies, and implications of START in reshaping the landscape of Sanskrit research and teaching. By exploring its advanced annotation capabilities, collab-orative potential and broader impact on the digital humanities, this paper demonstrates how START is redefining the boundaries of scholarly exploration and analysis.
Preprint
Full-text available
The primary focus of this thesis is to make Sanskrit manuscripts more accessible to the end-users through natural language technologies. The morphological richness, compounding, free word orderliness, and low-resource nature of Sanskrit pose significant challenges for developing deep learning solutions. We identify four fundamental tasks, which are crucial for developing a robust NLP technology for Sanskrit: word segmentation, dependency parsing, compound type identification, and poetry analysis. The first task, Sanskrit Word Segmentation (SWS), is a fundamental text processing task for any other downstream applications. However, it is challenging due to the sandhi phenomenon that modifies characters at word boundaries. Similarly, the existing dependency parsing approaches struggle with morphologically rich and low-resource languages like Sanskrit. Compound type identification is also challenging for Sanskrit due to the context-sensitive semantic relation between components. All these challenges result in sub-optimal performance in NLP applications like question answering and machine translation. Finally, Sanskrit poetry has not been extensively studied in computational linguistics. While addressing these challenges, this thesis makes various contributions: (1) The thesis proposes linguistically-informed neural architectures for these tasks. (2) We showcase the interpretability and multilingual extension of the proposed systems. (3) Our proposed systems report state-of-the-art performance. (4) Finally, we present a neural toolkit named SanskritShala, a web-based application that provides real-time analysis of input for various NLP tasks. Overall, this thesis contributes to making Sanskrit manuscripts more accessible by developing robust NLP technology and releasing various resources, datasets, and web-based toolkit.
Preprint
Full-text available
We present a neural Sanskrit Natural Language Processing (NLP) toolkit named SanskritShala (a school of Sanskrit) to facilitate computational linguistic analyses for several tasks such as word segmentation, morphological tagging, dependency parsing, and compound type identification. Our systems currently report state-of-the-art performance on available benchmark datasets for all tasks. SanskritShala is deployed as a web-based application, which allows a user to get real-time analysis for the given input. It is built with easy-to-use interactive data annotation features that allow annotators to correct the system predictions when it makes mistakes. We publicly release the source codes of the 4 modules included in the toolkit, 7 word embedding models that have been trained on publicly available Sanskrit corpora and multiple annotated datasets such as word similarity, relatedness, categorization, analogy prediction to assess intrinsic properties of word embeddings. So far as we know, this is the first neural-based Sanskrit NLP toolkit that has a web-based interface and a number of NLP modules. We are sure that the people who are willing to work with Sanskrit will find it useful for pedagogical and annotative purposes. SanskritShala is available at: https://cnerg.iitkgp.ac.in/sanskritshala. The demo video of our platform can be accessed at: https://youtu.be/x0X31Y9k0mw4.
Article
Full-text available
Motivation: Annotation tools are applied to build training and test corpora, which are essential for the development and evaluation of new natural language processing algorithms. Further, annotation tools are also used to extract new information for a particular use case. However, owing to the high number of existing annotation tools, finding the one that best fits particular needs is a demanding task that requires searching the scientific literature followed by installing and trying various tools. Methods: We searched for annotation tools and selected a subset of them according to five requirements with which they should comply, such as being Web-based or supporting the definition of a schema. We installed the selected tools (when necessary), carried out hands-on experiments and evaluated them using 26 criteria that covered functional and technical aspects. We defined each criterion on three levels of matches and a score for the final evaluation of the tools. Results: We evaluated 78 tools and selected the following 15 for a detailed evaluation: BioQRator, brat, Catma, Djangology, ezTag, FLAT, LightTag, MAT, MyMiner, PDFAnno, prodigy, tagtog, TextAE, WAT-SL and WebAnno. Full compliance with our 26 criteria ranged from only 9 up to 20 criteria, which demonstrated that some tools are comprehensive and mature enough to be used on most annotation projects. The highest score of 0.81 was obtained by WebAnno (of a maximum value of 1.0).
Article
Full-text available
This paper presents GATE Teamware—an open-source, web-based, collaborative text annotation framework. It enables users to carry out complex corpus annotation projects, involving distributed annotator teams. Different user roles are provided (annotator, manager, administrator) with customisable user interface functionalities, in order to support the complex workflows and user interactions that occur in corpus annotation projects. Documents may be pre-processed automatically, so that human annotators can begin with text that has already been pre-annotated and thus making them more efficient. The user interface is simple to learn, aimed at non-experts, and runs in an ordinary web browser, without need of additional software installation. GATE Teamware has been evaluated through the creation of several gold standard corpora and internal projects, as well as through external evaluation in commercial and EU text annotation projects. It is available as on-demand service on GateCloud.net, as well as open-source for self-installation.
Article
Full-text available
As users struggle to navigate the wealth of on-line information now available, the need for automated question answering systems becomes more urgent. We need systems that allow a user to ask a question in everyday language and receive an answer quickly and succinctly, with sufficient context to validate the answer. Current search engines can return ranked lists of documents, but they do not deliver answers to the user. Question answering systems address this problem. Recent successes have been reported in a series of question-answering evaluations that started in 1999 as part of the Text Retrieval Conference (TREC). The best systems are now able to answer more than two thirds of factual questions in this evaluation.
Article
Full-text available
The Text REtrieval Conference (TREC) question answering track is an effort to bring the benefits of large-scale evaluation to bear on a question answering (QA) task. The track has run twice so far, first in TREC-8 and again in TREC-9. In each case, the goal was to retrieve small snippets of text that contain the actual answer to a question rather than the document lists traditionally returned by text retrieval systems. The best performing systems were able to answer about 70% of the questions in TREC-8 and about 65% of the questions in TREC-9. While the 65% score is a slightly worse result than the TREC-8 scores in absolute terms, it represents a very significant improvement in question answering systems. The TREC-9 task was considerably harder than the TREC-8 task because TREC-9 used actual users’ questions while TREC-8 used questions constructed for the track. Future tracks will continue to challenge the QA community with more difficult, and more realistic, question answering tasks.
Article
Full-text available
The TREC-8 Question Answering track was the first large-scale evaluation of domain-independent question answering systems. This paper summarizes the results of the track by giving a brief overview of the different approaches taken to solve the problem. The most accurate systems found a correct response for more than 2/3 of the questions. Relatively simple bag-of-words approaches were adequate for finding answers when responses could be as long as a paragraph (250 bytes), but more sophisticated processing was necessary for more direct responses (50 bytes).
Conference Paper
In this workshop we provide a hands-on introduction to the popular open source graph database Neo4j [1] through fixing a series of increasingly sophisticated, but broken, test cases each of which highlights an important graph modeling or API affordance.
Conference Paper
This paper describes a dialog based QA system, Dialog Navigator, which can answer questions based on large text knowledge base. In real world QA systems, vagueness of questions is a big problem. Our system can navigates users to the desired answers using the following methods: asking users back with dialog cards, and description extraction of each retrieved text. Another feature of the system is that it retrieves relevant texts precisely, using question types, synonymous expression dictionary, and modifier-head relations in Japanese sentences.
Speech & language processing. Pearson Education India. Dan Jurafsky. 2000. Speech & language processing
  • Dan Jurafsky
Hiroki Nakayama Takahiro Kubo Junya Kamura Yasufumi Taniguchi and Xu Liang
  • Yasufumi Hiroki Nakayama Takahiro Kubo Junya Kamura
  • Xu Taniguchi
  • Liang
FoLiA Linguistic Annotation Tool
  • Maarten Van Gompel
Python 3 Reference Manual. CreateSpace
  • Guido Van Rossum
  • Fred L Drake
  • Rossum Guido Van
Flask web development: developing web applications with python
  • Miguel Grinberg
  • Grinberg Miguel