Joseph Marvin ImperialUniversity of Bath | UB · Department of Computer Science
Joseph Marvin Imperial
Doctor of Philosophy
Doctoral researcher of the UKRI CDT in Accountable, Responsible, and Transparent (AI) Program and BathNLP Laboratory.
About
54
Publications
29,116
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
146
Citations
Introduction
I'm a UKRI CDT Doctoral Student at the University of Bath's Integrated PhD Program in ART-AI. I'm currently on study leave from the Department of Computer Science and Human Language Technology (HLT) Lab at the National University, Philippines. I conduct research on the intersection of machine learning and NLP, specifically in the areas of text complexity, story generation, and social media.
Additional affiliations
June 2018 - September 2022
National University
Position
- Instructor
Description
- Currently on study-leave.
Education
September 2022 - October 2027
September 2018 - January 2021
Publications
Publications (54)
Proper identification of grade levels of children's reading materials is an important step towards effective learning. Recent studies in readability assessment for the English domain applied modern approaches in natural language processing (NLP) such as machine learning (ML) techniques to automate the process. There is also a need to extract the co...
In order to ensure quality and effective learning, fluency, and comprehension, the proper identification of the difficulty levels of reading materials should be observed. In this paper, we describe the development of automatic machine learning-based readability assessment models for educational Filipino texts using the most diverse set of linguisti...
Readability assessment is the process of identifying the level of ease or difficulty of a certain piece of text for its intended audience. Approaches have evolved from the use of arithmetic formulas to more complex pattern-recognizing models trained using machine learning algorithms. While using these approaches provide competitive results, limited...
Specialized lexicons are collections of words with associated constraints such as special definitions, specific roles, and intended target audiences. These constraints are necessary for content generation and documentation tasks (e.g., writing technical manuals or children's books), where the goal is to reduce the ambiguity of text content and incr...
In the next few years, applications of Generative AI are expected to revolutionize a number of different areas, ranging from science & medicine to education. The potential for these seismic changes has triggered a lively debate about potential risks and resulted in calls for tighter regulation, in particular from some of the major tech companies wh...
Southeast Asia (SEA) is a region rich in linguistic diversity and cultural variety, with over 1,300 indigenous languages and a population of 671 million people. However, prevailing AI models suffer from a significant lack of representation of texts, images, and audio datasets from SEA, compromising the quality of AI models for SEA languages. Evalua...
This paper introduces v0.5 of the AI Safety Benchmark, which has been created by the MLCommons AI Safety Working Group. The AI Safety Benchmark has been designed to assess the safety risks of AI systems that use chat-tuned language models. We introduce a principled approach to specifying and constructing the benchmark, which for v0.5 covers only a...
Domain experts across engineering, healthcare, and education follow strict standards for producing quality content such as technical manuals, medication instructions, and children's reading materials. However, current works in controllable text generation have yet to explore using these standards as references for control. Towards this end, we intr...
ChatGPT is a generative language model that serves as a conversational tool to assist users with various information-seeking activities. ChatGPT has gained traction as a promising technological advancement toward enhancing the educational experience of students and learners. However, there is limited understanding of ChatGPT in higher education ins...
We introduce Universal NER (UNER), an open, community-driven project to develop gold-standard NER benchmarks in many languages. The overarching goal of UNER is to provide high-quality, cross-lingually consistent annotations to facilitate and standardize multilingual NER research. UNER v1 contains 18 datasets annotated with named entities in a cross...
Despite being one of the most linguistically diverse groups of countries, computational linguistics and language processing research in Southeast Asia has struggled to match the level of countries from the Global North. Thus, initiatives such as open-sourcing corpora and the development of baseline models for basic language processing tasks are imp...
Understanding the perceptions and experiences of a community regarding disasters is crucial in effectively planning and implementing disaster strategies. There are two known approaches to analyzing perceptions, qualitative and automated approaches for thematic analysis. This paper aims to investigate the strengths and limitations of the mentioned a...
In recent years, the main focus of research on automatic readability assessment (ARA) has shifted towards using expensive deep learning-based methods with the primary goal of increasing models' accuracy. This, however, is rarely applicable for low-resource languages where traditional handcrafted features are still widely used due to the lack of exi...
The main objective of internship programs of every university is to enable students to learn the skills needed in a workplace regardless of the modality. Students want to experience the actual work environment in preparation for joining the knowledge workers after finishing the program. With the disruptions in education due to the pandemic, univers...
In the field of automatic readability assessment (ARA), the current trend in the research community focuses on the use of large neural language models such as BERT as evidenced from its high performance in other downstream NLP tasks. In this study, we dissect the BERT model and applied it to readability assessment in a low-resource setting using a...
Powerful language models such as GPT-2 have shown promising results in tasks such as narrative generation which can be useful in an educational setup. These models, however, should be consistent with the linguistic properties of triggers used. For example, if the reading level of an input text prompt is appropriate for low-leveled learners (ex. A2...
In this study, we developed the first baseline readability model for the Cebuano language. Cebuano is the second most-used native language in the Philippines with about 27.5 million speakers. As the baseline, we extracted traditional or surface-based features, syllable patterns based from Cebuano's documented orthography, and neural embeddings from...
In this paper, we present a unified model that works for both multilingual and crosslingual prediction of reading times of words in various languages. The secret behind the success of this model is in the preprocessing step where all words are transformed to their universal language representation via the International Phonetic Alphabet (IPA). To t...
One of the most important humanitarian responsibility of every individual is to protect the future of our children. This entails not only protection of physical welfare but also from ill events that can potentially affect the mental well-being of a child such as sexual coercion and abuse which, in worst-case scenarios, can result to lifelong trauma...
Assessing the proper difficulty levels of reading materials or texts in general is the first step towards effective comprehension and learning. In this study, we improve the conventional methodology of automatic readability assessment by incorporating the Word Mover's Distance (WMD) of ranked texts as an additional post-processing technique to furt...
Readability formulas consider word familiarity as one of the factors for predicting the readability of children’s books. Word familiarity is dependent on the frequency in which the words are encountered in daily reading. Often referred to as “sight words”, developing effective recognition of these high-frequency words can assist young readers to de...
Automatic readability assessment (ARA) is the task of evaluating the level of ease or difficulty of text documents for a target audience. For researchers, one of the many open problems in the field is to make such models trained for the task show efficacy even for low-resource languages. In this study, we propose an alternative way of utilizing the...
In this paper, we describe our efforts in establishing a simple knowledge base by building a semantic network composed of concepts and word relationships in the context of disasters in the Philippines. Our primary source of data is a collection of news articles scraped from various Philippine news websites. Using word embeddings, we extract semanti...
Proper identification of grade levels of children's reading materials is an important step towards effective learning. Recent studies in readability assessment for the English domain applied modern approaches in natural language processing (NLP) such as machine learning (ML) techniques to automate the process. There is also a need to extract the co...
In this paper, we describe our efforts in establishing a simple knowledge base by building a semantic network composed of concepts and word relationships in the context of disasters in the Philippines. Our primary source of data is a collection of news articles scraped from various Philippine news websites. Using word embeddings, we extract semanti...
Handwriting is a skill to express thoughts, ideas, and language. Over the years, medical doctors have been well-known for having illegible cursive handwritings and has been a generally accepted matter. The datasets used in this paper are samples of doctors cursive handwriting collected from several clinics and hospitals of Metro Manila, Quezon City...
In this paper, we present the compilation of a lexicon comprising of feature words deemed as most useful and informative when classifying Filipino storybooks based on their readability level. For this study, a total of 250 storybooks written in Filipino were collected from a university library with varying grade levels. The storybook data has been...
The Philippines is a common ground to natural calamities like typhoons, floods, volcanic eruptions and earthquakes. With Twitter as one of the most used social media platform in the Philippines, a total of 39,867 preprocessed tweets were obtained given a time frame starting from November 1, 2013 to January 31, 2014. Sentiment analysis determines th...
In this paper, disaster related tweets published during the timeframe of Typhoon Yolanda (from November 1, 2013 to January 31, 2014) were data mined for the analysis of sentiments and construction of a combined Filipino and English disaster polarity lexicon. A total of 92,040 tweets were obtained using the hashtags ‘#YolandaPH’, ‘#ReliefPH’, ‘#Bang...
The Philippines is a common ground to natural calamities like typhoons, floods, volcanic eruptions and earthquakes. With Twitter as one of the most used social media platform in the Philippines, a total of 39,867 preprocessed tweets were obtained given a time frame starting from November 1, 2013 to January 31, 2014. Sentiment analysis determines th...