
Sabrina TiunUniversiti Kebangsaan Malaysia | ukm · Center for Artificial Intelligence Technology
Sabrina Tiun
PhD ,Universiti Sains Malaysia
About
87
Publications
30,987
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
684
Citations
Citations since 2017
Introduction
Additional affiliations
November 2011 - present
Publications
Publications (87)
Hate speech detection has substantially increased interest among researchers in the domain of natural language processing (NLP) and text mining. The number of studies on this topic has been growing dramatically. Thus, the purpose of this analysis is to develop a resource that consists of an outline of the approaches, methods, and techniques employe...
The number of online documents has rapidly grown, and with the expansion of the Web, document analysis, or text analysis, has become an essential task for preparing, storing, visualizing and mining documents. The texts generated daily on social media platforms such as Twitter, Instagram and Facebook are vast and unstructured. Most of these generate...
COVID-19 (coronavirus disease 2019) is an ongoing global pandemic caused by severe acute respiratory syndrome coro-navirus 2. Recently, it has been demonstrated that the voice data of the respiratory system (i.e., speech, sneezing, coughing, and breathing) can be processed via machine learning (ML) algorithms to detect respiratory system diseases,...
Other languages have influenced Arabic because of several factors, such as geographical nearness, trade communication, past Islamic conquests, science and technology, new devices, brand names, models, and fashion. As a result of these factors, foreign words are used in Arabic text and are known as Arabised words. Arabised words affect the Arabic na...
Many works have employed Machine Learning (ML) techniques in the detection of Diabetic Retinopathy (DR), a disease that affects the human eye. However, the accuracy of most DR detection methods still need improvement. Gray Wolf Optimization-Extreme Learning Machine (GWO-ELM) is one of the most popular ML algorithms, and can be considered as an accu...
Twitter is a popular social media platform in Malaysia that allows for 280-character microblogging. Almost everything that happens in a single day is tweeted by users. Because of the popularity of Twitter, most Malaysians use it daily, providing researchers and developers with a wealth of data on Malaysian users. This paper explains why and how thi...
Automatic Emotion Speech Recognition (ESR) is considered as an active research field in the Human-Computer Interface (HCI). Typically, the ESR system is consisting of two main parts: Front-End (features extraction) and Back-End (classification). However, most previous ESR systems have been focused on the features extraction part only and ignored
th...
One of the most important phases in text processing is stemming, whose aim is to aggregate all variations in a word into one group to aid natural language processing. The morphological structure of the Arabic language is more challenging than that of the English language; thus, it requires superior stemming algorithms for Arabic stemmers to be effe...
In multilabel classification, each sample can be allocated to multiple class labels at the same time. However, one of the prominent problems of multilabel classification is missing labels (incomplete labels) in multilabel text. The multilabel classification performance is reduced significantly with the presence of missing labels. In order to addres...
Spoken language identification (LID) is the process of determining and classifying natural language from a given content and dataset. Data must be processed to extract useful features to perform LID. The mel-frequency cepstral coefficient (MFCC) is one of the most popular feature extraction techniques in LID. The MFCC features are generated to serv...
The technique used for recognizing a language by utilizing pronounced speech is called spoken Language Identification (LID). This field has a high significance in the interaction between human and computer. Besides, it can be implemented in several applications such as call centers, speaker diarization in multilingual environments, and in translati...
Superior stemming algorithms aid significantly in many natural language processing (NLP) applications such as information retrieval. Arabic light-based stemmer is one of the most important stemming algorithms. However, partially due to the highly inflected and complexity of Arabic language morphological structure, most of the existing Arabic light-...
Simultaneous multiple labelling of documents, also known as multilabel text classification, will not perform optimally if the class is highly imbalanced. Class imbalanced entails skewness in the fundamental data for distribution that leads to more difficulty in classification. Random over-sampling and under-sampling are common approaches to solve t...
Removal of stop words is essential in Natural Language Processing and text-related analysis. Existing works on Malay stop words are based on standard Malay and Quranic/Arabic translations into Malay. Thus, there is a lack of domain-specific stop word list, making it discordant for processing of Malay parliamentary discourse. In this paper, we propo...
Existing text clustering methods utilize only one representation at a time (single view), whereas multiple views can represent documents. The multiview multirepresentation method enhances clustering quality. Moreover, existing clustering methods that utilize more than one representation at a time (multiview) use representation with the same nature....
The coronavirus disease (COVID-19), is an ongoing global pandemic caused by severe acute respiratory syndrome. Chest Computed Tomography (CT) is an effective method for detecting lung illnesses, including COVID-19. However, the CT scan is expensive and time-consuming. Therefore, this work focus on detecting COVID-19 using chest X-ray images because...
Aspect-based sentiment analysis (ABSA) has recently attracted increasing attention due to its extensive applications. Most of the existing ABSA methods been applied on small-sized labeled datasets. However, real datasets such as the Amazon and TripAdvisor contain a massive number of reviews. Thus, applying these methods on large-scale datasets may...
In this study, we propose an alternative approach to analyzing a domain-specific time series corpus for detecting word evolution. The method trains a target corpus in time series into a temporal word embedding (TWE) model. The advantage of TWE is that one can see how the meaning of a word changes over time. We have chosen the TWEC approach to model...
Malay social media text is a text written on social media networks like Twitter. Commonly, this text comprises non-standard words, filled with dialects, foreign languages, word abbreviations, grammatical neglect, spelling errors, and many more. It is well known that this type of text is difficult to process due to its high noise and distinct text s...
The metaheuristic genetic algorithm (GA) is based on the natural selection process that falls under the umbrella category of evolutionary algorithms (EA). Genetic algorithms are typically utilized for generating high-quality solutions for search and optimization problems by depending on bio-oriented operators such as selection, crossover, and mutat...
User reviews are important resources for many processes such as recommender systems and decision-making programs. Sentiment analysis is one of the processes that is very useful for extracting the valuable information from these reviews. Data preprocessing step is of importance in the sentiment analysis process, in which suitable preprocessing metho...
The determination and classification of natural language based on specified content and data set involves a process known as spoken language identification (LID). To initiate the process, useful features of the given data need to be extracted first in a mature process where the standard LID features have been previously developed by employing the u...
One of the needs in adopting a crowdsourcing approach in software requirement system (SRS) is to be able to perform text analytics to gain insight or knowledge from the crowd’s feedback. One of the expected text analytic tasks is to be able of analyzing the feedback automatically; such as, whether the feedback concerns about the functional requirem...
Information retrieval is a difficult process due to the overabundance of information on the web. Nowadays, search result responds to user queries with too many results although only a few are relevant. Therefore, the existing clustering methods that fail in clustering snippets (short texts) of web documents due to the low frequencies of document te...
In this paper, we present the process of training the word embedding (WE) model for a small, domain-specific Malay corpus. In this study, Hansard corpus of Malaysia Parliament for specific years was trained on the Word2vec model. However, a specific setting of the hyperparameters is required to obtain an accurate WE model because changing one of th...
The exponential growth of medical information over the social network has poses several challenges issues. One of these issues is configuring the drug-interactions and other medical-related entities. Adverse Drug Reaction (ADR) is one of these entities that is crucial to be identified for contributing toward determining drug-interactions. The liter...
Adverse Drug Reaction (ADR) extraction is the process of identifying drug implications mentioned in social posts. Handling medical text for the identification of ADR is vital to research in terms of configuring the side effect and other medical-related entities within any medical text. However, investigating the role of such effect in the context o...
The determination and classification of a recognized spoken language based on certain contents and datasets is known as the process of language identification (LID). The common process in carrying out LID entails the mandatory processing of data which enables the extraction of the necessary features for the process. The extraction involves a mature...
Word sense disambiguation (WSD) is the process of identifying an appropriate sense for an ambiguous word. With the complexity of human languages in which a single word could yield different meanings, WSD has been utilized by several domains of interests such as search engines and machine translations. The literature shows a vast number of technique...
Comparison of results of hybrid PSO based on SensEval-2 and SensEval-3 corpora of each POSs.
(TIF)
Retrieve the SemCor Sentences line contents by Java library of JSemCor.
(TIF)
Sample of Semcor’s dataset.
(TIF)
Processing the meaning of words in social media texts, such as tweets, is challenging in natural language processing. Malay tweets are no exception because they demonstrate distinct linguistic phenomena, such as the use of dialects from each state in Malaysia; borrowing foreign language terms in the context of Malay language; and using mixed langua...
Processing the meaning of words in social media texts, such as tweets, is challenging in natural language processing. Malay tweets are no exception because they demonstrate distinct linguistic phenomena, such as the use of dialects from each state in Malaysia; borrowing foreign language terms in the context of Malay language; and using mixed langua...
Word Sense Disambiguation (WSD) is the process of determining the exact sense of a particular word in accordance to the context in a computational manner. Such task plays an essential role in multiple fields of study such as Information Retrieval and Information Extraction. With the complexity of human language, WSD came up to solve the problem beh...
Computer vision (CV) refers to the study of the computer simulation of human visual science. Major task of CV is to collect images (or video) so that they could be used for analysis, gathering information, and making decisions or judgements. CV has greatly progressed and developed in the past few decades. In recent years, deep learning (DL) approac...
Currently, the high volume of international information exchange involves a wide range of localities. As each locality comes with its own distinctive dialect, the need for an effective means of language translation is becoming more and more apparent. Among the concerns of information professionals is the capacity of an interested party to access we...
An Information Retrieval (IR) system aims to extract information based on a query made by a user on a particular subject from an extensive collection of text. IR is a process through which information is retrieved by submitting a query by a user in the form of keywords or to match words. In the Al-Quran, verses of the same or comparable topics are...
Spoken Language Identification (LID) is the process of determining and classifying natural language from a given content and dataset. Typically, data must be processed to extract useful features to perform LID. The extracting features for LID, based on literature, is a mature process where the standard features for LID have already been developed u...
Provides the languages, youtube channel names, and the URLs for every single channel that we have used to collocate our dataset.
(TXT)
Stemming is referred to a procedure of reducing all words appearing in different morphological variants to a common form. As a matter of fact, it is considered as a functional way in various areas of information-retrieval work and computational linguistics. In this paper, we introduced the Vocabulary Based Stemmer (VBS) as the alternative solution...
Automatic text categorization (ATC) has attracted the attention of the research community over the last decade as it frees organizations from the need of manually organized documents. The ensemble techniques, which combine the results of a number of individually trained base classifiers, always improve classification performance better than base cl...
One of the challenges of natural language processing is social media text like tweets. Conversational text in contrast to genres that are highly edited (standard language) which traditional NLP tools have been developed for contains many syntactic patterns and non-standard lexical items. These are the outcomes of dialectal variation, diversity in t...
Now a days, the use of short text has been increased dramatically in which many applications are being relied on short text such as mobile messaging, breaking news social media and queries. The key challenging behind the short text lies on the limitation of acquiring context information from such text. This limitation increases both sparsity and am...
Information retrieval is the process of analysing typed query as well as to retrieve relevant document according to the user query. Several issues can significantly affect the effectiveness of information retrieval. One of the common issue is the ambiguity lies on the words where a single word could yield several meanings. The process of identifyin...
With the dramatic expansion of information over the internet, users around the world express their opinion daily on the social network such as Facebook and Twitter. Large corporations nowadays invest on analyzing these opinions in order to assess their products or services by knowing the people feedback toward such business. The process of knowing...
Named Entity Recognition (NER) is the field of recognizing nouns such as names of people, corporations, places and dates. The process of extracting NEs is mainly relying on supervised machine learning techniques. Hence, utilizing proper features have a significant impact on the performance of recognizing the entities. Several approaches have been p...
Named Entity Recognition (NER) is the field of identifying proper nouns such as names of people, corporations, places and dates. Recently, extracting information form web pages has caught the researchers’ attentions regarding the valuable information that lies on such pages. The common valuable information is the NEs. However, web pages contain mor...
In order to develop a complete and usable Text-to-Speech (TTS) system requires years of time, hours of human workloads and tons of knowledge from various field of subjects. However, with a simple and easy software tools to use and understand, the burden of developing a complete TTS for a specific language can be overcome. Thus, such existing softwa...
Multi-label text classification has become progressively more important in recent years, where each document can be given multiple labels concurrently. Multi-label text classification is a main challenging task because of the large space of all potential label sets, which is exponential to the number of candidate labels. Among the disadvantages of...
The task of assigning proper meaning to an ambiguous word in a particular context is termed word sense disambiguation (WSD). We propose a genetic algorithm, improved by local search techniques, to maximise the overall semantic similarity or relatedness of a given text. Local search is used because of the inefficiency of population-based algorithms...
Cross-Language Plagiarism Detection (CLPD)is used to automatically identify and extract plagiarism among documents in different languages.The main challenge of cross-languageplagiarism detection is the difference of text languages, where the original source can be analysed and translated, and plagiarism can be detected automatically by comparing su...
With the exponential growth of textual information available from the Internet, there has been an emergent need to find relevant, in-time and in-depth knowledge about business topic. The huge size of such data makes the process of retrieving and analyzing and use of the valuable information in such texts manually a very difficult task. In this pape...
Previous research proved that a complete and usable Malay Text-to-Speech (TTS) system based on formant synthesis could be developed within a short period of time without in-depth knowledge in relevant fields. The speech produced however still been influenced by Indonesian and English pronunciation. This has led to this research that intended to imp...
Word sense disambiguation (WSD) is the process of eliminating ambiguity that lies on some words by identifying the exact sense of a given word. In the natural languages, many words could yield multiple meaning based on the context. WSD aims to identify the most accurate sense for such cases. In particular, when translating one language to another,...
Word Sense Disambiguation (WSD) is the task of determining which sense of an ambiguous word (word with multiple meanings) is chosen in a particular use of that word, by considering its context. A sentence is considered ambiguous if it contains ambiguous word(s). Practically, any sentence that has been classified as ambiguous usually has multiple in...
The methods and background introduced in this article concern on the interpretation of the Quranic text in English translation using word sense disambiguation. Three measures of semantic similarity measures: Wu-palmer, Lin and Jiang-Conrath, and their combination were used to identify words sense on the English Quranic text, in which comparison and...
This article proposes a system based on the interpretation on the Quranic text that has been translated into English language using word sense disambiguation. This system is based on a combination of three traditional semantic similarity measurements, which are Wu-Palmer (WUP), Lin (LIN), and Jiang-Conrath (JCN) for word sense disambiguation on the...
Named entity is a term that has been widely used in the field of Natural Language Processing (NLP). It contains the names of persons, organizations, locations, dates and currencies. The process of extracting such names called Named Entity Recognition (NER). Biomedical Named Entity Recognition (BNER) is one of the fields that contains variety of nam...
Instance-based matching is the process of finding the correspondence of schema elements by comparing the data from different data sources. It is used as an alternative option when the match between schema elements fails. Instance-based matching is applied in many application areas such as website creation and management, schema evolution and migrat...
Instance-based matching is the process of identifying the correspondences of schema elements by comparing the instances of different data sources. It is used as an alternative option when the schema-based matching fails. Instance-based matching is applied in many application areas such as website creation and management, data warehousing, database...
While a wide range of methods has been conducted to English terminology extraction, relatively few studies have been applied to Arabic terms extraction in Islamic corpus. In this paper, we present an efficient approach for automatic extraction of Arabic Terminology (SWTs, MWTs). The approach relies on two main filtering steps: the linguistic filter...
Quranic text Information Retrieval (IR) is quite demanding yet very trivial due to that user will not always use the exact keywords to retrieve the relevant Quranic text (verse). Many have tried to overcome this problem by expanding or reformulating the query entered by users using semantic approaches with resources such as ontologies and thesauri....
Quranic text Information Retrieval (IR) is quite demanding yet very trivial due to that user will not always use the exact keywords to retrieve the relevant Quranic text (verse). Many have tried to overcome this problem by expanding or reformulating the query entered by users using semantic approaches with resources such as ontologies and thesauri....