
Burcu CanUniversity of Wolverhampton · Research Institute in Information and Language Processing (RIILP)
Burcu Can
PhD from University of York
Reader in Computational Linguistics
RGCL, University of Wolverhampton
About
54
Publications
9,383
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
400
Citations
Introduction
Unsupervised learning of morphology, syntax, semantics, semantic parsing.
Deep neural networks applied to a range of NLP tasks.
Additional affiliations
October 2013 - September 2014
Publications
Publications (54)
Social media plays an important role in expressing the thoughts and sentiments of users. Irony is a way of stating a sentiment about something by expressing the opposite of the intended literal meaning. Irony detection is a recent emerging task in low-resource languages, although other tasks related to sentiment, such as sentiment analysis and emot...
Semantic representation is the task of conveying the meaning of a natural language utterance by converting it to a logical form that can be processed and understood by machines. It is utilised by many applications in natural language processing (NLP), particularly in tasks relevant to natural language understanding (NLU). Due to the widespread use...
This paper summarises the differences and similarities found between humans and three natural language processing models when attempting to identify whether English online comments are sarcastic or not. Three models were used to analyse 300 comments from the FigLang 2020 Reddit Dataset, with and without context. The same 300 comments were also give...
Social media is a widely used platform that provides a huge amount of user-generated content that can be processed to extract information about users’ emotions. This has numerous benefits, such as understanding how individuals feel about certain news or events. It can be challenging to categorize emotions from text created on social media, especial...
Semantic representation is a way of expressing the meaning of a text that can be processed by a machine to serve a particular natural language processing (NLP) task that usually requires meaning comprehension such as text summarisation, question answering or machine translation. In this paper, we present a semantic parsing model based on neural net...
Social media is a widespread platform that provides a massive amount of user-generated content that can be mined to reveal the emotions of social media users. This has many potential benefits, such as getting a sense of people’s pulse on various events or news. Emotion classification from social media posts is challenging, especially when it comes...
Universal Conceptual Cognitive Annotation (UCCA) is a cross-lingual semantic annotation framework that provides an easy annotation without any requirement for linguistic background. UCCA-annotated datasets have been already released in English, French, and German. In this paper, we introduce the first UCCA-annotated Turkish dataset that currently i...
Adversarial learning deals with the performance of machine learning algorithms in the existence of adversarial parties. In adversarial attacks, inputs are carefully crafted in order to have detrimental effects on a machine learning system. Studying adversarial attacks is important especially for security systems that employ machine learning in thei...
We propose an integrated deep learning model for morphological segmentation, morpheme tagging, part-of-speech (POS) tagging, and syntactic parsing onto dependencies, using cross-level contextual information flow for every word, from segments to dependencies, with an attention mechanism at horizontal flow. Our model extends the work of Nguyen and Ve...
Since mobile applications make our lives easier, there is a large number of mobile applications customized for our needs in the application markets. While the application markets provide us a platform for downloading applications, it is also used by malware developers in order to distribute their malicious applications. In Android, permissions are...
Semantic parsing provides a way to extract the semantic structure of a text that could be understood by machines. It is utilized in various NLP applications that require text comprehension such as summarization and question answering. Graph-based representation is one of the semantic representation approaches to express the semantic structure of a...
Android is among the most targeted platform by attackers. While attackers are improving their techniques, traditional solutions based on static and dynamic analysis have been also evolving. In addition to the application code, Android applications have some metadata that could be useful for security analysis of applications. Unlike traditional appl...
Part of speech (PoS) tagging is one of the fundamental syntactic tasks in Natural Language Processing, as it assigns a syntactic category to each word within a given sentence or context (such as noun, verb, adjective, etc.). Those syntactic categories could be used to further analyze the sentence-level syntax (e.g., dependency parsing) and thereby...
Android gives us opportunity to extract meaningful information from metadata. From the security point of view, the missing important information in metadata of an application could be a sign of suspicious application, which could be directed for extensive analysis. Especially the usage of dangerous permissions is expected to be explained in app des...
The earlier work on sarcastic texts mainly concentrated on detecting the sarcasm on a given text. With the spread of cyber-bullying with the use of social media, it becomes also essential to identify the target of the sarcasm besides detecting the sarcasm. In this study, we propose a deep learning model for target identification on sarcastic texts...
We investigate the usage of semantic information for morphological segmentation since words that are derived from each other will remain semantically related. We use mathematical models such as maximum likelihood estimate (MLE) and maximum a posteriori estimate (MAP) by incorporating semantic information obtained from dense word vector representati...
In this article, we investigate using deep neural networks with different word representation techniques for named entity recognition (NER) on Turkish noisy text. We argue that valuable latent features for NER can, in fact, be learned without using any hand-crafted features and/or domain-specific resources such as gazetteers and lexicons. In this r...
The significance of user satisfaction is increasing in the competitive open source software (OSS) market. Application stores let users send their feedbacks for applications, which are in the form of user reviews or ratings. Developers are informed about bugs or any additional requirements with the help of this feedback and use it to increase the qu...
Türkçe, morfem adı verilen birimlerin art arda eklenmesiyle sözcüklerin oluşturulduğu sondan eklemeli bir dildir. Sözcüklerin farklı parçaların birleştirilmesiyle oluşturulması makine tercümesi, duygu analizi ve bilgi çıkarımı gibi birçok doğal dil işleme uygulamasında seyreklik problemine yol açmaktadır çünkü sözcüğün her farklı formu farklı bir s...
Spelling errors are one of the crucial problems to be addressed in Natural Language Processing tasks. In this study, a context-based automatic spell correction method for Turkish texts is presented. The method combines the Noisy Channel Model with Hidden Markov Models to correct a given word. This study deviates from the other studies by also consi...
Turkish is an agglutinative language with rich morphology. A Turkish verb can have thousands of different word forms. Therefore, sparsity becomes an issue in many Turkish natural language processing (NLP) applications. This article presents a model for Turkish lexicon expansion. We aimed to expand the lexicon by using a morphological segmentation s...
The number of possible word forms is theoretically infinite in agglutinative languages. This brings up the out-of-vocabulary (OOV) issue for part-of-speech (PoS) tagging in agglutinative languages. Since inflectional morphology does not change the PoS tag of a word, we propose to learn stems along with PoS tags simultaneously. Therefore, we aim to...
In this paper, we introduce a trie-structured Bayesian model for unsupervised morphological segmentation. We adopt prior information from different sources in the model. We use neural word embeddings to discover words that are morphologically derived from each other and thereby that are semantically similar. We use letter successor variety counts o...
In this paper, we investigate the effects of
using subword information in representation learning. We argue that using syntactic subword units effects the quality of the
word representations positively. We introduce a morpheme-based model and compare it against to word-based, characterbased, and character n-gram level models.
Our model takes a list...
This article presents a probabilistic hierarchical clustering model for morphological segmentation. In contrast to existing approaches to morphology learning, our method allows learning hierarchical organization of word morphology as a collection of tree structured paradigms. The model is fully unsupervised and based on the hierarchical Dirichlet p...
Sparsity is one of the major problems in natural language processing. The problem becomes even more severe in agglutinating languages that are highly prone to be inflected. We deal with sparsity in Turkish by adopting morphological features for part-of-speech tagging. We learn inflectional and derivational morpheme tags in Turkish by using conditio...
In this paper, we build morphological chains for agglutinative languages by using a log linear model for the morphological segmentation task. The model is based on the unsupervised morphological segmentation system called MorphoChains [1]. We extend MorphoChains log linear model by expanding the candidate space recursively to cover more split point...
The number of word forms in agglutinative languages is theoretically infinite and this variety in word forms introduces sparsity in many natural language processing tasks. Part-of-speech tagging (PoS tagging) is one of these tasks that often suffers from sparsity. In this paper, we present an unsupervised Bayesian model using Hidden Markov Models (...
For a language learner, any new word is a pseudoword. A pseudoword is a string of of letters or phonemes that sounds like an existing word in a language, though it has no meaning in the lexicon. On the other hand, speakers are well aware of permissible phonemes, their frequencies and collocations in their language due to the phonotactics inherent i...
The number of word forms in agglutinative languages is theoretically infinite and this variety in word forms introduces sparsity in many natural language processing tasks. Part-of-speech tagging (PoS tagging) is one of these tasks that often suffers from sparsity. In this paper, we present an unsupervised Bayesian model using Hidden Markov Models (...
In this paper, we build morphological chains for agglutinative languages by using a log-linear model for the morphological segmentation task. The model is based on the unsupervised morphological segmentation system called MorphoChains. We extend MorphoChains log linear model by expanding the candidate space recursively to cover more split points fo...
In this paper, we introduce a trie-structured Bayesian model for unsupervised morphological segmentation. We adopt prior information from different sources in the model. We use neural word embeddings to discover words that are morphologically derived from each other and thereby that are semantically similar. We use letter successor variety counts o...
Sparsity is one of the major problems in natural language processing. The problem becomes even more severe in agglutinating languages that are highly prone to be inflected. We deal with sparsity in Turkish by adopting morphological features for part-of-speech tagging. We learn inflectional and derivational morpheme tags in Turkish by using conditio...
One morpheme may have several surface forms that correspond to allomorphs. In English, ed and d are surface forms of the past tense morpheme, and s, es, and ies are surface forms of the plural or present tense morpheme. Turkish has a large number of allomorphs due to its morphophonemic processes. One morpheme can have tens of different surface form...
Morphemes are not independent units and attached to each other based on morphotactics. However, they are assumed to be independent from each other to cope with the complexity in most of the models in the literature. We introduce a language independent model for unsupervised morphological segmentation using hierarchical Dirichlet process (HDP). We m...
We present a fully unsupervised method for morphological segmentation. Unlike many morphological segmentation systems, our method is based on semantic features rather than orthographic features. In order to capture word meanings, word embeddings are obtained from a two-level neural network [11]. We compute the semantic similarity between words usin...
Distributional representation of words is used for both syntactic and semantic tasks. In this paper two different methods are presented for clustering word roots. In the first method, the distributional model word2vec [1] is used for clustering word roots, whereas distributional approaches are generally used for words. For this purpose, the distrib...
Turkish is an agglutinating language with heavy affixation. During affixation, morphophonemic operations change the surface forms of morphemes, leading to allomorphy. This paper explores the use of Turkish allomorphs in morphological segmentation task. The results show that aggregating morphemes in allomorph sets and treating them as the same morph...
In this paper, we present a model for Turkish speech recognition. The model is syllable-based, where the recognition is performed through syllables as speech recognition units. The main goal of the model is to recognize as much as possible of a given continuous speech by identifying only a small set of syllables in the language. For that purpose, o...
This paper is a survey of methods and algorithms for unsupervised learning of morphology. We provide a description of the methods and algorithms used for morphological segmentation from a computational linguistics point of view. We survey morphological segmentation methods covering methods based on MDL (minimum description length), MLE (maximum lik...
This paper presents a joint model for learning morphology and part-of-speech (PoS) tags simultaneously. The proposed method adopts a finite mixture model that groups words having similar contextual features thereby assigning the same PoS tag to those words. While learning PoS tags, words are analysed morphologically by exploiting similar morphologi...
In this paper, we present an agglomerative hierarchical clustering algorithm for labelling morphs. The algorithm aims to capture allomorphs and homophonous morphemes for a deeper analysis of segmentation results of a morphological segmentation system. Most morphological segmentation systems focus only on segmentation rather than labelling morphs ac...
In this paper, we present an agglomerative hierarchical clustering algorithm for labelling morphs. The algorithm aims to capture allomorphs and homophonous morphemes for a deeper analysis of segmentation results of a morphological segmentation system. Most morphological segmentation systems focus only on segmentation rather than labelling morphs ac...
We propose a novel method for learning morphological paradigms that are structured within a hierarchy. The hierarchical structuring of paradigms groups morphologically similar words close to each other in a tree structure. This allows detecting morphological similarities easily leading to improved morphological segmentation. Our evaluation using (K...
We propose a new clustering algorithm for the induction of the morphological paradigms. Our method is unsupervised and exploits
the syntactic categories of the words acquired by an unsupervised syntactic category induction algorithm [1]. Previous research
[2,3] on joint learning of morphology and syntax has shown that both types of knowledge affect...
This paper presents a method for unsupervised learning of morphology that exploits the syntactic categories of words. Previous research [4][12] on learning of morphology and syntax has shown that both kinds of knowledge affect each other making it possible to use one type of knowledge to help the other. In this work, we make use of syntactic inform...