Gerhard Beukes Van HuyssteenNorth-West University | NWU · Centre for Text Technology
Gerhard Beukes Van Huyssteen
Bachelor of Arts; BA Hons; Magister Artium; PhD
About
86
Publications
31,314
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
365
Citations
Introduction
Additional affiliations
October 2010 - present
Publications
Publications (86)
Views on constructionalisation and constructional change are at the forefront of construction grammar approaches to language change. In order to be able to talk about constructionalisation and constructional changes in a particular part of the constructicon, it is necessary to have both a diachronic and synchronic view of that network of constructi...
Lexical units that are identical in form and that are traditionally referred to as either adpositions, adverbs, or particles (based on their morphosyntactic properties), can also be grouped together (based on semantic properties) under the term P-items (see for example Fontaine 2017). Although it is for most linguistic endeavours sufficient to refe...
In Afrikaans and other Germanic languages there is a subcategory of exocentric compounds that can be used evaluatively as personal names. The focus of this article is on such exocentric compounds that are used pejoratively as epithets (i.e., epithetic exocentric compounds (EECs)), and even more specifically on EECs that are based on one of two conc...
One of the central ideas in construction grammar (CxG) is that our linguistic knowledge is structured as a mental inventory of constructions, i.e., a constructicon. Descriptive representations of such constructicons are best characterised as a blend between CxG and lexicography, and has come to be known as constructicography (cf. Lyngfelt, 2018:2)....
One of the central ideas in construction grammar (CxG) is that our linguistic knowledge is structured as a mental inventory of constructions, i.e., a constructicon. Descriptive representations of such constructicons are best characterised as a blend between CxG and lexicography, and has come to be known as constructicography (cf. Lyngfelt, 2018:2)....
South Africa’s various multilingual communities – a mix of eleven official languages and many other, smaller languages – offer unique opportunities to study constructions in contact. The influence of two of the West Germanic languages on each other, English and Afrikaans, has been the subject of study for many years – see for example Bekker (2019),...
Censorship is the practice of suppressing or controlling the creation of and/or public access to artefacts based on their content – for political and/or moral reasons. Such artefacts include the fine arts; performing arts, including film and theatre; the art of words, science, and journalism; and digital products such as computer games and websites...
Die oorhoofse doel van hierdie artikel is om ’n eksemplariese opsomming te gee van die bestaande taalkundige beskrywings en leksikografiese bewerkings van hedendaagse fok en ander vormlik verwante woorde (soos befok, fokken, fokkol en opfok). Die artikel se spesifieke doelwitte is om (1) ’n konstruksienetwerk van die fok-familie op te stel wat as h...
For many decades now, automatic content classification stood at the centre of artificial intelligence design and machine learning systems. One of the most famous examples is probably the Netflix Prize, which was an open competition to find the best algorithm that could predict user ratings for films (e.g., Greene 2006; Kasula 2020). Content classif...
Navorsing oor vloekwoorde (hier gebruik as 'n hiperoniem om ander verskynsels en/of sinonieme in te sluit, waaronder swets, skel, (gods)laster en vuil taal) word al internasionaal vir baie jare in 'n verskeidenheid wetenskaplike dissiplines gedoen. Daarteenoor is daar baie min tot geen navorsing oor vloek in die Suid-Afrikaanse konteks gedoen nie....
Likert
type data is commonly used in many
research fields in humanities: from ga u ging the
usability of different user interface designs, to
determining users’ likeliness to vote for a
particular political party, to evaluation of course
materials to name but a few examples. Despite
its prevalence , there is still some disagreement
within the stati...
Within the blended learning environment, it is important to consolidate expert content and pedagogy inside and outside the classroom. Subject experts who serve as content developers play a vital role by contributing quality controlled subject content covered by the curriculum, which can be made available to students on digital platforms. However, i...
Research on swearing (used here as a hypernym to include other phenomena and/or synonyms like cursing, profanity, taboo language, etc.) has been prevalent for many years internationally, also from a variety of scientific disciplines. Most of the research literature, however, is on swearing in English, although studies have also been conducted on so...
Likert-type data is commonly used in many research fields in humanities: from gaging the usability of different user-interface designs, to determining users’ likeliness to vote for a particular political party, to evaluation of course materials–to name but a few examples. Despite its prevalence, there is still some disagreement within the statistic...
In Germanic languages the linking morpheme, like the ·s· in Afrikaans seun·s·naam ‘boy’s name’, or ·en· in Dutch pann·en·koek ‘pancake’ is quite common. This word element has been the topic of discussion in the past, with no definite consensus about its origin or possible semantic input. There has been a renewed interest in this phenomenon, especia...
A corpus exploration of huidiglik. In tandem with Van Huyssteen (2018a), this article examines the current usage of the word huidiglik (‘currently’) (an alleged Anglicism), together with other associated words (e.g. its base, huidig ‘current’). Based on a comprehensive literature review, Van Huyssteen (2018a) concludes that apart from stylistic pre...
Norms of huidiglik. In Afrikaans, huidiglik is a truly Janus-faced word: it is being used with high frequency in especially spoken language, while at the same time being one of the biggest language pet-peeves of language practitioners (and even ordinary speakers of Afrikaans). When asked why huidiglikshould be avoided, these language practitioners...
Over the past more than 100 years, Afrikaans associative plural constructions – especially constructions with hulle (‘they’) and goed (‘things/stuff; good’) as right-hand components – have been studied from both diachronic and synchronic perspectives, but with the main interest in their origins, and what they could tell us about the genesis of Afri...
Toe die eerste Afrikaanse woordelys en spelreëls (AWS) meer as honderd jaar gelede gekonsipieer is, was dit nog nie standaard praktyk om die aard, doel en omvang van woordeboeke duidelik, omvattend en sistematies uit te spel in ʼn woordeboekkonseptualiseringsplan nie. In 2006 het die TK begin om twee dokumente saam te stel, naamlik ʼn stylgids (waari...
Sedert 2006 het die Taalkommissie (TK) begin om formele riglyne vir die opname of eliminering van lemmas in die Afrikaanse woordelys en spelreëls (AWS) saam te stel; dié riglyne is deur Van Huyssteen (2017b - kyk Deel 1 in hierdie uitgawe) in ʼn operasionaliseringsraamwerk geformaliseer. Die doel van hierdie artikel is om die operasionaliseringsraam...
Die Virtuele Instituut vir Afrikaans (VivA) is 'n navorsingsinstituut en diensverskaffer vir Afrikaans in digitale kontekste. Ten einde verantwoorde keuses met betrekking tot VivA se produk- en diensaanbod te maak, is kwantitatiewe en kwalitatiewe navorsing gedoen om tekortkominge in die Afrikaanse mark van digitale taalprodukte te bepaal. Sewe tem...
Compared to well-resourced languages such as English and Dutch, natural language processing (NLP) tools for Afrikaans are still not abundant. In the context of the AfriBooms project, KU Leuven and the NorthWest University collaborated to develop a first, small treebank, a dependency parser, and an easy to use online linguistic search engine for Afr...
Following Den Besten’s (2009) desiderata for historical linguistics of Afrikaans, this article aims to contribute some modern evidence to the debate regarding the founding dialects of Afrikaans. From an applied perspective (i.e. human language technology), we aim to determine which West Germanic language(s) and/or dialect(s) would be best suited fo...
Aan die and besig in Afrikaans progressive construction : a corpus investigation (2)
Progressive aspect is a grammatical category which signifies that an event is continuing or taking place (Comrie 1976:33-36; Bybee et al. 1994:126). In two articles, this article and (Breed & Van Huyssteen 2014), we investigate the manner in which two periphrastic...
Progressive aspect is a grammatical category which signifies that an event is continuing or taking place (Comrie 1976:33-36; Bybee et al. 1994:126). In two articles, this article and (Breed & Van Huyssteen 2014), we investigate the manner in which two periphrastic constructions, namely the Vam, besig om te V and the Vcop aan die V constructions, ar...
•Aan die and besig in Afrikaans progressive constructions: the origin and development
Progressive aspect is a grammatical category which signifies that an event is continuing or taking place (Comrie 1976:33-36; Bybee et al. 1994:126). In two articles, this article and Breed and Van Huyssteen (submitted), we investigate the manner in which two perip...
In most languages, new words can be created through the process of compounding, which combines two or more words into a new lexical unit. Whereas in languages such as English the components that make up a compound are separated by a space, in languages such as Finnish, German, Afrikaans and Dutch these components are concatenated into one word. Com...
Compounding, the process of combining several simplex words into a complex whole, is a pro-ductive process in a wide range of languages. In particular, concatenative compounding, in which the components are "glued" together, leads to problems, for instance, in computational tools that rely on a predefined lexicon. Here we present the AuCoPro projec...
The linguistic categorisation of compounds dates back to some of the earliest work in linguistics. The cross-linguistic compound taxonomy of Bisetto and Scalise (2005), later refined in Scalise and Bisetto (2009), is well-known in linguistics for understanding the grammatical relations in compounds. Although this taxonomy has not been used extensiv...
Op die terrein van teksverwerking speel die metadata oor ’n bepaalde teks in baie gevalle ’n belangrike rol. Sodanige metadata word dikwels toegevoeg met behulp van outomatiese teksklassifiseerders wat op grond van die inhoud van ’n teks een of meer vooraf bepaalde klasse of kategorieë outomaties aan ’n teks toeken. Een van die dimensies waarvolgen...
Compound semantic analysis is the task of finding the correct internal relation between the constituents of a compound [3, 10]. In this paper we use a measure of semantic similarity [14] based on the relations in the Afrikaans WordNet [2] to determine the similarity between two Afrikaans compounds. We infer that if the different constituents of two...
The computational processing of compound semantics poses several interesting challenges. Up to now, the processing of nominal compounds with non-noun left-hand constituents (henceforth XN compounds) has not received any attention, despite the fact that these also seem to be rather productive in Germanic languages. In our research project, we aim to...
In this paper we introduce a solution for disease surveillance and monitoring in the primary animal health care (PAHC) domain that uses inbound voice-based services and voice- and text-based outbound services for connecting rural veterinarians and livestock owners with a PAHC service provider. We describe our findings from the ongoing pilots, where...
Development of human language technologies for the indigenous South African languages is currently being undertaken in various projects across South Africa. In one such project a lemmatizer for Setswana is being developed, and this article reports on work towards the development of a first prototype. A prerequisite of lemmatization is to determine...
This article presents initial results on a supervised machine learning approach to determine the semantics of noun-noun compounds in Dutch and Afrikaans. After a discussion of previous research on the topic, we present our annotation methods used to provide a training set of compounds with the appropriate semantic class. The support vector machine...
Multilingual emerging markets hold many opportunities for the application of spoken language technologies, such as interactive voice response (IVR) systems. Designing such systems requires an in-depth understanding of the business drivers and salient design decisions pertaining to these markets. In this paper we analyze the business drivers and des...
Research using voice-based services as a technology platform for providing information access and services within developing world regions has shown much promise. The results for design and deployment of such voice-based services have varied depending on the application domain, user community and context. In this paper we describe our work on devel...
http://search.sabinet.co.za/WebZ/Authorize?sessionid=0&next=ej/ej_content_literat.html&bad=error/authofail.html
Automatic lemmatisation for Afrikaans Automatic lemmatisation is a general normalisation procedure in text processing, where all inflected forms of a lexical word are normalised to a single lemma (i.e. a meaningful, uninflected base form from which more complex word forms could be formed). Traditionally, lemmatisers are developed by writing languag...
NVolume 194 of the series Current Issues in Linguistic Theory, entitled Lexicology,
Semantics and Lexicography, comprises a selection of eleven academic contributions
that were originally presented at the Tenth International Conference of
English Historical Linguistics. The disciplinary focus of this volume of texts is
the diachronics of English vo...
NVolume 194 of the series Current Issues in Linguistic Theory, entitled Lexicology, Semantics and Lexicography, comprises a selection of eleven academic contributions that were originally presented at the Tenth International Conference of English Historical Linguistics. The disciplinary focus of this volume of texts is the diachronics of English vo...
Human language technology (HLT) has been identified as a priority area by the South African government. However, despite efforts
by government and the research and development (R&D) community, South Africa has not yet been able to maximise the opportunities
of HLT and create a thriving HLT industry. One of the key challenges is the fact that there...
Afrikaans is one of the eleven official languages of South Africa. It is classified as an under-resourced language. No annotated broadband speech corpora currently exist for Afrikaans. This article reports on the development of speech resources for Afrikaans, specifically a broadband speech corpus and an extended pronunciation dictionary. Baseline...
South Africa (SA) epitomises diversity, with the nation boasting eleven official languages. The field of human language technology (HLT) can play a vital role in bridging the digital divide and thus has been recognised as a priority area by the South African government. The current HLT landscape in South Africa consists mostly of a relatively young...
HLT resource development for a resource scarce language (L2) can be expedited by recycling existing technologies for a closely related language (L1). To improve the success of L1 technologies on L2 data, one can convert L2 data to make it appear more L1-like. We explore this possibility by developing an Afrikaans-to-Dutch lexical conversion module...
South Africa (SA) is one of the few countries in the world that boasts a large number of official languages. Due to the efforts of government and the local research and development (R&D) community (comprising universities, science councils and a few private sector companies) all the official languages are -- to varying degrees -- enabled with regar...
Human Language Technologies, Annual Conference of the North American Chapter of the Association for Computational Linguistics, 2-4 June 2010, Los Angeles, California, USA In this research, the authors use machine learning techniques to provide solutions for descriptive linguists in the domain of language standardization. With regard to the personal...
Telephone-based information access has the potential to deliver a significant positive impact in the developing world. We discuss some of the most important issues that must be addressed in order to realize this potential, including matters related to resource development, automatic speech recognition, text-to-speech systems, and user-interface des...
Julie Coleman and Christian J. Kay (Eds.). Lexicology, Semantics and Lexicography: Selected Papers from the Fourth G.L. Brook Symposium, Manchester, August 1998. 2000, xiv + 249 pp. Current Issues in Linguistic Theory Volume 194. ISBN 90 272 3701 8 (Eur.), 1 55619 972 4 (US). Amsterdam/Philadelphia: John Benjamins. Price US$ 75,00 (Hb).
7th International Conference on Language Resources and Evaluation. 19-21 May 2010, Valetta, Malta Human language technologies (HLT) can play a vital role in bridging the digital divide and thus the HLT field has been recognised as a priority area by the South African government. The authors present the work on conducting a technology audit on the S...
In this article we give an overview of various aspects of a project developing a spelling checker for Afrikaans. We discuss two of the main aims of the project, viz. for researchers to obtain practical experience, and to further learning of both researchers and students. This article, therefore, consists of two relatively independent parts that eac...
Determining the algorithmic parameter combinations that deliver the best performance in applications using machine learning algorithms is a very important part in the development process. Exhaustive searches are slow and computationally expensive, which motivates the investigation of more efficient methods of automatic algorithmic parameter optimis...
This is the author's version of the work. The definitive version was published in the Proceedings of the RANLP'2009 International Conference Recent Advances in Natural Language Processing. Borovets, Bulgaria, 14-16 September, 2009. pp 65-70 Annotation of training data for machine learning is often a laborious and costly process. In Active Learning...
The development of a hyphenator and compound analyser for Afrikaans
The development of two core-technologies for Afrikaans, viz. a hyphenator and a compound analyser is described in this article. As no annotated Afrikaans data existed prior to this project to serve as training data for a machine learning classifier, the core-technologies in questi...
Developing digital resources is an expensive and time- consuming endeavour, especially in the case of less-resourced languages. We developed TurboAnnotate in an attempt to ac- celerate the annotation of linguistic data by means of boot- strapping linguistic data for machine-learning purposes. The design and functionality of the tool is given to sho...
This article is intended to address two under-explored aspects in Van Huyssteen (2000), namely, the verification of his findings and their refinement by means of a corpus of written Afrikaans, and also to provide a better description of the phonological pole of the Afrikaans reduplication construction. Within a usage-based approach it is shown that...
Within the South African context, e-learning provides various opportunities to contribute towards a multilingual society.
This paper describes a new project, ICALLESAL (Intelligent Computer-Assisted Language Learning for Eleven South African Languages),
where an e-learning system is being developed for the acquisition of the official South African...
In this article, aspects of a cognitive usage-based descriptive framework for Afrikaans grammar are spelled out in detail, with the aim of setting the agenda for the description of Afrikaans morphological (and other) constructions. Constructs and terms from cognitive grammar that are necessary for the description of such constructions are explicate...
The BA Language Technology program was recently introduced at the North-West University and is, to date, the only of its kind in South Africa. This paper gives an overview of the program, which consists of computational linguistic subjects as well as subjects from languages, computer science, mathematics, and statistics. A brief discussion of the c...
Abstract Current spelling checkers for Afrikaans still do not provide full access to desired linguistic performance, especially with respect to high lexical recall and error precision. One of the main problems is that Afrikaans is an agglutinative language,with a high lexical generative power,using concatenative compound,formation. This means that...
This contribution is aimed at an explanation of the motivation for the com- position of Afrikaans grammatical and onomatopoeic reduplications within a cognitive grammar framework. It is illustrated that we can adequately de- scribe the formation of reduplication s in terms of four valence factors, viz. correspondence, profile determinacy, autonomy/...
Three theoretical presuppositions of a cognitive usage-based model of grammar are identified in this article, viz. that language is an integral part of human cognition, that grammar is inherently meaningful, and that usage-based data are essential for any linguistic description and explanation. These three presuppositions can be considered the most...
In this paper we describe the development of an improved spelling checker for Afrikaans. We compare two currently available spelling checkers and discuss their shortcomings. The existing applications are restricted in their suggestion capabilities, as well as their preci- sion and recall, mainly because they cannot treat morphologically complex wor...
Die vokale van Swart Suid-Afrikaanse Engels (SSAE) word in hierdie artikel ondersoek. Bestaande navorsingsbevindings word opgesom en be-spreek naas data van twee nuwe eksperimente. Die bevindinge dui daarop dat SSAE heelwat verskil van ‘Wit’ Suid-Afrikaanse Engels wat sowel monoftonge as diftonge betref Die vernaamste monoftongeienskappe van SSAE i...
Leksikale items wat na aspekte van die seksuele domein verwys, lewer dikwels probleme op vir die leksikograaf. So byvoorbeeld is dit dikwels moeilik om die werking van uitbreidingsmeganismes soos metaforiek en metonimie in die leksikografiese bewerking van seksuele uitdrukkings te verreken. In hierdie artikel word vier Afrikaanse woordeboeke met me...
Cognitive metaphor theories indicate that sexual metaphors are metaphors in which a taboo domain of knowledge is described in terms of a non-taboo domain of knowledge. Data from the Afrikaans language substantiate the idea that mappings between these two domains are motivated on cognitive as well as on pragmatic grounds. In this article, it will be...
20th Annual Symposium of the Pattern Recognition Association of South Africa (PRASA). Stellenbosch, South Africa, 30 November - 01 December 2009 For fast-tracking the development of resources for resource-scarce languages, one could transfer existing technologies from one language to another well-sourced, closely-related language. In this contribut...
2nd AFLaT workshop at the Seventh International Conference on Language Resources and Evaluation (LREC) 2010. Valletta, Malta. 19-21 May 2010 Human Language technologies (HLT) have been identified as a priority area by South African government to enable its eleven official languages technologically. We present the results of a technology audit for t...