Conference PaperPDF Available

A Multi-Lingual Dictionary of Dirty Words

Authors:

Abstract

We present a multi-lingual dictionary of dirty words. We have collected about 3,200 dirty words in several languages and built a database of these. The language with the most words in the database is English, though there are several hundred dirty words in for instance Japanese too. Words are classified into their general meaning, such as what part of the human anatomy they refer to. Words can also be assigned a nuance label to indicate if it is a cute word used when speaking to children, a very rude word, a clinical word etc. The database is available online and will hopefully be enlarged over time. It has already been used in research on for instance automatic joke generation and emotion detection.
A Multi-Lingual Dictionary of Dirty Words
Jonas Sj¨
obergh and Kenji Araki
Graduate School of Information Science and Technology
Hokkaido University
Sapporo, Japan
{js, araki}@media.eng.hokudai.ac.jp
Abstract
We present a multi-lingual dictionary of dirty words. We have collected about 3,200 dirty words in several languages and built a database
of these. The language with the most words in the database is English, though there are several hundred dirty words in for instance
Japanese too. Words are classified into their general meaning, such as what part of the human anatomy they refer to. Words can also
be assigned a nuance label to indicate if it is a cute word used when speaking to children, a very rude word, a clinical word etc. The
database is available online and will hopefully be enlarged over time. It has already been used in research on for instance automatic joke
generation and emotion detection.
1. Introduction
Dictionaries can be tremendously useful in many language
processing tasks, and are also useful sources of informa-
tion for human readers. One category of words that is often
not included in very large amounts in most dictionaries are
“dirty words”. By that we mean words that are generally
not used in polite company, referring to for instance sex-
ually related things, bodily functions, or cuss words and
insults.
We have collected dirty words in several languages and
built a multi-lingual dictionary linking words in different
languages with similar meanings. Other types of annota-
tion is also possible, many words are for instance annotated
with the nuance they carry, i.e. if a word is really rude or
perhaps a euphemism and so on. Information on whether a
word is unambiguously a dirty word or can also have non-
dirty meanings can also be added.
This dictionary can of course be used in translation appli-
cations, to find appropriate translation candidates for words
and phrases that are perhaps hard to find in other dictionar-
ies. A dictionary of dirty words is also useful in many other
ways, there have for example been quite a few products
launched world wide were the producer later found out that
in some of the markets the product name was a dirty word
or in a similar way gave bad impression. Such things could
be mitigated by having access to a large resource of dirty
words in different languages.
The dictionary can also be used in monolingual natural
language processing applications where information about
dirty words is useful. Three examples of areas where our
database has already been used are:
Humor recognition (Sj ¨
obergh and Araki, 2007b). In a
machine learning approach to classify texts as either
jokes or not, some features based on the presence of
dirty words in the text were used. A high presence of
dirty words was useful as an indication that the text
was a joke.
Humor generation (Sj ¨
obergh and Araki, 2007a; Sj¨
obergh
and Araki, 2008). Dirty words and euphemisms are
common in jokes, and are thus useful in automatic joke
generation. A system generating rather weak puns was
perceived as slightly funnier if the punch line of the
pun was a dirty word. Other joke generation meth-
ods were based on changing parts of idioms to similar
sounding euphemisms (dirty words) for sex etc.
Emotion recognition (Ptaszynski et al., 2007). In a simi-
lar way to the humor recognition case, sentences with
dirty words tended to carry emotive content in an ex-
periment on emotion detection in Japanese.
Other uses of dirty words, though not our database, in natu-
ral language processing include detecting if a message is
a flame (Spertus, 1997), and other machine learning ap-
proaches to humor recognition (Mihalcea and Strapparava,
2005).
2. Collecting the Dirty Words
We have collected dirty words and short phrases from sev-
eral different sources and in several different languages to
add to our database. These were then annotated manually
with various types of information. The original intended
use was humor generation and humor recognition in En-
glish and Japanese, so these two languages received the
most focus.
The single largest source of dirty words was a list collected
by George Carlin1, containing about 2,400 dirty word ex-
pressions in English. Most of these are euphemisms, tend-
ing towards joke like expressions, for example “trouser ana-
conda”.
For Japanese we extracted all words in the EDICT
dictionary (Breen, 1995) marked with the “vulgar” flag, and
also added various short lists of dirty words found on the
Internet. We also had several native speakers of Japanese
simply write down a lot of dirty words that they could come
up with by looking at the other words in the list.
We have also found useful information in the Alternative
Dictionaries2, the Swearsaurus3, and Wikicurse4, which are
1http://www.georgecarlin.com/dirty/2443.html
2http://www.notam02.no/˜
hcholm/altlang/
3http://www.insultmonger.com/swearing/
4http://www.wikicurse.com/
509
collections of “bad words” in many languages. There are
also many bad words in these resources in other languages
that we have not added to our database, mainly because of
a lack of native speakers to check if the words are really of
the kinds we want. These could of course be added later if
one so wishes.
After collecting the dirty words, they have been annotated
by hand with different types of information. Not all words
are annotated with all types of information yet. Annotation
regarding the meaning, nuance, and ambiguity of a word or
phrase is possible.
3. Structure
In the dictionary the words are annotated with the follow-
ing information: how to write the word, how to pronounce
the word, the meaning of the word, the nuance of the word,
whether the word is ambiguous in the sense that it has non-
dirty meanings too, what language the word comes from,
and the part of speech of the word. The dictionary also con-
tains many multi-word expressions, though they are treated
like one unit and we will refer to these too as “words” in this
paper except when talking specifically about the number of
words in the expressions.
The only information that is mandatory for a word is how
to write it. All other fields can be left unspecified, though
so far all words are also annotated with the language they
come from. Pronunciation is currently only provided for
the Japanese words, for which it can be non-trivial to fig-
ure out the reading of the ideographic characters used for
writing. The same ideographic character sequence can
have several different readings, some of which can be dirty
words while others are not.
The meanings are specified by links to special “interlin-
gua” like objects. These describe the general meaning of
a word using English (though adding explanations in other
languages too is of course also possible). Currently only
the general meaning is given, such as what part of the hu-
man anatomy a word refers too or that it is some form of
fornication. More detailed classifications can be done later
if it is found to be necessary for a specific application.
These interlingua meaning objects are also grouped into
three general groups: sex related, bodily functions, and in-
sults. The meanings of some words do not fit into any of
these three categories, in which case what group the mean-
ing belongs to is left unspecified. An example of an inter-
lingua object is “cuss word interjection” for things such as
“dammit”.
The nuances of words indicate if a word is a clinical word
used for instance in doctor patient conversations, if it is a
“cute” word used when speaking to children, if it is a eu-
phemism, or if it is an “extra bad” word (very rude), etc.
We have found use for this type of information ourselves in
other experiments, for instance in humor generation where
really bad words tended to offend rather than entertain, and
clinical words did not sound very funny either. This type
of information could also be useful for instance when se-
lecting from different translation candidates, so as to find a
translation with a similar nuance in the target language.
Nuance can of course be hard to determine for some words.
Words can be perceived as very rude by some people and
Language Words
English 2402
Japanese 397
Swedish 158
Bulgarian 147
Polish 125
Total 3229
Table 1: The number of words and expressions in different
languages currently in the database.
as fairly OK by others. The same word can also be very
rude in some contexts and not rude at all in other contexts.
Currently we have not made any efforts at more detailed
descriptions of nuances, but if there is interest in the future
it could be added later. Many words are still unproblematic
though, and can be fairly easily annotated with a simple
description of their nuance.
The ambiguity field indicates if a word has both dirty mean-
ings and non-dirty meanings. It is possible to just note that
both are possible (e.g. “pussy”) or if a word is always dirty
(e.g. “fuck”), and it is also possibly to specify in more de-
tail if the dirty meaning is much more common than any
non-dirty meanings (as perhaps “cock”), or if the word is
generally not dirty but can be in special contexts (e.g the
words “it” or “there” in many languages). Which meaning
is more common can of course in many cases be rather hard
to judge, in which case just noting that the word is ambigu-
ous is enough.
Part of speech is currently mostly not given, though the field
was added since this information was available in some of
the sources we used to build the dictionary and was used
in some of our text generation experiments using the data.
Other grammatical information could also be useful but is
currently not given. For instance, it could be useful to know
what forms the mutli-word expressions can take, if they
can have parts of the expression modified by adjectives etc.
without losing their dirty impression, etc.
The language field simply indicates what language a word
comes from. If the same string is a dirty word in several
languages, a separate entry is made in the database for each
language. The same is true if a word can mean several dif-
ferent dirty things in the same language.
The dictionary is stored in an SQL database. The database
has a primitive web interface that allows searching the
database, downloading the whole dictionary, adding new
words and meanings, and annotating existing words with
meanings, nuances, etc.
4. Statistics
Some statistics showing the contents of the dictionary can
be found in Tables 1 to 5. As can be seen in Table 1, the bulk
of the words are currently English words. That English has
the by far largest amount of words is probably caused by
English being the most widely used language both on the
Internet (a good source of dirty words) and in natural lan-
guage research (where one could perhaps expect such re-
sources to show up). Somewhere around 150 words seems
510
Category Words
Sex 2652
Bodily Functions 261
Insults 211
Unspecified 105
Table 2: The general grouping of the meanings of the words
in the database.
Nuance Words
Euphemism 1462
Fairly bad word 87
Used normally 70
Children’s speak 24
Very bad 22
Clinical 13
Unspecified 1551
Table 3: The nuances of the words.
to be the limit where people who collect dirty words mainly
for fun get tired and give up. The differences in number of
words between the languages in our database most likely
do not reflect any actual cultural differences in the amount
of dirty words. It is simply an effect of what purposes we
have used the data for and thus what languages we put the
most work into collecting dirty words for.
In Table 2, it can be seen that rather unsurprisingly the over-
whelming majority of the words are sex related words. This
varies a bit between languages though, some of the lists of
“dirty words” we have collected have contained mostly in-
sults to be hurled at other people to make them angry. Many
of these do have sexual connotations though.
Of the words that have so far been annotated with their nu-
ances, euphemisms are by far the most common, see Table
3, though only about half the words have been annotated so
far.
The annotation of the ambiguity of the words has only cov-
ered about a third of the data so far, see Table 4. Thus far,
about half the words are ambiguous, though a large part
of the remaining words are rather long euphemistic expres-
sions that are likely not very ambiguous.
In the final table, Table 5, statistics on the lengths of the
expressions is presented. About half the dictionary is made
Ambiguity Words
Ambiguous, can be either dirty or not 686
Always dirty 475
Ambiguous, not-dirty meaning most common 26
Ambiguous, dirty meaning most common 17
Unspecified 2025
Table 4: The ambiguity of the words.
Length in Words Expressions
1 1722
2 902
3 398
4 132
5 47
6 20
7 5
8 2
10 1
Multi-Word 1507
Table 5: The lengths of the expressions in the database.
Average length is 1.8 words.
up of multi-word expressions, though not many are made
up of four or more words. The longest expression so far is
“choke the sheriff and wait for the posse to come”, which
is an English expression for (male) masturbation.
This data gives a general idea of the contents of the
dictionary, but one should keep in mind that it is a bit com-
plicated to gather this type of information from such dif-
ferent languages. The Swedish part contains many quite
long compound words treated as only one word, while a
similar word in English would be a multi-word expression.
And Japanese has no space between words at all, so only a
quick cursory check of roughly how many “words” a phrase
contains was done for the Japanese part.
5. Availability
The dictionary is freely available on the web5, though the
web interface is still very primitive. It is possible to down-
load the whole dictionary, and also to add new words,
change or add more annotations to the words already in the
database etc.
We plan to extend the database ourselves, both by adding
more words to the languages already included but and by
adding more languages. Any volunteers are of course also
welcome to add more data too. We also plan to improve the
web interface.
6. Conclusions
We presented a dictionary of dirty words in several lan-
guages. The meanings of the words are linked, so it can be
used to find for instance translations of dirty words in other
languages. The nuances of the words (really rude, clinical,
euphemism, etc.) are also annotated, which can help in se-
lecting an appropriate translation. Words that are ambigu-
ous in the sense that they have other non-dirty meanings
too, can also be annotated with this information.
The dictionary contains about 3,000 words and expressions,
2,400 in English, 400 in Japanese, and slightly over 100
words each in Bulgarian, Polish, and Swedish. It is freely
available on the Internet, and it is also possible for volun-
teers to contribute new words to the dictionary.
5http://dr-hato.se/projects/dirtywords/
511
So far, the contents of the dictionary have mainly been used
in monolingual applications, for instance humor generation
and emotion classification.
Acknowledgements
This work was done as part of a project funded by the
Japanese Society for the Promotion of Science (JSPS). We
would like to thank some of the anonymous reviewers for
interesting suggestions for extending our work. We would
also like to thank the volunteers who have contributed dirty
words to the dictionary, especially Svetoslav Dankov who
also helped out with various practical things.
7. References
Jim Breen. 1995. Building an electronic Japanese-English
dictionary. In Japanese Studies Association of Australia
Conference, Brisbane, Australia.
Rada Mihalcea and Carlo Strapparava. 2005. Making
computers laugh: Investigations in automatic humor
recognition. In Proceedings of HLT/EMNLP, Vancou-
ver, Canada.
Michal Ptaszynski, Pawel Dybala, Wen Han Shi, Rafal
Rzepka, and Kenji Araki. 2007. Lexical analysis of
emotiveness in utterances for automatic joke generation.
ITE Technical Report, Vol. 31, No. 47, pages 39–42,
ME2007-204.
Jonas Sj¨
obergh and Kenji Araki. 2007a. Automatically
creating word-play jokes in japanese. In Proceedings of
NL-178, pages 91–95, Nagoya, Japan.
Jonas Sj¨
obergh and Kenji Araki. 2007b. Recognizing hu-
mor without recognizing meaning. In Francesco Ma-
sulli, Sushmita Mitra, and Gabriella Pasi, editors, Pro-
ceedings of WILF 2007, volume 4578 of Lecture Notes
in Computer Science, pages 469–476, Camogli, Italy.
Springer.
Jonas Sj¨
obergh and Kenji Araki. 2008. What is poorly said
is a little funny. In Proceedings of LREC-2008, Mar-
rakech, Morocco.
Ellen Spertus. 1997. Smokey: Automatic recognition of
hostile messages. In Innovative Applications of Artifi-
cial Intelligence (IAAI), pages 1058–1065, Providence,
Rhode Island.
512
... The emotive element database was built using data from different research [13], [14], [15], [16] and divided into interjections, mimetics, endearments, vulgarities, and representations of non-verbal emotive elements, such as exclamation marks or ellipsis. The database of emotive expressions contains Nakamura's collection [11] ...
... Based on the linguistic approach and assumptions described above The emotive element database was built using data from different research [13], [14], [15], [16] and divided into interjections, mimetics, endearments, vulgarities, and representations of non-verbal emotive elements, such as exclamation marks or ellipsis. The database of emotive expressions contains Nakamura's collection [11]. ...
Conference Paper
Full-text available
The paper presents a support method for affect analysis of utterances in Japanese. One of the problems in the system for affect analysis developed by us before was confusing the valence of emotion types in the final stage of analysis. The cause of this problem was extracting from the utterance only the emotive expression keyword without its grammatical context. To solve this problem we enhance the emotion types extraction procedure in the baseline system with grammatical analysis using Contextual Valence Shifters (CVS). CVS are words, or phrases such as "not", "very much" "not quite", which determine the semantic orientation of the valence of emotive expressions.
... To achieve this objective and to answer the research question, there are three main challenges to adding a layer to the social media platform. The first is finding a multi-lingual dataset in order to obtain high accuracy in detecting inappropriate content [23,[131][132][133][134][135]. Such a dataset requires efforts to construct it using a large repository of multi-lingual text that contains inappropriate content by utilizing word synonyms such as WordNet and multi-lingual dictionaries. ...
Article
Full-text available
Background: With digital transformation and growing social media usage, kids spend considerable time on the web, especially watching videos on YouTube. YouTube is a source of education and entertainment media that has a significant impact on the skill improvement, knowledge, and attitudes of children. Simultaneously, harmful and inappropriate video content has a negative impact. Recently, researchers have given much attention to these issues, which are considered important for individuals and society. The proposed methods and approaches are to limit or prevent such threats that may negatively influence kids. These can be categorized into five main directions. They are video rating, parental control applications, analysis meta-data of videos, video or audio content, and analysis of user accounts. Objective: The purpose of this study is to conduct a systematic review of the existing methods, techniques, tools, and approaches that are used to protect kids and prevent them from accessing inappropriate content on YouTube videos. Methods: This study conducts a systematic review of research papers that were published between January 2016 and December 2022 in international journals and international conferences, especially in IEEE Xplore Digital Library, ACM Digital Library, Web of Science, Google Scholar, Springer database, and ScienceDirect database. Results: The total number of collected articles was 435. The selection and filtration process reduced this to 72 research articles that were appropriate and related to the objective. In addition, the outcome answers three main identified research questions. Significance: This can be beneficial to data mining, cybersecurity researchers, and peoples’ concerns about children’s cybersecurity and safety.
... {213} 3.[HY] Hypocorystrics (Kamei et al., 1996). {8} 4.[VU] Vulgarisms (Sjöbergh, 2008). {200} 5.[EM] Emotive marks (Kamei et al., 1996). ...
... This group is realised by various parts of speech, like nouns, verbs, adjectives, etc. Examples are: aijou (love), kanashimu (feel sad), ureshii (happy), respectively. The emotive element database was built using data from different research (Takane et al., 1995Takane et al., –1998 Tsuchiya, 1999; Baba, 2003; Sjöbergh and Araki, 2008) and divided into interjections, mimetic expressions, endearments, vulgarities, and representations of non-verbal emotive elements, such as exclamation marks or ellipsis. An algorithm detecting emoticons was also added, as they are symbols commonly used in everyday text-based communication tools. ...
Article
Full-text available
This paper presents a novel method for estimating speaker's affective states based on two contextual features: valence shifters and appropriateness. Firstly, a system for affect analysis is used to recognise specific types of emotions. We improve the baseline system with the analysis of Contextual Valence Shifters (CVS), which determine the semantic orientation of emotive expressions. Secondly, a web mining technique is used to verify the appropriateness of the recognised emotions for the particular context. Verification of contextual appropriateness of emotions is the next step towards implementation of Emotional Intelligence Framework in machines. The proposed method is evaluated using two conversational agents.
... For detection, there are many literatures discussing about detecting offensive language in sentence level [3] or message level [10] . For offensive lexicon generation, [9] presents a study. We believe offensive language detection a very challenging problem worthy of separate treatment. ...
Article
Offensive language has arisen to be a big issue to the health of both online communities and their users. To the online community, the spread of offensive language undermines its reputation, drives users away, and even directly affects its growth. To users, viewing offensive language brings negative influence to their mental health, especially for children and youth. When offensive language is detected in a user message, a problem arises about how the offensive language should be removed, i.e. the offensive language filtering problem. To solve this problem, manual filtering approach is known to produce the best filtering result. However, manual filtering is costly in time and labor thus can not be widely applied. In this paper, we analyze the offensive language in text messages posted in online communities, and propose a new automatic sentence-level filtering approach that is able to semantically remove the offensive language by utilizing the grammatical relations among words. Comparing with ex-isting automatic filtering approaches, the proposed filtering approach provides filtering results much closer to manual filtering. To demonstrate our work, we created a dataset by manu-ally filtering over 11,000 text comments from the YouTube website. Experiments on this dataset show over 90% agree-ment in filtered results between the proposed approach and manual filtering approach. Moreover, we show the overhead of applying proposed approach to user comments filtering is reasonable, making it practical to be adopted in real life applications.
Article
Full-text available
With globalization and cultural exchange around the globe, most of the population gained knowledge of at least two languages. The bilingual user base on the Social Media Platform (SMP) has significantly contributed to the popularity of code-mixing. However, apart from multiple vital uses, SMP also suffer with abusive text content. Identifying abusive instances for a single language is a challenging task, and even more challenging for code-mix. The abusive posts detection problem is more complicated than it seems due to its unseemly, noisy data and uncertain context. To analyze these contents, the research community needs an appropriate dataset. A small dataset is not a suitable sample for the research work. In this paper, we have analyzed the dimensions of Devanagari-Roman code-mix in short noisy text. We have also discussed the challenges of abusive instances. We have proposed a cost-effective methodology with 20.38% relevancy score to collect and annotate the code-mix abusive text instances. Our dataset is eight times to the related state-of-the-art dataset. Our dataset ensures the balance with 55.81% instances in the abusive class and 44.19% in the non-abusive class. We have also conducted experiments to verify the usefulness of the dataset. We have performed experiments with traditional machine learning techniques, traditional neural network architecture, recurrent neural network architectures, and pre-trained Large Language Model (LLM). From our experiments, we have observed the suitability of the dataset for further scientific work.
Article
Language is used by humans to carry out a wide range of social activities that reflect the social standards that have been established during various periods of time. Swearing is a linguistic device that serves both personal and interpersonal functions in almost all languages. As a result, linguists and anyone interested in language should pay attention to this linguistic phenomenon, which is also accompanied by a certain amount of taboo. The current study focuses on English and Iraqi Arabic societies because they are wealthy in the use of swear words. Its aims is to define, classify, state functions, and identify the history of swear words in language and culture. Also, it highlights and analyzes the most important swear words in English and Iraqi Arabic, and concludes the most essential findings. To fulfil the above aims, it is hypothesized that swear words come in a variety of forms, some of which are more widely used than others, culture influences the use of swear words, and the connotative meaning is determined by the context in which swear words are used. To sum, a range of elements influence the communicative act of swear words, including age, gender, social and economic status, as well as the speaker-listener relationship.
Chapter
The paper presents the distribution of pragmatic markers (PM) of Russian everyday speech in two types of discourse: dialogical and monologic. PMs are an essential part of any oral discourse, therefore, quantitative data on their distribution are necessary for solving both theoretical and practical tasks related to studies of speech communication, as well as for translation and teaching Russian as a foreign language. The article describes samples from two speech corpora: “One Speaker’s Day” (ORD corpus, consisting of mostly dialogue speech, the annotated subcorpus containing 321 504 tokens) and “Balanced Annotated Text Library” (SAT corpus, which consists only of monologues, the annotated subcorpus containing 50 128 tokens). Besides, it presents statistical data of PM distributions obtained for 60 basic (invariant) markers, PMs common in both dialogue and monologue (for example, hesitative marker such as vot, tam, tak) are identified, as well as those that are more typical for monologues (boundary markers like znachit, nu vot, vs’o) or dialogues (‘xeno’-markers like takoj, grit; and meta-communicative markers vidish’, (ja) ne znaju). Special attention is paid to PMs usage both in different communication situations and in speech of different sociolects.
Chapter
The abusive posts detection problem is more complicated than it seems due to its unseemly, unstructured noisy data and unpredictable context. The learning performance of the neural networks attracts researchers to get the highest performing output. Still, there are some limitations for noisy data while training for a neural network. In our work, we have proposed an approach that considers the assets of both the machine learning and neural network to get the most optimum result. Our approach performs with the F1 score of 92.79.
Article
Full-text available
We present ML-Ask – the first Open Source Affect Analysis system for textual input in Japanese. ML-Ask analyses the contents of an input (e.g., a sentence) and annotates it with information regarding the contained general emotive expressions, specific emotional words, valence-activation dimensions of overall expressed affect, and particular emotion types expressed with their respective expressions. ML-Ask also incorporates the Contextual Valence Shifters model for handling negation in sentences to deal with grammatically expressible shifts in the conveyed valence. The system, designed to work mainly under Linux and MacOS, can be used for research on, or applying the techniques of Affect Analysis within the framework Japanese language. It can also be used as an experimental baseline for specific research in Affect Analysis, and as a practical tool for written contents annotation. Funding statement: This research has been supported by: a Research Grant from the Nissan Science Foundation (years 2009–2010), The GCOE Program founded by Japan’s Ministry of Education, Culture, Sports, Science and Technology (years 2009–2010), (JSPS) KAKENHI Grant-in-Aid for JSPS Fellows (Project Number: 22-00358) (years 2010–2012), (JSPS) KAKENHI Grant-in-Aid for Scientific Research (Project Number: 24600001) (years 2012–2015), (JSPS) KAKENHI Grant-in-Aid for Research Activity Start-up (Project Number: 25880003) (years 2013–2015), and (JSPS) KAKENHI Grant-in-Aid for Encouragement of Young Scientists (B) (Project Number: 15K16044) (years 2015-present, project estimated to end in March 2018).
Conference Paper
Full-text available
We present a system for generating wordplay jokes in Japanese, which generates riddle style puns. By using different lexicons, different results can be achieved. Web searches are used to generate hints for the riddles. A subset of the generated riddles is evaluated manually. Using a naughty word lexicon gave funnier riddles than using normal words, though computer generated riddles in general were less funny than human generated riddles. Many computer generated riddles contain broken grammar or in other ways fail to make sense.
Conference Paper
Full-text available
We present a machine learning approach for classifying sentences as one-liner jokes or normal sentences. We use no deep analysis of the meaning to try to see if it is humorous, instead we rely on a combination of simple features to see if these are enough to detect humor. Features such as word overlap with other jokes, presence of words common in jokes, ambiguity and word overlap with common idioms turn out to be useful. When training and testing on equal amounts of jokes and sentences from the British National Corpus, a classification accuracy of 85% is achieved.
Conference Paper
Full-text available
We implement several different methods for generating jokes in English. The common theme is to intentionally produce poor utterances by breaking Grice’s maxims of conversation. The generated jokes are evaluated and compared to human made jokes. They are in general quite weak jokes, though there are a few high scoring jokes and many jokes that score higher than the most boring human joke.
Conference Paper
Humor is one of the most interesting and puzzling aspects of human behavior. De- spite the attention it has received in fields such as philosophy, linguistics, and psy- chology, there have been only few at- tempts to create computational models for humor recognition or generation. In this paper, we bring empirical evidence that computational approaches can be success- fully applied to the task of humor recogni- tion. Through experiments performed on very large data sets, we show that auto- matic classification techniques can be ef- fectively used to distinguish between hu- morous and non-humorous texts, with sig- nificant improvements observed over apri- ori known baselines.
Article
Abusive messages (flames) can be both a source of frustration and a waste of time for Internet users. This paper describes some approaches to flame recognition, including a prototype system, Smokey. Smokey builds a 47-element feature vector based on the syntax and semantics of each sentence, combining the vectors for the sentences within each message. A training set of 720 messages was used by Quinlan's C4.5 decision-tree generator to determine featurebased rules that were able to correctly categorize 64% of the flames and 98% of the non-flames in a separate test set of 460 messages. Additional techniques for greater accuracy and user customization are also discussed. Introduction Flames are one of the current hazards of on-line communication. While some people enjoy exchanging flames, most users consider these abusive and insulting messages to be a nuisance or even upsetting. I describe Smokey, a prototype system to automatically recognize email flames. Smokey combines na...
Article
This paper describes an on-going project to develop and maintain a comprehensive electronic Japanese-English dictionary capable of use within a variety of search-and-display, electronic-text reading support, and machine translation environments. The project consists of two parts: (a) the compilation of two major datafiles; a JapaneseEnglish lexicon (EDICT), and a kanji information database (KANJIDIC). At the time of writing, the former has over 100,000 entries, and (b) the development of software to index, search and display entries in the data files. This software, which has now been released on a variety of computing platforms, including Unix, PC (DOS and Windows) and MacIntosh, can operate as either a stand-alone dictionary, providing the functions of both normal word/phrase dictionaries and character dictionaries, or as a support package for reading electronic text, by automatically glossing selected words and phrases. 1 INTRODUCTION This paper describes a project to build an elec...
Lexical analysis of emotiveness in utterances for automatic joke generation Jonas Sj¨ obergh and Kenji Araki. 2007a. Automatically creating word-play jokes in japanese Jonas Sj¨ obergh and Kenji Araki. 2007b. Recognizing hu-mor without recognizing meaning
  • Michal Ptaszynski
  • Pawel Dybala
  • Wen
  • Rafal Shi
  • Rzepka
Michal Ptaszynski, Pawel Dybala, Wen Han Shi, Rafal Rzepka, and Kenji Araki. 2007. Lexical analysis of emotiveness in utterances for automatic joke generation. ITE Technical Report, Vol. 31, No. 47, pages 39–42, ME2007-204. Jonas Sj¨ obergh and Kenji Araki. 2007a. Automatically creating word-play jokes in japanese. In Proceedings of NL-178, pages 91–95, Nagoya, Japan. Jonas Sj¨ obergh and Kenji Araki. 2007b. Recognizing hu-mor without recognizing meaning. In Francesco Ma-sulli, Sushmita Mitra, and Gabriella Pasi, editors, Pro-ceedings of WILF 2007, volume 4578 of Lecture Notes in Computer Science, pages 469–476, Camogli, Italy