Conference PaperPDF Available

Modelling the Socialization of Creative Agents in a Master-Apprentice Setting: The Case of Movie Title Puns


Abstract and Figures

This paper presents work on modelling the social psychological aspect of socialization in the case of a com-putationally creative master-apprentice system. In each master-apprentice pair, the master, a genetic algorithm, is seen as a parent for its apprentice, which is an NMT based sequence-to-sequence model. The effect of different parenting styles on the creative output of each pair is in the focus of this study. This approach brings a novel view point to computational social creativity, which has mainly focused in the past on computation-ally creative agents being on a socially equal level, whereas our approach studies the phenomenon in the context of a social hierarchy.
Content may be subject to copyright.
Modelling the Socialization of Creative Agents in a Master-Apprentice Setting:
The Case of Movie Title Puns
Mika H¨
Department of Digital Humanities
Faculty of Arts
University of Helsinki
Khalid Alnajjar
Department of Computer Science
Faculty of Science
University of Helsinki
This paper presents work on modelling the social psy-
chological aspect of socialization in the case of a com-
putationally creative master-apprentice system. In each
master-apprentice pair, the master, a genetic algorithm,
is seen as a parent for its apprentice, which is an NMT
based sequence-to-sequence model. The effect of dif-
ferent parenting styles on the creative output of each
pair is in the focus of this study. This approach brings
a novel view point to computational social creativity,
which has mainly focused in the past on computation-
ally creative agents being on a socially equal level,
whereas our approach studies the phenomenon in the
context of a social hierarchy.
The master-apprentice approach, as introduced by (Alnaj-
jar and H¨
ainen 2018), to computational creativity has
been shown to achieve creative autonomy and its creativity
has been thoroughly discussed and motivated. However, the
question that has remained without an answer has been the
social nature of a master-apprentice pair and its effect on the
creative outcome.
The approach consists of two parts: a master, which is
a genetic algorithm, and an apprentice, which is an LSTM
sequence-to-sequence model. While the master is in charge
of the internal appreciation of the overall system as imple-
mented in its fitness function, the apprentice plays a crucial
role in the creative autonomy as it can learn its standards
partially from its master and partially from its peers.
This paper focuses on the exploration of the master-
apprentice approach from a social psychological point of
view. By modelling the socialization of the apprentice into a
creative society consisting of the master and peers, we seek
to gain a deeper understanding of the phenomenon in terms
of the overall creativity of the system. In addition, modelling
the social aspects of a computationally creative system can
help in understanding creativity as a social phenomenon in
a broader sense (Saunders and Bown 2015).
We motivate the model of socialization based on research
conducted on the field of social psychology, namely devel-
opmental psychology. We select the categorization of par-
enting styles presented by (Baumrind 1991) as the theoreti-
cal foundation of our work.
The creative task we are tackling in this paper is the cre-
ation of humorous movie titles delivering a food-related pun.
This consists of taking an existing movie title such as Beauty
and the Beast and making a pun out of it such as Beauty and
the Beets. As people have been writing funny movie titles
of this sort in a great abundance on the social media, we can
gather parallel data easily.
Related Work
While pun generation has been vastly studied in the field of
computational creativity (Ritchie 2005; Yu, Tan, and Wan
2018; He, Peng, and Liang 2019), we see that the most im-
portant contribution of our paper lies in the realm of social
creativity. Therefore, we dedicate this section in describing
some of the practical research conducted in the computa-
tional social creativity.
Research on an agent community consisting of self-
organizing maps (Honkela and Winter 2003), although out-
side of the computational creativity paradigm, presents a
way of simulating the emergence of language. The agents
are capable of meaning negotiation and converging into a
common language to communicate about edibility of differ-
ent food items in their shared world.
Multi-agent systems have been studied in the context
of novelty seeking in creative artifact generation (Linkola,
Takala, and Toivonen 2016). In their study, the agents exert
self-criticism and they can vote and veto on creative arti-
facts. Their findings suggest that multiple creative agents
can reach to a higher number of novelty in their output than
a single agent system.
A recent study (Hantula and Linkola 2018) has been con-
ducted in social creativity in agent societies where the in-
dividuals are goal-aware. The individuals create artifacts of
their own and peer up to collaborate with another agent. The
agents are capable of learning a peer model that guides them
in selecting a collaboration partner.
The papers discussed in this section, as well as other sim-
ilar previously conducted work (Gabora 1995; Corneli and
Jordanous 2015; Pagnutti, Compton, and Whitehead 2016),
study mostly the collaboration of agents that have an equal
social status, in contrast to our case where the social status
is hierarchical. Therefore we find that there’s need for con-
ducting the study presented in this paper to shed some light
into asymmetrical social relations in computational creativ-
Social Development
The master-apprentice approach gives us an intriguing test
bed for modelling different social interactions between the
master and the apprentice. With such a complex phe-
nomenon as human social behavior, we are bound to limit
our focus on a subarea of the phenomenon. In this sec-
tion, we describe different psychological approaches in un-
derstanding socialization.
Socialization, i.e. becoming a part of a social group, is an
important part of the psychological development of an in-
dividual. Even to such a degree that a child who is never
exposed to other people will not develop a language nor an
understanding of self. Socialization, thus seems to play a
crucial role in higher-level cognitive development of every-
thing that we consider to separate a man from an animal.
Perhaps this great level of importance has been the reason a
great many researchers have dedicated effort in unraveling
this mystery.
The ecological systems theory of social development
(Bronfenbrenner 1979) highlights the importance of bidirec-
tionality of different social groups. An individual child is in
the middle of the model, but just as the immediate close fam-
ily affects on the child, the child is also an actor in the pro-
cess of socialization. The theory identifies multiple different
systems from close family all the way to the level of the so-
ciety that play a role in the social development of a child.
This theory is quite complex to model computationally.
A take, simpler to model, on the social development is
that of parenting styles (Baumrind 1991). We find these
findings more suitable as a starting point for modelling
the socialization of the apprentice in our master-apprentice
approach. The parenting styles can be divided into four
main categories: authoritative, authoritarian, permissive and
rejecting-neglecting. These categories deviate from each
other on the two-fold axis of demandingness and respon-
siveness as seen in Figure 1.
Figure 1: Parenting styles
The authoritative parents are high on both demandingness
and responsiveness. They set rules, but the rules are nego-
tiable. The parenting is more supportive than punitive in na-
ture. The authoritarian parents, on the other hand, are low on
responsiveness and high on demandingness. They set non-
negotiable rules and expect obedience without explanation.
The permissive parents are low on demandingess and high
on responsiveness. They are very lenient and avoid con-
frontation. The rejecting-neglecting parents, however, are
low on both axis. They hardly engage in parenting, they of-
fer little support and do not set any rules.
The original research on the master-apprentice approach
(Alnajjar and H¨
ainen 2018) used the creative tripod
(Colton 2008) to define creativity in general and in the con-
text of their work on creating movie titles satirical towards
Saudi-Arabia by following the notions of the SPECS ap-
proach (Jordanous 2012). We use the same creative tripod
framework to adapt their definition into our similar task of
creating movie titles with food puns. This definition pro-
vides us with a reasoned way of conducting evaluation of
the overall creativity of our systems.
The creative tripod requires three key notions to be
present in a system in order for it to achieve creativity. These
are skill,imagination and appreciation. All of these compo-
nents must be present simultaneously in a creative system,
or the system will lack creativity.
For our systems to exhibit skill, they will need to take a
movie title as input and produce a new one with a food pun.
As in the case of the earlier master-apprentice approach, the
new humorous title should still communicate the original ti-
tle, i.e. the original name of the movie should be recogniz-
Requirements for a pun are that it reassembles the original
word in pronunciation and that it is humorous. According to
(Oring 2003), incongruity results in humour if it is delivered
in a playful fashion and accompanied by its resolution. An-
other, maybe a bit more concrete way of looking at humour,
is seeing incongruity as a surprise and resolution as coher-
ence c.f. (Brownell et al. 1983).
Surprise in the context of humor means that the brain
forms an expectation and this expectation is then broken by
the humorous element of the pun. Such is the case in the pun
Harry Potter and the Deathly Marshmallows where the sur-
prise is caused by the fact that the expected word Hallows
is replaced by Marshmallows. For the pun to be coherent, it
should make sense in the context of the original movie. In
this case, a thought of deathly marshmallows attacking the
Hogwarts, although bizarre, can still be seen as coherent.
For achieving appreciation, the systems will need to be
able to assess the humorousness of the created pun in terms
of surprise, coherence and sound similarity. In addition to
the humour, the system should be able to evaluate the recog-
nizability of the original title.
We define imagination by using the dichotomy of creativ-
ity introduced by (Boden 2004). This way of understanding
creativity divides it in two different types: P-creativity and
H-creativity. P-creativity is the minimal requirement we set
for imagination of the systems, and it means that a creative
entity should be able to come up with something that is novel
to itself. H-creativity, on the other hand, refers to an inno-
vation that is novel in a more global scale, i.e. nobody else
has come up with a similar creative artifact before. While
P-creativity is the minimum requirement, we consider H-
creativity as a more desired requirement for imagination.
The Data
As our approach is to generate food related puns, we need
a vocabulary consisting of food related terms. For this pur-
pose we use the Historical Thesaurus of the Oxford English
Dictionary1. We use all the nouns recorded under the topic
food and drink in the external world taxonomy. This list
contains 15,314 different nouns.
We extract real movie titles from the IMDB2(Internet
movie database). As we want our movie title corpus to con-
sist only of well known movies, we want to filter out all
the less known indie movies. To achieve this, we filter out
movies that have received less than 100,000 votes, leaving
us with 1,661 movie titles. For the master and apprentices
this is further limited to 1276 titles by filtering out the titles
that consisted only of one word.
For parallel humorous movie title data (later peer data),
we crawl comments on an Instagram post for an entertain-
ment account3. People were encouraged to come up with
creative movie titles containing a pun related to food. The
total number of comments crawled is 16,0884. Then, we fol-
low the same approach applied in (Alnajjar and H¨
2018) to map the crawled data to movie titles. In summary,
we preprocess the text to remove any hashtags and mentions,
and then we measure the character and word edit distances
between the comments and movie titles. Finally, a comment
is considered to be a punny variation of the matched movie
title with the least edit distance, only if it had at most three
word differences while ensuring that there exist at least one
word matching the movie title. This process yields 9,294
human-authored movie titles containing a pun.
The Master-Apprentice Model
The master-apprentice model consists of a computationally
creative genetic algorithm that implements the criteria set
for appreciation in its fitness function and an apprentice that
is an NMT (neural machine translation) model. The master
generates parallel data for the apprentice to learn from, while
the apprentice can also learn from its peers. In our setting,
we have four different apprentices; one for each parenting
Inspired by the work on slogan generation presented by Al-
najjar, Hadaytullah, and Toivonen (2018), we employ a sim-
ilar generator to act as a master in our model. In our case,
the generator, which is a genetic algorithm, receives an orig-
inal movie title as input and outputs an entire population of
movie titles carrying a pun, based on the input movie title.
The master makes use of the food related vocabulary de-
scribed earlier to replace words in the original title while
2Dumps from
4Crawled on the second of February
optimizing multiple parameters to increase the aptness of
the substitution and the punniness of the title. The following
subsections elucidate the algorithm.
Evolutionary algorithm The first step in the evolutionary
algorithm is producing the initial population, which will go
through the process of evolution during a certain number of
generations. The evolutionary algorithm employed is a stan-
dard (µ+λ)5where mutation and crossover are applied to
the current population to produce λoffspring. Individuals in
the current population and their offspring are then evaluated
by the algorithm to find the fittest µnumber of individuals
to survive to the next generation. Once the specified number
of generations (10, in our case) is reached, the evolutionary
process ends and returns the final population.
Initial Population The initial population consists of µ
copies of the input movie title. For each copy, a randomly
selected noun, adjective or verb is replaced with a random
word from the vocabulary. We used Spacy (Honnibal and
Montani 2017) to parse titles. We inflect the substituting
words using Pattern (De Smedt and Daelemans 2012) to
match the morphology of the original word when needed.
The altered titles assemble the initial population.
Mutation and Crossover In our evolutionary algorithm,
we implement one kind of mutation and crossover. The mu-
tation process substitutes words in the individual in the same
fashion as done in the creation of the initial population. The
crossover employed is a standard single-point crossover, i.e.
a random point in individuals is selected and words to the
right of the point are switched between them.
Evaluation In our evaluation metric, we propose four in-
ternal evaluation dimensions to measure the fitness of an in-
dividual. These dimensions are (1) prosody, (2) semantic
similarity to “food”, (3) semantic similarity to the original
word, and (4) number of altered words. The first two dimen-
sions are maximized, whereas the last two are minimized.
The prosody dimension is a weighted sum of four prosody
sub-features, which are consonance, assonance, rhyme and
alliteration. This dimension measures the sound similarity
between the original word and its substitution. To measure
the sound similarity, we use espeak-ng tool6to generate IPA
(international phonetic alphabet) transcriptions for assessing
the prosody.
To measure the semantic similarity between two words,
we employ a pre-trained Glove model7with 6 billion to-
kens and a dimension size of 300. The model is trained on
Wikipedia and English Gigaword Fifth Edition corpus. Us-
ing the semantic model, the next dimension computes the
maximum semantic similarity of words in the title to the
word “food”.
The third dimension measures the mean of the seman-
tic similarity of new words to their original corresponding
word. We minimize this dimension to increase surprise, with
5We set both to 100 empirically.
the idea that a lower semantic similarity between the original
word and its substitute would result in a bigger surprise.
The last dimension keeps track of the number of words
modified in comparison to the original title. Minimizing this
dimension motivates that less substitutions are made to the
title, which makes it more recognizable.
These are the criteria based on which the fitness of indi-
viduals is evaluated at the end of each generation to let only
the best ones survive to the next generation.
Selection and Filtering To reduce having a dominating
dimension and motivate generating titles with diverse and
balanced scores on all four dimensions, we opt for a non-
dominant sorting algorithm –NSGA-II– (Deb et al. 2002) as
the selection algorithm.
During each iteration of the evolution, the current popu-
lation and its offspring go through a filtering phase which
filters out any duplicate titles.
Final Verdict On top of individual evaluation metrics, we
introduce master’s final verdict, which is a way of telling
whether the master likes the generated title. The final ver-
dict of the master is a binary decision, i.e. an individual is
either good or not. In practice, the final verdict is defined as
conditional thresholds on each dimension. These thresholds
are 1) a positive non-zero value for prosody, 2) a positive
non-zero semantic similarity to “food”, 3) a semantic simi-
larity less than 0.5 of the new word to its original and 4) not
more than 50 percent change of content words.
The master uses this functionality to express its liking to
titles outside of its own creations such as those created by
the apprentice. Whenever we talk about the master liking
something in this paper, we mean that the final verdict has a
Boolean value of true.
For the apprentices we use OpenNMT (Klein et al. 2017),
which implements an RNN based sequence to sequence
model. The model has two RNN encoding layers and two
RNN decoding layers.
The attention mechanism is the general global attention
formulated by (Luong, Pham, and Manning 2015). The dif-
ference to the OpenNMT default parameters in our system
is that we use the copy attention mechanism which makes it
possible for the model to copy words from the source. This
is useful since the task is to translate within the same lan-
All of the apprentice models described in this paper have
been trained by using the same random seed to make their
intercomparison possible.
Different Parenting Styles
We model computationally the four different parenting
styles, authoritarian, authoritative, permissive and rejecting-
neglecting, in the way the master interacts with the appren-
tice during the training process of the NMT model.
The training process is done iteratively. In each itera-
tion, the apprentice is trained for 1000 training steps. Af-
ter each iteration, the apprentice produces an output based
on the 1276 popular IMDB movie titles. This output is then
evaluated by the master accordingly to the parenting style in
question and adjustments are made to the training data based
on the master’s parenting. The apprentices are trained for 20
The Authoritarian Master only lets the apprentice learn
from its own output. The apprentice is not exposed to any
of the peer data and the apprentice’s own creations are not
taken into account.
The Authoritative Master lets the apprentice learn from
its own creations and those peers who it considers good
enough by the final verdict (this means 2446 titles). The
apprentice can show its creations to the master after each
training iteration, out of which the master picks the ones it
likes and adds them to the training material of the apprentice.
The training of the NMT model continues with the modified
The Permissive Master lets the apprentice learn from its
own creations and all of the peer data. When the apprentice
presents its own creations at the end of a training iteration,
the master praises them all and adds them to the training
The Rejecting-Neglecting Master does not care about
the apprentice. The apprentice has no choice but to learn
from its peers. The apprentice does not learn from its own
creations because it receives no support from the master.
Training the Apprentice
The master is run once to create its own movie titles with
food related puns. This parallel data of 8306 titles is shared
across the different parenting styles. During the training pro-
cess of the apprentice, the master does not generate new ti-
tles of its own, but only interferes in the selection of the
parallel data used in the next training iteration as described
in the sections above.
After each iteration, we calculate BLEU score (Papineni
et al. 2002) and a uni-gram PINC score (Chen and Dolan
2011) for the outputs of the apprentices. We compare the
outputs both to the training material coming from the master
and the material from the peers. For each title generated by
the apprentice, we take the maximum BLEU and minimum
PINC score and take an average of them for each iteration.
BLEU score is traditionally used in machine translation to
evaluate how good the final translation is in terms of a gold
standard. We, however, do not use BLEU as a final evalu-
ation metric, but rather use it to shed some light into how
closely the outputs of the apprentices resemble those of the
master or the peer written titles. BLEU measures the simi-
larity, whereas PINC measures divergence from the original
data. In other words, the higher the BLEU, the more closely
the apprentice imitates and the higher the PINC the less it
imitates the master or the peers.
As indicated by Figure 2, the authoritarian scenario,
where the training data consists only of the master’s out-
put, starts quickly producing the output most similar to the
master. Where as the authoritative scenario leads to a bit
less similarity to the master. The effect of the peer data is
Figure 2: BLEU when comparing to the master
Figure 3: PINC when comparing to the master
very well visible in the permissive and neglecting scenarios.
The PINC scores in Figure 3 show the other side of the coin
where the authoritative and authoritarian scenarios are the
least divergent and the permissive and neglecting ones the
most divergent.
Figure 4: BLEU when comparing to peers
Figure 5: PINC when comparing to peers
When we do the BLEU comparison to the peer data as
seen in Figure 4, we can see that only the neglecting sce-
nario leads to high similarity with the peers, where as the
other scenarios are still quite low, the lowest being the au-
thoritarian scenario. The PINC scores tell a similar story
in Figure 5, where the neglecting scenario leads to the least
amount of divergence, leaving the authoritarian scenario the
most divergent.
Results and Evaluation
In this section, we show some of the results produced by the
different systems. In addition, we evaluate the different par-
enting style scenarios after each iteration with the master’s
appreciation function. Later, an evaluation is conducted by
Results and Master’s Liking
Results from the approaches can be seen in Table 1. The
master did not produce any training data for the last two ti-
tles in the examples. Looking at these results qualitatively,
in broad lines, the permissive and neglecting scenarios pro-
duced worse output than the ones exposed to the master’s
training data. The apprentice exposed to authoritarian par-
enting struggles in producing output for titles not present
in the training data. The authoritative scenario leads to the
most consistent results. The quantitative human evaluation
in the next section is used to verify these initial observations.
Another way to look at the results is to use the apprecia-
tion metrics implemented in the master. Figure 6 shows the
percentage of how many titles the master liked after each
training iteration.
Figure 6: Master’s liking of the output
As we can see, the appreciation the master has ranks the
authoritarian and authoritative scenarios higher than the per-
missive and neglecting ones. Even in the authoritarian case,
the master does not like all of the output produced by the
apprentice, which shows that the appreciation learned by
the apprentices is different from the one implemented in the
It is interesting to see to what extent the master’s liking
correlates with the evaluation results of the human judges.
This can reveal more information about the adequacy of the
appreciation of the master in this creative task. Or does the
master’s appreciation only tell about obedience when ap-
plied to the apprentices’ output?
original master authoritarian authoritative permissive neglecting
the butterfly effect the brewery effect the butterfly kimchi the butterfly chicken the butterfly effect the lasagna
effect lazarus
how to train
your dragon
how to train
your pepperoni
how to train
your avocado
how to train
your pepperoni
how to train
your bacon
how to train
your bacon
fantastic beasts and
where to find them —- fantastic ordinary
and where to find
fantastic beets and
where to find them
fantastic beefs and
where to find them
fantastic beets and
where to find them
under the skin —- under the cereals under the silver cake under the 13th fryday the 13
Table 1: Examples of the final output of the different models
Evaluation Questions
In this section we provide some reasoning in our selection
of the evaluation questions that are presented to the human
judges. Earlier, we defined the creativity in the case of pun
generation using the creative tripod as our theoretical frame-
work. This means that on a higher level, our evaluation ques-
tions should evaluate skill,appreciation and imagination.
Skill Our definition for skill stated that the system should
be able to take an existing movie title and produce a food
related pun as an output. A further requirement was that the
original title should be recognizable from the generated one.
1. The title has a pun in it
2. The title is related to food
3. The original title is recognizable
The evaluation questions described above are designed to
evaluate the requirements set for skill. We evaluate whether
a pun is perceived and whether the new title relates to food
separately, as it might be that the replacement word delivers
a pun, but is not food related or vice-versa.
Appreciation We defined appreciation from the humor
stand point. A good title with a pun is also funny. For some-
thing to be funny, i.e. humorous, the pun has to exhibit co-
herence and surprise.
4. The title is humorous
5. The pun is surprising
6. The pun makes sense in the context of the original movie
We choose to evaluate the overall humor value of the title
separately from the components that constitute it. The last
two questions are designed to evaluate surprise and coher-
ence respectively.
Imagination We used Boden’s dichotomy to establish the
definition of imagination. The minimal requirement was set
to P-creativity. However, P-creativity can easily be veri-
fied by looking at the training data and the final output, if
the output is different from the training material, there is
P-creativity. Therefore, we use human judges to assess the
H-creativity of the outputs.
7. The pun in the title is obvious
8. The pun in the title sounds familiar
If the pun is obvious, it probably is not too H-creative, as
an obvious pun could be said by just about anyone, also if the
pun sounds familiar, it has probably been said by someone
Human Evaluation
We take a random sample of 20 original movie titles that
were only present in the training data provided by the mas-
ter, 20 titles that were only present in the peer data and 20
titles that were in both sources of parallel data. We evalu-
ate the creative output of each apprentice for these randomly
sampled titles. In addition, we evaluate the master’s output
for the 40 titles of the sample it had generated movie title
puns for. As the master has generated multiple creative titles
per original title, we pick one randomly for each original
title. Altogether, we are evaluating 280 computer created
The evaluation was conducted on a crowd-sourcing plat-
form called Figure Eight8. The platform assigned people to
conduct evaluation in such a way that each title was evalu-
ated by 35 different users. The users could choose how many
titles they wanted to evaluate. The results of the evaluation
are show in Table 2. In the Training column, both,peer only
and master only indicate whether the original title was only
present in the master produced training data, peer produced
training data or in both respectively.
The authoritarian scenario didn’t get the best average
score for any of the test questions and neither did the master.
They both score particularly low on the Q2, which reflects
the fact that some of the words in the HTOED food and drink
taxonomy were only loosely related to food such as steam
and spit. It is interesting to note that the authoritarian sce-
nario gets the best results for Q3, Q6 and Q7 for titles it did
not encounter in the training data, in other words it has de-
veloped an appreciation of its own that does not just mimic
what the master produces and fail otherwise. In light of these
results, we can deduce that the master produced worse titles
with food puns than real people, which left both the master
and the authoritarian scenario without the first place on any
of the test questions.
The authoritative scenario, which was the highest ranking
one according to master’s liking as seen in Figure 6, got the
best results for Q1 and Q5. This means that it succeeds the
best in the main task of generating puns and they end up
being the most surprising ones. It is also the only one that
produces consistently good results (above 3 on the average)
for all training test sets for Q1-Q6, unfortunately the results
for Q7 and Q8 are also above 3 on the average meaning that
it does not rank high on H-creativity.
The same consistency can not be perceived in the the per-
Style Training Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8
µxSD µxS D µxSD µxSD µxSD µxSD µxSD µxSD
authoritarian both 3.35 1.08 2.65 1.26 3.33 1.06 3.02 1.10 3.08 0.99 3.03 1.01 3.16 0.99 3.10 1.03
authoritarian peer only 2.97 1.18 2.14 1.17 3.44 1.13 2.66 1.15 2.82 1.08 3.10 1.11 3.03 1.12 3.11 1.13
authoritarian master only 3.34 1.07 2.89 1.28 3.30 1.05 3.00 1.12 3.07 1.05 3.04 1.02 3.12 1.02 3.07 1.04
authoritative both 3.43 1.08 3.09 1.31 3.28 1.12 3.08 1.16 3.08 1.05 3.02 1.05 3.22 1.06 3.13 1.05
authoritative peer only 3.41 1.14 3.23 1.35 3.37 1.21 3.13 1.16 3.07 1.08 3.08 1.09 3.24 1.10 3.16 1.13
authoritative master only 3.47 1.03 3.15 1.33 3.41 1.08 3.17 1.11 3.16 1.02 3.16 1.04 3.28 1.01 3.24 1.06
master both 3.38 1.07 2.79 1.28 3.29 1.06 3.07 1.12 3.09 1.04 3.10 1.06 3.22 1.02 3.14 1.05
master master only 3.40 1.07 2.61 1.30 3.34 1.11 3.10 1.15 3.09 1.02 3.04 1.03 3.17 1.04 3.11 1.06
neglecting both 3.45 1.11 3.28 1.32 3.34 1.11 3.28 1.12 3.16 1.06 3.12 1.04 3.21 1.08 3.22 1.07
neglecting peer only 3.36 1.07 3.02 1.37 3.31 1.11 3.12 1.15 3.09 1.02 3.14 1.04 3.19 1.02 3.14 1.06
neglecting master only 3.28 1.13 2.87 1.35 3.34 1.12 3.09 1.14 3.05 1.05 3.07 1.06 3.15 1.06 3.18 1.08
permissive both 3.23 1.18 2.67 1.38 3.59 1.06 3.06 1.13 3.08 1.08 3.30 1.04 3.21 1.07 3.30 1.10
permissive peer only 3.05 1.19 2.87 1.39 3.25 1.18 2.88 1.13 2.88 1.09 3.00 1.08 2.99 1.11 3.04 1.14
permissive master only 3.09 1.23 2.32 1.24 3.64 1.11 2.88 1.15 2.91 1.12 3.07 1.15 2.98 1.13 3.04 1.12
Table 2: Mean and standard deviation.
Style Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8
authoritarian 78.33% 31.67% 83.33% 31.67% 48.33% 68.33% 70.00% 68.33%
authoritative 93.33% 60.00% 83.33% 56.67% 63.33% 66.67% 90.00% 70.00%
master 92.50% 37.50% 87.50% 62.50% 60.00% 65.00% 80.00% 67.50%
neglecting 86.67% 60.00% 81.67% 66.67% 65.00% 73.33% 75.00% 78.33%
permissive 66.67% 33.33% 85.00% 43.33% 46.67% 73.33% 68.33% 71.67%
Table 3: Percentage of movie titles having an average score by judges greater than 3
missive case as scores below 3 are common across the test
questions. It however, manages to score the best for Q3 and
Q7-Q8, in other words, it can achieve the best H-creativity
and the original titles can be the most recognizable, although
not consistently so. This shows, that even though the appre-
ciation the master has might not be spot on, as it is not able
to produce the best scoring titles, having moderation on the
peer data and critical assessment of the apprentice generated
results during the training by the master, has a positive effect
on the consistency of the results. In the permissive scenario,
the apprentice was exposed to everything without criticism
and in the authoritative some criticism was used to filer the
training data, which made the authoritative scenario more
consistent, but less H-creative.
Finally, the neglecting scenario gets the best scores for the
Q2, Q4 and Q5. It is the best one at producing humorous,
surprising and food related titles. It is quite consistent with
only the results for Q2 with previously unseen titles giving
a score that is inferior to 3. The good results of the neglect-
ing scenario serve as an additional proof to the fact that the
output of the master is worse than human written titles.
Table 3 shows the results form another stand point. The
table shows overall how many titles got the average rating
above 3 for each test question. These numbers are in line
with what was previously discussed about the Table 2. The
authoritarian scenario leads to the worst performance, but
this time master gets the highest percentage point of titles
above 3 for Q3. In the authoritative scenario most of the
titles have a clear pun and are related to food with the highest
percentage point. The permissive scenario holds the best
percentage points for Q6 and Q7. And the neglecting gets
the best percentage points in Q2, Q4, Q5 and Q6.
Discussion and Future Work
The evaluation results were not completely in line with what
we can observe by looking at the titles output by the differ-
ent methods by ourselves. This raises the question whether
our definition for creativity in movie title puns is adequate
and whether the evaluation questions we formulated based
on the definition really measure what they were designed to
measure. Because we have worked with a clear definition for
creativity in this paper, it is possible to take this under a crit-
ical study in the future. We also find evident that qualitative
research on the output titles with respect to the quantitative
results we got from the human judges is needed to evaluate
the evaluation itself.
Having a master with appreciation filter the parallel data
of the apprentice was beneficial for consistency (see au-
thoritative vs permissive). Although the evaluation results
showed that the appreciation is not in par with that of a real
human, the implication remains that a good external appre-
ciation can be beneficial for the learning outcome of the ap-
prentice model. As we used a rather generic NMT model
for the apprentice, our findings might be of a use in more
traditional context of sequence-to-sequence models such as
machine translation, text summarization or paraphrasing.
For now, the master and apprentice have been studied in
a social vacuum, where peer data is the only link to the sur-
rounding world. However, in the future it would be fruitful
to see how the creative outcome changes when the master
and the apprentice are exposed to a more complex social
system such as the one described by Bronfenbrenner (Bron-
fenbrenner 1979). In such a society, the master would also
be under a social pressure in changing its own standards of
This work has presented one of the first contributions to the
field of computational social creativity where the computa-
tionally creative agents are in a hierarchical social relation.
This asymmetry offers an intriguing setting for studying so-
cialization of computational agents from the creativity per-
Despite building our definition of creativity upon an ex-
isting theory and formulating the test questions based on
the definition, the quantitative evaluation left many ques-
tions unanswered. The results presented in this paper call
for qualitative evaluation to understand the phenomenon of
evaluation in this particular context.
Nevertheless, our findings suggest that having apprecia-
tion in parenting, or training, an NMT model can be of a
benefit. The applicability of these finding into sequence-to-
sequence deep learning models in a more generalized fash-
ion is an interesting research question on its own right.
Alnajjar, K., and H¨
ainen, M. 2018. A Master-
Apprentice Approach to Automatic Creation of Culturally
Satirical Movie Titles. In Proceedings of the 11th Interna-
tional Conference on Natural Language Generation (INLG),
Alnajjar, K.; Hadaytullah, H.; and Toivonen, H. 2018. “Tal-
ent, Skill and Support.” A method for automatic creation of
slogans. In Proceedings of the Ninth International Confer-
ence on Computational Creativity, 88–95.
Baumrind, D. 1991. Parenting styles and adolescent devel-
opment. The Encyclopedia of Adolescence 758–772.
Boden, M. A. 2004. The creative mind: Myths and mecha-
nisms. Routledge.
Bronfenbrenner, U. 1979. The ecology of human develop-
ment. Harvard university press.
Brownell, H. H.; Michel, D.; Powelson, J.; and Gardner,
H. 1983. Surprise but not coherence: Sensitivity to verbal
humor in right-hemisphere patients. Brain and Language
18(1):20 – 27.
Chen, D. L., and Dolan, W. B. 2011. Collecting highly
parallel data for paraphrase evaluation. In Proceedings of
the 49th Annual Meeting of the Association for Computa-
tional Linguistics: Human Language Technologies-Volume
1, 190–200.
Colton, S. 2008. Creativity Versus the Perception of Cre-
ativity in Computational Systems. In AAAI Spring Sympo-
sium: Creative Intelligent Systems, Technical Report SS-08-
03, 14—-20.
Corneli, J., and Jordanous, A. 2015. Implementing feedback
in creative systems: a workshop approach. In Proceedings
of the First International Conference on AI and Feedback-
Volume 1407, 10–17. CEUR-WS. org.
De Smedt, T., and Daelemans, W. 2012. Pattern for Python.
Journal of Machine Learning Research 13:2063–2067.
Deb, K.; Pratap, A.; Agarwal, S.; and Meyarivan, T. 2002.
A fast and elitist multiobjective genetic algorithm: Nsga-ii.
IEEE Transactions on Evolutionary Computation 6(2):182–
Gabora, L. 1995. Meme and variations: A computer model
of cultural evolution. 1993 Lectures in Complex Systems
Hantula, O., and Linkola, S. 2018. Towards goal-aware col-
laboration in artistic agent societies. In Proceedings of the
Ninth International Conference on Computational Creativ-
He, H.; Peng, N.; and Liang, P. 2019. Pun generation with
surprise. arXiv preprint arXiv:1904.06828.
Honkela, T., and Winter, J. 2003. Simulating language
learning in community of agents using self-organizing maps.
Helsinki University of Technology.
Honnibal, M., and Montani, I. 2017. spaCy 2: Natural
Language Understanding with Bloom Embeddings, Convo-
lutional Neural Networks and Incremental Parsing. To ap-
Jordanous, A. 2012. A standardised procedure for evalu-
ating creative systems: Computational creativity evaluation
based on what it is to be creative. Cognitive Computation
Klein, G.; Kim, Y.; Deng, Y.; Senellart, J.; and Rush, A. M.
2017. OpenNMT: Open-Source Toolkit for Neural Machine
Translation. In Proc. ACL.
Linkola, S.; Takala, T.; and Toivonen, H. 2016. Novelty-
seeking multi-agent systems. In Proceedings of The Seventh
International Conference on Computational Creativity.
Luong, M.-T.; Pham, H.; and Manning, C. D. 2015. Ef-
fective approaches to attention-based neural machine trans-
lation. arXiv preprint arXiv:1508.04025.
Oring, E. 2003. Engaging humor. Urbana and Chicago:
University of Illinois Press.
Pagnutti, J.; Compton, K.; and Whitehead, J. 2016. Do
you like this art i made you: introducing techne, a creative
artbot commune. In Proceedings of 1st International Joint
Conference of DiGRA and FDG.
Papineni, K.; Roukos, S.; Ward, T.; and Zhu, W.-J. 2002.
Bleu: a method for automatic evaluation of machine transla-
tion. In Proceedings of the 40th annual meeting on associa-
tion for computational linguistics, 311–318.
Ritchie, G. 2005. Computational mechanisms for pun gen-
eration. In Proceedings of the Tenth European Workshop on
Natural Language Generation (ENLG-05).
Saunders, R., and Bown, O. 2015. Computational social
creativity. Artificial life 21(3):366–378.
Yu, Z.; Tan, J.; and Wan, X. 2018. A neural approach to
pun generation. In Proceedings of the 56th Annual Meeting
of the Association for Computational Linguistics (Volume 1:
Long Papers), volume 1, 1650–1660.
... We model this with simple rules. Because in English it is difficult to know how well words rhyme together based on their written form, we use Espeak-ng 2 to produce IPA transcription for each word similarly to (Hämäläinen and Alnajjar, 2019b). As IPA is supposed to relatively closely model how words are pronounced, it makes it possible to detect rhyming more accurately. ...
Conference Paper
Full-text available
Automated generation of textual advertisements for specific products is a natural language generation problem that has not received too wide a research interest in the past. In this paper, we present a genetic algorithm based approach that models the key components of advertising: creativity , ability to draw attention, memo-rability, clarity, informativeness and dis-tinctiveness. Our results suggest that our method outperforms the current state of the art in readability and informativeness but not in attractiveness.
... However, Da (2019) identifies a "fundamental mismatch between the statistical tools that are used and the objects to which they are applied". Indeed, the question of creativity may be obscure and frustrating (Bown, 2014) and requires qualitative research to evaluate the evaluation process itself (Hämäläinen & Alnajjar, 2019). ...
Full-text available
Anticipating the rise in Artificial Intelligence’s ability to produce original works of literature, this study suggests that literariness, or that which constitutes a text as literary, is understudied in relation to text generation. From a computational perspective, literature is particularly challenging because it typically employs figurative and ambiguous language. Literary expertise would be beneficial to understanding how meaning and emotion are conveyed in this art form but is often overlooked. We propose placing experts from two dissimilar disciplines – machine learning and literary studies – in conversation to improve the quality of AI writing. Concentrating on evaluation as a vital stage in the text generation process, the study demonstrates that benefit could be derived from literary theoretical perspectives. This knowledge would improve algorithm design and enable a deeper understanding of how AI learns and generates. This article appears in the special track on AI and Society.
... Figurative language is one of the most difficult forms of natural language to model computationally and there have been several studies in the past focusing on its subcategories such as metaphor interpretation (Xiao et al., 2016;Hämäläinen and Alnajjar, 2019a), humor generation (Hämäläinen and Alnajjar, 2019b) and analyzing idioms (Flor and Klebanov, 2018). Sarcasm is one of the extreme forms of figurative language, where the meaning of an utterance has little to do with the surface meaning (see Kreuz and Glucksberg 1989). ...
Conference Paper
Full-text available
We construct the first ever multimodal sarcasm dataset for Spanish. The audiovisual dataset consists of sarcasm annotated text that is aligned with video and audio. The dataset represents two varieties of Spanish, a Latin American variety and a Peninsular Spanish variety , which ensures a wider dialectal coverage for this global language. We present several models for sarcasm detection that will serve as baselines in the future research. Our results show that results with text only (89%) are worse than when combining text with audio (91.9%). Finally, the best results are obtained when combining all the modalities: text, audio and video (93.1%).
... Figurative language is one of the most difficult forms of natural language to model computationally and there have been several studies in the past focusing on its subcategories such as metaphor interpretation (Xiao et al., 2016;Hämäläinen and Alnajjar, 2019a), humor generation (Hämäläinen and Alnajjar, 2019b) and analyzing idioms (Flor and Klebanov, 2018). Sarcasm is one of the extreme forms of figurative language, where the meaning of an utterance has little to do with the surface meaning (see Kreuz and Glucksberg 1989). ...
Full-text available
We construct the first ever multimodal sarcasm dataset for Spanish. The audiovisual dataset consists of sarcasm annotated text that is aligned with video and audio. The dataset represents two varieties of Spanish, a Latin American variety and a Peninsular Spanish variety, which ensures a wider dialectal coverage for this global language. We present several models for sarcasm detection that will serve as baselines in the future research. Our results show that results with text only (89%) are worse than when combining text with audio (91.9%). Finally, the best results are obtained when combining all the modalities: text, audio and video (93.1%).
... To solve these problems, we decided to follow an approach where we defined exactly what we need our system to be able to produce in its output (humorous headlines). In our first paper (Alnajjar and Hämäläinen, 2018), we believed we had solved the problem, only to realize in our follow-up paper (Hämäläinen and Alnajjar, 2019c) that the human evaluation results contradicted our own impression of the output produced by the different systems. As it turns out, the evaluation questions were too abstract and left enough room for people to read more into the output. ...
Conference Paper
Full-text available
We outline the Great Misalignment Problem in natural language processing research, this means simply that the problem definition is not in line with the method proposed and the human evaluation is not in line with the definition nor the method. We study this misalignment problem by surveying 10 randomly sampled papers published in ACL 2020 that report results with human evaluation. Our results show that only one paper was fully in line in terms of problem definition, method and evaluation. Only two papers presented a human evaluation that was in line with what was modeled in the method. These results highlight that the Great Misalignment Problem is a major one and it affects the validity and reproducibility of results obtained by a human evaluation.
... The apprentice is also exposed to human annotated data. This framework has then been further expanded and studied in Hämäläinen and Alnajjar (2019c) and Paper V. One of the key notions is the strong theoretical grounding and emphasis on a reasoned evaluation. ...
Full-text available
Computational creativity has received a good amount of research interest in generating creative artefacts programmatically. At the same time, research has been conducted in computational aesthetics, which essentially tries to analyse creativity exhibited in art. This thesis aims to unite these two distinct lines of research in the context of natural language generation by building, from models for interpretation and generation, a cohesive whole that can assess its own generations. I present a novel method for interpreting one of the most difficult rhetoric devices in the figurative use of language: metaphors. The method does not rely on hand-annotated data and it is purely data-driven. It obtains the state of the art results and is comparable to the interpretations given by humans. We show how a metaphor interpretation model can be used in generating metaphors and metaphorical expressions. Furthermore, as a creative natural language generation task, we demonstrate assigning creative names to colours using an algorithmic approach that leverages a knowledge base of stereotypical associations for colours. Colour names produced by the approach were favoured by human judges to names given by humans 70% of the time. A genetic algorithm-based method is elaborated for slogan generation. The use of a genetic algorithm makes it possible to model the generation of text while optimising multiple fitness functions, as part of the evolutionary process, to assess the aesthetic quality of the output. Our evaluation indicates that having multiple balanced aesthetics outperforms a single maximised aesthetic. From an interplay of neural networks and the traditional AI approach of genetic algorithms, we present a symbiotic framework. This is called the master-apprentice framework. This makes it possible for the system to produce more diverse output as the neural network can learn from both the genetic algorithm and real people. The master-apprentice framework emphasises a strong theoretical foundation for the creative problem one seeks to solve. From this theoretical foundation, a reasoned evaluation method can be derived. This thesis presents two different evaluation practices based on two different theories on computational creativity. This research is conducted in two distinct practical tasks: pun generation in English and poetry generation in Finnish.
... [1,8]). Furthermore, promising results have been obtained in research using a genetic algorithm for generating parallel data for an NMT model [2,10]. ...
... An additional reasoning for using human evaluators instead of automated evaluation metrics is the poor correlation observed in a previous study (Hämäläinen and Alnajjar, 2019) of automatic evaluation metrics such as BLEU (Papineni et al., 2002) and PINC (Chen and Dolan, 2011) scores with human judgments when evaluating creativity of a system. ...
Conference Paper
Full-text available
We present a novel approach for generating poetry automatically for the morphologically rich Finnish language by using a genetic algorithm. The approach improves the state of the art of the previous Finnish poem generators by introducing a higher degree of freedom in terms of structural creativity. Our approach is evaluated and described within the paradigm of computational creativity, where the fitness functions of the genetic algorithm are assimilated with the notion of aesthetics. The output is considered to be a poem 81.5% of the time by human evaluators.
Conference Paper
Full-text available
Prerecorded laughter accompanying dialog in comedy TV shows encourages the audience to laugh by clearly marking humorous moments in the show. We present an approach for automatically detecting humor in the Friends TV show using multimodal data. Our model is capable of recognizing whether an utterance is humorous or not and assess the intensity of it. We use the prerecorded laughter in the show as annotation as it marks humor and the length of the audience's laughter tells us how funny a given joke is. We evaluate the model on episodes the model has not been exposed to during the training phase. Our results show that the model is capable of correctly detecting whether an utterance is humorous 78% of the time and how long the audience's laughter reaction should last with a mean absolute error of 600 milliseconds .
Full-text available
Prerecorded laughter accompanying dialog in comedy TV shows encourages the audience to laugh by clearly marking humorous moments in the show. We present an approach for automatically detecting humor in the Friends TV show using multimodal data. Our model is capable of recognizing whether an utterance is humorous or not and assess the intensity of it. We use the prerecorded laughter in the show as annotation as it marks humor and the length of the audience's laughter tells us how funny a given joke is. We evaluate the model on episodes the model has not been exposed to during the training phase. Our results show that the model is capable of correctly detecting whether an utterance is humorous 78% of the time and how long the audience's laughter reaction should last with a mean absolute error of 600 milliseconds.
Full-text available
Pattern is a package for Python 2.4+ with functionality for web mining (Google + Twitter + Wikipedia, web spider, HTML DOM parser), natural language processing (tagger/chunker, n-gram search, sentiment analysis, WordNet), machine learning (vector space model, k-means clustering, Naive Bayes + k-NN + SVM classifiers) and network analysis (graph centrality and visualization). It is well documented and bundled with 30+ examples and 350+ unit tests. The source code is licensed under BSD and available from
Full-text available
Computational creativity is a flourishing research area, with a variety of creative systems being produced and developed. Creativity evaluation has not kept pace with system development with an evident lack of systematic evaluation of the creativity of these systems in the literature. This is partially due to difficulties in defining what it means for a computer to be creative; indeed, there is no consensus on this for human creativity, let alone its computational equivalent. This paper proposes a Standardised Procedure for Evaluating Creative Systems (SPECS). SPECS is a three-step process: stating what it means for a particular computational system to be creative, deriving and performing tests based on these statements. To assist this process, the paper offers a collection of key components of creativity, identified empirically from discussions of human and computational creativity. Using this approach, the SPECS methodology is demonstrated through a comparative case study evaluating computational creativity systems that improvise music. An author's postprint (same content, but before it has been put into journal-specific formatting) is available via my institutional repository at
Full-text available
Holland's (1975) genetic algorithm is a minimal computer model of natural selection that made it possible to investigate the effect of manipulating specific parameters on the evolutionary process. If culture is, like biology, a form of evolution, it should be possible to similarly abstract the underlying skeleton of the process and develop a minimal model of it. Meme and Variations, or MAV, is a computational model, inspired by the genetic algorithm, of how ideas evolve in a society of interacting individuals (Gabora 1995). The name is a pun on the classical music form 'theme and variations', because it is based on the premise that novel ideas are variations of old ones; they result from tweaking or combining existing ideas in new ways (Holland et al. 1981). MAV explores the impact of biological phenomena such as over-dominance and epistasis as well as cognitive and social phenomena such as the ability to learn generalizations or imitate others on the fitness and diversity of cultural transmissible actions.
We introduce an open-source toolkit for neural machine translation (NMT) to support research into model architectures, feature representations, and source modalities, while maintaining competitive performance, modularity and reasonable training requirements.
How is it possible to think new thoughts? What is creativity and can science explain it? When The Creative Mind: Myths and Mechanisms was first published, Margaret A. Boden's bold and provocative exploration of creativity broke new ground. Boden uses examples such as jazz improvisation, chess, story writing, physics, and the music of Mozart, together with computing models from the field of artificial intelligence to uncover the nature of human creativity in the arts, science and everyday life. The Second Edition of The Creative Mind has been updated to include recent developments in artificial intelligence, with a new preface, introduction and conclusion by the author. It is an essential work for anyone interested in the creativity of the human mind.