Modelling the Socialization of Creative Agents in a Master-Apprentice Setting:
The Case of Movie Title Puns
Department of Digital Humanities
Faculty of Arts
University of Helsinki
Department of Computer Science
Faculty of Science
University of Helsinki
This paper presents work on modelling the social psy-
chological aspect of socialization in the case of a com-
putationally creative master-apprentice system. In each
master-apprentice pair, the master, a genetic algorithm,
is seen as a parent for its apprentice, which is an NMT
based sequence-to-sequence model. The effect of dif-
ferent parenting styles on the creative output of each
pair is in the focus of this study. This approach brings
a novel view point to computational social creativity,
which has mainly focused in the past on computation-
ally creative agents being on a socially equal level,
whereas our approach studies the phenomenon in the
context of a social hierarchy.
The master-apprentice approach, as introduced by (Alnaj-
jar and H¨
ainen 2018), to computational creativity has
been shown to achieve creative autonomy and its creativity
has been thoroughly discussed and motivated. However, the
question that has remained without an answer has been the
social nature of a master-apprentice pair and its effect on the
The approach consists of two parts: a master, which is
a genetic algorithm, and an apprentice, which is an LSTM
sequence-to-sequence model. While the master is in charge
of the internal appreciation of the overall system as imple-
mented in its ﬁtness function, the apprentice plays a crucial
role in the creative autonomy as it can learn its standards
partially from its master and partially from its peers.
This paper focuses on the exploration of the master-
apprentice approach from a social psychological point of
view. By modelling the socialization of the apprentice into a
creative society consisting of the master and peers, we seek
to gain a deeper understanding of the phenomenon in terms
of the overall creativity of the system. In addition, modelling
the social aspects of a computationally creative system can
help in understanding creativity as a social phenomenon in
a broader sense (Saunders and Bown 2015).
We motivate the model of socialization based on research
conducted on the ﬁeld of social psychology, namely devel-
opmental psychology. We select the categorization of par-
enting styles presented by (Baumrind 1991) as the theoreti-
cal foundation of our work.
The creative task we are tackling in this paper is the cre-
ation of humorous movie titles delivering a food-related pun.
This consists of taking an existing movie title such as Beauty
and the Beast and making a pun out of it such as Beauty and
the Beets. As people have been writing funny movie titles
of this sort in a great abundance on the social media, we can
gather parallel data easily.
While pun generation has been vastly studied in the ﬁeld of
computational creativity (Ritchie 2005; Yu, Tan, and Wan
2018; He, Peng, and Liang 2019), we see that the most im-
portant contribution of our paper lies in the realm of social
creativity. Therefore, we dedicate this section in describing
some of the practical research conducted in the computa-
tional social creativity.
Research on an agent community consisting of self-
organizing maps (Honkela and Winter 2003), although out-
side of the computational creativity paradigm, presents a
way of simulating the emergence of language. The agents
are capable of meaning negotiation and converging into a
common language to communicate about edibility of differ-
ent food items in their shared world.
Multi-agent systems have been studied in the context
of novelty seeking in creative artifact generation (Linkola,
Takala, and Toivonen 2016). In their study, the agents exert
self-criticism and they can vote and veto on creative arti-
facts. Their ﬁndings suggest that multiple creative agents
can reach to a higher number of novelty in their output than
a single agent system.
A recent study (Hantula and Linkola 2018) has been con-
ducted in social creativity in agent societies where the in-
dividuals are goal-aware. The individuals create artifacts of
their own and peer up to collaborate with another agent. The
agents are capable of learning a peer model that guides them
in selecting a collaboration partner.
The papers discussed in this section, as well as other sim-
ilar previously conducted work (Gabora 1995; Corneli and
Jordanous 2015; Pagnutti, Compton, and Whitehead 2016),
study mostly the collaboration of agents that have an equal
social status, in contrast to our case where the social status
is hierarchical. Therefore we ﬁnd that there’s need for con-
ducting the study presented in this paper to shed some light
into asymmetrical social relations in computational creativ-
The master-apprentice approach gives us an intriguing test
bed for modelling different social interactions between the
master and the apprentice. With such a complex phe-
nomenon as human social behavior, we are bound to limit
our focus on a subarea of the phenomenon. In this sec-
tion, we describe different psychological approaches in un-
Socialization, i.e. becoming a part of a social group, is an
important part of the psychological development of an in-
dividual. Even to such a degree that a child who is never
exposed to other people will not develop a language nor an
understanding of self. Socialization, thus seems to play a
crucial role in higher-level cognitive development of every-
thing that we consider to separate a man from an animal.
Perhaps this great level of importance has been the reason a
great many researchers have dedicated effort in unraveling
The ecological systems theory of social development
(Bronfenbrenner 1979) highlights the importance of bidirec-
tionality of different social groups. An individual child is in
the middle of the model, but just as the immediate close fam-
ily affects on the child, the child is also an actor in the pro-
cess of socialization. The theory identiﬁes multiple different
systems from close family all the way to the level of the so-
ciety that play a role in the social development of a child.
This theory is quite complex to model computationally.
A take, simpler to model, on the social development is
that of parenting styles (Baumrind 1991). We ﬁnd these
ﬁndings more suitable as a starting point for modelling
the socialization of the apprentice in our master-apprentice
approach. The parenting styles can be divided into four
main categories: authoritative, authoritarian, permissive and
rejecting-neglecting. These categories deviate from each
other on the two-fold axis of demandingness and respon-
siveness as seen in Figure 1.
Figure 1: Parenting styles
The authoritative parents are high on both demandingness
and responsiveness. They set rules, but the rules are nego-
tiable. The parenting is more supportive than punitive in na-
ture. The authoritarian parents, on the other hand, are low on
responsiveness and high on demandingness. They set non-
negotiable rules and expect obedience without explanation.
The permissive parents are low on demandingess and high
on responsiveness. They are very lenient and avoid con-
frontation. The rejecting-neglecting parents, however, are
low on both axis. They hardly engage in parenting, they of-
fer little support and do not set any rules.
The original research on the master-apprentice approach
(Alnajjar and H¨
ainen 2018) used the creative tripod
(Colton 2008) to deﬁne creativity in general and in the con-
text of their work on creating movie titles satirical towards
Saudi-Arabia by following the notions of the SPECS ap-
proach (Jordanous 2012). We use the same creative tripod
framework to adapt their deﬁnition into our similar task of
creating movie titles with food puns. This deﬁnition pro-
vides us with a reasoned way of conducting evaluation of
the overall creativity of our systems.
The creative tripod requires three key notions to be
present in a system in order for it to achieve creativity. These
are skill,imagination and appreciation. All of these compo-
nents must be present simultaneously in a creative system,
or the system will lack creativity.
For our systems to exhibit skill, they will need to take a
movie title as input and produce a new one with a food pun.
As in the case of the earlier master-apprentice approach, the
new humorous title should still communicate the original ti-
tle, i.e. the original name of the movie should be recogniz-
Requirements for a pun are that it reassembles the original
word in pronunciation and that it is humorous. According to
(Oring 2003), incongruity results in humour if it is delivered
in a playful fashion and accompanied by its resolution. An-
other, maybe a bit more concrete way of looking at humour,
is seeing incongruity as a surprise and resolution as coher-
ence c.f. (Brownell et al. 1983).
Surprise in the context of humor means that the brain
forms an expectation and this expectation is then broken by
the humorous element of the pun. Such is the case in the pun
Harry Potter and the Deathly Marshmallows where the sur-
prise is caused by the fact that the expected word Hallows
is replaced by Marshmallows. For the pun to be coherent, it
should make sense in the context of the original movie. In
this case, a thought of deathly marshmallows attacking the
Hogwarts, although bizarre, can still be seen as coherent.
For achieving appreciation, the systems will need to be
able to assess the humorousness of the created pun in terms
of surprise, coherence and sound similarity. In addition to
the humour, the system should be able to evaluate the recog-
nizability of the original title.
We deﬁne imagination by using the dichotomy of creativ-
ity introduced by (Boden 2004). This way of understanding
creativity divides it in two different types: P-creativity and
H-creativity. P-creativity is the minimal requirement we set
for imagination of the systems, and it means that a creative
entity should be able to come up with something that is novel
to itself. H-creativity, on the other hand, refers to an inno-
vation that is novel in a more global scale, i.e. nobody else
has come up with a similar creative artifact before. While
P-creativity is the minimum requirement, we consider H-
creativity as a more desired requirement for imagination.
As our approach is to generate food related puns, we need
a vocabulary consisting of food related terms. For this pur-
pose we use the Historical Thesaurus of the Oxford English
Dictionary1. We use all the nouns recorded under the topic
food and drink in the external world taxonomy. This list
contains 15,314 different nouns.
We extract real movie titles from the IMDB2(Internet
movie database). As we want our movie title corpus to con-
sist only of well known movies, we want to ﬁlter out all
the less known indie movies. To achieve this, we ﬁlter out
movies that have received less than 100,000 votes, leaving
us with 1,661 movie titles. For the master and apprentices
this is further limited to 1276 titles by ﬁltering out the titles
that consisted only of one word.
For parallel humorous movie title data (later peer data),
we crawl comments on an Instagram post for an entertain-
ment account3. People were encouraged to come up with
creative movie titles containing a pun related to food. The
total number of comments crawled is 16,0884. Then, we fol-
low the same approach applied in (Alnajjar and H¨
2018) to map the crawled data to movie titles. In summary,
we preprocess the text to remove any hashtags and mentions,
and then we measure the character and word edit distances
between the comments and movie titles. Finally, a comment
is considered to be a punny variation of the matched movie
title with the least edit distance, only if it had at most three
word differences while ensuring that there exist at least one
word matching the movie title. This process yields 9,294
human-authored movie titles containing a pun.
The Master-Apprentice Model
The master-apprentice model consists of a computationally
creative genetic algorithm that implements the criteria set
for appreciation in its ﬁtness function and an apprentice that
is an NMT (neural machine translation) model. The master
generates parallel data for the apprentice to learn from, while
the apprentice can also learn from its peers. In our setting,
we have four different apprentices; one for each parenting
Inspired by the work on slogan generation presented by Al-
najjar, Hadaytullah, and Toivonen (2018), we employ a sim-
ilar generator to act as a master in our model. In our case,
the generator, which is a genetic algorithm, receives an orig-
inal movie title as input and outputs an entire population of
movie titles carrying a pun, based on the input movie title.
The master makes use of the food related vocabulary de-
scribed earlier to replace words in the original title while
2Dumps from https://datasets.imdbws.com/
4Crawled on the second of February
optimizing multiple parameters to increase the aptness of
the substitution and the punniness of the title. The following
subsections elucidate the algorithm.
Evolutionary algorithm The ﬁrst step in the evolutionary
algorithm is producing the initial population, which will go
through the process of evolution during a certain number of
generations. The evolutionary algorithm employed is a stan-
dard (µ+λ)5where mutation and crossover are applied to
the current population to produce λoffspring. Individuals in
the current population and their offspring are then evaluated
by the algorithm to ﬁnd the ﬁttest µnumber of individuals
to survive to the next generation. Once the speciﬁed number
of generations (10, in our case) is reached, the evolutionary
process ends and returns the ﬁnal population.
Initial Population The initial population consists of µ
copies of the input movie title. For each copy, a randomly
selected noun, adjective or verb is replaced with a random
word from the vocabulary. We used Spacy (Honnibal and
Montani 2017) to parse titles. We inﬂect the substituting
words using Pattern (De Smedt and Daelemans 2012) to
match the morphology of the original word when needed.
The altered titles assemble the initial population.
Mutation and Crossover In our evolutionary algorithm,
we implement one kind of mutation and crossover. The mu-
tation process substitutes words in the individual in the same
fashion as done in the creation of the initial population. The
crossover employed is a standard single-point crossover, i.e.
a random point in individuals is selected and words to the
right of the point are switched between them.
Evaluation In our evaluation metric, we propose four in-
ternal evaluation dimensions to measure the ﬁtness of an in-
dividual. These dimensions are (1) prosody, (2) semantic
similarity to “food”, (3) semantic similarity to the original
word, and (4) number of altered words. The ﬁrst two dimen-
sions are maximized, whereas the last two are minimized.
The prosody dimension is a weighted sum of four prosody
sub-features, which are consonance, assonance, rhyme and
alliteration. This dimension measures the sound similarity
between the original word and its substitution. To measure
the sound similarity, we use espeak-ng tool6to generate IPA
(international phonetic alphabet) transcriptions for assessing
To measure the semantic similarity between two words,
we employ a pre-trained Glove model7with 6 billion to-
kens and a dimension size of 300. The model is trained on
Wikipedia and English Gigaword Fifth Edition corpus. Us-
ing the semantic model, the next dimension computes the
maximum semantic similarity of words in the title to the
The third dimension measures the mean of the seman-
tic similarity of new words to their original corresponding
word. We minimize this dimension to increase surprise, with
5We set both to 100 empirically.
the idea that a lower semantic similarity between the original
word and its substitute would result in a bigger surprise.
The last dimension keeps track of the number of words
modiﬁed in comparison to the original title. Minimizing this
dimension motivates that less substitutions are made to the
title, which makes it more recognizable.
These are the criteria based on which the ﬁtness of indi-
viduals is evaluated at the end of each generation to let only
the best ones survive to the next generation.
Selection and Filtering To reduce having a dominating
dimension and motivate generating titles with diverse and
balanced scores on all four dimensions, we opt for a non-
dominant sorting algorithm –NSGA-II– (Deb et al. 2002) as
the selection algorithm.
During each iteration of the evolution, the current popu-
lation and its offspring go through a ﬁltering phase which
ﬁlters out any duplicate titles.
Final Verdict On top of individual evaluation metrics, we
introduce master’s ﬁnal verdict, which is a way of telling
whether the master likes the generated title. The ﬁnal ver-
dict of the master is a binary decision, i.e. an individual is
either good or not. In practice, the ﬁnal verdict is deﬁned as
conditional thresholds on each dimension. These thresholds
are 1) a positive non-zero value for prosody, 2) a positive
non-zero semantic similarity to “food”, 3) a semantic simi-
larity less than 0.5 of the new word to its original and 4) not
more than 50 percent change of content words.
The master uses this functionality to express its liking to
titles outside of its own creations such as those created by
the apprentice. Whenever we talk about the master liking
something in this paper, we mean that the ﬁnal verdict has a
Boolean value of true.
For the apprentices we use OpenNMT (Klein et al. 2017),
which implements an RNN based sequence to sequence
model. The model has two RNN encoding layers and two
RNN decoding layers.
The attention mechanism is the general global attention
formulated by (Luong, Pham, and Manning 2015). The dif-
ference to the OpenNMT default parameters in our system
is that we use the copy attention mechanism which makes it
possible for the model to copy words from the source. This
is useful since the task is to translate within the same lan-
All of the apprentice models described in this paper have
been trained by using the same random seed to make their
Different Parenting Styles
We model computationally the four different parenting
styles, authoritarian, authoritative, permissive and rejecting-
neglecting, in the way the master interacts with the appren-
tice during the training process of the NMT model.
The training process is done iteratively. In each itera-
tion, the apprentice is trained for 1000 training steps. Af-
ter each iteration, the apprentice produces an output based
on the 1276 popular IMDB movie titles. This output is then
evaluated by the master accordingly to the parenting style in
question and adjustments are made to the training data based
on the master’s parenting. The apprentices are trained for 20
The Authoritarian Master only lets the apprentice learn
from its own output. The apprentice is not exposed to any
of the peer data and the apprentice’s own creations are not
taken into account.
The Authoritative Master lets the apprentice learn from
its own creations and those peers who it considers good
enough by the ﬁnal verdict (this means 2446 titles). The
apprentice can show its creations to the master after each
training iteration, out of which the master picks the ones it
likes and adds them to the training material of the apprentice.
The training of the NMT model continues with the modiﬁed
The Permissive Master lets the apprentice learn from its
own creations and all of the peer data. When the apprentice
presents its own creations at the end of a training iteration,
the master praises them all and adds them to the training
The Rejecting-Neglecting Master does not care about
the apprentice. The apprentice has no choice but to learn
from its peers. The apprentice does not learn from its own
creations because it receives no support from the master.
Training the Apprentice
The master is run once to create its own movie titles with
food related puns. This parallel data of 8306 titles is shared
across the different parenting styles. During the training pro-
cess of the apprentice, the master does not generate new ti-
tles of its own, but only interferes in the selection of the
parallel data used in the next training iteration as described
in the sections above.
After each iteration, we calculate BLEU score (Papineni
et al. 2002) and a uni-gram PINC score (Chen and Dolan
2011) for the outputs of the apprentices. We compare the
outputs both to the training material coming from the master
and the material from the peers. For each title generated by
the apprentice, we take the maximum BLEU and minimum
PINC score and take an average of them for each iteration.
BLEU score is traditionally used in machine translation to
evaluate how good the ﬁnal translation is in terms of a gold
standard. We, however, do not use BLEU as a ﬁnal evalu-
ation metric, but rather use it to shed some light into how
closely the outputs of the apprentices resemble those of the
master or the peer written titles. BLEU measures the simi-
larity, whereas PINC measures divergence from the original
data. In other words, the higher the BLEU, the more closely
the apprentice imitates and the higher the PINC the less it
imitates the master or the peers.
As indicated by Figure 2, the authoritarian scenario,
where the training data consists only of the master’s out-
put, starts quickly producing the output most similar to the
master. Where as the authoritative scenario leads to a bit
less similarity to the master. The effect of the peer data is
Figure 2: BLEU when comparing to the master
Figure 3: PINC when comparing to the master
very well visible in the permissive and neglecting scenarios.
The PINC scores in Figure 3 show the other side of the coin
where the authoritative and authoritarian scenarios are the
least divergent and the permissive and neglecting ones the
Figure 4: BLEU when comparing to peers
Figure 5: PINC when comparing to peers
When we do the BLEU comparison to the peer data as
seen in Figure 4, we can see that only the neglecting sce-
nario leads to high similarity with the peers, where as the
other scenarios are still quite low, the lowest being the au-
thoritarian scenario. The PINC scores tell a similar story
in Figure 5, where the neglecting scenario leads to the least
amount of divergence, leaving the authoritarian scenario the
Results and Evaluation
In this section, we show some of the results produced by the
different systems. In addition, we evaluate the different par-
enting style scenarios after each iteration with the master’s
appreciation function. Later, an evaluation is conducted by
Results and Master’s Liking
Results from the approaches can be seen in Table 1. The
master did not produce any training data for the last two ti-
tles in the examples. Looking at these results qualitatively,
in broad lines, the permissive and neglecting scenarios pro-
duced worse output than the ones exposed to the master’s
training data. The apprentice exposed to authoritarian par-
enting struggles in producing output for titles not present
in the training data. The authoritative scenario leads to the
most consistent results. The quantitative human evaluation
in the next section is used to verify these initial observations.
Another way to look at the results is to use the apprecia-
tion metrics implemented in the master. Figure 6 shows the
percentage of how many titles the master liked after each
Figure 6: Master’s liking of the output
As we can see, the appreciation the master has ranks the
authoritarian and authoritative scenarios higher than the per-
missive and neglecting ones. Even in the authoritarian case,
the master does not like all of the output produced by the
apprentice, which shows that the appreciation learned by
the apprentices is different from the one implemented in the
It is interesting to see to what extent the master’s liking
correlates with the evaluation results of the human judges.
This can reveal more information about the adequacy of the
appreciation of the master in this creative task. Or does the
master’s appreciation only tell about obedience when ap-
plied to the apprentices’ output?
original master authoritarian authoritative permissive neglecting
the butterﬂy effect the brewery effect the butterﬂy kimchi the butterﬂy chicken the butterﬂy effect the lasagna
how to train
how to train
how to train
how to train
how to train
how to train
fantastic beasts and
where to ﬁnd them —- fantastic ordinary
and where to ﬁnd
fantastic beets and
where to ﬁnd them
fantastic beefs and
where to ﬁnd them
fantastic beets and
where to ﬁnd them
under the skin —- under the cereals under the silver cake under the 13th fryday the 13
Table 1: Examples of the ﬁnal output of the different models
In this section we provide some reasoning in our selection
of the evaluation questions that are presented to the human
judges. Earlier, we deﬁned the creativity in the case of pun
generation using the creative tripod as our theoretical frame-
work. This means that on a higher level, our evaluation ques-
tions should evaluate skill,appreciation and imagination.
Skill Our deﬁnition for skill stated that the system should
be able to take an existing movie title and produce a food
related pun as an output. A further requirement was that the
original title should be recognizable from the generated one.
1. The title has a pun in it
2. The title is related to food
3. The original title is recognizable
The evaluation questions described above are designed to
evaluate the requirements set for skill. We evaluate whether
a pun is perceived and whether the new title relates to food
separately, as it might be that the replacement word delivers
a pun, but is not food related or vice-versa.
Appreciation We deﬁned appreciation from the humor
stand point. A good title with a pun is also funny. For some-
thing to be funny, i.e. humorous, the pun has to exhibit co-
herence and surprise.
4. The title is humorous
5. The pun is surprising
6. The pun makes sense in the context of the original movie
We choose to evaluate the overall humor value of the title
separately from the components that constitute it. The last
two questions are designed to evaluate surprise and coher-
Imagination We used Boden’s dichotomy to establish the
deﬁnition of imagination. The minimal requirement was set
to P-creativity. However, P-creativity can easily be veri-
ﬁed by looking at the training data and the ﬁnal output, if
the output is different from the training material, there is
P-creativity. Therefore, we use human judges to assess the
H-creativity of the outputs.
7. The pun in the title is obvious
8. The pun in the title sounds familiar
If the pun is obvious, it probably is not too H-creative, as
an obvious pun could be said by just about anyone, also if the
pun sounds familiar, it has probably been said by someone
We take a random sample of 20 original movie titles that
were only present in the training data provided by the mas-
ter, 20 titles that were only present in the peer data and 20
titles that were in both sources of parallel data. We evalu-
ate the creative output of each apprentice for these randomly
sampled titles. In addition, we evaluate the master’s output
for the 40 titles of the sample it had generated movie title
puns for. As the master has generated multiple creative titles
per original title, we pick one randomly for each original
title. Altogether, we are evaluating 280 computer created
The evaluation was conducted on a crowd-sourcing plat-
form called Figure Eight8. The platform assigned people to
conduct evaluation in such a way that each title was evalu-
ated by 35 different users. The users could choose how many
titles they wanted to evaluate. The results of the evaluation
are show in Table 2. In the Training column, both,peer only
and master only indicate whether the original title was only
present in the master produced training data, peer produced
training data or in both respectively.
The authoritarian scenario didn’t get the best average
score for any of the test questions and neither did the master.
They both score particularly low on the Q2, which reﬂects
the fact that some of the words in the HTOED food and drink
taxonomy were only loosely related to food such as steam
and spit. It is interesting to note that the authoritarian sce-
nario gets the best results for Q3, Q6 and Q7 for titles it did
not encounter in the training data, in other words it has de-
veloped an appreciation of its own that does not just mimic
what the master produces and fail otherwise. In light of these
results, we can deduce that the master produced worse titles
with food puns than real people, which left both the master
and the authoritarian scenario without the ﬁrst place on any
of the test questions.
The authoritative scenario, which was the highest ranking
one according to master’s liking as seen in Figure 6, got the
best results for Q1 and Q5. This means that it succeeds the
best in the main task of generating puns and they end up
being the most surprising ones. It is also the only one that
produces consistently good results (above 3 on the average)
for all training test sets for Q1-Q6, unfortunately the results
for Q7 and Q8 are also above 3 on the average meaning that
it does not rank high on H-creativity.
The same consistency can not be perceived in the the per-
Style Training Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8
µxSD µxS D µxSD µxSD µxSD µxSD µxSD µxSD
authoritarian both 3.35 1.08 2.65 1.26 3.33 1.06 3.02 1.10 3.08 0.99 3.03 1.01 3.16 0.99 3.10 1.03
authoritarian peer only 2.97 1.18 2.14 1.17 3.44 1.13 2.66 1.15 2.82 1.08 3.10 1.11 3.03 1.12 3.11 1.13
authoritarian master only 3.34 1.07 2.89 1.28 3.30 1.05 3.00 1.12 3.07 1.05 3.04 1.02 3.12 1.02 3.07 1.04
authoritative both 3.43 1.08 3.09 1.31 3.28 1.12 3.08 1.16 3.08 1.05 3.02 1.05 3.22 1.06 3.13 1.05
authoritative peer only 3.41 1.14 3.23 1.35 3.37 1.21 3.13 1.16 3.07 1.08 3.08 1.09 3.24 1.10 3.16 1.13
authoritative master only 3.47 1.03 3.15 1.33 3.41 1.08 3.17 1.11 3.16 1.02 3.16 1.04 3.28 1.01 3.24 1.06
master both 3.38 1.07 2.79 1.28 3.29 1.06 3.07 1.12 3.09 1.04 3.10 1.06 3.22 1.02 3.14 1.05
master master only 3.40 1.07 2.61 1.30 3.34 1.11 3.10 1.15 3.09 1.02 3.04 1.03 3.17 1.04 3.11 1.06
neglecting both 3.45 1.11 3.28 1.32 3.34 1.11 3.28 1.12 3.16 1.06 3.12 1.04 3.21 1.08 3.22 1.07
neglecting peer only 3.36 1.07 3.02 1.37 3.31 1.11 3.12 1.15 3.09 1.02 3.14 1.04 3.19 1.02 3.14 1.06
neglecting master only 3.28 1.13 2.87 1.35 3.34 1.12 3.09 1.14 3.05 1.05 3.07 1.06 3.15 1.06 3.18 1.08
permissive both 3.23 1.18 2.67 1.38 3.59 1.06 3.06 1.13 3.08 1.08 3.30 1.04 3.21 1.07 3.30 1.10
permissive peer only 3.05 1.19 2.87 1.39 3.25 1.18 2.88 1.13 2.88 1.09 3.00 1.08 2.99 1.11 3.04 1.14
permissive master only 3.09 1.23 2.32 1.24 3.64 1.11 2.88 1.15 2.91 1.12 3.07 1.15 2.98 1.13 3.04 1.12
Table 2: Mean and standard deviation.
Style Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8
authoritarian 78.33% 31.67% 83.33% 31.67% 48.33% 68.33% 70.00% 68.33%
authoritative 93.33% 60.00% 83.33% 56.67% 63.33% 66.67% 90.00% 70.00%
master 92.50% 37.50% 87.50% 62.50% 60.00% 65.00% 80.00% 67.50%
neglecting 86.67% 60.00% 81.67% 66.67% 65.00% 73.33% 75.00% 78.33%
permissive 66.67% 33.33% 85.00% 43.33% 46.67% 73.33% 68.33% 71.67%
Table 3: Percentage of movie titles having an average score by judges greater than 3
missive case as scores below 3 are common across the test
questions. It however, manages to score the best for Q3 and
Q7-Q8, in other words, it can achieve the best H-creativity
and the original titles can be the most recognizable, although
not consistently so. This shows, that even though the appre-
ciation the master has might not be spot on, as it is not able
to produce the best scoring titles, having moderation on the
peer data and critical assessment of the apprentice generated
results during the training by the master, has a positive effect
on the consistency of the results. In the permissive scenario,
the apprentice was exposed to everything without criticism
and in the authoritative some criticism was used to ﬁler the
training data, which made the authoritative scenario more
consistent, but less H-creative.
Finally, the neglecting scenario gets the best scores for the
Q2, Q4 and Q5. It is the best one at producing humorous,
surprising and food related titles. It is quite consistent with
only the results for Q2 with previously unseen titles giving
a score that is inferior to 3. The good results of the neglect-
ing scenario serve as an additional proof to the fact that the
output of the master is worse than human written titles.
Table 3 shows the results form another stand point. The
table shows overall how many titles got the average rating
above 3 for each test question. These numbers are in line
with what was previously discussed about the Table 2. The
authoritarian scenario leads to the worst performance, but
this time master gets the highest percentage point of titles
above 3 for Q3. In the authoritative scenario most of the
titles have a clear pun and are related to food with the highest
percentage point. The permissive scenario holds the best
percentage points for Q6 and Q7. And the neglecting gets
the best percentage points in Q2, Q4, Q5 and Q6.
Discussion and Future Work
The evaluation results were not completely in line with what
we can observe by looking at the titles output by the differ-
ent methods by ourselves. This raises the question whether
our deﬁnition for creativity in movie title puns is adequate
and whether the evaluation questions we formulated based
on the deﬁnition really measure what they were designed to
measure. Because we have worked with a clear deﬁnition for
creativity in this paper, it is possible to take this under a crit-
ical study in the future. We also ﬁnd evident that qualitative
research on the output titles with respect to the quantitative
results we got from the human judges is needed to evaluate
the evaluation itself.
Having a master with appreciation ﬁlter the parallel data
of the apprentice was beneﬁcial for consistency (see au-
thoritative vs permissive). Although the evaluation results
showed that the appreciation is not in par with that of a real
human, the implication remains that a good external appre-
ciation can be beneﬁcial for the learning outcome of the ap-
prentice model. As we used a rather generic NMT model
for the apprentice, our ﬁndings might be of a use in more
traditional context of sequence-to-sequence models such as
machine translation, text summarization or paraphrasing.
For now, the master and apprentice have been studied in
a social vacuum, where peer data is the only link to the sur-
rounding world. However, in the future it would be fruitful
to see how the creative outcome changes when the master
and the apprentice are exposed to a more complex social
system such as the one described by Bronfenbrenner (Bron-
fenbrenner 1979). In such a society, the master would also
be under a social pressure in changing its own standards of
This work has presented one of the ﬁrst contributions to the
ﬁeld of computational social creativity where the computa-
tionally creative agents are in a hierarchical social relation.
This asymmetry offers an intriguing setting for studying so-
cialization of computational agents from the creativity per-
Despite building our deﬁnition of creativity upon an ex-
isting theory and formulating the test questions based on
the deﬁnition, the quantitative evaluation left many ques-
tions unanswered. The results presented in this paper call
for qualitative evaluation to understand the phenomenon of
evaluation in this particular context.
Nevertheless, our ﬁndings suggest that having apprecia-
tion in parenting, or training, an NMT model can be of a
beneﬁt. The applicability of these ﬁnding into sequence-to-
sequence deep learning models in a more generalized fash-
ion is an interesting research question on its own right.
Alnajjar, K., and H¨
ainen, M. 2018. A Master-
Apprentice Approach to Automatic Creation of Culturally
Satirical Movie Titles. In Proceedings of the 11th Interna-
tional Conference on Natural Language Generation (INLG),
Alnajjar, K.; Hadaytullah, H.; and Toivonen, H. 2018. “Tal-
ent, Skill and Support.” A method for automatic creation of
slogans. In Proceedings of the Ninth International Confer-
ence on Computational Creativity, 88–95.
Baumrind, D. 1991. Parenting styles and adolescent devel-
opment. The Encyclopedia of Adolescence 758–772.
Boden, M. A. 2004. The creative mind: Myths and mecha-
Bronfenbrenner, U. 1979. The ecology of human develop-
ment. Harvard university press.
Brownell, H. H.; Michel, D.; Powelson, J.; and Gardner,
H. 1983. Surprise but not coherence: Sensitivity to verbal
humor in right-hemisphere patients. Brain and Language
18(1):20 – 27.
Chen, D. L., and Dolan, W. B. 2011. Collecting highly
parallel data for paraphrase evaluation. In Proceedings of
the 49th Annual Meeting of the Association for Computa-
tional Linguistics: Human Language Technologies-Volume
Colton, S. 2008. Creativity Versus the Perception of Cre-
ativity in Computational Systems. In AAAI Spring Sympo-
sium: Creative Intelligent Systems, Technical Report SS-08-
Corneli, J., and Jordanous, A. 2015. Implementing feedback
in creative systems: a workshop approach. In Proceedings
of the First International Conference on AI and Feedback-
Volume 1407, 10–17. CEUR-WS. org.
De Smedt, T., and Daelemans, W. 2012. Pattern for Python.
Journal of Machine Learning Research 13:2063–2067.
Deb, K.; Pratap, A.; Agarwal, S.; and Meyarivan, T. 2002.
A fast and elitist multiobjective genetic algorithm: Nsga-ii.
IEEE Transactions on Evolutionary Computation 6(2):182–
Gabora, L. 1995. Meme and variations: A computer model
of cultural evolution. 1993 Lectures in Complex Systems
Hantula, O., and Linkola, S. 2018. Towards goal-aware col-
laboration in artistic agent societies. In Proceedings of the
Ninth International Conference on Computational Creativ-
He, H.; Peng, N.; and Liang, P. 2019. Pun generation with
surprise. arXiv preprint arXiv:1904.06828.
Honkela, T., and Winter, J. 2003. Simulating language
learning in community of agents using self-organizing maps.
Helsinki University of Technology.
Honnibal, M., and Montani, I. 2017. spaCy 2: Natural
Language Understanding with Bloom Embeddings, Convo-
lutional Neural Networks and Incremental Parsing. To ap-
Jordanous, A. 2012. A standardised procedure for evalu-
ating creative systems: Computational creativity evaluation
based on what it is to be creative. Cognitive Computation
Klein, G.; Kim, Y.; Deng, Y.; Senellart, J.; and Rush, A. M.
2017. OpenNMT: Open-Source Toolkit for Neural Machine
Translation. In Proc. ACL.
Linkola, S.; Takala, T.; and Toivonen, H. 2016. Novelty-
seeking multi-agent systems. In Proceedings of The Seventh
International Conference on Computational Creativity.
Luong, M.-T.; Pham, H.; and Manning, C. D. 2015. Ef-
fective approaches to attention-based neural machine trans-
lation. arXiv preprint arXiv:1508.04025.
Oring, E. 2003. Engaging humor. Urbana and Chicago:
University of Illinois Press.
Pagnutti, J.; Compton, K.; and Whitehead, J. 2016. Do
you like this art i made you: introducing techne, a creative
artbot commune. In Proceedings of 1st International Joint
Conference of DiGRA and FDG.
Papineni, K.; Roukos, S.; Ward, T.; and Zhu, W.-J. 2002.
Bleu: a method for automatic evaluation of machine transla-
tion. In Proceedings of the 40th annual meeting on associa-
tion for computational linguistics, 311–318.
Ritchie, G. 2005. Computational mechanisms for pun gen-
eration. In Proceedings of the Tenth European Workshop on
Natural Language Generation (ENLG-05).
Saunders, R., and Bown, O. 2015. Computational social
creativity. Artiﬁcial life 21(3):366–378.
Yu, Z.; Tan, J.; and Wan, X. 2018. A neural approach to
pun generation. In Proceedings of the 56th Annual Meeting
of the Association for Computational Linguistics (Volume 1:
Long Papers), volume 1, 1650–1660.