Conference PaperPDF Available

A Case Study of NLG from Multimedia Data Sources: Generating Architectural Landmark Descriptions

Authors:

Abstract and Figures

In this paper, we present a pipeline system that generates architectural landmark descriptions using textual, visual and structured data. The pipeline comprises five main components: (i) a textual analysis component, which extracts information from Wikipedia pages; (ii) a visual analysis component, which extracts information from copyright-free images and video frames; (iii) a retrieval component, which gathers relevant property, subject, object triples from DBpedia; (iv) a fusion component, which stores the contents from the different modalities in a Knowledge Base (KB) and resolves the conflicts that stem from using different sources of information; (v) an NLG component , which verbalises the resulting contents of the KB. We show that thanks to the addition of other modalities, we can make the verbalisation of DBpedia triples more relevant and/or inspirational.
Content may be subject to copyright.
A Case Study of NLG from Multimedia Data Sources:
Generating Architectural Landmark Descriptions
Simon Mille1, Spyridon Symeonidis2,Maria Rousi2,Montserrat Marimon Felipe1,
Klearchos Stavrothanasopoulos 2,Petros Alvanitopoulos2,Roberto Carlini1,
Jens Grivolla1,Georgios Meditskos2,Stefanos Vrochidis2, and Leo Wanner1,3
1Universitat Pompeu Fabra, Barcelona, Spain
Corresponding author UPF: simon.mille@upf.edu
2Information Technologies Institute - CERTH, Thessaloniki, Greece
Corresponding author ITI-CERTH: spyridons@iti.gr
3Catalan Institute for Research and Advanced Studies (ICREA)
Abstract
In this paper, we present a pipeline system that
generates architectural landmark descriptions
using textual, visual and structured data. The
pipeline comprises five main components: (i)
a textual analysis component, which extracts
information from Wikipedia pages; (ii) a vi-
sual analysis component, which extracts infor-
mation from copyright-free images and video
frames; (iii) a retrieval component, which gath-
ers relevant hproperty, subject, objectitriples
from DBpedia; (iv) a fusion component, which
stores the contents from the different modali-
ties in a Knowledge Base (KB) and resolves
the conflicts that stem from using different
sources of information; (v) an NLG compo-
nent, which verbalises the resulting contents
of the KB. We show that thanks to the addition
of other modalities, we can make the verbali-
sation of DBpedia triples more relevant and/or
inspirational.
1 Introduction
The bulk of the information reaches the reader
nowadays across different media. Most of the
videos uploaded to YouTube come accompanied by
written natural language comments and so do most
of the audio podcasts; even (online) newspaper ar-
ticles can be hardly imagined without any visual
illustrative material. This means that to generate a
comprehensive but, at the same time, not partially
repetitive, content summary of the provided infor-
mation, the input of all media needs to be taken
into account and merged.
The value of the merge of complementary in-
formation from multiple media for the generation
of more informative texts has been pointed out al-
ready early in the field; cf., e.g., (Huang et al.,
1999). However, since then only a few works tack-
led the problem; see, e.g., (Das et al.,2013;Xu
et al.,2015). A few others merge multimedia con-
tent in the context of other tasks such as retrieval;
cf. (Clinchant et al.,2011). In most of these works,
the integration is done using similarity measures
in a multimedia vector space. To the best of our
knowledge, none aims for integration (or fusion)
at, and subsequent generation from, the level of on-
tological
<
subj predicate obj
>
triples – which is
crucial in order to be able to generalize, use inheri-
tance or apply advanced reasoning mechanisms.
In this paper, we present a work that addresses
this challenge. It fuses triples from DBpedia, tex-
tual information from Wikipedia, and visual infor-
mation obtained from images as input to a pipeline-
based generator. Our work is situated in the context
of a larger research initiative, in which the objec-
tive is to automatically reconstruct architectural
landmarks in 3D, such that they can be used by
architects, game designers, journalists, etc. Each
reconstructed landmark should be accompanied by
automatically generated information that describes
its main features. The goal is to convey information
such as the landmark’s architect, some of its facade
elements, its date of construction and/or renovation,
its architectural style, etc. in terms of a text like:
Petronas Towers, which C
´
esar Pelli designed,
are commercial offices and a tourist attraction
in Kuala Lumpur. The building has 88 floors
and 40 elevators and a floor area of 395,000 m
2
.
It was the highest building in the world between
1998 and 2004, and was restored in 2001.
which is a verbalisation of the triples in Table 1.
In what follows, we describe how this is done,
from content extraction to the use of a grammar-
based text generator. We thus do not focus exclu-
sively on text generation. Rather, we attempt to
show how richer input structures can be created
using different media for the benefit of more com-
Petronas Towers location Kuala Lumpur
Petronas Towers restorationDate 2001
Petronas Towers floorArea 395,000
Petronas Towers floorCount 88
Petronas Towers elevatorCount 40
Petronas Towers buildingType commercial offices
Petronas Towers buildingType tourist attraction
Petronas Towers highestRegion world
Petronas Towers highestStart 1998
Petronas Towers highestEnd 2004
Petronas Towers architect C´
esar Pelli
Table 1: A set of <subj predicate obj>input triples
prehensive and user-relevant texts.
2 Related work
As already pointed out above, to the best of our
knowledge, only a few works deal with fusion of
content from different media as input representa-
tion for downstream applications (in our case, text
generation) and when they do, they use similar-
ity measures in a multimedia vector space (Huang
et al.,1999;Clinchant et al.,2011;Das et al.,2013;
Xu et al.,2015) rather than mapping multimedia
content onto a common ontology. This does not
mean though that research related to text generation
across different media would be neglected. For in-
stance, generation of image captions (Hossain et al.,
2019) and video descriptions (Aafag et al.,2019)
has recently become a very popular research topic.
All of the proposals in this area use sequence-to-
sequence neural network models. In (Idaya Aspura
and Azman,2017), indexes of textual and visual
features are integrated via a multi-modality ontol-
ogy, which is further enriched by DBpedia triples
for the purpose of semantics-driven image retrieval.
On the other side, text generation from ontological
structures is on the rise; cf., e.g., (Bouayad-Agha
et al.,2014;Gatt and Krahmer,2018) for overviews
and the WebNLG challenge (Gardent et al.,2017a)
for state-of-the-art works.
In general, there are three main approaches to
generating texts from ontologies: (i) filling slot
values in predefined sentence templates (McRoy
et al.,2003), (ii) applying grammars that encode
different types of linguistic knowledge (Varges and
Mellish,2001;Wanner et al.,2010;Bouayad-Agha
et al.,2012;Androutsopoulos et al.,2013), and (iii)
predicting the most appropriate output based on
machine learning models (Gardent et al.,2017b;
Belz et al.). Template-based generators are very
robust, but also limited in terms of portability since
new templates need to be defined for every new
domain, style, language, etc. Machine learning-
based generators have the best coverage, but the
relevance and the quality of the produced texts
cannot be ensured. Furthermore, they are fully de-
pendent on the available (still scarce and mostly
monolingual) training data. The development of
grammar-based generators is time-consuming and
they usually have coverage issues. However, they
do not require training material, allow for a greater
control over the outputs (e.g., for mitigating errors
or tuning the output to a desired style), and the lin-
guistic knowledge used for one domain or language
can be reused for other domains and languages. A
number of systems also combine (i) and (iii), fill-
ing the slot values of pre-existing templates using
neural network techniques (Nayak et al.,2017).
In what follows, we opt for a grammar-based
generator. We show that information from visual,
textual and structured (DBpedia) sources can be
successfully fused in order to generate informative
descriptions using a pipeline-based text generator.
3 System and dataset overview
Let us first introduce the architecture of our system
and then outline the creation of the datasets used
for development and testing.
3.1 General system architecture
The workflow of our system is illustrated in Figure
1. The initial input is the topic entity on which the
text is to be generated. Based on this, the data col-
lection module harvests relevant content from the
Web. The resources of interest are images from the
Flickr website and texts from Wikipedia, which are
processed by the visual and textual analysis mod-
ules respectively. The two modules extract a rich
set of features that describe the entity. The knowl-
edge integration and reasoning module stores the
extracted visual and textual features along with
additional metadata retrieved from DBpedia in ded-
icated ontologies. Semantic reasoning and fusion
operations are subsequently executed on top of the
saved data to aggregate the information coming
from the different media into a unified entity repre-
sentation. Text generation starts from this represen-
tation in order to generate a textual description.
3.2 Development and test datasets
The targeted entities are architectural landmarks
such as buildings, statues or stadiums. The goal
is to be able to generate a description which con-
Figure 1: System architecture.
tains, e.g., the date of creation, the location, the
architecture style or the popularity.
To create the sufficiently diverse datasets, we
first manually compiled a list of 160 landmarks
that vary in terms of the aforementioned character-
istics. In the next step, we retrieved the available
multimedia content on these landmarks: images
(from Flickr), textual descriptions (from Wikipedia)
and ontological properties (from DBpedia)
1
and
selected then a subset of 120 landmarks that had
either rich image or DBpedia contents, or both. We
used 101 landmarks to develop and optimise our
framework, whereas 19 randomly selected land-
marks were left aside for the evaluation stage. The
full list of landmarks is available in Appendix A.1.
4 Multimedia content acquisition
In this section, we describe the content that we ex-
tracted from the three sources (DBpedia, images
and texts) used for generating the landmark descrip-
tions, and how we extracted it.
4.1 DBpedia triple retrieval
DBpedia contains a lot of information that is po-
tentially relevant to the description of architectural
landmarks. We analysed manually the DBpedia
entries of the 101 landmarks in the development
set in order to see which properties are related to
the landmark and its architectural features.
2
We
identified 39 features of interest, most of which are
consistently found across the landmarks in the list,
among them, e.g., features related to the type of
the landmark, its style, who built it, the dates of its
construction, renovation, or extension, its location,
its construction materials, its cost, its number of
floors, elevators, or towers, etc. On average, about
1
While the resources on Wikipedia and DBpedia are free
for use, we had to pay special attention to image collection
from Flickr to ensure that we gather media whose license
permits their reuse for our purposes.
2
See for illustration the Petronas Towers page:
http:
//dbpedia.org/page/Petronas_Towers.
6 features per landmark can be obtained through
DBpedia (up to 13 for a single landmark). The
information that corresponds to one feature can be
encoded by a variety of property names (up to 10).
For instance, the cost of a building can be expressed
by dbo:cost,dbp:cost, or dbp:constructionCost.
3
The 39 features and the corresponding 98 proper-
ties are listed in Table 7, Appendix A.2.
For the retrieval of the corresponding DBpedia
triples, we developed a component that applies
SPARQL queries to the DBpedia SPARQL end-
point.
4
In some rare cases, the queries returned an
error message at the time they were performed; in
such a case, the property cannot be accessed and
the information is not retrieved.
4.2 Visual content acquisition
The performed visual analysis for the purpose of
visual content acquisition is twofold. First, an ob-
ject detection module classifies indoor and outdoor
scenes and detects landmark (in this case, building)
elements, and objects. Second, an architectural
style classification module assigns the related ar-
chitectural style label to each outdoor scene of the
selected dataset. Both classification modules are
based on state-of-the-art deep learning techniques.
4.2.1 Visual scene classification and labeling
For visual scene classification, we draw upon the
145 relevant indoor and outdoor scene classes from
the Places dataset (Zhou et al.,2018), which con-
tains 1,803,460 images, annotated with a total of
365 classes. The classifier is a VGG16 deep neu-
ral network (Simonyan and Zisserman,2014), pre-
trained on the Places dataset for the first 14 layers
(for all of its 365 classes) and fine-tuned on a subset
of 145 selected classes for the last two layers.
For labeling the landmark elements and objects
in the classified scenes, we use a Deeplab model,
pre-trained on the PASCAL VOC (Everingham
et al.,2009) dataset, where further training was
applied using a combination of building fa
c¸
ade
segmentation datasets (Mapillary Vistas (Neuhold
et al.,2017), CMP (Tylecek and S
´
ara,2013), ECP,
5
LabelMeFacade (Frohlich et al.,2010), eTRIMS
3
There are two main types of properties in DBpedia:
“clean” properties that stem from the DBpedia ontology (dbo:
prefix), and properties automatically extracted from raw
Wikipedia infoboxes (dbp: prefix).
4http://dbpedia.org/sparql
5http://vision.mas.ecp.fr/Personnel/
teboul/data.php
(Kor
ˇ
c and F
¨
orstner,2009)).
6
This not only resulted
in a computationally efficient implementation for
the detection of architectural landmark-related arte-
facts, but also increased the classification accuracy
of the model. In order to further improve object
detection, we added a third module based on the
Mask RCNN model. We initialized this module
using pre-trained weights on COCO dataset (Lin
et al.,2014) and then performed fine-tuning on a
customized set created by merging LVIS (Gupta
et al.,2019) and ADEK20K (Zhou et al.,2016)
datasets and removing all classes irrelevant to the
scope of our task. Details about the training set-
tings are provided in Appendix A.3.
Figure 2: (L)Visual of the detection module’s results.
The algorithm detects the building along with the sur-
roundings and generates the corresponding predicted
tags. (R) Architecture style recognition, the model pre-
dicted label : Romanesque , True label : Romanesque.
4.2.2 Landmark style identification
In the context of this task, the visual analysis com-
ponent aims to assign to a landmark one of the 18
architectural styles listed in Table 2and to iden-
tify the following seven additional types of features
that are later on used for text generation: (i) con-
struction type (e.g., ‘amphitheater’, ‘castle’, ‘ho-
tel’); (ii) similarities with other construction types
(same list as (i)); (iii) similarity of a part of the con-
struction with another construction (e.g., ‘bridge’,
‘arch’); (iv) facade elements (e.g., ‘balcony’, ‘fire
escape’); (v) interior components and objects (e.g.,
‘fireplace’, ‘elevator shaft’, etc.); (vi) environment
(e.g., ‘downtown’, ‘village’, ‘park’); (vii) proxim-
ity to a natural landmark (e.g., ‘river’, ‘park’). The
full list of detected features is provided in Table 8,
Appendix A.3. In total, eight different properties
are extracted for generation, two of which (style,
construction type) are fused with the properties ob-
tained through the other modalities, and the other
6
These datasets contain up to 25,000 high-resolution im-
ages annotated with a variety of semantic classes and possibly
instance-specific labels.
ones used for text generation as such.
A list of the supported Architectural styles
Art Deco Art Nouveau Baroque Bauhaus
Biedermeier Corinthian Order Deconstructuvism Doric Order
Early Roman Gothic Hellinistic Ionic Order
Modernist Neoclassical Postmodernism Renaisance
Rococo Romanesque
Table 2: Architectural Styles
For training of the model for landmark style and
feature identification, images from Flickr, Euro-
peana and Wiki were collected. Annotators with
architectural expertise annotated the collected data.
A set of 11,368 newly annotated images was used
for training purposes, while a total of 1,276 newly
annotated images were used for testing. VGG16
and ResNet50 models were enhanced with 3 layers
(one GlobalAveragePooling2D and two Dense lay-
ers) and initialised with the pre-trained ImageNet
weights. For better training, K-Fold Validation and
Stratified Shuffle Split were applied.
4.3 Textual content acquisition
In addition to using visual features, we enriched
the DBpedia information by entity-relation-entity
triples extracted from the unstructured part of the
Wikipedia articles. This is done using a pipeline
that comprises concept detection, entity linking and
WSD to identify the proper entities (linking to DB-
pedia URIs); then, PoS tagging, dependency pars-
ing and coreference resolution to generate surface-
syntactic structures and to link mentions of entities
in the different parts of text; and, finally, semantic
parsing to generate deep-syntactic structures from
which we extract the triples.
In what follows, we outline the types of infor-
mation we aim to extract from the textual data and
how we do it.
4.3.1 Targeted information in textual data
Unlike the content extracted from visuals, which
is not expected to be found in DBpedia since it is
related to specific images and some “subjective”
features, the information extracted from Wikipedia
is supposed to be already captured in DBpedia.
However, we observe that some of the relevant
properties are often missing. The goal of the textual
analysis component is thus to recover these missing
properties, which concern, in particular, the type
of the landmark, its date(s) of construction and
renovation, its location, its architectural style and
its architect, designer or creator. In order to reduce
the load on textual analysis, we analyse only the
first paragraphs of the scraped Wikipedia articles.
As an example, consider the text “Rouen Cathe-
dral is a Roman Catholic church in Rouen, Nor-
mandy, France. It is the seat of the Archbishop of
Rouen, Primate of Normandy. The cathedral is in
the Gothic architectural tradition., for which the
following triples are extracted:
Rouen Cathedral Localisation Roman Catholic church
Rouen Cathedral Location Rouen, Normandy, France
Rouen Cathedral Style Gothic architectural tradition
4.3.2 Triple extraction from texts
In order to extract the targeted triples, we apply a
sequence of rule-based graph transducers on the
output of an off-the-shelf syntactic parser. More
specifically, we run the pipeline used for creating
the deep input representations of the Surface Re-
alisation shared tasks 2018 and 2019 (Mille et al.,
2018), with one additional component responsi-
ble of identifying the configurations that corre-
spond to the targeted information. Consider, for
illustration, a sample rule in Figure 3, which ex-
tracts the ‘Rouen Cathedral – Localisation – Ro-
man Catholic church’ triple from the predicate-
argument (PredArg) structure as encoded by light
verb be-constructions, where the first argument of
the light verb be becomes the first argument of the
predicate in the PredArg representation.
Figure 3: A sample rule to extract triples. The Left-
Side matches a part of the input tree, the RightSide
builds part of the output. Three types of objects are
used: nodes (?N{}), relations between nodes (?r)
and attribute-value pairs associated to a node (?a =
b), where question marks indicate variables and text in
black indicates literal strings.
4.4 Quantitative analysis of the information
acquisition modules.
We evaluated the text analysis component with a
set of 16 texts and measured average values of
83% for precision and 40% for recall on the triple
extraction for the targeted triples (see Section 4.3).
In other words, the coverage of the module needs
to be extended to get more information, and some
incorrect values would need to be filtered. The
main limitations here are the difficulty in covering
the wide variety of surface syntactic structures, and
the quality of syntactic parses.
For the evaluation of the architectural style clas-
sification model a set of building images were se-
lected. The dataset for testing comprises 1276 im-
ages and includes all the 18 architectural styles.
The F1 score was taken into consideration and a
significant 46.16% of correct classification was per-
formed, similar to the SoA results (Z. et al.,2014);
the confusion matrix for the architecture style clas-
sification is shown in Figure 7in Appendix A.3.
Even though the classification is state-of-the-art, in
more than 50% of the cases the architectural style
is wrong, which is one of the main comments from
the evaluators in terms of incorrectness of contents.
5 Multimedia information fusion
The results of the visual and textual analyses and
the retrieved DBpedia properties are mapped us-
ing the Web Annotation Data Model
7
. The model
creates a body and a target for each annotation.
As main interconnection point between the con-
tent from different media, we use the name of the
corresponding entity, which is mapped as the tar-
get of the annotation. The body of the annotation
contains all the other information, which varies ac-
cording to the nature of each input module. More
specifically: (i) for the visual analysis content, the
body contains information pertinent to scene, ob-
jects, fa
c¸
ade, structure elements and architectural
styles. A mapping example is shown in Figure 4;
(ii) for the Wikipedia analysis outcome, the body
contains creator and localisation information re-
lated to the entity; (iii) for the retrieved DBpedia
triples, the information in the body is pertinent to
the landmark, the architecture, the location and
more general information about the landmark.
A reasoning mechanism applies the following
property-based semantic rules at the time of the
retrieval of DBpedia triples: (i) Extraction of class
information about creators and locations: the mech-
anism detects whether a creator is a person, an or-
ganisation or a company, and whether a location is
a region, a city or a country. (ii) Unit detection: in
7https://www.w3.org/TR/
annotation-model/
@prefix examples: <https://v4design.eu/ontologies/examples#> .
@prefix oa: <http://www.w3.org/ns/oa#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix v4d: <https://v4design.eu/ontologies/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
examples:VisualAnnotation_1 a v4d:VisualAnnotation;
oa:hasBody examples:VisualView_1;
oa:hasTarget examples:VisualFeature_1.
examples:VisualView_1 a v4d:VisualView;
v4d:containsImage v4d:Image_1 .
examples:VisualFeature_1 a v4d:VisualFeature;
v4d:isRelatedWith "Alhambra" .
v4d:Image_1 a v4d:Image;
v4d:hasArchitecturalStyle v4d:ArchitecturalStyle_1;
v4d:hasFacadeElement v4d:FacadeElement_1;
v4d:hasObject v4d:Object_1;
v4d:hasScene v4d:Scene_1;
v4d:hasStructureElement v4d:StructureElement_1;
v4d:imageName "24.jpg" .
v4d:Scene_1 a v4d:Scene;
rdfs:label "palace";
v4d:hasGenericClass "http://www.semanticweb.org
/inlg-ontology#building";
v4d:isOutdoor "true";
v4d:probability "0.4205022" .
v4d:StructureElement_1 a v4d:StructureElement;
rdfs:label "structure--building";
v4d:hasGenericClass "http://www.semanticweb.org
/inlg-ontology#building";
v4d:probability "0.4401882570316443" .
v4d:FacadeElement_1 a v4d:FacadeElement;
rdfs:label "door";
v4d:probability "0.5884110748255492" .
v4d:Object_1 a v4d:Object;
rdfs:label "signboard";
v4d:probability "0.9469741" .
v4d:ArchitecturalStyle_1 a v4d:ArchitecturalStyle;
rdfs:label "Baroque";
Figure 4: Example of mapping Visual data
case that DBpedia information contains the concept
of monetary cost, extracting the currency provides
the corresponding information to Text Generation.
The same rule is applied for literals. (iii) Filtering
of undesired values: for instance ‘buildingType’
cannot contain values such as ‘Cultural’, ‘style’
cannot contain an affirmation of type “yes”, etc.
(iv) Retrieval of one or more values according to
the property category: for example, for properties
such as ‘buildStartDate’, if more than one results
are found, only one is returned, while for properties
like ‘materials’, if more than one results are found,
all of them are returned.
During the fusion procedure, the content ob-
tained from Wikipedia, DBpedia and images are
merged per entity. For visuals, since for each entity
the results are analysed per image, we return the
five values that have maximum occurrences in the
images collection per category i.e., “scene”, “ob-
ject”, ‘fa
c¸
ade‘” and “structure elements”. For the
information that belongs to the same category and
comes from different modules (e.g., type of build-
ing, creator, architectural style), we select the most
frequent entities, or if there is none, we use the
information from DBpedia or pick one randomly.
The properties that are fused and the analysis mod-
ule they come from are shown in Table 3. At the
end of the fusion procedure, the results contain
both the information from the individual modules
and the fusion selection.
Property Text DBpedia Visual
Building type Localisation hypernyms and buildingTypes scene recognition features
Creator creator creator -
Architectural style - style architectural style
Table 3: Fused properties from different sources
6 Generation of landmark descriptions
Despite the advances in neural NLG, grammar-
based generation is still a valuable option when
training data are scarce and/or when a large cover-
age grammar-driven generator is already available.
6.1 Grammar-based generation
No annotated datasets of architectural landmark de-
scriptions to train machine learning-based models
or to extract sentence templates for template-based
generation are available. Therefore, we tackle de-
scription generation from the fused ontological
triples presented above using FORGe, a portable
grammar-based generator that has been adapted to
structured data inputs, in particular DBpedia triple
sets (Mille et al.,2019). The input triples are in-
dividually mapped to minimal predicate-argument
templates (see Figure 5), which are then sent to the
generator. The generation consists of a sequence
of graph-transduction grammars that map succes-
sively the PredArg templates to linguistic struc-
tures of different levels of abstraction, in particular
syntax, topology, morphology, and finally texts.
PredArg structures are very similar to the Facts
in ILEX’s Content potential structures (O’Donnell
et al.,2001), or the Message triples in NaturalOWL
(Androutsopoulos et al.,2013), with the difference
that all predicates in the PredArg structures are in-
tended to represent atomic meanings (e.g. highest
+ building as opposed to highestBuilding), allowing
for more flexible aggregation and sentence structur-
ing. The first part of the generation pipeline, which
produces aggregated predicate-argument graphs, is
also comparable to ILEX, while the surface reali-
sation is largely inspired by MARQUIS (Wanner
et al.,2010). Our generator shares not only its gen-
eral architecture with these two systems, but also
the use of lexical resources with subcategorisation
information and of a multilingual core of rules. One
of the specificities of our pipeline is that two types
of aggregation take place during generation, one
at the predicate-argument level (in a NaturalOWL
fashion), and one at the syntactic level (see below).
6.2 Extension of an existing generator
The base generator covers about 400 DBpedia prop-
erties, but only a few had to do with architectural
landmarks, and generation of up to only 10 triples
had been tested. In this work, the inputs can contain
up to 19 triples, and most of the properties are new.
We thus extended the coverage of the generator
according to two main aspects: (i) addition of 38
manually crafted PredArg templates, (ii) addition
of domain-specific “semantic” aggregation rules.
For (i), the 38 new properties
8
were each associ-
ated with a new PredArg template; see, for instance,
the templates corresponding to the ‘highestEnd’
and ‘interiorComponent’ properties in Figure 5. In-
stantiating the template 5(a) with the values of
Table 1([name] = Petronas Towers, [highestEnd] =
2004) and generating it would result in the sentence
The Petronas Towers were the highest building in
the world until 2004. Template 5(b) would be
realised as There is an elevator shaft in P. Towers.
(a)
(b)
Figure 5: Sample predicate-argument templates
(a) = DBpedia, (b) = visuals
For (ii), we designed new aggregation rules to
complement the generic rules already in place,
which are based on the identity between predicates
and/or entities only. In particular, properties that
involve dates need to be aggregated in a specific
way when there are both a start and an end date.
For instance, highestStart and highestEnd as seen
in Table 1trigger a rule that introduces a between:
the highest building in the world between 1998 and
2004. In parallel, some other rules aggregate some
properties in priority if found in the input: [style +
date] [creator + style] [generic rules].
Specific rules (e.g., for linearisation) were also
improved to cover the generation of texts that are
not only larger due to the input size, but also more
8
32 from DBpedia (7 out of the 39 were already covered;
see Table 7, Appendix A.2) , and 6 coming from the visual
content analysis (see Section 4.2)
complex due to the more complex syntactic con-
structions (e.g., non-projective trees, as in highest
building in the world). Finally, we crafted a new
syntactic aggregation module, which aggregates
coordinated and relative clauses based on identity
of syntactic subjects/objects, locations and verbs.
9
7 Evaluation
We evaluate the quality of the generated descrip-
tions from fused representations, first of all, against
monomodal descriptions generated solely from DB-
pedia triples. The goal is to assess to what extent
architectural landmark descriptions benefit from ad-
ditional content from other media. A comparison
with Wikipedia texts is also carried out.
7.1 Evaluation method
Six journalists, architects and architectural land-
mark content providers were recruited for the eval-
uation.
10
They were asked to evaluate descriptions
with respect to their correctness of form and content
and their level of interest by rating the following
statements on a 6-value Likert scale: 11
Correctness of Form
:Independently of Correct-
ness of content and Interestingness, (i) the surface
form of the text is free of grammatical and spelling
errors, (ii) the text is easy to read and understand,
and (iii) it flows well.
Correctness of Content
:Independently of Cor-
rectness of form and Interestingness, and using
only my current knowledge on the topic, I do not
identify information that looks obviously incorrect.
Interestingness
:Independently Correctness of
form and content, I find the information provided
in the text interesting, relevant and inspirational.
The evaluation test set consisted of the descrip-
tions of 19 landmarks: 19 descriptions generated
from fused multimedia representations, 19 gener-
ated from DBpedia triples and the first or second
paragraph (whichever was the most informative
in terms of architectural descriptions) from the
Wikipedia articles on the 19 landmarks in question.
For each building, all evaluators were presented
with the three descriptions, in a random order, and
scored each description. That is, each system re-
9
In the case of the ‘interiorComponent’ property, as seen
in Figure 5, there is no verb at the semantic level; it is only in-
troduced in the syntactic structure. The syntactic aggregation
module covers such cases.
10
All evaluators are fluent in English and familiar with the
described landmarks.
11Answers from 1: strongly disagree to 6: strongly agree.
ceived 114 ratings for each of the 3 criteria, which
we believe makes the evaluation trustworthy.
7.2 Evaluation against DBpedia-based
descriptions
The comparison between the descriptions gener-
ated from fused multimedia representations and the
descriptions generated from DBpedia triples (see
Table 4) is shown in Figure 6. A 2-tailed Mann-
Whitney U test indicates that only for Form the
difference is not statistically significant at p
<
0.05.
Form Content Interest.
0
2
4
6
4.6
3.8
2.7
4.9
3.4
3.1
DBpedia
Fusion
Figure 6: Results of the human evaluation
7.3 Evaluation against Wikipedia
The 6 evaluators were also asked to rate the
Wikipedia articles. In most cases, one single
Wikipedia paragraph was longer and richer than
either of our generated descriptions, so the texts
are not fully comparable, but our objective has
been to define some upper bound scores for a short
text. Wikipedia paragraphs scored 5.5, 5.3 and
4.4 for the correctness of form, of contents and
interestingness respectively, that is, 0.6, 1.9 and
1.3 points higher than our fused descriptions. We
also asked the evaluators to pick which text they
preferred among the 3 candidates, and interestingly,
Wikipedia articles were not always chosen: in 15
cases out of 114, an automatically generated text
was picked (DBpedia: 4, Fusion: 11).
7.4 Discussion
Table 4shows texts from the different sources. Fig-
ure 6shows that while the scores for the correctness
of form are rather high for both the fused and DBpe-
dia descriptions (close to 5), the scores for the other
two criteria are low, in particular for interestingness
Wikipedia (human)
The Sydney Opera House is a multi-venue performing arts
centre at Sydney Harbour in Sydney, New South Wales, Aus-
tralia. It is one of the 20th century’s most famous and distinc-
tive buildings.
DBpedia
Sydney Opera House, which Jørn Utzon designed, is a Per-
forming arts center in Sydney. Sydney Opera House, the
architectural style of which is Expressionist architecture, was
built between 1 March 1959 and 1973. Its structure is made
of Concrete frame & precast concrete ribbed roof.
Fused
Sydney Opera House, which Jørn Utzon designed, is a
centre
in Sydney. Its structure is made of Concrete frame & pre-
cast concrete ribbed roof.
An element of the structure is
like a bridge. Sydney Opera House has similarities with a
beach house
,
an amusement park and a museum
. Sydney
Opera House, the architectural style of which is
Deconstruc-
tuvism, was built between 1 March 1959 and 1973.
Table 4: Sydney Opera house descriptions (eval set)
(2.7 and 3.1). However, adding information from
different sources does increase slightly the interest-
ingness of the texts, and even their correctness in
terms of form, but at the expense of the correctness
of the contents.
The lower scores obtained for interestingness
(when compared to the other two criteria) even
for the human-written Wikipedia texts highlight
the difficulty for short texts to be considered in-
teresting and inspirational. But the fact that the
automatically generated texts score more than one
point lower than Wikipedia shows that the content
representation of the 44 features is still not suffi-
cient; more content needs to be provided. Other
DBpedia properties such as ‘owner’, ‘tenant’, or
information related to other landmarks or persons
such as ‘architecturalStyle (of)’, ‘birthPlace (of)’,
‘influencedBy’, ‘location (of)’, etc. could augment
the interestingness of the descriptions.
The low scores of our generator for the correct-
ness of the contents, in particular for the fused
descriptions (3.4) are due to several causes. First
of all, as shown by the scores of the DBpedia-only
descriptions (3.8), not all the information in DBpe-
dia is factually correct, in particular the informa-
tion extracted automatically from Infoboxes: build-
ings can be assigned types such as “Series”, “Nick-
name”or “Mixed-use” (see Tables 9and 11,A.4);
construction dates can be irrelevant (“between 2015
and 532”); locations sometimes refer to a relative
location (e.g., “right”), etc. Second, the informa-
tion extracted from texts and visuals, tasks which
are traditionally difficult to solve, is also not per-
fect (a detailed error analysis is provided in Section
4.4); incorrect architectural styles (e.g., Tables 10
and 12,A.4) and comparisons between supposedly
similar buildings (see Table 10,A.4) were found
particularly disconcerting by the evaluators. Fi-
nally, the performance of the fusion component is
currently heavily dependent on the cases seen in
the development dataset. In the development set, in
most cases, the selected entities were valuable and
supported the Text Generation as expected, but in
the evaluation set, many cases had not been seen,
such that ill-informed decisions were taken, some-
times triggering the replacement of a correct value
from DBpedia by an incorrect value from visual or
textual analysis (see Table 4and Tables 9and 12,
A.4). A larger development set would be needed
in order to identify more erroneous configurations.
Another solution may be more generic strategies to
foresee the possible mistakes in the inputs.
8 Conclusions
We presented the case of the generation of architec-
tural landmark descriptions from ontological struc-
tures that contain fused content from visual, textual
and ontological sources. The evaluation showed
that when compared to descriptions generated from
the DBpedia RDF-triples obtained from textual
material only (i.e., Wikipedia), descriptions that
communicate fused content are considered more
interesting and better in terms of textual quality.
However, also due to the limited content features
that were considered in the experiments, these de-
scriptions cannot compete, in general, with more
comprehensive well-written descriptions as encoun-
tered in Wikipedia. Still, it needs to be taken ac-
count that by far not all architectural landmarks
that are of interest from the professional or cultural
viewpoint are covered by Wikipedia. Fused content
descriptions are then a welcomed solution.
Acknowledgements
This work was supported by the European Commis-
sion in the context of its H2020 Program under the
grant numbers 870930-RIA, 779962-RIA, 825079-
RIA, 786731-RIA at Universitat Pompeu Fabra and
Information Technologies Institute - CERTH.
References
Nayyer Aafag, Ajmal Mian, Wei Liu, Syed Zulqarnain
Gilani, and Mubarak Shah. 2019. Video description:
A survey of methods, datasets and evaluation met-
rics. ACM Computing Surveys, 52(6).
Ion Androutsopoulos, Gerasimos Lampouras, and Dim-
itrios Galanis. 2013. Generating natural language
descriptions from owl ontologies: the naturalowl
system. Journal of Artificial Intelligence Research,
48:671–715.
Anja Belz, Mike White, Dominic Espinosa, Eric Kow,
Deirdre Hogan, and Amanda Stent. The first Sur-
face Realisation Shared Task: Overview and evalua-
tion results. In Proceedings of the Generation Chal-
lenges Session at the 13th European Workshop on
Natural Language Generation (ENLG).
Nadjet Bouayad-Agha, Gerard Casamayor, Simon
Mille, Marco Rospocher, Horacio Saggion, Luciano
Serafini, and Leo Wanner. 2012. From ontology to
nl: Generation of multilingual user-oriented environ-
mental reports. In International Conference on Ap-
plication of Natural Language to Information Sys-
tems, pages 216–221. Springer.
Nadjet Bouayad-Agha, Gerard Casamayor, and Leo
Wanner. 2014. Natural language generation in
the context of the semantic web. Semantic Web,
5(6):493–513.
St´
ephane Clinchant, Julien Ah-Pine, and Gabriela
Csurka. 2011. Semantic combination of textual and
visual information in multimedia retrieval. In Pro-
ceedings of the 1st ACM international conference on
multimedia retrieval, pages 1–8.
Pradipto Das, Rohini Kesavan Srihari, , and Jason J.
Corso. 2013. Translating related words to videos
and back through latent topics. In Proc. of WSDM.
Mark Everingham, Luc Van Gool, Christopher K. I.
Williams, John Winn, and Andrew Zisserman. 2009.
The pascal visual object classes (voc) challenge.In-
ternational Journal of Computer Vision, 88:303–
308. Printed version publication date: June 2010.
Bjorn Frohlich, Erik Rodner, and Joachim Denzler.
2010. A fast approach for pixelwise labeling of fa-
cade images. In Proceedings of the 2010 20th Inter-
national Conference on Pattern Recognition, ICPR
’10, page 3029–3032, USA. IEEE Computer Soci-
ety.
Claire Gardent, Anastasia Shimorina, Shashi Narayan,
and Laura Perez-Beltrachini. 2017a. Creating train-
ing corpora for micro-planners. In Proceedings
of the 55th Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Papers),
Vancouver, Canada. Association for Computational
Linguistics.
Claire Gardent, Anastasia Shimorina, Shashi Narayan,
and Laura Perez-Beltrachini. 2017b. The WebNLG
challenge: Generating text from RDF data. In Pro-
ceedings of the 10th International Conference on
Natural Language Generation, pages 124–133.
Albert Gatt and Emiel Krahmer. 2018. Survey of the
state of the art in natural language generation: Core
tasks, applications and evaluation. Journal of Artifi-
cial Intelligence Research, 61:65–170.
Agrim Gupta, Piotr Doll´
ar, and Ross Girshick. 2019.
Lvis: A dataset for large vocabulary instance seg-
mentation.
Zakir Hossain, Ferdous Sohel, Mohd Fairuz Shiratud-
din, and Hamid Laga. 2019. A comprehensive sur-
vey of deep learning for image captioning. ACM
Computing Surveys, 51(6).
Qian Huang, Zhu Liu, Aaron Rosenberg, David Gib-
bon, and Behzad Shahraray. 1999. Automated gen-
eration of news content hierarchy by integrating au-
dio, video, and text information. In 1999 IEEE In-
ternational Conference on Acoustics, Speech, and
Signal Processing. Proceedings. ICASSP99 (Cat. No.
99CH36258), volume 6, pages 3025–3028. IEEE.
Yanti Idaya Aspura and Shahrul Azman. 2017. Seman-
tic text-based image retrieval with multi-modality
ontology and dbpedia. The Electronic Library.
Filip Korˇ
c and Wolfgang F¨
orstner. 2009. eTRIMS Im-
age Database for interpreting images of man-made
scenes. Technical Report TR-IGG-P-2009-01.
Tsung-Yi Lin, Michael Maire, Serge Belongie, James
Hays, Pietro Perona, Deva Ramanan, Piotr Doll´
ar,
and C. Lawrence Zitnick. 2014. Microsoft coco:
Common objects in context. In Computer Vision –
ECCV 2014, pages 740–755, Cham. Springer Inter-
national Publishing.
Susan W McRoy, Songsak Channarukul, and Syed S
Ali. 2003. An augmented template-based approach
to text realization. Natural Language Engineering,
9(4):381.
Simon Mille, Anja Belz, Bernd Bohnet, and Leo Wan-
ner. 2018. Underspecified Universal Dependency
Structures as Inputs for Multilingual Surface Real-
isation. In Proceedings of the 11th International
Conference on Natural Language Generation, pages
199–209, Tilburg, The Netherlands.
Simon Mille, Stamatia Dasiopoulou, Beatriz Fisas, and
Leo Wanner. 2019. Teaching FORGe to verbalize
DBpedia properties in Spanish. In Proceedings of
the 12th International Conference on Natural Lan-
guage Generation, pages 473–483, Tokyo, Japan.
Association for Computational Linguistics.
Neha Nayak, Dilek Hakkani-T ¨
ur, Marilyn A Walker,
and Larry P Heck. 2017. To plan or not to plan?
discourse planning in slot-value informed sequence
to sequence models for language generation. In
Proceedings of INTERSPEECH, pages 3339–3343,
Stockholm, Sweden.
Gerhard Neuhold, Tobias Ollmann, Samuel Rota Bulo,
and Peter Kontschieder. 2017. The mapillary vistas
dataset for semantic understanding of street scenes.
In Proceedings of the IEEE International Confer-
ence on Computer Vision (ICCV).
Mick O’Donnell, Chris Mellish, Jon Oberlander, and
Alistair Knott. 2001. Ilex: an architecture for a dy-
namic hypertext generation system. Natural Lan-
guage Engineering, 7(3):225.
Karen Simonyan and Andrew Zisserman. 2014. Very
deep convolutional networks for large-scale image
recognition.
Radim Tylecek and Radim S´
ara. 2013. Spatial pat-
tern templates for recognition of objects with regular
structure. In Pattern Recognition, Lecture Notes in
Computer Science, pages 364–374. Springer Berlin
Heidelberg.
Sebastian Varges and Chris Mellish. 2001. Instance-
based natural language generation. In Second Meet-
ing of the North American Chapter of the Associa-
tion for Computational Linguistics.
Leo Wanner, Bernd Bohnet, Nadjet Bouayad-Agha,
Francois Lareau, and Daniel Nicklaß. 2010. MAR-
QUIS: Generation of user-tailored multilingual air
quality bulletins. Applied Artificial Intelligence,
24(10):914–952.
Ran Xu, Caiming Xiong, Wei Chen, and Jason J Corso.
2015. Jointly modeling deep video and composi-
tional text to bridge vision and language in a unified
framework. In Twenty-Ninth AAAI Conference on
Artificial Intelligence.
Xu Z., Tao D., Zhang Y., Wu J., and Tsoi A.C. 2014.
Architectural style classification using multinomial
latent logistic regression. In Computer Vision –
ECCV 2014. Lecture Notes in Computer Science, vol
8689, Cham. Springer International Publishing.
Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude
Oliva, and Antonio Torralba. 2018. Places: A 10
million image database for scene recognition. IEEE
Transactions on Pattern Analysis and Machine Intel-
ligence, 40(6):1452–1464.
Bolei Zhou, Hang Zhao, Xavier Puig, Tete Xiao,
Sanja Fidler, Adela Barriuso, and Antonio Torralba.
2016. Semantic understanding of scenes through the
ade20k dataset.
A Appendices
A.1 List of buildings in the datasets
Tables 5and 6show the buildings used for devel-
opment and evauation respectively.
Alhambra Arc de Triomphe
Bel´
em Tower Blue Church Bratislava
Borobudur Temple Bran Castle
Brandenburg Gate Bratislava Castle
Buckingham Palace Buda Castle
Burj Al Arab CN Tower
Canton Tower Casa Batll´
o
Casa Mil`
a Castel Sant’Angelo
Catherine Palace Chichen Itza
Chrysler Building Chˆ
ateau Frontenac
Chˆ
ateau de Chenonceau Cologne Cathedral
Colosseum Dome of the Rock
Dresden Frauenkirche Edinburgh Castle
Eiffel Tower Elbphilharmonie
Empire State Building Faisal Mosque
Fallingwater Fisherman’s Bastion
Florence Cathedral Forbidden City
Gatchina Palace Giza Pyramids
Grand Place Brussels Harpa Concert Hall
Helsinki Cathedral Heydar Aliyev Center
Himeji Castle Jin Mao Tower
Kiev Pechersk Lavra Knossos Palace
Konark Sun Temple Kronborg Castle
Lincoln Center Lincoln Memorial
Lloyd’s Building London Eye
Madrid Palace Marina Bay Sands
Metropolitan Cathedral of Bras´
ılia Milan Cathedral
Mosque of C´
ordoba Mus´
ee d’Orsay
Niter´
oi Contemporary Art Museum Notre Dame
Odeon of Herodes Atticus One World Trade Center
Oriental Pearl Tower Palace of Versailles
Peles Castle Pena Palace
Petra Jordan Porta Nigra
Potala Palace Prague Castle
Rouen Cathedral Royal Liver Building Liverpool
Royal Observatory (Greenwich) Sacr´
e-Coeur
Sagrada Familia Cathedral Space Needle
St. Basil’s Cathedral Statue of Liberty
Stonehenge Sultan Ahmed Mosque
Taipei 101 Taj Mahal
Tech Tower The Atomium
The Cristo Rei The Flatiron Building
The Gherkin The Guggenheim New York
The Lotus Temple The Pantheon Rome
The Shard The Sistine Chapel
The Temple of Olympian Zeus (Athens) The White House
Tokyo Skytree Tokyo Tower
Tower of Pisa Villa Savoye
Wembley Stadium Westminster Abbey
Wilan´
ow Palace Windsor Castle
Wuppertal Schwebebahn
Table 5: The 101 buildings used for development
Angkor Wat Big Ben
Burj Khalifa Camp Nou
Christ the Redeemer Dancing House Prague
Guggenheim Museum Bilbao Hagia Sophia
Hungarian Parliament Building Kremlin
Louvre Machu Picchu
Neuschwanstein Castle Parthenon
Peterhof Palace Petronas Towers
Sydney Opera House Walt Disney Concert Hall
White Tower Thessaloniki
Table 6: The 19 buildings used for the evaluation
A.2 List of retrieved DBpedia properties
Table 7lists the 39 features used for generation,
roughly grouped by topic, and their correspondence
Features (count) Properties
building type (54)
dbo:buildingType, dbo:type, dbp:buildingType,
dbp:type, dbp:architecturalType,
dbp:architectureType, dbp:category,
hypernym (89) http://purl.org/linguistics/gold/hypernym
architectural style
dbo:architecturalStyle,
dbp:architecturalStyle,
(49) dbp:architectureStyle,
dbp:style, dbp:architecture
architect (61)
dbo:architect, dbo:builder, dbp:architect,
dbp:author, dbp:builder, dbp:engineer,
dbp:foundedBy, dbp:renArchitect,
dbp:renOthDesigners
architecture firm (2) dbp:architectureFirm
sculptor (1) dbo:sculptor, dbp:sculptor
other name (17)
dbo:synonym, dbp:alternateName,
dbp:alternateNames, dbp:designation1Offname,
dbp:designation2Offname,
dbp:nativeName, dbp:otherName
former name (2) dbo:formerName
completion date (45)
dbo:buildingEndDate, dbp:built,
dbp:completionDate, dbp:completedDate,
dbp:dateComplete, dbp:dateConstructionEnds,
dbp:established, dbp:founded,
dbp:used, dbp:yearCompleted
construction start date
dbo:buildingStartDate, dbo:yearOfConstruction,
dbp:beginningDate, dbp:brokeGround, dbp:date,
(37) dbp:dateConstructionBegins, dbp:groundbreaking,
dbp:startDate, dbp:yearsBuilt
demolition date (1) dbp:demolished
extension date (1) dbp:extension
restoration date (4) dbp:restored, dbp:dateRenovated,
dbp:renovationDate
UNESCO designation dbp:year, dbp:whsYear,
date (12) dbp:designation1Date
location (86) dbo:location, dbp:location
country (27) dbo:country, dbp:country, dbp:locationCountry,
dbp:state, dbp:stateParty
culture (2) dbp:cultures
bell count (4) dbp:bells
dome count (2) dbp:domeQuantity
elevator count (16) dbp:elevatorCount
floor count (18) dbo:floorCount
minaret count (3) dbp:minaretQuantity
room count (2) dbp:roomCount, dbp:rooms
spire count (6) dbp:spireQuantity
step count (1) dbp:stepCount
suite count (1) dbp:suites
tower count (3) dbp:towerQuantity
cost (17) dbo:cost, dbp:cost, dbp:constructionCost
elevation (1) dbo:elevation
floor area (11) dbo:floorArea
height (9) dbo:height, dbp:height
seating capacity (10) dbp:capacity, dbp:seatingCapacity, dbp:garrison
building confused owl:differentFrom
with (2)
facade direction (3) dbp:facadeDirection
highest building dbp:highestStart
start date (7)
highest building dbp:highestEnd
end date (7)
highest building dbp:highestRegion
region (1)
construction dbo:material, dbp:material,
material (8) dbp:materials
structural system (2) dbp:structuralSystem
Table 7: List of retrieved features from DBpedia, and
number of occurrences in the development set (in grey,
properties already covered by the base generator)
with the 98 properties from DBpedia. In parenthe-
ses, the number of times each property had one
or more value(s) for a building. There can be two
reasons why there are more than one value for a fea-
ture: (i) one property is given more than one value,
or/and (ii) multiple properties have one value.
Alley alcove amphitheater
amusement park apartment building outdoor aqueduct
Arch archaelogical excavation atrium public
Attic auditorium balcony exterior
balcony interior ball Bar
barn barrier–curb barrier–fence
barrier–guard-rail barrier–wall bathroom
bazaar indoor bazaar outdoor beach house
bedroom berth bow window indoor
box building facade cafeteria
campus car case
castle catacomb cemetery
chalet chest of drawers children room
church indoor church outdoor classroom
cloister computerroom concert hall
cornice corridor cottage
courthouse courtyard crosswalk
deco department store dining hall
dining room discotheque doorway outdoor
downtown driveway eiffel-tower
elevator embassy engine room
entrance hall escalator indoor excavation
fabric store fac¸ade/wall farm
fire escape fire station fireplace
flat–bike-lane flat–crosswalk-plain flat–curb-cut
flat–parking flat–pedestrian-area flat–rail-track
flat–road flat–sidewalk food court
formal garden gameroom garage indoor
garage outdoor gazebo exterior general store indoor
general store outdoor golden-gate-bridge greenhouse indoor
greenhouse outdoor gymnasium indoor home office
home theater hospital hotel outdoor
hotel room house hunting lodge outdoor
igloo indoor industrial area
inn outdoor kasbah kindergarden classroom
kitchen library indoor library outdoor
living room lighthouse mansion
lobby mausoleum manufactured home
market indoor market outdoor meeting room
mirror mosque outdoor motel
movie theater indoor oast house museum indoor
museum outdoor nursery office
office building palace pagoda
painting pantry park
person parking lot patio
pavilion pier playground
pool/inside pub indoor pyramid
restaurant restaurant kitchen restaurant patio
River rock arch rope bridge
Ruin Schoolhouse sculpture
Shed Shopfront shopping mall indoor
sill ski resort Skyscraper
smokestack Stable stained-glass
staircase structure–bridge structure–building
structure–tunnel swimming pool indoor swimming pool outdoor
synagogue outdoor temple asia throne room
Tower tower-pisa train station platform
tree house Village water tower
waterfall wind farm windmill
window youth hostel zen garden
Table 8: List of classes supported by the object detec-
tion module
A.3 Details on the visuals analysis
List of extracted visual features.
Table 8shows
the list of all features extracted from images.
Training of models.
The training settings of each
component’s model involve a batch size of value
2, learning rate of 0.0001, momentum value equal
to 0.9, weight decay of 0.0005 and weights initial-
isation as described on the above section. For the
architectural style recognition task (see the confu-
sion matrix in Table 7), the experiments involved
Stochastic Gradient Descent and Adam as opti-
misers. Different epochs, batch size and learning
rates were tested. Finally a VGG19 model was
trained for 130 epochs. The training includes 3-fold
cross validation, and SGD optimiser of learning
rate equal to 0.001. All trainings and evaluations
were conducted on a 1080Ti GPU.
A.4 Sample output texts
Tables 9,10,11 and 12 show sample texts for a few
buildings; the parts of the text that come from the
textual and visual analysis are shown
in bold
, and
incorrect content is shown in red.
Wikipedia (human)
The Burj Khalifa, known as the Burj Dubai prior to its in-
auguration in 2010, is a skyscraper in Dubai, United Arab
Emirates. With a total height of 829.8 m (2,722 ft, just over
half a mile) and a roof height (excluding antenna, but includ-
ing a 244 m spire) of 828 m (2,717 ft), the Burj Khalifa has
been the tallest structure and building in the world since its
topping out in 2009 (preceded by Taipei 101).
DBpedia
Burj Khalifa, which Adrian Smith (architect) designed, is a
Mixed-use in Dubai. It costed 1,500,000,000$. It has 2 floors
and 57 elevators and a floor area of 309,473m2. It was the
highest building in the world. Burj Khalifa, the architectural
style of which is Neo-futurism, was built between 6 January
2004 and 31 December 2009. It was built of glass, steel,
aluminium, reinforced concrete. It was formerly called Burj
Dubai.
Fused
Burj Khalifa, which Adrian Smith (architect) designed, is
a
skyscraper in a downtown environment
in Dubai. It costed
1,500,000,000$. It has 2 floors and 57 elevators and a floor
area of 309,473m2.
It has similarities with a tower and a
train station.
It was the highest building in the world. Burj
Khalifa, the architectural style of which is
Deconstructu-
vism
, was built between 6 January 2004 and 31 December
2009. It was built of glass, steel, aluminium, reinforced con-
crete. It was formerly called Burj Dubai.
Table 9: Burj Khalifa (eval set)
Wikipedia (human)
Christ the Redeemer is an Art Deco statue of Jesus Christ
in Rio de Janeiro, Brazil, created by French sculptor Paul
Landowski and built by Brazilian engineer Heitor da Silva
Costa, in collaboration with French engineer Albert Caquot.
Romanian sculptor Gheorghe Leonida fashioned the face.
Constructed between 1922 and 1931, the statue is 30 metres
(98 ft) high, excluding its 8-metre (26 ft) pedestal. The arms
stretch 28 metres (92 ft) wide.
DBpedia
Christ the Redeemer (statue), which was built of Soapstone,
is a Statue in Brazil.
Fused
Christ the Redeemer (statue), which was built of Soapstone, is
a statue in a zen garden environment
in Brazil.
Its archi-
tectural style is Hellinistic. Christ the Redeemer (statue)
has similarities with a windmill and a beach house. There
is an elevator shaft in it.
Table 10: Christ the Redeemer (eval set)
Figure 7: Confusion matrix of the produced architectural styles model
Wikipedia (human)
The Dancing House, or Fred and Ginger, is the nickname
given to the Nationale-Nederlanden building on the Ra
ˇ
s
´
ınovo
n
´
ab
ˇ
re
ˇ
z
´
ı (Ra
ˇ
s
´
ın Embankment) in Prague, Czech Republic.
It was designed by the Croatian-Czech architect Vlado
Miluni
´
c in cooperation with Canadian-American architect
Frank Gehry on a vacant riverfront plot. The building was
designed in 1992 and was completed four years later in 1996.
DBpedia
Dancing House, which Frank Gehry designed, is a Nickname
in CzechRepublic (Prague). It was built between 1992 and
1996. It was formerly called Fred and Ginger.
Fused
Dancing House, which Frank Gehry designed, is a nickname
in CzechRepublic (Prague).
It has similarities with an em-
bassy, a palace and a parking garage. A fire escape can
be seen on its facade.
Dancing House, the architectural style
of which is Art Nouveau, was built between 1992 and 1996.
It was formerly called Fred and Ginger.
Table 11: Dancing house (eval set)
Wikipedia (human)
The Sydney Opera House is a multi-venue performing arts
centre at Sydney Harbour in Sydney, New South Wales, Aus-
tralia. It is one of the 20th century’s most famous and distinc-
tive buildings.
DBpedia
Sydney Opera House, which Jørn Utzon designed, is a Per-
forming arts center in Sydney. Sydney Opera House, the
architectural style of which is Expressionist architecture, was
built between 1 March 1959 and 1973. Its structure is made
of Concrete frame & precast concrete ribbed roof.
Fused
Sydney Opera House, which Jørn Utzon designed, is a
centre
in Sydney. Its structure is made of Concrete frame & pre-
cast concrete ribbed roof.
An element of the structure is
like a bridge. Sydney Opera House has similarities with a
beach house
,
an amusement park and a museum
. Sydney
Opera House, the architectural style of which is
Deconstruc-
tuvism, was built between 1 March 1959 and 1973.
Table 12: Sydney Opera house (eval set)
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Generating a description of an image is called image captioning. Image captioning requires to recognize the important objects, their attributes and their relationships in an image. It also needs to generate syntactically and semantically correct sentences. Deep learning-based techniques are capable of handling the complexities and challenges of image captioning. In this survey paper, we aim to present a comprehensive review of existing deep learning-based image captioning techniques. We discuss the foundation of the techniques to analyze their performances, strengths and limitations. We also discuss the datasets and the evaluation metrics popularly used in deep learning based automatic image captioning.
Article
Full-text available
Scene parsing, or recognizing and segmenting objects and stuff in an image, is one of the key problems in computer vision. Despite the community's efforts in data collection, there are still few image datasets covering a wide range of scenes and object categories with dense and detailed annotations for scene parsing. In this paper, we introduce and analyze the ADE20K dataset, spanning diverse annotations of scenes, objects, parts of objects, and in some cases even parts of parts. A generic network design called Cascade Segmentation Module is then proposed to enable the segmentation networks to parse a scene into stuff, objects, and object parts in a cascade. We evaluate the proposed module integrated within two existing semantic segmentation networks, yielding significant improvements for scene parsing. We further show that the scene parsing networks trained on ADE20K can be applied to a wide variety of scenes and objects.
Article
Video description is the automatic generation of natural language sentences that describe the contents of a given video. It has applications in human-robot interaction, helping the visually impaired and video subtitling. The past few years have seen a surge of research in this area due to the unprecedented success of deep learning in computer vision and natural language processing. Numerous methods, datasets, and evaluation metrics have been proposed in the literature, calling the need for a comprehensive survey to focus research efforts in this flourishing new direction. This article fills the gap by surveying the state-of-the-art approaches with a focus on deep learning models; comparing benchmark datasets in terms of their domains, number of classes, and repository size; and identifying the pros and cons of various evaluation metrics, such as SPICE, CIDEr, ROUGE, BLEU, METEOR, and WMD. Classical video description approaches combined subject, object, and verb detection with template-based language models to generate sentences. However, the release of large datasets revealed that these methods cannot cope with the diversity in unconstrained open domain videos. Classical approaches were followed by a very short era of statistical methods that were soon replaced with deep learning, the current state-of-the-art in video description. Our survey shows that despite the fast-paced developments, video description research is still in its infancy due to the following reasons: Analysis of video description models is challenging, because it is difficult to ascertain the contributions towards accuracy or errors of the visual features and the adopted language model in the final description. Existing datasets neither contain adequate visual diversity nor complexity of linguistic structures. Finally, current evaluation metrics fall short of measuring the agreement between machine-generated descriptions with that of humans. We conclude our survey by listing promising future research directions.
Article
Purpose The purpose of this study is to reduce the semantic distance by proposing a model for integrating indexes of textual and visual features via a multi-modality ontology and the use of DBpedia to improve the comprehensiveness of the ontology to enhance semantic retrieval. Design/methodology/approach A multi-modality ontology-based approach was developed to integrate high-level concepts and low-level features, as well as integrate the ontology base with DBpedia to enrich the knowledge resource. A complete ontology model was also developed to represent the domain of sport news, with image caption keywords and image features. Precision and recall were used as metrics to evaluate the effectiveness of the multi-modality approach, and the outputs were compared with those obtained using a single-modality approach (i.e. textual ontology and visual ontology). Findings The results based on ten queries show a superior performance of the multi-modality ontology-based IMR system integrated with DBpedia in retrieving correct images in accordance with user queries. The system achieved 100 per cent precision for six of the queries and greater than 80 per cent precision for the other four queries. The text-based system only achieved 100 per cent precision for one query; all other queries yielded precision rates less than 0.500. Research limitations/implications This study only focused on BBC Sport News collection in the year 2009. Practical implications The paper includes implications for the development of ontology-based retrieval on image collection. Originality value This study demonstrates the strength of using a multi-modality ontology integrated with DBpedia for image retrieval to overcome the deficiencies of text-based and ontology-based systems. The result validates semantic text-based with multi-modality ontology and DBpedia as a useful model to reduce the semantic distance.
Article
The rise of multi-million-item dataset initiatives has enabled data-hungry machine learning algorithms to reach near-human semantic classification performance at tasks such as visual object and scene recognition. Here we describe the Places Database, a repository of 10 million scene photographs, labeled with scene semantic categories, comprising a large and diverse list of the types of environments encountered in the world. Using the state-of-the-art Convolutional Neural Networks (CNNs), we provide scene classification CNNs (Places-CNNs) as baselines, that significantly outperform the previous approaches. Visualization of the CNNs trained on Places shows that object detectors emerge as an intermediate representation of scene classification. With its high-coverage and high-diversity of exemplars, the Places Database along with the Places-CNNs offer a novel resource to guide future progress on scene recognition problems.