Content uploaded by Gordana Ispirova
Author content
All content in this area was uploaded by Gordana Ispirova on Dec 19, 2022
Content may be subject to copyright.
EXPLOITING DOMAIN KNOWLEDGE IN
PREDICTIVE LEARNING FROM FOOD AND
NUTRITION DATA
Gordana Ispirova
Doctoral Dissertation
Jožef Stefan International Postgraduate School
Ljubljana, Slovenia
Supervisor: Prof. Dr. Barbara Koroušić Seljak, Computer Systems Department, Jožef
Stefan Institute, Ljubljana, Slovenia
Co-Supervisor: Assist. Prof. Dr. Tome Eftimov, Computer Systems Department, Jožef
Stefan Institute, Ljubljana, Slovenia
Evaluation Board:
Prof. Dr. Sašo Džeroski, Chair, Computer Systems Department, Jožef Stefan Institute,
Ljubljana, Slovenia
Prof. Dr. Zoran Bosnić, Member, Faculty of Computer and Information Science, University
of Ljubljana, Ljubljana, Slovenia
Assoc. Prof. Dr. Riste Stojanov, Member, Faculty of Computer Science and Engineering,
Ss. Cyril and Methodius University, Skopje, North Macedonia
Gordana Ispirova
EXPLOITING DOMAIN KNOWLEDGE IN PREDICTIVE
LEARNING FROM FOOD AND NUTRITION DATA
Doctoral Dissertation
IZKORIŠČANJE DOMENSKEGA ZNANJA PRI NAPOVED-
NEM UČENJU IZ PODATKOV O ŽIVILIH IN PREHRANI
Doktorska disertacija
Supervisor: Prof. Dr. Barbara Koroušić Seljak
Co-Supervisor: Assist. Prof. Dr. Tome Eftimov
Ljubljana, Slovenia, November 2022
To my family.
vii
Acknowledgments
Getting a PhD was never a chapter in the life that I imagined I would have, but, as it
turns out, you cannot skip chapters, that is not how life works. You have to read every
line. Meet every character. You will not enjoy all of it. Some chapters will make you cry.
Some will make you smile. Some will bring you to the verge of giving up. Some will show
you endless opportunities. In some chapters, you will read things you do not want to read.
In some you will not want the pages to end. My PhD was all of these chapters for me, it
was a chapter that could not be skipped.
If there is one person who showed me that this is possible, it is my supervisor, Prof. Dr.
Barbara Koroušić Seljak. I would like to express my heartfelt gratitude and appreciation
to her for never failing me, for allowing this to be my own work, for helping me in every
step of the way, for showing me how to be patient and forgiving and truly a good person,
as I could not have had a better example of it than her.
I would also like to express my sincere gratitude to the members of the evaluation board:
Prof. Dr. Sašo Džeroski, Prof. Dr. Zoran Bosnić, and Assoc. Prof. Dr. Riste Stojanov
for their constructive feedback on this work.
Achieving a goal requires people who bring passion and positivism into your life, people
who understand what you are going through and share your burdens, people who give you
courage and laughter, and people who give you unconditional love. I am fortunate to have
them all. First, a thank you from the bottom of my heart goes to my co-supervisor and
friend – Assist. Prof. Dr. Tome Eftimov for being the person who can change my mood
immediately with his positivism and passion for science, for making all of this happen, for
believing in me when I did not, for showing me how to be humble and reach for the stars
at the same time, and for teaching me one big lesson: "If you were not ready, you would
not have the opportunity, and if you were not capable, you would not have the desire."
Tough times require a reminder that we are not alone. I had incredible girls to share my
struggles with and find ways to cope with the stressful times of a PhD journey. This thank
you goes to Ana Kostovska, Nataša Malinova and Milka Ljoncheva. Tough times also re-
quire encouragement and a reminder to laugh. This thank you goes to my best friends
Georgi Kostov and Gjorgi Peev – thanks for always pushing me towards better things.
We are nothing without love and family. Thanks to my four grandparents, my four aunts,
and my four uncles for always believing in me and making me feel loved and appreciated all
the time. Thanks to my cousins, all nine of them, for their true love and trust, their con-
tagious smiles, and their never-failing ability to cheer me up. My very profound gratitude
goes also to my parents, Betka and Panče, and my brother, Igor, for their unconditional
love and support in everything I do. I love you infinitely!
Lastly, thanks to my love, Martin, for finding me when I was lost, for enduring with me
in the last year of my PhD, for allowing me to see myself with different eyes, and showing
me that the best way to predict the future is to create it.
At the end, a reminder that persistence makes all the difference, so the last thank you goes
to ME — past ME, present ME and future ME, none of whom none the wiser.
ix
Abstract
Human knowledge about food and nutrition has evolved drastically with time. With food
and nutrition-related data being mass produced and easily accessible, the next step is to
use Artificial Intelligence (AI) to translate data into knowledge. The majority of AI re-
search is model-driven, and classical Machine Learning (ML) pipelines concentrate on the
model-centric approach, prioritizing training the best model for a specific task, with the
main focus on improving model parameters, overlooking the importance of data.
We propose a novel ML pipeline that fused data and domain-driven knowledge for a pre-
dictive task from the Food and Nutrition domain – fast prediction of nutrient values from
unstructured recipe text. Our proposed pipeline consists of three parts: representation
learning (RL), unsupervised ML, and supervised ML. In the RL part, word and paragraph
embeddings are learned for text short descriptions of foods (recipe titles), in the unsu-
pervised ML part the recipes are separated in clusters based on a domain-specific coding
(FoodEx2 classification) from external domain resource, and in the supervised ML part,
the two parts are combined – separate predictive models are trained for each cluster for sep-
arate nutrients using the learned embeddings as input features. The pipeline is evaluated
with a criteria defined using domain knowledge (nutrient tolerance levels) and compared
to baselines also calculated using the same criteria.
As the evaluation results showed that including the domain knowledge in the unsupervised
ML part improved the results compared to the baseline, we propose an alteration of the
ML pipeline. We include two different external sources of domain knowledge for clustering
in the unsupervised ML part, to explore the domain bias for the same prediction task.
To further improve the ML pipeline, we include domain knowledge in the RL part of the
pipeline. Instead of obtaining recipe title embeddings, we introduce a domain heuristic
for merging embeddings of the ingredients of the recipe. This proved to be a successful
way to train excellent performing predictive models for predicting nutrient values, as the
accuracies obtained were significantly higher than the baseline.
As the domain-specific embeddings showed to be high performant, through the process
of data normalization using dictionary and rule-based Named Entity Recognition and
data mapping to a Food Composition Database from six heterogeneous multilingual recipe
datasets, we composed two predefined corpora of embeddings – ingredient and recipe em-
beddings. Training embeddings tailored for a specific task is a very time-consuming process,
therefore these corpora of predefined embeddings can be used for research purposes as well
as transferred to other tasks for application purposes.
To explore the major impact data has on model-performance, we focused on generaliza-
tion of predictive models, by defining a generalizability index that indicates the trust of
transferring a predictive model learned on one dataset to another. Going a step further to
show the importance of data in predictive modeling, we show different ways of selecting a
representative training dataset, and the results show how different selections of the training
dataset produce different outcomes. The training data should be representative of the data
expected in deployment, covering all variations that deployment data will present.
xi
Povzetek
Ekspertno znanje o živilih in prehrani se je v zadnjem času drastično povečalo. Umetna
inteligenca (UI) omogoča dodatno nadgrajevanje tega znanja s (pol)avtomatsko izlušče-
nim znanjem iz množično zbranih podatkov o živilih in prehrani, ki so relativno enostavno
dostopni. Vendar pa raziskave na področju UI pogosto zanemarjajo pomen podatkov in
so bolj usmerjene v samo modeliranje. Klasični razvoj cevovodov strojnega učenja (SU) je
osredotočen na učenje najboljših možnih modelov za izbrane naloge in optimizacijo para-
metrov izbranih modelov. V doktorski nalogi predstavljamo novo metodologijo za razvoj
cevovoda SU, ki z zlivanjem ekspertnega znanja z znanjem, pridobljenim iz podatkov,
omogoča hitro napovedovanje hranilnih vrednosti iz receptov, zapisanih v nestrukturirani
tekstovni obliki, kar je zahtevna naloga s področja živilstva in prehrane. Predlagani cevo-
vod temelji na predstavitvenem učenju (PU), nenadzorovanem SU in nadzorovanem SU.
PU omogoča začetno učenje vdelav (angl. embeddings) besed in odstavkov, ki opisujejo
živila oziroma naslove receptov. Nenadzorovano SU porazdeli vdelave obravnavanih re-
ceptov v gruče (angl. clusters) upoštevajoč domensko znanje o klasifikaciji živil in jedi
po standardiziranem sistemu FoodEx2. Nadzorovano SU pa je namenjeno učenju mode-
lov napovedovanja za vsako gručo posebej upoštevajoč posamična hranila. Cevovod smo
ovrednotili s primerjavo kriterija (t-j. stopnje tolerance hranil), ki temelji na domenskem
znanju, z izhodiščnim. Ker se je izkazalo, da upoštevanje domenskega znanja v nenadzo-
rovanem SU izboljša rezultate napovedovanja, smo predlagali nadgradnjo cevovoda SU. Z
namenom preučitve morebitne domenske pristranskosti – smo v nenadzorovano SU vklju-
čili dva različna zunanja vira domenskega znanja za porazdeljevanje v gruče. Prav tako
smo vključili domensko znanje v PU in s tem dodatno povečali učinkovitost cevovoda SU.
Tako namesto vdelav naslovov receptov uvajamo domensko hevristiko za združevanje vde-
lav posameznih sestavin recepta. Izkazalo se je, da je to uspešen način učinkovitega učenja
modelov napovedovanja hranilnih vrednosti, saj je bila natančnost značilno višja od iz-
hodiščnih vrednosti. Ker so se v procesu normalizacije in preslikave podatkov domensko
pogojene vdelave izkazale kot visoko zmogljive, smo izdelali dva ločena korpusa vnaprej do-
ločenih vdelav, enega z vdelavami sestavin in drugega z vdelavami receptov. Normalizacija
podatkov je temeljila na prepoznavanju imenskih entitet z uporabo slovarja in pravil, med-
tem ko so se podatki iz šestih baz z mednarodnimi recepti preslikali na podatke o sestavi
živil. Učenje vdelav, prilagojenih za določeno nalogo, je časovno zahteven proces, zato
lahko izdelana korpusa z vnaprej določenimi vdelavami uporabimo v raziskovalne namene,
možna pa je tudi njihova prevedba na druge aplikativne naloge. Da bi preučili glavni vpliv
podatkov na zmogljivosti modela, smo se osredotočili na problem posploševanja modelov
napovedovanja. Uvedli smo indeks posplošljivosti, ki ocenjuje stopnjo zaupanja v prenos
modela napovedovanja, pridobljenega na eni, za uporabo na drugi podatkovni množici. V
naslednjem koraku smo raziskali, kakšen pomen imajo podatki pri modeliranju napovedo-
vanja. Izkazalo se je, da ima izbor množice podatkov za učenje pomemben vpliv na končni
rezultat. Predvsem je pomembno, da so podatki, namenjeni učenju, dovolj reprezentativni
in zajemajo dovolj variabilnosti, kot je pričakovano v končni uporabi.
xiii
Contents
List of Figures xvii
List of Tables xix
List of Algorithms xxi
Abbreviations xxiii
Glossary xxv
1 Introduction 1
1.1 State-of-the-art and Thesis Ambitions . . . . . . . . . . . . . . . . . . . . . 3
1.1.1 State-of-the-art in Textual Representations . . . . . . . . . . . . . . 3
1.1.2 Thesis ambition in textual representations . . . . . . . . . . . . . . . 4
1.1.3 State-of-the-art in learning predictive models . . . . . . . . . . . . . 4
1.1.4 Thesis ambition in learning predictive models . . . . . . . . . . . . . 4
1.2 PurposeoftheThesis .............................. 5
1.3 GoalsoftheThesis................................ 5
1.4 Hypothesis .................................... 5
1.5 Scientific Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.6 Methodology ................................... 8
1.7 Structure of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2 Background 11
2.1 Food- and Nutrition – related data . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.1 FoodEx2 classification system . . . . . . . . . . . . . . . . . . . . . . 11
2.1.2 StandFood ................................ 12
2.2 Representation Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.1 Wordembeddings ............................ 13
2.2.2 Paragraph embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.3 Graph-based representation learning . . . . . . . . . . . . . . . . . . 14
2.3 MachineLearning................................. 15
2.3.1 Supervised machine learning . . . . . . . . . . . . . . . . . . . . . . . 15
2.3.2 Unsupervised machine learning . . . . . . . . . . . . . . . . . . . . . 19
3 Machine-Learning Pipeline for Integrating Domain and Data Driven
Knowledge 21
3.1 ProblemDefinition................................ 21
3.2 RelatedWork................................... 22
3.3 Methodology ................................... 23
3.3.1 Tolerance levels for nutrient values . . . . . . . . . . . . . . . . . . . 23
3.3.2 Representation learning . . . . . . . . . . . . . . . . . . . . . . . . . 25
xiv Contents
3.3.3 Unsupervised machine learning . . . . . . . . . . . . . . . . . . . . . 26
3.3.4 Supervised machine learning . . . . . . . . . . . . . . . . . . . . . . . 26
3.4 ResultsandDiscussion.............................. 28
3.4.1 Data.................................... 28
3.4.2 Data pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.4.3 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.4.4 Evaluation outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.4.5 Discussion................................. 42
4 Exploring Knowledge Domain Bias on a Prediction Task 45
4.1 ProblemDefinition................................ 45
4.2 RelatedWork................................... 46
4.3 Methodology ................................... 46
4.3.1 Domain knowledge criteria . . . . . . . . . . . . . . . . . . . . . . . . 48
4.3.2 Tolerance for nutrient values . . . . . . . . . . . . . . . . . . . . . . 48
4.4 ResultsandDiscussion.............................. 49
4.4.1 Data.................................... 49
4.4.2 Data pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.4.3 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.4.4 Evaluation outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.4.5 Discussion................................. 56
5 Domain Heuristic Fusion of Multi-word Embeddings for Nutrient Value
Prediction 59
5.1 ProblemDefinition................................ 59
5.2 RelatedWork................................... 62
5.3 Methodology ................................... 62
5.3.1 Domain-specific embeddings . . . . . . . . . . . . . . . . . . . . . . . 62
5.4 ResultsandDiscussion.............................. 66
5.4.1 Data.................................... 66
5.4.2 Data pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.4.3 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.4.4 Evaluation outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.4.5 Discussion................................. 74
6 Predefined Domain-Specific Embeddings of Food Concepts and Recipes:
A Case Study on Heterogeneous Recipe Datasets 81
6.1 ProblemDefinition................................ 81
6.2 RelatedWork................................... 82
6.2.1 USDA Food Composition Database . . . . . . . . . . . . . . . . . . . 83
6.2.2 Domain dictionaries . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
6.2.2.1 Units of measurement dictionary . . . . . . . . . . . . . . . 84
6.2.2.2 Redundant words dictionary . . . . . . . . . . . . . . . . . 84
6.2.2.3 Branded foods dictionary . . . . . . . . . . . . . . . . . . . 84
6.2.2.4 Conversion dictionary . . . . . . . . . . . . . . . . . . . . . 84
6.3 Methodology ................................... 85
6.3.1 Data.................................... 85
6.3.2 Data normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.3.2.1 Extracting information from unstructured recipe data . . . 88
6.3.2.2 Data mapping to USDA FCDB . . . . . . . . . . . . . . . . 93
6.4 ResultsandDiscussion.............................. 97
Contents xv
6.4.1 Generating the predefined corpus of ingredient embeddings . . . . . 97
6.4.2 Generating the predefined corpus of recipe embeddings . . . . . . . . 97
6.4.3 Predictive modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
6.4.4 Discussion................................. 98
7 Generalizing the Knowledge Learned by Predictive Modeling on Het-
erogeneous Recipe Data 101
7.1 ProblemDefinition................................101
7.2 RelatedWork...................................102
7.3 Methodology ...................................102
7.4 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
7.4.1 Discussion.................................123
8 Conclusions 127
8.1 Research Hypotheses and Their Confirmation . . . . . . . . . . . . . . . . . 127
8.2 Final Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . 128
Appendix A Results from Chapter 6 133
Appendix B Results from Chapter 7 147
References 161
Bibliography 175
Biography 177
xvii
List of Figures
Figure 1.1: Flowchart of a classical ML pipeline. . . . . . . . . . . . . . . . . . . . . 3
Figure 3.1: Flowchart of the P-NUT methodology. . . . . . . . . . . . . . . . . . . 24
Figure 3.2: Best prediction accuracies for carbohydrates predictions obtained from
the embeddings for the English names and Slovene names for each clus-
ter compared to the baseline mean and median for the particular cluster. 36
Figure 3.3: Best prediction accuracies for fat predictions obtained from the em-
beddings for the English names and Slovene names for each cluster
compared to the baseline mean and median for the particular cluster. . 37
Figure 3.4: Best prediction accuracies for protein predictions obtained from the
embeddings for the English names and Slovene names for each cluster
compared to the baseline mean and median for the particular cluster. . 38
Figure 3.5: Best prediction accuracies for water predictions obtained from the em-
beddings for the English names and Slovene names for each cluster
compared to the baseline mean and median for the particular cluster. . 39
Figure 3.6: Best prediction accuracies for each nutrient obtained from the embed-
dings for the English and Slovene names compared to the baseline mean
and median from the whole dataset. . . . . . . . . . . . . . . . . . . . . 41
Figure 4.1: Inferring domain knowledge in P-NUT. . . . . . . . . . . . . . . . . . . 47
Figure 4.2: Highest accuracy percentages obtained with the FoodEx2 clustering
method compared to the baseline mean and baseline median. . . . . . . 54
Figure 4.3: Highest accuracy percentages obtained with the FSA traffic light clus-
tering method compared to the baseline mean and baseline median. . . 55
Figure 4.4: Comparing the highest accuracies obtained for each nutrient prediction
with the FoodEx2 clustering and the FSA traffic light clustering. . . . . 57
Figure 5.2: Flowchart of the presented approach. . . . . . . . . . . . . . . . . . . . 63
Figure 5.3: Visual representation of the vector representations when using the domain-
specific heuristic of a group of recipes with the same ingredients but
different nutritional values (The nutritional values of each recipe repre-
sented in this picture are given in Table 5.4). . . . . . . . . . . . . . . . 72
Figure 5.4: Visual representation of the vector representations without the domain-
specific heuristic of a group of recipes with the same ingredients. . . . . 75
Figure 5.5: Visual representation of the vector representations with the domain-
specific heuristic of a group of recipes with the same name. . . . . . . . 77
Figure 5.6: Visual representation of the vector representations without the domain-
specific heuristic of a group of recipes with the same name. . . . . . . . 78
Figure 7.1: Flowchart of the methodology. . . . . . . . . . . . . . . . . . . . . . . . 103
xviii List of Figures
Figure 7.2: Reduced recipe embeddings for all six datasets obtained with the Word2Vec
algorithm (architecture: CBOW, dimension: 100, sliding window: 3
merging heuristic: average) presented separately in the same feature
space......................................107
Figure 7.3: Reduced recipe embeddings for all six datasets obtained with the Word2Vec
algorithm (architecture: CBOW, dimension: 100, sliding window: 3
merging heuristic: average) presented together in the same feature space.108
Figure 7.4: Curve of the average silhouette for values of the number of clusters k
from3to12..................................109
Figure 7.5: Clustering into 8 clusters of the reduced embedding produced with
Word2Vec – architecture: CBOW, dimension: 100, sliding window: 3,
merging heuristic: average. . . . . . . . . . . . . . . . . . . . . . . . . . 110
Figure 7.6: Percentage of instances of each dataset per cluster. . . . . . . . . . . . . 111
Figure 7.7: Number of instances from each dataset per cluster. . . . . . . . . . . . . 112
Figure 7.8: Distributions in the feature space of the Indian recipes dataset and the
Yummly28Kdataset. ............................113
Figure 7.9: Distributions in the feature space of the Indian recipes dataset and the
Saladrecipesdataset.............................114
Figure 7.10: Distributions in the feature space of the Indian recipes, Recipe1M, Epi-
curious and Recipe box datasets. . . . . . . . . . . . . . . . . . . . . . . 115
Figure 7.11: Distributions in the feature space of the Recipe1M, Epicurious and
Recipeboxdatasets..............................116
Figure 7.12: Distributions in the feature space of the Recipe1M, Epicurious, Recipe
box and Yummly28K datasets. . . . . . . . . . . . . . . . . . . . . . . . 117
Figure 7.13: Distributions in the feature space of the instances belonging to cluster
number 6and the instances from Indian recipes, Recipe1M, Epicurious,
Yummly28K and Recipe box datasets. . . . . . . . . . . . . . . . . . . . 118
xix
List of Tables
Table 3.1: Tolerated differences in nutrition content in foods besides food supple-
ments...................................... 24
Table 3.2: Subset from the dataset used in the experiments. . . . . . . . . . . . . . 29
Table 3.3: Examples of pre-processed English descriptions. . . . . . . . . . . . . . . 30
Table 3.4: Example instances from each cluster. . . . . . . . . . . . . . . . . . . . . 31
Table 3.5: Accuracy percentages after k-fold cross validation on each cluster ob-
tained with the embeddings for the English names of the food prod-
ucts. Target: C – Carbohydrates, F – Fat, P – Protein, W – Water.
The numbers shown in bold in the table represent the overall best per-
formance for each nutrient in the given cluster. . . . . . . . . . . . . . . 34
Table 3.6: Accuracy percentages after k– fold cross validation on each cluster ob-
tained with the embeddings for the Slovene names of the food prod-
ucts. Target: C – Carbohydrates, F – Fat, P – Protein, W – Water.
The numbers shown in bold in the table represent the overall best per-
formance for each nutrient in the given cluster. . . . . . . . . . . . . . . 35
Table 3.7: Embedding and regression algorithms which yielded highest accuracies
for each nutrient prediction in each cluster. Target: C – Carbohydrates,
F – Fat, P – Protein, W – Water. Regression Algorithm: EN – Elastic
Net, R – Ridge, L – Lasso, LN – Linear Regression. . . . . . . . . . . . . 40
Table 3.8: Embedding and regression algorithms which yielded highest accuracies
for each nutrient prediction on the whole dataset (without clustering).
Target: C – Carbohydrates, F – Fat, P – Protein, W – Water. Regression
Algorithm: EN – Elastic Net, R – Ridge, L – Lasso, LN – Linear Regression 42
Table 4.1: Tolerated differences in nutrition content in foods besides food supple-
ments...................................... 48
Table 4.2: Dataset structure and example data instances . . . . . . . . . . . . . . . 50
Table 5.1: Example instance from the Recipe1M dataset. . . . . . . . . . . . . . . . 67
Table 5.2: nutrient values per total weight of the ingredients in the recipe given in
Table5.1. ................................... 68
Table 5.3: Results from the evaluation on Recipe1M. . . . . . . . . . . . . . . . . . 70
Table 5.4: Nutrient values for recipes with the same ingredient list, but same or
differentquantities............................... 73
Table 5.5: Differences in performance space (A – Actual value, DH – Predicted value
when using the domain heuristic, No DH – Predicted values without
using the domain heuristic). . . . . . . . . . . . . . . . . . . . . . . . . . 76
Table 6.1: Examples with redundant words and/or phrases. . . . . . . . . . . . . . 85
Table 6.2: Example of well-structured not separated ingredient list in a recipe dataset. 88
Table 6.3: Examples with phrases depicting more than one ingredient. . . . . . . . 92
xx List of Tables
Table 6.4: Average accuracies obtained with the embeddings merged with the do-
mainheuristic. ................................ 99
Table 7.1: Number of instances in each cluster. . . . . . . . . . . . . . . . . . . . . 111
Table 7.2: Generalizability matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
Table 7.3: Results from the predictive models trained on each one of the recipe
datasets separately and tested on the rest obtained with the Word2Vec
embeddings merged with the domain heuristic. Max is the maximum
accuracy obtained, and Average is the average accuracy obtained. . . . . 120
Table 7.4: Number of instances from each cluster in the generalized dataset. . . . . 121
Table 7.5: Results from the evaluation on the models trained when using the gener-
alized training dataset obtained when using the instances closest to the
centroids from each cluster. . . . . . . . . . . . . . . . . . . . . . . . . . 122
Table 7.6: Maximum cosinedistance from each cluster’s centroid to an instance be-
longing to the same cluster. . . . . . . . . . . . . . . . . . . . . . . . . . 122
Table 7.7: Number of instances from each cluster for ϵ≤0.186. ...........123
Table 7.8: Results from the evaluation on the models trained when using the gen-
eralized training dataset obtained when using the instances in a defined
ϵneighbourhood of the centroids from each cluster. . . . . . . . . . . . . 124
Table A.1: Results for the Recipe1M dataset obtained with the embeddings merged
with the domain heuristic. . . . . . . . . . . . . . . . . . . . . . . . . . . 134
Table A.2: Results for the Indian recipes dataset obtained with the embeddings
merged with the domain heuristic. . . . . . . . . . . . . . . . . . . . . . 135
Table A.3: Results for the Epicurious recipe dataset obtained with the embeddings
merged with the domain heuristic. . . . . . . . . . . . . . . . . . . . . . 136
Table A.4: Results for the Salad recipes dataset obtained with the embeddings
merged with the domain heuristic. . . . . . . . . . . . . . . . . . . . . . 137
Table A.5: Results for the Yummly28K recipe dataset obtained with the embed-
dings merged with the domain heuristic. . . . . . . . . . . . . . . . . . . 138
Table A.6: Results for the Recipe box dataset obtained with the embeddings
merged with the domain heuristic. . . . . . . . . . . . . . . . . . . . . . 139
Table A.7: Results for the Recipe1M dataset obtained without the domain heuristic.140
Table A.8: Results for the Indian recipes dataset obtained without the domain
heuristic.....................................141
Table A.9: Results for the Epicurious dataset obtained without the domain heuristic.142
Table A.10: Results for the Salad recipes dataset obtained without the domain
heuristic.....................................143
Table A.11: Results for the Yummly28K recipes dataset obtained without the do-
mainheuristic. ................................144
Table A.12: Results for the Recipe box dataset obtained without the domain heuristic.145
Table B.1: Results from training on one recipe dataset and testing on other five
recipe datasets with embeddings merged with the domain heuristic as
features.....................................147
xxi
List of Algorithms
Algorithm 3.1: RL part of the P-NUT methodology. . . . . . . . . . . . . . . . . . 27
Algorithm 3.2: Unsupervised learning part of the P-NUT methodology. . . . . . . . 28
Algorithm 3.3: Supervised learning part of the P-NUT methodology. . . . . . . . . 28
Algorithm 4.1: Unsupervised learning part when using FSA clustering. . . . . . . . 52
Algorithm 5.1: RL part of the methodology. . . . . . . . . . . . . . . . . . . . . . . 64
Algorithm 5.2: Representation learning part of the methodology. . . . . . . . . . . 65
Algorithm 6.1: Lexical similarity measure. . . . . . . . . . . . . . . . . . . . . . . . 95
Algorithm 6.2: Pre-process food items from USDA. . . . . . . . . . . . . . . . . . . 95
Algorithm 6.3: Mapping ingredients to USDA. . . . . . . . . . . . . . . . . . . . . . 96
xxiii
Abbreviations
AI . . . Artificial Intelligence
ANNs . . . Artificial Neural Networks
CBOW . . . Continuous Bag-of-Words Model
CS . . . Computer Science
DM . . . Data Mining
DS . . . Data Science
EFSA . . . European Food Safety Authority
EHRs . . . Electronic Health Records
FCDB . . . Food Composition Database
FDA . . . Food and Drug Administration
FSA . . . Food Standards Agency
IBS . . . Irritable Bowel Syndrome
ICT . . . Information and Communication Technologies
IE . . . Information Extraction
LASSO .. . Least Absolute Selection Shrinkage Operator
ML . . . Machine Learning
NER . . . Named Entity Recognition
NLP . . . Natural Language Processing
NNs . . . Neural Networks
P-NUT . . . Predicting Nutrient content
PAM . . . Partition Around Medoids algorithm
PCA . . . Principal Component Analysis
PV-DBOW .. . Distributed Bag of Words version of Paragraph Vector
PV-DM . . . Distributed Memory version of Paragraph Vector
RL . . . Representation Learning
SG . . . Skip-gram Model
SI . . . The International System of Units
USDA . . . United States Department of Agriculture
WHO . . . World Health Organization
xxv
Glossary
Artificial intelligence (AI) – is the ability of a digital computer or computer-controlled
robot to perform tasks commonly associated with intelligent beings.
BERT – a deep learning model in which every output element is connected to every input
element, and the weightings between them are dynamically calculated based upon their
connection.
Clustering – is a ML technique, which groups the unlabeled dataset.
Cross validation – is a procedure that combines (averages) measures of fitness in predic-
tion to derive a more accurate estimate of model prediction performance.
Data mining – is the process of extracting and discovering patterns in large data sets in-
volving methods at the intersection of machine learning, statistics, and database systems.
Decision tree regression – observes features of an object and trains a model in the
structure of a tree to predict data in the future to produce meaningful continuous output.
Doc2Vec – is an NLP tool for representing documents as a vector and is a generalization
of the Word2Vec method.
Elastic Net regression – is an extension of linear regression that adds regularization
penalties to the loss function during training.
Electronic health record (EHR) – is the systematized collection of patient and popu-
lation electronically stored health information in a digital format.
EuroFIR AISBL – is a non-profit international association, which supports use of exist-
ing food composition data and future resources through cooperation and harmonization of
data quality, functionality and global standards.
European Food Safety Authority (EFSA) – is the agency of the European Union
that provides independent scientific advice and communicates on existing and emerging
risks associated with the food chain.
Food and Drug Administration (FDA) – is a federal agency of the Department of
Health and Human Services, responsible for protecting the public health by ensuring the
safety, efficacy, and security of human and veterinary drugs, biological products, and medi-
cal devices; and by ensuring the safety of our nation’s food supply, cosmetics, and products
that emit radiation.
Food composition database (FCDB) – are detailed sets of information on the nu-
tritionally important components of foods and provide values for energy and nutrients
including protein, carbohydrates, fat, vitamins and minerals and for other important food
components such as fibre.
Food Safety Agency traffic light system – is a system for indicating the status of a
variable using the red, amber or green of traffic lights.
FoodEx2 classification system – is the measure of a well-balanced ratio of the essen-
tial nutrients carbohydrates, fat, protein, minerals, and vitamins in items of food or diet
concerning the nutrient requirements of their consumer.
GloVe – is an unsupervised learning algorithm developed by Stanford for generating word
embeddings by aggregating global word-word co-occurrence matrix from a corpus.
xxvi Glossary
Hyperparameter tuning – is the problem of choosing a set of optimal hyperparameters
for a learning algorithm.
Hyperparameter – is a parameter whose value is used to control the learning process of
a ML algorithm.
Information extraction (IE) – is the process of extracting information from unstruc-
tured textual sources to enable finding entities as well as classifying and storing them in a
database.
International System of Units (SI) – a system of physical units (SI units) based on the
metre, kilogram, second, ampere, kelvin, candela, and mole, together with a set of prefixes
to indicate multiplication or division by a power of ten.
K-means clustering – is a method of vector quantization, originally from signal process-
ing, that aims to partition n observations into k clusters.
LASSO regression – is a regression analysis method that performs both variable selec-
tion and regularization in order to enhance the prediction accuracy and interpretability of
the resulting statistical model.
Linear regression – is a linear approach for modelling the relationship between a scalar
response and one or more explanatory variables (also known as dependent and independent
variables).
Machine learning (ML) – is defined as a discipline of AI that provides machines the
ability to automatically learn from data and past experiences to identify patterns and make
predictions with minimal human intervention.
Named entity recognition (NER) – is a sub-task of information extraction (IE) that
seeks out and categorizes specified entities in a body or bodies of texts.
Natural language processing (NLP) – is a sub-field of linguistics, computer science,
and artificial intelligence concerned with the interactions between computers and human
language, in particular how to program computers to process and analyze large amounts
of natural language data.
Neural network regression – predicts an output variable as a function of the inputs of
a neural network. The input features (independent variables) are numeric types.
Neural network – are a subset of machine learning and are at the heart of deep learning
algorithms. Their name and structure are inspired by the human brain, mimicking the
way that biological neurons signal to one another.
Nutrient values – is the measure of a well-balanced ratio of the essential nutrients car-
bohydrates, fat, protein, minerals, and vitamins in items of food or diet concerning the
nutrient requirements of their consumer.
Nutrition label – is a label required on most packaged food in many countries, showing
what nutrients and other ingredients (to limit and get enough of) are in the food. Labels
are usually based on official nutritional rating systems.
Partition around medoids (PAM) algorithm – is and algorithm intended to find a
sequence of objects called medoids that are centrally located in clusters.
Principal Component Analysis (PCA) – is an unsupervised, non-parametric statisti-
cal technique primarily used for dimensionality reduction in machine learning.
Random forest regression – is an ensemble technique capable of performing both re-
gression and classification tasks with the use of multiple decision trees and a technique
called Bootstrap and Aggregation, commonly known as bagging.
Regression – is a statistical technique that relates a dependent variable to one or more
independent (explanatory) variables.
Regular expression (regex or regexp) – A regular expression (regex or regexp) is a
string of characters that indicates a search pattern in text.
Representation learning (RL) – is a set of techniques that allows a system to automat-
xxvii
ically discover the representations needed for feature detection or classification from raw
data.
Ridge regression – is a method of estimating the coefficients of multiple-regression mod-
els in scenarios where the independent variables are highly correlated.
Supervised machine learning – is a subcategory of ML that learns from labeled training
data to help you predict outcomes for unforeseen data.
United States Department of Agriculture (USDA) – is the federal executive depart-
ment in the USA responsible for developing and executing federal laws related to farming,
forestry, rural economic development, and food.
Unsupervised machine learning – is a subcategory of ML that is used to identify pat-
terns in data sets containing data points that are neither classified nor labeled.
Word2Vec – is a two-layer neural network that processes text by “vectorizing” words.
1
Chapter 1
Introduction
We live in a time of a global epidemic of obesity and diabetes, as well as, inactivity, all
connected to bad dietary habits. Many other non-communicable chronic diseases such
as high blood pressure, cardiovascular disease, some cancers [1], bone-health diseases etc.
are linked to, again – poor dietary habits [2]. Dietary assessment is essential for patients
suffering from many diseases (especially diet- and nutrition-related ones), it is also very
much needed for professional athletes, and because of the accessibility of meal tracking
mobile applications, it is becoming part of everyday habits of a vast majority of individ-
uals, for health, fitness, or weight loss/gain. Obesity is spiking each day in developed
western countries and this contributes to raised public health concern about some subcat-
egories of nutrients, specifically about saturated fats, and added or free sugar. Nutritional
epidemiologists are also raising concern about micronutrients like – sodium, whose intake
should be monitored for individuals suffering from specific diseases like osteoporosis, stom-
ach cancer, kidney disease, kidney; and dietary fiber, whose intake is critical for patients
suffering from irritable bowel syndrome (IBS). This being said, there is no denying that
nutrition has become a core factor to today’s society, and one of the clear solutions to the
global health-crisis [3]–[6]. The path towards making the average human diet healthier and
environmentally sustainable is a fundamental part of the solution for numerous challenges
from the ecological, environmental, societal and economic perspective, and the awareness
for this has just started to grow and be fully appreciated. Human understanding and
knowledge about food and nutrition is constantly evolving, and has significantly improved
recently, one of the main contributors to this is data. However, the possibilities of gaining
knowledge from food and nutrition-related data are yet to be explored.
Predictive modeling of issues connected to human health has received a lot of attention
in recent decades, and a substantial amount of research has been published in this direc-
tion. The availability of numerous available biomedical vocabularies and standards, which
play a critical role in comprehending health information, as well as a huge volume of health
data, makes it possible to resolve difficulties linked to healthcare. However, concentrating
primarily on healthcare data restricts the potential advantages that artificial intelligence
(AI) and its vast capabilities might bring to our lives. As a result, the Lancet Planetary
Health journal in 2019 stated that exploring the linkages between food systems, human
health, and the environment will be the focus of future advances in our welfare and society.
Although – there is extensive data available, the disconnect between collecting data and
putting the information contained in it in use is a valid barrier to be overcome, but it is not
insurmountable. In domains, such as the biomedical, which is connected to the healthcare
industry, vigorous steps have been taken in the direction of putting the data generated
and available into use. In the wake of a situation like a global pandemic, there is a spike
2 Chapter 1. Introduction
in the interest in predictive analysis for improving healthcare. With a strong integration
of biomedical and healthcare data, including electronic health records, modern healthcare
organizations can possibly revolutionize the medical therapies and personalized medicine.
Different biomedical annotated corpora have been produced in recent years [7]–[11], where
the main aim is to challenge and encourage research teams on Natural Language Process-
ing (as the data mainly comes in text format) and data analyzing problems.
Despite the huge number of accessible resources and work done in applying AI in the
domains of health [12], the resourcefulness and extensive research done in this area for
biomedical tasks, the situation in the Food and Nutrition domain is seemingly different –
there is a scarcity of resources. Therefore, there are only a few predictive studies that use
food and nutrition data for predicting some targets [13], [14]. This is especially vital during
the current world situation with COVID-19, when food security, as well as healthy diet and
the environment, are critical for speedy recovery and long-term sustainability of our society
[15]. However, to address these challenges developing data-driven AI methods is extremely
required. All of this is in line with the focus of the European Commission project calls
under Horizon Europe funding framework, where different funding programmes highlight
the importance of involving the food and nutrition data to be utilized with AI methods,
to estimate the world’s big challenges in health and make the world a better place to live.
The backbone of Data Science and Data Mining (DM) per se – are machine learning
algorithms. Machine Learning is one part from AI, where algorithms have the benefit of
being able to model data in a non-linear and non-parametric way. Machine Learning (ML)
algorithms have the benefit of being able to model data in a non-linear and non-parametric
way. This can provide us with new understandings, insights and open new possibilities in
many domains. A classical supervised learning prediction task involves several steps: i)
pre-processing data, ii) feature extraction, iii) training a predictive model, and iv) evaluat-
ing the performance of the model. When dealing with textual data (NLP task) the second
part is learning representations with representation learning algorithms for obtaining text
embedding vectors (presented in Figure 1.1). If the data are textual, then the second part
is representation learning algorithms for obtaining text embedding vectors. The last goal
of having such a model is to be further used to predict the outcomes when new data, which
has not been involved in the training process, becomes available. Using ML algorithms to
solve a specific application in some domain is a task of Data Mining (DM).
Despite the significant effort put into developing ML/DM pipelines for biomedical stud-
ies, only a small percentage of them are put into practice [16], and one of the most difficult
challenges is generalizing knowledge learned from one domain dataset – to a new dataset
that is described with the same characteristics of the data instances. This means that the
ML pipelines that are developed are biased to the quality of the data used for learning the
models and applying them to new data requires fine-tuning (i.e., making the models adap-
tive). In line with this, the Food and Drug Administration (FDA) recently proposed an
adaptive AI systems framework [17], where they emphasized that doing ML/DM pipelines
in healthcare predictive modeling requires updating the models over time and personalizing
them to specific health applications.
Modeling a domain-specific application is a task that requires a deep understanding
of the problem in hand, domain expertise, and domain knowledge (represented with se-
mantic resources, and taxonomies created by the domain experts). In this thesis, our goal
is to explore how the synergism of domain and data-driven knowledge for food and nu-
1.1. State-of-the-art and Thesis Ambitions 3
Figure 1.1: Flowchart of a classical ML pipeline.
trition prediction tasks can improve their effectiveness. Moreover, we have investigated if
incorporating domain knowledge in different steps (i.e., more specifically through feature
extraction, training the model, and model evaluation) of developing an ML/DM pipeline
improves the performance of the model. The thesis focuses on exploring this synergism
using textual data related to food and nutrition.
1.1 State-of-the-art and Thesis Ambitions
In this section, we provide state-of-the-art in different steps of developing ML/DM pipelines
in healthcare predictive modeling.
1.1.1 State-of-the-art in Textual Representations
Nowadays, representation of textual data is a task of representation learning [18]. In gen-
eral, the idea behind representation learning is to learn a vector of continuous numbers (i.e.,
embeddings) which represents a text instance. It has been shown that such representations
improve state-of-the-art in natural language processing (NLP) comparing them with the
classical sparse text representations used in the 90s. The textual embeddings can be trained
on different levels (word – e.g., Word2Vec [19], [20], GloVe [21], and sentence/paragraph
– Doc2Vec [22], XLNet [23], BERT [24], etc.). In most cases, co-occurrence methods and
deep neural architectures are used to calculate them. All these methodologies are further
applied on different corpora related to a specific domain of biomedicine from which a vector
representation (i.e., an embedding) for each concept is learned and further the embeddings
are published to be reused for future predictive modeling studies. For example, by using
different biomedical text corpora, biomedical concept representations have been learned us-
ing Word2Vec methodology [25]. Further, it has been shown that learning text embeddings
on a specific biomedical dataset cannot transfer the knowledge to other predictive studies
performed on new data from other domains [26]. For this purpose, cui2vec [27] has been
proposed in the biomedical domain which applies GloVe methodology using multi-modal
health textual data (i.e., including electronic health records (EHRs), health insurance data,
full text biomedical journals, etc.). However, in 2019, one study [28] points out that learn-
ing biomedical concepts using graph-based embeddings of a biomedical semantic resource
can improve the predictive performance compared to using cui2vec embeddings. The rea-
son reported is that learning representation from the semantic resource is actually learning
from domain knowledge, since all concepts and relations in the semantic resource are de-
termined by domain experts [28]. Even more, it has been shown that BERT, which is state
of the art in representation learning in NLP, trained on biomedical data leads to BioBERT
4 Chapter 1. Introduction
and improves the performance in predictive healthcare. In the Food and Nutrition domain
there are few smaller predefined corpora of embeddings, in [29], [30], the embeddings are
learned from the instructions of the recipes (contextualized embeddings). Other work in
this direction is gained as a byproduct of recipe-image retrieval task [31]–[34].
1.1.2 Thesis ambition in textual representations
The thesis will go beyond state-of-the-art in biomedical representations by learning repre-
sentation for food concepts, with which it will populate a missing part of resources required
in predictive healthcare. For this purpose, we are going to explore text representation learn-
ing methods to learn food and nutrition concepts embeddings. Even more, since a food
concept can be complex (e.g., a recipe consists of several foods, therefore will be composed
of several food concepts), its representation will be defined by fusing the separate food con-
cept embeddings with a heuristic that is defined using the domain knowledge. We could
not explore graph-based representation since there is a lack of semantic resources in the
food domain, and those available are under development [35], [36]. Having such domain
representations can further help the predictive studies that involve them.
1.1.3 State-of-the-art in learning predictive models
Learning predictive models involves training a ML algorithm which links an input vector
of a data instance characteristics to an end target. More healthcare predictive studies are
performed by training only one general model that can be used to predict the outcome for
all data instances. The challenge that appears here is that the learned knowledge cannot
be transferred on new data, since a lot of bias exists starting from the quality of the data
through the modeling decisions that are taken [37]. One way to go beyond such learning
is to apply predictive clustering, which differs from a classical clustering approach, in a
way that it finds clusters of instances that are similar based on the descriptive and target
variables, and further learns a separate predictive model for each cluster [38].
1.1.4 Thesis ambition in learning predictive models
The thesis goes beyond state-of-the-art by developing ML/DM pipelines that can be used
for predicting outcomes (e.g. nutrient value prediction) from food and nutrition data.
For this purpose, instead of performing predictive clustering, the pipeline first, involves
the domain knowledge encoded in available external semantic resources, to find similar
clusters of data instances (using the food domain knowledge), and further uses the clusters
to learn predictive models using the descriptive features (i.e., the learned representation)
and the target variables. In addition, since the domain semantic resource can be developed
for different applications (i.e., involving different bias), we test the sensitivity of such
modeling using different resources in this step, including data-driven and deterministic
rules. Data-driven clustering utilizes a semantic resource with graph-based embeddings,
while the deterministic rules are all possible combinations that can appear in the resource.
Furthermore, we explore if the same text representation methodology provides the best
results for each cluster, or if some representations are more specific to some of the clusters.
All in all, the development of the ML/DM pipelines is led by combining the domain
knowledge and the data-driven modeling approach across its different steps. The ML/DM
pipelines are tested using recipe description data, from which we predict nutrient values.
The learning predictive task is a supervised regression.
1.2. Purpose of the Thesis 5
1.2 Purpose of the Thesis
The purpose of the thesis is to:
P1. Explore the synergism of domain and data-driven knowledge for prediction tasks in
the food and nutrition domain.
P2. Explore the domain bias impact in different stages of predictive food data modeling.
P3. Shorten a critical gap in Data Science applications for the Food and Nutrition domain,
by estimating the trust of the developed ML/DM pipeline when applied to new data.
1.3 Goals of the Thesis
The goals of this thesis are:
G1. Develop an ML/DM pipeline for learning predictive models that incorporates the
domain knowledge encoded in external semantic resources.
G2. Introduce a domain knowledge-based heuristic for fusing multi-word representations
for learning food concept representation that improves the prediction results.
G3. Sensitivity analysis of incorporating the domain knowledge bias in different stages of
the food predictive modeling.
G4. Develop quantitative indicators to evaluate the effectiveness of the developed ML/DM
pipeline on new, unseen data, as well as estimate the trust of applying it.
1.4 Hypothesis
The thesis’ hypotheses are related to methods for resolving the problem in hand and ex-
ploring the chosen domain. The first hypothesis is related to including domain knowledge
when constructing a ML/DM pipeline, the second one is related to the effect of the domain
bias over the predictive task and it is split in two parts, depending on the type of domain
knowledge incorporated in the DM/ML pipeline – the first one is related to semantic do-
main resources, and the second one to domain heuristic for merging representation vectors.
At the end, the third hypothesis is related to generalizing predictive models:
H1. Including the domain knowledge into different stages of data-driven modeling im-
proves the performance in a prediction task.
H2. Handling the domain knowledge bias can improve the results of a predictive task.
H2.a. Incorporating different semantic information about the domain knowledge in
the modeling process has a significant impact on the prediction task and has
the potential to enhance prediction outcomes.
H2.b. Fusion of multi-word representations with a domain knowledge-based heuris-
tic provides representations (i.e., embeddings) that improve the results from a
prediction task.
H3. Integrating domain knowledge into data-driven modeling generalizes the predictive
models and allows generalization of the knowledge over food and nutrition data that
come from different sources.
6 Chapter 1. Introduction
1.5 Scientific Contributions
The research presented in this thesis results in several scientific contributions, the Bibli-
ography section contains an exhaustive list of all publications that are connected to this
thesis. Each of the hypotheses presented is related to one/or more scientific contributions
(SC) that are relevant in the scientific community.
SC1. A novel ML/DM pipeline for learning predictive models that incorporates the domain
knowledge encoded in external semantic resources.
This work has been presented in a peer-reviewed journal article [39].
Assessing nutritional content is very relevant for patients suffering from various
diseases, professional athletes, and for other health reasons it is becoming part
of everyday life for many. However, it is a very challenging task as it requires
complete and reliable sources. We introduce a machine learning pipeline for
predicting nutrient values of foods using learned vector representations from
short text descriptions of food products. On a dataset used from health spe-
cialists, containing short descriptions of foods and nutrient values: we generate
paragraph embeddings, introduce clustering in food groups, using graph-based
vector representations – that include food domain knowledge information, and
train regression models for each cluster. The predictions are for four nutrients:
carbohydrates, fat, protein and water. The results from this study imply that
inferring domain knowledge before the predictive modeling in a ML pipeline
for the task of predicting nutrient values improves the results compared to the
baseline (without the inclusion of domain knowledge).
SC2. Exploring domain knowledge bias in the ML/DM modeling by including domain
semantic resources that represent different information about the domain.
This work has been presented in a conference paper [40].
We explore the effect of domain bias in a predictive study in the food and nutri-
tion domain. Having a ML pipeline for predicting nutrient values with learned
vector representations from short text description of recipes, we introduce do-
main knowledge before the prediction algorithms are applied. On a large corpus
of recipe data containing short description and nutrient values (both nutrients
and others) we introduce word and paragraph embeddings, learn concept repre-
sentations for the textual descriptions, introduce domain knowledge for cluster-
ing the data, and apply machine learning algorithms for predicting the nutrient
content of the recipes. We explore the impact of the domain knowledge by intro-
ducing two different criteria of clustering the dataset – using graph embedding
of the FoodEx2 codes, and using the traffic light labelling system from the FSA;
at the end we compare the two different criteria. Evaluating the ML pipeline
with the incorporation of both types of domain knowledge showed higher pre-
diction accuracies compared to the baseline, with a slight favor of the version
clustering the recipes with the FSA traffic light system, which is expected since
it is based on the nutrient values.
SC3. A representation learning pipeline for learning food concept embeddings by intro-
ducing a domain knowledge-based heuristic for fusing multi-word representations for
improving the prediction results.
This work has been presented in a peer-reviewed journal article [41].
1.5. Scientific Contributions 7
Using the concept of our proposed ML/DM pipeline we constructed a represen-
tation learning pipeline in order to explore how the prediction results change
when, instead of using the vector representations of the recipe description, we
use the embeddings of the list of ingredients. The nutrient content of one food
depends on its ingredients; therefore, the text of the ingredients contains more
relevant information. We define a domain-specific heuristic for merging the em-
beddings of the ingredients, which combines the quantities of each ingredient in
order to use them as features in machine learning models for nutrient prediction.
The results from the experiments indicate that the prediction results improve
when using the domain-specific heuristic. The prediction models for protein
prediction were highly effective, with accuracies up to 97.98%. Implementing a
domain-specific heuristic for combining multi-word embeddings yields better re-
sults than using conventional merging heuristics, with up to 60% more accuracy
in some cases.
SC4. Evaluation of the proposed ML/DM pipeline through benchmarking it against models
obtained without the incorporation of the domain knowledge in a predictive task.
This work has been presented in a peer-reviewed journal article [39], and conference
paper [40].
When evaluating the three aforementioned pipelines, we benchmarked the re-
sults obtained from the proposed methodology with and without the incorpora-
tion of domain knowledge. When evaluating the ML/DM pipeline we compared
the results obtained from the proposed ML/DM pipeline against results ob-
tained using conventional heuristics. In the process of evaluation of the extended
ML/DM pipeline with the two different clustering approaches, we compared the
results obtained from the proposed extension of the ML/DM pipeline with re-
sults obtained without incorporating the clustering according to the external
semantic resources.
SC5. Evaluation of the newly proposed representation learning pipeline that incorporates
the domain knowledge against conventional textual representations in a predictive
task.
This work has been presented in a peer-reviewed journal article [41].
In the evaluation process of the newly proposed representation learning pipeline
for comparison purposes, we repeated the same steps with the same experi-
mental setup but using conventional embedding merging heuristics – sum and
average, which we consider as baselines. The results show that using the domain
heuristic for merging the embeddings yields better results than the baselines.
SC6. Creating a predefined domain-specific embeddings of food concepts and recipes and
testing the ML/DM pipeline on heterogeneous recipe datasets.
This work has been presented on two conferences [42]–[44].
Although recipe data are very easy to come by nowadays, it is really hard to
find a complete recipe dataset – with a list of ingredients, nutrient values per
ingredients, and per recipe, allergens, etc. Recipe datasets are usually collected
from social media websites where users post and publish recipes. Usually written
with little to no structure, using both standardized and non-standardized units
of measurement. We collect six different recipe datasets, all publicly available,
all in different formats and some including data in different languages. Bringing
all of these datasets to the needed format for applying the ML/DM pipeline and
8 Chapter 1. Introduction
the RL pipeline – includes data normalization using dictionary-based named
entity recognition, rule-based named entity recognition, as well as conversions
using external domain-specific resources. After the normalization, the domain-
specific embeddings are created using the same embedding space for all recipes –
one ingredient dataset is generated. The result from this normalization process
are two corpora – one with predefined ingredient embeddings and one with
predefined recipe embeddings. The Ml/DM pipeline is then evaluated on all
recipe datasets. The results from this use case also confirm that the embeddings
merged using the domain heuristic yield better results than the baselines.
SC7. Estimating the generalization of the performance of the ML/DM pipeline across
different datasets. In addition, determining quantitative indicators of what will be
the trust of using the developed pipelines for new data that will become available in
the future. By this, we will bring closer the scientific relevance of the proposed thesis
to the industry. This work is presented in a journal article [45].
On our predefined corpus of recipe embeddings from the normalized six recipe
datasets we apply a pipeline for testing the generalization of predictive models.
We train predictive models on one of the six recipe datasets and test the mod-
els on the rest of the datasets. After the predictive modeling, we define and
calculate generalizability indexes which form a generalizability matrix. These
numbers indicate the trust with which predictive models trained on one dataset
can be transferred to the other. The evaluation results prove the validity of
these indexes – their correlation with the accuracy of the<