Conference PaperPDF Available

FoodIE: A Rule-based Named-entity Recognition Method for Food Information Extraction


Abstract and Figures

The application of Natural Language Processing (NLP) methods and resources to biomedical textual data has received growing attention over the past years. Previously organized biomedical NLP-shared tasks (such as, for example, BioNLP Shared Tasks) are related to extracting different biomedical entities (like genes, phenotypes, drugs, diseases, chemical entities) and finding relations between them. However, to the best of our knowledge there are limited NLP methods that can be used for information extraction of entities related to food concepts. For this reason, to extract food entities from unstructured textual data, we propose a rule-based named-entity recognition method for food information extraction, called FoodIE. It is comprised of a small number of rules based on computational linguistics and semantic information that describe the food entities. Experimental results from the evaluation performed using two different datasets showed that very promising results can be achieved. The proposed method achieved 97% precision, 94% recall, and 96% F1 score.
Content may be subject to copyright.
FoodIE: A Rule-based Named-entity Recognition Method for Food
Information Extraction
Gorjan Popovski1, Stefan Kochev1, Barbara Korouˇ
c Seljak2and Tome Eftimov2
1Faculty of Computer Science and Engineering, Ss. Cyril and Methodius University,
Rugjer Boshkovikj 16, 1000 Skopje, Macedonia
2Computer Systems Department, Joˇ
zef Stefan Institute, Jamova cesta 39, 1000 Ljubljana, Slovenia
{gorjan.popovski, stefan.kochev}, {barbara.korousic, tome.eftimov}
Keywords: Information Extraction, Rule-based Named-entity Recognition, Food-entity Recognition.
Abstract: The application of Natural Language Processing (NLP) methods and resources to biomedical textual data
has received growing attention over the past years. Previously organized biomedical NLP-shared tasks (such
as, for example, BioNLP Shared Tasks) are related to extracting different biomedical entities (like genes,
phenotypes, drugs, diseases, chemical entities) and finding relations between them. However, to the best of
our knowledge there are limited NLP methods that can be used for information extraction of entities related to
food concepts. For this reason, to extract food entities from unstructured textual data, we propose a rule-based
named-entity recognition method for food information extraction, called FoodIE. It is comprised of a small
number of rules based on computational linguistics and semantic information that describe the food entities.
Experimental results from the evaluation performed using two different datasets showed that very promising
results can be achieved. The proposed method achieved 97% precision, 94% recall, and 96% F1score.
Nowadays, a large amount of textual information is
available in digital form and published in public web
repositories (e.g., online news, scientific publications,
social media). The textual information is presented as
unstructured data, meaning that the data has no pre-
defined data model. Working with textual data is a
challenge because of its variability - the same con-
cepts can be mentioned in different ways regarding
the fact how people express themselves and use dif-
ferent writing styles.
Information Extraction (IE) is a task of automat-
ically extracting information from unstructured data
and, in most cases, is concerned with the processing
of human language text by means of natural language
processing (NLP) (Aggarwal and Zhai, 2012). The
idea behind IE is to provide a structured representa-
tion of extracted information obtained from analyzed
text. The information to be extracted is defined by
users, and consists of predefined concepts of interest
and related entities, as well as relationships between
entities and events.
One of the classic IE tasks is named-entity recog-
nition (NER), which addresses the problem of iden-
tification and classification of predefined concepts
(Nadeau and Sekine, 2007). It aims to determine and
identify words or phrases in text into predefined labels
(classes) that describe concepts of interest in a given
domain. Various NER methods exist: terminological-
driven,rule-based,corpus-based,methods based on
active learning (AL), and methods based on deep neu-
ral networks (DNNs).
In this paper, we focus on IE of food entities. To
the best of our knowledge, not a large amount of re-
search focusing on food entities has been done. How-
ever, nowadays, the knowledge about extracted food
entities and their relations with other biomedical enti-
ties (like genes, drugs, diseases, etc.) is important for
improving public health.
The main contributions of this paper are:
A rule-based NER method for IE of food entities.
Evaluation of the proposed method, which pro-
vides promising results on unstructured data,
without a need for an annotated corpus.
In the remainder of the paper, we first present an
overview of the related work. Then, we present the
proposed rule-based NER method for IE of food en-
tities. Next, the data used for evaluation is explained,
followed by the results and discussion. Finally, the
conclusions of the paper and a discussion for future
work are presented.
IE from biomedical literature is a very important task
with the goal of improving public health. Because
NER methods which have the best performances are
usually corpus-based NER methods, there is a need
for an annotated corpus from biomedical literature
that includes the entities of interest. For this purpose,
different annotated corpora are produced by shared
tasks, where the main aim is to challenge and encour-
age research teams on NLP problems.
In comparison with the extensive work done for
biomedical tasks, in the food science domain the sit-
uation is different. Several studies have been con-
ducted, but with different goals. For example, in (Xia
et al., 2013) authors presented an approach to iden-
tify rice protein resistant to Xanthomonas oryzae pv.
oryzae, which is an approach to enhance gene prior-
itization by combining text mining technologies with
a sequence-based approach. Co-occurrence methods
were also used to identify ingredients mentioned in
food labels and extracting food-chemical and food-
disease relationship (do Nascimento et al., 2013;
Jensen et al., 2014).
A ML approach to Japanese recipe text processing
was proposed in (Mori et al., 2012), where one task,
which was evaluated, was food-named entity recog-
nition. This approach used the r-FG corpus, which
is composed solely from Japanese food recipes. An-
other similar approach for generating graph structures
from food recipes was proposed in (Chen, 2017),
where authors manually annotated a recipe corpus
that is then used for training a ML model.
The UCREL Semantic Analysis System (USAS)
is a framework for automatic semantic analysis of
text, which distinguishes between 21 major cate-
gories, one of which is “food and farming” (Rayson
et al., 2004), being heavily utilized in our rule-based
system - FoodIE. The USAS can provide additional
information about the food entity, but the limitation is
that it works on a token level. For example, if in the
text two words (i.e. tokens), like “grilled chicken”,
denote one food entity that needs to be extracted and
analyzed, the semantic tagger would actually parse
the words “grilled” and “chicken” as separate entities
and obtain separate semantic tags.
In (Eftimov et al., 2017), a rule-based NER used
for IE from evidence-based dietary recommendation,
called drNER, is presented, where among other enti-
ties, food entities were also of interest.
Recipe description
Text POS-tagging
post-processing the tag
Semantic tagging
of food tokens
in the text
Food named-entity
Food entities
Figure 1: The flowchart of the Foodie methodology.
To enable food-named entity recognition, in this pa-
per, we propose a rule-based approach, called FoodIE.
It works with unstructured data (more specifically,
with a recipe that includes textual data in form of in-
structions on how to prepare the dish) and consists of
four steps:
Food-related text pre-processing
Text POS-tagging and post-processing of the tag
Semantic tagging of food tokens in the text
Food-named entity recognition
The flowchart of the methodology is presented in Fig-
ure 1. Further, we are going to explain each part in
more detail.
3.1 Food-related Text Pre-processing
The pre-processing step takes into account the dis-
crepancies that exist between the outputs of the tag-
gers we are utilizing, coreNLP tagger from the R pro-
gramming language (Arnold and Tilton, 2016) and the
UCREL Semantic Analysis System (USAS) (Rayson
et al., 2004). It is also used to remove any characters
that are unknown to the taggers.
Firstly, quotation marks should be removed from
the raw text, for the simple reason that they are treated
differently by both used NLP libraries, causing a dis-
Secondly, every white space sequence (including
tabulation, newlines, etc.) is converted into a single
white space to provide a consistent structure to the
Additionally, ASCII transliteration is performed,
which means characters that are equivalent to ASCII
characters are transliterated. An example of such
characters is [`
e, ¨
o, `
a], which are transliterated to [e,
o, a], respectively.
Finally, fractions should be converted into real
numbers. Usually, when a food-related text is writ-
ten (e.g., recipe), fractions are used when discussing
quantities. However, they are usually written in plain
ASCII format and in a manner which is confusing to
NLP taggers. For example, “2.5” is usually written as
“2 1/2” in such texts. This does not bode well with
coreNLP and the USAS semantic tagger. Thus, in the
pre-processing step, all fractions are converted into
the standard mathematical decimal notation for real
3.2 Text POS-tagging and
Post-processing of the Tag Set
To obtain the morphological information from a tex-
tual data, we use UCREL Semantic Analysis System
(USAS) and coreNLP.
The USAS semantic tagger provides word tokens
associated with their POS tags, lemmas, and seman-
tic tags. The semantic tags show semantic fields that
group together word senses that are related at some
level of generality with the same contextual concept.
The groups include not only synonyms and antonyms
but also hypernyms and hyponyms. More details
about semantic tags can be found in (Rayson et al.,
2004; Alexander and Anderson, 2012).
Furthermore, the same is done using the coreNLP
library, which includes all of the above except seman-
tic tags.
For example, the sentence “Heat the beef soup un-
til it boils” is processed by both libraries. The results
from the coreNLP library for the above mentioned ex-
ample sentence are presented in Table 1, while the re-
sults from USAS are presented in Table 2. Observing
the results presented in the tables, it is obvious that
there is a discrepancy between the POS tags for the
token “Heat”.
As is evident, both the USAS semantic tagger and
the coreNLP library, do not provide perfect tags (e.g.,
sometimes verbs are misclassified as nouns, as is the
Table 1: Tags obtained from coreNLP for one recipe sen-
Token ID Token Lemma POS tag
1 Heat heat NN
2 the the DT
3 beef beef NN
4 soup soup NN
5 until until IN
6 it it PRP
7 boils boil VBZ
8 . . .
case with the first token in the example given in Table
1). For this reason, the tags returned by both taggers
are post-processed and modified using the following
linguistic rules:
If at least one of the taggers classify a token as a
verb, mark it as a verb.
If there exists a discrepancy between the tags for
a specific token, prioritize the tag given by the
USAS semantic tagger.
If a past participle form or a past simple form of
a verb precedes and is adjacent to a noun, and it
is classified as a verb, change the tag from verb to
Finally, we keep two versions of the modified tag
set, one in each format. These modified tags in the
coreNLP format and USAS format are presented in
Table 3 and Table 4, respectively.
3.3 Semantic Tagging of Food Tokens in
To define phrases in the text related to food enti-
ties, we first need to find tokens that are related
to food entities. For this purpose, the USAS se-
mantic tagger is utilized. Using it, a specific rule
is defined to determine the food tokens in the text.
Food tokens are predominantly nouns or adjectives,
so we account for this as to improve the false pos-
itive rate, i.e. allowing a token to be categorized
as a food token if and only if it is either a noun
or an adjective. The decision rule combines three
conditions using the following Boolean expression
((Condition1OR Condition2)AND Condition3). If
the expression is true, then the token is classified as
food token. For clarity, let us assume that tis a to-
ken and stis the semantic tag that is assigned to it
using the USAS semantic tagger. Each condition is
constructed using the following rules:
Condit ion1:
Food tag F(1|2|3|4), or
Table 2: Tags obtained from USAS for one recipe sentence.
Token ID Token Lemma POS tag Semantic tag 1 Semantic tag2
1 Heat heat VV0 O4.6+ AJ.03.c.02 [Heat]; AJ.03.c.02 [Heat]; AJ.03.c.02.a [Heating/making hot/warm];
2 the the AT Z5 ZC [Grammatical Item];
3 beef beef NN1 F1 AG.01.d.03 [Beef]; AE.14.m.03 [Subfamily Bovinae (bovines)]; AE.14.m.03 [Subfamily Bovinae (bovines)];
4 soup soup NN1 F1 AG.01.n.02 [Soup/pottage]; AA.04.g.04 [Wave]; AA.11.h [Cloud];
5 until until CS Z5 ZC [Grammatical Item];
6 it it PPH1 Z8 ZF [Pronoun];
7 boils boil VVZ O4.6+ E3- AJ.03.c.02.b [Action of boiling]; AJ.03.c.02.b [Action of boiling]; AJ.03.c.02.b [Action of boiling];
Table 3: Modified tags from coreNLP for one recipe sen-
Token ID Token Lemma POS tag
1 Heat heat VB
2 the the DT
3 beef beef NN
4 soup soup NN
5 until until IN
6 it it PRP
7 boils boil VBZ
8 . . .
Living tag L(2|3), or
Substance tag (liquid and solid) O1.(1|2).
Condit ion2:
Body part tag B1, and
Not Linear order tag N4, and
Not Location and direction tag M6, and
Not Texture tag O4.5.
Condit ion3:
Not General Object tag O2, and
Not Quantities tag N5, and
Not Clothing tag B5, and
Not Equipment for food preparation tag
AG.01.t.08, and
Not Container for food, place for storing food
tag AG.01.u, and
Not Clothing tag AH.02.
More formally, using Boolean algebra, we can
write these rules as:
Condit ion1:
st∈ {F1,F2,F3,F4} ∨ st∈ {L2,L3} ∨ st
Condit ion2:
st=B1 st6=N4 st6=M6 st6=O4.5
Condit ion3:
st6=O2 st6=N5 st6=B5 st6=AG.01.t.08 st6=
Additionally, we define one rule to determine ob-
ject tokens. Determining the object tokens will fur-
ther help us in the definition of food entities, mainly
to avoid false positives. The rule consists of
General Object tag O2, or
Clothing tag B5, and
Not Body Part tag B1, and
Not Living tag L(2|3), and
Not a food token as defined by the aforementioned
first rule.
Using Boolean algebra, this rule is represented as
(st=O2 st=B5)st6=B1 st6=L2 st6=
L3 ∧ ¬Rule1.
If this condition is met, the token is tagged as gen-
eral object.
The single rule for defining color noun is consisted
Color tag O4.3.
The rule for defining a color noun is then formally
defined as
These tags are useful when food entities ending
on a color, such as “egg whites” or “hash browns”,
appear in the text, which indeed are to be treated as
food entities.
At the end, one additional rule is constructed for
defining what is explicitly disallowed to be the main
token in a food entity, and is defined as
Equipment for food preparation AG.01.t.08, and
Table 4: Modified tags from USAS for one recipe sentence.
Token ID Token Lemma POS tag Semantic Tag 1 Semantic tag 2
1 Heat heat VV0 O4.6+ AJ.03.c.02 [Heat]; AJ.03.c.02 [Heat]; AJ.03.c.02.a [Heating/making hot/warm];
2 the the AT Z5 ZC [Grammatical Item];
3 beef beef NN1 F1 AG.01.d.03 [Beef]; AE.14.m.03 [Subfamily Bovinae (bovines)]; AE.14.m.03 [Subfamily Bovinae (bovines)];
4 soup soup NN1 F1 AG.01.n.02 [Soup/pottage]; AA.04.g.04 [Wave]; AA.11.h [Cloud];
5 until until CS Z5 ZC [Grammatical Item];
6 it it PPH1 Z8 ZF [Pronoun];
7 boils boil VVZ O4.6+ E3- AJ.03.c.02.b [Action of boiling]; AJ.03.c.02.b [Action of boiling]; AJ.03.c.02.b [Action of boiling];
Container for food, place for storing food
AG.01.u, and
Clothing tag AH.02, and
Temperature tag O4.6, and
Measurement tag N3.
This rule can be represented as
st=AG.01.t.08 st=AG.01.ust=AH.02 st=
This rule is utilized when isolating entities that
could be potential false positives. An example of
this would be “oil temperature” or “cake pan”. Ad-
ditionally, there are some manually added resources
in this disallowed category, which frequently occur in
the texts.
3.4 Food-named Entity Recognition
To obtain food chunks, we used the modified tag set
from the USAS semantic tagger obtained in Subsec-
tion 3.2 in combination with the food tokens obtained
in Subsection 3.3. The process of food-named entity
recognition consists of three steps.
Firstly, we iterate through every food token which
we extracted previously from the text, and for each
token we define a set of rules that constitute a food
Adjacent to the left of the food token we allow
chaining of adjectives (JJ), nouns (NN), proper nouns
(NP), genitive tag (GE), unknown tags (Z99) and gen-
eral tokens tagged as food, but explicitly omit general
objects. The purpose of including the unknown POS
tag (Z99) is to catch tokens that do not concisely fall
into one of the tags in the standard POS tag set, yet
still are of importance to the semantics of the food en-
tity. Such an example would be “Colby-Jack cheese”,
whose POS tags are Z99 and NN, respectively.
Adjacent to the right the logic is the same, differ-
ing only by allowing general object to be part of the
food entity and tokens that have been been tagged as
a color noun by the rule engine. We also keep track
not to use a token twice.
Then, to determine if it truly is a food entity chunk
or just a chunk related to food but not a food entity in
and of itself, we check the last token of the chunk.
The whole chunk is discarded if the last token is:
A noun (starts with NN) and a general non-food
object, or
in the disallowed category as defined by the rule
engine, or
in the disallowed category as defined by the re-
Some examples where this would be a false pos-
itive are “muffin liner”, “casserole dish” or “egg
timer”. If this check passes and the last token is not
a general object, we mark each token in the new food
chunk with an index unique to the whole chunk and
continue iterating through the remaining food tokens.
After the first step, we now must concatenate all
relevant information for each food entity. For each
indexed food entity, we join all the instances into one
entry, thus creating a vector where each token is its
own entry, except for the food entities which are rep-
resented as one entry. If initially we had a vector of
tokens such as [Chop, the, hot, Italian, sausage, into,
pieces, .] the output would be [Chop, the, hot Ital-
ian sausage, into, pieces, .]. This also applies to other
relevant information we might want to track, such as
lemmas, POS tags, sentence indexes or even individ-
ual token indexes.
For additional robustness, we perform a check to
assure that each food chunk we have isolated indeed
contains a food token, and that the token is marked
under some food chunk. For this we only mark a
chunk as a food entity if it contains at least one word
that has previously been tagged as a food token and
has been indexed as part of the respective chunk as
The evaluation was performed manually, since there
is no pre-existing method to evaluate such a text cor-
pus. To avoid any kind of bias when evaluating food-
related text, one person was tasked with manually
performing food chunk extraction from each individ-
ual text, while another person cross referenced those
manually obtained chunks with the ones obtained
from FoodIE. Using this method, a figure for true pos-
itives (TPs), false negatives (FNs) and false positives
(FPs) was procured, while it was decided that the cat-
egory true negative was not applicable to the nature
of the problem and its evaluation. Additionally, it was
decided that a “partial (inconclusive)” category was
necessary, as some of the food chunks were incom-
plete, but nevertheless caught, thus including signifi-
cant information. This category encompasses all the
extracted food chunks which were caught, but missed
at least one token. An example would be “bell pep-
per”, where FoodIE would only catch “pepper”.
We would like to compare the results using the
model presented in (Chen, 2017), but we were un-
able to obtain the requested model and corpus. We
provide a small example of comparing FoodIE with
drNER (Eftimov et al., 2017), in order to show that
they provide food entities on different level, so a fair
comparison cannot be made.
While the evaluation was being done, we kept
track of all the False Negative instances and have con-
structed a resource set that will improve the perfor-
mance of FoodIE in future implementations.
4.1 Data
Firstly, a total of 200 recipes were processed and eval-
uated. The original 100 recipes, which were analyzed
and upon which the rule engine was built, were taken
into consideration, as well as 100 new recipes which
had not been analyzed beforehand. The recipes were
taken from two separate user-based sites, Allrecipes
( and MyRecipes (https:
//, where there is no standard-
ized format for the recipe description. This was cho-
sen as such to ensure that the linguistic constructs uti-
lized in each written piece varied and had no pattern
behind them. The texts were chosen from a variety of
topics, as to provide further diversity.
Secondly, we selected 1,000 independently ob-
tained recipes from Allrecipes (Groves, 2013), which
is the largest food-focused social network, where ev-
eryone plays part in helping cooks discover and share
home cooking. We selected the Allrecipes because
there is no limitation as to who can post recipes, so
we have variability in how users express themselves.
The recipes were selected from five recipe categories:
Appetizers and snacks, Breakfast and Lunch, Dessert,
Dinner, and Drinks. From each recipe category 200
recipes were included in the evaluation set.
The evaluation datasets, including the obtained
results, are publicly available at
4.2 Results and Discussion
The results for TPs, FPs, and FNs of evaluating the
FoodIE using the dataset of 200 recipes are presented
in Table 5. The group “Partial (Inconclusive)” was
left out of these evaluations, as some would argue
they should be counted as TPs, while other that they
should be included in the FNs. Some examples in-
cluded here are: “empty passion fruit juice”, “cinna-
mon” and “soda”, where the actual food entity chunks
would be “passion fruit juice”, “cinnamon sticks” and
“club soda”, respectively. These are mostly due to the
dual nature of words, meaning that a word that is a
synonym of both a noun and a verb or an adjective and
a verb, occur. For such words, the tagger sometimes
incorrectly classifies the tokens. In these examples,
“empty” is tagged as an adjective, where in context
it, in fact, is a verb. The same explanation holds for
the other two examples. For these reasons, when the
evaluation metrics were calculated, this category was
simply omitted. Moreover, even if they are grouped
with either TPs or FNs, this does not significantly af-
fect the results.
Regarding the FN category (type II error), there
were some specific patterns that produced the most
instances. One very simple type of a FN instance
is where the author of the text refers to a specific
food using the brand name, such as “allspice” or
agermeister”. These are difficult to catch if there is
no additional information following the brand name.
However, if the user includes the general classifi-
cation of the branded food, FoodIE will catch it.
An example of this would be by simply writing
agermeister liqueur”. Another instance of a type II
error is when the POS taggers give incorrect tags, as
was the case with some “Partial (Inconclusive)” in-
stances. An example of this is when the tagger misses
chunks such as “mint leaves” and “sweet glazes”,
where both “leaves” and “glazes” are incorrectly clas-
sified as verbs when in this context they should be
tagged as nouns. Another example would be when
the semantic tagger incorrectly classifies some token
within the given context, such as “date” being clas-
sified as a noun meaning day of year, as opposed to
it being a certain fruit. Furthermore, there exist FNs
which are simply due to the rarity of the food, such
as “kefir”, “couscous” or “stevia”, the last one being
of immense importance to people suffering from dia-
betes, as it is a safe sugar substitute. Another category
of type II errors is due to the fact that some foods are
often referred by their colloquial name, such as “half-
and-half” and “spring greens”. The final category of
this type of error is where there exist spelling varia-
tions for a single food, such as “eggnog”, “egg nog”,
“egg-nog”. These are very difficult, if not impossible,
to correctly predict since grammatical and morpho-
logical styles vary with each user, which extend as far
as including simply improper use of the English lan-
guage. This is a separate problem in and of itself, i.e.
spellchecking and spelling correction.
The second type of error to discuss is the FP cate-
gory (type I error), which is often due to the existence
of objects that are not foods, but are closely related to
food entities. These include instances such as “dol-
lop” or “milk frother”, where the first example has a
meaning very closely related to food, thus making it
difficult to distinguish using the semantic tags. The
second chunk is simply an instrument related to food
and cooking, while being rare enough such that the
semantic tagger does not classify it properly as an ob-
Table 5: Predictions (200 recipes).
True Positive (TP) 3063
False Positive (FP) 75
False Negative (FN) 185
Partial (Inconclusive) 97
Using the results reported in Table 5, the evalu-
ation metrics for F1score, precision, and recall, are
presented in Table 6.
Table 6: Evaluation metrics (200 recipes).
F1Score Precision Recall
0.9593 0.9761 0.9430
The results from evaluation the FoodIE on the
dataset with 1000 recipes are reported in tables 7 and
Table 7: Predictions (1000 recipes).
True Positive (TP) 11461
False Positive (FP) 258
False Negative (FN) 684
Partial (Inconclusive) 359
Comparing the results obtained from the evalua-
tions (tables 6 and 8), we can conclude that FoodIE
behaves consistently. Evaluating the dataset with 200
recipes, which consists of 100 recipes that were ana-
lyzed to build the rule engine and 100 new recipes that
were not analyzed beforehand, we obtained 0.9761
precision, 0.9430 recall, and 0.9593 F1score. Fur-
thermore, by evaluating it on a dataset that consists
of 1000 new recipes, it obtained 0.9780 for precision,
0.9437 for recall, and 0.9605 for F1score. Comparing
Table 8: Evaluation metrics (1000 recipes).
F1Score Precision Recall
0.9605 0.9780 0.9437
these results provides that FoodIE gives very promis-
ing and consistent results.
We also provided the TPs, FPs, FNs, and Par-
tial predictions, together with the evaluation metrics
for each recipe category separately (Table 9). Us-
ing them, we can see that Dinner category provides
most FNs (223), while the Breakfast/lunch category
provides the least FNs (82). Regarding the FNs,
the Breakfast/lunch category provides the most FPs
(108), while the Drinks category provides the least
FPs (31). Looking at the results, it is evident that
FoodIE retains the aforementioned consistency, even
when comparing the evaluation metrics from each cat-
egory between themselves.
Table 9: Predictions and evaluation metrics for each recipe
Recipe category TP FP FN Partial F1Score Precision Recall
Appetizers/snacks 2147 27 162 45 0.9578 0.9876 0.9298
Breakfast/lunch 2443 33 82 108 0.9770 0.9876 0.9675
Desserts 2612 87 127 124 0.9607 0.9678 0.9536
Dinner 3176 47 223 51 0.9592 0.9854 0.9344
Drinks 1083 64 90 31 0.9336 0.9442 0.9233
In Table 10, we present the results obtained for
10 sentences (i.e evidence-based dietary recommen-
dations) previously used in (Eftimov et al., 2016; Ef-
timov et al., 2017), in order to present the difference
between FoodIE and drNER. Semicolon was used to
split separate food entities. Using the table, we can
see that drNER and FoodIE provide results on a dif-
ferent level. For example, let us consider the sixth rec-
ommendation. drNER extracted only one food entity,
which is “Milk, cheese, yogurt and other dairy prod-
ucts”, while FoodIE extracted four separate food enti-
ties, i.e. “Milk”, “cheese”, “yogurt”, and “other dairy
products”. From this, it follows that FoodIE provides
more precise results, which means it can also be used
as a post-processing tool for drNER in order to extract
the food entities on a individual level.
The performance of the rule-based system FoodIE
heavily depends on the taggers used, so the improve-
ment of the qualities of the POS-tagging and seman-
tic tagging methods will also improve the evaluation
metrics for FoodIE.
To extract food entities from unstructured textual
data, we propose a rule-based named-entity recogni-
tion method for food information extraction, called
FoodIE. It is a rule engine, where the rules are
Table 10: Food entities extracted by drNER and FoodIE.
Recommendation drNER FoodIE
Good sources of
magnesium are:
fruits or vegetables,
nuts, peas and beans,
soy products, whole grains
and milk.
fruits or vegetables,
nuts, peas
and beans;
soy products;
whole grains and milk
fruits; vegetables;
nuts; peas; beans;
whole grains; milk
The RDAs for Mg
are 300 mg
for young women
and 350 mg
for young men.
- -
Increase potassium
by ordering
a salad, extra steamed
or roasted vegetables,
bean-based dishes
fruit salads,
and low-fat milk
instead of soda.
extra steamed or
roasted vegetables;
fruit salads;
low-fat milk
roasted vegetables;
bean-based dishes;
fruit salads;
low-fat milk;
Babies need
protein about 10
g a day.
- -
1 teaspoon of
table salt contains
2300 mg of sodium.
table salt table salt
Milk, cheese,
yogurt and other
dairy products
are good sources
of calcium and protein,
plus many other
vitamins and minerals.
Milk, cheese,
yogurt and
other dairy
Milk; cheese;
other dairy products
Breast milk
provides sufficient
zinc, 2 mg/day
for the first 4-6
months of life.
Breast milk milk
If you’re
trying to
get more
omega-3, you
might choose
salmon, tuna
or eggs enriched
with omega-3.
salmon, tuna;
salmon; tuna;
If you need
to get more fiber,
look to beans,
vegetables, nuts
and legumes.
beans, vegetables,
nuts, and legumes
nuts; legumes
Excellent sources
of alpha-linolenic
acid, ALA,
include flaxseeds
and walnuts.
flaxseeds and walnuts
alpha-linolenic acid;
based on computational linguistics and semantic in-
formation that describe the food entities. Evaluation
showed that FoodIE behaves consistently using differ-
ent independent evaluation datasets and very promis-
ing results have been achieved.
To the best of our knowledge, there is a lim-
ited number of NLP tools that can be used for IE
of food entities. Moreover, there is a lack of anno-
tated corpora that can be used to train corpus-based
NER methods. Motivated by the evaluation results
obtained, we are planning to use it in order to build
an annotated corpus that can be further used for ex-
tracting food entities together with their relations to
other biomedical entities. By performing this, we can
easily follow the new knowledge that comes rapidly
with each day with new scientifically published pa-
pers aimed at improving public health.
This work was supported by the Slovenian Research
Agency Program P2-0098 and ERA Chair ISO-
FOOD for isotope techniques in food quality, safety
and traceability [grant agreement no. 621329].
Aggarwal, C. C. and Zhai, C. (2012). Mining text data.
Springer Science & Business Media.
Alexander, M. and Anderson, J. (2012). The hansard cor-
pus, 1803-2003.
Arnold, T. and Tilton, L. (2016). coreNLP: Wrappers
Around Stanford CoreNLP Tools. R package version
Chen, Y. (2017). A Statistical Machine Learning Approach
to Generating Graph Structures from Food Recipes.
PhD thesis.
do Nascimento, A. B., Fiates, G. M. R., dos Anjos, A.,
and Teixeira, E. (2013). Analysis of ingredient
lists of commercially available gluten-free and gluten-
containing food products using the text mining tech-
nique. International journal of food sciences and nu-
trition, 64(2):217–222.
Eftimov, T., Korouˇ
c Seljak, B., and Koroˇ
sec, P. (2017).
A rule-based named-entity recognition method for
knowledge extraction of evidence-based dietary rec-
ommendations. PloS One, 12(6):e0179488.
Eftimov, T., Korouˇ
c Seljak, B., and Koroˇ
sec, P. (2016).
Grammar and dictionary based named-entity linking
for knowledge extraction of evidence-based dietary
recommendations. In KDIR, pages 150–157.
Groves, S. (2013). How allrecipes. com became the worlds
largest food/recipe site. roi of social media (blog).
Jensen, K., Panagiotou, G., and Kouskoumvekaki, I. (2014).
Integrated text mining and chemoinformatics analy-
sis associates diet to health benefit at molecular level.
PLoS computational biology, 10(1):e1003432.
Mori, S., Sasada, T., Yamakata, Y., and Yoshino, K. (2012).
A machine learning approach to recipe text process-
ing. In Proceedings of the 1st Cooking with Computer
Workshop, pages 29–34.
Nadeau, D. and Sekine, S. (2007). A survey of named entity
recognition and classification. Lingvisticae Investiga-
tiones, 30(1):3–26.
Rayson, P., Archer, D., Piao, S., and McEnery, A. (2004).
The ucrel semantic analysis system.
Xia, J., Zhang, X., Yuan, D., Chen, L., Webster, J., and
Fang, A. C. (2013). Gene prioritization of resistant
rice gene against xanthomas oryzae pv. oryzae by us-
ing text mining technologies. BioMed research inter-
national, 2013.
... Dictionary and rule-based approaches count on linguists to manually create unique rule templates or specialized dictionaries depending on the properties of data sets [11][12][13]. This rule-based approach has various drawbacks, including a high level of human involvement required, challenges in extending it to new entity types or datasets due to the specificity of rules to a given domain, the absence of comprehensive dictionaries, and limited portability. ...
Full-text available
It is highly significant from a research standpoint and a valuable practice to identify diseases, symptoms, drugs, examinations, and other medical entities in medical text data to support knowledge maps, question and answer systems, and other downstream tasks that can provide the public with knowledgeable answers. However, when contrasted with other languages like English, Chinese words lack a distinct dividing line, and medical entities have problems such as long length and multiple entity types nesting. Therefore, to address these issues, this study suggests a medical named entity recognition (NER) approach that combines part-of-speech and stroke features. First, the text is fed into the BERT pre-training model to get the semantic representation of the text, while the part-of-speech feature vector is obtained using the part-of-speech dictionary, and the stroke feature of the text is extracted through a convolution neural network (CNN). The word vector is then joined with the part-of-speech and stroke feature vectors, respectively, and input into the BiLSTM and CRF layer for training. Additionally, to balance the disparity in data volume across several types of entities, the class-weighted loss function is included in the loss function. According to the experimental findings, our model’s F1 score on the CCKS2019 dataset reaches 78.65%, and the recognition performance exceeds many existing algorithms.
... A dietary recommendations extraction model, drNER, is demonstrated in [20] which implemented a two-phase approach, utilizing dictionaries to identify entity mentions in the first phase and a syntactic parser to select and extract those entities in the second phase. Another closely related food based Named Entity Recognition system called FoodIE is proposed in [21] which uses several rules based on Part-of-Speech (POS) and semantic tags to identify food entities from text. Apart from the biomedical and food science domains, applications of NER are found in a number of other areas as well. ...
Conference Paper
Named Entity Recognition (NER) is a crucial task in Natural Language Processing (NLP) that aims to identify and categorize named entities in text. While NER has been well-studied in various domains, it remains a challeng�ing task in new domains where annotated data is limited. In this paper, we propose an NER system for the aviation domain that addresses this challenge. Our system combines rule-based and supervised methods to develop a model with little to no manual annotation work. We evaluate our system on a benchmark dataset and it outperforms baseline scores and achieves competitive results. To the best of our knowledge, this is the first study to develop an NER system that specifically tar�gets aviation entities. Our findings highlight the potential of our proposed system for NER in aviation and pave the way for future research in this area.
... Traditional unsupervised approaches rely on lexicon-based rules to identify entities within text, often using pre-existing dictionaries or knowledge bases. For example, the system presented in [28] uses syntactic and semantic rules applied to every word in the text, while an approach proposed in [38] involves searching for noun phrases in the text that differ from terms in a given Database by only a few symbols. Modern unsupervised approaches imply the usage of clustering strategies, such as kNN [35]. ...
This paper presents an enhanced approach for adapting a Language Model (LM) to a specific domain, with a focus on Named Entity Recognition (NER) and Named Entity Linking (NEL) tasks. Traditional NER/NEL methods require a large amounts of labeled data, which is time and resource intensive to produce. Unsupervised and semi-supervised approaches overcome this limitation but suffer from a lower quality. Our approach, called KeyWord Masking (KWM), fine-tunes a Language Model (LM) for the Masked Language Modeling (MLM) task in a special way. Our experiments demonstrate that KWM outperforms traditional methods in restoring domain-specific entities. This work is a preliminary step towards developing a more sophisticated NER/NEL system for domain-specific data.KeywordsLanguage ModelNamed Entity Recognition and Linkingdomain adaptation
... Интересно, что в текстах о еде лучшие показатели F-меры, более 95 %, оказались у методов, основанных на правилах [62,65]. Возможно это связано с особенностями структуры текстов, используемых в основе корпусов, например, рецептов [63]. ...
The task of named entity recognition (NER) is to identify and classify words and phrases denoting named entities, such as people, organizations, geographical names, dates, events, terms from subject areas. While searching for the best solution, researchers conduct a wide range of experiments with different technologies and input data. Comparison of the results of these experiments shows a significant discrepancy in the quality of NER and poses the problem of determining the conditions and limitations for the application of the used technologies, as well as finding new solutions. An important part in answering these questions is the systematization and analysis of current research and the publication of relevant reviews. In the field of named entity recognition, the authors of analytical articles primarily consider mathematical methods of identification and classification and do not pay attention to the specifics of the problem itself. In this survey, the field of named entity recognition is considered from the point of view of individual task categories. The authors identified five categories: the classical task of NER, NER subtasks, NER in social media, NER in domain, NER in natural language processing (NLP) tasks. For each category the authors discuss the quality of the solution, features of the methods, problems, and limitations. Information about current scientific works of each category is given in the form of a table for clarity. The review allows us to draw a number of conclusions. Deep learning methods are leading among state-of-the-art technologies. The main problems are the lack of datasets in open access, high requirements for computing resources, the lack of error analysis. A promising area of research in NER is the development of methods based on unsupervised techniques or rule-base learning. Intensively developing language models in existing NLP tools can serve as a possible basis for text preprocessing for NER methods. The article ends with a description and results of experiments with NER tools for Russian-language texts.
... Each outcome is assigned True Positive (TP) if it correlates with human-generated classification; assigned False Positive (FP) if it is extracted as a member of a list but false with human-generated classification and assigned False Negative (FN) if human-generated classification assigns it to a list but is not included in the extracted list. Eftimov [49] and Popovski [50] reported that the True Negative metric is not required for the evaluation of rule-based entity classification methods. Hence, True Negative values are not reported. ...
While a large number of knowledge graphs have previously been developed by automatically extracting and structuring knowledge from literature, there is currently no such knowledge graph that encodes relationships between food, biochemicals and mental illnesses, even though a large amount of knowledge about these relationships is available in the form of unstructured text in biomedical literature articles. To address this limitation, this article describes the development of GENA - (Graph of mEntal-health and Nutrition Association), a knowledge graph that represents relations between nutrition and mental health, extracted from biomedical abstracts. GENA is constructed from PubMed abstracts that contain keywords relating to chemicals, food, and health. A hybrid named entity recognition (NER) model is firstly applied to these abstracts to identify various entities of interest. Subsequently, a deep syntax-based relation extraction model is used to detect binary relations between the identified entities. Finally, the resulting relations are used to populate the GENA knowledge graph, whose relationships can be accessed in an intuitive and interpretable manner using the Neo4J Database Management System. To evaluate the reliability of GENA, two annotators manually assessed a subset of the extracted relations. The evaluation results show that our methods obtain high precision for the NER task and acceptable precision and relative recall for the relation extraction task. GENA consists of 43,367 relationships that encode information about nutrition and health, of which 94.04% are new relations that are not present in existing ontologies of food and diseases. GENA is constructed based on scientific principles, and has the potential to be used within further applications to contribute towards scientific research within the domain. It is a pioneering knowledge graph in nutrition and mental health, containing a diverse range of relationship types. All of our source code and results are publicly available at
Agile is on the rise as a new way of governing. It’s an emerging theme in the field of management to detach agile from its roots in software development and explore a new context of application. This article is a contribution to the literature to propose a model based on the theoretical gaps of the agile approach specifically in the public sector and linking the agile framework to the broader classical values of public management. In this paper, we model the components of the theoretical framework of the agile approach. We draw on De Vaujany’s theory of appropriation to explain the link between the public administration crisis and the attempt to model the agile approach in the public context. The model is guided by the principles of the agile approach for the independent variables. The moderating variables are the context presented in the models of Boehm and Turner [13] and that of Kruchten [14]. The dependent variables are the satisfaction of the users or stakeholder request to achieve good governance. With this paper, we hope to build a bridge for further collaboration between practitioners and academics in the search for new ways to improve public value.KeywordsModelingAgileOrganizational contextGood governancePublic sector
Information Extraction (IE) is the process of automatically extracting pertinent information from unstructured or semi-structured data, and it typically involves the analysis of human language text through natural language processing (NLP). Rules-based methods (RBM), Supervised-learning-based methods, and Unsupervised-Learning-based methods are the three basic methods used by the IE system. This work aims to explore, analyze the various approaches, and illustrate the difficulties encountered while using textual data in different forms, domains, and sizes of datasets from a preexisting information extraction using various categories of IE methods. This study presents an analytical study of various approaches to different information extraction methods used to analysis of textual data.KeywordsInformation ExtractionNatural Language ProcessingTextual DataRule-Based Methods Learning-Based Methods
Full-text available
Evidence-based dietary information represented as unstructured text is a crucial information that needs to be accessed in order to help dietitians follow the new knowledge arrives daily with newly published scientific reports. Different named-entity recognition (NER) methods have been introduced previously to extract useful information from the biomedical literature. They are focused on, for example extracting gene mentions, proteins mentions, relationships between genes and proteins, chemical concepts and relationships between drugs and diseases. In this paper, we present a novel NER method, called drNER, for knowledge extraction of evidence-based dietary information. To the best of our knowledge this is the first attempt at extracting dietary concepts. DrNER is a rule-based NER that consists of two phases. The first one involves the detection and determination of the entities mention, and the second one involves the selection and extraction of the entities. We evaluate the method by using text corpora from heterogeneous sources, including text from several scientifically validated web sites and text from scientific publications. Evaluation of the method showed that drNER gives good results and can be used for knowledge extraction of evidence-based dietary recommendations.
Conference Paper
Full-text available
In order to help people to follow the new knowledge about healthy diet that comes rapidly each day with the new published scientific reports, a grammar and dictionary based named-entity linking method is presented that can be used for knowledge extraction of evidence-based dietary recommendations. The method consists of two phases. The first one is a mix of entity detection and determination of a set of candidates for each entity, and the second one is a candidate selection. We evaluate our method using a corpus from dietary recommendations presented in one sentence provided by theWorld Health Organization and the U.S. National Library of Medicine. The corpus consists of 50 dietary recommendations and 10 sentences that are not related with dietary recommendations. For 47 out of 50 dietary recommendations the proposed method extract all the useful knowledge, and for remaining 3 only the information for one entity is missing. Due to the 10 sentences that are not dietary recommendation the method does not extract any entities, as expected.
Full-text available
Awareness that disease susceptibility is not only dependent on genetic make up, but can be affected by lifestyle decisions, has brought more attention to the role of diet. However, food is often treated as a black box, or the focus is limited to few, well-studied compounds, such as polyphenols, lipids and nutrients. In this work, we applied text mining and Naïve Bayes classification to assemble the knowledge space of food-phytochemical and food-disease associations, where we distinguish between disease prevention/amelioration and disease progression. We subsequently searched for frequently occurring phytochemical-disease pairs and we identified 20,654 phytochemicals from 16,102 plants associated to 1,592 human disease phenotypes. We selected colon cancer as a case study and analyzed our results in three directions; i) one stop legacy knowledge-shop for the effect of food on disease, ii) discovery of novel bioactive compounds with drug-like properties, and iii) discovery of novel health benefits from foods. This works represents a systematized approach to the association of food with health effect, and provides the phytochemical layer of information for nutritional systems biology research.
Full-text available
To effectively assess the possibility of the unknown rice protein resistant to Xanthomonas oryzae pv. oryzae, a hybrid strategy is proposed to enhance gene prioritization by combining text mining technologies with a sequence-based approach. The text mining technique of term frequency inverse document frequency is used to measure the importance of distinguished terms which reflect biomedical activity in rice before candidate genes are screened and vital terms are produced. Afterwards, a built-in classifier under the chaos games representation algorithm is used to sieve the best possible candidate gene. Our experiment results show that the combination of these two methods achieves enhanced gene prioritization.
Full-text available
Ingredients mentioned on the labels of commercially available packaged gluten-free and similar gluten-containing food products were analyzed and compared, using the text mining technique. A total of 324 products' labels were analyzed for content (162 from gluten-free products), and ingredient diversity in gluten-free products was 28% lower. Raw materials used as ingredients of gluten-free products were limited to five varieties: rice, cassava, corn, soy, and potato. Sugar was the most frequently mentioned ingredient on both types of products' labels. Salt and sodium also were among these ingredients. Presence of hydrocolloids, enzymes or raw materials of high nutritional content such as pseudocereals, suggested by academic studies as alternatives to improve nutritional and sensorial quality of gluten-free food products, was not identified in the present study. Nutritional quality of gluten-free diets and health of celiac patients may be compromised.
Conference Paper
Full-text available
The UCREL semantic analysis system (USAS) is a software tool for undertaking the automatic semantic analysis of English spoken and written data. This paper describes the software system, and the hierarchical semantic tag set containing 21 major discourse fields and 232 fine-grained semantic field tags. We discuss the manually constructed lexical resources on which the system relies, and the seven disambiguation methods including part-of-speech tagging, general likelihood ranking, multi-word-expression extraction, domain of discourse identification, and contextual rules. We report an evaluation of the accuracy of the system compared to a manually tagged test corpus on which the USAS software obtained a precision value of 91%. Finally, we make reference to the applications of the system in corpus linguistics, content analysis, software engineering, and electronic dictionaries.
Full-text available
The term “Named Entity”, now widely used in Natural Language Processing, was coined for the Sixth Message Understanding Conference (MUC-6) (R. Grishman & Sundheim 1996). At that time, MUC was focusing on Information Extraction (IE) tasks where structured information of company activities and defense related activities is extracted from unstructured text, such as newspaper articles. In defining the task, people noticed that it is essential to recognize information units like names, including person, organization and location names, and numeric expressions including time, date, money and percent expressions. Identifying references to these entities in text was recognized as one of the important sub-tasks of IE and was called “Named Entity Recognition and Classification (NERC)”. Le terme « entité nommée », maintenant largement utilisé dans le cadre du traitement des langues naturelles, a été adopté pour la Sixth Message Understanding Conference (MUC 6) (R. Grishman et Sundheim, 1996). À cette époque, la Conférence était concentrée sur les tâches d'extraction d'information (EI), dans lesquelles l'information structurée relative aux activités des entreprises et aux activités liées à la défense sont extraites de texte non structuré, comme les articles de journaux. Au moment de définir cette tâche, on a remarqué qu'il est essentiel de reconnaître les unités d'information comme les noms (dont les noms de personnes, d'organisations et de lieux géographiques) et les expressions numériques, notamment l'expression de l'heure, de la date, des sommes monétaires et des pourcentages. On a alors conclu que l'identification des références à ces entités dans le texte était une des principales sous-tâches de l'EI et on a alors nommé cette tâche Named Entity Recognition and Classification (NERC) (reconnaissance et classification d'entités nommées).
Text mining applications have experienced tremendous advances because of web 2.0 and social networking applications. Recent advances in hardware and software technology have lead to a number of unique scenarios where text mining algorithms are learned. Mining Text Data introduces an important niche in the text analytics field, and is an edited volume contributed by leading international researchers and practitioners focused on social networks & data mining. This book contains a wide swath in topics across social networks & data mining. Each chapter contains a comprehensive survey including the key research content on the topic, and the future directions of research in the field. There is a special focus on Text Embedded with Heterogeneous and Multimedia Data which makes the mining process much more challenging. A number of methods have been designed such as transfer learning and cross-lingual mining for such cases. Mining Text Data simplifies the content, so that advanced-level students, practitioners and researchers in computer science can benefit from this book. Academic and corporate libraries, as well as ACM, IEEE, and Management Science focused on information security, electronic commerce, databases, data mining, machine learning, and statistics are the primary buyers for this reference book. © 2012 Springer Science+Business Media, LLC. All rights reserved.
coreNLP: Wrappers Around Stanford CoreNLP Tools
  • T Arnold
  • L Tilton
Arnold, T. and Tilton, L. (2016). coreNLP: Wrappers Around Stanford CoreNLP Tools. R package version 0.4-2.
How allrecipes. com became the worlds largest food/recipe site. roi of social media (blog)
  • S Groves
Groves, S. (2013). How allrecipes. com became the worlds largest food/recipe site. roi of social media (blog).