Conference PaperPDF Available

Semantic Data Understanding With Character Level Learning



Databases are growing in size and complexity. With the emergence of data lakes, databases have become open, fast evolving and highly heterogeneous. Understanding the complex relationships among different entity types in such scenarios is both challenging and necessary to data scientists. We propose an approach that utilizes a convolutional neural network to learn patterns associated with each entity type in the database at the character level. We demonstrate that the learned character-level patterns can capture sufficient semantic information for many useful applications including data lake schema exploration, and interactive data cleaning.
Semantic Data Understanding with Character Level
Michael J. Mior
Rochester Institute of Technology
102 Lomb Memorial Drive
Rochester, New York 14623–5608
Ken Q. Pu
Ontario Tech University
2000 Simcoe Street North
Oshawa, Ontario L1G 0C5
Abstract—Databases are growing in size and complexity. With
the emergence of data lakes, databases have become open, fast
evolving and highly heterogeneous. Understanding the complex
relationships among different entity types in such scenarios is
both challenging and necessary to data scientists. We propose an
approach that utilizes a convolutional neural network to learn
patterns associated with each entity type in the database at the
character level. We demonstrate that the learned character-level
patterns can capture sufficient semantic information for many
useful applications including data lake schema exploration, and
interactive data cleaning.
Following the explosion of big data, the new trend is data
lakes which are characterized by heterogeneous, dirty, and
evolving schemas. Data lakes present a number of new chal-
lenges [1]. For example, there is limited information recorded
on relationships between data. In order to navigate data stored
in the lake, we need to understand semantic relationships
among different schemas. However, trying to understand these
relationships purely at the schema level is unreliable. An
attribute name could refer to a wide variety of concepts.
Without examining the data, we cannot reliably determine
which of the many different concepts name may refer to.
However, even analysis of values is not always sufficient
for disambiguation without expert input, especially in the
presence of dirty data. Our goal therefore is to derive semantic
information from data values at a substring level. We train a
data-driven semantic model, which can then assist analysts
in gaining insights into relationships among schemas without
any prior knowledge of these relationships. It can also allow
us to reason about uncertainties within the data and identifying
anomalies which may require cleaning. Since each data lake
may have a different set of semantic types, our goal is to learn
a model of semantic types based on labelled examples. The
model is used to organize types based on semantic similarity,
either into a visual graph layout or a hierarchy of clusters
of types. We then use this model to identify data which is
potentially dirty as well as explanations why the data may be
labelled incorrectly. This allows data analysts to make more
informed decisions when performing data cleaning.
Existing approaches often rely on matching entire strings
inside data values. In contrast, our method works at the
character level which enables it to detect salient features
within substrings. We also use this character-level information
to provide meaningful explanations for misclassified data.
Our model does not require the use of any pre-trained word
embeddings but only labelled input data. This makes it suitable
for semantic domains that do not align with standard corpora
used to train these models.
This paper provides three main contributions:
Use of a neural network model to capture character
features in data values within a data lake
A methodology to construct a similarity graph of entity
types based on the result of classification
Applications of the model and similarity graph in data
understanding and anomaly detection
There has been significant past work in trying to create
representations of meaning from text. For example, word
embeddings such as Word2vec [2], GloVe [3], ELMo [4],
and fastText [5], [6] embed textual representations of word
into a latent vector space. These vectors can be used to
compare the semantic similarity of words [7] and link related
concepts [8]. Further work such as ConVec [9] generated
embeddings not only for single words, but semantic concepts.
While we don’t use any of the specific embeddings described
above, our network does contain an embedding layer which
learns a similar representation specific to the dataset being
analyzed. A number of other works have tried to use such
embeddings to infer semantic information from attributes of
entities. Nobari et al. [10] used word embeddings to identify
similarities between attributes in a query and a knowledge
base. Most similar to our approach is Sherlock [11] which
attempts to infer a semantic data type from a set of attribute
values. One important distinction compared to our work is that
Sherlock relies on pre-trained GloVe embeddings whereas all
our embeddings are trained using only the provided training
data. This makes our approach suitable to any database of
strings even when their semantic meaning does not match with
any external source.978-1-7281-1054-7/20/$31.00 ©2020 IEEE
One of the primary goals of this work is to identify the
semantics behind specific attribute values. This is highly
useful in the context of matching semantically related data or
schemas. For example, InfoGather [12] uses semantic informa-
tion about attributes to find additional attributes semantically
related to a given entity. Venetis et al. [13] used information
extracted from the Web to assign semantic relationship to
tables. Related to our exploration of interactive approaches,
MWEAVER [14] uses sample instances from a target schema
which are provided iteratively to the algorithm. Our approach
also infers the semantic relationships of entity types, and use
them for labelling as we show in Section III-B.
Semantic relationships within a dataset can be useful for
data cleaning. Prokoshyna et al. use metric functional de-
pendencies to identify poor data quality where by finding
semantically related attributes. These relationships are then
used to suggest possible repairs. While their approach ana-
lyzes relationships between attributes of tuples in a relation,
we focus on the semantics of individual values. One early
interactive data cleaning system is Potter’s wheel [15] which
provides a spreadsheet interface that allows users to interac-
tively transform the data. However, it does not attempt to
learn data semantics which is incredibly useful information
for data scientists. ActiveClean [16] further demonstrated
the usefulness of interactive data cleaning in the context of
statistical model training where decisions made by users in
conjunction with an iterative optimization algorithm improved
model accuracy by up to 2.5×.
Our objective is to gain semantic understanding of a given
data collection. We start with a simple and widely applicable
data model consisting of sets of strings.
Definition 1 (Entity type and labels): An entity type is a
named set of strings. The strings associated with an entity
type tare called the entity labels of t. Let Tbe the set of all
entity types, and labels(t)be all its labels.
We define types(x)to be the set of types that xbelongs
to: {t∈ T :xlabels(t)}.
Example 1 (Entity types and labels in DBpedia): DBpe-
dia [17] is a community driven database of entities that are
extracted from Wikipedia. Each entity has a type and a label.
DBpedia has 463 types:
T={Country,TimePeriod,College, . . . }
Consider the entity type Country. It consists of country
names and aliases as strings, e.g.: “United States”, “United
States of America”, “Canada”, etc. Many other strings,
either historic names or phrases related to countries may
also appear in the set. Here some examples of entity labels
found in DBpedia dataset:1second east turkestan
republic,second empire of haiti,second
federal republic of mexico,second french
1DBpedia dataset available at:
V: the alphabet size
T: the number of entity types
L1= Embedding(input = V, output = 300)
L2= Convolution1D(filters = 64,kernelsize = 4)
L3= Dense(256)
L4= Dense(T)
model(x) = Softmax L4L3Maxpool L2L1(x)
Fig. 1: CNN architecture for type classification
Our objective is to learn prominent patterns at the substring
level from the entity labels. Based on the patterns, we train a
model that can probabilistically infer the entity type of a given
string. In this paper, we use the trained model to analyze the
relationships among different entity types as a similarity graph.
The similarity graph is then used to uncover hidden structures
in a collection of entity types. Furthermore, we demonstrate
how the model can also be used to perform dirty data discovery
and inspection.
A. Pattern capture using neural networks
Each entity type has its own distinct patterns in their labels.
For example, one may suspect that the entity type Year will
have short strings with digits, while Biomolecule labels may
have the prefix oxy or the suffix zyne. To capture a wide
range of patterns without any prior domain knowledge, we
choose to use a convolutional neural network (CNN). The
CNN is trained to perform entity type prediction based on the
labels at the character level. This forces the network to learn
kernels over the substrings of the input. Effectively, substring
patterns are encoded as the weight matrices of network. Our
approach makes no assumption on the substring patterns or
their location in the entity label, as the model parameters are
determined by the data during the training phase.
The network architecture consists of an embedding layer,
followed by a 1D convolutional layer with max pooling.
Finally, we use a two layer MLP to convert the output of the
convolutional layer to probability distribution over all entity
types. The input to the network is a vector of integers which
correspond to the integer encoding of the characters of an
entity label. The network architecture is given in Figure 1.
Given an input x(a sequence of integer-encoded characters),
we denote model(x)as the inferred probability distribution
over all possible types. The model is trained with label-type
pairs taken from {(x, t) : t∈ T , x labels(t)}.
To deal with differences in the cardinalities of labels(t)for
different types, we oversample the labels of underrepresented
types, and undersample those of overrepresented types. This
is necessary to avoid biasing the generated confusion matrix,
which as we show later, is core to our similarity analysis.
Both oversampling and undersampling are necessary due to the
highly skewed distribution in the training data. Oversampling
alone would result in overwhelmingly large and redundant
training data, while undersampling will result in too little
training data.
Cluster Homogeneity size type 1 type 2 type 3 type 4
1.11 1.00 2 NaturalPlace BodyOfWater
1.12 1.00 3 UnitOfWork Case LegalCase
1.13 1.00 3 ChemicalSubstance ChemicalCompound Drug
1.14 1.00 2 Biomolecule Protein
1.15 0.17 51 Award Disease EthnicGroup Holiday
1.16 1.00 4 ArchitecturalStructure HistoricPlace Building Infrastructure
Average 0.93
(a) Samples of first level clusters
Cluster Homogeneity size type 1 type 2 type 3 type 4
1.15.1 1.00 5 Artwork Film RadioProgram Musical
1.15.2 1.00 2 Place Region
1.15.3 0.11 20 Holiday Enzyme FictionalCharacter InformationAppliance
1.15.4 1.00 3 Monarch Noble Royalty
1.15.5 1.00 2 Park ProtectedArea
Average 0.72
(b) Samples of Second level clusters
TABLE I: Results of spectral clustering (at most four types are shown for each cluster)
B. Similarity graph
The most typical application of a trained neural network
model is to make predictions (in this case, of the entity type)
from an unseen input (an entity label). While this is still a valid
use case, we also use the trained model to obtain semantic
relations between types and discover any hidden structures
in the data. Let Cbe the confusion matrix produced by the
C[t, t0] = |{xlabels(t0) : argmax(model(x)) = t}|
From C, we derive a similarity measure. We assume that
any misclassification between tand t0is the result of the
model being confused by the patterns found in labels in t
and t0, and this suggests a high degree of semantic similarity
between tand t0. So, it makes sense to use the percentage of
misclassification as a similarity measure. We then construct
asimilarity matrix Sbased on the number of misclassified
examples in C:
S[t, t0] = (C[t,t0]+C[t0,t]
|labels(t0)|if t6=t0
1 otherwise ).
Fig. 2: Histogram of test accuracy
Next, we construct the similarity graph G= (V, E , w), an
undirected weighted graph derived from Sand a hyperparam-
eter  > 0as follows:
Vertices:V=T,Edges: The edges are the type pairs that
have a fraction of labels misclassified between them.
E={{t, t0}:t∈ T , t0∈ T , S [t, t0]> }
The hyperparameter controls the sparsity of the graph.
Edge weights: The edge weights w(t, t0)represent the se-
mantic similarity between the types tand t0. We want to
transform the similarity measure S[t, t0]to edge weights in the
range of [0,1] via a nonlinear transformation. The nonlinearity
allows us to control the sensitivity and threshold of semantic
similarity of any two types. We choose to use a shifted and
scaled sigmoid function as the transformation:
w(t, t0) = 1
1 + exp(a(S[t, t0] + b))
where aand bare constants. Later we will see that these
weights will be used as the spring stiffness when laying out
a visual representation of the similarity graph, so intuitively
the values aand bprovides control over the entity type spatial
placements. For the DBpedia entity types, we used the values
of a= 10 and b= 0.5which work well for this dataset but
can be tuned as needed.
C. Discovery of semantic relations
The similarity graph Gdefined in Section III-B can be used
to assist users in gaining insights into relations among entity
types. For example, it would be helpful to discover semantic
similarity between types Plant and Fungus, or ComicChar-
acter and FictionalCharacter. We argue that analysis of G
reveals groups of types that are mutually semantically similar.
1) Spatial Layout of Similarity Graph: We visually render
the similarity graph using the spring graph layout [18]. The
layout models the edges as springs and places the vertices by
simulating a mass-spring system. We use the edge weights as
spring constants. This way, types that are more similar will be
placed spatially closer to each other due to the stiffer spring
that connects them. Spring layout is sufficiently scalable to
accommodate large number of entity types. We can tune to
control the sparsity of the graph. Some samples of the spatial
layout of entity types are shown in Figure 3.
2) Spectral Clustering: While visually viewing the types as
a similarity graph is valuable for end users, we also want to
generate clusters of semantically similar types in an automated
fashion. This can be done via graph-based clustering algo-
rithms. Using spectral clustering [19], we partition the graph
into k-components with cuts involving the least similar edges.
Treating the algorithm as a function: SC :G
Powerset(T)we can apply it recursively. Given a cluster
T={t1, t2, . . . }⊂T, the subgraph G|Tis the graph obtained
from Gby restricting the vertices to the types in Tand the
edges between these types. No changes are made to the edge
weights. SC is then applied to G|Tto partition Tinto ksub-
By training the neural network to perform entity type
classification, we were able obtain a useful similarity graph
from the resulting confusion matrix. Based on the similarity
graph, one can discover clusters of entity types that are seman-
tically similar. The discovery process can be done by visually
inspecting the graph layout, or it can be done automatically
using spectral clustering.
D. Discovering anomalies in data
Data in real life is inherently dirty. Traditional database
constraints typically check for data integrity (primary key and
foreign key constraints in relational databases), and almost
never check for correctness at the semantic level. We are
motivated to find possible candidates of dirty entity labels by
identifying anomalies. Anomalies are labels that have been
misclassified by the model. That is,
anomalies(t) = {xlabels(t) : argmax(model(x)) 6=t}.
In a data cleaning scenario, not only do we wish to present
a set of dirty label candidates but we also wish provide an
explanation of why these labels are tagged as anomalies.
1) Model response over substrings: The model learns pat-
terns from substrings as kernel weights using a 1D convolution
layer (L2). We can generate an explanation of a misclassifi-
cation by looking how the model responds to all substrings.
Suppose that the convolution kernel size is n, and the input x
is a label with length L. Let ttrue be the true label of x, and
tpred be the predicted type.
Let ngrams(x) = {xi+n
i:i[0, L n]}, where xj
denotes the substring of xfrom position ito j1. We
can examine how the model interprets each n-gram xi+n
ngrams(x)by its output probability model(xi+n
i)[ttrue]. The user can gain insight into how the
model perceives the entity label xby plotting each n-gram
iagainst the model responses. An example of such a plot
is shown in Figure 4a. The model response shows the key
portions of xthat make the label an anomaly based on a sliding
window of 4-grams from the input label.
(a) Plant related types
(b) Person related types
Fig. 3: Semantically similar type groups
2) Model response over prefixes: To better understand
where the model goes astray in classification, we can also
examine the model response to the prefixes of x.
Let prefixes(x) = {xi
0:i[n, L]}. By plotting the position
iand the responses: model(x)[ttrue]and model(x)[tpred],
the user sees where the model deviates from the true type of
x. An example response to prefixes is shown in Figure 4b. We
will demonstrate in Section IV that these response plots assist
the user in finding and cleaning dirty data.
We have applied the methods in Section III to analyze data
collected from DBpedia as of October 2016. The dataset has
over 15 million labels grouped into 463 types. These types are
organized into a hierarchy which we use as the ground truth
of semantic similarity during our evaluation.
A. Training the model
There are 460 distinct characters across all labels in the
input so each label is encoded as a sequence of integers in the
range of [0,461]. To solve the problem of some labels having
extremely long length, we limit our analysis to the first 30
characters. We use 0 to encode a symbol out of the input
vocabulary which is also for padding as needed.
We trained the same model shown in Figure 1 restricted
to 144 types which are part of the DBpedia ontology and
contain at least 1000 instances. The distribution of accuracy
over each types is shown in Figure 2. Accuracy is low for
many types which are semantically very similar. This is not a
problem since we do not rely on the model to make accurate
predictions of the entity type, but rather to discover semantic
relations through its classification errors.
B. Similarity Graph for DBpedia
To examine which types are most similar, we consider those
most confused by our model based on a sample of 1000 entity
labels for each type. Two types are considered confounded if
the model misclassifies at least 1% of the test data. The most
confounded pairs are shown in the following table.
Type Type Misclassification
Year TimePeriod 646
GivenName Name 645
Game Activity 615
Protein Biomolecule 606
Case UnitOfWork 472
Based on these confusions, we construct a similarity graph
with edges connecting pairs of types if they are confounded
enabling a visualization of the semantic relations among
the types. Figure 3 shows some interesting subgraphs when
visualizing the graph using a spring layout [18] that uses spring
stiffness as rate of classification between types.
One can see in Figure 3a that our method suggests to the
analyst that the types Plant,Fungus,Eukaryote,Animal, and
Species are closely related. Their spatial layout suggest that
Plant is more related to Fungus than Animal. The subgraph
in Figure 3b shows how our method groups person-related
types into a common region. Other subgraphs in Figure 3a
also suggest we correctly identify strong semantic similarity
among types.
C. Spectral Clustering
We applied spectral clustering on the semantic similarity
graph to partition the graph into k= 20 clusters. Recall from
Section III-C2 that we first apply SC to the entire graph G,
and can optionally apply SC recursively to the subgraph G|T
of any cluster T. Some clusters from the first two levels of
the hierarchical spectral clustering are shown in Table I.
To measure how semantically homogeneous these clusters
are, we compare them with an existing type hierarchy given by
DBpedia. We define a measure of homogeneity of a cluster as
the percentage of distinct pairs that share a common supertype
in that cluster. Formally, let T={t1, t2, . . . }be a cluster of
types, we define the homogeneity of a type Tas:
|{(t, t0):(t, t0)pairs(T)|tand t0share a supertype}|
where pairs(T) = {(t, t0) : tT, t0T, t 6=t0}.
Observe that most clusters produced by spectral clustering
of Ghave a homogeneity of 1, which implies that all types
(a) Model response to n-grams of the input. The higher value
indicates prediction of the respective entity type according to
the ngram.
(b) Model response to prefixes of the input. The higher value
indicates prediction of the respective entity type according to
the prefix.
Fig. 4: An anomaly in DBpedia
in each cluster are a subtype of a common supertype. The
average homogeneity of top level clusters is 0.93. This strongly
supports our hypothesis that the similarity graph Gcaptures
the semantic similarity. We can conclude that the visualization
as shown in Figure 3 and the clusters produced by spectral
clustering in Table I are valuable insights.
D. Substring-level anomaly explanation
We applied the methods from Section III-D to the same
dataset. To better understand how the misclassification oc-
curred, we plot the model response with respect to a sliding
window of n-grams with n= 4. We also plot the model
response with respect to increasingly long prefixes of each
label in Figure 4. Below are two anomalies identified by
our approach. The first anomaly is a false positive identi-
fied by the system. The label pac ct/4 airtrainer is
misclassified as type Work, but it’s true type is MeansOfT-
ransportation. Based on the response to the prefixes, we see
that the prediction of the model was accurate up to the prefix
pac ct/4 air. However, upon encountering the substring
trainer, the model changed its prediction to the type Work.
Another anomaly is the label x=10.5cm LEfh 18/40. The
dataset assigned the type TimePeriod, but our model predicts
the type to be Device. This is an example where our model
correctly identified dirty data in the dataset. The true type of
xis Weapon which is a subtype of Device in the DBpedia
type hierarchy. In this case, not only did our model correctly
identify dirty data, it also suggested a reasonably correct entity
type, which can be presented to the user in an interactive
(a) Model response to n-grams
(b) Model response to prefixes
Fig. 5: A dirty label in DBpedia
data cleaning scenario. Furthermore, the plots of the model
responses to n-grams and prefixes of xprovides insight into
how the error occurred. In Figure 5, we see that the substring
that contributes maximally to the type TimePeriod is 18/4.
(We note that DBpedia has since corrected this error and
correctly assigned the type Weapon.)
We presented a novel application of a neural network based
approach to semantic understanding of entity labels and their
types. Using a convolutional neural network, we can encode
key patterns at the substring level to infer entity types from
their labels. Our approach focuses on the confusion matrix
produced by the neural network, and produces a similarity
graph of the entity types.
We have demonstrated that the similarity graph can assist
users in gaining valuable insights of semantic relationships
among the entity types. These insights can be presented
visually or analytically. Our model can also be used to identify
data anomalies and provide explanations. Users can use our
approach to generate a list of dirty data candidates, and
examine the model responses to their n-grams and prefixes
to understand why they are tagged by the system. We demon-
strated the effectiveness of our methods using a dataset of
labelled entities from DBpedia with promising results.
As future work, we intend to explore more powerful model
architectures such as the transformer architecture [20], sequen-
tial convolutional architectures [21], and attention in neural
networks [22].
[1] F. Nargesian, E. Zhu, R. J. Miller, K. Q. Pu, and P. C. Arocena,
“Data lake management: Challenges and opportunities,” PVLDB, vol. 12,
no. 12, pp. 1986–1989, 2019.
[2] T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean, “Distributed
representations of words and phrases and their compositionality,” in
Proceedings of the 26th International Conference on Neural Information
Processing Systems - Volume 2, ser. NIPS’13. USA: Curran Associates
Inc., 2013, pp. 3111–3119.
[3] J. Pennington, R. Socher, and C. D. Manning, “GloVe: Global vectors
for word representation,” in EMNLP. Doha, Qatar: Association for
Computational Linguistics, 2014, pp. 1532–1543.
[4] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee,
and L. Zettlemoyer, “Deep contextualized word representations,” in
Proceedings of the 2018 Conference of the NAACL: Human Language
Technologies. New Orleans, Louisiana: NAACL, 2018, pp. 2227–2237.
[5] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, “Enriching word
vectors with subword information,Transactions of the Association for
Computational Linguistics, vol. 5, pp. 135–146, 2017.
[6] A. Joulin, E. Grave, P. Bojanowski, and T. Mikolov, “Bag of tricks for
efficient text classification,” in Proceedings of the 15th Conference of
the European Chapter of the Association for Computational Linguistics:
Volume 2, Short Papers. Valencia, Spain: Association for Computational
Linguistics, Apr. 2017, pp. 427–431.
[7] T. Pedersen, S. Patwardhan, and J. Michelizzi, “Wordnet::similarity:
Measuring the relatedness of concepts,” in Demonstration Papers at
HLT-NAACL 2004. Stroudsburg, PA, USA: Association for Compu-
tational Linguistics, 2004, pp. 38–41.
[8] R. Castro Fernandez, E. Mansour, A. A. Qahtan, A. Elmagarmid,
I. Ilyas, S. Madden, M. Ouzzani, M. Stonebraker, and N. Tang, “Seeping
semantics: Linking datasets using word embeddings for data discovery,”
in 2018 IEEE 34th International Conference on Data Engineering
(ICDE). Paris, France: IEEE, Apr 2018, pp. 989–1000.
[9] E. Sherkat and E. E. Milios, “Vector embedding of wikipedia concepts
and entities,” CoRR, vol. abs/1702.03470, 2017.
[10] A. D. Nobari, A. Askari, F. Hasibi, and M. Neshati, “Query understand-
ing via entity attribute identification,” 2018.
[11] M. Hulsebos, K. Z. Hu, M. A. Bakker, E. Zgraggen, A. Satyanarayan,
T. Kraska, C¸ . Demiralp, and C. A. Hidalgo, “Sherlock: A deep learning
approach to semantic data type detection,” in Proceedings of the 25th
ACM SIGKDD International Conference on Knowledge Discovery &
Data Mining, KDD 2019, A. Teredesai, V. Kumar, Y. Li, R. Rosales,
E. Terzi, and G. Karypis, Eds. Anchorage, AK: ACM, 2019, pp. 1500–
[12] M. Yakout, K. Ganjam, K. Chakrabarti, and S. Chaudhuri, “InfoGather:
entity augmentation and attribute discovery by holistic matching with
web tables,” in Proceedings of the 2012 ACM SIGMOD International
Conference on Management of Data, ACM. Scottsdale, AZ, USA:
ACM, 2012, pp. 97–108.
[13] P. Venetis, A. Halevy, J. Madhavan, M. Pas¸ca, W. Shen, F. Wu, G. Miao,
and C. Wu, “Recovering semantics of tables on the web,Proceedings
of the VLDB Endowment, vol. 4, no. 9, pp. 528–538, 2011.
[14] L. Qian, M. J. Cafarella, and H. Jagadish, “Sample-driven schema
mapping,” in Proceedings of the 2012 ACM SIGMOD International
Conference on Management of Data. Scottsdale, AZ, USA: ACM,
2012, pp. 73–84.
[15] V. Raman and J. M. Hellerstein, “Potter’s wheel: An interactive data
cleaning system,” in VLDB, vol. 1. Roma, Italy: VLDB Endowment,
2001, pp. 381–390.
[16] S. Krishnan, J. Wang, E. Wu, M. J. Franklin, and K. Goldberg, “Active-
Clean: Interactive data cleaning for statistical modeling,Proceedings
of the VLDB Endowment, vol. 9, no. 12, pp. 948–959, 2016.
[17] S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, and Z. Ives,
“DBpedia: A nucleus for a web of open data,” in The semantic web.
Busan, Korea: Springer, 2007, pp. 722–735.
[18] T. M. Fruchterman and E. M. Reingold, “Graph drawing by force-
directed placement,” Software: Practice and experience, vol. 21, no. 11,
pp. 1129–1164, 1991.
[19] M. Meila, “Spectral clustering: a tutorial for the 2010’s,” in Handbook
of cluster analysis. Cleveland, OH: CRC Press, 2016, pp. 1–23.
[20] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,
L. Kaiser, and I. Polosukhin, “Attention is all you need,” CoRR, vol.
abs/1706.03762, 2017.
[21] T. N. Sainath, O. Vinyals, A. Senior, and H. Sak, “Convolutional,
long short-term memory, fully connected deep neural networks,” in
2015 IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP). Brisbane, Australia: IEEE, 2015, pp. 4580–4584.
[22] J. B. Lee, R. A. Rossi, S. Kim, N. K. Ahmed, and E. Koh, “Attention
models in graphs: A survey,ACM Transactions on Knowledge Discov-
ery from Data (TKDD), vol. 13, no. 6, pp. 1–25, 2019.
ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
Full-text available
Correctly detecting the semantic type of data columns is crucial for data science tasks such as automated data cleaning, schema matching, and data discovery. Existing data preparation and analysis systems rely on dictionary lookups and regular expression matching to detect semantic types. However, these matching-based approaches often are not robust to dirty data and only detect a limited number of types. We introduce Sherlock, a multi-input deep neural network for detecting semantic types. We train Sherlock on 686,765 data columns retrieved from the VizNet corpus by matching 78 semantic types from DBpedia to column headers. We characterize each matched column with 1,588 features describing the statistical properties, character distributions, word embeddings, and paragraph vectors of column values. Sherlock achieves a support-weighted F1 score of 0.89, exceeding that of machine learning baselines, dictionary and regular expression benchmarks, and the consensus of crowdsourced annotations.
Conference Paper
Full-text available
Understanding searchers' queries is an essential component of semantic search systems. In many cases, search queries involve specific attributes of an entity in a knowledge base (KB), which can be further used to find query answers. In this study, we aim to move forward the understanding of queries by identifying their related entity attributes from a knowledge base. To this end, we introduce the task of entity attribute identification and propose two methods to address it: (i) a model based on Markov Random Field, and (ii) a learning to rank model. We develop a human annotated test collection and show that our proposed methods can bring significant improvements over the baseline methods.
Graph-structured data arise naturally in many different application domains. By representing data as graphs, we can capture entities (i.e., nodes) as well as their relationships (i.e., edges) with each other. Many useful insights can be derived from graph-structured data as demonstrated by an ever-growing body of work focused on graph mining. However, in the real-world, graphs can be both large—with many complex patterns—and noisy, which can pose a problem for effective graph mining. An effective way to deal with this issue is to incorporate “attention” into graph mining solutions. An attention mechanism allows a method to focus on task-relevant parts of the graph, helping it to make better decisions. In this work, we conduct a comprehensive and focused survey of the literature on the emerging field of graph attention models. We introduce three intuitive taxonomies to group existing work. These are based on problem setting (type of input and output), the type of attention mechanism used, and the task (e.g., graph classification, link prediction). We motivate our taxonomies through detailed examples and use each to survey competing approaches from a unique standpoint. Finally, we highlight several challenges in the area and discuss promising directions for future work.
The ubiquity of data lakes has created fascinating new challenges for data management research. In this tutorial, we review the state-of-the-art in data management for data lakes. We consider how data lakes are introducing new problems including dataset discovery and how they are changing the requirements for classic problems including data extraction, data cleaning, data integration, data versioning, and metadata management.
Conference Paper
Employees that spend more time finding relevant data than analyzing it suffer a data discovery problem. The large volume of data in enterprises, and sometimes the lack of knowledge of the schemas aggravates this problem. Similar to how we navigate the Web today, we propose to identify semantic links that assist analysts in their discovery tasks. These links relate tables to each other, to facilitate navigating the schemas. They also relate data to external data sources such as ontologies and dictionaries, to help explain the schema meaning. We materialize the links in an enterprise knowledge graph, where they become available to analysts. The main challenge is how to find pairs of objects that are semantically related. We propose SEMPROP, a DAG of different components that find links based on syntactic and semantic similarities. SEMPROP is commanded by a semantic matcher which leverages word embeddings to find objects that are semantically related. To leverage word embeddings, we introduce coherent groups, a novel technique to combine them which works better than other state of the art alternatives for this problem. We implement SEMPROP as part of a discovery system we are building and conduct user studies, real deployments and a quantitative evaluation to understand the benefits of links for data discovery tasks, as well as the benefits of SEMPROP and coherent groups to find those links.
Conference Paper
The recently introduced continuous Skip-gram model is an efficient method for learning high-quality distributed vector representations that capture a large num- ber of precise syntactic and semantic word relationships. In this paper we present several extensions that improve both the quality of the vectors and the training speed. By subsampling of the frequent words we obtain significant speedup and also learn more regular word representations. We also describe a simple alterna- tive to the hierarchical softmax called negative sampling. An inherent limitation of word representations is their indifference to word order and their inability to represent idiomatic phrases. For example, the meanings of “Canada” and “Air” cannot be easily combined to obtain “Air Canada”. Motivated by this example,we present a simplemethod for finding phrases in text, and show that learning good vector representations for millions of phrases is possible.
Conference Paper
Using deep learning for different machine learning tasks such as image classification and word embedding has recently gained many attentions. Its appealing performance reported across specific Natural Language Processing (NLP) tasks in comparison with other approaches is the reason for its popularity. Word embedding is the task of mapping words or phrases to a low dimensional numerical vector. In this paper, we use deep learning to embed Wikipedia Concepts and Entities. The English version of Wikipedia contains more than five million pages, which suggest its capability to cover many English Entities, Phrases, and Concepts. Each Wikipedia page is considered as a concept. Some concepts correspond to entities, such as a person's name, an organization or a place. Contrary to word embedding, Wikipedia Concepts Embedding is not ambiguous, so there are different vectors for concepts with similar surface form but different mentions. We proposed several approaches and evaluated their performance based on Concept Analogy and Concept Similarity tasks. The results show that proposed approaches have the performance comparable and in some cases even higher than the state-of-the-art methods.
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.0 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.