Content uploaded by Omer Dawelbeit
Author content
All content in this area was uploaded by Omer Dawelbeit on Jul 02, 2016
Content may be subject to copyright.
School of Systems Engineering
Investigating Elastic Cloud Based RDF Processing
by
Omer Dawelbeit
Thesis submitted for the degree of Doctor of Philosophy
School of Systems Engineering
April 2016
University of Reading
Investigating Elastic Cloud Based RDF Processing
Omer Dawelbeit
PhD Thesis, School of Systems Engineering,
University of Reading, April 2016
Copyright c
2016 Omer Dawelbeit. All Rights Reserved.
i
Abstract
The Semantic Web was proposed as an extension of the traditional Web to give Web
data context and meaning by using the Resource Description Framework (RDF) data
model. The recent growth in the adoption of RDF in addition to the massive growth
of RDF data, have led numerous efforts to focus on the challenges of processing this
data. To this extent, many approaches have focused on vertical scalability by utilis-
ing powerful hardware, or horizontal scalability utilising always-on physical computer
clusters or peer to peer networks. However, these approaches utilise fixed and high
specification computer clusters that require considerable upfront and ongoing invest-
ments to deal with the data growth. In recent years cloud computing has seen wide
adoption due to its unique elasticity and utility billing features.
This thesis addresses some of the issues related to the processing of large RDF datasets
by utilising cloud computing. Initially, the thesis reviews the background literature
of related distributed RDF processing work and issues, in particular distributed rule-
based reasoning and dictionary encoding, followed by a review of the cloud computing
paradigm and related literature. Then, in order to fully utilise features that are spe-
cific to cloud computing such as elasticity, the thesis designs and fully implements
aCloud-based Task Execution framework (CloudEx), a generic framework for effi-
ciently distributing and executing tasks on cloud environments. Subsequently, some
of the large-scale RDF processing issues are addressed by using the CloudEx frame-
ii
work to develop algorithms for processing RDF using cloud computing. These algo-
rithms perform efficient dictionary encoding and forward reasoning using cloud-based
columnar databases. The algorithms are collectively implemented as an Elastic Cost
Aware Reasoning Framework (ECARF), a cloud-based RDF triple store. This thesis
presents original results and findings that advance the state of the art of performing
distributed cloud-based RDF processing and forward reasoning.
iii
Declaration
I confirm that this is my own work and the use of all material from other sources has
been properly and fully acknowledged.
Omer Dawelbeit
iv
Dedication
To Amna and Ibrahim, my wonderful mother and father for their everlasting love,
support and motivation.
v
Acknowledgements
“In the name of God, the Most Gracious, the Most Merciful”
All praise be to my God and Lord who created me and taught me that which I knew not,
who helped me get through difficult times.
Being part-time, the PhD journey has been a long one for me, but it was the journey of
a lifetime, a journey of discovery and exploration. Since I was young, doing a PhD was
my dream, to be like my dad whose PhD journey has inspired the whole family. There are
many people that supported, motivated and inspired me throughout this journey.
Firstly, I would like to thank my supervisor, Professor Rachel McCrindle, her kind encour-
agements and motivation gave me the strength and energy needed to complete the journey,
her guidance made the path very clear.
I’m ever so grateful to my parents Amna and Ibrahim for their supplications and for teaching
me to always seek knowledge and aspire to achieve. I would like to thank all my brothers
and sisters for looking up to me as their elder brother, Emad, Mariam, Mohammed, Huda,
Tasneem, Yousra and Osman. I would like to thank Osman in particular for the interesting
discussions and for patiently listening to me talk about my research challenges.
My special thanks go to my companions in this journey, my wife Amna and my children
Laila, Sarah and Muhammad. Amna has been the fuel of this journey with her love, patience,
support and understanding, despite the fact that she was also busy on her own PhD. My
children have inspired me to carry on and achieve to make them proud. They have been
patient and lived with me through every moment of this journey, even Muhammad, my 4
years old son regularly used to ask me how the thesis chapters are progressing.
I would like to thank Google for their great technologies, developer support and documen-
tations. Google support kindly and promptly lifted any quota restrictions that I requested
to complete the evaluation of this research. They have kindly provided free credits to use
the Google Cloud Platform.
vi
Research Outcome
Publications
•’ECARF: Distributed RDFS Reasoning in the Google Cloud’ - IEEE Transac-
tions on Cloud Computing Journal - Jun 2015. Submitted.
Conference Papers
•’A Novel Cloud Based Elastic Framework for Big Data Preprocessing’ - Sixth
Computer Science and Electronic Engineering Conference 2014. Proceedings to
be published by the IEEE - Sep 2014.
•’Efficient Dictionary Compression for Processing RDF Big Data Using Google
BigQuery’ - IEEE Communications Conference (Globecom 2016) - Dec 2016.
Submitted.
Posters
•Investigating Elastic Cloud Based Reasoning for the Semantic Web - University
of Reading postgraduate conference poster - Jul 2014.
Presentations
•The British Computer Society doctoral consortium presentation - May 2014.
•University of Reading 3 minutes thesis competition presentation - Jul 2014.
vii
•PhD proposal accepted for presentation at the International Semantic Web Con-
ference (ISWC) 2014 - Oct 2014.
•’Google BigQuery, processing big data without the infrastructure’ - Reading &
Thames Valley GDG group - Aug 2015.
Recognitions and Awards
•Recognised by Google as a Google Developer Expert on the Google Cloud Plat-
form - Nov 2015.
•Supplied with monthly Google Cloud Platform credit.
Open Source Contributions and Impact
•ECARF1, Elastic cloud-based RDF triple store for RDF reasoning and RDFS
reasoning.
•CloudEx2, A generic elastic cloud based framework for the execution of embar-
rassingly parallel tasks.
1http://ecarf.io
2http://cloudex.io
viii
Table of Contents
Abstract ii
Declaration iv
Dedication v
Acknowledgments vi
Research Outcome vii
Table of Contents xviii
List of Tables xx
List of Figures xxii
1 Introduction 1
1.1 TheSemanticWeb ............................ 1
1.2 The Resource Description Framework . . . . . . . . . . . . . . . . . . 2
1.2.1 RDF Data on the Web . . . . . . . . . . . . . . . . . . . . . . 4
1.2.2 UseCases ............................. 4
1.3 The Growth to Big Data . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Challenges with Physical Computing . . . . . . . . . . . . . . . . . . 6
ix
1.5 The Potential of Cloud Computing . . . . . . . . . . . . . . . . . . . 7
1.6 Motivation................................. 8
1.7 Research Aim and Questions . . . . . . . . . . . . . . . . . . . . . . . 9
1.7.1 Scope of Research . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.7.2 Impact of Research . . . . . . . . . . . . . . . . . . . . . . . . 10
1.8 TechnicalApproach............................ 11
1.9 Contributions ............................... 13
1.10OutlineoftheThesis........................... 14
2 Background on the Resource Description Framework 17
2.1 Resource Description Framework . . . . . . . . . . . . . . . . . . . . 18
2.1.1 RDFSchema ........................... 20
2.1.2 The Semantic Web and Ontologies . . . . . . . . . . . . . . . 22
2.1.2.1 Description Logics . . . . . . . . . . . . . . . . . . . 23
2.1.2.2 Web Ontology Language (OWL) . . . . . . . . . . . 24
2.1.3 RDFS Entailment Rules . . . . . . . . . . . . . . . . . . . . . 25
2.1.3.1 A Little Semantics Goes a Long Way . . . . . . . . . 27
2.1.4 Minimal and Efficient Subset of RDFS . . . . . . . . . . . . . 27
2.1.5 RDF Triple Stores . . . . . . . . . . . . . . . . . . . . . . . . 28
2.2 Distributed Semantic Web Reasoning . . . . . . . . . . . . . . . . . . 29
2.2.1 Peer to Peer Networks . . . . . . . . . . . . . . . . . . . . . . 29
2.2.1.1 Forward Reasoning on Top of DHTs . . . . . . . . . 30
2.2.1.2 Alternatives to Term Based Partitioning . . . . . . . 30
2.2.2 Grid and Parallel Computing . . . . . . . . . . . . . . . . . . 31
2.2.2.1 MapReduce Based Reasoning . . . . . . . . . . . . . 31
2.2.2.2 Spark Based Reasoning . . . . . . . . . . . . . . . . 32
2.2.2.3 Authoritative Distributed Reasoning . . . . . . . . . 33
x
2.2.2.4 Embarrassingly Parallel Reasoning . . . . . . . . . . 34
2.2.3 Centralised Systems . . . . . . . . . . . . . . . . . . . . . . . 34
2.3 RDF Data Management on the Cloud . . . . . . . . . . . . . . . . . . 35
2.4 RDF Data Compression . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.4.1 MapReduce-based Dictionary Encoding . . . . . . . . . . . . . 36
2.4.2 Supercomputers-based Dictionary Encoding . . . . . . . . . . 37
2.4.3 DHT-based Dictionary Encoding . . . . . . . . . . . . . . . . 38
2.4.4 Centralised Dictionary Encoding . . . . . . . . . . . . . . . . 38
2.5 TheChallenges .............................. 39
2.5.1 Dictionary Encoding . . . . . . . . . . . . . . . . . . . . . . . 40
2.5.2 DataStorage ........................... 40
2.5.3 Workload Distribution . . . . . . . . . . . . . . . . . . . . . . 41
2.6 Summary ................................. 42
3 The Cloud Computing Paradigm 44
3.1 Background ................................ 45
3.1.1 Cloud Service Models . . . . . . . . . . . . . . . . . . . . . . . 46
3.1.2 CloudBenefits .......................... 46
3.2 Public Cloud Deployment Overview . . . . . . . . . . . . . . . . . . . 47
3.3 CloudServices............................... 48
3.3.1 Virtual Machines . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.3.1.1 VMImage........................ 49
3.3.1.2 CPU Cores and RAM . . . . . . . . . . . . . . . . . 50
3.3.1.3 Virtual Disk and Snapshots . . . . . . . . . . . . . . 50
3.3.1.4 VMMetadata...................... 51
3.3.2 CloudStorage........................... 51
3.3.3 BigDataServices......................... 52
xi
3.4 Google Cloud Platform . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.4.1 Google Compute Engine . . . . . . . . . . . . . . . . . . . . . 53
3.4.1.1 Compute Engine Metadata . . . . . . . . . . . . . . 54
3.4.1.2 Compute Engine Pricing . . . . . . . . . . . . . . . . 55
3.4.1.3 Compute Engine API . . . . . . . . . . . . . . . . . 55
3.4.2 Google Cloud Storage . . . . . . . . . . . . . . . . . . . . . . 56
3.4.2.1 Cloud Storage API . . . . . . . . . . . . . . . . . . . 56
3.4.3 Google BigQuery . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.4.3.1 BigQuery SQL . . . . . . . . . . . . . . . . . . . . . 57
3.4.3.2 BigQuery API . . . . . . . . . . . . . . . . . . . . . 58
3.4.3.3 BigQuery pricing . . . . . . . . . . . . . . . . . . . . 59
3.5 The Potential of Cloud Computing . . . . . . . . . . . . . . . . . . . 59
3.5.1 Migration to the Cloud . . . . . . . . . . . . . . . . . . . . . . 60
3.5.1.1 High Performance Computing . . . . . . . . . . . . . 60
3.5.1.2 Interoperability and Migration of Cluster Frameworks 61
3.5.2 Observing Cost and Time Constraints . . . . . . . . . . . . . 62
3.5.3 Cloud-First Frameworks . . . . . . . . . . . . . . . . . . . . . 63
3.6 Summary ................................. 64
4 Research Methodology 65
4.1 Design Science Research Methodology . . . . . . . . . . . . . . . . . . 66
4.2 Addressing RDF Processing Issues . . . . . . . . . . . . . . . . . . . . 68
4.3 Utilising Cloud Computing . . . . . . . . . . . . . . . . . . . . . . . . 69
4.3.1 Cloud-First Design . . . . . . . . . . . . . . . . . . . . . . . . 69
4.3.2 Utilising Cloud Elasticity . . . . . . . . . . . . . . . . . . . . . 70
4.3.3 A Divide-Conquer Strategy . . . . . . . . . . . . . . . . . . . 70
4.4 ModelDevelopment............................ 71
xii
4.4.1 CloudEx Framework Requirements . . . . . . . . . . . . . . . 72
4.5 Prototype Development and Evaluation . . . . . . . . . . . . . . . . . 74
4.6 Summary ................................. 74
5 CloudEx, a Cloud First Framework 76
5.1 High Level Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.1.1 Key CloudEx Definitions . . . . . . . . . . . . . . . . . . . . . 78
5.1.2 The Lifecycle of Coordinators and Processors . . . . . . . . . 81
5.1.3 TheTasksFlow.......................... 83
5.2 Dealing with Tasks Input and Output . . . . . . . . . . . . . . . . . . 84
5.2.1 TheJobContext ......................... 84
5.2.2 Input and Output Resolution . . . . . . . . . . . . . . . . . . 86
5.2.3 HandlingInput .......................... 87
5.2.4 HandlingOutput ......................... 88
5.3 Partitioning the Workload . . . . . . . . . . . . . . . . . . . . . . . . 88
5.3.1 Bin Packing Partitioning . . . . . . . . . . . . . . . . . . . . . 89
5.3.1.1 Full Bin Strategy . . . . . . . . . . . . . . . . . . . . 90
5.3.1.2 Calculating The Bin Capacity . . . . . . . . . . . . . 90
5.3.1.3 Calculating The Number of Bins . . . . . . . . . . . 91
5.4 DefiningJobs ............................... 91
5.4.1 JobData ............................. 92
5.4.2 Virtual Machine Configurations . . . . . . . . . . . . . . . . . 92
5.4.3 TasksDefinition.......................... 93
5.4.3.1 Task Partitioning Configuration . . . . . . . . . . . . 96
5.5 CloudEx Job Execution in Detail . . . . . . . . . . . . . . . . . . . . 96
5.5.1 Duties of the Coordinator . . . . . . . . . . . . . . . . . . . . 96
5.5.1.1 Running Coordinator Tasks . . . . . . . . . . . . . . 97
xiii
5.5.1.2 Running Processor Tasks . . . . . . . . . . . . . . . 97
5.5.1.3 Error Handling . . . . . . . . . . . . . . . . . . . . . 99
5.5.2 Elastic Processors . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.5.3 The Processor Duties . . . . . . . . . . . . . . . . . . . . . . . 101
5.5.3.1 Error Handling . . . . . . . . . . . . . . . . . . . . . 101
5.6 Implementation .............................. 102
5.6.1 cloudex-core Component . . . . . . . . . . . . . . . . . . . . . 103
5.6.2 cloudex-google Component . . . . . . . . . . . . . . . . . . . . 104
5.6.3 User-Defined Tasks . . . . . . . . . . . . . . . . . . . . . . . . 104
5.6.4 User-Defined Partitioning Functions . . . . . . . . . . . . . . . 105
5.6.5 VMImageSetup ......................... 105
5.6.6 Running the Framework . . . . . . . . . . . . . . . . . . . . . 106
5.7 Summary ................................. 106
6 ECARF, Processing RDF on the Cloud 108
6.1 ECARFOverview............................. 110
6.1.1 Distributed Processing . . . . . . . . . . . . . . . . . . . . . . 111
6.1.2 Dictionary Encoding . . . . . . . . . . . . . . . . . . . . . . . 112
6.1.3 Forward Reasoning . . . . . . . . . . . . . . . . . . . . . . . . 112
6.2 Dictionary Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
6.2.1 Reducing the Size of URIRefs . . . . . . . . . . . . . . . . . . 114
6.2.2 Efficient URI Reference Compression . . . . . . . . . . . . . . 115
6.2.3 EncodingTerms.......................... 116
6.2.4 DecodingTerms.......................... 118
6.2.5 Storage Considerations . . . . . . . . . . . . . . . . . . . . . . 119
6.2.6 Dictionary Encoding Tasks . . . . . . . . . . . . . . . . . . . . 119
6.2.6.1 Extract Terms Task . . . . . . . . . . . . . . . . . . 119
xiv
6.2.6.2 AssembleDictionaryTask . . . . . . . . . . . . . . . . 120
6.2.6.3 Encode Data Task . . . . . . . . . . . . . . . . . . . 120
6.3 Distributed RDFS Reasoning . . . . . . . . . . . . . . . . . . . . . . 120
6.3.1 Handling Schema Triples . . . . . . . . . . . . . . . . . . . . . 121
6.3.2 Handling Instance Triples . . . . . . . . . . . . . . . . . . . . 122
6.3.3 Performing Forward Reasoning Using BigQuery . . . . . . . . 123
6.3.3.1 The Relevant Term . . . . . . . . . . . . . . . . . . . 125
6.3.3.2 Query Optimisation . . . . . . . . . . . . . . . . . . 126
6.3.3.3 Distributing the Reasoning Process . . . . . . . . . . 127
6.3.4 The Schema Terms Analysis Step . . . . . . . . . . . . . . . . 127
6.3.5 The Instance Triples Count Step . . . . . . . . . . . . . . . . 128
6.3.6 Workload Partitioning . . . . . . . . . . . . . . . . . . . . . . 129
6.4 ECARF Architecture Walkthrough . . . . . . . . . . . . . . . . . . . 130
6.4.1 The Schema Term Analysis Task . . . . . . . . . . . . . . . . 131
6.4.2 Distributing the Count Task . . . . . . . . . . . . . . . . . . . 131
6.4.3 The Instance Triple Count / Extract Dictionary Parts Task . 132
6.4.4 Assemble Dictionary Task . . . . . . . . . . . . . . . . . . . . 133
6.4.5 Transform and Encode Data Tasks . . . . . . . . . . . . . . . 134
6.4.6 Aggregate Processors Results Task . . . . . . . . . . . . . . . 134
6.4.7 Load Files into BigQuery Task . . . . . . . . . . . . . . . . . . 135
6.4.8 The Forward Reasoning Task . . . . . . . . . . . . . . . . . . 136
6.5 Summary ................................. 137
7 Evaluation of Cloud Based RDF Processing 139
7.1 ResearchQuestions............................ 140
7.2 ExperiementsSetup............................ 141
7.2.1 Implementation.......................... 142
xv
7.2.2 PlatformSetup .......................... 143
7.2.2.1 Virtual Machine and Disk Types . . . . . . . . . . . 143
7.2.3 Results Gathering . . . . . . . . . . . . . . . . . . . . . . . . . 144
7.2.4 Datasets.............................. 145
7.3 Common Evaluation Criteria . . . . . . . . . . . . . . . . . . . . . . . 146
7.3.1 Runtime and Scalability . . . . . . . . . . . . . . . . . . . . . 146
7.3.2 Cost of Computing Resources . . . . . . . . . . . . . . . . . . 147
7.3.3 Multiple CPU Cores . . . . . . . . . . . . . . . . . . . . . . . 147
7.3.4 Triples Throughput . . . . . . . . . . . . . . . . . . . . . . . . 148
7.4 Distributed RDF Processing . . . . . . . . . . . . . . . . . . . . . . . 148
7.4.1 Partitioning Factor . . . . . . . . . . . . . . . . . . . . . . . . 149
7.4.1.1 ExtractCountTerms2PartTask . . . . . . . . . . . . . 149
7.4.1.2 ProcessLoadTask . . . . . . . . . . . . . . . . . . . . 151
7.4.1.3 Partitioning Factor Summary . . . . . . . . . . . . . 152
7.4.2 Horizontal Scalability . . . . . . . . . . . . . . . . . . . . . . . 154
7.4.3 Comparison of LUBM 8K Load . . . . . . . . . . . . . . . . . 155
7.4.4 Vertical Scalability . . . . . . . . . . . . . . . . . . . . . . . . 156
7.5 Dictionary Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
7.5.1 URIRefs Split Strategy . . . . . . . . . . . . . . . . . . . . . . 159
7.5.2 Dictionary Assembly Memory Footprint . . . . . . . . . . . . 160
7.5.3 BigQuery Scanned Bytes Improvements . . . . . . . . . . . . . 162
7.5.4 Comparison of Dictionary Encoding . . . . . . . . . . . . . . . 162
7.6 ForwardReasoning ............................ 163
7.6.1 Forward Reasoning Optimisations . . . . . . . . . . . . . . . . 164
7.6.2 Results of Performing Forward Reasoning . . . . . . . . . . . 167
7.6.3 Runtime, Load Balancing and Cost . . . . . . . . . . . . . . . 168
xvi
7.6.3.1 Load Balancing . . . . . . . . . . . . . . . . . . . . . 170
7.6.3.2 Cost of Performing Forward Reasoning . . . . . . . . 171
7.6.4 BigQuery Data Import and Export . . . . . . . . . . . . . . . 171
7.6.5 Comparison of Forward RDFS Reasoning Throughput . . . . . 172
7.6.6 Forward Reasoning Conclusion . . . . . . . . . . . . . . . . . 173
7.7 CloudEx Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
7.7.1 CloudEx Improvements . . . . . . . . . . . . . . . . . . . . . . 174
7.7.2 Cloud Platform Observations . . . . . . . . . . . . . . . . . . 175
7.7.3 Cost Considerations . . . . . . . . . . . . . . . . . . . . . . . 175
7.8 Summary ................................. 176
8 Conclusions 178
8.1 The CloudEx Framework Contributions . . . . . . . . . . . . . . . . . 180
8.1.1 Summary of CloudEx . . . . . . . . . . . . . . . . . . . . . . . 180
8.1.2 Effectiveness and Feasibility . . . . . . . . . . . . . . . . . . . 181
8.2 The ECARF Triple Store Contributions . . . . . . . . . . . . . . . . 181
8.2.1 Summary of Dictionary Encoding . . . . . . . . . . . . . . . . 182
8.2.2 Effectiveness of Dictionary Encoding . . . . . . . . . . . . . . 183
8.2.3 Summary of Forward Reasoning . . . . . . . . . . . . . . . . . 183
8.2.4 Effectiveness of Forward Reasoning . . . . . . . . . . . . . . . 184
8.3 Open Source Contributions and Impact . . . . . . . . . . . . . . . . . 184
8.4 Futurework................................ 185
8.4.1 CloudEx Improvements . . . . . . . . . . . . . . . . . . . . . . 185
8.4.1.1 Resilience ........................ 186
8.4.1.2 Autoscaling of Resources . . . . . . . . . . . . . . . . 186
8.4.1.3 Implementation for Other Clouds . . . . . . . . . . . 186
8.4.2 ECARF Improvements . . . . . . . . . . . . . . . . . . . . . . 187
xvii
8.4.2.1 Triple Store Capability . . . . . . . . . . . . . . . . . 187
8.4.2.2 Redundant Triples . . . . . . . . . . . . . . . . . . . 188
References 188
Appendices 206
Appendix A ECARF Tasks And Job Definition 207
A.1 ECARFTasks............................... 207
A.2 ECARF End-to-End Job Definition . . . . . . . . . . . . . . . . . . . 209
Appendix B CloudEx and ECARF Source Code 216
xviii
List of Tables
2.1 Schema-Instance (SI) entailment rules of the ρdf fragment of RDFS,
(sc = subClassOf, sp = subPropertyOf, dom = domain)........ 26
2.2 Schema entailment rules of the ρdf fragment of RDFS (sc = subClassOf,
sp = subPropertyOf ) ........................... 26
5.1 Example CloudEx processor task metadata . . . . . . . . . . . . . . . 99
5.2 Example CloudEx processor error metadata . . . . . . . . . . . . . . 99
6.1 Encoded example of the first two rows in Table 6.3 . . . . . . . . . . 115
6.2 Information Bits for URI Reference Compression . . . . . . . . . . . . 117
6.3 Example Triple Table for the Instance Data in Figure 2.1(c) . . . . . 122
6.4 BigQuery SQL-like queries for the SI rules reasoning, (sc = subClassOf,
sp = subPropertyOf, dom = domain) .................. 124
6.5 Sample ECARF instance triples count / extract dictionary parts task
processormetadata............................ 132
7.1 EvaluationDatasets ........................... 145
7.2 Partitioning Factor runtime (seconds) comparison for DBpedia using 8
n1-highmem-4 processors. . . . . . . . . . . . . . . . . . . . . . . . . 152
xix
7.3 LUBM 8K coordinator (C) and processors (P) runtime (seconds), speedup
(Sp.) and efficiency (Eff.) for ExtractCountTerms2PartTask and Pro-
cessLoadTask on up to 16 n1-standard-2 processors. . . . . . . . . . . 153
7.4 LUBM dataset load runtime comparison . . . . . . . . . . . . . . . . 156
7.5 LUBM 8K load runtime, speedup and efficiency on single 1n-standard-1
to n1-standard-16 processor. . . . . . . . . . . . . . . . . . . . . . . . 157
7.6 Comparison of Standard, Hostname + first path part (1stPP) and Host-
name dictionary split strategies for Swetodblp, DBpedia and LUBM
datasets................................... 158
7.7 Dictionary encoding metrics, such as encoded data and dictionary sizes,
compression rate and BigQuery scanned bytes improvements for Swe-
todblp, DBpedia and LUBM. . . . . . . . . . . . . . . . . . . . . . . 158
7.8 Comparison of large scale RDF dictionary encoding. . . . . . . . . . . 163
7.9 BigQuery reasoning improvements comparison for Swetodblp. . . . . 166
7.10 BigQuery reasoning results for LUBM. . . . . . . . . . . . . . . . . . 169
7.11 BigQuery reasoning results for DBpedia. . . . . . . . . . . . . . . . . 169
7.12 LUBM and DBpedia BigQuery reasoning on one n1-standard-8 processor.169
7.13 Comparison of Forward RDFS Reasoning . . . . . . . . . . . . . . . . 173
A.1 ECARF Tasks Summary . . . . . . . . . . . . . . . . . . . . . . . . . 208
xx
List of Figures
1.1 ResearchContext.............................. 11
1.2 Research Methodology and Phases. . . . . . . . . . . . . . . . . . . . 12
1.3 ThesisOutline. .............................. 16
2.1 Running example RDF data and graph representation. . . . . . . . . 21
3.1 Cloud Deployment Overview. . . . . . . . . . . . . . . . . . . . . . . 48
4.1 The general methodology of design science research (Source: Vaishnavi
andKuechler[1]) ............................. 67
4.2 Proposed RDF Triple Store Design . . . . . . . . . . . . . . . . . . . 68
4.3 Research Design Methodology . . . . . . . . . . . . . . . . . . . . . . 73
5.1 CloudEx High Level Architecture. . . . . . . . . . . . . . . . . . . . . 78
5.2 CloudEx coordinator and processor lifecycle. . . . . . . . . . . . . . . 82
5.3 CloudExtasksflow............................. 84
5.4 CloudEx tasks input and output. . . . . . . . . . . . . . . . . . . . . 85
5.5 Bin packing workload partitioning. . . . . . . . . . . . . . . . . . . . 90
5.6 CloudExjobentities............................ 93
5.7 CloudEx distributed job execution. . . . . . . . . . . . . . . . . . . . 95
5.8 CloudEx high level components. . . . . . . . . . . . . . . . . . . . . . 102
xxi
6.1 ECARF High Level Activities . . . . . . . . . . . . . . . . . . . . . . 109
6.2 ECARF High Level Architecture . . . . . . . . . . . . . . . . . . . . 111
6.3 Dictionary Encoding Using Binary Interleaving . . . . . . . . . . . . 117
6.4 ECARF Architecture Walkthrough . . . . . . . . . . . . . . . . . . . 130
6.5 ECARF Schema Terms Analysis and Distribute Count Tasks . . . . . 132
6.6 ECARF Assemble Dictionary and Transform/Encode Data Tasks . . 134
6.7 ECARF Aggregate Results and Reason Tasks . . . . . . . . . . . . . 135
7.1 Partitioning factor on DBpedia using 8 n1-highmem-4 processors (4
cores,26GBRAM). ........................... 150
7.2 LUBM 8K end to end and processors runtime (log-scale) vs. number
of processors for ExtractCountTerms2PartTask and ProcessLoadTask
on up to 16 n1-standard-2 processors. . . . . . . . . . . . . . . . . . . 153
7.3 Comparison of dictionary size (logscale) when using Standard, Host-
name + first path part (1stPP) and Hostname split strategies. . . . . 160
7.4 Dictionary assembly memory usage for DBpedia and LUBM datasets. 161
7.5 BigQuery reasoning improvements comparison for SwetoDblp. . . . . 166
7.6 BigQuery export and import times vs. retrieved / inserted rows for the
DBpedia and LUBM datasets. . . . . . . . . . . . . . . . . . . . . . 168
7.7 Reason task time (mins), BigQuery scanned bytes (GB) and Cost (USD
cent) for forward reasoning on DBpedia and LUBM using BigQuery. 168
xxii
Chapter 1
Introduction
The World Wide Web is a collection of interlinked hypertext documents designed
primarily for humans to read, with search engines being used to crawl and index
these pages to provide search capability. However, this search is usually based on
keyword matching, rather than the meaning of the content. For example, searching
to get an intelligent answer to the question ”Does John work at iNetria?” would
simply return Web pages with text that matches the words in the question. There
is therefore, a need for a mechanism to formally express the meaning of data on the
Web, hence the Semantic Web was proposed [2]. In 2001, Berners-Lee et al. outlined
the vision of the Semantic Web as follows:
“The Semantic Web is not a separate Web but an extension of the
current one, in which information is given well-defined meaning, better
enabling computers and people to work in cooperation”
1.1 The Semantic Web
The Semantic Web was proposed as an extension of the traditional Web to give Web
data context and meaning (semantics). Consequently, intelligent software applications
1
Section 1.2 Page 2
that can parse these semantics can be developed to assist humans with many tasks.
One of these tasks is question answering, to answer questions like the one posed
previously. A few other examples include semantic search, social discovery, content
enrichment and publishing, etc., to mention just a few. For the Semantic Web vision
to become a reality, the meaning of data must be expressed in a formal way. This
requires the definition of common vocabulary and formal semantics that both humans
and software applications can understand. Software applications can then understand
the meaning of this data and deduce new knowledge from it. Consider for example
the following statements available on the Web and represented in plain English:
•John is an employee
•John works at iNetria
•John’s full name is ”John Smith”
It is easy for humans to understand these statements and to deduce that the in-
dividual named John is a person working at a business named iNetria. A software
application, however, will not be able to deduce such implicit knowledge straightaway.
These statements need to be represented in a formal language that the application
understands. Additionally, the application needs to be supplied with a set of rules
on how to deduce that extra implicit knowledge. Another issue that might face such
an application is the context as to which John the user have in mind, there might be
many. To express such knowledge, the Resource Description Framework (RDF) [3]
data model can be used.
1.2 The Resource Description Framework
RDF provides a data model that can be used on the Web to add meaning to existing
data such that it can be understood by both software applications and humans. The
2
Section 1.2 Page 3
RDF Schema (RDFS) [4], is a semantic extension of RDF that enables users to define
a group of related resources and the relationships between them. Information in
RDF is primarily represented using XML [5], however other compact and readable
representations also exist such as the Notation3 (N3) [6] syntax. Resources in RDF can
be identified by using Uniform Resource Identifiers (URIs), for example the individual
named John can be identified by using <http://inetria.org/directory/employee/
smithj>. To avoid using long URIs these are usually shortened using namespaces such
as employee:smithj.
To demonstrate these concepts, consider for example the plain English statements
presented previously, it can be said that the individual John is of type Employee and
the entity iNetria is of type Company. It can also be said that the relationship between
an Employee and a Company is represented through the works at relationship. These
statements of knowledge — usually known as triples — contain three parts, a subject,
predicate and an object and can be written as follows:
- employee:smithj rdf:type inetria:Employee
- business:inetria rdf:type inetria:Company
- employee:smithj inetria:works_at business:inetria
Usually, basic explicit knowledge about resources is encoded as RDF by humans then
software applications are used to deduce further implicit knowledge by using rule-
based reasoning. Rule-based reasoning requires that a set of rules — such as RDFS
entailment rules [7] — to be applied repeatedly to a set of statements to infer new
knowledge. For example, given the extra information that an Employee is a sub class
of Person an application can use rule-based reasoning to deduce that the employee
employee:smithj is actually of type Person. This knowledge may seem trivial to
humans, however for software applications inferring such knowledge requires compu-
3
Section 1.2 Page 4
tational resources, with more statements and rules this task can become computa-
tionally difficult.
1.2.1 RDF Data on the Web
There is currently a great deal of community, organisations and governments1effort
to provide RDF datasets on the Web covering a wide range of topics, such as gov-
ernment data, publications, life sciences, geographic, social web to mention just a
few. Additionally, ongoing effort is also focusing on connecting pieces of information
on the Semantic Web data, widely known as Linked Data2. As of 20143this Linked
Data cloud spanned more than 1014 publicly available datasets. To give an indication
on the size of these datasets, which is usually measured in terms of the number of
statements (RDF triples) in each dataset, consider the DBpedia4[8] dataset, which
is a collection structured data extracted from Wikipedia. This dataset contains more
than 3 billion statements covering 125 languages. Another example is the Linked Life
Data5(LLD) that has 10 billion statements, providing a collection of biomedical data
such as genes, protein, disease, etc. . . .
1.2.2 Use Cases
Many applications use Semantic Web technologies in particular RDF to provide capa-
bilities such as semantic search, content aggregation and discovery. Perhaps some of
the notable usages of these technologies for content enrichment and publishing were
the BBC websites for both the 2010 World Cup [9] and 2012 Olympics [10]. Using
these technologies the BBC was able to automatically aggregate web pages that con-
1https://data.gov.uk/
2http://linkeddata.org/
3http://linkeddatacatalog.dws.informatik.uni-mannheim.de/state/
4http://wiki.dbpedia.org/Datasets
5http://linkedlifedata.com/
4
Section 1.3 Page 5
tain links to relevant resources with minimal human management. Such applications
usually utilise a special type of database for storing and managing RDF triples called
triple stores, which provide the ability to load, update and delete RDF statements.
Additionally, triple stores can perform forward rule-based reasoning to infer addi-
tional knowledge and support the query of the stored statements using the Protocol
and RDF Query Language (SPARQL) [11]. For applications to be able to process,
integrate or query large datasets such as DBpedia and Linked Life Data, high spec-
ification hardware and expensive software are usually required. To make the matter
worse, these datasets are constantly growing, forcing users to constantly upgrade their
software and hardware in order to keep up with the data growth.
1.3 The Growth to Big Data
Initially, RDF data was processed using centralised systems that run on a single com-
puter, however, due to the constant growth of the Semantic Web RDF data, this
data can now be described as Big Data [12]. To this extent, the Semantic Web Chal-
lenge [13] which is concerned with building end-user Semantic Web applications, was
formerly known as Billion Triples Challenge is now renamed to Big Data Challenge.
This data growth means that the average size of RDF datasets that an application
needs to process well exceeds a billion statements of knowledge. Therefore, the com-
putational processing power required to process these datasets — such as deducing
implicit knowledge through rule-based reasoning — far exceeds the capabilities of a
single computer. Consequently, commercially available large triple stores [14] have
focused on solutions that utilise high specifications hardware.
Moreover, recent research work on large scale RDF processing in general and RDFS
reasoning in particular, has moved from focusing on centralised algorithms to focusing
5
Section 1.4 Page 6
on distributed and parallel algorithms. These algorithms utilise either peer-to-peer
networks using Distributed Hash Tables (DHT) [15, 16], or computer clusters using
MapReduce [17] and other programming models [18, 19, 20, 21]. Aforementioned
work has primarily focused on addressing the challenges of efficient data partitioning
and assimilation between the various computing resources, which in most cases, are
either part of an always-on computer cluster, or peer to peer network.
1.4 Challenges with Physical Computing
Some of the existing algorithms on large scale RDF processing utilise high specifi-
cations and dedicated computer clusters [19, 20, 21] that require large upfront in-
frastructure investments. These clusters usually contain a fixed number of always-on
computing resources that are billed for 24/7 regardless of utilisation. This presents
a number of challenges, for example, it is not possible to simply “switch off” these
resources when not needed and then switch them on when there is enough workload.
Besides, jobs executed on such clusters are constraint to the maximum number of
computing resources available, which means larger jobs utilising all available resources
will need to execute over longer time periods. Increasing the capacity of such clusters
requires further investment and effort to purchase, install and configure additional
computing resources.
Other algorithms that perform distributed RDF processing utilise computers in peer to
peer networks [15, 16], namely Distributed Hash Tables [22] (DHTs), in these settings
computing resources perform multiple tasks that are not related. In such heteroge-
neous computing environments, there are no performance or deadline guarantees to
any one particular task. Moreover, due to the shared nature of such environments, it
is difficult to measure the overall cost exclusively required to perform one particular
6
Section 1.5 Page 7
task. In addition to the aforementioned challenges, the Semantic Web data is often
highly skewed which causes a number of workload partitioning strategies to suffer
from load balancing issues [19]. Furthermore, the continuous growth of the Semantic
Web data presents the challenges of storing both the explicit and the inferred data.
Some of the issues described in this section are inherit from the nature of physical
computing, cloud computing on the other hand is a promising computing paradigm
that can be utilised to address some of these issues.
1.5 The Potential of Cloud Computing
Traditional distributed or grid computing is based on fixed, physical and intercon-
nected servers hosted in data centres, cloud computing on the other hand is a com-
puting paradigm built on top of these servers. Public cloud computing provides a pool
of shared resources that can be rapidly acquired or released with minimal effort, ei-
ther manually or programmatically [23], a feature unique to cloud computing termed
Elasticity [24, 25, 26]. Elasticity is the ability to acquire computing resource with
varying configurations (e.g. number of CPU cores, RAM size, disk type) almost in-
stantly when needed and released instantly when not needed by using an Application
Programming Interface (API) exposed by the cloud provider. Public cloud providers,
most notably Amazon Web Services [27], Google Cloud Platform [28] and Microsoft
Azure [29], utilise their data centres to commercially offer on-demand computing re-
sources.
In addition to elasticity, the other key benefit of cloud computing is usage-based billing
so that consumers only pay for what they use, which is highly cost efficient compared
to a physical computer environment being billed even when the resources are not being
used. Cloud computing offers many other on demand services, such as low latency
7
Section 1.6 Page 8
mass cloud storage capable of storing unlimited amounts of data and analytical big
data services. These big data services such as Amazon Redshift [30] and Google
BigQuery [31] provide a promising platform that is able to readily analyse web-scale
datasets in a few seconds. For example, it was shown that the algorithms behind
Google BigQuery are able to achieve a scan throughput of 100 billion records per
second on a shared cluster, which out performs MapReduce by an order of magnitude
[32]. Additionally, it was shown that using public clouds, users can scale to over 475
of computing nodes in less than 15 minutes [33].
1.6 Motivation
As noted in the previous section, the cloud computing paradigm provides a promising
approach for storing, managing and processing large RDF datasets. Furthermore,
cloud computing is both available and affordable to users and organisations, whereas,
dedicated physical computing clusters may not be practically available for everyone
to use due to the required upfront infrastructure investment and ongoing maintenance
cost. Currently end-user applications that process, use and query large datasets such
as DBpedia and Linked Life Data require high specification hardware to run com-
mercially available triple stores, requiring massive upfront infrastructure investments.
Utilising an approach based on cloud computing can play a step forward towards
making RDF processing available for mainstream use, without any expensive upfront
investment. Despite this, little research has been done to explore the use of cloud
computing for RDF processing, storage and performing RDFS rule-based reasoning.
It may be tempting to replicate approaches developed in physical computing on cloud
computing, such approaches are primarily centred on computing clusters with the
storage and processing contained therein. In doing so, it is very likely that such
8
Section 1.7 Page 9
approaches will suffer from reduced performance due to the virtualisation nature of
cloud resources [20], which is yet another layer on top of the native computing re-
sources. Instead of such replication, this thesis utilises and introduces the notion
“cloud-first frameworks” to signify frameworks that are entirely cloud based, specifi-
cally designed from the ground-up for cloud computing and utilise cloud services such
as Infrastructure as a Service, cloud storage and Big Data services.
1.7 Research Aim and Questions
Based on the challenges and motivation noted in the previous sections, the aim of
this research is to develop and evaluate an elastic cloud-based triple store for RDF
processing in general and RDFS rule-based reasoning in particular. This aim has
resulted in a number research questions being asked, firstly, if the processing of RDF
is performed on cloud computing resources, can these resources be efficiently acquired
for task execution and released. Secondly, as noted earlier that RDF contains long
URIs which occupy storage space, network bandwidth and take longer to process. A
question worth asking is, can these URIs be compressed rapidly and efficiently by
using an efficient dictionary encoding. Finally, provided that cloud big data service
such as Google BigQuery provide analytical capabilities for massive datasets, can
these services be used to perform rule-based reasoning. These research questions can
be summarised as follows:
•Q1. How can a cloud-based system efficiently distribute and process tasks with
different computational requirements?
•Q2. How can an efficient dictionary that fits in-memory be created for encoding
RDF URIs from strings to integers?
9
Section 1.7 Page 10
•Q3. How can cloud-based big data services such as Google BigQuery be used
to perform RDFS rule-based reasoning?
An overarching research question that encompasses all the above question is: can cloud
computing provide an efficient RDF triple store that provides storage, processing and
can perform RDFS rule-based reasoning?
1.7.1 Scope of Research
To provide answers to the research questions, this research is conducted over two
domains, RDF and cloud computing as shown in Figure 1.1. The research develops
aCloud-based Task Execution framework (CloudEx), which is a generic cloud-first
framework for efficiently executing tasks on cloud environments. Subsequently, the
CloudEx framework is used to develop algorithms for providing an RDF triple store
on the cloud, these algorithms are collectively implemented as an Elastic Cost Aware
Reasoning Framework (ECARF). Both CloudEx and ECARF provide answers to the
research question mentioned in the previous section.
1.7.2 Impact of Research
This thesis presents original results and findings that advance the state of the art of
distributed cloud-based RDF processing and forward RDFS reasoning. The CloudEx
framework developed as part of this thesis enables users to run generic tasks with
different computational requirements on the cloud. Additionally, the ECARF triple
store described in this thesis makes it possible to commercially offer a pay-as-you-go
cloud hosted RDF triple store, by using a Triple store as a Service (TaaS) model
similar to Software as a Service. Furthermore, these approaches enable applications
to harness the powers of the Semantic Web and RDF without any upfront investment
10
Section 1.8 Page 11
Physical Computing
RDF
Cloud
Computing
ECARF
CloudEx
This
research
Figure 1.1: Research Context.
in expensive hardware or software. Some of the findings of this research have already
attracted a great deal of interest (more than 10,000 views over a few days) from the
technical community, in particular to potential savings with public clouds that charge
per-minutes versus the ones that charge per-hour6.
1.8 Technical Approach
To accomplish the aforementioned research aim and answer the research questions,
this research follows the general methodology of design science research as outlined by
[1] and summarised in Figure 1.2. As shown in the figure, the research is conducted
over five iterative phases and can be summarised as follows:
6http://omerio.com/2016/03/16/saving-hundreds-of-hours-with-google- compute-
engine-per-minute-billing/
11
Section 1.8 Page 12
Evaluation Conclusion
Development
Suggestion
Awareness of Problem
Processing large RDF
datasets require high
specification hardware.
Cloud computing can
be used without
upfront investment.
Literature Review
(RDF & Cloud
Computing)
Conceptual/
Theoretical Model
Development
CloudEx
prototype
Results & Analysis
Conclusion,
Discussion and
Write-up
CloudEx model
ECARF model
Experiments on
Google Cloud
Platform using real-
world and synthetic
datasets
ECARF
prototype
Cloud
specific
model
RDF
specific
model
Figure 1.2: Research Methodology and Phases.
1. Literature review:
•RDF literature - to review and identify the issues with approaches de-
veloped for distributed RDF processing and RDFS reasoning both in peer-
to-peer networks and computer clusters.
•Cloud computing literature - to review key services provided by public
cloud providers with a review of the Google Cloud Platform [28], addition-
ally, to survey and categorise related work on utilising cloud computing.
2. Conceptual / theoretical model development:
•CloudEx model - To design a generic cloud-based architecture for tasks
execution and workload partitioning on public clouds.
•ECARF model - To design algorithms based on CloudEx for cloud-based
RDF processing and performing RDFS forward reasoning using columnar
databases.
3. Prototype development: to develop both the CloudEx and ECARF prototypes.
12
Section 1.9 Page 13
4. Evaluation:
•To run experiments on the Google Cloud Platform using both real-world
and synthetic datasets.
•To gather an analyse the results.
5. Conclusion: to discuss the finding, draw conclusions and suggest future work.
1.9 Contributions
The contributions of this thesis are summarised as follows:
•This thesis has designed, implemented and evaluated CloudEx, a generic cloud
based task execution framework that can be implemented on any public cloud.
It is shown that CloudEx can acquire computing resources with various configu-
rations to efficiently execute RDF tasks on the cloud, then release these resource
once the work is done. CloudEx has successfully utilised many cloud services
and acquired 1,086 computing resources with various specifications at a total
cost of $290.29. The CloudEx framework is implemented as a publicly available
open source framework7.
•This thesis has designed, implemented and evaluated a dictionary encoding algo-
rithm to generate an efficient dictionary for compressing RDF Uniform Resource
Identifiers (URIs) using cloud computing resources. It is shown that this ap-
proach generates an efficient dictionary that can readily fit in-memory during
the encoding and decoding process, which speeds up these processes consider-
ably in comparison with other work. Additionally, the dictionaries generated
7https://cloudex.io/
13
Section 1.10 Page 14
using this approach are smaller than the ones created by other approaches that
utilise dedicated hardware.
•This thesis has presented an algorithm to perform RDF processing and RDFS
rule-based reasoning using distributed computing resources and big data ser-
vices, in particular Google BigQuery [31]. It is shown that this approach is
promising and provides the benefit of permanent storage for RDF datasets with
the potential to query and develop a full triple store capability on top of Big-
Query. In comparison, other related work output RDF datasets to files, with the
data yet to be imported into a permanent storage with triple store capability.
•The last two contributions are implemented as ECARF, a cloud based RDF
triple store with both processing and forward RDFS reasoning capabilities.
ECARF is implemented on the Google Cloud Platform as a publicly available
open source framework8. ECARF is the first to run RDFS reasoning entirely on
cloud computing, unlike other approaches that simply replicate physical com-
puting setup — such as MapReduce [20] — on the cloud, ECARF is entirely
based on cloud computing features and services.
1.10 Outline of the Thesis
The outline of this thesis is shown in Figure 1.3. The remainder of the thesis is organ-
ised as follows, Chapters 2 and 3 provide background literature review on the Resource
Description Framework and cloud computing respectively. Chapter 2 introduces the
key concepts of RDF, RDFS and other key definitions that are used throughout this
thesis. Additionally, the chapter provides a survey of related distributed RDF pro-
cessing work and issues, in particular around distributed RDFS reasoning. Chapter 3
8http://ecarf.io
14
Section 1.10 Page 15
provides a review of the cloud computing paradigm including a high level overview of
public cloud deployments. The key cloud services utilised in this research are reviewed
with an introduction to the key Google Cloud Platform services. A background on
some of the related research and commercial work utilising cloud elasticity and utility
billing features is also provided. Chapter 4 introduces the design methodology used
in the model development for both the CloudEx framework and the ECARF triple
store.
Chapters 5, 6 and 7 present the original contributions and results of this thesis.
Chapter 5 fully introduces the architecture of the CloudEx framework and present
its various components. Subsequently, Chapter 6 presents a number of algorithms
for RDF processing and rule-based RDFS reasoning using the CloudEx framework.
These algorithms collectively define the ECARF RDF cloud-based triple store. Then,
Chapter 7 presents the results and discussions of the experimental evaluation of both
CloudEx and ECARF. Finally, Chapter 8 summarises the achievements of this thesis,
discusses open issues and presents possible future work.
15
Section 1.10 Page 16
Literature Review
Chapter 1
Introduction
Original Contributions
Chapter 8
Conclusions
Chapter 4
Design Method
Chapter 2
Background on the Resource Description Framework
Conceptual Model Development
Chapter 3
The Cloud Computing Paradigm
Chapter 6
ECARF, Processing RDF on the Cloud
Results and Discussions
Chapter 7
Evaluation of Cloud Based RDF Processing
Chapter 5
CloudEx, a Cloud First Framework
Figure 1.3: Thesis Outline.
16
Chapter 2
Background on the Resource
Description Framework
The Resource Description Framework in addition to cloud computing are central to
the work done in this research as was illustrated in Figure 1.1. More specifically,
the primary aim of this research is to develop and evaluate an RDF triple store for
RDF processing in general and RDFS rule-based reasoning in particular using cloud
computing. The focus on RDF is motivated by the recent growth in the adoption of
Semantic Web and RDF technologies, which is evident from efforts like Linked Data1
and Schema.org2. Linked Data focuses on connecting RDF data on the Web and
Schema.org focuses on providing common schemas for structured data on the Web
using RDF and other frameworks. Schema.org for example, is sponsored by major
search engines providers like Google, Microsoft and Yahoo to enable search engines
to intelligently understand the meaning of structured data on the Web.
Another motivation for this research is the explosive growth of RDF datasets on the
Web, which has also led to numerous other research and commercial efforts to effec-
1http://linkeddata.org/
2http://schema.org/
17
Section 2.1 Page 18
tively process this data. With large RDF datasets of a billion and more statements,
efforts have focused on using distributed computing on dedicated computer clusters,
however this is not without challenges. For example, the challenges of dealing with
the ever increasing storage and processing requirements for such big data. Addition-
ally, most research effort on large scale distributed RDF reasoning does not provide a
mechanism to be able to query this data, which is a key requirement for applications
to make use and benefit from such data.
This chapter is the first of the literature review chapters and introduces the key
concepts of RDF that are used in this research to develop ECARF, furthermore, a
survey of related literature is also provided. Firstly, in Section 2.1, some of the RDF
key concepts and definitions used throughout this thesis are introduced, such as the
RDF Schema (RDFS) and RDFS reasoning. Additionally, the concept of ontology is
also introduced. Then in Section 2.2, a review of the literature covering the distributed
processing and management of RDF and RDFS is provided.
Section 2.3 reviews RDF data management efforts that utilise cloud computing, sub-
sequently in Section 2.4 a review of large scale RDF dictionary encoding literature is
provided. Followed by Section 2.5 that summarises the challenges facing large scale
RDF processing presented in this chapter. Finally, in Section 2.5, this chapter is
concluded with a summary of the issues and challenges with the approaches surveyed
herein.
2.1 Resource Description Framework
RDF [3] is recommended by the W3C for representing information about resources
in the World Wide Web. RDF is intended for use cases when information need to be
processed by and exchanged between applications rather than people. Information in
18
Section 2.1 Page 19
RDF is primarily represented in XML, however other compact representations also
exist such as the Notation3 (N3) [6]. Resources in RDF can be identified by using
Uniform Resource Identifier (URI) references (URIrefs), for example:
<http://inetria.org/directory/employee/smithj>
As a convention, to avoid writing long URIrefs, they are shortened using namespaces
to replace the common part of the URI with a prefix. The previous example can
be written as employee:smithj by using the prefix employee: to replace the URI
<http://inetria.org/directory/employee/>.
Unknown resources or resources that do not need to be explicitly identified are called
blank nodes. Blank nodes are referenced using an identifier prefixed with an under-
score such as :nodeId. Constant values such as strings, dates or numbers are referred
to as literals. Information about a particular resource is represented in a statement,
called a triple, that has the format (subject, predicate, object) abbreviated as (s, p, o).
Subject represents a resource, either a URI or a blank node, predicate represents a
property linking a resource to an object, which could be another resource or literal.
More formally let there be pairwise disjoint infinite sets of URIrefs (U), blanks nodes
(B), and literals (L). An RDF triple is a tuple:
(s, p, o)∈(U∪B)×(U)×(U∪B∪L) (2.1)
A set of RDF triples is called an RDF graph (Figure 2.1 (a)), in which each triple
is represented as a node-arc-node link. The node (ellipse or rectangle) represents the
subject and object, the arc represents the predicate and is directed towards the object
node. RDF triples can be exchanged in a number of formats, primarily RDF/XML
[5] which is based on XML documents. Other formats include line based, plain text
encoding of RDF graphs such as N-Quads [34] and N-Triples [35], which are simplified
19
Section 2.1 Page 20
subsets of N3. RDF defines a built-in vocabulary of URIrefs that have a special
meaning. For example, to indicate that a resource is an instance of a particular kind
or class, RDF defines the predicate http://www.w3.org/1999/02/22-rdf-syntax-
ns\#type, shortened as rdf:type.
Figure 2.1 shows a running example of a simple RDF dataset of employees and their
relationships. The RDF graph (Figure 2.1 (a)) shows URIrefs resources as ellipses
with text, blanks nodes as empty ellipses and literals as rectangles. The example uses
a number of well-known vocabularies such as RDF (rdf:), RDFS (rdfs:), which will
be explained in the next section and Friend of a Friend3(foaf:). The example also
defines a custom vocabulary (inetria:), which is summarised in Figure 2.1 (c).
The RDF graph describes a number of resources, mainly an employee (employee:sm-
ithj) and his properties, such as name (foaf:name) and start date (inetria:start-
date), which are both literals. The example also describes properties that refer to
other resources rather than literals, such as manages (inetria:manages) and works
at (inetria:works at). The RDF triples in Figure 2.1 (c) and (d) are represented in
N-Triples format, with the long URI namespaces shortened with the prefixes shown
in Figure 2.1 (b) for readability.
2.1.1 RDF Schema
The RDF Schema (RDFS) [4] is a semantic extension of RDF for defining groups
of related resources and their relationships. RDFS provides a vocabulary of URIrefs
starting with http://www.w3.org/2000/01/rdf-schema\# and shortened as rdfs:.
RDFS URIrefs can be used to define a hierarchy of classes (e.g. rdfs:subClassOf), a
hierarchy of properties (e.g. rdfs:subPropertyOf) and how classes and properties are
intended to be used together (e.g. rdfs:range, rdfs:domain). As an example the
3http://www.foaf-project.org/
20
Section 2.1 Page 21
employee:smithj rdf:type inetria:Employee .
employee:smithj foaf:name "John Smith" .
employee:smithj inetria:manages employee:doej .
employee:smithj inetria:works_at business:inetria .
employee:smithj inetria:start_date "2002-04-17" ^^xsd:date.
employee:smithj inetria:drives _:jA5492297 .
_:jA5492297 inetria:make "Ford" .
_:jA5492297 inetria:reg_number "ABC123" .
employee:doej rdf:type inetria:Employee .
employee:doej inetria:works_at business:inetria .
employee:doej foaf:name "John Doe" .
inetria:Employee rdfs:subClassOf foaf:Person .
inetria:Company rdfs:subClassOf foaf:Organization .
inetria:drives rdfs:domain inetria:Employee .
inetria:drives rdfs:range inetria:Car .
inetria:manages rdfs:domain inetria:Employee .
inetria:manages rdfs:range inetria:Employee .
inetria:works_at rdfs:domain inetria:Employee .
inetria:works_at rdfs:range inetria:Company .
prefix rdf:, namespace URI: http://www.w3.org/1999/02/22-rdf-syntax-ns#
prefix rdfs:, namespace URI: http://www.w3.org/2000/01/rdf-schema#
prefix foaf:, namespace URI: http://xmlns.com/foaf/0.1/
prefix inetra:, namespace URI: http://inetria.org/directory/schema#
prefix employee:, namespace URI: http://inetria.org/directory/employee/
prefix business:, namespace URI: http://inetria.org/directory/business/
prefix xsd:, namespace URI: http://www.w3.org/2001/XMLSchema#
inetria:Employee
employee:smithj
rdf:type
employee:doej
business:inetria
Ford
John Smith
2002-04-17
ABC123
John Doe
rdf:type
foaf:name
foaf: name
inetria:start_date
inetria:works_at
inetria:works_at
inetria:make
inetria:reg_number
inetria:drives
(a) RDF graph
(b) URI namespace prefixes
(c) Schema data N-Triples
(d) Instance data N-Triples for the RDF graph shown in (a)
inetria:manages
Figure 2.1: Running example RDF data and graph representation.
21
Section 2.1 Page 22
statements in Figure 2.1 (c) describe that the class Employee (inetria:Employee) is
a subclass of Person (foaf:Person). The statements also describe that the property
works at (inetria:works at) connects instances of type inetria:Employee (subject)
to instances of type inetria:Company (object).
The Semantic Web aims to represent knowledge about resources in a machine read-
able format that automated agents can understand. Knowledge about a particular
domain of interest is described as an ontology, which is a machine readable specifi-
cation with a formally defined meaning [36]. An ontology enables different agents to
agree on a common understanding of certain terms, for example Person or Employee
in Figure 2.1. As noted, RDFS provides the capability to specify background knowl-
edge such as rdfs:subClassOf and rdfs:subPropertyOf about terms, this renders
RDFS a knowledge representation language or ontology language. Having said this,
RDFS has its limitations as an ontology language, for example it is not possible to
express negated statements [36]. The following section provides brief background on
the concept of ontologies and the more expressive Web Ontology Language (OWL)
[37].
2.1.2 The Semantic Web and Ontologies
In philosophy, an Ontology is a systematic account of Existence. In computer science,
Gruber [38] defines ontology as “an explicit specification of a conceptualisation” and
conceptualisation as an abstract view of the world represented for a particular pur-
pose. Guarino [39] on the other hand, argues that this definition of conceptualisation
relies on an extensional notion and alternatively suggests an intensional definition
of conceptualisation. Guarino defines conceptualisation as “an intensional semantic
structure that encodes the implicit rules constraining the structure of the a piece of
reality” and ontology as “a logical theory which gives an explicit, partial account of
22
Section 2.1 Page 23
a conceptualisation”. In terms of the Semantic Web, an ontology is a collection of
definitions and concepts such that all the agents interpret the concepts with regard
to the same ontology and hence have a shared understanding of these concepts [40].
2.1.2.1 Description Logics
Existing work in the Artificial Intelligence community have explored the use of formal
ontologies in knowledge engineering [41, 42]. For the Semantic Web, a family of
knowledge representation languages known as Description Logics (DLs) [43, 44] are
used to formally represent the knowledge in ontologies [40]. In Description Logic
(DL) classes are called concepts, properties or predicates are called roles and objects
are called individuals. The building blocks of DL knowledge bases are called axioms,
these are logical statements relating to concepts or roles.
The expressivity of DL is represented as labels for example ALC (Attribute Language
with Complement) is a family of DL that supports all class expressions such as the
fact two classes are equivalent or an individual is a subset of particular class or the
fact that a role connects two expressions. For example, let Abe an atomic class
name, Rbe an abstract role, >refer to a class that contains all objects (logical
equivalence of true), ⊥refers to an empty class (logical equivalence of false), ¬∩∪
refer to class constructor axioms of negation, union and intersection respectively and
∀∃ the universal and existential quantifiers that refer to property restrictions. The
class expression C, D in ALC DL can be constructed as follows [36]:
Axiom ::= C⊆D|C(A)|R(A, A)
C, D ::= A|>|⊥|¬C|C∩D|C∪D|∀R.C|∃R.C
One of the most expressive DLs in the literature is SROIQ [46], which is a superset
23
Section 2.1 Page 24
of ALC. Generally DL axioms can be divided into three categories [45]:
•Assertional or instance data (ABox), which captures knowledge about named
individuals i.e. concept assertion such as smithj ⊆Employee, or role assertions
between named individuals such as worksAt(smithj, inetria).
•Terminological or schema data (TBox), which describe concepts relationships
such as concept inclusion for example Employee ⊆P erson or concept equiva-
lence for example P erson ≡Human.
•Relationship between Roles (RBox), which describe concepts such as role inclu-
sion and rule equivalence.
2.1.2.2 Web Ontology Language (OWL)
As discussed previously, as an ontology language RDFS expressivity is limited, for this
reason the W3C proposed the Web Ontology Language (OWL) [37, 50]. OWL has
two semantics, the RDF-based semantics and the direct semantics, which relate OWL
axioms to DL [52]. The first version of OWL, OWL 1 [37] has three sublanguages
OWL Full, OWL DL and OWL Lite. The computational complexity [47] for OWL
Lite is ExpTime, for OWL DL is NExpTime, and whilst both are decidable, OWL Full
is undecidable. These complexities have led many researchers to propose a subset of
OWL 1 that can be processed in Polynomial Time (PTime). For example the widely
used OWL Horst [48] semantics and Description Logic Programs (DLP) [49].
OWL 2 was proposed to address the computational complexity issues with OWL 1
sublanguages. The W3C proposed three profiles of OWL 2 that can be processed in
PTime, namely OWL EL, OWL RL and OWL QL [50]. OWL EL — which belongs to
the EL++ [51] family of DLs — is mainly used in biomedical ontologies with very large
number of TBox triples such as classes and/or properties. On the other hand, OWL
24
Section 2.1 Page 25
RL is proposed as the preferred approach for representing Web ontologies that contain
very large of instance data (ABox). Finally OWL QL provides database applications
with an ontological data access layer [52].
2.1.3 RDFS Entailment Rules
RDF statements also have a formal meaning that determines the conclusions or en-
tailments a software application can draw from a particular RDF graph. The RDFS
entailment rules [7] can be applied repeatedly to a set of triples (graph) to infer new
triples. Other entailment rules related to OWL, include OWL Horst [48] and OWL
2 [50] semantics. The process of applying a set of rules repeatedly to a set of triples,
is known as forward chaining or reasoning and continues until no further new triples
are inferred, this process is also referred to as materialisation [36] in the literature.
At this point the closure of the RDF graph is reached under the RDFS semantics and
the reasoning process stops.
As an example the rules in Tables 2.1 and 2.2 have a body and a head. The forward
reasoning process will add the triple in the head column if the RDF graph contains
the triples in the body column, this added triple is usually referred to as an inferred
triple. A dataset that contains both the initial triples and all the possible inferred
triples is referred to as materialised dataset. Some of the rules might infer the same
knowledge as others resulting in duplicate statements being added to the RDF graph.
This forward reasoning process can be illustrated using the example in Figure 2.1. By
applying rule rdfs9 to the statements in Figure 2.1 (c) and (d), it can be seen that
following two statements match the rule’s body:
1. inetria:Employee rdfs:subClassOf foaf:Person
2. employee:smithj rdf:type inetria:Employee
25
Section 2.1 Page 26
Table 2.1: Schema-Instance (SI) entailment rules of the ρdf fragment of RDFS, (sc
= subClassOf, sp = subPropertyOf, dom = domain)
RDFS
Name
Body Head
Schema Triple Instance Triple
rdfs2 ?p, rdfs:dom, ?c ?x, ?p, ?y ?x, rdf:type, ?c
rdfs3 ?p, rdfs:range, ?c ?x, ?p, ?y ?y, rdf:type, ?c
rdfs7 ?p1, rdfs:sp, ?p2 ?x, ?p1, ?y ?x, ?p2, ?y
rdfs9 ?c1, rdfs:sc, ?c2 ?x, rdf:type, ?c1 ?x, rdf:type, ?c2
Table 2.2: Schema entailment rules of the ρdf fragment of RDFS (sc = subClassOf,
sp = subPropertyOf )
RDFS
Name
Body Head
Schema Triple Schema Triple
rdfs5 ?p1, rdfs:sp, ?p2 ?p2, rdfs:sp, ?p3 ?p1, rdfs:sp, ?p3
rdfs11 ?c1, rdfs:sc, ?c2 ?c2, rdfs:sc, ?c3 ?c1, rdfs:sc, ?c3
The reasoning process will then add the triple in the rule’s head to the RDF graph,
which in this case is (employee:smithj rdf:type foaf:Person). If rule rdf3 is
applied to the same statements, it can be seen that the following two statements
match the rule’s body:
1. inetria:works at rdfs:range inetria:Company
2. employee:smithj inetria:works at business:inetria
The reasoning process will infer the following statement (business:inetria rdf:type
inetria:Company). Then by applying rule rdf9 to this inferred statement and the
following statement (inetria:Company rdfs:subClassOf foaf:Organization), a
new statement (business:inetria rdf:type foaf:Organization) can be inferred.
Once all the implicit knowledge in an RDF dataset has been inferred, the dataset can
be queried by using the Protocol and RDF Query Language (SPARQL) [11].
In addition to forward reasoning, backward reasoning or chaining can also be used to
infer new knowledge in RDF datasets. However, backward reasoning is only conducted
at query time rather than upfront as with forward reasoning. Forward reasoning has
26
Section 2.1 Page 27
the advantage of fast queries as no reasoning is required at that point and all the
inferred triples are included in the dataset. The disadvantage with this approach is
when the knowledge base is updated regularly, in this case the closure needs to be
recomputed each time which can be computationally expensive. Backward reasoning
on the other hand, starts from the query then builds a knowledge graph for the answer,
the advantage of this approach is that the knowledge base can be updated at anytime,
the disadvantage is that query answering can take a long time as reasoning is done at
query time.
2.1.3.1 A Little Semantics Goes a Long Way
As seen from the previous examples new knowledge can be deduced through reasoning,
which means, it is not required to define all the implicit knowledge such as ”iNetria is
an Organisation” upfront. Basic facts can be represented then software applications
can be used to deduce further knowledge through reasoning. This feature is best
expressed by the so called Hendler Hypothesis [53] that:
”A little semantics goes a long way.”
2.1.4 Minimal and Efficient Subset of RDFS
Some of the RDF and RDFS entailment rules [7] serve the purpose of reasoning about
the structure of RDF itself rather than about the data it describes [54]. For example
single antecedent rules (rules with one triple in the body), container-management
properties and RDF axiomatic triples are usually ignored in other published work
[15, 16, 19, 55]. This is because they are easy to implement and are less frequently
used. A minimal and efficient fragment of the RDFS entailment rules known as the
ρdf fragment has been formalised in [54]. The work done in this thesis utilises this
27
Section 2.1 Page 28
fragment as summarised in Tables 2.1 and 2.2. Below, some of the definitions that
will be used throughout the rest of this thesis are introduced:
Definition 2.1.1. A schema triple, is a triple that contains in its predicate one
of the RDFS vocabulary terms such as rdfs:SubClassOf, rdfs:SubPropertyOf,
rdfs:domain, rdfs:range, etc., Schema data is a set of schema triples such as the
triples shown in Figure 2.1 (c).
Definition 2.1.2. An instance triple, is a triple that is not a schema triple. In-
stance data is a set of instance triples such as the triples shown in Figure 2.1 (d).
Definition 2.1.3. A schema-instance (SI) rule, is a rule that contains in its
body one schema triple and one instance triple and contain in its head an instance
triple. The RDFS rules in Table 2.1 are SI rules.
Definition 2.1.4. A schema rule, is a rule that contains schema triples both in its
body and head. The RDFS rules in Table 2.2 are schema rules.
Definition 2.1.5. A term, refers to part of a triple such as the subject, the predicate
or the object.
2.1.5 RDF Triple Stores
RDF datasets in the form of triples are stored and managed in databases usually
referred to as triple stores. Triple stores are databases optimised for the storage,
processing and querying of triples with the ability to perform forward reasoning under
any of the aforementioned entailment rules. Moreover, these triple stores can also
support the querying of stored triples using the SPARQL query language. In addition
to reasoning and SPARQL support, these stores can also provide RDF management
capabilities such the ability to update the knowledge base by adding, modifying or
28
Section 2.2 Page 29
removing triples. Commercial implementations of large triple stores [14] include,
BigOWLIM [56], Virtuoso [57], AllegroGraph and Oracle 11g [58].
2.2 Distributed Semantic Web Reasoning
The current explosive growth of Web data has subsequently led to similar growth
in the Semantic Web RDF data. Consequently, recent work on large scale RDFS
reasoning has primarily focused on distributed and parallel algorithms utilising peer
to peer networks and dedicated computing clusters. The following sections provide
a brief survey of related work in the distributed Semantic Web reasoning and data
management.
2.2.1 Peer to Peer Networks
Distributed RDF and OWL reasoning on Peer to Peer (P2P) networks has mainly
focused on the use of Distributed Hash Tables (DHT) [22, 59, 60]. DTHs provide
distributed storage capability for key-value pairs in P2P networks where each node is
responsible for a number of keys. The mapping between nodes and keys is done using a
standard hash function for both node IP addresses and keys. This mapping is usually
referred to as term based partitioning, because it is based on fixed mappings from
triple terms to nodes. For example a node would be responsible for any triples with
the predicate rdf:type, another node might be responsible for rdfs:subPropertyOf.
Each node keeps a lookup table for its successor node so when a request is received
for a key that is larger than the one handled by this node, it is forwarded to that
nearest successor. This process continues until the key reaches the node responsible
for it. Each node supports a Get(Key) and Put(Key, Value) operations on its own
database and lookup is performed in O(log N) where Nis the number of nodes.
29
Section 2.2 Page 30
2.2.1.1 Forward Reasoning on Top of DHTs
Fang et al. [15] presented a system called DORS for forward reasoning on a subset
of the OWL entailment rules. They compute the closure of the schema triples using
an off the shelf reasoner. The schema triples are then replicated across all the nodes
and each node performs reasoning on the instance triples assigned to it. Each node
stores these triples in a local relational database and forwards the generated triples
to other nodes. This process of triples exchange and reasoning continues until no
further triples are sent through the network. Result are reported for datasets of up
to 2 million triples of synthetic data using 32 nodes system, and showed scalability
problems due to excessive network traffic.
Similar work done by Kaoudi et al. [61, 16] performed forward reasoning on top of
DHTs using the ρdf fragment of RDFS. They show that forward reasoning on top of
DHTs results in excessive network traffic due to redundant triple generation. Their
approach also suffers from load balancing issues as some terms are more popular than
others. This results in extremely large workloads for some of the nodes compared to
others. Although DHTs provide a loosely coupled network where nodes and can join
and leave, the approaches summarised previously suffer from load balancing issues.
2.2.1.2 Alternatives to Term Based Partitioning
Kotoulas et al. [19] and Oren et al. [62] have shown that RDF datasets can exhibit a
high level of data skew due to the popularity of some terms such as rdf:type. This
leads to load balancing issues when fixed term partitioning is used as some nodes are
overloaded with work compared to others. Oren et al. [62] propose a divide-conquer-
swap algorithm called SpeedDate for getting similar terms to cluster around a region
of peers rather than at a particular peer. They use an in-memory implementation
which speeds up the reasoning process considerably and report results for up to 200M
30
Section 2.2 Page 31
triples using 64 nodes. However, this approach does not provide a permanent storage
mechanism for the generated in-memory data. Also due to the nature of the algorithm
used the system has to be stopped manually when no new triples are generated.
Kulahcioglu and Bulut [55] propose a two-step method to provide a schema sensitive,
RDFS specific term-based partitioning. They show that, compared to standard term
partitioning, they can eliminate non productive partitions accounting up to 45% of
the dataset. They analyse each of the ρdf entailment rules and identify the term used
by the rule for reasoning. This approach avoids using popular terms such rdfs:type
for DHT partitioning if the term is not used in the reasoning process. Subsequently,
they implemented a forward reasoning approach [63] and evaluate it using up to 14M
triples. However, no details are provided on loading the data on the cluster and the
storage implications of larger datasets.
2.2.2 Grid and Parallel Computing
Work utilising dedicated computer clusters has mainly focused on large scale forward
reasoning over datasets that are 1 billion triples or more. In this section a brief review
of such related work is provided.
2.2.2.1 MapReduce Based Reasoning
Urbani et al. [64, 65, 20] have focused on utilising the MapReduce programming
model [17] for forward reasoning. They presented Web-scale inference under the
RDFS and OWL Horst [48] semantics by implementing a system called WebPIE on
top of the MapReduce Hadoop framework. They focus on achieving high throughput
by reporting the closure of a billion triples of real-world datasets and 100 billion triples
of synthetic LUBM [66] dataset. Their approach showed that high throughput can
be achieved by utilising only 16 nodes. However, they correlated the reduction of
31
Section 2.2 Page 32
performance when increasing the number of nodes beyond 16 to platform overhead
related to the Hadoop framework.
For RDFS reasoning they find that a naive straightforward implementation suffers
from a number of issues including derivation of duplicates, the need for joins with
schema triples and fixed point iterations. They provide optimisations for these issues
by loading the schema triples in memory due to the low ratio between schema and
instance triples. Duplication is avoided by grouping triples by subject then executing
a join over a single group. Other optimisations are done for rules with two antecedents
which require a joint over parts of the data.
They address the issue of storing large amount of inferred data by using an upfront
dictionary encoding [67] based on MapReduce to reduce the data size as will be
discussed in Section 2.4. It is worth noting that although this MapReduce approach
has achieved very high throughput compared to other systems, it is not without
challenges. Performing rule based reasoning using map-reduce jobs require complex
and non trivial optimisations, which makes this approach difficult to extend to richer
logics such as OWL 2 RL. Additionally the output of this approach is stored in flat
files which adds another challenge to provide SPARQL query support.
2.2.2.2 Spark Based Reasoning
To address the issues with the MapReduce batch processing nature, Jagvaral and Park
[68] propose an approach based on Spark [69]. Spark is a cluster computing frame-
work that utilises parallel data structures where intermediate results can be stored
in memory or on disk. Additionally, Spark enables user to manage their workload
partitioning strategy in order to optimise the processing. They show that such an
approach is faster than MapReduce by using a cluster of 8 machines each with 8 cores
and 93GB of memory. However, no details are provided as to the preprocessing or
32
Section 2.2 Page 33
the loading process required to get the data on the cluster or provision for long term
storage.
2.2.2.3 Authoritative Distributed Reasoning
Hogan et al. [70, 18] have focused on authoritative OWL reasoning over linked data
[71] crawled from the Web. They provide an incomplete OWL reasoning by selecting
a fragment of OWL then only considering ’authoritative sources’ to counteract what
they call ’ontology hijacking’. They use two scans over the data, the first scan to
separate and filter the schema triples from the instance triples. The schema data is
then stored in memory throughout the reasoning process due to its small size compared
to the instance data. From the sample data used, it was found that the size of the
schema data is less than 2% of the overall statements. The instance data is stored
on-disk and accessed through the second scan, during this stage they use on disk sorts
and file scans.
Hogan et al. also presented a template rule optimisation for distributed, rule-based
reasoning [18]. Instead of simply specifying a list of template rules, they use the
schema data to create a generic template rule function that encodes the schema triples
themselves into a set of new templated rules. This approach eliminates the need
to repeatedly access the schema pattern during the instance reasoning process. The
processing is distributed by using Java Remote Method Invocation (RMI) architecture
and flooding all nodes with the linear template rules. The only communications
required between the nodes is to aggregate the schema data and to created a shared
template rule index. They reason over 1.12 billion triples of linked data in 3.35 hours
using 8 nodes, inferring 1.58 billion triples.
33
Section 2.2 Page 34
2.2.2.4 Embarrassingly Parallel Reasoning
Weaver and Hendler [21] presented an “embarrassingly parallel” algorithms for RDF
processing. An embarrassingly parallel [72] algorithm indicate that tasks can easily be
divided into independent processes which can be executed concurrently without any
dependencies or communication between these processes. In this work, two types of
supercomputers were used, one with four 16 core AMD Opteron 6272 processors and
512 GB of RAM and an IBM Blue Gene/Q4, each node has 16 cores at 1.6 GHz and
16 GB of RAM. The system used gigabit ethernet and InfiniBand [73] interconnect
and utilised a parallel file system. The data is preprocessed before hand by carrying
compression and dictionary encoding as will be discussed in Section 2.4.
For the reasoning process the they utilise up to 128 processes and replicate the schema
data on all of them then partition the instance data between the them to execute
independently. The reasoning process utilise a Message Passing Interface (MPI) ar-
chitecture and apply a number of rules in a particular order to derive new triples.
This process continues until no further data is inferred. They reported results on the
closure of up to 346 million triples of synthetic LUBM datasets in ≈291 seconds. One
of the disadvantages of their approach is that the results of each process are written to
separate files leading to duplicates. Additionally, similar to the approach mentioned
previously done by Urbani et al. [20], the output is stored in flat files which are yet
to be imported into a system that supports querying.
2.2.3 Centralised Systems
Systems based on a single high specification system include BigOWLIM [56], a pro-
prietary large triple store that support the full management of RDF, OWL Horst and
a subset of OWL 2 RL ontologies. When the data is loaded BigOWLIM performs
4http://www-03.ibm.com/systems/technicalcomputing/solutions/bluegene/
34
Section 2.3 Page 35
rule based forward reasoning upfront, once reasoning is done the data is then stored
to provide SPARQL query support. In terms of scalability and resilience BigOWLIM
exhibits centralised architecture and vertical scalability. BigOWLIM provides a repli-
cation cluster with a number of master and slave nodes. In this architecture the
data is replicated across all the nodes and query answering is load balanced between
these nodes. The practical limit of BigOWLIM is around 20 billion statements on a
node with 64GB RAM, reported results include the reasoning over 12 billion triples
of synthetic data using such a node in around 290 hours.
2.3 RDF Data Management on the Cloud
Most of the Semantic Web work that utilises cloud computing services is related to
RDF data storage and management [74, 75] rather than performing forward reasoning.
Aranda-And´ujar et al. have presented AMADA, a platform for storing RDF graphs
on the Amazon Web Services (AWS) [27] cloud. They utilised a number of AWS
services to index, store and query RDF data. Bugiotti et al. [76] have focused on
providing RDF data management in the AWS cloud by utilising SimpleDB [77], a
key-value store provided by AWS for small data items. They presented and assessed
a number of indexing strategies for their analytical cost models for RDF data stored
in SimpleDB for the purpose of providing query answering using out-of-the-box query
processor.
Stein and Zacharias [78] presented Stratustore, an RDF store based on the Amazon
cloud that stores and indexes triples in SimpleDB and integrates with Jena’s API
5to provide SPARQL query support. They highlighted the limited expressiveness
of SimpleDB for this purpose and point to the need for more complex cloud based
5https://jena.apache.org/
35
Section 2.4 Page 36
database services. Kritikos et al. [79] have presented a cloud based architecture
for the management of geospatial linked data. They utilised an approach based on
the autoscaling capabilities provided by Amazon Web Services to scale up virtual
machines running the Virtuoso [57] triple store, with the primary focus on providing
SPARQL endpoints.
2.4 RDF Data Compression
As noted in Section 2.1, RDF datasets are comprised of triples, each containing three
terms — subject, predicate and object — that are represented as strings. When
dealing with large datasets, these string representations occupy many bytes and take
a large amount of storage space, this is particularly true with datasets in N-Triple
format that have long URI references (URIrefs) or literals. Additionally, there is in-
creased network latencies when transferring such data over the network. Although
Gzip compression can be used to compress RDF dataset, it is difficult to parse and
process these datasets without decompressing them first, which imposes a computa-
tion overhead. There is, therefore, a need for a compression mechanism that maintain
the semantic of the data, consequently, many large scale reasoners such as BigOWLIM
[56] and WebPIE [20] adopt dictionary encoding. Dictionary encoding encodes each
of the unique URIs in RDF datasets using numerical identifiers such as integers that
only occupy 8 bytes each.
2.4.1 MapReduce-based Dictionary Encoding
In order to perform large scale reasoning, Urbani et al. adopt an upfront dictionary
encoding [67] based on MapReduce to reduce the data size. The creation of the
dictionary and the encoding of the data was distributed between a number of nodes
36
Section 2.4 Page 37
running the Hadoop framework. Initially, the most popular terms are sampled and
encoded into a dictionary table, since these are small this dictionary held in main
memory in each of the nodes. The system then deconstructs the statements and
encode each of the terms whilst building up the dictionary table. To avoid clash of
IDs, each node assigns IDs from the range of numbers allocated to it. The first 4
bytes of the identifier are used to store the task identifier that processed the term and
the last 4 bytes are used as an incremental counter within the task.
A similar approach was adopted for decompression and experimented with a number
of settings such the popular-term cache. They report the compression of 1.1 billion
triples of the LUBM [66] dataset in 1 hour and 10 minutes, with a 1.9 GB dictionary.
Although this approach is scalable both in terms of input and computing nodes, the
generated dictionaries take more than an hour to build and are in most cases larger
than 1 GB of data. This is a challenge when considering loading the whole dictionary
in main memory and imposes an IO overhead as the dictionary file needs to searched
in-disk. These large dictionaries are due to the fact that no special considerations are
given to the common parts of the URIs such as namespaces, hence including these
namespaces numerous times in the dictionary.
2.4.2 Supercomputers-based Dictionary Encoding
Weaver and Hendler [21], use a parallel dictionary encoding approach on the IBM
Blue Gene/Q by utilising the IBM General Parallel File System (GPFS). Due to
disk quotas restrictions they perform LZO [80] compression on the datasets before
the dictionary encoding. LZO is a fast block compression algorithm that enables the
compressed data to be split into blocks. This feature is utilised such that processes
can directly operate on the compressed data blocks. Subsequently, the compressed
file blocks are partitioned equally between the processors, the processors collectively
37
Section 2.4 Page 38
access the file and starts encoding the data. The encoded data is written in separate
output files, one for each processor, additionally, when encoding the data processors
communicate with each other using MPI to resolve the numeric identifier for each of
the terms. For the dictionary encoding of 1.3 billion triples of the LUBM dataset [81],
a reported total runtime of approximately 28 minutes by utilising 64 processors and
50 seconds when utilising 32,768 processors. Both reported runtimes exclude the time
required to perform the LZO compression on the datasets. The total size reported for
the dictionary is 23 GB and 29.8 GB for the encoded data.
2.4.3 DHT-based Dictionary Encoding
A dictionary encoding based on DHT network is presented by Kaoudi et al. [82] to
provide efficient encoding for SPARQL queries. In this approach the dictionary nu-
merical IDs are composed of two parts, the unique peer identifier and a local numerical
counter. When new triples are added to the network, they are encoded by the re-
ceiving peers and are resent through the network alongside their dictionary entry. As
was noted previously, this approach further exacerbates the network congestion issue
known with DHTs due to not only the traffic of the triples, but also their encoding.
2.4.4 Centralised Dictionary Encoding
A comparison of RDF compression approaches is provided by Fern´andez et al. [83].
They compare three approaches, mainly gzip compression, adjacent lists and dictio-
nary encoding. Adjacent lists concentrates the repeatability of some of the RDF
statements and achieves high compression rates when the data is further compressed
using gzip. They also show that datasets with a large number of URIs that are named
sequentially can result in a dictionary that is highly compressible. Additionally, it was
shown that dictionary encoding for literals can increase the triple representation spe-
38
Section 2.5 Page 39
cially when the dataset contains a variety of literals, and hence conclude that literals
need finer approaches.
A dictionary approach for the compression of long URIRefs in RDF/XML documents
was presented by Lee et al. [84]. The compression is carried out in two stages, firstly
the namespace URIs in the document are dictionary encoded using numerical IDs,
then any of the URIRefs are encoded by using the URI ID as a reference. Then two
dictionaries are created, one for the URIs and another one for the URIRefs. The
encoded data is then compressed further by using an XML specific compressor. Al-
though this approach shows compression rates that are up to 39.5% better than Gzip,
it is primarily aimed at compacting RDF/XML documents rather than provided an
encoding that reduces both the size and enables the data to be processed in com-
pressed format.
2.5 The Challenges
As seen from the previous sections that large scale RDFS reasoning is mainly con-
cerned with the distribution of work amongst a number of nodes or processes. Some
of the challenges that existing work try to address are: 1. efficient strategy for work-
load distribution that deals with data skew and dependencies, 2. a shared storage to
accomodate existing and generated data, 3. dictionary encoding to reduce the size of
massive datasets.
Additionally, the majority of large scale distributed RDF processing and reason-
ing utilise high specifications and dedicated computer clusters [19, 20, 21] that re-
quire large upfront infrastructure investments. An additional restriction with the
approaches surveyed in this section related to the number of processing nodes being
used. The number of computing nodes is pre-setup with the required applications and
39