Conference PaperPDF Available

Design of a Vertical Search Engine for Synchrotron Data: a Big Data Approach using Hadoop Ecosystem

Authors:

Abstract and Figures

A Synchrotron as an experimental physics facility can provide the opportunity of a multidisciplinary research and collaboration between scientists in various fields of study such as physics, chemistry, etc. During the construction and operation of such facility valuable data regarding the design of the facility, instruments and conducted experiments are published and stored. It takes researchers a long time going through different results from generalized search engines to find their needed scientific information so that the design of a domain specific search engine can help researchers to find their desired information with greater precision. It also provides the opportunity to use the crawled data to create a knowledgebase and also to generate different datasets required by the researchers. There have been several other vertical search engines that are designed for scientific data search such as medical information. In this paper we propose the design of such search engine on top of the Apache Hadoop framework. Usage of Hadoop ecosystem provides the necessary features such as scalability, fault tolerance and availability. It also abstracts the complexities of search engine design by using different open source tools as building blocks, among them Apache Nutch for the crawling block and Apache Solr for indexing and query processing.
Content may be subject to copyright.
Design of a Vertical Search Engine for Synchrotron Data:
a Big Data Approach using Hadoop Ecosystem
Dr. Ali Khaleghi1 [0000-0003-4944-3585] and Kamran Mahmoudi2 [0000-0002-1749-7354] and Sonia
Mozaffari3 [0000-0003-1770-2976]
Imam Khomeini International University, Qazvin, Iran
1 akhaleghi@eng.ikiu.ac.ir
2 kmahmoudi@edu.ikiu.ac.ir
3 soniamozaffari@edu.ikiu.ac.ir
Abstract. A Synchrotron as an experimental physics facility can provide the op-
portunity of a multi-disciplinary research and collaboration between scientists in
various fields of study such as physics, chemistry, etc. During the construction
and operation of such facility valuable data regarding the design of the facility,
instruments and conducted experiments are published and stored. It takes re-
searchers a long time going through different results from generalized search en-
gines to find their needed scientific information so that the design of a domain
specific search engine can help researchers to find their desired information with
greater precision. It also provides the opportunity to use the crawled data to create
a knowledgebase and also to generate different datasets required by the research-
ers. There have been several other vertical search engines that are designed for
scientific data search such as medical information. In this paper we propose the
design of such search engine on top of the Apache Hadoop framework. Usage of
Hadoop ecosystem provides the necessary features such as scalability, fault tol-
erance and availability. It also abstracts the complexities of search engine design
by using different open source tools as building blocks, among them Apache
Nutch for the crawling block and Apache Solr for indexing and query processing.
Keywords: Synchrotron, Search Engine, Information Retrieval, Big Data, Ha-
doop, Solr, Nutch.
1 Introduction
1.1 Particle Accelerator and Synchrotron
Experimental physicists working in different fields of study conduct various experi-
ments in a verity of laboratories. Synchrotron as an experimental physics facility ena-
bles scientists to conduct experiments and study materials at a nanoscopic scale. Syn-
chrotron radiation has become an essential part of research in multiple disciplines that
depend on light source for their studies [1]. The radiation produced by a synchrotron
2
can be used to study samples with higher precision through various experiments done
in the fields of physics, chemistry, biology, medicine, etc.
Synchrotron experiments can be divided into Spectroscopy, Imaging and Scattering
major categories [2].
1.2 Synchrotron Data
Synchrotron experiments are conducted at places called Beam-line where the synchro-
tron radiation is projected to the sample, then using several detectors the experiment
data is generated and stored in the data center for further analysis. Depending on the
category of the experiment a notable amount of data is generated specially for the Im-
aging category. The data generated by detectors at CERN particle accelerator is ex-
pected to be around 50 to 70 terabytes at 2018 [3]. This huge amount of data is just for
the experiments at LHC as a part of synchrotron community. The usage and develop-
ment of Big Data tools is extremely required for analysis of such data. One of the im-
portant needs of scientists working at synchrotrons is the ability of searching various
datasets and documents related to their specific topic of research. In this paper we will
propose a design for a domain specific search engine as a solution for the need above.
Synchrotrons share a lot of documents and data which are related to their design and
experiments. These documents are useful for scientists who are designing this facility
such as the researchers at Iranian Light Source Facility (ILSF). The published data on
Synchrotron websites are also useful for beamline scientists and also scientific directors
of other Synchrotrons around the world. One of the main problems here is the difficulty
in finding desired information. In this paper we propose the design of a domain specific
search engine to address this need. The rest of the paper is organized as follows: section
2 presents related work, section 3 presents methodology, and finally, section 4 holds
our conclusions.
2 Background
A domain specific search engine creates a searchable index of contents related to a
particular subject. Due to huge amount of data published on the internet, design of such
search engines can help us to find more accurate data easier than using a general search
engine. The section 2.1 introduce the three main modules of a search engine then in
section 2.2 we will name several use cases that a domain specific search engine has
been designed. Finally, in the section 2.3 the architecture of a search engine is presented
and a vertical search engine named HVSE which is built using the Hadoop framework
with a similar approach to the current paper is introduced.
2.1 Domain Specific Search Engines
Widyantoro and Yen introduce a domain specific search engine for searching academic
papers’ abstracts using a fuzzy ontology method for query refinement [4]. For search-
ing academic papers in the field of particle accelerator, CERN has used FAST search
3
engine which is introduced in 2007 and is a subsidiary of the Microsoft since 2008, this
search engine enables researchers to eliminate unwanted results using full Boolean que-
ries [5]. As another example for domain specific search Mišutka and Galamboš [6]
proposed a method for searching the mathematical content that can be adopted by any
full text search engine.
Researchers use Google Scholar to find academic papers, but an important issue with
Google Scholar is the lack of custom search technique, in [7] a domain specific search
engine is introduced that uses a new search methodology called n-paged-m-items par-
tial crawling algorithm that is a real-time faster algorithm, authors of the paper reported
a better performance comparing with Google Scholar.
Because most of medical queries are long and finding relevant results are also diffi-
cult, Luo et al. proposed a special search engine for medical information to simplify
medical searching, which called Medsearch. It splits long queries to several short que-
ries and then finds more relevant results for that [8]. Search Engine Design
A vertical search engine called HVSE has been proposed in [9], in which the authors
have improved topic oriented web crawler algorithms and developed a search engine
based on Hadoop platform. With the decentralized Hadoop platform this search engine
can have higher efficiency for massive amount of data due to the ability of expanding
the Hadoop cluster.
The architecture of a search engine consists of four main parts as shown in figure 1.
The crawler as the first part which is responsible for collection of data from web pages.
The second part is the indexer which creates a searchable index of the collected raw
data. The third part is the query parser which parse the user’s input query and retrieves
the related information. The last and forth part is the user interface which could be in
the form of a web application or mobile app that facilitates the search and shows the
results to the end user.
Fig. 1. Architecture of a search engine
4
3 Methodology
As mentioned in previous sections, one of the problems of scientists working in syn-
chrotron facilities is finding related documents and datasets for their research. It is dif-
ficult to use a general search engine to find specific scientific data and to our knowledge
there has not been any domain specific search engine created for the field of particle
accelerator physics. The synchrotron information come from various sources, most of
which is publicly accessible through the websites of different light sources and labora-
tories. Some other sources of valuable information is the facilities data center that stores
experimental data, also each facility have their information system for storing the status
of devices and infrastructure.
In this section we propose an architecture of a domain specific search engine that
can be used as a solution to the above issue. The architecture includes an implementa-
tion of Apache Hadoop framework and HDFS as the basis for setting up the search
engine. Apache Nutch and Solr will be deployed over the Hadoop framework which
will be used for crawling and indexing respectively as shown in figure 2.
Fig. 2. Architecture of proposed search engine
3.1 Hadoop Ecosystem
Hadoop is an apache projects which is developed using Java as a framework for pro-
cessing Big Data. With the implementation of a Distributed File System and Map-Re-
duce as a programming model [10]. In our architecture Hadoop HDFS is used by the
Apache Solr to store the index the documents that are retrieved by Apache Nutch which
is an open-source web crawler software that is used for crawling websites. With Nutch,
you can create your own search engine easily and customize it based on your needs
[11].
5
3.2 Crawler Module
Apache Nutch which is an extensible and scalable web crawler is used for crawling
synchrotron websites to collect the required data. It can be configured to use HDFS
providing the scalability for storing huge amount of data using the distributed file sys-
tem. Nutch has a number of plug-ins that can be used to process a variety of file types
such as Plain Text, XML, Open Document, Microsoft Office, PDF, RTF etc. This fea-
ture is very useful since it can be used for processing most of the Synchrotron Data
formats.
Most researchers design their own implemented web crawlers when it comes to creating
a focused crawler. Supervised and semi-supervised machine learning techniques are
used to improve relevant document retrieval but there is little literature on the perfor-
mance of such crawlers for web-scale data.
3.3 Indexing and Query parsing Modules
For indexing the fetched documents, Nutch can be configured to send them to Apache
Solr which is built on the Apache Lucene information retrieval library. Apache Lucene
uses a vector space model method for creating the index and it supports a variety of
query types such as fielded term with boosts, wildcards, fuzzy (using Levenshtein Dis-
tance), proximity searches and Boolean operators.
3.4 User interface
When data is collected by the Apache Nutch and an Index is generated, users can de-
scribe their intent in the form of a Query including several keyword terms. The user’s
input query then should be sent to the search engine to retrieve related information. This
can be done through creating different user interfaces to capture the query input.
Apache Solr can be accessed through sending a HTTP request containing the user query
in the form of field and value. When a query is sent Solr would run the query and send
the results in JSON format. This feature enables the UI designer to create a variety of
applications for end users of different platforms such as web pages or mobile applica-
tions.
4 Conclusion
Due to the vast amount of information published on the web, general search engines
have less efficiency while searching for very specific data such as scientific infor-
mation. Data integration methods such as web data integration can be used to provide
means for running complex queries on various integrated data sources. Here we propose
a search engine just for indexing webpages of different Synchrotrons for keyword
searching. Creation of vertical crawlers is a solution to provide easy and precise search
tool. There have been various vertical search engines used for searching scientific pa-
pers or medical records, etc., but according to our knowledge a vertical search engine
6
in the field of accelerator physics and Synchrotron community has not been developed.
In this paper we proposed a domain specific search engine for being used by scientists
working in the field of particle physics and related disciplines in experimental physics
facilities to find desired information and datasets with more precision. We reviewed the
literature on designing such search engines for various applications then we presented
our proposed architecture using the open source tools developed by the Apache soft-
ware foundation. We used Apache Nutch for crawling different synchrotron data
sources such as synchrotron websites and their local repositories. By deploying an
Apache Solr instance we will be able to run queries and index the collected data using
Apache Lucene. These tools are configured to run over a Hadoop cluster that provides
the scalability and fault tolerance required in the design of a search engine. As a future
work we will implement our proposed design using a Hadoop cluster with three nodes
having an overall 200 CPU cores, 280 GB memory and 9 TB of disk space to be used
by scientists working at Iranian Light Source Facility (ILSF) and other laboratories of
the Synchrotron community worldwide.
References
1. Rahighi, J, et al. "ILSF, a third generation light source laboratory in Iran." TUOAB202,
IPAC 13 (2013).
2. Alizada. S., Khaleghi A., “the Study of Big Data Tools Usages in Synchrotrons”, in pro-
ceedings of 16th Int. Conf. on Accelerator and Large Experimental Control Systems,
ICALEPCS2017, Barcelona, Spain (2017).
3. CERN About, http://wlcg-public.web.cern.ch/about, last accessed 2018/12/5.
4. Widyantoro, Dwi H., and John Yen. "A fuzzy ontology-based abstract search engine and its
user studies." Fuzzy Systems, 2001. The 10th IEEE International Conference on. Vol. 3.
IEEE, 2001.
5. Particle accelerator conference proceedings, https://accelconf.web.cern.ch/accelconf/JA-
CoW/proceedingsnew.htm last accessed: 2018/12/5
6. Mišutka, Jozef, and Leo Galamboš. "Extending full text search engine for mathematical con-
tent." Towards Digital Mathematics Library. Birmingham, United Kingdom, July 27th, 2008
(2008): 55-67.
7. Saha, Tushar Kanti, and ABM Shawkat Ali. "Domain Specific Custom Search for Quicker
Information Retrieval." International Journal of Information Retrieval Research (IJIRR) 3.3
(2013): 26-39.
8. Gang Luo, Chunqiang Tang, Hao Yang, Xing Wei: MedSearch: A Specialized Search En-
gine for Medical Information Retrieval. Proceedings of the 17th ACM Conference on In-
formation and Knowledge Management, CIKM 2008, Napa Valley, California, USA, Octo-
ber 26-30, 2008.
9. Lin, Cheng, and Ma Yajie. "Design and Implementation of Vertical Search Engine Based
on Hadoop." 2016 Eighth International Conference on Measuring Technology and Mecha-
tronics Automation (ICMTMA). IEEE, 2016.
10. Sagar A. Zalte, Vishwas R. Takate, Saish R. Chaudhari.: Study of Dis-tributed File System
for Big Data. International Journal of Innovative Research in Computer and Communication
Engineering, 1435-1438 , Vol. 5, Issue 2, February 2017
11. Dr.Zakir Laliwala, Abdulbasit Shaikh: Web Crawling and Data Mining with Apache Nutch.
Packt Publishing Ltd., UK (2013).
ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
Full-text available
In today's world, there is plenty of data being generated from various sources in different areas across economics, engineering and science. For instance, accelerators are able to generate 3 PB data just in one experiment. Synchrotrons industry is an example of the volume and velocity of data which data is too big to be analyzed at once. While some light sources can deal with 11 PB, they confront with data problems. The explosion of data become an important and serious issue in today's synchrotrons world. Totally, these data problems pose in different fields like storage, analytics, visualisation, monitoring and controlling. To override these problems, they prefer HDF5, grid computing, cloud computing and Hadoop/Hbase and NoSQL. Recently, bigdata takes a lot of attention from academic and industry places. We are looking for an appropriate and feasible solution for data issues in ILSF basically. Contemplating on Hadoop and other up-to-date tools and components is not out of mind as a stable solution .
Article
Full-text available
The Iranian Light Source Facility (ILSF) project is the first large scale accelerator facility which is currently under planning in Iran. On the basis of the present design, circumference of the 3 GeV storage ring is 528 m. Beam current and natural beam emittance are 400 mA and 0.477 nm.rad, respectively. Some prototype accelerator components such as high power solid state radio frequency amplifiers, low level RF system, thermionic RF gun, H-type dipole and quadruple magnets, magnetic measurement laboratory and highly stable magnet power supplies have been constructed at ILSF R&D laboratory.
Article
Full-text available
Recently researchers are using Google scholar widely to find out the research articles and the relevant experts in their domain. But it is unable to find out all experts in a relevant research area from a specific country by a quick search. Basically the custom search technique is not available in the current Google scholar’s setup. The authors have combined custom search with domain-specific search and named as domain specific custom search in this research. First time this research introduces a domain specific custom search technique using new search methodology called n-paged-m-items partial crawling algorithm. This algorithm is a real-time faster crawling algorithm due to the partial crawling technique. It does not store anything in the database, which can be shown later on to the user. The proposed algorithm is implemented on a new domain scholar.google.com to find out the scholars or experts quickly. Finally the authors observe the better performance of the proposed algorithm comparing with Google scholar.
Article
Full-text available
The WWW became the main resource of mathematical knowledge. Currently available full text search engines can be used on these documents but they are deficient in almost all cases. By applying axioms, equal transformations, and by using different notation each formula can be expressed in numerous ways. Most of these documents do not contain semantic information; therefore, precise mathematical interpretation is impossible. On the other hand, semantic information can help to give more precise information. In this work we address these issues and present a new technique how to search for mathematical formulae in real-world mathematical documents, but still offering an extensible level of mathematical awareness. It exploits the advantages of full text search engine and stores each formula not only once but in several generalised representations. Because it is designed as an extension, any full text search engine can adopt it. Based on the proposed theory we developed EgoMath – new mathematical search engine. Experiments with EgoMath over two document sets, containing semantic information, showed that this technique can be used to build a fully-fledged mathematical search engine.
Conference Paper
Full-text available
Query refinement can help users find information on the Internet more effectively. This feature has been implemented in a PASS (personalized abstract search services) system, a Web-based, domain-specific search engine for searching abstracts of research papers. The system uses a fuzzy ontology of term associations to support the feature. The ontology is automatically built in two stages using information obtained from the system's collection. A preliminary user study reveals that query refinement is one of the most important features of the system
Conference Paper
People are thirsty for medical information. Existing Web search engines cannot handle medical search well because they do not consider its special requirements. Often a medical information searcher is uncertain about his exact questions and unfamiliar with medical terminology. Therefore, he prefers to pose long queries, describing his symptoms and situation in plain English, and receive comprehensive, relevant information from search results. This paper presents MedSearch, a specialized medical Web search engine, to address these challenges. MedSearch can assist ordinary Internet users to search for medical information, by accepting queries of extended length, providing diversified search results, and suggesting related medical phrases.
Study of Dis-tributed File System for Big Data
  • A Sagar
  • Zalte
  • R Vishwas
  • Takate
  • R Saish
  • Chaudhari
Sagar A. Zalte, Vishwas R. Takate, Saish R. Chaudhari.: Study of Dis-tributed File System for Big Data. International Journal of Innovative Research in Computer and Communication Engineering, 1435-1438, Vol. 5, Issue 2, February 2017
Abdulbasit Shaikh: Web Crawling and Data Mining with Apache Nutch
  • Zakir Dr
  • Laliwala
Dr.Zakir Laliwala, Abdulbasit Shaikh: Web Crawling and Data Mining with Apache Nutch. Packt Publishing Ltd., UK (2013).