Chapter

A Hybrid Supervised/Unsupervised Machine Learning Approach to Classify Web Services

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Reusing software is a promising way to reduce software development costs. Nowadays, applications compose available web services to build new software products. In this context, service composition faces the challenge of proper service selection. This paper presents a model for classifying web services. The service dataset has been collected from the well-known public service registry called ProgrammableWeb. The results were obtained by breaking service classification into a two-step process. First, Natural Language Processing(NLP) pre-processed web service data have been clustered by the Agglomerative hierarchical clustering algorithm. Second, several supervised learning algorithms have been applied to determine service categories. The findings show that the hybrid approach using the combination of hierarchical clustering and SVM provides acceptable results in comparison with other unsupervised/supervised combinations.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Such services need to be discovered, usually though a manual search of the developer. In SmartCLIDE service discovery is assisted by an AI agent that receives as input necessary functional information about the service that is going to be used, and goes through several sources (e.g., programmable web) for identifying fitting services [2], [3]. ...
... After the user sends a query with the service specification, through a Natural Language Query Interface, the discovery will draw results from web pages, code repositories and Service Registries by invoking 3 rd party search APIs. SmartCLIDE can rewrite the provided user input query on the basis of indexed popular search queries leveraging AI-based techniques, displaying the results of identified services to the user as a ranked list [2], [3]. ...
Conference Paper
Nowadays the majority of cloud applications are developed based on the Service-Oriented Architecture (SOA) paradigm. Large-scale applications are structured as a collection of well-integrated services that are deployed in public, private or hybrid cloud. Despite the inherent benefits that service-based cloud development provides, the process is far from trivial, in the sense that it requires the software engineer to be (at least) comfortable with the use of various technologies in the long cloud development toolchain: programming in various languages, testing tools, build / CI tools, repositories, deployment mechanisms, etc. In this paper, we propose an approach and corresponding toolkit (termed SmartCLIDE-as part of the results of an EU-funded research project) for facilitating SOA-based software development for the cloud, by extending a well-known cloud IDE from Eclipse. The approach aims at shortening the toolchain for cloud development, hiding the process complexity and lowering the required level of knowledge from software engineers. The approach and tool underwent an initial validation from professional cloud software developers. The results underline the potential of such an automation approach, as well as the usability of the research prototype, opening further research opportunities and providing benefits for practitioners.
... Approaching these limitations requires the combination of different algorithms and Big Data analytics such as the combination of a supervised mL model as Random Forest (RF) with other non-supervised models such as Hierarchical Clustering (HCA) and Fuzzy Deformable Prototypes (FDP) to overcome the confidence and data complexity problems [38]. the HC has been used as a step prior to RF training to reduce features in very complex and high dimensional datasets [41,42]. Additionally, and despite its limitations [30], black-box models have been used to interpret the predictions of complex mL algorithms such as deducing the patterns learned by deep neural networks to understand how the algorithm works once trained and to detect biases [43]. ...
Article
Full-text available
Infectious diseases are a major threat for human and animal health worldwide. Artificial Intelligence (AI) combined algorithms including Machine Learning and Big Data analytics have emerged as a potential solution to analyse diverse datasets and face challenges posed by infectious diseases. In this commentary we explore the potential applications and limitations of ML to management of infectious disease. It explores challenges in key areas such as outbreak prediction, pathogen identification, drug discovery, and personalized medicine. We propose potential solutions to mitigate these hurdles and applications of ML to identify biomolecules for effective treatment and prevention of infectious diseases. In addition to use of ML for management of infectious diseases, potential applications are based on catastrophic evolution events for the identification of biomolecular targets to reduce risks for infectious diseases and vaccinomics for discovery and characterization of vaccine protective antigens using intelligent Big Data analytics techniques. These considerations set a foundation for developing effective strategies for managing infectious diseases in the future.
... In recent years, studies have focused on using AI-based techniques [18], [20], [23]. Although most earlier approaches have used information extraction for extracting service features from WSDL [15], [21], [22], [24], REST has become the prevalent solution for providing web services and APIs [25]. In RESTful service implementations, service description text data has become a significant feature in service classification. ...
Conference Paper
Developing software based on services is one of the most emerging programming paradigms in software development. Service-based software development relies on the composition of services (i.e., pieces of code already built and deployed in the cloud) through orchestrated API calls. Black-box reuse can play a prominent role when using this programming paradigm, in the sense that identifying and reusing already existing/deployed services can save substantial development effort. According to the literature, identifying reusable assets (i.e., components, classes, or services) is more successful and efficient when the discovery process is domain-specific. To facilitate domain-specific service discovery, we propose a service classification approach that can categorize services to an application domain, given only the service description. To validate the accuracy of our classification approach, we have trained a machine-learning model on thousands of open-source services and tested it on 67 services developed within two companies employing service-based software development. The study results suggest that the classification algorithm can perform adequately in a test set that does not overlap with the training set; thus, being (with some confidence) transferable to other industrial cases. Additionally, we expand the body of knowledge on software categorization by highlighting sets of domains that consist 'grey-zones' in service classification.
Article
Full-text available
Background: Measles, a highly contagious viral infection, is resurging in the United States, driven by international importation and declining domestic vaccination coverage. Despite this resurgence, measles outbreaks are still rare events that are difficult to predict. Improved methods to predict outbreaks at the county level would facilitate the optimal allocation of public health resources. Objective: We aimed to validate and compare extreme gradient boosting (XGBoost) and logistic regression, 2 supervised learning approaches, to predict the US counties most likely to experience measles cases. We also aimed to assess the performance of hybrid versions of these models that incorporated additional predictors generated by 2 clustering algorithms, hierarchical density-based spatial clustering of applications with noise (HDBSCAN) and unsupervised random forest (uRF). Methods: We constructed a supervised machine learning model based on XGBoost and unsupervised models based on HDBSCAN and uRF. The unsupervised models were used to investigate clustering patterns among counties with measles outbreaks; these clustering data were also incorporated into hybrid XGBoost models as additional input variables. The machine learning models were then compared to logistic regression models with and without input from the unsupervised models. Results: Both HDBSCAN and uRF identified clusters that included a high percentage of counties with measles outbreaks. XGBoost and XGBoost hybrid models outperformed logistic regression and logistic regression hybrid models, with the area under the receiver operating curve values of 0.920-0.926 versus 0.900-0.908, the area under the precision-recall curve values of 0.522-0.532 versus 0.485-0.513, and F2 scores of 0.595-0.601 versus 0.385-0.426. Logistic regression or logistic regression hybrid models had higher sensitivity than XGBoost or XGBoost hybrid models (0.837-0.857 vs 0.704-0.735) but a lower positive predictive value (0.122-0.141 vs 0.340-0.367) and specificity (0.793-0.821 vs 0.952-0.958). The hybrid versions of the logistic regression and XGBoost models had slightly higher areas under the precision-recall curve, specificity, and positive predictive values than the respective models that did not include any unsupervised features. Conclusions: XGBoost provided more accurate predictions of measles cases at the county level compared with logistic regression. The threshold of prediction in this model can be adjusted to align with each county's resources, priorities, and risk for measles. While clustering pattern data from unsupervised machine learning approaches improved some aspects of model performance in this imbalanced data set, the optimal approach for the integration of such approaches with supervised machine learning models requires further investigation.
Article
In this survey, 60 research papers are reviewed based on various web data classification techniques, which are used for effective classification of web data and measuring the semantic relatedness between the two words. The web data classification techniques are classified into three types, such as semantic-based approach, search engine-based approach, and WordNet-based approach, and the research issues and challenges confronted by the existing techniques are reported in this survey. Moreover, the analysis is carried out based on the research works using the categorized web data classification techniques, dataset, and evaluation metrics are carried out. From the analysis, it is clear that semantic-based approach is the widely used techniques in the classification of web data. Similarly, Miller-Charles dataset is the most commonly used dataset in most of the research papers, and the evaluation metrics, like precision, recall, and F-measure are widely utilized in web data classification. The insights from this manuscript can be utilized to understand various research gaps and problems in this area. Those can be considered in the future by developing novel optimization algorithms, which might enhance the performance of web data classifications.
Article
Full-text available
Over the last decades, web services are used for performing specific tasks demanded by users. The most important task of service’s classification system is to match an anonymous input service with the stored pre-classified web services. The most challenging issue is that web services are currently organized and classified according to syntax while the context of the requested service is ignored. Due to this motivation, Cloud-based Classification Methodology is proposed as it presents a new methodology based on semantic web service’s classification. Furthermore, cloud computing is used for not only storing but also allocating the high scale of web services with both high availability and accessibility. Fog technology is employed to reduce the latency and to speed up response time. The experimental results using the suggested methodology show a better performance of the proposed system regarding both precision and accuracy in comparison with most of the methods discussed in the literature of the current study.
Article
Full-text available
One of the main assets of the Service Oriented Architecture (SOA) is composition, which consists in developing higher-level services by re-using well-known functionality provided by other services in a low-cost and rapid development process. In this paper, we present IDECSE a new integrated approach for composite services engineering. By considering semantic Web services, IDECSE addresses the challenge of fully automating the classification, discovery and composition while reducing development time and cost. The classification and the discovery processes rely on adequate semantic similarity measures. Both semantic and syntactic descriptions are integrated through specific techniques for computing similarity measures between services. Formal Concept Analysis (FCA) is used then to classify Web services into concept lattices in order to facilitate relevant services identification. A graph based semantic Web service composition process was proposed within the IDECSE framework. Using semantic similarities in grouping classes of services and in composing services shows a significant improvement compared to other approaches.
Article
Full-text available
Supervised machine learning studies are gaining more significant recently because of the availability of the increasing number of the electronic documents from different resources. Text classification can be defined that the task was automatically categorized a group documents into one or more predefined classes according to their subjects. Thereby, the major objective of text classification is to enable users for extracting information from textual resource and deals with process such as retrieval, classification, and machine learning techniques together in order to classify different pattern. In text classification technique, term weighting methods design suitable weights to the specific terms to enhance the text classification performance. This paper surveys of text classification, process of different term weighing methods and comparison between different classification techniques.
Conference Paper
Full-text available
How to classify and organize the semantic Web services to help users find the services to meet their needs quickly and accurately is a key issue to be solved in the era of service-oriented software engineering. This paper makes full use the characteristics of solid mathematical foundation and stable classification efficiency of naive bayes classification method. It proposes a semantic Web service classification method based on the theory of naive bayes. It elaborates the concrete process of how to use the three stages of bayesian classification to classify the semantic Web services in the consideration of service interface and execution capacity. The information gain theory is used to determine the classification influence of different features. Finally, the experiments are used to validate the proposed methods.
Article
Full-text available
A Web service is a Web accessible software that can be published, located and invoked by using standard Web protocols. Automatically determining the category of a Web service, from several pre-defined categories, is an important problem with many applications such as service discovery, semantic annotation and service matching. This paper describes AWSC (Automatic Web Service Classification), an automatic classifier of Web service descriptions. AWSC exploits the connections between the category of a Web service and the information commonly found in standard descriptions. In addition, AWSC bridges different styles for describing services by combining text mining and machine learning techniques. Experimental evaluations show that this combination helps our classification system at improving its precision. In addition, we report an experimental comparison of AWSC with a related work.
Article
Full-text available
The Web is gradually evolving as provider of services along with its text and image processing functions. Web services markup is proposed in the Defense advance research project agency's agent markup language (DAML) family of semantic Webmarkup languages. The markup provide an agent-independant declarative API to capture the data and metadata associated with a service. Sharing, reuse, composition, mapping and succint local Web service markup is facilitated by the exploitation of ontologies by markup. A wide variety of agent technologies for automated Web services discovery, execution, composition and interoperation is enabled by this markup.
Article
Full-text available
The rapid evolution and expansion of wireless-enabled environments have increased the need for sophisticated service discovery protocols (SDPs). Typically, service discovery involves a client, service provider, and lookup or directory server. The paper discusses Bluetooth (http://www.bluetooth.com) short-range wireless technology. The Bluetooth protocol stack includes specifications that define the SDP, RFCOMM (for cable replacement), the logical link control and adaptation protocol (L2CAP), a host controller interface (HCI), the link manager protocol (LMP), the base-band protocol, and a radio frequency (RF) protocol. The paper considers Bluetooth service discovery improvements with semantic matching
Article
As per the global digital report, 52.9% of the world population is using the internet, and 42% of the world population is actively using e-commerce, banking, and other online applications. Web services are software components accessed using networked communications and provide services to end users. Software developers provide a high quality of web service. To meet the demands of user requirements, it is necessary for a developer to ensure quality architecture and quality of services. To meet the demands of user measure service quality by the ranking of web services, in this paper, we analyzed QWS dataset and found important parameters are best practices, successability, availability, response time, reliability and throughput, and compliance. We have used various data mining techniques and conducted experiments to classify QWS data set into four categorical values as class1, 2, 3, and 4. The results are compared with various techniques random forest, artificial neural network, J48 decision tree, extreme gradient boosting, K-nearest neighbor, and support vector machine. Multiple classifiers analyzed, and it was observed that the classifier technique eXtreme gradient boosting got the maximum accuracy of 98.44%, and random forest got the accuracy of 98.13%. In future, we can extend the quality of web service for mixed attributes.
Conference Paper
With the development of Web Service Technology, the quantity of the web services published on the Internet is increasing rapidly. Recognizing each web service intelligently becomes the key of efficiently using Internet. And the first step of recognization is to classify the web services accurately. To classify a huge amount of web services becomes a difficulty job. Therefore, in order to support applications of web services more effectively, an automatic web service classification method is needed. In this paper, the common WSDL files are regarded as the study object. Since web service is described by WSDL, the traditional document classification method cannot be applied directly. In the paper, a new method is proposed which applies automatic web service semantic annotation and uses three classification method: Naive Bayes, SVM and REP Tree, furthermore ensemble learning is applied. According to the experiment done on 951 WSDL files and 19 categories, the accuracy was 87.39%.
Article
With the continuous expansion of data availability in many large-scale, complex, and networked systems, such as surveillance, security, Internet, and finance, it becomes critical to advance the fundamental understanding of knowledge discovery and analysis from raw data to support decision-making processes. Although existing knowledge discovery and data engineering techniques have shown great success in many real-world applications, the problem of learning from imbalanced data (the imbalanced learning problem) is a relatively new challenge that has attracted growing attention from both academia and industry. The imbalanced learning problem is concerned with the performance of learning algorithms in the presence of underrepresented data and severe class distribution skews. Due to the inherent complex characteristics of imbalanced data sets, learning from such data requires new understandings, principles, algorithms, and tools to transform vast amounts of raw data efficiently into information and knowledge representation. In this paper, we provide a comprehensive review of the development of research in learning from imbalanced data. Our focus is to provide a critical review of the nature of the problem, the state-of-the-art technologies, and the current assessment metrics used to evaluate learning performance under the imbalanced learning scenario. Furthermore, in order to stimulate future research in this field, we also highlight the major opportunities and challenges, as well as potential important research directions for learning from imbalanced data.