Book

Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage

Authors:

Abstract

Introduction Number of Visit Actions Session Duration Relationship between Visit Actions and Session Duration Average Time per Page Duration for Individual Pages References Exercises
... The use of a TF does not give information on the context of the words, and it is therefore more useful to use the inverse document frequency (IDF), which is calculated as the ratio of documents containing a given term to the total number of documents. Thus, the IDF indicates whether a term is rare or common among a set of documents [12], [13]. ...
... By merging the TF and IDF, we obtain the so-called term frequency-inverse document frequency (TF-IDF), which gives more information, including the number of terms, their positions and the contextual information. The main tasks of the text processing component are therefore tokenisation, stemming and producing the TF-IDF vector space model for a given set of text documents, where an individual vector consists of m dimensions as follows [12], [13]: ...
... For similar text documents, the value will be equal to unity [12]. It should be also noted that when the vectors are normalised, this measure is equivalent to the dot product between vectors [13]. ...
Article
Full-text available
This paper presents a framework for text clustering and categorisation. The proposed clustering approach is based on a modified existing similarity-based clustering algorithm, which was originally developed for well-structured data. In this study, the clustering algorithm is used to map text documents into clusters, in order to discover groups of topical documents. The clusters produced in this way are also used for the categorisation of new documents that are uploaded to the system. The algorithms are discussed using as an example the analysis of text documents including Industrial Control Systems (ICS) Advisory Reports and Common Vulnerabilities and Exposures (CVE) recommendations, together available and provided by the Cybersecurity and Infrastructure Security Agency (CISA). Experiments are carried out, although the main focus is on the clustering algorithm. Based on the experimental results, it can be concluded that the proposed similarity-based clustering algorithm can be considered as an alternative approach for text clustering.
... Kada rudarenje podataka koristimo za klasterovanje, klasifikaciju, predikciju ili regresiju velikih količina web podataka u nameri da izvučemo korist za poslovanje, govorimo o disciplini rudarenja web-a (Markov & Larose, 2007;Liu, 2007;Palau, Montaner, Lopez & de la Rosa, 2016). Ona ima tri različita aspekta: ...
... Asocijativno pravilo je pravilo oblika X => Y, koje tumačimo tako da kupovina artikla X povlači sa sobom kupovinu artikla Y u istoj transakciji. U web okruženju, isto pravilo ukazuje na relaciju između HTML stranica X i Y, koje se učestalo pojavlјuju jedna pored druge u korisničkim sesijama (Markov & Larose, 2007). Asocijativno pravilo ne nosi informaciju o hronologiji obavlјenih poseta sajtu. ...
... Za ovakve vrste analiza se koriste tehnike za generisanje sekvencijalnih asocijativnih pravila, koje uklјučuju i temporalnu komponentu. U sekvencijalnim pravilima, poseta web sajtu navedenom u antecedensu pravila usledila je pre posete web sajtovima navedenim u konsekvensu pravila, odnosno, analizira se sekvenca klikova u vremenu (Markov & Larose, 2007.) Postojeći algoritmi za rudarenje upotrebe web-a generišu preveliki broj pravila, u kojima je teško razlučiti koja pravila su bitna, a koja nemaju potencijalnu vrednost za korisnika (Cheng, Healey, McHugh & Wang, 2001). ...
Article
Full-text available
Much of one’s online behavior, including browsing, shopping, posting, is recorded in databases on companies’ computers on a daily basis. Those data sets are referred to as web data. The patterns which are the indicators of one’s interests, habits, preferences or behaviors are stored within those data. More useful than an individual indicator is when a company records data on all its users and when it gains an insight into their habits and tendencies. Detecting and interpreting such patterns can help managers to make informed decisions and serve their customers better. Utilizing data mining with respect to web data is said to turn them into web knowledge. The research study conducted in this paper demonstrates how data mining methods and models can be applied to the web-based forms of data, on the one hand, and what the implications of uncovering patterns in web content, the structure and their usage are for management.
... It is organized in a structure as a graph where the web pages represent the nodes and the relations among them represent the links. These web pages need to be stored in a network of computers not only in one database or computer which is achieved by the idea of developing the Internet in a distributed infrastructure, so it is the most important repository of information that needs to be navigated in the case of answering any query [1]. The web has several characteristics as will be explained below [2]: ...
... Data mining can be defined as the analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the owner of data. It is used to extract knowledge from the web automatically [1]. ...
... In other words clustering is a type of learning, that is unsupervised because it does not use labeled objects. The clustering seeks to find generic patterns, group them to patterns or organizing them in hierarchies [1]. ...
... It is organized in a structure as a graph where the web pages represent the nodes and the relations among them represent the links. These web pages need to be stored in a network of computers not only in one database or computer which is achieved by the idea of developing the Internet in a distributed infrastructure, so it is the most important repository of information that needs to be navigated in the case of answering any query [1]. The web has several characteristics as will be explained below [2]: ...
... Data mining can be defined as the analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the owner of data. It is used to extract knowledge from the web automatically [1]. ...
... In other words clustering is a type of learning, that is unsupervised because it does not use labeled objects. The clustering seeks to find generic patterns, group them to patterns or organizing them in hierarchies [1]. ...
... Each line in this text file shows a specific operation requested by the browser of a user and received by an EPA web server in Research Triangle Park, North Carolina. Each record includes the IP address, Date/Time field, HTTP request, the status code field, and the transfer size field [24]. ...
... In other words, the raw web log files do not seem to have an appropriate format for the data mining. Therefore, the data preprocessing is required to be performed [24]. In this study, for the pre-processing operations and the implementation of the process, Matlab R2010A software is used. ...
Article
Full-text available
In this study, we present a new approach for Web Usage Mining using Case Based Reasoning. Case-Based Reasoning techniques are a knowledge-based problem-solving approach which is based on the reuse of previous work experience. Thus, the past experience can be deemed as an efficient guide for solving new problems. Web personalization systems which have the capability to adapt the next set of visited pages to individual users according to their interests and navigational behaviors have been proposed. The proposed architecture consists of a number of components, namely, basic log preprocessing, pattern discovery methods (By Case Based Reasoning and peer to peer similarity—Clustering—association rules mining methods), and recommendations. One of the issues considered in this study is that there are no recommendations to those who are different from the existing users in the log file. Also, it is one of the challenges facing the recommendations systems. To deal with this problem, Apriori algorithm was designed individually in order to be utilized in presenting recommendations; in other words, in cases where recommendations may be inadequate, using association rules can enhance the overall system performance recommendations. A new method used in this study is clustering algorithms for Nominal web data. Our evaluations show that the proposed method along with Standard case-classified Log provides more effective recommendations for the users than the Logs with no case classification.
... All the new pages are analyzed to gather information, and the process then continues using the obtained pages as new seeds, which are stored in the queue (Mirtaheri et al., 2013). The crawler basically works on the principle of a simple graph search algorithm, such as breadth-first search (BFS) and depth-first search (DFS) (Singh et al., 2014), assuming that all the web pages are linked together and that there are no redundant links (Markov & Larose, 2007). ...
... Moreover, there also exists the possibility to have multiple links to the same web page. Thus, a crawler can maintain a cache of links or page content to check content similarity between two web pages (Markov & Larose, 2007). ...
... Many researchers have published works about WCM [15], [16], [17], [18]. Their studies aim to solve different problems. ...
... Their studies aim to solve different problems. Materna, Qi and Davison work on automated web page classification [19], [20]; Markov and Larosh focus on text document grouping [18]; Hariharan, Srinivasan and Lakshmisimilarity discovery of text documents [21], [22]; Liu, Medhat, Hassan, Korashy, D'Avanzo, Pilato, Patel, Prabhu and Bhowmick conduct research in two areas: Sentiment Analysis and Opinion Mining [23] , [24], [25], [26], [27]. ...
Article
Full-text available
The impact of social networks on our liveskeeps increasing because they provide content,generated and controlled by users, that is constantly evolving. They aid us in spreading news, statements, ideas and comments very quickly. Social platforms are currently one of the richest sources of customer feedback on a variety of topics. A topic that is frequently discussed is the resort and holiday villages and the tourist services offered there. Customer comments are valuable to both travel planners and tour operators. The accumulation of opinions in the web space is a prerequisite for using and applying appropriate tools for their computer processing and for extracting useful knowledge from them. While working with unstructured data, such as social media messages, there isn’t a universal text processing algorithm because each social network and its resources have their own characteristics. In this article, we propose a new approach for an automated analysis of a static set of historical data of user messages about holiday and vacation resorts, published on Twitter. The approach is based on natural language processing techniques and the application of machine learning methods. The experiments are conducted using softwareproduct RapidMiner.
... Aturan yang mencerminkan asosiasi antara beberapa atribut sering disebut affinity analysis atau market basket analysis. Analisis asosiasi, atau association rule mining, adalah teknik data mining yang digunakan untuk menemukan aturan kombinasi item [13]. Salah satu tahap analisis asosiasi yang menarik perhatian banyak peneliti adalah analisis pola frekuensi tinggi (frequent pattern mining). ...
Article
Full-text available
Keberhasilan proses pembangunan memerlukan dukungan optimal dalam pertukaran data dan informasi antar instansi guna mencapai integrasi sistem yang seimbang antara pemerintah dan para pengguna. SAKTI, sebuah aplikasi keuangan tingkat instansi, telah dirancang untuk mengelola segala aspek keuangan, mulai dari perencanaan hingga pertanggungjawaban anggaran. Aplikasi SAKTI ini mengintegrasikan semua aplikasi satuan kerja yang ada, bertujuan untuk meningkatkan efektivitas, efisiensi, transparansi, dan akuntabilitas dalam pengelolaan keuangan. Meskipun telah diimplementasikan sejak awal tahun 2022, operator komitmen masih menghadapi kendala dalam penentuan kodefikasi barang, terutama karena kurangnya familiaritas dengan tugas tersebut dan jumlah barang yang banyak sebagai referensi. Kesalahan yang dilakukan oleh operator komitmen dapat berdampak pada proses pendetailan aset pada modul persediaan dan aset. Dalam penelitian ini, peneliti menggunakan metode Algoritma Apriori dan frequent pattern growth (FP-growth) sebagai alat untuk menemukan sejumlah aturan asosiasi dari data transaksi barang yang disimpan dalam basis data aplikasi SAKTI. Hasil simulasi menunjukkan bahwa aturan yang memenuhi minimum support dan minimum confidence, dengan pemilihan terbanyak adalah Ballpoint Standar Tecno, refill tisu plastik, Lak Ban Hitam 2 Inchi Merk Daimaru, dan Ballpoint Kenko K1 (0,5) sebesar 100%.
... The neurons are associated with weights in every layer, and the weighted output will feed as input to the next layer. ANN functions well in noisy data set; hence ANN is robust w.r.t erroneous and noisy data [98]. It can function even if any part of the network fails. ...
Article
Full-text available
In this digital era, users and service providers are facing various decisions that prompt data over-burden. The choices should be separated and focused on or altered so that the actual data is passed with significant subtleties to the service provider or to the intended user. A recommender framework or engine handles the information overload problem by customizing and filtering the large volume of data and generating the customer’s appropriate information dynamically with personalized content. This comprehensive study focuses on several recommender systems (RecSys) methodologies and discusses the problems or issues associated with different principles and techniques. In addition to the various principles and techniques, this study elaborates on several similarity measures, including conventional and non-conventional measures, with their merits and demerits also points out both ranking and non-ranking performance metrics. Further, we have studied different articles, including journals and conferences. Based on the studies, we outline current research challenges as future directions. We have briefly discussed various datasets utilized in the recommender domain for evaluating and validating the recommendation task.
... Website texts have been investigated by many researchers and examples can be found in 'Web mining' books, such as Larose and Markov (2007), and in more text-oriented Web mining books, such as Song and Wu (2008). To obtain more information on the activities of companies, several applications have been described. ...
Article
Full-text available
A text-based, bag-of-words, model was developed to identify drone company websites for multiple European countries in different languages. A collection of Spanish drone and non-drone websites was used for initial model development. Various classification methods were compared. Supervised logistic regression (L2-norm) performed best with an accuracy of 87% on the unseen test set. The accuracy of the later model improved to 88% when it was trained on texts in which all Spanish words were translated into English. Retraining the model on texts in which all typical Spanish words, such as names of cities and regions, and words indicative for specific periods in time, such as the months of the year and days of the week, were removed did not affect the overall performance of the model and made it more generally applicable. Applying the cleaned, completely English word-based, model to a collection of Irish and Italian drone and non-drone websites revealed, after manual inspection, that it was able to detect drone websites in those countries with an accuracy of 82 and 86%, respectively. The classification of Italian texts required the creation of a translation list in which all 1560 English word-based features in the model were translated to their Italian analogs. Because the model had a very high recall, 93, 100, and 97% on Spanish, Irish and Italian drone websites respectively, it was particularly well suited to select potential drone websites in large collections of websites.
... Field (2013, s. 263)'in ifade ettiği gibi ilişki analizleri, ele alınan iki veya daha fazla değişkenin birlikte değişkenlik gösterip göstermediğine dayanan analiz türleridir. Veri madenciliğinde birliktelik kuralı ve kümeleme analizi gibi uygulamalar, birlikte değişim gösteren birey davranışlarını ya da tercihlere ilişkin kararların tespit edilmesinde kullanılmaktadır (Markov ve Larose, 2007). ...
Article
Full-text available
Geçmişi ve bugünü anlamanın, geleceğe daha net bakmamıza yardım ettiği söylenebilir. Özellikle bilgi çağında, dijitalleşmenin de katkısıyla oluşan devasa veriler bu anlamlandırmayı daha önemli kılmaktadır. Bunu başarabilmek için elimizdeki en etkili yöntemlerden biri ise veri madenciliğidir. Veri madenciliği söz konusu verilerin içerisinde anlamlı ilişkileri, kalıpları ve eğilimleri keşfetmeye dayalı üretkenliği arttırmaya yönelik bir araçtır. Sosyal bilimlerde ve pazarlama alanında sıklıkla kullanılan veri madenciliği, keşfettiği anlamlı kalıplar ve ilişkilerle, müşterilerin gelecekteki davranışlarını tahmin etmeye yönelik öngörü geliştirmekte; ürün tekliflerinin nasıl yapılandırılması gerektiği gibi satış ve hizmet fonksiyonlarını destekleyerek işletmeler için birçok avantaj yaratmaktadır. Bu bağlamda çalışmada, sosyal bilimlerde veri madenciliği ve uygulamalarına ilişkin genel bilgi verilmesi, ardından pazarlama alanında veri madenciliği kullanımının değerlendirilmesi amaçlanmıştır. Bu sayede veri madenciliği kavramının sosyal bilimciler açısından daha net anlaşılmasına ve benimsenmesine, pazarlama alanında veri madenciliği uygulamalarının artmasına, dolayısıyla teoriye ve sektöre sağlayacağı katkıyı arttırmasına destek olacağı düşünülmektedir.
... At the end, each example's actual class label is compared against its predicted one to find the total number of correct predictions. To evaluate MGREPD, leave-one-out method of cross-validation (LOOCV) [32] and measures such as accuracy and f-score are used [33]. Accuracy A measures the ability of the model to match the actual value of the class label with its predicted one (e.g. ...
... Due to the massification of information on the World Wide Web (WWW), this has become an important global database. Web scraping, used for web content mining, obtains information from the content of web pages, with two basic objectives: extract information to improve search engines and information retrieval fields [33] and analyze and explore information to gain useful content knowledge [34]. In this content mining technique, opinions, feelings, and emotions are extracted from the text to understand the context of web content [35]. ...
Article
Full-text available
This work objective is to generate an HJ-biplot representation for the content analysis obtained by latent Dirichlet assignment (LDA) of the headlines of three Spanish newspapers in their web versions referring to the topic of the pandemic caused by the SARS-CoV-2 virus (COVID-19) with more than 500 million affected and almost six million deaths to date. The HJ-biplot is used to give an extra analytical boost to the model, it is an easy-to-interpret multivariate technique which does not require in-depth knowledge of statistics, allows capturing the relationship between the topics about the COVID-19 news and the three digital newspapers, and it compares them with LDAvis and heatmap representations, the HJ-biplot provides a better representation and visualization, allowing us to analyze the relationship between each newspaper analyzed (column markers represented by vectors) and the 14 topics obtained from the LDA model (row markers represented by points) represented in the plane with the greatest informative capacity. It is concluded that the newspapers El Mundo and 20 M present greater homogeneity between the topics published during the pandemic, while El País presents topics that are less related to the other two newspapers, highlighting topics such as t_12 (Government_Madrid) and t_13 (Government_millions).
... Guerbas et al. (2013) proposed a KNNbased navigation pattern prediction approach in which the sessions are considered as documents and the web pages in the sessions are considered as the terms in the document. Recently a recommendation system based on K-Nearest Neighbor classification [70] was developed by Adeniyi et al. [1]. The method decides the class to which a user belongs by comparing the current click stream of a user with the click stream of previous users whose class labels are known. ...
Article
Full-text available
The burgeoning e-commerce market has presented companies with the opportunity to grow their businesses through online platforms. But, the researchers have concluded that just 2.86% of e-commerce website visits lead to a purchase and one of the reasons for this missed opportunity is an unpleasant website browsing experience. Therefore, a pleasant browsing experience is the need of the hour whereby the web page recommendation systems (WPRS) provide high-quality navigation experience by providing suggestions about the web pages of interest and by taking the website users to their desired web pages in fewer clicks. In this context, this paper presents a method to improve the browsing experience of the website users by proposing two hybrid algorithms based on clustering for web page recommendation systems, namely a hybrid partitioning-based heuristic sequence clustering (HSC) algorithm inspired from K-medoid and DBSCAN algorithms and a hybrid tree-based sequence clustering (TSC) algorithm inspired from B-Trees and BIRCH algorithm. The testing has been performed using CTI, BMSWebView1, BMSWebView2 and MSNBC datasets. To measure the performance, the algorithm considered for the study has been evaluated using parameters like precision, recall, F1 measures and execution time. Also, an in-depth comparative analysis of state-of-the-art web page recommendation systems with the recommendation system considered for the study has been done. The results indicate that the proposed clustering-based framework was able to generate superior results than the other classes of algorithms.
... The concept of geoportals has becomed key for spatial data and geoinformation accessing and sharing. We perform geoportal navigation analysis based on geoportal web server logs (click-stream data) following the guidelines given by (Markov and Larose, 2007;Bhavani el al., 2017;Bhuvaneswari and Muneeswaran, 2021). ...
Article
Full-text available
The Friedman Test is used for problems similar to a wine contest, where we want to check if there is any difference between the wines. We have analyzed the problems where the judges might find ties between the wines, and produced exact tables for the problem. Using the tables instead of an asymptotic estimate might circumvent errors, at least for the case of small number of wines and judges
... Data Conversion. This step concerns with the user-session representation [38]. Not all sessions and web pages are involved. ...
Article
Full-text available
The problem of finding relevant data while searching the internet represents a big challenge for web users due to the enormous amounts of available information on the web. These difficulties are related to the well-known problem of information overload. In this work, we propose an online web assistant called OWNA. We developed a fully integrated framework for making recommendations in real-time based on web usage mining techniques. Our work starts with preparing raw data, then extracting useful information that helps build a knowledge base as well as assigns a specific weight for certain factors. The experiments show the advantages of the proposed model against alternative approaches.
... According to the definition of big data [19], [20], it can be argued that geophysical studies, which consider dozens of different parameters accumulated over several decades, are a classic example of such "big data", for the processing which the special machine learning and data mining methods have been developed [21]- [23]. Therefore, the machine learning methods have been chosen as a tool for achieving the aim of the present research. ...
Article
Full-text available
Well logging, also known as a geophysical survey, is one of the main components of a nuclear fuel cycle. This survey follows directly after the drilling process, and the operational quality assessment of its results is a very serious problem. Any mistake in this survey can lead to the culling of the whole well. This paper examines the feasibility of applying machine learning techniques to quickly assess the well logging quality results. The studies were carried out by a reference well modelling for the selected uranium deposit of the Republic of Kazakhstan and further comparing it with the results of geophysical surveys recorded earlier. The parameters of the geophysical methods and the comparison rules for them were formulated after the reference well modelling process. The classification trees and the artificial neural networks were used during the research process and the results obtained for both methods were compared with each other. The results of this paper may be useful to the enterprises engaged in the geophysical well surveys and data processing obtained during the logging process.
... The emergence of more complex types of data led to the development of new methods and models to cope with the new task of mining complex data. As examples, we can point out text mining (do Prado & Ferneda, 2008), web mining (content, structure, and usage) (Markov & Larose, 2007), spatial data mining (Nlenanya, 2009), graph mining (Zhang, Hu, Xia, Zhou, & Achananuparp, 2008), mining time-series data (Liabotis, Theodoulidis, & Saraaee, 2006), among others. In (Kumar, 2011) some trends and new domains are explored. ...
Chapter
The term knowledge discovery in databases or KDD, for short, was coined in 1989 to refer to the broad process of finding knowledge in data, and to emphasize the “high-level” application of particular data mining (DM) methods. The DM phase concerns, mainly, the means by which the patterns are extracted and enumerated from data. Nowadays, the two terms are, usually, indistinctly used. Efforts are being developed in order to create standards and rules in the field of DM with great relevance being given to the subject of inductive databases. Within the context of inductive databases, a great relevance is given to the so-called DM languages. This chapter explores DM in KDD.
... The emergence of more complex types of data led to the development of new methods and models to cope with the new task of mining complex data like text mining (Prado & Ferneda, 2008), web mining (content, structure, and usage) (Markov & Larose, 2007), spatial data mining (Nlenanya, 2009), graph mining (Zhang, Hu, Xia, Zhou, & Achananuparp, 2008), mining time-series data (Liabotis, Theodoulidis, & Saraaee, 2006) etc., ...
Chapter
Business Intelligence (BI) is an emergent area of the Decision Support Systems (DSS) discipline. Over the past years, the evolution in this area has been considerable. Similarly, in the last years, there has been a huge growth and consolidation of the Data Mining (DM) field. DM is being used with success in BI systems, but a truly DM integration with BI is lacking. The purpose of this chapter is to discuss the relevance of DM integration with BI, and its importance to business users. From the literature review, it was observed that the definition of an underlying structure for BI is missing, and therefore a framework is presented. It was also observed that some efforts are being done that seek the establishment of standards in the DM field, both by academics and by people in the industry. Supported by those findings, this chapter introduces an architecture that can conduct to an effective usage of DM in BI. This architecture includes a DM language that is iterative and interactive in nature. This chapter suggests that the effective usage of DM in BI can be achieved by making DM models accessible to business users, through the use of the presented DM language.
... Also Back Propagation offered by Neural Network is effective algorithms for the task of text classification. It can be dealing with large dimensional data space [3] [6] [19]. Text Categorization is an active research area of text mining where the text is prepared with supervised, unsupervised or semi-supervised, Knowledge [4]. ...
Article
Full-text available
Text classification is the process of inserting text into one or additional categories. Text categorization has many of significant application, Mostly in the field of organization, and for browsing within great groups of document. It is sometimes completed by means of "machine learning.". Since the system is built based on a wide range of document features."Feature selection." is an important approach within this process, since there are typically several thousand possible features terms. Within text categorization, The target goal of features selection is to improve the efficiency of procedures and reliability of classification by deleting features that have no relevance and non-essential terms. While keeping terms which hold enough data that facilitate with the classification task. The target goal of this work is to increase the efficient text categorization models. Within the "text mining" algorithms, a document is appearing as "vector" whose dimension is that the range of special keywords in it, which can be very large. Classic document categorization may be computationally costly. Therefore, feature extraction through the singular valued decomposition is employed for decrease the dimensionality of the documents, we are applying classification algorithms based on "Back propagation" and "Support Vector Machine." methodology. before the classification we applied "Principle Component Analysis." technique in order to improve the result accuracy . We then compared the performance of these two algorithms via computing standard precision and recall for the documentscollection.
... Web structure mining: Web can be represented as a graph that its nodes are documents and edges are links between documents. Web structure mining, is the process of extracting structural information from the web [13]. ...
Article
With the growing data available on the Internet, customization of the web sites information has become a requirement for users. A procedure for the appropriate customization of web data is configured by automatic extraction of combined knowledge of the log file and user profile information. In this paper, integrating decision tree and association rules for user profile information and log information of website in an online shopping store is targeted. The tangible results of such a framework for decision makers and marketers are customization of web pages and statistical analysis for sale improvement. Applying association rules, the website users' patterns are mined and utilizing decision tree users are classified and their interests are determined. By combining the results of two algorithms and its analysis, the behavior models from user profile, user interests in terms of age and gender, and the most visited web pages by subject can be achieved.
... The application of keNN as a machine learning approach spans more than 50 years [35]. Although keNN was believed to have been introduced in 1951 in an unpublished medicine report, it did not gain much traction until 1960. ...
Article
Energy generation from biomass requires a nexus of different sources irrespective of origin. A detailed and scientific understanding of the class to which a biomass resource belongs is therefore highly essential for energy generation. An intelligent classification of biomass resources based on properties offers a high prospect in analytical, operational and strategic decision-making. This study proposes the k-Nearest Neighbour (k-NN) classification model to classify biomass based on their properties. The study scientifically classified 214 biomass dataset obtained from several articles published in reputable journals. Four different values of k (k=1,2,3,4) were experimented for various self normalizing distance functions and their results compared for effectiveness and efficiency in order to determine the optimal model. The k–NN model based on Mahalanobis distance function revealed a great accuracy at k=3 with Root Mean Squared Error (RMSE), Accuracy, Error, Sensitivity, Specificity, False positive rate, Kappa statistics and Computation time (in seconds) of 1.42, 0.703, 0.297, 0.580, 0.953, 0.047, 0.622, and 4.7 respectively. The authors concluded that k–NN based classification model is feasible and reliable for biomass classification. The implementation of this classification models shows that k–NN can serve as a handy tool for biomass resources classification irrespective of the sources and origins.
... In classification phase, the classification model is used to correctly classify a new unlabeled document. The Text Classification process is divided into four main steps: Data collection and preprocessing, building the model (feature selection), model evaluation and model testing (classification of new documents with unknown label classes) [6] detailed in the following sub-sections. ...
Conference Paper
Rain is of paramount importance for Indian agriculture, as it serves as the primary source of water for crops, sustaining agricultural productivity and ensuring food security for millions of people. In India’s predominantly rain-fed agriculture, timely and adequate rainfall is crucial for successful crop growth, making it a lifeline for farmers and a determining factor in the country’s agricultural output. Accurate rainfall predictions are essential for various applications, including agriculture, water resource management, and disaster preparedness. Ensemble machine learning models have demonstrated their capability to enhance the accuracy and reliability of rainfall predictions compared to single models. This article presents a comparative analysis of different ensemble techniques applied to rainfall prediction tasks. We explore various ensemble approaches, including Averaging, Max Voting, and Stacking, and evaluate their performance using rainfall day-wise datasets from Pantnagar (29.0222° N, 79.4908° E), Uttarakhand, India, from 2010 to 2022.
Article
Full-text available
Wind energy presents a high growth potential in the EU as an emission reduction strategy and to achieve the climate neutrality goal by 2050. Wind farms suitability analysis is one of the primary goals in the spatial planning of wind energy developments. This research paper introduces a hybrid spatial multicriteria GIS-based framework that combines Analytic Hierarchy Process (AHP), PROMETHEE II and Machine Learning algorithms to determine and predict the most efficient onshore wind farm locations by generating suitability index mappings. The methodology allows to overcome PROMETHEE II limitations in raster driven suitability analysis, utilizing machine learning regression methods as the k Nearest Neighbor and Support Vector Machines to predict a graduating mapping of suitability index for wind farm locations in northeastern Greece. The best configured models presented a RMSE of 0.0344 and 0.0154 respectively, indicating a quite high predictive performance. Suitability results indicate that 56.10% of the feasible locations in the Thrace area present a positive outranking character for the kNN model and 56.79% for the SVR model. The proposed framework, enriched by PROMETHEE II capabilities, assists energy and spatial planners in identifying suitable sites for wind farm siting and enables rational decision making that enhances efficient wind energy investments.
Article
Full-text available
Nos últimos anos, os cidadãos brasileiros têm se interessado cada vez mais sobre política, especialmente quando o assunto afeta a economia, e assim procurando notícias em portais online. Neste trabalho, analisamos alguns portais de notícias nacionais para identificar quais temas são mais presentes na sessão de notícias mais lidas e entender as preferências dos usuários. Para isso, o sistema desenvolvido faz a coleta das páginas mais lidas de cada portal e categoriza as notícias encontradas. Para testar e validar o sistema, catalogamos as notícias mais lidas do ano de 2017 e 2018 de três portais: UOL, Veja e Estadão. Os dados demonstram que os usuários do portal UOL têm uma preferência por notícias de entretenimento, enquanto no Estadão, o público é dividido entre Política e Entretenimento, e na Veja a preferência é por notícias de política, economia e opinião. Os resultados mostram que notícias de política tiveram um aumento de leituras no ano de 2018, comparado ao ano de 2017.
Chapter
The research is devoted to solving the problem of conjugation of the virtual information space and the physical world in terms of data retrieval. Wherein the methods for solving the problem of extracting data from the virtual information space are determined by these data themselves (data-driven). The paper discusses ways to solve the problem of thematic content obtaining (data retrieval) from an unstructured set of information resources or news feeds. The problems of the “growing bubble” of unprocessed documents that arise during the “blind” collection of documents are discussed and ways to solve these problems are proposed in the paper. To reduce the resource consumption of the problem of forming a periodically updated search base, three approaches to the automatic collection of “raw data” are proposed. The proposed approaches to developing the sub-search systems are the part of a large class of modern methods and algorithms of adaptive heterogeneous data filtering for content retrieval and aggregation in subject oriented knowledge bases formation. A possible field of application and relevance are determined by the fact that for the closed cyber-physical systems the use of sub-search systems is proposed with unlimited and unstructured information space as an input that is being processed in real time. One of the possible implementations of proposed methods is in the development of the knowledge bases for science-technical documentation.
Chapter
Full-text available
Quite often, developers face low performance, hanging, and other problems when they’re developing sites. To solve such problems, we need to trace site requests. Existing tracing methods do not allow tracing the progress of requests from a client’s web browser to a server or group of servers. In this paper, we propose distributed tracing mechanism that allows tracking requests starting from the browser. For generating complete client-to-server tracing, the client application must be able to initiate the appropriate request. For the execution of these actions, we need to use a unique library. In the paper, we consider the algorithm of such a library. A popular tracer (OpenTracing) is used on the serverside. Based on the proposed methodology, a library was developed. The library's work was tested. Testing has shown that using the library, and we can track the complete chain of requests from a browser to the server. Trace result is presented in graphical view. This allows analyzing received data and finding bottlenecks when queries are passing. The novelty of the proposed solution is that the request is traced from the client application and to the client application. That is, the full path of the request is shown. The result is presented in a graphical form that is convenient for analysis. The library is designed primarily for the development of client-server applications and for support services.
Article
It is known that in the former Soviet Union, Azerbaijan was a country that exported cotton and fruit, tobacco, wine, and canned fruit and vegetable products. In the early 1990s, the collapse of the union, the deterioration of economic relations between the former republics and the loss of traditional markets led to a sharp decrease in production. In such a situation, Armenia's groundless territorial claim against Azerbaijan, the coming to power of the Popular Front, their incompetent management, and internal strife worsen the political and economic situation in the country. strained. As a result, Armenia's occupation of 20 percent of our territories and the creation of problems of more than one million refugees and internally displaced persons dealt a heavy blow to the agricultural sector. With this, the material and technical base created over many years weakened, product markets were lost, and the production of agricultural products decreased sharply. Thus, in 1990-1993, the balance between the prices of industrial products and agricultural products was disturbed in the country, and a difficult situation arose in the development of the social and production infrastructure of the village. The construction of schools, culture, household services and health facilities was practically stopped. During this period, the depreciation of the main funds accelerated, the level of equipment armament of the agricultural and processing industry decreased. Production the application of the achievements of scientific and technical progress in the processes was limited. Due to the mentioned reasons, our country turned from an exporting country to an importing country. A number of measures were taken to overcome the crisis in the country, and since 1993, confident steps were taken to strengthen state building and revive the economy. The main task was to form market relations, develop entrepreneurship and improve domestic production by effectively using existing potentials. National Leader Heydar Aliyev, who returned to the leadership of the country decided to take decisive steps, establish stability in the country and implement economic reforms. For this purpose, under the leadership of the great leader in 1993-1995 agrarian policy directions for the next 5-10 years were determined and a number of measures were implemented. Azerbaijan has been under the aggression of Armenia for 30 years. The purpose of the study is to assess the condition of the agricultural sector in the territories freed from occupation by our victorious army, to determine the measures to be implemented and to prepare proposals for socio-economic development goals. The methodology of the research is based on the analysis of the measures implemented by Azerbaijan after regaining its independence, the creation of a legal framework leading the country from recession to dynamic development, and the analysis of a number of consistent and systematic relationships. The applied importance of the research is that it can be used in the preparation of socio-economic development programs and measures of the liberated territories. The results of the study are to use the positive experiences gained in Azerbaijan for the socio-economic development of the territories freed from occupation. With this, the development of agricultural production in Karabakh can be achieved on the basis of new techniques and technologies. As a result of the implementation of the proposed proposals, modern agricultural production and processing enterprises and specialized cooperatives can be created in these areas. Originality and scientific innovation of research. In the article, three factors of the development of the agricultural sector in the liberated territories of Azerbaijan are considered. Keywords: de-occupied territories, investment, resources, users, targets, reforms.
Article
Məqalədə binar serium-mis oksid katalizatorları üzərində etanolun dehidrogenləşməsi reaksiyasının tədqiqi aparılmışdır. Müəyyən edilmişdir ki, Ce-Cu-O katalizatorlarında etanol əsasən sirkə aldehidi, aseton,etilasetat, etilen . və karbon dioksidə çevrilir. Öyrənilən şəraitdə serium-mis okid katalizatorlarında buten 1-in buten 2-yə izomerləşməsi reaksiyasında alınan məhsul trans və sis-buten 2-dir. 350C –dən yüksək temperaturda isə dərin oksidləşmə məhsulları yəni CO vəCO2 əmələ gəlir. Müəyyən edilmişdir ki, katalizatorların tərkibində seriumun miqdarının artması ilə trans və sis buten-2-nin çıxımı azalır ki, bu da katalizator səthinin turşuluğunun azalmasından xəbər verir. Müəyyən edilmişdir ki, serium-mis oksid katalizatorlarında etanolun dehidrogenləşməsinin səthin turşuluğunun artması ilə sirkə aldehidinin məhsuldarlığı və onun selektivliyi minimumdan keçir. Tədqiqatlar göstərdi ki, Ce-Cu-O kaatalizatorunda etanol əsasən aşağı temperaturda sirkə aldehidi, etilasetata və 3500C- dən yüksək temperaturda sirkə aldehidi və aseton alınır. Tədqiqatlar nəticəsində müəyyən olunuşdur ki, serium – mis oksid katalizatorları buten-1 buten 2-yə izomerləşməsi reaksiyasında kiçik aktivliyə malikdir.Seriumla zəngin olan nümunələrdə buten 1-in izomerləşməsi reaksiyası 2500 C temperaturdan başlayır və buten—nin çıxımının cəmi 15%- dən artıq olmur .Ce-Cu-Ovkatalizatorlarının trans-sis izomerlərinin nisbəti 0,17-0,56% intervalında dəyişir. Eyni nisbətlərdə olan katalizatorun izomerləşməsi minimum bərabərdir. 2500C temperaturda buten-1-in buten-2- yə izomerləşməsi reaksiyasında tərkibində misin miqdarı çox olan nümunələr aktivlik göstərir, eyni zamanda yüksək temperaturda tərkibi seriumla zəngin olan nümunələr də aktivlik göstərirlər. Heterogen katalitik reaksiyalar ilkin qaz maddələrinin bərk katalizator səth ilə qarşılıqlı təsirini əhatə edən mürəkkəb bir prosesdir .Məlumdur ki, reksiyanın sürətinə aktiv mərkəzlərin sayı təsir göstərir.Ona görə də katalizatorun aktivliyini artırmaq üçün aktiv mərkəzlərin sayının artırılması vacib bir məsələdir.Sintez edilmiş katalizatorların xüsusi səthini təyin etmək üçün azotun istilik desorbsiyası üsulu ilə ölçülmüşdür.Ce-Cu oksid katalizatorlarında xüsusi səth katalizatorun tərkibində serium miqdarının artması ilə əvvəlcə azalır sonra artmağa başlayır.Ce-Cu= 3:7 nümunəsində xüsusi səth 7,1 m2/q- a qədər azalır. 9:1 nisbətində isə bu göstərici 16,5 m2/q-a qədər artır. O, katalitik sistemə daxil olan ilkin oksidlər yəni Ce və Cu oksidlərin qiymətləri ölçülmüşdür və uyğun olaraq 6,5 və 0,7 m2/q- a bərabərdir. Açar sözlər : etanol, dehidrogenləşmə sirkə aldehidi, binar katalizatorlar, izomerləşmə.
Article
Hazırda quyu-yığım sistemində asfalten-qatran-parafinin çökməsi prosesi, bu neftlərin nəqlinin asanlaşdırılması, neft kəmərlərinin ötürücülüyünün artırılması diqqət mərkəzində olan aktual problemlərdən olaraq qalmaqdadır.Bir çox ölkələrdə asfalten-qatran,parafin çöküntülərinin yaranması nəqli çətinləşdirməklə yanaşı, korroziya problemlərinin artmasına, karbohidrogen itkisinə və iqtisadi cəhətdən xərclərin artmasına səbəb olur. Bu çöküntülərin stabilliyini artıran və çöküntü miqdarına səbəb olan faktorlardan biri də neftin tərkibində suyun miqdarının çox olmasıdır.Neftin tərkibində su dispers fazalarının faiz miqdarı artdıqca kolloidlik artır. Çünki, asfalten,qatran və parafin də neft sistemində asılı vəziyətdə qalmaqla kolloid sistem əmələ gətirirlər, su disper fazaları da həmin kolloidiyə daxil olduqda yaranan çöküntülər daha sıx rabitələrlə bir-birinə bağlanırlar ki, bu cür çöküntülərin yaranması əleyhinə tərbirlər görmək çətinlik yaradır. Neft müxtəlif komponentləri özündə birləşdirən nanosistemdir. Bu nanosistemi asfaltenlər,qatranlar,parafinlər,müxaniki qarışıqlar və su dispers fazaları təşkil edir. Çöküntünü yaradan komponentlər də həmin nanosistemi yaradan komponentlərdir ki, xüsusi quruluş əmələ gətirərək neftlərin reologiyasına təsir göstərirlər. Quruluşun mərkəzində asfaltenlər, sonra qatranlar,sonra isə parafinlər cəmləşirlər. Bu quruluşun ərtafina mexaniki qarışıqlar yapışaraq daha da mürəkkəbləşmə yaradırlar. Ən sonda isə su fazası quruluşu əhatə edir. Nəticədə su daxilində olan rabitələr mexaniki qarışıq, parafin, qatran və asfalten daxilində müxtəlif çevrilmələr yaradır. Eyni zamanda su həmin qeyd olunan komponentlər tərəfindən adsorbsiya olunaraq peptidləşir ki, ən çox neftin reologiyasına təsir edən faktorlardandır. Reologiya dedikdə ən önəmli parametr özlülükdür ki, peptidləşmiş quruluşlar da özlülüyün qiymətini artırır. Yaranmış quruluşlar digər quruluşlarla birləşərək zəncirvari sistem əmələ gətirirlər. Həmin zəncirvari sistemin yaranması ilə neft daxilində kinetik və aqreqativ davamılılıq pozulur. Davamlılığın pozulması nəticəsində koaqulyasiya prosesi baş verir. Bir-birinə birləşmiş quruluşlar neft sistemi daxilində böyük kütlə əmələ gətirdikdən sonra isə sedimentasiya prosesi ilə başlayır. Sedimentasiya prosesi çöküntü yaranması deməkdir. Sedimentasiya nə qədər sürətli getsə çöküntünün miqdarı və qalınlığı daha çox olur.Bu növ neftlərin nəqlini asanlaşdırmaq məqsədilə böyük kapital qoyuluşu tələb etmədən kimyəvi reagentlərin tətbiqi ilə nəql prosesinin səmərəliliyinin artırılması qarşıda duran əsas vəzifələrdən biridir. Bu mqsədlə laboratoriya şəraitində fərdi- Difron-3970 və ND-NDP-1 reagentlərinin və yeni hazırlanmış nanotərkibli Difron-3970+Cu və ND-NDP-1+Cu kompozisiyaların ayrı-ayrılıqda və birgə təsiri SOCAR-ın 28 May yatağının 412 saylı quyusundan götürülmüş neft nümünəsində tədqiq edilmişdir.Tətbiq edilən səthi aktiv maddələrdən yeni nanokompozisiya yüksək parafinli neft nümunəsinin donma temperaturuna, onun susuzlaşmasına və ondan asfalten-qatran-parafin çökmələrinə qarşı daha effektiv olmuşdur. Belə ki, nanohissəcik vəsəthi aktiv reagentdən ibarət kompozisiya neftin donma temperaturunu +15°С-dən uyğun olaraq Difron-3970+Cureagenti -2˚C və ND-NDP-1+Cu reagent isə -8˚C-yə kimi aşağı salır. Eyni zamanda deemulsasiya qabiliyyəti baxımından fərdi reagentlər olan Difron-3970 və ND-NDP-1 reagentləri 60% sulaşması olan nefti uyğun olaraq 15% və 4% susuzlaşdırdığı halda nanokompozitlər isə 10% və 2% miqdara qədər susuzlaşdırır. Ilk dəfə olaraq laboratoriya şəraitində “Soyuq borucuq” üsulu ilə Difron-3970 və ND-NDP-1 reagentlərinəCu nanohissəciklərinin əlavəsi əsasında hazırlanmış yeni kompozisiyanın asfalten-qatran-parafin çöküntülərinin metallik səth üzərinə çökməsinə təsiri tədqiq edilmişdir.Təcrübələr ”Soyuq borucuğ”un 200C temperaturunda depressor aşqarların müxtəlif qatılıqlarında (100-700 q/t) aparılmışdır. Təcrübələrdən alınan nətəcələrə əsaslanaraq fərdi reagentlərin və nanohissəciyin əlavəsi ilə hazırlanan kompozisiyanıneffektliyi və eyni zamanda borucuğun səthində toplanan parafin çöküntülərinin maksimal faiz miqdarı hesablanmışdır. Ən yüksək effektivlik“ND-NDP-1+Cu”depressor aşqarının 700 q/t qatılığında 99% müşahidə olmuşdur. Açar sözlər: nanohissəcik, kompozisiya, soyuq borucuq, yüksək parafinli neft, deemulsasiya, donma temperaturu.
Article
The efficiency of analytical information processing is primarily due to the quality of the input data array. Cleaning the input data is an important step in any analysis. The presence of noise and anomalies can significantly affect the result of the study and lead to an erroneous conclusion. However, as well as excessive cleaning, accompanied by the loss of potentially valuable elements of observation. Despite the constant optimization of systems for collecting and processing information, the development of an effective methodology for eliminating inaccuracies in a data set is still an area of increased interest in the scientific community. The continuously increasing volume of information flows predetermines the need to form an adequate tool for cleaning time series from noise. Especially relevant in modern conditions is the task of improving the accuracy of identifying noise elements. The article provides an overview of existing methods for identifying and eliminating the noise component in one-dimensional and multidimensional time series, used in foreign practice, their features and shortcomings are emphasized. Foreign approaches to the classification of technologies are considered and analyzed. Based on the results of the analysis, a set of the most effective techniques was determined. Keywords: time series, data cleaning, noise, filtering and noise removal, array of data.
Article
Full-text available
Classification in data mining is one of the major functionalities that is performed either by predicting the value of unknown class labels on the basis of previously labeled data or to make groups of the dataset on the basis of some implicit similarity measure. Clustering works on unsupervised datasets and converts datasets to groups on the basis of some measures like Euclidean distance in K Means Clustering. The performance of K Means can significantly be affected by outliers. Outliers are not dealt in the K Means algorithm. This paper proposes a change in the K Means algorithm to accommodate the method for outlier detection on the basis of the threshold value. The threshold value of the outlier named as clus_span is computed by taking distance of each point from each other point and dividing it by the total number of points. All the points of a dataset that do not qualify the value of the minimum threshold are considered as outliers. New K Means with this add-in is tested on benchmark dataset for identification of outliers and compared with the existing K means algorithm in terms of accuracy. An improvement in performance is evident.
Book
Full-text available
Con una perspectiva interdisciplinaria de computacion y de administracion publica, este libro presenta un panorama descriptivo y exploratorio de los sitios web (portales web, paginas web) de los gobiernos municipales de Mexico, en su situacion entre los años 2020 y 2021. Inicialmente, enfrenta el reto de encontrar todos (o la mayor cantidad posible de) estos sitios web, considerando que las pocas listas o directorios de esta informacion disponibles en fuentes gubernamentales se encuentran en un estado de actualizacion incierto. El reto se supera aplicando algunas tecnicas de web data mining y tecnicas manuales. Un producto importante de esta investigacion es un dataset que concentra las direcciones digitales de un importante numero de websites municipales, superior al de las fuentes oficiales. Se producen otros dos datasets: uno de informacion sociodemografica de los habitantes y otro de las caracteristicas administrativas y politicas de los gobiernos de los municipios. Con estos tres datasets (que los autores ofrecen para descarga gratuita en la web), se analizan, se descubren y se representan mediante tecnicas estadisticas y de aprendizaje automatico los perfiles sociodemograficos y gubernamentales de los municipios que tienen website oficial, permitiendo diferenciarlos de aquellos que no lo tienen. Los resultados consisten en una serie de analisis de estadistica descriptiva, mapas y modelos de aprendizaje automatico supervisado. Los modelos se producen usando algoritmos generadores de arboles y de reglas clasificadores (los modelos producidos pueden descargarse gratuitamente). Los resultados se complementan con una propuesta para facilitar el estudio continuo de los websites municipales mexicanos a mediano y largo plazo. La propuesta consiste en implementar un directorio de estos websites que pueda actualizarse continuamente en modo semi- automatizado y un repositorio automatizado de replicas de estos websites. Con ello, se facilitara su observacion y su analisis, tanto transeccional como longitudinal.
Conference Paper
Full-text available
In this paper will be defined the diagnostic agent that finds information about the constructional and technological preferences of the customers for a reference technical object. Based on this agent can be implemented an uncertain system that can be used to define customers preferences. For imprecise information from users this system would give their preferences and exact data which can be used to improve production.
Article
Full-text available
Developments in information technology have impacted on all areas of modern life and in particular facilitated the growth of globalisation in commerce and communication. Within the drugs area this means that both drugs discourse and drug markets have become increasingly digitally enabled. In response to this, new methods are being developed that attempt to research and monitor the digital environment. In this commentary we present three case studies of innovative approaches and related challenges to software-automated data mining of the digital environment: (i) an e-shop finder to detect e-shops offering new psychoactive substances, (ii) scraping of forum data from online discussion boards, (iii) automated sentiment analysis of discussions in online discussion boards. We conclude that the work presented brings opportunities in terms of leveraging data for developing a more timely and granular understanding of the various aspects of drug-use phenomena in the digital environment. In particular, combining the number of e-shops, discussion posts, and sentiments regarding particular substances could be used for ad hoc risk assessments as well as longitudinal drug monitoring and indicate “online popularity”. The main challenges of digital data mining involve data representativity and ethical considerations.
Chapter
In this paper, we address the importance of classification and social media mining of human emotions. We compared different theories about basic emotions and the application of emotion theory in practice. Based on Plutchik’s classification, we suggest creating a specialized lexicon with terms and phrases to identify emotions for research of general attitudes towards mobile learning in social media. The approach can also be applied to other areas of scientific knowledge that aim to explore the emotional attitudes of users in social media. It is based on the Natural Language Processing and more specifically uses text mining classification algorithms. For test purposes, we’ve retrieved a number of tweets on users’ attitudes towards mobile learning.
Chapter
Security issues in e-commerce web applications are still exploratory, and in spite of an increase in e-commerce application research and development, lots of security challenges remain unanswered. Botnets are the most malicious threats to web applications, especially the e-commerce applications. Botnet is a network of BOTs. It executes automated scripts to launch different types of attack on web applications. Botnets are typically controlled by one or more hackers known as Bot masters and are exploited for different types of attacks including Dos (denial of service), DDos (distributed denial of service), phishing, spreading of malware, adware, Spyware, identity fraud, and logic bombs. The aim of this chapter is to scrutinize to what degree botnets can cause a threat to e-commerce security. In the first section, an adequate overview of botnets in the context of e-commerce security is presented in order to provide the reader with an understanding of the background for the remaining sections.
Chapter
Security issues in e-commerce web applications are still exploratory, and in spite of an increase in e-commerce application research and development, lots of security challenges remain unanswered. Botnets are the most malicious threats to web applications, especially the e-commerce applications. Botnet is a network of BOTs. It executes automated scripts to launch different types of attack on web applications. Botnets are typically controlled by one or more hackers known as Bot masters and are exploited for different types of attacks including Dos (denial of service), DDos (distributed denial of service), phishing, spreading of malware, adware, Spyware, identity fraud, and logic bombs. The aim of this chapter is to scrutinize to what degree botnets can cause a threat to e-commerce security. In the first section, an adequate overview of botnets in the context of e-commerce security is presented in order to provide the reader with an understanding of the background for the remaining sections.
Chapter
Data Mining (DM) is being applied with success in Business Intelligence (BI) environments, and several examples of applications can be found. BI and DM have different roots and, as a consequence, have significantly different characteristics. DM came up from scientific environments; thus, it is not business oriented. DM tools still demand heavy work in order to obtain the intended results. On the contrary, BI is rooted in industry and business. As a result, BI tools are user-friendly. This chapter reflects on this difference from a historical perspective. Starting with a separated historical perspective of each one, BI and DM, the author then discusses how they converged into the current situation, when DM is used, and integrated, in BI environments with success.
Article
Full-text available
Market forecasting, like stock's market, with high volume of transactions affect researchers and investors and get their attention. The risk and turnover are two important issue factors in any investment decision. Understanding market momentum gives the ability to predict future movements. The ability to predict in a market economy, enables to achieve a higher turnover by reducing risk and avoiding financial losses. News plays an important role in the process of evaluating the current stock price. The development of data mining methods, computational intelligence and machine learning algorithms led to new models of prediction. phpCrawler is a php-base content crawler that operates on DomCralwer Problematic sentence structure and Guzzle packages for storing web data. With this tool, the news releases are stored and categorized from 17 news agencies. Then, by using text mining and support vector machine with different kernel, predict stock price direction. In this research use 948990, news has been stored from 17 news agencies. More than 300,000 news regarding political and economic categories were used, and moreover, stock prices of chemicals between November 2017 until March 2018 (123 trading days) were studied. The results show that by using the linear kernel Support Vector Machine algorithm, the prediction accuracy of average price movement reached to 83%. Using nonlinear kernel Support Vector Machine with poly kernel increased two percent prediction accuracy to 85% on average and other kernel had poorer prediction.
Preprint
Full-text available
This article addresses a problem in the electronic government discipline with special interest in Mexico: the need for a concentrated and updated information source about municipal e-government websites. One reason for this is the lack of a complete and updated database containing the electronic addresses (web domain names) of the municipal governments having a website. Due to diverse causes, not all the Mexican municipalities have one, and a number of those having it do not present information corresponding to the current governments but, instead, to other previous ones. The scarce official lists of municipal websites are not updated with the sufficient frequency, and manually determining which municipalities have an operating and valid website in a given moment is a time-consuming process. Besides, website contents do not always comply with legal requirements and are considerably heterogeneous. In turn, the evolution development level of municipal websites is valuable information that can be harnessed for diverse theoretical and practical purposes in the public administration field. Obtaining all these pieces of information requires website content analysis. Therefore, this article investigates the need for and the feasibility to automate implementation and updating of a digital repository to perform diverse analyses of these websites. Its technological feasibility is addressed by means of a literature review about web scraping and by proposing a preliminary manual methodology. This takes into account known, proven, techniques and software tools for web crawling and scraping. No new techniques for crawling or scraping are proposed because the existing ones satisfy the current needs. Finally, software requirements are specified in order to automate the creation, updating, indexing, and analyses of the repository.
Article
Full-text available
The concept of archiving as an object of study in the humanities is worked by approaches from the end of the 20th century, such as structuralism and post-structuralism. Distancing from the great modern theoretical constructions, these approaches understand communication and culture through the analysis of signs and their factual arrangement in records of all kinds - texts, images, symbols -, avoiding considering any transcendental orientation. With the digital revolution, new technologies arise that propitiate new archive types. Through data mining, new correlations and textual structures are discovered. This article wonders if there are pre-digital problems that persist in the new digital medium, as well as the way in which structuralist and post-structuralist debates can rethink some of the controversies that today affect digital Humanities.
Article
Full-text available
Being an interdisciplinary subject of study Data Mining has become new and curious subject among researchers. As our capabilities of both generating and collecting data have been increasing rapidly, it has become dynamic and fast expanding field with great strength. The thirsts of required data include the computerization of business, scientific, and government transactions; informative data search on different topics, digital images, e-purchasing of online products etc. In addition, popular use of the World Wide Web as a global information system has flooded us with a tremendous amount of data and information. This explosive growth in stored or transient data has generated an urgent need for new techniques and automated tools that can intelligently assist us in transforming the vast amounts of data into useful information and knowledge.
Chapter
This chapter focuses on predicting web user behaviors. When web users enter a website, every move they make on that website is stored as web log files. Unlike the focus group or questionnaire, the log files reflect real user behavior. It can easily be said that having actual user behavior is a gold value for the organizations. In this chapter, the ways of extracting user patterns (user behavior) from the log files are sought. In this context, the web usage mining process is explained. Some web usage mining techniques are mentioned.
Chapter
The information on the web is increasing day by day and to manage such vast amount of information is really a difficult task. The user finds it really hard to capture the desired information as per their need and maximum amount of time is spent in framing proper query and filtering the resultant web pages. The search engine plays a major role in filtering the information and ranking the desired result. The quest for accurate information is still a dream and in this regard this paper presents an approach that tries to optimize the ranking algorithm by employing document clustering and similarity measures. In this paper we present an outline of different ranking algorithms and proposed an approach where PageRank algorithm is optimized by using document clustering. It also employs content mining along with structural mining that help to reduce the computational complexity of the algorithm and thereby diminish the time in performing the ranking of the web pages.
Conference Paper
In this paper, we present Google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext. Google is designed to crawl and index the Web efficiently and produce much more satisfying search results than existing systems. The prototype with a full text and hyperlink database of at least 24 million pages is available at http://google.stanford.edu/ To engineer a search engine is a challenging task. Search engines index tens to hundreds of millions of web pages involving a comparable number of distinct terms. They answer tens of millions of queries every day. Despite the importance of large-scale search engines on the web, very little academic research has been done on them. Furthermore, due to rapid advance in technology and web proliferation, creating a web search engine today is very different from 3 years ago. This paper provides an in-depth description of our large-scale web search engine - the first such detailed public description we know of to date. Apart from the problems of scaling traditional search techniques to data of this magnitude, there are new technical challenges involved with using the additional information present in hypertext to produce better search results. This paper addresses this question of how to build a practical large-scale system which can exploit the additional information present in hypertext. Also we look at the problem of how to effectively deal with uncontrolled hypertext collections, where anyone can publish anything they want.