Figure 1 - uploaded by Trung Hieu Tran
Content may be subject to copyright.
Source publication
Machine learning and data mining are research areas of computer science whose quick development is due to the advances in data analysis research, growth in the database industry and the resulting market needs for methods that are capable of extracting valuable knowledge from large data stores. A vast amount of research work has been done in the mul...
Contexts in source publication
Context 1
... mining refers to analysis of large amount of multimedia information in order to extract patterns based on their statistical relationships. Figure 1 shows the categories of multimedia data mining ...
Context 2
... will talk about random forest in classification, since classification is sometimes considered the building block of machine learning. Below you can see how a random forest would look like as Figure 10. Random Forest is also considered as a very handy and easy to use algorithm, because it is default hyperparameters often produce a good prediction result. ...
Context 3
... a training set has to select a learning model to learn from it and make multimedia mining model more iterative [38]. The Figure 11 below shows the multimedia mining processing: Data collection is the very first step in multimedia mining process. It acts as a raw data which are further input to the data preprocessing stage, which includes several task such as data cleaning and feature selection. ...
Context 4
... consist of a huge, widely, distributed, advertisements, consumer information, education, government, e commerce, and many other services. The main task of web mining includes mining of web contents, web access patterns and web linkage structures as shown in Figure 12. This involves mining the web page layout structure, mining the web's link structures to identify authorize web pages, mining multimedia data on the web, automatic classification of web documents, and web usage mining [38]. ...
Context 5
... this the mining is performed on the collected data and then after Interpretation and evaluation the Knowledge is generated. The entire process is described as the Figure 13 below. ...
Context 6
... two important issues in Video mining are to develop a representational scheme for the content and a Human friendly query/interface [92]. Figure 14 shows general framework for video data mining. There are many video mining approaches and they are roughly classified into five categories. ...
Context 7
... Phoneme based indexing does not deals with conversion from speech to text, but instead works only with sound. Figure 15 shows the process of Audio mining. The main objective of audio mining technology is to search through speech for identifying specific characteristics. ...
Context 8
... main use of this technology in the field of security hence it can be utilized by military, police and private companies since they provide security services. Figure 16 shows present architecture which includes the types of multimedia mining process [109]. Data Collection is the initial stage of the learning system; Pre-processing is to extract significant features from raw data, it includes data cleaning, transformation, normalization, feature extraction, etc. Learning can be direct, if informative types can be recognized at pre-processing stage. ...
Context 9
... difference between unstructured data and structured data mining is the sequence or time element. The architecture of converting unstructured data to structured data and it is used for extracting information from unstructured database is shown in Figure 17. Then data mining tools are applied to the stored structured databases. ...
Citations
... Many previous studies have been conducted, and various classification schemes as well as clustering algorithms have been employed to predict breast cancer. To deliver the continuous result of particular data, H. Tran employed logistics regression, a supervised learning approach that contains additional dependent variables [3]. Shen et al. chose the most crucial features and used the feature selection method Interactive Autonomy and Collaborative Technologies Laboratory (INTERACT) to create a model and found out that it outperformed the other model in terms of accuracy [4]. ...
Breast cancer, whose incidence rate is increasing year by year, is one of the malignant tumours with the highest incidence rate in women. Every year, an increment of about 1300000 people suffers from breast cancer and 400000 people die from it globally. For the sake of those who may be at risk of breast cancer, it is of critical importance to establish a model that can make predictions of breast cancer. This study utilizes the Random Forest algorithm and Logistic Regression algorithm to construct an analysis model. The study is conducted on a breast cancer dataset that contains data derived from Wisconsin. Specifically, the research conducts feature selection and manages to work out the relationship between various features and tumour types and selects the 5 most significant features. Based on the data of those 5 features, the accuracy of the two models is compared and the Logistic Regression Model is further optimized to reach a higher prediction accuracy. This study is highly significant in the medical community since the model it created can help with breast cancer prediction, allowing for early intervention and a higher survival rate for possible breast cancer patients.
... (1) f (x1, x2) = e −||(x1−x2)|| 2 2 2 , Naive Bayes (NB) Naive Bayes, a popular supervised classification learning algorithm, is built on Bayes' theorem. Naive Bayes frequently performs well in reality and is particularly effective with high-dimensional data despite its naive assumption of feature independence [57,58]. It implies: ...
Breast cancer is still a major problem for medical research, science, and society. Breast cancer is the most common form of cancer among women and has a high rate of mortality. Early detection will lessen its impact and could urge victims to receive immediate medical treatment, which will significantly improve the prognosis and likelihood of recovery. However, early detection models suffer from many constraints, like a high-dimensional feature set, imbalanced data, the integration of different data, and generalization. All these constraints make early detection models a challenge. In this review article, we point out the breast cancer detection model’s open research issues. Also, highlight the conventional framework using machine and deep learning along with the method of feature selection, evaluate the conventional model based on accuracy, and concluded with a possible future research direction for breast cancer detection or classification.
... The multimedia data is classified into two broad categories static media and dynamic media. Static media consists of text and images, whereas Dynamic media consists of Audio and Videos [5] as shown in Figure 1. ...
Over recent years, multimedia data has become a cornerstone for insightful data analysis, yielding vital information crucial for informed decision-making processes. This diverse data format encompasses audio, video, images, and text, offering a wealth of valuable knowledge. Advancements in multimedia acquisition, storage, and processing technologies have significantly enhanced analytical capabilities, overcoming challenges posed by semi-structured and unstructured data formats. Various entities including corporations, governmental bodies, and academic institutions are keenly interested in harnessing insights from the vast reservoirs of multimedia data generated across diverse sources. Consequently, researchers have delved into data mining methodologies, uncovering effective strategies for extracting insights from multimedia datasets. This study aims to probe the conceptual and practical dimensions of multimedia data mining within surveillance contexts, elucidating its transformative impact on diverse sectors by facilitating efficient data collection, analysis, and dissemination processes. Moreover, it underscores the significance of incorporating relevant cryptography methods to bolster the system’s integrity and completeness.
... It has more dependent variables. It has a popular choice for modeling and major advantage of accepting binary responses [49]. In paper [52], author used modified LR to analyze microarray gene expression for the classification of BC. ...
Cancer is a complex global health problem that causes a high death rate. Breast cancer (BC) is the second most common death-causing disease in women worldwide. BC develops in the cells of the ducts or lobules of the glandular tissue when breast cells become uncontrollably proliferative. It can be controlled if diagnosed early enough. There are many techniques used to diagnose or classify BC. Machine learning (ML) has a significant effect on BC classification. This article provides a comparative study of different ML approaches for BC prediction based on medical imaging and microarray gene expression (MGE) data. DT, KNN, RF, SVM, Naïve Bayes, ANN, etc. perform much better in their respective fields. Another method named ensemble, incorporates more than one single classifier to solve the same problem. The study shows how ML with supervised, unsupervised, and ensemble learning might help with BC prognosis. This paper observes ensemble methods provide better performance than a single classifier. Finally, a comprehensive review of various imaging modalities and microarray gene expression, different datasets, performance metrics and outcomes, challenges, and prospective research directions are provided for the new researchers in this fast-growing field.
... Tran [37] implemented an Intelligent System based on Naive Bayes, which functions similarly to predict heart disease diagnoses, enhancing clinical decision-making and lowering treatment expenses. Gnaneswar [38] emphasized the importance of monitoring heart rate during cycling to manage exercise intensity and avoid overtraining and cardiac stress. ...
Identifying and predicting the risk of Cardiovascular Diseases (CVD) in healthy individuals is crucial for effective disease management. Leveraging extensive health data available in hospital databases offers significant potential for early detection and diagnosis of CVD, which can greatly improve disease outcomes. Integrating machine learning techniques shows considerable promise in advancing clinical practices for CVD management. These methods enable the development of evidence-based clinical guidelines and management algorithms, potentially reducing the need for costly and extensive clinical and laboratory investigations, thereby easing financial burdens on patients and healthcare systems. To enhance early prediction and intervention for CVD, this study proposes the development of novel, robust, efficient machine learning algorithms tailored for automatic feature selection and early-stage heart disease detection. The proposed Catboost model achieves an F1-score of approximately 92.3% and an average accuracy of 90.94%. Compared to many current state-of-the-art approaches, it demonstrates superior classification performance with higher accuracy and precision.
... Tran [37] study built an Intelligent System using the Naive Bayes data mining modeling technique. It is a web application in which the user answers pre-programmed questions. ...
The identification and prognosis of the potential for developing Cardiovascular Diseases (CVD) in healthy individuals is a vital aspect of disease management. Accessing the comprehensive health data on CVD currently available within hospital databases holds significant potential for the early detection and diagnosis of CVD, thereby positively impacting disease outcomes. Therefore, the incorporation of machine learning methods holds significant promise in the advancement of clinical practice for the management of Cardiovascular Diseases (CVDs). By providing a means to develop evidence-based clinical guidelines and management algorithms, these techniques can eliminate the need for costly and extensive clinical and laboratory investigations, reducing the associated financial burden on patients and the healthcare system. In order to optimize early prediction and intervention for CVDs, this study proposes the development of novel, robust, effective, and efficient machine learning algorithms, specifically designed for the automatic selection of key features and the detection of early-stage heart disease. The proposed Catboost model yields an F1-score of about 92.3% and an average accuracy of 90.94%. Therefore, Compared to many other existing state-of-art approaches, it successfully achieved and maximized classification performance with higher percentages of accuracy and precision.
... This is because raw data have not been processed. Due to the fluidity of the data and the absence of any underlying structure, it is possible that the usual techniques to data mining will not be appropriate in this particular instance [73,75]. ...
This study covered a lot of area when it came to the most crucial qualities of artificial intelligence, such as its capability to act and explain which choices had the highest chance of being successful. This, in turn, is reliant on systems that are quickly evolving in terms of their applications, processing speeds, positive changes, and capacity levels. The incorporation of (intelligence connected to teaching) into machines allows such machines to carry out tasks that had previously required the user's mental processing power. The term "artificial intelligence" is often used to refer to this concept. The primary objective of this research is to analyse the ways in which robots are steadily becoming more competent of doing routine jobs. During this time, the human brain is "taking in" all of the information that it needs to in order to arrive at the most appropriate conclusion. Artificial intelligence (AI) is reduced to its most elemental form when it consists of nothing more than "picking" the most effective course of action in every specific case. This, in turn, gives the system the capacity to discover new and creative methods to solve problems, which is something that people just do not have. The findings of this research indicate that the following approaches to artificial intelligence and machine learning are the most successful ones in their respective fields.
... It offers the highest accuracy rate when predicting large datasets. It is a famous machine learning algorithm built on 3D and 2D modeling [27]. SVM algorithms utilize a set of mathematical functions known as kernels. ...
Breast cancer is one of the main causes of mortality for women around the world. Such mortality rate could be reduced if it is possible to diagnose breast cancer at the primary stage. It is hard to determine the causes of this disease that may lead to the development of breast cancer. But it is still important in predicting the probability of cancer. We can assess the likelihood of occurrence of breast cancer using machine learning algorithms and routine diagnosis data. Although a variety of patient information attributes are stored in cancer datasets not all of the attributes are important in predicting cancer. In such situations, feature selection approaches can be applied to keep the pertinent feature set. In this research, a comprehensive analysis of Machine Learning (ML) classification algorithms with and without feature selection on Wisconsin Breast Cancer Original (WBCO), Wisconsin Diagnosis Breast Cancer (WDBC), and Wisconsin Prognosis Breast Cancer (WPBC) datasets is performed for breast cancer prediction. We employed wrapper-based feature selection and three different classifiers Logistic Regression (LR), Linear Support Vector Machine (LSVM), and Quadratic Support Vector Machine (QSVM) for breast cancer prediction. Based on experimental results, it is shown that the LR classifier with feature selection performs significantly better with an accuracy of 97.1% and 83.5% on WBCO and WPBC datasets respectively. On WDBC datasets, the result reveals that the QSVM classifier without feature selection achieved an accuracy of 97.9% and these results outperform the existing methods.
... Of Internet users. According to Digimind : The Leading Social Media Listening and Analytics Solution , a specialist in business intelligence software [6], Ereputation is "the perception that Internet users have of your company, your brand or people who collaborate (managers, employees) and which is potentially visible on many supports of the net" [11]. 66% of consumers seek advice before buying a product and 96% seek advice before buying a product are influenced by the E-reputation of a brand during a purchase. ...
... In our project, we propose a case study at the level of a telecom leader in Algeria, namely: Djezzy the aim is to analyze the different opinions that are in the form of comments found on social network: Twitter, use Automatic Learning [6] and Data Mining techniques [6] To detect strengths and anomalies of the company, and create a dashboard offering an overview of its E-reputation (positive or negative) and its distribution by geographical area. The proposed solution can be implemented on a Cloud server for various advantages, such as execution time, storage capacity and above all ease of access, for which a solution based on cloud computing is highly recommended. ...
... In our project, we propose a case study at the level of a telecom leader in Algeria, namely: Djezzy the aim is to analyze the different opinions that are in the form of comments found on social network: Twitter, use Automatic Learning [6] and Data Mining techniques [6] To detect strengths and anomalies of the company, and create a dashboard offering an overview of its E-reputation (positive or negative) and its distribution by geographical area. The proposed solution can be implemented on a Cloud server for various advantages, such as execution time, storage capacity and above all ease of access, for which a solution based on cloud computing is highly recommended. ...
In a competitive world, companies are looking to gain a positive reputation through these clients. Electronic reputation is part of this reputation mainly in social networks, where everyone is free to express their opinion. Sentiment analysis of the data collected in these networks is very necessary to identify and know the reputation of a companies. This paper focused on one type of data, Twits on Twitter, where the authors analyzed them for the company Djezzy (mobile operator in Algeria), to know their satisfaction. The study is divided into two parts: The first part was the pre-processing phase, where this research filtered the Twits (eliminate useless words, use the tokenization) to keep the necessary information for a better accuracy. The second part was the application of machine learning algorithms (SVM and logistic regression) for a supervised classification since the results are binary. The strong point of this study was the possibility to run the chosen algorithms on a cloud in order to save execution time; the solution also supports the three languages: Arabic, English, and French.
... Simple neural networks and deep networks[54] Alduailej and Alothaim Journal of Big Data (2022)9:72 ...
The Arabic language is a complex language with little resources; therefore, its limitations create a challenge to produce accurate text classification tasks such as sentiment analysis. The main goal of sentiment analysis is to determine the overall orientation of a given text in terms of whether it is positive, negative, or neutral. Recently, language models have shown great results in promoting the accuracy of text classification in English. The models are pre-trained on a large dataset and then fine-tuned on the downstream tasks. Particularly, XLNet has achieved state-of-the-art results for diverse natural language processing (NLP) tasks in English. In this paper, we hypothesize that such parallel success can be achieved in Arabic. The paper aims to support this hypothesis by producing the first XLNet-based language model in Arabic called AraXLNet, demonstrating its use in Arabic sentiment analysis in order to improve the prediction accuracy of such tasks. The results showed that the proposed model, AraXLNet, with Farasa segmenter achieved an accuracy results of 94.78%, 93.01%, and 85.77% in sentiment analysis task for Arabic using multiple benchmark datasets. This result outperformed AraBERT that obtained 84.65%, 92.13%, and 85.05% on the same datasets, respectively. The improved accuracy of the proposed model was evident using multiple benchmark datasets, thus offering promising advancement in the Arabic text classification tasks.