BookPDF Available

Finding Groups in Data: An Introduction To Cluster Analysis

Authors:

Abstract

This is a book, not a book review.
A preview of the PDF is not available
... We first explored the application of cluster analysis, or clustering [21]. Its principle consists of dividing relevant data into groups so that in each group, there are data similar to each other with respect to some similarity function. ...
... [22] iii) d(p, q) ≤ d(p, r)+d(r, q): Proof of this property follows from the definition of the xor operation. It is stated in [21] that this axiom does not have to be fulfilled for the purposes of clustering. ...
Preprint
Full-text available
This paper deals with reducing the secret key computation time of small scale variants of the AES cipher using algebraic cryptanalysis, which is accelerated by data mining methods. This work is based on the known plaintext attack and aims to speed up the calculation of the secret key by processing the polynomial equations extracted from plaintext-ciphertext pairs. Specifically, we propose to transform the overdefined system of polynomial equations over GF(2) into a new system so that the computation of the Gröbner basis using the F4 algorithm takes less time than in the case of the original system. The main idea is to group similar polynomials into clusters, and for each cluster, sum the two most similar polynomials, resulting in simpler polynomials. We compare different data mining techniques for finding similar polynomials, such as clustering or locality-sensitive hashing (LSH). Experimental results show that using the LSH technique, we get a system of equations for which we can calculate the Gröbner basis the fastest compared to the other methods that we consider in this work. Experimental results also show that the time to calculate the Gröbner basis for the transformed system of equations is significantly reduced compared to the case when the Gröbner basis was calculated from the original non-transformed system. This paper demonstrates that reducing an overdefined system of equations reduces the computation time for finding a secret key.
... On the other hand, k-Medoids selects one of the nearest objects to the cluster center as the prototype. Both algorithms create "hard" clusters, where an object belongs to only one cluster (Kaufman & Rousseeuw, 1990;MacQueen, 1967). In contrast, fuzzy clustering algorithms like Fuzzy c-Means and Fuzzy c-Medoids build "soft" clusters, where objects have degrees of membership in each cluster (Bezdek, 1981;Dembélé & Kastner, 2003). ...
Thesis
Full-text available
Corruption is a pervasive societal issue, entailing the misuse of public authority for personal benefits. Traditionally, corruption was estimated via perception surveys, which rely on probing the individuals about their views on corruption rather than directly measuring it. Such assessments encounter challenges in accurately capturing corruption and often diverge from actual corruption levels. Recent advancements in data collection, spurred by calls for transparency in public institutions and fueled by enhanced computational and storage capabilities, opened unprecedented opportunities for a far more precise analysis of corruptive processes. By quantitatively analyzing concrete datasets, such as transactions between public sector and private companies, contractual documents, public procurement records, bid outcomes, and healthcare product prices, novel avenues emerged for both addressing and predicting corruption. These scientific endeavors aim to discover the best policies to mitigate corruption and rebuild trust in public institutions. This doctoral dissertation pioneers this novel approach, forging a collaborative partnership with the Commission for the Prevention of Corruption in Slovenia (CPC). Harnessing state-of-the-art data mining, statistical analysis, and machine learning, we analyze a large CPC’s datasets detailing 17 years of public spending on private companies and reported receiving of gifts to public officials. We uncover an array of findings along three research directions: 1. We reveal the presence of self-organizing principles that govern Slovenian public expenditure. Such mechanisms are usually observed in more orderly (e.g. physical) systems and come across as surprising in this context, where interactions are dominated by human factors. 2. We construct an interactive framework tailored for CPC's use. It enables quick identification of suspicious private companies whose revenues from public sources exhibit visible disparities that correlate with changes of the government. 3. Finally, employing natural language processing, we uncover how seemingly innocent ceremonial gifts can foster favoritism and enable misuse of public positions for personal gains. We illustrate the disparities between the laws regulating gift reporting and the actual practices. VIII In conclusion, this research contributed: (i) new computational methods for data-driven analysis of corruption, and (ii) better understanding of societal processes that govern public spending in Slovenia. Our work delivers valuable recommendations to governmental, public, and administrative bodies. We hope these insights will bolster the use of transparent public data as the key tool in the fight against corruption.
... As a feature, PAM allows the use of dissimilarities as well as of distances and is helpful when variables of different metrics are used, as in this case. The number of clusters was decided using the Gap statistic (Tibshirani et al., 2001) and the Silhouette width (Kaufman and Rousseeuw, 1990) methods (Fig. 7), which allow the research to see how well the datapoints are grouped into clusters and to assess if the separation between clusters is good. We carefully analysed the dissimilarities between different number of potential clusters, aiming for clear and distinct groupings. ...
Article
Full-text available
Addressing digital exclusion first requires an in-depth understanding of the factors leading to it. In this paper, we explore to what extent new digital mobility solutions can be considered inclusive, by taking into account the diverse perspectives of the users of transport services. Specifically, our work contributes to understanding the end user needs and capabilities in digital mobility, by presenting a set of personas developed using data from a population-representative survey of 601 Barcelona metropolitan area residents in the framework of the EU Horizon 2020 programme's DIGNITY project. Overall, roughly 15% of this population cannot access and effectively use digital technologies, thereby hindering their use of many digital mobility services. This work provides information about the diversity of potential users by analysing different stories and travel experiences of the personas; this in turn can inspire decision makers, developers, and other stakeholders along the design process. The methodological approach for developing personas could be also potentially useful for mobility service providers and policymakers who aim to create more inclusive and user-centred transport ecosystems that meet the needs of diverse users.
... 1) Based on k-means algorithm (k-M): This algorithm utilizes the MDs matrix to form k-clusters or groups of peptides based on their similarity in terms of Euclidean distance, separating peptides with different structures. Then, the test set consists of one or more peptides per cluster, each approximating the centroid within its respective cluster, while the remaining compounds are included in the training set [27]. 3) Based on random selection (RS): In this case, the peptides of test sets are chosen by random selection. This method allows the selection of a suitable number of cases. ...
Preprint
Full-text available
Antioxidants agents play an essential role in the food industry improving the oxidative stability of food products. In the last years, the search for new natural antioxidants has increased due to the potential high toxicity of chemical additives. Therefore, the synthesis and evaluation of the antioxidant activity in peptides is a field of current research. In this study, we performed a Quantitative Structure Activity Relationship analysis (QSAR) of cysteine-containing 19 dipeptides and 19 tripeptides. The main objective is to bring information on the relationship between the structure of peptides and their antioxidant activity. For this purpose, 1D and 2D molecular descriptors were calculated using the PaDEL software, which provide information about the structure, shape, size, charge, polarity, solubility and other aspects of the compounds. Different QSAR model for di- and tripeptides were developed. The statistic parameter for di-peptides model (R ² train = 0.947 and R ² test = 0.804) and for tripeptide models (R ² train = 0.863 and R ² test = 0.789) indicate that the generated models have high predictive capacity. Then, the influence of the cysteine position was analyzed predicting the antioxidant activity for new di- and tripeptides, and comparing with glutathione.
... This method of classifying MFI values has been used in neglected tropical diseases and malaria studies 33,34 . The method partitions data into a predetermined number of clusters, and the optimal number of clusters for each antigen was determined using within-cluster sum of squares and average silhouette testing 35 . For a specific antigen, individuals grouped within the cluster of higher MFI values were classified as seropositive, and individuals within the cluster of lower MFI values were defined as seronegative. ...
Article
Full-text available
Despite progress towards malaria reduction in Peru, measuring exposure in low transmission areas is crucial for achieving elimination. This study focuses on two very low transmission areas in Loreto (Peruvian Amazon) and aims to determine the relationship between malaria exposure and proximity to health facilities. Individual data was collected from 38 villages in Indiana and Belen, including geo-referenced households and blood samples for microscopy, PCR and serological analysis. A segmented linear regression model identified significant changes in seropositivity trends among different age groups. Local Getis-Ord Gi* statistic revealed clusters of households with high (hotspots) or low (coldspots) seropositivity rates. Findings from 4000 individuals showed a seropositivity level of 2.5% (95%CI: 2.0%-3.0%) for P. falciparum and 7.8% (95%CI: 7.0%-8.7%) for P. vivax, indicating recent or historical exposure. The segmented regression showed exposure reductions in the 40–50 age group (β1 = 0.043, p = 0.003) for P. vivax and the 50–60 age group (β1 = 0.005, p = 0.010) for P. falciparum. Long and extreme distance villages from Regional Hospital of Loreto exhibited higher malaria exposure compared to proximate and medium distance villages (p < 0.001). This study showed the seropositivity of malaria in two very low transmission areas and confirmed the spatial pattern of hotspots as villages become more distant.
... In addition to the descriptive analysis, we addressed heterogeneity by conducting an ancillary cluster analysis using the K-means [36] algorithm, incorporating information on the category of the hospital based on the number of hospital beds and their role in their corresponding health district. The average silhouette [37] method was used to determine the optimal number of clusters. ...
Article
Full-text available
Background Hospital at home (HaH) was increasingly implemented in Catalonia (7.7 M citizens, Spain) achieving regional adoption within the 2011-2015 Health Plan. This study aimed to assess population-wide HaH outcomes over five years (2015-2019) in a consolidated regional program and provide context-independent recommendations for continuous quality improvement of the service. Methods A mixed-methods approach was adopted, combining population-based retrospective analyses of registry information with qualitative research. HaH (admission avoidance modality) was compared with a conventional hospitalization group using propensity score matching techniques. We evaluated the 12-month period before the admission, the hospitalization, and use of healthcare resources at 30 days after discharge. A panel of experts discussed the results and provided recommendations for monitoring HaH services. Results The adoption of HaH steadily increased from 5,185 episodes/year in 2015 to 8,086 episodes/year in 2019 (total episodes 31,901; mean age 73 (SD 17) years; 79% high-risk patients. Mortality rates were similar between HaH and conventional hospitalization within the episode [76 (0.31%) vs. 112 (0.45%)] and at 30-days after discharge [973(3.94%) vs. 1112(3.24%)]. Likewise, the rates of hospital re-admissions at 30 days after discharge were also similar between groups: 2,00 (8.08%) vs. 1,63 (6.58%)] or ER visits [4,11 (16.62%) vs. 3,97 (16.03%). The 27 hospitals assessed showed high variability in patients’ age, multimorbidity, severity of episodes, recurrences, and length of stay of HaH episodes. Recommendations aiming at enhancing service delivery were produced. Conclusions Besides confirming safety and value generation of HaH for selected patients, we found that this service is delivered in a case-mix of different scenarios, encouraging hospital-profiled monitoring of the service.
... Cluster analysis was used to group similar objects together into homogeneous clusters and to reveal hidden dependencies and structure in a sample dataset (Kaufman & Rousseeuw, 1990). Particularly, Ward's method was applied to implement cluster analysis based on a measure of the distance between them (Ward, 1963). ...
Article
The article’s purpose is to analyse the issue of implementation of knowledge economy and innovation through business education based on cluster analysis. The role of knowledge economy, innovation transfer, entrepreneurship and business-education coopetition are grounded to achieve economic growth and sustainable development. Input data withing the distribution of the knowledge economy through business education include a data of 23 countries for the following indicators: new registered enterprises, labour force, employment in industry, proportion of population studying ‘Business, Administration and Law’, proportion of population studying ‘Services’ and proportion of population studying ‘Economics’. Using data normalization, Ward and Sturges methods and Statgraphics Centurion 19 soft five clusters were determined to show hidden dependencies and structure in countries sample in this research context. The first cluster includes 2 countries (Austria and the United Kingdom), the second – 11 countries (Belgium, Portugal, Denmark, Italy, Lithuania, Latvia, Poland, Ukraine, Croatia, Norway, and the Netherlands), the third – 5 countries (Bulgaria, Spain, France, Switzerland, and Finland), the fourth – 3 countries (Estonia, Germany and Sweden), and the fifth – 2 countries (the Czech Republic and Hungary). Due to building dendrogram of distribution on clusters and graph of agglomeration distance the quality of countries distribution into clusters was confirmed. Obtained results can be useful for further research and improving the state innovation, information and educational policy based on positive experience of neighbour countries within certain formed cluster.
... The values in the set should be most similar to each other, Clusters should not be as similar as possible. [5] By applying k-means clustering on the dataset, comparable data points are systematically grouped together to enable a deeper comprehension of underlying trends. Before dividing the data into discrete groups, the k-means algorithm, a partitioning technique, requires the number of clusters (k) to be specified. ...
Research Proposal
Full-text available
This study analyzes the complicated landscape of Manhattan real estate sales in 2008 using an extensive dataset from the Department of Finance (DOF). Applying k-means clustering, the study concentrates on residences belonging to Class 1-, 2-, and 3-Family and offers novel findings into the geographic distribution of house trades. The dataset includes all sales with a sale price of at least $150,000 that occurred between January 1st and December 31st, 2008.
Article
Full-text available
Article History: Indonesia has a tropical climate and has two seasons: dry and rainy. Prolonged drought can cause drought disasters, and rain can cause floods and landslides. According to information from the Meteorology, Climatology, and Geophysics Agency (BMKG), natural disasters such as floods and landslides due to heavy rains have been a severe problem in Indonesia for the past five years. Different regional characteristics can affect the intensity of rain that falls in every province in Indonesia. It can be grouped to determine which provinces have similar characteristics to natural disasters due to rainfall. Later, it can provide information to the government and the public so that they are more aware of natural disasters. So, it is necessary to research and classify provinces in Indonesia for rainfall with cluster analysis. The data used is secondary rainfall data taken from the official BMKG website. Cluster analysis of rainfall in 34 provinces in Indonesia used hierarchical and non-hierarchical methods in this study. The approach that is used in this research limits our clustering of the data. Further research with a machine learning approach is recommended. For the clustering method, the agglomerative hierarchical method includes single, average, and complete linkage. The non-hierarchical method includes k-medoids and fuzzy C-means. The cluster analysis results show that the dynamic time warping (DTW) distance measurement method with the average linkage method has the most optimal cluster results with a silhouette coefficient value of 0.813.
Article
Full-text available
BACKGROUND The Alzheimer's Disease COMposite Score (ADCOMS) is more sensitive in clinical trials than conventional measures when assessing pre‐dementia. This study compares ADCOMS trajectories using clustered progression characteristics to better understand different patterns of decline. METHODS Post‐baseline ADCOMS values were analyzed for sensitivity using mean‐to‐standard deviation ratio (MSDR), partitioned by baseline diagnosis, comparing with the original scales upon which ADCOMS is based. Because baseline diagnosis was not a particularly reliable predictor of progression, individuals were also grouped into similar ADCOMS progression trajectories using clustering methods and the MSDR compared for each progression group. RESULTS ADCOMS demonstrated increased sensitivity for clinically important progression groups. ADCOMS did not show statistically significant sensitivity or clinical relevance for the less‐severe baseline diagnoses and marginal progression groups. CONCLUSIONS This analysis complements and extends previous work validating the sensitivity of ADCOMS. The large data set permitted evaluation–in a novel approach–by the clustered progression group.
Article
Cluster analysis involves the problem of optimal partitioning of a given set of entities into a pre-assigned number of mutually exclusive and exhaustive clusters. Here the problem is formulated in two different ways with the distance function (a) of minimizing the within groups sums of squares and (b) minimizing the maximum distance within groups. These lead to different kinds of linear and non-linear (0–1) integer programming problems. Computational difficulties are discussed and efficient algorithms are provided for some special cases.