BookPDF Available

Finding Groups in Data: An Introduction To Cluster Analysis

January 1990
Biometrics

January 1990

DOI:10.2307/2532178

Source
DBLP

Publisher: John Wiley, New York.
ISBN: 0-471-87876-6

Authors:

Leonard Kaufman

Vrije Universiteit Brussel

Peter Rousseeuw

KU Leuven

This is a book, not a book review.

Content uploaded by Peter Rousseeuw

Content may be subject to copyright.

A preview of the PDF is not available

Reducing Overdefined Systems of Polynomial Equations Derived from Small Scale Variants of the AES via Data Mining Methods

Preprint

Full-text available

May 2024

Martin Jureček

This paper deals with reducing the secret key computation time of small scale variants of the AES cipher using algebraic cryptanalysis, which is accelerated by data mining methods. This work is based on the known plaintext attack and aims to speed up the calculation of the secret key by processing the polynomial equations extracted from plaintext-ciphertext pairs. Specifically, we propose to transform the overdefined system of polynomial equations over GF(2) into a new system so that the computation of the Gröbner basis using the F4 algorithm takes less time than in the case of the original system. The main idea is to group similar polynomials into clusters, and for each cluster, sum the two most similar polynomials, resulting in simpler polynomials. We compare different data mining techniques for finding similar polynomials, such as clustering or locality-sensitive hashing (LSH). Experimental results show that using the LSH technique, we get a system of equations for which we can calculate the Gröbner basis the fastest compared to the other methods that we consider in this work. Experimental results also show that the time to calculate the Gröbner basis for the transformed system of equations is significantly reduced compared to the case when the Gröbner basis was calculated from the original non-transformed system. This paper demonstrates that reducing an overdefined system of equations reduces the computation time for finding a secret key.

PATTERNS DISCOVERY IN SLOVENIAN PUBLIC SPENDING: A DATA-DRIVEN APPROACH TO CORRUPTION DETECTION

Thesis

Full-text available

Nov 2023

Jelena Joksimović

Corruption is a pervasive societal issue, entailing the misuse of public authority for personal benefits. Traditionally, corruption was estimated via perception surveys, which rely on probing the individuals about their views on corruption rather than directly measuring it. Such assessments encounter challenges in accurately capturing corruption and often diverge from actual corruption levels. Recent advancements in data collection, spurred by calls for transparency in public institutions and fueled by enhanced computational and storage capabilities, opened unprecedented opportunities for a far more precise analysis of corruptive processes. By quantitatively analyzing concrete datasets, such as transactions between public sector and private companies, contractual documents, public procurement records, bid outcomes, and healthcare product prices, novel avenues emerged for both addressing and predicting corruption. These scientific endeavors aim to discover the best policies to mitigate corruption and rebuild trust in public institutions. This doctoral dissertation pioneers this novel approach, forging a collaborative partnership with the Commission for the Prevention of Corruption in Slovenia (CPC). Harnessing state-of-the-art data mining, statistical analysis, and machine learning, we analyze a large CPC’s datasets detailing 17 years of public spending on private companies and reported receiving of gifts to public officials. We uncover an array of findings along three research directions: 1. We reveal the presence of self-organizing principles that govern Slovenian public expenditure. Such mechanisms are usually observed in more orderly (e.g. physical) systems and come across as surprising in this context, where interactions are dominated by human factors. 2. We construct an interactive framework tailored for CPC's use. It enables quick identification of suspicious private companies whose revenues from public sources exhibit visible disparities that correlate with changes of the government. 3. Finally, employing natural language processing, we uncover how seemingly innocent ceremonial gifts can foster favoritism and enable misuse of public positions for personal gains. We illustrate the disparities between the laws regulating gift reporting and the actual practices. VIII In conclusion, this research contributed: (i) new computational methods for data-driven analysis of corruption, and (ii) better understanding of societal processes that govern public spending in Slovenia. Our work delivers valuable recommendations to governmental, public, and administrative bodies. We hope these insights will bolster the use of transparent public data as the key tool in the fight against corruption.

Exploring the diversity of users of digital mobility services by developing personas -A case study of the Barcelona metropolitan area

Article

Full-text available

May 2024

Addressing digital exclusion first requires an in-depth understanding of the factors leading to it. In this paper, we explore to what extent new digital mobility solutions can be considered inclusive, by taking into account the diverse perspectives of the users of transport services. Specifically, our work contributes to understanding the end user needs and capabilities in digital mobility, by presenting a set of personas developed using data from a population-representative survey of 601 Barcelona metropolitan area residents in the framework of the EU Horizon 2020 programme's DIGNITY project. Overall, roughly 15% of this population cannot access and effectively use digital technologies, thereby hindering their use of many digital mobility services. This work provides information about the diversity of potential users by analysing different stories and travel experiences of the personas; this in turn can inspire decision makers, developers, and other stakeholders along the design process. The methodological approach for developing personas could be also potentially useful for mobility service providers and policymakers who aim to create more inclusive and user-centred transport ecosystems that meet the needs of diverse users.

Development of QSARs for Cysteine-containing di- and tripeptides with antioxidant activity. Influence of the cysteine position

Preprint

Full-text available

Feb 2024

Antioxidants agents play an essential role in the food industry improving the oxidative stability of food products. In the last years, the search for new natural antioxidants has increased due to the potential high toxicity of chemical additives. Therefore, the synthesis and evaluation of the antioxidant activity in peptides is a field of current research. In this study, we performed a Quantitative Structure Activity Relationship analysis (QSAR) of cysteine-containing 19 dipeptides and 19 tripeptides. The main objective is to bring information on the relationship between the structure of peptides and their antioxidant activity. For this purpose, 1D and 2D molecular descriptors were calculated using the PaDEL software, which provide information about the structure, shape, size, charge, polarity, solubility and other aspects of the compounds. Different QSAR model for di- and tripeptides were developed. The statistic parameter for di-peptides model (R ² train = 0.947 and R ² test = 0.804) and for tripeptide models (R ² train = 0.863 and R ² test = 0.789) indicate that the generated models have high predictive capacity. Then, the influence of the cysteine position was analyzed predicting the antioxidant activity for new di- and tripeptides, and comparing with glutathione.

Malaria seroepidemiology in very low transmission settings in the Peruvian Amazon

Article

Full-text available

Feb 2024

Despite progress towards malaria reduction in Peru, measuring exposure in low transmission areas is crucial for achieving elimination. This study focuses on two very low transmission areas in Loreto (Peruvian Amazon) and aims to determine the relationship between malaria exposure and proximity to health facilities. Individual data was collected from 38 villages in Indiana and Belen, including geo-referenced households and blood samples for microscopy, PCR and serological analysis. A segmented linear regression model identified significant changes in seropositivity trends among different age groups. Local Getis-Ord Gi* statistic revealed clusters of households with high (hotspots) or low (coldspots) seropositivity rates. Findings from 4000 individuals showed a seropositivity level of 2.5% (95%CI: 2.0%-3.0%) for P. falciparum and 7.8% (95%CI: 7.0%-8.7%) for P. vivax, indicating recent or historical exposure. The segmented regression showed exposure reductions in the 40–50 age group (β1 = 0.043, p = 0.003) for P. vivax and the 50–60 age group (β1 = 0.005, p = 0.010) for P. falciparum. Long and extreme distance villages from Regional Hospital of Loreto exhibited higher malaria exposure compared to proximate and medium distance villages (p < 0.001). This study showed the seropositivity of malaria in two very low transmission areas and confirmed the spatial pattern of hotspots as villages become more distant.

Five years of Hospital at Home adoption in Catalonia: impact, challenges, and proposals for quality assurance

Article

Full-text available

Feb 2024
BMC HEALTH SERV RES

Background Hospital at home (HaH) was increasingly implemented in Catalonia (7.7 M citizens, Spain) achieving regional adoption within the 2011-2015 Health Plan. This study aimed to assess population-wide HaH outcomes over five years (2015-2019) in a consolidated regional program and provide context-independent recommendations for continuous quality improvement of the service. Methods A mixed-methods approach was adopted, combining population-based retrospective analyses of registry information with qualitative research. HaH (admission avoidance modality) was compared with a conventional hospitalization group using propensity score matching techniques. We evaluated the 12-month period before the admission, the hospitalization, and use of healthcare resources at 30 days after discharge. A panel of experts discussed the results and provided recommendations for monitoring HaH services. Results The adoption of HaH steadily increased from 5,185 episodes/year in 2015 to 8,086 episodes/year in 2019 (total episodes 31,901; mean age 73 (SD 17) years; 79% high-risk patients. Mortality rates were similar between HaH and conventional hospitalization within the episode [76 (0.31%) vs. 112 (0.45%)] and at 30-days after discharge [973(3.94%) vs. 1112(3.24%)]. Likewise, the rates of hospital re-admissions at 30 days after discharge were also similar between groups: 2,00 (8.08%) vs. 1,63 (6.58%)] or ER visits [4,11 (16.62%) vs. 3,97 (16.03%). The 27 hospitals assessed showed high variability in patients’ age, multimorbidity, severity of episodes, recurrences, and length of stay of HaH episodes. Recommendations aiming at enhancing service delivery were produced. Conclusions Besides confirming safety and value generation of HaH for selected patients, we found that this service is delivered in a case-mix of different scenarios, encouraging hospital-profiled monitoring of the service.

Implementation of knowledge economy and innovation through business education

Article

Dec 2023

The article’s purpose is to analyse the issue of implementation of knowledge economy and innovation through business education based on cluster analysis. The role of knowledge economy, innovation transfer, entrepreneurship and business-education coopetition are grounded to achieve economic growth and sustainable development. Input data withing the distribution of the knowledge economy through business education include a data of 23 countries for the following indicators: new registered enterprises, labour force, employment in industry, proportion of population studying ‘Business, Administration and Law’, proportion of population studying ‘Services’ and proportion of population studying ‘Economics’. Using data normalization, Ward and Sturges methods and Statgraphics Centurion 19 soft five clusters were determined to show hidden dependencies and structure in countries sample in this research context. The first cluster includes 2 countries (Austria and the United Kingdom), the second – 11 countries (Belgium, Portugal, Denmark, Italy, Lithuania, Latvia, Poland, Ukraine, Croatia, Norway, and the Netherlands), the third – 5 countries (Bulgaria, Spain, France, Switzerland, and Finland), the fourth – 3 countries (Estonia, Germany and Sweden), and the fifth – 2 countries (the Czech Republic and Hungary). Due to building dendrogram of distribution on clusters and graph of agglomeration distance the quality of countries distribution into clusters was confirmed. Obtained results can be useful for further research and improving the state innovation, information and educational policy based on positive experience of neighbour countries within certain formed cluster.

Manhattan Neighborhood Segmentation: Unveiling Property Sales Dynamics in 2008 for Class 1-, 2-, and 3-Family Homes

Research Proposal

Full-text available

Dec 2023

Turgud Valiyev

This study analyzes the complicated landscape of Manhattan real estate sales in 2008 using an extensive dataset from the Department of Finance (DOF). Applying k-means clustering, the study concentrates on residences belonging to Class 1-, 2-, and 3-Family and offers novel findings into the geographic distribution of house trades. The dataset includes all sales with a sale price of at least $150,000 that occurred between January 1st and December 31st, 2008.

ANALYSIS OF RAINFALL IN INDONESIA USING A TIME SERIES- BASED CLUSTERING APPROACH

Article

Full-text available

Jun 2024

Khusnia Nurul Khikmah

Article History: Indonesia has a tropical climate and has two seasons: dry and rainy. Prolonged drought can cause drought disasters, and rain can cause floods and landslides. According to information from the Meteorology, Climatology, and Geophysics Agency (BMKG), natural disasters such as floods and landslides due to heavy rains have been a severe problem in Indonesia for the past five years. Different regional characteristics can affect the intensity of rain that falls in every province in Indonesia. It can be grouped to determine which provinces have similar characteristics to natural disasters due to rainfall. Later, it can provide information to the government and the public so that they are more aware of natural disasters. So, it is necessary to research and classify provinces in Indonesia for rainfall with cluster analysis. The data used is secondary rainfall data taken from the official BMKG website. Cluster analysis of rainfall in 34 provinces in Indonesia used hierarchical and non-hierarchical methods in this study. The approach that is used in this research limits our clustering of the data. Further research with a machine learning approach is recommended. For the clustering method, the agglomerative hierarchical method includes single, average, and complete linkage. The non-hierarchical method includes k-medoids and fuzzy C-means. The cluster analysis results show that the dynamic time warping (DTW) distance measurement method with the average linkage method has the most optimal cluster results with a silhouette coefficient value of 0.813.

ADCOMS sensitivity versus baseline diagnosis and progression phenotypes

Article

Full-text available

Feb 2024

BACKGROUND The Alzheimer's Disease COMposite Score (ADCOMS) is more sensitive in clinical trials than conventional measures when assessing pre‐dementia. This study compares ADCOMS trajectories using clustered progression characteristics to better understand different patterns of decline. METHODS Post‐baseline ADCOMS values were analyzed for sensitivity using mean‐to‐standard deviation ratio (MSDR), partitioned by baseline diagnosis, comparing with the original scales upon which ADCOMS is based. Because baseline diagnosis was not a particularly reliable predictor of progression, individuals were also grouped into similar ADCOMS progression trajectories using clustering methods and the MSDR compared for each progression group. RESULTS ADCOMS demonstrated increased sensitivity for clinically important progression groups. ADCOMS did not show statistically significant sensitivity or clinical relevance for the less‐severe baseline diagnoses and marginal progression groups. CONCLUSIONS This analysis complements and extends previous work validating the sensitivity of ADCOMS. The large data set permitted evaluation–in a novel approach–by the clustered progression group.

Location-allocation problems

Article

Jan 1963

L. Cooper

Hierarchical grouping to optimize an ob.iective function.

Article

Jan 1963

J.H. Ward

Some distance properties of latent root and vector methods used in multivariate data analysis

Article

Jan 1966
BIOMETRIKA

John Gower

Cluster analysis and mathematical programming

Article

Sep 1971

M.R. Rao

Cluster analysis involves the problem of optimal partitioning of a given set of entities into a pre-assigned number of mutually exclusive and exhaustive clusters. Here the problem is formulated in two different ways with the distance function (a) of minimizing the within groups sums of squares and (b) minimizing the maximum distance within groups. These lead to different kinds of linear and non-linear (0–1) integer programming problems. Computational difficulties are discussed and efficient algorithms are provided for some special cases.

Cluster analysis of cases

Article