Article

# A dendrite method for cluster analysis

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

## No full-text available

... The number of classes was considered in a range from 5 to 30. The comparison of the different segmentation methods was performed using the relative validation indexes Mean Adecuate index (MIA) [2], Calinski index (CH) [20], and Davies Bouldin index (DBI) [21]. MIA and DBI are minimization indexes where CH is a maximization index. ...
... The relationships between within-class and total (SB/ST) variability [20] has also been represented for all the segmentation algorithms used (Fig. 6b). ...
... As can be seen, once again the best values, except DW and K starting from 25 classes, were obtained by HK (p = 1), although the average results obtained by the algorithms used are very high. The SW/SB variability relationship [20] (not shown) provides information on the relationship between class compaction (SW) and class separation (SB), value which should be minimized. The best results, except DW and K starting with 25 classes, were obtained by the HK (p = 1) algorithm. ...
Article
Customer classification aims at providing electric utilities with a volume of information to enable them to establish different types of tariffs. Several methods have been used to segment electricity customers, including, among others, the hierarchical clustering, Modified Follow the Leader and K-Means methods. These, however, entail problems with the pre-allocation of the number of clusters (Follow the Leader), randomness of the solution (K-Means) and improvement of the solution obtained (hierarchical algorithm). Another segmentation method used is Hopfield's autonomous recurrent neural network, although the solution obtained only guarantees that it is a local minimum. In this paper, we present the Hopfield–K-Means algorithm in order to overcome these limitations. This approach eliminates the randomness of the initial solution provided by K-Means based algorithms and it moves closer to the global optimun. The proposed algorithm is also compared against other customer segmentation and characterization techniques, on the basis of relative validation indexes. Finally, the results obtained by this algorithm with a set of 230 electricity customers (residential, industrial and administrative) are presented.
... Hierarchical methods [19] [20] and partition methods [21] [22]. The number of partitions must be defined beforehand by criteria such as the silhouette [23] or the Calinski-Harabasz index [24]. Halfway between supervised and unsupervised approaches, semi-supervised learning has emerged. ...
... Examples include hard and soft permanent, intermittent, and transient. More recently, based on deep learning methods, Bazzi et al. [24] proposed a model to diagnose intermittent faults in wireless sensor networks. The performance of the diagnosis method was measured by false-positive rate, detection accuracy, and false alarm rate. ...
Preprint
Full-text available
The supervised and semi-supervised learning framework does not always correspond to the situations encountered in vehicular networks. The labeling work is therefore often laborious and expensive. This is why the development of solutions to deal with imperfect labels was of particular interest to us during this research article. We introduce in this paper the formulation of the classification problem when the information available on the labels of the examples used for learning is imperfect. We also present an efficient way to solve classification problems, even when some of the labels provided to learn the classification function are wrong. Furthermore, we present our work on the extension of statistical learning methods in the environment of vehicular networks. To build an expectation-maximization (EM) algorithm capable of optimizing the marginal log-likelihood of the observed data, the path we will take follows the classic approach encountered in the probabilistic framework. Different experiments were carried out to analyze the behavior of our algorithm when “soft” labels in vehicles are used. These experiences have allowed us in particular to highlight the contribution of "soft" labels in this context to represent information on the reliability of the labels and thus significantly improve the performance of vehicular networks.
... Moreover, to filter out task-irrelevant information, clustering-based methods should assign observations corresponding to similar states to the same group and encode them as neighboring points in the latent space. Therefore, we can evaluate the robustness against distractions by the Calinski-Harabasz index (CH index) (Caliński and Harabasz 1974) with respect to the lowdimensional physical states. The CH index is the ratio of between-clusters dispersion and within-cluster dispersion. ...
... The Calinski-Harabasz index (Caliński and Harabasz 1974) is the ratio of between-clusters dispersion and inter-cluster dispersion for all clusters, where dispersion is defined as the sum of distances squared. The CH index is higher when clusters are dense and well separated, which relates to a standard concept of a cluster. ...
Preprint
Recent work has shown that representation learning plays a critical role in sample-efficient reinforcement learning (RL) from pixels. Unfortunately, in real-world scenarios, representation learning is usually fragile to task-irrelevant distractions such as variations in background or viewpoint.To tackle this problem, we propose a novel clustering-based approach, namely Clustering with Bisimulation Metrics (CBM), which learns robust representations by grouping visual observations in the latent space. Specifically, CBM alternates between two steps: (1) grouping observations by measuring their bisimulation distances to the learned prototypes; (2) learning a set of prototypes according to the current cluster assignments. Computing cluster assignments with bisimulation metrics enables CBM to capture task-relevant information, as bisimulation metrics quantify the behavioral similarity between observations. Moreover, CBM encourages the consistency of representations within each group, which facilitates filtering out task-irrelevant information and thus induces robust representations against distractions. An appealing feature is that CBM can achieve sample-efficient representation learning even if multiple distractions exist simultaneously.Experiments demonstrate that CBM significantly improves the sample efficiency of popular visual RL algorithms and achieves state-of-the-art performance on both multiple and single distraction settings. The code is available at https://github.com/MIRALab-USTC/RL-CBM.
... To resolve the most parsimonious solution, the Caliński-Harabasz pseudo F-statistic (Caliński and Harabasz 1974;Will 2016) was employed (Figs. 7, 8), available in the cluster-Sim package (Walesiak and Dudek 2020). This test explores the relationship of betweenclusters sums of squares compared with within-cluster sums of squares across several cluster solutions (k). ...
... Making a well-grounded decision on the number of clusters resulting from clustering algorithms is critical. Even though the Caliński and Harabasz pseudo F-statistic (Caliński and Harabasz 1974) traditionally shows good performance, it bears limitations, as it cannot select the one-cluster solution (Will 2016). This obstacle was solved in this approach by monothetic clustering (see Tran 2019). ...
Article
Full-text available
Fossils from the deep-sea Ediacaran biotas of Newfoundland are among the oldest architecturally complex soft-bodied macroorganisms on Earth. Most organisms in the Mistaken Point–type biotas of Avalonia—particularly the fractal-branching frondose Rangeomorpha— have been traditionally interpreted as living erect within the water column during life. However, due to the scarcity of documented physical sedimentological proxies associated with fossiliferous beds, Ediacaran paleocurrents have been inferred in some instances from the preferential orientation of fronds. This calls into question the relationship between frond orientation and paleocurrents. In this study, we present an integrated approach from a newly described fossiliferous surface (the “Melrose Surface” in the Fermeuse Formation at Melrose, on the southern portion of the Catalina Dome in the Discovery UNESCO Global Geopark) combining: (1) physical sedimentological evidence for paleocurrent direction in the form of climbing ripple cross-lamination and (2) a series of statistical analyses based on modified polythetic and monothetic clustering techniques reflecting the circular nature of the recorded orientation of Fractofusus misrai specimens. This study demonstrates the reclining rheotropic mode of life of the Ediacaran rangeomorph taxon Fractofusus misrai and presents preliminary inferences suggesting a similar mode of life for Bradgatia sp. and Pectinifrons abyssalis based on qualitative evidence. These results advocate for the consideration of an alternative conceptual hypothesis for position of life of Ediacaran organisms in which they are interpreted as having lived reclined on the seafloor, in the position that they are preserved.
... Calinski-Harabasz index [21] measures the similarity of a software repository to other repositories in its cluster (cohesion) as compared to other clusters (separation). Cohesion is estimated based on the distances from the repositories in a cluster to its cluster centroid and separation is based on the distance of the cluster centroids from the global centroid. ...
... We have aggregated the matched cluster prototypes from different repository sets by taking the mean of the matched prototypes for each cluster -the result is presented in Figure 1 (where the prototypes are normalized between each others for better visualization) -the metrics on the radar plots are numbered following the next order: issues, then commits metrics -full history (1-7 on the radar plots), past month (8)(9)(10)(11)(12)(13)(14), past two weeks (15)(16)(17)(18)(19)(20)(21), the latest date (22)(23)(24)(25)(26)(27)(28). Compared to the results generated on random data, the discrepancy for c 1 shows relatively consistent results in terms of cosine distance between the cluster prototypes. ...
Article
Full-text available
Software repositories contain a wealth of information about the aspects related to software development process. For this reason, many studies analyze software repositories using methods of data analytics with a focus on clustering. Software repository clustering has been applied in studying software ecosystems such as GitHub, defect and technical debt prediction, software remodularization. Although some interesting insights have been reported, the considered studies exhibited some limitations. The limitations are associated with the use of individual clustering methods and manifesting in the shortcomings of the obtained results. In this study, to alleviate the existing limitations we engage multiple cluster validity indices applied to multiple clustering methods and carry out consensus clustering. To our knowledge, this study is the first to apply the consensus clustering approach to analyze software repositories and one of the few to apply the consensus clustering to software metrics. Intensive experimental studies are reported for software repository metrics data consisting of a number of software repositories each described by software metrics.We revealed seven clusters of software repositories and relate them to developers’ activity. It is advocated that the proposed clustering environment could be useful for facilitating the decision making process for business investors and open-source community with the help of the Gartner’s hype cycle.
... The higher purity or NMI indicates better cluster quality. Besides, we used Calinski-Harabasz (CH) indexes [28] and Silhouette scores (SS) [29] as the internal metrics to evaluate the clustering performance without using any actual labels. A higher CH or SS indicates that the clustering has high intra-class compactness and interclass separability. ...
Preprint
Full-text available
Recent studies have shown that pseudo labels can contribute to unsupervised domain adaptation (UDA) for speaker verification. Inspired by the self-training strategies that use an existing classifier to label the unlabeled data for retraining, we propose a cluster-guided UDA framework that labels the target domain data by clustering and combines the labeled source domain data and pseudo-labeled target domain data to train a speaker embedding network. To improve the cluster quality, we train a speaker embedding network dedicated for clustering by minimizing the contrastive center loss. The goal is to reduce the distance between an embedding and its assigned cluster center while enlarging the distance between the embedding and the other cluster centers. Using VoxCeleb2 as the source domain and CN-Celeb1 as the target domain, we demonstrate that the proposed method can achieve an equal error rate (EER) of 8.10% on the CN-Celeb1 evaluation set without using any labels from the target domain. This result outperforms the supervised baseline by 39.6% and is the state-of-the-art UDA performance on this corpus.
... Various statistical methods and indices for testing variable selection and MZs quality have been developed (Boydell & McBratney, 2002;Gavioli et al., 2016;Peralta et al., 2015;Zhou et al., 2014). The current study applied the Calinski-Harabasz Index (CHI) to determine the optimal number of clusters in each plot (Caliński & Harabasz, 1974). The CHI is a dissimilarity index based on the degree of dispersion between clusters compared to the within-cluster similarity (Wang & Xu, 2019) and was used to select the number of clusters. ...
Article
Full-text available
Estimating crop nitrogen status to optimize production and minimize environmental pollution is a major challenge for modern agriculture. The study objective was to develop a multivariate spatiotemporal dynamic clustering approach to generate Nitrogen (N) Management Zones (MZs) in a citrus orchard during the growing season. The research was conducted in four citrus plots in the coastal area of Israel. Five variables were selected to characterize each plot’s spatiotemporal variability of canopy N content. These were split into constant (i.e., elevation, northness, and slope) and non-constant (i.e., canopy N content and tree height) variables. The non-constant data were obtained via bi-monthly imaging campaigns with a multispectral camera mounted on an unmanned aerial vehicle (UAV) throughout the growing season of 2019. The selected variables were then standardized to define the clusters by applying the Getis-Ord Gi* z-score. These were used to develop a spatiotemporal dynamic clustering model using Fuzzy C-means (FCM). Four input variables were investigated in this final stage, including the constant variables only and different combinations of constant and non-constant variables. The support vector machine regression model results for estimating canopy N-content from multispectral images were R² = 0.771 and RMSE = 0.227. This model was used to predict monthly canopy-level N content and classify the N content levels based on the October N-to-yield content envelope curve. Delineating MZs was followed by the comparison of spatial association among cluster maps. This process may support site-specific and time-specific nitrogen management.
... This result, on the one hand, it enlarges the application of the α-connections in the context of cluster analysis; on the other, it arises the problem of how to select the good value of α in real applications where the ground truth is unknown. A possible way of handling this problem is to select the value of alpha by using internal validity indices such as the pseudo F index [19] or the GAP statistic [20]. The preliminary results have also shown the validity of the presented numerical approximation of the geodesics based on the quadratic Taylor expansion. ...
Article
Full-text available
According to Information Geometry, we represent landmarks of a complex shape, as probability densities in a statistical manifold where geometric structures from α-connections are considered. In particular the 0-connection is the Riemannian connection with respect to the Fisher metric. In the setting of shapes clustering, we compare the discriminative power of different shapes distances induced by geodesic distances derived from α-connections. The methodology is analyzed in an application to a data set of aeroplane shapes.
... Even in the domain of unsupervised learning, some metrics do exist. For clustering algorithms, there are metrics such as Silhouette Coefficient [1], Calinski-Harabasz Index [2], and Davies-Bouldin index [3]. However, for dimensionality reduction, to our best knowledge the only proposed metric is known as Neighborhood Preserving Ratio (NPR) [4,5]. ...
Preprint
Unsupervised machine learning lacks ground truth by definition. This poses a major difficulty when designing metrics to evaluate the performance of such algorithms. In sharp contrast with supervised learning, for which plenty of quality metrics have been studied in the literature, in the field of dimensionality reduction only a few over-simplistic metrics has been proposed. In this work, we aim to introduce the first highly non-trivial dimensionality reduction performance metric. This metric is based on the sectional curvature behaviour arising from Riemannian geometry. To test its feasibility, this metric has been used to evaluate the performance of the most commonly used dimension reduction algorithms in the state of the art. Furthermore, to make the evaluation of the algorithms robust and representative, using curvature properties of planar curves, a new parameterized problem instance generator has been constructed in the form of a function generator. Experimental results are consistent with what could be expected based on the design and characteristics of the evaluated algorithms and the features of the data instances used to feed the method.
... The Calinski-Harabasz Index [53] (Variance Ratio Criterion) is a ratio of the sum of the inter-clusters (between-group) dispersion and the intra-cluster dispersion (within-group) for all clusters. ...
Article
Full-text available
Narrow band imaging is an established non-invasive tool used for the early detection of laryngeal cancer in surveillance examinations. Most images produced from the examination are useless, such as blurred, specular reflection, and underexposed. Removing the uninformative frames is vital to improve detection accuracy and speed up computer-aided diagnosis. It often takes a lot of time for the physician to manually inspect the informative frames. This issue is commonly addressed by a classifier with task-specific categories of the uninformative frames. However, the definition of the uninformative categories is ambiguous, and tedious labeling still cannot be avoided. Here, we show that a novel unsupervised scheme is comparable to the current benchmarks on the dataset of NBI-InfFrames. We extract feature embedding using a vanilla neural network (VGG16) and introduce a new dimensionality reduction method called UMAP that distinguishes the feature embedding in the lower-dimensional space. Along with the proposed automatic cluster labeling algorithm and cost function in Bayesian optimization, the proposed method coupled with UMAP achieves state-of-the-art performance. It outperforms the baseline by 12% absolute. The overall median recall of the proposed method is currently the highest, 96%. Our results demonstrate the effectiveness of the proposed scheme and the robustness of detecting the informative frames. It also suggests the patterns embedded in the data help develop flexible algorithms that do not require manual labeling.
... Consequently, we fix the number of clusters k = 4 of our method. This decision was validated using the Caliński-Harabasz (CH) Index (Caliński and Harabasz 1974), specifically tailored for situations in which ground truth labels are unknown as in our case study. The CH Index measures the cohesion and separation of clusters. ...
Article
Full-text available
E-scooter services have multiplied worldwide as a form of urban transport. Their use has grown so quickly that policymakers and researchers still need to understand their interrelation with other transport modes. At present, e-scooter services are primarily seen as a first-and-last-mile solution for public transport. However, we demonstrate that 50%\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$50\,\%$$\end{document} of e-scooter trips are either substituting it or covering areas with little public transportation infrastructure. To this end, we have developed a novel data-driven methodology that autonomously classifies e-scooter trips according to their relation to public transit. Instead of predefined design criteria, the blind nature of our approach extracts the city’s intrinsic parameters from real data. We applied this methodology to Rome (Italy), and our findings reveal that e-scooters provide specific mobility solutions in areas with particular needs. Thus, we believe that the proposed methodology will contribute to the understanding of e-scooter services as part of shared urban mobility.
... Additionally, it is robust to outliers and groups are not too dissimilar in size (Balaguer-Coll et al., 2013). Finally, the Caliñski and Harabasz (1974) stopping rule was used in order to determine the number of clusters. ...
Article
Full-text available
This study aims to assess whether Ecuadorian health reforms carried out since 2008 have affected the efficiency performance of public hospitals in the country. We contribute to the literature by shedding new light on the effects on public healthcare efficiency for developing countries when policies move toward health equity and universal coverage. We follow a two-stage approach, wherein the first stage we make use of factor and cluster analysis to obtain three clusters of public hospitals based on their technological endowment; we exploit Data Envelopment Analysis for panel data in the second stage to estimate robust efficiency measures over time. Our innovative empirical strategy considers the heterogeneity of healthcare institutions in the analysis of their efficiency performance. The results show a significant decrease in the average efficiency of low and intermediate technology hospitals after the new constitution was adopted in 2008. The decline in efficiency coincides with the two reforms of 2010 and 2011 that brought on higher social security coverage.
... To address this, a feasible solution is to first enumerate the possible values of the number of categories C t in the target domain and divide the target domain into the corresponding clusters by applying a clustering algorithm like K-means [31]. Then the clustering evaluation criteria [3,11,41,52] can be employed to determine the appropriate number of target domain categoriesC t . ...
Preprint
Full-text available
Deep neural networks (DNNs) often perform poorly in the presence of domain shift and category shift. How to upcycle DNNs and adapt them to the target task remains an important open problem. Unsupervised Domain Adaptation (UDA), especially recently proposed Source-free Domain Adaptation (SFDA), has become a promising technology to address this issue. Nevertheless, existing SFDA methods require that the source domain and target domain share the same label space, consequently being only applicable to the vanilla closed-set setting. In this paper, we take one step further and explore the Source-free Universal Domain Adaptation (SF-UniDA). The goal is to identify "known" data samples under both domain and category shift, and reject those "unknown" data samples (not present in source classes), with only the knowledge from standard pre-trained source model. To this end, we introduce an innovative global and local clustering learning technique (GLC). Specifically, we design a novel, adaptive one-vs-all global clustering algorithm to achieve the distinction across different target classes and introduce a local k-NN clustering strategy to alleviate negative transfer. We examine the superiority of our GLC on multiple benchmarks with different category shift scenarios, including partial-set, open-set, and open-partial-set DA. Remarkably, in the most challenging open-partial-set DA scenario, GLC outperforms UMAD by 14.8\% on the VisDA benchmark. The code is available at https://github.com/ispc-lab/GLC.
... Different combinations of attributes (morphometric variables) and numbers of classes were trialed iteratively to determine an optimal combination. A ratio representing withingroup similarity and between-group differences, the Calinski-Harabasz pseudo F-statistic (Calinski and Harabasz, 1974), indicated the optimal number of groups for this dataset to be Chapter 2. Coral reef spur and groove morphology 37 four. K-means clustering defines classes in a way that maximizes variability between classes (classification criterion #4, Section 3.3, paragraph 1) and minimizes variability within classes (classification criterion #5). ...
Thesis
Full-text available
The fore reef spur and groove (SaG) zone is an important, but poorly understood, zone of coral reefs worldwide. Spurs (parallel ridges of carbonate material), are separated by grooves (regularly spaced channels) to form a distinctive finger-like pattern around the margins of coral reefs. Few studies have collected quantitative data regarding the morphometrics, hydrodynamics or reef growth of SaG systems and thus many questions remain about their formation and evolution. Additionally, the majority of SaG studies have focused only on one location at a single reef. Thus, findings are localised and it is difficult to differentiate common underlying trends in SaG eco-morphodynamics from site-specific factors. This thesis aims to describe the eco-morphodynamic evolution of spurs and grooves by establishing, and measuring, the interactions and feedbacks between their geomorphology, hydrodynamics and reef growth. A multifaceted approach is used to determine the eco-morphodynamic evolution of SaGs by quantifying their geomorphology (form), hydrodynamics (function) and reef growth (evolution) across multiple spatial and temporal scales at multiple reefs in the Indo-Pacific region. This constitutes the most extensive and comprehensive study of SaG features to date. This study presents in-situ hydrodynamic data from two SaG zones at One Tree Reef, in the southern GBR, Australia and three SaG zones at Moorea in French Polynesia. It also presents a unique suite of 38 cores from three SaG zones at One Tree Reef, three SaG zones at Heron Reef (also in the southern GBR) and two SaG zones in Moorea. A remote sensing analysis and classification of SaG morphometrics is also presented, which shows that remotely sensed data can provide insight into SaG evolution and potentially upscale the findings of in-situ hydrodynamic and reef growth studies. Studying SaG features at multiple sites on multiple reefs, which differ in size, reef type, wave energy regime, tidal range and broad scale morphology, provides globally applicable insights into SaG ecomorphodynamics across a broad spectrum of environmental conditions.
... To evaluate the performance on identifying the number of clusters in a dataset, we compared the result by the algorithms with the true number. We used two internal indices of Calinski-Harabasz index (CHI) [39] and Davies-Bouldin index (DBI) [40], and three external indices of adjusted mutual information (AMI) [41], adjusted rand index (ARI) [41], and Fowlkes-Mallows index (FMI) [42] to evaluate the clustering accuracy of our two ensemble methods and the five comparison algorithms. The specific definitions of these indices can be found in the reference papers. ...
Article
Clustering a big distributed dataset of hundred gigabytes or more is a challenging task in distributed computing. A popular method to tackle this problem is to use a random sample of the big dataset to compute an approximate result as an estimation of the true result computed from the entire dataset. In this paper, instead of using a single random sample, we use multiple random samples to compute an ensemble result as the estimation of the true result of the big dataset. We propose a distributed computing framework to compute the ensemble result. In this framework, a big dataset is represented in the RSP data model as random sample data blocks managed in a distributed file system. To compute the ensemble clustering result, a set of RSP data blocks is randomly selected as random samples and clustered independently in parallel on the nodes of a cluster to generate the component clustering results. The component results are transferred to the master node, which computes the ensemble result. Since the random samples are disjoint and traditional consensus functions cannot be used, we propose two new methods to integrate the component clustering results into the final ensemble result. The first method uses component cluster centers to build a graph and the METIS algorithm to cut the graph into subgraphs, from which a set of candidate cluster centers is found. A hierarchical clustering method is then used to generate the final set of $k$ cluster centers. The second method uses the clustering-by-passing-messages method to generate the final set of $k$ cluster centers. Finally, the $k$ -means algorithm was used to allocate the entire dataset into $k$ clusters. Experiments were conducted on both synthetic and real-world datasets. The results show that the new ensemble clustering methods performed better than the comparison methods and that the distributed computing framework is efficient and scalable in clustering big datasets.
... The unsupervised ML approach was applied for clustering as follows. First, the optimal number of classes were determined by using non-parametric criteria (44)(45)(46). Then, k-means algorithm was applied to partition individual trajectories (babies) into those identified classes (47). ...
Article
Full-text available
Background: Non-nutritive suck (NNS) is used to promote ororhythmic patterning and assess oral feeding readiness in preterm infants in the neonatal intensive care unit (NICU). While time domain measures of NNS are available in real time at cribside, our understanding of suck pattern generation in the frequency domain is limited. The aim of this study is to model the development of NNS in the frequency domain using Fourier and machine learning (ML) techniques in extremely preterm infants (EPIs). Methods: A total of 117 EPIs were randomized to a pulsed or sham orocutaneous intervention during tube feedings 3 times/day for 4 weeks, beginning at 30 weeks post-menstrual age (PMA). Infants were assessed 3 times/week for NNS dynamics until they attained 100% oral feeding or NICU discharge. Digitized NNS signals were processed in the frequency domain using two transforms, including the Welch power spectral density (PSD) method, and the Yule-Walker PSD method. Data analysis proceeded in two stages. Stage 1: ML longitudinal cluster analysis was conducted to identify groups (classes) of infants, each showing a unique pattern of change in Welch and Yule-Walker calculations during the interventions. Stage 2: linear mixed modeling (LMM) was performed for the Welch and Yule-Walker dependent variables to examine the effects of gestationally-aged (GA), PMA, sex (male, female), patient type [respiratory distress syndrome (RDS), bronchopulmonary dysplasia (BPD)], treatment (NTrainer, Sham), intervention phase [1, 2, 3], cluster class, and phase-by-class interaction. Results: ML of Welch PSD method and Yule-Walker PSD method measures revealed three membership classes of NNS growth patterns. The dependent measures peak_Hz, PSD amplitude, and area under the curve (AUC) are highly dependent on PMA, but show little relation to respiratory status (RDS, BPD) or somatosensory intervention. Thus, neural regulation of NNS in the frequency domain is significantly different for each identified cluster (classes A, B, C) during this developmental period. Conclusions: Efforts to increase our knowledge of the evolution of the suck central pattern generator (sCPG) in preterm infants, including NNS rhythmogenesis will help us better understand the observed phenotypes of NNS production in both the frequency and time domains. Knowledge of those features of the Pediatric Medicine, 2023
... presence of noise, density differences, arbitrary cluster shapes (Sheng et al. 2005)). To the best of our knowledge, the closest clustering validity index (CVI) to this scenario is the Quality Index used in (Artoni et al. 2018(Artoni et al. , 2014, which is further inspired by the Calinski-Harabasz criterion (Caliński and Harabasz 1974), defined as the difference between the average within-cluster similarities and the average between-cluster similarities: ...
Article
Full-text available
Clustering of independent component (IC) topographies of Electroencephalograms (EEG) is an effective way to find brain-generated IC processes associated with a population of interest, particularly for those cases where event-related potential features are not available. This paper proposes a novel algorithm for the clustering of these IC topographies and compares its results with the most currently used clustering algorithms. In this study, 32-electrode EEG signals were recorded at a sampling rate of 500 Hz for 48 participants. EEG signals were pre-processed and IC topographies computed using the AMICA algorithm. The algorithm implements a hybrid approach where genetic algorithms are used to compute more accurate versions of the centroids and the final clusters after a pre-clustering phase based on spectral clustering. The algorithm automatically selects the optimum number of clusters by using a fitness function that involves local-density along with compactness and separation criteria. Specific internal validation metrics adapted to the use of the absolute correlation coefficient as the similarity measure are defined for the benchmarking process. Assessed results across different ICA decompositions and groups of subjects show that the proposed clustering algorithm significantly outperforms the (baseline) clustering algorithms provided by the software EEGLAB, including CORRMAP.
... We share the same sentiment with Reimers and Gurevych [36] that utterance representation with good expressiveness can help semantically similar sentences cluster. Since ground truth labels of clusters are unknown, internal evaluation metrics including Calinski-Harabasz Index [5] and Davies-Bouldin Index [8] are used to assess the quality of clusters. Higher Calinski-Harabasz Index score and lower Davies-Bouldin Index score indicate better clusters definition and separation. ...
Preprint
Full-text available
Dialogue structure discovery is essential in dialogue generation. Well-structured topic flow can leverage background information and predict future topics to help generate controllable and explainable responses. However, most previous work focused on dialogue structure learning in task-oriented dialogue other than open-domain dialogue which is more complicated and challenging. In this paper, we present a new framework CTRLStruct for dialogue structure learning to effectively explore topic-level dialogue clusters as well as their transitions with unlabelled information. Precisely, dialogue utterances encoded by bi-directional Transformer are further trained through a special designed contrastive learning task to improve representation. Then we perform clustering to utterance-level representations and form topic-level clusters that can be considered as vertices in dialogue structure graph. The edges in the graph indicating transition probability between vertices are calculated by mimicking expert behavior in datasets. Finally, dialogue structure graph is integrated into dialogue model to perform controlled response generation. Experiments on two popular open-domain dialogue datasets show our model can generate more coherent responses compared to some excellent dialogue models, as well as outperform some typical sentence embedding methods in dialogue utterance representation. Code is available in GitHub.
... We indicate with X (K ) the partition corresponding to a given K value and we will identify the optimal K value by looking at the maximum or minimum of the specified IVI. More precisely, we consider the Silhouette index SI(K ) [15], the Calinski-Harabaz index CH(K ) [16], the Davies-Bouldin index DB(K ) [17] and the Dunn index Dunn(K ) [18]. We choose SLINK, among the many other clustering algorithms, since it allows us to explore a wide range of values of K ∈ [1, N], in a relatively short computational time. ...
Article
Clustering represents a fundamental procedure to provide users with meaningful insights from an original data set. The quality of the resulting clusters is largely dependent on the correct estimation of their number, K∗, which must be provided as an input parameter in many clustering algorithms. Only very few techniques provide an automatic detection of K∗ and are usually based on cluster validity indexes which are expensive with regard to computation time. Here, we present a new algorithm which allows one to obtain an accurate estimate of K∗, without partitioning data into the different clusters. This makes the algorithm particularly efficient in handling large-scale data sets from both the perspective of time and space complexity. The algorithm, indeed, highlights the block structure which is implicitly present in the similarity matrix, and associates K∗ to the number of blocks in the matrix. We test the algorithm on synthetic data sets with or without a hierarchical organization of elements. We explore a wide range of K∗ and show the effectiveness of the proposed algorithm to identify K∗, even more accurate than existing methods based on standard internal validity indexes, with a huge advantage in terms of computation time and memory storage. We also discuss the application of the novel algorithm to the de-clustering of instrumental earthquake catalogs, a procedure finalized to identify the level of background seismic activity useful for seismic hazard assessment.
... The three cell types that did not expose a clear cyclic signal (NG2, microglia and tanycytes) exhibit the lowest fraction of rhytmic gene expression 18 . Moreover, we measured the separation of cells that were sampled at different time points, before and after cyclic filtering/enhancement using the Calinski and Harabasz score 19 . Overall, as expected, separation increased substantially following cyclic enhancement and decreased following cyclic filtering, which, as above, is least substantial for the three cell types exhibiting the lowest fraction of rhythmic genes (Fig. 5d). ...
Article
Full-text available
Single-cell RNA sequencing has been instrumental in uncovering cellular spatiotemporal context. This task is challenging as cells simultaneously encode multiple, potentially cross-interfering, biological signals. Here we propose scPrisma, a spectral computational method that uses topological priors to decouple, enhance and filter different classes of biological processes in single-cell data, such as periodic and linear signals. We apply scPrisma to the analysis of the cell cycle in HeLa cells, circadian rhythm and spatial zonation in liver lobules, diurnal cycle in Chlamydomonas and circadian rhythm in the suprachiasmatic nucleus in the brain. scPrisma can be used to distinguish mixed cellular populations by specific characteristics such as cell type and uncover regulatory networks and cell–cell interactions specific to predefined biological signals, such as the circadian rhythm. We show scPrisma’s flexibility in incorporating prior knowledge, inference of topologically informative genes and generalization to additional diverse templates and systems. scPrisma can be used as a stand-alone workflow for signal analysis and as a prior step for downstream single-cell analysis.
... Three algorithms for estimating the number of clusters in each image were evaluated and compared with results from human observers. The Caliński-Harabasz index 37 , also known as the variance ratio criterion (VRC), determines the ratio of the sum of between-clusters dispersion and inter-cluster dispersion for all clusters, where the dispersion is the sum of squared distances. A range was set for the number of clusters and the number within the range with the highest VRC chosen as the optimum Fig. 1. ...
Article
Full-text available
Realistic images often contain complex variations in color, which can make economical descriptions difficult. Yet human observers can readily reduce the number of colors in paintings to a small proportion they judge as relevant. These relevant colors provide a way to simplify images by effectively quantizing them. The aim here was to estimate the information captured by this process and to compare it with algorithmic estimates of the maximum information possible by colorimetric and general optimization methods. The images tested were of 20 conventionally representational paintings. Information was quantified by Shannon’s mutual information. It was found that the estimated mutual information in observers’ choices reached about 90% of the algorithmic maxima. For comparison, JPEG compression delivered somewhat less. Observers seem to be efficient at effectively quantizing colored images, an ability that may have applications in the real world.
... We performed a k-means analysis based on Ward's method using the fpc package [33] coupled with the Calinski Harabasz index (CH index) [34] to determine the minimal parsimonious number of ecological clusters of species. Based on a Principal Component Analysis (PCA) using the FactoMineR package [35], we performed an Ascendant Hierarchical Clustering (AHC) to determine and visualize species' clusters likely to occur together within the same micro-ecological niche. ...
Preprint
In Africa, vector-borne diseases (VBDs) are still a major public health issue especially in cities that gather an increasing human population. Market gardening practices, for example, can favor the transmission of urban malaria, while insufficient water supply and waste management favor the circulation of arboviroses related to Aedes mosquitoes. Urban planning is a major challenge to mitigate vector risks. As a planning strategy, greening is a concept that is increasingly considered as a major element impacting the well-being of inhabitants, but also for the restoration of biodiversity in cities. Nevertheless, the impact of urban green spaces on vector risk remains poorly investigated, as they may serve as refuge for vectors. This is why the diversity of mosquitoes in terms of species and larval habitat, through larval prospections in environmental water collections and human landing catches, is studied here at an intra-urban interface area between a forest and an urban ecosystem in order to assess the vector risk generated by preserving a forest patch in the heart of Libreville, capital of Gabon, central Africa. Out of 104 water containers explored, 94 (90.4%) were artificial, mainly comprising gutters, used tires, and plastic bottles, while 10 (9.6%) were natural, comprising a puddle, streams, and tree holes. The majority of the water collections recovered (73.1%) were found outside of the forested area, natural and artificial ones considered together. A total of 770 mosquitoes belonging to 14 species were collected from water collections. The mosquito community was largely dominated by anthropophilic species like Aedes albopictus (33.5%), Culex quinquefasciatus (30.4%), and Lutzia tigripes (16.5%). The Shannon index of diversity showed that mosquitoes were almost twice less diversified inside the forest (0.7) than outside (1.3). However, both communities were quite similar in terms of common species and relative abundance (Morisita-Horn index = 0.7). Regarding Human landing catches, Aedes albopictus (86.1%) was the most aggressive species, putting people at risk of Aedes-borne viruses. This study uncovered the importance of considering urban forested ecosystems as potential drivers of disease emergence and spread in urban areas, as they might locally boost urban mosquito densities due to poor environmental practices, mainly maintained by humans through poor environmental practices. In Gabon, this study should contribute to guide targeted vector control strategies, especially regarding the implementation of policies for a better environmental management and vector surveillance in urbanized areas.
... • Silhouette score [12], • Calinski and Harabasz index [13], • Dunn index [14], • Davies-Bauldin index [15], • Density-Based Clustering Validation (DBCV) [16]. ...
... The number of clusters is set by the user and is generally chosen by considering the change in some metric of the analysis as a function of the chosen number of clusters. Here we use the Calinski-Harabasz score 38 which is roughly the ratio of the average distance between members of a cluster to that between clusters. One seeks the maximum amount of information available before we move from delineating truly isolated clusters to partitioning randomly distributed points within clusters. ...
Article
Full-text available
We use a globally consistent, time-resolved data set of CO2 emission proxies to quantify urban CO2 emissions in 91 cities. We decompose emission trends into contributions from changes in urban extent, population density and per capita emission. We find that urban CO2 emissions are increasing everywhere but that the dominant contributors differ according to development level. A cluster analysis of factors shows that developing countries were dominated by cities with the rapid area and per capita CO2 emissions increases. Cities in the developed world, by contrast, show slow area and per capita CO2 emissions growth. China is an important intermediate case with rapid urban area growth combined with slower per capita CO2 emissions growth. Urban per capita emissions are often lower than their national average for many developed countries, suggesting that urbanisation may reduce overall emissions. However, trends in per capita urban emissions are higher than their national equivalent almost everywhere, suggesting that urbanisation will become a more serious problem in the future. An important exception is China, whose per capita urban emissions are growing more slowly than the national value. We also see a negative correlation between trends in population density and per capita CO2 emissions, highlighting a strong role for densification as a tool to reduce CO2 emissions.
... In order to determine the optimal number of clusters, we calculated several classic clustering quality criteria (Supplementary section S3). In the raw-data-based and the feature-based approaches, we calculated the Calinski-Harabasz criterion [40], the Kryszczuk variant of Calinski-Harabasz criterion [41], the Genolini variant of Calinski-Harabasz criterion [37], the opposite of Ray-Turi criterion [42] and the opposite of Davies-Bouldin criterion [43]. In the model-based approach, we calculated the Akaike Information Criterion (AIC) [44] and the Bayesian Information Criterion (BIC) [45]. ...
Article
Context: Identifying clusters (i.e., subgroups) of patients from the analysis of medico-administrative databases is particularly important to better understand disease heterogeneity. However, these databases contain different types of longitudinal variables which are measured over different follow-up periods, generating truncated data. It is therefore fundamental to develop clustering approaches that can handle this type of data. Objective: We propose here cluster-tracking approaches to identify clusters of patients from truncated longitudinal data contained in medico-administrative databases. Material and methods: We first cluster patients at each age. We then track the identified clusters over ages to construct cluster-trajectories. We compared our novel approaches with three classical longitudinal clustering approaches by calculating the silhouette score. As a use-case, we analyzed antithrombotic drugs used from 2008 to 2018 contained in the Échantillon Généraliste des Bénéficiaires (EGB), a French national cohort. Results: Our cluster-tracking approaches allow us to identify several cluster-trajectories with clinical significance without any imputation of data. The comparison of the silhouette scores obtained with the different approaches highlights the better performances of the cluster-tracking approaches. Conclusion: The cluster-tracking approaches are a novel and efficient alternative to identify patient clusters from medico-administrative databases by taking into account their specificities.
... After clustering, they were divided into 3, 4, 5 classes in order, and then their clustering effects were evaluated. In this paper, we used the following three evaluation methods: Silhouette Coefficient [7], Calinski-Harabasz Index [1], Davies-Bouldin Index [4]. ...
... Between March and September, daily predicted DMS data from GPR are used at each pixel in the NA basin (spatial resolution of 0.25°× 0.25°) in this analysis. The optimal number of clusters is set according to the novel Elbow approach by Shi et al. (2021), which performs better than the Calinski-Harabasz index (Caliński and Harabasz, 1974) and the classic Elbow graph (Syakur et al., 2017) in this analysis, setting the optimal cluster number to 7 (see Section 2.5 and Fig. S7 for details). ...
Article
Full-text available
As the most ubiquitous natural source of sulfur in the atmosphere, dimethylsulfide (DMS) promotes aerosol formation in marine environments, impacting cloud radiative forcing and precipitation, eventually influencing regional and global climate. In this study, we propose a machine learning predictive algorithm based on Gaussian process regression (GPR) to model the distribution of daily DMS concentrations in the North Atlantic waters over 24 years (1998–2021) at 0.25° × 0.25° spatial resolution. The model was built using DMS observations from cruises, combined with satellite-derived oceanographic data and Copernicus-modelled data. Further comparison was made with the previously employed machine learning methods (i.e., artificial neural network and random forest regression) and the existing empirical DMS algorithms. The proposed GPR outperforms the other methods for predicting DMS, displaying the highest coefficient of determination (R2) value of 0.71 and the least root mean square error (RMSE) of 0.21. Notably, DMS regional patterns are associated with the spatial distribution of phytoplankton biomass and the thickness of the ocean mixed layer, displaying high DMS concentrations above 50°N from June to August. The amplitude, onset, and duration of the DMS annual cycle vary significantly across different regions, as revealed by the k-means++ clustering. Based on the GPR model output, the sea-to-air flux in the North Atlantic from March to September is estimated to be 3.04 Tg S, roughly 44 % lower than the estimates based on extrapolations of in-situ data. The present study demonstrates the effectiveness of a novel method for estimating seawater DMS surface concentration at unprecedented space and time resolutions. As a result, we are able to capture high-frequency spatial and temporal patterns in DMS variability. Better predictions of DMS concentration and derived sea-to-air flux will improve the modeling of biogenic sulfur aerosol concentrations in the atmosphere and reduce aerosol-cloud interaction uncertainties in climate models.
... For consistency with the literature presented above, we implemented a cluster analysis using the K-means algorithm with a number of groups varying from two to six and trying multiple random starting points. The optimal number of groups was selected by computing several clustering performance measures, such as the GAP statistic (Tibshirani et al., 2001), the silhouettes criterion and the majority rule-of-thumb for several indicators as proposed by Caliński and Harabasz (1974) and Krzanowski and Lai (1988). of mobility within the city of Milan. First, there is a strong similarity in the color intensities obtained from the geometric mean, Mean 0-1, and AMPI indices. ...
Article
Full-text available
We evaluate the level of mobility services and infrastructures in Milan to identify which areas are best equipped to serve citizens. We explore the overall degree of smart mobility by ranking the 88 administrative districts according to their transportation services. A statistical analysis both quantifies and groups the neighborhoods by their degree of mobility. We first built a set of composite indicators, including the AMPI and the Static Jevons Index. The robustness of the index is validated through a sensitivity analysis of behavior when varying the underlying indicators. A spatial cross-correlation analysis is conducted to contextualize the degree of mobility estimated in the neighborhoods with respect to some infrastructural variables. Second, the composite indices are used to cluster the districts into homogeneous groups with similar mobility levels. The results show that, whether using the indices individually or in combination, the cluster analyses successfully distinguish key areas of the city, such as the interchange hubs, university zones, city center, workplaces, and suburbs. We identify four classes of districts characterized by increasing levels of smart mobility, and highlight critical differences between the city center and the peripheral areas of Milan.
... To assess the clustering performance in a quantitative manner, we computed multiple unsupervised cluster-separation metrics for evaluation. To start with, the Calinski-Harabasz index (CH) [54] for a set of data E with n E pixels and split into k clusters is defined as the ratio of the dispersion between and within clusters. ...
Article
Full-text available
In this paper, we expand upon our previous research on unsupervised learning algorithms to map the spectral parameters of the Martian surface. Previously, we focused on the VIS-NIR range of hyperspectral data from the CRISM imaging spectrometer instrument onboard NASA’s Mars Reconnaissance Orbiter to relate to other correspondent imager data sources. In this study, we generate spectral cluster maps on a selected CRISM datacube in a NIR range of 1050–2550 nm. This range is suitable for identifying most dominate mineralogy formed in ancient wet environment such as phyllosilicates, pyroxene and smectites. In the machine learning community, the UMAP method for dimensionality reduction has recently gained attention because of its computing efficiency and speed. We apply this algorithm in combination with k-Means to data from Jezero Crater. Such studies of Jezero Crater are of priority to support the planning of the current NASA’s Perseversance rover mission. We compare our results with other methodologies based on a suitable metric and can identify an optimal cluster size of six for the selected datacube. Our proposed approach outperforms comparable methods in efficiency and speed. To show the geological relevance of the different clusters, the so-called “summary products” derived from the hyperspectral data are used to correlate each cluster with its mineralogical properties. We show that clustered regions relate to different mineralogical compositions (e.g., carbonates and pyroxene). Finally the generated spectral cluster map shows a qualitatively strong resemblance with a given manually compositional expert map. As a conclusion, the presented method can be implemented for automated region-based analysis to extend our understanding of Martian geological history.
Chapter
Statistical and machine learning methods have many applications in the environmental sciences, including prediction and data analysis in meteorology, hydrology and oceanography; pattern recognition for satellite images from remote sensing; management of agriculture and forests; assessment of climate change; and much more. With rapid advances in machine learning in the last decade, this book provides an urgently needed, comprehensive guide to machine learning and statistics for students and researchers interested in environmental data science. It includes intuitive explanations covering the relevant background mathematics, with examples drawn from the environmental sciences. A broad range of topics is covered, including correlation, regression, classification, clustering, neural networks, random forests, boosting, kernel methods, evolutionary algorithms and deep learning, as well as the recent merging of machine learning and physics. End‑of‑chapter exercises allow readers to develop their problem-solving skills, and online datasets allow readers to practise analysis of real data.
Article
How pain emerges from human brain remains an unresolved question in pain neuroscience. Neuroimaging studies have suggested that all brain areas activated by painful stimuli were also activated by tactile stimuli, and vice versa. Nonetheless, pain-preferential spatial patterns of voxel-level activation in the brain have been observed when distinguishing painful and tactile brain activations using multivariate pattern analysis (MVPA). According to two hypotheses, the neural activity pattern preferentially encoding pain could exist at a global, coarse-grained, regional level, corresponding to the "pain connectome" hypothesis proposing that pain-preferential information may be encoded by the synchronized activity across multiple distant brain regions, and/or exist at a local, fine-grained, voxel level, corresponding to the "intermingled specialized/preferential neurons" hypothesis proposing that neurons responding specially or preferentially to pain could be present and intermingled with non-pain neurons within a voxel. Here, we systematically investigated the spatial scales of pain-distinguishing information in the human brain measured by fMRI using machine learning techniques, and found that pain-distinguishing information could be detected at both coarse-grained spatial scales across widely distributed brain regions and fine-grained spatial scales within many local areas. Importantly, the spatial distribution of pain-distinguishing information in the brain varies across individuals and such inter-individual variations may be related to a person's trait about pain perception, particularly the pain vigilance and awareness. These results provide new insights into the long-standing question of how pain is represented in the human brain and help the identification of characteristic neuroimaging measurements of pain.
Thesis
Contexte : En santé, une intervention complexe est définie par l'interaction entre un certain nombre d'éléments distincts qui produit un résultat ne se limitant pas à la somme des effets de chacun des composants. Certains services de la FSEF proposent une prise en charge spécifique grâce au travail coordonné d'équipes du champ de la psychiatrie et d'autres disciplines, en particulier l'enseignement de l'Education Nationale. L'étude de ces systèmes complexes nécessite une méthode d'évaluation particulière. L'objectif de ce travail est de débuter leur évaluation en décrivant les dispositifs eux-mêmes, les populations qu'ils prennent en charge et certains éléments de leur évolution clinique durant ou après les soins.Méthode : Nous avons mené une revue systématique de la littérature pour synthétiser les données existantes sur l'évaluation des soins-études. Nous avons ensuite réalisé deux études d'épidémiologie clinique dans deux types de services proposant des interventions complexes : un soins-études en psychiatrie et un service transdisciplinaire. Ce dernier dispense des soins coordonnés de psychiatrie et de rééducation aux personnes ayant fait une tentative de suicide grave, avec des séquelles physiques importantes. Nous avons étudié des indicateurs liés au fonctionnement de ces dispositifs ou à l'évolution clinique des sujets (tels que la poursuite ou non des hospitalisations, la durée d'hospitalisation et la mortalité à long terme). Nous avons analysé les éléments cliniques associés à ces évolutions.Résultats : La revue de la littérature sur les soins-études retrouvait onze publications. Elles décrivaient le dispositif soins-études, les particularités des populations prises en charge et l'évolution clinique au cours et après ces soins. La première étude présentait ensuite un outil d'évaluation de la pertinence de la poursuite des hospitalisations en soins-études et son application. Les facteurs prédictifs de sortie pour non-pertinence étaient une alliance thérapeutique fragile, une faible autonomie, des difficultés à s'inscrire dans un cadre de vie collective et à adhérer au projet de soins. La deuxième étude décrivait les personnes hospitalisées dans le service transdisciplinaire après une tentative de suicide. A l'admission, elles présentaient des troubles psychiatriques et somatiques sévères. La durée d'hospitalisation dans l'unité et la surmortalité à cinq ans étaient liées à des caractéristiques sociodémographiques des sujets et à la sévérité des séquelles physiques.Discussion : Nos résultats soutiennent l'intérêt des systèmes complexes de soins étudiés ici. Néanmoins ces premières évaluations sont limitées par leur méthode et leurs faibles échantillons. Nous proposons donc les modalités selon lesquelles des études prospectives pourraient être construites, abordant de manière plus complète ces interventions. Des perspectives sont proposées afin qu'à la complexité des dispositifs de soins psychiatriques réponde des évaluations adaptées.
Chapter
This study maps the scientific production on the performance of regional innovation systems from 1989 to 2017. Qualitative and quantitative procedures are employed to reveal key characteristics of the research field to complement previous systematic reviews. The evolution in the absolute number of articles has been non-monotone. Complementarity between the literature of national innovation systems and regional innovation systems is not always verified given that a negative elasticity is observed in the period [2004, 2006]. Leading contributors are open to collaborative research with follower researchers thereby suggesting consistency between the theory disseminated by the field and professional conduct of main contributors. Results indicate high level of receptivity by international journals and gradual relevance of empirical analysis over time. Empirical studies are classified based on three distinct analytical approaches—case studies, benchmarking and scoring—resulting from co-occurrence analysis of text data. Main methods and indicators to measure regional performance are identified. A trend to adopt composite indicators is observed. Scoring articles use more indicators and cover higher number of dimensions relative to benchmarking articles, but efforts are currently developed to reduce the gap. This review confirms that the research field is characterised by a dichotomy between theoretical and empirical contributions since the survey of most used indicators suggests that empirical studies mainly adopt those capable of capturing the impact of top-down processes on regional innovation, while theoretical contributions disclose the need to use indicators that capture the impact of bottom-up processes. Overcoming difficulties related to performance measurement is also mandatory.
Preprint
Unsupervised classification is becoming an increasingly common method to objectively identify coherent structures within both observed and modelled climate data. However, in most applications using this method, the user must choose the number of classes into which the data are to be sorted in advance. Typically, a combination of statistical methods and expertise is used to choose the appropriate number of classes for a given study, however it may not be possible to identify a single `optimal' number of classes. In this work, we present a heuristic method, the Ensemble Difference Criterion, for determining the maximum number of classes unambiguously for modelled data where more than one ensemble member is available. This method requires robustness in the class definition between simulated ensembles of the system of interest. For demonstration, we apply this to the clustering of Southern Ocean potential temperatures in a CMIP6 climate model, and show that the data supports between four and seven classes of a Gaussian Mixture Model.
Conference Paper
The article describes an approach to determining the homogeneity of a set of elements based on the quality characteristics of dividing a set into groups. We have formulated a concept of homogeneity based on the introduced characteristics and proposed a general approach to determine the boundaries of such characteristics to achieve a high- quality division using training samples. We also give an example of the practical application with the described approach to determine the homogeneity of batches of electronic components in the process of additional testing for the space industry.
Article
Full-text available
Chalk, an undesirable grain quality trait in rice, is primarily formed due to high temperatures during the grain-filling process. Owing to the disordered starch granule structure, air spaces and low amylose content, chalky grains are easily breakable during milling thereby lowering head rice recovery and its market price. Availability of multiple QTLs associated with grain chalkiness and associated attributes, provided us an opportunity to perform a meta-analysis and identify candidate genes and their alleles contributing to enhanced grain quality. From the 403 previously reported QTLs, 64 Meta-QTLs encompassing 5262 non-redundant genes were identified. MQTL analysis reduced the genetic and physical intervals and nearly 73% meta-QTLs were narrower than 5cM and 2Mb, revealing the hotspot genomic regions. By investigating expression profiles of 5262 genes in previously published datasets, 49 candidate genes were shortlisted on the basis of their differential regulation in at least two of the datasets. We identified non-synonymous allelic variations and haplotypes in 39 candidate genes across the 3K rice genome panel. Further, we phenotyped a subset panel of 60 rice accessions by exposing them to high temperature stress under natural field conditions over two Rabi cropping seasons. Haplo-pheno analysis uncovered haplotype combinations of two starch synthesis genes, GBSSI and SSIIa, significantly contributing towards the formation of grain chalk in rice. (2023) Meta-QTL and haplo-pheno analysis reveal superior haplotype combinations associated with low grain chalkiness under high temperature in rice.
Preprint
The relationship between clinically accessible epileptic biomarkers and neuronal activity underlying the seizure transition is complex, potentially leading to imprecise delineation of epileptogenic brain areas. In particular, the pattern of interneuronal firing at seizure onset remains under debate, with some studies demonstrating increased firing while others suggest reductions. Previous study of neocortical sites suggests that seizure recruitment occurs upon failure of inhibition, with intact feedforward inhibition in non-recruited territories. We investigated whether the same principles applied also in limbic structures. We analyzed simultaneous ECoG and neuronal recordings during 34 seizures in a cohort of 19 patients (10 male, 9 female) undergoing surgical evaluation for pharmacoresistant focal epilepsy. A clustering approach with five quantitative metrics computed from ECoG and multiunit data was used to distinguish three types of site-specific activity patterns during seizures, at times co-existing within seizures. 156 single-units were isolated, subclassified by cell-type, and tracked through the seizure using our previously published methods to account for impacts of increased noise and single-unit waveshape changes caused by seizures. One cluster was closely associated with clinically defined seizure onset or spread. Entrainment of high-gamma activity to low-frequency ictal rhythms was the only metric that reliably identified this cluster at the level of individual seizures (p < 0.001). A second cluster demonstrated multi-unit characteristics resembling those in the first cluster, without concomitant high-gamma entrainment, suggesting feedforward effects from the seizure. The last cluster captured regions apparently unaffected by the ongoing seizure. Across all territories, the majority of both excitatory and inhibitory neurons reduced (69.2%) or ceased firing (21.8%). Transient increases in interneuronal firing rates were rare (13.5%) but showed evidence of intact feedforward inhibition with maximal firing rate increases and waveshape deformations in territories not fully recruited but showing feedforward activity from the seizure, and a shift to burst-firing in seizure-recruited territories (p = 0.014). This study provides evidence for entrained high gamma activity as an accurate biomarker of ictal recruitment in limbic structures. However, our results of reduced neuronal firing suggest preserved inhibition in mesial temporal structures despite simultaneous indicators of seizure recruitment, in contrast to the inhibitory collapse scenario documented in neocortex. Further study is needed to determine if this activity is ubiquitous to hippocampal seizures or if it indicates a "seizure-responsive" state in which the hippocampus is not the primary driver. If the latter, distinguishing such cases may help refine surgical treatment of mesial temporal lobe epilepsy.
Preprint
Full-text available
p>Most deep clustering methods despite providing complex networks to learn better from data, use a shallow clustering method. These methods have difficulty in finding good clusters due to the lack of ability to handle between local search and global search to prevent premature convergence. In other words, they do not consider different aspects of the search and it causes them to get stuck in the local optimum. In addition, the majority of existing deep clustering approaches perform clustering with the knowledge of the number of clusters, which is not practical in most real scenarios where such information is not available. To address these problems, this paper presents a novel automatic deep sparse clustering approach based on an evolutionary algorithm called Multi-Trial Vectorbased Differential Evolution (MTDE). Sparse auto-encoder is first applied to extract embedded features. Manifold learning is then adopted to obtain representation and extract the spatial structure of features. Afterward, MTDE clustering is performed without prior information on the number of clusters to find the optimal clustering solution. The proposed approach was evaluated on various datasets, including images and time-series. The results demonstrate that the proposed method improved MTDE by 18.94% on average and compared to the most recent deep clustering algorithms, is consistently among the top three in the majority of datasets. Source code is available on Github: https://github.com/parhamhadikhani/ADSMTDE_Clustering. </p
Article
Full-text available
This paper introduces an algorithm for the detection of change-points and the identification of the corresponding subsequences in transient multivariate time-series data (MTSD). The analysis of such data has become increasingly important due to growing availability in many industrial fields. Labeling, sorting or filtering highly transient measurement data for training Condition-based Maintenance (CbM) models is cumbersome and error-prone. For some applications it can be sufficient to filter measurements by simple thresholds or finding change-points based on changes in mean value and variation. But a robust diagnosis of a component within a component group for example, which has a complex non-linear correlation between multiple sensor values, a simple approach would not be feasible. No meaningful and coherent measurement data, which could be used for training a CbM model, would emerge. Therefore, we introduce an algorithm that uses a recurrent neural network (RNN) based Autoencoder (AE) which is iteratively trained on incoming data. The scoring function uses the reconstruction error and latent space information. A model of the identified subsequence is saved and used for recognition of repeating subsequences as well as fast offline clustering. For evaluation, we propose a new similarity measure based on the curvature for a more intuitive time-series subsequence clustering metric. A comparison with seven other state-of-the-art algorithms and eight datasets shows the capability and the increased performance of our algorithm to cluster MTSD online and offline in conjunction with mechatronic systems.
Preprint
p>Most deep clustering methods despite providing complex networks to learn better from data, use a shallow clustering method. These methods have difficulty in finding good clusters due to the lack of ability to handle between local search and global search to prevent premature convergence. In other words, they do not consider different aspects of the search and it causes them to get stuck in the local optimum. In addition, the majority of existing deep clustering approaches perform clustering with the knowledge of the number of clusters, which is not practical in most real scenarios where such information is not available. To address these problems, this paper presents a novel automatic deep sparse clustering approach based on an evolutionary algorithm called Multi-Trial Vectorbased Differential Evolution (MTDE). Sparse auto-encoder is first applied to extract embedded features. Manifold learning is then adopted to obtain representation and extract the spatial structure of features. Afterward, MTDE clustering is performed without prior information on the number of clusters to find the optimal clustering solution. The proposed approach was evaluated on various datasets, including images and time-series. The results demonstrate that the proposed method improved MTDE by 18.94% on average and compared to the most recent deep clustering algorithms, is consistently among the top three in the majority of datasets. Source code is available on Github: https://github.com/parhamhadikhani/ADSMTDE_Clustering. </p
Conference Paper
Full-text available
With the growing consumer awareness of healthy dieting, the interest of consumers for concrete information regarding it is also growing. Therefore, consumers are getting involved in various virtual communities (VC) on social networks through which they receive information from other consumers, share experiences and give recommendations for certain products. This leads to the electronic word of mouth (eWOM), which is influencing consumer attitudes and beliefs. In this regard, this paper aims to explore how VC Low carb high fat (LCHF) and Paleo diets, passed down through eWOM, influence consumer attitudes about dieting and the perception of products they consider healthy. The empirical research was conducted through an online survey questionnaire in VC on Facebook on a random sample of 137 respondents. The results of the study were obtained by a descriptive statistical analysis and indicate that practitioners of VC LCHF and Paleo diet have attitudes consistent with the views of the community, i.e. they adopt attitudes that the community promotes and influence each other by encouraging healthy food purchases through eWOM. The limitation of the study mainly regards the small sample size and possible other factors that influence the attitudes of VC members. Future studies should include a larger number of claims in the verification of respondents' attitudes and, for example, compare them with the attitudes of non-VC respondents. The findings of the study could provide marketers with a better understanding of consumer behavior in VC and provide guidelines for creating an effective marketing mix in the context of healthy dieting
Conference Paper
Full-text available
Successful management of personal finances requires developing financial literacy of young generations from an early age in order to increase their financial literacy and ensure the achievement of financial competencies. Significant efforts have been made through the education system in the last five years to raise the level of financial literacy in Croatia. Responsibility for the financial education of young people is placed mostly on the regular and formal education system, while parental education is to some extent neglected. Parental conversation, setting an example, rewarding or punishing children are just some of the educational factors. This paper investigates the financial literacy of high school students as well as the impact of parental involvement on their financial competencies. The aim of this paper is to examine whether family factors, such as parents' conversations with children, their personal examples and financial behaviour influence the child's financial literacy and which components of financial literacy are affected the most. In the empirical part of the paper, the survey method is used, while statistical methods are used to process the research results, from which we single out Spearman's rank correlation coefficient and two- sample t-test with (approximately) equal variances. This methodology determined the positive influence of parents as educational factors on the financial literacy of high school students in the observed research sample, especially on the component of financial behaviour. We can conclude that it is desirable for parents to discuss financial topics. In addition to the conversation, it is necessary that they show their responsible financial behaviour to the child by their own example. For example, it is desirable to involve the child in financial decisions, show him the bills, not fulfill every whim of the child and reward him financially when he deserves it.
Article
Lung cancer is the leading cause of cancer-related deaths worldwide. Medical imaging technologies such as computed tomography (CT) and positron emission tomography (PET) are routinely used for non-invasive lung cancer diagnosis. In clinical practice, physicians investigate the characteristics of tumors such as the size, shape and location from CT and PET images to make decisions. Recently, scientists have proposed various computational image features that can capture more information than that directly perceivable by human eyes, which promotes the rise of radiomics. Radiomics is a research field on the conversion of medical images into high-dimensional features with data-driven methods to help subsequent data mining for better clinical decision support. Radiomic analysis has four major steps: image preprocessing, tumor segmentation, feature extraction and clinical prediction. Machine learning, including the high-profile deep learning, facilitates the development and application of radiomic methods. Various radiomic methods have been proposed recently, such as the construction of radiomic signatures, tumor habitat analysis, cluster pattern characterization and end-to-end prediction of tumor properties. These methods have been applied in many studies aiming at lung cancer diagnosis, treatment and monitoring, shedding light on future non-invasive evaluations of the nodule malignancy, histological subtypes, genomic properties and treatment responses. In this review, we summarized and categorized the studies on the general workflow, methods for clinical prediction and clinical applications of machine learning in lung cancer radiomic studies, introduced some commonly-used software tools, and discussed the limitations of current methods and possible future directions.
Article
Представлен анализ разработанных за последние три года подходов, технологий, моделей, связанных с оценкой профессионализма и уровня компетенций педагогов. Акцентировано внимание на исследованиях и разработках, в основу которых положена работа с большими данными, применение технических средств для автоматизации процесса оценки. Сформулирован подход к оценке профессиональных компетенций педагогов, основанный на анализе результатов их учеников. Описан третий этап технологии оценки профессионализма и уровня компетенций педагогов общего образования. Образовательные результаты обучающихся Томской области по национальному единому государственному экзамену были разделены на два массива (естественно-научное и гуманитарное направления), далее к ним применены три алгоритма кластеризации; валидность кластеров исследовалась коэффициентом силуэта и индексом Калински – Харабаса. Оценены метрики результатов алгоритмов и целесообразность их использования в рамках поставленной задачи. An analysis of approaches, technologies, models developed over the past three years related to the assessment of professionalism and the level of competencies of teachers is presented. The attention is focused on research and development, which is based on working with big data, the use of technical means to automate the assessment process. An approach to assessing the professional competencies of teachers based on the analysis of the results of their students is formulated. Within the framework of this paper, the 3rd stage of the technology for assessing professionalism and competence level of general education teachers has been improved, tested and described. Using the data of the learning results of Tomsk region secondary students in the national Unified State Exam (USE), 3 (three) clustering algorithms were applied by technology: k-means, spectral clustering, agglomerative clustering. The learning results under study were on 4 (four) subjects: Russian, Mathematics (profile level), Physics, and Social Studies for the period from 2015 to 2019. The above data were divided into 2 (two) arrays: Science and Humanities, and clustering algorithms mentioned were applied to them. The validity of the clusters was assessed by the method of Silhouette coefficient and Kalinsky-Harabasz index. Some measured parameters of the algorithms and the expediency of their use within the framework of the task were evaluated.
Article
The unprecedented systemic disruptions that occurred in the last years are highlighting a structural lack of resilience in most organisations. In this context, there is an increasing scholars’ interest in understanding to what extent capabilities to anticipate, respond and thrive in unprecedented situations represent a strategic lever for business continuity management in most sectors of activity. To contribute to this debate, this research adopts a dynamic capabilities perspective to investigate the specific capabilities that organisations build in the pursuit of business continuity. Based on multi-sectoral primary data collected in 2021 from HR professionals of 419 organisations operating in Italy, the outcomes of our quantitative study show that the business continuity requirements expressed by ISO22301 are perceived as interrelated and indivisible. Furthermore, our results suggest that the ambition to fulfil the business continuity requirements depends on the organisational capabilities to improvise and coordinate the use of its assets (i.e. people, technologies, premises, information) in the face of disruptions. Besides the managerial implications concerning how to build favourable organisational conditions to reduce the vulnerability to external risks, the research contributes to the literature by building new measurement scales to assess business continuity and disentangling the rationale behind the related dynamic capabilities framework.