Article

Performance Profiles based on Archetypal Athletes

Taylor & Francis
International Journal of Performance Analysis in Sport
Authors:
To read the full-text of this research, you can request a copy directly from the author.

Abstract

Performance indicators and, on their basis, performance profiles are one of the foundations of performance analysis in sports. Obviously, the crux is to develop performance profiles which allow to evaluate the subjects of interest accurately. The present paper contributes a further approach to the existing toolbox of profiling methods. Performance profiles based on archetypal athletes are not based on typical, i.e., mean, performances but on extreme performances—usually the most interesting aspect in sports. Archetypal athletes (outstanding—positive and/or negative—performers) are computed and performers are related to these archetypal athletes. As the archetypal athletes are interpretable, an easy interpretation of the performers' profiles follows. The method is demonstrated on basketball statistics and soccer skill ratings.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

... Traditional means of evaluating players and teams are best used in conjunction with rigorous statistical methods. One interesting approach to provide objective evidence about how good (or bad) the players perform based on the statistics collected for them is described by Eugster (2012). The author uses archetype analysis (AA) to obtain outstanding athletes (both positively and negatively). ...
... Archetypes are data-driven extreme points. As is rightly pointed out by Eugster (2012), in sports these extreme points correspond to positively or negatively prominent players. However, AA has an important drawback: archetypes are a convex combination of the sampled individuals, but they are not necessarily observed individuals. ...
... On the one hand, our approach is a natural extension and improvement of the methodology proposed by Eugster (2012) with regard to multivariate data. On the other hand, the methodology can also be used with other available information, such as asymmetric relations and sparse functional data. ...
Article
Full-text available
We intend to understand the growing amount of sports perfor- mance data by finding extreme data points, which makes human interpreta- tion easier. In archetypoid analysis each datum is expressed as a mixture of actual observations (archetypoids). Therefore, it allows us to identify not only extreme athletes and teams, but also the composition of other athletes (or teams) according to the archetypoid athletes, and to establish a ranking. The utility of archetypoids in sports is illustrated with basketball and soccer data in three scenarios. Firstly, with multivariate data, where they are compared with other alternatives, showing their best results. Secondly, despite the fact that functional data are common in sports (time series or trajectories), func- tional data analysis has not been exploited until now, due to the sparseness of functions. In the second scenario, we extend archetypoid analysis for sparse functional data, furthermore showing the potential of functional data analysis in sports analytics. Finally, in the third scenario, features are not available, so we use proximities. We extend archetypoid analysis when asymmetric relations are present in data. This study provides information that will provide valu- able knowledge about player/team/league performance so that we can analyze athlete’s careers.
... Ideal types are extreme points that synthesize data representing "pure individual types", and are assigned by the most discriminating features of data points. Recent works on archetypal analysis [13,18,31,17,21,48,11,19,37] show the importance of the concept of 'archetypal type' in machine learning and data mining applications. The model of archetypal analysis postulates that entities are convex combinations of the archetypes to be found, and that the magnitudes of the K coefficients for each case reflect the relative proximity to each archetype. ...
... In order to solve this problem, we assume, without any loss of generality, that d 1 ≥ d 2 ≥ · · · ≥ d K . The optimal u * of problem (19) has the same order of components, that is, u * 1 ≥ u * 2 ≥ · · · ≥ u * K . Let us assume that u * 1 ≥ u * 2 ≥ · · · ≥ u * k + > 0 for some k + ≤ K. ...
Chapter
This paper presents a fuzzy proportional membership model for clustering (FCPM). Unlike the other clustering models, FCPM requires that each entity may express an extent of each prototype, which makes its criterion to loose the conventional prototype-additive structure. The methods for fitting the model at different fuzziness parameter values are presented. Because of the complexity of the clustering criterion, minimization of the errors requires the gradient projection method (GPM). We discuss how to find the projection of a vector on the simplex of the fuzzy membership vectors and how the stepsize length of the GPM had been fixed. The properties of the clusters found with the FCPM are discussed. Especially appealing seems the property to keep the extremal cluster prototypes stable even after addition of many entities around the grand mean.
... Archetypes 2 therefore are i) a few (not necessarily observed) pure individual types, ii) they are well separated among themselves, as they lie on the data convex hull, iii) they are so close as possible to the other data, as they 2 A detailed discussion of archetypal analysis properties in the case of real-valued data may be found in [28][29][30]. Archetypes have been used in many fields [31][32][33][34][35][36][37][38][39][40][41][42]. Two modifications of the original definition of archetypes have been introduced: the first for coping with the presence of outlying observations [38][39][40][41][42][43], obtaining robust archetypes; the second for coping with the fact that archetypes are not necessarily observed, yielding so-called archetypoids that are coincident with some original observation [44,45]. ...
... Archetypes have been used in many fields [31][32][33][34][35][36][37][38][39][40][41][42]. Two modifications of the original definition of archetypes have been introduced: the first for coping with the presence of outlying observations [38][39][40][41][42][43], obtaining robust archetypes; the second for coping with the fact that archetypes are not necessarily observed, yielding so-called archetypoids that are coincident with some original observation [44,45]. Further extensions of archetypes have been proposed for interval-valued [24], functional [25,26], network [27], and discrete data [46]. ...
Article
Prototypes, as Rosch (1973) defined the term in the cognitive sciences field, are ideal exemplars that summarize and represent groups of objects (or categories) and that are “typical" according to their internal resemblance and external dissimilarity vis-à-vis other groups or categories. In line with the cognitive approach, we propose a data-driven procedure for identifying prototypes that is based on archetypal analysis and compositional data analysis. The procedure presented here exploits the properties of archetypes, both in terms of their external dissimilarity in relation to other points in the data set and in terms of their ability to represent the data through compositions in a simplex in which it is possible to cluster all of the observations. The proposed procedure is useful not only for the usual real data points; it may also be used for interval-valued data, functional data, and relational data, and it provides well-separated and clearly profiled prototypes. © 2016 Wiley Periodicals, Inc. Statistical Analysis and Data Mining: The ASA Data Science Journal, 2016
... Although there are already several methods available aiming to develop meaningful profiles (Butterworth, O'Donoghue, & Cropley, 2013;Eugster, 2012;Hughes, Evans, & Wells, 2001;James, Mellalieu, & Jones, 2005;Liu, Yi, Gimenez, Gómez, & Lago-Peñas, 2015c;O'Donoghue, 2005O'Donoghue, , 2013, performance profiling is still required to be further explored. The main sports profiling techniques were discussed by O'Donoghue (2013). ...
... The basic principle of these techniques is to combine a set of valid and reliable performance-related variables within a given sport to properly describe a certain performance/ performer by using normative match data (O'Donoghue, 2013). In fact, combining different types of variables altogether makes the profiling technique an appropriate procedure to evaluate the technical performance of football players (Eugster, 2012;Liu et al., 2015c). However, the current profiling technique is more focused on representing the typical performance and the spread of the performance of a single performer/player (O'Donoghue, 2013). ...
Article
Recent research suggests that match-to-match variation adds important information to performance descriptors in team sports, as it helps measure how players fine-tune their tactical behaviours and technical actions to the extreme dynamical environments. The current study aims to identify the differences in technical performance of players from strong and weak teams and to explore match-to-match variation of players’ technical match performance. Performance data of all the 380 matches of season 2012–2013 in the Spanish First Division Professional Football League were analysed. Twenty-one performance-related match actions and events were chosen as variables in the analyses. Players’ technical performance profiles were established by unifying count values of each action or event of each player per match into the same scale. Means of these count values of players from Top3 and Bottom3 teams were compared and plotted into radar charts. Coefficient of variation of each match action or event within a player was calculated to represent his match-to-match variation of technical performance. Differences in the variation of technical performances of players across different match contexts (team and opposition strength, match outcome and match location) were compared. All the comparisons were achieved by the magnitude-based inferences. Results showed that technical performances differed between players of strong and weak teams from different perspectives across different field positions. Furthermore, the variation of the players’ technical performance is affected by the match context, with effects from team and opposition strength greater than effects from match location and match outcome.
... This is an important characteristic that distinguishes AA from centeroriented clustering techniques which may produce several nearby clusters, not easily interpretable (Li, Wang, Louviere, & Carson, 2003). The most interesting point in AA is that each observation is expressed as a convex combination of the archetypes, with mixture coefficients revealing the amount of participation of each archetype to the observation (Eugster, 2012). These mixture coefficients can be also used to define the proximity of an observation to an archetype, allowing for a kind of probabilistic clustering (Bauckhage & Thurau, 2009). ...
... where A [n×k] is the matrix of the mixture (or alpha) coefficients showing the degree of each archetype's contribution to the expression of each and every observation (Eugster, 2012). Furthermore, the archetypes as convex combinations of the original data points are expressed by the following equation: ...
Article
The objective of this paper is to present a methodology for deriving an intelligible synopsis of single-trial (ST) variability in brain responses. An algorithmic procedure, relying on temporal patterning and built over archetypal analysis, is introduced. Archetypical brain waves are first derived from the ensemble of brain responses and then used to unfold the observed variability. Using these archetypes as anchor points, homogeneous groups of ST-responses are detected and contrasted with each other. The new methodology incorporates steps for organizing the variability and presenting it by means of low-dimensional maps. Magnetoencephalographic responses from a visual stimulation paradigm are used for demonstrating and validating the approach. The results show that a small number of archetypes is sufficient for describing reliably the response variability. The groups of ST-responses, delineated around these archetypes, reflect differences in the way the ongoing activity interacts with the incoming stimulus. Estimates of signal-to-noise ratio are utilized in order to demonstrate that there is a significant information loss when response variability is left untreated. Moreover, ensemble averaging is employed for uniquely recovering the “true” response. Archetypal analysis provides a concise description of response variability which potentially can contribute in the understanding of its origin.
... Archetypal analysis is a very useful tool for this pur-pose. Archetypes were defined in Cutler and Breiman (1994) and they have been applied in different fields such as market research (Li et al. (2003), Porzio et al. (2008), Midgley and Venaik (2013)), biology (D'Esposito et al. (2012)), genetics (Thøgersen et al. (2013)), sports (Eugster (2012)), industrial engineering (ergonomic design and evaluation) (Epifanio, Vinué, and Alemany (2013)), the evaluation of scientists (Seiler and Wohlrabe (2013)), astrophysics (Chan et al. (2003), Richards et al. (2012)), e-learning (Theodosiou et al. (2013)), multi-document summarization Kononenko (2013, 2014)) and different machine learning problems (Mørup and Hansen (2012), Stone (2002)). The archetypes returned by archetypal analysis are a convex combination of the sampled individuals, but they are not necessarily observed individuals. ...
... The second choice identifies the individuals with the maximum α value for each archetype, i.e. the individuals with the largest relative share for the respective archetype. We refer to this set as cand α and is used in Eugster (2012) and Seiler and Wohlrabe (2013). The third choice identifies the individuals with the maximum β value for each archetype, i.e. the major contributors in the generation of the archetypes. ...
Article
Full-text available
The new concept archetypoids is introduced. Archetypoid analysis represents each observation in a dataset as a mixture of actual observations in the dataset, which are pure type or archetypoids. Unlike archetype analysis, archetypoids are real observations, not a mixture of observations. This is relevant when existing archetypal observations are needed, rather than fictitious ones. An algorithm is proposed to find them and some of their theoretical properties are introduced. It is also shown how they can be obtained when only dissimilarities between observations are known (features are unavailable). Archetypoid analysis is illustrated in two design problems and several examples, comparing them with the archetypes, the nearest observations to them and other unsupervised methods.
... These archetypes are themselves constrained to be convex combinations of actual data points and lie on the data set boundary, i.e., the convex hull. This statistical method was first introduced by Cutler and Breiman (1994) and has found applications in different areas, e.g., in economics (Porzio et al, 2008), astrophysics (Chan et al, 2003), pattern recognition (Bauckhage and Thurau, 2009) and sports analytics (Eugster, 2012;Seth and Eugster, 2016). ...
Preprint
Full-text available
The standard mixture modelling framework has been widely used to study heterogeneous populations, by modelling them as being composed of a finite number of homogeneous sub-populations. However, the standard mixture model assumes that each data point belongs to one and only one mixture component, or cluster, but when data points have fractional membership in multiple clusters this assumption is unrealistic. It is in fact conceptually very different to represent an observation as partly belonging to multiple groups instead of belonging to one group with uncertainty. For this purpose, various soft clustering approaches, or individual-level mixture models, have been developed. In this context, Heller et al (2008) formulated the Bayesian partial membership model (PM) as an alternative structure for individual-level mixtures, which also captures partial membership in the form of attribute specific mixtures, but does not assume a factorization over attributes. Our work proposes using the PM for soft clustering of count data arising in football performance analysis and compare the results with those achieved with the mixed membership model and finite mixture model. Learning and inference are carried out using Markov chain Monte Carlo methods. The method is applied on Serie A football player data from the 2022/2023 football season, to estimate the positions on the field where the players tend to play, in addition to their primary position, based on their playing style. The application of partial membership model to football data could have practical implications for coaches, talent scouts, team managers and analysts. These stakeholders can utilize the findings to make informed decisions related to team strategy, talent acquisition, and statistical research, ultimately enhancing performance and understanding in the field of football.
... On the one hand, there has been an increasing number of papers proposing efficient computational methods to calculate AA [20], [21], [22], [23], with applications in computer vision. On the other hand, AA has been applied in other very diverse fields, such as, climatology [24], [25], ergonomics [26], [27], genetics [28], [29], [30], image processing [31], [32], [33], [34], machine learning problems [20], [35], [36], [37], market research [38], multi-document summarization [39], nanotechnology [40], neuroscience [41], [42], sports [43], [44], [45] and sustainability [4]. Finally, other papers have proposed extensions and new methodologies derived from AA with applications in a broad spectrum of fields: kernel AA [20], AA with missing data [20], [46], robust AA [47], [48], interval archetypes [49], archetypoid analysis (ADA) [50], functional AA [51], data-driven prototype identification [52], archetypal networks [35], probabilistic AA [53], AA for nominal [6], [54] and ordinal observations [55], directional AA [56], AA for shapes [57], deep AA [37], [58], and outlier detection [59], [60], [61]. ...
Article
Full-text available
We introduce a novel exploratory technique, termed biarchetype analysis, which extends archetype analysis to simultaneously identify archetypes of both observations and features. This innovative unsupervised machine learning tool aims to represent observations and features through instances of pure types, or biarchetypes, which are easily interpretable as they embody mixtures of observations and features. Furthermore, the observations and features are expressed as mixtures of the biarchetypes, which makes the structure of the data easier to understand. We propose an algorithm to solve biarchetype analysis. Although clustering is not the primary aim of this technique, biarchetype analysis is demonstrated to offer significant advantages over biclustering methods, particularly in terms of interpretability. This is attributed to biarchetypes being extreme instances, in contrast to the centroids produced by biclustering, which inherently enhances human comprehension. The application of biarchetype analysis across various machine learning challenges underscores its value, and both the source code and examples are readily accessible in R and Python at https://github.com/aleixalcacer/JA-BIAA .
... Next to this common method, Tittonell et al. (2020) propose using archetypal analysis (AA) to construct farm typologies. AA is an unsupervised learning method designed to find extremal points in a multivariate data set, called "archetypes," by minimizing the squared error, such that all the individuals are represented as a convex combination of the archetypes (Cutler and Breiman 1994;Eugster 2012;Tittonell et al. 2020). An individual's proximity to an archetype is reflected by a coefficient that determines whether they should be assigned to that archetype. ...
Article
Full-text available
Conservation Agriculture (CA) is actively promoted as an alternative farming system that combines environmental, economic, and social sustainability. Three pillars define CA: (i) minimum mechanical soil disturbance, (ii) permanent soil organic cover, and (iii) species diversification. The local context, constraints, and needs of the farmers influence the translation of the pillars into practices. Currently, there is no method for categorizing this diversity of CA practices, which hampers impact assessment, understanding of farmer choices and pathways, stakeholder communication, and policymaking. This paper presents a systematic method to identify and categorize the diversity of CA practices at the regional level, anchored in the three pillars and based on practices implemented by CA farmers. The classification method is grounded on the intersection of an archetypal analysis and a hierarchical clustering analysis. This method was used to study CA practices in Wallonia, Belgium, based on a survey of practices in a sample of 48 farmers. Combining the two clustering methods increases the proportion of classified farmers while allowing for the distinction between three CA-types with extreme and salient practices, and two intermediate CA-types comprising farmers whose practices fall between these references. The study reveals that three explanatory factors influence the implementation of CA practices in Wallonia: (i) the proportion of tillage-intensive crops and (ii) temporary grasslands in the crop sequence, and (iii) the organic certification. These factors lead to trade-offs that hinder the three pillars of CA from being fully implemented simultaneously. This new classification method can be replicated in other regions where CA is practiced, by adapting input variables according to context and local knowledge.
... There are authors who state that combining several variables is the right procedure to increase team and player complexity (Eugster, 2012;Liu et al., 2015). The fact that football offers players the opportunity to alternate between individual and team play contributes to the development of personality traits that facilitate this transition. ...
Article
Full-text available
There is a growing interest in extracurricular and out-of-school activities for pupils, given their positive effects on the development of individuals. In the present study we aimed to develop a training program for the game of football as an extracurricular activity that integrates both physical skills and personality traits in high school students. The research design assumed a pre-posttest design, with the dependent samples t-test used as the statistical method of investigation. The physical abilities tracked in the study were body mass index, body harmony, exercise endurance, fitness, body recovery after exercise and endurance running. As personality traits we looked at dominance, emotional stability, spontaneity, sensitivity, intelligence, conformity and discipline, sense of self, and refocusing on planning as a positive cognitive-emotional coping strategy. The results show that football, as an extracurricular activity, offers the possibility of harmonious physical and mental development of young people.
... AA and ADA are applied in many diverse fields, such as biology [22], computer vision [23][24][25][26][27][28], education [29,30], engineering [31], genetics [32], machine learning problems [15], market research [17], neuroscience [33][34][35], psychology [36],and sports [37,38]. Since the proposal by [21], AA has become a standard in the accommodation problem in industrial design [39], where extreme cases are searched to give designers an efficient way to develop and assess a product design. ...
Article
Full-text available
Curves are complex data. Tools for visualizing, exploring, and discovering the structure of a data set of curves are valuable. In this paper, we propose a scalable methodology to solve this challenge. On the one hand, we consider two distances in the shape and size space, one well-known distance and another recently proposed, which differentiate the contribution in shape and in size of the elements considered to compute the distance. On the other hand, we use archetypoid analysis (ADA) for the first time in elastic shape analysis. ADA is a recent technique in unsupervised statistical learning, whose objective is to find a set of archetypal observations (curves in this case), in such a way that we can describe the data set as convex combinations of these archetypal curves. This makes interpretation easy, even for non-experts. Archetypal curves or pure types are extreme cases, which also facilitates human understanding. The methodology is illustrated with a simulated data set and applied to a real problem. It is important to know the distribution of foot shapes to design suitable footwear that accommodates the population. For this purpose, we apply our proposed methodology to a real data set composed of foot contours from the adult Spanish population.
... The simultaneous combination of several variables is seemingly the right procedure to increase team complexity in football athletes (Eugster, 2012;Liu et al., 2015). This study was in line with previous studies which stated that teamwork is reciprocally working with two or more team members to achieve a goal (Collins & Durand-Bush, 2019;Gaffney, 2015). ...
Article
Full-text available
This study aims to examine the effectiveness of the games experience learning-based training against teamwork, respect attitude, skills, and physical ability of young footballers. Materials and methods. Participants consisted of young male football players who had participated in matches at the district, provincial and national levels. The number of participants involved amounted to 46 divided into control and intervention groups of 23 parties each. Footballers aged 10-12 years old have a height of 140.98 ± 7.23 cm, an average body weight of 34.33 ± 8.64 kg. Instruments in this study consist of instruments to measure teamwork and respect attitude in the form of expert validated observation rubrics, passing, and dribbling skill tests, 30-meter sprints, vertical jumps, and multistage fitness tests to measure physical ability of speed, leg power, and aerobic endurance. Results. Descriptive data showed mean values and standard deviations of control and intervention groups there were differences, the intervention group had a mean value and better deviation standards, while the hypothesis test using the Mann-Whitney test obtained a teamwork attitude score of 0.039
... 3,5,7,8,[13][14][15][16][17] These variables include the following: number of yellow and red cards received, number of conceived fouls that originated from dangerous or non-dangerous situations; number of defensive actions followed by head or body pass; defensive actions resulting in ball out of the field or ball loss to the opponent; tackles; successful interceptions; number of shots blocked; duels and aerial duels; successful or unsuccessful 1 3 1 defensive actions; and anticipation and ball recovery. As stated by Eugster, 18 the combination of different variables may comprise a fine technique to assess individual performance of players in respect to team collective performance. ...
Article
Full-text available
This study sought to create a formula called the Defensive Golden Index (DGI) that assigns an index to each player according to his individual and collective defensive performance. This method helps identify the best-ranked players in defending play, designated as Defensive Golden Players. The study was organized in two main phases: (1) selection, definition, and validation, including weighting assignment, of a set of variables associated with players’ defensive performance by applying statistical techniques to standardize variable values that compose the DGI formula; and (2) applicability of the DGI formula to rank players’ performance of Fútbol Club Barcelona (FCB) according to the opponents’ level in 38 games of La Liga 2018–2019. A set of variables (e.g. tackles, anticipation, successful interception, successful 1 × 1 defensive) were included in a questionnaire and (re)conceptualized based on previous literature and author’s experience as coaches and match analysts. The survey was reviewed by 11 football experts to ascertain the importance of variables, as well as validate and determine the weight of each variable. Each defensive variable was divided by the time played. Then the individual values were subtracted from the mean and divided by the standard deviation of the variable. The results of the DGI formula were obtained by multiplying the weights from the experts for each of the 12 standardized defensive variables by the sum of each variable. In general, the best-ranked FCB players (Defensive Golden Players) were Gerard Piqué, Clement Lenglet, and Sergio Busquets. Moreover, considering the opponents’ levels, the same players were best-ranked, however, with different order in the top three positions. The DGI formula is a valuable method for capturing individual and collective performance of football players in defending play.
... 3,5,7,8,[13][14][15][16][17] These variables include the following: number of yellow and red cards received, number of conceived fouls that originated from dangerous or non-dangerous situations; number of defensive actions followed by head or body pass; defensive actions resulting in ball out of the field or ball loss to the opponent; tackles; successful interceptions; number of shots blocked; duels and aerial duels; successful or unsuccessful 1 3 1 defensive actions; and anticipation and ball recovery. As stated by Eugster, 18 the combination of different variables may comprise a fine technique to assess individual performance of players in respect to team collective performance. ...
... AA was defined by [15] and has been applied in a broad spectrum of fields, such as biology [18], developmental psychology [69], didactics [11], engineering [25,85,81,24,60,2,84], finance [61], genetics [78], global development [22], image processing [10], machine learning problems [64], market research [68], multidocument summarization [13], neuroscience [80,38] and sports [27,82,83]. With AA we can see all samples by looking at a few based on extreme profiles, but these extreme profiles should not be outliers. ...
Article
Full-text available
The problem of detecting outliers in multivariate data sets with continuous numerical features is addressed by a new method. This method combines projections into relevant subspaces by archetype analysis with a nearest neighbor algorithm, through an appropriate ensemble of the results. Our method is able to detect an anomaly in a simple data set with a linear correlation of two features, while other methods fail to recognize that anomaly. Our method performs among top in an extensive comparison with 23 state-of-the-art outlier detection algorithms with several benchmark data sets. Finally, a novel industrial data set is introduced, and an outlier analysis is carried out to improve the fit of footwear, since this kind of analysis has never been fully exploited in the anthropometric field.
... Due to the intuitive rational and interesting properties of AA, the method has been widely used for benchmarking purposes in many scientific domains (Moliner and Epifanio, 2019), e.g. such as marketing (Li et al., 2003), astrophysics (Chan et al., 2003), sports analytics (Eugster, 2012), biology (Thøgersen et al., 2013), medicine (Elze Tobias et al., 2015), scientometrics and bibliometrics (Seiler and Wohlrabe, 2013), multi-document summarization (Canhasi and Kononenko, 2014), neuroscience (Tsanousa et al., 2015) etc. In Software Engineering, AA has been introduced in (Mittas et al., 2014;Mittas and Angelis, 2020), in which the objectives were the evaluation of the predictive capabilities of a set of Software Effort Estimation (SEE) models and the building of ensembles using a subset of inferior models, whereas in (Kosti et al., 2016), the authors explored psychometric data in order to extract different software engineers profiles based on measurements from their personality and behavioral characteristics. ...
Article
Full-text available
Software teams are often asked to deliver new features within strict deadlines leading developers to deliberately or inadvertently serve “not quite right code” compromising software quality and maintainability. This non-ideal state of software is efficiently captured by the Technical Debt (TD) metaphor, which reflects the additional effort that has to be spent to maintain software. Although several tools are available for assessing TD, each tool essentially checks software against a particular ruleset. The use of different rulesets can often be beneficial as it leads to the identification of a wider set of problems; however, for the common usage scenario where developers or researchers rely on a single tool, diverse estimates of TD and the identification of different mitigation actions limits the credibility and applicability of the findings. The objective of this study is two-fold: First, we evaluate the degree of agreement among leading TD assessment tools. Second, we propose a framework to capture the diversity of the examined tools with the aim of identifying few “reference assessments” (or class/file profiles) representing characteristic cases of classes/files with respect to their level of TD. By extracting sets of classes/files exhibiting similarity to a selected profile (e.g., that of high TD levels in all employed tools) we establish a basis that can be used either for prioritization of maintenance activities or for training more sophisticated TD identification techniques. The proposed framework is illustrated through a case study on fifty (50) open source projects and two programming languages (Java and JavaScript) employing three leading TD tools.
... These kinds of organizations are categorized as an amateur one. Since their emphasis depends on fun through introducing this kind of sport they are presenting [26]. The next approach type is based on an expert who is having a long-time experience -in a certain field of sport-that the player uses this kind of experience to make his/her own prediction on a match, player quality or even player's movement season. ...
Technical Report
Full-text available
Due to the high growth usage of technology in modern sports, it became more complex. Both neural networks and data mining are frequently used as effective approaches for analyzing the huge amounts of data collected in sports. Fencing is one of most popular sports. Due to its dynamics, a large number of events occur during a single match. Fencing statisticians have the task of noting as many of these events as possible. In this paper, the main target is to use neural networks to collect statistics through Kinect-a line of motion sensing input device-to create a dataset to train a software system; in order to enhance the fencer movement through a training course. A software system was designed and applied regarding this issue.
... AA and ADA have been applied to many different fields, such as astrophysics (Chan, Mitchell and Cram, 2003), biology (D'Esposito, Palumbo and Ragozini, 2012), climate (Steinschneider and Lall, 2015;Su et al., 2017), developmental psychology (Ragozini, Palumbo and D'Esposito, 2017), e-learning (Theodosiou et al., 2013), finance (Moliner and Epifanio, 2019), genetics (Thøgersen et al., 2013), human development (Epifanio 2016;Epifanio, Ibáñez and Simó, 2020), industrial engineering Epifanio, Ibáñez and Simó, 2018;Millán-Roures, Epifanio and Martínez, 2018;Alcacer et al., 2020), machine learning (Mørup and Hansen, 2012;Seth and Eugster, 2016a,b;Ragozini and D'Esposito, 2015;Cabero and Epifanio, 2019), market research (Li et al., 2003;Porzio, Ragozini and Vistocco, 2008;Midgley and Venaik, 2013), multidocument summarization Kononenko, 2013, 2014), nanotechnology (Fernandez and Barnard, 2015), neuroscience (Tsanousa, Laskaris and Angelis, 2015; and sports (Eugster, 2012;Vinué andEpifanio, 2017, 2019). ...
Article
Full-text available
Archetypal analysis is an exploratory tool that explains a set of observations as mixtures of pure (extreme) patterns. If the patterns are actual observations of the sample, we refer to them as archetypoids. For the first time, we propose to use archetypoid analysis for binary observations. This tool can contribute to the understanding of a binary data set, as in the multivariate case. We illustrate the advantages of the proposed methodology in a simulation study and two applications , one exploring objects (rows) and the other exploring items (columns). One is related to determining student skill set profiles and the other to describing item response functions. MSC: 62H99, 62P25, 97D60.
... AA and ADA have been applied to many different fields, such as astrophysics (Chan, Mitchell and Cram, 2003), biology (D'Esposito, Palumbo and Ragozini, 2012), climate (Steinschneider and Lall, 2015;Su et al., 2017), developmental psychology (Ragozini, Palumbo and D'Esposito, 2017), e-learning (Theodosiou et al., 2013), finance (Moliner and Epifanio, 2019), genetics (Thøgersen et al., 2013), human development (Epifanio 2016;Epifanio, Ibáñez and Simó, 2020), industrial engineering Epifanio, Ibáñez and Simó, 2018;Millán-Roures, Epifanio and Martínez, 2018;Alcacer et al., 2020), machine learning (Mørup and Hansen, 2012;Seth and Eugster, 2016a,b;Ragozini and D'Esposito, 2015;Cabero and Epifanio, 2019), market research (Li et al., 2003;Porzio, Ragozini and Vistocco, 2008;Midgley and Venaik, 2013), multidocument summarization Kononenko, 2013, 2014), nanotechnology (Fernandez and Barnard, 2015), neuroscience (Tsanousa, Laskaris and Angelis, 2015; and sports (Eugster, 2012;Vinué andEpifanio, 2017, 2019). ...
Preprint
Full-text available
Archetypal analysis is an exploratory tool that explains a set of observations as mixtures of pure (extreme) patterns. If the patterns are actual observations of the sample, we refer to them as archetypoids. For the first time, we propose to use archetypoid analysis for binary observations. This tool can contribute to the understanding of a binary data set, as in the multivariate case. We illustrate the advantages of the proposed methodology in a simulation study and two applications, one exploring objects (rows) and the other exploring items (columns). One is related to determining student skill set profiles and the other to describing item response functions.
... A small group of human models that represents the anthropometric variability of the target population is commonly used in ergonomic design and evaluation. Working with a small group of cases, the test cases, PLOS AA and ADA applications have been growing at a great rate and they can be found in a diverse range of disciplines, such as biology [39], computer vision [40][41][42][43][44][45], developmental psychology [46], engineering [11,13,47,48], finance [49], genetics [50], global development [51], machine learning problems [52], market research [53], multi-document summarization [54], neuroscience [55,56] and sports [57][58][59]. ...
Article
Full-text available
The taxonomy of foot shapes or other parts of the body is important, especially for design purposes. We propose a methodology based on archetypoid analysis (ADA) that overcomes the weaknesses of previous methodologies used to establish typologies. ADA is an objective, data-driven methodology that seeks extreme patterns, the archetypal profiles in the data. ADA also explains the data as percentages of the archetypal patterns, which makes this technique understandable and accessible even for non-experts. Clustering techniques are usually considered for establishing taxonomies, but we will show that finding the purest or most extreme patterns is more appropriate than using the central points returned by clustering techniques. We apply the methodology to an anthropometric database of 775 3D right foot scans representing the Spanish adult female and male population for footwear design. Each foot is described by a 5626 × 3 configuration matrix of landmarks. No multivariate features are used for establishing the taxonomy, but all the information gathered from the 3D scanning is employed. We use ADA for shapes described by landmarks. Women’s and men’s feet are analyzed separately. We have analyzed 3 archetypal feet for both men and women. These archetypal feet could not have been recovered using multivariate techniques.
... Archetypal Analysis (AA) has been applied to various impact real world problems (e.g. [15,5,2,6,20]. ...
Conference Paper
Full-text available
This paper further investigates and compares a method for fuzzy clustering which retrieves pure individual types from data, known as the fuzzy clustering with proportional membership (FCPM), with the FurthestSum Archetypal Analysis algorithm (FS-AA). The Anomalous Pattern (AP) initialization algorithm, an algorithm that sequentially extracts clusters one by one in a manner similar to principal component analysis, is shown to outperform the FurthestSum not only by improving the convergence of FCPM and AA algorithms but also to be able to model the number of clusters to extract from data. A study comparing nine information-theoretic validity indices and the soft ARI has shown that the soft Normalized Mutual Information max (N M IsM) and the Adjusted Mutual Information (AMI) indices are more adequate to access the quality of FCPM and AA partitions than soft internal validity indices. The experimental study was conducted exploring a collection of 99 synthetic data sets generated from a proper data generator , the FCPM-DG, covering various dimensionalities as well as 18 benchmark data sets from machine learning.
... AA was proposed by Cutler and Breiman (1994) and it has found applications in diverse fields, such as computer vision (Chen et al., 2014;Bauckhage et al., 2015;Sun et al., 2017a;b;Mair et al., 2017), developmental psychology (Ragozini et al., 2017), engineering (Epifanio et al., 2013;Vinué et al., 2015;Vinué, 2017;Epifanio et al., 2018b;Millán-Roures et al., 2018), finance (Moliner and Epifanio, 2019), genetics (Thøgersen et al., 2013), global development (Epifanio, 2016;Epifanio et al., 2018a), machine learning problems (Mørup and Hansen, 2012), neuroscience (Tsanousa et al., 2015;Hinrich et al., 2016) and sports (Eugster, 2012;Vinué and Epifanio, 2017). ...
Article
Full-text available
Texture segmentation is one of the main tasks in image applications, specifically in remote sensing, where the objective is to segment high-resolution images of natural landscapes into different cover types. Often the focus is on the selection of discriminant textural features, and although these are really fundamental, there is another part of the process that is also influential, partitioning different homogeneous textures into groups. A methodology based on archetype analysis (AA) of the local textural measurements is proposed. AA seeks the purest textures in the image and it can find the borders between pure textures, as those regions composed of mixtures of several archetypes. The proposed procedure has been tested on a remote sensing image application with local granulometries, providing promising results.
... In order to define the optimal number of ecological strategies necessary to characterize the global lizard assemblage, we used Archetypal Analysis (AA) which is increasingly used in economics (Li, Wang, Louviere, & Carson, 2003), human health science (Prabhakaran, 2014), sport (Eugster, 2012) and astronomy (Chan, Mitchell, & Cram, 2003). ...
Article
Aim Understanding the mechanisms determining species richness is a primary goal of biogeography. Richness patterns of sub‐groups within a taxon are usually assumed to be driven by similar processes. However, if richness of distinct ecological strategies respond differently to the same processes, inferences made for an entire taxon may be misleading. We deconstruct the global lizard assemblage into functional groups and examine the congruence among richness patterns between them. We further examine the species richness – functional richness relationship to elucidate the way functional diversity contributes to the overall species richness patterns. Location Global. Methods Using comprehensive biological trait databases we classified the global lizard assemblage into ecological strategies based on body size, diet, activity times and microhabitat preferences, using Archetypal Analysis. We then examined spatial gradients in the richness of each strategy at the one‐degree grid cell, biome, and realm scales. Results We found that lizards can best be characterized by seven “ecological strategies”: scansorial , terrestrial , nocturnal , herbivorous , fossorial , large, and semi‐aquatic . There are large differences among the global richness patterns of these strategies. While the major richness hotspot for lizards in general is in Australia, several strategies exhibit highest richness in the Amazon Basin. Importantly, the global maximum in lizard species richness is achieved at intermediate values of functional diversity and increasing functional diversity further result in a slow decline of species richness. Main conclusions The deconstruction of the global lizard assemblage along multiple ecological axes offers a new way to conceive lizard diversity patterns. It suggests that local lizard richness mostly increases when species belonging to particular ecological strategies become hyper‐diverse there, and not because more ecological types are present in the most species rich localities. Thus maximum richness and maximum ecological diversity do not overlap. These results shed light on the global richness pattern of lizards, and highlight previously unidentified spatial patterns in understudied functional groups.
... AA and ADA have therefore aroused the interest of researchers working in different fields, such as astrophysics (Chan et al. (2003)), biology (D'Esposito et al. (2012)), climate (Steinschneider and Lall (2015); Su et al. (2017); Epifanio et al. (2018a)), developmental psychology (Ragozini et al. (2017)), e-learning (Theodosiou et al. (2013)), engineering (Epifanio et al. (2013(Epifanio et al. ( , 2018b; Millán-Roures et al. (2018)), genetics (Thøgersen et al. (2013)), machine learning (Mørup and Hansen (2012); Seth and Eugster (2016a,b); Ragozini and D'Esposito (2015)), multi-document summarization Kononenko (2013, 2014)), nanotechnology (Fernandez and Barnard (2015)), neuroscience (Tsanousa et al. (2015); Hinrich et al. (2016)) and sports (Eugster (2012)). AA has also been applied in market research (Li et al. (2003); Porzio et al. (2008); Midgley and Venaik (2013)) in the multivariate context. ...
Article
Full-text available
Archetypal analysis approximates data by means of mixtures of actual extreme cases (archetypoids) or archetypes, which are a convex combination of cases in the data set. Archetypes lie on the boundary of the convex hull. This makes the analysis very sensitive to outliers. A robust methodology by means of M-estimators for classical multivariate and functional data is proposed. This unsupervised methodology allows complex data to be understood even by non-experts. The performance of the new procedure is assessed in a simulation study, where a comparison with a previous methodology for the multivariate case is also carried out, and our proposal obtains favorable results. Finally, robust bivariate functional archetypoid analysis is applied to a set of companies in the S&P 500 described by two time series of stock quotes. A new graphic representation is also proposed to visualize the results. The analysis shows how the information can be easily interpreted and how even non-experts can gain a qualitative understanding of the data.
... This has meant that the applications of AA and ADA have spread to many different fields, such as astrophysics (Chan et al., 2003), biology (D'Esposito et al., 2012), climate (Steinschneider and Lall, 2015;Su et al., 2017), developmental psychology (Ragozini et al., 2017), e-learning (Theodosiou et al., 2013), genetics (Thøgersen et al., 2013), human development (Epifanio, 2016), industrial engineering Millán-Roures et al., 2018), machine learning (Mørup and Hansen, 2012;Seth and Eugster, 2016a,b;Ragozini and D'Esposito, 2015), market research (Li et al., 2003;Porzio et al., 2008;Midgley and Venaik, 2013), multi-document summarization Kononenko, 2013, 2014), nanotechnology (Fernandez and Barnard, 2015), neuroscience (Tsanousa et al., 2015;Hinrich et al., 2016) and sports (Eugster, 2012;Vinué and Epifanio, 2017). ...
Article
Full-text available
In this paper we propose several methodologies for handling missing or incomplete data in Archetype analysis (AA) and Archetypoid analysis (ADA). AA seeks to find archetypes, which are convex combinations of data points, and to approximate the samples as mixtures of those archetypes. In ADA, the representative archetypal data belong to the sample, i.e. they are actual data points. With the proposed procedures, missing data are not discarded or previously filled by imputation and the theoretical properties regarding location of archetypes are guaranteed, unlike the previous approaches. The new procedures adapt the AA algorithm either by considering the missing values in the computation of the solution or by skipping them. In the first case, the solutions of previous approaches are modified in order to fulfill the theory and a new procedure is proposed, where the missing values are updated by the fitted values. In this second case, the procedure is based on the estimation of dissimilarities between samples and the projection of these dissimilarities in a new space, where AA or ADA is applied, and those results are used to provide a solution in the original space. A comparative analysis is carried out in a simulation study, with favorable results. The methodology is also applied to two real data sets: a well-known climate data set and a global development data set. We illustrate how these unsupervised methodologies allow complex data to be understood, even by non-experts.
... Métodos de agrupamento por classes como Análise de Componentes Principais e Análise de Cluster são muito frequentes nas análises esportivas e se baseiam principalmente em indicadores de desempenho (Kubatko, Oliver, Pelton, & Rosenbaum, 2007). Esses valores são normalmente explorados na literatura por meio de dados normativos, que descrevem as médias e inferem os indicadores mais frequentes entre os atletas e/ou equipes (Eugster, 2012). A Análise de Arquétipos (AA), diferente dos métodos citados, considera os valores extremos e características multivariadas de um grande conjunto de variáveis (Cutler & Breiman, 1994 ...
... Archetypal Analysis [5] is a statistical method that synthesizes a set of multivariate data points through a few representatives, called archetypes, in such a way that the data points are a convex combination of the archetypes, and the archetypes are also a convex combination of the data points, making those representatives to lie on the convex hull of the data set. Archetypal Analysis (AA) has been applied in ecology, benchmarking for performance analysis, talent analysis in sports and in education, profiling of scientific activities, to name but a few [6,21,10,4]. ...
Chapter
Full-text available
This paper presents a comparative study between a method for fuzzy clustering which retrieves pure individual types from data, the fuzzy clustering with proportional membership (FCPM), and an archetypal analysis algorithm based on Furthest-Sum approach (FS-AA). A simulation study comprising 82 data sets is conducted with a proper data generator, FCPM-DG, whose goal is twofold: first, to analyse the ability of archetypal clustering algorithm to recover Archetypes from data of distinct dimensionality; second, to analyse robustness of FCPM and FS-AA algorithms to outliers. The effectiveness of these algorithms are yet compared on clustering 12 diverse benchmark data sets from machine learning. The evaluation conducted with five primer unsupervised validation indices shows the good quality of the clustering solutions.
... AA and ADA have therefore aroused the interest of researchers working in different fields, such as astrophysics (Chan et al. (2003)), biology (D'Esposito et al. (2012)), climate (Steinschneider and Lall (2015); Su et al. (2017)), developmental psychology (Ragozini et al. (2017)), e-learning (Theodosiou et al. (2013)), genetics (Thøgersen et al. (2013)), industrial engineering (Epifanio et al. (2013), machine learning (Mørup and Hansen (2012); Seth and Eugster (2016a,b); Ragozini and D'Esposito (2015)), multi-document summarization Kononenko (2013, 2014)), nanotechnology (Fernandez and Barnard (2015)), neuroscience (Tsanousa et al. (2015); Hinrich et al. (2016)) and sports (Eugster (2012)). AA has also been applied in market research (Li et al. (2003); Porzio et al. (2008); Midgley and Venaik (2013)) in the multivariate context. ...
Preprint
Full-text available
Archetypal analysis approximates data by means of mixtures of actual extreme cases (archetypoids) or archetypes, which are a convex combination of cases in the data set. Archetypes lie on the boundary of the convex hull. This makes the analysis very sensitive to outliers. A robust methodology by means of M-estimators for classical multivariate and functional data is proposed. This unsupervised methodology allows complex data to be understood even by non-experts. The performance of the new procedure is assessed in a simulation study, where a comparison with a previous methodology for the multivariate case is also carried out, and our proposal obtains favorable results. Finally, robust bivariate functional archetypoid analysis is applied to a set of companies in the S\&P 500 described by two time series of stock quotes. A new graphic representation is also proposed to visualize the results. The analysis shows how the information can be easily interpreted and how even non-experts can gain a qualitative understanding of the data.
... The decomposed archetypes are restricted to be a mixture of actual observations, which can ease their physical interpretability. AA has also been widely expanded into data analysis in many scientific research fields [17][18][19][20]. ...
Article
Full-text available
Understanding the spatiotemporal variability of seasonal extreme precipitation and its linkage with climate patterns is of great importance for water resource management over the Yangtze River Basin. Hence, this study examined the spatiotemporal variability of seasonal extreme precipitation through the archetypal analysis (AA), by which observations were decomposed and characterized as several extreme modes. Six archetypes were identified and can obviously exhibit the features of events with above average or below average precipitation. Summer precipitation is the most variable compared to the winter, spring, and autumn precipitation through the trend analysis. It ranged from extremely dry (A6) to normal (A1 and A2) to extremely wet (A4). Climate teleconnections to the four archetypes for summer precipitation and relative importance of climate patterns were thus investigated. Results show that El Niño Southern Oscillation index is the strongest determinant of the ensuing archetypes representing the events with above average precipitation, while the Atlantic Multi-decadal Oscillation (AMO) contributes most to the events with below-average precipitation. A warm phase of the Pacific Decadal Oscillation (PDO) is significantly correlated with the above-average precipitation.
... It is an unsupervised method, since the algorithm is given only the raw data, not any descriptive or outcome-related information (e.g., diagnoses). It has been used in a wide variety of fields, including sports analytics (33), astrophysics (34), marketing (35), and bioinformatics/medicine (16,36,37). Although it is common to show archetype results on a PCA plot as we do here, AA is conceptually very different from PCA. ...
Article
Conventional histologic diagnosis of rejection in kidney transplants has limited repeatability due to its inherent requirement for subjective assessment of lesions, in a rule-based system that does not acknowledge diagnostic uncertainty. Molecular phenotyping affords opportunities for increased precision and improved disease classification to address the limitations of conventional histologic diagnostic systems and quantify levels of uncertainty. Microarray data from 1,208 kidney transplant biopsies were collected prospectively from 13 centers. Cross-validated classifier scores predicting the presence of antibody-mediated rejection (ABMR), T cell-mediated rejection (TCMR), and 5 related histologic lesions were generated using supervised machine learning methods. These scores were used as input for archetypal analysis, an unsupervised method similar to cluster analysis, to examine the distribution of molecular phenotypes related to rejection. Six archetypes were generated: no rejection, TCMR, 3 associated with ABMR (early-stage, fully developed, and late-stage), and mixed rejection (TCMR plus early-stage ABMR). Each biopsy was assigned 6 scores, one for each archetype, representing a probabilistic assessment of that biopsy based on its rejection-related molecular properties. Viewed as clusters, the archetypes were similar to existing histologic Banff categories, but there was 32% disagreement, much of it probably reflecting the "noise" in the current histologic assessment system. Graft survival was lowest for fully developed and late-stage ABMR, and it was better predicted by molecular archetype scores than histologic diagnoses. The results provide a system for precision molecular assessment of biopsies and a new standard for recalibrating conventional diagnostic systems.
... ADA is available in the R package Anthropometry (Vinué, Epifanio, Simó, Ibáñez, Domingo, and Ayala (2015b)). The fields of application include, for instance, market research (Li et al. (2003), Porzio et al. (2008, Midgley and Venaik (2013)), biology (D'Esposito et al. (2012)), genetics (Thøgersen et al. (2013)), sports (Eugster (2012)), industrial engineering (Epifanio et al. (2013); Vinué et al. (2015a)), the evaluation of scientists (Seiler and Wohlrabe (2013)), astrophysics (Chan et al. (2003), Richards et al. (2012), e-learning (Theodosiou et al. (2013) ), multidocument summarization ( Kononenko (2013, 2014) ) and different machine learning problems ( Hansen (2012), Stone (2002)). In the seminal paper by Cutler and Breiman (1994) , one of the illustrative examples worked with functional observations, i.e, data consisting of a set of functions, although they converted them into a matrix by considering a set of values of each curve (after being smoothed) at certain points. ...
... ADA is available in the R package Anthropometry (Vinué, Epifanio, Simó, Ibáñez, Domingo, and Ayala (2015b)). The fields of application include, for instance, market research (Li et al. (2003), Porzio et al. (2008), Midgley and Venaik (2013)), biology (D'Esposito et al. (2012)), genetics (Thøgersen et al. (2013)), sports (Eugster (2012)), industrial engineering (Epifanio et al. (2013); Vinué et al. (2015a)), the evaluation of scientists (Seiler and Wohlrabe (2013)), astrophysics (Chan et al. (2003), Richards et al. (2012)), e-learning (Theodosiou et al. (2013)), multidocument summarization Kononenko (2013, 2014)) and different machine learning problems (Mørup and Hansen (2012), Stone (2002)). ...
Article
Full-text available
Archetype and archetypoid analysis can be extended to functional data. Each function is represented as a mixture of actual observations (functional archetypoids) or functional archetypes, which are a mixture of observations in the data set. Well-known Canadian temperature data are used to illustrate the analysis developed. Computational methods are proposed for performing these analyses, based on the coefficients of a basis. They are compared with other alternatives in a simulation study using a well-known curve discrimination problem, achieving better or similar performance. Unlike a previous attempt to compute functional archetypes, which was only valid for an orthogonal basis, the proposed methodology can be used for any basis. It is computationally less demanding than the simple approach of discretizing the functions. Multivariate functional archetype and archetypoid analysis are also introduced and applied in an interesting problem about the study of human development around the world over the last 50 years. These tools can contribute to the understanding of a functional data set, as in the multivariate case.
... Excellence in sports is elusive by definition and both coaches and sports scientists know that for sure. The science of coaching searches for performance indicators and models (Eugster, 2012;Franks and Goodman, 1986) that may help in the decision making of coaches and trainers. The characteristics of the game of basketball (Parlebas, 1999) make this search a challenging venture that must take into account not only its structural complexity (Martín Acero and Lago Peñas, 2005) but also the psychosocial issues related to players, teams and coaches themselves. ...
Article
Full-text available
The purpose of this study was to examine the relationships between collective efficacy, cohesion and team performance in the Spanish amateur female basketball. Players (N = 68) from 8 basketball teams completed the Spanish versions of the Collective Efficacy Questionnaire for Sport and the Group Environment Questionnaire for Sports at the end of the 2011/20112 season. Collective efficacy was a powerful explanatory factor of success whereas general group cohesion was not. Even though, the players from successful teams declared higher levels of attraction to and integration with the group in relation to the task.
... In recent years, researchers made great efforts to develop meaningful and useful performance profiles in sport (Butterworth et al., 2013;Eugster, 2012;Hughes et al., 2001;James et al., 2005;O'Donoghue, 2005;. The main profiling techniques were discussed and compared by O'Donoghue (2013). ...
Article
Full-text available
Performance of football teams varies constantly due to the dynamic nature of this sport, whilst the typical performance and its spread can be represented by profiles combining different performance-related variables based on data from multiple matches. The current study aims to use a profiling technique to evaluate and compare match performance of football teams in the UEFA Champions League incorporating three situational variables (i.e. strength of team and opponent, match outcome and match location). Match statistics of 72 teams, 496 games across four seasons (2008-09 to 2012-13) of this competition were analysed. Sixteen performance-related events were included: shots, shots on target, shots from open play, shots from set piece, shots from counter attack, passes, pass accuracy (%), crosses, through balls, corners, dribbles, possession, aerial success (%), fouls, tackles, and yellow cards. Teams were classified into three levels of strength by a k-cluster analysis. Profiles of overall performance and profiles incorporating three situational variables for teams of all three levels of strength were set up by presenting the mean, standard deviation, median, lower and upper quartiles of the counts of each event to represent their typical performances and spreads. Means were compared by using one-way ANOVA and independent sample t test (for match location, home and away differences), and were plotted into the same radar charts after unifying all the event counts by standardised score. Established profiles can present straightforwardly typical performances of football teams of different levels playing in different situations, which could provide detailed references for coaches and analysts to evaluate performances of upcoming opposition and of their own.
... In recent years, researchers made great efforts to develop meaningful and useful performance profiles in sport (Butterworth et al., 2013;Eugster, 2012;Hughes et al., 2001;James et al., 2005;O'Donoghue, 2005;. The main profiling techniques were discussed and compared by O'Donoghue (2013). ...
Article
Performance of football teams varies constantly due to the dynamic nature of this sport, whilst the typical performance and its spread can be represented by profiles combining different performance-related variables based on data from multiple matches. The current study aims to use a profiling technique to evaluate and compare match performance of football teams in the UEFA Champions League incorporating three situational variables (i.e. strength of team and opponent, match outcome and match location). Match statistics of 72 teams, 496 games across four seasons (2008-09 to 2012-13) of this competition were analysed. Sixteen performance-related events were included: shots, shots on target, shots from open play, shots from set piece, shots from counter attack, passes, pass accuracy (%), crosses, through balls, corners, dribbles, possession, aerial success (%), fouls, tackles, and yellow cards. Teams were classified into three levels of strength by a k-cluster analysis. Profiles of overall performance and profiles incorporating three situational variables for teams of all three levels of strength were set up by presenting the mean, standard deviation, median, lower and upper quartiles of the counts of each event to represent their typical performances and spreads. Means were compared by using one-way ANOVA and independent sample t test (for match location, home and away differences), and were plotted into the same radar charts after unifying all the event counts by standardised score. Established profiles can present straightforwardly typical performances of football teams of different levels playing in different situations, which could provide detailed references for coaches and analysts to evaluate performances of upcoming opposition and of their own.
... Porzio et al. [18] used AA for defining reference performers of leading universities across the world on the basis of six performance indicators. Towards the direction of benchmarking, Eugster [19] proposed a method for analyzing the performance profiles of sports analytic data. ...
Article
Full-text available
The research on software cost estimation has resulted not only to a large number of prediction methodologies and improvement techniques, but also to numerous methods for evaluating and comparing them. The identification of the best prediction model for a specific dataset is still an open issue since the evaluation of candidate models is essentially a multi-criteria problem. Model comparison usually involves statistical hypothesis tests with respect to a single criterion, while for multiple criteria, aggregating methods are usually employed. In the current study, we investigate the alternative approach of benchmarking, which is different from model comparison. The general idea is first to choose among the competitors few "reference models" with special, preferably divergent performance characteristics with respect to multiple criteria and then to examine the placement of all the other models in relation to the reference ones. For solving this problem, we utilize a multivariate statistical method, known as Archetypal Analysis (AA), which provides an appealing and intuitive approach for the identification of the reference models and the subsequent benchmarking of all the competitors. The competitor models are considered as points in a multi-dimensional space, defined by the prediction performance criteria, while AA locates the archetypes, i.e. the reference models which determine the convex hull of the swarm of all points (competitors). Apart from identifying reference models for benchmarking with superior or inferior predictive power according to several accuracy measures, the proposed methodology utilizes the similarity of a subset of models to a "superior" archetype in order to provide a mechanism for building ensembles. The proposed methodology is applied to a dataset containing performance measures of seventy five models which were initially trained and tested on 195 Web projects of the TUKUTUKU database. The application illustrates the straightforwardness and the intuitively attractive interpretation of the derived results.
Technical Report
Full-text available
In this short note, we discuss the use of archetypal analysis in clustering. The underlying ideas are straightforward and very easy to implement in NumPy.
Technical Report
Full-text available
Archetypal analysis is an increasingly popular tool for data mining and pattern recognition. In this note, we first discuss how to solve the underlying optimization problem using plain vanilla Frank-Wolfe optimization and then present an efficient NumPy implementation of this approach.
Article
The objective of this review was to systematically describe the traditional and contemporary data capture and analytic methods employed in performance analysis research in team invasion sports, evaluate the practicality of these methods, and formulate practical recommendations on methods for analysing tactics and strategies in team invasion sports. A systematic search of the databases SPORTDiscus, Web of Science, Scopus, MEDLINE and PubMed was performed. Keywords addressed performance analysis methods and team invasion sports, with all other disciplines of sports science excluded. A total of 537 articles were included in the review and six main themes of research identified. Themes included game actions, dynamic game actions, movement patterns, collective team behaviours, social network analysis and game styles. Performance analysis research has predominantly focused on identifying key performance indicators related to success by analysing differences in game actions between successful and less successful teams. However, these measures are outcome-focused and only provide limited insight into winning team’s strategy. Team invasion sports are now viewed as dynamic, complex systems with opposing teams as interacting parts. Strategies and tactics should be analysed using a holistic process-orientated approach by recording dynamic actions, collective team behaviours and passing networks, and viewing them in game styles.
Chapter
This paper further investigates and compares a method for fuzzy clustering which retrieves pure individual types from data, known as the fuzzy clustering with proportional membership (FCPM), with the FurthestSum Archetypal Analysis algorithm (FS-AA). The Anomalous Pattern (AP) initialization algorithm, an algorithm that sequentially extracts clusters one by one in a manner similar to principal component analysis, is shown to outperform the FurthestSum not only by improving the convergence of FCPM and AA algorithms but also to be able to model the number of clusters to extract from data. A study comparing nine information-theoretic validity indices and the soft ARI has shown that the soft Normalized Mutual Information max () and the Adjusted Mutual Information (AMI) indices are more adequate to access the quality of FCPM and AA partitions than soft internal validity indices. The experimental study was conducted exploring a collection of 99 synthetic data sets generated from a proper data generator, the FCPM-DG, covering various dimensionalities as well as 18 benchmark data sets from machine learning.
Article
In this study, a formula called the Golden Index was developed that calculates a value for each player according to their individual and collective performances. This method identifies the most important players, designated as Golden Players, in football attacking plays. This study was organized in two main phases: (1) selection, definition and validation, including weighting assignment, of a set of variables associated with players’ performance through application of statistical techniques to uniformize variables values that compose the Golden Index formula and (2) applicability of the Golden Index formula to quantify players’ performance of Club Atlético de Madrid over 2016-2017 season. A questionnaire was given to football experts in order to validate and determine the weight of each of the 12 variables selected. Descriptive statistics with standardization techniques were used to set the weights and uniformize each variable of the Golden Index formula. Applying the Golden Index formula to Club Atlético de Madrid 2016-2017 season named Koke, Yannick Carrasco and Filipe Luís as the Golden Players, while the centre-backs Lucas Hernández, Stefan Savić and Diego Godín and striker Fernando Torres received negative indexes. Results suggested that the Golden Index formula is a valuable and useful tool in capturing the individual and collective performance of players in attacking play in football.
Book
Full-text available
The dynamic development of football game, especially of its technical and tactical dimen-sions as well as multi-factorial physical features of players, have led to the need to conduct studies and their analysis in order to deepen and understand occurring phenomena. The studies and their results presented in this work were aimed at better understanding of the changes taking place and indicating trends in the modern football game at the highest sports level. The purpose of the studies was to indicate trends in the monitoring and eval-uation of football game over the last decade, which were characterized as systematic changes in the efficiency of performance. The research material used for the analysis was obtained on the basis of the most modern systems analyzing players' game as well as exploration of a large number of factors including five championships: the World Cham-pionships in 2010 and 2014 and the European Championships in 2008, 2012 and 2016. The calculated game performance indicators allowed to compare the efficiency of football teams and players participating in individual tournaments. The effectiveness of winning and losing teams was determined. There was compared the efficiency of players' specific positions on the pitch. There were also indicated factors of the game which in the highest degree determined the victory in direct competition and those which were the most important when achieving good results in championships. The studies have shown that since the European Championships in 2008 the activity of shots on goal has decreased, efficiency of counterattacks, especially among side midfielders, has decreased, activity and effectiveness of duels in the air that aim at taking over the ball passed high have increased. It has been also indicated that since the European Championships in 2008 the differences in the efficiency of passing the ball between winning and losing teams, previ-ously beneficial for the winners, have decreased with a countervailing trend. Since the European Championships in 2008 the average distance a football player runs in a match has been lengthening to reach the average level of 11 000 meters in the World Champi-onships in 2014. The factors that had the greatest impact on the game efficiency of teams during the analyzed tournaments were the parameters of shots, such as: the activeness and effectiveness of shots as well as activeness of shots from the penalty area and from outside of the penalty area. Among the parameters determining the efficiency of passes were among others: the efficiency of passes that gain space being performed in the mid-dle zone, the efficiency of passes that maintain game space in the zone of gaining space and creating goal situations. The studies also showed that game effectiveness was influ-enced by defensive actions such as: the efficiency of sliding tackles, duels in the air that is the fight for taking over the ball passed high or clearances, whereas among offensive actions the efficiency of crosses gaining the game space.
Article
Full-text available
Die Bewertung von wissenschaftlicher Leistung spielt in zunehmendem Maße eine Rolle bei Berufungsverfahren oder der Evaluierung von wissenschaftlichen Einrichtungen. Neben hochwertigen Publikationen nehmen Rankings einen immer größeren Stellenwert in der Wissenschaft ein. Der vorliegende Artikel stellt ein neues Verfahren zur Klassifizierung von Wissenschaftlern vor und illustriert dies anhand von Daten aus dem RePEc-Netzwerk, eines der umfangreichsten Datenbanken von wissenschaftlichem Output in den Wirtschaftswissenschaften. Abschließend werden einige Ökonomen im deutschsprachigen Raum anhand dieser Methode klassifiziert.
Article
Full-text available
Archetype and archetypoid analysis are extended to shapes. The objective is to find representative shapes. Archetypal shapes are pure (extreme) shapes. We focus on the case where the shape of an object is represented by a configuration matrix of landmarks. As shape space is not a vectorial space, we work in the tangent space, the linearized space about the mean shape. Then, each observation is approximated by a convex combination of actual observations (archetypoids) or archetypes, which are a convex combination of observations in the data set. These tools can contribute to the understanding of shapes, as in the usual multivariate case, since they lie somewhere between clustering and matrix factorization methods. A new simplex visualization tool is also proposed to provide a picture of the archetypal analysis results. We also propose new algorithms for performing archetypal analysis with missing data and its extension to incomplete shapes. A well-known data set is used to illustrate the methodologies developed. The proposed methodology is applied to an apparel design problem in children.
Article
Full-text available
The development of powerful new 3D scanning techniques has enabled the generation of large up-to-date anthropometric databases which provide highly valued data to improve the ergonomic design of products adapted to the user population. As a consequence, Ergonomics and Anthropometry are two increasingly quantitative fields, so advanced statistical methodologies and modern software tools are required to get the maximum benefit from anthropometric data. This paper presents a new R package, called Anthropometry, which is available on the Comprehensive R Archive Network. It brings together some statistical methodolo-gies concerning clustering, statistical shape analysis, statistical archetypal analysis and the statistical concept of data depth, which have been especially developed to deal with anthropometric data. They are proposed with the aim of providing effective solutions to some common anthropometric problems, such as clothing design or workstation design (focusing on the particular case of aircraft cockpits). The utility of the package is shown by analyzing the anthropometric data obtained from a survey of the Spanish female population performed in 2006 and from the 1967 United States Air Force survey. This manuscript is also contained in Anthropometry as a vignette.
Conference Paper
In this paper we propose a method to analyze and synthesize a set of N networks that refer to a common scenario and that are comparable among each other. Examples of this type of data are: a set of collaboration networks, each defined for a different scientific field; or a set of ego networks, where egos belong to a same category; a set of governance networks, etc. For these kind of sets of networks it can be of interest to find a small number of representative networks that can serve as a condensed view of the data set. In a statistical perspective this goal amount to find a small number of networks that are able to typify the network structures starting from the observed ones. In addition, these networks should have a clear and interpretable profile in terms of their most relevant features and their specificity in contrast to the others. Given the set of N networks, we propose to find these representative networks by using the archetypal analysis, yielding what we call Archetypal Networks. The Archetypal Networks can serve to understand the data structure, as benchmarks for the other networks, and are useful also to compare networks among each other. We exemplify the proposed procedure by analyzing a set of 36 governance networks of public structures devoted to provide youth services and referring to 36 different territorial districts in Campania region in Italy. Our results highlight the presence of different network structures that can be interpreted in terms of the governance forms established in literature.
Article
Archetypal analysis is a popular exploratory tool that explains a set of observations as compositions of few 'pure' patterns. The standard formulation of archetypal analysis addresses this problem for real valued observations by finding the approximate convex hull. Recently, a probabilistic formulation has been suggested which extends this framework to other observation types such as binary and count. In this article we further extend this framework to address the general case of nominal observations which includes, for example, multiple-option questionnaires. We view archetypal analysis in a generative framework: this allows explicit control over choosing a suitable number of archetypes by assigning appropriate prior information, and finding efficient update rules using variational Bayes'. We demonstrate the efficacy of this approach extensively on simulated data, and three real world examples: Austrian guest survey dataset, German credit dataset, and SUN attribute image dataset.
Article
Full-text available
Database files and additional log files of Learning Management Systems (LMSs) contain an enormous volume of data which usually remain unexploited. A new methodology is proposed in order to analyse these data both on the level of both the courses and the learners. Specifically, regression analysis is proposed as a first step in the methodology in order to explore how e-learning contents and characteristics of the course (such as a theory or lab course, a first- or second-year course, etc.) influence performance. Further investigation of each course, according to learners’ usage, is achieved by archetypal analysis, which pinpoints the typical usage. The proposed methodology was successfully applied to LMS data from a Greek University. The results confirmed the validity of the approach and showed a relationship between the educational content which was provided and its usage by the learners.
Article
Full-text available
This article presents the basic results of using the parallel coordinate representation as a high-dimensional data analysis tool. Several alternatives are reviewed. The basic algorithm for parallel coordinates is laid out and a discussion of its properties as a projective transformation is given. Several duality results are discussed along with their interpretations as data analysis tools. Permutations of the parallel coordinate axes are discussed, and some examples are given. Some extensions of the parallel coordinate idea are given. The article closes with a discussion of implementation and some of my experiences.
Conference Paper
Full-text available
Archetypal analysis represents the members of a set of multivariate data as a convex combination of extremal points of the data. It allows for dimensionality reduction and clustering and is particularly useful whenever the data are superpositions of basic entities. However, since its computation costs grow quadratically with the number of data points, the original algorithm hardly applies to modern pattern recognition or data mining settings. In this paper, we introduce ways of notably accelerating archetypal analysis. Our experiments are the first successful application of the technique to large scale data analysis problems.
Article
Full-text available
Archetypal analysis has the aim to represent observations in a multivariate data set as convex combinations of extremal points. This approach was introduced by Cutler and Breiman (1994); they defined the concrete problem, laid out the theoretical foundations and presented an algorithm written in Fortran. In this paper we present the R package archetypes which is available on the Comprehensive R Archive Network. The package provides an implementation of the archetypal analysis algorithm within R and different exploratory tools to analyze the algorithm during its execution and its final result. The application of the package is demonstrated on two examples.
Article
Full-text available
The aim of this study was to construct a valid and reliable methodology for the analysis of performance profiles of individual playing positions within rugby union. Twenty-two matches were sampled from the domestic season of a professional male rugby union team. Key performance indicators for individual positions were developed and notated using a computerized behavioural analysis system. Performance profiles of playing positions containing data from one or more individuals were then constructed to compare intra-positional differences. Significant differences (chi-square) were observed between individuals within all the tested playing positions for the principal performance indicators (passing, carrying and tackling for the forward positions, and passing, carrying, tackling and kicking for the backs). For example, the difference between the two outside-halves in the study (P < 0.001) was illustrated by one of the players having a median of 6 successful carries for the season (95% confidence limits of 13 and 3), whereas the other had a median of 2 successful carries (95% confidence limits of 5 and 1). The findings suggest that while general positional performance profiles appear to exist, intra-positional differences may occur due to variations in an individual's style of play, the decision-making demands of the position and the effects of potential confounding variables. Multiple profiles may therefore be necessary for some playing positions to account for variation in factors such as playing conditions and the strength of the opposition.
Book
It is difficult to imagine that the statistical analysis of compositional data has been a major issue of concern for more than 100 years. It is even more difficult to realize that so many statisticians and users of statistics are unaware of the particular problems affecting compositional data, as well as their solutions. The issue of spurious correlation'', as the situation was phrased by Karl Pearson back in 1897, affects all data that measures parts of some whole, such as percentages, proportions, ppm and ppb. Such measurements are present in all fields of science, ranging from geology, biology, environmental sciences, forensic sciences, medicine and hydrology. This book presents the history and development of compositional data analysis along with Aitchison's log-ratio approach. Compositional Data Analysis describes the state of the art both in theoretical fields as well as applications in the different fields of science. Key Features: • Reflects the state-of-the-art in compositional data analysis. • Gives an overview of the historical development of compositional data analysis, as well as basic concepts and procedures. • Looks at advances in algebra and calculus on the simplex. • Presents applications in different fields of science, including, genomics, ecology, biology, geochemistry, planetology, chemistry and economics. • Explores connections to correspondence analysis and the Dirichlet distribution. • Presents a summary of three available software packages for compositional data analysis. • Supported by an accompanying website featuring R code. Applied scientists working on compositional data analysis in any field of science, both in academia and professionals will benefit from this book, along with graduate students in any field of science working with compositional data.
Article
It is an implicit assumption in notational analysis that in presenting a performance profile of a team or an individual that a ‘normative profile‘ has been achieved. Inherently this implies that all the variables that are to be analysed and compared have all stabilised. Most researchers assume that this will have happened if they analyse enough performances. But how many is enough? In the literature there are large differences in sample sizes. Just trawling through some of the analyses in soccer shows the differences (Table 1). Establishing normative profiles in performance analysis All authors Mike Hughes, Steve Evans & Julia Wells https://doi.org/10.1080/24748668.2001.11868245 Published online 03 April 2017 Table 1 Some examples of sample sizes for profiling in sport CSVDisplay Table There must be some way of assessing how data within a study is stabilising. The nature of the data itself will also effect how many performances are required - 5 matches may be enough to analyse passing in field hockey, would you need 10 to analyse crossing or perhaps 30 for shooting? The way in which the data is analysed also will effect the stabilisation of performance means - data that is analysed across a multi-cell representation of the playing area will require far more performances to stabilise than data that is analysed on overall performance descriptors (e.g. shots per match). It is misleading to test the latter and then go on to analyse the data in further detail. This study aimed to explore strategies in solving these problems in two sports, squash and badminton, in depth and then present further examples from a multiplicity of types of sports. A computerised notation system (Brown and Hughes, 1995) was used to record and analyse play, post event, for elite (N=20), county (N=20) and recreational (N=20) players. T-tests were used to examine the inter- and intra-reliability of the data collection processes. In addition, to establish that a normative profile had been reached, the profiles of 8 matches were compared with those of 9 and 10 matches, using dependent t-tests, for each of the categories of players. This method clearly demonstrated that those studies assuming that 5, 6 or 8 matches or performances were enough for a normative profile, without resorting to this sort of test, are clearly subject to possible flaws. The number of matches required for a normal profile of a subject population to be reached is dependent upon the narure of the data and, in particular, the nature of the performers. A notation system, designed to record rally-end variables in Badminton, was shown to be both valid and reliable. Inter and intra reliability ranged from 98.6% (Rally length) to 91.3% (Position). Percentage differences between data from side, and end observations of the same match were not greater than for the intra-reliability data thus different court viewing angles had little effect on notation. Previous literature declared profiles of performance without adequately tackling the problem of quantifying of the data required in creating a normative template. The badminton notation system was used to examine the cumulative means of selected variables over a series of 11 matches of a player. A template, at match N (E), was established when these means became stable within set limits of error (LE). T-tests on the variable means in games won, and games lost established the existence of winning and losing templates for winners and errors. Match descriptors (rallies, shots and shots per rally) were independent of match outcome. General values of N(E) established for data types, (10% LE), were 3 matches (descriptive variables), 4 (winners/errors (w/e), 6 (smash + w/e), 7 (position + w/e). Respective values at 5% LE were 7, 5, 8 and 10. There was little difference in the values of N (E) when variable means were analysed by game than by match. For the working performance analyst the results provide an estimate of the minimum number of matches to profile an opponent’s rally-end play. Whilst these results may be limited to badminton, men’s singles and the individual, the methodology of using graphical plots of cumulative means in attempting to establish templates of performance has been served. Further examples will be presented from different sports. For the working performance analyst the results provide an estimate of the minimum number of matches to profile an opponent’s rally-end play. Whilst the results may be limited to badminton, men’s singles and the individual, the methodology of using graphical plots of cumulative means in attempting to establish templates of performance has been served.
Article
Performance indicators in sport are not stable properties of individual competitors or teams and a single athlete's (or team's) performance will vary from match to match. For this reason, Hughes, Evans and Wells (2001) developed a technique for determining "normative profiles in performance analysis". This paper proposes an alternative technique that represents not only the typical performance of a team or individual but also the spread of performances. The technique also relates the set of performance indicators for a team or individual to normative data for a relevant population of teams or individuals. This provides a useful means of interpreting sports performance data. Grand Slam singles tennis is used as an example of the application of the proposed technique for determining a normative profile of a team's or individual's performance. It is recognised that the mean value for each performance indicator within the typical performance is critically important information that should be supported by rather than replaced by percentile bandings.
Article
Sports performance variables are unstable with opposition quality being the main source of player of variability. Therefore, performance indicators should be evaluated addressing quality of opposition. This paper uses the men's singles at the 2010 and 2011 Australian Open tennis championships to show how opposition effect can be modelled. The models are for expected performance indicator values given the World rankings of the players involved in the match. The residual values determine how much better or worse a player did than expected for each performance indicator. The residuals can be mapped onto percentage evaluation scores that address opposition quality. These interpretation scores can be used to interpret individual performances, determine performance profiles or trends in performance. This idea has been extended to address the relative strengths and weaknesses of particular opponents rather than applying the same model to all players.
Article
Benchmarking plays a relevant role in performance analysis, and statistical methods can be fruitfully exploited for its aims. While clustering, regression, and frontier analysis may serve some benchmarking purposes, we propose to consider archetypal analysis as a suitable technique. Archetypes are extreme points that synthesize data and that, in our opinion, can be profitably used as benchmarks. That is, they may be viewed as key reference performers in the comparison process. We suggest a three-step data driven benchmarking procedure, which enables users: (i) to identify some reference performers, (ii) to analyze their features, (iii) to compare observed performers with them. An exploratory point of view is preferred, and graphical devices are adopted throughout the procedure. Finally, our approach is presented by means of an illustrative example based on The Times league table of the world top 200 universities. Copyright © 2008 John Wiley & Sons, Ltd.
Article
Archetypal analysis represents observations in a multivariate data set as convex combinations of a few extremal points lying on the boundary of the convex hull. Data points which vary from the majority have great influence on the solution; in fact one outlier can break down the archetype solution. The original algorithm is adapted to be a robust M-estimator and an iteratively reweighted least squares fitting algorithm is presented. As a required first step, the weighted archetypal problem is formulated and solved. The algorithm is demonstrated using an artificial example, a real world example and a detailed simulation study.
Article
A methodological and computational framework for centroid-based partitioning cluster analysis using arbitrary distance or similarity measures is presented. The power of high-level statistical computing environments like R enables data analysts to easily try out various distance measures with only minimal programming effort. A new variant of centroid neighborhood graphs is introduced which gives insight into the relationships between adjacent clusters. Artificial examples and a case study from marketing research are used to demonstrate the influence of distances measures on partitions and usage of neighborhood graphs.
Article
By means ofParallel Coordinates planar “graphs” of multivariate relations are obtained. Certain properties of the relationship correspond tothe geometrical properties of its graph. On the plane a point ←→ line duality with several interesting properties is induced. A new duality betweenbounded and unbounded convex sets and hstars (a generalization of hyperbolas) and between Convex Unions and Intersections is found. This motivates some efficient Convexity algorithms and other results inComputational Geometry. There is also a suprising “cusp” ←→ “inflection point” duality. The narrative ends with a preview of the corresponding results inR N .
Article
Segmentation results derived using cluster analysis depend on (1) the structure of the data and (2) algorithm parameters. Typically, neither the data structure nor the sensitivity of the analysis to changes in algorithm parameters is assessed in advance of clustering. We propose a benchmarking framework based on bootstrapping techniques that accounts for sample and algorithm randomness. This provides much needed guidance both to data analysts and users of clustering solutions regarding the choice of the final clusters from computations that are exploratory in nature.
Article
The aims of this paper are to examine the application of performance indicators in different sports and, using the different structural definitions of games, to make general recommendations about the use and application of these indicators. Formal games are classified into three categories: net and wall games, invasion games, and striking and fielding games. The different types of sports are also sub-categorized by the rules of scoring and ending the respective matches. These classes are analysed further, to enable definition of useful performance indicators and to examine similarities and differences in the analysis of the different categories of game. The indices of performance are sub-categorized into general match indicators, tactical indicators, technical indicators and biomechanical indicators. Different research examples and the accuracy of their presentation are discussed. We conclude that, to enable a full and objective interpretation of the data from the analysis of a performance, comparisons of data are vital. In addition, any analysis of the distribution of actions across the playing surface should also be presented normalized, or non-dimensionalized, to the total distribution of actions across the area. Other normalizations of performance indicators should also be used more widely in conjunction with the accepted forms of data analysis. Finally, we recommend that biomechanists should pay more attention to games to enrich the analysis of performance in these sports.
Article
This paper synthesizes the results, methodology, and research conducted concerning the K-means clustering method over the last fifty years. The K-means method is first introduced, various formulations of the minimum variance loss function and alternative loss functions within the same class are outlined, and different methods of choosing the number of clusters and initialization, variable preprocessing, and data reduction schemes are discussed. Theoretic statistical results are provided and various extensions of K-means using different metrics or modifications of the original algorithm are given, leading to a unifying treatment of K-means and some of its extensions. Finally, several future studies are outlined that could enhance the understanding of numerous subtleties affecting the performance of the K-means method.
Article
The quantitative analysis of sports is a growing branch of science and, in many ways one that has developed through non-academic and non-traditionally peer-reviewed work. The aim of this paper is to bring to a peer-reviewed journal the generally accepted basics of the analysis of basketball, thereby providing a common starting point for future research in basketball. The possession concept, in particular the concept of equal possessions for opponents in a game, is central to basketball analysis. Estimates of possessions have existed for approximately two decades, but the various formulas have sometimes created confusion. We hope that by showing how most previous formulas are special cases of our more general formulation, we shed light on the relationship between possessions and various statistics. Also, we hope that our new estimates can provide a common basis for future possession estimation. In addition to listing data sources for statistical research on basketball, we also discuss other concepts and methods, including offensive and defensive ratings, plays, per-minute statistics, pace adjustments, true shooting percentage, effective field goal percentage, rebound rates, Four Factors, plus/minus statistics, counterpart statistics, linear weights metrics, individual possession usage, individual efficiency, Pythagorean method, and Bell Curve method. This list is not an exhaustive list of methodologies used in the field, but we believe that they provide a set of tools that fit within the possession framework and form the basis of common conversations on statistical research in basketball.
Article
Archetypal analysis represents each individual member of a set of data vectors as a mixture (a constrained linear combination) of the pure types or archetypes of the data set. The archetypes are themselves required to be mixtures of the data vectors. Archetypal analysis may be particularly useful in analysing data sets comprising galaxy spectra, since each spectrum is, presumably, a superposition of the emission from the various stellar populations, nebular emissions and nuclear activity making up that galaxy, and each of these emission sources corresponds to a potential archetype of the entire data set. We demonstrate archetypal analysis using sets of composite synthetic galaxy spectra, showing that the method promises to be an effective and efficient way to classify spectra. We show that archetypal analysis is robust in the presence of various types of noise. Comment: 6 pages, 5 figures, 1 style-file. Accepted for publication by MNRAS
Archetypal analysis: A new way to segment markets based on extreme individuals
  • S L P Wang
  • J Louviere
  • R Carson
archetypes: Archetypal Analysis, URL http://crans-project.org/package=archetypes
  • M J A Eugster
SportsAnalytics: Sports Analytics, URL http://crans-project.org/package=SportsAnalytics
  • M J A Eugster
Doug’s NBA & MLB statistics home page. http://dougstats.comi; visited on 10
  • D Steele