Figure - available via license: Creative Commons Attribution 2.0 Generic
Content may be subject to copyright.
The Lehmer code. A complete translation from permutation to
decimal, by way of the factoradic, for a permutation of size 4. Each
permutation is mapped to a single unique decimal representation. Two
additional translations from permutation to factoradic are shown in
Additional file 1: Figure S2.

The Lehmer code. A complete translation from permutation to decimal, by way of the factoradic, for a permutation of size 4. Each permutation is mapped to a single unique decimal representation. Two additional translations from permutation to factoradic are shown in Additional file 1: Figure S2.

Source publication
Article
Full-text available
Relative expression algorithms such as the top-scoring pair (TSP) and the top-scoring triplet (TST) have several strengths that distinguish them from other classification methods, including resistance to overfitting, invariance to most data normalization methods, and biological interpretability. The top-scoring ‘N’ (TSN) algorithm is a generalized...

Similar publications

Article
Full-text available
The real-time and robust fall detection is one of the key components of elderly people care and monitoring systems. Depth sensors, as they became more available, occupy an increasing place in event recognition systems. Some of them can directly produce a skeletal description of the human figure for compact representation of a person’s posture. Skel...
Article
Full-text available
This study examines genomic prediction within 8416 Mexican landrace accessions and 2403 Iranian landrace accessions stored in gene banks. The Mexican and Iranian collections were evaluated in separate field trials, including an optimum environment for several traits, and in two separate environments (drought, D and heat, H) for the highly heritable...
Article
Full-text available
Behavioral indicators in the murine Bacille Calmette Guérin (BCG) model of inflammation have been studied individually; however, the variability of the behaviors across BCG levels and the mouse-to-mouse variation within BCG-treatment group are only partially understood. The objectives of this study were: 1) to gain a comprehensive understanding of...

Citations

... Over time, this approach has been expanded to include the use of multiple feature pairs, multi-class targets, and has even been integrated with other machine learning techniques such as SVMs, decision trees, and random forests [21][22][23][24][25][26]. ...
Article
Full-text available
Background: Previous studies have described sex-specific patient subtyping in glioblastoma. The cluster labels associated with these "legacy data" were used to train a predictive model capable of recapitulating this clustering in contemporary contexts. Methods: We used robust ensemble machine learning to train a model using gene microarray data to perform multi-platform predictions including RNA-seq and potentially scRNA-seq. Results: The engineered feature set was composed of many previously reported genes that are associated with patient prognosis. Interestingly, these well-known genes formed a predictive signature only for female patients, and the application of the predictive signature to male patients produced unexpected results. Conclusions: This work demonstrates how annotated "legacy data" can be used to build robust predictive models capable of multi-target predictions across multiple platforms.
... Furthermore, these methods were not primarily designed for individual sample concordance, leading to potential inconsistencies in patient-to-molecular subtype assignments [22]. In contrast, TSP methods and their extensions [19,[23][24][25][26] offer scalability, interpretability, and robust feature selection. They generate gene rules by comparing expression values within a single sample, thus avoiding normalization with another dataset. ...
Article
Full-text available
Building Single Sample Predictors (SSPs) from gene expression profiles presents challenges, notably due to the lack of calibration across diverse gene expression measurement technologies. However, recent research indicates the viability of classifying phenotypes based on the order of expression of multiple genes. Existing SSP methods often rely on Top Scoring Pairs (TSP), which are platform-independent and easy to interpret through the concept of “relative expression reversals”. Nevertheless, TSP methods face limitations in classifying complex patterns involving comparisons of more than two gene expressions. To overcome these constraints, we introduce a novel approach that extends TSP rules by constructing rank-based trees capable of encompassing extensive gene-gene comparisons. This method is bolstered by incorporating two ensemble strategies, boosting and random forest, to mitigate the risk of overfitting. Our implementation of ensemble rank-based trees employs boosting with LogitBoost cost and random forests, addressing both binary and multi-class classification problems. In a comparative analysis across 12 cancer gene expression datasets, our proposed methods demonstrate superior performance over both the k-TSP classifier and nearest template prediction methods. We have further refined our approach to facilitate variable selection and the generation of clear, precise decision rules from rank-based trees, enhancing interpretability. The cumulative evidence from our research underscores the significant potential of ensemble rank-based trees in advancing disease classification via gene expression data, offering a robust, interpretable, and scalable solution. Our software is available at https://CRAN.R-project.org/package=ranktreeEnsemble.
... All evolutionary algorithms rely on the idea of biological evolution when solving problems. When conducting research, researchers usually prefer to use genetic algorithms in evolutionary algorithms [9]. e concept of genetic algorithm and self-adjustment came into being at the same time in the middle of the 20th century. ...
Article
Full-text available
According to the traditional data mining method, it is no longer applicable to obtain knowledge from the database, and the knowledge mined in the past must be constantly updated. In the last few years, Internet technology and cloud computing technology have emerged. The emergence of these two technologies has brought about Earth-shaking changes in certain industries. In order to efficiently retrieve and count a large amount of data at a lower cost, big data technology is proposed. Big data technology has played an important role for data with various types, huge quantities, and extremely fast changing speeds. However, big data technology still has some limitations, and researchers still cannot obtain the value of data in a short period of time with low cost and high efficiency. The sports database constructed in this paper can effectively carry out statistics and analysis on the data of sports learning. In the prototype system, log files can be mined, classified, and preprocessed. For the incremental data obtained by preprocessing, incremental data mining can be performed, a classification model can be established, and the database can be updated to provide users with personalized services. Through the method of data survey, the author studied the students’ exercise status, and the feedback data show that college students lack the awareness of physical exercise and have no fitness habit. It is necessary to accelerate the reform of college sports and cultivate students’ good sports awareness.
... To address this, we propose two algorithms to estimate H + , both referred to as an "h-plus estimator" or (HPE): (i) a brute force approach inspired by the Top-Scoring Pair (Leek, 2009;Magis and Price, 2012) algorithms, which use relative ranks to classify observations with O(p 2 ) comparisons and (ii) a grid search approach with O(p) comparisons, where p refers to percentiles of the data (rather than the n observations themselves). Typically, p is chosen such that p << n, leading to significant improvements in the computational speed to calculate H + . ...
Article
Full-text available
A standard unsupervised analysis is to cluster observations into discrete groups using a dissimilarity measure, such as Euclidean distance. If there does not exist a ground-truth label for each observation necessary for external validity metrics, then internal validity metrics, such as the tightness or separation of the clusters, are often used. However, the interpretation of these internal metrics can be problematic when using different dissimilarity measures as they have different magnitudes and ranges of values that they span. To address this problem, previous work introduced the “scale-agnostic” G+G_{+} discordance metric; however, this internal metric is slow to calculate for large data. Furthermore, in the setting of unsupervised clustering with k groups, we show that G+G_{+} varies as a function of the proportion of observations assigned to each of the groups (or clusters), referred to as the group balance, which is an undesirable property. To address this problem, we propose a modification of G+G_{+}, referred to as H+H_{+}, and demonstrate that H+H_{+} does not vary as a function of group balance using a simulation study and with public single-cell RNA-sequencing data. Finally, we provide scalable approaches to estimate H+H_{+}, which are available in the fasthplus\mathtt{fasthplus} R package.
... To address this, we propose two algorithms to estimate H + (an 'h-plus estimator' or HPE) inspired by the Top-Scoring Pair (TSP) (Leek, 2009;Magis, Price, 2012) algorithms, which use relative ranks to classify observations: a brute force approach with O(p 2 ) comparisons, and a grid search approach with O(p) comparisons where p refers to percentiles of the data (rather than the n observations themselves) where p << n, leading to significant improvements in the computational speed to calculate H + . Specifically, both algorithms estimate H + (referred to as H e or HPE) assume D(n × n) has been pre-calculated and provide faster ways to approximate s ( Figure 3B). ...
Preprint
Full-text available
A standard unsupervised analysis is to cluster observations into discrete groups using a dissimilarity measure, such as Euclidean distance. If there does not exist a ground-truth label for each observation necessary for external validity metrics, then internal validity metrics, such as the tightness or consistency of the cluster, are often used. However, the interpretation of these internal metrics can be problematic when using different dissimilarity measures as they have different magnitudes and ranges of values that they span. To address this problem, previous work introduced the ‘scale-agnostic’ G + discordance metric, however this internal metric is slow to calculate for large data. Furthermore, we show that G + varies as a function of the proportion of observations in the predicted cluster labels (group balance), which is an undesirable property. To address this problem, we propose a modification of G + , referred to as H + , and demonstrate that H + does not vary as a function of group balance using a simulation study and with public single-cell RNA-sequencing data. Finally, we provide scalable approaches to estimate H + , which are available in the fasthplus R package.
... Different approaches for the TSP extension focus on the relationships between more than two genes. Algorithms Top Scoring Triplet (TST) [19] and Top Scoring N (TSN) [21] analyze all possible ordering relationships between the genes, however, the general concept of TSP is retained. One of the first heuristic approaches that applied the RXA concept was the evolutionary algorithm called EvoTSP [6] where the authors proposed an evolutionary search for the TSP-like rules. ...
... In the case of RXA there exists also research considering GPU parallelization. In [21] authors managed to speed up calculations of basic TSP and TST solutions by two orders of magnitude. ...
Chapter
This paper presents a new concept for biomarker discovery and gene expression data classification that rises from the Relative Expression Analysis (RXA). The basic idea of RXA is to focus on simple ordering relationships between the expression of small sets of genes rather than their raw values. We propose a paradigm shift as we extend RXA concept to tree-based Advanced Relative Expression Analysis (ARXA). The main contribution is a decision tree with splitting nodes that consider relative fraction comparisons between multiple gene pairs. In addition, to face the enormous computational complexity of RXA, the most time-consuming part which is scoring all possible gene pairs in each splitting node is parallelized using GPU. This way the algorithm allows searching for more tailored interactions between sub-groups of genes in a reasonable time. Experiments carried out on 8 cancer-related datasets show not only significant improvement in accuracy and speed of our approach in comparison to various RXA solutions but also new interesting patterns between subgroups of genes.
... A preferable solution is employing a mapping algorithm that assigns a number to each permutation. One of the most efficient codes of this type is the Lehmer Code [24], which is shown in Appendix A. With aid of Algorithm 1, once the value of S is determined, the dealer converts it to the corresponding permutation in s, spreading the x i values over the range D according to pairs (x i , N i ) and following v(N i ) = x i , for i ∈ {1, 2, . . . , k}. ...
... The Lehmer Code [24] allows for achieving a bijective relationship between an integer-valued number S and a sequence of k elements in a vector s. Since k! permutations of the k elements are possible, the range of values of S is {0, 1, . . . ...
... In k-TSP, each top gene pair gives a prediction and the final predicted label is given by a majority vote. Other variations further extended the idea of TSP, e.g. by comparing the relative abundance of three or more genes simultaneously (Lin et al., 2009;Magis and Price, 2012;Wang et al., 2013;Yang and Naiman, 2014). The success of TSP and its variations has been shown in numerous real applications. ...
Article
Motivation: Scaling by sequencing depth is usually the first step of analysis of bulk or single-cell RNA-seq data, but estimating sequencing depth accurately can be difficult, especially for single-cell data, risking the validity of downstream analysis. It is thus of interest to eliminate the use of sequencing depth and analyze the original count data directly. Results: We call an analysis method "scale-invariant" (SI) if it gives the same result under different estimates of sequencing depth and hence can use the original count data without scaling. For the problem of classifying samples into pre-specified classes, such as normal versus cancerous, we develop a deep-neural-network based SI classifier named SINC. On nine bulk and single-cell datasets, the classification accuracy of SINC is better than or competitive to the best of eight other classifiers. SINC is easier to use and more reliable on data where proper sequencing depth is hard to determine. Availability: This source code of SINC is available at https://www.nd.edu/~jli9/SINC.zip. Supplementary information: Supplementary data are available at Bioinformatics online.
... The RXA algorithms managed to identify many interesting gene-gene interactions and played important role in a biomarker discovery [19]. The influence of RXA solutions could be even greater, however, the computational complexity of the algorithms that use exhaustive search strongly limits the number of genes that can be analyzed [23]. This indicated the direction of most of the current solutions to perform rigorous feature selection and to limit the complexity of the analyzed relations to a minimum. ...
... Different approaches for the TSP extension focus on the relationships between more than two genes. Algorithms Top Scoring Triplet (TST) [19] and Top Scoring N (TSN) [23] analyze all possible ordering relationships between the genes, however, the general concept of TSP is retained. ...
... In addition, calculation of all possible gene pairs or gene groups strongly limits the number of genes and inter-relations that can be analyzed. The largest so far ordering relationship tested a group of 4 genes (N=4) but only when the total number of analyzed genes was heavily reduced by the feature selection to a few hundred [23]. Although the parallelization of the algorithm managed to speed up calculation by two orders of magnitude [22], it is still computationally infeasible to calculate on a full gene expression dataset. ...
Conference Paper
Full-text available
Relative Expression Analysis (RXA) focuses on finding interactions among a small group of genes and studies the relative ordering of their expression rather than their raw values. Algorithms based on that idea play an important role in biomarker discovery and gene expression data classification. We propose a new evolutionary approach and a paradigm shift for RXA applications in data mining as we redefine the inter-gene relations using the concept of a cluster of co-expressed genes. The global hierarchical classification allows finding various sub-groups of genes, unifies the main variants of RXA algorithms and explores a much larger solution space compared to current solutions based on exhaustive search. Finally, the multi-objective fitness function, which includes accuracy, discriminative power of genes and clusters consistency, as well as specialized variants of genetic operators improve evolutionary convergence and reduce model underfitting. Importantly, patterns in predictive structures are kept comprehensible and may have direct applicability. Experiments carried out on 8 cancer-related gene expression datasets show that the proposed approach allows finding interesting patterns and significantly improves the accuracy of predictions.
... For each experimental station, the use of transcript levels for the top 20 pairs of genes clustered AI-pregnant heifers separately from the others with 100% confidence of cluster formation. Because this approach is parameter free [66,67] with the exception of the binary variable that separates subjects into two categories, we used the algorithm to identify TSPs in all 23 samples that could identity AI-pregnant heifers. The ratio between the expression levels for four gene pairs misclassified only two out of the 12 AI-pregnant heifers. ...
Article
Full-text available
Background: Infertility is a longstanding limitation in livestock production with important economic impact for the cattle industry. Female reproductive traits are polygenic and lowly heritable in nature, thus selection for fertility is challenging. Beef cattle operations leverage estrous synchronization in combination with artificial insemination (AI) to breed heifers and benefit from an early and uniform calving season. A couple of weeks following AI, heifers are exposed to bulls for an opportunity to become pregnant by natural breeding (NB), but they may also not become pregnant during this time period. Focusing on beef heifers, in their first breeding season, we hypothesized that: a- at the time of AI, the transcriptome of peripheral white blood cells (PWBC) differs between heifers that become pregnant to AI and heifers that become pregnant late in the breeding season by NB or do not become pregnant during the breeding season; and b- the ratio of transcript abundance between genes in PWBC classifies heifers according to pregnancy by AI, NB, or failure to become pregnant. Results: We generated RNA-sequencing data from 23 heifers from two locations (A: six AI-pregnant and five NB-pregnant; and B: six AI-pregnant and six non-pregnant). After filtering out lowly expressed genes, we quantified transcript abundance for 12,538 genes. The comparison of gene expression levels between AI-pregnant and NB-pregnant heifers yielded 18 differentially expressed genes (DEGs) (ADAM20, ALDH5A1, ANG, BOLA-DQB, DMBT1, FCER1A, GSTM3, KIR3DL1, LOC107131247, LOC618633, LYZ, MNS1, P2RY12, PPP1R1B, SIGLEC14, TPPP, TTLL1, UGT8, eFDR≤0.02). The comparison of gene expression levels between AI-pregnant and non-pregnant heifers yielded six DEGs (ALAS2, CNKSR3, LOC522763, SAXO2, TAC3, TFF2, eFDR≤0.05). We calculated the ratio of expression levels between all gene pairs and assessed their potential to classify samples according to experimental groups. Considering all samples, relative expression from two gene pairs correctly classified 10 out of 12 AI-pregnant heifers (P = 0.0028) separately from the other 11 heifers (NB-pregnant, or non-pregnant). Conclusion: The transcriptome profile in PWBC, at the time of AI, is associated with the fertility potential of beef heifers. Transcript levels of specific genes may be further explored as potential classifiers, and thus selection tools, of heifer fertility.