Main flowchart of the GLISTER-ONLINE framework for Data Selection.

Main flowchart of the GLISTER-ONLINE framework for Data Selection.

Source publication
Preprint
Full-text available
Large scale machine learning and deep models are extremely data-hungry. Unfortunately, obtaining large amounts of labeled data is expensive, and training state-of-the-art models (with hyperparameter tuning) requires significant computing resources and time. Secondly, real-world data is noisy and imbalanced. As a result, several recent papers try to...

Contexts in source publication

Context 1
... algorithm is iterative, in that it proceeds by simultaneously updating the model parameters and selecting subsets. Figure 1 gives a flowchart of GLISTER-ONLINE. Note that instead of performing data selection every epoch, we perform data selection every L epochs, for computational reasons. ...
Context 2
... extensive analysis, we used L = 20 in our experiments. Convergence with varying r values In this section, we analyzed the effect of varying r values on our model convergence rate on various datasets as shown in the figure 10. From the results, it is evident that our GLISTER-ONLINE framework is highly unstable for low values of r and that the stability of our model increases with r. ...
Context 3
... it is not always advisable to use large r values. It is also evident from our results shown in figure 10, since our GLISTER-ONLINE framework has faster convergence when r = 10 instead of r = 20 in the majority of the datasets. ...
Context 4
... synthetic datasets include linearly separable binary data, multi-class separable data, and linearly separable binary data with slack. The linearly separable binary dataset, as shown in fig.11a, comprises two-dimensional feature data points from two non-overlapping data points clusters of class 0 and class 1. ...
Context 5
... linearly separable binary dataset, as shown in fig.11a, comprises two-dimensional feature data points from two non-overlapping data points clusters of class 0 and class 1. The multi-class separable data comprises two-dimensional feature data points from four non-overlapping data points clusters of classes 0,1,2,3 which is shown in fig.12a. A overlapping version of the same is shown in fig.14a. ...
Context 6
... multi-class separable data comprises two-dimensional feature data points from four non-overlapping data points clusters of classes 0,1,2,3 which is shown in fig.12a. A overlapping version of the same is shown in fig.14a. Whereas the linearly separable dataset with slack comprises two-dimensional feature data points from two overlapping data points clusters of class 0 and class 1 as shown in fig.13a. ...
Context 7
... overlapping version of the same is shown in fig.14a. Whereas the linearly separable dataset with slack comprises two-dimensional feature data points from two overlapping data points clusters of class 0 and class 1 as shown in fig.13a. The result in 11 shows the subset selected by various methods for a linearly binary separable dataset. ...
Context 8
... A overlapping version of the same is shown in fig.14a. Whereas the linearly separable dataset with slack comprises two-dimensional feature data points from two overlapping data points clusters of class 0 and class 1 as shown in fig.13a. The result in 11 shows the subset selected by various methods for a linearly binary separable dataset. From fig. 11b, we can see that our GLISTER-ONLINE framework selects a subset that is close to the boundary, whereas other methods like CRAIG (Mirzasoleiman, Bilmes, and Leskovec 2020), Random, KNNSubmod (Wei, Iyer, and Bilmes 2015) selects subsets that are representative of the training dataset as shown in figures 11c, 11e, 11d respectively. ...
Context 9
... 11b, we can see that our GLISTER-ONLINE framework selects a subset that is close to the boundary, whereas other methods like CRAIG (Mirzasoleiman, Bilmes, and Leskovec 2020), Random, KNNSubmod (Wei, Iyer, and Bilmes 2015) selects subsets that are representative of the training dataset as shown in figures 11c, 11e, 11d respectively. Similarly, in fig. 12 points selected by the methods are highlighted for linearly separable dataset with four classes, in figure 13 points selected by the methods are highlighted for binary dataset with outliers and in figure 14 points selected by the methods are highlighted for overlapping dataset with four ...
Context 10
... in fig. 12 points selected by the methods are highlighted for linearly separable dataset with four classes, in figure 13 points selected by the methods are highlighted for binary dataset with outliers and in figure 14 points selected by the methods are highlighted for overlapping dataset with four classes. ...
Context 11
... in fig. 12 points selected by the methods are highlighted for linearly separable dataset with four classes, in figure 13 points selected by the methods are highlighted for binary dataset with outliers and in figure 14 points selected by the methods are highlighted for overlapping dataset with four classes. ...
Context 12
... GLISTER-ONLINE is driven by validation data, GLISTER-ONLINE can better handle situations where there is distribution shift in test and validation data from the training data, This what is called as the covariate shift. To illustrate this we use two synthetic datasets which are very similar to dataset, as shown in figure 14a and figure 13a except that their validation dataset is shifted as shown in figure 15b and figure 16b respectively. Figures 15c and 15d shows how effective the methods -CRAIG (Mirzasoleiman, Bilmes, and Leskovec 2020), Random, KNNSubmod (Wei, Iyer, and Bilmes 2015) and GLISTER-ONLINE are reducing validation and test loss respectively. ...
Context 13
... GLISTER-ONLINE is driven by validation data, GLISTER-ONLINE can better handle situations where there is distribution shift in test and validation data from the training data, This what is called as the covariate shift. To illustrate this we use two synthetic datasets which are very similar to dataset, as shown in figure 14a and figure 13a except that their validation dataset is shifted as shown in figure 15b and figure 16b respectively. Figures 15c and 15d shows how effective the methods -CRAIG (Mirzasoleiman, Bilmes, and Leskovec 2020), Random, KNNSubmod (Wei, Iyer, and Bilmes 2015) and GLISTER-ONLINE are reducing validation and test loss respectively. ...
Context 14
... illustrate this we use two synthetic datasets which are very similar to dataset, as shown in figure 14a and figure 13a except that their validation dataset is shifted as shown in figure 15b and figure 16b respectively. Figures 15c and 15d shows how effective the methods -CRAIG (Mirzasoleiman, Bilmes, and Leskovec 2020), Random, KNNSubmod (Wei, Iyer, and Bilmes 2015) and GLISTER-ONLINE are reducing validation and test loss respectively. Clearly GLISTER-ONLINE outperforms other methods. ...
Context 15
... GLISTER-ONLINE outperforms other methods. A similar trend is seen Figures 16c and 16d for the binary dataset. ...

Similar publications

Chapter
Full-text available
To identify heart diseases several contributory risk factors and information about patient must be taken into account. Diagnosis of heart disease requires medical specialist that are often timely unavailable or costly for some patients in remote and rural areas, but also in overly populated urban areas. For this reason, over the years, new approach...