Article
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Sequence classification is essential for domains from medical diagnosis to online advertising. In these settings, data are typically proprietary, and annotations are expensive to acquire. Often times, so few annotations are available that training a robust model from scratch is impractical. Recently, knowledge amalgamation (KA) has emerged as a promising strategy for training models without this hard-to-come-by labeled training dataset. To achieve this, KA methods combine the knowledge of multiple pre-trained teacher models (trained on different classification tasks and proprietary datasets) into one student model that becomes an expert on the union of all teachers’ classes. However, we demonstrate that the state-of-the-art solutions fail in the presence of overconfident teachers, which make confident but incorrect predictions for instances from classes upon which they were not trained. Additionally, to-date no work has explored KA for sequence models. Therefore, we propose and then solve the open problem of semi-supervised KA for sequence classification (SKA). Our SKA approach first learns to estimate how trustworthy each teacher is for a given instance, then rescales the predicted probabilities from all teachers to supervise a student model. Our solution overcomes overconfident teachers through careful use of a very small amount of labeled instances. We demonstrate that this approach beats eight state-of-the-art alternatives on four real-world datasets by on average 15% in accuracy with as little as 2% of training data being annotated.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Two types of neural networks with recurrent LSTM layers [9] and LSTNet [10] were built to predict electricity imbalances. After which two linear layers with hyperbolic tangent and sigmoid activation functions, respectively. ...
Article
Full-text available
Currently, the problem of improving results of short-term forecasting of electricity imbalances in the modern electricity market of Ukraine is a current problem. In order to solve this problem, two types of neural networks with recurrent layers LSTM and LSTNet were analyzed in this work. A comparison of the results of short-term forecasting of daily schedules of electricity imbalances using LSTM and LSTNet neural networks with vector autoregression model (VARMA) was carried out. Actual data of the balancing market were used for the research. Analysis of the results shows that the smallest forecast error was achieved using the LSTM artificial neural network architecture.
... Instead, knowledge integration removes this restriction by merging knowledge from multiple teacher with various label sets to train a versatile student model. Recently, the idea of integrating knowledge from models with different skills has been explored in computer vision (Shen et al., 2019;Ye et al., 2019;Luo et al., 2019;Vongkulbhisal et al., 2019) and graph neural networks (Jing et al., 2021), or extended to a semi-supervised setting (Thadajarassiri et al., 2021). To the best of our knowledge, we are the first to explore knowledge integration for PLMs, which is of great practical value as there are abundant released PLMs. ...
... Instead, knowledge integration removes this restriction by merging knowledge from multiple teacher with various label sets to train a versatile student model. Recently, the idea of integrating knowledge from models with different skills has been explored in computer vision (Shen et al., 2019;Ye et al., 2019;Luo et al., 2019;Vongkulbhisal et al., 2019) and graph neural networks (Jing et al., 2021), or extended to a semi-supervised setting (Thadajarassiri et al., 2021). To the best of our knowledge, we are the first to explore knowledge integration for PLMs, which is of great practical value as there are abundant released PLMs. ...
Preprint
Full-text available
Investigating better ways to reuse the released pre-trained language models (PLMs) can significantly reduce the computational cost and the potential environmental side-effects. This paper explores a novel PLM reuse paradigm, Knowledge Integration (KI). Without human annotations available, KI aims to merge the knowledge from different teacher-PLMs, each of which specializes in a different classification problem, into a versatile student model. To achieve this, we first derive the correlation between virtual golden supervision and teacher predictions. We then design a Model Uncertainty--aware Knowledge Integration (MUKI) framework to recover the golden supervision for the student. Specifically, MUKI adopts Monte-Carlo Dropout to estimate model uncertainty for the supervision integration. An instance-wise re-weighting mechanism based on the margin of uncertainty scores is further incorporated, to deal with the potential conflicting supervision from teachers. Experimental results demonstrate that MUKI achieves substantial improvements over baselines on benchmark datasets. Further analysis shows that MUKI can generalize well for merging teacher models with heterogeneous architectures, and even teachers major in cross-lingual datasets.
Preprint
Vision Foundation Models (VFMs) have demonstrated outstanding performance on numerous downstream tasks. However, due to their inherent representation biases originating from different training paradigms, VFMs exhibit advantages and disadvantages across distinct vision tasks. Although amalgamating the strengths of multiple VFMs for downstream tasks is an intuitive strategy, effectively exploiting these biases remains a significant challenge. In this paper, we propose a novel and versatile "Swiss Army Knife" (SAK) solution, which adaptively distills knowledge from a committee of VFMs to enhance multi-task learning. Unlike existing methods that use a single backbone for knowledge transfer, our approach preserves the unique representation bias of each teacher by collaborating the lightweight Teacher-Specific Adapter Path modules with the Teacher-Agnostic Stem. Through dynamic selection and combination of representations with Mixture-of-Representations Routers, our SAK is capable of synergizing the complementary strengths of multiple VFMs. Extensive experiments show that our SAK remarkably outperforms prior state of the arts in multi-task learning by 10% on the NYUD-v2 benchmark, while also providing a flexible and robust framework that can readily accommodate more advanced model designs.
Article
Most of the voice spoofing detection methods are designed for specific kinds of spoofing attacks, synthetic or replay. In practice, however, there is no prior information about these two kinds of spoofing attacks. To this end, this paper proposes a generalized voice spoofing detection method based on integral knowledge amalgamation to detect jointly synthetic attacks and replay attacks. Two amalgamation mechanisms, feature amalgamation and structure amalgamation, are designed from different perspectives, so that the model can generalize better and run fast. Specifically, the feature amalgamation transfers the high-level sematic knowledge from two teacher models to the compact model. The structure amalgamation employs the adversarial learning to ensure the global structure consistency of two teacher models and a student model. In addition, the feature matching loss is introduced to capture the distinctive features of synthetic attacks and replay attacks. We conduct extensive experiments on logical access (LA) scenario and physical access (PA) scenario of ASVspoof 2019 dataset to verify the validity of the proposed method. The experimental results show that compared with the most advanced generalized voice spoofing detection methods, the proposed method achieves a comparable or even better performance. In particular, our method gains the state-of-the-art detection capability on LA scenario. Moreover, our method achieves similar or even outstanding detectability when compared with specialized anti-spoofing methods.
Article
Heterogeneous Knowledge Amalgamation (HKA) algorithms attempt to learn a versatile and lightweight student neural network from multiple pre-trained heterogeneous teachers. They encourage the student not only to produce the same prediction as the teachers but also to imitate each teacher’s features separately in a learned Common Feature Space (CFS) by using Maximum Mean Discrepancy (MMD). However, there is no theoretical guarantee of the Out-of-Distribution robustness of teacher models in CFS, which can cause an overlap of feature representations when mapping unknown category samples. Furthermore, global alignment MMD can easily result in a negative transfer without considering class-level alignment and the relationships among all teachers. To overcome these issues, we propose a Dual Discriminative Feature Alignment (DDFA) framework, consisting of a Discriminative Centroid Clustering Strategy (DCCS) and a Joint Group Feature Alignment method (JGFA). DCCS promotes the class-separability of the teachers’ features to alleviate the overlap issue. Meanwhile, JGFA decouples the complex discrepancy among teachers and the student at both category and group levels, extending MMD to align the features discriminatively. We test our model on a list of benchmarks and demonstrate that the learned student is robust and even outperforms its teachers in most cases.
Article
Heterogeneous model aggregation (HMA) is an effective paradigm that integrates on-device trained models heterogeneous in architecture and target task into a comprehensive model. Recent works adopt knowledge distillation to amalgamate the knowledge of learned features and predictions from heterogeneous on-device models to realize HMA. However, most of them ignore that the disclosure of learned features exposes on-device models to privacy attacks. Moreover, the aggregated model may suffer from the imbalanced supervision caused by the uneven distribution of amalgamated knowledge about each class and show class bias. In this paper, to address these issues, we propose a response-based class-balanced heterogeneous model aggregation mechanism, called CBHMA. It can effectively achieve HMA in a privacy-preserving manner and alleviate class bias in the aggregated model. Specifically, CBHMA aggregates on-device models by using only their response information to reduce their privacy leakage risk. To mitigate the impact of imbalanced supervision, CBHMA quantitatively measures the imbalanced supervision level for each class. Based on that, CBHMA customizes fine-grained misclassification costs for each class and utilizes such costs to adjust the importance of each class (more importance to classes with weaker supervision) in the response-based HMA algorithm. Extensive experiments on two real-world datasets demonstrate the effectiveness of CBHMA.
Article
Full-text available
In addition to a variety of exceptional sensors, Smartphones now facilitates vigorous open entries in data mining and machine learning to scrutinize the Human Activity Recognition (HAR) system. The follow-up to the treatment of diseases, HAR monitoring system, can be used to recognize mental depression that until now has been overlooked for HAR applications. In this scrutinize, Smartphone sensor data were collected in the 1 Hz frequency from 20 data subjects of different ages. We drove the HAR by using basic machine learning strategies, namely Support Vector Machine, Random Forest, K-Nearest Neighbors, and Artificial Neural Network to recognize physical activities which are associated with mental depression. Random Forest outperformed to recognize daily patterns of activities with 99.80% accuracy of the validation data set. Along with, sensors data was amassed regarding the activities performed over the most recent 14 days continuously from target subjects’ Smartphone. This data was fed to the optimized Random Forest model and quantified the duration of each symptomatic activity of mental depression. Here, a push was connected to figure the risk factor for the probability that an individual has been encountering mental depression. So, a questionnaire was surveyed to collect data from 50 patients who were suffering from mental depression. The questionnaire enquires for the duration of activities related to mental depression. Then, the similarity of these experimental subjects’ activity pattern was measured with those of 50 depressed patients. Finally, data was collected from target subjects’ and applied similarity approach to induce the relation between the target subjects’ and depressed patients. Average similarity value of 90.94% for the depressing subject and 34.99% of the typical subject justifies that this robust system was able to achieve a good performance in terms of measurement of risk factors.
Conference Paper
Full-text available
Many well-trained Convolutional Neural Network~(CNN) models have now been released online by developers for the sake of effortless reproducing. In this paper, we treat such pre-trained networks as teachers and explore how to learn a target student network for customized tasks, using multiple teachers that handle different tasks. We assume no human-labelled annotations are available, and each teacher model can be either single- or multi-task network, where the former is a degenerated case of the latter. The student model, depending on the customized tasks, learns the related knowledge filtered from the multiple teachers, and eventually masters the complete or a subset of expertise from all teachers. To this end, we adopt a layer-wise training strategy, which entangles the student's network block to be learned with the corresponding teachers. As demonstrated on several benchmarks, the learned student network achieves very promising results, even outperforming the teachers on the customized tasks.
Conference Paper
Full-text available
An increasing number of well-trained deep networks have been released online by researchers and developers, enabling the community to reuse them in a plug-and-play way without accessing the training annotations. However, due to the large number of network variants, such public-available trained models are often of different architectures, each of which being tailored for a specific task or dataset. In this paper, we study a deep-model reusing task, where we are given as input pre-trained networks of heterogeneous architectures specializing in distinct tasks, as teacher models. We aim to learn a multitalented and light-weight student model that is able to grasp the integrated knowledge from all such heterogeneous-structure teachers, again without accessing any human annotation. To this end, we propose a common feature learning scheme, in which the features of all teachers are transformed into a common space and the student is enforced to imitate them all so as to amalgamate the intact knowledge. We test the proposed approach on a list of benchmarks and demonstrate that the learned student is able to achieve very promising performance, superior to those of the teachers in their specialized tasks.
Conference Paper
Full-text available
Transfer learning for deep neural networks is the process of first training a base network on a source dataset, and then transferring the learned features (the network's weights) to a second network to be trained on a target dataset. This idea has been shown to improve deep neural network's generalization capabilities in many computer vision tasks such as image recognition and object localization. Apart from these applications, deep Convolutional Neural Networks (CNNs) have also recently gained popularity in the Time Series Classification (TSC) community. However, unlike for image recognition problems, transfer learning techniques have not yet been investigated thoroughly for the TSC task. This is surprising as the accuracy of deep learning models for TSC could potentially be improved if the model is fine-tuned from a pre-trained neural network instead of training it from scratch. In this paper, we fill this gap by investigating how to transfer deep CNNs for the TSC task. To evaluate the potential of transfer learning, we performed extensive experiments using the UCR archive which is the largest publicly available TSC benchmark containing 85 datasets. For each dataset in the archive, we pre-trained a model and then fine-tuned it on the other datasets resulting in 7140 different deep neural networks. These experiments revealed that transfer learning can improve or degrade the model's predictions depending on the dataset used for transfer. Therefore, in an effort to predict the best source dataset for a given target dataset, we propose a new method relying on Dynamic Time Warping to measure inter-datasets similarities. We describe how our method can guide the transfer to choose the best source dataset leading to an improvement in accuracy on 71 out of 85 datasets.
Conference Paper
Full-text available
Inspired by the tremendous success of deep Convolutional Neural Networks as generic feature extractors for images, we propose TimeNet: a deep recurrent neural network (RNN) trained on diverse time series in an unsupervised manner using sequence to sequence (seq2seq) models to extract features from time series. Rather than relying on data from the problem domain, TimeNet attempts to generalize time series representation across domains by ingesting time series from several domains simultaneously. Once trained, TimeNet can be used as a generic off-the-shelf feature extractor for time series. The representations or embeddings given by a pre-trained TimeNet are found to be useful for time series classification (TSC). For several publicly available datasets from UCR TSC Archive and an industrial telematics sensor data from vehicles, we observe that a classifier learned over the TimeNet embeddings yields significantly better performance compared to (i) a classifier learned over the embeddings given by a domain-specific RNN, as well as (ii) a nearest neighbor classifier based on Dynamic Time Warping.
Article
Full-text available
Health care is one of the most exciting frontiers in data mining and machine learning. Successful adoption of electronic health records (EHRs) created an explosion in digital clinical data available for analysis, but progress in machine learning for healthcare research has been difficult to measure because of the absence of publicly available benchmark data sets. To address this problem, we propose four clinical prediction benchmarks using data derived from the publicly available Medical Information Mart for Intensive Care (MIMIC-III) database. These tasks cover a range of clinical problems including modeling risk of mortality, forecasting length of stay, detecting physiologic decline, and phenotype classification. We formulate a heterogeneous multitask problem where the goal is to jointly learn multiple clinically relevant prediction tasks based on the same time series data. To address this problem, we propose a novel recurrent neural network (RNN) architecture that leverages the correlations between the various tasks to learn a better predictive model. We validate the proposed neural architecture on this benchmark, and demonstrate that it outperforms strong baselines, including single task RNNs.
Conference Paper
Full-text available
Very deep convolutional networks with hundreds of layers have led to significant reductions in error on competitive benchmarks. Although the unmatched expressiveness of the many layers can be highly desirable at test time, training very deep networks comes with its own set of challenges. The gradients can vanish, the forward flow often diminishes, and the training time can be painfully slow. To address these problems, we propose stochastic depth, a training procedure that enables the seemingly contradictory setup to train short networks and use deep networks at test time. We start with very deep networks but during training, for each mini-batch, randomly drop a subset of layers and bypass them with the identity function. This simple approach complements the recent success of residual networks. It reduces training time substantially and improves the test error significantly on almost all data sets that we used for evaluation. With stochastic depth we can increase the depth of residual networks even beyond 1200 layers and still yield meaningful improvements in test error (4.91 % on CIFAR-10).
Article
Full-text available
Learning to store information over extended time intervals by recurrent backpropagation takes a very long time, mostly because of insufficient, decaying error backflow. We briefly review Hochreiter's (1991) analysis of this problem, then address it by introducing a novel, efficient, gradient based method called long short-term memory (LSTM). Truncating the gradient where this does not do harm, LSTM can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units. Multiplicative gate units learn to open and close access to the constant error flow. LSTM is local in space and time; its computational complexity per time step and weight is O. 1. Our experiments with artificial data involve local, distributed, real-valued, and noisy pattern representations. In comparisons with real-time recurrent learning, back propagation through time, recurrent cascade correlation, Elman nets, and neural sequence chunking, LSTM leads to many more successful runs, and learns much faster. LSTM also solves complex, artificial long-time-lag tasks that have never been solved by previous recurrent network algorithms.
Article
Smart City is an emerging concept in global urban development. A Smart City applies ICT technologies to provide greater efficiencies for its urban areas and civilian population. One of the key requirements for a Smart City is to exploit data from its ICT infrastructure (such as Internet of Things connected sensors) to improve city services and features such as accessibility and sustainability. To address this requirement, the City of Melbourne (COM) Smart City office maintains several hundred data sets relating to urban activity and development. These datasets address parking, mobility, land use, 3D data, statistics, environment, and major city developments such as rail projects. One promising dataset relates to pedestrian traffic. Data are obtained from sensors and updated on the COM website (City of Melbourne Open Data Platform: https://data.melbourne.vic.gov.au/.) at regular intervals. These data include the number of pedestrians passing 53 specific locations in the central business district and also their times and directions of travel. In a 24 h period, over 650,000 pedestrians were counted passing all locations. Peak rates of several thousand pedestrians per minute are regularly recorded during city rush hours at hotspots making the data amenable to Big Data analysis techniques. Results are obtained in graphical format as heatmaps and charts of city pedestrian traffic using both Microsoft Excel® for static analysis and PowerBI® for more advanced interactive visualisation and analysis. These findings can identify pedestrian hotspots and inform future locations of traffic lights and street configurations to make the city more pedestrian friendly. Further, the experience gained can be used to examine other data sets such as bicycle traffic that can be analysed to inform city infrastructure projects. Future work is suggested that could link these pedestrian flow data with social media data from smartphones and potentially wearable devices such as fitness monitors to correlate pedestrian satisfaction with traffic flow. The ‘happiness’ effect of pedestrians passing through green areas such as city parks can also be quantified. This research was undertaken with the assistance of Swinburne University under its Capstone Project scheme.
Conference Paper
As an important and challenging problem in machine learning and computer vision, neural network acceleration essentially aims to enhance the computational efficiency without sacrificing the model accuracy too much. In this paper, we propose a progressive blockwise learning scheme for teacher-student model distillation at the subnetwork block level. The proposed scheme is able to distill the knowledge of the entire teacher network by locally extracting the knowledge of each block in terms of progressive blockwise function approximation. Furthermore, we propose a structure design criterion for the student subnetwork block, which is able to effectively preserve the original receptive field from the teacher network. Experimental results demonstrate the effectiveness of the proposed scheme against the state-of-the-art approaches.
Conference Paper
We introduce DropConnect, a generalization of Dropout (Hinton et al., 2012), for regularizing large fully-connected layers within neural networks. When training with Dropout, a randomly selected subset of activations are set to zero within each layer. DropConnect instead sets a randomly selected subset of weights within the network to zero. Each unit thus receives input from a random subset of units in the previous layer. We derive a bound on the generalization performance of both Dropout and DropConnect. We then evaluate DropConnect on a range of datasets, comparing to Dropout, and show state-of-the-art results on several image recognition benchmarks by aggregating multiple DropConnect-trained models.
Conference Paper
While depth tends to improve network performances, it also makes gradient-based training more difficult since deeper networks tend to be more non-linear. The recently proposed knowledge distillation approach is aimed at obtaining small and fast-to-execute models, and it has shown that a student network could imitate the soft output of a larger teacher network or ensemble of networks. In this paper, we extend this idea to allow the training of a student that is deeper and thinner than the teacher, using not only the outputs but also the intermediate representations learned by the teacher as hints to improve the training process and final performance of the student. Because the student intermediate hidden layer will generally be smaller than the teacher's intermediate hidden layer, additional parameters are introduced to map the student hidden layer to the prediction of the teacher hidden layer. This allows one to train deeper students that can generalize better or run faster, a trade-off that is controlled by the chosen student capacity. For example, on CIFAR-10, a deep student network with almost 10.4 times less parameters outperforms a larger, state-of-the-art teacher network.
Conference Paper
Ensemble methods are learning algorithms that construct a set of classifiers and then classify new data points by taking a (weighted) vote of their predictions. The original ensemble method is Bayesian averaging, but more recent algorithms include error-correcting output coding, Bagging, and boosting. This paper reviews these methods and explains why ensembles can often perform better than any single classifier. Some previous studies comparing ensemble methods are reviewed, and some new experiments are presented to uncover the reasons that Adaboost does not overfit rapidly.
Conference Paper
Deep Bidirectional LSTM (DBLSTM) recurrent neural networks have recently been shown to give state-of-the-art performance on the TIMIT speech database. However, the results in that work relied on recurrent-neural-network-specific objective functions, which are difficult to integrate with existing large vocabulary speech recognition systems. This paper investigates the use of DBLSTM as an acoustic model in a standard neural network-HMM hybrid system. We find that a DBLSTM-HMM hybrid gives equally good results on TIMIT as the previous work. It also outperforms both GMM and deep network benchmarks on a subset of the Wall Street Journal corpus. However the improvement in word error rate over the deep network is modest, despite a great increase in framelevel accuracy. We conclude that the hybrid approach with DBLSTM appears to be well suited for tasks where acoustic modelling predominates. Further investigation needs to be conducted to understand how to better leverage the improvements in frame-level accuracy towards better word error rates.
Conference Paper
The construction of appearance-based object detection systems is time-consuming and difficult because a large number of training examples must be collected and man- ually labeled in order to capture variations in object ap- pearance. Semi-supervised training is a means for reduc- ing the effort needed to prepare the training set by train- ing the model with a small number of fully labeled examples and an additional set of unlabeled or weakly labeled exam- ples. In this work we present a semi-supervised approach to training object detection systems based on self-training. We implement our approach as a wrapper around the train- ing process of an existing object detector and present em- pirical results. The key contributions of this empirical study is to demonstrate that a model trained in this manner can achieve results comparable to a model trained in the tradi- tional manner using a much larger set of fully labeled data, and that a training data selection metric that is defined in- dependently of the detector greatly outperforms a selection metric based on the detection confidence generated by the detector.
Time-series similarity queries employing a feature-based approach
  • R J Alcock
  • Y Manolopoulos
Alcock, R. J.; Manolopoulos, Y.; et al. 1999. Time-series similarity queries employing a feature-based approach. In Hellenic conference on informatics, 27-29.
A Two-Teacher Framework for Knowledge Distillation
  • X Chen
  • J Su
  • J Zhang
Chen, X.; Su, J.; and Zhang, J. 2019. A Two-Teacher Framework for Knowledge Distillation. In Proceedings of ISNN, 58-66.
  • G Hinton
  • O Vinyals
  • J Dean
Hinton, G.; Vinyals, O.; and Dean, J. 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531.
Adam: A method for stochastic optimization
  • D P Kingma
  • J Ba
Kingma, D. P.; and Ba, J. 2014. Adam: A method for stochastic optimization. In Proceedings of ICLR.
Ensembles of elastic distance measures for time series classification
  • J Lines
  • A Bagnall
Lines, J.; and Bagnall, A. 2014. Ensembles of elastic distance measures for time series classification. In Proceedings of SDM, 524-532.
Multi-task prediction of disease onsets from longitudinal laboratory tests
  • N Razavian
  • J Marcus
  • D Sontag
Razavian, N.; Marcus, J.; and Sontag, D. 2016. Multi-task prediction of disease onsets from longitudinal laboratory tests. In Proceedings of MLHC, 73-100.
Amalgamating knowledge towards comprehensive classification
  • C Shen
  • X Wang
  • J Song
  • L Sun
  • M Song
Shen, C.; Wang, X.; Song, J.; Sun, L.; and Song, M. 2019a. Amalgamating knowledge towards comprehensive classification. In Proceedings of AAAI, 3068-3075.
Swapout: Learning an ensemble of deep architectures
  • S Singh
  • D Hoiem
  • D Forsyth
Singh, S.; Hoiem, D.; and Forsyth, D. 2016. Swapout: Learning an ensemble of deep architectures. In Proceedings of NeurIPS, 28-36.
Instance-Wise Dynamic Sensor Selection for Human Activity Recognition
  • X Yang
  • Y Chen
  • H Yu
  • Y Zhang
  • W Lu
  • R Sun
Yang, X.; Chen, Y.; Yu, H.; Zhang, Y.; Lu, W.; and Sun, R. 2020. Instance-Wise Dynamic Sensor Selection for Human Activity Recognition. In Proceedings of AAAI, 1104-1111.