Article

DeepType: On-Device Deep Learning for Input Personalization Service with Minimal Privacy Concern

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Mobile users spend an extensive amount of time on typing. A more efficient text input instrument brings a significant enhancement of user experience. Deep learning techniques have been recently applied to suggesting the next words of input, but to achieve more accurate predictions, these models should be customized for individual users. Personalization is often at the expense of privacy concerns. Existing solutions require users to upload the historical logs of their input text to the cloud so that a deep learning predictor can be trained. In this work, we propose a novel approach, called DeepType, to personalize text input with better privacy. The basic idea is intuitive: training deep learning predictors on the device instead of on the cloud, so that the model makes personalized and private data never leaves the device to externals. With DeepType, a global model is first trained on the cloud using massive public corpora, and our personalization is done by incrementally customizing the global model with data on individual devices. We further propose a set of techniques that effectively reduce the computation cost of training deep learning models on mobile devices at the cost of negligible accuracy loss. Experiments using real-world text input from millions of users demonstrate that DeepType significantly improves the input efficiency for individual users, and its incurred computation and energy costs are within the performance and battery restrictions of typical COTS mobile devices.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Using supervised subspace approaches to solve the SSS problem consists of algorithm-based and transform-based strategies. An alternative formalization of LDA in an algorithm-based strategy is to solve the SSS problem [72], [81], [89][65], [81], [90]. Hence, in a transform-based method like PCA+LDA, the original picture data is first converted into a lower-dimensional subspace, and then LDA is used to extract the feature. ...
... Therefore, in the transform domains, the subspace methods techniques such as Fourier transforms are applied [96]. Gabor [98]; discrete cosine [67] [90]. These references proposed angular radial energy information for palmprint image into wavelet domain, then extract to characterize the directional context characteristics of palmprint are set of global statistical signatures [67] in their work proposed training advanced correlation filter each palm. ...
Article
Full-text available
Information systems in organizations traditionally require users to remember their secret pins (password), token, card number, or both to confirm their identities. However, the technological trend has been moving towards personal identification based on individual behavioural attributes (such as gaits, signature, and voice) or physiological (such as palmprint, fingerprint, face, iris, or ear). These attributes (biometrics) offer many advantages over knowledge and possession-based approaches. For example, palmprint images have rich, unique features for reliable human identification, and it has received significant attention due to their stability, reliability, uniqueness, and non-intrusiveness. This paper provides an overview and evaluation of contactless palmprint recognition system, the state-of-the-art performance of existing works, different types of “Region of Interest” (ROI) extraction algorithms, feature extraction, and matching algorithms. Finally, the findings obtained are presented and discussed.
... It is also fundamentally challenging because DL models are known to be very complex and cumbersome [21,30,51]. Consequently, optimizing the inference performance has been the theme of both academia [22,73,77] and industry [8,12,14] in recent years. ...
... Mobile DL In recent years, there is a notable trend to move DL inference into local devices instead of offloading to remote servers [41,43,[73][74][75][76][77]. A fundamental challenge of this trend is the constrained resources of smartphones. ...
Preprint
Full-text available
Deploying deep learning (DL) on mobile devices has been a notable trend in recent years. To support fast inference of on-device DL, DL libraries play a critical role as algorithms and hardware do. Unfortunately, no prior work ever dives deep into the ecosystem of modern DL libs and provides quantitative results on their performance. In this paper, we first build a comprehensive benchmark that includes 6 representative DL libs and 15 diversified DL models. We then perform extensive experiments on 10 mobile devices, which help reveal a complete landscape of the current mobile DL libs ecosystem. For example, we find that the best-performing DL lib is severely fragmented across different models and hardware, and the gap between those DL libs can be rather huge. In fact, the impacts of DL libs can overwhelm the optimizations from algorithms or hardware, e.g., model quantization and GPU/DSP-based heterogeneous computing. Finally, atop the observations, we summarize practical implications to different roles in the DL lib ecosystem.
... Use cases of ubiquitous learning abound: input method [1], virtual assistant [2], item recommendation [13], etc. While the learning protocols may diversify, e.g., federated learning [12], split learning [15], and local transfer learning [19], they all rely on a fundamental system component: ondevice training library. Given the highly constrained resources on edge devices and the huge resource demand of deep learning training stage, it's intuitive to ask whether edge devices can really afford training modern NN models and if so, do current libraries efficiently support that? ...
... To reach a usable accuracy, the training phase often takes a substantial period of time, e.g., minutes for each round of federated learning [4] or even hours for continuous local transfer learning [19]. Such a long duration of intensive computation may lead to thermal issues and therefore the CPU frequency change due to dynamic voltage and frequency scaling (DVFS) even without any other applications running. ...
Conference Paper
Full-text available
We are witnessing the emergence ofubiquitous learning, whereeach device (smartphones, wearables, IoTs, etc) can learn from theirenvironments either alone or collaboratively. Such a new para-digm is enabled by deep learning techniques, or more specifically,on-device training. Given its popularity in the machine learningcommunity, unfortunately, there are no systematic understandingsof a critical question: how much cost does it take to train typicaldeep models on commodity end devices? Therefore, this work per-forms comprehensive measurements of on-device training with thestate-of-the-art training library, 6 mobile phones, and 5 classicalneural networks. Our measurements report metrics of training time,energy consumption, memory footprint, hardware utilization, andthermal dynamics, thus help reveal a complete landscape of theon-device training performance. The observations from the mea-surements help guide us to several promising future directions toefficiently enable ubiquitous learning.
... This baseline is currently deployed in mobile WeChat IME app, guarantees data anonymization, and is introduced to validate the necessity of personalization. (2) On-Device Learning for Model Personalization (On-Device) [35,38], which lets each mobile device download the cloud-based model and finetune over the local training data. Each user's personalized model needs to be deployed on the mobile device for inference, because it is unaffordable for the cloud to maintain a large number of user-specific models. ...
Preprint
In many practical natural language applications, user data are highly sensitive, requiring anonymous uploads of text data from mobile devices to the cloud without user identifiers. However, the absence of user identifiers restricts the ability of cloud-based language models to provide personalized services, which are essential for catering to diverse user needs. The trivial method of replacing an explicit user identifier with a static user embedding as model input still compromises data anonymization. In this work, we propose to let each mobile device maintain a user-specific distribution to dynamically generate user embeddings, thereby breaking the one-to-one mapping between an embedding and a specific user. We further theoretically demonstrate that to prevent the cloud from tracking users via uploaded embeddings, the local distributions of different users should either be derived from a linearly dependent space to avoid identifiability or be close to each other to prevent accurate attribution. Evaluation on both public and industrial datasets using different language models reveals a remarkable improvement in accuracy from incorporating anonymous user embeddings, while preserving real-time inference requirement.
... Top layer fine tuning (Top): This method finetunes only the parameters of the task-specific layers, which is widely used for most on-device models (Xu et al., 2018). ...
... The fine-tuning is often performed on data generated by applications in user devices, e.g., input methods [89], emails [46,73], and instant messaging [71]. Those data is private by nature and cannot be collected arbitrarily to respect user's privacy concern and legal regulation like GDPR [79]. ...
... Still, the models used in these applications are generic, which (in most cases) are not tailored to the data currently being experienced at the device. As the next step of further improving sensing application performance, system designers are now examining the possibilities of performing some level of local model training [26,83]. Despite the computing resource constraints of embedded and mobile sensing platforms, locally training deep learning models (even a small amount) can accelerate the design of more personalized and customized services [87]. ...
Article
Full-text available
This work presents AttFL, a federated learning framework designed to continuously improve a personalized deep neural network for efficiently analyzing time-series data generated from mobile and embedded sensing applications. To better characterize time-series data features and efficiently abstract model parameters, AttFL appends a set of attention modules to the baseline deep learning model and exchanges their feature map information to gather collective knowledge across distributed local devices at the server. The server groups devices with similar contextual goals using cosine similarity, and redistributes updated model parameters for improved inference performance at each local device. Specifically, unlike previously proposed federated learning frameworks, AttFL is designed specifically to perform well for various recurrent neural network (RNN) baseline models, making it suitable for many mobile and embedded sensing applications producing time-series sensing data. We evaluate the performance of AttFL and compare with five state-of-the-art federated learning frameworks using three popular mobile/embedded sensing applications (e.g., physiological signal analysis, human activity recognition, and audio processing). Our results obtained from CPU core-based emulations and a 12-node embedded platform testbed shows that AttFL outperforms all alternative approaches in terms of model accuracy and communication/computational overhead, and is flexible enough to be applied in various application scenarios exploiting different baseline deep learning model architectures.
... Developing machine learning models on a generic population, where data is collected from various data sources, and fine-tuning these models to specific populations or individuals is of interest [123][124][125]. At its core, this idea of personalization is aimed at adapting to specific persons to provide for better activity monitoring. ...
Article
Full-text available
With the growing interest in smart home environments and in providing seamless interactions with various smart devices, robust and reliable human activity recognition (HAR) systems are becoming essential. Such systems provide automated assistance to residents or to longitudinally monitor their daily activities for health and well-being assessments, as well as for tracking (long-term) behavior changes. These systems thus contribute towards an understanding of the health and continued well-being of residents. Smart homes are personalized settings where residents engage in everyday activities in their very own idiosyncratic ways. In order to provide a fully functional HAR system that requires minimal supervision, we provide a systematic analysis and a technical definition of the lifespan of activity recognition systems for smart homes. Such a designed lifespan provides for the different phases of building the HAR system, where these different phases are motivated by an application scenario that is typically observed in the home setting. Through the aforementioned phases, we detail the technical solutions that are required to be developed for each phase such that it becomes possible to derive and continuously improve the HAR system through data-driven procedures. The detailed lifespan can be used as a framework for the design of state-of-the-art procedures corresponding to the different phases.
... The solutions that have already tackled similar challenges, such as keyboard prediction [51], language translation [61], or even chatbots [62,63], have shown that deep learning approaches provide sufficiently strong prediction capabilities. Again, various structures of neural networks can be applied to predict or anticipate the outcome of an unstructured or semi-structured string of inputs [64]. Here, recurrent neural network (RNN) architectures, transformer-based architectures, or long short-term memory (LSTM) architectures have been applied [65][66][67] to varied degrees of success. ...
Article
Full-text available
The benefits of design decision support systems (DDSSs) in the architectural planning context have been proven in research and are increasingly used in practice. The sense and purpose are apparent. The weighing of the most diverse ideas and approaches are required for design prob- lems that cannot be solved unambiguously and are characterized by complex, open issues of archi- tectural design tasks, coupled with contradictory criteria. DDSSs support planners/decision-makers with objective information to support the decision-making process with well-founded data and statements. This is becoming increasingly necessary, especially given increasingly complex con- struction tasks, and thus the difficult-to-predict effects of decisions. Taking this maxim into account, however, also reveals challenges in the planning context, as well as the immense potential and fields of application. Building on these issues, this article presents a perspective for DDSSs. The paper discusses the current focus and advancements of such systems, highlighting the challenges such tools still face, and provides a vision of the perspective future of these systems from reactive systems to proactive assistance.
... speedup on Jetson TX2 compared to full training. On all datasets, its training time is within 2.7 hours, which matches the average idle time available for on-device training per day [87]. Its training speedup also outperforms that of BN+Bias by up to 30%. ...
Conference Paper
Full-text available
On-device training is essential for neural networks (NNs) to continuously adapt to new online data, but can be time-consuming due to the device's limited computing power. To speed up on-device training, existing schemes select trainable NN portion offline or conduct unrecoverable selection at runtime, but the evolution of trainable NN portion is constrained and cannot adapt to the current need for training. Instead, runtime adaptation of on-device training should be fully elastic, i.e., every NN substructure can be freely removed from or added to the trainable NN portion at any time in training. In this paper, we present ElasticTrainer, a new technique that enforces such elasticity to achieve the required training speedup with the minimum NN accuracy loss. Experiment results show that ElasticTrainer achieves up to 3.5× more training speedup in wall-clock time and reduces energy consumption by 2×-3× more compared to the existing schemes, without noticeable accuracy loss.
... LMs and Deep learning methods have been used for a plethora of downstream tasks for a long time (Yin et al., 2018;Li et al., 2017;Das, 2015;Gupta et al., 2020Gupta et al., , 2021bHusain et al., 2019;Feng et al., 2020;Vijayakumar et al., 2018). Several recent works have leveraged NLP methods and simple sampling methods for different downstream results (Xu et al., 2018;Alon et al., 2018;Allamanis et al., 2017;Balog et al., 2016;Ogundokun et al., 2022;Kehinde et al., 2022;Gupta et al., 2019). ...
Preprint
Full-text available
Event detection refers to identifying event occurrences in a text and comprises of two subtasks; event identification and classification. We present EDM3, a novel approach for Event Detection that formulates three generative tasks: identification, classification, and combined detection. We show that EDM3 helps to learn transferable knowledge that can be leveraged to perform Event Detection and its subtasks concurrently, mitigating the error propagation inherent in pipelined approaches. Unlike previous dataset- or domain-specific approaches, EDM3 utilizes the existing knowledge of language models, allowing it to be trained over any classification schema. We evaluate EDM3 on multiple event detection datasets: RAMS, WikiEvents, MAVEN, and MLEE, showing that EDM3 outperforms 1) single-task performance by 8.4% on average and 2) multi-task performance without instructional prompts by 2.4% on average. We obtain SOTA results on RAMS (71.3% vs. 65.1% F-1) and competitive performance on other datasets. We analyze our approach to demonstrate its efficacy in low-resource and multi-sentence settings. We also show the effectiveness of this approach on non-standard event configurations such as multi-word and multi-class event triggers. Overall, our results show that EDM3 is a promising approach for Event Detection that has the potential for real-world applications.
... Concurrently, with the advancement of hardware and compression techniques in mobile devices, the offline inference of lightweight DNNs has become feasible [6], enabling locally deployed inferences that eliminate the risk of information leakage and high latency caused by internet connection. More companies take on-device deployment of DNN as a more promising trend to offer more diverse and secure services to customers like personalized text prediction [62], personal item recommendation [21] and image classification [24]. However, limited resources remain a major challenge for practical and runtime mobile environments, where a massive and consistent model inference can cause high energy consumption and device overheating, resulting in excessive latency and decision-making failure, especially in time-sensitive tasks. ...
Preprint
In recent years, on-device deep learning has gained attention as a means of developing affordable deep learning applications for mobile devices. However, on-device models are constrained by limited energy and computation resources. In the mean time, a poisoning attack known as sponge poisoning has been developed.This attack involves feeding the model with poisoned examples to increase the energy consumption during inference. As previous work is focusing on server hardware accelerators, in this work, we extend the sponge poisoning attack to an on-device scenario to evaluate the vulnerability of mobile device processors. We present an on-device sponge poisoning attack pipeline to simulate the streaming and consistent inference scenario to bridge the knowledge gap in the on-device setting. Our exclusive experimental analysis with processors and on-device networks shows that sponge poisoning attacks can effectively pollute the modern processor with its built-in accelerator. We analyze the impact of different factors in the sponge poisoning algorithm and highlight the need for improved defense mechanisms to prevent such attacks on on-device deep learning applications.
... LMs and Deep learning methods have been used for plethora of downstream tasks for ling time (Yin et al., 2018;Li et al., 2017;Das, 2015;Gupta et al., 2020Gupta et al., , 2021bHusain et al., 2019;Feng et al., 2020;Vijayakumar et al., 2018). Several recent works have leveraged NLP methods and simple sampling methods for different downstream results (Xu et al., 2018;Alon et al., 2018;Allamanis et al., 2017;Balog et al., 2016;Ogundokun et al., 2022;Kehinde et al., 2022;Gupta et al., 2019). The study of whether existing LMs can understand instructions by Efrat and Levy (2020) Train 1557 930 354 140 43 10 6 3 1 -1 3045 Test 378 266 105 34 10 6 1 ----800 Rest14 Train 1020 1022 572 269 104 30 15 5 3 1 -3041 Test 194 290 186 80 30 14 3 2 --1 800 Rest15 Train 482 576 174 58 22 2 --1 --1315 Test 284 294 82 18 6 -1 ----685 Hotels15 Test 98 135 23 7 2 1 -----266 Rest16 Train 766 868 258 76 28 2 1 -1 --2000 Test 256 298 87 22 9 3 ----1 676 (2022) showed that adding knowledge with instruction helps LMs understand the context better. ...
Preprint
Full-text available
In this paper, we present InstructABSA, Aspect-Based Sentiment Analysis (ABSA) using instruction learning paradigm for all ABSA subtasks: Aspect Term Extraction (ATE), Aspect Term Sentiment Classification (ATSC), and Joint Task modeling. Our method introduces positive, negative, and neutral examples to each training sample, and instruction tunes the model (Tk-Instruct Base) for each ABSA subtask, yielding significant performance improvements. Experimental results on the Sem Eval 2014 dataset demonstrate that InstructABSA outperforms the previous state-of-the-art (SOTA) approaches on all three ABSA subtasks (ATE, ATSC, and Joint Task) by a significant margin, outperforming 7x larger models. In particular, InstructABSA surpasses the SOTA on the restaurant ATE subtask by 7.31% points and on the Laptop Joint Task by 8.63% points. Our results also suggest a strong generalization ability to unseen tasks across all three subtasks.
... DL model serving has become a pivotal module in modern intelligent applications and has received extensive attention from both academia and industry [13,17,56,58,64,105]. Existing work has focused on optimizing DL performance on traditional data-center-level CPUs and GPUs. Meanwhile, on-device DL is rising towards low inference delay and user privacy preservation [79,82,88,115,123,124] powered by the continuously improving SoC hardware. This trend has inspired us to treat DL serving as a killer workload to be supported by SoC-Cluster. ...
Preprint
Huge electricity consumption is a severe issue for edge data centers. To this end, we propose a new form of edge server, namely SoC-Cluster, that orchestrates many low-power mobile system-on-chips (SoCs) through an on-chip network. For the first time, we have developed a concrete SoC-Cluster server that consists of 60 Qualcomm Snapdragon 865 SoCs in a 2U rack. Such a server has been commercialized successfully and deployed in large scale on edge clouds. The current dominant workload on those deployed SoC-Clusters is cloud gaming, as mobile SoCs can seamlessly run native mobile games. The primary goal of this work is to demystify whether SoC-Cluster can efficiently serve more general-purpose, edge-typical workloads. Therefore, we built a benchmark suite that leverages state-of-the-art libraries for two killer edge workloads, i.e., video transcoding and deep learning inference. The benchmark comprehensively reports the performance, power consumption, and other application-specific metrics. We then performed a thorough measurement study and directly compared SoC-Cluster with traditional edge servers (with Intel CPU and NVIDIA GPU) with respect to physical size, electricity, and billing. The results reveal the advantages of SoC-Cluster, especially its high energy efficiency and the ability to proportionally scale energy consumption with various incoming loads, as well as its limitations. The results also provide insightful implications and valuable guidance to further improve SoC-Cluster and land it in broader edge scenarios.
... The technique of fine-tuning, which was initially proposed in the context of big model pre-training to quickly adapt to a specific downstream task (Devlin et al. 2019;Feichtenhofer et al. 2022), is quite convenient and efficient to be applied to resource-constraint mobile devices. For example, Xu et al. (2018) focused on the next-word prediction task and proposed to leverage on-device fine-tuning for model personalization, thereby improving prediction accuracy. Their focus was on how to reduce overhead by vocabulary and model compression. ...
Preprint
Full-text available
To meet the practical requirements of low latency, low cost, and good privacy in online intelligent services, more and more deep learning models are offloaded from the cloud to mobile devices. To further deal with cross-device data heterogeneity, the offloaded models normally need to be fine-tuned with each individual user's local samples before being put into real-time inference. In this work, we focus on the fundamental click-through rate (CTR) prediction task in recommender systems and study how to effectively and efficiently perform on-device fine-tuning. We first identify the bottleneck issue that each individual user's local CTR (i.e., the ratio of positive samples in the local dataset for fine-tuning) tends to deviate from the global CTR (i.e., the ratio of positive samples in all the users' mixed datasets on the cloud for training out the initial model). We further demonstrate that such a CTR drift problem makes on-device fine-tuning even harmful to item ranking. We thus propose a novel label correction method, which requires each user only to change the labels of the local samples ahead of on-device fine-tuning and can well align the locally prior CTR with the global CTR. The offline evaluation results over three datasets and five CTR prediction models as well as the online A/B testing results in Mobile Taobao demonstrate the necessity of label correction in on-device fine-tuning and also reveal the improvement over cloud-based learning without fine-tuning.
... There are mainly two paths to address the preceding problems raised by FL. One path is on-device training with no raw data or intermediate results uploading [10,47], which suffers the over-fitting problem. The other path is decentralized FL [14,15,17,19,30,39,40], a new FL paradigm to leverage a blockchain to coordinate the model aggregation and update parameters in a decentralized manner. ...
Preprint
Full-text available
Federated learning (FL) is an emerging promising privacy-preserving machine learning paradigm and has raised more and more attention from researchers and developers. FL keeps users' private data on devices and exchanges the gradients of local models to cooperatively train a shared Deep Learning (DL) model on central custodians. However, the security and fault tolerance of FL have been increasingly discussed, because its central custodian mechanism or star-shaped architecture can be vulnerable to malicious attacks or software failures. To address these problems, Swarm Learning (SL) introduces a permissioned blockchain to securely onboard members and dynamically elect the leader, which allows performing DL in an extremely decentralized manner. Compared with tremendous attention to SL, there are few empirical studies on SL or blockchain-based decentralized FL, which provide comprehensive knowledge of best practices and precautions of deploying SL in real-world scenarios. Therefore, we conduct the first comprehensive study of SL to date, to fill the knowledge gap between SL deployment and developers, as far as we are concerned. In this paper, we conduct various experiments on 3 public datasets of 5 research questions, present interesting findings, quantitatively analyze the reasons behind these findings, and provide developers and researchers with practical suggestions. The findings have evidenced that SL is supposed to be suitable for most application scenarios, no matter whether the dataset is balanced, polluted, or biased over irrelevant features.
Article
Generative tasks, such as text generation and question answering, are essential for mobile applications. Given their inherent privacy sensitivity, executing them on devices is demanded. Nowadays, the execution of these generative tasks heavily relies on the Large Language Models (LLMs). However, the scarce device memory severely hinders the scalability of these models. We present EdgeLLM, an efficient on-device LLM inference system for models whose sizes exceed the device's memory capacity. EdgeLLM is built atop speculative decoding, which delegates most tokens to a smaller, memory-resident (draft) LLM. EdgeLLM integrates three novel techniques: (1) Instead of generating a fixed width and depth token tree, EdgeLLM proposes compute-efficient branch navigation and verification to pace the progress of different branches according to their accepted probability to prevent the wasteful allocation of computing resources to the wrong branch and to verify them all at once efficiently. (2) It uses a self-adaptive fallback strategy that promptly initiates the verification process when the smaller LLM generates an incorrect token. (3) To not block the generation, EdgeLLM proposes speculatively generating tokens during large LLM verification with the compute-IO pipeline. Through extensive experiments, EdgeLLM exhibits impressive token generation speed which is up to 9.3 faster than existing engines.
Article
The rapid development of Low Earth Orbit (LEO) satellite constellations offers significant potential for in-orbit services, particularly in mitigating the impact of sudden natural disasters. However, the massive data collected by these satellites are often large and severely constrained by limited transmission capabilities when sending data to the ground. Satellite computing, which utilizes onboard computational capacity to process data before transmission, presents a promising solution to alleviate the downlink burden. Nonetheless, this paradigm introduces another bottleneck: limited onboard computing capacity, resulting in slow in-orbit processing and poor results. Current satellite computing systems struggle to efficiently address both data transmission and computing bottlenecks, particularly for urgent disaster services that demand accurate and timely results. Thus, we introduce an efficient satellite computing system designed to jointly mitigate these bottlenecks, thereby providing better service. The core idea is to utilize onboard computing capacity for swift in-orbit annotation of image regions, enabling adaptive compression and download based on annotation confidence and perceived downlink availability. Once the data is downloaded, image restoration and re-inference are performed on the ground to enhance accuracy. Compared to satellite-only inference, our system demonstrates an average improvement in inference accuracy of 3.8%. Furthermore, compared to ground-only inference, with only a 2.8% accuracy loss, our system achieves a 38.4% reduction in response time and saves 71.6% of downlink volume on average.
Article
Web applications have increasingly adopted Deep Learning (DL) through in-browser inference , wherein DL inference performs directly within Web browsers. The actual performance of in-browser inference and its impacts on the quality of experience ( QoE ) remain unexplored, and urgently require new QoE measurements beyond traditional ones, e.g., mainly focusing on page load time. To bridge this gap, we make the first comprehensive performance measurement of in-browser inference to date. Our approach proposes new metrics to measure in-browser inference: responsiveness, smoothness, and inference accuracy. Our extensive analysis involves 9 representative DL models across Web browsers of 50 popular PC devices and 20 mobile devices. The results reveal that in-browser inference exhibits a substantial latency gap, averaging 16.9 times slower on CPU and 4.9 times slower on GPU compared to native inference on PC devices. The gap on mobile CPU and mobile GPU is 15.8 times and 7.8 times, respectively. Furthermore, we identify contributing factors to such latency gap, including underutilized hardware instruction sets, inherent overhead in the runtime environment, resource contention within the browser, and inefficiencies in software libraries and GPU abstractions. Additionally, in-browser inference imposes significant memory demands, at times exceeding 334.6 times the size of the DL models themselves, partly attributable to suboptimal memory management. We also observe that in-browser inference leads to a significant 67.2% increase in the time it takes for GUI components to render within Web browsers, significantly affecting the overall user QoE of Web applications reliant on this technology.
Article
In the realm of industrial edge computing, a novel server architecture known as SoC-Cluster, characterized by its aggregation of numerous mobile systems-on-chips (SoCs), has emerged as a promising solution owing to its enhanced energy efficiency and seamless integration with prevalent mobile applications. Despite its advantages, the utilization of SoC-Cluster servers remains unsatisfactory, primarily attributed to the tidal patterns of user-initiated workloads. To address such inefficiency, we introduce SoCFlow+ , a pioneering framework designed to facilitate the co-location of deep learning training tasks on SoC-Cluster servers, thereby optimizing resource utilization. SoCFlow+ incorporates three novel techniques tailored to mitigate the inherent limitations of commercial SoC-Cluster servers. First, it employs group-wise parallelism complemented by delayed aggregation, a strategy engineered to enhance the training efficiency and scalability of deep learning models, effectively circumventing network bottlenecks. Second, it integrates a data-parallel mixed-precision training algorithm, optimized to exploit the heterogeneous processing capabilities inherent to mobile SoCs fully. Third, SoCFlow+ employs an underclocking-aware workload re-balanacing mechanism to tackle the training performance degradation caused by the thermal control of mobile SoCs. Through rigorous experimental validation, SoCFlow+ achieves a convergence speedup ranging from 1.6× to 740× across 32 SoCs, compared to conventional benchmarks. Furthermore, when juxtaposed with commodity GPU servers (e.g., NVIDIA V100) under identical power constraints, SoCFlow+ not only exhibits comparable training speed but also achieves a remarkable reduction in energy consumption by a factor of 2.31× to 10.23×, all while preserving convergence accuracy.
Article
With the rapid development of location acquisition technologies, massive mobile trajectories have been collected and made available to us, which support a fantastic way of understanding and modeling individuals’ mobility. However, existing data-driven methods either fail to capture the long-range dependency or suffer from a high computational cost. To overcome these issues, we propose a knowledge-driven framework for mobility prediction, which leverages knowledge graphs (KG) to formulate the mobility prediction task into the KG completion problem through integrating the structured “knowledge” from the mobility data. However, most related mobility prediction works only focus on the structured information encoded in existing triples, which ignores the rich semantic information of relation paths composed of multiple relation triples. In this paper, we apply a dedicated module to extract the supplementary semantic structure of paths in KG, which contributes to the interpretability and accuracy of our model. Specifically, the extracted rules are applied to capture the dependencies between relational facts. Moreover, by incorporating user information in the entity-relation space with the corresponding hyperplane, our method could capture diverse user mobility patterns and model the personal characteristics of users to improve the accuracy of mobility prediction. Extensive evaluations illustrate that our proposed model beats state-of-the-art mobility prediction algorithms, which verifies the superiority of utilizing logical rules and user hyperplanes. Our implementation code is available at https://github.com/tsinghua-fib-lab/RulekG-MobiPre.git
Chapter
Deep neural networks (DNNs) are usually trained on servers, and then the trained model is deployed to edge devices. However, the pre-trained models are static and can be inaccurate when the inputs from the new environment are very different from the pre-training data. Therefore, the models on devices need to be continuously adapted by on-device training. To achieve efficient on-device learning, both the software and the hardware issues need to be considered. On the software level, the real-time streaming data usually follows non-independent and identically distribution (non-iid), and simply learning from the latest data can result in forgetting previous data. Besides, the storage on edge devices is usually too small to store all the input data for rehearsal. As for the hardware-level design, implementing on-device training on resource-limited edge devices is challenging because of the complex memory access with different patterns among forward propagation, backward propagation, and weight update. To solve these problems, we first propose a framework to automatically select the most representative data from the unlabeled input stream, which only requires a small data buffer for dynamic learning. Then, we propose an efficient DNN training accelerator, EF-Train, to achieve end-to-end training on resource-constrained, low-power, edge devices.
Article
The charging protocol design is a control problem between charge time and capacity retention solved with numerous methods. However, the outcome is an optimal yet rigid charge protocol. Ever changing user behaviour limits the acceptability of one rigid optimal protocol to affect the growing market of LIBs -electric vehicles (EVs), embedded systems etc. It is imperative to redefine optimal charging by incorporating the user behaviour resulting in a dynamic charging strategy. We have formulated the personalized charging strategy problem as a Markov Decision Process (MDP). Q-learning and Deep Deterministic Policy Gradient (DDPG) method is used to solve the MDP. We then present a full spectrum of charging strategies based on perceived user requirement. Three representative charge protocols are demonstrated. The aggressive protocol can charge 77% SOC in \sim 15 minutes faster than MSCCCV (baseline). \sim 1.5 years of extra battery life is offered by the life-saver protocol which takes only 30 minutes more than MSCCCV to fully charge. The balanced protocol provides a quick boost and yet maintains a similar charge time and health as MSCCCV. We present an online methodology to retune the protocol on-device based on battery dynamics and user behavior. Finally, the claims are validated using real-world experiments. Note to Practitioners —This work was motivated by the problem of dynamic optimization of the battery-charging trajectory for each user of a mobile device or electric vehicle. The existing mechanism of an optimal charging problem revolves around finding a trade-off between charge time and battery health, not taking into account every user’s unique perspective. This offers only a fixed way of charging a device. This paper proposes to design an optimal charging protocol by taking user behaviour also into account. At each charging phase, the trajectory used for battery charging is optimized w.r.t expected charge time (user behaviour), battery health, and safety constraints of battery at that point of time. The method is trained to learn the battery ageing mechanism for individual user. Existing methodologies finds it tough to accommodate the process of optimal charging for fresh battery characteristics while also considering the constraints of an ageing battery. The proposed solution improves productivity because existing ways of optimizing require significant man-months. An automatic way of optimizing the charging protocol for each charging cycles is an efficient and reliable way to capture degradation as well as the battery-to-battery variability. The outcome of our work is a spectrum of charging protocols each optimized w.r.t battery characteristics and user expectations. Thus, each user can be catered to, with a unique charging protocol optimally designed for their immediate requirement. The practical limitation is to ensure the model converges during on-device retraining. The future direction of the work will be to incorporate more detailed battery model for optimizing the charging protocol.
Article
Deploying deep learning (DL) on mobile devices has been a notable trend in recent years. To support fast inference of on-device DL, DL libraries play a critical role as algorithms and hardware do. Unfortunately, no prior work ever dives deep into the ecosystem of modern DL libraries and provides quantitative results on their performance. In this paper, we first build a comprehensive benchmark that includes 6 representative DL libraries and 15 diversified DL models. Then we perform extensive experiments on 10 mobile devices, and the results reveal the current landscape of mobile DL libraries. For example, we find that the best-performing DL library is severely fragmented across different models and hardware, and the gap between DL libraries can be rather huge. In fact, the impacts of DL libraries can overwhelm the optimizations from algorithms or hardware, e.g., model quantization and GPU/DSP-based heterogeneous computing. Motivated by the fragmented performance of DL libraries across models and hardware, we propose an effective DL Library selection framework to obtain the optimal library on a new dataset that has been created. We evaluate the DL Library selection algorithm, and the results show that the framework at it can improve the prediction accuracy by about 10% than benchmark approaches on average.
Article
Retrieving similar trajectories aims to search for the trajectories that are close to a query trajectory in spatio-temporal domain from a large trajectory dataset. This is critical for a variety of applications, like transportation planning and mobility analysis. Unlike previous studies that perform similar trajectory retrieval on fine-grained GPS data or single cellular carrier, we investigate the feasibility of finding similar trajectories from cellular data of multiple carriers, which provide more comprehensive coverage of population and space. To handle the issues of spatial bias of cellular data from multiple carriers, coarse spatial granularity, and irregular sparse temporal sampling, we develop a holistic system cellSim . Specifically, to avoid the issue of spatial bias, we first propose a novel map matching approach, which transforms the cell tower sequences from multiple carriers to routes on a unified road map. Then, to address the issue of temporal sparse sampling, we generate multiple routes with different confidences to increases the probability of finding truly similar trajectories. Finally, a new trajectory similarity measure is developed for similar trajectory search by calculating the similarities between the irregularly-sampled trajectories. Extensive experiments on a large-scale cellular dataset from two carriers and real-world 1,701-km query trajectories reveal that cellSim provides state-of-the-art performance for similar trajectory retrieval.
Preprint
Machine learning (ML) is becoming more frequent in mobile software applications. ML-based mobile applications are software applications that combine ML models that have been trained using data. By using ML, it will increase the efficiency and effeteness of the operation and create new suggestions for development. ML programmers are used to analyzing data. A machine learning programmer summarizes the structure of an ideal machine learning model as well as the process of training the model with training data. However, converting a prediction algorithm into a machine learning model that can be used operationally is a time-consuming and difficult job. Because modern mobile apps are becoming increasingly reliant on machine learning, SE for android ML app has becoming more critical. But in the present time, many attempts are made by the SE community that only focused on the development of the ML model and analysis of the flaw that occurs in the ML algorithms. Errors that occur in the deployment of the ML models have received little consideration. ML apps are used by millions of people on daily basis for many purposes and they identify the flaws in the deployment. A systematic literature analysis is used to identify relevant problems initially and then from both the literature and Stack-overflow, GitHub are two frequently utilized data resources for investigating defects in software, we discover 100+ genuine deployment issues. Then we divided these issues into three-phase which are pre-deployment, deployment, and non-technical. Further, we suggest a structure in which we create 6 categories which are infrastructure, data structure, governance, implementation, customer relation, and economic implications which are further divided into sub-categories and see which categories show their effect on pre-deployment, deployment, and non-technical phase which make standard ML models and deployment process. In the future, it will help the developers to find their desire issue in the related category because in GitHub and Stack-overflow the data isn't organized according to ML knowledge.
Article
Detecting actions in videos have been widely applied in on-device applications, such as cars, robots, etc. Practical on-device videos are always untrimmed with both action and background. It is desirable for a model to both recognize the class of action and localize the temporal position where the action happens. Such a task is called temporal action location (TAL), which is always trained on the cloud where multiple untrimmed videos are collected and labeled. It is desirable for a TAL model to continuously and locally learn from new data, which can directly improve the action detection precision while protecting customers’ privacy. However, directly training a TAL model on the device is nontrivial. To train a TAL model which can precisely recognize and localize each action, tremendous video samples with temporal annotations are required. However, annotating videos frame by frame is exorbitantly time consuming and expensive. Although weakly supervised temporal action localization (W-TAL) has been proposed to learn from untrimmed videos with only video-level labels, such an approach is also not suitable for on-device learning scenarios. In practical on-device learning applications, data are collected in streaming. For example, the camera on the device keeps collecting video frames for hours or days, and the actions of nearly all classes are included in a single long video stream. Dividing such a long video stream into multiple video segments requires lots of human effort, which hinders the exploration of applying the TAL tasks to realistic on-device learning applications. To enable W-TAL models to learn from a long, untrimmed streaming video, we propose an efficient video learning approach that can directly adapt to new environments. We first propose a self-adaptive video dividing approach with a contrast score-based segment merging approach to convert the video stream into multiple segments. Then, we explore different sampling strategies on the TAL tasks to request as few labels as possible. To the best of our knowledge, we are the first attempt to directly learn from the on-device, long video stream. Experimental results on the THUMOS’14 dataset show that the performance of our approach is comparable to the current W-TAL state-of-the-art (SOTA) work without any laborious manual video splitting.
Article
Full-text available
Conventionally, DNN models are trained once in the cloud and deployed in edge devices such as cars, robots, or unmanned aerial vehicles (UAVs) for real-time inference. However, there are many cases that require the models to adapt to new environments, domains, or new users. In order to realize such domain adaption or personalization, the models on devices need to be continuously trained on the device. In this work, we design EF-Train, an efficient DNN training accelerator with a unified channel-level parallelism-based convolution kernel that can achieve end-to-end training on resource-limited low-power edge-level FPGAs. It is challenging to implement on-device training on resource-limited FPGAs due to the low efficiency caused by different memory access patterns among forward, backward propagation, and weight update. Therefore, we developed a data reshaping approach with intra-tile continuous memory allocation and weight reuse. An analytical model is established to automatically schedule computation and memory resources to achieve high energy efficiency on edge FPGAs. The experimental results show that our design achieves 46.99 GFLOPS and 6.09 GFLOPS/W in terms of throughput and energy efficiency, respectively.
Article
With the rapid development of the mobile communication technology, mobile trajectories of humans are massively collected by Internet service providers (ISPs) and application service providers (ASPs). On the other hand, the rising paradigm of knowledge graph (KG) provides us a promising solution to extract structured "knowledge" from massive trajectory data. In this paper, we focus on modeling users' spatio-temporal mobility patterns based on knowledge graph techniques, and predicting users' future movement based on the "knowledge" extracted from multiple sources in a cohesive manner. Specifically, we propose a new type of knowledge graph, i.e., spatio-temporal urban knowledge graph (STKG), where mobility trajectories, category information of venues, and temporal information are jointly modeled by the facts with different relation types in STKG. The mobility prediction problem is converted to the knowledge graph completion problem in STKG. Further, a complex embedding model with elaborately designed scoring functions is proposed to measure the plausibility of facts in STKG to solve the knowledge graph completion problem, which considers temporal dynamics of the mobility patterns and utilizes PoI categories as the auxiliary information and background knowledge. Extensive evaluations confirm the high accuracy of our model in predicting users' mobility, i.e., improving the accuracy by 5.04% compared with the state-of-the-art algorithms. In addition, PoI categories as the background knowledge and auxiliary information are confirmed to be helpful by improving the performance by 3.85% in terms of accuracy. Additionally, experiments show that our proposed method is time-efficient by reducing the computational time by over 43.12% compared with existing methods.
Article
Full-text available
How does individual mobility in the urban environment impact their health status? Previous works have explored the correlation between human mobility behaviour and individual health, yet the study on the underlying causal effect is woefully inadequate. However, the correlation analysis can sometimes be bewildering because of the confounding effects. For example, older people visit park more often but have worse health status than younger people. The common associations with age will lead to a counter-intuitive negative correlation between park visits and health status. Obtaining causal effects from confounded observations remains a challenge. In this paper, we construct a causal framework based on propensity score matching on multi-level treatment to eliminate the bias brought by confounding effects and estimate the total treatment effects of mobility behaviours on health status. We demonstrate that the matching procedure approximates a de-confounded randomized experiment where confounding variables are balanced substantially. The analysis on the directions of estimated causal effects reveals that fewer neighbouring tobacco shops and frequent visits to sports facilities are related with higher risk in health status, which differs from their correlation directions. Physical mobility behaviours and environment features have more significant estimated effects on health status than contextual mobility behaviours. Moreover, we embed our causal analysis framework in health prediction models to filter out features with superficial correlation but insignificant effects that might lead to over-fitting. This strategy achieves better model robustness with more features filtered out than L1-regularization. Our findings shed light on individual healthy lifestyle and mobility-related health policymaking.
Chapter
Designing an interactive system tailored appropriately for each user’s physical and cognitive characteristics is important for providing optimal user experience. In this chapter, we discuss how we could address such problems leveraging modern interactive machine learning techniques. As a case study, we introduce a method to individualize 3D spatial sound rendering with perceptual feedback. 3D spatial sound rendering traditionally required time-consuming measurement of individual user using an expensive device. By taking data-driven approach, one can replace such expensive measurement with simple calibration. We first describe how to train a generic deep learning model with an existing measured data set. We then describe how to adapt the model to a specific user with simple calibration process consisting of pairwise comparisons. Through this case study, the readers will get insight on how to adapt an interactive system for a specific user’s characteristics, taking advantage of the high expressiveness of modern machine learning techniques.
Article
Full-text available
Smartphone app developers often access and use privacy-sensitive data to create apps with rich and meaningful interactions. However, it can be challenging for auditors and end-users to know what granularity of data is being used and how, thereby hindering assessment of potential risks. Furthermore, developers lack easy ways of offering transparency to users regarding how personal data is processed, even if their intentions are to make their apps more privacy friendly. To address these challenges, we introduce PrivacyStreams, a functional programming model for accessing and processing personal data as a stream. PrivacyStreams is designed to make it easy for developers to make use of personal data while simultaneously making it easier to analyze how that personal data is processed and what granularity of data is actually used. We present the design and implementation of PrivacyStreams, as well as several user studies and experiments to demonstrate its usability, utility, and support for privacy.
Article
Full-text available
Deep convolutional neural networks (CNNs) are indispensable to state-of-the-art computer vision algorithms. However, they are still rarely deployed on battery-powered mobile devices, such as smartphones and wearable gadgets, where vision algorithms can enable many revolutionary real-world applications. The key limiting factor is the high energy consumption of CNN processing due to its high computational complexity. While there are many previous efforts that try to reduce the CNN model size or amount of computation, we find that they do not necessarily result in lower energy consumption, and therefore do not serve as a good metric for energy cost estimation. To close the gap between CNN design and energy consumption optimization, we propose an energy-aware pruning algorithm for CNNs that directly uses energy consumption estimation of a CNN to guide the pruning process. The energy estimation methodology uses parameters extrapolated from actual hardware measurements that target realistic battery-powered system setups. The proposed layer-by-layer pruning algorithm also prunes more aggressively than previously proposed pruning methods by minimizing the error in output feature maps instead of filter weights. For each layer, the weights are first pruned and then locally fine-tuned with a closed-form least-square solution to quickly restore the accuracy. After all layers are pruned, the entire network is further globally fine-tuned using back-propagation. With the proposed pruning method, the energy consumption of AlexNet and GoogLeNet are reduced by 3.7x and 1.6x, respectively, with less than 1% top-5 accuracy loss. Finally, we show that pruning the AlexNet with a reduced number of target classes can greatly decrease the number of weights but the energy reduction is limited.
Conference Paper
Full-text available
An efficient algorithm for recurrent neural network training is presented. The approach increases the training speed for tasks where a length of the input sequence may vary significantly. The proposed approach is based on the optimal batch bucketing by input sequence length and data parallelization on multiple graphical processing units. The baseline training performance without sequence bucketing is compared with the proposed solution for a different number of buckets. An example is given for the online handwriting recognition task using an LSTM recurrent neural network. The evaluation is performed in terms of the wall clock time, number of epochs, and validation loss value.
Conference Paper
Full-text available
Recent advances in eye tracking technologies opened the way to design novel attention-based user interfaces. This is promising for pro-active and assistive technologies for cyber-physical systems in the domains of, e.g., healthcare and industry 4.0. Prior approaches to recognize a user's attention are usually limited to the raw gaze signal or sensors in instrumented environments. We propose a system that (1) incorporates the gaze signal and the egocentric camera of the eye tracker to identify the objects the user focuses at; (2) employs object classification based on deep learning which we recompiled for our purposes on a GPU-based image classification server; (3) detects whether the user actually draws attention to that object; and (4) combines these modules for constructing episodic memories of egocentric events in real-time.
Conference Paper
Full-text available
Neural networks are among the state-of-the-art techniques for language modeling. Existing neural language models typically map discrete words to distributed, dense vector representations. After information processing of the preceding context words by hidden layers, an output layer estimates the probability of the next word. Such approaches are time- and memory-intensive because of the large numbers of parameters for word embeddings and the output layer. In this paper, we propose to compress neural language models by sparse word representations. In the experiments, the number of parameters in our model increases very slowly with the growth of the vocabulary size, which is almost imperceptible. Moreover, our approach not only reduces the parameter space to a large extent, but also improves the performance in terms of the perplexity measure.
Conference Paper
Full-text available
We consider applying computer vision to video on cloud-backed mobile devices using Deep Neural Networks (DNNs). The computational demands of DNNs are high enough that, without careful resource management, such applications strain device battery, wireless data, and cloud cost budgets. We pose the corresponding resource management problem, which we call Approximate Model Scheduling, as one of serving a stream of heterogeneous (i.e., solving multiple classification problems) requests under resource constraints. We present the design and implementation of an optimizing compiler and runtime scheduler to address this problem. Going beyond traditional resource allocators, we allow each request to be served approximately, by systematically trading off DNN classification accuracy for resource use, and remotely, by reasoning about on-device/cloud execution trade-offs. To inform the resource allocator, we characterize how several common DNNs, when subjected to state-of-the art optimizations, trade off accuracy for resource use such as memory, computation, and energy. The heterogeneous streaming setting is a novel one for DNN execution, and we introduce two new and powerful DNN optimizations that exploit it. Using the challenging continuous mobile vision domain as a case study, we show that our techniques yield significant reductions in resource usage and perform effectively over a broad range of operating conditions.
Article
Full-text available
Recently, convolutional neural networks (CNN) have demonstrated impressive performance in various computer vision tasks. However, high performance hardware is typically indispensable for the application of CNN models due to the high computation complexity, which prohibits their further extensions. In this paper, we propose an efficient framework, namely Quantized CNN, to simultaneously speed-up the computation and reduce the storage and memory overhead of CNN models. Both filter kernels in convolutional layers and weighting matrices in fully-connected layers are quantized, aiming at minimizing the estimation error of each layer's response. Extensive experiments on the ILSVRC-12 benchmark demonstrate 46×4 \sim 6 \times speed-up and 1520×15 \sim 20 \times compression with merely one percentage loss of classification accuracy. With our quantized CNN model, even mobile devices can accurately classify images within one second.
Conference Paper
Full-text available
The wide adoption of smart devices has stimulated a fast shift of security-critical data from desktop to mobile devices. However, recurrent device theft and loss expose mobile devices to various security threats and even physical attacks. This paper presents TinMan, a system that protects confidential data such as web site password and credit card number (we use the term cor to represent these data, which is short for Confidential Record) from being leaked or abused even under device theft. TinMan separates accesses of cor from the rest of the functionalities of an app, by introducing a trusted node to store cor and offloading any code from a mobile device to the trusted node to access cor. This completely eliminates the exposure of cor on the mobile devices. The key challenges to TinMan include deciding when and how to efficiently and transparently offload execution; Tin-Man addresses these challenges with security-oriented of-floading with a low-overhead tainting scheme called asym-metric tainting to track accesses to cor to trigger offloading, as well as transparent SSL session injection and TCP pay-load replacement to offload accesses to cor. We have implemented a prototype of TinMan based on Android and demonstrated how TinMan protects the information of user's bank account and credit card number without modifying the apps. Evaluation results also show that TinMan incurs only a small amount of performance and power overhead.
Article
Full-text available
We study the problem of stochastic optimization for deep learning in the parallel computing environment under communication constraints. A new algorithm is proposed in this setting where the communication and coordination of work among concurrent processes (local workers), is based on an elastic force which links the parameter vectors they compute with a center variable stored by the parameter server (master). The algorithm enables the local workers to perform more exploration, i.e. the algorithm allows the local variables to fluctuate further from the center variable by reducing the amount of communication between local workers and the master. We empirically demonstrate that in the deep learning setting, due to the existence of many local optima, allowing more exploration can lead to the improved performance. We propose synchronous and asynchronous variants of the new algorithm. We provide the theoretical analysis of the synchronous variant in the quadratic case and prove it achieves the highest possible asymptotic rate of convergence for the center variable. We additionally propose the momentum-based version of the algorithm that can be applied in both synchronous and asynchronous settings. An asynchronous variant of the algorithm is applied to train convolutional neural networks for image classification on the CIFAR and ImageNet datasets. Experiments demonstrate that the new algorithm accelerates the training of deep architectures compared to DOWNPOUR and other common baseline approaches and furthermore is very communication efficient.
Article
Full-text available
Many deep neural networks trained on natural images exhibit a curious phenomenon in common: on the first layer they learn features similar to Gabor filters and color blobs. Such first-layer features appear not to be specific to a particular dataset or task, but general in that they are applicable to many datasets and tasks. Features must eventually transition from general to specific by the last layer of the network, but this transition has not been studied extensively. In this paper we experimentally quantify the generality versus specificity of neurons in each layer of a deep convolutional neural network and report a few surprising results. Transferability is negatively affected by two distinct issues: (1) the specialization of higher layer neurons to their original task at the expense of performance on the target task, which was expected, and (2) optimization difficulties related to splitting networks between co-adapted neurons, which was not expected. In an example network trained on ImageNet, we demonstrate that either of these two issues may dominate, depending on whether features are transferred from the bottom, middle, or top of the network. We also document that the transferability of features decreases as the distance between the base task and target task increases, but that transferring features even from distant tasks can be better than using random features. A final surprising result is that initializing a network with transferred features from almost any number of layers can produce a boost to generalization that lingers even after fine-tuning to the target dataset.
Conference Paper
Full-text available
The internet is increasingly becoming a standard for both the production and consumption of data while at the same time cyber-crime involving the theft of private data is growing. Therefore in efforts to securely transact in data, privacy and security concerns must be taken into account to ensure that the confidentiality of individuals and entities involved is not compromised, and that the data published is compliant to privacy laws. In this paper, we take a look at noise addition as one of the data privacy providing techniques. Our endeavor in this overview is to give a foundational perspective on noise addition data privacy techniques, provide statistical consideration for noise addition techniques and look at the current state of the art in the field, while outlining future areas of research.
Article
Full-text available
Mobile devices are frequently used as terminals to interact with many security-critical services such as mobile payment and online banking. However, the large client software stack and the continuous proliferation of malware expose such interaction under various threats, including passive attacks like phishing and active ones like direct code manipulation. This paper proposes TrustUI, a new trusted path design for mobile devices that enables secure interaction between end users and services based on ARM's TrustZone technology. TrustUI is built with a combination of key techniques including cooperative randomization of the trusted path and secure delegation of network interaction. With such techniques, TrustUI not only requires no trust of the commodity software stack, but also takes a step further by excluding drivers for user-interacting devices like touch screen from its trusted computing base (TCB). Hence, TrustUI has a much smaller TCB, requires no access to device driver code, and may easily adapt to many devices. A prototype of TrustUI has been implemented on a Samsung Exynos 4412 board and evaluation shows that TrustUI provides strong protection of users interaction.
Article
Full-text available
Neural machine translation is a recently proposed approach to machine translation. Unlike the traditional statistical machine translation, the neural machine translation aims at building a single neural network that can be jointly tuned to maximize the translation performance. The models proposed recently for neural machine translation often belong to a family of encoder-decoders and consists of an encoder that encodes a source sentence into a fixed-length vector from which a decoder generates a translation. In this paper, we conjecture that the use of a fixed-length vector is a bottleneck in improving the performance of this basic encoder-decoder architecture, and propose to extend this by allowing a model to automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly. With this new approach, we achieve a translation performance comparable to the existing state-of-the-art phrase-based system on the task of English-to-French translation. Furthermore, qualitative analysis reveals that the (soft-)alignments found by the model agree well with our intuition.
Article
Full-text available
Social recommender systems leverage collaborative filtering (CF) to serve users with content that is of potential interesting to active users. A wide spectrum of CF schemes has been proposed. However, most of them cannot deal with the cold-start problem that denotes a situation that social media sites fail to draw recommendation for new items, users or both. In addition, they regard that all ratings equally contribute to the social media recommendation. This supposition is against the fact that low-level ratings contribute little to suggesting items that are likely to be of interest of users. To this end, we propose bi-clustering and fusion (BiFu)-a newly-fashioned scheme for the cold-start problem based on the BiFu techniques under a cloud computing setting. To identify the rating sources for recommendation, it introduces the concepts of popular items and frequent raters. To reduce the dimensionality of the rating matrix, BiFu leverages the bi-clustering technique. To overcome the data sparsity and rating diversity, it employs the smoothing and fusion technique. Finally, BiFu recommends social media contents from both item and user clusters. Experimental results show that BiFu significantly alleviates the cold-start problem in terms of accuracy and scalability.
Conference Paper
Full-text available
This paper presents the design, implementation, and evaluation of the Trusted Language Runtime (TLR), a system that protects the confidentiality and integrity of .NET mobile applications from OS security breaches. TLR enables separating an application's security-sensitive logic from the rest of the application, and isolates it from the OS and other apps. TLR provides runtime support for the secure component based on a .NET implementation for embedded devices. TLR reduces the TCB of an open source .NET implementation by a factor of 78 with a tolerable performance cost. The main benefit of the TLR is to bring the developer benefits of managed code to trusted computing. With the TLR, developers can build their trusted components with the productivity benefits of modern high level languages, such as strong typing and garbage collection.
Article
Full-text available
We present techniques for speeding up the test-time evaluation of large convolutional networks, designed for object recognition tasks. These models deliver impressive accuracy but each image evaluation requires millions of floating point operations, making their deployment on smartphones and Internet-scale clusters problematic. The computation is dominated by the convolution operations in the lower layers of the model. We exploit the linear structure present within the convolutional filters to derive approximations that significantly reduce the required computation. Using large state-of-the-art models, we demonstrate speedups by a factor of 2x, while keeping the accuracy within 1% of the original model.
Article
This paper presents the design, implementation, and evaluation of the Trusted Language Runtime (TLR), a system that protects the confidentiality and integrity of .NET mobile applications from OS security breaches. TLR enables separating an application's security-sensitive logic from the rest of the application, and isolates it from the OS and other apps. TLR provides runtime support for the secure component based on a .NET implementation for embedded devices. TLR reduces the TCB of an open source .NET implementation by a factor of 78 with a tolerable performance cost. The main benefit of the TLR is to bring the developer benefits of managed code to trusted computing. With the TLR, developers can build their trusted components with the productivity benefits of modern high level languages, such as strong typing and garbage collection.
Article
Despite the ubiquity of mobile and wearable text messaging applications, the problem of keyboard text decoding is not tackled sufficiently in the light of the enormous success of the deep learning Recurrent Neural Network (RNN) and Convolutional Neural Networks (CNN) for natural language understanding. In particular, considering that the keyboard decoders should operate on devices with memory and processor resource constraints, makes it challenging to deploy industrial scale deep neural network (DNN) models. This paper proposes a sequence-to-sequence neural attention network system for automatic text correction and completion. Given an erroneous sequence, our model encodes character level hidden representations and then decodes the revised sequence thus enabling auto-correction and completion. We achieve this by a combination of character level CNN and gated recurrent unit (GRU) encoder along with and a word level gated recurrent unit (GRU) attention decoder. Unlike traditional language models that learn from billions of words, our corpus size is only 12 million words; an order of magnitude smaller. The memory footprint of our learnt model for inference and prediction is also an order of magnitude smaller than the conventional language model based text decoders. We report baseline performance for neural keyboard decoders in such limited domain. Our models achieve a word level accuracy of 90%90\% and a character error rate CER of 2.4%2.4\% over the Twitter typo dataset. We present a novel dataset of noisy to corrected mappings by inducing the noise distribution from the Twitter data over the OpenSubtitles 2009 dataset; on which our model predicts with a word level accuracy of 98%98\% and sequence accuracy of 68.9%68.9\%. In our user study, our model achieved an average CER of 2.6%2.6\% with the state-of-the-art non-neural touch-screen keyboard decoder at CER of 1.6%1.6\%.
Conference Paper
Many deep neural networks trained on natural images exhibit a curious phenomenon in common: on the first layer they learn features similar to Gabor filters and color blobs. Such first-layer features appear not to be specific to a particular dataset or task, but general in that they are applicable to many datasets and tasks. Features must eventually transition from general to specific by the last layer of the network, but this transition has not been studied extensively. In this paper we experimentally quantify the generality versus specificity of neurons in each layer of a deep convolutional neural network and report a few surprising results. Transferability is negatively affected by two distinct issues: (1) the specialization of higher layer neurons to their original task at the expense of performance on the target task, which was expected, and (2) optimization difficulties related to splitting networks between co-adapted neurons, which was not expected. In an example network trained on ImageNet, we demonstrate that either of these two issues may dominate, depending on whether features are transferred from the bottom, middle, or top of the network. We also document that the transferability of features decreases as the distance between the base task and target task increases, but that transferring features even from distant tasks can be better than using random features. A final surprising result is that initializing a network with transferred features from almost any number of layers can produce a boost to generalization that lingers even after fine-tuning to the target dataset.
Conference Paper
We present techniques for speeding up the test-time evaluation of large convolutional networks, designed for object recognition tasks. These models deliver impressive accuracy but each image evaluation requires millions of floating point operations, making their deployment on smartphones and Internet-scale clusters problematic. The computation is dominated by the convolution operations in the lower layers of the model. We exploit the linear structure present within the convolutional filters to derive approximations that significantly reduce the required computation. Using large state-of-the-art models, we demonstrate we demonstrate speedups of convolutional layers on both CPU and GPU by a factor of 2x, while keeping the accuracy within 1% of the original model.
Conference Paper
State-of-the-art deep neural networks (DNNs) have hundreds of millions of connections and are both computationally and memory intensive, making them difficult to deploy on embedded systems with limited hardware resources and power budgets. While custom hardware helps the computation, fetching weights from DRAM is two orders of magnitude more expensive than ALU operations, and dominates the required power. Previously proposed 'Deep Compression' makes it possible to fit large DNNs (AlexNet and VGGNet) fully in on-chip SRAM. This compression is achieved by pruning the redundant connections and having multiple connections share the same weight. We propose an energy efficient inference engine (EIE) that performs inference on this compressed network model and accelerates the resulting sparse matrix-vector multiplication with weight sharing. Going from DRAM to SRAM gives EIE 120x energy saving; Exploiting sparsity saves 10x; Weight sharing gives 8x; Skipping zero activations from ReLU saves another 3x. Evaluated on nine DNN benchmarks, EIE is 189x and 13x faster when compared to CPU and GPU implementations of the same DNN without compression. EIE has a processing power of 102GOPS/s working directly on a compressed network, corresponding to 3TOPS/s on an uncompressed network, and processes FC layers of AlexNet at 1.88x10^4 frames/sec with a power dissipation of only 600mW. It is 24,000x and 3,400x more energy efficient than a CPU and GPU respectively. Compared with DaDianNao, EIE has 2.9x, 19x and 3x better throughput, energy efficiency and area efficiency.
Article
We propose an efficient and unified framework, namely ThiNet, to simultaneously accelerate and compress CNN models in both training and inference stages. We focus on the filter level pruning, i.e., the whole filter would be discarded if it is less important. Our method does not change the original network structure, thus it can be perfectly supported by any off-the-shelf deep learning libraries. We formally establish filter pruning as an optimization problem, and reveal that we need to prune filters based on statistics information computed from its next layer, not the current layer, which differentiates ThiNet from existing methods. Experimental results demonstrate the effectiveness of this strategy, which has advanced the state-of-the-art. We also show the performance of ThiNet on ILSVRC-12 benchmark. ThiNet achieves 3.31×\times FLOPs reduction and 16.63×\times compression on VGG-16, with only 0.52%\% top-5 accuracy drop. Similar experiments with ResNet-50 reveal that even for a compact network, ThiNet can also reduce more than half of the parameters and FLOPs, at the cost of roughly 1%\% top-5 accuracy drop. Moreover, the original VGG-16 model can be further pruned into a very small model with only 5.05MB model size, preserving AlexNet level accuracy but showing much stronger generalization ability.
Conference Paper
Microphones are remarkably powerful sensors of human behavior and context. However, audio sensing is highly susceptible to wild fluctuations in accuracy when used in diverse acoustic environments (such as, bedrooms, vehicles, or cafes), that users encounter on a daily basis. Towards addressing this challenge, we turn to the field of deep learning; an area of machine learning that has radically changed related audio modeling domains like speech recognition. In this paper, we present DeepEar - the first mobile audio sensing framework built from coupled Deep Neural Networks (DNNs) that simultaneously perform common audio sensing tasks. We train DeepEar with a large-scale dataset including unlabeled data from 168 place visits. The resulting learned model, involving 2.3M parameters, enables DeepEar to significantly increase inference robustness to background noise beyond conventional approaches present in mobile devices. Finally, we show DeepEar is feasible for smartphones by building a cloud-free DSP-based prototype that runs continuously, using only 6% of the smartphone's battery daily.
Article
In information retrieval, query auto completion (QAC), also known as type-ahead [Xiao et al., 2013, Cai et al., 2014b] and auto-complete suggestion [Jain and Mishne, 2010], refers to the following functionality: given a prefix consisting of a number of characters entered into a search box, the user interface proposes alternative ways of extending the prefix to a full query. Ranking query completions is a challenging task due to the limited length of prefixes entered by users, the large volume of possible query completions matching a prefix, and the broad range of possible search intents. In recent years, a large number of query auto completion approaches have been proposed that produce ranked lists of alternative query completions by mining query logs. In this survey, we review work on query auto completion that has been published before 2016. We focus mainly on web search and provide a formal definition of the query auto completion problem. We describe two dominant families of approaches to the query auto completion problem, one based on heuristic models and the other based on learning to rank. We also identify dominant trends in published work on query auto completion, viz. the use of time-sensitive signals and the use of user-specific signals. We describe the datasets and metrics that are used to evaluate algorithms for query auto completion. We also devote a chapter to efficiency and a chapter to presentation and interaction aspects of query auto completion. We end by discussing related tasks as well as potential research directions to further the area.
Conference Paper
Deep learning has revolutionized the way sensor data are analyzed and interpreted. The accuracy gains these approaches offer make them attractive for the next generation of mobile, wearable and embedded sensory applications. However, state-of-the-art deep learning algorithms typically require a significant amount of device and processor resources, even just for the inference stages that are used to discriminate high-level classes from low-level data. The limited availability of memory, computation, and energy on mobile and embedded platforms thus pose a significant challenge to the adoption of these powerful learning techniques. In this paper, we propose SparseSep, a new approach that leverages the sparsification of fully connected layers and separation of convolutional kernels to reduce the resource requirements of popular deep learning algorithms. As a result, SparseSep allows large-scale DNNs and CNNs to run efficiently on mobile and embedded hardware with only minimal impact on inference accuracy. We experiment using SparseSep across a variety of common processors such as the Qualcomm Snapdragon 400, ARM Cortex M0 and M3, and Nvidia Tegra K1, and show that it allows inference for various deep models to execute more efficiently; for example, on average requiring 11.3 times less memory and running 13.3 times faster on these representative platforms.
Conference Paper
Machine learning techniques based on neural networks are achieving remarkable results in a wide variety of domains. Often, the training of models requires large, representative datasets, which may be crowdsourced and contain sensitive information. The models should not expose private information in these datasets. Addressing this goal, we develop new algorithmic techniques for learning and a refined analysis of privacy costs within the framework of differential privacy. Our implementation and experiments demonstrate that we can train deep neural networks with non-convex objectives, under a modest privacy budget, and at a manageable cost in software complexity, training efficiency, and model quality.
Article
We consider a set of learning agents in a collaborative peer-to-peer network, where each agent learns a personalized model according to its own learning objective. The question addressed in this paper is: how can agents improve upon their locally trained model by communicating with other agents that have similar objectives? We introduce and analyze two asynchronous gossip algorithms running in a fully decentralized manner. Our first approach, inspired from label propagation, aims to smooth pre-trained local models over the network while accounting for the confidence that each agent has in its initial model. In our second approach, agents jointly learn and propagate their model by making iterative updates based on both their local dataset and the behavior of their neighbors. Our algorithm to optimize this challenging objective in a decentralized way is based on ADMM.
Article
We introduce a new and increasingly relevant setting for distributed optimization in machine learning, where the data defining the optimization are unevenly distributed over an extremely large number of nodes. The goal is to train a high-quality centralized model. We refer to this setting as Federated Optimization. In this setting, communication efficiency is of the utmost importance and minimizing the number of rounds of communication is the principal goal. A motivating example arises when we keep the training data locally on users' mobile devices instead of logging it to a data center for training. In federated optimziation, the devices are used as compute nodes performing computation on their local data in order to update a global model. We suppose that we have extremely large number of devices in the network --- as many as the number of users of a given service, each of which has only a tiny fraction of the total data available. In particular, we expect the number of data points available locally to be much smaller than the number of devices. Additionally, since different users generate data with different patterns, it is reasonable to assume that no device has a representative sample of the overall distribution. We show that existing algorithms are not suitable for this setting, and propose a new algorithm which shows encouraging experimental results for sparse convex problems. This work also sets a path for future research needed in the context of \federated optimization.
Conference Paper
Maintaining a clean and hygienic civic environment is an indispensable yet formidable task, especially in developing countries. With the aim of engaging citizens to track and report on their neighborhoods, this paper presents a novel smartphone app, called SpotGarbage, which detects and coarsely segments garbage regions in a user-clicked geo-tagged image. The app utilizes the proposed deep architecture of fully convolutional networks for detecting garbage in images. The model has been trained on a newly introduced Garbage In Images (GINI) dataset, achieving a mean accuracy of 87.69%. The paper also proposes optimizations in the network architecture resulting in a reduction of 87.9% in memory usage and 96.8% in prediction time with no loss in accuracy, facilitating its usage in resource constrained smartphones.
Article
Microphones are remarkably powerful sensors of human behavior and context. However, audio sensing is highly susceptible to wild fluctuations in accuracy when used in diverse acoustic environments (such as, bedrooms, vehicles, or cafes), that users encounter on a daily basis. Towards addressing this challenge, we turn to the field of deep learning; an area of machine learning that has radically changed related audio modeling domains like speech recognition. In this paper, we present DeepEar – the first mobile audio sensing framework built from coupled Deep Neural Networks (DNNs) that simultaneously perform common audio sensing tasks. We train DeepEar with a large-scale dataset including unlabeled data from 168 place visits. The resulting learned model, involving 2.3M parameters, enables DeepEar to significantly increase inference robustness to background noise beyond conventional approaches present in mobile devices. Finally, we show DeepEar is feasible for smartphones by building a cloud-free DSP-based prototype that runs continuously, using only 6% of the smartphone’s battery daily
Conference Paper
Selecting and prioritizing major device models are critical for mobile app developers to select testbeds and optimize resources such as marketing and quality-assurance resources. The heavily fragmented distribution of Android devices makes it challenging to select a few major device models out of thousands of models available on the market. Currently app developers usually rely on some reported or estimated general market share of device models. However, these estimates can be quite inaccurate, and more problematically, can be irrelevant to the particular app under consideration. To address this issue, we propose PRADA, the first approach to prioritizing Android device models for individual apps, based on mining large-scale usage data. PRADA adapts the concept of operational profiling (popularly used in software reliability engineering) for mobile apps -- the usage of an app on a specific device model reflects the importance of that device model for the app. PRADA includes a collaborative filtering technique to predict the usage of an app on different device models, even if the app is entirely new (without its actual usage in the market yet), based on the usage data of a large collection of apps. We empirically demonstrate the effectiveness of PRADA over two popular app categories, i.e., Game and Media, covering over 3.86 million users and 14,000 device models collected through a leading Android management app in China.
Conference Paper
Deep convolutional neural networks (CNNs) are widely used in modern AI systems for their superior accuracy but at the cost of high computational complexity. The complexity comes from the need to simultaneously process hundreds of filters and channels in the high-dimensional convolutions, which involve a significant amount of data movement. Although highly-parallel compute paradigms, such as SIMD/SIMT, effectively address the computation requirement to achieve high throughput, energy consumption still remains high as data movement can be more expensive than computation. Accordingly, finding a dataflow that supports parallel processing with minimal data movement cost is crucial to achieving energy-efficient CNN processing without compromising accuracy. In this paper, we present a novel dataflow, called row-stationary (RS), that minimizes data movement energy consumption on a spatial architecture. This is realized by exploiting local data reuse of filter weights and feature map pixels, i.e., activations, in the high-dimensional convolutions, and minimizing data movement of partial sum accumulations. Unlike dataflows used in existing designs, which only reduce certain types of data movement, the proposed RS dataflow can adapt to different CNN shape configurations and reduces all types of data movement through maximally utilizing the processing engine (PE) local storage, direct inter-PE communication and spatial parallelism. To evaluate the energy efficiency of the different dataflows, we propose an analysis framework that compares energy cost under the same hardware area and processing parallelism constraints. Experiments using the CNN configurations of AlexNet show that the proposed RS dataflow is more energy efficient than existing dataflows in both convolutional (1.4× to 2.5×) and fully-connected layers (at least 1.3× for batch size larger than 16). The RS dataflow has also been demonstrated on a fabricated chip, which verifies our energy analysis.
Article
State-of-the-art deep neural networks (DNNs) have hundreds of millions of connections and are both computationally and memory intensive, making them difficult to deploy on embedded systems with limited hardware resources and power budgets. While custom hardware helps the computation, fetching weights from DRAM is two orders of magnitude more expensive than ALU operations, and dominates the required power. Previously proposed 'Deep Compression' makes it possible to fit large DNNs (AlexNet and VGGNet) fully in on-chip SRAM. This compression is achieved by pruning the redundant connections and having multiple connections share the same weight. We propose an energy efficient inference engine (EIE) that performs inference on this compressed network model and accelerates the resulting sparse matrix-vector multiplication with weight sharing. Going from DRAM to SRAM gives EIE 120× energy saving; Exploiting sparsity saves 10×; Weight sharing gives 8×; Skipping zero activations from ReLU saves another 3×. Evaluated on nine DNN benchmarks, EIE is 189× and 13× faster when compared to CPU and GPU implementations of the same DNN without compression. EIE has a processing power of 102 GOPS working directly on a compressed network, corresponding to 3 TOPS on an uncompressed network, and processes FC layers of AlexNet at 1.88×10⁴ frames/sec with a power dissipation of only 600mW. It is 24,000× and 3,400× more energy efficient than a CPU and GPU respectively. Compared with DaDianNao, EIE has 2.9×, 19× and 3× better throughput, energy efficiency and area efficiency.
Conference Paper
Breakthroughs from the field of deep learning are radically changing how sensor data are interpreted to extract the high-level information needed by mobile apps. It is critical that the gains in inference accuracy that deep models afford become embedded in future generations of mobile apps. In this work, we present the design and implementation of DeepX, a software accelerator for deep learning execution. DeepX signif- icantly lowers the device resources (viz. memory, computation, energy) required by deep learning that currently act as a severe bottleneck to mobile adoption. The foundation of DeepX is a pair of resource control algorithms, designed for the inference stage of deep learning, that: (1) decompose monolithic deep model network architectures into unit- blocks of various types, that are then more efficiently executed by heterogeneous local device processors (e.g., GPUs, CPUs); and (2), perform principled resource scaling that adjusts the architecture of deep models to shape the overhead each unit-blocks introduces. Experiments show, DeepX can allow even large-scale deep learning models to execute efficently on modern mobile processors and significantly outperform existing solutions, such as cloud-based offloading.
Conference Paper
Convolutional neural network (CNN) has been widely employed for image recognition because it can achieve high accuracy by emulating behavior of optic nerves in living creatures. Recently, rapid growth of modern applications based on deep learning algorithms has further improved research and implementations. Especially, various accelerators for deep CNN have been proposed based on FPGA platform because it has advantages of high performance, reconfigurability, and fast development round, etc. Although current FPGA accelerators have demonstrated better performance over generic processors, the accelerator design space has not been well exploited. One critical problem is that the computation throughput may not well match the memory bandwidth provided an FPGA platform. Consequently, existing approaches cannot achieve best performance due to under-utilization of either logic resource or memory bandwidth. At the same time, the increasing complexity and scalability of deep learning applications aggravate this problem. In order to overcome this problem, we propose an analytical design scheme using the roofline model. For any solution of a CNN design, we quantitatively analyze its computing throughput and required memory bandwidth using various optimization techniques, such as loop tiling and transformation. Then, with the help of rooine model, we can identify the solution with best performance and lowest FPGA resource requirement. As a case study, we implement a CNN accelerator on a VC707 FPGA board and compare it to previous approaches. Our implementation achieves a peak performance of 61.62 GFLOPS under 100MHz working frequency, which outperform previous approaches significantly.
Article
Recently proposed deep neural network (DNN) obtains significant accuracy improvements in many large vocabulary continuous speech recognition (LVCSR) tasks. However, DNN requires much more parameters than traditional systems, which brings huge cost during online evaluation, and also limits the application of DNN in a lot of scenarios. In this paper we present our new effort on DNN aiming at reducing the model size while keeping the accuracy improvements. We apply singular value decomposition (SVD) on the weight matrices in DNN, and then restructure the model based on the inherent sparseness of the original matrices. After restructuring we can reduce the DNN model size significantly with negligible accuracy loss. We also fine-tune the restructured model using the regular back-propagation method to get the accuracy back when reducing the DNN model size heavily. The proposed method has been evaluated on two LVCSR tasks, with context-dependent DNN hidden Markov model (CD-DNN-HMM). Experimental results show that the proposed approach dramatically reduces the DNN model size by more than 80% without losing any accuracy.
Article
An attentional mechanism has lately been used to improve neural machine translation (NMT) by selectively focusing on parts of the source sentence during translation. However, there has been little work exploring useful architectures for attention-based NMT. This paper examines two simple and effective classes of attentional mechanism: a global approach which always attends to all source words and a local one that only looks at a subset of source words at a time. We demonstrate the effectiveness of both approaches over the WMT translation tasks between English and German in both directions. With local attention, we achieve a significant gain of 5.0 BLEU points over non-attentional systems which already incorporate known techniques such as dropout. Our ensemble model using different attention architectures has established a new state-of-the-art result in the WMT'15 English to German translation task with 25.9 BLEU points, an improvement of 1.0 BLEU points over the existing best system backed by NMT and an n-gram reranker.
Conference Paper
Although deep neural networks (DNN) has achieved significant accuracy improvements in speech recognition, it is computationally expensive to deploy large-scale DNN in decoding due to huge number of parameters. Weights truncation and decomposition methods have been proposed to speed up decoding by exploiting the sparseness of DNN. This paper summarizes different approaches of restructuring DNN and proposes a new node pruning approach to reshape DNN for fast decoding. In this approach, hidden nodes of a fully trained DNN are pruned with certain importance function and the reshaped DNN is retuned using back-propagation. The approach requires no modification on code and can directly save computational costs during decoding. Furthermore, it is complementary to weight decomposition methods. Experiments on a switchboard task shows that, by using the proposed node-pruning approach, DNN complexity can be reduced to 37.9%. The complexity can be further reduced to 12.3% without accuracy loss when node-pruning is combined with weight decomposition.
Article
Stochastic gradient descent (SGD) is a popular technique for large-scale optimization problems in machine learning. In order to parallelize SGD, minibatch training needs to be employed to reduce the communication cost. However, an increase in minibatch size typically decreases the rate of convergence. This paper introduces a technique based on approximate optimization of a conservatively regularized objective function within each minibatch. We prove that the convergence rate does not decrease with increasing minibatch size. Experiments demonstrate that with suitable implementations of approximate optimization, the resulting algorithm can outperform standard SGD in many scenarios.
Conference Paper
We consider applications scenarios where an untrusted aggregator wishes to continually monitor the heavy-hitters across a set of distributed streams. Since each stream can contain sensitive data, such as the purchase history of customers, we wish to guarantee the privacy of each stream, while allowing the untrusted aggregator to accurately detect the heavy hitters and their approximate frequen-cies. Our protocols are scalable in settings where the volume of streaming data is large, since we guarantee low memory usage and processing overhead by each data source, and low communication overhead between the data sources and the aggregator.
Conference Paper
This paper presents the design, implementation, and evaluation of the Trusted Language Runtime (TLR), a system that protects the confidentiality and integrity of .NET mobile applications from OS security breaches. TLR enables separating an application’s security-sensitive logic from the rest of the application, and isolates it from the OS and other apps. TLR provides runtime support for the secure component based on a .NET implementation for embedded devices. TLR reduces the TCB of an open source .NET implementation by a factor of 78 with a tolerable performance cost. The main benefit of the TLR is to bring the developer benefits of managed code to trusted computing. With the TLR, developers can build their trusted components with the productivity benefits of modern high-level languages, such as strong typing and garbage collection.
Conference Paper
A trusted execution environment (TEE) is a secure processing environment that is isolated from the normal processing environment where the device operating system and applications run. The first mobile phones with hardware-based TEEs appeared almost a decade ago, and today almost every smartphone and tablet contains a TEE like ARM TrustZone. Despite such a large-scale deployment, the use of TEE functionality has been limited for developers. With emerging standardization this situation is about to change. In this tutorial, we explain the security features provided by mobile TEEs and describe On-board Credentials (ObC) system that enables third-party TEE development. We discuss ongoing TEE standardization activities, including the recent Global Platform standards and the Trusted Platform Module (TPM) 2.0 specification, and identify open problems for the near future of mobile hardware security.
Conference Paper
Machine-Learning tasks are becoming pervasive in a broad range of domains, and in a broad range of systems (from embedded systems to data centers). At the same time, a small set of machine-learning algorithms (especially Convolutional and Deep Neural Networks, i.e., CNNs and DNNs) are proving to be state-of-the-art across many applications. As architectures evolve towards heterogeneous multi-cores composed of a mix of cores and accelerators, a machine-learning accelerator can achieve the rare combination of efficiency (due to the small number of target algorithms) and broad application scope. Until now, most machine-learning accelerator designs have focused on efficiently implementing the computational part of the algorithms. However, recent state-of-the-art CNNs and DNNs are characterized by their large size. In this study, we design an accelerator for large-scale CNNs and DNNs, with a special emphasis on the impact of memory on accelerator design, performance and energy. We show that it is possible to design an accelerator with a high throughput, capable of performing 452 GOP/s (key NN operations such as synaptic weight multiplications and neurons outputs additions) in a small footprint of 3.02 mm2 and 485 mW; compared to a 128-bit 2GHz SIMD processor, the accelerator is 117.87x faster, and it can reduce the total energy by 21.08x. The accelerator characteristics are obtained after layout at 65 nm. Such a high throughput in a small footprint can open up the usage of state-of-the-art machine-learning algorithms in a broad set of systems and for a broad set of applications.
Conference Paper
Privacy has become an issue of paramount importance for many users. As a result, encryption tools such as True Crypt, OS-based full-disk encryption such as File Vault, and privacy modes in all modern browsers have become popular. However, although such tools are useful, they are not perfect. For example, prior work has shown that browsers still leave many traces of user information on disk even if they are started in private browsing mode. In addition, disk encryption alone is not sufficient, as key disclosure through coercion remains possible. Clearly, it would be useful and highly desirable to have OS-level support that provides strong privacy guarantees for any application -- not only browsers. In this paper, we present the design and implementation of PrivExec, the first operating system service for private execution. PrivExec provides strong, general guarantees of private execution, allowing any application to execute in a mode where storage writes, either to the filesystem or to swap, will not be recoverable by others during or after execution. PrivExec does not require explicit application support, recompilation, or any other preconditions. We have implemented a prototype of PrivExec by extending the Linux kernel that is performant, practical, and that secures sensitive data against disclosure.
Conference Paper
Recommender system has been recognized as the most effective method for information overload problem. Although many efforts have been done on the "Cold-Start" problem, it is still an open problem and has become a very emergent issue in social network analysis. In this paper, we propose a novel approach, which applies the character capture and clustering methods to address the cold-user problem (producing recommendations to new users who have no preference on any item). We use the vector cosine method to obtain the user's similarity matrix and clustering users into different groups. For each group, we produce the top-N recommendation by averaging ratings of every item and choosing the top N items on the list. The experimental results on MovieLens-1M data demonstrate that our approach achieve a remarkable and consistent improvements in overcoming the cold-start problem.
Conference Paper
Perceptual, "context-aware" applications that observe their environment and interact with users via cameras and other sensors are becoming ubiquitous on personal computers, mobile phones, gaming platforms, household robots, and augmented-reality devices. This raises new privacy risks. We describe the design and implementation of DARKLY, a practical privacy protection system for the increasingly common scenario where an untrusted, third-party perceptual application is running on a trusted device. DARKLY is integrated with OpenCV, a popular computer vision library used by such applications to access visual inputs. It deploys multiple privacy protection mechanisms, including access control, algorithmic privacy transforms, and user audit. We evaluate DARKLY on 20 perceptual applications that perform diverse tasks such as image recognition, object tracking, security surveillance, and face detection. These applications run on DARKLY unmodified or with very few modifications and minimal performance overheads vs. native OpenCV. In most cases, privacy enforcement does not reduce the applications' functionality or accuracy. For the rest, we quantify the tradeoff between privacy and utility and demonstrate that utility remains acceptable even with strong privacy protection.