Article

Recommender system implementations for embedded collaborative filtering applications

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

This paper starts proposing a complete recommender system implemented on reconfigurable hardware with the purpose of testing on-chip, low-energy embedded collaborative filtering applications. Although the computing time is lower than the one obtained from usual multicore microprocessors, this proposal has the advantage of providing an approach to solve any prediction problem based on collaborative filtering by using an off-line, highly-portable light computing environment. This approach has been successfully tested with state-of-the-art datasets. Next, as a result of improving certain tasks related to the on-chip recommender system, we propose a custom, fine-grained parallel circuit for quick matrix multiplication with floating-point numbers. This circuit was designed to accelerate the predictions from the model obtained by the recommender system, and tested with two small datasets for experimental purposes. The accelerator is built from two levels of parallelism. On the one hand, several predictions run in parallel through the simultaneous multiplication of different vectors of two matrices. On the other hand, the operation of each vector is executed in parallel by multiplying pairs of floating-point values to later add the corresponding results in parallel as well. This circuit was compared with other approaches designed for the same purpose: circuits built using automatized tools of high-level synthesis, a general-purpose microprocessor, and high-performance graphical processing units. The performance of the prediction accelerator in terms of time surpassed that of the other approaches. We also evaluated the scalability of the circuit to practical problems using the high-level synthesis approach, and confirmed that implementations based on reconfigurable hardware allow acceptable speedups of multi-core processors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... In those heterogeneous architectures, generally, the FPGA is used as an accelerator to implement the most demanding task of the system. Although fixedpoint arithmetic has been used traditionally in FPGA, in this model, the utilization of floating-point arithmetic is preferred for many applications such as advanced signal processing [6,7,15], industrial [16,17,20], wireless communication [8,21], and other advanced applications [11][12][13][14]18,19,22]. Although the fixed-point number operation has the advantages of fast computation and easy implementation, floating-point (FP) arithmetic offers a larger dynamic range and higher numeric stability. ...
... It has two modules, namely a look-up table and a binary right shifter. Since our final device is an FPGA, we design conversion functions (18) and (19) utilizing a small look-up table instead of implementing the corresponding logic circuits, as shown in the module Look-up table of Fig. 2. The input to this module is the exponent of the IEEE-754 number E x (8 bits, 256 inputs), and the outputs of this module are the new exponent E x ( 8 − log 2 r -bits width) and the number of bits to be right-shifted for the mantissa (the exponent of equation (19), log 2 r -bits width). A binary variable shifter module is required for this operation. ...
... It has two modules, namely a look-up table and a binary right shifter. Since our final device is an FPGA, we design conversion functions (18) and (19) utilizing a small look-up table instead of implementing the corresponding logic circuits, as shown in the module Look-up table of Fig. 2. The input to this module is the exponent of the IEEE-754 number E x (8 bits, 256 inputs), and the outputs of this module are the new exponent E x ( 8 − log 2 r -bits width) and the number of bits to be right-shifted for the mantissa (the exponent of equation (19), log 2 r -bits width). A binary variable shifter module is required for this operation. ...
Article
Full-text available
This article proposes a family of high-radix floating-point representation to efficiently deal with floating-point addition in FPGA devices with no native floating-point support. Since variable shifter implementation (required in any FP adder) has a very high cost in FPGA, high-radix formats considerably reduce the number of possible shifts, decreasing the execution time and area highly. Although the high-radix format produces also a significant penalty in the implementation of multipliers, the experimental results show that the adder improvement overweights the multiplication penalty for most of the practical and common cases (digital filters, matrix multiplications, etc.). We also provide the designer with guidelines on selecting a suitable radix as a function of the ratio between the number of additions and multiplications of the targeted algorithm. For applications with similar numbers of additions and multiplications, the high-radix version may be up to 26% faster and even having a wider dynamic range and using higher number of significant bits. Furthermore, thanks to the proposed efficient converters between the standard IEEE-754 format and our internal high-radix format, the cost of the input/output conversions in FPGA accelerators is negligible.
... The main idea is to decompose the original user-commodity rating matrix into low-rank potential matrix through technologies like singular value decomposition (SVD) and then obtain the prediction result through an analysis [12]. Matrix decomposition method has relieved the sparsity problem of CF algorithm to a certain degree, but cold start problem remains to be solved, so it is impossible to predict the probability for user to purchase new commodities with difficult similarity calculation [12]. ...
... The main idea is to decompose the original user-commodity rating matrix into low-rank potential matrix through technologies like singular value decomposition (SVD) and then obtain the prediction result through an analysis [12]. Matrix decomposition method has relieved the sparsity problem of CF algorithm to a certain degree, but cold start problem remains to be solved, so it is impossible to predict the probability for user to purchase new commodities with difficult similarity calculation [12]. Furthermore, CF algorithms transform the prediction problem of user purchasing behaviors into processing of rating prediction problem, and the prediction result highly depends on user rating information for commodities. ...
Preprint
Full-text available
In recent years, with the rapid development of wireless communication network, M-Commerce has achieved great success. Relying on mobile phones, tablets and other wireless communication devices for online shopping has become a mainstream way for users to consume. Users leave a lot of historical behavior data when shopping on the M-Commerce platform. Using these data to predict future purchasing behaviors of the users will be of great significance for improving user experience and realizing mutual benefit and win-win result between merchant and user. Therefore, a sample balance-based multi-perspective feature ensemble learning was proposed in this study as the solution to predicting user purchasing behaviors, specifically including: 1) “Sliding window”-centroid under-sampling was combined with sample balance method was used, while the positive sample size was enlarged using “sliding window”, centroid under-sampling was used to reduce the negative sample size within “sliding window”, so as to acquire user’s historical purchasing behavioral data with sample balance. 2) Influence feature of user purchasing behaviors were extracted from three perspectives—user, commodity and interaction, in order to further enrich the feature dimensions. Meanwhile, feature selection was carried out using XGBSFS algorithm. 3) An ensemble learning model—five-fold cross validation stacking—which could be used to predict user purchasing behaviors was raised. Three prediction models—XGBoost-Logistics, LightGBM-L2 and cascaded deep forest models—so that they could realize mutual collaboration and the overall prediction ability of the ensemble learning model could be improved. 4) Large-scale real datasets were experimented on Alibaba M-Commerce platform. The experimental results show that the proposed method has achieved better prediction effect in various evaluation indexes such as precision and recall rate.
... At present many model-based CF algorithms have been put forward, where matrix decomposition has gradually become the mainstream method in model-based CF algorithms by virtue of transformation of high-dimensional sparse user rating data and excellent extensibility [8] . The main idea is to decompose the original user-commodity rating matrix into low-rank potential matrix through technologies like singular value decomposition (SVD) and then obtain the prediction result through an analysis [12] . Matrix decomposition method has relieved the sparsity problem of CF algorithm to a certain degree, but cold start problem remains to be solved, so it is impossible to predict the probability for user to purchase new commodities with difficult similarity calculation [12] . ...
... The main idea is to decompose the original user-commodity rating matrix into low-rank potential matrix through technologies like singular value decomposition (SVD) and then obtain the prediction result through an analysis [12] . Matrix decomposition method has relieved the sparsity problem of CF algorithm to a certain degree, but cold start problem remains to be solved, so it is impossible to predict the probability for user to purchase new commodities with difficult similarity calculation [12] . Furthermore, CF algorithms transform the prediction problem of user purchasing behaviors into processing of rating prediction problem, and the prediction result highly depends on user rating information for commodities. ...
Preprint
Full-text available
In recent years, with the rapid development of wireless communication network, M-Commerce has achieved great success. Relying on mobile phones, tablets and other wireless communication devices for online shopping has become a mainstream way for users to consume. Users leave a lot of historical behavior data when shopping on the M-Commerce platform. Using these data to predict future purchasing behaviors of the users will be of great significance for improving user experience and realizing mutual benefit and win-win result between merchant and user. Therefore, a sample balance-based multi-perspective feature ensemble learning was proposed in this study as the solution to predicting user purchasing behaviors, specifically including: 1) “Sliding window”-centroid under-sampling was combined with sample balance method was used, while the positive sample size was enlarged using “sliding window”, centroid under-sampling was used to reduce the negative sample size within “sliding window”, so as to acquire user’s historical purchasing behavioral data with sample balance. 2) Influence feature of user purchasing behaviors were extracted from three perspectives—user, commodity and interaction, in order to further enrich the feature dimensions. Meanwhile, feature selection was carried out using XGBSFS algorithm. 3) An ensemble learning model—five-fold cross validation stacking—which could be used to predict user purchasing behaviors was raised. Three prediction models—XGBoost-Logistics, LightGBM-L2 and cascaded deep forest models—so that they could realize mutual collaboration and the overall prediction ability of the ensemble learning model could be improved. 4) Large-scale real datasets were experimented on Alibaba M-Commerce platform. The experimental results show that the proposed method has achieved better prediction effect in various evaluation indexes such as precision and recall rate.
... The RSs helps decision-makers to increase the desire of people to take the vaccines. RSs are filtering information systems applied to predict the user's preference to afford an item [6] [7]. Thus in this paper, we implement Enhanced RSs for the vaccine to prevent COVID-19 pandemic defined as ECRSs-19. ...
... It provides a competitive advantage in the low power consumption design and the real-time data processing. Furthermore, it shows a powerful processing ability in massive matrix operations and multiply-accumulation operations which are the main building blocks of deep neural networks [17]. ...
Article
Full-text available
The size of neural networks in deep learning techniques is increasing and varies significantly according to the requirements of real-life applications. The increasing network size, along with the scalability requirements, poses significant challenges for a high performance implementation of deep neural networks (DNN). Conventional implementations, such as graphical processing units and application specific integrated circuits, are either less efficient or less flexible. Consequently, this article presents a system-on-chip (SoC) solution for the acceleration of DNN, where an ARM processor controls the overall execution and off-loads computational intensive operations to a hardware accelerator. The system implementation is performed on a SoC development board. Experimental results show that the proposed system achieves a speed-up of 22.3, with a network architecture size of 64X64, in comparison with the native implementation on a dual core cortex ARM-A9 processor. In order to generalize the performance of complete system, a mathematical formula is presented which allows to compute the total execution time for any architecture size. The validation is performed by taking Epileptic Seizure Recognition as the target case study. Finally, the results of the proposed solution are compared with various state-of-the-art solutions in terms of execution time, scalability, and clock frequency.
... The root mean square error (RMSE) function is employed for performance evaluation. RMSE has been utilized in many prediction approaches [15], [28] to evaluate the performance of the CF technique. A lower RMSE value indicates a higher prediction accuracy. ...
Article
Full-text available
Collaborative Filtering (CF) is a widely used technique in recommendation systems. It provides personal recommendations for users based on their preferences. However, this technique suffers from the sparsity issue which occurs due to a high proportion of missing rating scores in a rating matrix. Several factorization approaches have been used to address the sparsity issue. Such techniques have also been considered to tackle other challenges such as the overfitted predicted scores. Nevertheless, they suffer from setbacks such as drift in user preferences and items' popularity decay. These challenges can be solved by prediction approaches that accurately learn the long-term and short-term preferences integrated with factorization features. Nonetheless, the current temporal-based factorization approaches do not accurately learn the convergence of the assigned k clusters due to a lower number of short-term periods. Additionally, the use of optimization algorithms in the learning process to reduce prediction errors is time-consuming which necessitates a faster optimization algorithm. To address these issues, a new temporal-based approach named TWOCF is proposed in this paper. TWOCF utilizes the elbow clustering method to define the optimal number of clusters for the temporal activities of both users and items. This approach deploys the whale optimization algorithm to accurately learn short-term preferences within other factorization and temporal features. Experimental results indicate that TWOCF exhibits a superior CF prediction accuracy achieved within a shorter execution time when compared to the benchmark approaches.
... The main idea is to decompose the original user-commodity rating matrix into low-rank potential matrix through technologies like singular value decomposition and then obtain the prediction result through an analysis [15]. Matrix decomposition method has relieved the sparsity problem of CF algorithm to a certain degree, but cold start problem remains to be solved, so it is impossible to predict the probability for user to purchase new commodities with difficult similarity calculation [16,17]. Furthermore, CF algorithms transform the prediction problem of user purchasing behaviors into processing of rating prediction problem, and the prediction result highly depends on user rating information for commodities. ...
Article
Full-text available
With the rapid development of wireless communication network, M-Commerce has achieved great success. Users leave a lot of historical behavior data when shopping on the M-Commerce platform. Using these data to predict future purchasing behaviors of the users will be of great significance for improving user experience and realizing mutual benefit and win-win result between merchant and user. Therefore, a sample balance-based multi-perspective feature ensemble learning was proposed in this study as the solution to predicting user purchasing behaviors, so as to acquire user’s historical purchasing behavioral data with sample balance. Influence feature of user purchasing behaviors was extracted from three perspectives—user, commodity and interaction, in order to further enrich the feature dimensions. Meanwhile, feature selection was carried out using XGBSFS algorithm. Large-scale real datasets were experimented on Alibaba M-Commerce platform. The experimental results show that the proposed method has achieved better prediction effect in various evaluation indexes such as precision and recall rate.
... The main idea is to decompose the original user-commodity rating matrix into low-rank potential matrix through technologies like singular value decomposition and then obtain the prediction result through an analysis [15]. Matrix decomposition method has relieved the sparsity problem of CF algorithm to a certain degree, but cold start problem remains to be solved, so it is impossible to predict the probability for user to purchase new commodities with difficult similarity calculation [16,17]. Furthermore, CF algorithms transform the prediction problem of user purchasing behaviors into processing of rating prediction problem, and the prediction result highly depends on user rating information for commodities. ...
Preprint
Full-text available
With the rapid development of wireless communication network, M-Commerce has achieved great success. Users leave a lot of historical behavior data when shopping on the M-Commerce platform. Using these data to predict future purchasing behaviors of the users will be of great significance for improving user experience and realizing mutual benefit and win-win result between merchant and user. Therefore, a sample balance-based multi-perspective feature ensemble learning was proposed in this study as the solution to predicting user purchasing behaviors, so as to acquire user’s historical purchasing behavioral data with sample balance. Influence feature of user purchasing behaviors were extracted from three perspectives—user, commodity and interaction, in order to further enrich the feature dimensions. Meanwhile, feature selection was carried out using XGBSFS algorithm. Large-scale real datasets were experimented on Alibaba M-Commerce platform. The experimental results show that the proposed method has achieved better prediction effect in various evaluation indexes such as precision and recall rate.
... The main parallelization strategy for PMF is described in [49]. As we can see in Algorithm 1, two consecutive loops can be parallelized after initialization in order to update the corresponding factorized matrices for each user/item. ...
Article
Full-text available
Nowadays, highly portable and low-energy computing environments require programming applications able to satisfy computing time and energy constraints. Furthermore, collaborative filtering based recommender systems are intelligent systems that use large databases and perform extensive matrix arithmetic calculations. In this research, we present an optimized algorithm and a parallel hardware implementation as good approach for running embedded collaborative filtering applications. To this end, we have considered high-level synthesis programming for reconfigurable hardware technology. The design was tested under environments where usual parameters and real-world datasets were applied, and compared to usual microprocessors running similar implementations. The performance results obtained by the different implementations were analyzed in computing time and energy consumption terms. The main conclusion is that the optimized algorithm is competitive in embedded applications when considering large datasets and parallel implementations based on reconfigurable hardware.
Article
Built-in applications built on a stand-alone device, TCP / IP network and interconnected are a need for high integration with other systems. The system provided through Web services interconnects service-oriented distributed architecture. The TCP / IP network is widely employed to integrate business applications. This integration is still not provided through the embedded application. The applications connected to the Internet are an especially difficult problem in embedded systems to interpret the sensor data. Real-time sensor function generates a training/classification result for IoT application selected, customized, or data structure design. It has an integrated hardware/software system to achieve the continuous training and real-time data analysis and re-training of Machine Learning (ML) algorithm. World of English native speakers on top of all, this is the database of words that have been separated from the local and non-collection English. It also reported a variety of methods that are used in the English vocabulary recognition system. Please check the study of learners of mediation based on the part of the corpus. Students, in writing, too much-advanced technology and general vocabulary. These students, publicly their discourse in and contributed to the professional corpus of "existence" I mentioned that there is a professional writer, is better. To stimulate the different means of integration, it evaluated several network technology today, discussing the balance between the use of shared with this integration in the adaptability and built-in the field of Web network technology of today.
Article
This article has been withdrawn: please see Elsevier Policy on Article Withdrawal (https://www.elsevier.com/about/our-business/policies/article-withdrawal). This article has been withdrawn at the request of the Editor in Chief. Subsequent to acceptance of this special issue paper by the responsible Guest Editor Sundhararajan Mahalingam, the integrity and rigor of the peer-review process was investigated and confirmed to fall beneath the high standards expected by Microprocessors & Microsystems. There are also indications that much of the Special Issue includes unoriginal and heavily paraphrased content. Due to a configuration error in the editorial system, unfortunately the Editor in Chief did not receive these papers for approval as per the journal's standard workflow.
Article
Full-text available
Generalized sparse non-negative matrix factorization (SNMF) has been proven useful in extracting information and representing sparse data with various types of probabilistic distributions from industrial applications, e.g., recommender systems and social networks. However, current solution approaches for generalized SNMF are based on the manipulation of whole sparse matrices and factor matrices, which will result in large-scale intermediate data. Thus, these approaches cannot describe the high-dimensional and sparse (HiDS) matrices in mainstream industrial and big data platforms, e.g., Graphics Processing Unit (GPU) and multi-GPU, in an online and scalable manner. To overcome these issues, an online, scalable and single-thread-based SNMF for CUDA parallelization on GPU (CUSNMF) and multi-GPU (MCUSNMF) is proposed. First, theoretical derivation is conducted, which demonstrates that the CUSNMF depends only on the products and sums of the involved feature tuples. Next, the compactness, which can follow the sparsity pattern of sparse matrices, endows the CUSNMF with online learning capability and the fine granularity gives it high parallelization potential on GPU and multi-GPU. Finally, the performance results on several real industrial data sets demonstrate the linear scalability of the time overhead and the space requirement and the validity of the extension to online learning. Moreover, CUSNMF obtains speedup of 7X on a P100 GPU compared to that of the state-of-the-art parallel approaches on a shared memory platform.
Article
Full-text available
Real-time accurate recommendation of large-scale recommender systems is a challenging task. Matrix factorization (MF), as one of the most accurate and scalable techniques to predict missing ratings, has become popular in the collaborative filtering (CF) community. Currently, stochastic gradient descent (SGD) is one of the most famous approaches for MF. However, it is non-trivial to parallelize SGD for large-scale problems due to the dependence on the user and item pair, which can cause parallelization over-writing. To remove the dependence on the user and item pair, we propose a multi-stream SGD (MSGD) approach, for which the updating process is theoretically convergent. On that basis, we propose a CUDA (compute Unified Device Architecture) parallelization MSGD (CUMSGD) approach. CUMSGD can obtain high parallelism and scalability on Graphic Processing Units (GPUs). On Tesla K20m and K40c GPUs, the experimental results show that CUMSGD outperforms prior works that accelerated MF on shared memory systems, e.g., DSGD, FPSGD, Hogwild!, and CCD++. For large-scale CF problems, we propose multiple GPUs (multi-GPU) CUMSGD (MCUMSGD). The experimental results show that MCUMSGD can improve MSGD performance further.
Article
Full-text available
Autonomous control systems onboard planetary rovers and spacecraft benefit from having cognitive capabilities like learning so that they can adapt to unexpected situations in-situ. Q-learning is a form of reinforcement learning and it has been efficient in solving certain class of learning problems. However, embedded systems onboard planetary rovers and spacecraft rarely implement learning algorithms due to the constraints faced in the field, like processing power, chip size, convergence rate and costs due to the need for radiation hardening. These challenges present a compelling need for a portable, low-power, area efficient hardware accelerator to make learning algorithms practical onboard space hardware. This paper presents a FPGA implementation of Q-learning with Artificial Neural Networks (ANN). This method matches the massive parallelism inherent in neural network software with the fine-grain parallelism of an FPGA hardware thereby dramatically reducing processing time. Mars Science Laboratory currently uses Xilinx-Space-grade Virtex FPGA devices for image processing, pyrotechnic operation control and obstacle avoidance. We simulate and program our architecture on a Xilinx Virtex 7 FPGA. The architectural implementation for a single neuron Q-learning and a more complex Multilayer Perception (MLP) Q-learning accelerator has been demonstrated. The results show up to a 43-fold speed up by Virtex 7 FPGAs compared to a conventional Intel i5 2.3 GHz CPU. Finally, we simulate the proposed architecture using the Symphony simulator and compiler from Xilinx, and evaluate the performance and power consumption.
Article
Full-text available
Reconfigurable architectures can bring unique capabilities to computational tasks. They offer the performance and energy efficiency of hardware with the flexibility of software. In some domains, they are the only way to achieve the required, real-time performance without fabricating custom integrated circuits. Their functionality can be upgraded and repaired during their operational lifecycle and specialized to the particular instance of a task. We survey the field of reconfigurable computing, providing a guide to the body-of-knowledge accumulated in architecture, compute models, tools, run-time reconfiguration, and applications.
Article
Full-text available
This study treats architecture and implementation of a field-programmable gate array (FPGA) accelerator for doubleprecision floating-point matrix multiplication. The architecture is oriented towards minimising resource utilisation and maximising clock frequency. It employs the block matrix multiplication algorithm which returns the result blocks to the host processor as soon as they are computed. This avoids output buffering and simplifies placement and routing on the chip. The authors show that such architecture is especially well suited for full-duplex communication links between the accelerator and the host processor. The architecture requires the result blocks to be accumulated by the host processor; however, the authors show that typically more than 99% of all arithmetic operations are performed by the accelerator. The implementation focuses on efficient use of embedded FPGA resources, in order to allow for a large number of processing elements (PEs). Each PE uses eight Virtex-6 DSP blocks. Both adders and multipliers are deeply pipelined and use several FPGA-specific techniques to achieve small area size and high clock frequency. Finally, the authors quantify the performance of accelerator implemented in Xilinx Virtex-6 FPGA, with 252 PEs running at 403 MHz (achieving 203.1 Giga FLOPS (GFLOPS)), by comparing it to double-precision matrix multiplication function from MKL, ACML, GotoBLAS and ATLAS libraries executing on Intel Core2Quad and AMD Phenom X4 microprocessors running at 2.8 GHz. The accelerator performs 4.5 times faster than the fastest processor/library pair.
Article
Full-text available
Recommender systems are widely used in many areas, especially in e- commerce. Recently, they are also applied in e-learning for recommending learn- ing objects (e.g. papers) to students. This chapter introduces state-of-the-art recom- mender system techniques which can be used not only for recommending objects like tasks/exercises to the students but also for predicting student performance. We formulate the problem of predicting student performance as a recommender system problem and present matrix factorization methods, which are currently known as the most effective recommendation approaches, to implicitly take into account the pre- vailing latent factors (e.g. "slip" and "guess") for predicting student performance. As a learner's knowledge improves over time, too, we propose tensor factorization methods to take the temporal effect into account. Finally, some experimental results and discussions are provided to validate the proposed approach.
Chapter
Full-text available
Recommender Systems (RSs) are software tools and techniques providing suggestions for items to be of use to a user. In this introductory chapter we briefly discuss basic RS ideas and concepts. Our main goal is to delineate, in a coherent and structured way, the chapters included in this handbook and to help the reader navigate the extremely rich and detailed content that the handbook offers.
Article
Full-text available
Machine-learning algorithms are employed in a wide variety of applications to extract useful information from data sets, and many are known to suffer from super-linear increases in computational time with increasing data size and number of signals being processed (data dimension). Certain principal machine-learning algorithms are commonly found embedded in larger detection, estimation, or classification operations. Three such principal algorithms are the Parzen window-based, non-parametric estimation of Probability Density Functions (PDFs), K-means clustering and correlation. Because they form an integral part of numerous machine-learning applications, fast and efficient execution of these algorithms is extremely desirable. FPGA-based reconfigurable computing (RC) has been successfully used to accelerate computationally intensive problems in a wide variety of scientific domains to achieve speedup over traditional software implementations. However, this potential benefit is quite often not fully realized because creating efficient FPGA designs is generally carried out in a laborious, case-specific manner requiring a great amount of redundant time and effort. In this paper, an approach using pattern-based decomposition for algorithm acceleration on FPGAs is proposed that offers significant increases in productivity via design reusability. Using this approach, we design, analyze, and implement a multi-dimensional PDF estimation algorithm using Gaussian kernels on FPGAs. First, the algorithm’s amenability to a hardware paradigm and expected speedups are predicted. After implementation, actual speedup and performance metrics are compared to the predictions, showing speedup on the order of 20× over a 3.2 GHz processor. Multi-core architectures are developed to further improve performance by scaling the design. Portability of the hardware design across multiple FPGA platforms is also analyzed. After implementing the PDF algorithm, the value of pattern-based decomposition to support reuse is demonstrated by rapid development of the K-means and correlation algorithms. KeywordsFPGA–Design patterns–Machine learning–Pattern recognition–Hardware acceleration–Performance prediction
Conference Paper
Full-text available
Matrix multiplication is a computation intensive operation and plays an important role in many scientific and engineering applications. For high performance applications, this operation must be realized in hardware. This paper presents a parallel architecture for the multiplication of two matrices using field programmable gate array (FPGA). The proposed architecture employs advanced design techniques and exploits architectural features of FPGA. Results show that it provides performance improvements over previously reported hardware implementation. FPGA implementation results are presented and discussed.
Conference Paper
Full-text available
Sparse matrix factorization is a critical step for the circuit simulation problem, since it is time consuming and computed repeatedly in the flow of circuit simulation. To accelerate the factorization of sparse matrices, a parallel CPU+FPGA based architecture is proposed in this paper. While the pre-processing of the matrix is implemented on CPU, the parallelism of numeric factorization is explored by processing several columns of the sparse matrix simultaneously on a set of processing elements (PE) in FPGA. To cater for the requirements of circuit simulation, we also modified the Gilbert/Peierls (G/P) algorithm and considered the scalability of our architecture. Experimental results on circuit matrices from the University of Florida Sparse Matrix Collection show that our architecture achieves speedup of 0.5x-5.36x compared with the CPU KLU results.
Article
Full-text available
Recommender systems have been evaluated in many, often incomparable, ways. In this article, we review the key decisions in evaluating collaborative filtering recommender systems: the user tasks being evaluated, the types of analysis and datasets being used, the ways in which prediction quality is measured, the evaluation of prediction attributes other than quality, and the user-based evaluation of the system as a whole. In addition to reviewing the evaluation strategies used by prior researchers, we present empirical results from the analysis of various accuracy metrics on one content domain where all the tested metrics collapsed roughly into three equivalence classes. Metrics within each equivalency class were strongly correlated, while metrics from different equivalency classes were uncorrelated.
Article
Full-text available
Several high-performance computers now use field-programmable gate arrays as reconfigurable coprocessors. The authors describe the two major contemporary HPRC architectures and explore the pros and cons of each using representative applications from remote sensing, molecular dynamics, bioinformatics, and cryptanalysis.
Article
Non-negative Matrix Factorization (NMF) plays an important role in many data mining applications for low-rank representation and analysis. Due to the sparsity that is caused by missing information in many high-dimension scenes, e.g., social networks or recommender systems, NMF cannot mine a more accurate representation from the explicit information. Manifold learning can incorporate the intrinsic geometry of the data, which is combined with a neighborhood with implicit information. Thus, manifold-regularized NMF (MNMF) can realize a more compact representation for the sparse data. However, MNMF suffers from (a) the forming of large-scale Laplacian matrices, (b) frequent large-scale matrix manipulation, and (c) the involved K-nearest neighbor points, which will result in the over-writing problem in parallelization. To address these issues, a single-thread-based MNMF model is proposed on two types of divergence, i.e., Euclidean distance and Kullback–Leibler (KL) divergence, which depends only on the involved feature-tuples’ multiplication and summation and can avoid large-scale matrix manipulation. Furthermore, this model can remove the dependence among the feature vectors with fine-grain parallelization inherence. On that basis, a CUDA parallelization MNMF (CUMNMF) is presented on GPU computing. From the experimental results, CUMNMF achieves a 20X speedup compared with MNMF, as well as a lower time complexity and space requirement.
Article
Deploying Convolutional Neural Networks (CNNs) on a portable system is still challenging due to the large volume of data, the extensive amount of computation and frequent memory accesses. Although existing high-level synthesis tools (e.g. HLS, OpenCL) for FPGAs dramatically reduce the design time, the resulting implementations are still inefficient with respect to resource allocation for maximizing parallelism and throughput. Manual hardware-level design (i.e., RTL) can improve the efficiency and achieve greater acceleration but that requires an in-depth understanding of both the algorithm structure and the FPGA system architecture. This work presents a scalable solution that achieves the flexibility and reduced design time of high-level synthesis and the near-optimality of an RTL implementation. The proposed solution is a compiler that analyzes the algorithm structure and parameters, and automatically integrates a set of modular and scalable computing primitives to accelerate the operation of various deep learning algorithms on an FPGA. Integrating these modules together for end-to-end CNN implementations, this work quantitatively analyzes the complier's design strategy to optimize the throughput of a given CNN model under the FPGA resource constraints. The proposed RTL compiler, named ALAMO, is demonstrated on Altera Stratix-V GXA7 FPGA for the inference tasks of AlexNet and NiN CNN models, achieving 114.5 GOPS and 117.3 GOPS, respectively. This represents a 1.9. X improvement in throughput when compared to the OpenCL-based design. The results illustrate the promise of the automatic compiler solution for modularized and scalable hardware acceleration of deep learning.
Article
In this work, we present PolyBlaze, a scalable and configurable multicore platform for FPGA-based embedded systems and systems research. PolyBlaze is an extension of the MicroBlaze soft processor, leveraging the configurability of the MicroBlaze and bringing it into the multicore era with Linux Symmetric Multi- Processor (SMP) support. This work details the hardware modifications required for the MicroBlaze processor and its software stack to enable fully validated SMP operations, including atomic operation support, shared interrupts and timers, and exception handling. New in this work, we present a scalable and flexible memory hierarchy optimized for Field Programmable Gate Arrays (FPGAs), which manages atomic operations and provides support for future flexible memory hierarchies and heterogeneous systems. Also new is an in-depth analysis of key performance characteristics, including memory bandwidth, latency, and resource usage. For all system configurations, bandwidth is found to scale linearly with the addition of processor cores until the memory interface is saturated. Additionally, average memory latency remains constant until the memory interface is saturated; after which, it scales linearly with each additional processor core.
Article
Article
This paper presents an overview of the field of recommender systems and describes the current generation of recommendation methods that are usually classified into the following three main categories: content-based, collaborative, and hybrid recommendation approaches. This paper also describes various limitations of current recommendation methods and discusses possible extensions that can improve recommendation capabilities and make recommender systems applicable to an even broader range of applications. These extensions include, among others, an improvement of understanding of users and items, incorporation of the contextual information into the recommendation process, support for multicriteria ratings, and a provision of more flexible and less intrusive types of recommendations.
Article
Group Recommender Systems are becoming very popular in the social web owing to their ability to provide a set of recommendations to a group of users. Several group recommender systems have been proposed by extending traditional KNN based Collaborative Filtering. In this paper we explain how to perform group recommendations using Matrix Factorization (MF) based Collaborative Filtering (CF). We propose three original approaches to map the group of users to the latent factor space and compare the proposed methods in three different scenarios: when the group size is small, medium and large. We also compare the precision of the proposed methods with state-of-the-art group recommendation systems using KNN based Collaborative Filtering. We analyze group movie ratings on MovieLens and Netflix datasets. Our study demonstrates that the performance of group recommender systems varies depending on the size of the group, and MF based CF is the best option for group recommender systems.
Article
Matrix multiplication is a kernel and fundamental operation in many applications including image, robotic and digital signal processing. The key component of matrix multiplication is Multiplier Accumulator (MAC) which is a decisive component for the performance of matrix multiplication. This study proposes a pipelined floating-point MAC architecture on Field Programmable Gate Array (FPGA) using a novel accumulating method. By adding the last N-stage results of the pipelined adder, the accumulation of the multiplier products can be obtained. Then, a matrix multiplication is implemented by employing parallel systolic structure based on the proposed MAC. Experimental results demonstrate that the proposed MAC architecture achieves higher clock speed and consumes less hardware resources than previous designs and the matrix multiplier has a good performance and scalability. It also can be concluded that the efficiency of the matrix multiplier is even higher when the matrices are larger.
Article
In this paper we present a novel technique for predicting the tastes of users in recommender systems based on collaborative filtering. Our technique is based on factorizing the rating matrix into two non negative matrices whose components lie within the range [0, 1] with an understandable probabilistic meaning. Thanks to this decomposition we can accurately predict the ratings of users, find out some groups of users with the same tastes, as well as justify and understand the recommendations our technique provides.
Article
The MovieLens datasets are widely used in education, research, and industry. They are downloaded hundreds of thousands of times each year, reflecting their use in popular press programming books, traditional and online courses, and software. These datasets are a product of member activity in the MovieLens movie recommendation system, an active research platform that has hosted many experiments since its launch in 1997. This article documents the history of MovieLens and the MovieLens datasets. We include a discussion of lessons learned from running a long-standing, live research platform from the perspective of a research organization. We document best practices and limitations of using the MovieLens datasets in new research.
Conference Paper
Floating-point computing with more than one TFLOP of peak performance is already a reality in recent Field-Programmable Gate Arrays (FPGA). General-Purpose Graphics Processing Units (GPGPU) and recent many-core CPUs have also taken advantage of the recent technological innovations in integrated circuit (IC) design and had also dramatically improved their peak performances. In this paper, we compare the trends of these computing architectures for high-performance computing and survey these platforms in the execution of algorithms belonging to different scientific application domains. Trends in peak performance, power consumption and sustained performances, for particular applications, show that FPGAs are increasing the gap to GPUs and many-core CPUs moving them away from high-performance computing with intensive floating-point calculations. FPGAs become competitive for custom floating-point or fixed-point representations, for smaller input sizes of certain algorithms, for combinational logic problems and parallel map-reduce problems.
Conference Paper
This paper presents case studies on the application of the Xilinx Vivado High Level Synthesis (HLS) tool-suite for C++-based design capture, simulation and synthesis to Hardware Description Language (HDL) format, and further to FPGA hardware implementation. HLS reduces the effort of HDL design capture and debug while allowing flexibility in the final hardware implementation in order to meet design constraints. HLS is not yet widely used. This paper demonstrates the practical steps in using HLS and the resulting hardware implementation. Case studies illustrate the effectiveness of HLS as a developing efficient and flexible design capture to FPGA implementation approach. The paper presents four HLS design examples, including a multiplexer, counter, register block and a skin detection image processing algorithm. Xilinx PlanAhead EDA tool-suite is used to generate a Xilinx Spartan-6 FPGA bitstream from the Xilinx Vivado HLS-synthesised HDL model. Each design has been implemented and tested in FPGA hardware using the Vicilogic automation and proto-typing tools developed by the authors. These tools automate the integration of designs with an FPGA IP core, which supports Ethernet I/O, SDRAM interface and a register-based I/O system. The Vicilogic Python client application environment enables GUI-based development and testing of the hardware implementation.
Article
Many organizations today are faced with the challenge of processing and distilling information from huge and growing collections of data. Such organizations are increasingly deploying sophisticated mathematical algorithms to model the behavior of their business processes to discover correlations in the data, to predict trends and ultimately drive decisions to optimize their operations. These techniques, are known collectively as analytics, and draw upon multiple disciplines, including statistics, quantitative analysis, data mining, and machine learning. In this survey paper, we identify some of the key techniques employed in analytics both to serve as an introduction for the non-specialist and to explore the opportunity for greater optimizations for parallelization and acceleration using commodity and specialized multi-core processors. We are interested in isolating and documenting repeated patterns in analytical algorithms, data structures and data types, and in understanding howthese could be most effectively mapped onto parallel infrastructure. To this end, we focus on analytical models that can be executed using different algorithms. For most major model types, we study implementations of key algorithms to determine common computational and runtime patterns. We then use this information to characterize and recommend suitable parallelization strategies for these algorithms, specifically when used in data management workloads.
Conference Paper
K-means clustering is a popular technique for partitioning a data set into subsets of similar features. Due to their simple control flow and inherent fine-grain parallelism, K-means algorithms are well suited for hardware implementations, such as on field programmable gate arrays (FPGAs), to accelerate the computationally intensive calculation. However, the available hardware resources in massively parallel implementations are easily exhausted for large problem sizes. This paper presents an FPGA implementation of an efficient variant of K-means clustering which prunes the search space using a binary kd-tree data structure to reduce the computational burden. Our implementation uses on-chip dynamic memory allocation to ensure efficient use of memory resources. We describe the trade-off between data-level parallelism and search space reduction at the expense of increased control overhead. A data-sensitive analysis shows that our approach requires up to five times fewer computational FPGA resources than a conventional massively parallel implementation for the same throughput constraint.
Conference Paper
One of the challenges to data mining raised by technology development is that both data size and dimensionality is growing rapidly. K-means, one of the most popular clustering algorithms in data mining, suffers in computational time when used for large data sets and data with high dimensionality. In this paper, we propose a hardware architecture for K-means with triangle inequality optimization on FPGA. An optimal 8-bit square calculator for 6-LUT architectures is described to minimize the hardware cost and an approximation solution is proposed to avoid square root calculation in the original triangle inequality optimization. Our software and hardware experiments are tested with the MNIST benchmark and uniform random numbers of various size. This approximation results in 2% more distance calculations for MNIST and 5% for uniform random numbers than the original optimization. Compared to the baseline hardware system without optimization, our approach achieves up to 77% improvement in processing time with about 10% logic overhead. We demonstrate that the hardware can achieve 55-fold speed up compared to software for the 1024 MNIST.
Article
Recommender systems have developed in parallel with the web. They were initially based on demographic, content-based and collaborative filtering. Currently, these systems are incorporating social information. In the future, they will use implicit, local and personal information from the Internet of things. This article provides an overview of recommender systems as well as collaborative filtering methods and algorithms; it also explains their evolution, provides an original classification for these systems, identifies areas of future implementation and develops certain areas selected for past, present or future importance.
Article
A new variant ‘PMF’ of factor analysis is described. It is assumed that X is a matrix of observed data and σ is the known matrix of standard deviations of elements of X. Both X and σ are of dimensions n × m. The method solves the bilinear matrix problem X = GF + E where G is the unknown left hand factor matrix (scores) of dimensions n × p, F is the unknown right hand factor matrix (loadings) of dimensions p × m, and E is the matrix of residuals. The problem is solved in the weighted least squares sense: G and F are determined so that the Frobenius norm of E divided (element-by-element) by σ is minimized. Furthermore, the solution is constrained so that all the elements of G and F are required to be non-negative. It is shown that the solutions by PMF are usually different from any solutions produced by the customary factor analysis (FA, i.e. principal component analysis (PCA) followed by rotations). Usually PMF produces a better fit to the data than FA. Also, the result of PF is guaranteed to be non-negative, while the result of FA often cannot be rotated so that all negative entries would be eliminated. Different possible application areas of the new method are briefly discussed. In environmental data, the error estimates of data can be widely varying and non-negativity is often an essential feature of the underlying models. Thus it is concluded that PMF is better suited than FA or PCA in many environmental applications. Examples of successful applications of PMF are shown in companion papers.
Book
The explosive growth of e-commerce and online environments has made the issue of information search and selection increasingly serious; users are overloaded by options to consider and they may not have the time or knowledge to personally evaluate these options. Recommender systems have proven to be a valuable way for online users to cope with the information overload and have become one of the most powerful and popular tools in electronic commerce. Correspondingly, various techniques for recommendation generation have been proposed. During the last decade, many of them have also been successfully deployed in commercial environments. Recommender Systems Handbook, an edited volume, is a multi-disciplinary effort that involves world-wide experts from diverse fields, such as artificial intelligence, human computer interaction, information technology, data mining, statistics, adaptive user interfaces, decision support systems, marketing, and consumer behavior. Theoreticians and practitioners from these fields continually seek techniques for more efficient, cost-effective and accurate recommender systems. This handbook aims to impose a degree of order on this diversity, by presenting a coherent and unified repository of recommender systems major concepts, theories, methodologies, trends, challenges and applications. Extensive artificial applications, a variety of real-world applications, and detailed case studies are included. Recommender Systems Handbook illustrates how this technology can support the user in decision-making, planning and purchasing processes. It works for well known corporations such as Amazon, Google, Microsoft and AT&T. This handbook is suitable for researchers and advanced-level students in computer science as a reference.
Article
In October, 2006 Netflix released a dataset containing 100 million anonymous movie ratings and challenged the data mining, machine learning and computer science communities to develop systems that could beat the accuracy of its recommendation system, Cinematch. We briefly describe the challenge itself, review related work and efforts, and summarize visible progress to date. Other potential uses of the data are outlined, including its application to the KDD Cup 2007.
Article
During the last decade, the data sizes have grown faster than the speed of processors. In this context, the capabilities of statistical machine learning meth-ods is limited by the computing time rather than the sample size. A more pre-cise analysis uncovers qualitatively different tradeoffs for the case of small-scale and large-scale learning problems. The large-scale case involves the computational complexity of the underlying optimization algorithm in non-trivial ways. Unlikely optimization algorithms such as stochastic gradient descent show amazing perfor-mance for large-scale problems. In particular, second order stochastic gradient and averaged stochastic gradient are asymptotically efficient after a single pass on the training set.
Chapter
The importance of contextual information has been recognized by researchers and practitioners in many disciplines, including e-commerce personalization, information retrieval, ubiquitous and mobile computing, data mining, marketing, and management. While a substantial amount of research has already been performed in the area of recommender systems, most existing approaches focus on recommending the most relevant items to users without taking into account any additional contextual information, such as time, location, or the company of other people (e.g., for watching movies or dining out). In this chapter we argue that relevant contextual information does matter in recommender systems and that it is important to take this information into account when providing recommendations. We discuss the general notion of context and how it can be modeled in recommender systems. Furthermore, we introduce three different algorithmic paradigms – contextual prefiltering, post-filtering, and modeling – for incorporating contextual information into the recommendation process, discuss the possibilities of combining several contextaware recommendation techniques into a single unifying approach, and provide a case study of one such combined approach. Finally, we present additional capabilities for context-aware recommenders and discuss important and promising directions for future research.
Article
As the Netflix Prize competition has demonstrated, matrix factorization models are superior to classic nearest neighbor techniques for producing product recommendations, allowing the incorporation of additional information such as implicit feedback, temporal effects, and confidence levels.
Article
Collaborative filtering is one of the most successful and widely used methods of automated product recommendation in online stores. The most critical component of the method is the mechanism of finding similarities among users using product ratings data so that products can be recommended based on the similarities. The calculation of similarities has relied on traditional distance and vector similarity measures such as Pearson’s correlation and cosine which, however, have been seldom questioned in terms of their effectiveness in the recommendation problem domain. This paper presents a new heuristic similarity measure that focuses on improving recommendation performance under cold-start conditions where only a small number of ratings are available for similarity calculation for each user. Experiments using three different datasets show the superiority of the measure in new user cold-start conditions.
Article
Recommender systems are typically provided as Web 2.0 services and are part of the range of applications that give support to large-scale social networks, enabling on-line recommendations to be made based on the use of networked databases. The operating core of recommender systems is based on the collaborative filtering stage, which, in current user to user recommender processes, usually uses the Pearson correlation metric. In this paper, we present a new metric which combines the numerical information of the votes with independent information from those values, based on the proportions of the common and uncommon votes between each pair of users. Likewise, we define the reasoning and experiments on which the design of the metric is based and the restriction of being applied to recommender systems where the possible range of votes is not greater than 5. In order to demonstrate the superior nature of the proposed metric, we provide the comparative results of a set of experiments based on the MovieLens, FilmAffinity and NetFlix databases. In addition to the traditional levels of accuracy, results are also provided on the metrics’ coverage, the percentage of hits obtained and the precision/recall.
Conference Paper
Recommender systems provide users with personalized suggestions for products or services. These systems often rely on Collaborating Filtering (CF), where past transactions are analyzed in order to establish connections between users and products. The two more successful approaches to CF are latent factor models, which directly profile both users and products, and neighborhood models, which analyze similarities between products or users. In this work we introduce some innovations to both approaches. The factor and neighborhood models can now be smoothly merged, thereby building a more accurate combined model. Further accuracy improvements are achieved by extending the models to exploit both explicit and implicit feedback by the users. The methods are tested on the Netflix data. Results are better than those previously published on that dataset. In addition, we suggest a new evaluation metric, which highlights the differences among methods, based on their performance at a top-K recommendation task.
Conference Paper
Regularized matrix factorization models are known to generate high quality rating predictions for recommender systems. One of the major drawbacks of matrix factorization is that once computed, the model is static. For real-world applications dynamic updating a model is one of the most important tasks. Especially when ratings on new users or new items come in, updating the feature matrices is crucial. In this paper, we generalize regularized matrix factorization (RMF) to regularized kernel matrix factorization (RKMF). Kernels provide a flexible method for deriving new matrix factorization methods. Furthermore with kernels nonlinear interactions between feature vectors are possible. We propose a generic method for learning RKMF models. From this method we derive an online-update algorithm for RKMF models that allows to solve the new-user/new-item problem. Our evaluation indicates that our proposed online-update methods are accurate in approximating a full retrain of a RKMF model while the runtime of online-updating is in the range of milliseconds even for huge datasets like Netflix.
Conference Paper
The importance of contextual information has been recognized by researchers and practitioners in many disciplines, including e-commerce personalization, information retrieval, ubiquitous and mobile computing, data mining, marketing, and management. While a substantial amount of research has already been performed in the area of recommender systems, many existing approaches focus on recommending the most relevant items to users without taking into account any additional contextual information, such as time, location, or the company of other people (e.g.,for watching movies or dining out). There is growing understanding that relevant contextual information does matter in recommender systems and that it is important to take this information into account when providing recommendations. We discuss the general notion of context and how it can be modeled in recommender systems. We also discuss three popular algorithmic paradigms—contextual pre-filtering, post-filtering, and modeling—for incorporating contextual information into the recommendation process, and survey recent work on context-aware recommender systems. We also discuss important directions for future research.
Article
Several case studies have demonstrated the effectiveness of formulating computing models for field programmable gate array-based (FPGA) accelerators. FPGA are widely considered as accelerators for compute-intensive applications. It is essential to find and map the appropriate computing model for the development of a FPGA-based application. FPGA computing enables models, with high flexible better parallelism and associative operations, such as broadcast and collective response. A computing model refers to an abstraction of a target machine, which is used to facilitate application development. The abstraction lets the developer, separate an application's design, including the algorithms , from its coding and compilation. Random access memory (RAM) is one of the most common computing models for single threaded computer.
Conference Paper
Recommender systems based on collaborative filtering predict user preferences for products or services by learning past user-item relationships. A predominant approach to collaborative filtering is neighborhood based ("k-nearest neighbors"), where a user-item preference rating is interpolated from ratings of similar items and/or users. We enhance the neighborhood-based approach leading to substantial improvement of prediction accuracy, without a meaningful increase in running time. First, we remove certain so-called "global effects" from the data to make the ratings more comparable, thereby improving interpolation accuracy. Second, we show how to simultaneously derive interpolation weights for all nearest neighbors, unlike previous approaches where each weight is computed separately. By globally solving a suitable optimization problem, this simultaneous interpolation accounts for the many interactions between neighbors leading to improved accuracy. Our method is very fast in practice, generating a prediction in about 0.2 milliseconds. Importantly, it does not require training many parameters or a lengthy preprocessing, making it very practical for large scale applications. Finally, we show how to apply these methods to the perceivably much slower user-oriented approach. To this end, we suggest a novel scheme for low dimensional embedding of the users. We evaluate these methods on the netflix dataset, where they deliver significantly better results than the commercial netflix cinematch recommender system.
Conference Paper
The first MEMOCODE hardware/software co-design contest posed the following problem: optimize matrix-matrix multiplication in such a way that it is split between the FPGA and PowerPC on a Xilinx Virtex IIPro30. In this paper we discuss our solution, which we implemented on a Xilinx XUP development board with 256 MB of DRAM. The design was done by the five authors over a span of approximately 3 weeks, though of the 15 possible man-weeks, about 9 were actually spent working on this problem. All hardware design was done using Blue-spec SystemVerilog (BSV), with the exception of an imported Verilog multiplication unit, necessary only due to the limitations of the Xilinx FPGA toolflow optimizations.
Conference Paper
We develop new algorithms and architectures for matrix multiplication on configurable hardware. These designs significantly reduce the latency as well as the area. Our designs improve the previous designs in terms of the area/speed metric where the speed denotes the maximum achievable running frequency. The area/speed metrics for the previous designs and our design are 14.45, 4.93, and 2.35, respectively, for 4 × 4 matrix multiplication. The latency of one of the previous design is 0.57 μs, while our design takes 0.15 μs using 18% less area. The area of our designs is smaller by 11% - 46% compared with the best known systolic designs with the same latency for the matrices of sizes 3 × 3 - 12 × 12. The performance improvements tend to grow with the problem size.