September 2023
·
8 Reads
This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.
September 2023
·
8 Reads
June 2023
·
26 Reads
·
1 Citation
ACM Transactions on Recommender Systems
Most recommender systems rely on user interaction data for personalization. Usually, the recommendation quality improves with more data. In this work, we study the quality implications when limiting user interaction data for personalization purposes. We formalize this problem and provide algorithms for selecting a smaller subset of user interaction data. We propose a selection method that picks the subset of a user’s history items that maximizes the expected recommendation quality. We show on well studied benchmarks that it is possible to achieve high quality results with small subsets of less than ten items per user.
June 2023
·
37 Reads
The need to compactly and robustly represent item-attribute relations arises in many important tasks, such as faceted browsing and recommendation systems. A popular machine learning approach for this task denotes that an item has an attribute by a high dot-product between vectors for the item and attribute -- a representation that is not only dense, but also tends to correct noisy and incomplete data. While this method works well for queries retrieving items by a single attribute (such as \emph{movies that are comedies}), we find that vector embeddings do not so accurately support compositional queries (such as movies that are comedies and British but not romances). To address these set-theoretic compositions, this paper proposes to replace vectors with box embeddings, a region-based representation that can be thought of as learnable Venn diagrams. We introduce a new benchmark dataset for compositional queries, and present experiments and analysis providing insights into the behavior of both. We find that, while vector and box embeddings are equally suited to single attribute queries, for compositional queries box embeddings provide substantial advantages over vectors, particularly at the moderate and larger retrieval set sizes that are most useful for users' search and browsing.
February 2023
·
5 Reads
We study the problem of multi-task learning under user-level differential privacy, in which n users contribute data to m tasks, each involving a subset of users. One important aspect of the problem, that can significantly impact quality, is the distribution skew among tasks. Certain tasks may have much fewer data samples than others, making them more susceptible to the noise added for privacy. It is natural to ask whether algorithms can adapt to this skew to improve the overall utility. We give a systematic analysis of the problem, by studying how to optimally allocate a user's privacy budget among tasks. We propose a generic algorithm, based on an adaptive reweighting of the empirical loss, and show that when there is task distribution skew, this gives a quantifiable improvement of excess empirical risk. Experimental studies on recommendation problems that exhibit a long tail of small tasks, demonstrate that our methods significantly improve utility, achieving the state of the art on two standard benchmarks.
September 2022
·
42 Reads
·
56 Citations
December 2021
·
245 Reads
We present ALX, an open-source library for distributed matrix factorization using Alternating Least Squares, written in JAX. Our design allows for efficient use of the TPU architecture and scales well to matrix factorization problems of O(B) rows/columns by scaling the number of available TPU cores. In order to spur future research on large scale matrix factorization methods and to illustrate the scalability properties of our own implementation, we also built a real world web link prediction dataset called WebGraph. This dataset can be easily modeled as a matrix factorization problem. We created several variants of this dataset based on locality and sparsity properties of sub-graphs. The largest variant of WebGraph has around 365M nodes and training a single epoch finishes in about 20 minutes with 256 TPU cores. We include speed and performance numbers of ALX on all variants of WebGraph. Both the framework code and the dataset is open-sourced.
October 2021
·
55 Reads
iALS is a popular algorithm for learning matrix factorization models from implicit feedback with alternating least squares. This algorithm was invented over a decade ago but still shows competitive quality compared to recent approaches like VAE, EASE, SLIM, or NCF. Due to a computational trick that avoids negative sampling, iALS is very efficient especially for large item catalogues. However, iALS does not scale well with large embedding dimensions, d, due to its cubic runtime dependency on d. Coordinate descent variations, iCD, have been proposed to lower the complexity to quadratic in d. In this work, we show that iCD approaches are not well suited for modern processors and can be an order of magnitude slower than a careful iALS implementation for small to mid scale embedding sizes (d ~ 100) and only perform better than iALS on large embeddings d ~ 1000. We propose a new solver iALS++ that combines the advantages of iALS in terms of vector processing with a low computational complexity as in iCD. iALS++ is an order of magnitude faster than iCD both for small and large embedding dimensions. It can solve benchmark problems like Movielens 20M or Million Song Dataset even for 1000 dimensional embedding vectors in a few minutes.
October 2021
·
47 Reads
Matrix factorization learned by implicit alternating least squares (iALS) is a popular baseline in recommender system research publications. iALS is known to be one of the most computationally efficient and scalable collaborative filtering methods. However, recent studies suggest that its prediction quality is not competitive with the current state of the art, in particular autoencoders and other item-based collaborative filtering methods. In this work, we revisit the iALS algorithm and present a bag of tricks that we found useful when applying iALS. We revisit four well-studied benchmarks where iALS was reported to perform poorly and show that with proper tuning, iALS is highly competitive and outperforms any method on at least half of the comparisons. We hope that these high quality results together with iALS's known scalability spark new interest in applying and further improving this decade old technique.
September 2021
·
31 Reads
·
24 Citations
Journal of Privacy and Confidentiality
Differential privacy provides a rigorous framework for privacy-preserving data analysis. This paper proposes the first differentially private procedure for controlling the false discovery rate (FDR) in multiple hypothesis testing. Inspired by the Benjamini-Hochberg procedure (BHq), our approach is to first repeatedly add noise to the logarithms of the p-values to ensure differential privacy and to select an approximately smallest p-value serving as a promising candidate at each iteration; the selected p-values are further supplied to the BHq and our private procedure releases only the rejected ones. Moreover, we develop a new technique that is based on a backward submartingale for proving FDR control of a broad class of multiple testing procedures, including our private procedure, and both the BHq step- up and step-down procedures. As a novel aspect, the proof works for arbitrary dependence between the true null and false null test statistics, while FDR control is maintained up to a small multiplicative factor.
July 2021
·
17 Reads
We study the problem of differentially private (DP) matrix completion under user-level privacy. We design a joint differentially private variant of the popular Alternating-Least-Squares (ALS) method that achieves: i) (nearly) optimal sample complexity for matrix completion (in terms of number of items, users), and ii) the best known privacy/utility trade-off both theoretically, as well as on benchmark data sets. In particular, we provide the first global convergence analysis of ALS with noise introduced to ensure DP, and show that, in comparison to the best known alternative (the Private Frank-Wolfe algorithm by Jain et al. (2018)), our error bounds scale significantly better with respect to the number of items and users, which is critical in practical problems. Extensive validation on standard benchmarks demonstrate that the algorithm, in combination with carefully designed sampling procedures, is significantly more accurate than existing techniques, thus promising to be the first practical DP embedding model.
... A detailed analysis of these computational costs is provided in Section 3.4. Unlike other domains, user profile modeling in educational contexts primarily depends on interaction histories to construct user profiles, allowing for a degree of data compression, as not all information requires lossless transmission (Rendle and Zhang, 2023;Purificato et al., 2024). ...
June 2023
ACM Transactions on Recommender Systems
... Moreover, beyond the classical (ε, δ)-differential privacy framework, several variants have been developed, for example, concentrated differential privacy (Dwork and Rothblum 2016), Rényi differential privacy (Mironov 2017), Gaussian differential privacy (Dong et al. 2022), to list a few. In particular, extensive studies have been conducted on developing differentially private algorithms within the realms of computer science, machine learning, and statistics, see Wasserman and Zhou (2010), Lei (2011), Abadi et al. (2016), Avella-Medina (2021), Cai et al. (2021), Dwork et al. (2021), Wang and Xu (2021), Avella-Medina et al. (2023), Butucea et al. (2023), Guo et al. (2023), Xia and Cai (2023), Chang et al. (2024), among many others. For further discussions on differential privacy, we refer to Dwork and Roth (2014), Slavković and Seeman (2023) and Liu et al. (2024). ...
September 2021
Journal of Privacy and Confidentiality
... In this paper, we focus on whole-data models with weighted square loss. Weighted Matrix Factorization (WMF), also called iALS [14,20], pioneered this class of models and is still known to achieve competitive results while having highly scalable learning and prediction routines [22]. After its introduction, many extensions were proposed, among which three variants for context-aware recommender systems (CARS) [5,10,11], where each variant uses a different tensor decomposition method. ...
September 2022
... Gazetteers may have omissions, and have been shown to have inconsistent coverage spatially and across feature types (Acheson et al. 2017). An alternative approach is to train models that use linguistic features to predict the most likely location on a local or global grid (Hulden et al. 2015, Gritta et al. 2018b, Fize et al. 2021, Kulkarni et al. 2021. ...
January 2021
... In recent years, more effective alternatives to classical representations have emerged, known as the embedding in the context of deep learning. Initially applied in natural language processing, where words are projected onto a concept space, embedding has gained popularity in the area of deep learning based recommendation system such as NCF [22] and GMF [23]. GMF and NCF are some generic neural embedding versions of matrix factorization. ...
September 2020
... Following them, RCDFM [46] uses Stacked Denoising Autoencoders (SDAEs) to fuse semantic representations from user reviews and item contents with rating matrices, creating richer latent factors. Furthermore, [222] introduces zero-shot heterogeneous transfer learning to align semantic spaces between a recommender system and a retrieval system, leveraging item co-consumption correlations to generate domain-invariant embeddings. ...
October 2020
... However, this method relies on a geographic name dictionary as a form of external data and does not leverage the hierarchical relationships of place names on a spatial scale. To address this problem, Kulkarni et al. [15] employed S2 geometry to partition the Earth's surface. They proposed a multi-level neural geographic encoder called MLG based on a CNN to overcome the limitations of previous models. ...
August 2020
... Music genre classification, a fundamental task in music information retrieval, remains of paramount importance, [1], [2]. It serves as a means for humans to categorize and describe various music collections, [3], offering applications in music recommendation, [4], and music curation, [5], among others. Traditional approaches to music genre classification have often focused solely on either the textual (lyrics) or audio modalities, [2], with some studies like that of [6], concentrating on audio modality, while others, such as [1], emphasizing lyrics/textual modality. ...
May 2020
... where M is a mechanism and D and D ′ are two neighboring data sets. The statement that RDP provides guarantees for the composition of many steps of a private process is presented in [46]: a composition of a number of mechanisms m i with each (α, ϵ i )-RDP satisfies (α, ∑ i ϵ i )-RDP and it is a tighter bound in comparison with the Gaussian mechanism for composition. ...
August 2019
... Previous works on defending against backdoor attacks in FL rely on post-processing techniques. After collecting updates from clients, the server either tries to limit the influence of backdoor updates on the global model [47], [5], [35], [37], [52], or identify and further filter out backdoors through comparing model parameters [42], [3], [39], [14], [50], [36], [44], [54], [30], [4], [13], [27], [41]. However, it is widely believed that influence-reduction-based methods can merely slow the rate of backdoor success, rather than eliminating the injected backdoors entirely. ...
October 2017