Article

Deep reinforcement learning in recommender systems: A survey and new perspectives

Authors:
  • CSIRO Data61 and UNSW
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... This capability allows for more nuanced and precise predictions. For example, models that incorporate user interests at various levels (item-level, category-level, etc.) and utilize feedforward neural networks have shown improved performance over traditional methods [7]. To create a more adaptive and responsive user experience, reinforcement learning approaches are being explored as a means to dynamically fine-tune recommendations based on real-time user feedback [8]. ...
... Example: Netflix's Recommender System Netflix's recommendation engine employs a hybrid strategy that combines deep learning models, collaborative filtering, and content-based filtering to deliver personalized content to users [7] [8]. ...
... Explainable AI: Creating models that offer clear justifications for their recommendations in addition to precise recommendations. This openness can foster user trust and enable more effective troubleshooting of recommendation errors [7]. Example: An explainable AI model in an e-commerce recommender system might show users why certain products are recommended based on their past purchases and browsing history. ...
Article
Full-text available
In the contemporary digital landscape, recommender systems have become essential for navigating the overwhelming volume of information and connecting users with relevant content. Powered by machine learning, these systems analyze extensive user data to predict preferences and personalize experiences across diverse platforms, spanning e-commerce, entertainment, and education. This study provides a comprehensive review of the principal machine learning algorithms employed in recommender systems, including hybrid approaches, content-based filtering, and collaborative filtering, while also examining their respective advantages and disadvantages. Recent advancements, particularly in deep learning and reinforcement learning, have significantly enhanced the capabilities of these systems. Techniques such as Neural Collaborative Filtering and autoencoders enable the capture of complex user-item interactions and improve scalability, while reinforcement learning allows for dynamic adaptation to real-time user feedback, optimizing long-term engagement. The paper delves into the practical applications of recommender systems across multiple domains, highlighting their role in facilitating product discovery and driving sales in e-commerce, enhancing user retention through personalized content suggestions in entertainment, and tailoring learning resources to individual needs in education. Despite their efficacy, these systems face challenges, including scalability problems, cold-start issues, and privacy concerns. The study explores solutions, such as distributed computing, hybrid approaches, and robust data protection measures. Furthermore, the paper discusses future directions, emphasizing the importance of explainable AI, real-time personalization, cross-domain recommendations, enhanced user control, and ethical considerations. This review examines recent advancements in ML-based recommender systems to provide insights into their current state, identify key performance factors, and suggest potential areas for further innovation and improvement.
... (1) Query-based attacks: In these attacks, adversaries meticulously design queries to the target recommendation model. Recent research has opened certain attack vulnerabilities [42,184], such as reinforcement learning-based queries employ a trial-and-error approach to optimize the query strategy and impose effective attacks [3,34]. (2) Decision boundary attacks: These attacks influence the decision power of recommendation process. ...
... Recent research has explored ensemble-based surrogate models, which can provide a more accurate approximation of the target recommendation model [129]. Several evolutionary optimization techniques have been employed for surrogate model construction [34]. (4) Model inversion attacks: The goal of these attacks is to reconstruct training data from the recommendation results. ...
... This could result in incorrect recommendations made, potentially leading to a loss of trust in the system. Several researchers comprehensively reviewed adversarial attacks and defense mechanisms for recommender systems [34,42,106]. ...
Article
Full-text available
Recommender systems (RS) play an integral role in many online platforms. Exponential growth and potential commercial interests are raising significant concerns around privacy, security, fairness, and overall responsibility. The existing literature around responsible recommendation services is diverse and multidisciplinary. Most literature reviews cover a specific aspect or a single technology for responsible behavior, such as federated learning or blockchain. This study integrates relevant concepts across disciplines to provide a broader representation of the landscape. We review the latest advancements toward building privacy-preserved and responsible recommendation services for the e-commerce industry. The survey summarizes recent, high-impact works on diverse aspects and technologies that ensure responsible behavior in RS through an interconnected taxon-omy. We contextualize potential privacy threats, practical significance, industrial expectations, and research remedies. From the technical viewpoint, we analyze conventional privacy defenses and provide an overview of emerging technologies including differential privacy, federated learning, and blockchain. The methods and concepts across technologies are linked based on their objectives, challenges, and future directions. In addition, we also develop an open-source repository that summarizes a wide range of evaluation benchmarks, codebases, and toolkits to aid the further research. The survey offers a holistic perspective on this rapidly evolving landscape by synthesizing insights from both recommender systems and responsible AI literature.
... The RL-based approaches can continuously learn and adapt from each user interaction, being able to adjust quickly to changing user preferences and behaviors [13], optimizing users' long-term engagement [44]. Despite the effectiveness of RL, it is unrealistic for an uncompleted recommender system to perform expensive online interactions with users to collect training data [4,5,7,10]. ...
... Since supervised RSs have difficulties in capturing the dynamics of user preference [5,[32][33][34][35]51], more deep RL [38] methods are employed. Some directions [46,52] focuses on formulating the RS to a Markov Decision Process (MDP), investigating the state representation [14,24], user profiling [22,56], and action representation [23]. ...
... where is the index in world model ensemble E, and 2 is the variance of corresponding GPM. Thus, the estimated reward in Equation (5) has been changed to: ...
Preprint
Offline reinforcement learning (RL) is an effective tool for real-world recommender systems with its capacity to model the dynamic interest of users and its interactive nature. Most existing offline RL recommender systems focus on model-based RL through learning a world model from offline data and building the recommendation policy by interacting with this model. Although these methods have made progress in the recommendation performance, the effectiveness of model-based offline RL methods is often constrained by the accuracy of the estimation of the reward model and the model uncertainties, primarily due to the extreme discrepancy between offline logged data and real-world data in user interactions with online platforms. To fill this gap, a more accurate reward model and uncertainty estimation are needed for the model-based RL methods. In this paper, a novel model-based Reward Shaping in Offline Reinforcement Learning for Recommender Systems, ROLeR, is proposed for reward and uncertainty estimation in recommendation systems. Specifically, a non-parametric reward shaping method is designed to refine the reward model. In addition, a flexible and more representative uncertainty penalty is designed to fit the needs of recommendation systems. Extensive experiments conducted on four benchmark datasets showcase that ROLeR achieves state-of-the-art performance compared with existing baselines. The source code can be downloaded at https://github.com/ArronDZhang/ROLeR.
... , robotics [2], and recommendation systems [3], even outperforming humans in some fields. Early researchers [1][2][3][4][5] have proposed diverse online reinforcement learning methods tailored to these scenarios, as depicted in Fig. 1. ...
... , robotics [2], and recommendation systems [3], even outperforming humans in some fields. Early researchers [1][2][3][4][5] have proposed diverse online reinforcement learning methods tailored to these scenarios, as depicted in Fig. 1. However, online learning entails real-time interaction with the environment to collect data, resulting in a reliance on a large number of samples. ...
Article
Full-text available
Offline Reinforcement Learning (Offline RL) is able to learn from pre-collected offline data without real-time interaction with the environment by policy regularization via distributional constraints or support set constraints. However, since the policy learned from offline data under the constrains of support set is usually similar to the behavioral policy due to the overly conservative constraints, offline RL confronts challenges in active behavioral exploration. Moreover, without online interaction, policy evaluation becomes prone to inaccuracy, and the learned policy may lack robustness in the presence of sub-optimal state-action pairs or noise in a dataset. In this paper, we propose an Offline-to-Online Reinforcement Learning Approach based on Multi-action Evaluation with Policy Extension(MAERL) for improving the ability of the policy exploration and the effective value evaluation of state-action in offline RL. In MAERL, we develop four modules: (1) in the policy extension module, we design a policy extension method, which uses the online policy to extend the offline policy; (2) in the multi-action evaluation module, we present an adaptive manner to merge the offline and online policies to generate an action of the agent; (3) in the action-oriented module, we learn the action trajectories of the agent from the dataset, mitigating the issue of actions deviating excessively during environmental exploration; (4) to maintain the consistency in the agent’s actions, we propose an action temporally-aligned representation learning method to maintain the trend of actions of agents. This approach ensures that the agent’s actions align with the learned trajectories, preventing significant deviations during exploration. Extensive experiments are conducted on 15 scenarios of the D4RL/mujoco environment. Results demonstrate that our proposed methods achieve the best performance in 12 scenarios and the second-best performance in 3 scenarios compared to state-of-the-art methods. The project’s code can be found at https://github.com/FrankGod111/Policy-Expansion.git
... In RLRS, the efficiency hinges on three essential components: State Representation, Policy Optimization, and Reward Formulation [1,8]. While much of the current research in RLRS is centered on policy optimization [5,7,33,39] and reward formulation [4,9,16], the role of state representation should not be understated. ...
... DRL-based recommender systems model the interaction recommendation process as Markov Decision Processes (MDPs), utilizing deep learning to estimate the value function and tackle highdimensional MDPs [8,22]. Recognizing the significance of negative feedback in understanding user preferences, Zhao et al. [40] introduced DEERS, which processes positive and negative signals separately at the input layer to avoid negative feedback overwhelming positive signals due to their sparsity. ...
Preprint
In Reinforcement Learning-based Recommender Systems (RLRS), the complexity and dynamism of user interactions often result in high-dimensional and noisy state spaces, making it challenging to discern which aspects of the state are truly influential in driving the decision-making process. This issue is exacerbated by the evolving nature of user preferences and behaviors, requiring the recommender system to adaptively focus on the most relevant information for decision-making while preserving generaliability. To tackle this problem, we introduce an innovative causal approach for decomposing the state and extracting \textbf{C}ausal-\textbf{I}n\textbf{D}ispensable \textbf{S}tate Representations (CIDS) in RLRS. Our method concentrates on identifying the \textbf{D}irectly \textbf{A}ction-\textbf{I}nfluenced \textbf{S}tate Variables (DAIS) and \textbf{A}ction-\textbf{I}nfluence \textbf{A}ncestors (AIA), which are essential for making effective recommendations. By leveraging conditional mutual information, we develop a framework that not only discerns the causal relationships within the generative process but also isolates critical state variables from the typically dense and high-dimensional state representations. We provide theoretical evidence for the identifiability of these variables. Then, by making use of the identified causal relationship, we construct causal-indispensable state representations, enabling the training of policies over a more advantageous subset of the agent's state space. We demonstrate the efficacy of our approach through extensive experiments, showcasing our method outperforms state-of-the-art methods.
... Challenge with CNNs: CNNs in RS face challenges such as data sparsity, scalability, privacy, and domain-specific issues [156]. Researchers continue to explore solutions to enhance CNN-based RS performance and usability [48,134,169,170,171,64,172,47,173,30,174,175,176,177,178,179,118,180,181,182,183,184,185,186,187,188,189,120,190,54,191,46,192,193,122,194,195,196,197,198,199,200,201,202,203,204,205,206,207,208,209,210,211,157,212,213,214,215,216,217,168,155,89,218,219,194,220,221,222,223,224] Sequential [225,38,226,227,228,163,179,39,117,181,186,187,229,230,160,165,159,231,232,161,233,37,234,95,235,236,216,158,237,238,239,142,240,219,241,242,243,244,245,246,167,247,248,249] KG [173,30,176,180,183,196,197,198,199,206,157,212,213,215,168,89,218,221] RL [250,251,106,252,253,254,255,256,257,258,259,260,261,119,230,262,263,264,265,266,267,268,269,270,214,271,43,272,273,274,275,276,277,217,278,279,167,280,281,223] LLM [282,51,283,284,285,286,287,288,289,50,290,291,292,293,8,294,295] Multi-modals [296,204,297,298,299,300,249,301,218,243,244,302] Recurrent Neural Networks (RNNs) are adept at capturing complex user-item interactions within sequential data [157]. The evolution of RNN-based RS began with GRU4Rec, utilizing Gated Recurrent Units (GRUs) for sessionbased recommendations [158,159]. ...
... where t indexes the time steps, ranging from 0 to T , the maximum time step in a finite Markov Decision Process (MDP), s t and a t represent the state and action at time t, respectively, r(s t , a t ) is the immediate reward received after taking action a t in state s t , γ t applies the discount factor to future rewards, making them worth less than immediate rewards. Applying these RL concepts to RS, the RS itself acts as the RL agent [254] through an environment constituted by user interactions and data, as detailed in a related survey [43]. ...
Preprint
Full-text available
Recommender Systems (RS) play an integral role in enhancing user experiences by providing personalized item suggestions. This survey reviews the progress in RS inclusively from 2017 to 2024, effectively connecting theoretical advances with practical applications. We explore the development from traditional RS techniques like content-based and collaborative filtering to advanced methods involving deep learning, graph-based models, reinforcement learning, and large language models. We also discuss specialized systems such as context-aware, review-based, and fairness-aware RS. The primary goal of this survey is to bridge theory with practice. It addresses challenges across various sectors, including e-commerce, healthcare, and finance, emphasizing the need for scalable, real-time, and trustworthy solutions. Through this survey, we promote stronger partnerships between academic research and industry practices. The insights offered by this survey aim to guide industry professionals in optimizing RS deployment and to inspire future research directions, especially in addressing emerging technological and societal trends
... By integrating reinforcement learning principles with deep neural networks [ 9], DQNs provide a powerful framework for recommender systems, enabling them to optimise decision-making processes in complex and uncertain environments [ 10]. This new application of DQNs in e-commerce recommender systems represents a significant advance towards creating more intelligent, adaptive and user-centric shopping experiences in the smart city digital domains [ 12]. ...
... This pertains especially to the energy domain, where the complexity of systems and the risks associated with allowing algorithms to interact directly with production systems have restricted the implementation of RL [29]. Nonetheless, the potential of RL in further developing artificial intelligence has continued to garner interest in the last ten years [30]. Thus, there is some research that has focused on the usage of RL in the energy sector, with applications focused on either energy decision management or flow control objectives. ...
Article
Full-text available
Waterflooding optimization is a critical process for enhancing oil recovery in mature oil fields, where conventional approaches often rely on fixed injection rates over an extended period. However, this may not be the most efficient strategy due to reservoir heterogeneity and complexity. In this study, we propose a multi-agent physics informed reinforcement learning (MAPIRL) framework to optimize the waterflooding process. The MAPIRL approach utilizes a Markov decision process to formulate the optimization problem, where multiple RL agents are trained to interact with a reservoir simulation model and receive rewards for each action. The proposed approach uses an actor-critic RL architecture to train the agents to find the optimal strategy. The agents interact with the environment during several episodes until convergence is achieved. We evaluated the effectiveness of the MAPIRL approach based on the improvement in net present value (NPV), which reflects the economic benefits of the optimized waterflooding strategy. Then, we compared the MAPIRL approach with the multi-objective particle swarm optimization (MOPSO) algorithm. The comparison revealed that the MAPIRL approach outperformed the MOPSO algorithm in terms of net present value. In conclusion, the MAPIRL approach is a scientifically accurate method for optimizing waterflooding in mature oil fields, providing a more efficient and robust waterflooding strategy that reduces water consumption and associated costs while maximizing the economic benefits. The ability of the MAPIRL approach to optimize the waterflooding process with a high degree of complexity makes it a promising tool for the energy industry, and further research is needed to explore its potential for addressing other complex problems in this domain.
... Reinforcement Learning (RL) has emerged as a powerful approach for developing recommender systems (RS), where the objective is to learn a policy that maximizes long-term rewards, typically measured by user satisfaction or engagement. Unlike traditional recommendation methods that primarily aim to optimize immediate rewards, RLRS focuses on learning a recommendation strategy that adapts to user preferences over time [5]. This allows RLRS to dynamically update recommendations based on user feedback, aiming to improve long-term outcomes and enhance user experiences. ...
Preprint
In offline reinforcement learning-based recommender systems (RLRS), learning effective state representations is crucial for capturing user preferences that directly impact long-term rewards. However, raw state representations often contain high-dimensional, noisy information and components that are not causally relevant to the reward. Additionally, missing transitions in offline data make it challenging to accurately identify features that are most relevant to user satisfaction. To address these challenges, we propose Policy-Guided Causal Representation (PGCR), a novel two-stage framework for causal feature selection and state representation learning in offline RLRS. In the first stage, we learn a causal feature selection policy that generates modified states by isolating and retaining only the causally relevant components (CRCs) while altering irrelevant components. This policy is guided by a reward function based on the Wasserstein distance, which measures the causal effect of state components on the reward and encourages the preservation of CRCs that directly influence user interests. In the second stage, we train an encoder to learn compact state representations by minimizing the mean squared error (MSE) loss between the latent representations of the original and modified states, ensuring that the representations focus on CRCs. We provide a theoretical analysis proving the identifiability of causal effects from interventions, validating the ability of PGCR to isolate critical state components for decision-making. Extensive experiments demonstrate that PGCR significantly improves recommendation performance, confirming its effectiveness for offline RL-based recommender systems.
... • Actor-Critic: Combine the value-based and policy-based methods by using two different RL networks: The Actor uses a policy-based method to propose a set of possible actions given a state, and the Critic estimated value function, which evaluates actions taken by the Actor based on the given policy. The Actor then uses the feedback from the Critic to adjust its policy and make more informed decisions, leading to improved overall performance [133,134]. ...
Article
Full-text available
Human-AI collaboration has evolved into a complex, multidimensional paradigm shaped by research in various domains. Key areas such as human-in-the-loop systems, Interactive Machine Learning (IML), Hybrid Intelligence, and Human-Agent Interaction have significantly contributed to this development. However, these fields often lack cohesion, underscoring the need for a cohesive perspective to advance. This work addresses this gap by integrating insights from diverse aspects of collaboration to present a holistic approach to fostering effective and adaptive interactions between humans and artificial agents. It emphasizes empowering end-users with greater control and involvement in decision-making processes, thereby enhancing both the levels of interactivity and adaptability within intelligent systems. Moving beyond a focus on AI training techniques, this paper presents a broader perspective on incorporating human input into AI decision-making and learning processes, highlighting the importance of flexibility in systems and user engagement. The manuscript proposes a framework encompassing five levels of human integration and examines their relationship with core collaboration aspects, including the system purpose, participant expertise, and system proactivity. By synthesizing current knowledge on human-AI collaboration and outlining essential design principles, this work aims to advance the field and foster interdisciplinary collaboration among researchers, practitioners, and designers.
... Reinforcement learning (RL) stands out as a potent tool in these domains, enabling agents to acquire knowledge through trial and error and responses from the environment. Its versatility has resulted in its widespread utilization across a range of sectors, encompassing automated vehicles (Guan et al., 2020), robotics (Kormushev et al., 2013), healthcare (Yu et al., 2021), finance (Deng et al., 2016), and various natural language processing (NLP) applications (Uc-Cetina et al., 2023), recommender systems (Chen et al., 2023) and so on. Typically, an RL task can be expressed using a Markov decision process (MDP) (Sutton and Barto, 2018) framework. ...
Preprint
Full-text available
Bayesian reinforcement learning (BRL) is a method that merges principles from Bayesian statistics and reinforcement learning to make optimal decisions in uncertain environments. Similar to other model-based RL approaches, it involves two key components: (1) Inferring the posterior distribution of the data generating process (DGP) modeling the true environment and (2) policy learning using the learned posterior. We propose to model the dynamics of the unknown environment through deep generative models assuming Markov dependence. In absence of likelihood functions for these models we train them by learning a generalized predictive-sequential (or prequential) scoring rule (SR) posterior. We use sequential Monte Carlo (SMC) samplers to draw samples from this generalized Bayesian posterior distribution. In conjunction, to achieve scalability in the high dimensional parameter space of the neural networks, we use the gradient based Markov chain Monte Carlo (MCMC) kernels within SMC. To justify the use of the prequential scoring rule posterior we prove a Bernstein-von Misses type theorem. For policy learning, we propose expected Thompson sampling (ETS) to learn the optimal policy by maximizing the expected value function with respect to the posterior distribution. This improves upon traditional Thompson sampling (TS) and its extensions which utilize only one sample drawn from the posterior distribution. This improvement is studied both theoretically and using simulation studies assuming discrete action and state-space. Finally we successfully extend our setup for a challenging problem with continuous action space without theoretical guarantees.
... Consider, an element represented as for the matrices ℎ , , , , ℎ , , , and . Using this, a function for the LSTM gates is represented as which comprises of cell-states and hidden-states which is mathematically represented using (8). The comprises of various gradient-ascending paths (i.e., and ̃) because of the temporal-paths (i.e., and ̃) of various sessions. ...
Article
Full-text available
Recommendation systems are pivotal for personalized user experiences, employing algorithms to predict and suggest items aligned with user preferences. Deep learning (DL) models, such as recurrent neural networks (RNNs) and long short-term memory networks (LSTMs), excel in capturing sequential dependencies, enhancing recommendation accuracy. However, challenges persist in session-based recommendation systems, particularly with gradient descent and class imbalances. Addressing these challenges, this work introduces dynamic LSTM (D-LSTM), a novel DL-based recommendation system tailored for dynamic E-commerce environments. The primary objective is to optimize recommendation accuracy by effectively capturing temporal dependencies within user sessions. The methodology involves the integration of D-LSTM with weight matrix optimization and a Bayesian personalized ranking (BPR) adaptable learning rate optimizer to enhance learning efficiency. Experimental results demonstrate the efficacy of D-LSTM, showing significant improvements over existing models. Specifically, comparisons with the hybrid time-centric prediction (HTCP) model reveal a performance enhancement of 19.4%, 17.2%, 35.41%, and 21.99% for hit-rate (HR) and mean reciprocal rank (MRR) in 10k and 20k recommendation sets using the Tmall dataset. These findings underscore the superior performance of D-LSTM, highlighting its potential to advance personalized recommendations in dynamic E-commerce settings.
... Furthermore, in algorithmic trading and portfolio management, advances in DRL should also be considered (see X. Chen et al., 2023;Zhu et al., 2023). For example, Hierarchical DRL breaks down tasks, like asset selection and risk management, into smaller steps, making strategies more flexible. ...
Preprint
Full-text available
This review systematically examines deep learning applications in financial asset management. Unlike prior reviews, this study focuses on identifying emerging trends, such as the integration of explainable artificial intelligence (XAI) and deep reinforcement learning (DRL), and their transformative potential. It highlights new developments, including hybrid models (e.g., transformer-based architectures) and the growing use of alternative data sources such as ESG indicators and sentiment analysis. These advancements challenge traditional financial paradigms and set the stage for a deeper understanding of the evolving landscape. We use the Scopus database to select the most relevant articles published from 2018 to 2023. The inclusion criteria encompassed articles that explicitly apply deep learning models within financial asset management. We excluded studies focused on physical assets. This review also outlines our methodology for evaluating the relevance and impact of the included studies, including data sources and analytical methods. Our search identified 934 articles, with 612 meeting the inclusion criteria based on their focus and methodology. The synthesis of results from these articles provides insights into the effectiveness of deep learning models in improving portfolio performance and price forecasting accuracy. The review highlights the broad applicability and potential enhancements deep learning offers to financial asset management. Despite some limitations due to the scope of model application and variation in methodological rigour, the overall evidence supports deep learning as a valuable tool in this field. Our systematic review underscores the progressive integration of deep learning in financial asset management, suggesting a trajectory towards more sophisticated and impactful applications.
... policy π is a rule that agent uses to decide what action to take next, Q is the estimation of the expected cumulative reward, value function V is used to predict future rewards and evaluate the quality of different policies. DRL can be categorized as value-based algorithms and policy-based algorithms (Chen et al., 2023) based on the optimal policy for the agent determined by solving the Bellman equation (Gerstenberg et al., 2023) for state-value (Eq. (1)) or action-value (Eq. ...
... The RL agent's objective is to learn optimal policies or strategies, guiding its decision-making process to maximize cumulative rewards over time. Reinforcement learning has diverse applications across domains like healthcare and recommendation systems [40,41]. For example, it has excelled in gameplaying tasks, often surpassing human performance in chess, Go, and video games. ...
Article
Full-text available
In the fourth industrial revolution, artificial intelligence and machine learning (ML) have increasingly been applied to manufacturing, particularly additive manufacturing (AM), to enhance processes and production. This study provides a comprehensive review of the state-of-the-art achievements in this domain, highlighting not only the widely discussed supervised learning but also the emerging applications of semi-supervised learning and reinforcement learning. These advanced ML techniques have recently gained significant attention for their potential to further optimize and automate AM processes. The review aims to offer insights into various ML technologies employed in current research projects and to promote the diverse applications of ML in AM. By exploring the latest advancements and trends, this study seeks to foster a deeper understanding of ML’s transformative role in AM, paving the way for future innovations and improvements in manufacturing practices.
... This adaptive capability makes RL-based algorithms particularly suitable for applications in the human-building interaction, where user preferences, environmental conditions, and energy consumption patterns frequently change [8]. In such applications, value-based methods like Deep Q-Network (DQN) and Double Deep Q-Network (DDQN) algorithms are extensively utilized for energy performance optimization [88] and recommender system design [89]. Owing to their online learning capabilities, which include real-time strategy updates and experience replay mechanisms, these algorithms ensure that the models remain current and efficient [90]. ...
Preprint
Full-text available
The indoor environment significantly impacts human health and well-being; enhancing health and reducing energy consumption in these settings is a central research focus. With the advancement of Information and Communication Technology (ICT), recommendation systems and reinforcement learning (RL) have emerged as promising approaches to induce behavioral changes to improve the indoor environment and energy efficiency of buildings. This study aims to employ text mining and Natural Language Processing (NLP) techniques to thoroughly examine the connections among these approaches in the context of human-building interaction and occupant context-aware support. The study analyzed 27,595 articles from the ScienceDirect database, revealing extensive use of recommendation systems and RL for space optimization, location recommendations, and personalized control suggestions. Furthermore, this review underscores the vast potential for expanding recommender systems and RL applications in buildings and indoor environments. Fields ripe for innovation include predictive maintenance, building-related product recommendation, and optimization of environments tailored for specific needs, such as sleep and productivity enhancements based on user feedback. The study also notes the limitations of the method in capturing subtle academic nuances. Future improvements could involve integrating and fine-tuning pre-trained language models to better interpret complex texts.
... We examine two principal types of algorithms in this field, namely offline and online learning algorithms. We categorize collaborative filtering and matrix factorization based algorithms as offline (we refer the reader to some online applications of these algorithms (Zangerle and Bauer, 2022;Raza and Ding, 2022;Chen et al., 2023)). Then, we survey the collaboration-enabled algorithms under online learning. ...
Preprint
This paper offers a comprehensive analysis of collaborative bandit algorithms and provides a thorough comparison of their performance. Collaborative bandits aim to improve the performance of contextual bandits by introducing relationships between arms (or items), allowing effective propagation of information. Collaboration among arms allows the feedback obtained through a single user (item) to be shared across related users (items). Introducing collaboration also alleviates the cold user (item) problem, i.e., lack of historical information when a new user (item) arriving to the platform with no prior record of interactions. In the context of modeling the relationships between arms (items), there are two main approaches: Hard and soft clustering. We call approaches that model the relationship between arms in an \textit{absolute} manner as hard clustering, i.e., the relationship is binary. Soft clustering relaxes membership constraints, allowing \textit{fuzzy} assignment. Focusing on the latter, we provide extensive experiments on the state-of-the-art collaborative contextual bandit algorithms and investigate the effect of sparsity and how the exploration intensity acts as a correction mechanism. Our numerical experiments demonstrate that controlling for sparsity in collaboration improves data efficiency and performance as it better informs learning. Meanwhile, increasing the exploration intensity acts as a correction because it effectively reduces variance due to potentially misspecified relationships among users. We observe that this misspecification is further remedied by introducing latent factors, and thus, increasing the dimensionality of the bandit parameters.
... The demand for recommender systems has increased in online services since they can effectively alleviate information overload [5]. As a new branch of recommender systems, sequential recommender systems capture users' dynamic preferences and provide potential recommendations from massive candidates by modeling users' historical interactions as temporally ordered sequences. ...
Preprint
Full-text available
Federated sequential recommendation (FedSeqRec) has gained growing attention due to its ability to protect user privacy. Unfortunately, the performance of FedSeqRec is still unsatisfactory because the models used in FedSeqRec have to be lightweight to accommodate communication bandwidth and clients' on-device computational resource constraints. Recently, large language models (LLMs) have exhibited strong transferable and generalized language understanding abilities and therefore, in the NLP area, many downstream tasks now utilize LLMs as a service to achieve superior performance without constructing complex models. Inspired by this successful practice, we propose a generic FedSeqRec framework, FELLAS, which aims to enhance FedSeqRec by utilizing LLMs as an external service. Specifically, FELLAS employs an LLM server to provide both item-level and sequence-level representation assistance. The item-level representation service is queried by the central server to enrich the original ID-based item embedding with textual information, while the sequence-level representation service is accessed by each client. However, invoking the sequence-level representation service requires clients to send sequences to the external LLM server. To safeguard privacy, we implement dx-privacy satisfied sequence perturbation, which protects clients' sensitive data with guarantees. Additionally, a contrastive learning-based method is designed to transfer knowledge from the noisy sequence representation to clients' sequential recommendation models. Furthermore, to empirically validate the privacy protection capability of FELLAS, we propose two interacted item inference attacks. Extensive experiments conducted on three datasets with two widely used sequential recommendation models demonstrate the effectiveness and privacy-preserving capability of FELLAS.
... Prior art on user response modeling can be broadly classified into: (i) probabilistic models, which were historically first [2,19,25], based on clustering [42,44,52,53], learning skewed distributions [51], resampling real data [37], context-aware models [17,18,30,45], causal models [36,59] or inhomogeneous Poisson processes [39]; (ii) generative adversarial networks (GAN), namely their variations for discrete and tabular data [20,43,57,57], used to generate data for recommender systems in the form of user preference vectors [7,8], user profiles [56], individual interactions [23,49,55], temporal sequences [3,60], or all of the above [4]; (iii) full-scale synthetic data generators, usually intended to serve as an environment for training a reinforcement learning agent, which is a popular framing for modern RS [1,11,12,21,29,35]; here we note RecSim [28], RecSim NG [40], RecoGym [32,46], SOFA [26], Virtual Taobao [48], and SARDINE [16] among other works [58]. ...
Preprint
Full-text available
We develop and evaluate neural architectures to model the user behavior in recommender systems (RS) inspired by click models for Web search but going beyond standard click models. Proposed architectures include recurrent networks, Transformer-based models that alleviate the quadratic complexity of self-attention, adversarial and hierarchical architectures. Our models outperform baselines on the ContentWise and RL4RS datasets and can be used in RS simulators to model user response for RS evaluation and pretraining.
... Reinforcement learning methods are used to teach a sequence of actions that perform specific tasks and define rewarded and punished actions. Deep learning methods are used to capture non-linear relationships between users and objects and to deal with different types of data sources, such as images and text [47]. All these methods and techniques are nowadays exploited to define complex supply chain problems and solve them. ...
Article
Full-text available
The economic growth of developed or emerging countries through globalization has prompted them to increase their supply chain performance. A large number of concepts, tools, and methodologies have been proposed in support of this performance improvement. They are mainly based on the use of classical optimization or enterprise modeling methods. However, environmental and social issues, not to mention digital transformation, are often ignored or not sufficiently integrated. Indeed, the world geopolitical situation, the increase in oil prices, and the commitment to protect our earth require the integration of sustainability aspects and Industry 4.0 concepts like digital twin and artificial intelligence in transforming the supply chain. This paper focuses on defining a conceptual framework to support sustainable supply chain management and digital transformation. It aims to exploit the sustainability and digital maturity of companies to transform their supply chains and enhance their performance to meet the challenges of Industry 5.0. Several practices related to sustainability, as well as two use cases on optimization and digital twin, are presented to illustrate this framework. Finally, based on the previous practices and use cases, an adapted framework for the supply chain manager to support the transition from Industry 4.0 to Industry 5.0 has been developed, as well as a performance dashboard.
... Since the early implementations of recommendation algorithms (e.g., [230,231]), these systems analyze past usage behavior in order to build user models, and to suggest items, or even people in social networks [77,145,165], to individual users or to groups of users (e.g., [190,191]). To build these user models, different techniques have been employed, including traditional approaches such as collaborative filtering [81], content-based filtering [182], and hybrid recommendations [45], and more recent approaches based on latent representations (or embeddings) and deep learning [57,276,289]. Thus, also different types of data sources are utilized for generating recommendations, e.g., preference information such as ratings, and content features of items (see Section 2.1 for more details on recommender systems in general). ...
Thesis
Full-text available
This is a habilitation (post-doctoral) thesis accepted at Graz University of Technology for the scientific subject "Applied Computer Science"
... Most existing reviews and surveys focus on XAI, with some exploring implementation variations involving KG and RL. Specifically, Chen [3] presents a survey paper on XRS, primarily focusing on approaches employing deep RL techniques. Guo [4] offers a survey paper on the detailed examination of KG incorporation and a few approaches utilizing RL for XRS development. ...
Article
Full-text available
This review paper addresses the research question of the significance of explainability in AI and the role of integrating KG and RL to enhance Explainable Recommender Systems (XRS). It surveys articles published from January 2015 to March 2024 on XRS, focusing on knowledge graphs (KGs) and reinforcement learning (RL) for achieving explainability in recommender systems. Employing a systematic methodology, it introduces a custom Python-based web scraper to efficiently navigate and extract relevant academic research papers from IEEE, ScienceDirect (Elsevier), ACM, and Springer online databases. The study encompasses the PRISMA methodology to conduct a thorough analysis and identify pertinent research works. This systematic literature review aims to provide a unified view of the field by reviewing eight existing XRS literature reviews and 29 pertinent XRS studies involving KG and RL from the specified period. It categorizes and analyses relevant research papers based on their implementation methodologies and explores significant contributions, encompassing perspectives on model-agnostic and model-intrinsic explanations.
... The advent of big data, coupled with strides in artificial intelligence and deep learning, has significantly refined human-computer interactions. These advancements have provided a rich reservoir of source data and have bolstered the technical underpinnings necessary for the integration of reinforcement learning within recommendation systems, thereby enhancing their efficacy and customization [8]. Prospective research in this domain is poised to concentrate on the development of RL recommendation architectures that can navigate expansive action spaces efficiently. ...
Article
Full-text available
This article commences by elucidating the concept and algorithms underpinning reinforcement learning (RL), laying the foundation for a comprehensive understanding of RL's principles. It then draws a detailed comparison between RL and traditional machine learning paradigms, using illustrative examples to highlight the distinctive methodologies and outcomes of these approaches. This comparison not only clarifies the unique attributes of RL but also contextualizes its position within the broader landscape of machine learning techniques. Subsequently, the focus shifts to recommendation systems (RS), where both the conceptual framework and algorithmic foundations are thoroughly examined. This exploration is not limited to the technical mechanics of RS but extends to an appraisal of their diverse applications, showcasing how these systems have become integral in various domains, from e-commerce to content curation. The core of the discussion then converges on the application of RL within the realm of Chinese RS. This section delves into how RL's dynamic learning capabilities and adaptability enhance the functionality and effectiveness of RS, particularly in the context of the unique market dynamics and user behaviors observed in China. The synergy between RL and RS in this context is dissected, offering insights into how RL-driven RS can lead to more personalized, context-aware, and efficient user experiences.
... Deep reinforcement learning (DRL) is attracting interest as a technique for deriving optimal behavior for agents in this complex environment. DRL agents have achieved a high performance in various control tasks [1]- [6], but DRL has a black-box problem in that it is very difficult for users to understand the agent's decision-making. There are two main components to this problem. ...
Article
Full-text available
Deep reinforcement learning (DRL) can learn an agent’s optimal behavior from the experience it gains through interacting with its environment. However, since the decision-making process of DRL agents is a black-box, it is difficult for users to understand the reasons for the agents’ actions. To date, conventional visual explanation methods for DRL agents have focused only on the policy and not on the state value. In this work, we propose a DRL method called Mask-Attention A3C (Mask A3C) to analyze agents’ decision-making by focusing on both the policy and value branches, which have different outputs. Inspired by the Actor-Critic method, our method introduces an Attention mechanism that applies mask processing to the feature map of the policy and value branches using mask-attention, which is a heat-map representation of the basis for judging the policy and state values. We also propose the introduction of a Mask-attention Loss to obtain highly interpretable mask-attention. By introducing this loss function, the agent learns not to gaze at regions that do not affect its decision-making. Our evaluations with Atari 2600 as video game strategy task and robot manipulation as robot control task showed that visualizing the mask-attention of an agent during its action selection facilitates the analysis of the agent’s decision-making. We also investigated the effect of Mask-attention Loss and confirmed that it is useful for analyzing agents’ decision-making. In addition, we showed that these mask-attentions are highly interpretable to the user by conducting a users’ survey on the prediction of the agent’s behavior.
... Since the early implementations of recommendation algorithms (e.g., [229,230]), these systems analyze past usage behavior in order to build user models, and to suggest items, or even people in social networks [77,145,164], to individual users or to groups of users (e.g., [189,190]). To build these user models, different techniques have been employed, including traditional approaches such as collaborative filtering [81], content-based filtering [181], and hybrid recommendations [45], and more recent approaches based on latent representations (or embeddings) and deep learning [57,275,288]. Thus, also different types of data sources are utilized for generating recommendations, e.g., preference information such as ratings, and content features of items (see Section 2.1 for more details on recommender systems in general). ...
Preprint
Full-text available
Recommender systems have become a pervasive part of our daily online experience, and are one of the most widely used applications of artificial intelligence and machine learning. Therefore, regulations and requirements for trustworthy artificial intelligence, for example, the European AI Act, which includes notions such as transparency, privacy, and fairness are also highly relevant for the design of recommender systems in practice. This habilitation elaborates on aspects related to these three notions in the light of recommender systems, namely: (i) transparency and cognitive models, (ii) privacy and limited preference information, and (iii) fairness and popularity bias in recommender systems. Specifically, with respect to aspect (i), we highlight the usefulness of incorporating psychological theories for a transparent design process of recommender systems. We term this type of systems psychology-informed recommender systems. In aspect (ii), we study and address the trade-off between accuracy and privacy in differentially-private recommendations. We design a novel recommendation approach for collaborative filtering based on an efficient neighborhood reuse concept, which reduces the number of users that need to be protected with differential privacy. Furthermore, we address the related issue of limited availability of user preference information, e.g., click data, in the settings of session-based and cold-start recommendations. With respect to aspect (iii), we analyze popularity bias in recommender systems. We find that the recommendation frequency of an item is positively correlated with this item's popularity. This also leads to the unfair treatment of users with little interest in popular content. Finally, we study long-term fairness dynamics in algorithmic decision support in the labor market using agent-based modeling techniques.
Chapter
The ability for humans to work in close contact with robots in a manufacturing environment has been limited due to safety concerns and the robot’s inability to sense, react, and coordinate with a human without explicit, rigid programming. However, advances in Deep Reinforcement Learning (DRL) have shown considerable promise in developing processes that allow robots to work in a dynamic environment, solving problems and adapting to the actions and communication from human counterparts. This chapter explores the current state of the art for Human Robot Interaction (HRI), discussing the tools, algorithms, and methods being explored. Representative use cases are discussed to better understand what can be accomplished in today’s manufacturing environment and what challenges could be faced. The concerns around safety, ethics, and unintended consequences are identified. Finally, the chapter looks ahead at the obstacles that still need to be overcome before HRI can be fully scalable and widely used.
Article
Reinforcement Learning-based Recommender Systems (RLRS) have shown promise across a spectrum of applications, from e-commerce platforms to streaming services. Yet, they grapple with challenges, notably in crafting reward functions and harnessing large pre-existing datasets within the RL framework. Recent advancements in offline RLRS provide a solution for how to address these two challenges. However, existing methods mainly rely on the transformer architecture, which, as sequence lengths increase, can introduce challenges associated with computational resources and training costs. Additionally, the prevalent methods employ fixed-length input trajectories, restricting their capacity to capture evolving user preferences. In this study, we introduce a new offline RLRS method to deal with the above problems. We reinterpret the RLRS challenge by modeling sequential decision-making as an inference task, leveraging adaptive masking configurations. This adaptive approach selectively masks input tokens, transforming the recommendation task into an inference challenge based on varying token subsets, thereby enhancing the agent’s ability to infer across diverse trajectory lengths. Furthermore, we incorporate a multi-scale segmented retention mechanism that facilitates efficient modeling of long sequences, significantly enhancing computational efficiency. Our experimental analysis, conducted on both online simulator and offline datasets, clearly demonstrates the advantages of our proposed method.
Article
Planning personalized travel itineraries for groups with diverse preferences is indeed challenging. This article proposes a novel group tour trip recommender model (GTTRM), which uses ant colony optimization (ACO) to optimize group satisfaction while minimizing conflicts between group members. Unlike existing models, the proposed GTTRM allows dynamic subgroup formation during the trip to handle conflicting preferences and provide tailored recommendations. Experimental results show that GTTRM significantly improves satisfaction levels for individual group members, outperforming state-of-the-art models in terms of both subgroup management and optimization efficiency.
Article
In e-learning, the rapid expansion of learning resources poses challenges for learners in finding suitable materials due to their diverse preferences and cognitive abilities. Consequently, personalized learning path recommendation has emerged as a pivotal research area, especially for advancing e-learning systems. This paper introduces an algorithmic framework that integrates deep reinforcement learning with a graph attention mechanism to tailor learning paths to individual learners. The online course dataset is selected and a series of controlled experiments are conducted on the common recommendation models proposed in the past, and the experimental results are analyzed using a combination of two evaluation indices, such as data evaluation results and model variance. The experimental results show that adding the attention mechanism can significantly improve the accuracy of the model recommendation, compared with the deep reinforcement learning model without adding the graph attention mechanism, the comprehensive scores of the students in the test set were improved by 5.8 and 12.8 points, respectively, and the accuracy was improved by 5.3% compared with the previous deep learning model; the deep reinforcement model used in this paper with the addition of the labeling feedback mechanism was improved by 5.3% compared with the deep learning with feedback mechanism. In the recommendation model, the final scores of the students were improved by 3.7 and 8.2 points, respectively. In addition, the Advanced test set in the recommendation model of the learning path recommended by the student score improvement is more than two times of the Middle test set’s scores improvement, indicating that more learning recommended object knowledge points the richer, the model recommendation accuracy rate is higher. By merging graph attention mechanisms with deep reinforcement learning, our system provides precise recommendations, offering insights into the development of efficient personalized learning path systems and accelerating their educational applications.
Article
With the rapid development of artificial intelligence technology, recommendation systems have been widely applied in various fields. However, in the art field, art similarity search and recommendation systems face unique challenges, namely data privacy and copyright protection issues. To address these problems, this article proposes a cross-institutional artwork similarity search and recommendation system (AI-based Collaborative Recommendation System (AICRS) framework) that combines multimodal data fusion and federated learning. This system uses pre-trained convolutional neural networks (CNN) and Bidirectional Encoder Representation from Transformers (BERT) models to extract features from image and text data. It then uses a federated learning framework to train models locally at each participating institution and aggregate parameters to optimize the global model. Experimental results show that the AICRS framework achieves a final accuracy of 92.02% on the SemArt dataset, compared to 81.52% and 83.44% for traditional CNN and Long Short-Term Memory (LSTM) models, respectively. The final loss value of the AICRS framework is 0.1284, which is better than the 0.248 and 0.188 of CNN and LSTM models. The research results of this article not only provide an effective technical solution but also offer strong support for the recommendation and protection of artworks in practice.
Article
Full-text available
Over the last several decades, recommender systems have become an integral part of both our daily lives and the research frontier at machine learning. In this survey, we explore various approaches to developing simulators for recommendation systems, especially for modeling the user response function. We consider simple probabilistic models, approaches based on generative adversarial networks, and full-scale simulators, and also review the datasets available for the research community.
Preprint
Full-text available
In the fourth industrial revolution, artificial intelligence (AI) and machine learning (ML) have increasingly been applied to manufacturing, particularly additive manufacturing (AM), to enhance processes and production. This study provides a comprehensive review of the state-of-the-art achievements in this domain, highlighting not only the widely discussed supervised learning but also the emerging applications of semi-supervised learning and reinforcement learning (RL). These advanced ML techniques have recently garnered significant attention due to their potential to further optimize and automate AM processes. The review aims to offer insights into various ML technologies employed in current research projects and to promote the diverse applications of ML in AM. By exploring the latest advancements and trends, this study seeks to foster a deeper understanding of ML’s transformative role in AM, paving the way for future innovations and improvements in manufacturing practices.
Article
La continua evolución de la tecnología transforma la manera en que las bibliotecas interactúan con sus usuarios, y estos a su vez con los libros. Los sistemas de recomendación se conciben como sistemas de filtrado de información cuyo objetivo es proporcionar acceso a información personalizada (libros de interés, revistas, bases de datos, artículos científicos, salas, etc.) para mejorar la experiencia del usuario, fomentar la usabilidad de los recursos bibliográficos y optimizar los servicios. Este artículo propone un modelo híbrido de recomendación automática de libros que ensambla tres procesos en dos fases: identificación del usuario, filtrado colaborativo y filtrado por contenido. En la primera fase, se lleva a cabo el proceso de reconocimiento de usuario con técnicas que implementan aprendizaje profundo y, en la segunda fase, se integran los procesos de recomendación mediante filtrado colaborativo y por contenido. Se elaboró un caso de estudio en un entorno biblioecario para recomendar libros y fue evaluado mediante métricas clásicas de recuperación de información. Se compararon los resultados con otros modelos de recomendación más robustos, obteniendo resultados satisfactorios.
Article
Full-text available
Deep reinforcement learning (DRL) has shown promising results in modeling dynamic user preferences in RS in recent literature. However, training a DRL agent in the sparse RS environment poses a significant challenge. This is because the agent must balance between exploring informative user-item interaction trajectories and using existing trajectories for policy learning, a known exploration and exploitation trade-off. This trade-off greatly affects the recommendation performance when the environment is sparse. In DRL-based RS, balancing exploration and exploitation is even more challenging as the agent needs to deeply explore informative trajectories and efficiently exploit them in the context of RS. To address this issue, we propose a novel intrinsically motivated reinforcement learning (IMRL) method that enhances the agent’s capability to explore informative interaction trajectories in the sparse environment. We further enrich these trajectories via an adaptive counterfactual augmentation strategy with a customised threshold to improve their efficiency in exploitation. Our approach is evaluated on six offline datasets and three online simulation platforms, demonstrating its superiority over existing state-of-the-art methods. The extensive experiments show that our IMRL method outperforms other methods in terms of recommendation performance in the sparse RS environment.
Conference Paper
Full-text available
In this paper, we investigate the task of aggregating search results from heterogeneous sources in an E-commerce environment. First, unlike traditional aggregated web search that merely presents multi-sourced results in the first page, this new task may present aggregated results in all pages and has to dynamically decide which source should be presented in the current page. Second, as pointed out by many existing studies, it is not trivial to rank items from heterogeneous sources because the relevance scores from different source systems are not directly comparable. To address these two issues, we decompose the task into two subtasks in a hierarchical structure: a high-level task for source selection where we model the sequential patterns of user behaviors onto aggregated results in different pages so as to understand user intents and select the relevant sources properly; and a low-level task for item presentation where we formulate a slot filling process to sequentially present the items instead of giving each item a relevance score when deciding the presentation order of heterogeneous items. Since both subtasks can be naturally formulated as sequential decision problems and learn from the future user feedback on search results, we build our model with hierarchical reinforcement learning. Extensive experiments demonstrate that our model obtains remarkable improvements in search performance metrics, and achieves a higher user satisfaction.
Conference Paper
Full-text available
Building interpretable parameterizations of real-world decision-making on the basis of demonstrated behavior--i.e. trajectories of observations and actions made by an expert maximizing some unknown reward function--is essential for introspecting and auditing policies in different institutions. In this paper, we propose learning explanations of expert decisions by modeling their reward function in terms of preferences with respect to "what if" outcomes: Given the current history of observations, what would happen if we took a particular action? To learn these cost-benefit tradeoffs associated with the expert's actions, we integrate counterfactual reasoning into batch inverse reinforcement learning. This offers a principled way of defining reward functions and explaining expert behavior, and also satisfies the constraints of real-world decision-making---where active experimentation is often impossible (e.g. in healthcare). Additionally, by estimating the effects of different actions, counterfactuals readily tackle the off-policy nature of policy evaluation in the batch setting, and can naturally accommodate settings where the expert policies depend on histories of observations rather than just current states. Through illustrative experiments in both real and simulated medical environments, we highlight the effectiveness of our batch, counterfactual inverse reinforcement learning approach in recovering accurate and interpretable descriptions of behavior.
Article
Full-text available
This work considers the question of how convenient access to copious data impacts our ability to learn causal effects and relations. In what ways is learning causality in the era of big data different from—or the same as—the traditional one? To answer this question, this survey provides a comprehensive and structured review of both traditional and frontier methods in learning causality and relations along with the connections between causality and machine learning. This work points out on a case-by-case basis how big data facilitates, complicates, or motivates each approach.
Conference Paper
Full-text available
In session-based or sequential recommendation, it is important to consider a number of factors like long-term user engagement,multiple types of user-item interactions such as clicks, purchases etc. The current state-of-the-art supervised approaches fail to model them appropriately. Casting sequential recommendation task as a reinforcement learning (RL) problem is a promising direction. A major component of RL approaches is to train the agent through interactions with the environment. However, it is often problematic to train a recommender in an on-line fashion due to the requirement to expose users to irrelevant recommendations. As a result, learning the policy from logged implicit feedback is of vital importance,which is challenging due to the pure off-policy setting and lack of negative rewards (feedback). In this paper, we propose self-supervised reinforcement learning for sequential recommendation tasks. Our approach augments standard recommendation models with two output layers: one for self-supervised learning and the other for RL. The RL part acts as a regularizer to drive the supervised layer focusing on specific rewards(e.g., recommending items which may lead to purchases rather than clicks) while the self-supervised layer with cross-entropy loss provides strong gradient signals for parameter updates. Based on such an approach, we propose two frameworks namely Self-Supervised Q-learning(SQN) and Self-Supervised Actor-Critic(SAC). We integrate the proposed frameworks with four state-of-the-art recommendation models. Experimental results on two real-world datasets demonstrate the effectiveness of our approach
Article
Recent advances in recommender systems have proved the potential of reinforcement learning (RL) to handle the dynamic evolution processes between users and recommender systems. However, learning to train an optimal RL agent is generally impractical with commonly sparse user feedback data in the context of recommender systems. To circumvent the lack of interaction of current RL-based recommender systems, we propose to learn a general model-agnostic counterfactual synthesis (MACS) policy for counterfactual user interaction data augmentation. The counterfactual synthesis policy aims to synthesize counterfactual states while preserving significant information in the original state relevant to the user’s interests, building upon two different training approaches we designed: learning with expert demonstrations and joint training. As a result, the synthesis of each counterfactual data is based on the current recommendation agent’s interaction with the environment to adapt to users’ dynamic interests. We integrate the proposed policy deep deterministic policy gradient (DDPG), soft actor critic (SAC), and twin delayed DDPG (TD3) in an adaptive pipeline with a recommendation agent that can generate counterfactual data to improve the performance of recommendation. The empirical results on both online simulation and offline datasets demonstrate the effectiveness and generalization of our counterfactual synthesis policy and verify that it improves the performance of RL recommendation agents.
Article
Integrated recommendation aims to jointly recommend heterogeneous items in the main feed from different sources via multiple channels, which needs to capture user preferences on both item and channel levels. It has been widely used in practical systems by billions of users, while few works concentrate on the integrated recommendation systematically. In this work, we propose a novel Hierarchical reinforcement learning framework for integrated recommendation (HRL-Rec), which divides the integrated recommendation into two tasks to recommend channels and items sequentially. The low-level agent is a channel selector, which generates a personalized channel list. The high-level agent is an item recommender, which recommends specific items from heterogeneous channels under the channel constraints. We design various rewards for both recommendation accuracy and diversity, and propose four losses for fast and stable model convergence. We also conduct an online exploration for sufficient training. In experiments, we conduct extensive offline and online experiments on a billion-level real-world dataset to show the effectiveness of HRL-Rec. HRL-Rec has also been deployed on WeChat Top Stories, affecting millions of users. The source codes are released in https://github.com/modriczhang/HRL-Rec.
Article
With the recent prevalence of Reinforcement Learning (RL), there have been tremendous interests in utilizing RL for online advertising in recommendation platforms (e.g., e-commerce and news feed sites). However, most RL-based advertising algorithms focus on optimizing ads' revenue while ignoring the possible negative influence of ads on user experience of recommended items (products, articles and videos). Developing an optimal advertising algorithm in recommendations faces immense challenges because interpolating ads improperly or too frequently may decrease user experience, while interpolating fewer ads will reduce the advertising revenue. Thus, in this paper, we propose a novel advertising strategy for the rec/ads trade-off. To be specific, we develop an RL-based framework that can continuously update its advertising strategies and maximize reward in the long run. Given a recommendation list, we design a novel Deep Q-network architecture that can determine three internally related tasks jointly, i.e., (i) whether to interpolate an ad or not in the recommendation list, and if yes, (ii) the optimal ad and (iii) the optimal location to interpolate. The experimental results based on real-world data demonstrate the effectiveness of the proposed framework.
Article
In recent years, there are great interests as well as many challenges in applying reinforcement learning (RL) to recommendation systems (RS). In this paper, we summarize three key practical challenges of large-scale RL-based recommender systems: massive state and action spaces, high-variance environment, and the unspecific reward setting in recommendation. All these problems remain largely unexplored in the existing literature and make the application of RL challenging. We develop a model-based reinforcement learning framework, called GoalRec. Inspired by the ideas of world model (model-based), value function estimation (model-free), and goal-based RL, a novel disentangled universal value function designed for item recommendation is proposed. It can generalize to various goals that the recommender may have, and disentangle the stochastic environmental dynamics and high-variance reward signals accordingly. As a part of the value function, free from the sparse and high-variance reward signals, a high-capacity reward-independent world model is trained to simulate complex environmental dynamics under a certain goal. Based on the predicted environmental dynamics, the disentangled universal value function is related to the user's future trajectory instead of a monolithic state and a scalar reward. We demonstrate the superiority of GoalRec over previous approaches in terms of the above three practical challenges in a series of simulations and a real application.
Article
The development of autonomous agents which can interact with other agents to accomplish a given task is a core area of research in artificial intelligence and machine learning. Towards this goal, the Autonomous Agents Research Group develops novel machine learning algorithms for autonomous systems control, with a specific focus on deep reinforcement learning and multi-agent reinforcement learning. Research problems include scalable learning of coordinated agent policies and inter-agent communication; reasoning about the behaviours, goals, and composition of other agents from limited observations; and sample-efficient learning based on intrinsic motivation, curriculum learning, causal inference, and representation learning. This article provides a broad overview of the ongoing research portfolio of the group and discusses open problems for future directions.
Preprint
Recommender System (RS) is an important online application that affects billions of users every day. The mainstream RS ranking framework is composed of two parts: a Multi-Task Learning model (MTL) that predicts various user feedback, i.e., clicks, likes, sharings, and a Multi-Task Fusion model (MTF) that combines the multi-task outputs into one final ranking score with respect to user satisfaction. There has not been much research on the fusion model while it has great impact on the final recommendation as the last crucial process of the ranking. To optimize long-term user satisfaction rather than obtain instant returns greedily, we formulate MTF task as Markov Decision Process (MDP) within a recommendation session and propose a Batch Reinforcement Learning (RL) based Multi-Task Fusion framework (BatchRL-MTF) that includes a Batch RL framework and an online exploration. The former exploits Batch RL to learn an optimal recommendation policy from the fixed batch data offline for long-term user satisfaction, while the latter explores potential high-value actions online to break through the local optimal dilemma. With a comprehensive investigation on user behaviors, we model the user satisfaction reward with subtle heuristics from two aspects of user stickiness and user activeness. Finally, we conduct extensive experiments on a billion-sample level real-world dataset to show the effectiveness of our model. We propose a conservative offline policy estimator (Conservative-OPEstimator) to test our model offline. Furthermore, we take online experiments in a real recommendation environment to compare performance of different models. As one of few Batch RL researches applied in MTF task successfully, our model has also been deployed on a large-scale industrial short video platform, serving hundreds of millions of users.
Preprint
Recent studies have shown that deep neural networks-based recommender systems are vulnerable to adversarial attacks, where attackers can inject carefully crafted fake user profiles (i.e., a set of items that fake users have interacted with) into a target recommender system to achieve malicious purposes, such as promote or demote a set of target items. Due to the security and privacy concerns, it is more practical to perform adversarial attacks under the black-box setting, where the architecture/parameters and training data of target systems cannot be easily accessed by attackers. However, generating high-quality fake user profiles under black-box setting is rather challenging with limited resources to target systems. To address this challenge, in this work, we introduce a novel strategy by leveraging items' attribute information (i.e., items' knowledge graph), which can be publicly accessible and provide rich auxiliary knowledge to enhance the generation of fake user profiles. More specifically, we propose a knowledge graph-enhanced black-box attacking framework (KGAttack) to effectively learn attacking policies through deep reinforcement learning techniques, in which knowledge graph is seamlessly integrated into hierarchical policy networks to generate fake user profiles for performing adversarial black-box attacks. Comprehensive experiments on various real-world datasets demonstrate the effectiveness of the proposed attacking framework under the black-box setting.
Article
Interactive recommendation with natural-language feedback can provide richer user feedback and has demonstrated advantages over traditional recommender systems. However, the classical online paradigm involves iteratively collecting experience via interaction with users, which is expensive and risky. We consider an offline interactive recommendation to exploit arbitrary experience collected by multiple unknown policies. A direct application of policy learning with such fixed experience suffers from the distribution shift. To tackle this issue, we develop a behavior-agnostic off-policy correction framework to make offline interactive recommendation possible. Specifically, we leverage the conservative Q-function to perform off-policy evaluation, which enables learning effective policies from fixed datasets without further interactions. Empirical results on the simulator derived from real-world datasets demonstrate the effectiveness of our proposed offline training framework.
Article
Recommender systems (RSs) have become an inseparable part of our everyday lives. They help us find our favorite items to purchase, our friends on social networks, and our favorite movies to watch. Traditionally, the recommendation problem was considered to be a classification or prediction problem, but it is now widely agreed that formulating it as a sequential decision problem can better reflect the user-system interaction. Therefore, it can be formulated as a Markov decision process (MDP) and be solved by reinforcement learning (RL) algorithms. Unlike traditional recommendation methods, including collaborative filtering and content-based filtering, RL is able to handle the sequential, dynamic user-system interaction and to take into account the long-term user engagement. Although the idea of using RL for recommendation is not new and has been around for about two decades, it was not very practical, mainly because of scalability problems of traditional RL algorithms. However, a new trend has emerged in the field since the introduction of deep reinforcement learning (DRL), which made it possible to apply RL to the recommendation problem with large state and action spaces. In this paper, a survey on reinforcement learning based recommender systems (RLRSs) is presented. Our aim is to present an outlook on the field and to provide the reader with a fairly complete knowledge of key concepts of the field. We first recognize and illustrate that RLRSs can be generally classified into RL- and DRL-based methods. Then, we propose an RLRS framework with four components, i.e., state representation, policy optimization, reward formulation, and environment building, and survey RLRS algorithms accordingly. We highlight emerging topics and depict important trends using various graphs and tables. Finally, we discuss important aspects and challenges that can be addressed in the future.
Article
User profile perturbation protects privacy in the release of user profiles to receive recommendation services, in which the privacy budget as a privacy parameter can be controlled to effect a tradeoff between the recommendation quality and privacy protection against inference attacks. In this article, we propose a deep reinforcement learning (RL)-based user profile perturbation scheme for recommendation systems. This scheme applies differential privacy to protect user privacy and uses deep RL to choose the privacy budget against inference attackers. Based on an evaluated neural network (NN) and a target NN, this scheme enables a user device to optimize the privacy budget over time based on the sensitivity level of the clicked item, the similarities among the recommended items, and the estimated privacy loss. We provide an upper bound on the privacy protection performance of this scheme in the recommendation game and evaluate its computational complexity. Simulation results for a movie recommendation system show that this scheme increases the user privacy protection level for a given recommendation quality compared with benchmark schemes.
Article
Deep neural network-based systems are now state-of-the-art in many robotics tasks, but their application in safety-critical domains remains dangerous without formal guarantees on network robustness. Small perturbations to sensor inputs (from noise or adversarial examples) are often enough to change network-based decisions, which was recently shown to cause an autonomous vehicle to swerve into another lane. In light of these dangers, numerous algorithms have been developed as defensive mechanisms from these adversarial inputs, some of which provide formal robustness guarantees or certificates. This work leverages research on certified adversarial robustness to develop an online certifiably robust for deep reinforcement learning algorithms. The proposed defense computes guaranteed lower bounds on state-action values during execution to identify and choose a robust action under a worst case deviation in input space due to possible adversaries or noise. Moreover, the resulting policy comes with a certificate of solution quality, even though the true state and optimal action are unknown to the certifier due to the perturbations. The approach is demonstrated on a deep Q-network (DQN) policy and is shown to increase robustness to noise and adversaries in pedestrian collision avoidance scenarios, a classic control task, and Atari Pong. This article extends our prior work with new performance guarantees, extensions to other reinforcement learning algorithms, expanded results aggregated across more scenarios, an extension into scenarios with adversarial behavior, comparisons with a more computationally expensive method, and visualizations that provide intuition about the robustness algorithm.
Article
Deep reinforcement learning (RL) has recently led to many breakthroughs on a range of complex control tasks. However, the decision-making process is generally not transparent. The lack of interpretability hinders the applicability in safety-critical scenarios. While several methods have attempted to interpret vision-based RL, most come without detailed explanation for the agent's behaviour. In this paper, we propose a self-supervised interpretable framework, which can discover causal features to enable easy interpretation of RL even for non-experts. Specifically, a self-supervised interpretable network is employed to produce fine-grained masks for highlighting task-relevant information, which constitutes most evidence for the agent's decisions. We verify and evaluate our method on several Atari-2600 games and Duckietown, which is a challenging self-driving car simulator environment. The results show that our method renders causal explanations and empirical evidences about how the agent makes decisions and why the agent performs well or badly. Overall, our method provides valuable insight into the decision-making process of RL. In addition, our method does not use any external labelled data, and thus demonstrates the possibility to learn high-quality mask through a self-supervised manner, which may shed light on new paradigms for label-free vision learning such as self-supervised segmentation and detection.
Article
Reinforcement learning (RL) techniques have recently been introduced to recommender systems. Most existing research works focus on designing policy and learning algorithms of the recommender agent but seldom care about the top-aware issue, i.e., the performance on the top positions is not satisfying, which is crucial for real applications. To address the drawback, we propose a Supervised deep Reinforcement learning Recommendation framework named as SRR. Within this framework, we utilize a supervised learning (SL) model to partially guide the learning of recommendation policy, where the supervision signal and RL signal are jointly employed and updated in a complementary fashion. We empirically find that suitable weak supervision helps to balance the immediate reward and the long-term reward, which nicely addresses the top-aware issue in RL based recommendation. Moreover, we perform a further investigation on how different supervision signals impact on recommendation policy. Extensive experiments are carried out on two real-world datasets under both the offline and simulated online evaluation settings, and the results demonstrate that the proposed methods indeed resolve the top-aware issue without much performance sacrifice in the long-run, compared with the state-of-the-art methods.
Article
Dynamic taxi route recommendation aims at recommending cruising routes to vacant taxis such that they can quickly find and pick up new passengers. Given citizens’ giant but unbalancing riding demand and the very limited taxis in a city, dynamic taxi route recommendation is essential for its ability to alleviate the waiting time of passengers and increase the earning of taxi drivers. Thus, in this paper we study the dynamic taxi route recommendation problem as a sequential decision-making problem and we design an effective two-step method to tackle it. First, we propose to consider and extract multiple real-time spatio-temporal features, which are related with the easiness degree of vacant taxis picking up new passengers. Second, we design an adaptive deep reinforcement learning method, which learns a carefully designed deep policy network to better fuse the extracted spatio-temporal features such that effective route recommendation can be done. Extensive experiments using real-world data from San Francisco and New York are conducted. Comparing with the state-of-the-arts, our method can increase at least 15.8% of average earning for taxi drivers and reduce at least 29.6% of average waiting time for passengers.
Article
Reinforcement learning techniques have recently been introduced to interactive recommender systems to capture the dynamic patterns of user behavior during the interaction with recommender systems and perform planning to optimize long-term performance. Most existing research work focuses on designing policy and learning algorithms of the recommender agent but seldom cares about the state representation of the environment, which is indeed essential for the recommendation decision making. In this paper, we first formulate the interactive recommender system problem with a deep reinforcement learning recommendation framework. Within this framework, we then carefully design four state representation schemes for learning the recommendation policy. Inspired by recent advances in feature interaction modeling in user response prediction, we discover that explicitly modeling user-item interactions in state representation can largely help the recommendation policy perform effective reinforcement learning. Extensive experiments on four real-world datasets are conducted under both the offline and simulated online evaluation settings. The experimental results demonstrate the proposed state representation schemes lead to better performance over the state-of-the-art methods.
Article
Prominent theories in cognitive science propose that humans understand and represent the knowledge of the world through causal relationships. In making sense of the world, we build causal models in our mind to encode cause-effect relations of events and use these to explain why new events happen by referring to counterfactuals — things that did not happen. In this paper, we use causal models to derive causal explanations of the behaviour of model-free reinforcement learning agents. We present an approach that learns a structural causal model during reinforcement learning and encodes causal relationships between variables of interest. This model is then used to generate explanations of behaviour based on counterfactual analysis of the causal model. We computationally evaluate the model in 6 domains and measure performance and task prediction accuracy. We report on a study with 120 participants who observe agents playing a real-time strategy game (Starcraft II) and then receive explanations of the agents' behaviour. We investigate: 1) participants' understanding gained by explanations through task prediction; 2) explanation satisfaction and 3) trust. Our results show that causal model explanations perform better on these measures compared to two other baseline explanation models.