Article

Thompson Sampling for Dynamic Pricing

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

In this paper we apply active learning algorithms for dynamic pricing in a prominent e-commerce website. Dynamic pricing involves changing the price of items on a regular basis, and uses the feedback from the pricing decisions to update prices of the items. Most popular approaches to dynamic pricing use a passive learning approach, where the algorithm uses historical data to learn various parameters of the pricing problem, and uses the updated parameters to generate a new set of prices. We show that one can use active learning algorithms such as Thompson sampling to more efficiently learn the underlying parameters in a pricing problem. We apply our algorithms to a real e-commerce system and show that the algorithms indeed improve revenue compared to pricing algorithms that use passive learning.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... I N the last few decades dynamic pricing has been considered an active area of research having substantial contributions in the fields of operations research, management sciences, economics, and computer science [1]. Dynamic pricing is defined as setting a time-varying price for a certain product or service [2], [3]. ...
... Another piece of work that exploits multi-armed bandit algorithms for dynamic pricing of e-commerce applications is presented by Ganti et al. in [1]. The authors apply two different multi-armed bandit algorithms: Upper Confidence Bound (UCB) [59] and Thomson Sampling (TS) [53]. ...
Article
Full-text available
Dynamic pricing is a beneficial strategy for firms seeking to achieve high revenues. It has been widely applied to various domains such as the airline industry, the hotel industry, and e-services. Dynamic pricing is basically the problem of setting time-varying prices for a certain product or service for the purpose of optimizing revenue. However, a major challenge encountered when applying dynamic pricing is the lack of knowledge of the demand-price curve. The demand-price curve defines the customer’s response towards changing the price. In this work, we address the dynamic pricing problem in case of unknown demand-price relation. This work introduces a less myopic pricing approach based on looking ahead for one or several future steps in our quest to optimize revenue. Specifically, the proposed formulation maximizes the summation of the immediate revenue and the expected future revenues of one or multiple look-ahead steps. A key benefit of the proposed approach is that it automatically strikes a balance between the conflicting goals of revenue maximization and demand learning, by producing less myopic and profitable prices. We provide a formulation for the presented look-ahead pricing approach, and we implement two variants of it: one-step and two-step look-ahead methods. Experiments are conducted on synthetic and real datasets to compare the proposed pricing methods to other pricing strategies in literature. The experimental results indicate that the proposed look-ahead methods outperform their counterparts in terms of the achieved revenue gain.
... Dynamic pricing. Many market leaders, including Groupon [16] and Walmart [17], extensively investigate and utilize dynamic pricing algorithms to obtain and maintain a competitive advantage over time. Kemper and Breuer [18] combined the mathematical principles of dynamic pricing with empirical marketing research methods on a dataset of auctions (from ebay.de), to determine demand functions of football tickets during the 2013-14 Bundesliga season. ...
... If the random number is less than the probability of purchase and if there are free seats on the considered bus, the tickets are considered sold (lines [14][15][16][17][18][19]. Otherwise, we consider that the user has left the platform without buying any tickets. ...
Article
Full-text available
In recent years the demand for collective mobility services has registered a significant growth. In particular, the long-distance coach market has undergone an important change in Europe since FlixBus adopted a dynamic pricing strategy, providing low-cost transport services and an efficient and fast information system. This paper presents a methodology, called DA4PT (Data Analytics for Public Transport), for discovering the factors that influence travelers in booking and purchasing bus tickets. Starting from a set of 3.23 million user-generated event logs of a bus ticketing platform, the methodology shows the correlation rules between booking factors and purchase of tickets. Such rules are then used to train a machine learning model for predicting whether a user will buy or not a ticket. The rules are also used to define various dynamic pricing strategies with the purpose of increasing the number of tickets sales on the platform and the related amount of revenues. The methodology reaches an accuracy of 95% in forecasting the purchase of a ticket and a low variance in results. Exploiting a dynamic pricing strategy, DA4PT is able to increase the number of purchased tickets by 6% and the total revenue by 9% by showing the effectiveness of the proposed approach.
Chapter
Reinforcement learning (RL) is a mathematical framework for developing computer agents that can learn an optimal behavior by relating generic rewards with its past actions. With numerous successful applications in business intelligence, health care, finance, and gaming, the RL framework is ideal for sequential decision-making in unknown environments with large amounts of data. Multiarmed bandits are the simplest form of reinforcement learning. This chapter provides a systematic bridge between RL and multiarmed bandits. We also discuss the state-of-the-art results in nonstationary environment, which is practical in several real-life applications.
Article
We study a family of adversarial (a.k.a. nonstochastic) multi-armed bandit (MAB) problems, wherein not only the player cannot observe the reward on the played arm (self-unaware player) but also it incurs switching costs when shifting to another arm. We study two cases: In Case 1 , at each round, the player is able to either play or observe the chosen arm, but not both. In Case 2, the player can choose an arm to play and, at the same round, choose another arm to observe. In both cases, the player incurs a cost for consecutive arm switching due to playing or observing the arms. We propose two novel online learning-based algorithms each addressing one of the aforementioned MAB problems. We theoretically prove that the proposed algorithms for Case 1 and Case 2 achieve sublinear regret of $O(\sqrt [{4}]{KT^{3}\ln K})$ and $O(\sqrt [{3}]{(K-1)T^{2}\ln K})$ , respectively, where the latter regret bound is order-optimal in time, $K$ is the number of arms, and $T$ is the total number of rounds. In Case 2, we extend the player’s capability to multiple $m>1$ observations and show that more observations do not necessarily improve the regret bound due to incurring switching costs. However, we derive an upper bound for switching cost as $c \leq 1/\sqrt [{3}]{m^{2}}$ for which the regret bound is improved as the number of observations increases. Finally, through this study, we found that a generalized version of our approach gives an interesting sublinear regret upper bound result of $\tilde {O}\left ({T^{\frac {s+1}{s+2}}}\right)$ for any self-unaware bandit player with $s$ number of binary decision dilemma before taking the action. To further validate and complement the theoretical findings, we conduct extensive performance evaluations over synthetic data constructed by nonstochastic MAB environment simulations and wireless spectrum measurement data collected in a real-world experiment.
Article
Full-text available
Reinforcement learning (RL) is an area of machine learning concerned with how agents take actions to optimize a given long-term reward by interacting with the environment they are placed in. Some well-known recent applications include self-driving cars and computers playing games with super-human performance. One of the main advantages of this approach is that there is no need to explicitly model the nature of the interactions with the environment. In this work, we present a new airline Revenue Management System (RMS) based on RL, which does not require a demand forecaster. The optimization module remains but works in a different way. It is theoretically proven that RL converges to the optimal solution; however, in practice, the system may require a significant amount of data (a booking history with millions of daily departures) to learn the optimal policies. To overcome these difficulties, we present a novel model that integrates domain knowledge with a deep neural network trained on GPUs. The results are very encouraging in different scenarios and open the door for a new generation of RMSs that could automatically learn by directly interacting with customers.
Article
Full-text available
We consider a monopolist who sells a set of products over a time horizon of T periods. The seller initially does not know the parameters of the products' linear demand curve, but can estimate them based on demand observations. We first assume that the seller knows nothing about the parameters of the demand curve, and then consider the case where the seller knows the expected demand under an incumbent price. It is shown that the smallest achievable revenue loss in T periods, relative to a clairvoyant who knows the underlying demand model, is of order √T in the former case and of order log T in the latter case. To derive pricing policies that are practically implementable, we take as our point of departure the widely used policy called greedy iterated least squares (ILS), which combines sequential estimation and myopic price optimization. It is known that the greedy ILS policy itself suffers from incomplete learning, but we show that certain variants of greedy ILS achieve the minimum asymptotic loss rate. To highlight the essential features of well-performing pricing policies, we derive sufficient conditions for asymptotic optimality.
Article
Full-text available
Motivated by applications in financial services, we consider a seller who offers prices sequentially to a stream of potential customers, observing either success or failure in each sales attempt. The parameters of the underlying demand model are initially unknown, so each price decision involves a trade-off between learning and earning. Attention is restricted to the simplest kind of model uncertainty, where one of two demand models is known to apply, and we focus initially on performance of the myopic Bayesian policy (MBP), variants of which are commonly used in practice. Because learning is passive under the MBP (that is, learning only takes place as a by-product of actions that have a different purpose), it can lead to incomplete learning and poor profit performance. However, under one additional assumption, a constrained variant of the myopic policy is shown to have the following strong theoretical virtue: the expected performance gap relative to a clairvoyant who knows the underlying demand model is bounded by a constant as the number of sales attempts becomes large. This paper was accepted by Gérard P. Cachon, stochastic models and simulation.
Article
Full-text available
Price experimentation is an important tool for firms to find the optimal selling price of their products. It should be conducted properly, since experimenting with selling prices can be costly. A firm, therefore, needs to find a pricing policy that optimally balances between learning the optimal price and gaining revenue. In this paper, we propose such a pricing policy, called controlled variance pricing (CVP). The key idea of the policy is to enhance the certainty equivalent pricing policy with a taboo interval around the average of previously chosen prices. The width of the taboo interval shrinks at an appropriate rate as the amount of data gathered gets large; this guarantees sufficient price dispersion. For a large class of demand models, we show that this procedure is strongly consistent, which means that eventually the value of the optimal price will be learned, and derive upper bounds on the regret, which is the expected amount of money lost due to not using the optimal price. Numerical tests indicate that CVP performs well on different demand models and time scales. This paper was accepted by Assaf Zeevi, stochastic models and simulation.
Article
Full-text available
Reinforcement learning policies face the exploration versus exploitation dilemma, i.e. the search for a balance between exploring the environment to find profitable actions while taking the empirically best action as often as possible. A popular measure of a policy's success in addressing this dilemma is the regret, that is the loss due to the fact that the globally optimal policy is not followed all the times. One of the simplest examples of the exploration/exploitation dilemma is the multi-armed bandit problem. Lai and Robbins were the first ones to show that the regret for this problem has to grow at least logarithmically in the number of plays. Since then, policies which asymptotically achieve this regret have been devised by Lai and Robbins and many others. In this work we show that the optimal logarithmic regret is also achievable uniformly over time, with simple and efficient policies, and for all reward distributions with bounded support.
Article
Full-text available
We consider a single product revenue management problem where, given an initial inventory, the objective is to dynamically adjust prices over a finite sales horizon to maximize expected revenues. Realized demand is observed over time, but the underlying functional relationship between price and mean demand rate that governs these observations (otherwise known as the demand function or demand curve), is not known. We consider two instances of this problem: i.) a setting where the demand function is assumed to belong to a known parametric family with unknown parameter values; and ii.) a setting where the demand function is assumed to belong to a broad class of functions that need not admit any parametric representation. In each case we develop policies that learn the demand function "on the fly," and optimize prices based on that. The performance of these algorithms is measured in terms of the regret: the revenue loss relative to the maximal revenues that can be extracted when the demand function is known prior to the start of the selling season. We derive lower bounds on the regret that hold for any admissible pricing policy, and then show that our proposed algorithms achieve a regret that is "close" to this lower bound. The magnitude of the regret can be interpreted as the economic value of prior knowledge on the demand function; manifested as the revenue loss due to model uncertainty.
Article
Full-text available
A retailer is endowed with a finite inventory of a nonperishable product. Demand for this product is driven by a pricesensitive Poisson process that depends on an unknown parameter that is a proxy for the market size. The retailer has a prior belief on the value of this parameter that he updates as time and available information (prices and sales) evolve. The retailer's objective is to maximize the discounted long-term average profits of his operation using dynamic pricing policies. We consider two cases. In the first case, the retailer is constrained to sell the entire initial stock of the nonperishable product before a different assortment is considered. In the second case, the retailer is able to stop selling the nonperishable product at any time and switch to a different menu of products. For both cases, we formulate the retailer's problem as a (Poisson) intensity control problem and derive structural properties of an optimal solution, and suggest a simple and efficient approximated solution. We use numerical computations, together with asymptotic analysis, to evaluate the performance of our proposed policy.
Article
Full-text available
Multitask Learning is an approach to inductive transfer that improves learning for one task by using the information contained in the training signals of other related tasks. It does this by learning tasks in parallel while using a shared representation; what is learned for each task can help other tasks be learned better. In this thesis we demonstrate multitask learning for a dozen problems. We explain how multitask learning works and show that there are many opportunities for multitask learning in real domains. We show that in some cases features that would normally be used as inputs work better if used as multitask outputs instead. We present suggestions for how to get the most out of multitask learning in artificial neural nets, present an algorithm for multitask learning with case based methods like k nearest neighbor and kernel regression, and sketch an algorithm for multitask learning in decision trees. Multitask learning improves generalization performance, can be applied in many different kinds of domains, and can be used with different learning algorithms. We conjecture there will be many opportunities for its use on real world problems. Thesis (Master's).
Article
Full-text available
This paper considers the problem of changing prices over time to maximize expectedrevenues in the presence of unknown demand distribution parameters. It providesand compares several methods that use the sequence of past prices and observeddemands to set price in the current period. A Taylor series expansion of the futurereward function explicitly illustrates the tradeoff between short term revenuemaximization and future information gain and suggests a promising pricing policyreferred to as a one-step look-ahead rule. An in-depth Monte Carlo study comparesseveral different pricing strategies and shows that the one-step look-ahead rulesdominate other heuristic policies and produce good short term performance. Thereasons for the observed bias of parameter estimates are also investigated.
Article
We consider a price-based network revenue management problem in which a retailer aims to maximize revenue from multiple products with limited inventory over a finite selling season. As is common in practice, we assume the demand function contains unknown parameters that must be learned from sales data. In the presence of these unknown demand parameters, the retailer faces a trade-off commonly referred to as the “exploration-exploitation trade-off.” Toward the beginning of the selling season, the retailer may offer several different prices to try to learn demand at each price (“exploration” objective). Over time, the retailer can use this knowledge to set a price that maximizes revenue throughout the remainder of the selling season (“exploitation” objective). We propose a class of dynamic pricing algorithms that builds on the simple, yet powerful, machine learning technique known as “Thompson sampling” to address the challenge of balancing the exploration-exploitation trade-off under the presence of inventory constraints. Our algorithms have both strong theoretical performance guarantees and promising numerical performance results when compared with other algorithms developed for similar settings. Moreover, we show how our algorithms can be extended for use in general multiarmed bandit problems with resource constraints as well as in applications in other revenue management settings and beyond.
Conference Paper
In this paper, we describe an exploration system that was implemented by the search-advertising team of a prominent web-portal to address the cold ads problem. The cold ads problem refers to the situation where, when new ads are injected into the system by advertisers, the system is unable to assign an accurate quality to the ad (in our case, the click probability). As a consequence, the advertiser may suffer from low impression volumes for these cold ads, and the overall system may perform sub-optimally if the click probabilities for new ads are not learnt rapidly. We designed a new exploration system that was adapted to search advertising and the serving constraints of the system. In this paper, we define the problem, discuss the design details of the exploration system, new evaluation criteria, and present the performance metrics that were observed by us.
Article
Thompson sampling is an algorithm for online decision problems where actions are taken sequentially in a manner that must balance between exploiting what is known to maximize immediate performance and investing to accumulate new information that may improve future performance. The algorithm addresses a broad range of problems in a computationally efficient manner and is therefore enjoying wide use. This tutorial covers the algorithm and its application, illustrating concepts through a range of examples, including Bernoulli bandit problems, shortest path problems, dynamic pricing, recommendation, active learning with neural networks, and reinforcement learning in Markov decision processes. Most of these problems involve complex information structures, where information revealed by taking an action informs beliefs about other actions. We will also discuss when and why Thompson sampling is or is not effective and relations to alternative algorithms.
Conference Paper
Matrix factorization (MF) collaborative filtering is an effective and widely used method in recommendation systems. However, the problem of finding an optimal trade-off between exploration and exploitation (otherwise known as the bandit problem), a crucial problem in collaborative filtering from cold-start, has not been previously addressed.In this paper, we present a novel algorithm for online MF recommendation that automatically combines finding the most relevantitems with exploring new or less-recommended items.Our approach, called Particle Thompson Sampling for Matrix-Factorization, is based on the general Thompson sampling framework, but augmented with a novel efficient online Bayesian probabilistic matrix factorization method based on the Rao-Blackwellized particle filter.Extensive experiments in collaborative filtering using several real-world datasets demonstrate that our proposed algorithm significantly outperforms the current state-of-the-arts.
Article
One of the core problems of modern statistics is to approximate difficult-to-compute probability distributions. This problem is especially important in Bayesian statistics, which frames all inference about unknown quantities as a calculation about the posterior. In this paper, we review variational inference (VI), a method from machine learning that approximates probability distributions through optimization. VI has been used in myriad applications and tends to be faster than classical methods, such as Markov chain Monte Carlo sampling. The idea behind VI is to first posit a family of distributions and then to find the member of that family which is close to the target. Closeness is measured by Kullback-Leibler divergence. We review the ideas behind mean-field variational inference, discuss the special case of VI applied to exponential family models, present a full example with a Bayesian mixture of Gaussians, and derive a variant that uses stochastic optimization to scale up to massive data. We discuss modern research in VI and highlight important open problems. VI is powerful, but it is not yet well understood. Our hope in writing this paper is to catalyze statistical research on this widely-used class of algorithms.
Article
This technical note presents a new approach to carrying out the kind of exploration achieved by Thompson sampling, but without explicitly maintaining or sampling from posterior distributions. The approach is based on a bootstrap technique that uses a combination of observed and artificially generated data. The latter serves to induce a prior distribution which, as we will demonstrate, is critical to effective exploration. We explain how the approach can be applied to multi-armed bandit and reinforcement learning problems and how it relates to Thompson sampling. The approach is particularly well-suited for contexts in which exploration is coupled with deep learning, since in these settings, maintaining or generating samples from a posterior distribution becomes computationally infeasible.
Article
This chapter aims to supplement the book Convex Optimization Theory, Athena Scientific, 2009 with material on convex optimization algorithms.
Article
In this paper, we develop a stylized partially observed Markov decision process (POMDP) framework, to study a dynamic pricing problem faced by sellers of fashion-like goods. We consider a retailer that plans to sell a given stock of items during a finite sales season. The objective of the retailer is to dynamically price the product in a way that maximizes expected revenues. Our model brings together various types of uncertainties about the demand, some of which are resolvable through sales observations. We develop a rigorous upper bound for the seller’s optimal dynamic decision problem and use it to propose an active-learning heuristic pricing policy. We conduct a numerical study to test the performance of four different heuristic dynamic pricing policies, in order to gain insights into several important managerial questions that arise in the context of revenue management.
Article
Quantity-Based RM.- Single-Resource Capacity Control.- Network Capacity Control.- Overbooking.- Price-based RM.- Dynamic Pricing.- Auctions.- Common Elements.- Customer-Behavior and Market-Response Models.- The Economics of RM.- Estimation and Forecasting.- Industry Profiles.- Implementation.
Article
We consider a stylized dynamic pricing model in which a monopolist prices a product to a sequence of T customers who independently make purchasing decisions based on the price offered according to a general parametric choice model. The parameters of the model are unknown to the seller, whose objective is to determine a pricing policy that minimizes the regret, which is the expected difference between the seller's revenue and the revenue of a clairvoyant seller who knows the values of the parameters in advance and always offers the revenue-maximizing price. We show that the regret of the optimal pricing policy in this model is Θ(√T), by establishing an Ω(√T) lower bound on the worst-case regret under an arbitrary policy, and presenting a pricing policy based on maximum-likelihood estimation whose regret is 𝓞(√T) across all problem instances. Furthermore, we show that when the demand curves satisfy a "well-separated" condition, the T-period regret of the optimal policy is Θ(log T). Numerical experiments show that our policies perform well.
Article
This paper considers the problem of setting prices dynamically to maximise expected revenues in a finite horizon model in which the demand parameters are unknown. At each decision epoch, the manager chooses a price and observes a binary response (buy or not) for each consumer visiting the website during that period. This paper focuses on comparing several easy to implement good pricing policies. A Taylor series expansion of the future reward function explicity illustrates the trade-off between short-term revenue maximisation and future information gains and suggests a pricing policy referred to as a one-step look ahead rule. A Monte Carlo study compares several different pricing strategies and shows that the one-step look ahead rule dominates other policies and produces good short term performance.Journal of Revenue & Pricing Management (2005) 3, 320-336; doi:10.1057/palgrave.rpm.5170118
Article
Multi-armed bandit problems are the most basic examples of sequential decision problems with an exploration-exploitation trade-off. This is the balance between staying with the option that gave highest payoffs in the past and exploring new options that might give higher payoffs in the future. Although the study of bandit problems dates back to the Thirties, exploration-exploitation trade-offs arise in several modern applications, such as ad placement, website optimization, and packet routing. Mathematically, a multi-armed bandit is defined by the payoff process associated with each option. In this survey, we focus on two extreme cases in which the analysis of regret is particularly simple and elegant: i.i.d. payoffs and adversarial payoffs. Besides the basic setting of finitely many actions, we also analyze some of the most important variants and extensions, such as the contextual bandit model.
Article
We study a problem of dynamic pricing faced by a vendor with limited inventory, uncertain about demand, aiming to maximize expected discounted revenue over an infinite time horizon. The vendor learns from purchase data, so his strategy must take into account the impact of price on both revenue and future observations. We focus on a model in which customers arrive according to a Poisson process of uncertain rate, each with an independent, identically distributed reservation price. Upon arrival, a customer purchases a unit of inventory if and only if his reservation price equals or exceeds the vendor’s prevailing price.
A multiworld testing decision service
  • Alekh Agarwal
  • Bird
  • Sarah
  • Cozowicz
  • Markus
  • Hoang
  • Luong
  • Langford
  • John
  • Lee
  • Stephen
  • Li
  • Jiaji
  • Melamed
  • Dan
  • Oshri
  • Gal
  • Ribas
  • Oswaldo
Agarwal, Alekh, Bird, Sarah, Cozowicz, Markus, Hoang, Luong, Langford, John, Lee, Stephen, Li, Jiaji, Melamed, Dan, Oshri, Gal, Ribas, Oswaldo, et al. A multiworld testing decision service. arXiv preprint arXiv:1606.03966, 2016.
Optimal learning by experimentation. The review of economic studies
  • Philippe Aghion
  • Bolton
  • Patrick
  • Christopher Harris
  • Bruno Jullien
Aghion, Philippe, Bolton, Patrick, Harris, Christopher, and Jullien, Bruno. Optimal learning by experimentation. The review of economic studies, 58(4):621-654, 1991.
Thompson sampling for complex online problems
  • Aditya Gopalan
  • Shie Mannor
  • Yishay Mansour
Gopalan, Aditya, Mannor, Shie, and Mansour, Yishay. Thompson sampling for complex online problems. In International Conference on Machine Learning, pp. 100-108, 2014.