PreprintPDF Available

Business analytics meets artificial intelligence: Assessing the demand effects of discounts on Swiss train tickets

Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

We assess the demand effects of discounts on train tickets issued by the Swiss Federal Railways, the so-called `supersaver tickets', based on machine learning, a subfield of artificial intelligence. Considering a survey-based sample of buyers of supersaver tickets, we investigate which customer- or trip-related characteristics (including the discount rate) predict buying behavior, namely: booking a trip otherwise not realized by train, buying a first- rather than second-class ticket, or rescheduling a trip (e.g.\ away from rush hours) when being offered a supersaver ticket. Predictive machine learning suggests that customer's age, demand-related information for a specific connection (like departure time and utilization), and the discount level permit forecasting buying behavior to a certain extent. Furthermore, we use causal machine learning to assess the impact of the discount rate on rescheduling a trip, which seems relevant in the light of capacity constraints at rush hours. Assuming that (i) the discount rate is quasi-random conditional on our rich set of characteristics and (ii) the buying decision increases weakly monotonically in the discount rate, we identify the discount rate's effect among `always buyers', who would have traveled even without a discount, based on our survey that asks about customer behavior in the absence of discounts. We find that on average, increasing the discount rate by one percentage point increases the share of rescheduled trips by 0.16 percentage points among always buyers. Investigating effect heterogeneity across observables suggests that the effects are higher for leisure travelers and during peak hours when controlling several other characteristics.
Content may be subject to copyright.
Business analytics meets artificial intelligence:
Assessing the demand effects of discounts on Swiss train tickets
Martin Huber*, Jonas Meier**, and Hannes Wallimann+
*University of Fribourg, Dept. of Economics and
Center for Econometrics and Business Analytics, St. Petersburg State University
**University of Bern, Dept. of Economics
+University of Applied Sciences and Arts Lucerne, Competence Center for Mobility
Abstract: We assess the demand effects of discounts on train tickets issued by the Swiss Federal Railways, the so-called
‘supersaver tickets’, based on machine learning, a subfield of artificial intelligence. Considering a survey-based sample of
buyers of supersaver tickets, we investigate which customer- or trip-related characteristics (including the discount rate)
predict buying behavior, namely: booking a trip otherwise not realized by train, buying a first- rather than second-class
ticket, or rescheduling a trip (e.g. away from rush hours) when being offered a supersaver ticket. Predictive machine learning
suggests that customer’s age, demand-related information for a specific connection (like departure time and utilization),
and the discount level permit forecasting buying behavior to a certain extent. Furthermore, we use causal machine learning
to assess the impact of the discount rate on rescheduling a trip, which seems relevant in the light of capacity constraints at
rush hours. Assuming that (i) the discount rate is quasi-random conditional on our rich set of characteristics and (ii) the
buying decision increases weakly monotonically in the discount rate, we identify the discount rate’s effect among ‘always
buyers’, who would have traveled even without a discount, based on our survey that asks about customer behavior in the
absence of discounts. We find that on average, increasing the discount rate by one percentage point increases the share
of rescheduled trips by 0.16 percentage points among always buyers. Investigating effect heterogeneity across observables
suggests that the effects are higher for leisure travelers and during peak hours when controlling several other characteristics.
Keywords: Causal Machine Learning, Double Machine Learning, Treatment Effect, Business Analytics, Causal Forest, Public Transportation.
JEL classification: C21, R41, R48.
Acknowledgments: We are grateful to the SBB Research Fund for financial support. Furthermore, we are indebted to Pierre Chevallier and
Philipp Wegelin for their helpful discussions. Addresses for correspondence: Martin Huber, University of Fribourg, Bd. de P´erolles 90, 1700 Fribourg,
Switzerland,; Jonas Meier, University of Bern, Schanzeneckstrasse 1, 3001 Bern, Switzerland,;
Hannes Wallimann, University of Applied Sciences and Arts Lucerne, R¨
osslimatte 48, 6002 Luzern, Switzerland,
arXiv:2105.01426v1 [econ.GN] 4 May 2021
1 Introduction
Organizing public transport involves a well-known trade-off between consumer welfare and
provider revenue. Typically, consumers value frequency, reliability, space, and low fares (Red-
man, Friman, G¨
arling, and Hartig,2013) while suppliers aim at operating with a minimum
number of vehicles to maximize profits. In general, the allocation can be improved as providers
do not account for the positive externalities on consumers (Mohring,1972). In particular, service
frequency reduces travelers’ access and waiting costs. This so-called ‘Mohring-effect’ leads to
economies of scale, implying the need for subsidies to achieve the first-best solution in terms of
welfare. Consequently, it may be socially optimal to subsidize railway companies to reduce fares
(Parry and Small,2009). To assess such a measure’s effectiveness on demand, policy-makers
would need to know how individuals respond to lower fares. However, it is generally challenging
to identify causal effects of discounts on train tickets (or goods and services in general) due
confounding or selection. For instance, discounts might typically be provided for dates or hours
with low train utilization such that connections with and without discount are not comparable
in terms of baseline demand. A naive comparison of sold tickets with and without discount
would therefore mix the influence of the discount with that of baseline demand. In this con-
text, we apply machine learning (a subfield of artificial intelligence) to convincingly assess how
discounts on train tickets for long-distance connections in Switzerland, the so-called ‘supersaver
tickets’, affect demand, by exploiting a unique data set of the Swiss Federal Railways (SBB)
that combines train utilization records with a survey of supersaver buyers.
More concisely, our study provides two use cases of machine learning for business analytics
in the railway industry: (i) Predicting buying behavior among supersaver customers, namely
whether customers booked a trip otherwise not realized by train (additional trip), bought a
first-class rather than a second-class ticket (upselling), or rescheduled their trip e.g. away from
rush hours (demand shift); (ii) analysing the causal effect of the discount on demand shifts
among customers that would have booked the trip even without discount. This is feasible
because our unique survey contains information on how supersaver buyers would have decided
in the absence of a discount, e.g. whether they are so-called ‘always buyers’ and would have
booked the connection even at the regular fare. For both prediction and causal analysis, we
make use of appropriately tailored machine learning techniques, which learns the associations
between the demand outcomes of interest, the discount rate, and further customer or trip-
related characteristics in a data-driven way and helps avoiding model misspecification. Such a
targeted combination of predictive and causal machine learning can therefore improve demand
forecasting and decision-making in companies and organizations. While predictive machine
learning permits optimizing forecasts about demand and customer behavior as a function of
observed characteristics, causal machine learning permits evaluating the causal effect of specific
interventions like a discount regime for optimizing the offer of such discounts. Concerning the
prediction task, we use the so-called random forest, see Breiman (2001), as machine learner to
forecast the supersaver customers’ behavior and obtain an accuracy or correct (out of sample)
classification rate of 58% (demand shift), 65% (additional trip), and 82% (upselling), respectively.
Trip-related characteristics like seat capacity, utilization, departure time, and the discount rate,
but also customer’s age turn out to be strong predictors.
Concerning the causal analysis (which is more challenging than mere prediction), we impose
(i) a selection-on-observables assumption implying that the discount rate is as good as randomly
assigned when controlling for our rich set of trip- and demand-related characteristics and (ii)
weak monotonicity of any individual’s decision to purchase an additional trip (otherwise not
realized) in the discount rate, implying that a higher (rather than lower) discount does either
positively or not affect any customer’s buying decision. As a methodological contribution, we
formally show how these assumptions permit tackling the selectivity of discount rates and survey
response to identify the discount rate’s effect on demand shifts (rescheduling away from rush
hours) for the subgroup ‘always buyers’, based on the survey information on how customers
would have behaved in the absence of a discount. Furthermore, we discuss testable implications
of monotonicity, namely that among all survey respondents, the share of additional trips must
increase in the discount rate, and the selection on observables assumptions, requiring that con-
ditional on trip- and demand-related characteristics, the discount must not be associated with
personal characteristics (like age or gender) among always buyers. Hypothesis tests do not point
to the violation of these implications.
Based on our causal identification strategy, we estimate the marginal effect of slightly increas-
ing the (continuously distributed) discount rate based on the causal forest (CF), see Wager and
Athey (2018) and Athey, Tibshirani, and Wager (2019), and find that on average, increasing the
discount rate by one percentage point increases the share of rescheduled trips by 0.16 percentage
points among always buyers. In a second approach, we binarize the discount rates by splitting
them into two discount categories of less than 30% (relative to the regular fare) and 30% or
more. Applying double machine learning (DML), see Chernozhukov, Chetverikov, Demirer, Du-
flo, Hansen, Newey, and Robins (2018), we find that discount rates of 30% and more on average
increase the share of rescheduled trips 3.6 percentage points, which is in line with the CF-based
results. Our paper therefore provides the first empirical evidence (at least for Switzerland) that
such discounts can help balancing out train utilization across time and reducing overload during
peak hours, albeit the magnitude of the impact on always buyers appears limited.
When investigating the heterogeneity of effects across all of our observed characteristics using
the CF, our results suggest that demand-related trip characteristics (like seat capacity, utiliza-
tion, departure time, and distance) have some predictive power for the size of the discounts’
impact on shifting demand. Such information on heterogeneous effects appears interesting for
optimizing the allocation of discounts for the purpose of shifting demand, as the SBB has (due to
its monopoly in the Swiss long-distance passenger rail market) agreed with the Swiss price mon-
itoring agency to provide a fixed amount of discounted tickets per year, but is free to chose the
timing and connections for discounts. In a second heterogeneity analysis, we investigate whether
effects differ systematically across a pre-selected set of characteristics, namely: age, gender, pos-
session of a half fare travel card, travel distance, whether the purpose is business, commute, or
leisure, and whether the departure time is during peak hours. Using the regression approach
of Semenova and Chernozhukov (2020), we find that conditional on the other characteristics,
the effects of increasing the discount by one percentage point on rescheduling are by more than
0.2 percentage points higher during peak hours and for leisure travelers, differences that are
statistically significant at the 10% level when, however, not controlling for multile hypothesis
testing. These effects appear plausible as leisure travelers are likely more flexible and discounts
during peak hours make trips at times of increased demand even more attractive. We do not find
statistically significant effect differences for the other pre-selected characteristics, which could,
however, be due to the (for the purpose of investigating effect heterogeneity) limited sample of
several thousand observations.
Our paper is related to a growing literature applying statistical and machine learning meth-
ods for analyzing transport systems, as well as to methodological studies on causal inference for
so-called principal strata, see Frangakis and Rubin (2002), i.e. endogenous subgroups like the al-
ways buyers. Typically, it is hard to identify the causal effect of some treatment (or intervention)
like a discount on such a non-randomly selected subgroup defined in terms how a post-treatment
variable (e.g. buying decision) depends on the treatment (e.g. treatment). One approach is to
give up on point identification and instead derive upper and lower bounds on a set of possible
effects for groups alike the always buyers based on the aforementioned monotonicity assumption
(and possibly further assumptions about the ordering of outcomes of always buyers and other
individuals), see for instance Zhang and Rubin (2003), Zhang, Rubin, and Mealli (2008), Imai
(2008), Lee (2009), and Blanco, Flores, and Flores-Lagunes (2011). Alternatively, the treatment
effect on always buyers is point-identified when invoking a selection-on-observables or instru-
mental variable assumption for selection into the survey, see for instance Huber (2014), which
requires sufficiently rich data on both survey participants and non-participants for modeling
survey participation. In contrast to these previous studies, the approach in this paper point-
identifies the treatment effect by exploiting the rather unique survey feature that customers were
asked about their behavior in the absence of the discount, which under monotonicity permits
identifying the principal stratum of always buyers directly in the data.
Furthermore, our work is related to conceptual studies on transport systems, considering for
instance the previously mentioned positive externalities of an increased service for customers
that are not accounted for by transportation providers. Such externalities typically arise from
economies of scale due to fixed costs and a ’Mohring effect’, implying that service frequency
reduces waiting costs (Mohring,1972). The study by Parry and Small (2009) suggests that
lower fares can boost overall welfare by increasing economies of scale (off-peak) and decreasing
pollution and accidents (at peaks). Similarly, De Palma, Lindsey, and Monchambert (2017) argue
that time-dependent ticket prices may increase overall welfare as overcrowding during peak hours
is suboptimal for both consumers and providers. As public transport is usually highly subsidized,
governments may directly manage the trade-off mentioned above. As this involves taxpayer
money, it is a question of general interest how the subsidies should be designed. Based on their
results, Parry and Small (2009) conclude that even substantial subsidies are justified due to
lower fares’ positive welfare effect. In contrast, Basso and Silva (2014) find that the contribution
of transit subsidies to welfare diminishes once congestion is taxed and alternatives are available,
i.e., bus lanes. Irrespective of the specific policy instrument, the consumer’s willingness to shift
demand drives these policies’ effectiveness. While many factors affect this willingness, most
studies conclude that consumers are price sensitive (Paulley, Balcombe, Mackett, Titheridge,
Preston, Wardman, Shires, and White,2006). In this context, we aim at contributing to a
better understanding of how time-dependent pricing translates to consumer decisions.
More broadly, our paper relates to the literature on policies targeting demand shifts. Among
these, the setting of car parking costs, fiscal regulations, or even free public transport has
been analyzed (e.g. Batty, Palacin, and Gonz´alez-Gil,2015,Rotaris and Danielis,2014,Zhang,
Lindsey, and Yang,2018,De Witte, Macharis, Lannoy, Polain, Steenberghen, and Van de Walle,
2006). Another stream of literature applies machine learning algorithms in the context of public
transport. Examples are short-term traffic flow forecasts for bus rapid transit (Liu and Chen,
2017) or metro (Liu, Liu, and Jia,2019) services. Further, Hagenauer and Helbich (2017) and
Omrani (2015) implement machine learning algorithms to predict travel mode choices. Yap and
Cats (2020) predict disruptions and their passenger delay impacts for public transport stops. In
other research fields, also applications of causal (rather than predictive) machine learning are on
the rise (see for instance Yang, Chuang, and Kuan,2020,Knaus,2021). This is, to the best of
our knowledge, the first study using causal machine learning in the context of public transport.
Finally, a growing literature discusses the opportunities of data-driven business decision-making
(Brynjolfsson and McElheran,2016) by assessing the relevance of predictive and causal machine
learning. Ascarza (2018) and H¨
unermund, Kaminski, and Schmitt (2021) show that companies
may gain by designing their policies based on causal machine learning. For instance, firms can
target the relevant consumers much more effectively when accounting for their heterogeneity in
terms of reaction to a treatment. Our study provides a use case of how the machine learning-
based assessment of discounts could be implemented also in other businesses and industries
facing capacity constraints.
This paper proceeds as follows. Section 2presents the institutional setting of passenger rail-
way transport in Switzerland. Section 3describes our data, coming from a unique combination
of a customer survey and transport utilization data. Section 4discusses the identifying assump-
tions underlying the causal machine learning approach as well as testable implications. Section
5outlines the predictive and causal machine learning methods. Section 6presents the empirical
results. Section 7concludes.
2 Institutional Background
The railway system in Switzerland is known for its high quality of service. Examples include the
high level of system integration with frequent services, synchronized timetables, and comprehen-
sive fare integration, see Desmaris (2014). In Switzerland, a country of railway tradition, the
state owned incumbent Swiss Federal Railways (SBB) operates the long distance passenger rail
market as monopolist (Thao, von Arx, and Fr¨
olicher,2020). Furthermore, nationally operating
long-distance coaches may only be approved if they do not ‘substantially’ compete with existing
services. Thus, the SBB competes exclusively with motorized private transport in Swiss long-
distance traffic. The company also owns most of the rail infrastructure, which is funded by the
Federal Government. However, since the end of 2020 the companies Berne-L¨
Railways (BLS) and Southeast Railways (SOB) operate a few links on behalf of the SBB. Dif-
ferent to regional public transport that Swiss taxpayers subsidize with approximately CHF 1.9
bn per year, the operation of the long distance public transport itself has to be self-sustaining
Because of the monopoly position of the SBB in long distance passenger transport, the
prices are screened by the Swiss ‘price watchdog’ (or price monitoring agency) to prevent abuse.
Based on the price monitoring act, the watchdog keeps a permanent eye on how prices and profits
develop. By the end of 2014, the watchdog concluded that the SBB charged too high prices. As
a consequence and through a mutual agreement, the SBB and the Swiss price watchdog agreed
on a significantly higher supply of supersaver tickets, which were first offered in 2009. Using a
supersaver ticket, customers can travel on long distance public transport routes with a discount
of up to 70%. Thereafter, additional agreements were regularly reached regarding number and
scope of the supersaver tickets. While only a few thousand supersaver tickets were sold in 2014,
sales increased to about 8.8 million in 2019, see L¨
uscher (2020).
From the SBB’s perspective, these tickets can serve two purposes. First, the tickets might
be used as means to balance out the utilization of transport services. For instance, supersaver
tickets could reduce the high demand during peak hours which is a key challenge for public
transport. Thus, balancing the demand may reduce delays and increase the number of free seats
which is valued by the consumers. The average load of SBBs’ seats amounts to 30% in the long
distance passenger transport.1For this reason, there is in the literal sense room for improving
the allocation. Second, price sensitive customers can be acquired during off-peak hours at rather
negligible marginal costs.
Despite the increasing interest in the supersaver tickets in recent years, many users of the
Switzerland public transport network purchase a so-called ‘general abonnement’ travel ticket
(GA). This (annually renewed) subscription provides free and unlimited access to the public
transport network in Switzerland. In 2019, about 0.5 million individuals owned a GA in Switzer-
land, roughly 6% of the Swiss population. The GA’s cost amounts to 3,860 and 6,300 Swiss
francs for the second and first class, respectively. In the same year, about 2.7 million individuals
held a relatively cheap half fare travel ticket amounting to 185 Swiss francs. The latter implies
a price reduction of 50% for public transport tickets in Switzerland. Overall, discounts provided
through supersaver tickets are slightly lower for owners of half fare tickets, as the SBB aims to
attract non-regular public transport users. In our causal analysis, we therefore also control for
the possession of a half fare ticket.
3 Data
To investigate supersaver tickets’ effect, we use a unique cross-sectional data set provided by the
SBB. Our sample consists of randomly surveyed buyers of supersaver tickets that purchased their
tickets between January 2018 and December 2019. These survey data are matched with data on
distances between any two railway stops as well as utilization-related information relevant for
the supply and calculation of discounts. In Section 6, we provide descriptive statistics for these
3.1 Survey Data
The customer survey is our primary data source. It for instance includes the outcome variable
‘demand shift’, a binary indicator of whether an interviewee rescheduled her or his trip due to
buying a supersaver ticket. ‘Yes’ means that the departure time has been advanced or postponed
because of the discount. A second variable characterizing customer behavior is an indicator for
upselling, i.e. whether someone purchased a first rather than a second class ticket as a reaction
to the discount. Another question asks whether an interviewee would have bought the train
trip in the same or a higher class even without being offered a discount, which permits judging
whether an additional trip has been sold through offering the discount and allows identifying
the subgroup of always buyers under the assumptions outlined further below. Our continuously
distributed treatment variable is the discount rate of a supersaver ticket relative to the standard
fare, which may take positive values of up to 70%.
Furthermore, we observe two kinds of covariates, namely trip- or demand-related factors
and personal characteristics of the interviewee. The former are important control variables for
our causal identification strategy outlined below and include the difference between the days of
purchase and travel, the weekday, month, and year, an indicator for buying a half fare ticket,
departure time, peak hour,2number of tickets purchased per person, class (first or second),
indicators for leisure trips, commutes, or business trips, the number of companions (by children
and adults if any) and a judgment of how complicated the ticket purchase was on a scale from
1 (complicated) to 10 (easy). Furthermore, it consists of indicators for the point of departure,
destination, and public holidays. The personal characteristics include age, gender, migrant
status, language (German, French, Italian), and indicators for owning a half fare travel ticket
or other subscriptions like those of regional tariff associations, specific connections, and Gleis 7
(‘rail 7’). The latter is a travelcard for young adults not older than 25, providing free access to
public transport after 7pm.
3.2 Factors Driving the Supply of Supersaver Tickets
In addition to the survey, we have access to factors determining the supply of supersaver tickets
with various discounts. This is crucial for our causal analysis that hinges on on controlling for
all characteristics jointly affecting the the discount rate and the demand shift outcome. While
information on the distances between railway stops in Switzerland is publicly available,3the SBB
provides us for the various connections with information on utilization data, the number of offered
seats, and contingency schemes, which define the quantity of offered discounts. This allows us to
2Peak hour is defined as a departure time between 6am and 8:59am or between 4pm and 18:59pm, from Mon-
day to Friday. This time windows is chosen on the base of the SBB’s train-path prices. For further de-
tails, see
train-path-price.html (assessed on March 24 2021).
3See the Open Data Platform of the SBB: (accessed
on March 24 2021), which provides the distances between any stops on a railway route.
account for travel distance, offered seats, capacity utilization, and quantities of offered supersaver
tickets for various discount levels as well as quantities of supersaver tickets already sold (both
quantities at the time of purchase). Furthermore, we create binary indicators for the 27 different
contingency schemes of the SBB present in our data, which change approximately every month.
The variables listed in the previous paragraph are important, as the SBB calculates the
supply of supersaver tickets based on an algorithm considering four type of inputs: Demand
forecasts, advance booking deadlines, number of supersaver tickets already sold, and contingency
schemes defining the amount and the size of offered discounts based on the three previous inputs.
The schemes are set as a function of the SBB’s self-imposed goals such as customer satisfaction
but also depend on the requirements imposed by the price watchdog. The algorithm calculates a
journey’s final discount as a weighted average of all discounts between any two adjacent railway
stops along a journey. The weights depend on the distances of the respective subsections of the
trip. To approximate the (not directly available) demand forecasts of the SBB, we consider the
quarterly average of capacity utilization and the number of offered seats for any two stops, which
are available by (exact) departure time, workday, class, and weekend. In addition, we make use
of indicators for place of departure, destination, month, year, weekday and public holidays. We
use this information to reconstruct the amount and size of offered discounts by taking values
from the contingency schemes that correspond to our demand forecast approximation combined
with the difference between buying and travel days. Comparing this amount and size of offered
discounts with a buyer’s discount, we estimate the number of supersaver tickets already sold for
the exact date of purchase.
3.3 Sample Construction
Our initial sample contains 12,966 long-distance train trips that cover 61,469 sections between
two adjacent stops. For 12.2% of these sections, there is no information on the capacity utiliza-
tion available, which can be due to various reasons. First, for some cases, capacity utilization
data is missing. Second, passengers traveling long-distance may switch to regional transport in
exceptional cases causing problems for determining utilization. A further reason could be issues
in data processing. Altogether, missing information occurs in 3,967 trips of our initial sample.
We tackle this problem by dropping all journeys with more than 50% of missing information,
which is the case for 320 trips or 2.5% of our initial sample. After this step, our evaluation
sample consists of 12,646 trips. For the remaining 3,647 trips with missing information (which
now account for a maximum of 50% of all sections of a journey), we impute capacity utilization
as the average of the remaining sections of a trip. In our empirical analysis, we include an
indicator for whether some trip information has been imputed as well as the share of imputed
values for a specific trip as control variables. Finally, we note that our causal analysis makes (in
contrast to the predictive analysis) only use of a subsample, namely observations identified as
always buyers who would have traveled even without a discount, all in all 6,112 observations.
4 Identification
We subsequently formally discuss the identification strategy and assumptions underlying our
causal analysis of the discounts among always buyers.
4.1 Definition of Causal Effects
Let Ddenote the continuously distributed treatment ‘discount rate’ and Ythe outcome ‘demand
shift’, a binary indicator for rescheduling a trip due to being offered a discount. More generally,
capital letters represent random variables in our framework, while lower case letters represent
specific values of these variables. To define the treatment effects of interest, we make use of
the potential outcome framework, see for instance Rubin (1974). To this end, Y(d) denotes the
potential outcome hypothetically realized when the treatment is set to a specific value din the
interval [0, Q], with 0 indicating no discount and Qindicating the maximum possible discount.
For instance, Q= 0.7 would imply the maximum discount of 70% of a regular ticket fare. The
realized outcome corresponds to the potential outcome under the treatment actually received,
i.e. Y=Y(D), while the potential outcomes under discounts different to one received remain
unknown without further statistical assumptions.
A further complication for causal inference is that our survey data only consists of individuals
that purchased a supersaver ticket, a decision that is itself an outcome of the treatment, i.e.
the size of the discount. Denoting by Sa binary indicator for purchasing a supersaver ticket
and by S(d) the potential buying decision under discount rate d, this implies that we only
observe outcomes Yfor individuals with S= 1. In general, making the survey conditional on
buying introduces Heckman-type sample selection (or collider) bias, see Heckman (1976) and
Heckman (1979), if unobserved characteristics affecting the buying decision Salso likely affect
the inclination of shifting the timing of the train journey Y. Furthermore, it is worth noting
that S=S(D) implies that buying a supersaver ticket is conditional on receiving a non-zero
discount. For this reason, non-treated subjects paying regular fares (with D= 0) are not
observed in our data. Yet, the outcome in our sample is defined relative to the behavior without
treatment, as Yindicates whether a has passenger has changed the timing of the trip because of
a discount. This implies that Y(0) = 0 by definition, such that the causal effect of some positive
discount dvs. no discount is Y(d)Y(0) = Y(d) is directly observable among observations that
actually received d. However, it also appears interesting to investigate whether the demand shift
effect varies across different (non-zero) discount rates d(0, Q] to see whether the size matters.
This is complicated by the fact that supersaver customers with different discount rates that are
observed in our data might in general differ importantly in terms of background characteristics
also affecting the outcome, exactly because they bought their trip and were selected into the
survey under non-comparable discount regimes. Our causal approach aims at tackling exactly
this issue to establish customer groups that are comparable across discount rates in order to
identify the effect of the latter.
Based on the potential notation, we can define different causal parameters of interest. For
instance, the average treatment effect (ATE) of providing discount levels dvs. d0(for d6=d0) on
outcome Y, denoted by ∆(d, d0), corresponds to
∆(d, d0) = E[Y(d)Y(d0)].(1)
Furthermore, the average partial effect (APE) of marginally increasing the discount level at
D=d, denoted by θ(d), is defined as
θ(d) = E∂Y (D)
∂D .(2)
Accordingly, θ(D) corresponds to the APE when marginally increasing the actually received
discount of any individual (rather than imposing some hypothetical value dfor everyone).
The identification of these causal parameters based on observable information requires rather
strong assumptions. First, it implies that confounders jointly affecting Dand Ycan be con-
trolled for by conditioning on observed characteristics. In our context, this appears plausible,
as treatment assignment is based on variables related to demand (like weekdays or month),
contingency schemes, capacity utilization, and supersaver tickets already sold - all of which is
available in our data, as described in Section (3). Second, identification requires that selection
Sis as good as random (i.e., not associated with outcome Y) given the observed characteristics
and the treatment, an assumption known as missing at random (MAR), see for instance Rubin
(1976) and Little and Rubin (1987). However, the latter condition appears unrealistic in our
framework, as our data lack important socio-economic characteristics likely affecting preferences
and reservation prices for public transport, namely education, wealth, or income. For this rea-
son, we argue that the ATE and APE among the individuals selected for the survey (S= 1),
i.e. conditional on buying a supersaver ticket, which are defined as
S=1(d, d0) = E[Y(d)Y(d0)|S= 1], θS=1(D) = E∂Y (D)
∂D ,(3)
cannot be plausibly identified either. The reason is that if an increase in the discount rate
induces some customers to buy a super saver ticket, then buyers with lower and higher discounts
will generally differ in terms of their average reservation prices and related characteristics (as
education or income), which likely also affect the demand-shift outcome Y.
To tackle this sample selection issue, we exploit the fact that our data provide information
on whether the supersaver customers would have purchased a ticket for this specific train trip
also in the absence of any discount. Provided that the interviewees give accurate responses, we
thus have information on S(0), the hypothetical buying decision without treatment. Under the
assumption that each customer’s buying decision is weakly monotonic in the treatment in the
sense that anyone purchasing a trip in a specific travel class (e.g., second class) without discount
would also buy it for that class in the case of any positive discount, this permits identifying the
group of always buyers. Importantly, we therefore define always buyers as those that would buy
the trip not in a lower travel class (namely second rather than first class) without discount. For
alway buyers, S(0) = S(d) = 1 for any d > 0, such that their buying decision is always one
and thus not affected by the treatment, implying the absence of the selection problem. In the
denomination of Frangakis and Rubin (2002), the always buyers constitute a so-called principal
stratum, i.e., a subpopulation defined in terms of how the selection reacts to different treatment
intensities. Therefore, sample selection bias does not occur within such a stratum, in which
selection behavior is by definition homogeneous. For this reason, we aim at identifying the ATE
and APE on the always buyers:
S(0)=1(d, d0) = E[Y(d)Y(d0)|S(0) = 1] = E[Y(d)Y(d0)|S(0) = S(d00) = 1] for d00 (0, Q],
θS(0)=1(D) = EY (D)
∂D S(0) = 1=E∂Y (D)
∂D |S(0) = S(d00)=1(4)
where the second equality follows from the monotonicity of Sin Dthat is formalized further
Figure 1: Causal framework
Figure 1provides a graphical illustration of our causal framework based on a directed acyclic
graph, with arrows representing causal effects. Observed covariates Xthat are related to demand
are allowed to jointly affect the discount rate Dand the demand-shift outcome Y.Xmay
influence the potential purchasing decision under a hypothetical treatment S(d), implying that
buying a ticket given a specific discount depends on observed demand drivers like weekday,
month, etc. Furthermore, unobserved socio-economic characteristics V(like the reservation
price) likely affect both S(d) and Y. This introduces sample selection when conditioning on S,
e.g. by only considering survey respondents (S= 1). We also note that Sis deterministic in
Dand S(d) (as S=S(D)), even when controlling for X. This is the case because conditional
on S= 1, Dis associated with V, which also affects Y, thus entailing confounding of the
treatment-outcome relation. A reason for this is for instance that buyers under higher and lower
discounts are generally not comparable in terms of their reservation prices. In the terminology
of Pearl (2000), Sis a collider that opens up a backdoor path between Dand Ythrough V.
Theoretically, this could be tackled by jointly conditioning on the potential selection states under
treatment values dvs. d0considered in the causal analysis, namely S(d), S (d0), as controls for
the selection behavior. This is typically not feasible in empirical applications when only the
potential selection corresponding to the actual treatment assignment is observed, S=S(D). In
our application, however, we do have information on S(0) and can thus condition on being an
always buyer under the mentioned monotonicity assumption.
4.2 Identifying Assumptions
We now formally introduce the identification assumptions underlying our causal analysis.
Assumption 1 (identifiability of selection under non-treatment):
S(0), is known for all subjects with S= 1.
Assumption 1 is satisfied in our data in the absence of misreporting, as subjects have been asked
whether they would have bought the train trip even in the absence of discount.
Assumption 2 (conditional independence of the treatment):
Y(d), S(d)D|Xfor all d(0, Q].
By Assumption 2, there are no unobservables jointly affecting the treatment assignment on the
one hand and the potential outcomes or selection states under any positive treatment value
on the other hand conditional on covariates X. This assumption is satisfied if the treatment
is quasi-random conditional on our demand-related factors X. Note that the assumption also
implies that Y(d)D|X, S(0) = 1 for all d(1, Q].
Assumption 3 (weak monotonicity of selection in the treatment):
Pr(S(d)S(d0)|X) = 1 for all d > d0and d, d0(1, Q].
By Assumption 3, selection is weakly monotonic in the treatment, implying that a higher treat-
ment state can never decrease selection for any individual. In our context, this means that a
higher discount cannot induce a customer to not buy a ticket that would have been purchased
under a lower discount. An analogous assumption has been made in the context of nonpara-
metric instrumental variable models, see Imbens and Angrist (1994) and Angrist, Imbens, and
Rubin (1996), where, however, it is the treatment that is assumed to be monotonic in its instru-
ment. Note that monotonicity implies the testable implication that E[SS(0)|X, S = 1, D =
d] = E[(1 S(0)|X, S = 1, D =d] weakly increases in treatment value d. In words, the share of
customers that bought the ticket because of the discount must increase in the discount rate in
our survey population when controlling for X.
Assumption 4 (common support):
f(d|X, S(0) = 1) >0 for all d(1, Q].
Assumption 4 is a common support restriction requiring that f(d|X, S(0) = 1), the conditional
density of receiving a specific treatment intensity dgiven Xand S(0) = 1 (or conditional prob-
ability if the treatment takes discrete values), henceforth referred to as treatment propensity
score, is larger than zero among always buyers for the treatment doses to be evaluated. This
implies that the demand-related covariates Xdo not deterministically affect the discount rate
received such that there exists variation in the rates conditional on X.
Our assumptions permit identifying the conditional ATE given X(CATE), denoted by
X,S(0)=1 (d, d0) = E[Y(d)Y(d0)|X, S(0) = 1] for d6=d0and d, d0(1, Q]. To see this,
note that
X,S(0)=1 (d, d0) = E[Y|D=d, X, S(0) = 1] E[Y|D=d0, X, S(0) = 1],
=E[Y|D=d, X, S(0) = 1, S = 1] E[Y|D=d0, X, S (0) = 1, S = 1],(5)
where the first equality follows from Assumption 2 and the second from Assumption 3, as
monotonicity implies that asymptotically, S= 1 if S(0) = 1. Together with Assumption 1,
which postulates the identifiability of S(0), it follows that the causal effect on always buyers
is nonparametrically identified, given that common support (Assumption 4) holds. If follows
that the ATE among always buyers is identified by averaging over the distribution of Xgiven
S(0) = 1, S = 1:
S(0)=1(d, d0) = E[E[Y|D=d, X, S (0) = 1, S = 1] E[Y|D=d0, X, S(0) = 1, S = 1]|S(0) = 1, S = 1].(6)
Furthermore, considering (5) and letting dd00 identifies the conditional average partial
effect (CAPE) of marginally increasing the treatment at D=dgiven X, S(0) = 1, denoted by
θX,S(0)=1 (D) = Eh∂Y (D)
∂D |X, S(0) = 1i:
θX,S(0)=1 (d) = ∂E[Y|D=d,X,S(0)=1,S=1]
∂D .(7)
Accordingly, the APE among always buyers that averages over the distributions of Xand Dis
identified by
θS(0)=1(D) = Eh E[Y|D,X,S(0)=1,S=1]
∂D i.(8)
Our identifying assumptions yield a testable implication if some personal characteristics (like
customer’s age) that affect S(d) are observed, which we henceforth denote by W. In fact, Dmust
be statistically independent of Wconditional on X, S(0) = 1, S = 1 if Xis sufficient for avoiding
any cofounding of the treatment-outcome relation. To see this, note that personal characteristics
must by Assumption 2 not influence the treatment decision conditional on X. This statistical
independence must also hold within subgroups (or principal strata) in which sample selection
behavior (and thus sample selection/collider bias) is controlled for like the always buyers, i.e.
conditional on S(d), S = 1.
5 Estimation based on machine learning
In this section, we outline the predictive and causal machine learning approaches used in our
empirical analysis of the evaluation sample.
5.1 Predictive Machine Learning
Let i∈ {1, ...., n}be an index for the different interviewees in our sample of size nand {Yi, Di, Xi, Wi, Si(0)}
denote the outcome, treatment, the covariates related to the treatment and the outcome, the
observed personal characteristics, and the buying decision without discount of these interviewees
that by the sampling design all satisfy Si= 1 (because they are part of the survey). Therefore,
Yirepresents customer i’s demand shift (rescheduling behavior) under customer i’s received dis-
count rate Direlative to no discount. We in a first step investigate which observed predictors
among the covariates X, W as well as the size of the discount Dare most powerful for predicting
demand shifts by machine learning algorithms. We point out that this analysis is of descriptive
nature as it does not yield the causal effects of the various predictors, but merely their capability
of forecasting Y. In particular, our approach averages the predictions of Yover different levels
of treatment intensity Dand thus different customer types in terms of reservation price (related
to S(0)) and unobserved background characteristics that likely vary with the treatment level.
Therefore, we also perform the prediction analysis within subgroups defined upon the treat-
ment level to see whether the set of important predictors is affected by the treatment intensity.
To this end, we binarize the treatment such that it consists of two categories, namely (non-zero)
discounts below 30%, i.e. covering the treatment range d(0,0.3), and more substantial dis-
counts of 30% and more, d[0.3,0.7], as 70% is the highest discount observed in our data. In
the same manner, we also assess the predictive power when considering the decision to buy a
trip that would not have been realized without discount (additional trip), i.e. SiSi(0), as
outcome. As Si= 1 is equal to one for everyone, the outcome corresponds to 1 Si(0) and
indicates whether someone has been induced purchase the ticket because of the discount, i.e. is
not an always buyer. As a further consumer behavior-related outcome to be predicted, we also
consider buying a first class rather than second class ticket because of the discount (upselling).
Prediction is based on the random forest, a nonparametric machine learner suggested by
Breiman (2001) for predicting outcomes as a function of covariates. Random forests rely on
repeatedly drawing subsamples from the original data and averaging over the predictions in
each subsample obtained by a decision tree, see Breiman, Friedman, Olshen, and Stone (1984).
The idea of decision trees is to recursively split the covariate space, i.e. the set of possible
values of X, W , into a number of non-overlapping subsets (or nodes). Recursive splitting is
performed such that after each split, a statistical goodness-of-fit criterion like the sum of squared
residuals, i.e. the difference between the outcome and the subset-specific average outcome, is
minimized across the newly created subsets. Intuitively, this can be thought of as a regression
of the outcome on a data-driven choice of indicator functions for specific (brackets of) covariate
values. At each split of a specific tree, only a random subset of covariates is chosen as potential
variables for splitting in order to reduce the correlation of tree structures across subsamples,
which together with averaging predictions overall subsamples reduces the estimation variance
of the random forest when compared to running a single tree in the original data. Even when
using an excessive number of splits (or indicator functions for covariate values) such that some
of them do not importantly predict the outcome, averaging over many samples will cancel out
those non-predictive splits that are only due to sampling noise. Forest-based predictions can be
represented by smooth weighting functions that bear some resemblance with kernel regression,
with the important difference that random forests detect predictive covariates in a data-driven
way. We use the randomforest package by Breiman (2018) for the statistical software Rto
implement the random forest based on growing 1,000 decision trees.
5.2 Causal Machine Learning
Our second part of the analysis assesses the causal effect of increasing discount rates on demand
shifts among always buyers while controlling for the selection into the survey and the non-random
assignment of the treatment based on Assumptions 1 to 4 of Section 4. We apply the causal
forest (CF) approach of Wager and Athey (2018), and Athey, Tibshirani, and Wager (2019)
to estimate the CAPE and APE of the continuous treatment, as well as the double machine
learning (DML) approach of Chernozhukov, Chetverikov, Demirer, Duflo, Hansen, Newey, and
Robins (2018) to estimate the ATE of a binary treatment of a discount 30% vs. <30% in the
sample of always buyers.
The CF adapts the random forest to the purpose of causal inference. It is based on first
running separate random forests for predicting the outcome Yand the treatment Das a function
of the covariates Xusing leave-one-out cross-fitting. The latter implies that the outcome or
treatment of each observation is predicted based on all observations in the data but its own,
in order to safeguard against overfitting bias. Second, the predictions are used for computing
residuals of the outcomes and treatments, in which the influence of Xhas been partialled out.
Finally, a further random forest is applied to average over so-called causal trees, see Athey
and Imbens (2016), in order to estimate the CAPE. The causal tree approach contains two key
modifications when compared to standard decision trees. First, instead of an outcome variable,
it is the coefficient of regressing the residual of Yon the residual of D, i.e. the causal effect
estimate of the treatment, that is to be predicted. Recursive splitting aims to find the largest
effect heterogeneities across subsets defined in terms of Xto estimate the CAPE accurately.
Secondly, within each subset, different parts of the data are used for estimating (a) the tree’s
splitting structure (i.e., the definition of covariate indicator functions) and (b) the causal effect
of the treatment to prevent spuriously large effect heterogeneities due to overfitting.
The CAPE estimate obtained by CF can be thought of as a weighted regression of the
outcome residual on the treatment residual. The random forest-determined weight reflects the
importance of a sample observation for assessing the causal effect at specific values of the covari-
ates. After estimating the CAPE given X, the APE is obtained by appropriately averaging over
the distribution of Xamong the always buyers. For implementing CAPE and APE estimation,
we use the grf package by Tibshirani, Athey, Friedberg, Hadad, Hirshberg, Miner, Sverdrup,
Wager, and Wright (2020) for the statistical software R. We set the number of trees to be used
in a forest to 1000. We select any other tuning parameters like the number of randomly chosen
covariates considered for splitting or the minimum number of observations per subset (or node)
by the built-in cross-validation procedure.
We also estimate the ATE among always buyers in our sample based on DML for a binary
treatment defined as ˜
D=I{D0.3}, with I{·} denoting the indicator function that is equal
to one if its argument is satisfied and zero otherwise. Furthermore, let µd(X) = E[Y|˜
d, X, S(0) = 1, S = 1] denote the conditional mean outcome and pd(X) = Pr( ˜
D=d|X, S(0) =
1, S = 1) the propensity score of receiving treatment category d(with d= 1 for a discount
30% and d= 0 otherwise) in that population. Estimation is based on the sample analog of
the doubly robust identification expression for the ATE, see Robins, Rotnitzky, and Zhao (1994)
and Robins and Rotnitzky (1995):
S(0)=1(1,0) = E[µ1(X)µ0(X) (9)
+(Yµ1(X)) ·˜
p1(X)(Yµ0(X)) ·(1 ˜
p0(X)S(0) = 1, S = 1#.
We estimate (9) using the causalweight package for the statistical software Rby Bodory and Hu-
ber (2018). As machine learners for the conditional mean outcomes µD(X) and the propensity
scores pD(X) we use the random forest with the default options of the SuperLearner package of
van der Laan, Polley, and Hubbard (2007), which itself imports the ranger package by Wright
and Ziegler (2017) for random forests. To impose common support in the data used for ATE
estimation, we apply trimming threshold of 0.01, implying that we drop observations with es-
timated propensity scores smaller than 0.01 (or 1%) and larger than 0.99 (or 99%) from our
6 Empirical results
6.1 Descriptive Statistics
Before discussing the results of our machine learning approaches, we first present some descriptive
statistics for our data in Table 1, namely the mean and the standard deviation of selected
variables by always buyer status and binary discount category (30% and <30%). We see
that discounts and regular ticket fares of always buyers are on average lower than those of other
customers. Another interesting observation is that in either discount category, we observe less
leisure travelers among the always buyers than among other customers, which can be rationalized
by business travelers responding less to price incentives by discounts. This is also in line with
the finding that always buyers tend to purchase more second class tickets. More generally, we
see non-negligible variation in demand-related covariates across the four subsamples defined in
terms of buying behavior and discount rates. For instance, among always buyers, the total
amount of supersaver tickets offered is on average larger in the higher discount category, while
it is lower among the remaining clients. This suggests that neither the treatment nor being an
always buyer is quasi-random, a problem we aim to tackle based on our identification strategy
outlined in Section 4. Concerning the demand-shift outcome, we see that always buyers change
the departure time less frequently than others. With regard to upselling,we recognise that the
relative amount of individuals upgrading their 2nd class to a 1st class ticket is the same for both
discount categories, i.e. 30% and <30%.
6.2 Predicting buying decisions
We subsequently present our predictive analysis based on the random forest and investigate
which covariates importantly predict three outcomes, namely whether customers booked a trip
otherwise not realized by train (additional trip), bought a first-class rather than a second-class
ticket (upselling), or rescheduled their trip e.g. away from rush hours (demand shift). For this
purpose, we create three distinct datasets in which the values of the respective binary outcome are
balanced, i.e. 1 (for instance, upselling) for 50% and 0 (no upselling) for 50% of the observations.
We balance our data because we aim to train a model that predicts both outcome values equally
well. Taking the demand shift outcome as an example, our data with non-missing covariate or
outcome information contain 3481 observations with Y= 1 and 9576 observations with Y= 0.
Table 1: Mean and standard deviation by discount and type
discount <30% 30%
always buyers No Yes No Yes
discount 0.21 0.19 0.57 0.53
(0.07) (0.08) (0.12) (0.13)
regular ticket fare 44.36 36.14 47.19 32.91
(29.38) (25.47) (30.14) (23.78)
age 47.22 47.68 45.59 48.77
(15.36) (16.14) (15.80) (16.49)
gender 0.51 0.55 0.53 0.59
(0.50) (0.50) (0.50) (0.49)
diff. purchase travel 3.42 3.23 7.72 7.19
( 6.96) ( 6.76) (11.23) (10.30)
distance 136.49 127.86 126.15 116.76
(77.38) (71.49) (69.98) (66.04)
capacity utilization 35.51 39.19 26.46 33.15
(14.16) (14.31) (13.24) (13.75)
seat capacity 328.28 429.57 303.83 445.14
(196.19) (196.10) (185.42) (188.54)
offer total 33.95 44.10 70.97 98.34
(42.57) (50.68) (69.57) (84.45)
sold total 28.04 37.29 13.70 25.75
(41.92) (50.31) (36.37) (53.67)
half fare travel ticket 0.74 0.79 0.62 0.74
(0.44) (0.40) (0.49) (0.44)
leisure 0.77 0.69 0.82 0.76
(0.42) (0.46) (0.39) (0.43)
class 1.38 1.65 1.33 1.73
(0.48) (0.48) (0.47) (0.44)
Swiss 0.89 0.92 0.88 0.88
(0.31) (0.28) (0.33) (0.32)
demand shift 0.31 0.19 0.31 0.23
(0.46) (0.40) (0.46) (0.42)
upselling 0.49 0.00 0.49 0.00
(0.50) (0.00) (0.50) (0.00)
obs. 1151 2221 5529 3745
Notes: Regular ticket fare is in Swiss francs. ‘diff. purchase travel’ denotes the difference between purchase and
travel day. ‘Offer total’ and ‘sold total’ denote the total amount of supersaver tickets offered and the total amount
of supersaver tickets sold respectively.
We retain all observations with Y= 1 and randomly draw 3481 observations with Y= 0 to
obtain such a balanced data set. In the next step, we randomly split these 6962 observations into
a training set consisting of 75% of the data and a test set (25%). In the training set, we train
the random forest using the treatment Dand all covariates X, W as predictors. In the test set,
we predict the outcomes based on the trained forest, classifying e.g. observations with a demand
shift probability 0.5 as 1. We then compare the predictions to the actually observed outcomes
to assess model performance based on the correct classification rate (also known as accuracy),
i.e. the share of observations in the test data for which the predicted outcome corresponds to
the actual one.
For each of the outcomes, Table 2presents the 30 most predictive covariates in the training
set ordered in decreasing order according to a variable importance measure. The latter is defined
as the total decrease in the Gini index (as a measure of node impurity in terms of outcome values)
in a tree when including the respective covariate for splitting, averaged over all trees in the forest.
The results suggest that trip- and demand-related characteristics like seat capacity, utilization,
departure time, and distance are important predictors. Concerning personal characteristics,
also customer’s age appears to be relevant. Furthermore, also the treatment intensity Dhas
considerable predictive power. Interestingly, specific connections (defined by indicators for points
of departure and destination) turn out to be less important characteristics conditional on the
other covariates already mentioned.
At the bottom of Table 2we also report the correct classification rates for the three outcomes.
While the accuracy in predicting a demand shift amounts to 58%, which is somewhat better
than random guessing but not particularly impressive, the performance is more satisfactory
for predicting decisions about additional trips with an accuracy of 65% and quite decent for
upselling (82%). We note that when predicting upselling, we drop the variables ‘class’, which
indicates whether someone travels in the first or second class, and ‘seat capacity’, which refers to
the capacity in the chosen class, from the predictors. The reason is that upselling is defined as
switching from second to first class, and therefore, the chosen class and the related seat capacities
are actually part of the outcome to be predicted. Tables B.2 and B.1 in the Appendix present
the predictive outcome analysis separately for subsamples with discounts 30% and <30%,
respectively. In terms of which classes of variables are most predictive (trip- and demand-related
characteristics, age, discount rate) and also in terms of accuracy, the findings are rather similar to
those in Table 2. In general, machine learning appears useful for forecasting customer behavior
in the context of demand for train trips, albeit not equally well for all aspects of interest. Such
forecasts may for instance serve as a base for customer segmentation, e.g. into customer groups
more and less inclined to book an additional trip or switch classes or departure times.
Table 2: Predictive outcome analysis
demand shift upselling additional trip
variable importance variable importance variable importance
departure time 142.694 capacity utilization 295.924 seat capacity 147.037
seat capacity 121.42 offer level B 188.861 D 128.086
age 119.846 offer level C 149.911 age 123.948
capacity utilization 119.606 D 132.095 departure time 123.516
D 112.474 age 100.258 capacity utilization 113.160
distance 112.143 departure time 98.909 distance 101.730
offer level B 84.142 offer level A 93.303 offer level B 84.989
diff. purchase travel 81.167 distance 87.319 diff. purchase travel 80.236
offer level C 76.238 offer level D 85.408 offer level A 78.507
offer level A 75.971 diff. purchase travel 62.841 offer level C 77.097
number of sub-journeys 73.096 number of sub-journeys 55.978 number of sub-journeys 69.443
offer level D 61.763 rel. sold level A 44.505 ticket purchase complexity 64.498
ticket purchase complexity 57.071 ticket purchase complexity 41.819 offer level D 56.888
rel. sold level A 51.377 offer level E 37.159 class 51.456
rel. amount imputed values 42.222 rel. sold level B 34.462 rel. sold level A 46.969
rel. sold level B 38.144 rel. amount imputed values 30.747 rel. amount imputed values 38.869
adult companions 34.176 rel. sold level C 28.635 rel. sold level B 36.484
rel. sold level C 28.201 adult companions 25.115 half fare 35.785
offer level E 25.714 2019 18.88 adult companions 34.446
gender 23.707 gender 18.47 halfe fare travel ticket 28.465
amount purchased tickets 19.575 rush hour 17.448 gender 25.419
German 18.659 Saturday 16.173 rel. sold level C 24.679
travel alone 18.605 German 15.457 offer level E 22.556
2019 18.082 leisure 15.304 leisure 20.438
French 17.906 amount purchased tickets 15.112 no subscriptions 19.793
saturday 17.487 travel alone 14.792 amount purchased tickets 19.283
Friday 17.272 half fare 14.306 German 19.119
peak hour 17.064 French 14.161 travel alone 18.139
class 16.973 Thursday 13.413 2019 17.192
leisure 16.892 scheme 20 13.411 French 17.026
correct prediction rate 0.581 0.817 0.653
balanced sample size 6962 6738 7000
Notes: ‘Offer level A’, ‘offer level B’, ‘offer level C’, ‘offer level D’ and ‘offer level E’ denote the amount of supersaver tickets with discount A, B, C, D and E respectively. ‘Diff.
purchase travel’ denotes the difference between purchase and travel day. ‘Rel. sold level A’, ‘rel. sold level B’, ‘rel. sold level C’ and ‘rel. sold level D’ denote the relative amount
of supersaver tickets offered with discount A, B, C and D respectively. The relative amounts are in relation to the seats offered. ‘No subscriptions’ indicates not possessing any
subscription. For predicting upselling, the covariates ‘class’ and ‘seat capacity’ are dropped.
6.3 Testing the identification strategy
Before presenting the results for the causal analysis, we consider two different methods to par-
tially test the assumptions underlying our identification strategy. First, we test Assumption 3
(weak monotonicity) by running the CF and DML procedures as well as a conventional OLS
regression in which we use buying an additional trip (1 S(0)), i.e. not being an always buyer,
as outcome variable and Xas control variables in our sample of supersaver customers. The CF
permits estimating the conditional change in the share of surveyed customers induced to buy
an additional trip by modifying the discount rate Dgiven X, i.e. E[(1S(0))|D,X,S=1]
∂D , as well as
the average thereof across Xconditional on sample selection, EhE[(1S(0))|D,X,S =1]
∂D S= 1i.
DML, on the other hand, yields an estimate of the average difference in the share of ad-
ditional trips across the high and low treatment categories conditional on sample selection,
E[E[(1 S(0))|D < 0.3, X, S = 1] E[(1 S(0))|D0.3, X, S = 1]|S= 1]. Finally, the OLS
regression of (1S(0)) on Dand all Xin our sample tests monotonicity when assuming a linear
Table 3reports the results that do not provide any evidence against the monotonicity assump-
tion. When considering the continuous treatment D, the CF-based estimate of Eh∂E[(1S(0))|D,X,S=1]
∂D S= 1i
is highly statistically significant and suggests that augmenting the discount by one percentage
point increases the share of customers otherwise not buying the ticket by 0.56 percentage points
on average. Furthermore, any estimates of the conditional change ∂E[(1S(0))|D,X,S=1]
∂D are posi-
tive, as displayed in the histogram of Figure 2, and 82.2% of them are statistically significant at
the 10% level, 69.1% at the 5% level. Furthermore, the OLS coefficient of 0.544 is highly signif-
icant. Likewise, the statistically significant DML estimate points to an increase in the share of
additional trips by 18.6 percentage points when switching the binary treatment indicator from
D < 0.3 to D0.3.
We also test the statistical independence of Dand Wconditional on Xin our sample of
always buyers, as implied by our identifying assumptions, see the discussion at the end of Section
4. To this end, we randomly split the evaluation data into a training set (25% of observations)
and a test set (75% of all observations). In the training data set, we run a linear lasso regression
(Tibshirani,1996) of Don Xin order to identify important predictors by means of 10-fold
cross-validation. In the next step, we select all covariates in Xwith non-zero lasso coefficients
Table 3: Monotonicity tests
CF: average change OLS: coefficient DML: D0.3 vs D < 0.3
change in (1 S(0)) 0.564 0.544 0.186
standard error 0.060 0.031 0.007
p-value 0.000 0.000 0.000
trimmed observations 1760
number of observations 12924
Notes: ‘CF’, ‘OLS’, and ‘DML’ stands for estimates based on causal forests, linear regression, and double machine
learning, respectively. ‘trimmed observations’ is the number of trimmed observations in DML when setting the
propensity score-based trimming threshold to 0.01. Control variables consist of X.
Figure 2: Monotonicity given X
and run an OLS regression of Don the selected covariates in the test data. Finally, we add W
to that regression in the test data and run a Wald test to compare the predictive power of the
models with and without W. We repeat the procedure of splitting the data, performing the lasso
regression in the training set, and running the OLS regressions and the Wald test in the test set
100 times. This yields an average p-value of 0.226, with 15 out of 100 p-values being larger than
5%. These results do not provide compelling statistical evidence that Wis associated with D
conditional on X, even though the training sample is relatively small and thus favors selecting
too few predictors in X(due to the cross-validation that trades off bias due to including fewer
predictors and variance due to including more predictors).
We note that performing lasso-based variable selection and OLS-based testing in different
(training and test) data avoids correlations of these steps that could entail an overestimation of
the goodness of fit. Nonetheless, our findings remain qualitatively unchanged when performing
both steps in all of the evaluation data. Repeating the cross-validation step for the lasso-based
covariate selection 100 times and testing in the total sample yields an even higher average p-
value of 0.360. Finally, we run a standard OLS regression of Don all elements of X(rather than
selecting the important ones by lasso) in the total sample and compare its predictive power to
a model additionally including W. Also in this case, the Wald test entails a rather high p-value
of 0.343. In summary, we conclude that our tests do not point to the violation of our identifying
6.4 Assessing the causal effect of discounts
Table 4presents the main results of our causal analysis, namely the estimates of the discount
rate’s effect on the demand shift outcome, which is equal to one if the discount induced reschedul-
ing the departure time and zero otherwise. We note that all covariates, i.e. both the trip- or
demand-related factors Xand the personal characteristics W, are used as control variables, even
though we have previously claimed that Xis sufficient for identification. There are, however,
good reasons for including Was well in the estimations. First, conditioning on the personal
characteristics available in the data may reduce estimation bias if Xis - contrarily to our as-
sumptions and to what our tests suggest - not fully sufficient to account for confounding. Second,
it can also reduce the variance of the estimator, e.g. if some factors like age are strong predictors
of the outcome. Third, having Win the CF allows for a more fine-grained analysis of effect
heterogeneity based on computing more ‘individualized’ partial effects that (also) vary across
personal characteristics.
Table 4: Effects on demand shift
CF: APE DML: ATE D0.3 vs D < 0.3
effect 0.161 0.036
standard error 0.072 0.014
p-value 0.025 0.010
trimmed observations 151
number of observations 5903
Notes: ‘CF’ and ‘DML’ stands for estimates based on causal forests, linear regression, and double machine
learning, respectively. ‘trimmed observations’ is the number of trimmed observations in DML when setting the
propensity score-based trimming threshold to 0.01. Control variables consist of both Xand W.
Considering the estimates of the CF, we obtain an average partial effect (APE) of 0.161,
suggesting that increasing the current discount rate among always buyers by one percentage
point increases the share of rescheduled trips by 0.16 percentage points. This effect is statistically
significant at the 5% level. As a word of caution, however, we point out that the standard error
is non-negligible such that the magnitude of the impact is not very precisely estimated. When
applying DML, we obtain an average treatment effect (ATE) of 0.036 that is significant at the
1% level, suggesting that discounts of 30% and more on average increase the number of demand
shifts by 3.6 percentage points compared to lower discounts, which is qualitatively in line with the
CF. Furthermore, we find a decent overlap or common support in most of our sample in terms of
the estimated propensity scores across lower and higher discount categories considered in DML,
see the propensity score histograms in Appendix A. This is important as ATE evaluation hinges
on the availability of observations with comparable propensity scores across treatment groups.
Only 151 out of our 5903 observations are dropped due to too extreme propensity scores below
0.01 or above 0.99 (pointing to a violation of common support).4In summary, our results clearly
point to a positive average effect of the discount rate on trip rescheduling among always buyers,
which is, however, not overwhelmingly large.
6.5 Effect heterogeneity
In this section, we assess the heterogeneity of the effects of Don Yacross interviewees and
observed characteristics. Figure 3shows the distribution the CF-based conditional average
effects (CAPE) of marginally increasing the discount rate given the covariates values of the
always buyers in our sample (which are also the base for the estimation of the APE). While the
CAPEs are predominantly positive, they are quite imprecisely estimated. Only 2.9% and 0.8%
of the positive ones are statistically significant at the 10% and 5% levels, respectively. Further,
only 0.1% of the negative ones are statistically significant at the 10% level. Yet, the distribution
points to a positive marginal effect for most always buyers and also suggests that the magnitude
of the effects varies non-negligibly across individuals.
Next, we assess the effect heterogeneity across observed characteristics based on the CF
results. First, we run a conventional random forest with the estimated CAPEs as the outcome
4Our findings of a positive ATE remain robust when setting the propensity score-based trimming threshold to 0.02
(ATE: 0.039) or 0.05 (ATE: 0.043).
Figure 3: CAPEs
and the covariates as predictors to assess the covariates’ relative importance for predicting the
CAPE, using the decrease in the Gini index as importance measure as also considered in Section
6.2. Table 5reports the 20 most predictive covariates ordered in decreasing order according to the
importance measure. Demand-related characteristics (like seat capacity, utilization, departure
time, and distance) turn out to be the most important predictors for the size of the effects,
also customer’s age has some predictive power. Similarly as for outcome prediction in Section
6.2, specific connections (characterized by points of departure or destination) are less important
predictors of the CAPEs given the other information available in the data.
While Table 5provides information on the best predictors of effect heterogeneity, it does not
give insights on whether effects differ importantly and statistically significantly across specific
observed characteristics of interest. For instance, one question relevant for designing discount
schemes is whether (marginally) increasing the discounts is more effective among always buyers
so far exposed to rather small or rather large discounts. Therefore, we investigate whether the
CAPEs are different across our binary treatment categories defined by ˜
D(30% or more and
less than 30%). To this end, we apply the approach of Semenova and Chernozhukov (2020)
based on (i) plugging the CF-based predictions into a modified version of the doubly robust
functions provided within the expectation operator of (9) that is suitable for a continuous D
Table 5: Most important covariates for predicting CAPEs
covariate importance
seat capacity 11.844
offer level C 11.164
capacity utilization 5.144
departure time 5.122
distance 4.287
offer level D 4.015
class 3.434
saturday 2.933
age 2.429
number of sections 2.373
diff. purchase travel 2.110
offer level A 1.634
offer level B 1.610
half fare 1.524
scheme 17 1.496
half fare travel ticket 1.373
rel. sold level B 0.901
ticket purchase complexity 0.847
leisure 0.773
rel. sold level A 0.770
Notes: ‘Offer level A’, ‘offer level B’, ‘offer level C’ and ‘offer level D’ denote the amount of supersaver tickets
with discount A, B, C and D respectively. ‘Rel. offer level A’, ‘rel. offer level B’ and ‘rel. offer level C’ denote
the relative amount of supersaver tickets offered with discount A, B and C. The relative amounts are in relation
to the seats offered.
and (ii) linearly regressing the doubly robust functions on the treatment indicator ˜
D. The results
are reported in the upper panel of Table 6. While the point estimate of 0.104 suggests that
the demand shifting effect of increasing the discount is on average smaller when discounts are
already quite substantial (above 30%), the difference is far from being statistically significant at
any conventional level.
Table 6: Effect heterogeneity analysis
effect standard error p-value
Discounts categories (D0.3vs D < 0.03)
APE for D < 0.3 (constant) 0.209 0.089 0.019
Difference APE D0.3 vs D < 0.3 (slope coefficient) -0.104 0.122 0.395
Customer and travel characteristics
constant -0.154 0.295 0.602
age -0.002 0.004 0.556
gender -0.022 0.129 0.866
distance -0.000 0.001 0.697
leisure trip 0.297 0.165 0.072
commute 0.241 0.241 0.316
half fare travel ticket 0.228 0.142 0.109
peak hours 0.222 0.133 0.094
Notes: Business trip is the reference category for the indicators ‘leisure trip’ and ‘commute’.
Using again the method of Semenova and Chernozhukov (2020), we also investigate the
heterogeneity among a limited and pre-selected set of covariates that appears interesting for
characterizing customers and their travel purpose, namely age, gender, and travel distance, as
well as indicators for leisure trip and commute (with business trip being the reference category),
traveling during peak hours, and possession of a half fare travel tickets. As displayed in the
lower panel of Table 6, we find no important effect heterogeneities across the age or gender of
always buyers or as a function of travel distance conditional on the other information included
in the regression, as the coefficients on these variables are close to zero. In contrast, the effect
of demand shift is (given the other characteristics) substantially larger among always buyers
with a half fare travel tickets and among commuters, however, neither coefficient is statistically
significant at the 10% level (even though the half fare coefficient is close).
For leisure trips, the coefficient is even larger (0.297), suggesting that all other included
variables equal, a one percentage point increase in the discount rate increases the share of
rescheduled trips by 0.29 percentage points more among leisure travelers than among always
buyers traveling for business. The coefficient is statistically significant at the 10% level, even
though we point out that the p-value does not account for multiple hypothesis testing of several
covariates. This finding can be rationalized by leisure travelers being likely more flexible in terms
of timing than business travelers. Also the coefficient on peak hours is substantially positive
(0.222) and statistically significant at the 10% level (again, without controlling for multiple
hypothesis testing). This could be due to peak hours being the most attractive travel time,
implying that costumers are more willing to reschedule their trips when being offered a discount
within peak hours. We conclude that even though several coefficients appear non-negligible,
statistical significance in our heterogeneity analysis is overall limited, which could be due to
the (for the purpose of investigating effect heterogeneity) limited sample of several thousand
7 Conclusion
In this study, we applied causal and predictive machine learning to assess the demand effects
of discounts on train tickets issued by the Swiss Federal Railways (SBB), the so-called ‘super-
saver tickets’, based on a unique data that combines a survey of supersaver customers with rail
trip- and demand-related information provided by the SBB. In a first step, we analyzed which
customer- or trip-related characteristics (including the discount rate) are predictive for three out-
comes characterizing buying behavior, namely: booking a trip otherwise not realized by train
(additional trip), buying a first- rather than second-class ticket (upselling), or rescheduling a trip
(e.g. a demand shift away from rush hours) when being offered a supersaver ticket. The random
forest-based results suggested that customer’s age, demand-related information for a specific
connection (like seat capactiy, departure time, and utilization), and the discount level permit
forecasting buying behavior to a certain extent, with correct classification rates amounting to
58% (demand shift), 65% (additional trip), and 82% (upselling), respectively.
As predictive machine learning cannot provide the causal effects of the predictors involved,
we in a second step applied causal machine learning to assess the impact of the discount rate
on the demand shift among always buyers (who would have traveled even without a discount),
which appears interesting in the light of capacity constraints at rush hours. To this end, we
invoked the identifying assumptions that (i) the discount rate is quasi-random conditional on
our covariates and (ii) the buying decision increases weakly monotonically in the discount rate
and exploited survey information about customer behavior in the absence of discounts. We
also considered two approaches for partially testing our assumptions, which did not point to a
violation of the latter. Our main results based on the causal forest suggested that increasing
the discount rate by one percentage point entails an average increase of 0.16 percentage points
in the share of rescheduled trips among always buyers. This finding was corroborated by double
machine learning with just two discount categories, suggesting that discount rates of 30% and
more on average increase the share of rescheduled trips by 3.6 percentage points when compared
to lower discounts. Furthermore, when investigating effect heterogeneity across a pre-selected set
of characteristics, we found the causal forest-based effects to be higher (with marginal statistical
significance when not controlling for multiple hypothesis testing) for leisure travelers and during
peak hours when also controlling for customer’s age, gender, possession of a half fare travel card,
and travel distance. Finally, our effect heterogeneity analysis also revealed that demand-related
information is most predictive for the size of the effect of the discount rate.
Using state-of-the art machine learning tools, our study appears to be the first (at least for
Switzerland) to provide empirical evidence on how discounts on train tickets affect customers’
willingness to reschedule trips - an important information for designing discount schemes aiming
at balancing out train utilization across time and reducing overload during peak hours. Even
though the overall impact on the demand shifts on always buyers might not be as large as one
could hope for, the causal forest pointed to the existence of customer segments that are likely
more responsive and could be scrutinized when collecting a larger amount of data than available
for our analysis. Furthermore, our empirical approach may also be applied to other countries or
transport industries facing capacity constraints. For instance, we would expect that in a setting
with higher competition from alternative public transport modes like long distance bus services
(not present in Switzerland), the impact of train discounts may well be different. More generally,
our study can be regarded as a use case for how predictive and in particular causal machine
learning can be fruitfully applied for business analytics and as decision support for optimizing
specific interventions like discount schemes based on impact evaluation.
Angrist, J., G. Imbens, and D. Rubin (1996): “Identification of Causal Effects using Instru-
mental Variables,” Journal of American Statistical Association, 91, 444–472.
Ascarza, E. (2018): “Retention futility: Targeting high-risk customers might be ineffective,”
Journal of Marketing Research, 55(1), 80–98.
Athey, S., and G. Imbens (2016): “Recursive partitioning for heterogeneous causal effects,”
Proceedings of the National Academy of Sciences, 113, 7353–7360.
Athey, S., J. Tibshirani, and S. Wager (2019): “Generalized random forests,” The Annals
of Statistics, 47, 1148–1178.
Basso, L. J., and H. E. Silva (2014): “Efficiency and substitutability of transit subsidies and
other urban transport policies,” American Economic Journal: Economic Policy, 6(4), 1–33.
Batty, P., R. Palacin, and A. Gonz´
alez-Gil (2015): “Challenges and opportunities in
developing urban modal shift,” Travel Behaviour and Society, 2(2), 109–123.
Blanco, G., C. A. Flores, and A. Flores-Lagunes (2011): “Bounds on Quantile Treat-
ment Effects of Job Corps on Participants’ Wages,” Discussion paper.
Bodory, H., and M. Huber (2018): “The causalweight package for causal inference in R,”
SES Working Paper 493, University of Fribourg.
Breiman, L. (2001): “Random forests,” Machine Learning, 45, 5–32.
Breiman, L. (2018): “randomForest: Breiman and Cutler’s Random Forests for Classification
and Regression. R package version 4.6-12,” Software available at URL: https://cran. r-project.
org/package= randomForest.
Breiman, L., J. Friedman, R. Olshen, and C. Stone (1984): Classification and Regression
Trees. Wadsworth, Belmont, California.
Brynjolfsson, E., and K. McElheran (2016): “The rapid adoption of data-driven decision-
making,” American Economic Review, 106(5), 133–39.
Chernozhukov, V., D. Chetverikov, M. Demirer, E. Duflo, C. Hansen, W. Newey,
and J. Robins (2018): “Double/debiased machine learning for treatment and structural pa-
rameters,” The Econometrics Journal, 21, C1–C68.
De Palma, A., R. Lindsey, and G. Monchambert (2017): “The economics of crowding in
rail transit,” Journal of Urban Economics, 101, 106–122.
De Witte, A., C. Macharis, P. Lannoy, C. Polain, T. Steenberghen, and S. Van de
Walle (2006): “The impact of “free” public transport: The case of Brussels,” Transportation
Research Part A: Policy and Practice, 40(8), 671–689.
Desmaris, C. (2014): “The reform of passenger rail in Switzerland: More performance without
competition,” Research in Transportation Economics, 48, 290–297.
Frangakis, C., and D. Rubin (2002): “Principal Stratification in Causal Inference,” Biomet-
rics, 58, 21–29.
Hagenauer, J., and M. Helbich (2017): “A comparative study of machine learning classifiers
for modeling travel mode choice,” Expert Systems with Applications, 78, 273–282.
Heckman, J. J. (1976): “The Common Structure of Statistical Models of Truncation, Sample
Selection, and Limited Dependent Variables, and a Simple Estimator for such Models,” Annals
of Economic and Social Measurement, 5, 475–492.
(1979): “Sample selection bias as a specification error,” Econometrica, 47, 153–161.
Huber, M. (2014): “Treatment evaluation in the presence of sample selection,” Econometric
Reviews, 33, 869–905.
unermund, P., J. Kaminski, and C. Schmitt (2021): “Causal Machine Learning and
Business Decision Making,” .
Imai, K. (2008): “Sharp bounds on the causal effects in randomized experiments with
‘truncation-by-death’,” Statistics & Probability Letters, 78, 144–149.
Imbens, G. W., and J. Angrist (1994): “Identification and Estimation of Local Average
Treatment Effects,” Econometrica, 62, 467–475.
Knaus, M. C. (2021): “A double machine learning approach to estimate the effects of musical
practice on student’s skills,” Journal of the Royal Statistical Society: Series A (Statistics in
Society), 184(1), 282–300.
Lee, D. S. (2009): “Training, Wages, and Sample Selection: Estimating Sharp Bounds on
Treatment Effects,” Review of Economic Studies, 76, 1071–1102.
Little, R., and D. Rubin (1987): Statistical Analysis with Missing Data. Wiley, New York.
Liu, L., and R.-C. Chen (2017): “A novel passenger flow prediction model using deep learning
methods,” Transportation Research Part C: Emerging Technologies, 84, 74–91.
Liu, Y., Z. Liu, and R. Jia (2019): “DeepPF: A deep learning based architecture for metro
passenger flow prediction,” Transportation Research Part C: Emerging Technologies, 101, 18–
uscher, R. (2020): “10 Jahre Sparbillette – Attraktive Preise und h¨
ohere Nachfrage f¨
ur den
oV.,” Discussion paper.
Mohring, H. (1972): “Optimization and scale economies in urban bus transportation,” The
American Economic Review, 62(4), 591–604.
Omrani, H. (2015): “Predicting travel mode of individuals by machine learning,” Transportation
Research Procedia, 10, 840–849.
Parry, I. W., and K. A. Small (2009): “Should urban transit subsidies be reduced?,” Amer-
ican Economic Review, 99(3), 700–724.
Paulley, N., R. Balcombe, R. Mackett, H. Titheridge, J. Preston, M. Wardman,
J. Shires, and P. White (2006): “The demand for public transport: The effects of fares,
quality of service, income and car ownership,” Transport policy, 13(4), 295–306.
Pearl, J. (2000): Causality: Models, Reasoning, and Inference. Cambridge University Press,
Redman, L., M. Friman, T. G¨
arling, and T. Hartig (2013): “Quality attributes of public
transport that attract car users: A research review,” Transport policy, 25, 119–127.
Robins, J. M., and A. Rotnitzky (1995): “Semiparametric Efficiency in Multivariate Regres-
sion Models with Missing Data,” Journal of the American Statistical Association, 90, 122–129.
Robins, J. M., A. Rotnitzky, and L. Zhao (1994): “Estimation of Regression Coefficients
When Some Regressors Are not Always Observed,” Journal of the American Statistical Asso-
ciation, 90, 846–866.
Rotaris, L., and R. Danielis (2014): “The impact of transportation demand management
policies on commuting to college facilities: A case study at the University of Trieste, Italy,”
Transportation Research Part A: Policy and Practice, 67, 127–140.
Rubin, D. B. (1974): “Estimating Causal Effects of Treatments in Randomized and Nonran-
domized Studies,” Journal of Educational Psychology, 66, 688–701.
(1976): “Inference and Missing Data,” Biometrika, 63, 581–592.
Semenova, V., and V. Chernozhukov (2020): “Debiased machine learning of conditional av-
erage treatment effects and other causal functions,” forthcoming in the Econometrics Journal.
Thao, V. T., W. von Arx, and J. Fr¨
olicher (2020): “Swiss cooperation in the travel
and tourism sector: long-term relationships and superior performance,” Journal of Travel
Research, 59(6), 1044–1060.
Tibshirani, J., S. Athey, R. Friedberg, V. Hadad, D. Hirshberg, L. Miner, E. Sver-
drup, S. Wager, and M. Wright (2020): “grf: Generalized Random Forests,” R package
version 1.2.0.
Tibshirani, R. (1996): “Regresson shrinkage and selection via the LASSO,” Journal of the
Royal Statistical Society, 58, 267–288.
van der Laan, M. J., E. C. Polley, and A. E. Hubbard (2007): “Super Learner,” Statistical
Applications in Genetics and Molecular Biology, 6.
Wager, S., and S. Athey (2018): “Estimation and Inference of Heterogeneous Treatment
Effects using Random Forests,” Journal of the American Statistical Association, 113, 1228–
Wegelin, P. (2018): “Is the mere threat enough? An empirical analysis about competitive ten-
dering as a threat and cost efficiency in public bus transportation,” Research in Transportation
Economics, 69, 245–253.
Wright, M. N., and A. Ziegler (2017): “ranger: A fast implementation of random forests
for high dimensional data in C++ and R,” Journal of Statistical Software, 77, 1–17.
Yang, J.-C., H.-C. Chuang, and C.-M. Kuan (2020): “Double machine learning with gradi-
ent boosting and its application to the Big N audit quality effect,” Journal of Econometrics,
216(1), 268–283.
Yap, M., and O. Cats (2020): “Predicting disruptions and their passenger delay impacts for
public transport stops,” Transportation, pp. 1–29.
Zhang, J., R. Lindsey, and H. Yang (2018): “Public transit service frequency and fares
with heterogeneous users under monopoly and alternative regulatory policies,” Transportation
Research Part B: Methodological, 117, 190–208.
Zhang, J., and D. B. Rubin (2003): “Estimation of Causal Effects via Principal Stratification
When Some Outcomes are Truncated by ‘Death’,” Journal of Educational and Behavioral
Statistics, 4, 353–368.
Zhang, J., D. B. Rubin, and F. Mealli (2008): “Evaluating The Effects of Job Training
Programs on Wages through Principal Stratification,” in Advances in Econometrics: Mod-
elling and Evaluating Treatment Effects in Econometrics, ed. by D. Millimet, J. Smith, and
E. Vytlacil, vol. 21, pp. 117–145. Elsevier Science Ltd.
A Propensity score plots
Figure A.1: Propensity score estimates in the higher discount category ( D0.3)
Figure A.2: Propensity score estimates in the lower discount category (D < 0.3)
B Further tables
Table B.1: Predictive outcome analysis, D < 0.3
demand shift upselling additional trip
variable importance variable importance variable importance
departure time 37.33 capacity utilization 41.387 seat capacity 25.342
seat capacity 27.871 offer level D 27.669 age 21.639
capacity utilization 26.508 age 22.145 capacity utilization 20.168
distance 26.31 D 19.077 distance 18.970
age 26.223 offer level C 17.324 departure time 18.527
D 25.08 departure time 16.538 D 18.076
number of sub-journeys 17.403 distance 15.897 ticket purchase complexity 16.637
offer level C 15.643 offer level B 15.696 offer level C 12.085
diff. purchase travel 15.299 rel. sold level B 10.992 rel. sold level B 11.709
ticket purchase complexity 15.116 number of sub-journeys 10.019 offer level D 11.641
rel. sold level B 15.03 diff. purchase travel 9.573 number of sub-journeys 11.347
offer level D 15.012 rel. sold level C 9.367 diff. purchase travel 10.328
rel. sold level C 14.625 offer level A 7.857 offer level B 10.185
offer level B 14.413 rel. sold level D 7.33 rel. sold level C 8.993
rel. sold level A 11.856 ticket purchase complexity 7.319 rel. sold level A 8.162
offer level A 11.329 offer level E 7.183 offer level A 7.643
rel. amount imputed values 10.079 rel. sold level A 6.769 class 7.381
rel. sold level D 9.625 rel. amount imputed values 5.422 rel. sold level D 6.964
adult companions 8.503 adult companions 4.881 rel. amount imputed values 6.189
offer level E 6.511 rush hour 4.143 adult companions 5.785
gender 5.214 leisure 3.602 leisure 5.398
leisure 5.154 gender 3.599 offer level E 4.866
destination Geneva Airport 4.83 2019 3.597 gender 4.692
departure Zuerich 4.736 travel alone 3.047 German 3.801
class 4.598 Friday 2.92 halfe fare travel ticket 3.686
travel alone 4.59 German 2.825 travel alone 3.540
peak hour 4.545 French 2.479 French 3.493
Friday 4.524 departure Zuerich 2.429 Friday 3.419
German 4.522 destination Zuerich Airport 2.427 half fare 3.163
amount purchased tickets 4.349 scheme 20 2.427 2019 3.136
correct prediction rate 0.555 0.772 0.605
balanced sample size 1642 1140 1202
Notes: ‘Diff. purchase travel’ denotes the difference between purchase and travel day. ‘Rel. offer level A’, ‘rel. offer level B’, ‘rel. offer level C’ and ‘rel. offer level D’ denote the
relative amount of supersaver tickets offered with discount A, B, C and D respectively. The relative amounts are in relation to the seats offered. ‘Offer level A’, ‘Offer level B’,
‘Offer level C’, ‘Offer level D’ and ‘Offer level E’ denotes the amount of supersaver tickets with discount A, B, C, D and E respectively. ‘No subscription’ indicates not possessing
any subscription. For predicting upselling, the covariates ‘class’ and ‘seat capacity’ are dropped.
Table B.2: Predictive outcome analysis, D0.3
demand shift upselling additional trip
variable importance variable importance variable importance
departure time 114 seat capacity 246.396 capacity utilization 133.936
seat capacity 95.799 offer level B 178.212 age 105.107
age 95.209 offer level C 127.327 departure time 100.091
capacity utilization 89.422 D 100.618 capacity utilization 97.889
distance 85.503 offer level A 88.947 distance 85.647
D 80.447 Tageszeitinmin 82.658 D 83.671
diff. purchase travel 69.276 age 78.886 offer level B 73.399
offer level B 68.75 distance 73.885 offer level A 69.936
offer level A 65.766 offer level D 72.452 diff. purchase travel 67.823
offer level C 60.513 diff. purchase travel 55.321 offer level C 64.689
number of sub-journeys 57.767 number of sub-journeys 48.622 number of sub-journeys 58.100
offer level D 44.626 rel. sold level A 38.997 class 54.857
ticket purchase complexity 44.434 ticket purchase complexity 31.991 ticket purchase complexity 49.348
rel. sold level A 39.796 offer level E 27.041 offer level D 46.685
rel. amount imputed values 32.144 rel. amount imputed values 25.586 rel. sold level A 42.589
adult companions 25.925 adult companions 24.73 rel. amount imputed values 35.411
rel. sold level B 20.536 rel. sold level B 17.099 half fare 31.308
gender 18.629 gender 15.835 adult companions 28.732
offer level E 17.521 2019 15.416 half fare travel ticket 23.900
travel alone 15.15 Saturday 14.256 gender 20.562
amount purchased tickets 14.939 amount purchased tickets 13.396 rel. sold level B 20.036
German 14.855 rush hour 12.947 offer level E 18.595
French 14.415 German 12.878 German 16.257
2019 14.387 leisure 12.344 amount purchased tickets 15.847
Sunday 13.821 travel alone 12.28 leisure 15.500
destination Zuerich Airport 13.387 half fare 11.882 no subscription 15.434
Saturday 13.378 Friday 11.646 travel alone 15.132
class 13.27 scheme 20 11.559 Swiss 14.951
half fare 13.258 French 11.477 Saturday 14.613
rel. amount imputed values 13.048 Sunday 11.39 2019 14.279
correct prediction rate 0.589 0.809 0.629
balanced sample size 5320 5598 5798
Notes: ‘Diff. purchase travel’ denotes the difference between purchase and travel day. ‘Rel. offer level A’, ‘rel. offer level B’, ‘rel. offer level C’ and ‘rel. offer level D’ denote the
relative amount of supersaver tickets offered with discount A, B, C and D respectively. The relative amounts are in relation to the seats offered. ‘Offer level A’, ‘Offer level B’,
‘Offer level C’, ‘Offer level D’ and ‘Offer level E’ denotes the amount of supersaver tickets with discount A, B, C, D and E respectively. ‘No subscription’ indicates not possessing
any subscription. For predicting upselling, the covariates ‘class’ and ‘seat capacity’ are dropped.
ResearchGate has not been able to resolve any citations for this publication.
Full-text available
This paper provides estimation and inference methods for the best linear predictor (approximation) of a structural function, such as conditional average structural and treatment effects, and structural derivatives, based on modern machine learning (ML) tools. We represent this structural function as a conditional expectation of an unbiased signal that depends on a nuisance parameter, which we estimate by modern machine learning techniques. We first adjust the signal to make it insensitive (Neyman-orthogonal) with respect to the first-stage regularization bias. We then project the signal onto a set of basis functions, which grows with sample size, to get the best linear predictor of the structural function. We derive a complete set of results for estimation and simultaneous inference on all parameters of the best linear predictor, conducting inference by Gaussian bootstrap. When the structural function is smooth and the basis is sufficiently rich, our estimation and inference results automatically targets this function. When basis functions are group indicators, the best linear predictor reduces to the group average treatment/structural effect, and our inference automatically targets these parameters. We demonstrate our method by estimating uniform confidence bands for the average price elasticity of gasoline demand conditional on income.
Full-text available
Disruptions in public transport can have major implications for passengers and service providers. Our study objective is to develop a generic approach to predict how often different disruption types occur at different stations of a public transport network, and to predict the impact related to these disruptions as measured in terms of passenger delays. We propose a supervised learning approach to perform these predictions, as this allows for predictions for individual stations for each time period, without the requirement of having sufficient empirical disruption observations available for each location and time period. This approach also enables a fast prediction of disruption impacts for a large number of disruption instances, hence addressing the computational challenges that rise when typical public transport assignment or simulation models would be used for real-world public transport networks. To improve transferability of our study results, we cluster stations based on their contribution to network vulnerability using unsupervised learning. This supports public transport agencies to apply the appropriate type of measure aimed to reduce disruptions or to mitigate disruption impacts for each station type. Applied to the Washington metro network, we predict a yearly passenger delay of 5.9 million hours for the total metro network. Based on the clustering, five different types of station are distinguished. Stations with high train frequencies and high passenger volumes located at central trunk sections of the network show to be most critical, along with start/terminal and transfer stations. Intermediate stations located at branches of a line are least critical.
This study investigates the dose–response effects of making music on youth development. Identification is based on the conditional independence assumption and estimation is implemented using a recent double machine learning estimator. The study proposes solutions to two highly practically relevant questions that arise for these new methods: (i) How to investigate sensitivity of estimates to tuning parameter choices in the machine learning part? (ii) How to assess covariate balancing in high‐dimensional settings? The results show that improvements in objectively measured cognitive skills require at least medium intensity, while improvements in school grades are already observed for low intensity of practice.
In this paper, we study the double machine learning (DML) approach of Chernozhukov et al. (2018) for estimating average treatment effect and apply this approach to examine the Big N audit quality effect in the accounting literature. This approach relies on machine learning methods and is suitable when a high dimensional nuisance function with many covariates is present in the model. This approach does not suffer from the “regularization bias” when a learning method with a proper convergence rate is used. We demonstrate by simulations that, for the DML approach, the gradient boosting method is fairly robust and to be preferred to other methods, such as regression tree, random forest, support vector regression machine, and the conventional Nadaraya–Watson nonparametric estimator. We then apply the DML approach with gradient boosting to estimate the Big N effect. We find that Big N auditors have a positive effect on audit quality and that this effect is not only statistically significant but also economically important. We further show that, in contrast to the results of propensity score matching, our estimates of said effect are quite robust to the hyper-parameters in the gradient boosting algorithm.
Despite a growing body of research on the interface and relationship between transport and tourism, this research area remains undeveloped. Using Switzerland as a case study, the present study aims to investigate the level of integration between public transport and tourism companies, the enablers of their long-term cooperative relationship and outstanding performance, seen from the perspective of the public transport companies. A mixed methods approach is used to provide greater insights into how these companies cooperate with each other. Our findings suggest that public transport companies adopt different cooperative strategies with different types of partners. They are able to maintain long-term cooperative relationships due to strong cooperation in sales, a long tradition of cooperation, a high degree of involvement in national public organizations, and their central focus on the customer. Type of partner, sales, product design and pricing, and service provision have statistically significant effects on cooperative performance.
We propose generalized random forests, a method for nonparametric statistical estimation based on random forests (Breiman [Mach. Learn. 45 (2001) 5–32]) that can be used to fit any quantity of interest identified as the solution to a set of local moment equations. Following the literature on local maximum likelihood estimation, our method considers a weighted set of nearby training examples; however, instead of using classical kernel weighting functions that are prone to a strong curse of dimensionality, we use an adaptive weighting function derived from a forest designed to express heterogeneity in the specified quantity of interest. We propose a flexible, computationally efficient algorithm for growing generalized random forests, develop a large sample theory for our method showing that our estimates are consistent and asymptotically Gaussian and provide an estimator for their asymptotic variance that enables valid confidence intervals. We use our approach to develop new methods for three statistical tasks: nonparametric quantile regression, conditional average partial effect estimation and heterogeneous treatment effect estimation via instrumental variables. A software implementation, grf for R and C++, is available from CRAN.
Many scientific and engineering challenges---ranging from personalized medicine to customized marketing recommendations---require an understanding of treatment effect heterogeneity. In this paper, we develop a non-parametric causal forest for estimating heterogeneous treatment effects that extends Breiman's widely used random forest algorithm. Given a potential outcomes framework with unconfoundedness, we show that causal forests are pointwise consistent for the true treatment effect, and have an asymptotically Gaussian and centered sampling distribution. We also discuss a practical method for constructing asymptotic confidence intervals for the true treatment effect that are centered at the causal forest estimates. Our theoretical results rely on a generic Gaussian theory for a large family of random forest algorithms that, to our knowledge, is the first set of results that allows any type of random forest, including classification and regression forests, to be used for provably valid statistical inference. In experiments, we find causal forests to be substantially more powerful than classical methods based on nearest-neighbor matching, especially as the number of covariates increases.
We present a model of public transit service under monopoly when potential users differ in their willingness to pay and value of time. The transit operator chooses service frequency and the fare to maximize a weighted sum of profit and consumers’ surplus. Profit-maximizing and social-surplus-maximizing frequency decisions are compared using a unified framework that includes results of previous studies as special cases. The prevalence of the Mohring Effect and the need for subsidization are investigated. Four types of regulatory policies are then considered: fare regulation, frequency regulation, goal or objective function regulation, and fiscal regulation whereby the operator receives a subsidy based on consumers’ surplus or demand. A numerical example is used to assess the relative efficiency of the regulatory regimes, and illustrate how the solutions depend on the joint distribution of willingness to pay and value of time.