Available via license: CC BY 4.0

Content may be subject to copyright.

Business analytics meets artiﬁcial intelligence:

Assessing the demand eﬀects of discounts on Swiss train tickets

Martin Huber*, Jonas Meier**, and Hannes Wallimann+

*University of Fribourg, Dept. of Economics and

Center for Econometrics and Business Analytics, St. Petersburg State University

**University of Bern, Dept. of Economics

+University of Applied Sciences and Arts Lucerne, Competence Center for Mobility

Abstract: We assess the demand eﬀects of discounts on train tickets issued by the Swiss Federal Railways, the so-called

‘supersaver tickets’, based on machine learning, a subﬁeld of artiﬁcial intelligence. Considering a survey-based sample of

buyers of supersaver tickets, we investigate which customer- or trip-related characteristics (including the discount rate)

predict buying behavior, namely: booking a trip otherwise not realized by train, buying a ﬁrst- rather than second-class

ticket, or rescheduling a trip (e.g. away from rush hours) when being oﬀered a supersaver ticket. Predictive machine learning

suggests that customer’s age, demand-related information for a speciﬁc connection (like departure time and utilization),

and the discount level permit forecasting buying behavior to a certain extent. Furthermore, we use causal machine learning

to assess the impact of the discount rate on rescheduling a trip, which seems relevant in the light of capacity constraints at

rush hours. Assuming that (i) the discount rate is quasi-random conditional on our rich set of characteristics and (ii) the

buying decision increases weakly monotonically in the discount rate, we identify the discount rate’s eﬀect among ‘always

buyers’, who would have traveled even without a discount, based on our survey that asks about customer behavior in the

absence of discounts. We ﬁnd that on average, increasing the discount rate by one percentage point increases the share

of rescheduled trips by 0.16 percentage points among always buyers. Investigating eﬀect heterogeneity across observables

suggests that the eﬀects are higher for leisure travelers and during peak hours when controlling several other characteristics.

Keywords: Causal Machine Learning, Double Machine Learning, Treatment Eﬀect, Business Analytics, Causal Forest, Public Transportation.

JEL classiﬁcation: C21, R41, R48.

Acknowledgments: We are grateful to the SBB Research Fund for ﬁnancial support. Furthermore, we are indebted to Pierre Chevallier and

Philipp Wegelin for their helpful discussions. Addresses for correspondence: Martin Huber, University of Fribourg, Bd. de P´erolles 90, 1700 Fribourg,

Switzerland, martin.huber@unifr.ch; Jonas Meier, University of Bern, Schanzeneckstrasse 1, 3001 Bern, Switzerland, jonas.meier@vwi.unibe.ch;

Hannes Wallimann, University of Applied Sciences and Arts Lucerne, R¨

osslimatte 48, 6002 Luzern, Switzerland, hannes.wallimann@hslu.ch.

arXiv:2105.01426v1 [econ.GN] 4 May 2021

1 Introduction

Organizing public transport involves a well-known trade-oﬀ between consumer welfare and

provider revenue. Typically, consumers value frequency, reliability, space, and low fares (Red-

man, Friman, G¨

arling, and Hartig,2013) while suppliers aim at operating with a minimum

number of vehicles to maximize proﬁts. In general, the allocation can be improved as providers

do not account for the positive externalities on consumers (Mohring,1972). In particular, service

frequency reduces travelers’ access and waiting costs. This so-called ‘Mohring-eﬀect’ leads to

economies of scale, implying the need for subsidies to achieve the ﬁrst-best solution in terms of

welfare. Consequently, it may be socially optimal to subsidize railway companies to reduce fares

(Parry and Small,2009). To assess such a measure’s eﬀectiveness on demand, policy-makers

would need to know how individuals respond to lower fares. However, it is generally challenging

to identify causal eﬀects of discounts on train tickets (or goods and services in general) due

confounding or selection. For instance, discounts might typically be provided for dates or hours

with low train utilization such that connections with and without discount are not comparable

in terms of baseline demand. A naive comparison of sold tickets with and without discount

would therefore mix the inﬂuence of the discount with that of baseline demand. In this con-

text, we apply machine learning (a subﬁeld of artiﬁcial intelligence) to convincingly assess how

discounts on train tickets for long-distance connections in Switzerland, the so-called ‘supersaver

tickets’, aﬀect demand, by exploiting a unique data set of the Swiss Federal Railways (SBB)

that combines train utilization records with a survey of supersaver buyers.

More concisely, our study provides two use cases of machine learning for business analytics

in the railway industry: (i) Predicting buying behavior among supersaver customers, namely

whether customers booked a trip otherwise not realized by train (additional trip), bought a

ﬁrst-class rather than a second-class ticket (upselling), or rescheduled their trip e.g. away from

rush hours (demand shift); (ii) analysing the causal eﬀect of the discount on demand shifts

among customers that would have booked the trip even without discount. This is feasible

because our unique survey contains information on how supersaver buyers would have decided

in the absence of a discount, e.g. whether they are so-called ‘always buyers’ and would have

booked the connection even at the regular fare. For both prediction and causal analysis, we

make use of appropriately tailored machine learning techniques, which learns the associations

1

between the demand outcomes of interest, the discount rate, and further customer or trip-

related characteristics in a data-driven way and helps avoiding model misspeciﬁcation. Such a

targeted combination of predictive and causal machine learning can therefore improve demand

forecasting and decision-making in companies and organizations. While predictive machine

learning permits optimizing forecasts about demand and customer behavior as a function of

observed characteristics, causal machine learning permits evaluating the causal eﬀect of speciﬁc

interventions like a discount regime for optimizing the oﬀer of such discounts. Concerning the

prediction task, we use the so-called random forest, see Breiman (2001), as machine learner to

forecast the supersaver customers’ behavior and obtain an accuracy or correct (out of sample)

classiﬁcation rate of 58% (demand shift), 65% (additional trip), and 82% (upselling), respectively.

Trip-related characteristics like seat capacity, utilization, departure time, and the discount rate,

but also customer’s age turn out to be strong predictors.

Concerning the causal analysis (which is more challenging than mere prediction), we impose

(i) a selection-on-observables assumption implying that the discount rate is as good as randomly

assigned when controlling for our rich set of trip- and demand-related characteristics and (ii)

weak monotonicity of any individual’s decision to purchase an additional trip (otherwise not

realized) in the discount rate, implying that a higher (rather than lower) discount does either

positively or not aﬀect any customer’s buying decision. As a methodological contribution, we

formally show how these assumptions permit tackling the selectivity of discount rates and survey

response to identify the discount rate’s eﬀect on demand shifts (rescheduling away from rush

hours) for the subgroup ‘always buyers’, based on the survey information on how customers

would have behaved in the absence of a discount. Furthermore, we discuss testable implications

of monotonicity, namely that among all survey respondents, the share of additional trips must

increase in the discount rate, and the selection on observables assumptions, requiring that con-

ditional on trip- and demand-related characteristics, the discount must not be associated with

personal characteristics (like age or gender) among always buyers. Hypothesis tests do not point

to the violation of these implications.

Based on our causal identiﬁcation strategy, we estimate the marginal eﬀect of slightly increas-

ing the (continuously distributed) discount rate based on the causal forest (CF), see Wager and

Athey (2018) and Athey, Tibshirani, and Wager (2019), and ﬁnd that on average, increasing the

discount rate by one percentage point increases the share of rescheduled trips by 0.16 percentage

2

points among always buyers. In a second approach, we binarize the discount rates by splitting

them into two discount categories of less than 30% (relative to the regular fare) and 30% or

more. Applying double machine learning (DML), see Chernozhukov, Chetverikov, Demirer, Du-

ﬂo, Hansen, Newey, and Robins (2018), we ﬁnd that discount rates of 30% and more on average

increase the share of rescheduled trips 3.6 percentage points, which is in line with the CF-based

results. Our paper therefore provides the ﬁrst empirical evidence (at least for Switzerland) that

such discounts can help balancing out train utilization across time and reducing overload during

peak hours, albeit the magnitude of the impact on always buyers appears limited.

When investigating the heterogeneity of eﬀects across all of our observed characteristics using

the CF, our results suggest that demand-related trip characteristics (like seat capacity, utiliza-

tion, departure time, and distance) have some predictive power for the size of the discounts’

impact on shifting demand. Such information on heterogeneous eﬀects appears interesting for

optimizing the allocation of discounts for the purpose of shifting demand, as the SBB has (due to

its monopoly in the Swiss long-distance passenger rail market) agreed with the Swiss price mon-

itoring agency to provide a ﬁxed amount of discounted tickets per year, but is free to chose the

timing and connections for discounts. In a second heterogeneity analysis, we investigate whether

eﬀects diﬀer systematically across a pre-selected set of characteristics, namely: age, gender, pos-

session of a half fare travel card, travel distance, whether the purpose is business, commute, or

leisure, and whether the departure time is during peak hours. Using the regression approach

of Semenova and Chernozhukov (2020), we ﬁnd that conditional on the other characteristics,

the eﬀects of increasing the discount by one percentage point on rescheduling are by more than

0.2 percentage points higher during peak hours and for leisure travelers, diﬀerences that are

statistically signiﬁcant at the 10% level when, however, not controlling for multile hypothesis

testing. These eﬀects appear plausible as leisure travelers are likely more ﬂexible and discounts

during peak hours make trips at times of increased demand even more attractive. We do not ﬁnd

statistically signiﬁcant eﬀect diﬀerences for the other pre-selected characteristics, which could,

however, be due to the (for the purpose of investigating eﬀect heterogeneity) limited sample of

several thousand observations.

Our paper is related to a growing literature applying statistical and machine learning meth-

ods for analyzing transport systems, as well as to methodological studies on causal inference for

so-called principal strata, see Frangakis and Rubin (2002), i.e. endogenous subgroups like the al-

3

ways buyers. Typically, it is hard to identify the causal eﬀect of some treatment (or intervention)

like a discount on such a non-randomly selected subgroup deﬁned in terms how a post-treatment

variable (e.g. buying decision) depends on the treatment (e.g. treatment). One approach is to

give up on point identiﬁcation and instead derive upper and lower bounds on a set of possible

eﬀects for groups alike the always buyers based on the aforementioned monotonicity assumption

(and possibly further assumptions about the ordering of outcomes of always buyers and other

individuals), see for instance Zhang and Rubin (2003), Zhang, Rubin, and Mealli (2008), Imai

(2008), Lee (2009), and Blanco, Flores, and Flores-Lagunes (2011). Alternatively, the treatment

eﬀect on always buyers is point-identiﬁed when invoking a selection-on-observables or instru-

mental variable assumption for selection into the survey, see for instance Huber (2014), which

requires suﬃciently rich data on both survey participants and non-participants for modeling

survey participation. In contrast to these previous studies, the approach in this paper point-

identiﬁes the treatment eﬀect by exploiting the rather unique survey feature that customers were

asked about their behavior in the absence of the discount, which under monotonicity permits

identifying the principal stratum of always buyers directly in the data.

Furthermore, our work is related to conceptual studies on transport systems, considering for

instance the previously mentioned positive externalities of an increased service for customers

that are not accounted for by transportation providers. Such externalities typically arise from

economies of scale due to ﬁxed costs and a ’Mohring eﬀect’, implying that service frequency

reduces waiting costs (Mohring,1972). The study by Parry and Small (2009) suggests that

lower fares can boost overall welfare by increasing economies of scale (oﬀ-peak) and decreasing

pollution and accidents (at peaks). Similarly, De Palma, Lindsey, and Monchambert (2017) argue

that time-dependent ticket prices may increase overall welfare as overcrowding during peak hours

is suboptimal for both consumers and providers. As public transport is usually highly subsidized,

governments may directly manage the trade-oﬀ mentioned above. As this involves taxpayer

money, it is a question of general interest how the subsidies should be designed. Based on their

results, Parry and Small (2009) conclude that even substantial subsidies are justiﬁed due to

lower fares’ positive welfare eﬀect. In contrast, Basso and Silva (2014) ﬁnd that the contribution

of transit subsidies to welfare diminishes once congestion is taxed and alternatives are available,

i.e., bus lanes. Irrespective of the speciﬁc policy instrument, the consumer’s willingness to shift

demand drives these policies’ eﬀectiveness. While many factors aﬀect this willingness, most

4

studies conclude that consumers are price sensitive (Paulley, Balcombe, Mackett, Titheridge,

Preston, Wardman, Shires, and White,2006). In this context, we aim at contributing to a

better understanding of how time-dependent pricing translates to consumer decisions.

More broadly, our paper relates to the literature on policies targeting demand shifts. Among

these, the setting of car parking costs, ﬁscal regulations, or even free public transport has

been analyzed (e.g. Batty, Palacin, and Gonz´alez-Gil,2015,Rotaris and Danielis,2014,Zhang,

Lindsey, and Yang,2018,De Witte, Macharis, Lannoy, Polain, Steenberghen, and Van de Walle,

2006). Another stream of literature applies machine learning algorithms in the context of public

transport. Examples are short-term traﬃc ﬂow forecasts for bus rapid transit (Liu and Chen,

2017) or metro (Liu, Liu, and Jia,2019) services. Further, Hagenauer and Helbich (2017) and

Omrani (2015) implement machine learning algorithms to predict travel mode choices. Yap and

Cats (2020) predict disruptions and their passenger delay impacts for public transport stops. In

other research ﬁelds, also applications of causal (rather than predictive) machine learning are on

the rise (see for instance Yang, Chuang, and Kuan,2020,Knaus,2021). This is, to the best of

our knowledge, the ﬁrst study using causal machine learning in the context of public transport.

Finally, a growing literature discusses the opportunities of data-driven business decision-making

(Brynjolfsson and McElheran,2016) by assessing the relevance of predictive and causal machine

learning. Ascarza (2018) and H¨

unermund, Kaminski, and Schmitt (2021) show that companies

may gain by designing their policies based on causal machine learning. For instance, ﬁrms can

target the relevant consumers much more eﬀectively when accounting for their heterogeneity in

terms of reaction to a treatment. Our study provides a use case of how the machine learning-

based assessment of discounts could be implemented also in other businesses and industries

facing capacity constraints.

This paper proceeds as follows. Section 2presents the institutional setting of passenger rail-

way transport in Switzerland. Section 3describes our data, coming from a unique combination

of a customer survey and transport utilization data. Section 4discusses the identifying assump-

tions underlying the causal machine learning approach as well as testable implications. Section

5outlines the predictive and causal machine learning methods. Section 6presents the empirical

results. Section 7concludes.

5

2 Institutional Background

The railway system in Switzerland is known for its high quality of service. Examples include the

high level of system integration with frequent services, synchronized timetables, and comprehen-

sive fare integration, see Desmaris (2014). In Switzerland, a country of railway tradition, the

state owned incumbent Swiss Federal Railways (SBB) operates the long distance passenger rail

market as monopolist (Thao, von Arx, and Fr¨

olicher,2020). Furthermore, nationally operating

long-distance coaches may only be approved if they do not ‘substantially’ compete with existing

services. Thus, the SBB competes exclusively with motorized private transport in Swiss long-

distance traﬃc. The company also owns most of the rail infrastructure, which is funded by the

Federal Government. However, since the end of 2020 the companies Berne-L¨

otschberg-Simplon

Railways (BLS) and Southeast Railways (SOB) operate a few links on behalf of the SBB. Dif-

ferent to regional public transport that Swiss taxpayers subsidize with approximately CHF 1.9

bn per year, the operation of the long distance public transport itself has to be self-sustaining

(Wegelin,2018).

Because of the monopoly position of the SBB in long distance passenger transport, the

prices are screened by the Swiss ‘price watchdog’ (or price monitoring agency) to prevent abuse.

Based on the price monitoring act, the watchdog keeps a permanent eye on how prices and proﬁts

develop. By the end of 2014, the watchdog concluded that the SBB charged too high prices. As

a consequence and through a mutual agreement, the SBB and the Swiss price watchdog agreed

on a signiﬁcantly higher supply of supersaver tickets, which were ﬁrst oﬀered in 2009. Using a

supersaver ticket, customers can travel on long distance public transport routes with a discount

of up to 70%. Thereafter, additional agreements were regularly reached regarding number and

scope of the supersaver tickets. While only a few thousand supersaver tickets were sold in 2014,

sales increased to about 8.8 million in 2019, see L¨

uscher (2020).

From the SBB’s perspective, these tickets can serve two purposes. First, the tickets might

be used as means to balance out the utilization of transport services. For instance, supersaver

tickets could reduce the high demand during peak hours which is a key challenge for public

transport. Thus, balancing the demand may reduce delays and increase the number of free seats

which is valued by the consumers. The average load of SBBs’ seats amounts to 30% in the long

6

distance passenger transport.1For this reason, there is in the literal sense room for improving

the allocation. Second, price sensitive customers can be acquired during oﬀ-peak hours at rather

negligible marginal costs.

Despite the increasing interest in the supersaver tickets in recent years, many users of the

Switzerland public transport network purchase a so-called ‘general abonnement’ travel ticket

(GA). This (annually renewed) subscription provides free and unlimited access to the public

transport network in Switzerland. In 2019, about 0.5 million individuals owned a GA in Switzer-

land, roughly 6% of the Swiss population. The GA’s cost amounts to 3,860 and 6,300 Swiss

francs for the second and ﬁrst class, respectively. In the same year, about 2.7 million individuals

held a relatively cheap half fare travel ticket amounting to 185 Swiss francs. The latter implies

a price reduction of 50% for public transport tickets in Switzerland. Overall, discounts provided

through supersaver tickets are slightly lower for owners of half fare tickets, as the SBB aims to

attract non-regular public transport users. In our causal analysis, we therefore also control for

the possession of a half fare ticket.

3 Data

To investigate supersaver tickets’ eﬀect, we use a unique cross-sectional data set provided by the

SBB. Our sample consists of randomly surveyed buyers of supersaver tickets that purchased their

tickets between January 2018 and December 2019. These survey data are matched with data on

distances between any two railway stops as well as utilization-related information relevant for

the supply and calculation of discounts. In Section 6, we provide descriptive statistics for these

data.

3.1 Survey Data

The customer survey is our primary data source. It for instance includes the outcome variable

‘demand shift’, a binary indicator of whether an interviewee rescheduled her or his trip due to

buying a supersaver ticket. ‘Yes’ means that the departure time has been advanced or postponed

because of the discount. A second variable characterizing customer behavior is an indicator for

upselling, i.e. whether someone purchased a ﬁrst rather than a second class ticket as a reaction

1See https://reporting.sbb.ch/verkehr.

7

to the discount. Another question asks whether an interviewee would have bought the train

trip in the same or a higher class even without being oﬀered a discount, which permits judging

whether an additional trip has been sold through oﬀering the discount and allows identifying

the subgroup of always buyers under the assumptions outlined further below. Our continuously

distributed treatment variable is the discount rate of a supersaver ticket relative to the standard

fare, which may take positive values of up to 70%.

Furthermore, we observe two kinds of covariates, namely trip- or demand-related factors

and personal characteristics of the interviewee. The former are important control variables for

our causal identiﬁcation strategy outlined below and include the diﬀerence between the days of

purchase and travel, the weekday, month, and year, an indicator for buying a half fare ticket,

departure time, peak hour,2number of tickets purchased per person, class (ﬁrst or second),

indicators for leisure trips, commutes, or business trips, the number of companions (by children

and adults if any) and a judgment of how complicated the ticket purchase was on a scale from

1 (complicated) to 10 (easy). Furthermore, it consists of indicators for the point of departure,

destination, and public holidays. The personal characteristics include age, gender, migrant

status, language (German, French, Italian), and indicators for owning a half fare travel ticket

or other subscriptions like those of regional tariﬀ associations, speciﬁc connections, and Gleis 7

(‘rail 7’). The latter is a travelcard for young adults not older than 25, providing free access to

public transport after 7pm.

3.2 Factors Driving the Supply of Supersaver Tickets

In addition to the survey, we have access to factors determining the supply of supersaver tickets

with various discounts. This is crucial for our causal analysis that hinges on on controlling for

all characteristics jointly aﬀecting the the discount rate and the demand shift outcome. While

information on the distances between railway stops in Switzerland is publicly available,3the SBB

provides us for the various connections with information on utilization data, the number of oﬀered

seats, and contingency schemes, which deﬁne the quantity of oﬀered discounts. This allows us to

2Peak hour is deﬁned as a departure time between 6am and 8:59am or between 4pm and 18:59pm, from Mon-

day to Friday. This time windows is chosen on the base of the SBB’s train-path prices. For further de-

tails, see https://company.sbb.ch/en/sbb-as-business-partner/services-rus/onestopshop/services-and-prices/the-

train-path-price.html (assessed on March 24 2021).

3See the Open Data Platform of the SBB: https://data.sbb.ch/explore/dataset/linie-mitbetriebspunkte (accessed

on March 24 2021), which provides the distances between any stops on a railway route.

8

account for travel distance, oﬀered seats, capacity utilization, and quantities of oﬀered supersaver

tickets for various discount levels as well as quantities of supersaver tickets already sold (both

quantities at the time of purchase). Furthermore, we create binary indicators for the 27 diﬀerent

contingency schemes of the SBB present in our data, which change approximately every month.

The variables listed in the previous paragraph are important, as the SBB calculates the

supply of supersaver tickets based on an algorithm considering four type of inputs: Demand

forecasts, advance booking deadlines, number of supersaver tickets already sold, and contingency

schemes deﬁning the amount and the size of oﬀered discounts based on the three previous inputs.

The schemes are set as a function of the SBB’s self-imposed goals such as customer satisfaction

but also depend on the requirements imposed by the price watchdog. The algorithm calculates a

journey’s ﬁnal discount as a weighted average of all discounts between any two adjacent railway

stops along a journey. The weights depend on the distances of the respective subsections of the

trip. To approximate the (not directly available) demand forecasts of the SBB, we consider the

quarterly average of capacity utilization and the number of oﬀered seats for any two stops, which

are available by (exact) departure time, workday, class, and weekend. In addition, we make use

of indicators for place of departure, destination, month, year, weekday and public holidays. We

use this information to reconstruct the amount and size of oﬀered discounts by taking values

from the contingency schemes that correspond to our demand forecast approximation combined

with the diﬀerence between buying and travel days. Comparing this amount and size of oﬀered

discounts with a buyer’s discount, we estimate the number of supersaver tickets already sold for

the exact date of purchase.

3.3 Sample Construction

Our initial sample contains 12,966 long-distance train trips that cover 61,469 sections between

two adjacent stops. For 12.2% of these sections, there is no information on the capacity utiliza-

tion available, which can be due to various reasons. First, for some cases, capacity utilization

data is missing. Second, passengers traveling long-distance may switch to regional transport in

exceptional cases causing problems for determining utilization. A further reason could be issues

in data processing. Altogether, missing information occurs in 3,967 trips of our initial sample.

We tackle this problem by dropping all journeys with more than 50% of missing information,

which is the case for 320 trips or 2.5% of our initial sample. After this step, our evaluation

9

sample consists of 12,646 trips. For the remaining 3,647 trips with missing information (which

now account for a maximum of 50% of all sections of a journey), we impute capacity utilization

as the average of the remaining sections of a trip. In our empirical analysis, we include an

indicator for whether some trip information has been imputed as well as the share of imputed

values for a speciﬁc trip as control variables. Finally, we note that our causal analysis makes (in

contrast to the predictive analysis) only use of a subsample, namely observations identiﬁed as

always buyers who would have traveled even without a discount, all in all 6,112 observations.

4 Identiﬁcation

We subsequently formally discuss the identiﬁcation strategy and assumptions underlying our

causal analysis of the discounts among always buyers.

4.1 Deﬁnition of Causal Eﬀects

Let Ddenote the continuously distributed treatment ‘discount rate’ and Ythe outcome ‘demand

shift’, a binary indicator for rescheduling a trip due to being oﬀered a discount. More generally,

capital letters represent random variables in our framework, while lower case letters represent

speciﬁc values of these variables. To deﬁne the treatment eﬀects of interest, we make use of

the potential outcome framework, see for instance Rubin (1974). To this end, Y(d) denotes the

potential outcome hypothetically realized when the treatment is set to a speciﬁc value din the

interval [0, Q], with 0 indicating no discount and Qindicating the maximum possible discount.

For instance, Q= 0.7 would imply the maximum discount of 70% of a regular ticket fare. The

realized outcome corresponds to the potential outcome under the treatment actually received,

i.e. Y=Y(D), while the potential outcomes under discounts diﬀerent to one received remain

unknown without further statistical assumptions.

A further complication for causal inference is that our survey data only consists of individuals

that purchased a supersaver ticket, a decision that is itself an outcome of the treatment, i.e.

the size of the discount. Denoting by Sa binary indicator for purchasing a supersaver ticket

and by S(d) the potential buying decision under discount rate d, this implies that we only

observe outcomes Yfor individuals with S= 1. In general, making the survey conditional on

buying introduces Heckman-type sample selection (or collider) bias, see Heckman (1976) and

10

Heckman (1979), if unobserved characteristics aﬀecting the buying decision Salso likely aﬀect

the inclination of shifting the timing of the train journey Y. Furthermore, it is worth noting

that S=S(D) implies that buying a supersaver ticket is conditional on receiving a non-zero

discount. For this reason, non-treated subjects paying regular fares (with D= 0) are not

observed in our data. Yet, the outcome in our sample is deﬁned relative to the behavior without

treatment, as Yindicates whether a has passenger has changed the timing of the trip because of

a discount. This implies that Y(0) = 0 by deﬁnition, such that the causal eﬀect of some positive

discount dvs. no discount is Y(d)−Y(0) = Y(d) is directly observable among observations that

actually received d. However, it also appears interesting to investigate whether the demand shift

eﬀect varies across diﬀerent (non-zero) discount rates d∈(0, Q] to see whether the size matters.

This is complicated by the fact that supersaver customers with diﬀerent discount rates that are

observed in our data might in general diﬀer importantly in terms of background characteristics

also aﬀecting the outcome, exactly because they bought their trip and were selected into the

survey under non-comparable discount regimes. Our causal approach aims at tackling exactly

this issue to establish customer groups that are comparable across discount rates in order to

identify the eﬀect of the latter.

Based on the potential notation, we can deﬁne diﬀerent causal parameters of interest. For

instance, the average treatment eﬀect (ATE) of providing discount levels dvs. d0(for d6=d0) on

outcome Y, denoted by ∆(d, d0), corresponds to

∆(d, d0) = E[Y(d)−Y(d0)].(1)

Furthermore, the average partial eﬀect (APE) of marginally increasing the discount level at

D=d, denoted by θ(d), is deﬁned as

θ(d) = E∂Y (D)

∂D .(2)

Accordingly, θ(D) corresponds to the APE when marginally increasing the actually received

discount of any individual (rather than imposing some hypothetical value dfor everyone).

The identiﬁcation of these causal parameters based on observable information requires rather

strong assumptions. First, it implies that confounders jointly aﬀecting Dand Ycan be con-

11

trolled for by conditioning on observed characteristics. In our context, this appears plausible,

as treatment assignment is based on variables related to demand (like weekdays or month),

contingency schemes, capacity utilization, and supersaver tickets already sold - all of which is

available in our data, as described in Section (3). Second, identiﬁcation requires that selection

Sis as good as random (i.e., not associated with outcome Y) given the observed characteristics

and the treatment, an assumption known as missing at random (MAR), see for instance Rubin

(1976) and Little and Rubin (1987). However, the latter condition appears unrealistic in our

framework, as our data lack important socio-economic characteristics likely aﬀecting preferences

and reservation prices for public transport, namely education, wealth, or income. For this rea-

son, we argue that the ATE and APE among the individuals selected for the survey (S= 1),

i.e. conditional on buying a supersaver ticket, which are deﬁned as

∆S=1(d, d0) = E[Y(d)−Y(d0)|S= 1], θS=1(D) = E∂Y (D)

∂D ,(3)

cannot be plausibly identiﬁed either. The reason is that if an increase in the discount rate

induces some customers to buy a super saver ticket, then buyers with lower and higher discounts

will generally diﬀer in terms of their average reservation prices and related characteristics (as

education or income), which likely also aﬀect the demand-shift outcome Y.

To tackle this sample selection issue, we exploit the fact that our data provide information

on whether the supersaver customers would have purchased a ticket for this speciﬁc train trip

also in the absence of any discount. Provided that the interviewees give accurate responses, we

thus have information on S(0), the hypothetical buying decision without treatment. Under the

assumption that each customer’s buying decision is weakly monotonic in the treatment in the

sense that anyone purchasing a trip in a speciﬁc travel class (e.g., second class) without discount

would also buy it for that class in the case of any positive discount, this permits identifying the

group of always buyers. Importantly, we therefore deﬁne always buyers as those that would buy

the trip not in a lower travel class (namely second rather than ﬁrst class) without discount. For

alway buyers, S(0) = S(d) = 1 for any d > 0, such that their buying decision is always one

and thus not aﬀected by the treatment, implying the absence of the selection problem. In the

denomination of Frangakis and Rubin (2002), the always buyers constitute a so-called principal

stratum, i.e., a subpopulation deﬁned in terms of how the selection reacts to diﬀerent treatment

12

intensities. Therefore, sample selection bias does not occur within such a stratum, in which

selection behavior is by deﬁnition homogeneous. For this reason, we aim at identifying the ATE

and APE on the always buyers:

∆S(0)=1(d, d0) = E[Y(d)−Y(d0)|S(0) = 1] = E[Y(d)−Y(d0)|S(0) = S(d00) = 1] for d00 ∈(0, Q],

θS(0)=1(D) = E∂Y (D)

∂D S(0) = 1=E∂Y (D)

∂D |S(0) = S(d00)=1(4)

where the second equality follows from the monotonicity of Sin Dthat is formalized further

below.

Figure 1: Causal framework

Figure 1provides a graphical illustration of our causal framework based on a directed acyclic

graph, with arrows representing causal eﬀects. Observed covariates Xthat are related to demand

are allowed to jointly aﬀect the discount rate Dand the demand-shift outcome Y.Xmay

inﬂuence the potential purchasing decision under a hypothetical treatment S(d), implying that

buying a ticket given a speciﬁc discount depends on observed demand drivers like weekday,

month, etc. Furthermore, unobserved socio-economic characteristics V(like the reservation

price) likely aﬀect both S(d) and Y. This introduces sample selection when conditioning on S,

e.g. by only considering survey respondents (S= 1). We also note that Sis deterministic in

Dand S(d) (as S=S(D)), even when controlling for X. This is the case because conditional

on S= 1, Dis associated with V, which also aﬀects Y, thus entailing confounding of the

treatment-outcome relation. A reason for this is for instance that buyers under higher and lower

discounts are generally not comparable in terms of their reservation prices. In the terminology

13

of Pearl (2000), Sis a collider that opens up a backdoor path between Dand Ythrough V.

Theoretically, this could be tackled by jointly conditioning on the potential selection states under

treatment values dvs. d0considered in the causal analysis, namely S(d), S (d0), as controls for

the selection behavior. This is typically not feasible in empirical applications when only the

potential selection corresponding to the actual treatment assignment is observed, S=S(D). In

our application, however, we do have information on S(0) and can thus condition on being an

always buyer under the mentioned monotonicity assumption.

4.2 Identifying Assumptions

We now formally introduce the identiﬁcation assumptions underlying our causal analysis.

Assumption 1 (identiﬁability of selection under non-treatment):

S(0), is known for all subjects with S= 1.

Assumption 1 is satisﬁed in our data in the absence of misreporting, as subjects have been asked

whether they would have bought the train trip even in the absence of discount.

Assumption 2 (conditional independence of the treatment):

Y(d), S(d)⊥D|Xfor all d∈(0, Q].

By Assumption 2, there are no unobservables jointly aﬀecting the treatment assignment on the

one hand and the potential outcomes or selection states under any positive treatment value

on the other hand conditional on covariates X. This assumption is satisﬁed if the treatment

is quasi-random conditional on our demand-related factors X. Note that the assumption also

implies that Y(d)⊥D|X, S(0) = 1 for all d∈(1, Q].

Assumption 3 (weak monotonicity of selection in the treatment):

Pr(S(d)≥S(d0)|X) = 1 for all d > d0and d, d0∈(1, Q].

By Assumption 3, selection is weakly monotonic in the treatment, implying that a higher treat-

ment state can never decrease selection for any individual. In our context, this means that a

higher discount cannot induce a customer to not buy a ticket that would have been purchased

under a lower discount. An analogous assumption has been made in the context of nonpara-

metric instrumental variable models, see Imbens and Angrist (1994) and Angrist, Imbens, and

Rubin (1996), where, however, it is the treatment that is assumed to be monotonic in its instru-

14

ment. Note that monotonicity implies the testable implication that E[S−S(0)|X, S = 1, D =

d] = E[(1 −S(0)|X, S = 1, D =d] weakly increases in treatment value d. In words, the share of

customers that bought the ticket because of the discount must increase in the discount rate in

our survey population when controlling for X.

Assumption 4 (common support):

f(d|X, S(0) = 1) >0 for all d∈(1, Q].

Assumption 4 is a common support restriction requiring that f(d|X, S(0) = 1), the conditional

density of receiving a speciﬁc treatment intensity dgiven Xand S(0) = 1 (or conditional prob-

ability if the treatment takes discrete values), henceforth referred to as treatment propensity

score, is larger than zero among always buyers for the treatment doses to be evaluated. This

implies that the demand-related covariates Xdo not deterministically aﬀect the discount rate

received such that there exists variation in the rates conditional on X.

Our assumptions permit identifying the conditional ATE given X(CATE), denoted by

∆X,S(0)=1 (d, d0) = E[Y(d)−Y(d0)|X, S(0) = 1] for d6=d0and d, d0∈(1, Q]. To see this,

note that

∆X,S(0)=1 (d, d0) = E[Y|D=d, X, S(0) = 1] −E[Y|D=d0, X, S(0) = 1],

=E[Y|D=d, X, S(0) = 1, S = 1] −E[Y|D=d0, X, S (0) = 1, S = 1],(5)

where the ﬁrst equality follows from Assumption 2 and the second from Assumption 3, as

monotonicity implies that asymptotically, S= 1 if S(0) = 1. Together with Assumption 1,

which postulates the identiﬁability of S(0), it follows that the causal eﬀect on always buyers

is nonparametrically identiﬁed, given that common support (Assumption 4) holds. If follows

that the ATE among always buyers is identiﬁed by averaging over the distribution of Xgiven

S(0) = 1, S = 1:

∆S(0)=1(d, d0) = E[E[Y|D=d, X, S (0) = 1, S = 1] −E[Y|D=d0, X, S(0) = 1, S = 1]|S(0) = 1, S = 1].(6)

Furthermore, considering (5) and letting d−d0→0 identiﬁes the conditional average partial

eﬀect (CAPE) of marginally increasing the treatment at D=dgiven X, S(0) = 1, denoted by

15

θX,S(0)=1 (D) = Eh∂Y (D)

∂D |X, S(0) = 1i:

θX,S(0)=1 (d) = ∂E[Y|D=d,X,S(0)=1,S=1]

∂D .(7)

Accordingly, the APE among always buyers that averages over the distributions of Xand Dis

identiﬁed by

θS(0)=1(D) = Eh∂ E[Y|D,X,S(0)=1,S=1]

∂D i.(8)

Our identifying assumptions yield a testable implication if some personal characteristics (like

customer’s age) that aﬀect S(d) are observed, which we henceforth denote by W. In fact, Dmust

be statistically independent of Wconditional on X, S(0) = 1, S = 1 if Xis suﬃcient for avoiding

any cofounding of the treatment-outcome relation. To see this, note that personal characteristics

must by Assumption 2 not inﬂuence the treatment decision conditional on X. This statistical

independence must also hold within subgroups (or principal strata) in which sample selection

behavior (and thus sample selection/collider bias) is controlled for like the always buyers, i.e.

conditional on S(d), S = 1.

5 Estimation based on machine learning

In this section, we outline the predictive and causal machine learning approaches used in our

empirical analysis of the evaluation sample.

5.1 Predictive Machine Learning

Let i∈ {1, ...., n}be an index for the diﬀerent interviewees in our sample of size nand {Yi, Di, Xi, Wi, Si(0)}

denote the outcome, treatment, the covariates related to the treatment and the outcome, the

observed personal characteristics, and the buying decision without discount of these interviewees

that by the sampling design all satisfy Si= 1 (because they are part of the survey). Therefore,

Yirepresents customer i’s demand shift (rescheduling behavior) under customer i’s received dis-

count rate Direlative to no discount. We in a ﬁrst step investigate which observed predictors

among the covariates X, W as well as the size of the discount Dare most powerful for predicting

demand shifts by machine learning algorithms. We point out that this analysis is of descriptive

16

nature as it does not yield the causal eﬀects of the various predictors, but merely their capability

of forecasting Y. In particular, our approach averages the predictions of Yover diﬀerent levels

of treatment intensity Dand thus diﬀerent customer types in terms of reservation price (related

to S(0)) and unobserved background characteristics that likely vary with the treatment level.

Therefore, we also perform the prediction analysis within subgroups deﬁned upon the treat-

ment level to see whether the set of important predictors is aﬀected by the treatment intensity.

To this end, we binarize the treatment such that it consists of two categories, namely (non-zero)

discounts below 30%, i.e. covering the treatment range d∈(0,0.3), and more substantial dis-

counts of 30% and more, d∈[0.3,0.7], as 70% is the highest discount observed in our data. In

the same manner, we also assess the predictive power when considering the decision to buy a

trip that would not have been realized without discount (additional trip), i.e. Si−Si(0), as

outcome. As Si= 1 is equal to one for everyone, the outcome corresponds to 1 −Si(0) and

indicates whether someone has been induced purchase the ticket because of the discount, i.e. is

not an always buyer. As a further consumer behavior-related outcome to be predicted, we also

consider buying a ﬁrst class rather than second class ticket because of the discount (upselling).

Prediction is based on the random forest, a nonparametric machine learner suggested by

Breiman (2001) for predicting outcomes as a function of covariates. Random forests rely on

repeatedly drawing subsamples from the original data and averaging over the predictions in

each subsample obtained by a decision tree, see Breiman, Friedman, Olshen, and Stone (1984).

The idea of decision trees is to recursively split the covariate space, i.e. the set of possible

values of X, W , into a number of non-overlapping subsets (or nodes). Recursive splitting is

performed such that after each split, a statistical goodness-of-ﬁt criterion like the sum of squared

residuals, i.e. the diﬀerence between the outcome and the subset-speciﬁc average outcome, is

minimized across the newly created subsets. Intuitively, this can be thought of as a regression

of the outcome on a data-driven choice of indicator functions for speciﬁc (brackets of) covariate

values. At each split of a speciﬁc tree, only a random subset of covariates is chosen as potential

variables for splitting in order to reduce the correlation of tree structures across subsamples,

which together with averaging predictions overall subsamples reduces the estimation variance

of the random forest when compared to running a single tree in the original data. Even when

using an excessive number of splits (or indicator functions for covariate values) such that some

of them do not importantly predict the outcome, averaging over many samples will cancel out

17

those non-predictive splits that are only due to sampling noise. Forest-based predictions can be

represented by smooth weighting functions that bear some resemblance with kernel regression,

with the important diﬀerence that random forests detect predictive covariates in a data-driven

way. We use the randomforest package by Breiman (2018) for the statistical software Rto

implement the random forest based on growing 1,000 decision trees.

5.2 Causal Machine Learning

Our second part of the analysis assesses the causal eﬀect of increasing discount rates on demand

shifts among always buyers while controlling for the selection into the survey and the non-random

assignment of the treatment based on Assumptions 1 to 4 of Section 4. We apply the causal

forest (CF) approach of Wager and Athey (2018), and Athey, Tibshirani, and Wager (2019)

to estimate the CAPE and APE of the continuous treatment, as well as the double machine

learning (DML) approach of Chernozhukov, Chetverikov, Demirer, Duﬂo, Hansen, Newey, and

Robins (2018) to estimate the ATE of a binary treatment of a discount ≥30% vs. <30% in the

sample of always buyers.

The CF adapts the random forest to the purpose of causal inference. It is based on ﬁrst

running separate random forests for predicting the outcome Yand the treatment Das a function

of the covariates Xusing leave-one-out cross-ﬁtting. The latter implies that the outcome or

treatment of each observation is predicted based on all observations in the data but its own,

in order to safeguard against overﬁtting bias. Second, the predictions are used for computing

residuals of the outcomes and treatments, in which the inﬂuence of Xhas been partialled out.

Finally, a further random forest is applied to average over so-called causal trees, see Athey

and Imbens (2016), in order to estimate the CAPE. The causal tree approach contains two key

modiﬁcations when compared to standard decision trees. First, instead of an outcome variable,

it is the coeﬃcient of regressing the residual of Yon the residual of D, i.e. the causal eﬀect

estimate of the treatment, that is to be predicted. Recursive splitting aims to ﬁnd the largest

eﬀect heterogeneities across subsets deﬁned in terms of Xto estimate the CAPE accurately.

Secondly, within each subset, diﬀerent parts of the data are used for estimating (a) the tree’s

splitting structure (i.e., the deﬁnition of covariate indicator functions) and (b) the causal eﬀect

of the treatment to prevent spuriously large eﬀect heterogeneities due to overﬁtting.

The CAPE estimate obtained by CF can be thought of as a weighted regression of the

18

outcome residual on the treatment residual. The random forest-determined weight reﬂects the

importance of a sample observation for assessing the causal eﬀect at speciﬁc values of the covari-

ates. After estimating the CAPE given X, the APE is obtained by appropriately averaging over

the distribution of Xamong the always buyers. For implementing CAPE and APE estimation,

we use the grf package by Tibshirani, Athey, Friedberg, Hadad, Hirshberg, Miner, Sverdrup,

Wager, and Wright (2020) for the statistical software R. We set the number of trees to be used

in a forest to 1000. We select any other tuning parameters like the number of randomly chosen

covariates considered for splitting or the minimum number of observations per subset (or node)

by the built-in cross-validation procedure.

We also estimate the ATE among always buyers in our sample based on DML for a binary

treatment deﬁned as ˜

D=I{D≥0.3}, with I{·} denoting the indicator function that is equal

to one if its argument is satisﬁed and zero otherwise. Furthermore, let µd(X) = E[Y|˜

D=

d, X, S(0) = 1, S = 1] denote the conditional mean outcome and pd(X) = Pr( ˜

D=d|X, S(0) =

1, S = 1) the propensity score of receiving treatment category d(with d= 1 for a discount

≥30% and d= 0 otherwise) in that population. Estimation is based on the sample analog of

the doubly robust identiﬁcation expression for the ATE, see Robins, Rotnitzky, and Zhao (1994)

and Robins and Rotnitzky (1995):

∆S(0)=1(1,0) = E[µ1(X)−µ0(X) (9)

+(Y−µ1(X)) ·˜

D

p1(X)−(Y−µ0(X)) ·(1 −˜

D)

p0(X)S(0) = 1, S = 1#.

We estimate (9) using the causalweight package for the statistical software Rby Bodory and Hu-

ber (2018). As machine learners for the conditional mean outcomes µD(X) and the propensity

scores pD(X) we use the random forest with the default options of the SuperLearner package of

van der Laan, Polley, and Hubbard (2007), which itself imports the ranger package by Wright

and Ziegler (2017) for random forests. To impose common support in the data used for ATE

estimation, we apply trimming threshold of 0.01, implying that we drop observations with es-

timated propensity scores smaller than 0.01 (or 1%) and larger than 0.99 (or 99%) from our

sample.

19

6 Empirical results

6.1 Descriptive Statistics

Before discussing the results of our machine learning approaches, we ﬁrst present some descriptive

statistics for our data in Table 1, namely the mean and the standard deviation of selected

variables by always buyer status and binary discount category (≥30% and <30%). We see

that discounts and regular ticket fares of always buyers are on average lower than those of other

customers. Another interesting observation is that in either discount category, we observe less

leisure travelers among the always buyers than among other customers, which can be rationalized

by business travelers responding less to price incentives by discounts. This is also in line with

the ﬁnding that always buyers tend to purchase more second class tickets. More generally, we

see non-negligible variation in demand-related covariates across the four subsamples deﬁned in

terms of buying behavior and discount rates. For instance, among always buyers, the total

amount of supersaver tickets oﬀered is on average larger in the higher discount category, while

it is lower among the remaining clients. This suggests that neither the treatment nor being an

always buyer is quasi-random, a problem we aim to tackle based on our identiﬁcation strategy

outlined in Section 4. Concerning the demand-shift outcome, we see that always buyers change

the departure time less frequently than others. With regard to upselling,we recognise that the

relative amount of individuals upgrading their 2nd class to a 1st class ticket is the same for both

discount categories, i.e. ≥30% and <30%.

6.2 Predicting buying decisions

We subsequently present our predictive analysis based on the random forest and investigate

which covariates importantly predict three outcomes, namely whether customers booked a trip

otherwise not realized by train (additional trip), bought a ﬁrst-class rather than a second-class

ticket (upselling), or rescheduled their trip e.g. away from rush hours (demand shift). For this

purpose, we create three distinct datasets in which the values of the respective binary outcome are

balanced, i.e. 1 (for instance, upselling) for 50% and 0 (no upselling) for 50% of the observations.

We balance our data because we aim to train a model that predicts both outcome values equally

well. Taking the demand shift outcome as an example, our data with non-missing covariate or

outcome information contain 3481 observations with Y= 1 and 9576 observations with Y= 0.

20

Table 1: Mean and standard deviation by discount and type

discount <30% ≥30%

always buyers No Yes No Yes

discount 0.21 0.19 0.57 0.53

(0.07) (0.08) (0.12) (0.13)

regular ticket fare 44.36 36.14 47.19 32.91

(29.38) (25.47) (30.14) (23.78)

age 47.22 47.68 45.59 48.77

(15.36) (16.14) (15.80) (16.49)

gender 0.51 0.55 0.53 0.59

(0.50) (0.50) (0.50) (0.49)

diﬀ. purchase travel 3.42 3.23 7.72 7.19

( 6.96) ( 6.76) (11.23) (10.30)

distance 136.49 127.86 126.15 116.76

(77.38) (71.49) (69.98) (66.04)

capacity utilization 35.51 39.19 26.46 33.15

(14.16) (14.31) (13.24) (13.75)

seat capacity 328.28 429.57 303.83 445.14

(196.19) (196.10) (185.42) (188.54)

oﬀer total 33.95 44.10 70.97 98.34

(42.57) (50.68) (69.57) (84.45)

sold total 28.04 37.29 13.70 25.75

(41.92) (50.31) (36.37) (53.67)

half fare travel ticket 0.74 0.79 0.62 0.74

(0.44) (0.40) (0.49) (0.44)

leisure 0.77 0.69 0.82 0.76

(0.42) (0.46) (0.39) (0.43)

class 1.38 1.65 1.33 1.73

(0.48) (0.48) (0.47) (0.44)

Swiss 0.89 0.92 0.88 0.88

(0.31) (0.28) (0.33) (0.32)

demand shift 0.31 0.19 0.31 0.23

(0.46) (0.40) (0.46) (0.42)

upselling 0.49 0.00 0.49 0.00

(0.50) (0.00) (0.50) (0.00)

obs. 1151 2221 5529 3745

Notes: Regular ticket fare is in Swiss francs. ‘diﬀ. purchase travel’ denotes the diﬀerence between purchase and

travel day. ‘Oﬀer total’ and ‘sold total’ denote the total amount of supersaver tickets oﬀered and the total amount

of supersaver tickets sold respectively.

We retain all observations with Y= 1 and randomly draw 3481 observations with Y= 0 to

obtain such a balanced data set. In the next step, we randomly split these 6962 observations into

a training set consisting of 75% of the data and a test set (25%). In the training set, we train

the random forest using the treatment Dand all covariates X, W as predictors. In the test set,

we predict the outcomes based on the trained forest, classifying e.g. observations with a demand

shift probability ≥0.5 as 1. We then compare the predictions to the actually observed outcomes

to assess model performance based on the correct classiﬁcation rate (also known as accuracy),

i.e. the share of observations in the test data for which the predicted outcome corresponds to

21

the actual one.

For each of the outcomes, Table 2presents the 30 most predictive covariates in the training

set ordered in decreasing order according to a variable importance measure. The latter is deﬁned

as the total decrease in the Gini index (as a measure of node impurity in terms of outcome values)

in a tree when including the respective covariate for splitting, averaged over all trees in the forest.

The results suggest that trip- and demand-related characteristics like seat capacity, utilization,

departure time, and distance are important predictors. Concerning personal characteristics,

also customer’s age appears to be relevant. Furthermore, also the treatment intensity Dhas

considerable predictive power. Interestingly, speciﬁc connections (deﬁned by indicators for points

of departure and destination) turn out to be less important characteristics conditional on the

other covariates already mentioned.

At the bottom of Table 2we also report the correct classiﬁcation rates for the three outcomes.

While the accuracy in predicting a demand shift amounts to 58%, which is somewhat better

than random guessing but not particularly impressive, the performance is more satisfactory

for predicting decisions about additional trips with an accuracy of 65% and quite decent for

upselling (82%). We note that when predicting upselling, we drop the variables ‘class’, which

indicates whether someone travels in the ﬁrst or second class, and ‘seat capacity’, which refers to

the capacity in the chosen class, from the predictors. The reason is that upselling is deﬁned as

switching from second to ﬁrst class, and therefore, the chosen class and the related seat capacities

are actually part of the outcome to be predicted. Tables B.2 and B.1 in the Appendix present

the predictive outcome analysis separately for subsamples with discounts ≥30% and <30%,

respectively. In terms of which classes of variables are most predictive (trip- and demand-related

characteristics, age, discount rate) and also in terms of accuracy, the ﬁndings are rather similar to

those in Table 2. In general, machine learning appears useful for forecasting customer behavior

in the context of demand for train trips, albeit not equally well for all aspects of interest. Such

forecasts may for instance serve as a base for customer segmentation, e.g. into customer groups

more and less inclined to book an additional trip or switch classes or departure times.

22

Table 2: Predictive outcome analysis

demand shift upselling additional trip

variable importance variable importance variable importance

departure time 142.694 capacity utilization 295.924 seat capacity 147.037

seat capacity 121.42 oﬀer level B 188.861 D 128.086

age 119.846 oﬀer level C 149.911 age 123.948

capacity utilization 119.606 D 132.095 departure time 123.516

D 112.474 age 100.258 capacity utilization 113.160

distance 112.143 departure time 98.909 distance 101.730

oﬀer level B 84.142 oﬀer level A 93.303 oﬀer level B 84.989

diﬀ. purchase travel 81.167 distance 87.319 diﬀ. purchase travel 80.236

oﬀer level C 76.238 oﬀer level D 85.408 oﬀer level A 78.507

oﬀer level A 75.971 diﬀ. purchase travel 62.841 oﬀer level C 77.097

number of sub-journeys 73.096 number of sub-journeys 55.978 number of sub-journeys 69.443

oﬀer level D 61.763 rel. sold level A 44.505 ticket purchase complexity 64.498

ticket purchase complexity 57.071 ticket purchase complexity 41.819 oﬀer level D 56.888

rel. sold level A 51.377 oﬀer level E 37.159 class 51.456

rel. amount imputed values 42.222 rel. sold level B 34.462 rel. sold level A 46.969

rel. sold level B 38.144 rel. amount imputed values 30.747 rel. amount imputed values 38.869

adult companions 34.176 rel. sold level C 28.635 rel. sold level B 36.484

rel. sold level C 28.201 adult companions 25.115 half fare 35.785

oﬀer level E 25.714 2019 18.88 adult companions 34.446

gender 23.707 gender 18.47 halfe fare travel ticket 28.465

amount purchased tickets 19.575 rush hour 17.448 gender 25.419

German 18.659 Saturday 16.173 rel. sold level C 24.679

travel alone 18.605 German 15.457 oﬀer level E 22.556

2019 18.082 leisure 15.304 leisure 20.438

French 17.906 amount purchased tickets 15.112 no subscriptions 19.793

saturday 17.487 travel alone 14.792 amount purchased tickets 19.283

Friday 17.272 half fare 14.306 German 19.119

peak hour 17.064 French 14.161 travel alone 18.139

class 16.973 Thursday 13.413 2019 17.192

leisure 16.892 scheme 20 13.411 French 17.026

correct prediction rate 0.581 0.817 0.653

balanced sample size 6962 6738 7000

Notes: ‘Oﬀer level A’, ‘oﬀer level B’, ‘oﬀer level C’, ‘oﬀer level D’ and ‘oﬀer level E’ denote the amount of supersaver tickets with discount A, B, C, D and E respectively. ‘Diﬀ.

purchase travel’ denotes the diﬀerence between purchase and travel day. ‘Rel. sold level A’, ‘rel. sold level B’, ‘rel. sold level C’ and ‘rel. sold level D’ denote the relative amount

of supersaver tickets oﬀered with discount A, B, C and D respectively. The relative amounts are in relation to the seats oﬀered. ‘No subscriptions’ indicates not possessing any

subscription. For predicting upselling, the covariates ‘class’ and ‘seat capacity’ are dropped.

23

6.3 Testing the identiﬁcation strategy

Before presenting the results for the causal analysis, we consider two diﬀerent methods to par-

tially test the assumptions underlying our identiﬁcation strategy. First, we test Assumption 3

(weak monotonicity) by running the CF and DML procedures as well as a conventional OLS

regression in which we use buying an additional trip (1 −S(0)), i.e. not being an always buyer,

as outcome variable and Xas control variables in our sample of supersaver customers. The CF

permits estimating the conditional change in the share of surveyed customers induced to buy

an additional trip by modifying the discount rate Dgiven X, i.e. ∂ E[(1−S(0))|D,X,S=1]

∂D , as well as

the average thereof across Xconditional on sample selection, Eh∂E[(1−S(0))|D,X,S =1]

∂D S= 1i.

DML, on the other hand, yields an estimate of the average diﬀerence in the share of ad-

ditional trips across the high and low treatment categories conditional on sample selection,

E[E[(1 −S(0))|D < 0.3, X, S = 1] −E[(1 −S(0))|D≥0.3, X, S = 1]|S= 1]. Finally, the OLS

regression of (1−S(0)) on Dand all Xin our sample tests monotonicity when assuming a linear

model.

Table 3reports the results that do not provide any evidence against the monotonicity assump-

tion. When considering the continuous treatment D, the CF-based estimate of Eh∂E[(1−S(0))|D,X,S=1]

∂D S= 1i

is highly statistically signiﬁcant and suggests that augmenting the discount by one percentage

point increases the share of customers otherwise not buying the ticket by 0.56 percentage points

on average. Furthermore, any estimates of the conditional change ∂E[(1−S(0))|D,X,S=1]

∂D are posi-

tive, as displayed in the histogram of Figure 2, and 82.2% of them are statistically signiﬁcant at

the 10% level, 69.1% at the 5% level. Furthermore, the OLS coeﬃcient of 0.544 is highly signif-

icant. Likewise, the statistically signiﬁcant DML estimate points to an increase in the share of

additional trips by 18.6 percentage points when switching the binary treatment indicator from

D < 0.3 to D≥0.3.

We also test the statistical independence of Dand Wconditional on Xin our sample of

always buyers, as implied by our identifying assumptions, see the discussion at the end of Section

4. To this end, we randomly split the evaluation data into a training set (25% of observations)

and a test set (75% of all observations). In the training data set, we run a linear lasso regression

(Tibshirani,1996) of Don Xin order to identify important predictors by means of 10-fold

cross-validation. In the next step, we select all covariates in Xwith non-zero lasso coeﬃcients

24

Table 3: Monotonicity tests

CF: average change OLS: coeﬃcient DML: D≥0.3 vs D < 0.3

change in (1 −S(0)) 0.564 0.544 0.186

standard error 0.060 0.031 0.007

p-value 0.000 0.000 0.000

trimmed observations 1760

number of observations 12924

Notes: ‘CF’, ‘OLS’, and ‘DML’ stands for estimates based on causal forests, linear regression, and double machine

learning, respectively. ‘trimmed observations’ is the number of trimmed observations in DML when setting the

propensity score-based trimming threshold to 0.01. Control variables consist of X.

Figure 2: Monotonicity given X

and run an OLS regression of Don the selected covariates in the test data. Finally, we add W

to that regression in the test data and run a Wald test to compare the predictive power of the

models with and without W. We repeat the procedure of splitting the data, performing the lasso

regression in the training set, and running the OLS regressions and the Wald test in the test set

100 times. This yields an average p-value of 0.226, with 15 out of 100 p-values being larger than

5%. These results do not provide compelling statistical evidence that Wis associated with D

conditional on X, even though the training sample is relatively small and thus favors selecting

too few predictors in X(due to the cross-validation that trades oﬀ bias due to including fewer

predictors and variance due to including more predictors).

We note that performing lasso-based variable selection and OLS-based testing in diﬀerent

25

(training and test) data avoids correlations of these steps that could entail an overestimation of

the goodness of ﬁt. Nonetheless, our ﬁndings remain qualitatively unchanged when performing

both steps in all of the evaluation data. Repeating the cross-validation step for the lasso-based

covariate selection 100 times and testing in the total sample yields an even higher average p-

value of 0.360. Finally, we run a standard OLS regression of Don all elements of X(rather than

selecting the important ones by lasso) in the total sample and compare its predictive power to

a model additionally including W. Also in this case, the Wald test entails a rather high p-value

of 0.343. In summary, we conclude that our tests do not point to the violation of our identifying

assumptions.

6.4 Assessing the causal eﬀect of discounts

Table 4presents the main results of our causal analysis, namely the estimates of the discount

rate’s eﬀect on the demand shift outcome, which is equal to one if the discount induced reschedul-

ing the departure time and zero otherwise. We note that all covariates, i.e. both the trip- or

demand-related factors Xand the personal characteristics W, are used as control variables, even

though we have previously claimed that Xis suﬃcient for identiﬁcation. There are, however,

good reasons for including Was well in the estimations. First, conditioning on the personal

characteristics available in the data may reduce estimation bias if Xis - contrarily to our as-

sumptions and to what our tests suggest - not fully suﬃcient to account for confounding. Second,

it can also reduce the variance of the estimator, e.g. if some factors like age are strong predictors

of the outcome. Third, having Win the CF allows for a more ﬁne-grained analysis of eﬀect

heterogeneity based on computing more ‘individualized’ partial eﬀects that (also) vary across

personal characteristics.

Table 4: Eﬀects on demand shift

CF: APE DML: ATE D≥0.3 vs D < 0.3

eﬀect 0.161 0.036

standard error 0.072 0.014

p-value 0.025 0.010

trimmed observations 151

number of observations 5903

Notes: ‘CF’ and ‘DML’ stands for estimates based on causal forests, linear regression, and double machine

learning, respectively. ‘trimmed observations’ is the number of trimmed observations in DML when setting the

propensity score-based trimming threshold to 0.01. Control variables consist of both Xand W.

26

Considering the estimates of the CF, we obtain an average partial eﬀect (APE) of 0.161,

suggesting that increasing the current discount rate among always buyers by one percentage

point increases the share of rescheduled trips by 0.16 percentage points. This eﬀect is statistically

signiﬁcant at the 5% level. As a word of caution, however, we point out that the standard error

is non-negligible such that the magnitude of the impact is not very precisely estimated. When

applying DML, we obtain an average treatment eﬀect (ATE) of 0.036 that is signiﬁcant at the

1% level, suggesting that discounts of 30% and more on average increase the number of demand

shifts by 3.6 percentage points compared to lower discounts, which is qualitatively in line with the

CF. Furthermore, we ﬁnd a decent overlap or common support in most of our sample in terms of

the estimated propensity scores across lower and higher discount categories considered in DML,

see the propensity score histograms in Appendix A. This is important as ATE evaluation hinges

on the availability of observations with comparable propensity scores across treatment groups.

Only 151 out of our 5903 observations are dropped due to too extreme propensity scores below

0.01 or above 0.99 (pointing to a violation of common support).4In summary, our results clearly

point to a positive average eﬀect of the discount rate on trip rescheduling among always buyers,

which is, however, not overwhelmingly large.

6.5 Eﬀect heterogeneity

In this section, we assess the heterogeneity of the eﬀects of Don Yacross interviewees and

observed characteristics. Figure 3shows the distribution the CF-based conditional average

eﬀects (CAPE) of marginally increasing the discount rate given the covariates values of the

always buyers in our sample (which are also the base for the estimation of the APE). While the

CAPEs are predominantly positive, they are quite imprecisely estimated. Only 2.9% and 0.8%

of the positive ones are statistically signiﬁcant at the 10% and 5% levels, respectively. Further,

only 0.1% of the negative ones are statistically signiﬁcant at the 10% level. Yet, the distribution

points to a positive marginal eﬀect for most always buyers and also suggests that the magnitude

of the eﬀects varies non-negligibly across individuals.

Next, we assess the eﬀect heterogeneity across observed characteristics based on the CF

results. First, we run a conventional random forest with the estimated CAPEs as the outcome

4Our ﬁndings of a positive ATE remain robust when setting the propensity score-based trimming threshold to 0.02

(ATE: 0.039) or 0.05 (ATE: 0.043).

27

Figure 3: CAPEs

and the covariates as predictors to assess the covariates’ relative importance for predicting the

CAPE, using the decrease in the Gini index as importance measure as also considered in Section

6.2. Table 5reports the 20 most predictive covariates ordered in decreasing order according to the

importance measure. Demand-related characteristics (like seat capacity, utilization, departure

time, and distance) turn out to be the most important predictors for the size of the eﬀects,

also customer’s age has some predictive power. Similarly as for outcome prediction in Section

6.2, speciﬁc connections (characterized by points of departure or destination) are less important

predictors of the CAPEs given the other information available in the data.

While Table 5provides information on the best predictors of eﬀect heterogeneity, it does not

give insights on whether eﬀects diﬀer importantly and statistically signiﬁcantly across speciﬁc

observed characteristics of interest. For instance, one question relevant for designing discount

schemes is whether (marginally) increasing the discounts is more eﬀective among always buyers

so far exposed to rather small or rather large discounts. Therefore, we investigate whether the

CAPEs are diﬀerent across our binary treatment categories deﬁned by ˜

D(30% or more and

less than 30%). To this end, we apply the approach of Semenova and Chernozhukov (2020)

based on (i) plugging the CF-based predictions into a modiﬁed version of the doubly robust

functions provided within the expectation operator of (9) that is suitable for a continuous D

28

Table 5: Most important covariates for predicting CAPEs

covariate importance

seat capacity 11.844

oﬀer level C 11.164

capacity utilization 5.144

departure time 5.122

distance 4.287

oﬀer level D 4.015

class 3.434

saturday 2.933

age 2.429

number of sections 2.373

diﬀ. purchase travel 2.110

oﬀer level A 1.634

oﬀer level B 1.610

half fare 1.524

scheme 17 1.496

half fare travel ticket 1.373

rel. sold level B 0.901

ticket purchase complexity 0.847

leisure 0.773

rel. sold level A 0.770

Notes: ‘Oﬀer level A’, ‘oﬀer level B’, ‘oﬀer level C’ and ‘oﬀer level D’ denote the amount of supersaver tickets

with discount A, B, C and D respectively. ‘Rel. oﬀer level A’, ‘rel. oﬀer level B’ and ‘rel. oﬀer level C’ denote

the relative amount of supersaver tickets oﬀered with discount A, B and C. The relative amounts are in relation

to the seats oﬀered.

and (ii) linearly regressing the doubly robust functions on the treatment indicator ˜

D. The results

are reported in the upper panel of Table 6. While the point estimate of −0.104 suggests that

the demand shifting eﬀect of increasing the discount is on average smaller when discounts are

already quite substantial (above 30%), the diﬀerence is far from being statistically signiﬁcant at

any conventional level.

Table 6: Eﬀect heterogeneity analysis

eﬀect standard error p-value

Discounts categories (D≥0.3vs D < 0.03)

APE for D < 0.3 (constant) 0.209 0.089 0.019

Diﬀerence APE D≥0.3 vs D < 0.3 (slope coeﬃcient) -0.104 0.122 0.395

Customer and travel characteristics

constant -0.154 0.295 0.602

age -0.002 0.004 0.556

gender -0.022 0.129 0.866

distance -0.000 0.001 0.697

leisure trip 0.297 0.165 0.072

commute 0.241 0.241 0.316

half fare travel ticket 0.228 0.142 0.109

peak hours 0.222 0.133 0.094

Notes: Business trip is the reference category for the indicators ‘leisure trip’ and ‘commute’.

Using again the method of Semenova and Chernozhukov (2020), we also investigate the

29

heterogeneity among a limited and pre-selected set of covariates that appears interesting for

characterizing customers and their travel purpose, namely age, gender, and travel distance, as

well as indicators for leisure trip and commute (with business trip being the reference category),

traveling during peak hours, and possession of a half fare travel tickets. As displayed in the

lower panel of Table 6, we ﬁnd no important eﬀect heterogeneities across the age or gender of

always buyers or as a function of travel distance conditional on the other information included

in the regression, as the coeﬃcients on these variables are close to zero. In contrast, the eﬀect

of demand shift is (given the other characteristics) substantially larger among always buyers

with a half fare travel tickets and among commuters, however, neither coeﬃcient is statistically

signiﬁcant at the 10% level (even though the half fare coeﬃcient is close).

For leisure trips, the coeﬃcient is even larger (0.297), suggesting that all other included

variables equal, a one percentage point increase in the discount rate increases the share of

rescheduled trips by 0.29 percentage points more among leisure travelers than among always

buyers traveling for business. The coeﬃcient is statistically signiﬁcant at the 10% level, even

though we point out that the p-value does not account for multiple hypothesis testing of several

covariates. This ﬁnding can be rationalized by leisure travelers being likely more ﬂexible in terms

of timing than business travelers. Also the coeﬃcient on peak hours is substantially positive

(0.222) and statistically signiﬁcant at the 10% level (again, without controlling for multiple

hypothesis testing). This could be due to peak hours being the most attractive travel time,

implying that costumers are more willing to reschedule their trips when being oﬀered a discount

within peak hours. We conclude that even though several coeﬃcients appear non-negligible,

statistical signiﬁcance in our heterogeneity analysis is overall limited, which could be due to

the (for the purpose of investigating eﬀect heterogeneity) limited sample of several thousand

observations.

7 Conclusion

In this study, we applied causal and predictive machine learning to assess the demand eﬀects

of discounts on train tickets issued by the Swiss Federal Railways (SBB), the so-called ‘super-

saver tickets’, based on a unique data that combines a survey of supersaver customers with rail

trip- and demand-related information provided by the SBB. In a ﬁrst step, we analyzed which

30

customer- or trip-related characteristics (including the discount rate) are predictive for three out-

comes characterizing buying behavior, namely: booking a trip otherwise not realized by train

(additional trip), buying a ﬁrst- rather than second-class ticket (upselling), or rescheduling a trip

(e.g. a demand shift away from rush hours) when being oﬀered a supersaver ticket. The random

forest-based results suggested that customer’s age, demand-related information for a speciﬁc

connection (like seat capactiy, departure time, and utilization), and the discount level permit

forecasting buying behavior to a certain extent, with correct classiﬁcation rates amounting to

58% (demand shift), 65% (additional trip), and 82% (upselling), respectively.

As predictive machine learning cannot provide the causal eﬀects of the predictors involved,

we in a second step applied causal machine learning to assess the impact of the discount rate

on the demand shift among always buyers (who would have traveled even without a discount),

which appears interesting in the light of capacity constraints at rush hours. To this end, we

invoked the identifying assumptions that (i) the discount rate is quasi-random conditional on

our covariates and (ii) the buying decision increases weakly monotonically in the discount rate

and exploited survey information about customer behavior in the absence of discounts. We

also considered two approaches for partially testing our assumptions, which did not point to a

violation of the latter. Our main results based on the causal forest suggested that increasing

the discount rate by one percentage point entails an average increase of 0.16 percentage points

in the share of rescheduled trips among always buyers. This ﬁnding was corroborated by double

machine learning with just two discount categories, suggesting that discount rates of 30% and

more on average increase the share of rescheduled trips by 3.6 percentage points when compared

to lower discounts. Furthermore, when investigating eﬀect heterogeneity across a pre-selected set

of characteristics, we found the causal forest-based eﬀects to be higher (with marginal statistical

signiﬁcance when not controlling for multiple hypothesis testing) for leisure travelers and during

peak hours when also controlling for customer’s age, gender, possession of a half fare travel card,

and travel distance. Finally, our eﬀect heterogeneity analysis also revealed that demand-related

information is most predictive for the size of the eﬀect of the discount rate.

Using state-of-the art machine learning tools, our study appears to be the ﬁrst (at least for

Switzerland) to provide empirical evidence on how discounts on train tickets aﬀect customers’

willingness to reschedule trips - an important information for designing discount schemes aiming

at balancing out train utilization across time and reducing overload during peak hours. Even

31

though the overall impact on the demand shifts on always buyers might not be as large as one

could hope for, the causal forest pointed to the existence of customer segments that are likely

more responsive and could be scrutinized when collecting a larger amount of data than available

for our analysis. Furthermore, our empirical approach may also be applied to other countries or

transport industries facing capacity constraints. For instance, we would expect that in a setting

with higher competition from alternative public transport modes like long distance bus services

(not present in Switzerland), the impact of train discounts may well be diﬀerent. More generally,

our study can be regarded as a use case for how predictive and in particular causal machine

learning can be fruitfully applied for business analytics and as decision support for optimizing

speciﬁc interventions like discount schemes based on impact evaluation.

References

Angrist, J., G. Imbens, and D. Rubin (1996): “Identiﬁcation of Causal Eﬀects using Instru-

mental Variables,” Journal of American Statistical Association, 91, 444–472.

Ascarza, E. (2018): “Retention futility: Targeting high-risk customers might be ineﬀective,”

Journal of Marketing Research, 55(1), 80–98.

Athey, S., and G. Imbens (2016): “Recursive partitioning for heterogeneous causal eﬀects,”

Proceedings of the National Academy of Sciences, 113, 7353–7360.

Athey, S., J. Tibshirani, and S. Wager (2019): “Generalized random forests,” The Annals

of Statistics, 47, 1148–1178.

Basso, L. J., and H. E. Silva (2014): “Eﬃciency and substitutability of transit subsidies and

other urban transport policies,” American Economic Journal: Economic Policy, 6(4), 1–33.

Batty, P., R. Palacin, and A. Gonz´

alez-Gil (2015): “Challenges and opportunities in

developing urban modal shift,” Travel Behaviour and Society, 2(2), 109–123.

Blanco, G., C. A. Flores, and A. Flores-Lagunes (2011): “Bounds on Quantile Treat-

ment Eﬀects of Job Corps on Participants’ Wages,” Discussion paper.

Bodory, H., and M. Huber (2018): “The causalweight package for causal inference in R,”

SES Working Paper 493, University of Fribourg.

32

Breiman, L. (2001): “Random forests,” Machine Learning, 45, 5–32.

Breiman, L. (2018): “randomForest: Breiman and Cutler’s Random Forests for Classiﬁcation

and Regression. R package version 4.6-12,” Software available at URL: https://cran. r-project.

org/package= randomForest.

Breiman, L., J. Friedman, R. Olshen, and C. Stone (1984): Classiﬁcation and Regression

Trees. Wadsworth, Belmont, California.

Brynjolfsson, E., and K. McElheran (2016): “The rapid adoption of data-driven decision-

making,” American Economic Review, 106(5), 133–39.

Chernozhukov, V., D. Chetverikov, M. Demirer, E. Duflo, C. Hansen, W. Newey,

and J. Robins (2018): “Double/debiased machine learning for treatment and structural pa-

rameters,” The Econometrics Journal, 21, C1–C68.

De Palma, A., R. Lindsey, and G. Monchambert (2017): “The economics of crowding in

rail transit,” Journal of Urban Economics, 101, 106–122.

De Witte, A., C. Macharis, P. Lannoy, C. Polain, T. Steenberghen, and S. Van de

Walle (2006): “The impact of “free” public transport: The case of Brussels,” Transportation

Research Part A: Policy and Practice, 40(8), 671–689.

Desmaris, C. (2014): “The reform of passenger rail in Switzerland: More performance without

competition,” Research in Transportation Economics, 48, 290–297.

Frangakis, C., and D. Rubin (2002): “Principal Stratiﬁcation in Causal Inference,” Biomet-

rics, 58, 21–29.

Hagenauer, J., and M. Helbich (2017): “A comparative study of machine learning classiﬁers

for modeling travel mode choice,” Expert Systems with Applications, 78, 273–282.

Heckman, J. J. (1976): “The Common Structure of Statistical Models of Truncation, Sample

Selection, and Limited Dependent Variables, and a Simple Estimator for such Models,” Annals

of Economic and Social Measurement, 5, 475–492.

(1979): “Sample selection bias as a speciﬁcation error,” Econometrica, 47, 153–161.

33

Huber, M. (2014): “Treatment evaluation in the presence of sample selection,” Econometric

Reviews, 33, 869–905.

H¨

unermund, P., J. Kaminski, and C. Schmitt (2021): “Causal Machine Learning and

Business Decision Making,” .

Imai, K. (2008): “Sharp bounds on the causal eﬀects in randomized experiments with

‘truncation-by-death’,” Statistics & Probability Letters, 78, 144–149.

Imbens, G. W., and J. Angrist (1994): “Identiﬁcation and Estimation of Local Average

Treatment Eﬀects,” Econometrica, 62, 467–475.

Knaus, M. C. (2021): “A double machine learning approach to estimate the eﬀects of musical

practice on student’s skills,” Journal of the Royal Statistical Society: Series A (Statistics in

Society), 184(1), 282–300.

Lee, D. S. (2009): “Training, Wages, and Sample Selection: Estimating Sharp Bounds on

Treatment Eﬀects,” Review of Economic Studies, 76, 1071–1102.

Little, R., and D. Rubin (1987): Statistical Analysis with Missing Data. Wiley, New York.

Liu, L., and R.-C. Chen (2017): “A novel passenger ﬂow prediction model using deep learning

methods,” Transportation Research Part C: Emerging Technologies, 84, 74–91.

Liu, Y., Z. Liu, and R. Jia (2019): “DeepPF: A deep learning based architecture for metro

passenger ﬂow prediction,” Transportation Research Part C: Emerging Technologies, 101, 18–

34.

L¨

uscher, R. (2020): “10 Jahre Sparbillette – Attraktive Preise und h¨

ohere Nachfrage f¨

ur den

¨

oV.,” Discussion paper.

Mohring, H. (1972): “Optimization and scale economies in urban bus transportation,” The

American Economic Review, 62(4), 591–604.

Omrani, H. (2015): “Predicting travel mode of individuals by machine learning,” Transportation

Research Procedia, 10, 840–849.

Parry, I. W., and K. A. Small (2009): “Should urban transit subsidies be reduced?,” Amer-

ican Economic Review, 99(3), 700–724.

34

Paulley, N., R. Balcombe, R. Mackett, H. Titheridge, J. Preston, M. Wardman,

J. Shires, and P. White (2006): “The demand for public transport: The eﬀects of fares,

quality of service, income and car ownership,” Transport policy, 13(4), 295–306.

Pearl, J. (2000): Causality: Models, Reasoning, and Inference. Cambridge University Press,

Cambridge.

Redman, L., M. Friman, T. G¨

arling, and T. Hartig (2013): “Quality attributes of public

transport that attract car users: A research review,” Transport policy, 25, 119–127.

Robins, J. M., and A. Rotnitzky (1995): “Semiparametric Eﬃciency in Multivariate Regres-

sion Models with Missing Data,” Journal of the American Statistical Association, 90, 122–129.

Robins, J. M., A. Rotnitzky, and L. Zhao (1994): “Estimation of Regression Coeﬃcients

When Some Regressors Are not Always Observed,” Journal of the American Statistical Asso-

ciation, 90, 846–866.

Rotaris, L., and R. Danielis (2014): “The impact of transportation demand management

policies on commuting to college facilities: A case study at the University of Trieste, Italy,”

Transportation Research Part A: Policy and Practice, 67, 127–140.

Rubin, D. B. (1974): “Estimating Causal Eﬀects of Treatments in Randomized and Nonran-

domized Studies,” Journal of Educational Psychology, 66, 688–701.

(1976): “Inference and Missing Data,” Biometrika, 63, 581–592.

Semenova, V., and V. Chernozhukov (2020): “Debiased machine learning of conditional av-

erage treatment eﬀects and other causal functions,” forthcoming in the Econometrics Journal.

Thao, V. T., W. von Arx, and J. Fr¨

olicher (2020): “Swiss cooperation in the travel

and tourism sector: long-term relationships and superior performance,” Journal of Travel

Research, 59(6), 1044–1060.

Tibshirani, J., S. Athey, R. Friedberg, V. Hadad, D. Hirshberg, L. Miner, E. Sver-

drup, S. Wager, and M. Wright (2020): “grf: Generalized Random Forests,” R package

version 1.2.0.

35

Tibshirani, R. (1996): “Regresson shrinkage and selection via the LASSO,” Journal of the

Royal Statistical Society, 58, 267–288.

van der Laan, M. J., E. C. Polley, and A. E. Hubbard (2007): “Super Learner,” Statistical

Applications in Genetics and Molecular Biology, 6.

Wager, S., and S. Athey (2018): “Estimation and Inference of Heterogeneous Treatment

Eﬀects using Random Forests,” Journal of the American Statistical Association, 113, 1228–

1242.

Wegelin, P. (2018): “Is the mere threat enough? An empirical analysis about competitive ten-

dering as a threat and cost eﬃciency in public bus transportation,” Research in Transportation

Economics, 69, 245–253.

Wright, M. N., and A. Ziegler (2017): “ranger: A fast implementation of random forests

for high dimensional data in C++ and R,” Journal of Statistical Software, 77, 1–17.

Yang, J.-C., H.-C. Chuang, and C.-M. Kuan (2020): “Double machine learning with gradi-

ent boosting and its application to the Big N audit quality eﬀect,” Journal of Econometrics,

216(1), 268–283.

Yap, M., and O. Cats (2020): “Predicting disruptions and their passenger delay impacts for

public transport stops,” Transportation, pp. 1–29.

Zhang, J., R. Lindsey, and H. Yang (2018): “Public transit service frequency and fares

with heterogeneous users under monopoly and alternative regulatory policies,” Transportation

Research Part B: Methodological, 117, 190–208.

Zhang, J., and D. B. Rubin (2003): “Estimation of Causal Eﬀects via Principal Stratiﬁcation

When Some Outcomes are Truncated by ‘Death’,” Journal of Educational and Behavioral

Statistics, 4, 353–368.

Zhang, J., D. B. Rubin, and F. Mealli (2008): “Evaluating The Eﬀects of Job Training

Programs on Wages through Principal Stratiﬁcation,” in Advances in Econometrics: Mod-

elling and Evaluating Treatment Eﬀects in Econometrics, ed. by D. Millimet, J. Smith, and

E. Vytlacil, vol. 21, pp. 117–145. Elsevier Science Ltd.

36

Appendices

A Propensity score plots

Figure A.1: Propensity score estimates in the higher discount category ( D≥0.3)

Figure A.2: Propensity score estimates in the lower discount category (D < 0.3)

B Further tables

37

Table B.1: Predictive outcome analysis, D < 0.3

demand shift upselling additional trip

variable importance variable importance variable importance

departure time 37.33 capacity utilization 41.387 seat capacity 25.342

seat capacity 27.871 oﬀer level D 27.669 age 21.639

capacity utilization 26.508 age 22.145 capacity utilization 20.168

distance 26.31 D 19.077 distance 18.970

age 26.223 oﬀer level C 17.324 departure time 18.527

D 25.08 departure time 16.538 D 18.076

number of sub-journeys 17.403 distance 15.897 ticket purchase complexity 16.637

oﬀer level C 15.643 oﬀer level B 15.696 oﬀer level C 12.085

diﬀ. purchase travel 15.299 rel. sold level B 10.992 rel. sold level B 11.709

ticket purchase complexity 15.116 number of sub-journeys 10.019 oﬀer level D 11.641

rel. sold level B 15.03 diﬀ. purchase travel 9.573 number of sub-journeys 11.347

oﬀer level D 15.012 rel. sold level C 9.367 diﬀ. purchase travel 10.328

rel. sold level C 14.625 oﬀer level A 7.857 oﬀer level B 10.185

oﬀer level B 14.413 rel. sold level D 7.33 rel. sold level C 8.993

rel. sold level A 11.856 ticket purchase complexity 7.319 rel. sold level A 8.162

oﬀer level A 11.329 oﬀer level E 7.183 oﬀer level A 7.643

rel. amount imputed values 10.079 rel. sold level A 6.769 class 7.381

rel. sold level D 9.625 rel. amount imputed values 5.422 rel. sold level D 6.964

adult companions 8.503 adult companions 4.881 rel. amount imputed values 6.189

oﬀer level E 6.511 rush hour 4.143 adult companions 5.785

gender 5.214 leisure 3.602 leisure 5.398

leisure 5.154 gender 3.599 oﬀer level E 4.866

destination Geneva Airport 4.83 2019 3.597 gender 4.692

departure Zuerich 4.736 travel alone 3.047 German 3.801

class 4.598 Friday 2.92 halfe fare travel ticket 3.686

travel alone 4.59 German 2.825 travel alone 3.540

peak hour 4.545 French 2.479 French 3.493

Friday 4.524 departure Zuerich 2.429 Friday 3.419

German 4.522 destination Zuerich Airport 2.427 half fare 3.163

amount purchased tickets 4.349 scheme 20 2.427 2019 3.136

correct prediction rate 0.555 0.772 0.605

balanced sample size 1642 1140 1202

Notes: ‘Diﬀ. purchase travel’ denotes the diﬀerence between purchase and travel day. ‘Rel. oﬀer level A’, ‘rel. oﬀer level B’, ‘rel. oﬀer level C’ and ‘rel. oﬀer level D’ denote the

relative amount of supersaver tickets oﬀered with discount A, B, C and D respectively. The relative amounts are in relation to the seats oﬀered. ‘Oﬀer level A’, ‘Oﬀer level B’,

‘Oﬀer level C’, ‘Oﬀer level D’ and ‘Oﬀer level E’ denotes the amount of supersaver tickets with discount A, B, C, D and E respectively. ‘No subscription’ indicates not possessing

any subscription. For predicting upselling, the covariates ‘class’ and ‘seat capacity’ are dropped.

38

Table B.2: Predictive outcome analysis, D≥0.3

demand shift upselling additional trip

variable importance variable importance variable importance

departure time 114 seat capacity 246.396 capacity utilization 133.936

seat capacity 95.799 oﬀer level B 178.212 age 105.107

age 95.209 oﬀer level C 127.327 departure time 100.091

capacity utilization 89.422 D 100.618 capacity utilization 97.889

distance 85.503 oﬀer level A 88.947 distance 85.647

D 80.447 Tageszeitinmin 82.658 D 83.671

diﬀ. purchase travel 69.276 age 78.886 oﬀer level B 73.399

oﬀer level B 68.75 distance 73.885 oﬀer level A 69.936

oﬀer level A 65.766 oﬀer level D 72.452 diﬀ. purchase travel 67.823

oﬀer level C 60.513 diﬀ. purchase travel 55.321 oﬀer level C 64.689

number of sub-journeys 57.767 number of sub-journeys 48.622 number of sub-journeys 58.100

oﬀer level D 44.626 rel. sold level A 38.997 class 54.857

ticket purchase complexity 44.434 ticket purchase complexity 31.991 ticket purchase complexity 49.348

rel. sold level A 39.796 oﬀer level E 27.041 oﬀer level D 46.685

rel. amount imputed values 32.144 rel. amount imputed values 25.586 rel. sold level A 42.589

adult companions 25.925 adult companions 24.73 rel. amount imputed values 35.411

rel. sold level B 20.536 rel. sold level B 17.099 half fare 31.308

gender 18.629 gender 15.835 adult companions 28.732

oﬀer level E 17.521 2019 15.416 half fare travel ticket 23.900

travel alone 15.15 Saturday 14.256 gender 20.562

amount purchased tickets 14.939 amount purchased tickets 13.396 rel. sold level B 20.036

German 14.855 rush hour 12.947 oﬀer level E 18.595

French 14.415 German 12.878 German 16.257

2019 14.387 leisure 12.344 amount purchased tickets 15.847

Sunday 13.821 travel alone 12.28 leisure 15.500

destination Zuerich Airport 13.387 half fare 11.882 no subscription 15.434

Saturday 13.378 Friday 11.646 travel alone 15.132

class 13.27 scheme 20 11.559 Swiss 14.951

half fare 13.258 French 11.477 Saturday 14.613

rel. amount imputed values 13.048 Sunday 11.39 2019 14.279

correct prediction rate 0.589 0.809 0.629

balanced sample size 5320 5598 5798

Notes: ‘Diﬀ. purchase travel’ denotes the diﬀerence between purchase and travel day. ‘Rel. oﬀer level A’, ‘rel. oﬀer level B’, ‘rel. oﬀer level C’ and ‘rel. oﬀer level D’ denote the

relative amount of supersaver tickets oﬀered with discount A, B, C and D respectively. The relative amounts are in relation to the seats oﬀered. ‘Oﬀer level A’, ‘Oﬀer level B’,

‘Oﬀer level C’, ‘Oﬀer level D’ and ‘Oﬀer level E’ denotes the amount of supersaver tickets with discount A, B, C, D and E respectively. ‘No subscription’ indicates not possessing

any subscription. For predicting upselling, the covariates ‘class’ and ‘seat capacity’ are dropped.

39