ArticlePDF Available

Abstract and Figures

Objective: The goal of this study is to assess machine learning for predicting procedure performance from operator and procedure characteristics. Background: Procedures are vital for the performance and safety of high-risk industries. Current procedure design guidelines are insufficient because they rely on subjective assessments and qualitative analyses that struggle to integrate and quantify the diversity of factors that influence procedure performance. Method: We used data from a 25 participant study with 4 procedures, conducted on a high-fidelity oil extraction simulation to develop logistic regression, random forest, and decision tree algorithms that predict procedure step performance from operator, step, readability, and natural language processing-based features. Features were filtered using the Boruta approach. The algorithms were trained and optimized with a repeated 10-fold cross-validation. Following the algorithm assessment, inference was performed using variable importance and partial dependence plots. Results: The random forest, decision tree, and logistic regression algorithms with all features had an AUC of 0.78, 0.77, and 0.75, respectively, and significantly outperformed logistic regression with only operator features—AUC of 0.61. The most important features were experience, familiarity, total words, and character-based metrics. The partial dependence plots showed that steps with fewer words, abbreviations, and characters were correlated with correct step performance. Conclusion: Machine learning algorithms are a promising approach for predicting step-level procedure performance and may help guide procedure design after validation with additional data on further tasks. Application: After validation, the inferences from these models can be used to generate procedure design alternatives.
Content may be subject to copyright.
Title:
Predicting procedure step performance from operator and text features: A critical first step
towards machine learning-driven procedure design
Author Names and Affiliations:
Anthony D. McDonald1, mcdonald@tamu.edu
Nilesh Ade2, nilesh14@tamu.edu
S. Camille Peres2,3 peres@tamu.edu
1 Wm Michael Barnes '64 Department of Industrial and Systems Engineering, Texas A&M
University, College Station, TX USA
2 Mary Kay O’Connor Process Safety Center, Department of Chemical Engineering, Texas A&M
University, College Station, TX USA
3 Department of Environmental and Occupational Health, Texas A&M University, College Station,
TX USA
Running Head:
Towards machine learning-driven procedure design
Manuscript Type:
Research article
Word Count:
5,715
Acknowledgments:
This research was co-funded by the Next Generation Advanced Procedures
Initiative (http://advancedprocedures.tamu.edu/). We would further like to acknowledge the
assistance of Shell USA for access to the BOOST training facility in Robert, LA. We would also
like to acknowledge Noor Quddus, Timothy Neville, Pranav Bagaria, Changwon Son, M. Sam
Mannan, Sarah Thomas, Lena Clark, and Whitney Mantooth for their valuable contributions to this
project.
Corresponding Author:
Anthony D. McDonald, Texas A&M University,
Department of Industrial and Systems Engineering,
3131 TAMU,
College Station, TX 77843, USA;
E-mail: mcdonald@tamu.edu.
Towards machine learning-driven procedure design, Page 2
Objective:
The goal of this study is to assess machine learning for predicting procedure performance
1
from operator and procedure characteristics.
2
Background:
Procedures are vital for the performance and safety of high-risk industries. Current
3
procedure design guidelines are insufficient because they rely on subjective assessments and
4
qualitative analyses that struggle to integrate and quantify the diversity of factors that influence
5
procedure performance.
6
Method:
We used data from a 25 participant study with 4 procedures, conducted on a high-fidelity
7
oil extraction simulation to develop logistic regression, random forest, and decision tree algorithms
8
that predict procedure step performance from operator, step, readability, and natural language
9
processing-based features. Features were filtered using the Boruta approach. The algorithms were
10
trained and optimized with a repeated 10-fold cross-validation. Following the algorithm assessment,
11
inference was performed using variable importance and partial dependence plots.
12
Results:
The random forest, decision tree, and logistic regression algorithms with all features had an
13
AUC of 0.78, 0.77, and 0.75, respectively, and significantly outperformed logistic regression with
14
only operator featuresAUC of 0.61. The most important features were experience, familiarity, total
15
words, and character-based metrics. The partial dependence plots showed that steps with fewer
16
words, abbreviations, and characters were correlated with correct step performance.
17
Conclusion:
Machine learning algorithms are a promising approach for predicting step-level procedure
18
performance and may help guide procedure design after validation with additional data on further
19
tasks.
20
Application:
After validation, the inferences from these models can be used to generate procedure
21
design alternatives.
22
23
24
Précis:
We develop a machine learning approach for predicting procedure performance from operator,
25
readability, and natural language processing-based features.
26
Keywords:
machine learning, procedure design, operator performance, random forest, decision tree
27
28
Towards machine learning-driven procedure design, Page 3
Introduction
29
Inadequate procedures have been cited as a primary or secondary cause of several major incidents
30
including the Macondo blowout (Hickman et al., 2012), the Texas City refinery explosion (US
31
Chemical Safety and Hazard Investigation Board, 2007), and the 2007 American Airlines Flight 1400
32
engine fire (Baron, 2009). Beyond these incidents, the National Aeronautics and Space
33
Administration (NASA) found that 44% to 73% of aviation maintenance errors were associated with
34
procedures (Hobbs & Kanki, 2008). A recent study of the healthcare industry observed that around
35
16% of undesired incidents resulted from the poor quality of procedures (Deufel et al., 2017).
36
The role of procedures in these failures is complex and can be attributed to multiple factors
37
including the quality of the procedure (e.g., accuracy, clarity, ease of use); workers’ adherence to
38
the procedure; and the availability of the procedure (Bullemer & Laberge, 2010; Poisson & Chinniah,
39
2015; Sasangohar et al., 2018; Siegel & Schraagen, 2017; Vicente & Burns, 1995; Wright &
40
McCarthy, 2003). The influence of these factors on system performance is also complex. For
41
instance, prior research has found that some deviations from procedures can improve large system
42
operations and that rote procedure adherence has, in some cases, contributed to failures (Suchman,
43
1983; Vicente & Burns, 1995). These seemingly conflicting findings speak to the need to design
44
procedures that consider all aspects of the complex sociotechnical systems in which they operate
45
(Hale & Borys, 2013b, 2013a). Our long term goal is to address this challenge by introducing a novel
46
method of machine learning-driven procedure design. This work explores a critical first step in that
47
process: predicting procedure step performance from procedure text and operator characteristics.
48
Background
49
Procedure design has typically focused on the design of the procedure document itself rather than
50
other relevant attributes of the entire the procedural system (e.g., operator experience, task
51
Towards machine learning-driven procedure design, Page 4
complexity) that may impact task performance and overall system safety (Ahmed et al., 2020; Peres,
52
Quddus, et al., 2016). A recent review from Ahmed et al. (2020) found that few procedure writing
53
guides based their guidance on empirical evidence or documented best practices. However, more
54
recent recommendations have advocated for the integration of Human Factors design approaches
55
(UK Health and Safety Executive, 2015). Considerations of human and system factors will likely
56
result in significant improvements in procedure quality, adherence, and overall task performance, as
57
research suggests that successful procedure performance depends on attributes of the procedure, the
58
task, and characteristics of the operator (Novatsis & Skilling, 2016; Peres et al., 2016; Sasangohar,
59
Peres, Williams, Smith, & Mannan, 2018). Indeed, studies investigating performance and satisfaction
60
with procedures and procedural systems regularly find that attributes of the worker (e.g., experience)
61
and attributes of the task (e.g., frequency) are reliably related to objective and subjective outcomes
62
(Bates & Holroyd, 2012; Bullemer & Hajdukiewicz, 2004; Carim et al., 2016; Dekker, 2003; HFRG,
63
1995; Noroozi et al., 2014). Thus, to identify those attributes of a procedural system that reliably
64
impact worker performance and more important, system safety, attributes of the entire system must
65
be considered.
66
Understanding how these diverse features are differentially associated with workers performance
67
with procedures represents a significant challenge. One methodology that may address this challenge
68
is machine learning. While machine learning is not a traditional Human Factors method, its use has
69
been growing, and it has been shown to be superior to traditional statistical analyses (i.e., logistic
70
regression) in some cases (Carnahan et al., 2003). Machine learning has also been successful in aiding
71
the design process in other domains such as drug design (Burbidge et al., 2001; Sanchez-Lengeling
72
& Aspuru-Guzik, 2018). The most common type of machine learning employed for design is
73
supervised machine learning, in which classes of the data are known and the goal is to build algorithms
74
Towards machine learning-driven procedure design, Page 5
to make predictions about the classes (James et al., 2013). While prediction is traditionally the focus
75
of supervised machine learning approaches, there has been recent interest in using fitted supervised
76
machine learning algorithms for causal inference (Wager & Athey, 2018). Much of this work has
77
focused on tree-based approaches such as decision trees and random forests.
78
Decision trees learn to classify the data through iteratively segmenting predictor space into
79
regions corresponding to a class (James et al., 2013). They have been successfully implemented in
80
the past in Human Factors research to identify common patterns across specific automotive crashes
81
(Clarke et al., 1998) and quantify human and organization aspects of accident management
82
(Baumont et al., 2000). Random forests are an extension of decision trees in which many individual
83
trees are fit to different subsets of the data and predictors and classifications are made by a plurality
84
vote among the different trees (Breiman, 2001). In the Human Factors domain, random forests have
85
been applied to the detection of driver drowsiness (McDonald et al., 2014), driver distraction
86
(McDonald et al., 2019), and the prediction of anxiety in computer users (Yamauchi, 2013). Random
87
forests are advantageous relative to decision trees because they are less sensitive to overfitting the
88
training data and typically have higher accuracy on predicting new unseen data (James et al., 2013).
89
In addition, random forests can be used to calculate variable importance, a measure of the prediction
90
accuracy lost by eliminating a predictor from the algorithm, and partial dependence, a measure
91
analogous to Beta coefficients in linear regression, which can be used for inference (Friedman, 2001;
92
Zhao & Hastie, 2019). Decision trees, in contrast, provide a clearer logical structure that is more
93
interpretable for procedure designers. To benefit from the strengths of both methods, recent analyses
94
suggest the use of random forests to select predictors that are subsequently used to fit a simpler
95
algorithm (Chen & Lin, 2006; Kursa & Rudnicki, 2010; Menze et al., 2009).
96
Towards machine learning-driven procedure design, Page 6
The extent to which the predictive and inference capabilities of decision trees and random forests
97
may be used to understand procedure step performance remains an open question. The goal of this
98
study is to attempt to answer this question in two parts. First, we fit a series of algorithms to predict
99
procedure step performance from predictors that describe the task, the text of the procedure’s steps,
100
and operators. We use these variables because they represent different attributes of the procedural
101
system and will provide a proof of concept with regard to whether these algorithms can effectively
102
identify the importance of and interactions between these variables. Second, we use the fitted
103
algorithm to identify the most important predictors and their quantitative impact on procedure
104
performance.
105
Method
106
When workers perform the steps in a procedure, they are doing so to support their performance of
107
a particular task that itself is part of a larger system or process. Performance may be assessed across
108
each of these levels (step level, task level, and system level). Performance across these levels is often
109
correlated, but it is not always consistent. For example, if workers follow an incorrect procedure
110
exactlyhigh step level performanceit may lead to an unsafe situationlow task and system level
111
performance (Dekker, 2003). Additionally, workers may complete a task successfully without
112
following the procedure exactly (low step level performance; Dekker, 2003; Sasangohar et al., 2018).
113
In the current analysis, we focus on step level performance. While this focus is somewhat limited, it
114
is justified by our goals of providing step-level procedure design guidance, and for understanding
115
interactions between the task (e.g., frequency), the worker (e.g., experience), and procedure steps
116
(e.g., word length). The remainder of this section describes the dataset used for this analysis,
117
discusses the measurement of step level performance and features calculated, and introduces the
118
machine learning process.
119
Towards machine learning-driven procedure design, Page 7
Dataset
120
The data used in this study were collected as a part of an experiment conducted at the Shell’s Robert
121
training center, Louisiana, USA, in the Basic Offshore Operations Simulator Training (BOOST)
122
facility. The BOOST facility is a high-fidelity training environment for offshore oil production
123
platforms, where participants perform tasks on real equipment to process mineral oil (as opposed to
124
petroleum on offshore facilities). The study was approved by the Texas A&M Institutional Review
125
Board and complied with the American Psychological Association’s code of ethics. The experiment
126
involved the observation of the step-level performance of 25 operators performing 4 procedures. The
127
procedures included: Fluid sampling using a centrifuge, column flushing, Level Control Valve (LCV)
128
replacement, and pressure testing. Fluid sampling consists of separating water from mineral oil.
129
Column flushing comprises draining gas and liquid from a column attached to a secondary vessel.
130
LCV replacement is a maintenance task that replaces a level control valve. Pressure testing is a task
131
that tests high- and low-pressure warning indicators of a digital sensor. These procedures were
132
selected as they are integral to the safe operation of an offshore oil rig. The procedures varied in
133
their frequency of performance, step content, and number of steps. Frequency was classified as
134
frequent (approximately weekly) or infrequent (approximately yearly) based on the expertise of the
135
training team at the facility. The number of procedure steps varied from 8 to 23. Table 1 summarizes
136
these metrics for the four procedures.
137
Table 1 Descriptive information about the procedures.
Procedure
Number of steps
Frequency
Example step
Fluid sampling using a
centrifuge
8
Frequent
“Place tubes onto opposite sides of the
centrifuge to maintain balance”
Column flushing
14
Frequent
“Open manual column valves M101 - 11.
Upper isolation valve on level column”
Level Control Valve
replacement
23
Infrequent
“Open drain valve downstream of check Valve
FSV-M111-29”
Towards machine learning-driven procedure design, Page 8
Pressure testing
14
Infrequent
“Make sure test connection is depressured”
138
Participants
139
There were three attributes of the participants included in the machine learning algorithmstheir
140
experience, their familiarity with the procedures, and their tendency to acknowledge a step by signing
141
off. An employment agency recruited the participants, with the goal of having an equal number of
142
Experienced and Inexperienced workers. Experience level was based on the judgment by subject
143
matter experts and the trainers at the facility. Participants with less than 6 years of experience were
144
considered inexperienced whereas those with more than 6 years of experience were considered as
145
experienced. Participants rated their familiarity with the actual procedures following their completion
146
of all the procedures using a 5-point Likert scale with 1 denoting complete unfamiliarity with the
147
task and 5 denoting complete familiarity. The study authors manually annotated participants’ sign-
148
off behavior (checking off on each step that they had done the step) of the operators from video
149
recordings.
150
Step performance
151
Step performance was a assigned a binary label of correct or incorrect in accordance with the
152
definitions established in Neville, et al., (2018). The correctness assessment was conducted through
153
video coding. There were four total reviewers, two coded the Column Flushing and Sampling tasks
154
and two coded the Level Control Valve and Pressure Testing tasks. The interrater reliability was
155
measured by Cohen’s Kappa. The Kappa between the first two coders was 0.57 (81% agreement)
156
and the Kappa between the second pair of coders was 0.38 (68% agreement). Where the two
157
reviewers initially did not agree on a code, they would use a consensus method to decide on the most
158
appropriate code.
159
Towards machine learning-driven procedure design, Page 9
Steps were considered to be performed incorrectly if operators took an extended period of time
160
to complete a step (approximately double the nominal duration), needed assistance from the training
161
instructor to perform the step, performed the step out of order from the sequence described in the
162
procedure, or performed the step incorrectly or never completed the step. Steps were considered to
163
be correctly performed if the operator performed the step completely in the right manner, in
164
prescribed sequence, without struggling and without assistance from the instructor (Neville, Peres,
165
Ade, et al., 2018). The distribution of step performance of the operators is shown in Table 2. While
166
the overall step accuracy rate (66%) may seem low, it is important to acknowledge that almost every
167
participant eventually completed the tasks correctly. The successful task completion and the
168
distribution of types of failures suggests that the observed failures are representative of how workers
169
interact with actually procedures rather than simply unfamiliarity with the procedures.
170
Table 2 Distribution of step performance by error type. Each cell shows the number of steps and the
percentage.
Procedure
Steps done
out of order
Steps done with
struggle or
assistance
Steps done
incorrectly,
skipped or with
extended
period
Fluid sampling using a
centrifuge
23 (17%)
4 (3%)
28 (20%)
Column flushing
20 (8%)
13 (5%)
29 (12%)
Level control valve
replacement
68 (19%)
23 (6%)
31 (9%)
Pressure testing
20 (13%)
10 (5%)
72 (16%)
Total
131 (13%)
50 (5%)
160 (16%)
171
Towards machine learning-driven procedure design, Page 10
Feature identification and reduction
172
The complete dataset consisted of 1,009 step performances, 668 correct instances and 341 incorrect
173
instances. In order to reduce the likelihood of algorithm overfitting and Type-1 errors, a limited set
174
of 30 features were considered. The features were associated with one of four categories: operator
175
characteristics, procedure characteristics, step readability, and standard natural language processing-
176
based features. The goal of these categories was to capture theoretically motivated domain specific
177
factors (e.g., learning and retention) alongside of domain agnostic text processing features, and
178
provide a comparison between the two types. These features and the rationale for their inclusion are
179
described in detail in the following sections.
180
Operator characteristic features
181
Prior research of procedures in high-risk industries suggests that three of the most significant
182
operator characteristics related procedure adherence are 1) field experience, 2) familiarity with the
183
procedure, and 3) the operator’s tendency to acknowledge the completion of a step in writing
184
(Novatsis & Skilling, 2016; Sasangohar et al., 2018). Recent qualitative analyses suggest that many
185
experienced operators in high-risk industries find procedures extraneous and that they may prefer to
186
rely on their extensive training, prior knowledge, and efficient teamwork with other operators rather
187
than adhere to procedures (Sasangohar et al., 2018). In contrast, inexperienced operators may follow
188
procedures more closely because procedures directly support the proceduralization phase (i.e., rule-
189
based behavior (Rasmussen, 1983), or the associative stage (Anderson, 1982) of learning (Ritter et
190
al., 2013). Familiarity may impact procedure step performance on a similar pathwayas the need
191
to rely on declarative knowledge fades and operators increasingly rely on procedural learning (Ritter
192
et al., 2013). Step acknowledgement has been directly linked to correct procedure step performance
193
and is recommended industry practice (Novatsis & Skilling, 2016). These three characteristics were
194
Towards machine learning-driven procedure design, Page 11
included as individual features in the machine learning analysis. Experience and step acknowledgement
195
were represented as binary variables, and familiarity was included on a five-point scale. Experience
196
was represented as a binary feature rather than a continuous feature due to sparsity in the distribution
197
of the years of experience of the operators.
198
Procedure characteristic features
199
Procedure characteristics were included in three ways: the frequency of performance of the procedure
200
(Frequent/Infrequent), the total number of steps of the procedure, and the procedure step
201
complexity. Procedure step complexity is a content-based approximation of the cognitive load
202
imposed by a step. This complexity has been shown to impact system performance by influencing
203
decision making, information processing, intrinsic motivation, and satisfaction of the task performer
204
(Campbell, 1991; Gill & Hicks, 2006). The complexity measure employed in this study was developed
205
by Kannan, Quddus, Peres, and Mannan, (2018). The measure calculates 5 binary dimensions of
206
complexity: decision, judgment, interdependency, step-size, and step information. Decision
207
complexity reflects that a step requires an operator to observe and respond to a cue. Judgment
208
complexity requires an operator to evaluate a quantity. Interdependency comprises the dependency
209
of one step on another. Step-size indicates the presence (or absence) of multiple instructions in a
210
step. Step information indicates that additional information in the form of notes or cautions are
211
provided to an operator in a step. The complexity of each dimension was calculated based on the
212
presence of identifiers (keywords) in a step. For example, steps including the words “if” and “then”
213
indicate the presence of decision complexity. The complexity calculations are independent, and more
214
than one type of complexity may be present in any given step. The types of step level complexity
215
and example identifiers are shown in Table 3. Each type of complexity was included as binary feature
216
in the dataset due to a sparsity of instances where multiple examples of a given type of complexity
217
Towards machine learning-driven procedure design, Page 12
were present. The number of steps in the procedure was included as a continuous variable and the
218
frequency of performance was included as a binary variable.
219
Table 3 Types, context, and example identifiers of task level complexity
Type of complexity
Context
Example identifiers
Decision
Presence of decision in a step
resulting from a multiplicity of
outcomes
‘If-then’
Judgment
The requirement of an operator’s
judgment in a step due to the
presence of uncertainty in
information in a step
‘Raise’, ‘Reduce’
Interdependency
Dependence of a step on another
step (s) through order or cascade
‘Go to’, ‘Proceed to
Step size
The multiplicity of instructions in a
step
‘and’, ‘,’
Step information
(Step info)
The multiplicity of information in a
step
‘Note’, ‘Caution
220
Readability features
221
Readability is a measure of the ease of reading a step (DuBay, 2008). Whereas the step complexity
222
features provide an index of the difficulty of execution of the task, readability measures the difficulty
223
of processing the instructions. Readability was included in this analysis as it is a common focus of
224
procedure designs for high-risk industries (Sharit, 1998) and has been shown to influence operator
225
performance (Novatsis & Skilling, 2016). Readability is typically measured through a scoring formula
226
which is often a function of the number or means of words, sentences, syllables, and characters in a
227
text. Given that there is no singular gold standard metric for readability of procedures in oil and gas
228
industries, we initially calculated a set of 29 metrics for each step in the dataset. These metrics were
229
selected based on their commonality and ease of implementation. The metrics were calculated with
230
the Quanteda package in R (Benoit et al., 2018).
231
Towards machine learning-driven procedure design, Page 13
The volume and overlapping nature of the readability formulas creates a challenge because many
232
of the metrics are highly correlated, which may undermine the generalizability and performance of
233
decision tree algorithms (Kuhn & Johnson, 2013). To address this challenge, we performed a principal
234
component feature selection process to identify the readability metric that explained the most
235
variance in the dataset. This method is an established technique for feature selection with correlated
236
features (Song et al., 2010). The principal components analysis, summarized in Table 4, found that
237
the Flesch-Kincaid metric was the most informative in this context and thus this metric was included
238
in the final feature set. The Flesch-Kincaid measures readability from the average sentence length
239
(ASL), number of words (nw), and number of syllables (nsy) according to Equation 1.
240
(1)
241
Table 4 Summary of Principal Components analysis of readability features.
Parameter
Principle component 1
Standard deviation
3.39
Proportion of variance
39.66%
Loading of individual readability metrics in principal component 1
ARI (Smith & Senter, 1967)
0.19
Coleman (E. B. Coleman, 1971)
-0.20
Coleman.C2 (E. B. Coleman, 1971)
-0.24
Coleman.Liau.ECP (M. Coleman & Liau, 1975)
-0.23
Farr.Jenkins.Paterson (Farr et al., 1951)
0.03
Flesch (Flesch, 1948)
-0.23
Flesch.PSK (Powers et al., 1958)
0.08
Flesch.Kincaid (Kincaid et al., 1975)
0.27
FOG (Gunning, 1968)
0.23
FOG.PSK (Powers et al., 1958)
0.23
FOG.NRI (Kincaid et al., 1975)
0.11
FORCAST (Caylor & Sticht, 1973)
0.09
FORCAST.RGL (Caylor & Sticht, 1973)
0.19
Fucks (Fucks, 1955)
0.11
Linsear.Write (Klare G, 1974)
0.16
LIW (Björnsson, 1968)
0.18
nWS (Bamberger & Vanecek, 1984)
0.20
nWS.2 (Bamberger & Vanecek, 1984)
0.24
FK =0.39ASL +11.8nsy
nw
15.59
Towards machine learning-driven procedure design, Page 14
nWS.3 (Bamberger & Vanecek, 1984)
0.24
nWS.4 (Bamberger & Vanecek, 1984)
0.24
RIX (Anderson, 1983)
0.22
Scrabble (Benoit et al., 2018)
0.05
SMOG (McLaughlin, 1969)
0.19
SMOG.C (McLaughlin, 1969)
0.25
Spache.old (Spache, 2005)
-0.05
Strain (Solomon, 2006)
0.18
Traenkle.Bailer (Tränkle & Bailer, 1984)
-0.12
meanSentenceLength (Benoit et al., 2018)
0.11
meanWordSyllables (Benoit et al., 2018)
0.11
242
Natural language processing features
243
Natural language processing features were included in this analysis to provide a domain knowledge-
244
agnostic comparison for the operator and procedure complexity features. Eighteen natural language
245
processing features were calculated including word and character counts, parts of speech, character
246
types (e.g., capital letters), and sentiment scores. The word, characters, parts of speech, and
247
character types were calculated as count frequencies. The sentiment scores were calculated by
248
looking up the words in each step in a sentiment dictionary and summing the scores; Table 5
249
illustrates this calculation for the step “Close drain valve FSV-M111-29.” Four sentiment scores were
250
calculated using the Bing (Liu, 2012), Vader (Hutto & Gilbert, 2014), Syuzhet (Jockers, 2015), and
251
Afinn (Nielsen, 2011) dictionaries. These dictionaries represent a sample of the most widely used
252
sentiment dictionaries. The process of calculation was consistent across the sentiment scores,
253
although individual sentiment scores for word differed (e.g., the Vader sentiment score of “Drain” is
254
0).
255
Table 5 Illustration of sentiment calculation for the step “close drain valve FSV-
M111-29.
Step text
“Close drain valve FSV-M111-29"
Step words
Close
Drain
Valve
FSV-M111-29
Bing sentiment score
0
-1
0
0
Total sentiment score
-1
256
Towards machine learning-driven procedure design, Page 15
Machine learning analysis
257
The machine learning analysis used in this study consisted of four phases: feature selection, algorithm
258
fitting, algorithm evaluation, and inferential analysis. Feature selection was performed using the
259
Boruta algorithma random forest-based feature selection approach which identifies important
260
features through two-sided tests of equality against random variables (Kursa & Rudnicki, 2010).
261
Following the Boruta feature selection, four algorithms, a conditional inference decision tree (DT) a
262
random forest (RF), and two logistic regressions (LR), were fit to the data using the caret (Kuhn,
263
2020) package in R 3.6.0 (R Core Team, 2014). Conditional inference trees were used because of
264
their formal statistical foundation and because they may reduce variable bias associated with standard
265
recursive partitioning tree algorithms (Hothorn et al., 2006). The two logistic regression algorithms
266
differed in their feature sets. The first LR algorithm used only operator features, and the second
267
used the features selected by the Boruta process. The LR algorithms were used as a benchmark
268
comparison to justify the additional complexity of the DT and random forest algorithms following
269
the example in Carnahan et al., (2003).
270
The algorithms were implemented with the party (Hothorn et al., 2006), randomForest (Liaw &
271
Wiener, 2002), and statsvia the glm functionpackages, respectively. For each algorithm, a 10-
272
fold repeated cross validation approach with 10 repetitions was used to fit hyperparameters and
273
estimate the algorithm generalizability. In each repetition and fold, the incorrect instances in the
274
training data were upsampled to create a balanced dataset and further reduce bias in the algorithm
275
fitting process. The optimal hyperparameters were selected using the maximum area under the
276
receiver operating characteristic curve (AUC). The final tuned hyperparameters for the DT and
277
random forest summarized in Table 6—note that the LR algorithm does not have hyperparameters.
278
The overall algorithm fits were assessed with the AUC, sensitivity, and specificity calculated from
279
Towards machine learning-driven procedure design, Page 16
the cross-validation test set samples. Statistical tests for these metrics were calculated with the
280
DeLong method (DeLong et al., 1988).
281
Table 6 Hyperparameter settings for the random forest and decision tree algorithms.
Algorithm
Hyperparameter(s)
Final values
Definition
Decision Tree
Max depth
Min criterion
15
0.22
The maximum depth of any branch of the tree.
The threshold of the test statistic for creating a
split
Random Forest
mtry
2
The number of randomly selected parameters
considered for splitting at each node
282
Machine learning inference
283
Following the algorithm fitting and performance evaluation, the random forest variable importance
284
and partial dependence plots were used for inference. The variable importance illustrates the expected
285
loss in accuracy associated with removing a feature from the random forest algorithm. Variable
286
importance is calculated by fitting an algorithm including the feature and calculating the accuracy,
287
then fitting the algorithm again without the feature and re-calculating accuracy, and finally
288
calculating the difference between the two accuracy values. This iterative process does not control
289
for the accuracy contributions of other features and thus the total variable importance across all
290
features may be more than 100%. This lack of control limits the use of variable importance for
291
inference. Partial dependence addresses this limitation by calculating the expected algorithm
292
prediction across the values of a feature in the dataset while holding all other features constant at
293
their mean. The calculation of partial dependence is analogous to linear regression coefficients,
294
although they are typically more complex functions given the underlying complexity of the machine
295
learning algorithm. Plotting partial dependence over the range of feature values illustrates how
296
changes in a feature impact the algorithm’s class prediction likelihood and provides more granular
297
insight into the algorithm’s predictions (Zhao & Hastie, 2019). Together these methods can be used
298
Towards machine learning-driven procedure design, Page 17
to illustrate important features for procedure design and thresholds that may be used to create design
299
guidelines.
300
Results
301
Feature selection
302
The Boruta feature selection method identified 25 relevant features and 4 irrelevant features for
303
correct step performance. The irrelevant features included the number of first-person pronouns in
304
the step, the step size and interdependence complexity, and whether the step was checked off. The
305
relevant features are summarized in Table 7 according to their feature type.
306
Table 7 Selected features and their associated categories.
Feature category
Feature
Operator characteristics
Experience
Frequency
Procedure characteristics
Number of steps
Decision complexity
Judgement complexity
Step info complexity
Familiarity
Readability
Flesch Kincaid score
Natural language processing
Total characters
Unique characters
Total digits
Lowercase characters
Periods
Unique words
Uppercase characters
Punctuation marks
Mean characters/word
Afinn dict. sentiment
Bing dict. sentiment
Syuzhet dict. sentiment
Vader dict. sentiment
Second person pronouns
Third person pronouns
To be verbs
Prepositions
Number of words
Towards machine learning-driven procedure design, Page 18
Algorithm fitting results
307
Figure 1 illustrates the ROC curves for the logistic regression with only operator features (LROP),
308
logistic regression (LR), decision tree (DT) and random forest (RF) algorithms. The logistic
309
regression with only operator features had an AUC of 0.61 (95% CI: 0.57-0.64), the logistic
310
regression with all features had an AUC of 0.75 (95% CI: 0.72-0.78) the DT had an AUC of 0.77
311
(95% CI: 0.74-0.80), and the random forest (RF) algorithm had an AUC of 0.78 (DeLong 95% CI:
312
0.75-0.81). The sensitivity and specificity of the algorithms and their standard deviations across
313
cross-validation fold test sets are summarized in Table 8. In all cases, the algorithms performed
314
significantly better than a random classifier. In addition, the algorithms with all features significantly
315
outperformed the logistic regression with operator only features (LR: D = -5.95, df = 1977.1, p <
316
0.001; DT: D = -6.50, df = 1958.8, p < 0.001; RF: Z = -7.92, p < 0.001). Pairwise comparisons
317
between the AUC of the LR, RF, and DT were not significant.
318
319
Figure 1 Receiver operating characteristic curves for the logistic regression with only operator features (LROP), logistic
320
regression with all features (LR), decision tree (DT), and random forest (RF) algorithms.
321
0
0.25
0.5
0.75
1
0 0.25 0.5 0.75 1
False positive rate
True positive rate
LROP
LR
DT
RF
Towards machine learning-driven procedure design, Page 19
Table 8 The area under the curve (AUC), sensitivity, and specificity of the logistic
regressions, decision tree, and random forest algorithms. The numbers in parentheses
are the standard deviations across the 10 cross fold test sets.
Algorithm
AUC
Sensitivity
Specificity
Logistic Regression
Operator only
0.62 (0.05)
0.68 (0.05)
0.53 (0.11)
Logistic Regression
0.75 (0.05)
0.71 (0.07)
0.70 (0.07)
Decision Tree
0.77 (0.05)
0.73 (0.05)
0.68 (0.08)
Random Forest
0.78 (0.05)
0.76 (0.04)
0.67 (0.04)
Inferential analysis
322
The random forest variable importance plot, shown in Figure 2, illustrates that familiarity with the
323
procedure is the most important feature, i.e., the omission of the feature results in the largest loss
324
of predictive accuracy. Beyond familiarity, the other top ten most important features include
325
experience, total words, character-based metrics (e.g., lowercase and total characters), Flesch
326
Kincaid readability, and the Vader dictionary sentiment score. It is notable that while the procedure-
327
based features do appear to be important for classification, they are considerably less so than the
328
operator-based features, readability feature, and several of the natural language processing features.
329
Towards machine learning-driven procedure design, Page 20
330
Figure 2 Feature importance for the random forest algorithm based on mean decrease in accuracy.
331
Figure 3 illustrates the partial dependence plots for the top ten features, ordered left-to-right,
332
top-to-bottom, by importance. Several notable trends emerge from the figure. Familiarity and
333
experience show expected trends in that increased familiarity and experience result in a higher
334
likelihood that the algorithm predicts that the step will be correctly performed. In contrast, as total
335
words increase, it is less likely that the algorithm predicts the step will be correctly performed. Three
336
surprising trends emerge from the Flesch Kincaid readability score, the uppercase characters, and
337
the Vader sentiment. The Flesch Kincaid reading score graph suggests that as a step becomes more
338
difficult to read, the algorithm is more likely to predict it will be performed correctly. Further analysis
339
suggests that this trend may be related to the use of many syllable domain words, e.g., “operator”
340
and abbreviations. For example, one step with a Flesch Kincaid score of 10.9 (approximately
341
equivalent to the highest likelihood that the algorithm predicts correct performance) reads, “Notify
342
the Control Room Operator that equipment is ready to be removed from service.” In contrast, the
343
Third person pronouns
Second person pronouns
Step info complexity
Decision complexity
Frequency
To be verbs
Prepositions
Punctuation marks
Bing dict. sentiment
Afinn dict. sentiment
Judgment complexity
Periods
Total Digits
Mean characters/word
Unique words
Total steps
Syuzhet dict. sentiment
Vader dict. sentiment
Unique characters
Uppercase characters
Flesch Kincaid score
Total characters
Lowercase characters
Total words
Experience
Familiarity
0 10 20
Mean decrease in accuracy
Feature
Towards machine learning-driven procedure design, Page 21
step, “Close M112-9 and M112-10,” has a Flesch Kincaid score of -1.45. Additional context on this
344
analysis can be found in the uppercase characters chart (bottom left in the figure), which shows that
345
the algorithm predicts that steps containing between 8 and 12 uppercase characters are likely to be
346
performed incorrectly. These steps, e.g., “Have CRO place ILIC - 101 in Manual Control,” generally
347
contain at least 2 abbreviations. Thus, these results suggest that algorithm predicts that steps with
348
many abbreviations are likely to be performed incorrectly. The trend in the Vader dictionary sentiment
349
scores suggests that the algorithm predicts that steps with neutral sentiment are more likely to be
350
performed correctly, whereas steps with more positive sentiment are more likely to be performed
351
incorrectly.
352
353
Figure 3 Partial dependence plots for the most important features of the random forest algorithm. Points show values of
354
the feature where partial dependence was calculated, the lines illustrate trends.
355
Additional context for the univariate partial dependence plots can be gained by analyzing and
356
plotting the partial dependence across two variables. Figure 4 illustrates four such plots for experience
357
Uppercase characters
Unique characters
Vader dict. sentiment
Lowercase characters
Total characters
Flesch Kincaid score
Familiarity
Experience
Total words
5 10 10 15 20 25 30 2 0 2
0 40 80 120 160 50 100 150 0 5 10
1 2 3 4 5 0.00 0.25 0.50 0.75 1.00 10 20 30 40
0.25
0.00
0.25
0.50
0.25
0.00
0.25
0.50
0.25
0.00
0.25
0.50
Feature value
Probability of correct performance
Towards machine learning-driven procedure design, Page 22
and familiarity across the number of words in the procedure and the number of uppercase characters.
358
The top left graphexperience by number of wordsshows that the algorithm is more likely to
359
predict correct step performance in experienced operators (Experience = 1) when the step contains
360
fewer than 20 words compared to more than 20 words. Inexperienced operators (Experience = 0)
361
show a similar trend, although the figure suggests that inexperienced operators are least likely to
362
perform a step correctly if it has 20 words. The trend is similar in the top-right graph which shows
363
the prediction trends for uppercase characters and experience. Given the findings on abbreviations
364
discussed above, it is notable that the algorithm predicts incorrect step performance is most likely in
365
inexperienced operators conducting steps with many abbreviations. It is notable that experienced
366
operators are less impacted by abbreviations than inexperienced operators. The bottom graphs
367
suggest that the algorithm is most likely to predict correct performance when procedure steps have
368
less than 20 words and when operators are more familiar (4 or 5 familiarity rating).
369
Towards machine learning-driven procedure design, Page 23
370
Figure 4 Two-dimensional partial dependency plots for Experience, Familiarity, Number of words, and Uppercase
371
characters. The shading indicates the likelihood of the algorithm predicting correct step performancewhite corresponds
372
to high likelihood of a correct step performance prediction and black corresponds to a high likelihood of predicting
373
incorrect step performance.
374
Discussion
375
The goal of this study was to take steps towards machine learning-driven procedure design by
376
investigating machine learning for predicting procedure performance from operator, readability, and
377
natural language processing-based features. The findings provide evidence that machine learning can
378
effectively integrate these features and accurately predict step-performance. In addition, the results
379
provide an initial quantitative description of the correlations between these factors.
380
Algorithm performance
381
The AUC of the random forest, decision tree, and logistic regression with all of the features, 0.78,
382
0.77, and 0.75, respectively, indicate acceptable algorithm prediction performance relative to
383
Experience
Number of words
10
20
30
0.0 0.5 1.0
0.3
0.2
0.1
0.0
0.1
0.2
0.3
0.4
0.5
Experience
Uppercase characters
5
10
0.0 0.5 1.0
0.4
0.3
0.2
0.1
0.0
0.1
0.2
0.3
0.4
0.5
Familiarity
Number of words
10
20
30
12345
0.3
0.2
0.1
0.0
0.1
0.2
0.3
0.4
0.5
Familiarity
Uppercase characters
5
10
12345
0.4
0.2
0.0
0.2
0.4
Towards machine learning-driven procedure design, Page 24
common benchmarks (Mandrekar, 2010). It is notable that these algorithms significantly
384
outperformed the logistic regression algorithm with only operator features. This finding validates the
385
need for both features derived from the procedure steps as well as operator characteristics in
386
predicting procedure step performance. The consistency in performance across the algorithms
387
including all features suggests that the features included in the algorithm are more important than
388
the machine learning approach for predicting procedure step performance. This finding is consistent
389
with analyses in other domains (McDonald et al., 2019) and highlights the need for careful feature
390
identification and selection in future analyses of procedures.
391
It is somewhat surprising that there were no significant differences between the logistic regression,
392
decision tree, and random forest algorithms containing all features. While this may be an artifact of
393
the limited dataset, the result is important as the random forest requires considerable additional
394
parameters and complexity. This additional complexity reduces the likelihood of overfitting but it also
395
makes the algorithm less interpretable. In contrast, decision tree and logistic regression algorithms
396
are generally considered human readable and interpretable by non-machine learning experts and thus
397
they may be more directly useful to procedure designers in high-risk industries without training in
398
machine learning. This idea is supported by prior work from Bevilacqua, Ciarapica, and Giacchetta
399
(2008) who used decision trees to inform a refinery of operational safety issues. As suggested in that
400
work, the decision tree approach may be a complementary method to be used in conjunction with
401
current procedure design practices.
402
Feature selection and inference
403
The feature selection and inferential analyses highlight the importance of operator characteristics
404
(e.g., experience, familiarity), readability, and characteristics of the procedure step text (e.g., total
405
characters). The significance of these features alone is not surprising given that prior analyses of
406
Towards machine learning-driven procedure design, Page 25
procedures have also found them to impact procedure performance (Novatsis & Skilling, 2016; Peres,
407
Mannan, et al., 2016; Sasangohar et al., 2018; Sharit, 1998). However, the alignment of this analysis
408
with prior work is important because the automated feature selection of the Boruta approach and
409
the iterative construction of the random forest and decision trees, relies on the data rather than
410
domain constructs. The alignment provides at least some evidence that the algorithm performance
411
here would be replicated with a broader sample.
412
Beyond the qualitative alignment, the inferential analysis here highlights novel correlations
413
between human factors and language-based features. In particular, the partial dependency results
414
illustrate that procedure steps over 20 words in length correlate with a decline in step performance.
415
The results also suggest that inexperience correlates with incorrect step performance which provides
416
support for earlier assessments of procedures (Leplat, 1985; Sasangohar et al., 2018). The correlation
417
between declined step performance and abbreviations, provides an additional clarity on word and
418
abbreviation limitations and their effects on operators.
419
The results also show that step complexity metrics and sentiment may also play a substantial
420
role in procedure step performance. Although there has been some previous research investigating
421
the relationships between complexity and performance (Campbell, 1991; Chan et al., 2015; Park &
422
Jung, 2015), the findings regarding the relationship between a type of complexity, worker experience,
423
and attributes of the procedure design are novel and need to be pursued further to be more clearly
424
understood. Similarly, the significance of the sentiment findings must be explored further in future
425
work. While it is notable that the findings suggest that steps with neutral sentiment are more likely
426
to lead to correct performance compared to positive or negative sentiment steps, more detailed
427
analysis is needed to assess the role of sentiment in procedure performance. Sentiment dictionaries,
428
such as the ones used in this analysis, are generally based on subjective ratings and general or popular
429
Towards machine learning-driven procedure design, Page 26
texts (Hutto & Gilbert, 2014; Nielsen, 2011). As such, they should be used with caution in
430
professional domains because the meaning and experience associated with a word may change
431
significantly between general conversation and high-risk industry practice. For example, the word
432
“ensure” in the Vader dictionary is mapped to a positive sentiment score of 1.6. In the procedures
433
analyzed here, the word generally refers to ensuring a setting or the readiness of equipment, which
434
one may expect to be a neutral sentiment.
435
Application
436
Although it is premature to directly extend the findings into specific procedure design guidelines, the
437
correlations identified in this analysis warrant consideration in the procedure design process. The
438
inferential findings suggest that procedure steps will be more likely to be performed correctly if they
439
contain 10 to 20 total words, fewer than 25 total characters (15 unique characters), fewer than 5
440
uppercase characters, and neutral sentiment. Practitioners may consider these bounds as heuristics
441
to guide design alternatives as part of a larger design and evaluation process. The findings also
442
suggest that familiarity with a procedure substantially increases the likelihood of correct step
443
performance, further emphasizing the need for specific training and deliberate practice (Boot &
444
Ericsson, 2011). When considered alongside of other recent findings from Sasangohar et al., (2018)
445
and Peres, Smith, and Sasangohar, (in press), the importance of operator experience suggests that
446
procedure designers should consider alternate procedure designs for experienced and inexperienced
447
operators.
448
Limitations and future work
449
There are several limitations with the present analysis. Most importantly, the size and scope of the
450
current dataset is limited. The focus on a set of 4 procedures in the oil and gas industry limits the
451
extension of the results to a broader set of procedures. Before the results here are generalized to
452
Towards machine learning-driven procedure design, Page 27
other procedure designs, they must be validated with procedures not used in the algorithm training
453
process. Specifically, these procedures should include systematic manipulations aligned with the
454
correlations identified in this study. For example, the same procedure steps should be offered with
455
varying amounts of acronyms. Additionally, while the number of participants was reasonable, the
456
training process could be refined further with data from additional operators and more granular
457
measures of complexity and experience.
458
Another limitation of the current work is the environment. The simulation facility used in this
459
work is high fidelity, but it does not fully reflect the conditions on a real offshore facility. This may
460
result in differences in procedure performance, particularly when additional operator factors such as
461
fatigue and stress occur. Finally, the low initial reliability of the step-performance coding is concerning
462
and may have impacted the results. To address these concerns, future work should explore the
463
implementation of the data collection and design procedure here on a large sample of workers in real
464
environments with a more clearly defined step-performance evaluation. Future work should
465
additionally explore the utility of these findings in the presence of operator stress and physical and
466
mental fatigue and additional NLP features (e.g., word embeddings) to provide additional linguistic
467
insights.
468
Conclusion
469
The machine learning approaches analyzed in this study suggest that random forest, decision tree,
470
and logistic regression algorithms can be used to predict procedure step performance from operator,
471
procedure, readability, and natural language-based features. The inferential analysis suggests that
472
short procedure steps with few characters and abbreviations are correlated with improved procedure
473
step performance. While these results are a promising step towards machine learning-driven procedure
474
Towards machine learning-driven procedure design, Page 28
design, they must be validated with additional data containing additional procedures before they are
475
broadly extended to the field.
476
Key points
477
Machine learning can be an effective approach for analyzing procedure step performance
478
based on the characteristics of the operator and procedure steps.
479
Partial dependence analysis can be used to understand correlations between features and step
480
performance.
481
Procedure steps with minimal words, abbreviations, and characters and higher levels of
482
experience and familiarity are correlated with correct procedure step performance for the
483
procedures investigated.
484
REFERENCES
485
Ahmed, L., Quddus, N., Kannan, P., Peres, S. C., & Mannan, M. S. (2020). Development of a procedure
486
writers’ guide framework: Integrating the procedure life cycle and reflecting on current industry practices.
487
International Journal of Industrial Ergonomics, 76(February).
488
https://doi.org/10.1016/j.ergon.2020.102930
489
Anderson, J. (1983). Lix and Rix: variations on a little-known readability index. Journal of Reading, 26(6),
490
490496. Retrieved from http://www.jstor.org/stable/40031755
491
Anderson, J. R. (1982). Acquisition of cognitive skill. Psychological Review, 89(4), 369406.
492
https://doi.org/10.1037/0033-295X.89.4.369
493
Bamberger, R., & Vanecek, E. (1984). Lesen-Verstehen-Lernen-Schreiben [Read-Understand-Learn-Write].
494
Diesterweg.
495
Baron, R. (2009). Failure to follow procedures: Deviations are a significant factor in maintenance errors.
496
Retrieved from the Federal Aviation Administration website:
497
https://www.faa.gov/about/initiatives/maintenance_hf/library/documents/media/roi/failure_to_follo
498
w_procedures_deviations_are_a_significant_factor_in_maintenance_errors.pdf
499
Bates, S., & Holroyd, J. (2012). Human factors that lead to non-compliance with standard operating
500
procedures (Research Report RR 919). Health and Safety Executive Laboratory.
501
Baumont, G., Ménage, F., Schneiter, J. R., Spurgin, A., & Vogel, A. (2000). Quantifying human and
502
organizational factors in accident management using decision trees: the HORAAM method. Reliability
503
Engineering & System Safety, 70(2), 113124. https://doi.org/10.1016/S0951-8320(00)00051-X
504
Benoit, K., Watanabe, K., Wang, H., Nulty, P., Obeng, A., Müller, S., & Matsuo, A. (2018). quanteda: An R
505
Towards machine learning-driven procedure design, Page 29
package for the quantitative analysis of textual data. Journal of Open Source Software, 3(30), 774.
506
https://doi.org/10.21105/joss.00774
507
Bevilacqua, M., Ciarapica, F. E., & Giacchetta, G. (2008). Industrial and occupational ergonomics in the
508
petrochemical process industry: A regression trees approach. Accident Analysis and Prevention, 40(4),
509
14681479. https://doi.org/10.1016/j.aap.2008.03.012
510
Björnsson, C. H. (1968). Läsbarhet [Readability]. Liber (6th ed.). Stockholm: Liber.
511
Boot, W., & Ericsson, K. A. (2011). Expertise. In J. D. Lee & A. Kirlik (Eds.), Oxford handbook of cognitive
512
engineering (pp. 143158). New York: Oxford University press.
513
Breiman, L. (2001). Random forests. Machine learning, 45(1), 532.
514
https://doi.org/10.1023/A:1010933404324
515
Bullemer, P. T., & Hajdukiewicz, J. R. (2004). A Study of effective procedural practices in refining and
516
chemical operations. Proceedings of the Human Factors and Ergonomics Society Annual Meeting, 48(20),
517
24012405. https://doi.org/10.1177/154193120404802006
518
Bullemer, P. T., & Laberge, J. C. (2010). Common operations failure modes in the process industries. Journal
519
of Loss Prevention in the Process Industries, 23(6), 928935. https://doi.org/10.1016/j.jlp.2010.05.008
520
Burbidge, R., Trotter, M., Holden, S., & Buxton, B. (2001). Drug design by machine learning: Support vector
521
machines for pharmaceutical data analysis. Computers and Chemistry, 26, 514.
522
Campbell, D. J. (1991). Goal levels, complex tasks, and strategy development: A review and analysis. Human
523
Performance, 4(1), 131. https://doi.org/10.1207/s15327043hup0401_1
524
Carim, G. C., Saurin, T. A., Havinga, J., Rae, A., Dekker, S. W. A., & Henriqson, É. (2016). Using a procedure
525
doesn’t mean following it: A cognitive systems approach to how a cockpit manages emergencies. Safety
526
Science, 89, 147–157. https://doi.org/10.1016/j.ssci.2016.06.008
527
Carnahan, B., Meyer, G., & Kuntz, L.-A. (2003). Comparing statistical and machine learning classifiers:
528
Alternatives for predictive modeling in Human Factors research. Human Factors, 45(3), 408423.
529
https://doi.org/10.1518/hfes.45.3.408.27248
530
Caylor, J. S., & Sticht, T. G. (1973). Development of a simple readability index for job reading material.
531
Department of the Army. Retrieved from ERIC database. (ED076707)
532
Chan, S. H., Song, Q., & Yao, L. J. (2015). The moderating roles of subjective (perceived) and objective task
533
complexity in system use and performance. Computers in Human Behavior, 51(Part A), 393402.
534
https://doi.org/10.1016/j.chb.2015.04.059
535
Chen, Y.-W., & Lin, C.-J. (2006). Combining SVMs with various feature selection strategies. In I. Guyon, M.
536
Nikravesh, S. Gunn, & L. A. Zadeh (Eds.), Feature Extraction: Foundations and Applications (pp. 315
537
324). Berlin, Heidelberg: Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-540-35488-8_13
538
Clarke, D. D., Forsyth, R., & Wright, R. (1998). Machine learning in road accident research: Decision trees
539
describing road accidents during cross-flow turns. Ergonomics, 41(7), 10601079.
540
https://doi.org/10.1080/001401398186603
541
Coleman, E. B. (1971). Developing a technology of written instruction: Some determiners of the complexity
542
of prose. In E. Z. Rothkopf & P. E. Johnson (Eds.), Verbal learning research and the technology of
543
written instruction (pp. 155204). New York: Teachers College Press.
544
Coleman, M., & Liau, T. L. (1975). A computer readability formula designed for machine scoring. Journal of
545
Towards machine learning-driven procedure design, Page 30
Applied Psychology, 60(2), 283284. https://doi.org/10.1037/h0076540
546
Dekker, S. (2003). Failure to adapt or adaptations that fail: Contrasting models on procedures and safety.
547
Applied Ergonomics, 34(3), 233238. https://doi.org/10.1016/S0003-6870(03)00031-0
548
DeLong, E. R., DeLong, D. M., & Clarke-Pearson, D. L. (1988). Comparing the areas under two or more
549
correlated receiver operating characteristic curves: a nonparametric approach. Biometrics, 44, 837845.
550
Deufel, C. L., McLemore, L. B., de los Santos, L. E. F., Classic, K. L., Park, S. S., & Furutani, K. M. (2017).
551
Patient safety is improved with an incident learning systemclinical evidence in brachytherapy.
552
Radiotherapy and Oncology, 125(1), 94100.
553
https://doi.org/https://doi.org/10.1016/j.radonc.2017.07.032
554
DuBay, W. (2008). The principles of readability. Costa Mesa: Impact Information, (949), 77.
555
https://doi.org/10.1.1.91.4042
556
Farr, J. N., Jenkins, J. J., & Paterson, D. G. (1951). Simplification of Flesh reading ease formula. Journal of
557
Applied Psychology., 35, 333337.
558
Friedman, J. H. (2001). Greedy function approximation: the gradient boosting machine. The Annals of
559
Statistics, 29(5), 11891232.
560
Fucks, W. (1955). Unterschied des Prosastils von Dichtern und Schriftstellern [Difference in the prose style of
561
poets and writers]. Ein Beispiel mathematischer [An example of math] Stilanalyse. Sprachforum 1, 234
562
241.
563
Gill, T. G., & Hicks, R. C. (2006). Task complexity and informing science: A synthesis. Informing Science, 9,
564
1–30. https://doi.org/10.28945/469
565
Gunning, R. (1952). The technique of clear writing. McGraw Hill
566
Hale, A., & Borys, D. (2013a). Working to rule, or working safely? Part 1: A state of the art review. Safety
567
Science, 55, 207221. https://doi.org/10.1016/j.ssci.2012.05.011
568
Hale, A., & Borys, D. (2013b). Working to rule or working safely? Part 2: The management of safety rules
569
and procedures . Safety Science, 55, 222-231. https://doi.org/10.1016/j.ssci.2012.05.013.
570
HFRG. (1995). Improving compliance with safety procedures, reducing industrial violations. HSE Books
571
Sudbury, Suffolk, UK.
572
Hickman, S. H., Hsieh, P. A., Mooney, W. D., Enomoto, C. B., Nelson, P. H., Mayer, L. A., … McNutt, M.
573
K. (2012). Scientific basis for safely shutting in the Macondo Well after the April 20, 2010 Deepwater
574
Horizon blowout. Proceedings of the National Academy of Sciences, 109(50), 2026820273. Retrieved
575
from http://www.pnas.org/content/109/50/20268.abstract
576
Hobbs, A., & Kanki, B. G. (2008). Patterns of error in confidential maintenance incident reports. The
577
International Journal of Aviation Psychology, 18(1), 516.
578
https://doi.org/10.1080/10508410701749365
579
Hothorn, T., Hornik, K., & Zeileis, A. (2006). Unbiased recursive partitioning: A conditional inference
580
framework. Journal of Computational and Graphical Statistics, 15(3), 651674.
581
https://doi.org/10.1198/106186006X133933
582
Hutto, C. J., & Gilbert, E. (2014). VADER: A parsimonious rule-based model for sentiment analysis of social
583
media text. In Eighth International AAAI Conference on Weblogs and Social Media. Ann Arbor, MI,
584
United states. https://doi.org/10.1210/en.2011-1066
585
Towards machine learning-driven procedure design, Page 31
James, G., Witten, D., Hastie, T. J., & Tibshirani, R. (2013). An introduction to statistical learning. New
586
York: Springer.
587
Jockers, M. L. (2015). Syuzhet: extract sentiment and plot arcs from text. Retrieved from
588
https://github.com/mjockers/syuzhet
589
Kannan, P., Quddus, N., Peres, S. C., & Mannan, M. S. (2018). Can we simplify complexity measurement? a
590
primer toward usable framework for industry implementation. In Proceedings of the Human Factors and
591
Ergonomics Society 2018 Annual Meeting. Philadelphia.
592
Kincaid, J. P., Fishburne, J., Robert P., R., Richard L., C., & Brad S. (1975). Derivation of new readability
593
formulas (automated readability index, Fog count and Flesch reading ease formula) for Navy Enlisted
594
Personnel (Research Branch Report 8-75). Retrieved from https://doi.org/10.21236/ADA006655
595
Klare G. (1974). Assessing readability. Reading Research Quarterly, 10(1), 62102.
596
Kuhn, M. (2020). Caret: classification and regression training. Retrieved from https://cran.r-
597
project.org/package=caret
598
Kuhn, M., & Johnson, K. (2013). Applied predictive modeling. New York, New York, USA: Springer.
599
Kursa, M. B., & Rudnicki, W. R. (2010). Feature selection with the Boruta package. Journal of Statistical
600
Software, 36(11), 113.
601
Leplat, J. (1985). Erreur humaine, fiabilite humaine dans Ie travail [Human error, human reliability in work].
602
Paris: Armand Colin.
603
Liaw, A., & Wiener, M. (2002). Classification and regression by randomforest. R News, 2(December), 1822.
604
https://doi.org/10.1023/A:1010933404324
605
Liu, B. (2012). Sentiment Analysis and Opinion Mining. In G. Hirst (Ed.), Synthesis Lectures on Human
606
Language Technologies (pp. 1167). New York, New York, USA: Morgan & Claypool.
607
https://doi.org/10.2200/S00416ED1V01Y201204HLT016
608
Mandrekar, J. N. (2010). Receiver operating characteristic curve in diagnostic test assessment. Journal of
609
Thoracic Oncology, 5(9), 13151316. https://doi.org/10.1097/JTO.0b013e3181ec173d
610
McDonald, A. D., Ferris, T. K., & Wiener, T. A. (2019). Classification of driver distraction: A comprehensive
611
analysis of feature generation, machine learning, and input measures. Human Factors.
612
https://doi.org/10.1177/0018720819856454
613
McDonald, A. D., Lee, J. D., Schwarz, C., & Brown, T. L. (2014). Steering in a random forest: ensemble
614
learning for detecting drowsiness-related lane departures. Human Factors, 56(5), 986998.
615
https://doi.org/10.1177/0018720813515272
616
McLaughlin, G. H. (1969). SMOG Grading a new readability formula. Journal of Reading, 12(8), 639646.
617
Menze, B. H., Kelm, B. M., Masuch, R., Himmelreich, U., Bachert, P., Petrich, W., & Hamprecht, F. A.
618
(2009). A comparison of random forest and its Gini importance with standard chemometric methods for
619
the feature selection and classification of spectral data. BMC Bioinformatics, 10, 213.
620
https://doi.org/10.1186/1471-2105-10-213
621
Neville, T. J., Peres, S. C., Ade, N., Son, C., Bagaria, P., Quddus, N., & Mannan, M. S. (2018). Assessing
622
procedure adherence under training conditions in high risk industrial operations. In Proceedings of the
623
Human Factors and Ergonomics Society 2018 Annual Meeting. Philadelphia.
624
https://doi.org/10.1177/1541931218621362
625
Towards machine learning-driven procedure design, Page 32
Neville, T. J., Peres, S. C., Quddus, N., Hendricks, J., Shortz, A., Ade, N., … Mannan, M. S. (2018). Behavior
626
assessment technique for procedural industrial tasks: using mixed reality to develop a method to
627
understand work-as-done under normal operating conditions. In Proceedings of the Human Factors and
628
Ergonomics Society 2018 Annual Meeting. Philadelphia.
629
Nielsen, F. Å. (2011). A new ANEW: Evaluation of a word list for sentiment analysis in microblogs. In CEUR
630
Workshop Proceedings (Vol. 718, pp. 9398). Retrieved from http://arxiv.org/abs/1103.2903
631
Noroozi, A., Khan, F., Mackinnon, S., Amyotte, P., & Deacon, T. (2014). Determination of human error
632
probabilities in maintenance procedures of a pump. Process Safety and Environmental Protection, 92(2),
633
131141. https://doi.org/10.1016/j.psep.2012.11.003
634
Novatsis, E., & Skilling, E. J. (2016). Human factors in the design of procedures. In J. B. T.-H. F. in the C.
635
and P. I. Edmonds (Ed.) (pp. 291307). Elsevier. https://doi.org/10.1016/B978-0-12-803806-2.00017-
636
0
637
Park, J., & Jung, W. (2015). Identifying objective criterion to determine a complicated task - A comparative
638
study. Annals of Nuclear Energy, 85, 205212. https://doi.org/10.1016/j.anucene.2015.05.012
639
Peres, S. C., Mannan, M. S., & Quddus, N. (2016). Effective procedure design and use: What do operators
640
need, when do they need it, and how should it be provided? In Proceedings of the Annual Offshore
641
Technology Conference.
642
Peres, S. C., Quddus, N., Kannan, P., Ahmed, L., Ritchey, P., Johnson, W., … Mannan, M. S. (2016). A
643
summary and synthesis of procedural regulations and standardsinforming a procedures writer’s guide.
644
Journal of Loss Prevention in the Process Industries, 44, 726734.
645
https://doi.org/10.1016/j.jlp.2016.08.003
646
Peres, S. C., Smith, A., & Sasangohar, F. (in press). Worker-centered investigation of issues with procedural
647
systems: Implications for the revision process and safety culture. Journal of Loss Prevention in the Process
648
Industries.
649
Poisson, P., & Chinniah, Y. (2015). Observation and analysis of 57 lockout procedures applied to machinery
650
in 8 sawmills. Safety Science, 72, 160171. https://doi.org/10.1016/j.ssci.2014.09.005
651
Powers, R., Sumner, W., & Kearl, B. E. (1958). A recalculation of four adult readability formulas. Journal of
652
Educational Psychology, 49(2), 99.
653
R Core Team. (2014). R: A language and environment for statistical computing. Retrieved from http://www.r-
654
project.org/
655
Rasmussen, J. (1983). Skills, rules, and knowledge; signals, signs, and symbols, and other distinctions in human
656
performance models. IEEE Transactions on Systems, Man and Cybernetics, SMC-13(3), 257266.
657
https://doi.org/10.1109/TSMC.1983.6313160
658
Ritter, F. E., Baxter, G. D., Kim, J. W., & Srinivasmurthy, S. (2013). Learning and retention. In J. D. Lee &
659
A. Kirlik (Eds.), Oxford handbook of cognitive engineering (pp. 125142). New York, New York, USA:
660
Oxford University press.
661
Flesch, R.. (1948). A new readability yardstick. Journal of Applied Psychology. 32 (3). 221.
662
https://doi.org/10.1037/h0057532
663
Sanchez-Lengeling, B., & Aspuru-Guzik, A. (2018). Inverse molecular design using machine learning: Generative
664
models for matter engineering. Science, 361(6400), 360365. https://doi.org/10.1126/science.aat2663
665
Sasangohar, F., Peres, S. C., Williams, J. P., Smith, A., & Mannan, M. S. (2018). Investigating written
666
Towards machine learning-driven procedure design, Page 33
procedures in process safety: Qualitative data analysis of interviews from high risk facilities. Process
667
Safety and Environmental Protection, 113, 3039. https://doi.org/10.1016/j.psep.2017.09.010
668
Sharit, J. (1998). Applying human and system reliability analysis to the design and analysis of written procedures
669
in high-risk industries. Human Factors and Ergonomics in Manufacturing, 8(3), 265281.
670
Siegel, A. W., & Schraagen, J. M. C. (2017). Beyond procedures: Team reflection in a rail control centre to
671
enhance resilience. Safety Science, 91, 181191. https://doi.org/10.1016/j.ssci.2016.08.013
672
Smith, E. A., & Senter, R. J. (1967). Automated readability index. AMRL-TR66-22. Wright-Patterson AFB,
673
OH: Aerospace Medical Division. Retrieved from http://www.ncbi.nlm.nih.gov/pubmed/5302480
674
Solomon, N. W. (2006). A qualitative analysis of media language. LAP Lambert Academic Publishing
675
Song, F., Guo, Z., & Mei, D. (2010). Feature selection using principal component analysis. Proceedings - 2010
676
International Conference on System Science, Engineering Design and Manufacturing Informatization,
677
ICSEM 2010, 1, 2730. https://doi.org/10.1109/ICSEM.2010.14
678
Spache, G. (2005). A new readability formula for primary-grade reading materials. The Elementary School
679
Journal, 53(7), 410413. https://doi.org/10.1086/458513
680
Suchman, L. A. (1983). Office procedure as practical action: Models of work and system design. ACM
681
Transactions on Information Systems (TOIS), 1(4), 320328. https://doi.org/10.1145/357442.357445
682
Tränkle, U. & Bailer, H. (1984). Kreuzvalidierung und Neuberechnung von Lesbarkeitsformeln für die deutsche
683
Sprache [Cross-validation and recalculation of readability formulas for the German language]. Journal of
684
Developmental Psychology and Educational Psychology, 16(3), 231244.
685
UK Health and Safety Executive. (2015). HSE Human Factors briefing note no 4: procedures. Retrieved May
686
9, 2018, from http://www.hse.gov.uk/humanfactors/topics/04procedures.pdf
687
US Chemical Safety and Hazard Investigation Board. (2007). Investigation report, refinery explosion and fire,
688
BP Texas city. Retrieved from https://www.csb.gov/file.aspx?DocumentId=5596
689
Vicente, K. J., & Burns, C. M. (1995). A field study of operator cognitive monitoring at Pickering nuclear
690
generating station.
691
Wager, S., & Athey, S. (2018). Estimation and inference of heterogeneous treatment effects using random
692
forests. Journal of the American Statistical Association, 113(523), 12281242.
693
https://doi.org/10.1080/01621459.2017.1319839
694
Wright, P., & McCarthy, J. (2003). Analysis of procedure following as concerned work. In Handbook of
695
cognitive task design (pp. 679700). London: Lawrence Erlbaum Associates London.
696
Yamauchi, T. (2013). Mouse trajectories and state anxiety: Feature selection with random forest. Proceedings
697
- 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction, ACII 2013,
698
399404. https://doi.org/10.1109/ACII.2013.72
699
Zhao, Q., & Hastie, T. J. (2019). Causal interpretations of black-box models. Journal of Business & Economic
700
Statistics, 110. https://doi.org/10.1080/07350015.2019.1624293
701
702
703
Towards machine learning-driven procedure design, Page 34
Anthony D. McDonald is an assistant professor in the Wm Michael Barnes '64 department of
704
industrial and systems engineering at Texas A&M University and the director of the Human
705
Factors and Machine Learning Laboratory. He received his PhD in industrial engineering from the
706
University of Wisconsin-Madison in 2014.
707
708
Nilesh Ade is a PhD candidate in Mary Kay O’Connor process safety center, department of
709
chemical engineering, Texas A&M University. He obtained his BS in chemical engineering from
710
Institute of Chemical Technology, Mumbai in 2015.
711
712
S. Camille Peres is an associate professor at Texas A&M University in the department of
713
environmental and occupational health. She obtained her PhD in psychology from Rice University.
714
715
... The identified complexities can possibly be reduced/eliminated by redesigning the procedures through adjustment of language, numbering, and added specifics. The redesigned procedure with reduced task complexities has the potential to improve the performance and safety of high-risk industries (McDonald et al., 2020). Examples of such procedure step redesign are depicted in Table 6. ...
Article
Task complexity plays an important role in performance and procedure adherence. While studies have attempted to assess the contribution of different aspects of task complexity and their relationship to workers’ performance and procedure adherence, only a few have focused on application-specific measurement of task complexity. Further, generalizable methods of operationalizing task complexity that are used to both write and evaluate a wide range of routine or non-routine procedures is largely absent. This paper introduces a novel framework to quantify the step-level complexity of written procedures based on attributes such as decision complexity, need for judgment, interdependency of instructions, multiplicity of instructions, and excess information. This framework was incorporated with natural language processing and artificial intelligence to create a tool for procedure writers for identifying complex elements in procedures steps. The proposed technique has been illustrated through examples as well as an application to a tool for procedure writers. This method can be used both to support writers when constructing procedures as well as to examine the complexity of existing procedures. Further, the complexity index is applicable across several high-risk industries in which written procedures are prevalent, improving the linguistic complexity of the procedures and thus reducing the likelihood of human errors with procedures associated with complexity.
Article
As NASA moves to long-duration space exploration operations, there is an increasing need for human-agent cooperation that requires real-time trust estimation by virtual agents. Our objective was to estimate trust using conversational data, including lexical and acoustic features, with machine learning. A 2 (reliability) × 2 (cycles) × 3 (events) within-subject study was designed to provoke various levels of trust. Participants had trust-related conversations with a conversational agent at the end of each event. To estimate trust, subjective trust ratings were predicted using machine learning models trained on three types of conversational features (i.e., lexical, acoustic, and combined). Results showed that a random forest model, trained on the combined lexical and acoustic features, best predicts trust in the conversational agent (R ² adj = 0.67). Comparing models, we showed that trust is not only reflected in lexical cues but also acoustic cues. These results show the possibility of using conversational data to measure trust unobtrusively and dynamically.
Preprint
Full-text available
As NASA moves to long-duration space exploration operations, there is an increasing need for human-agent cooperation that requires real-time trust estimation by virtual agents. Our objective was to estimate trust using conversational data, including lexical and acoustic features, with machine learning. A 2 (reliability) × 2 (cycles) × 3 (events) within-subject study was designed to provoke various levels of trust. Participants had trust-related conversations with a conversational agent at the end of each event. To estimate trust, subjective trust ratings were predicted using machine learning models trained on three types of conversational features (i.e., lexical, acoustic, and combined). Results showed that a random forest model, trained on the combined lexical and acoustic features, best predicts trust in the conversational agent (R 2 adj = 0.67). Comparing models, we showed that trust is not only reflected in lexical cues but also acoustic cues. These results show the possibility of using conversational data to measure trust unobtrusively and dynamically.
Article
Full-text available
Objective. The objective of this study was to analyze a set of driver performance and physiological data using advanced machine learning approaches, including feature generation, to determine the best-performing algorithms for detecting driver distraction and predicting the source of distraction. Background. Distracted driving is a causal factor in many vehicle crashes, often resulting in injuries and deaths. As mobile devices and in-vehicle information systems become more prevalent, the ability to detect and mitigate driver distraction becomes more important. Method. This study trained twenty-one algorithms to identify when drivers were distracted by secondary cognitive and texting tasks. The algorithms included physiological and driving behavioral input, processed with a comprehensive feature generation package, Time Series Feature Extraction based on Scalable Hypothesis tests. Results. Results showed that a Random Forest algorithm, trained using only driving behavior measures and excluding driver physiological data, was the highest-performing algorithm for accurately classifying driver distraction. The most important input measures identified were lane offset, speed and, steering while the most important feature types were standard deviation, quantiles, and non-linear transforms. Conclusion. This work suggests that distraction detection algorithms may be improved by considering ensemble machine-learning algorithms that are trained with driving behavior measures and non-standard features. Additionally, the study presents several new indicators of distraction derived from speed and steering measures. Application. Future development of distraction mitigation systems should focus on driver behavior-based algorithms that use complex feature generation techniques.
Article
Full-text available
quanteda is an R package providing a comprehensive workflow and toolkit for natural language processing tasks such as corpus management, tokenization, analysis, and visualization. It has extensive functions for applying dictionary analysis, exploring texts using keywords-in-context, computing document and feature similarities, and discovering multi-word expressions through collocation scoring. Based entirely on sparse operations, it provides highly efficient methods for compiling document-feature matrices and for manipulating these or using them in further quantitative analysis. Using C++ and multithreading extensively, quanteda is also considerably faster and more efficient than other R and Python ackages in processing large textual data. The package is designed for R users needing to apply natural language processing to texts, from documents to final analysis. Its capabilities match or exceed those provided in many end-user software applications, many of which are expensive and not open source. The package is therefore of great benefit to researchers, students, and other analysts with fewer financial resources. While using quanteda requires R programming knowledge, its API is designed to enable powerful, efficient analysis with a minimum of steps. By emphasizing consistent design, furthermore, quanteda lowers the barriers to learning and using NLP and quantitative text analysis even for proficient R programmers.
Conference Paper
Full-text available
Written procedures are an important artifact in maintaining the safety in high risk industrial operations. Procedures set out the steps required to complete safety and process critical tasks. However, as Sidney Dekkar noted, while procedures spell out how to do the job safely, “following all the procedures can lead to an inability to get the job done” (Dekker, 2003, p. 235). For instance, while a procedure will set out a safe step-by-step approach to achieving a desired outcome, a work environment, filled with uncertainty and constraints may not allow for a procedure to be executed correctly (Dekker, 2003). Although procedures are used to support safety, procedure misuse has been identified as a contributing factor in incidents and near misses in multiple industries (e.g. Alper & Karsh, 2009; Bullemer, Kiff, & Tharanathan, 2011). Such findings are often based on retrospective, case-study approach to describe deviations from expected steps and order within the procedures. While misuse of procedures has been identified as a contributing factor in major incidents, little research has been conducted on how procedures are used under normal conditions. Thus, the aim of this research is to demonstrate the applicability of an objective approach to assess adherence and worker behavior with written procedures. Consequently, the method presented will identify how and when workers deviate from procedures through the conduct of normal work. The proposed objective assessment of worker use of procedures consists of a two-tier assessment of a worker’s adherence to each step within a procedure. Within Tier 1, a worker’s compliance to each step is assessed as either completing correctly without issue (i.e., WAD = WAI) or work not completed as expected (i.e., WAD ≠ WAI). Tier 2 provides a more detailed description of how WAD is different to WAI. Specifically, for those steps not completed as expected, 7 types of deviations, ranging in their severity to potentially unforeseen consequences, are used to assess worker use of procedures. Specifically, the worker could: require assistance, struggle, iterate between steps, skip a step and go back to it later, complete a step out of order, incorrectly execute a step, or not complete a step. Using analysis of audio/visual data, the assessment method was used to test if experienced and inexperienced workers use written procedures differently in a high fidelity simulated training environment. Results identified that there is a difference between experienced and inexperienced workers, with experienced workers complying with the procedures at a higher rate. Results also identified that when experienced operators deviated from the procedure it was through either by skipping or incorrectly executing a step in the procedure. For inexperienced workers, deviation from the procedure occurred through gaining assistance or struggling with the action required in the procedure. When combining correct procedure use (Tier 1), assistance and struggle (Tier 2), there is little difference between experienced and inexperienced workers. The assessment approach described how workers use written procedure under normal work conditions. To our knowledge, this is the first time this particular assessment technique has been used. While the method offers, at the coarse level, a binary correct/incorrect assessment of procedure step compliance; it also allows for an understanding of how deviations occur. The method provides safety engineers, managers and procedure writers with data which can be used to change or modify written procedures and to improve process and safety training. Furthermore, data collection for the method is relatively unobtrusive (small camera on a safety helmet) and low cost. Given the capabilities of small portable cameras, the approach could also, theoretically, be applied in real time. From a WAD/WAI perspective, the assessment approach allows for a detailed understanding of how work occurs under normal conditions. Such an approach provides an ability to understand how deviations from WAI as a preventive approach to safety within high risk environments.
Article
Full-text available
Written procedures can play an integral role in mitigating risks and hazards in industries such as petrochemical, nuclear, and aviation. However, failure to adhere to procedures has resulted in major incidents. While there have been multiple studies investigating procedures in the aviation and nuclear industries, a comprehensive study of the high-risk industries’ use of written procedures is largely absent. This paper documents one part of a large-scale project that addresses this gap by investigating the issues with procedure forms, usage, adoption, and challenges in a wide range of high-risk industries. A grounded theory approach in qualitative data analysis was used to examine 72 interviews with operators of varying roles and experiences across 6 countries and an offshore drilling vessel. Findings reaffirm previous research, suggesting an explanation for the lack of use of procedures due to the abundance of outdated procedures and procedures plagued by information overload. New findings suggest that frequency of the task and the experience level of the worker would impact workers’ procedure use. Other unintended consequences associated with written procedural systems included reactive organizational behavior surrounding procedures and a general disconnect between the users and the writers of these documents.
Article
Issues related to procedural systems have been found to contribute to incidents in many high-risk industries such as petrochemical, oil and gas, etc. While previous research has focused on understanding issues with procedural systems from the perspective of the workers (who are the end-users of procedures), most of this research suffers from samples that only include companies with programs focused on improving safety by improving procedures. These companies may have inherent differences in their safety practices and thus the experiences of these workers may not completely represent all workers’ experiences in this domain. The purpose of this study is to gain insights into the thoughts and perceptions from a representative and broad sample of workers concerning procedure use and purpose. To improve the generalizability of previous findings, interviews were conducted with workers from a broad range of high-risk process industries to investigate issues related to procedure adherence that may be present in companies not currently implementing. Findings from a qualitative data analysis provide support for the generalizability of issues previously discovered, such as: more experience workers being more likely to deviate; procedure quality being inconsistent; and the procedure revision process being problematic. However additional prominent issues were found as well. Most importantly, this study found that adherence to procedures is often motivated by potential liability issues instead of genuine concerns for safety in organizations and many deviations from procedures were due to pressure from immediate supervisors. These findings suggest a relationship between the effectiveness/quality of procedural systems and the safety climate of the organization or work unit.
Article
The current study aims to advance a procedure development process that will create procedures as effective safeguards. Several past incidents identified that many current procedural systems failed to support workers conducting their work safely and effectively. There is currently a dearth of systematic inquiry regarding the procedure development process. A key element of the procedure development process is writers' guide, which dictates how these procedures should be developed, written, reviewed and managed. The current effort collected 16 writers' guides across various industries such as chemical, oil and gas, nuclear, and energy. Different components of these writers' guides were identified, summarized, and later categorized according to a newly developed writers' guide framework. The framework uses four phases of the procedure life cycle and five sections for procedure content. The analysis showed that industry practices primarily focus on general goals of the procedure, writing style, and review process. Many important components of the procedure, such as process hazard information, execution challenges, and training requirements were not addressed adequately. Of the 41 components, there were only 3 that 80% of the writers' guides contained suggesting little consensus about the content. The proposed writers' guide framework organizes all components including low frequency of appearance ones in different phases of procedure life cycle. The significance of the current procedure writers' guide framework is that it not only produces a structure for a comprehensive writers’ guide but also opens an opportunity for future improvement in current procedure writing practices.
Article
The fields of machine learning and causal inference have developed many concepts, tools, and theory that are potentially useful for each other. Through exploring the possibility of extracting causal interpretations from black-box machine-trained models, we briefly review the languages and concepts in causal inference that may be interesting to machine learning researchers. We start with the curious observation that Friedman’s partial dependence plot has exactly the same formula as Pearl’s back-door adjustment and discuss three requirements to make causal interpretations: a model with good predictive performance, some domain knowledge in the form of a causal diagram and suitable visualization tools. We provide several illustrative examples and find some interesting and potentially causal relations using visualization tools for black-box models.
Article
Many scientific and engineering challenges---ranging from personalized medicine to customized marketing recommendations---require an understanding of treatment effect heterogeneity. In this paper, we develop a non-parametric causal forest for estimating heterogeneous treatment effects that extends Breiman's widely used random forest algorithm. Given a potential outcomes framework with unconfoundedness, we show that causal forests are pointwise consistent for the true treatment effect, and have an asymptotically Gaussian and centered sampling distribution. We also discuss a practical method for constructing asymptotic confidence intervals for the true treatment effect that are centered at the causal forest estimates. Our theoretical results rely on a generic Gaussian theory for a large family of random forest algorithms that, to our knowledge, is the first set of results that allows any type of random forest, including classification and regression forests, to be used for provably valid statistical inference. In experiments, we find causal forests to be substantially more powerful than classical methods based on nearest-neighbor matching, especially as the number of covariates increases.
Article
The development of systems with complex interactions has led to limitations which are reflected in performance including reliability and safety of the systems. Concurrent developments of frameworks to represent and analyze complexity have aided the understanding of complexity in human-machine systems. The methodology and framework presented is proposed to aid the design of experiments to establish causative relationships of complexity attributes with performance and further deployment in industry. The framework leverages three independent measurement paradigms, at the worker level, interaction level, and task level to classify twenty measurable complexity attributes. Their deployment in key performance indicator (KPI) frameworks and procedure writer’s guides are discussed.
Article
The discovery of new materials can bring enormous societal and technological progress. In this context, exploring completely the large space of potential materials is computationally intractable. Here, we review methods for achieving inverse design, which aims to discover tailored materials from the starting point of a particular desired functionality. Recent advances from the rapidly growing field of artificial intelligence, mostly from the subfield of machine learning, have resulted in a fertile exchange of ideas, where approaches to inverse molecular design are being proposed and employed at a rapid pace. Among these, deep generative models have been applied to numerous classes of materials: rational design of prospective drugs, synthetic routes to organic compounds, and optimization of photovoltaics and redox flow batteries, as well as a variety of other solid-state materials.