Predicting procedure step performance from operator and text features: A critical first step
towards machine learning-driven procedure design
Author Names and Affiliations:
Anthony D. McDonald1, firstname.lastname@example.org
Nilesh Ade2, email@example.com
S. Camille Peres2,3 firstname.lastname@example.org
1 Wm Michael Barnes '64 Department of Industrial and Systems Engineering, Texas A&M
University, College Station, TX USA
2 Mary Kay O’Connor Process Safety Center, Department of Chemical Engineering, Texas A&M
University, College Station, TX USA
3 Department of Environmental and Occupational Health, Texas A&M University, College Station,
Towards machine learning-driven procedure design
This research was co-funded by the Next Generation Advanced Procedures
Initiative (http://advancedprocedures.tamu.edu/). We would further like to acknowledge the
assistance of Shell USA for access to the BOOST training facility in Robert, LA. We would also
like to acknowledge Noor Quddus, Timothy Neville, Pranav Bagaria, Changwon Son, M. Sam
Mannan, Sarah Thomas, Lena Clark, and Whitney Mantooth for their valuable contributions to this
Anthony D. McDonald, Texas A&M University,
Department of Industrial and Systems Engineering,
College Station, TX 77843, USA;
Towards machine learning-driven procedure design, Page 2
The goal of this study is to assess machine learning for predicting procedure performance
from operator and procedure characteristics.
Procedures are vital for the performance and safety of high-risk industries. Current
procedure design guidelines are insufficient because they rely on subjective assessments and
qualitative analyses that struggle to integrate and quantify the diversity of factors that influence
We used data from a 25 participant study with 4 procedures, conducted on a high-fidelity
oil extraction simulation to develop logistic regression, random forest, and decision tree algorithms
that predict procedure step performance from operator, step, readability, and natural language
processing-based features. Features were filtered using the Boruta approach. The algorithms were
trained and optimized with a repeated 10-fold cross-validation. Following the algorithm assessment,
inference was performed using variable importance and partial dependence plots.
The random forest, decision tree, and logistic regression algorithms with all features had an
AUC of 0.78, 0.77, and 0.75, respectively, and significantly outperformed logistic regression with
only operator features—AUC of 0.61. The most important features were experience, familiarity, total
words, and character-based metrics. The partial dependence plots showed that steps with fewer
words, abbreviations, and characters were correlated with correct step performance.
Machine learning algorithms are a promising approach for predicting step-level procedure
performance and may help guide procedure design after validation with additional data on further
After validation, the inferences from these models can be used to generate procedure
We develop a machine learning approach for predicting procedure performance from operator,
readability, and natural language processing-based features.
machine learning, procedure design, operator performance, random forest, decision tree
Towards machine learning-driven procedure design, Page 3
Inadequate procedures have been cited as a primary or secondary cause of several major incidents
including the Macondo blowout (Hickman et al., 2012), the Texas City refinery explosion (US
Chemical Safety and Hazard Investigation Board, 2007), and the 2007 American Airlines Flight 1400
engine fire (Baron, 2009). Beyond these incidents, the National Aeronautics and Space
Administration (NASA) found that 44% to 73% of aviation maintenance errors were associated with
procedures (Hobbs & Kanki, 2008). A recent study of the healthcare industry observed that around
16% of undesired incidents resulted from the poor quality of procedures (Deufel et al., 2017).
The role of procedures in these failures is complex and can be attributed to multiple factors
including the quality of the procedure (e.g., accuracy, clarity, ease of use); workers’ adherence to
the procedure; and the availability of the procedure (Bullemer & Laberge, 2010; Poisson & Chinniah,
2015; Sasangohar et al., 2018; Siegel & Schraagen, 2017; Vicente & Burns, 1995; Wright &
McCarthy, 2003). The influence of these factors on system performance is also complex. For
instance, prior research has found that some deviations from procedures can improve large system
operations and that rote procedure adherence has, in some cases, contributed to failures (Suchman,
1983; Vicente & Burns, 1995). These seemingly conflicting findings speak to the need to design
procedures that consider all aspects of the complex sociotechnical systems in which they operate
(Hale & Borys, 2013b, 2013a). Our long term goal is to address this challenge by introducing a novel
method of machine learning-driven procedure design. This work explores a critical first step in that
process: predicting procedure step performance from procedure text and operator characteristics.
Procedure design has typically focused on the design of the procedure document itself rather than
other relevant attributes of the entire the procedural system (e.g., operator experience, task
Towards machine learning-driven procedure design, Page 4
complexity) that may impact task performance and overall system safety (Ahmed et al., 2020; Peres,
Quddus, et al., 2016). A recent review from Ahmed et al. (2020) found that few procedure writing
guides based their guidance on empirical evidence or documented best practices. However, more
recent recommendations have advocated for the integration of Human Factors design approaches
(UK Health and Safety Executive, 2015). Considerations of human and system factors will likely
result in significant improvements in procedure quality, adherence, and overall task performance, as
research suggests that successful procedure performance depends on attributes of the procedure, the
task, and characteristics of the operator (Novatsis & Skilling, 2016; Peres et al., 2016; Sasangohar,
Peres, Williams, Smith, & Mannan, 2018). Indeed, studies investigating performance and satisfaction
with procedures and procedural systems regularly find that attributes of the worker (e.g., experience)
and attributes of the task (e.g., frequency) are reliably related to objective and subjective outcomes
(Bates & Holroyd, 2012; Bullemer & Hajdukiewicz, 2004; Carim et al., 2016; Dekker, 2003; HFRG,
1995; Noroozi et al., 2014). Thus, to identify those attributes of a procedural system that reliably
impact worker performance and more important, system safety, attributes of the entire system must
Understanding how these diverse features are differentially associated with workers performance
with procedures represents a significant challenge. One methodology that may address this challenge
is machine learning. While machine learning is not a traditional Human Factors method, its use has
been growing, and it has been shown to be superior to traditional statistical analyses (i.e., logistic
regression) in some cases (Carnahan et al., 2003). Machine learning has also been successful in aiding
the design process in other domains such as drug design (Burbidge et al., 2001; Sanchez-Lengeling
& Aspuru-Guzik, 2018). The most common type of machine learning employed for design is
supervised machine learning, in which classes of the data are known and the goal is to build algorithms
Towards machine learning-driven procedure design, Page 5
to make predictions about the classes (James et al., 2013). While prediction is traditionally the focus
of supervised machine learning approaches, there has been recent interest in using fitted supervised
machine learning algorithms for causal inference (Wager & Athey, 2018). Much of this work has
focused on tree-based approaches such as decision trees and random forests.
Decision trees learn to classify the data through iteratively segmenting predictor space into
regions corresponding to a class (James et al., 2013). They have been successfully implemented in
the past in Human Factors research to identify common patterns across specific automotive crashes
(Clarke et al., 1998) and quantify human and organization aspects of accident management
(Baumont et al., 2000). Random forests are an extension of decision trees in which many individual
trees are fit to different subsets of the data and predictors and classifications are made by a plurality
vote among the different trees (Breiman, 2001). In the Human Factors domain, random forests have
been applied to the detection of driver drowsiness (McDonald et al., 2014), driver distraction
(McDonald et al., 2019), and the prediction of anxiety in computer users (Yamauchi, 2013). Random
forests are advantageous relative to decision trees because they are less sensitive to overfitting the
training data and typically have higher accuracy on predicting new unseen data (James et al., 2013).
In addition, random forests can be used to calculate variable importance, a measure of the prediction
accuracy lost by eliminating a predictor from the algorithm, and partial dependence, a measure
analogous to Beta coefficients in linear regression, which can be used for inference (Friedman, 2001;
Zhao & Hastie, 2019). Decision trees, in contrast, provide a clearer logical structure that is more
interpretable for procedure designers. To benefit from the strengths of both methods, recent analyses
suggest the use of random forests to select predictors that are subsequently used to fit a simpler
algorithm (Chen & Lin, 2006; Kursa & Rudnicki, 2010; Menze et al., 2009).
Towards machine learning-driven procedure design, Page 6
The extent to which the predictive and inference capabilities of decision trees and random forests
may be used to understand procedure step performance remains an open question. The goal of this
study is to attempt to answer this question in two parts. First, we fit a series of algorithms to predict
procedure step performance from predictors that describe the task, the text of the procedure’s steps,
and operators. We use these variables because they represent different attributes of the procedural
system and will provide a proof of concept with regard to whether these algorithms can effectively
identify the importance of and interactions between these variables. Second, we use the fitted
algorithm to identify the most important predictors and their quantitative impact on procedure
When workers perform the steps in a procedure, they are doing so to support their performance of
a particular task that itself is part of a larger system or process. Performance may be assessed across
each of these levels (step level, task level, and system level). Performance across these levels is often
correlated, but it is not always consistent. For example, if workers follow an incorrect procedure
exactly—high step level performance—it may lead to an unsafe situation—low task and system level
performance (Dekker, 2003). Additionally, workers may complete a task successfully without
following the procedure exactly (low step level performance; Dekker, 2003; Sasangohar et al., 2018).
In the current analysis, we focus on step level performance. While this focus is somewhat limited, it
is justified by our goals of providing step-level procedure design guidance, and for understanding
interactions between the task (e.g., frequency), the worker (e.g., experience), and procedure steps
(e.g., word length). The remainder of this section describes the dataset used for this analysis,
discusses the measurement of step level performance and features calculated, and introduces the
machine learning process.
Towards machine learning-driven procedure design, Page 7
The data used in this study were collected as a part of an experiment conducted at the Shell’s Robert
training center, Louisiana, USA, in the Basic Offshore Operations Simulator Training (BOOST)
facility. The BOOST facility is a high-fidelity training environment for offshore oil production
platforms, where participants perform tasks on real equipment to process mineral oil (as opposed to
petroleum on offshore facilities). The study was approved by the Texas A&M Institutional Review
Board and complied with the American Psychological Association’s code of ethics. The experiment
involved the observation of the step-level performance of 25 operators performing 4 procedures. The
procedures included: Fluid sampling using a centrifuge, column flushing, Level Control Valve (LCV)
replacement, and pressure testing. Fluid sampling consists of separating water from mineral oil.
Column flushing comprises draining gas and liquid from a column attached to a secondary vessel.
LCV replacement is a maintenance task that replaces a level control valve. Pressure testing is a task
that tests high- and low-pressure warning indicators of a digital sensor. These procedures were
selected as they are integral to the safe operation of an offshore oil rig. The procedures varied in
their frequency of performance, step content, and number of steps. Frequency was classified as
frequent (approximately weekly) or infrequent (approximately yearly) based on the expertise of the
training team at the facility. The number of procedure steps varied from 8 to 23. Table 1 summarizes
these metrics for the four procedures.
Table 1 Descriptive information about the procedures.
Number of steps
Fluid sampling using a
“Place tubes onto opposite sides of the
centrifuge to maintain balance”
“Open manual column valves M101 - 11.
Upper isolation valve on level column”
Level Control Valve
“Open drain valve downstream of check Valve
Towards machine learning-driven procedure design, Page 8
“Make sure test connection is depressured”
There were three attributes of the participants included in the machine learning algorithms—their
experience, their familiarity with the procedures, and their tendency to acknowledge a step by signing
off. An employment agency recruited the participants, with the goal of having an equal number of
Experienced and Inexperienced workers. Experience level was based on the judgment by subject
matter experts and the trainers at the facility. Participants with less than 6 years of experience were
considered inexperienced whereas those with more than 6 years of experience were considered as
experienced. Participants rated their familiarity with the actual procedures following their completion
of all the procedures using a 5-point Likert scale with 1 denoting complete unfamiliarity with the
task and 5 denoting complete familiarity. The study authors manually annotated participants’ sign-
off behavior (checking off on each step that they had done the step) of the operators from video
Step performance was a assigned a binary label of correct or incorrect in accordance with the
definitions established in Neville, et al., (2018). The correctness assessment was conducted through
video coding. There were four total reviewers, two coded the Column Flushing and Sampling tasks
and two coded the Level Control Valve and Pressure Testing tasks. The interrater reliability was
measured by Cohen’s Kappa. The Kappa between the first two coders was 0.57 (81% agreement)
and the Kappa between the second pair of coders was 0.38 (68% agreement). Where the two
reviewers initially did not agree on a code, they would use a consensus method to decide on the most
Towards machine learning-driven procedure design, Page 9
Steps were considered to be performed incorrectly if operators took an extended period of time
to complete a step (approximately double the nominal duration), needed assistance from the training
instructor to perform the step, performed the step out of order from the sequence described in the
procedure, or performed the step incorrectly or never completed the step. Steps were considered to
be correctly performed if the operator performed the step completely in the right manner, in
prescribed sequence, without struggling and without assistance from the instructor (Neville, Peres,
Ade, et al., 2018). The distribution of step performance of the operators is shown in Table 2. While
the overall step accuracy rate (66%) may seem low, it is important to acknowledge that almost every
participant eventually completed the tasks correctly. The successful task completion and the
distribution of types of failures suggests that the observed failures are representative of how workers
interact with actually procedures rather than simply unfamiliarity with the procedures.
Table 2 Distribution of step performance by error type. Each cell shows the number of steps and the
out of order
Steps done with
skipped or with
Fluid sampling using a
Level control valve
Towards machine learning-driven procedure design, Page 10
Feature identification and reduction
The complete dataset consisted of 1,009 step performances, 668 correct instances and 341 incorrect
instances. In order to reduce the likelihood of algorithm overfitting and Type-1 errors, a limited set
of 30 features were considered. The features were associated with one of four categories: operator
characteristics, procedure characteristics, step readability, and standard natural language processing-
based features. The goal of these categories was to capture theoretically motivated domain specific
factors (e.g., learning and retention) alongside of domain agnostic text processing features, and
provide a comparison between the two types. These features and the rationale for their inclusion are
described in detail in the following sections.
Operator characteristic features
Prior research of procedures in high-risk industries suggests that three of the most significant
operator characteristics related procedure adherence are 1) field experience, 2) familiarity with the
procedure, and 3) the operator’s tendency to acknowledge the completion of a step in writing
(Novatsis & Skilling, 2016; Sasangohar et al., 2018). Recent qualitative analyses suggest that many
experienced operators in high-risk industries find procedures extraneous and that they may prefer to
rely on their extensive training, prior knowledge, and efficient teamwork with other operators rather
than adhere to procedures (Sasangohar et al., 2018). In contrast, inexperienced operators may follow
procedures more closely because procedures directly support the proceduralization phase (i.e., rule-
based behavior (Rasmussen, 1983), or the associative stage (Anderson, 1982) of learning (Ritter et
al., 2013). Familiarity may impact procedure step performance on a similar pathway—as the need
to rely on declarative knowledge fades and operators increasingly rely on procedural learning (Ritter
et al., 2013). Step acknowledgement has been directly linked to correct procedure step performance
and is recommended industry practice (Novatsis & Skilling, 2016). These three characteristics were
Towards machine learning-driven procedure design, Page 11
included as individual features in the machine learning analysis. Experience and step acknowledgement
were represented as binary variables, and familiarity was included on a five-point scale. Experience
was represented as a binary feature rather than a continuous feature due to sparsity in the distribution
of the years of experience of the operators.
Procedure characteristic features
Procedure characteristics were included in three ways: the frequency of performance of the procedure
(Frequent/Infrequent), the total number of steps of the procedure, and the procedure step
complexity. Procedure step complexity is a content-based approximation of the cognitive load
imposed by a step. This complexity has been shown to impact system performance by influencing
decision making, information processing, intrinsic motivation, and satisfaction of the task performer
(Campbell, 1991; Gill & Hicks, 2006). The complexity measure employed in this study was developed
by Kannan, Quddus, Peres, and Mannan, (2018). The measure calculates 5 binary dimensions of
complexity: decision, judgment, interdependency, step-size, and step information. Decision
complexity reflects that a step requires an operator to observe and respond to a cue. Judgment
complexity requires an operator to evaluate a quantity. Interdependency comprises the dependency
of one step on another. Step-size indicates the presence (or absence) of multiple instructions in a
step. Step information indicates that additional information in the form of notes or cautions are
provided to an operator in a step. The complexity of each dimension was calculated based on the
presence of identifiers (keywords) in a step. For example, steps including the words “if” and “then”
indicate the presence of decision complexity. The complexity calculations are independent, and more
than one type of complexity may be present in any given step. The types of step level complexity
and example identifiers are shown in Table 3. Each type of complexity was included as binary feature
in the dataset due to a sparsity of instances where multiple examples of a given type of complexity
Towards machine learning-driven procedure design, Page 12
were present. The number of steps in the procedure was included as a continuous variable and the
frequency of performance was included as a binary variable.
Table 3 Types, context, and example identifiers of task level complexity
Type of complexity
Presence of decision in a step
resulting from a multiplicity of
The requirement of an operator’s
judgment in a step due to the
presence of uncertainty in
information in a step
Dependence of a step on another
step (s) through order or cascade
‘Go to’, ‘Proceed to’
The multiplicity of instructions in a
The multiplicity of information in a
Readability is a measure of the ease of reading a step (DuBay, 2008). Whereas the step complexity
features provide an index of the difficulty of execution of the task, readability measures the difficulty
of processing the instructions. Readability was included in this analysis as it is a common focus of
procedure designs for high-risk industries (Sharit, 1998) and has been shown to influence operator
performance (Novatsis & Skilling, 2016). Readability is typically measured through a scoring formula
which is often a function of the number or means of words, sentences, syllables, and characters in a
text. Given that there is no singular gold standard metric for readability of procedures in oil and gas
industries, we initially calculated a set of 29 metrics for each step in the dataset. These metrics were
selected based on their commonality and ease of implementation. The metrics were calculated with
the Quanteda package in R (Benoit et al., 2018).
Towards machine learning-driven procedure design, Page 13
The volume and overlapping nature of the readability formulas creates a challenge because many
of the metrics are highly correlated, which may undermine the generalizability and performance of
decision tree algorithms (Kuhn & Johnson, 2013). To address this challenge, we performed a principal
component feature selection process to identify the readability metric that explained the most
variance in the dataset. This method is an established technique for feature selection with correlated
features (Song et al., 2010). The principal components analysis, summarized in Table 4, found that
the Flesch-Kincaid metric was the most informative in this context and thus this metric was included
in the final feature set. The Flesch-Kincaid measures readability from the average sentence length
(ASL), number of words (nw), and number of syllables (nsy) according to Equation 1.
Table 4 Summary of Principal Components analysis of readability features.
Principle component 1
Proportion of variance
Loading of individual readability metrics in principal component 1
ARI (Smith & Senter, 1967)
Coleman (E. B. Coleman, 1971)
Coleman.C2 (E. B. Coleman, 1971)
Coleman.Liau.ECP (M. Coleman & Liau, 1975)
Farr.Jenkins.Paterson (Farr et al., 1951)
Flesch (Flesch, 1948)
Flesch.PSK (Powers et al., 1958)
Flesch.Kincaid (Kincaid et al., 1975)
FOG (Gunning, 1968)
FOG.PSK (Powers et al., 1958)
FOG.NRI (Kincaid et al., 1975)
FORCAST (Caylor & Sticht, 1973)
FORCAST.RGL (Caylor & Sticht, 1973)
Fucks (Fucks, 1955)
Linsear.Write (Klare G, 1974)
LIW (Björnsson, 1968)
nWS (Bamberger & Vanecek, 1984)
nWS.2 (Bamberger & Vanecek, 1984)
FK =0.39ASL +11.8nsy
Towards machine learning-driven procedure design, Page 14
nWS.3 (Bamberger & Vanecek, 1984)
nWS.4 (Bamberger & Vanecek, 1984)
RIX (Anderson, 1983)
Scrabble (Benoit et al., 2018)
SMOG (McLaughlin, 1969)
SMOG.C (McLaughlin, 1969)
Spache.old (Spache, 2005)
Strain (Solomon, 2006)
Traenkle.Bailer (Tränkle & Bailer, 1984)
meanSentenceLength (Benoit et al., 2018)
meanWordSyllables (Benoit et al., 2018)
Natural language processing features
Natural language processing features were included in this analysis to provide a domain knowledge-
agnostic comparison for the operator and procedure complexity features. Eighteen natural language
processing features were calculated including word and character counts, parts of speech, character
types (e.g., capital letters), and sentiment scores. The word, characters, parts of speech, and
character types were calculated as count frequencies. The sentiment scores were calculated by
looking up the words in each step in a sentiment dictionary and summing the scores; Table 5
illustrates this calculation for the step “Close drain valve FSV-M111-29.” Four sentiment scores were
calculated using the Bing (Liu, 2012), Vader (Hutto & Gilbert, 2014), Syuzhet (Jockers, 2015), and
Afinn (Nielsen, 2011) dictionaries. These dictionaries represent a sample of the most widely used
sentiment dictionaries. The process of calculation was consistent across the sentiment scores,
although individual sentiment scores for word differed (e.g., the Vader sentiment score of “Drain” is
Table 5 Illustration of sentiment calculation for the step “close drain valve FSV-
“Close drain valve FSV-M111-29"
Bing sentiment score
Total sentiment score
Towards machine learning-driven procedure design, Page 15
Machine learning analysis
The machine learning analysis used in this study consisted of four phases: feature selection, algorithm
fitting, algorithm evaluation, and inferential analysis. Feature selection was performed using the
Boruta algorithm—a random forest-based feature selection approach which identifies important
features through two-sided tests of equality against random variables (Kursa & Rudnicki, 2010).
Following the Boruta feature selection, four algorithms, a conditional inference decision tree (DT) a
random forest (RF), and two logistic regressions (LR), were fit to the data using the caret (Kuhn,
2020) package in R 3.6.0 (R Core Team, 2014). Conditional inference trees were used because of
their formal statistical foundation and because they may reduce variable bias associated with standard
recursive partitioning tree algorithms (Hothorn et al., 2006). The two logistic regression algorithms
differed in their feature sets. The first LR algorithm used only operator features, and the second
used the features selected by the Boruta process. The LR algorithms were used as a benchmark
comparison to justify the additional complexity of the DT and random forest algorithms following
the example in Carnahan et al., (2003).
The algorithms were implemented with the party (Hothorn et al., 2006), randomForest (Liaw &
Wiener, 2002), and stats—via the glm function—packages, respectively. For each algorithm, a 10-
fold repeated cross validation approach with 10 repetitions was used to fit hyperparameters and
estimate the algorithm generalizability. In each repetition and fold, the incorrect instances in the
training data were upsampled to create a balanced dataset and further reduce bias in the algorithm
fitting process. The optimal hyperparameters were selected using the maximum area under the
receiver operating characteristic curve (AUC). The final tuned hyperparameters for the DT and
random forest summarized in Table 6—note that the LR algorithm does not have hyperparameters.
The overall algorithm fits were assessed with the AUC, sensitivity, and specificity calculated from
Towards machine learning-driven procedure design, Page 16
the cross-validation test set samples. Statistical tests for these metrics were calculated with the
DeLong method (DeLong et al., 1988).
Table 6 Hyperparameter settings for the random forest and decision tree algorithms.
The maximum depth of any branch of the tree.
The threshold of the test statistic for creating a
The number of randomly selected parameters
considered for splitting at each node
Machine learning inference
Following the algorithm fitting and performance evaluation, the random forest variable importance
and partial dependence plots were used for inference. The variable importance illustrates the expected
loss in accuracy associated with removing a feature from the random forest algorithm. Variable
importance is calculated by fitting an algorithm including the feature and calculating the accuracy,
then fitting the algorithm again without the feature and re-calculating accuracy, and finally
calculating the difference between the two accuracy values. This iterative process does not control
for the accuracy contributions of other features and thus the total variable importance across all
features may be more than 100%. This lack of control limits the use of variable importance for
inference. Partial dependence addresses this limitation by calculating the expected algorithm
prediction across the values of a feature in the dataset while holding all other features constant at
their mean. The calculation of partial dependence is analogous to linear regression coefficients,
although they are typically more complex functions given the underlying complexity of the machine
learning algorithm. Plotting partial dependence over the range of feature values illustrates how
changes in a feature impact the algorithm’s class prediction likelihood and provides more granular
insight into the algorithm’s predictions (Zhao & Hastie, 2019). Together these methods can be used
Towards machine learning-driven procedure design, Page 17
to illustrate important features for procedure design and thresholds that may be used to create design
The Boruta feature selection method identified 25 relevant features and 4 irrelevant features for
correct step performance. The irrelevant features included the number of first-person pronouns in
the step, the step size and interdependence complexity, and whether the step was checked off. The
relevant features are summarized in Table 7 according to their feature type.
Table 7 Selected features and their associated categories.
Number of steps
Step info complexity
Flesch Kincaid score
Natural language processing
Afinn dict. sentiment
Bing dict. sentiment
Syuzhet dict. sentiment
Vader dict. sentiment
Second person pronouns
Third person pronouns
To be verbs
Number of words
Towards machine learning-driven procedure design, Page 18
Algorithm fitting results
Figure 1 illustrates the ROC curves for the logistic regression with only operator features (LROP),
logistic regression (LR), decision tree (DT) and random forest (RF) algorithms. The logistic
regression with only operator features had an AUC of 0.61 (95% CI: 0.57-0.64), the logistic
regression with all features had an AUC of 0.75 (95% CI: 0.72-0.78) the DT had an AUC of 0.77
(95% CI: 0.74-0.80), and the random forest (RF) algorithm had an AUC of 0.78 (DeLong 95% CI:
0.75-0.81). The sensitivity and specificity of the algorithms and their standard deviations across
cross-validation fold test sets are summarized in Table 8. In all cases, the algorithms performed
significantly better than a random classifier. In addition, the algorithms with all features significantly
outperformed the logistic regression with operator only features (LR: D = -5.95, df = 1977.1, p <
0.001; DT: D = -6.50, df = 1958.8, p < 0.001; RF: Z = -7.92, p < 0.001). Pairwise comparisons
between the AUC of the LR, RF, and DT were not significant.
Figure 1 Receiver operating characteristic curves for the logistic regression with only operator features (LROP), logistic
regression with all features (LR), decision tree (DT), and random forest (RF) algorithms.
0 0.25 0.5 0.75 1
False positive rate
True positive rate
Towards machine learning-driven procedure design, Page 19
Table 8 The area under the curve (AUC), sensitivity, and specificity of the logistic
regressions, decision tree, and random forest algorithms. The numbers in parentheses
are the standard deviations across the 10 cross fold test sets.
The random forest variable importance plot, shown in Figure 2, illustrates that familiarity with the
procedure is the most important feature, i.e., the omission of the feature results in the largest loss
of predictive accuracy. Beyond familiarity, the other top ten most important features include
experience, total words, character-based metrics (e.g., lowercase and total characters), Flesch
Kincaid readability, and the Vader dictionary sentiment score. It is notable that while the procedure-
based features do appear to be important for classification, they are considerably less so than the
operator-based features, readability feature, and several of the natural language processing features.
Towards machine learning-driven procedure design, Page 20
Figure 2 Feature importance for the random forest algorithm based on mean decrease in accuracy.
Figure 3 illustrates the partial dependence plots for the top ten features, ordered left-to-right,
top-to-bottom, by importance. Several notable trends emerge from the figure. Familiarity and
experience show expected trends in that increased familiarity and experience result in a higher
likelihood that the algorithm predicts that the step will be correctly performed. In contrast, as total
words increase, it is less likely that the algorithm predicts the step will be correctly performed. Three
surprising trends emerge from the Flesch Kincaid readability score, the uppercase characters, and
the Vader sentiment. The Flesch Kincaid reading score graph suggests that as a step becomes more
difficult to read, the algorithm is more likely to predict it will be performed correctly. Further analysis
suggests that this trend may be related to the use of many syllable domain words, e.g., “operator”
and abbreviations. For example, one step with a Flesch Kincaid score of 10.9 (approximately
equivalent to the highest likelihood that the algorithm predicts correct performance) reads, “Notify
the Control Room Operator that equipment is ready to be removed from service.” In contrast, the
Third person pronouns
Second person pronouns
Step info complexity
To be verbs
Bing dict. sentiment
Afinn dict. sentiment
Syuzhet dict. sentiment
Vader dict. sentiment
Flesch Kincaid score
0 10 20
Mean decrease in accuracy
Towards machine learning-driven procedure design, Page 21
step, “Close M112-9 and M112-10,” has a Flesch Kincaid score of -1.45. Additional context on this
analysis can be found in the uppercase characters chart (bottom left in the figure), which shows that
the algorithm predicts that steps containing between 8 and 12 uppercase characters are likely to be
performed incorrectly. These steps, e.g., “Have CRO place ILIC - 101 in Manual Control,” generally
contain at least 2 abbreviations. Thus, these results suggest that algorithm predicts that steps with
many abbreviations are likely to be performed incorrectly. The trend in the Vader dictionary sentiment
scores suggests that the algorithm predicts that steps with neutral sentiment are more likely to be
performed correctly, whereas steps with more positive sentiment are more likely to be performed
Figure 3 Partial dependence plots for the most important features of the random forest algorithm. Points show values of
the feature where partial dependence was calculated, the lines illustrate trends.
Additional context for the univariate partial dependence plots can be gained by analyzing and
plotting the partial dependence across two variables. Figure 4 illustrates four such plots for experience
Vader dict. sentiment
Flesch Kincaid score
5 10 10 15 20 25 30 −2 0 2
0 40 80 120 160 50 100 150 0 5 10
1 2 3 4 5 0.00 0.25 0.50 0.75 1.00 10 20 30 40
Probability of correct performance
Towards machine learning-driven procedure design, Page 22
and familiarity across the number of words in the procedure and the number of uppercase characters.
The top left graph—experience by number of words—shows that the algorithm is more likely to
predict correct step performance in experienced operators (Experience = 1) when the step contains
fewer than 20 words compared to more than 20 words. Inexperienced operators (Experience = 0)
show a similar trend, although the figure suggests that inexperienced operators are least likely to
perform a step correctly if it has 20 words. The trend is similar in the top-right graph which shows
the prediction trends for uppercase characters and experience. Given the findings on abbreviations
discussed above, it is notable that the algorithm predicts incorrect step performance is most likely in
inexperienced operators conducting steps with many abbreviations. It is notable that experienced
operators are less impacted by abbreviations than inexperienced operators. The bottom graphs
suggest that the algorithm is most likely to predict correct performance when procedure steps have
less than 20 words and when operators are more familiar (4 or 5 familiarity rating).
Towards machine learning-driven procedure design, Page 23
Figure 4 Two-dimensional partial dependency plots for Experience, Familiarity, Number of words, and Uppercase
characters. The shading indicates the likelihood of the algorithm predicting correct step performance—white corresponds
to high likelihood of a correct step performance prediction and black corresponds to a high likelihood of predicting
incorrect step performance.
The goal of this study was to take steps towards machine learning-driven procedure design by
investigating machine learning for predicting procedure performance from operator, readability, and
natural language processing-based features. The findings provide evidence that machine learning can
effectively integrate these features and accurately predict step-performance. In addition, the results
provide an initial quantitative description of the correlations between these factors.
The AUC of the random forest, decision tree, and logistic regression with all of the features, 0.78,
0.77, and 0.75, respectively, indicate acceptable algorithm prediction performance relative to
Number of words
0.0 0.5 1.0
0.0 0.5 1.0
Number of words
Towards machine learning-driven procedure design, Page 24
common benchmarks (Mandrekar, 2010). It is notable that these algorithms significantly
outperformed the logistic regression algorithm with only operator features. This finding validates the
need for both features derived from the procedure steps as well as operator characteristics in
predicting procedure step performance. The consistency in performance across the algorithms
including all features suggests that the features included in the algorithm are more important than
the machine learning approach for predicting procedure step performance. This finding is consistent
with analyses in other domains (McDonald et al., 2019) and highlights the need for careful feature
identification and selection in future analyses of procedures.
It is somewhat surprising that there were no significant differences between the logistic regression,
decision tree, and random forest algorithms containing all features. While this may be an artifact of
the limited dataset, the result is important as the random forest requires considerable additional
parameters and complexity. This additional complexity reduces the likelihood of overfitting but it also
makes the algorithm less interpretable. In contrast, decision tree and logistic regression algorithms
are generally considered human readable and interpretable by non-machine learning experts and thus
they may be more directly useful to procedure designers in high-risk industries without training in
machine learning. This idea is supported by prior work from Bevilacqua, Ciarapica, and Giacchetta
(2008) who used decision trees to inform a refinery of operational safety issues. As suggested in that
work, the decision tree approach may be a complementary method to be used in conjunction with
current procedure design practices.
Feature selection and inference
The feature selection and inferential analyses highlight the importance of operator characteristics
(e.g., experience, familiarity), readability, and characteristics of the procedure step text (e.g., total
characters). The significance of these features alone is not surprising given that prior analyses of
Towards machine learning-driven procedure design, Page 25
procedures have also found them to impact procedure performance (Novatsis & Skilling, 2016; Peres,
Mannan, et al., 2016; Sasangohar et al., 2018; Sharit, 1998). However, the alignment of this analysis
with prior work is important because the automated feature selection of the Boruta approach and
the iterative construction of the random forest and decision trees, relies on the data rather than
domain constructs. The alignment provides at least some evidence that the algorithm performance
here would be replicated with a broader sample.
Beyond the qualitative alignment, the inferential analysis here highlights novel correlations
between human factors and language-based features. In particular, the partial dependency results
illustrate that procedure steps over 20 words in length correlate with a decline in step performance.
The results also suggest that inexperience correlates with incorrect step performance which provides
support for earlier assessments of procedures (Leplat, 1985; Sasangohar et al., 2018). The correlation
between declined step performance and abbreviations, provides an additional clarity on word and
abbreviation limitations and their effects on operators.
The results also show that step complexity metrics and sentiment may also play a substantial
role in procedure step performance. Although there has been some previous research investigating
the relationships between complexity and performance (Campbell, 1991; Chan et al., 2015; Park &
Jung, 2015), the findings regarding the relationship between a type of complexity, worker experience,
and attributes of the procedure design are novel and need to be pursued further to be more clearly
understood. Similarly, the significance of the sentiment findings must be explored further in future
work. While it is notable that the findings suggest that steps with neutral sentiment are more likely
to lead to correct performance compared to positive or negative sentiment steps, more detailed
analysis is needed to assess the role of sentiment in procedure performance. Sentiment dictionaries,
such as the ones used in this analysis, are generally based on subjective ratings and general or popular
Towards machine learning-driven procedure design, Page 26
texts (Hutto & Gilbert, 2014; Nielsen, 2011). As such, they should be used with caution in
professional domains because the meaning and experience associated with a word may change
significantly between general conversation and high-risk industry practice. For example, the word
“ensure” in the Vader dictionary is mapped to a positive sentiment score of 1.6. In the procedures
analyzed here, the word generally refers to ensuring a setting or the readiness of equipment, which
one may expect to be a neutral sentiment.
Although it is premature to directly extend the findings into specific procedure design guidelines, the
correlations identified in this analysis warrant consideration in the procedure design process. The
inferential findings suggest that procedure steps will be more likely to be performed correctly if they
contain 10 to 20 total words, fewer than 25 total characters (15 unique characters), fewer than 5
uppercase characters, and neutral sentiment. Practitioners may consider these bounds as heuristics
to guide design alternatives as part of a larger design and evaluation process. The findings also
suggest that familiarity with a procedure substantially increases the likelihood of correct step
performance, further emphasizing the need for specific training and deliberate practice (Boot &
Ericsson, 2011). When considered alongside of other recent findings from Sasangohar et al., (2018)
and Peres, Smith, and Sasangohar, (in press), the importance of operator experience suggests that
procedure designers should consider alternate procedure designs for experienced and inexperienced
Limitations and future work
There are several limitations with the present analysis. Most importantly, the size and scope of the
current dataset is limited. The focus on a set of 4 procedures in the oil and gas industry limits the
extension of the results to a broader set of procedures. Before the results here are generalized to
Towards machine learning-driven procedure design, Page 27
other procedure designs, they must be validated with procedures not used in the algorithm training
process. Specifically, these procedures should include systematic manipulations aligned with the
correlations identified in this study. For example, the same procedure steps should be offered with
varying amounts of acronyms. Additionally, while the number of participants was reasonable, the
training process could be refined further with data from additional operators and more granular
measures of complexity and experience.
Another limitation of the current work is the environment. The simulation facility used in this
work is high fidelity, but it does not fully reflect the conditions on a real offshore facility. This may
result in differences in procedure performance, particularly when additional operator factors such as
fatigue and stress occur. Finally, the low initial reliability of the step-performance coding is concerning
and may have impacted the results. To address these concerns, future work should explore the
implementation of the data collection and design procedure here on a large sample of workers in real
environments with a more clearly defined step-performance evaluation. Future work should
additionally explore the utility of these findings in the presence of operator stress and physical and
mental fatigue and additional NLP features (e.g., word embeddings) to provide additional linguistic
The machine learning approaches analyzed in this study suggest that random forest, decision tree,
and logistic regression algorithms can be used to predict procedure step performance from operator,
procedure, readability, and natural language-based features. The inferential analysis suggests that
short procedure steps with few characters and abbreviations are correlated with improved procedure
step performance. While these results are a promising step towards machine learning-driven procedure
Towards machine learning-driven procedure design, Page 28
design, they must be validated with additional data containing additional procedures before they are
broadly extended to the field.
• Machine learning can be an effective approach for analyzing procedure step performance
based on the characteristics of the operator and procedure steps.
• Partial dependence analysis can be used to understand correlations between features and step
• Procedure steps with minimal words, abbreviations, and characters and higher levels of
experience and familiarity are correlated with correct procedure step performance for the
Ahmed, L., Quddus, N., Kannan, P., Peres, S. C., & Mannan, M. S. (2020). Development of a procedure
writers’ guide framework: Integrating the procedure life cycle and reflecting on current industry practices.
International Journal of Industrial Ergonomics, 76(February).
Anderson, J. (1983). Lix and Rix: variations on a little-known readability index. Journal of Reading, 26(6),
490–496. Retrieved from http://www.jstor.org/stable/40031755
Anderson, J. R. (1982). Acquisition of cognitive skill. Psychological Review, 89(4), 369–406.
Bamberger, R., & Vanecek, E. (1984). Lesen-Verstehen-Lernen-Schreiben [Read-Understand-Learn-Write].
Baron, R. (2009). Failure to follow procedures: Deviations are a significant factor in maintenance errors.
Retrieved from the Federal Aviation Administration website:
Bates, S., & Holroyd, J. (2012). Human factors that lead to non-compliance with standard operating
procedures (Research Report RR 919). Health and Safety Executive Laboratory.
Baumont, G., Ménage, F., Schneiter, J. R., Spurgin, A., & Vogel, A. (2000). Quantifying human and
organizational factors in accident management using decision trees: the HORAAM method. Reliability
Engineering & System Safety, 70(2), 113–124. https://doi.org/10.1016/S0951-8320(00)00051-X
Benoit, K., Watanabe, K., Wang, H., Nulty, P., Obeng, A., Müller, S., & Matsuo, A. (2018). quanteda: An R
Towards machine learning-driven procedure design, Page 29
package for the quantitative analysis of textual data. Journal of Open Source Software, 3(30), 774.
Bevilacqua, M., Ciarapica, F. E., & Giacchetta, G. (2008). Industrial and occupational ergonomics in the
petrochemical process industry: A regression trees approach. Accident Analysis and Prevention, 40(4),
Björnsson, C. H. (1968). Läsbarhet [Readability]. Liber (6th ed.). Stockholm: Liber.
Boot, W., & Ericsson, K. A. (2011). Expertise. In J. D. Lee & A. Kirlik (Eds.), Oxford handbook of cognitive
engineering (pp. 143–158). New York: Oxford University press.
Breiman, L. (2001). Random forests. Machine learning, 45(1), 5–32.
Bullemer, P. T., & Hajdukiewicz, J. R. (2004). A Study of effective procedural practices in refining and
chemical operations. Proceedings of the Human Factors and Ergonomics Society Annual Meeting, 48(20),
Bullemer, P. T., & Laberge, J. C. (2010). Common operations failure modes in the process industries. Journal
of Loss Prevention in the Process Industries, 23(6), 928–935. https://doi.org/10.1016/j.jlp.2010.05.008
Burbidge, R., Trotter, M., Holden, S., & Buxton, B. (2001). Drug design by machine learning : Support vector
machines for pharmaceutical data analysis. Computers and Chemistry, 26, 5–14.
Campbell, D. J. (1991). Goal levels, complex tasks, and strategy development: A review and analysis. Human
Performance, 4(1), 1–31. https://doi.org/10.1207/s15327043hup0401_1
Carim, G. C., Saurin, T. A., Havinga, J., Rae, A., Dekker, S. W. A., & Henriqson, É. (2016). Using a procedure
doesn’t mean following it: A cognitive systems approach to how a cockpit manages emergencies. Safety
Science, 89, 147–157. https://doi.org/10.1016/j.ssci.2016.06.008
Carnahan, B., Meyer, G., & Kuntz, L.-A. (2003). Comparing statistical and machine learning classifiers:
Alternatives for predictive modeling in Human Factors research. Human Factors, 45(3), 408–423.
Caylor, J. S., & Sticht, T. G. (1973). Development of a simple readability index for job reading material.
Department of the Army. Retrieved from ERIC database. (ED076707)
Chan, S. H., Song, Q., & Yao, L. J. (2015). The moderating roles of subjective (perceived) and objective task
complexity in system use and performance. Computers in Human Behavior, 51(Part A), 393–402.
Chen, Y.-W., & Lin, C.-J. (2006). Combining SVMs with various feature selection strategies. In I. Guyon, M.
Nikravesh, S. Gunn, & L. A. Zadeh (Eds.), Feature Extraction: Foundations and Applications (pp. 315–
324). Berlin, Heidelberg: Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-540-35488-8_13
Clarke, D. D., Forsyth, R., & Wright, R. (1998). Machine learning in road accident research: Decision trees
describing road accidents during cross-flow turns. Ergonomics, 41(7), 1060–1079.
Coleman, E. B. (1971). Developing a technology of written instruction: Some determiners of the complexity
of prose. In E. Z. Rothkopf & P. E. Johnson (Eds.), Verbal learning research and the technology of
written instruction (pp. 155–204). New York: Teachers College Press.
Coleman, M., & Liau, T. L. (1975). A computer readability formula designed for machine scoring. Journal of
Towards machine learning-driven procedure design, Page 30
Applied Psychology, 60(2), 283–284. https://doi.org/10.1037/h0076540
Dekker, S. (2003). Failure to adapt or adaptations that fail: Contrasting models on procedures and safety.
Applied Ergonomics, 34(3), 233–238. https://doi.org/10.1016/S0003-6870(03)00031-0
DeLong, E. R., DeLong, D. M., & Clarke-Pearson, D. L. (1988). Comparing the areas under two or more
correlated receiver operating characteristic curves: a nonparametric approach. Biometrics, 44, 837–845.
Deufel, C. L., McLemore, L. B., de los Santos, L. E. F., Classic, K. L., Park, S. S., & Furutani, K. M. (2017).
Patient safety is improved with an incident learning system—clinical evidence in brachytherapy.
Radiotherapy and Oncology, 125(1), 94–100.
DuBay, W. (2008). The principles of readability. Costa Mesa: Impact Information, (949), 77.
Farr, J. N., Jenkins, J. J., & Paterson, D. G. (1951). Simplification of Flesh reading ease formula. Journal of
Applied Psychology., 35, 333–337.
Friedman, J. H. (2001). Greedy function approximation: the gradient boosting machine. The Annals of
Statistics, 29(5), 1189–1232.
Fucks, W. (1955). Unterschied des Prosastils von Dichtern und Schriftstellern [Difference in the prose style of
poets and writers]. Ein Beispiel mathematischer [An example of math] Stilanalyse. Sprachforum 1, 234–
Gill, T. G., & Hicks, R. C. (2006). Task complexity and informing science: A synthesis. Informing Science, 9,
Gunning, R. (1952). The technique of clear writing. McGraw Hill
Hale, A., & Borys, D. (2013a). Working to rule, or working safely? Part 1: A state of the art review. Safety
Science, 55, 207–221. https://doi.org/10.1016/j.ssci.2012.05.011
Hale, A., & Borys, D. (2013b). Working to rule or working safely? Part 2: The management of safety rules
and procedures . Safety Science, 55, 222-231. https://doi.org/10.1016/j.ssci.2012.05.013.
HFRG. (1995). Improving compliance with safety procedures, reducing industrial violations. HSE Books
Sudbury, Suffolk, UK.
Hickman, S. H., Hsieh, P. A., Mooney, W. D., Enomoto, C. B., Nelson, P. H., Mayer, L. A., … McNutt, M.
K. (2012). Scientific basis for safely shutting in the Macondo Well after the April 20, 2010 Deepwater
Horizon blowout. Proceedings of the National Academy of Sciences, 109(50), 20268–20273. Retrieved
Hobbs, A., & Kanki, B. G. (2008). Patterns of error in confidential maintenance incident reports. The
International Journal of Aviation Psychology, 18(1), 5–16.
Hothorn, T., Hornik, K., & Zeileis, A. (2006). Unbiased recursive partitioning: A conditional inference
framework. Journal of Computational and Graphical Statistics, 15(3), 651–674.
Hutto, C. J., & Gilbert, E. (2014). VADER: A parsimonious rule-based model for sentiment analysis of social
media text. In Eighth International AAAI Conference on Weblogs and Social Media. Ann Arbor, MI,
United states. https://doi.org/10.1210/en.2011-1066
Towards machine learning-driven procedure design, Page 31
James, G., Witten, D., Hastie, T. J., & Tibshirani, R. (2013). An introduction to statistical learning. New
Jockers, M. L. (2015). Syuzhet: extract sentiment and plot arcs from text. Retrieved from
Kannan, P., Quddus, N., Peres, S. C., & Mannan, M. S. (2018). Can we simplify complexity measurement? a
primer toward usable framework for industry implementation. In Proceedings of the Human Factors and
Ergonomics Society 2018 Annual Meeting. Philadelphia.
Kincaid, J. P., Fishburne, J., Robert P., R., Richard L., C., & Brad S. (1975). Derivation of new readability
formulas (automated readability index, Fog count and Flesch reading ease formula) for Navy Enlisted
Personnel (Research Branch Report 8-75). Retrieved from https://doi.org/10.21236/ADA006655
Klare G. (1974). Assessing readability. Reading Research Quarterly, 10(1), 62–102.
Kuhn, M. (2020). Caret: classification and regression training. Retrieved from https://cran.r-
Kuhn, M., & Johnson, K. (2013). Applied predictive modeling. New York, New York, USA: Springer.
Kursa, M. B., & Rudnicki, W. R. (2010). Feature selection with the Boruta package. Journal of Statistical
Software, 36(11), 1–13.
Leplat, J. (1985). Erreur humaine, fiabilite humaine dans Ie travail [Human error, human reliability in work].
Paris: Armand Colin.
Liaw, A., & Wiener, M. (2002). Classification and regression by randomforest. R News, 2(December), 18–22.
Liu, B. (2012). Sentiment Analysis and Opinion Mining. In G. Hirst (Ed.), Synthesis Lectures on Human
Language Technologies (pp. 1–167). New York, New York, USA: Morgan & Claypool.
Mandrekar, J. N. (2010). Receiver operating characteristic curve in diagnostic test assessment. Journal of
Thoracic Oncology, 5(9), 1315–1316. https://doi.org/10.1097/JTO.0b013e3181ec173d
McDonald, A. D., Ferris, T. K., & Wiener, T. A. (2019). Classification of driver distraction: A comprehensive
analysis of feature generation, machine learning, and input measures. Human Factors.
McDonald, A. D., Lee, J. D., Schwarz, C., & Brown, T. L. (2014). Steering in a random forest: ensemble
learning for detecting drowsiness-related lane departures. Human Factors, 56(5), 986–998.
McLaughlin, G. H. (1969). SMOG Grading – a new readability formula. Journal of Reading, 12(8), 639–646.
Menze, B. H., Kelm, B. M., Masuch, R., Himmelreich, U., Bachert, P., Petrich, W., & Hamprecht, F. A.
(2009). A comparison of random forest and its Gini importance with standard chemometric methods for
the feature selection and classification of spectral data. BMC Bioinformatics, 10, 213.
Neville, T. J., Peres, S. C., Ade, N., Son, C., Bagaria, P., Quddus, N., & Mannan, M. S. (2018). Assessing
procedure adherence under training conditions in high risk industrial operations. In Proceedings of the
Human Factors and Ergonomics Society 2018 Annual Meeting. Philadelphia.
Towards machine learning-driven procedure design, Page 32
Neville, T. J., Peres, S. C., Quddus, N., Hendricks, J., Shortz, A., Ade, N., … Mannan, M. S. (2018). Behavior
assessment technique for procedural industrial tasks: using mixed reality to develop a method to
understand work-as-done under normal operating conditions. In Proceedings of the Human Factors and
Ergonomics Society 2018 Annual Meeting. Philadelphia.
Nielsen, F. Å. (2011). A new ANEW: Evaluation of a word list for sentiment analysis in microblogs. In CEUR
Workshop Proceedings (Vol. 718, pp. 93–98). Retrieved from http://arxiv.org/abs/1103.2903
Noroozi, A., Khan, F., Mackinnon, S., Amyotte, P., & Deacon, T. (2014). Determination of human error
probabilities in maintenance procedures of a pump. Process Safety and Environmental Protection, 92(2),
Novatsis, E., & Skilling, E. J. (2016). Human factors in the design of procedures. In J. B. T.-H. F. in the C.
and P. I. Edmonds (Ed.) (pp. 291–307). Elsevier. https://doi.org/10.1016/B978-0-12-803806-2.00017-
Park, J., & Jung, W. (2015). Identifying objective criterion to determine a complicated task - A comparative
study. Annals of Nuclear Energy, 85, 205–212. https://doi.org/10.1016/j.anucene.2015.05.012
Peres, S. C., Mannan, M. S., & Quddus, N. (2016). Effective procedure design and use: What do operators
need, when do they need it, and how should it be provided? In Proceedings of the Annual Offshore
Peres, S. C., Quddus, N., Kannan, P., Ahmed, L., Ritchey, P., Johnson, W., … Mannan, M. S. (2016). A
summary and synthesis of procedural regulations and standards—informing a procedures writer’s guide.
Journal of Loss Prevention in the Process Industries, 44, 726–734.
Peres, S. C., Smith, A., & Sasangohar, F. (in press). Worker-centered investigation of issues with procedural
systems: Implications for the revision process and safety culture. Journal of Loss Prevention in the Process
Poisson, P., & Chinniah, Y. (2015). Observation and analysis of 57 lockout procedures applied to machinery
in 8 sawmills. Safety Science, 72, 160–171. https://doi.org/10.1016/j.ssci.2014.09.005
Powers, R., Sumner, W., & Kearl, B. E. (1958). A recalculation of four adult readability formulas. Journal of
Educational Psychology, 49(2), 99.
R Core Team. (2014). R: A language and environment for statistical computing. Retrieved from http://www.r-
Rasmussen, J. (1983). Skills, rules, and knowledge; signals, signs, and symbols, and other distinctions in human
performance models. IEEE Transactions on Systems, Man and Cybernetics, SMC-13(3), 257–266.
Ritter, F. E., Baxter, G. D., Kim, J. W., & Srinivasmurthy, S. (2013). Learning and retention. In J. D. Lee &
A. Kirlik (Eds.), Oxford handbook of cognitive engineering (pp. 125–142). New York, New York, USA:
Oxford University press.
Flesch, R.. (1948). A new readability yardstick. Journal of Applied Psychology. 32 (3). 221.
Sanchez-Lengeling, B., & Aspuru-Guzik, A. (2018). Inverse molecular design using machine learning: Generative
models for matter engineering. Science, 361(6400), 360–365. https://doi.org/10.1126/science.aat2663
Sasangohar, F., Peres, S. C., Williams, J. P., Smith, A., & Mannan, M. S. (2018). Investigating written
Towards machine learning-driven procedure design, Page 33
procedures in process safety: Qualitative data analysis of interviews from high risk facilities. Process
Safety and Environmental Protection, 113, 30–39. https://doi.org/10.1016/j.psep.2017.09.010
Sharit, J. (1998). Applying human and system reliability analysis to the design and analysis of written procedures
in high-risk industries. Human Factors and Ergonomics in Manufacturing, 8(3), 265–281.
Siegel, A. W., & Schraagen, J. M. C. (2017). Beyond procedures: Team reflection in a rail control centre to
enhance resilience. Safety Science, 91, 181–191. https://doi.org/10.1016/j.ssci.2016.08.013
Smith, E. A., & Senter, R. J. (1967). Automated readability index. AMRL-TR66-22. Wright-Patterson AFB,
OH: Aerospace Medical Division. Retrieved from http://www.ncbi.nlm.nih.gov/pubmed/5302480
Solomon, N. W. (2006). A qualitative analysis of media language. LAP Lambert Academic Publishing
Song, F., Guo, Z., & Mei, D. (2010). Feature selection using principal component analysis. Proceedings - 2010
International Conference on System Science, Engineering Design and Manufacturing Informatization,
ICSEM 2010, 1, 27–30. https://doi.org/10.1109/ICSEM.2010.14
Spache, G. (2005). A new readability formula for primary-grade reading materials. The Elementary School
Journal, 53(7), 410–413. https://doi.org/10.1086/458513
Suchman, L. A. (1983). Office procedure as practical action: Models of work and system design. ACM
Transactions on Information Systems (TOIS), 1(4), 320–328. https://doi.org/10.1145/357442.357445
Tränkle, U. & Bailer, H. (1984). Kreuzvalidierung und Neuberechnung von Lesbarkeitsformeln für die deutsche
Sprache [Cross-validation and recalculation of readability formulas for the German language]. Journal of
Developmental Psychology and Educational Psychology, 16(3), 231–244.
UK Health and Safety Executive. (2015). HSE Human Factors briefing note no 4: procedures. Retrieved May
9, 2018, from http://www.hse.gov.uk/humanfactors/topics/04procedures.pdf
US Chemical Safety and Hazard Investigation Board. (2007). Investigation report, refinery explosion and fire,
BP Texas city. Retrieved from https://www.csb.gov/file.aspx?DocumentId=5596
Vicente, K. J., & Burns, C. M. (1995). A field study of operator cognitive monitoring at Pickering nuclear
Wager, S., & Athey, S. (2018). Estimation and inference of heterogeneous treatment effects using random
forests. Journal of the American Statistical Association, 113(523), 1228–1242.
Wright, P., & McCarthy, J. (2003). Analysis of procedure following as concerned work. In Handbook of
cognitive task design (pp. 679–700). London: Lawrence Erlbaum Associates London.
Yamauchi, T. (2013). Mouse trajectories and state anxiety: Feature selection with random forest. Proceedings
- 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction, ACII 2013,
Zhao, Q., & Hastie, T. J. (2019). Causal interpretations of black-box models. Journal of Business & Economic
Statistics, 1–10. https://doi.org/10.1080/07350015.2019.1624293
Towards machine learning-driven procedure design, Page 34
Anthony D. McDonald is an assistant professor in the Wm Michael Barnes '64 department of
industrial and systems engineering at Texas A&M University and the director of the Human
Factors and Machine Learning Laboratory. He received his PhD in industrial engineering from the
University of Wisconsin-Madison in 2014.
Nilesh Ade is a PhD candidate in Mary Kay O’Connor process safety center, department of
chemical engineering, Texas A&M University. He obtained his BS in chemical engineering from
Institute of Chemical Technology, Mumbai in 2015.
S. Camille Peres is an associate professor at Texas A&M University in the department of
environmental and occupational health. She obtained her PhD in psychology from Rice University.