Content uploaded by Scott Coffin
Author content
All content in this area was uploaded by Scott Coffin on Apr 26, 2021
Content may be subject to copyright.
California Water Boards
Designing and Administrating
the COVID-19 Financial
Impacts Survey
Bethany Robinson, Ph.D (Water Resource Control Engineer, DDW)
Scott Coffin, Ph.D (Research Scientist III, DDW)
Marielle Pinheiro, Ph.D (Research Data Specialist II, ORPP)
California Water Boards
Key Survey Design/Analysis Tools
Survey Design
•Clearly defining data objectives
•Question design
•Statistical power analysis
•Survey distribution and collection
Survey Analysis
•Correlation and regression
•Imputation
•Extrapolation and uncertainty estimation (calibration and weighting)
2
California Water Boards
Background
•Executive Order N-42-20 (April 2020): Shutoff moratorium
•Preliminary voluntary survey in summer 2020
•More expansive survey effort in winter 2020
•Motivation:
•Understand financial impacts to drinking water systems
•Understand household bill debt
•Inform development of assistance options
California Water Boards
Summer 2020: Initial Voluntary Survey
•Shiny app, similar to a Google form
(but with user authentication for data
quality/PII purposes)
•MongoDB backend for response storage
•Publicized via listservs (Water Board, water
agency organizations)
•Results not statistically significant
•Skewed sample (overrepresented larger systems,
spatial gaps)
•Only captured 30% of state population served
California Water Boards
Winter 2020: Targeted sample of systems
Small/medium systems
(<10,000 service connections)
Large systems (10,000+
service connections)
Number (percent) of
community water systems
2661 systems (92%) 223 systems (8%)
Number (percent) of state
population served
7.1 million (18%) 32.4 million (82%)
Sample size 510 150
Number of responses
(systems represented*)
406 (428) 131 (151)
Number (percent) of
subset population
represented
4.0 million (56% of
small/medium system
population)
24.2 million (75% of large
system population)
* Some agencies provided responses for multiple systems as they were unable to separate their billing data
California Water Boards
Survey Administration
Small/Medium Water Systems Large Water Systems
450 systems received an email on 11/9,
notifying them that staff would be
following up
150 systems received an email with the
survey on 11/9
DDW staff workshop 11/5: instructed
field staff on specifics of filling out
small/medium system survey
Public workshop 11/16: instructed
agency representatives on specifics of
filling out large system survey
DDW field office staff called systems and
filled out survey forms
Agencies completed survey forms and
returned to ORPP staff
California Water Boards
Sampling and Survey Design
7
California Water Boards
Probabilistic Sampling Approaches
•Simple Random Sampling
•Stratified Random Sampling
•Multi-Stage Sampling
8
Which should you choose?
California Water Boards
Probabilistic Sampling Approaches
9
Groupings of #Service Connections Makes Case for STRATIFIED Random Sampling
California Water Boards
Sampling Procedure
•Kept the two populations (small and large) separate
•Small/medium sample: 4 bins determined using Jenk’s Natural Breaks
•Large sample: random sample of large system population (stratified by
service connections and region)
•Final survey results also included 5 voluntary responses from large systems
California Water Boards
Survey Design:
Binning Small/Medium Systems
Bins (Service Connection Breaks):
•Bin A: 0 –1,008
•Bin B: 1,009 –3,090
•Bin C: 3,091 –5,868
•Bin D: 5,869 –10,000
California Water Boards
Margins of Error
12
Bin Service connections Population Sample Size Margin of Error
for 90% CI
Margin of Error
for 95% CI
Bin A SC < 1,009 1944 184 1.5% 1.7%
Bin B 1,009 <= SC < 3,091 265 107 0.8% 1.0%
Bin C 3,091 <= SC < 5,869 108 70 1.2% 1.4%
Bin D 5,869 <= SC < 10,000 72 55 1.2% 1.4%
z = z-score
σ = standard deviation
P = population
n = sample size
∗
∗ −
−1
California Water Boards
Question Design Considerations
•Number and length of questions
•Limits on staff time (both Water Board and local agencies)
•Data availability
•Some billing systems didn’t have historical records
•Some systems declined to participate or provide data
•Billing cycle and resultant data uniformity
•PII protection
•Guidance document for staff to ensure consistent responses
13
California Water Boards
Survey Questions
Small/Medium System Questions Large System Questions
System finances
(Revenues, expenses, cash reserves,
months before assistance needed)
Household debt
(Delinquent accounts, debt levels, total
water bill debt, drinking water debt,
debt per billing cycle, debt levels per zip
code)
Household debt distribution
(Number of accounts with debt, total
amount of debt, debt amount per zip
code)
Fees and repayment options
(Late fees charged, number of accounts
on repayment plans)
Questions created by ORPP/DDW working group and vetted by various
stakeholder groups
California Water Boards
Survey Distribution, Collection, and
Data Extraction
15
California Water Boards
Survey Distribution Considerations
•Time constraints
•no time to build and test a new site
•Couldn’t use existing Shiny app for legal reasons
•Platform considerations: Survey needed to be:
•Usable for all parties involved (Water Board staff, agencies, general public)
•Customizable
•Accessible
•Data quality control and management: Ensure that responses
•Are the right type (i.e. numeric vs string)
•Are consistent across different billing cycle types
•Can be extracted programmatically and compiled into a single database
California Water Boards
Survey Distribution Solution
•Excel spreadsheet with
data validation
•Flexible design, but still
machine readable
•No new skills needed to
create survey form
•Retains data type
information
•Created separate template
files for large and small
system responses
•Used Python to automate
personalized survey form
generation
California Water Boards
Question Types
18
•Yes/No
•Multiple Choice
•Comment boxes
•Tables
California Water Boards
Binned Delinquent Accounts Table
•For each zip code
in the service area:
How many
accounts have less
than $100 debt,
between $100-
$200, etc.
•Provided ability to
join data with zip
code level ACS
data, examine
spatial trends
19
California Water Boards
File Generation and Management
1. Created separate file trees for
small/large survey responses in the
S Drive
2. Used openpyxl to fill in template
cells with
1. PWSID
2. system name
3. billing cycle length
3. Saved prefilled files as
[MAIN]/[DISTRICT]/Incomplete/
Survey_[DISTRICT]_[PWSID]_[NAME].xlsx
4. Once a survey was finished, Water
Board staff moved file to “Complete”
folder
Main
Folder
District
subfolder
Incomplete Complete
California Water Boards
Data Extraction
21
Excel file: Numeric/string
responses
Dictionary [DICT_LOC] of
{[varname]:[cell_loc]} pairs
Excel file: table with binned
number of delinquent accounts,
per zip code
extract_function([FILE],[DICT_LOC])
1. Open file using openpyxl
2. For each [varname]in [DICT_LOC]:
1. Obtain value [ans] located at
[cell_loc]
2. Store [varname]:[ans] in
dictionary [DICT_ANS]
3. Store filepath (keep track of original
records)
1. Use pandas to read table [ZIP_TAB]
2. Store [ZIP_TAB] in [DICT_ANS]
Return dictionary [DICT_ANS]
California Water Boards
Data Storage and Release
22
Keylist Table:
•3-digit ID
•System Name
•PWSID
Public Release:
•3-digit ID
•Data with PII
removed
Private Record:
•3-digit ID
•Full, unredacted
data
•References back
to original files
stored on S drive
•Use pyodbc to write data to SQL tables on the Water Board server
•Data errors were fixed by re-reading corrected file using extract_function()
and changing the table values in the row corresponding to the survey ID
•Also added correction details in a “Notes” column
•Use pandas.read_sql_query() and pandas.to_excel() to read public
release table from the server and write to Excel for publishing to website
California Water Boards
Imputation, Weighting, and
Extrapolation
23
California Water Boards
Survey Population Did Not Perfectly
Match State Population
California Water Boards
Statewide Extrapolation Workflow
25
Valliant et al. (2018). Springer.
1.Impute missing data
2.Calculate base weights
3.Adjust weights for non-response
4.Calibrate weights
California Water Boards
Statewide Extrapolation - Weighting
26
=�
t = total
wi= Weight
yi= Response Valliant et al. (2018). Springer.
California Water Boards
Statewide Extrapolation Workflow
27
Valliant et al. (2018). Springer.
1.Impute missing data
2.Calculate base weights
3.Adjust weights for non-response
4.Calibrate weights
California Water Boards
Reasons for Missing Data
•Billing software limitations resulted in incomplete survey responses
•79 systems:
•Do not keep historic data
•Had difficulty sorting accounts by customer class
•Had difficulty with zip code extraction
•Some systems were unable to isolate drinking water portion of total debt
•Time limitations–some systems unable to retrieve data before deadline
28
California Water Boards
Response Data were Incomplete
Mice package (R).
California Water Boards
Types of Missingness
University of Virgina Library
•MCAR: Missing Completely at Random
•The missingness of data points are at random
•MAR: Missing at Random
•The missingness is not completely random, but the propensity of
missingness depends on the observed data, not the missing data.
•MNAR: Missing Not at Random
•The missing is not random, it correlates with unobservable characteristics
unknown to a researcher
California Water Boards
Data were Missing at Random
Missingness
Fraction
Figure from pheatmap package (R)
California Water Boards
Filling in Missing Data with Multiple Imputation
University of Virgina Library
California Water Boards
Imputing Missing Data
California Water Boards
Imputation Via Random Forest Modelling
Imputed Data
Observed Data
Data imputed with Mice package (R).
California Water Boards
Statewide Extrapolation Workflow
35
Valliant et al. (2018). Springer.
1.Impute missing data
2.Calculate base weights
3.Adjust weights for non-response
4.Calibrate weights
California Water Boards
Statewide Extrapolation –Base Weights
36
Valliant et al. (2018). Springer.
Base weight = probability of being sample = proportion in population
California Water Boards
Statewide Extrapolation Workflow
37
Valliant et al. (2018). Springer.
1.Impute missing data
2.Calculate base weights
3.Adjust weights for non-response
4.Calibrate weights
California Water Boards
(0 –1,008) (1,009 –3,090) (3,091 –5,868) (5,869 –10,000)
Service Connection Breaks:
Survey Bins
Survey Response Was Not Random
Figure made with ggplot2 (R)
Response Rate
California Water Boards
Predicting Non-response with Auxiliary Variables
Non-Response Weight
Median 12-Month Household Income ($USD) Figure made with ggplot2 (R)
California Water Boards
Non-Response Adjustment using Machine-Learning
Zhao et al. (2016). Contemporary Clinical Trials Decision Tree generated with rpart package (R)
California Water Boards
Statewide Extrapolation Workflow
41
Valliant et al. (2018). Springer.
1.Impute missing data
2.Calculate base weights
3.Adjust weights for non-response
4.Calibrate weights
California Water Boards
Calibration – Evaluate Correlative Factors
42
California Water Boards
Calibration –
Identify Candidate Calibration Variables
43
Psych package (R))
California Water Boards
Weighting and Calibration Decreases Uncertainties
Less weighting More weighting
California Water Boards
Data Analysis and Example Results
45
California Water Boards
Analysis Techniques
•Random Forest: Variable importance
•Weighted Extrapolation
46
California Water Boards
Perceived Assistance Prediction
A random forest machine
learning algorithm of 10,000 trees
determined that the most
important variables for predicting
assistance are:
1. Median household income and;
2. Service connections.
47
California Water Boards
Financial Vulnerability Metrics
Perceived Months before
assistance Days of unrestricted cash Revenue to expense ratio
No assistance >365 days >2
>12 180-365 days 1.5-2
9-12 90-180 days 1.2-1.5
6-9 60-90 days 0.95-1.2
3-6 30-60 days 0.5-0.95
<3 <30 days <0.5
48
Vulnerability scores were determined by the count of “high-risk” indicators per system (red text in table).
•Systems with 0 out of 3 high-risk indicators have low financial vulnerability.
•Systems with 1 out of 3 high-risk indicators have medium financial vulnerability
•Systems with 2 out of 3 high-risk indicators have high financial vulnerability
•Systems with 3 out of 3 high-risk indicators have extreme financial vulnerability
California Water Boards
Vulnerability Scores and Statewide
Estimates
49
Vulnerability Survey sample Survey % Statewide
estimate
Statewide %
Low 218 52% 1172 49%
Medium 116 28% 592 25%
High 38 9% 270 11%
Extreme 51% 25 1%
Used weights (explained earlier) to extrapolate from the
survey results to a statewide estimate of how many systems
were at high or extreme financial vulnerability status
California Water Boards
Total Statewide Household Debt:
March – October 2020
50
Data System size Survey Total Statewide Estimate
Total Debt
($USD)
Large $671.8 million $731.7 million +/-$1.4 million
Small/Medium $33.0 million $57.6 million +/-$2.5 million
Tota l $704.8 million $789.3 million +/-$3.9 million
California Water Boards
Demographic Correlations
• Racial disparities:
•Zip codes with higher percentages of Hispanic and Black
households have:
• A higher percentage of households with some level
of debt
• A higher average level of debt
• A higher percentage of households with very high levels
of debt (above $600 and above $1,000)
•Strong correlation with median household income
51
California Water Boards
Key Survey Design/Analysis Tools
Survey Design
•Clearly defining data objectives
•Question design
•Statistical power analysis
•Survey distribution and collection
Survey Analysis
•Correlation and regression
•Imputation
•Extrapolation and uncertainty estimation (calibration and weighting)
52
California Water Boards
Further Information
•COVID-19 Financial Impacts Survey Website:
https://www.waterboards.ca.gov/drinking_water/certlic/drinkingw
ater/covid-19watersystemsurvey.html
•Marielle: Marielle.Pinheiro@Waterboards.ca.gov
•Bethany: Bethany.Robinson@Waterboards.ca.gov
•@DrSCoffin ScottCoffin
53