Occupational Injury Surveillance Methods Using Free Text Data and Machine Learning: Creating a Gold Standard Data Set

... The creation of a"gold-standard' dataset has been described in detail elsewhere and will be briefly summarized here [21]. Over 50,000 PCR records were visually inspected and tagged as to their occupational injury status. ...
Full-text available
Purpose Current injury surveillance efforts in agriculture are considerably hampered by the limited quantity of occupation or industry data in current health records. This has impeded efforts to develop more accurate injury burden estimates and has negatively impacted the prioritization of workplace health and safety in state and federal public health efforts. This paper describes the development of a Naïve Bayes machine learning algorithm to identify occupational injuries in agriculture using existing administrative data, specifically in pre-hospital care reports (PCR). Methods A Naïve Bayes machine learning algorithm was trained on PCR datasets from 2008–2010 from Maine and New Hampshire and tested on newer data from those states between 2011 and 2016. Further analyses were devoted to establishing the generalizability of the model across various states and various years. Dual visual inspection was used to verify the records subset by the algorithm. Results The Naïve Bayes machine learning algorithm reduced the volume of cases that required visual inspection by 69.5 percent over a keyword search strategy alone. Coders identified 341 true agricultural injury records (Case class = 1) (Maine 2011–2016, New Hampshire 2011–2015). In addition, there were 581 (Case class = 2 or 3) that were suspected to be agricultural acute/traumatic events, but lacked the necessary detail to make a certain distinction. Conclusions The application of the trained algorithm on newer data reduced the volume of records requiring visual inspection by two thirds over the previous keyword search strategy, making it a sustainable and cost-effective way to understand injury trends in agriculture.
Introduction: Specialized occupational injury surveillance systems are filling the gap in the undercount of work-related injuries in industries such as agriculture and forestry. To ensure data quality and maximize efficiency in the operation of a regional occupational injury surveillance system, the need for continued dual coding of occupational injury records was assessed. Methods: Kappa scores and percent agreement were used to compare interrater reliability for assigned variables in 1,259 agricultural and forestry injuries identified in pre-hospital care reports. The variables used for the comparison included type of event, source of injury, nature of injury, part of body, injury location, intentionality, and farm and agriculture injury classification (FAIC). Results: Kappa (κ) ranged from 0.2605 for secondary source to 0.8494 for event and exposure. Individual coder accuracy ranged from medium to high levels of agreement. Agreement beyond the first digit of OIICS coding was measured in percent agreement, and type of event or exposure, body part, and primary source of injury continued to meet levels of accord reaching 70% or greater agreement between all coders and the final choice, even to the most detailed 4th digit of OIICS. Conclusions: This research supports evidence-based decision making in customizing an occupational injury surveillance system, ultimately making it less costly while maintaining data quality. We foresee these methods being applicable to any surveillance system where visual inspection and human decisions are levied. Practical Applications: Assessing the rigor of occupational injury record coding provides critical information to tailor surveillance protocols, especially those targeted to make the system less costly. System administrators should consider evaluating the quality of coding, especially when dealing with free-text narratives before deciding on single coder protocols. Further, quality checks should remain a part of the system going forward.
ResearchGate has not been able to resolve any references for this publication.