Objectives
To present our AI-based symptom checker, rigorously measure its accuracy, and compare it against existing popular symptom checkers and seasoned primary care physicians.
Design
Vignettes study.
Setting
400 gold-standard primary care vignettes.
Intervention/Comparator
We utilized 7 standard accuracy metrics for evaluating the performance of 6 symptom checkers. To this end, we developed and peer-reviewed 400 vignettes, each approved by at least 5 out of 7 independent and experienced general practitioners. To the best of our knowledge, this yielded the largest benchmark vignette suite in the field thus far. To establish a frame of reference and interpret the results of symptom checkers accordingly, we further directly compared the best-performing symptom checker against 3 primary care physicians with an average experience of 16.6 years.
Primary Outcome Measures
We thoroughly studied the diagnostic accuracies of symptom checkers and physicians from 7 standard angles, including: (a) M 1, M 3, and M 5 as measures of a symptom checker’s or a physician’s ability to return a vignette’s main diagnosis at the top, among the first 3 diseases, or among the first 5 diseases of their differential diagnosis, respectively (b) recall as a measure of the percentage of relevant diseases that are returned in a symptom checker’s or a physician’s differential diagnosis, (c) precision as a measure of the percentage of diseases in a symptom checker’s or a physician’s differential diagnosis that are relevant, (d) F1-measure as a trade-off measure between recall and precision, and (e) Normalized Discounted Cumulative Gain or NDCG as a measure of ranking quality of a symptom checker’s or a physician’s differential diagnosis.
Results
Our AI-based symptom checker, namely, Avey significantly outperformed 5 popular symptom checkers, namely, Ada, WebMD, K Health, Buoy, and Babylon by averages of 24.5%, 175.5%, 142.8%, 159.6%, 2968.1% using M 1; 22.4%, 114.5%, 123.8%, 118.2%, 3392% using M 3; 18.1%, 79.2%, 116.8%, 125%, 3114.2% using M 5; 25.2%, 65.6%, 109.4%, 154%, 3545% using recall; 8.7%, 88.9%, 66.4%, 88.9%, 2084% using F1-measure; and 21.2%, 93.4%, 113.3%, 136.4%, 3091.6% using NDCG, respectively. Under precision, Ada outperformed Avey by an average of 0.9%, while Avey surpassed WebMD, K Health, Buoy, and Babylon by averages of 103.2%, 40.9%, 49.6%, and 1148.5%, respectively. To the contrary of symptom checkers, physicians outperformed Avey by averages of 37.1% and 1.2% using precision and F1-measure, while Avey exceeded them by averages of 10.2%, 20.4%, 23.4%, 56.4%, and 25.1% using M 1, M 3, M 5, recall, and NDCG, respectively. To facilitate the reproducibility of our study and support future related studies, we made all our gold-standard vignettes publicly and freely available. Moreover, we posted online all the results of the symptoms checkers and physicians (i.e., 45 sets of experiments) to establish a standard of full transparency and enable verifying and cross validating our results.
Conclusions
Avey tremendously outperformed the considered symptom checkers. In addition, it compared favourably to physicians, whereby it underperformed them under some accuracy metrics (e.g., precision and F1-measure), but outperformed them under some others (e.g., M 1, M 3, M 5, recall, and NDCG). We will continue evolving Avey’s AI model. Furthermore, we will study its usability with real patients, examine how they respond to its suggestions, and measure its impact on their subsequent choices for care, among others.