Stroke disease has been the leading cause of death globally for the last several decades. Thus, the death rate can be decreased by early recognition of disease and ongoing surveillance. However, the largest obstacle to perform advanced analytics using the conventional approach is the growth of massive amount of data from various sources, including patient histories, wearable sensor devices, and
... [Show full abstract] medical data. The current technology that could have a large impact on the healthcare sector is the integration of machine learning with big data analytics (scalable machine learning), particularly in the early diagnosis of this disease. To address this issue, a scalable stroke disease prediction model for a multinode distributed environment, which was developed by combining big data analytics concepts with machine learning to handle extensive healthcare datasets, an aspect not seen in the prior literature on stroke disease detection, is presented in this work. We have implemented four scalable algorithms: logistic regression, random forest, gradient-boosting tree, and decision tree, using a dataset that was collected from a Medical Quality Improvement Consortium database. As a result, two worker nodes and one master node were used to analyze the dataset. The model’s performance was assessed using performance metrics including the area under the curve (AUC) and confusion matrix. With an accuracy of 94.3% and an AUC score of 99%, the random forest was determined to be better based on the experimental results. It was also shown that the main risk factor for stroke disease is diabetes, which is followed by hypertension. This study demonstrated the effectiveness of using Spark’s scalable machine learning techniques to forecast stroke disease and identify risk factors earlier. The findings of this study can be utilized by physicians as clinical decision aids to aid in the more accurate identification of stroke disease.