Stroke Prediction Analysis

Machine Learning Models for Healthcare Risk Assessment

Anit Mathew | Master's Thesis Research

Dataset Size

4,981

Patient Records Analyzed

Models Tested

ML Algorithms Evaluated

Best Accuracy

95%

Original Data Performance

Features

Clinical Variables

Research Overview

Objective

🎯 Primary Goal

Develop and evaluate machine learning models to predict stroke risk using clinical and demographic data, enabling early intervention and improved patient outcomes.

📊 Clinical Variables

Age & Gender
Hypertension Status
Heart Disease History
Marriage Status
Work Type & Residence
Average Glucose Level
Body Mass Index (BMI)
Smoking Status

Methodology

Process

Step 01

Data Collection & Preprocessing

Collected 4,981 patient records with 10 clinical features. Performed data cleaning, handled missing values, and encoded categorical variables for model training.

Step 02

Oversampling Strategy

Applied oversampling to address class imbalance in stroke cases. Created balanced datasets for comprehensive model evaluation across age groups (above/below 65).

Step 03

Model Development

Implemented four machine learning algorithms: Logistic Regression, Decision Trees, Random Forest, and Deep Neural Networks (DNN). Each model trained on original and oversampled data.

Step 04

Cross-Validation & Evaluation

Performed 5-fold cross-validation for robust evaluation. Assessed models using accuracy, precision, recall, and F1-score metrics across different data configurations.

Model Performance

Results

Logistic Regression Best Overall

Original Data Accuracy 95.0%

Oversampled Accuracy 68.0%

Age ≥65 Accuracy 56.0%

Age <65 Accuracy 70.0%

Decision Tree Moderate

Original Data Accuracy 94.5%

Oversampled Accuracy 66.8%

Clear Decision Rules ✓

Interpretability High

Random Forest Balanced

Original Data Accuracy 94.8%

Oversampled Accuracy 67.5%

Ensemble Performance Stable

Feature Importance ✓

Deep Neural Network Advanced

Architecture 64-32-1

Activation ReLU + Sigmoid

Optimizer Adam

Cross-Validation 5-Fold

Key Predictive Features

Analysis

Age

High

Average Glucose Level

High

Hypertension

High

Heart Disease

Moderate

BMI

Moderate

Smoking Status

Low

Research Insights

Findings

📈

Age-Dependent Risk

Stroke risk increases significantly with age. Models showed better predictive performance for patients under 65, suggesting different risk patterns across age groups.

⚖️

Class Imbalance Impact

Original data showed 95% accuracy but poor stroke detection. Oversampling improved balanced prediction but reduced overall accuracy to 68%, highlighting the trade-off.

🎯

Clinical Markers

Hypertension and glucose levels emerged as strong predictors. Combined with age, these features form the foundation of effective stroke risk assessment.

🧠

Model Interpretability

Logistic Regression provided best balance of accuracy and interpretability, making it ideal for clinical deployment where explainability is crucial.

🔬

Ensemble Methods

Random Forest showed stable performance across datasets, demonstrating the value of ensemble approaches for robust medical predictions.

💡

Deep Learning Potential

DNN models captured complex patterns but required careful tuning. Future work with larger datasets could unlock their full potential.

Research Conclusion

This research demonstrates that machine learning models can effectively predict stroke risk using clinical data. Logistic Regression achieved 95% accuracy on original data, making it suitable for clinical screening. The study highlights the importance of age, glucose levels, and hypertension as key predictive factors. While oversampling improved minority class detection, it revealed the inherent challenge of balancing sensitivity and specificity in imbalanced medical datasets. Future work should focus on larger datasets, real-time deployment, and integration with electronic health records for practical clinical impact.