Stroke Prediction Analysis

Machine Learning Models for Healthcare Risk Assessment

Anit Mathew | Master's Thesis Research

Dataset Size
4,981
Patient Records Analyzed
Models Tested
4
ML Algorithms Evaluated
Best Accuracy
95%
Original Data Performance
Features
10
Clinical Variables

Research Overview

Objective

🎯 Primary Goal

Develop and evaluate machine learning models to predict stroke risk using clinical and demographic data, enabling early intervention and improved patient outcomes.

📊 Clinical Variables

  • Age & Gender
  • Hypertension Status
  • Heart Disease History
  • Marriage Status
  • Work Type & Residence
  • Average Glucose Level
  • Body Mass Index (BMI)
  • Smoking Status

Methodology

Process
Step 01
Data Collection & Preprocessing
Collected 4,981 patient records with 10 clinical features. Performed data cleaning, handled missing values, and encoded categorical variables for model training.
Step 02
Oversampling Strategy
Applied oversampling to address class imbalance in stroke cases. Created balanced datasets for comprehensive model evaluation across age groups (above/below 65).
Step 03
Model Development
Implemented four machine learning algorithms: Logistic Regression, Decision Trees, Random Forest, and Deep Neural Networks (DNN). Each model trained on original and oversampled data.
Step 04
Cross-Validation & Evaluation
Performed 5-fold cross-validation for robust evaluation. Assessed models using accuracy, precision, recall, and F1-score metrics across different data configurations.

Model Performance

Results
Logistic Regression Best Overall
Original Data Accuracy 95.0%
Oversampled Accuracy 68.0%
Age ≥65 Accuracy 56.0%
Age <65 Accuracy 70.0%
Decision Tree Moderate
Original Data Accuracy 94.5%
Oversampled Accuracy 66.8%
Clear Decision Rules
Interpretability High
Random Forest Balanced
Original Data Accuracy 94.8%
Oversampled Accuracy 67.5%
Ensemble Performance Stable
Feature Importance
Deep Neural Network Advanced
Architecture 64-32-1
Activation ReLU + Sigmoid
Optimizer Adam
Cross-Validation 5-Fold

Key Predictive Features

Analysis
Age
High
Average Glucose Level
High
Hypertension
High
Heart Disease
Moderate
BMI
Moderate
Smoking Status
Low

Research Insights

Findings
📈
Age-Dependent Risk
Stroke risk increases significantly with age. Models showed better predictive performance for patients under 65, suggesting different risk patterns across age groups.
⚖️
Class Imbalance Impact
Original data showed 95% accuracy but poor stroke detection. Oversampling improved balanced prediction but reduced overall accuracy to 68%, highlighting the trade-off.
🎯
Clinical Markers
Hypertension and glucose levels emerged as strong predictors. Combined with age, these features form the foundation of effective stroke risk assessment.
🧠
Model Interpretability
Logistic Regression provided best balance of accuracy and interpretability, making it ideal for clinical deployment where explainability is crucial.
🔬
Ensemble Methods
Random Forest showed stable performance across datasets, demonstrating the value of ensemble approaches for robust medical predictions.
💡
Deep Learning Potential
DNN models captured complex patterns but required careful tuning. Future work with larger datasets could unlock their full potential.

Research Conclusion

This research demonstrates that machine learning models can effectively predict stroke risk using clinical data. Logistic Regression achieved 95% accuracy on original data, making it suitable for clinical screening. The study highlights the importance of age, glucose levels, and hypertension as key predictive factors. While oversampling improved minority class detection, it revealed the inherent challenge of balancing sensitivity and specificity in imbalanced medical datasets. Future work should focus on larger datasets, real-time deployment, and integration with electronic health records for practical clinical impact.