Stroke Prediction Dataset Analysis and Visualization

Data Set Characteristics:

Name:

Stroke Prediction dataset from Kaggle URL: https://www.kaggle.com/fedesoriano/stroke-prediction-dataset

Potential Goal:

This dataset is used to predict whether a patient is likely to get stroke based on the input parameters like gender, age, various diseases, and smoking status. Each row in the data provides relavant information about the patient.

:Features: 12 clinical features

:Number of Instances: 5110

:Number of Attributes: 12 numeric/categorical predictive. 

:Attribute Information (in order):

    1) id: unique identifier
    2) gender: "Male", "Female" or "Other"
    3) age: age of the patient
    4) hypertension: 0 if the patient doesn't have hypertension, 1 if the patient has hypertension
    5) heart_disease: 0 if the patient doesn't have any heart diseases, 1 if the patient has a heart disease
    6) ever_married: "No" or "Yes"
    7) work_type: "children", "Govt_jov", "Never_worked", "Private" or "Self-employed"
    8) Residence_type: "Rural" or "Urban"
    9) avg_glucose_level: average glucose level in blood
    10) bmi: body mass index
    11) smoking_status: "formerly smoked", "never smoked", "smokes" or "Unknown"*
    12) stroke: 1 if the patient had a stroke or 0 if not


Import Libraries

Constants

Load healthcare stroke dataset

Data preprocessing:

remove rows with missing values and remove 'other' from gender since there is only 1 row with gender = other

Add a label column with 'yes' value if stroke = 1 and 'no' value if stroke = 0

Summarize data

Correlation matrix:

We will focus on the following subset of 3 features:
    1. Age
    2. Average glucose
    3. BMI

Inferences

Pairwise plots for stroke = 0 and stroke = 1.

Analyze features

Following 7 features are taken into account for analysis and a feature subset is created:

    1) gender: "Male", "Female" or "Other"
    2) age: age of the patient
    3) hypertension: 0 if the patient doesn't have hypertension, 1 if the patient has hypertension
    4) heart_disease: 0 if the patient doesn't have any heart diseases, 1 if the patient has a heart disease
    5) avg_glucose_level: average glucose level in blood
    6) bmi: body mass index
    7) smoking_status: "formerly smoked", "never smoked", "smokes" or "Unknown"*

Encode categorical data

Stroke data set has both numeric and categorical features, therefore categorical features need to be converted into numerical features a method like one hot encoding.

Balance dataset

Stroke prediction dataset is highly imbalanced. There are only 209 observation with stroke = 1 and 4700 observations with stroke = 0. A balanced sample dataset is created by combining all 209 observations with stroke = 1 and 10% of the observations with stroke = 0 which were obtained by random sampling from the 4700 observations. The resulting sample dataset is then split into train and test set. Different classifiers are trained on the train set and applied on the sample test set and then on the whole data set excluding train set.

Evaluation metrics

Provided by Nabeel Shaikh (github.com/nshaikh99)

Accuracy tells us how often our model issues a correct prediction. This is accomplished by dividing the sum of the number of true positives and the number of true negatives by the number of total predictions. We value maximizing accuracy when we want to maximize the number of true positives and the number of true negatives.

Recall(TPR or Sensitivity) is the fraction of positive labels predicted correctly. It tells us how often our model doesn't issue a false negative prediction. This is accomplished by dividing the number of true positives by the sum of the number of true positives and the number of false negatives. We value maximizing recall when we want to minimize the number of false negatives. For example, we would want to maximize recall when predicting if someone has a life threatening disease.

Precision tells us how often our model doesn't issue a false positive prediction. This is accomplished by dividing the number of true positives by the sum of the number of true positives and the number of false positives. We value maximizing precision when we want to minimize the number of false positives. For example, we would want to maximize precision when predicting if someone is who they say they are when opening a credit card.

F1 score tells us how often our model doesn't issue a false positive or false negative prediction. This is accomplished by dividing two times recall times precision by the sum of recall and precision. We value maximizing F1 score when we want to minimize both the number of false positives and the number of false negatives. For example, we would want to maximize F1 score when predicting if someone committed a crime.

KNN Classifier

Take k = 3,5,7,9,11. Split the dataset into training and testing set. For each k, train your k-NN classifier on Xtrain and compute its accuracy for Xtest

Logistic Regression Classfier

Decision Tree Classifier

Random Forest Classifier

Summary of performance measures using different classifiers on the whole dataset with the following features:

Although K-NN, Logistic Regression and Random Forest classifiers provide the highest overall accuracy of prediction 82.5% and 85.7% respectively, however, the Logistic Regression gives the highest recall/TPR = 0.8 and TNR = 0.8. In this case TPR represents the accuracy of predicting patients who have had a stroke and TNR represents the accuracy of predicting patients who did not have a stroke. We want to maximize recall when predicting if someone could have a stroke. Also, Recall/TPR is very important considering we have a highly imbalanced dataset with around 6% of the observations with patients who have had a stroke. That’s why performance of Logistic Regression classifier is the best among the four classifiers in the prediction of stroke.

Feature selection: gender feature is missing

Gender feature is dropped from the dataset to see its impact on the accuracy of prediction. Classifiers are trained on the ”truncated” training set using the remaining features and are used to predict labels on test set.

Accuracy of prediction when age feature is removed from the dataset

The age feature, when removed, contributed the most to loss of TPR(recall). Hence, age feature plays a significant role in the stroke prediction.

Prediction using lime package

LIME algorithm is used to explain the predictions of Logistic Regression classifier. Lime explains the prediction by approximating it locally with an interpretable model. 

Analyze features

10 clinical features are used:

1) gender: "Male", "Female" or "Other"
2) age: age of the patient
3) hypertension: 0 if the patient doesn't have hypertension, 1 if the patient has hypertension
4) heart_disease: 0 if the patient doesn't have any heart diseases, 1 if the patient has a heart disease
5) ever_married: "No" or "Yes"
6) work_type: "children", "Govt_jov", "Never_worked", "Private" or "Self-employed"
7) Residence_type: "Rural" or "Urban"
8) avg_glucose_level: average glucose level in blood
9) bmi: body mass index
10) smoking_status: "formerly smoked", "never smoked", "smokes" or "Unknown"*

Data preprocessing

create a copy of original dataframe and remove features id and stroke label from the dataset

Encode categorical data¶

Balance dataset

Stroke prediction dataset is highly imbalanced. There are only 209 observation with stroke = 1 and 4700 observations with stroke = 0. A balanced sample dataset is created by combining all 209 observations with stroke = 1 and 10% of the observations with stroke = 0 which were obtained by random sampling from the 4700 observations. The resulting sample dataset is then split into train and test set. Different classifiers are trained on the train set and applied on the sample test set and then on the whole data set excluding train set.

Logistic Regression

LIME explanation of prediction of logistic regression classifier

The explanation of LIME provides an intuition into the inner workings of machine learning algorithms as to what features are being used to arrive at a prediction. It assigns weight to each feature based on its contribution to the prediction of a label within the local structure of the data.