data mastercourse
&cONSULTING
Diabetes Case Study Analysis
This analysis used Python language program to analyse different aspects of Diabetes in the Pima Indians tribe by doing Exploratory Data Analysis.
​
CONTEXT:
Diabetes is one of the most frequent diseases worldwide and the number of diabetic patients are growing over the years. The main cause of diabetes remains unknown, yet scientists believe that both genetic factors and environmental lifestyle play a major role in diabetes.
A few years ago research was done on a tribe in America which is called the Pima tribe (also known as the Pima Indians). In this tribe, it was found that the ladies are prone to diabetes very early. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients were females at least 21 years old of Pima Indian heritage.
​
The dataset has the following information:
-
Pregnancies: Number of times pregnant
-
Glucose: Plasma glucose concentration over 2 hours in an oral glucose tolerance test
-
BloodPressure: Diastolic blood pressure (mm Hg)
-
SkinThickness: Triceps skin fold thickness (mm)
-
Insulin: 2-Hour serum insulin (mu U/ml)
-
BMI: Body mass index (weight in kg/(height in m)^2)
-
DiabetesPedigreeFunction: A function that scores the likelihood of diabetes based on family history.
-
Age: Age in years
-
Outcome: Class variable (0: a person is not diabetic or 1: a person is diabetic)
​
# import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt %matplotlib
inline print
dataset = pd.read_csv ("diabetes.csv") dataset.head()
dataset.tail(758)
dataset.iloc[: , 0 : 8].sum ()
dataset.describe ().T
sns.displot(dataset['BloodPressure'], kind = 'kde')
plt.show()
sns.pairplot(data = dataset, vars = ['Glucose', 'SkinThickness', 'DiabetesPedigreeFunction'], hue = 'Outcome') plt.show()
plt.scatter(x = 'Glucose', y = 'Insulin', data = dataset) plt.show()
plt.boxplot(dataset['Age']) plt.title('Boxplot of Age') plt.ylabel('Age') plt.show()
plt.boxplot(dataset[dataset['Outcome'] == 1]['Age']) plt.title('Distribution of Age for Women who has Diabetes') plt.xlabel('Age') plt.ylabel('Frequency')
plt.show()
corr_matrix = corr_matrix = dataset.corr() corr_matrix
plt.figure(figsize = (8, 8)) sns.heatmap(corr_matrix, annot = True) plt.show()
Observations: From the heatmap above, it shows that there are three variables which highly correlated to diabetes, as follows; age, pregnancies, Skin thickness, BMI, and glucose.The age and pregnancies shared the same value (0.54), meaning they contain similiar information. As well with BMI and akin thickness (0.53). While the most significant variable that correlated to diabetes is glucose level (0.49), and insulin level (0.40)