top of page

Diabetes Case Study Analysis

This analysis used Python language program to analyse different aspects of Diabetes in the Pima Indians tribe by doing Exploratory Data Analysis.

​

CONTEXT:

Diabetes is one of the most frequent diseases worldwide and the number of diabetic patients are growing over the years. The main cause of diabetes remains unknown, yet scientists believe that both genetic factors and environmental lifestyle play a major role in diabetes.

A few years ago research was done on a tribe in America which is called the Pima tribe (also known as the Pima Indians). In this tribe, it was found that the ladies are prone to diabetes very early. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients were females at least 21 years old of Pima Indian heritage.

​

The dataset has the following information:

  • Pregnancies: Number of times pregnant

  • Glucose: Plasma glucose concentration over 2 hours in an oral glucose tolerance test

  • BloodPressure: Diastolic blood pressure (mm Hg)

  • SkinThickness: Triceps skin fold thickness (mm)

  • Insulin: 2-Hour serum insulin (mu U/ml)

  • BMI: Body mass index (weight in kg/(height in m)^2)

  • DiabetesPedigreeFunction: A function that scores the likelihood of diabetes based on family history.

  • Age: Age in years

  • Outcome: Class variable (0: a person is not diabetic or 1: a person is diabetic)

​

# import numpy as np

import pandas as pd

import seaborn as sns

import matplotlib.pyplot as plt %matplotlib

inline print

dataset = pd.read_csv ("diabetes.csv") dataset.head()

df.png

dataset.tail(758)

ds tail.png

dataset.iloc[: , 0 : 8].sum ()

dataset.describe ().T

summary stat.png

sns.displot(dataset['BloodPressure'], kind = 'kde')

plt.show()

plot graph.png

sns.pairplot(data = dataset, vars = ['Glucose', 'SkinThickness', 'DiabetesPedigreeFunction'], hue = 'Outcome') plt.show()

pairplot.png

plt.scatter(x = 'Glucose', y = 'Insulin', data = dataset) plt.show()

scatter plot.png

plt.boxplot(dataset['Age']) plt.title('Boxplot of Age') plt.ylabel('Age') plt.show()

boxplot.png

plt.boxplot(dataset[dataset['Outcome'] == 1]['Age']) plt.title('Distribution of Age for Women who has Diabetes') plt.xlabel('Age') plt.ylabel('Frequency')

plt.show()

boxplot age freq.png

corr_matrix = corr_matrix = dataset.corr() corr_matrix

corr matrix.png

plt.figure(figsize = (8, 8)) sns.heatmap(corr_matrix, annot = True) plt.show()

heatmap python.png

Observations: From the heatmap above, it shows that there are three variables which highly correlated to diabetes, as follows; age, pregnancies, Skin thickness, BMI, and glucose.The age and pregnancies shared the same value (0.54), meaning they contain similiar information. As well with BMI and akin thickness (0.53). While the most significant variable that correlated to diabetes is glucose level (0.49), and insulin level (0.40)

bottom of page