Analysis; A Chronic Clinic Dataset I
I was given a dataset to work on. The task involved doing an analysis and dashboard for the organization. These are my findings and although not exhaustive I welcome suggestions and opinions on how it might have been improved.
The Data Set
The Data set had 4 Columns namely Patient ID, Age, Gender and Chronic Conditions. The ages of the Patients ranged from 2 years old to 80 years old. The Chronic Conditions were COPD, Asthma, Diabetes, Hypertension and Pregnant (being pregnant is not a medical condition). The Data set had 500 male and female patients.
Data Clean Up & Preparation
I did a slight clean up in Excel first removed duplicates (found no duplicated patient Ids) and looked for any missing data like age and gender and found none missing. I pulled the cleaned Excel dataset into Power Bi and used the transform data option to do another clean up. The Data set had 500 patients and some of them had null values in the Chronic Conditions column . I went ahead and removed the null values and was left with only 267 patients from ages 2 to 80 who had some sort of Chronic Condition. I decided to leave the ones who were pregnant to see if I can pull out any analysis from them. I also didn't change the datatype as they had the right datatype in the Column Heading numeric for Patient Id and Ages and text for Gender and Chronic Conditions respectively. There also was no need to delete any columns as I will be using all the attributes for analysis.
Analysis Tools
I will be using Power BI mainly together with Excel to do some visualizations and analyze the data. In addition I converted the Excel file into a database using the SQLIzer and named it flu_csv.
CREATE TABLE IF NOT EXISTS flu_csv (
Patient_ID INTEGER,
Age INTEGER,
Gender TEXT,
Chronic_Conditions TEXT
);
I then uploaded the data to https://sqliteonline.com/ to run SQL queries on the dataset and do further analysis on the dataset.
Analysis and Findings : Power BI & Excel
Table 1 below shows the Total Number of Patients and Different Chronic Conditions. Using Power BI I was able to pull the table below depicting total number of patients with different Chronic Conditions.
I also wanted to find out the average age of the patients by gender and found an interesting insight that the average age of the patients was 24 years old for females and 41 years old for males. (But this was actually expected as the pregnant women comprised almost a third of the data).
To further analyze the dataset I pulled another table this time showing Chronic Conditions by gender .From Chart 1 below 136 are male while 131 are Female with 82 being pregnant .This will further skew the results as (131– 82) that is 49 do have real Chronic Conditions (Asthma, COPD, Diabetes and Hypertension).
A further analysis using a Tree Map revealed the Count of Patients by Chronic Conditions which actually gives the same result as Table 1 above. A similar visual was done using a Pie chart and this time I looked at the Percentages instead of absolute numbers.
From the pie Chart below we can see that 31% the largest percentage comprised of pregnant women and while 13% comprised the smallest number with a COPD medical condition.
Chart 4 below give us another view of Chronic Conditions by Gender and again the glaring number is the number of pregnant women who dominate the dataset.
I also decided to take a look at Chronic conditions at various ages and came up with Chart 5 below
From Chart 5 above we see that 30 patients aged 5 had the highest number of Chronic conditions followed closely by those aged 65. The ones with the lowest chronic conditions were the ones who were at age 26 and 42 respectively at 3 each. Although we had 8 who were 80 years and 7 who were 70 who only had Chronic Condition these actually represented 100 % of those in the sample. In otherwise all the 8 and 7 respectively at 80 and 70 years old had a Chronic Conditions which happens to be COPD.
Analysis & Finding : SQL
I run the Select * Command to have a look at the table and list the columns
SELECT * FROM flu_csv;
To count total number of patients and I got a total number of 500 patients in the database.
SELECT COUNT(*)
FROM flu_csv;
to look at age and chronic condition
SELECT age, chronic_conditions from flu_csv;
I wanted to find out what chronic condition those aged 80 had.
SELECT age,chronic_conditions from flu_csv
where age = 80;
We found out that those aged 80 had only COPD as their chronic condition.
I ran another query to find out chronic condition between those aged 5 and 50 and gender chronic conditions.
SELECT age, chronic_conditions from flu_csv
where age BETWEEN 5 and 50
ORDER by age DESC;
SELECT age, gender, chronic_conditions from flu_csv
where age BETWEEN 5 and 50
ORDER by gender;
SELECT age, gender, chronic_conditions from flu_csv
where age BETWEEN 5 and 50
ORDER by gender DESC;
SELECT age, gender, chronic_conditions from flu_csv
where gender = "F";
SELECT patient_id, chronic_conditions FROM flu_csv
WHERE chronic_conditions is "Pregnant";
SELECT patient_id, age ,chronic_conditions FROM flu_csv
WHERE chronic_conditions IN ("Pregnant", "COPD" , "Asthma", "Diabetes" );
SELECT patient_id, age ,chronic_conditions FROM flu_csv
WHERE chronic_conditions is NOT "NULL";
SELECT patient_id ,age, chronic_conditions FROM flu_csv
where chronic_conditions IS "Pregnant";
SELECT MIN(age)
FROM flu_csv where chronic_conditions is "Pregnant";
SELECT MAX (age)
FROM flu_csv where chronic_conditions is "Pregnant";
Using the code above I was able to find out that the youngest Pregnant patient is 22 years old and the oldest Pregnant patient is 39 years old.
Calculating the average age of patients below, I got a figure of 25.5 this differed slightly from the figure in the Power BI analysis above as we included NULL values when the data set was converted into a database or a csv file.
SELECT AVG (age)
FROM flu_csv where gender is "F" OR "M";
I used this query to count the number of chronic conditions for patients.
SELECT COUNT(chronic_conditions) AS chronic_patients
FROM flu_csv
WHERE chronic_conditions = 'Pregnant';
82 for Pregnant patients
SELECT COUNT(chronic_conditions) AS chronic_patients
FROM flu_csv
WHERE chronic_conditions = "Hypertension";
49 for Hypertension patients
SELECT COUNT(chronic_conditions) AS chronic_patients
FROM flu_csv
WHERE chronic_conditions = 'COPD';
36 for COPD patients
SELECT COUNT(chronic_conditions) AS chronic_patients
FROM flu_csv
WHERE chronic_conditions = 'Asthma';
46 for Asthma patients
SELECT COUNT(chronic_conditions) AS chronic_patients
FROM flu_csv
WHERE chronic_conditions = 'Diabetes';
54 for Diabetic Patients
SELECT COUNT(chronic_conditions) AS chronic_patients
FROM flu_csv
WHERE chronic_conditions IS NOT NULL;
267 for all Chronic Condition patients
This gives us the above numbers that I got when I did the clean up using Power BI and removed all the Null Values in the Chronic Conditions Column.
I wanted to find out how many patients between the ages of 5 and 50 had a Chronic condition by using the following sql query.
SELECT COUNT(chronic_conditions) AS chronic_patients
FROM flu_csv
WHERE age BETWEEN 5 and 50 ;
177 patients had a Chronic Condition aged between 5 and 50 years old.
SELECT COUNT(chronic_conditions) AS chronic_patients
FROM flu_csv
WHERE age BETWEEN 2 and 50 ;
212 patients have a Chronic Condition aged between 2 and 50 years, we also know that the youngest pregnant girl is 22 and the oldest is 39 . Can we find out the number of patients who have a Chronic Condition that is anything other than being pregnant ?
SELECT COUNT(chronic_conditions) AS chronic_patients
FROM flu_csv
WHERE age BETWEEN 22 and 39 ;
107 patients have a Chronic Condition between this age and most of them were pregnant and a few 4 had Diabetes and 6 had Hypertension none had COPD or Asthma.
Conclusion
Based on my findings we can safely conclude that most of the elderly aged 70 and above had COPD and the younger ones had Diabetes together with those in their middle 60s. Apart from the Chronic Condition listed as ‘pregnant’ I think the dataset was good enough to give insights that I didn't expect like the youngest pregnant girl being 22 and the oldest pregnant women being 39 years old. Better analysis and conclusions could have been arrived at if multiple Chronic Conditions had been listed for each patient. As you are aware most pregnant women might have Gestational Diabetes and Hypertension while the elderly might have Diabetes and Hypertension.
In Part II of this series An Analysis : A Chronic Clinic Dataset I will you create a separate table where the Chronic Conditions is Null and try and do Joins with the Chronic Conditions table using SQL , Power BI and Excel to see what more insights I can get from this database.