Complete Guide to Healthcare Data Analysis: Real Heart Disease Dataset Tutorial [2025]
Learn data science fundamentals through real medical data analysis with Python code examples
Table of Contents:
Introduction
Healthcare generates 2.5 quintillion bytes of data daily – but how do we transform raw patient numbers into life-saving insights?
In this comprehensive tutorial, we'll analyze real heart disease data from 918 patients to demonstrate essential data science concepts every healthcare professional should know. You'll learn practical skills that hospitals and research institutions use to improve patient outcomes and optimize treatment protocols.
What You'll Learn:
✅ How to classify and analyze different types of medical data
✅ Statistical techniques for healthcare research (with Python code)
✅ Probability calculations for medical diagnosis
✅ Sample size determination for clinical studies
✅ Real-world applications in cardiovascular health
🔬 Dataset Overview:
918 patients across multiple healthcare institutions
12 clinical variables including demographics and test results
Heart disease prediction with validated medical outcomes
Zero missing values – perfect for learning!
💡 Pro Tip: Bookmark this page and follow along with our Google Colab notebook for hands-on practice!
Understanding Healthcare Data Types: The Foundation of Medical Analytics
Before diving into analysis, understanding data measurement scales is crucial for selecting appropriate statistical methods and avoiding analytical errors that could impact patient care.
The DIKW Pyramid in Healthcare
Healthcare data transformation follows a clear hierarchy:
📊 Data → 📋 Information → 🧠 Knowledge → 💡 Wisdom
Real Example - Heart Disease Assessment:
Data Level: Raw patient record
"65, M, ATA, 120, 177, 0, Normal, 140, N, 0.4, Flat, 0"
Information Level: Contextualized medical record
"65-year-old male with atypical angina, BP 120/80, cholesterol 177 mg/dl, normal ECG, max heart rate 140 bpm, no exercise angina, ST depression 0.4"
Knowledge Level: Pattern recognition
"Males aged 60+ with atypical angina and flat ST slopes show 23% heart disease probability in our dataset"
Wisdom Level: Clinical decision-making
"Implement enhanced cardiac monitoring for asymptomatic males 60+ with multiple risk factors, prioritizing preventive care over reactive treatment"
Four Types of Medical Data Variables
Understanding variable types determines which statistical tests are valid and clinically meaningful.
1. Nominal Variables (Categories)
Examples from our dataset:
Sex: 725 males (79.0%), 193 females (21.0%)
Chest Pain Type: Asymptomatic (496 patients), Non-anginal (203), Atypical angina (173), Typical angina (46)
Key Insight: Use chi-square tests for associations, frequency analysis for distributions.
2. Ordinal Variables (Ranked Categories)
Example: ST Slope (Exercise ECG response)
Up (395 patients) = Best cardiac response ✅
Flat (460 patients) = Intermediate response ⚠️
Down (63 patients) = Concerning response ❌
Clinical Significance: Natural health ranking where "Up" indicates healthier cardiac function.
3. Interval Variables (Equal Intervals, No True Zero)
Example: Patient Age
Range: 28-77 years
Mean: 53.5 years
Standard Deviation: 9.4 years
Why It Matters: Can calculate differences but not ratios (60-year-old isn't "twice as old" as 30-year-old clinically).
4. Ratio Variables (True Zero Point)
Examples:
Blood Pressure: 0-200 mmHg (zero = clinical death)
Heart Rate: 60-202 bpm (zero = cardiac arrest)
Cholesterol: 0-603 mg/dl (zero = no cholesterol)
Advantage: All mathematical operations valid, including meaningful clinical ratios.
Step-by-Step Statistical Analysis: From Data to Discovery
Let's analyze our heart disease dataset using proven statistical methods. Each technique answers specific clinical questions.
Phase 1: Descriptive Statistics - Understanding Your Population
Age Distribution Analysis
📊 View Interactive Age Histogram
Our analysis reveals compelling demographic patterns:
Age Group Patients Percentage Clinical Significance 25-34 32 3.5% Young adult onset 35-44 164 17.9% Early risk factors 45-54 316 34.4% Peak incidence 55-64 324 35.3% Highest risk 65-74 78 8.5% Survivor population 75+ 4 0.4% Rare in acute care
🔍 Key Finding: 69.7% of patients fall between ages 45-64, aligning with cardiovascular disease epidemiology.
Summary Statistics Interpretation
# Core metrics from our analysis Mean Age: 53.51 years (middle-aged population) Median Age: 54.00 years (symmetric distribution) Standard Deviation: 9.43 years (moderate variability) Range: 49 years (diverse age spectrum)
Clinical Insight: The near-identical mean and median indicate a balanced patient population without age bias.
Phase 2: Bivariate Analysis - Discovering Relationships
Sex vs Heart Disease: Chi-Square Analysis
🔬 Run Chi-Square Test in Colab
Sex No Heart Disease Heart Disease Total Rate Male 267 458 725 63.2% Female 143 50 193 25.9%
Statistical Results:
Chi-square statistic: χ² = 84.145
p-value: < 0.000001 ⭐ (highly significant)
Effect size: Males 2.4x more likely to have heart disease
🚨 Clinical Alert: This represents a massive gender disparity in cardiovascular risk, consistent with medical literature showing earlier onset and higher prevalence in males.
Age vs Maximum Heart Rate: Correlation Analysis
📈 View Correlation Scatter Plot
Results:
Pearson correlation: r = -0.382 (moderate negative)
R-squared: 14.6% of heart rate variation explained by age
p-value: < 0.0000000001 (extremely significant)
Translation: For every year of aging, maximum heart rate decreases predictably – crucial for exercise prescription and cardiac rehabilitation.
Probability in Medical Diagnosis: Making Data-Driven Decisions
Healthcare decisions often involve uncertainty. Probability theory provides the mathematical framework for evidence-based medicine.
Basic Probability from Real Patient Data
From our 918-patient dataset:
Fundamental Probabilities:
P(Heart Disease) = 55.3% – Baseline risk in this population
P(Male) = 79.0% – Population composition
P(Female) = 21.0% – Female representation
Conditional Risk Assessment:
P(Heart Disease | Male) = 63.2% – Male-specific risk
P(Heart Disease | Female) = 25.9% – Female-specific risk
Bayes' Theorem in Action: Diagnostic Probability
Clinical Question: "If a patient has heart disease, what's the probability they're male?"
🧮 Calculate with Interactive Tool
Bayes' Formula:
P(Male | Heart Disease) = P(Heart Disease | Male) × P(Male) / P(Heart Disease) = 0.632 × 0.790 / 0.553 = 0.902 or 90.2%
Clinical Impact: In this population, heart disease patients are overwhelmingly male (90.2%), informing screening protocols and resource allocation.
Risk Factor Analysis: Discrete Random Variables
We created a cardiac risk score by counting five major factors per patient:
Age > 55 years
Male sex
Exercise-induced angina
High cholesterol (>240 mg/dl)
Elevated blood pressure (>140 mmHg)
📊 Risk Distribution:
Risk Factors Patients Probability Risk Level 0 factors 49 5.3% Low risk 1 factor 210 22.9% Mild risk 2 factors 279 30.4% Moderate 3 factors 242 26.4% High 4 factors 107 11.7% Very high 5 factors 31 3.4% Critical
Key Insight: Most patients (56.8%) have 2-3 risk factors, suggesting moderate risk profiles are most common.
Heart Rate Distribution: Continuous Variables
Maximum Heart Rate ~ Normal(136.8 bpm, σ = 25.5)
Clinical Probabilities:
P(MaxHR < 130) = 39.5% → Reduced exercise capacity
P(MaxHR > 160) = 18.1% → Preserved cardiac function
P(120 < MaxHR < 150) = 44.3% → Typical response
Application: These percentiles help establish reference ranges and identify patients with unusually low/high exercise capacity.
Research Design & Sampling: Planning Effective Studies
Proper sampling determines whether your findings apply beyond your immediate dataset. Here's how to calculate sample sizes and choose sampling methods for healthcare research.
Sample Size Calculations
For Estimating Population Means
Question: "How many patients needed to estimate average age within ±1 year with 95% confidence?"
Formula: n = (Z²α/2 × σ²) / E²
Calculation:
# Using our dataset parameters Confidence level: 95% (Z = 1.96) Population SD: 9.43 years Margin of error: ±1.0 year n = (1.96² × 9.43²) / 1.0² = 342 patients
🎯 Answer: 342 patients required
For Estimating Disease Prevalence
Question: "Sample size to estimate heart disease prevalence within ±5%?"
Using observed rate (55.3%):
n = (1.96² × 0.553 × 0.447) / 0.05² = 380 patients
Conservative approach (worst-case p = 0.5):
n = (1.96² × 0.5 × 0.5) / 0.05² = 384 patients
📋 Recommendation: 384 patients (conservative estimate ensures adequate power)
Multi-Center Study Design Example
Scenario: Estimating heart disease prevalence across 4 hospitals with ±3% precision
🏥 Target Populations:
Hospital A: 2,500 patients (26.0%)
Hospital B: 1,800 patients (18.8%)
Hospital C: 3,200 patients (33.3%)
Hospital D: 2,100 patients (21.9%)
Total: 9,600 patients
Required Sample Size:
n = (1.96² × 0.55 × 0.45) / 0.03² = 1,056 patients
🎯 Proportional Allocation:
Hospital A: 275 patients (26.0%)
Hospital B: 198 patients (18.7%)
Hospital C: 352 patients (33.3%)
Hospital D: 231 patients (21.9%)
This ensures representative geographic coverage while maintaining statistical power.
Sampling Method Comparison
✅ Simple Random Sampling
Pros: Mathematically unbiased, easy to implement
Cons: May miss important subgroups
Best for: Homogeneous populations
✅ Stratified Sampling
Pros: Ensures subgroup representation
Cons: Requires population knowledge
Best for: Diverse populations with known strata
✅ Cluster Sampling
Pros: Practical for multi-site studies
Cons: Reduced statistical efficiency
Best for: Geographically dispersed studies
Key Takeaways: Actionable Insights for Healthcare Professionals
🎯 Statistical Findings
Gender Disparity Alert: Males have 2.4x higher heart disease risk (63.2% vs 25.9%)
Age-Related Decline: Heart rate decreases predictably with age (r = -0.382)
Risk Distribution: Most patients have 2-3 cardiac risk factors (56.8%)
Diagnostic Probability: 90.2% of heart disease patients are male in this population
🔬 Methodological Lessons
Variable Classification Matters: Choose statistics based on measurement scale
Sample Size Planning: 342 patients for age estimates, 384 for prevalence studies
Probability Applications: Bayes' theorem quantifies diagnostic uncertainty
Sampling Strategy: Stratified sampling ensures representative subgroups
🏥 Clinical Applications
Screening Protocols: Focus on males 45-64 with multiple risk factors
Exercise Testing: Use age-adjusted heart rate expectations
Resource Allocation: Prioritize preventive care for high-risk demographics
Study Design: Multi-center stratified sampling for generalizable results
Download Resources
🔗 Links & Tools
📓 Complete Google Colab Notebook Run all analyses step-by-step with detailed explanations
📊 Heart Disease Dataset Original 918-patient dataset from Kaggle
🎓 Next Steps
Advanced Topics to Explore:
Machine learning for medical prediction
Survival analysis for time-to-event data
Multi-level modeling for hospital comparisons
Clinical trial design and analysis
Recommended Learning Path:
Master descriptive statistics (this tutorial)
Learn regression analysis for risk factors
Explore machine learning applications
Study clinical trial methodology
About This Analysis
This tutorial analyzed real heart disease data from 918 patients across multiple healthcare institutions. All statistical analyses were performed using Python in Google Colab, with emphasis on practical applications in healthcare settings.
Data Source: Heart Failure Prediction Dataset (Kaggle)
Analysis Tools: Python, pandas, scipy, matplotlib
Statistical Methods: Descriptive statistics, chi-square tests, correlation analysis, probability theory
Want to dive deeper? Check out our complete Colab notebook with executable code and detailed explanations.
Found this helpful? Share it with your healthcare data team and bookmark our interactive notebook for future reference!

