AI, Machine Learning

Heart Disease Prediction with Python and Machine Learning

Inroduction:

Heart disease is a major global health concern, and early detection is key to preventing severe outcomes. This project aims to build machine learning models to predict the risk of heart disease using clinical data such as age, blood pressure, cholesterol, heart rate, and more.

Using a public dataset from Kaggle, we perform exploratory data analysis, data preprocessing, and apply various classification algorithms including Logistic Regression, Decision Tree, Random Forest, and SVM. The models are evaluated using accuracy, precision, recall, and F1-score to identify the most effective one for early diagnosis and clinical support.https://www.jeeviacademy.com/loan-eligibility-prediction/

Step 01: Import Necessary Libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import rcParams
from matplotlib.cm import rainbow %matplotlib inline
import warnings
warnings.filterwarnings('ignore')
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

numpy: A package for numerical computing in Python, useful for handling large, multi-dimensional arrays and matrices, and performing mathematical operations.
pandas: A library for data manipulation and analysis, especially for structured data like tables (DataFrames).
matplotlib.pyplot: A plotting library used to create static, interactive, and animated visualizations in Python.
matplotlib.rcParams: A configuration tool used to customize the default styles of Matplotlib plots such as figure size, font, or color.
matplotlib.cm.rainbow: A built-in colormap in Matplotlib used to apply a rainbow color scheme to plots for better visualization.
%matplotlib inline: A magic command in Jupyter Notebook to display plots directly below the code cell where the plot is called.
warnings.filterwarnings(‘ignore’): Suppresses warning messages that may arise during code execution, keeping the output cleaner.
sklearn.model_selection.train_test_split: A function that splits the dataset into training and testing sets, useful for evaluating machine learning models.
sklearn.preprocessing.StandardScaler: Used to normalize or standardize data by removing the mean and scaling to unit variance—important for many ML algorithms.
sklearn.neighbors.KNeighborsClassifier: Implements the K-Nearest Neighbors algorithm, which classifies data points based on the majority label of their nearest neighbors.
sklearn.svm.SVC: A Support Vector Machine (SVM) classifier from Scikit-learn used for building powerful classification models.
sklearn.tree.DecisionTreeClassifier: A model that uses a tree-like structure to make decisions and classify data based on feature values.
sklearn.ensemble.RandomForestClassifier: An ensemble learning method that builds multiple decision trees and merges them to improve accuracy and prevent overfitting.

Step 02: Load The Dataset

dataset = pd.read_csv('Heart.csv')

Loads the dataset from a CSV file named Heart.csv into a Pandas Data Frame called dataset.

Step 03: Data Information

dataset.info()

It has 303 rows and 14 columns.
All columns have non-null values (no missing data).
Most columns are of data type int64, with one column (oldpeak) as float64.
This is useful for checking data completeness and types.

Step 04: Descriptive Statistics

data.describe()

We use .describe() to generate basic statistical metrics (mean, std, min, max, quartiles) for each numerical column. This helps understand the range and distribution of values.

Step 05: Correlation Matrix Heatmap of Dataset Features

      rcParams['figure.figsize'] = 20, 14
      plt.matshow(dataset.corr())
      plt.yticks(np.arange(dataset.shape[1]),dataset.columns) 
      plt.xticks(np.arange(dataset.shape[1]), dataset.columns) 
      plt.colorbar()

The code sets the figure size, generates a heatmap of the dataset’s correlation matrix, labels both axes with feature names, and adds a color bar to indicate the correlation values, making it easier to interpret the relationships between features visually.

Step 06: Histogram

dataset.hist()

This command automatically generates histograms for all numerical columns in your DataFrame named dataset.
The output is a grid of subplots, where each subplot shows the distribution of values for a particular column.

Step 07: Visualization of Target Class Distribution

rcParams['figure.figsize'] = 8,6
plt.bar(dataset['target'].unique(), dataset['target'].value_counts(), color = ['red', 'green'])
plt.xticks([0, 1])
plt.xlabel('Target Classes')
plt.ylabel('Count')
plt.title('Count of each Target Class')

This code generates a bar chart to visualize the distribution of the target variable in the dataset, which typically represents the presence (1) or absence (0) of heart disease.
First, rcParams['figure.figsize'] = 8,6 sets the overall size of the plot to 8 inches by 6 inches for better readability.
The plt.bar() function then creates a bar chart using the unique values of the target column (usually 0 and 1) as the x-axis, and their corresponding counts (frequency of each class) as the y-axis.
The bars are colored red and green to visually differentiate between the two classes. plt.xticks([0, 1]) ensures that the x-axis only shows ticks at 0 and 1, representing the two classes.
Labels are added to both axes with plt.xlabel('Target Classes') and plt.ylabel('Count'), and the chart is given a title, “Count of each Target Class”, using plt.title(), to make the purpose of the visualization clear.

Step 08: Data preprocessing

dataset = pd.get_dummies(dataset, columns = ['sex', 'cp', 'fbs', 'restecg', 'exang',     'slope', 'ca', 'thal'])

The code converts categorical columns in your dataset into numerical columns by creating new columns for each category.
Each new column will have 1 if that category is present in the row, and 0 if it’s not.

Step 09: Algorithms

i) K Nearest Neighbours

knn_scores = []

for k in range(1,21):

knn_classifier = KNeighborsClassifier(n_neighbors = k)

knn_classifier.fit(X_train, y_train)m

knn_scores.append(knn_classifier.score(X_test, y_test))

plt.plot([k for k in range(1, 21)], knn_scores, color = ‘red’)

for i in range(1,21):

plt.text(i, knn_scores[i-1], (i, knn_scores[i-1]))

plt.xticks([i for i in range(1, 21)])

plt.xlabel(‘Number of Neighbors (K)’)

plt.ylabel(‘Scores’)

plt.title(‘K Neighbors Classifier scores for different K values’)

The code tests the KNN classifier for different values of K (from 1 to 20), calculates the accuracy for each, and plots the results.
It shows how the model’s performance (accuracy) changes with different numbers of neighbors, helping identify the best value of K.
The plot includes labels for each point to display the accuracy score

ii) Support Vector Classifier

svc_scores = []
kernels = ['linear', 'poly', 'rbf', 'sigmoid']
for i in range(len(kernels)):
    svc_classifier = SVC(kernel = kernels[i])
    svc_classifier.fit(X_train, y_train)
    svc_scores.append(svc_classifier.score(X_test, y_test))
colors = rainbow(np.linspace(0, 1, len(kernels)))
plt.bar(kernels, svc_scores, color = colors)
for i in range(len(kernels)):
    plt.text(i, svc_scores[i], svc_scores[i])
plt.xlabel('Kernels')
plt.ylabel('Scores')
plt.title('Support Vector Classifier scores for different kernels')

The code tests the Support Vector Classifier (SVC) using four different kernels—linear, poly, rbf, and sigmoid.
For each kernel, it trains the model, calculates its accuracy on test data, and stores the results.
Then it plots a bar chart with different colors to compare how well each kernel performs, with the exact scores shown on top of each bar

iii) Decision Tree Classifier

dt_scores = []
for i in range(1, len(X.columns) + 1):
dt_classifier = DecisionTreeClassifier(max_features = i, random_state = 0)
dt_classifier.fit(X_train, y_train)
dt_scores.append(dt_classifier.score(X_test, y_test))
plt.plot([i for i in range(1, len(X.columns) + 1)], dt_scores, color = 'green')
for i in range(1, len(X.columns) + 1):
    plt.text(i, dt_scores[i-1], (i, dt_scores[i-1]))
plt.xticks([i for i in range(1, len(X.columns) + 1)])
plt.xlabel('Max features')
plt.ylabel('Scores')
plt.title('Decision Tree Classifier scores for different number of maximum features')

The code tests how well a Decision Tree model performs when it is allowed to use different numbers of features, from 1 up to all features.
It checks the accuracy for each case and plots the results to see which number of features gives the best accuracy

iv) Random Forest Classifier

rf_scores = []
estimators = [10, 100, 200, 500, 1000]
for i in estimators:
rf_classifier = RandomForestClassifier(n_estimators = i, random_state = 0)
rf_classifier.fit(X_train, y_train)
rf_scores.append(rf_classifier.score(X_test, y_test))
colors = rainbow(np.linspace(0, 1, len(estimators)))
plt.bar([i for i in range(len(estimators))], rf_scores, color = colors, width = 0.8)
for i in range(len(estimators)):
plt.text(i, rf_scores[i], rf_scores[i])
plt.xticks(ticks = [i for i in range(len(estimators))], labels = [str(estimator) for estimator in estimators])
plt.xlabel('Number of estimators')
plt.ylabel('Scores')
plt.title('Random Forest Classifier scores for different number of estimators')

The code checks how well a Random Forest model performs using different numbers of trees (10, 100, 200, 500, 1000), records the accuracy for each, and shows the results in a bar chart to compare their performance.

Step 10: Feature important
Visualization

importances = rf_classifier.feature_importances_ 
features = X.columns
plt.figure(figsize=(10,6))
import seaborn as sns # Import seaborn library
sns.barplot(x=importances, y=features)
plt.title('Feature Importance')
plt.show()

The code plots a bar chart to visualize the importance of each feature in the Random Forest Classifier’s decision-making process.
It uses the feature importance scores from the trained model and displays them on the X-axis, with the feature names on the Y-axis, helping to identify which features had the most influence on the model’s predictions.

Step 11: ROC Curve & AUC Score

from sklearn.metrics import roc_curve, auc
model = rf_classifier # Assuming you want to use the RandomForestClassifier
y_proba = model.predict_proba(X_test)[:,1]
fpr, tpr, _ = roc_curve(y_test, y_proba)
roc_auc = auc(fpr, tpr)
plt.figure()
plt.plot(fpr, tpr, label=f'ROC Curve (AUC = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic')
plt.legend()
plt.show()

The code generates and displays a Receiver Operating Characteristic (ROC) curve for the Random Forest Classifier, evaluating its ability to distinguish between the positive and negative classes.
It calculates the False Positive Rate (FPR) and True Positive Rate (TPR) at various thresholds, computes the Area Under the Curve (AUC), and plots the ROC curve with the AUC value shown, helping assess the model’s performance.

Step 12: Cross Validation

from sklearn.model_selection import cross_val_score
cv_scores = cross_val_score(model, X, y, cv=5)
print(f"Average CV Score: {cv_scores.mean():.2f}")

The code performs 5-fold cross-validation on the model using the entire dataset, calculating the accuracy for each fold and then printing the average accuracy score to evaluate the model’s overall performance.

Step 13: Over all accuracy

from sklearn.metrics import confusion_matrix, accuracy_score
y_pred = model.predict(X_test) # add this line
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(6,4))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['No Disease', 'Disease'], yticklabels=['No Disease', 'Disease'])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()
accuracy = accuracy_score(y_test, y_pred)
print(f"Overall Accuracy: {accuracy * 100:.2f}%")

The code generates predictions for the test set, computes the confusion matrix to compare actual vs. predicted values, visualizes the matrix as a heatmap, and calculates the model’s overall accuracy, displaying the result as a percentage.

Conclusion:

The implementation of machine learning models for heart disease prediction demonstrates the potential of data-driven approaches in healthcare. By analyzing clinical attributes such as age, blood pressure, cholesterol levels, and heart rate, the system provides valuable insights that can assist in early diagnosis and intervention. Through the use of algorithms like Logistic Regression, Decision Tree, Random Forest, and SVM, the model achieves reliable performance in classifying patients at risk. This not only supports medical professionals in making informed decisions but also contributes to reducing the overall burden of heart disease. As more accurate data becomes available, such models can be further refined to enhance their predictive capabilities and real-world impact