AI, Machine Learning

Loan Eligibility Prediction Using Machine Learning

Introduction:

           We developed a machine learning system to predict loan eligibility based on factors like income, loan amount, credit history, and employment status. 

Using Python and scikit-learn, we implemented and compared models such as Logistic Regression, Decision Tree, Random Forest, and SVM. The aim is to help financial institutions make faster and more accurate loan decisions while reducing manual effort and bias.https://www.jeeviacademy.com/heart-diseases-prediction-machine-learni/


Step 1:  Import Required Libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn import metrics
from sklearn.svm import SVC
from imblearn.over_sampling import RandomOverSampler
import warnings
warnings.filterwarnings('ignore')

numpy: A package for numerical computing in Python, useful for handling large, multi-dimensional arrays and matrices, and performing mathematical operations.
pandas: A library for data manipulation and analysis, especially for structured data like tables (DataFrames).
matplotlib.pyplot: A plotting library used to create static, interactive, and animated visualizations in Python.
seaborn: A data visualization library built on top of matplotlib that makes it easier to generate complex visualizations with less code.
sklearn.model_selection.train_test_split: A utility from Scikit-learn to split datasets into training and test sets for model training and evaluation.
sklearn.preprocessing.LabelEncoder: A utility for encoding categorical labels (e.g., transforming labels like “yes”/”no” to 1/0).
sklearn.preprocessing.StandardScaler: Used to standardize features by removing the mean and scaling to unit variance. This is often important when using algorithms like Support Vector Machines (SVM).
sklearn.metrics: Contains various functions to assess the performance of a model (e.g., accuracy, precision, recall, etc.).
sklearn.svm.SVC: The Support Vector Machine (SVM) classifier, which is useful for classification tasks.
imblearn.over_sampling.RandomOverSampler: A technique for handling imbalanced datasets. It oversamples the minority class by duplicating examples to balance the dataset
warnings.filterwarnings (‘ignore’): Suppresses any warnings that might pop up during the execution of the code. This is useful when you are not interested in seeing minor warnings.

Step 2:Load The Dataset:

df = pd.read_csv(‘loan_data.csv’)
df.head()

df = pd.read_csv(‘loan_data.csv’): This line reads the CSV file loan_data.csv into a pandas DataFrame, which is a tabular data structure (similar to a table or a spreadsheet).
pd.read_csv() is a function provided by pandas to read CSV files. The df variable stores the resulting DataFrame.
df.head(): This method returns the first five rows of the DataFrame by default. It gives you a quick overview of the structure of the data (the columns and their values).

Step 3: Dataset Information

df.info()

The df.info() command provides a concise summary of the DataFrame df.
It’s used to understand the structure and basic properties of the dataset.

Step 4: Getting The Dimensions Of The Dataframe

	df.shape

df.shape returns a tuple indicating the number of rows and columns in the DataFrame.

Step 5: Descriptive Statistics Of Numerical Columns

df.describe()

df.describe() provides statistical summaries (like count, mean, std, min, max, and percentiles) for all numerical columns in the DataFrame.

Step 6: Exploratory Data Analysis

i) Pie Chart:

temp = df['Loan_Status'].value_counts()
plt.pie(temp.values,
        labels=temp.index,
        autopct='%1.1f%%')
plt.show()

This code generates a pie chart to visualize the distribution of loan statuses in the dataset.
It first uses value_counts() to count the occurrences of each unique value in the 'Loan_Status' column, then plots these counts as a pie chart using matplotlib.
The labels parameter assigns the category names to the chart slices, and autopct='%1.1f%%' displays the percentage of each category on the chart, formatted to one decimal place.
Finally, plt.show() displays the pie chart.

ii) Bar plot:

plt.subplots(figsize=(15, 5))
for i, col in enumerate(['Gender', 'Married']):
    plt.subplot(1, 2, i+1)
    sb.countplot(data=df, x=col, hue='Loan_Status')
plt.tight_layout()
plt.show()

This code creates side-by-side bar plots to compare loan approval status based on the applicant’s Gender and Married status.
A figure of size 15×5 inches is created using plt.subplots().
Then, a for loop iterates over the columns 'Gender' and 'Married', and for each, a subplot is generated using plt.subplot(1, 2, i+1) to place them side-by-side (1 row, 2 columns).
The seaborn.countplot() function is used to plot the frequency of each category, further divided by Loan_Status using the hue parameter.
plt.tight_layout() ensures proper spacing between subplots, and plt.show() displays the final figure.

iii) Histogram:

plt.subplots(figsize=(15, 5))
for i, col in enumerate(['ApplicantIncome', 'LoanAmount']):
    plt.subplot(1, 2, i+1)
    sb.distplot(df[col])
plt.tight_layout()
plt.show()

This code generates side-by-side distribution plots to visualize the distribution of ApplicantIncome and LoanAmount in the dataset.
The plt.subplots(figsize=(15, 5)) creates a figure with a size of 15×5 inches.
The for loop iterates over the columns 'ApplicantIncome' and 'LoanAmount', and for each column, it creates a subplot using plt.subplot(1, 2, i+1) to place the plots side by side (1 row, 2 columns). seaborn.distplot() is used to plot the distribution (histogram + KDE) of each feature.
plt.tight_layout() ensures there’s no overlap between subplots, and plt.show() displays the plots.

iv) Box Plot:

plt.subplots(figsize=(15, 5))
for i, col in enumerate(['ApplicantIncome', 'LoanAmount']):
    plt.subplot(1, 2, i+1)
    sb.boxplot(df[col])
plt.tight_layout()
plt.show()

This code creates side-by-side box plots to visualize the spread and detect outliers in the ApplicantIncome and LoanAmount columns.
The plt.subplots(figsize=(15, 5)) defines the figure size.
Then, a for loop iterates over the two numerical columns, creating a subplot for each using plt.subplot(1, 2, i+1).
The seaborn.boxplot() function is used to draw the box plots, which show the median, quartiles, and any outliers in the data.
plt.tight_layout() adjusts the spacing to prevent overlap, and plt.show() displays the figure.


Step 7: Average Loan Amount By Gender

df = df[df['ApplicantIncome'] < 25000]
df = df[df['LoanAmount'] < 400000]
df.groupby('Gender').mean(numeric_only=True)['LoanAmount']

he code filters the DataFrame df to include only rows where ApplicantIncome < 25,000 and LoanAmount < 400,000.
It then groups the filtered data by the Gender column and calculates the mean LoanAmount for each gender, using only numeric columns.

Step 8: Average Loan Amount By Martial Status And Gender

df.groupby(['Married', 'Gender']).mean(numeric_only=True)['LoanAmount']

The code groups the DataFrame df by both Married and Gender columns.
It then calculates the average LoanAmount for each combination of marital status and gender, using only numeric columns.

Step 9: Heatmap Of Strong Feature Correlations After Label Encoding

def encode_labels(data):
    		for col in data.columns:
     		 	  if data[col].dtype == 'object':
       		   		  le = LabelEncoder()
          			  data[col] = le.fit_transform(data[col])
   		 return data
df = encode_labels(df)
sb.heatmap(df.corr() > 0.8, annot=True, cbar=False)
plt.show()

The code defines a function to label encode all categorical columns in the DataFrame, enabling correlation computation across all features.
It then creates a heatmap to visualize correlations greater than 0.8, highlighting strong relationships between variables.

Step 10: Data Preprocessing

features = df.drop('Loan_Status', axis=1)
target = df['Loan_Status'].values
X_train, X_val, Y_train, Y_val = train_test_split(features, target, test_size=0.2,random_state=10)
ros = RandomOverSampler(sampling_strategy='minority',
						random_state=0)
X, Y = ros.fit_resample(X_train, Y_train)
X_train.shape, X.shape

Splits the dataset into features (X) and target (Y), where Loan_Status is the target variable.
Splits the data into training and validation sets using train_test_split, with 20% of the data used for validation.
Handles class imbalance in the training data using RandomOverSampler, which duplicates samples of the minority class to balance the dataset.
Finally, it checks the shapes of X_train (original) and X (resampled).

Step 11: Model Deployment

from sklearn.metrics import roc_auc_score
model = SVC(kernel='rbf')
model.fit(X, Y)
print('Training Accuracy : ', metrics.roc_auc_score(Y, model.predict(X)))
print('Validation Accuracy : ', metrics.roc_auc_score(Y_val, model.predict(X_val)))
print()

Trains an SVM classifier with an RBF kernel on the balanced training data and evaluates it using the ROC AUC score for both the training and validation sets.
This metric measures how well the model distinguishes between the classes, especially useful for imbalanced datasets.

Step 12: Model Evaluation And Performance Visualization

from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix
training_roc_auc = roc_auc_score(Y, model.predict(X))
validation_roc_auc = roc_auc_score(Y_val, model.predict(X_val))
print('Training ROC AUC Score:', training_roc_auc)
print('Validation ROC AUC Score:', validation_roc_auc)
print()
plt.figure(figsize=(6, 6))
sb.heatmap(cm, annot=True, fmt='d', cmap='Blues', cbar=False)
plt.title('Confusion Matrix')
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.show()cm = confusion_matrix(Y_val, model.predict(X_val))

This code evaluates a machine learning model (assumed to be an SVC) by calculating the ROC AUC score for both the training and validation sets using roc_auc_score.
It then computes the confusion matrix for the validation set using confusion_matrix, and visualizes it as a heatmap with seaborn.heatmap.
The heatmap displays the performance of the model, showing how well the predicted labels match the true labels.
The ROC AUC scores provide an overall measure of the model’s ability to discriminate between classes, with higher values indicating better performance.
The confusion matrix offers a detailed breakdown of true positives, true negatives, false positives, and false negatives, helping to assess the model’s accuracy, precision, recall, and other metrics.

Step 13: Classification Report For Model Evaluation

from sklearn.metrics import classification_report
print(classification_report(Y_val, model.predict(X_val)))

This section of the code prints the classification report, which provides key performance metrics such as precision, recall, F1-score, and support for each class in the validation set.
It helps assess the model’s effectiveness in distinguishing between classes and highlights areas for improvement in its predictions.

Step 14: Cross Validation

from sklearn.model_selection import cross_val_score
from sklearn.pipeline import make_pipeline
pipeline = make_pipeline(StandardScaler(), SVC(kernel='rbf'))
scores = cross_val_score(pipeline, features, target, cv=5, scoring='roc_auc')
print("Cross-Validation ROC AUC Scores:", scores)
print("Mean ROC AUC Score:", scores.mean())

This code evaluates the model’s stability and generalization ability using 5-fold cross-validation with ROC AUC as the scoring metric.
It first creates a pipeline that standardizes the features using StandardScaler and applies an SVC with an RBF kernel.
Then, cross_val_score splits the dataset into 5 parts, trains the model on 4 parts, and tests it on the remaining part in each iteration.
It returns the ROC AUC scores for each fold, and the mean score gives an overall estimate of the model’s performance across different data splits, helping to ensure the model is not overfitting or underperforming on unseen data.

Conclusion:

In conclusion, a machine learning-based loan eligibility prediction system enhances decision-making in the financial sector by offering faster, more accurate, and unbiased evaluations. By analyzing key applicant data and comparing multiple classification models, the system streamlines loan approvals and supports regulatory compliance. As banking operations grow more data-driven, such solutions become essential for secure and efficient financial services.