Standard Bank Data Science Virtual Experience Programme

Credit / Home Loans

Standard Bank is embracing the digital transformation wave and intends to use new and exciting technologies to give their customers a complete set of services from the convenience of their mobile devices. As Africa’s biggest lender by assets, the bank aims to improve the current process in which potential borrowers apply for a home loan. The current process involves loan officers having to manually process home loan applications. This process takes 2 to 3 days to process upon which the applicant will receive communication on whether or not they have been granted the loan for the requested amount. To improve the process Standard Bank wants to make use of machine learning to assess the credit worthiness of an applicant by implementing a model that will predict if the potential borrower will default on his/her loan or not, and do this such that the applicant receives a response immediately after completing their application.

You will be required to follow the data science lifecycle to fulfill the objective. The data science lifecycle (https://www.datascience-pm.com/crisp-dm-2/) includes:

You now know the CRoss Industry Standard Process for Data Mining (CRISP-DM), have an idea of the business needs and objectivess, and understand the data. Next is the tedious task of preparing the data for modeling, modeling and evaluating the model. Luckily, just like EDA the first of the two phases can be automated. But also, just like EDA this is not always best.

Import Libraries

Import Datasets

Part One: EDA

Sweetviz

Overview of the data

Data Quality Evaluation

Both train and test datasets have some null values (not horrible). No duplicates

Loan Statuses Distribution

By gender

Married

Have dependent

Loan by Employment type

Education

By Credit History

Property Area

Loan Amount

Part Two: Data Preparation

Data Preparation

Feature Selection

Handle Missing Values

Credit_History

Married

Dependent

Loan Amount Term

Loan Amount

Update Engineered Features

Check Missing Values Again

Dummy, Split, Scale

Logistic Regression

KNN

Random Forests

Support Vector Machine

Best model is Logistc Regression. 80% accuracy. But we have terrible recall score for class 0. This is a huge problem when it comes to predicting whether this person should receive loan. We need more data on class 0 to improve model performance