LendingClub Loan Default Prediction, Deep Learning

The Data

We will be using a subset of the LendingClub DataSet obtained from Kaggle: https://www.kaggle.com/wordsforthewise/lending-club

LendingClub is a US peer-to-peer lending company, headquartered in San Francisco, California.[3] It was the first peer-to-peer lender to register its offerings as securities with the Securities and Exchange Commission (SEC), and to offer loan trading on a secondary market. LendingClub is the world's largest peer-to-peer lending platform.

Goal

Given historical data on loans given out with information on whether or not the borrower defaulted (charge-off), can we build a model thatcan predict wether or nor a borrower will pay back their loan? This way in the future when we get a new potential customer we can assess whether or not they are likely to pay back the loan. Keep in mind classification metrics when evaluating the performance of your model!

The "loan_status" column contains our label.

Data Overview



There are many LendingClub data sets on Kaggle. Here is the information on this particular data set:

LoanStatNew Description
0 loan_amnt The listed amount of the loan applied for by the borrower. If at some point in time, the credit department reduces the loan amount, then it will be reflected in this value.
1 term The number of payments on the loan. Values are in months and can be either 36 or 60.
2 int_rate Interest Rate on the loan
3 installment The monthly payment owed by the borrower if the loan originates.
4 grade LC assigned loan grade
5 sub_grade LC assigned loan subgrade
6 emp_title The job title supplied by the Borrower when applying for the loan.*
7 emp_length Employment length in years. Possible values are between 0 and 10 where 0 means less than one year and 10 means ten or more years.
8 home_ownership The home ownership status provided by the borrower during registration or obtained from the credit report. Our values are: RENT, OWN, MORTGAGE, OTHER
9 annual_inc The self-reported annual income provided by the borrower during registration.
10 verification_status Indicates if income was verified by LC, not verified, or if the income source was verified
11 issue_d The month which the loan was funded
12 loan_status Current status of the loan
13 purpose A category provided by the borrower for the loan request.
14 title The loan title provided by the borrower
15 zip_code The first 3 numbers of the zip code provided by the borrower in the loan application.
16 addr_state The state provided by the borrower in the loan application
17 dti A ratio calculated using the borrower’s total monthly debt payments on the total debt obligations, excluding mortgage and the requested LC loan, divided by the borrower’s self-reported monthly income.
18 earliest_cr_line The month the borrower's earliest reported credit line was opened
19 open_acc The number of open credit lines in the borrower's credit file.
20 pub_rec Number of derogatory public records
21 revol_bal Total credit revolving balance
22 revol_util Revolving line utilization rate, or the amount of credit the borrower is using relative to all available revolving credit.
23 total_acc The total number of credit lines currently in the borrower's credit file
24 initial_list_status The initial listing status of the loan. Possible values are – W, F
25 application_type Indicates whether the loan is an individual application or a joint application with two co-borrowers
26 mort_acc Number of mortgage accounts.
27 pub_rec_bankruptcies Number of public record bankruptcies

Preparation

Section 1: Exploratory Data Analysis

Section Goal: Get an understanding for which variables are important, view summary statistics, and visualize the data

Countplot of loan_status

Histogram of the loan_amnt column.

Explore correlation between the continuous feature variables.

Noticed almost perfect correlation with the "installment" feature. Explore this feature further.

Explore loan_status and the Loan Amount.

Explore the Grade and SubGrade columns that LendingClub attributes to the loans.

Countplot per grade.

Countplot per subgrade.

Isloate F and G subgrades as they don't get paid back often

Map loan_status to 1 and 0

Show the correlation of the numeric features to the loan_repaid.

Section 2: Data PreProcessing

Section Goals: Remove or fill any missing data. Remove unnecessary or repetitive features. Convert categorical string features to dummy variables.

Missing Data

See if we should keep, discard, or fill in the missing data.

Length of the dataframe

Displays the total count of missing values per column.

Percentage of missing data

Check if we can drop emp_title and emp_length

Unique employment job titles

Too many unique title to contert to dummy variable. Drop

Countplot of the emp_length.

Separate countplot with loan_status

Percentage of charge offs per category.

Default percentage are extremely similar across all employment lengths. DROP

Revisit missing data.

Check title and purpose column

They are basically repeated info. Drop

Handle mort_acc

value_counts of mort_acc.

Fill in mort_acc by mean imputation

Calculate the mean value for the mort_acc per total_acc entry and fill

revol_util and the pub_rec_bankruptcies have missing data points, but less than 0.5% of the total data. Remove missing row

Categorical Variables and Dummy Variables

List all curretly non-numeric column.

term feature

grade feature

Convert the subgrade into dummy variables and concatenate

Convert verification_status, application_type, initial_list_status, purpose into dummy variables and concatenate

home_ownership

Replace NONE and ANY with OTHER. Convert into dummy variables and concatenate

Address

Convert zip_code column into dummy variables and Concatenate

Issue_d

earliest_cr_line

TASK: drop the load_status column. We have loan_repaid

Section 3: Train Test Split and Normalizing

Import train_test_split from sklearn.

Set X and y variables to the .values of the features and label.

Normalizing the Data.

Section 4: Creating the Model and Evaluation

Import libraries

Build a sequential model to train on the data. Dropout layers 1) 2

TASK: Fit the model with validation data for later plotting and early stopping callback.

Plot out the validation loss versus the training loss.

Create predictions from the X_test.

classification report and confusion matrix

Deploy model on a random customer

Thank you