All articlesData Science
in Data Science
14 days ago

Predicting Card Approval with SmartPredict

1- Introduction

The creation of a platform like SmartPredict, in which we can carry out our Artificial intelligence projects from scratch WITHOUT CODING, is an ingenious idea for us Data scientists and AI practitioners. Now that it's available for free in its beta version, it's worth a try.
Hence in this blog, I report to you a machine learning project: "Card approval prediction" with SmartPredict. So let's do it together and you'll see how easy it is. Ready!! Let's get started.

2- Presentation of the project

Many pieces of information, in particular on credit reports, must be analyzed by a commercial bank in order to approve or not an application for a credit card. When these applications are numerous, manually analyzing them is tedious and time-consuming.
Thanks to machine learning, this task can be automated. So in this project, we will build a model that predicts whether the credit card application has to be approved or not based on some details given by the user. It is, therefore, a classification problem.

In this project, we use the Credit Card Approval dataset from the UCI Machine Learning Repository.

So with SmartPredict let's perform it in four stages :

First step: we will explore our data and will analyze it with the Dataset Processor app and with the notebook integrated into SmartPredict

Second step: we will proceed with the preprocessing data with Processing Pipeline

Third step: we will build and train Model in Build Tab

Last step: we will deploy and test our model in Deploy Tab and Test Tab

3- Performing the whole project in SmartPredict

First of all, you should create an account and an empty project in SmartPredict and upload data.

As the data type is ".data" let's enter this following snippet of code in SmartPredict's notebook to upload it directly from the web and give it column's name.

import pandas as pd
df = pd.read_csv ("https://archive.ics.uci.edu/ml/machine-learning-databases/credit-screening/crx.data", header = None)
df.columns = ['A1','A2','A3','A4','A5','A6','A7','A8','A9','A10','A11','A12','A13','A14','A15','LABEL']

Then, we use the snippet "Dataset save" already available in our notebook for saving the data as CSV type under the name of "DATA".

from smartpredict import api

dataset = api.dataset_save(dataset=df, name='DATA',
                           dataset_type='csv')

Now, our data named "DATA" is available in SmartPredict's dataset as shown in the following picture.

DATA available in SmartPredict's dataset

DATA available in SmartPredict's dataset

Step 1: Exploratory Data Analysis in Dataset Processor app and with SmartPredict's notebook

SmartPredict has some apps for carrying out our project better like the Dataset Processor app with which we can explore and analyze our dataset without coding.

Let's have a view of our dataset in the Processing Tab first.

Visualizing data in Processing Tab

Visualizing data in Processing Tab

DATA representation in processing Tab

DATA representation in processing Tab

So our dataset is represented in a table that indicates the type and the quality of each column as shown in the figure above.

As the UCI Repository has indicated in crx.name file, all attribute names, and values have been changed to meaningless symbols to protect the confidentiality of the data. So the dataset has 15 attributes columns (that we have named A1, A2, A3, so forth), a column which indicates if the client's application is approved (+) or not (-) that we have named "LABEL" with 690 instances representing 690 individuals applying for a credit card.

As shown in the figure above and with some research, we have this following attributes information :

A1: type String, nominal value (b, a) represents the client's sex,
A2: type String, continuous, represents the age, ( that we should convert to float type later in processing pipeline),
A3: type Float, continuous, represents debt,
A4: type String, nominal value (u, y, l, t), represents the marital status,
A5: type String, nominal value (p, gg), represents if the client is a Bank customer,
A6: type String, nominal value (c, d, cc, i, j, k, m, r, q, w, x, e, aa, ff), represents the education level,
A7: type String, nominal value (v, h, bb, j, n, z, dd, ff, o), represents the Ethnicity,
A8: type Float, continuous, represents the YearsEmployed,
A9: type String, nominal value (t, f), represents the PriorDefault,
A10: type String, nominal value (t, f), represents if the client is employed,
A11: type Integer, continuous, represents the CreditScore,
A12: type String, nominal (t, f), represents the DriversLicense,
A13: type String, nominal (g, p, s), represents the Citizen,
A14: type String, continuous, represents the ZipCode (that we should convert to float type later),
A15: type Integer, continuous, represents the Income,
LABEL: type String, nominal (+,-), represents the approval.

Let's have a view of the statistical value of the numerical and categorical columns by entering this code in our notebook.

dataset.describe(include=[np.number], percentiles=[.5]) \
    .transpose().drop("count", axis=1)

Statistical value with numerical columns debt, YearsEmployed, CreditScore and Income respectively.

Statistical value with numerical columns

dataset.describe(include=[np.object]).transpose() \
    .drop("count", axis=1)

Statistical value with object columns

Statistical value with column with an object value

In this table above, the "unique" column indicates the total number of symbols used in each column, and the "top" column represents the symbol frequently used with its frequency in the "freq" column.

Let's analyze the dataset with the Visualization Tab, in which you can visualize your data with pie, bar, scatter, line chart without coding. For more flexibility, you can also use SmartPredict's Notebook.

Let's see the "LABEL" distribution with a bar chart.

Distribution of the LABEL column

As we can see, the "LABEL" column has 307 entries (44.5%) of "+" value and 383 entries (55.5%) of "-" value.

As I have already shown you how it works, I leave you to explore the dataset as you see fit, but let's take the next step.

Step 2: Preprocessing Data with Processing Pipeline

Preprocessing data is a crucial step to make this latter suitable for Machine Learning model. Depending on our dataset many processes can be done on it such as formatting, cleaning, sampling, normalization, feature engineering, and so forth. Thanks to SmartPredict many of these processes can be done without worrying about coding as shown in the gif below.

List of Preprocessing data in SmartPredict

List of Preprocessing data in SmartPredict

So as values in A2 (age) and in A14(Zip Code) are continuous so let's change the type of these columns into float type.

Changing A2, A14 columns' type

Changing A2, A14 columns' type

As indicated in the dataset description in crx.name file, some attributes in our dataset has missing values represented by "?", let's then clean them.

There are different ways to deal with missing values like :

- deleting the rows which have missing values,

- imputing with mean or median of the attribute all missing values if the variables are continuous

- imputing with the most common value of the attribute if the variables are categorical

As we have a dataset with 690 entries only, we will not delete lines with missing value but replacing them. So, missing values in A1, A4, A5, A6, and A7 attributes (categorical) will be replaced by the most common value and missing values in A2 and A14(continuous) attributes will be replaced by the mean and median value respectively.

The following figures show the configuration of the processing pipeline for these three actions :

Replacing missing value by the frequent value in A1, A4, A5, A6, A7 columns

Replacing missing value by median value in A14 column

Replacing missing value by median value in A14 column

Replacing missing value by mean value in A2 column

Replacing missing value by mean value in A2 column

So these four processing pipelines are available in the right pane and can be exported as a module that we have named "data_preprocess" as shown in the gif below.

Exporting the four processing pipeline under the name data_process

Exporting processing pipelines under the name "data_process"

The data preprocessing can be also done with the "Data Preprocessing" modules in the Core Module, so let's do the encoding of the categorical values in the next step.

Step 3: Building and Training Model in Build Tab

If you don't know it yet, in SmartPredict building and training a machine learning model consists of drag and drop module available in the right pane and configure them in the build tab. So the figure below shows you the flowchart that I have built for this problem.

The flowchart in build tab

Flowchart in build tab

Let's explain the conceptual idea of this flowchart in the following.

As all flowchart in the build tab begins with the dataset's module with which our model will be trained, thus from the Datasets menu we drag and drop the "DATA" which is our dataset for this card approval project.

Then we drag and drop and interconnect the module "data_process" available in the Processing pipeline menu, and to ensure there is no longer missing value in our dataset, we interconnect another preprocessing pipeline which deals in them that I have exported and named "Missing_value_handling".

After that, with the Feature Selector module, we select columns which constitute features and label, thus this module presents two outputs (features' columns and label's column). Here is its configuration.

Feature selector's configuration

Feature selector's configuration

Then, we encode categorical features columns with One Hot Encoder module and with Ordinal Encoder module the label column. As the type of the output of Feature Selector module is an array, then columns are indicated with their index (number) as shown the figure below.

One Hot Encoder's configuration

One Hot Encoder's configuration

Ordinal Encoder's configuration

Ordinal Encoder's configuration

At this point, we have a dataset that can train our model. So let's split them into train and test dataset with the Label data splitter module in which we indicate the rate " 0.2" which indicates the number of test data among the number of training data.

Labeled data splitter's configuration

Labeled data splitter's configuration

As you can imagine any flowchart in the Build tab ends with three modules:

- the module which represents our machine learning model that you can choose in the drop-down list Machine Learning Algorithms; for this model we choose the Support Vector Classifier model,

- the Trainer ML models module which trains our machine learning model whose inputs are features and label of the data used for training,

- the Evaluator for ML models module which evaluates, with metric that we can choose, our model. It receives as inputs features and label of data used for evaluating it.

Figures below show you their configurations.

Support Vector Classifier model's configuration

Support Vector Classifier model's configuration

Trainer ML models' configuration

Trainer ML model's configuration

Evaluator for ML models configured with Accuracy metric

Evaluator for ML models configured with accuracy metric

When we run our flowchart, a tooltip which indicates the accuracy appears. Here we obtain 0.86 accuracy.

Note:

As you can notice at each output of the One hot Encoder and the Trainer ML models module, a module named Item Saver is interconnected. This module is available in the Basic Operation drop-down list. Its function is to save a model that will be available in the drop-down Trained Models.

So we have saved the encoding module One hot Encoder under the name of One_Hot_Card and our trained Machine Learning model under the name of SVM_card as shown in the two figures below.

Item Saver which saves the one hot encoder module

Item Saver which saves the one-hot encoder module

Item Sver which saves the trained machine learning model module

Item Saver which saves the trained machine learning model module

Indeed these trained models will be used in the deployment step. So let's go to the next and the last step.

Step 4: Deploying and testing our model in Deploy Tab and Test Tab

As I said before with the SmartPredict platform we can perform our machine learning problem form scratch. So we are at the end of our project which consists of deploying and testing our model.

Deploying a model in SmartPredict still consists of creating a flowchart but this time in "Deploy Tab", all modules in the right pane can be used. Once deploying, our model is accessible in web service to which an access token is generated for you in the Monitor tab.

So no need to keep you in suspense, the figure below shows you the flowchart that I build for deploying our model for this project.

Flowchart in Deploy Tab

Our flowchart in Deploy Tab

All flowchart in the Deploy tab begins and ends up with a Web Service IN and a Web Service OUT modules respectively.

The Web Service receives in its input an instance of data in format JSON and returns it in dictionary form.

Thus we need to convert the instance of data into a dataframe with a DataFrame loader/converter module.

After that, we should use the Feature Selector module to select features that we use from the instance of data.

Features with categorical values should be encoded, so we should use a One Hot Encoder module that we don't set up anymore but receives in its input the One_Hot_Card that we saved before.

Finally, our web service model is intended to predict the output of the instance, thus we should always use the Predictor ML models module in our flowchart in Deploy Tab. It receives in its input the trained model SVM_model that we have saved before.

Figures below show you the configurations of the DataFrame loader/converter and Feature Selector module, the other modules are left by their default configuration.

DataFrame loader/converter's configuration in deploy flowchart

DataFrame loader/converter's configuration in deploy flowchart

Feature Selector's configuration in deploy flowchart

Feature Selector's configuration in deploy' s flowchart

Once the modules are interconnected, we just have to click on the rocket icon and go to the test Tab.

In the Test tab, an instance of data is already presented whose column should be arranged in order of their index. When we click on the arrow icon, our deployed model predicts and output is displayed.

As in the figure below the output is "0" that indicates that the client's card approval application will be accepted.

Test result in Test Tab

4-Conclusion

I hope that through this blog I was able to clarify how to perform an AI project using the SmartPredict platform and show you that it is now easy thanks to this platform born of an ingenious idea.

Thanks for reading, see you in another blog.

Subject categories

Data Science