Predicting Card Approval with SmartPredict
The creation of a platform like SmartPredict, in which we can carry out our Artificial intelligence projects from scratch WITHOUT CODING, is an ingenious idea for us Data scientists and AI practitioners. Now that it's available for free in its beta version, it's worth a try.
Hence in this blog, I report to you a machine learning project: "Card approval prediction" with SmartPredict. So let's do it together and you'll see how easy it is. Ready!! Let's get started.
2- Presentation of the project
Many pieces of information, in particular on credit reports, must be analyzed by a commercial bank in order to approve or not an application for a credit card. When these applications are numerous, manually analyzing them is tedious and time-consuming.
Thanks to machine learning, this task can be automated. So in this project, we will build a model that predicts whether the credit card application has to be approved or not based on some details given by the user. It is, therefore, a classification problem.
In this project, we use the Credit Card Approval dataset from the UCI Machine Learning Repository.
So with SmartPredict let's perform it in four stages :
First step: we will explore our data and will analyze it with the Dataset Processor app and with the notebook integrated into SmartPredict
Second step: we will proceed with the preprocessing data with Processing Pipeline
Third step: we will build and train Model in Build Tab
Last step: we will deploy and test our model in Deploy Tab and Test Tab
3- Performing the whole project in SmartPredict
First of all, you should create an account and an empty project in SmartPredict and upload data.
As the data type is ".data" let's enter this following snippet of code in SmartPredict's notebook to upload it directly from the web and give it column's name.
import pandas as pd df = pd.read_csv ("https://archive.ics.uci.edu/ml/machine-learning-databases/credit-screening/crx.data", header = None) df.columns = ['A1','A2','A3','A4','A5','A6','A7','A8','A9','A10','A11','A12','A13','A14','A15','LABEL']
Then, we use the snippet "Dataset save" already available in our notebook for saving the data as CSV type under the name of "DATA".
from smartpredict import api dataset = api.dataset_save(dataset=df, name='DATA', dataset_type='csv')
Now, our data named "DATA" is available in SmartPredict's dataset as shown in the following picture.
DATA available in SmartPredict's dataset
Step 1: Exploratory Data Analysis in Dataset Processor app and with SmartPredict's notebook
SmartPredict has some apps for carrying out our project better like the Dataset Processor app with which we can explore and analyze our dataset without coding.
Let's have a view of our dataset in the Processing Tab first.
Visualizing data in Processing Tab
DATA representation in processing Tab
So our dataset is represented in a table that indicates the type and the quality of each column as shown in the figure above.
As the UCI Repository has indicated in crx.name file, all attribute names, and values have been changed to meaningless symbols to protect the confidentiality of the data. So the dataset has 15 attributes columns (that we have named A1, A2, A3, so forth), a column which indicates if the client's application is approved (+) or not (-) that we have named "LABEL" with 690 instances representing 690 individuals applying for a credit card.
As shown in the figure above and with some research, we have this following attributes information :
A1: type String, nominal value (b, a) represents the client's sex,
A2: type String, continuous, represents the age, ( that we should convert to float type later in processing pipeline),
A3: type Float, continuous, represents debt,
A4: type String, nominal value (u, y, l, t), represents the marital status,
A5: type String, nominal value (p, gg), represents if the client is a Bank customer,
A6: type String, nominal value (c, d, cc, i, j, k, m, r, q, w, x, e, aa, ff), represents the education level,
A7: type String, nominal value (v, h, bb, j, n, z, dd, ff, o), represents the Ethnicity,
A8: type Float, continuous, represents the YearsEmployed,
A9: type String, nominal value (t, f), represents the PriorDefault,
A10: type String, nominal value (t, f), represents if the client is employed,
A11: type Integer, continuous, represents the CreditScore,
A12: type String, nominal (t, f), represents the DriversLicense,
A13: type String, nominal (g, p, s), represents the Citizen,
A14: type String, continuous, represents the ZipCode (that we should convert to float type later),
A15: type Integer, continuous, represents the Income,
LABEL: type String, nominal (+,-), represents the approval.
Let's have a view of the statistical value of the numerical and categorical columns by entering this code in our notebook.
dataset.describe(include=[np.number], percentiles=[.5]) \ .transpose().drop("count", axis=1)
Statistical value with numerical columns
dataset.describe(include=[np.object]).transpose() \ .drop("count", axis=1)
Statistical value with column with an object value
In this table above, the "unique" column indicates the total number of symbols used in each column, and the "top" column represents the symbol frequently used with its frequency in the "freq" column.
Let's analyze the dataset with the Visualization Tab, in which you can visualize your data with pie, bar, scatter, line chart without coding. For more flexibility, you can also use SmartPredict's Notebook.
Let's see the "LABEL" distribution with a bar chart.
Distribution of the LABEL column
As we can see, the "LABEL" column has 307 entries (44.5%) of "+" value and 383 entries (55.5%) of "-" value.
As I have already shown you how it works, I leave you to explore the dataset as you see fit, but let's take the next step.
Step 2: Preprocessing Data with Processing Pipeline
Preprocessing data is a crucial step to make this latter suitable for Machine Learning model. Depending on our dataset many processes can be done on it such as formatting, cleaning, sampling, normalization, feature engineering, and so forth. Thanks to SmartPredict many of these processes can be done without worrying about coding as shown in the gif below.
List of Preprocessing data in SmartPredict
So as values in A2 (age) and in A14(Zip Code) are continuous so let's change the type of these columns into float type.
Changing A2, A14 columns' type
As indicated in the dataset description in crx.name file, some attributes in our dataset has missing values represented by "?", let's then clean them.
There are different ways to deal with missing values like :
- deleting the rows which have missing values,
- imputing with mean or median of the attribute all missing values if the variables are continuous
- imputing with the most common value of the attribute if the variables are categorical
As we have a dataset with 690 entries only, we will not delete lines with missing value but replacing them. So, missing values in A1, A4, A5, A6, and A7 attributes (categorical) will be replaced by the most common value and missing values in A2 and A14(continuous) attributes will be replaced by the mean and median value respectively.
The following figures show the configuration of the processing pipeline for these three actions :
Replacing missing value by the frequent value in A1, A4, A5, A6, A7 columns
Replacing missing value by median value in A14 column
Replacing missing value by mean value in A2 column
So these four processing pipelines are available in the right pane and can be exported as a module that we have named "data_preprocess" as shown in the gif below.
Exporting processing pipelines under the name "data_process"
The data preprocessing can be also done with the "Data Preprocessing" modules in the Core Module, so let's do the encoding of the categorical values in the next step.
Step 3: Building and Training Model in Build Tab
If you don't know it yet, in SmartPredict building and training a machine learning model consists of drag and drop module available in the right pane and configure them in the build tab. So the figure below shows you the flowchart that I have built for this problem.
Flowchart in build tab
Let's explain the conceptual idea of this flowchart in the following.
As all flowchart in the build tab begins with the dataset's module with which our model will be trained, thus from the Datasets menu we drag and drop the "DATA" which is our dataset for this card approval project.
Then we drag and drop and interconnect the module "data_process" available in the Processing pipeline menu, and to ensure there is no longer missing value in our dataset, we interconnect another preprocessing pipeline which deals in them that I have exported and named "Missing_value_handling".
After that, with the Feature Selector module, we select columns which constitute features and label, thus this module presents two outputs (features' columns and label's column). Here is its configuration.
Feature selector's configuration
Then, we encode categorical features columns with One Hot Encoder module and with Ordinal Encoder module the label column. As the type of the output of Feature Selector module is an array, then columns are indicated with their index (number) as shown the figure below.
One Hot Encoder's configuration
Ordinal Encoder's configuration
At this point, we have a dataset that can train our model. So let's split them into train and test dataset with the Label data splitter module in which we indicate the rate " 0.2" which indicates the number of test data among the number of training data.
Labeled data splitter's configuration
As you can imagine any flowchart in the Build tab ends with three modules:
- the module which represents our machine learning model that you can choose in the drop-down list Machine Learning Algorithms; for this model we choose the Support Vector Classifier model,
- the Trainer ML models module which trains our machine learning model whose inputs are features and label of the data used for training,
- the Evaluator for ML models module which evaluates, with metric that we can choose, our model. It receives as inputs features and label of data used for evaluating it.
Figures below show you their configurations.
Support Vector Classifier model's configuration
Trainer ML model's configuration
Evaluator for ML models configured with accuracy metric
When we run our flowchart, a tooltip which indicates the accuracy appears. Here we obtain 0.86 accuracy.
As you can notice at each output of the One hot Encoder and the Trainer ML models module, a module named Item Saver is interconnected. This module is available in the Basic Operation drop-down list. Its function is to save a model that will be available in the drop-down Trained Models.
So we have saved the encoding module One hot Encoder under the name of One_Hot_Card and our trained Machine Learning model under the name of SVM_card as shown in the two figures below.
Item Saver which saves the one-hot encoder module
Item Saver which saves the trained machine learning model module
Indeed these trained models will be used in the deployment step. So let's go to the next and the last step.
Step 4: Deploying and testing our model in Deploy Tab and Test Tab
As I said before with the SmartPredict platform we can perform our machine learning problem form scratch. So we are at the end of our project which consists of deploying and testing our model.
Deploying a model in SmartPredict still consists of creating a flowchart but this time in "Deploy Tab", all modules in the right pane can be used. Once deploying, our model is accessible in web service to which an access token is generated for you in the Monitor tab.
So no need to keep you in suspense, the figure below shows you the flowchart that I build for deploying our model for this project.
Our flowchart in Deploy Tab
All flowchart in the Deploy tab begins and ends up with a Web Service IN and a Web Service OUT modules respectively.
The Web Service receives in its input an instance of data in format JSON and returns it in dictionary form.
Thus we need to convert the instance of data into a dataframe with a DataFrame loader/converter module.
After that, we should use the Feature Selector module to select features that we use from the instance of data.
Features with categorical values should be encoded, so we should use a One Hot Encoder module that we don't set up anymore but receives in its input the One_Hot_Card that we saved before.
Finally, our web service model is intended to predict the output of the instance, thus we should always use the Predictor ML models module in our flowchart in Deploy Tab. It receives in its input the trained model SVM_model that we have saved before.
Figures below show you the configurations of the DataFrame loader/converter and Feature Selector module, the other modules are left by their default configuration.
DataFrame loader/converter's configuration in deploy flowchart
Feature Selector's configuration in deploy' s flowchart
Once the modules are interconnected, we just have to click on the rocket icon and go to the test Tab.
In the Test tab, an instance of data is already presented whose column should be arranged in order of their index. When we click on the arrow icon, our deployed model predicts and output is displayed.
As in the figure below the output is "0" that indicates that the client's card approval application will be accepted.
Test result in Test Tab
I hope that through this blog I was able to clarify how to perform an AI project using the SmartPredict platform and show you that it is now easy thanks to this platform born of an ingenious idea.
Thanks for reading, see you in another blog.