In this following blog, we are trying to predict the rate of insurance to allocate according to people's personal data. The original use case and datasets are available on Kaggle where many methods are already suggested. However, what differentiates it, this time we are going to solve it without touching any bit of code. Can you imagine it? Without code!
Yes,thanks to SmartPredict, your partner for low-code and high-level Data Science and Machine Learning, we are going to unravel this problem in the easiest possible manner. Without any further ado, let's go!
N.B: The project is available for you to try in the sample projects menu.
To solve such a use case, we need first to divide it into bits that we will tackle one by one. I will guide you through this simply by connecting the steps you usually take for the traditional code-friendly methods, to the condensed, high-level visual flowcharting method. To do so, let's define these steps beforehand.
- Problem definition
- Data exploration and visualization
- Data pre-processing
- Feature engineering
- Model building, parametrization and training
- Model deployment
- Prediction and test
To begin with, let's create our project and name it Medical Cost Prediction. Afterwards, let's upload our dataset by fetching it from the Kaggle repository.
1. Problem definition
We have detected that indeed some subjective conditions influence on the insurance cost, namely a particular body state, the fact a person smokes or not etc.
Therefore, we are going to take these variables into account in the way to proceed. In the meantime, these estimates constitute parameters to set the reasonable price for higher and lower end of yearly premiums.
Also, we are in front of a regression problem.
2. Data exploration and visualization
To ease our task and get an overview of our data, let's visualize our train dataset with the help of the data visualizer module. Based on the Pandas profiling tool, it comes in as a handy, transportable module for observing the dataset's content.
From the processing menu, we can see the names of the columns. Whereas from the profiling menu, we can see the quality of the data, as well as their distribution from which we can already derive essential insights.
So, we have 1070 rows and 7 columns: age, sex, bmi,children, smoker,region,charges.
1. age: the primary beneficiary's
2. sex: insurance contractor gender (female, male)
3. bmi: Body Mass Index, a ratio relative to height, ideally in the range of 18.5 to 24.9
4. children: number of children under the responsibility and covered by health insurance
5. smoker: smoking habit or not
6. region: the beneficiary' s dwelling area, northeast, southeast, southwest, northwest.
7. charges: individual medical costs billed by health insurance
These columns constitute the input features. As our goal is to predict insurance costs, then charges represent our target feature.
For detailed description of the dataset we can consult either Kaggle or the markdown at the sample project opening.
We may also wish to have a glimpse of the descriptive statistics, so let's check the summary from the profiling tab.
Speaking about categorical features, we can see that the number of subjects are almost the same except for smoker where there is less smokers than non-smokers.
Somehow, we can already affirm that smokers definitely outperform non-smokers in terms of charge. Thus, the feature smoker impacts the most on the medical charges.
Afterwards, it is also proportional to age, bmi and the number of children.
Finally, from the correlation matrix, we can conjecture the important correlation between smoker and charges.
3. Data pre-processing
With SmartPredict, all the libraries we need are included inside modules from the simplest to the most complex. Machine Learning algorithms are nested inside them and are used each for a specific role. We do not need to import libraries anymore, but if we need to we can still add some thanks to custom modules.
To preprocess our dataset, for instance, let's check if there are any duplicated values or missing ones. We notice at a glance that each feature is already attributed its correct type.
To do so, we are going to borrow the data processor functions such as the feature selector.
Let's insert all the features into the selector.
Indeed, there is one pair of duplicates, so we let the module remove it. Hopefully, there is no missing values at all.
Then, as categorical features are present, we need to use the categorical encoder. In the settings, let's just specify them, then save to apply the changes.
Apart from that, we somehow need to normalize our data so let's do that with the data processor and select the operation "normalize".
To divide our data into a random set of train and a test dataset, we use the dataset splitter. We can set up the percentage of each one.
Let's drag and drop both of these modules into the build workspace then link them to each other.
4. Model building, parameterization and training
As a model, we are going to build and train a XGBoost Regression model. This choice is governed by the finding that it fits training data much better than linear regressors for relationships between the predictor variables and the target one.
Let's retrieve it under the models drop-down and drag and drop it into the build space.
Metrics are scales of value between the predicted and actual value. As metrics, we have the choice between Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and Root Mean Squared Logarithmic Error (RMSLE). We choose to use the RMSE ,the best metric for Regressions.
N.B: If we wish to obtain sharper metrics, we may use Polynomial Regression: a combination of feature engineering and principal component analysis.
We need to specify the type of problem both in the model evaluator and in the model trainer.
Let's add the other modules now: data logger to register the operations conditions of run, the item saver to save our model, and the model estimator to provide the metrics, then link them. Once done, let's run the whole flowchart.
Now, to deploy the model as well as the data science processing pipeline, prepare the model for deployment.
Redirected to the deploy space, click on the rocket icon to launch the operation.
Choose the right compute resource according to your plan.
Then deploy the data science processing pipeline as a web service.
Notice that the dataframe loader and features selector needs a bit of tuning.
The input is a dataframe whereas the output should be a dictionary.
And in the model predictor we should mention that it's of a regression type.
In this space, you will be able to copy the active URL and access token to the clipboard then share it at will.
Now, landing on this last space, we are ready to test the performance of our model and ask it to return predictions. Insert prediction data or lines from your test dataset into the field destined to this purpose. For our case, we are going to use a csv dataset.
Wait a while until the machine finishes processing it. Then get the results in real time.
We have obtained here the estimation of the insurance cost, in the column charges on the extreme right.
We have seen today how to perform "Predicting Medical Costs" use case with SmartPredict and we have completed it successfully! Easy, isn't it?
See you for other tutorials!