Be the first to read our first Ebook!

In the Age of AutoML, we are all Data Scientists! | Artificial Intelligence (AI) has become increasingly essential to our daily routine lives.

How to perform Medical Cost Prediction using SmartPredict?

Published on Apr 20, 2021 by Christelle Julias

In this following blog, we are trying to predict the rate of insurance to allocate according to people's personal data.  The original use case and datasets are available on Kaggle where many methods are already suggested. However, what differentiates it, this time we are going to solve it without touching any bit of code. Can you imagine it? Without code!

Yes,thanks to SmartPredict, your partner for low-code and high-level Data Science and Machine Learning, we are going to unravel this problem in the easiest possible manner. Without any further ado, let's go! 

N.B: The project is available for you to try in the sample projects menu.


To solve such a use case, we need first to divide it into bits that we will tackle one by one.  I will guide you through this simply by connecting the steps you usually take for the traditional code-friendly methods, to the condensed, high-level visual flowcharting method. To do so, let's define these steps beforehand.

  1. Problem definition
  2. Data exploration and visualization
  3. Data pre-processing
  4. Feature engineering
  5. Model building, parametrization and training
  6. Model deployment
  7. Prediction and test

To begin with, let's create our project and name it Medical Cost Prediction. Afterward, let's upload our dataset by fetching it from the Kaggle repository.

1. Problem definition

We have detected that indeed some subjective conditions influence the insurance cost, namely a particular body state, the fact a person smokes or not, etc. 

Therefore, we are going to take these variables into account in the way to proceed. In the meantime, these estimates constitute parameters to set the reasonable price for the higher and lower end of yearly premiums. 

 Also,  we are in front of a regression problem.

2. Data exploration and visualization

To ease our task and get an overview of our data, let's visualize our train dataset with the help of the data visualizer module. Based on the Pandas profiling tool, it comes in as a handy, transportable module for observing the dataset's content.

From the processing menu, we can see the names of the columns. Whereas from the profiling menu, we can see the quality of the data, as well as their distribution from which we can already derive essential insights.


So, we have 1070 rows and 7 columns: age, sex, bmi,children, smoker,region,charges.

The content of the medical cost dataset

The columns:

1. age: the primary beneficiary's

2. sex: insurance contractor gender (female, male)

3. bmi: Body Mass Index, a ratio relative to height, ideally in the range of 18.5 to 24.9

4. children: number of children under the responsibility and covered by health insurance

5. smoker: smoking habit or not

6. region: the beneficiary' s dwelling area, northeast, southeast, southwest, northwest.

7. charges: individual medical costs billed by health insurance

These columns constitute the input features. As our goal is to predict insurance costs, then charges represent our target feature. 

For a detailed description of the dataset, we can consult either Kaggle or the markdown at the sample project opening.

We may also wish to have a glimpse of the descriptive statistics, so let's check the summary from the profiling tab. 

Speaking about categorical features, we can see that the number of subjects is almost the same except for smoker where there are fewer smokers than non-smokers.

Somehow, we can already affirm that smokers definitely outperform non-smokers in terms of charge. Thus, the feature smoker impacts the most on the medical charges.

Afterwards, it is also proportional to age, bmi and the number of children.

Finally, from the correlation matrix, we can conjecture the important correlation between smoker and charges.

3. Data pre-processing

With SmartPredict, all the libraries we need are included inside modules from the simplest to the most complex. Machine Learning algorithms are nested inside them and are used each for a specific role. We do not need to import libraries anymore, but if we need to we can still add some thanks to custom modules.

To preprocess our dataset, for instance, let's check if there are any duplicated values or missing ones. We notice at a glance that each feature is already attributed its correct type.  

To do so, we are going to borrow the data processor functions such as the feature selector. 

Let's insert all the features into the selector.

Indeed, there is one pair of duplicates, so we let the module remove it. Hopefully, there is no missing values at all.

Then, as categorical features are present, we need to use the categorical encoder. In the settings, let's just specify them, then save to apply the changes.

Apart from that, we somehow need to normalize our data so let's do that with the data processor and select the operation "normalize".

To divide our data into a random set of train and a test dataset, we use the dataset splitter. We can set up the percentage of each one.

Let's drag and drop both of these modules into the build workspace then link them to each other.

4. Model building, parameterization, and training

Model choice:

As a model, we are going to build and train an XGBoost Regression model. This choice is governed by the finding that it fits training data much better than linear regressors for relationships between the predictor variables and the target one.

Let's retrieve it under the models drop-down and drag and drop it into the build space.


Metrics are scales of value between the predicted and actual value.  As metrics, we have the choice between  Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and Root Mean Squared Logarithmic Error (RMSLE). We choose to use the RMSE ,the best metric for Regressions.

N.B: If we wish to obtain sharper metrics, we may use Polynomial Regression: a combination of feature engineering and principal component analysis.

We need to specify the type of problem both in the model evaluator and in the model trainer.

Model building:

Let's add the other modules now: data logger to register the operations conditions of run, the item saver to save our model, and the model estimator to provide the metrics, then link them. Once done, let's run the whole flowchart.


Now, to deploy the model as well as the data science processing pipeline, prepare the model for deployment.

Redirected to the deploy space, click on the rocket icon to launch the operation. 

Choose the right compute resource according to your plan.

Then deploy the data science processing pipeline as a web service.


Notice that the data frame loader and features selector needs a bit of tuning.

The input is a data frame whereas the output should be a dictionary.

And in the model predictor, we should mention that it's of a regression type.

Monitor space

In this space, you will be able to copy the active URL and access token to the clipboard then share it at will.

Predict space

Now, landing on this last space, we are ready to test the performance of our model and ask it to return predictions. Insert prediction data or lines from your test dataset into the field destined to this purpose. For our case, we are going to use a csv dataset.

Wait a while until the machine finishes processing it. Then get the results in real-time.

We have obtained here the estimation of the insurance cost, in the column charges on the extreme right.

We have seen today how to perform the "Predicting Medical Costs" use case with SmartPredict and we have completed it successfully! Easy, isn't it?

See you for other tutorials!