Be the first to read our first Ebook!

In the Age of AutoML, we are all Data Scientists! | Artificial Intelligence (AI) has become increasingly essential to our daily routine lives.

PREDICTING A KAGGLE COMPETITION BASELINE: TITANIC SPACESHIP DATASET BY USING SMARTPREDICT PLATFORM

Published on May 31, 2022 by Niaina Tahiana A

Predicting a Kaggle competition baseline: Titanic SpaceShip dataset by using SmartPredict platform

Introduction :

In this tutorial, we will learn how to do prediction with a Kaggle competition baseline with the Titanic SpaceShip dataset. It consists of predicting whether a passenger was transported to an alternate dimension during the Spaceship Titanic's collision with the spacetime anomaly.  You can download this dataset at this link https://www.kaggle.com/competitions/spaceship-titanic/data. This dataset contains a set of personal records recovered from the ship’s damaged computer system. 

With this use case, let us help you to explore the SmartPredict Platform, how to create projects, how to run projects, how to create predictions, etc … SmartPredict procures multiple features and functionalities useful for AI projects. It contains a drag and drops functionality and many modules for building projects.

Now, let’s get started :) 

Log in SmartPredict :

If you don’t have a SmartPredict account, you should sign up here https://cloud.smartpredict.ai/register-choice. But if you already have an account, you just sign in here https://cloud.smartpredict.ai/login.

Once, you are in the dashboard, you can start by creating a new project in the « My projects » menu :

After creating a project, we obtain a workspace like this :

We have to select a dataset:

Create flowchart build:

Insert dataset: 

Let’s drag and drop the dataset to the workspace :

We can see an overview of this dataset with the « View Summary » option in the workspace or the dataset menu and by clicking the « View Summary » option :

This option allows us to see the list of columns and their respective types, the number of rows, and the table size.

Columns descriptions :

Let’s take an overview of each column :

  • Age: the age of the passenger
  • Spa, RoomService, FoodCourt, VRDeck, ShoppingMall:  Amount the passenger has billed at each of the Spaceship Titanic's many luxury amenities.
  • VIP: whether the passenger has paid for special VIP service during the voyage.
  • Name: The first and last names of the passenger
  • Cabin: The cabin number where the passenger is staying
  • CryoSleep:  Indicates whether the passenger elected to be put into suspended animation for the duration of the voyage. Passengers in cryosleep are confined to their cabins.
  • HomePlanet: The planet the passenger departed from, typically their planet of permanent residence.
  • Destination: The planet the passenger will be debarking to
  • PassengerId: Unique Id for each passenger
  • Transported: whether the passenger was transported to another dimension. This is the target column to predict.

As we can see, some columns contain missing values.

Data preprocessing :

Data preprocessing includes the steps we need to follow to clean data and organize data to be suitable for machine learning models. There will be :

-unnecessary columns in a dataset for example columns with many values, unique values, and constant values.

-columns with missing values

-categorical columns

-numerical columns not on the same scale that can affect the performance of a model

Data preprocessing increases the efficiency of a machine learning model. 

Delete unnecessary columns : 

It is necessary to use this method here because there are columns with many values and unique values.

Drag and drop the “Dataset processor” module and relate it to the dataset. By clicking on this module, there are many options that we can use to process our dataset. Here, let’s choose the option “Delete column” and select columns to delete in the “Columns to delete” fields. 

Handling missing values: 

As we can see from the dataset's summary, there are some categorical and numerical attributes with missing values. To handle them, let’s use the “Dataset processor” module but by using another option: “Handle missing values”. One processor for the categorical features and one for the numerical features.

We can insert config parameters as we want. Here, we have a classification task, we have to fill missing values in categorical features by inserting column names in the “Subset columns” field. We have used the median method.

Let’s handle now missing values in categorical columns.

In the above illustration, we have used a frequency method for imputing missing values.

Normalization:

Variables that are measured at different scales don’t contribute equally to the model fitting. Thus, to deal with this potential problem, we should apply a normalization method, the Min-Max scaler for numerical features. Normalization scales each feature into the range [0,1]. Then, it makes our training faster.

The “Normalize” option is chosen among the existing options. We can configure the module and insert the column names to operate.

One Hot Encoding:

Machine learning algorithms want everything to be provided to them in numerical form.

As we have categorical variables which haven’t related to each other and which can’t be recognized by machine learning algorithms, let’s apply one hot encode these variables.

For using this method in SmartPredict, we just select the “One hot encode” option in the “Processor to use” field.

We don’t have to scale one hot encoded feature because they are already between 0 and 1.

Until now, we have created processing pipelines in the flowchart.

Let’s take an illustration to see it.

Save processing pipeline: 

We need to save all this processing pipeline for applying them in deployment.

SmartPredict has an excellent module for saving items like processing pipelines, datasets, trained models, and feature engineering models. Let’s save the processing pipeline by using the “Item Saver” module in the “Control Modules” category (in the Core Modules Features) in the right part.

We have attributed it a name.

Select features and labels:

Select features:

For selecting all features for training, let’s use the “Delete column” processor again. This processor should be related to the “One hot encode” processor’s output. Then, the “Transported” column will be removed.

Select label:

Label column contains boolean values with two classes: True and False for indicating whether the passenger has been transported or not. Thus, we need to save the trained model in the next step with the item saver module that can’t support the boolean type. We can also see ordinal encoding functionality in the “Dataset Processor” module.

The option “Keep these columns only”, in the “Dataset processor” is used for the label

In the “Data Selection” category, we need to drag and drop the “Features selector” module for selecting labels.

In the “Features and labels” group, we just select the “Transported” column in the “Selected labels” field. This module generates 2 outputs: features and label, but we keep label output for model training.

For now, we have this flowchart architecture:

After selecting features and labels, let's apply the Train and test split to split data either as training or testing.

Train and test split:

The module “Data splitter” in the “Data Selection” category is useful to split data. We just choose the “Train/Test Split” option.

This option takes 2 inputs: features (from the Delete column’s output) and label (from the Features selector’s output) and has 4 outputs :

  1. Features for training
  2. Label for training
  3. Features for testing
  4. Label for testing

We will need these outputs for training and evaluating a model.

Model training:

First of all,  let’s use the Random Forest classifier for training our model. It consists of creating trees of decisions on randomly selected data samples, obtaining predictions from each tree, and selecting the best solution through a vote. Each tree depends on an independent random sample. The most popular class is chosen as the final result.

Choose the classifier to train: Random Forest Classifier in the “Machine Learning Algorithms” category. After that, relate it to the “Model Trainer” module in the “Training & Prediction” category.

“Model Trainer” takes as inputs : 

  1. Trainable Machine Learning, Deep Learning, Prophet, or Keras algorithms.
  2. Features of the training data
  3. Labels of the training data
  4. Features data used to validate a model
  5. Label of the validation dataset

Connect the first third inputs to the classifier from the “Random Forest Classifier” module to the features for training, and the label for training from the “Train/Test Split”.

Save model:

Don’t forget to save the trained model, we will need it in the deployment.

Let’s drag and drop a new instance of the “Item Saver” module for it and choose the “Trained model (ML, DL, FE, PRC, LANG)” option, and attribute to it a name.

Evaluate model:

For evaluating the model’s performances, let’s use the “Model Evaluator” module in the “Evaluation & Fine Tuning” category, which takes 3 mandatory inputs:

  1. Trained model
  2. Features for test
  3. Label for test

This module generates 2 outputs:

  1. Model’s score
  2. Trained and evaluated model

In the above illustration, we have chosen 3 metrics for evaluating our model: F1 score, Precision score, and Recall score.

We can log the score by using the “Data/Object/Type Logger” module.

We have finished our build flowchart. Let’s run it by clicking the “Run” option on the left and wait for it.

We can see the score in the Build Logs at the bottom.

Deployment:

A popup will appear when all is done if we want to create a deployment. Select the radio-box Flowchart 1 for selecting the flowchart to translate to the deployment flowchart.

Then, we will be redirected to the deployment space.

We obtained this flowchart.

We have removed the “Features selector” module from this flowchart, we don’t need it.

As we can see, the processing pipeline saved and the trained model saved in the build space appear here.

-Click on the “Deploy as web service” on the left top

-Click on the Deploy button

-Choose a model for flowchart deployment

Monitoring:

-Click on the “Go To MONITOR SPACE” button

Once the web service is ready, you can go to the Predict Space and create a request.

Create and launch request:

After creating a request, we have 4 types of data format, but let’s use an example with a JSON format.

Click on the “Launch request” and wait for the result.

We can see from the next illustration the outcome, in the TABLE tab.

This outcome contains column names with their values inserted in the request, the predicted variable, and the confidence. We can add multiple requests as we want by trying these other data formats. 

Don’t forget to put off your web service when you have finished using a project.

Conclusion:

Now, we were able to create a project on the SmartPredict platform, we have used some modules and explored the dataset. We have treated a famous Kaggle competition baseline using SmartPredict. For the next part, we will try to interpret the model used in this project with the “Model Interpreter” module and explain each prediction’s instance. Thank you !!