All articlesSmartPredict
Christelle Julias - Documentation engineer at SmartPredict
in SmartPredict
2 months ago

How to design a Titanic Suvival Predictor using SmartPredict (Step-by-Step)

The_Titanic_sailing_majestically_during_its_maiden_voyage

  • Reading time : 8 mn
  • Realization time: ~ 15 mn
  • Prerequisites: subscription to SmartPredict

Project description

In this brief tutorial, we are going to recreate the Titanic survival prediction project step by step with the help of the AI platform SmartPredict.

For a detailed walkthrough of the project and module references, please check this link.

The data science and machine learning project is based on a Kaggle challenge, the objective of which is to determine whom of the passengers were predisposed to survive the shipwreck from a certain number of factors.

As a reminder, the tragedy occurred after the RMS Titanic collided with a giant iceberg , resulting in a huge loss of human lives mainly due to a lifeboat shortage.

Historical facts

Factually, these passengers' chance of survival was somehow linked to a set of initial features in relation to: their social status, the number of relatives onboard , and their gender.

The order of priority for being rescued was ranked according to their socio-economic class from the wealthiest to the neediest, a prejudice which doomed many of these latter.

Speaking about gender, female passengers were also given priority above male ones , whereas regarding age , children below the age of 14 were granted privilege over their elders.

All in all , out of the 2777 passengers and crew of the infamous ship , only 705 of them survived .

Workflow

Just like for any ML project, let us proceed with the typical workflow as follows:

  1. Classify or Define the problem
  2. Acquire and Process the Dataset
  3. Model the Problem
  4. Validate or Test and Execute
  5. Deploy the Pipeline

N.B : Steps 5 and 6 can be repeated indefinitely until we obtain our desired prediction accuracy which preferably is the highest .

The SmartPredict concept : generating a Machine Learning Pipeline

SmartPredict has been conceived from a simple principle : to enable the design and the generation of a machine learning pipeline easily and quickly from flowcharts .

Thanks to its large palette of various modules, SmartPredict contains all the tools to complete our project from end-to end effortlessly.

The pipeline to issue is in the form of a REST API Web Service that will be used for prediction, speaking of this project.

To generate our pipeline , our data will undertake these steps :

  1. Dataset preprocessing including feature selection and missing value handling
  2. Flowchart assembling with dedicated modules ( build and deploy flowchart)
  3. Model testing (by running it to give back the accuracy)
  4. and finally, Inference making with our model (and eventually fine-tuning) .

#0 Create the Titanic project with SmartPredict

Let us create our Titanic survival prediction project with SmartPredict.

To do so, let us directly borrow the classification template which already includes the basic modules for starting such a project.

  1. Go to the Dashboard (home icon)
  2. Click on "Create a new project"
  3. Select type "Classification" in the list
  4. Insert the project information : Name (compulsory) , Description (optional)
  5. Click on "Create" to validate.

SmartPredict_Create_project

#1 Acquire and Preprocess the Dataset

1. Upload the dataset

Before we could ever begin processing the dataset, we need to upload it first into SmartPredict. To begin with , let us fetch and download the dataset included in the Kaggle challenge page.

N.B This is the one we are going to use in this project.

It consists of three files: a training set, a test set and an example of submission. If you already have your dataset in a on premise database just select and load your data into the application .

Drop dataset files or Browse to upload. Take heed that it is the training set that we need to upload . To avoid much preprocessing , the training set is provided and available for download here.

Upload_dataset

The exhaustive list of features is provided here :

  • PassengerId : unique number for identification
  • Ticket Number : the number under which the passenger was registered
  • PClass : first, second and third
  • Cabin : number of the cabin lodge
  • Name : the passenger's official name, married or maiden name for a woman
  • Sex : gender male or female
  • SibSp : number of siblings and spouse aboard
  • Parch: if the parents accompanied
  • Embark: port of embarkation (C = Cherbourg ; Q = Queenstown ; S = Southampton)
  • Fare: for deducing the passenger's wealth (presumably higher for a higher class)

The training set furthermore displays one more column : Survived or not.

This is the kind of result we expect our predictor to give back.

2. Delete irrelevant features

To treat our dataset easily , we will need to shrink it in order to eliminate the useless features while keeping the relevant ones. To do so, we are simply going to delete unused columns in the dataset with the help of the Data Processor.

Given the criteria previously mentioned, the following features chosen as relevant are:

Sex, Age, Pclass, Parch, SibSp, Fare.

The rest will be ignored because these data do not provide enough relevance so as to be considered to influence the prediction outcome.

  1. Get to the SmartApps and choose Dataset Processing and Visualization.
  2. Add a processor step. Select delete columns >> Choose columns to process by adding in the useless columns.
  3. Apply the processing step onto the dataset.

Delete_useless_columns

3. Handle missing data

We need to process our dataset furthermore by removing its remaining flaws.

Indeed, our dataset displays missing values. This next step consists of eliminating them. However , only the "Age" column is concerned. The no-brainer recourse is to delete rows containing blank values . This is also possible through the Data Processor in the Dataset processing and Visualization SmartApp.

  1. Open the Dataset processing and Visualization SmartApp.
  2. Select the dataset >> Click on the cog icon for processing.
  3. Click on "Add new step" (+ button).
  4. Select a processor >> Handle missing values.
  5. As a strategy: delete rows.

handling-missing-values

  1. Apply the processing step onto the dataset.
  2. Export the obtained set of processing steps as a processing pipeline by clicking on "Export processing pipeline to SmartPredict".

Export_the_processing pipeline

#2 Assemble the Build and Deploy Flowcharts with dedicated modules

We will perform our flowcharting in two steps:

  1. First, let us assemble a pipeline build flowchart,
  2. then , deal with the pipeline deploy flowchart.

Build flowchart

Open the Titanic project project .

The default build flowchart for a classification project is composed of the following elements:

  • the Dataframe Loader
  • the Features Selector
  • the ML trainer
  • the Item saver
  • the ML evaluator
  • the Labeled Data splitter
  • the Data Object logger
  • and the Support Vector Classifier

We furthermore need to add :

  • an Ordinal Encoder in order to correctly handle the categorical features such as in this type of data.
    Then for including the processing steps into our pipeline, we need to choose between these two options:
  • [a processing pipeline + the original dirty dataset ] OR [a dataframe loader and a clean dataset]

1st Option: [a processing pipeline + the original dirty dataset ]

pipeline_without_dataframe_loader

2nd Option : [a dataframe loader and a clean dataset]

As an alternative to using the processing pipeline, we can utilize the dataframe loader which precisely performs similar functions while loading the project's dataframe.

Module configuration.

Let us explore the respective configuration of :

  1. the data frame loader
  2. the feature selector
  3. the ordinal encoder

1 - The Dataframe Loader

The Dataframe loader is one important component of the flowchart. It is a specialized SmartPredict module for directly performing processing steps such as handling missing values.

*** For the BUILD

----------------------------------

dataframe_loader_build

*** For the DEPLOY

-----------------------------------

2- The Feature Selector

The feature selector , as its name suggests, is used to select the relevant features. and define the output label

Here is its configuration for our project.

feature_selector

3- The Ordinal Encoder

  • As a purpose, the ordinal encoder enables the use of categorical features in the string form and which cannot be expressed otherwise.
    Here for instance ,. we are presented the case of "Sex" which will be represented by the binary opposition male/ female. For more details on the use of the ordinal encoder , check the Sci-Kit documentation on the use of ordinal encoders.

Here is the Ordinal Encoder's configuration.

Ordinal_encoder

Run the project

Before transitioning to the deploy tab, we need first to run our project.

To do so, just click on the Run button.

SmartPredict_Project_Run

The model accuracy displays.

Model_accuracy

Pipeline Deploy flowchart

The Deploy flowchart is composed of quite similar modules as the build one, except for their elements' parameters.

Only those are slightly different.

For instance:

  1. Dataframe loader (to find how to set it up , check the previous section 'Building a flowchart') , however as an input type it will be "Dictionary" now instead of Dataframe.
  2. Ordinal encoder (same configuration as seen before in the build flowchart)
  3. Features selector (idem as previously)
  4. ML predictor ( found in the core module )
  5. and , of course, our trained model (found in the 'Trained Models' sub-tab).

Let us begin by dragging and dropping them into the workspace. Only after this, can we start parameterizing them.

Here is the final layout of the deploy flowchart:

To get to deploy, we need :

1. First, to prepare for deployment:

2. Then , deploy our pipeline per se by clicking on the rocket icon as shown below:

Test the success of our pipeline by making an inference with our REST API

For this final step, to test the success of our pipeline , let us pick a random passenger from the testing set , identified only by their distinct features. This would be done in a JSON format.

{
   "Pclass":3,
   "Sex":[
      "male"
   ],
   "Age":34.5,
   "SibSp":0,
   "Parch":0,
   "Fare":7.8292
}

After we had made an inference , our predictor returns the value : "Not Survived" as expected . Indeed, our fated passenger belonged to the third class, traveled alone and was a man -which is far from coinciding with a survivor's profile.

This is all about this brief step-by-step SmartPredict walkthrough. I hope you enjoyed reproducing it!

Subscribe to our newsletter to receive the latest news and join our Slack community for fresh updates about SmartPredict!

Subject categories

SmartPredict