House Price Prediction with SmartPredict
Have you ever felt to want to create your machine learning project from scratch (from Exploration data to deployment) on a single platform without worrying about installing packages, and even NOT CODING, the SaaS SmartPredict platform is made for you. So in this blog, we will discover how easy the treatment of "the House Price Prediction" with SmartPredict is. So I invite you to deal with it together on the SmartPredict beta version which is free. Ready!! Let's get started.
2-Presentation of the project and the dataset
Whether you are a buyer or a seller it is not easy to estimate the price of a valuable property especially when the latter depends on several characteristics like a house. Thanks to artificial intelligence, a machine learning model can predict a house price based on data related to the house itself (its size, the year it was built in, etc.). Hence this is a regression problem that we will perform with SmartPredict.
We will use a housing dataset presented by De Cock (2011). This dataset describes the sales of residential units in Ames, Iowa starting from 2006 until 2010. The dataset contains a large number of variables ( 82 columns or features and 2930 row or records) that are involved in determining a house price. We can obtain a CSV copy of the data at Kaggle and its description here.
So our project with SmartPredict will take place in five stages as follows:
- First of all, we get feel of the data set with the Dataset Processor application and with the notebook integrated into SmartPredict
- Secondly, we move on to exploratory data analysis with the Visualization tab and with the notebook integrated into SmartPredict
- Third, we clean up the data in the Processing tab
- Fourth, we build a flowchart in our Build tab for training our model
- Finally, we deploy and test our model in each corresponding tab
3- Step by step let's complete the whole project in the single platform: SmartPredict
First step: getting a feel of the data with the Dataset Processor app and the SmartPredict's notebook
Let's have a view of our dataset In the Processing Tab in Dataset Processor App.
As you can see in this figure below, we have a better representation of the dataset in a table. As the types and qualities of the data are indicated in the headers of each column, we note that we have data of Integer, String, and Float types and that around twenty columns have missing values. We note that "Alley", "Pool QC", "Fence", "Misc Feature" are the columns with more missing values as they have less data quality (6%, 0%, 19%, 3% respectively).
Let's analyze our dataset with more flexibility, with the notebook integrated into SmartPredict.
We need to enter these three lines of code in one of which we indicate the name of our dataset(AmesHousing1) for uploading it to our notebook.
from smartpredict import api import smartpredict as sp dataset =sp.api.dataset_load('AmesHousing1')
We can get statistical information about the numeric columns in our dataset by entering this code.
dataset.describe(include=[np.number], percentiles=[.5]) \ .transpose().drop("count", axis=1)
Thusly we can see from the table that the SalePrice's mean is around 180 000.00 $, the minimum of SalePrice is 12 789.00 $ and the maximum is 755 000.00$. Similarly, we can get a lot of information about our dataset variables from the table.
Then, let's see the statistical information about the non-numerical columns in our dataset.
dataset.describe(include=[np.object]).transpose() \ .drop("count", axis=1)
In the table above, "unique" represents the number of unique values, "top" represents the most frequent element, and "freq" represents the frequency of the most frequent element.
Second step: Cleaning data in the Processing Tab
As we can see previously, we can well locate "Missing data" as they are in pink cells. But for more agility, we want to list the name of columns with "Missing value", so let's enter this snippet of code in the SmartPredict's notebook.
# Getting the number of missing values in each column num_missing = dataset.isna().sum() # Excluding columns that contains 0 missing values num_missing = num_missing[num_missing > 0] # Getting the percentages of missing values percent_missing = num_missing * 100 / dataset.shape # Concatenating the number and perecentage of missing values # into one dataframe and sorting it pd.concat([num_missing, percent_missing], axis=1, keys=['Missing Values', 'Percentage']).\ sort_values(by="Missing Values", ascending=False)
Before getting into data cleaning, let's do a little tune-up to better treat with the missing values :
- "Pool QC" has 2917 missing data and if we check the "Pool Area" values count ( dataset["Pool Area"].value_counts()), 2917 entries are also "0" value, so this means that 2917 of houses recorded in the dataset don't have swimming-pool which are presented as a missing value in the "Pool QC" that we can then replace with " No features" ( which means "No Pool"),
- "Misc feature" has 2824 missing data and if we check (dataset["Misc Val"].value_counts()), "Misc Val" has 2827 entries with a value of 0. Then, as with Pool QC, we can say that each house without a "miscellaneous feature" has a missing value in "Misc Feature" column and a value of 0 in "Misc Val" column. So let's fill the missing values in Misc Feature column with "No Feature"
- By looking at the other categorical columns with missing value, we can say that it means that the houses do not have corresponding features. So the missing value in "Alley", "Fence", "Fireplace Qu", " Garage Cond", "Garage Qual", "Garage Finish", "Garage Type", "Bsmt Exposure", "Bsmt Fin Type2", "Bsmt Fin Type 1", "Bsmt Qual", "Bsmt Cond" "Mas Vnr Type" will be replaced by "No features".
- Except for "Electrical" in which we notice only one Missing Value so we will replace it with the frequent value.
- According to the dataset's description missing value in the numerical column can be replaced by "0", this is the case for "Lot Fontage", "Gar Yr Blt", "Mas Vnr Area", "Bsmt Half Bath", "Bsmt Full Bath", "Total Bsmt SF", "Bsmt Unf SF", "Garage Cars", "Garage Area", "BsmtFin SF2", "BsmtFin SF1".
As I have already mentioned, with SmartPredict no question of coding!!! Thus for dealing with Missing value, we use the preprocessing pipeline in Processing Tab, the gif below shows you a demonstration :
So these following figures show respectively the configuration to handle missing value in numerical, then in categorical and at last the "Electrical" columns.
The processing pipeline pane appears on the right pane. We can export these panes as a module (that we name "Cleaning_house_dataset" ) for our flowchart in the fourth step.
Now let's visualize our data with the same place with the Visualization tab.
Third step: Exploratory Data Analysis in the Visualization tab
In this section, we will explore the data using visualizations. This will allow us to understand the data and the relationships between variables better, which will help us build a better model.
First, let's check the target variable distribution with histogram. If you have trouble handling the application in the Visualization tab, I invite you to read my blog on it. Don't worry it's only a matter of handling but not of coding as shows the gif below.
So we get this graph that we can interpret that almost house prices are in the range of 100000.00 $ and 180 000.00 $.
Now, let's see the correlation between variables with heatmap. At the time of speaking the Visualization tab just supports bar, pie, line, scatter chart., so let' s enter this code in our notebook.
fig, ax = plt.subplots(figsize=(12,9)) sns.heatmap(dataset.corr(), ax=ax);
We notice that there are many correlated variables in our dataset especially with "Garage Cars" and "Garage Area", and with "Gr Liv Area" and "TotRms AbvGrd". Let's check these two cases with a "scatter plot".
We observe that they are truly positively correlated between each pair, so we'd better delete one column of each pair for avoiding multicollinearity. This column delete action will be done in the " building flowchart" stage so let's go to the next step.
Fourth Step: Building a flowchart for training our model in our Build tab
Training a machine learning model in SmartPredict is no longer a question of coding but of " drag and drop, set up and interconnect module.
That's funny, however, we should be prudent to avoid making mistakes.
In the "build tab" we have a workspace in which we drag and drop our modules available in the right pane. All flowchart in the "build tab" begins with our dataset ( here the AmesHousing dataset). Then the dataset must be cleaned, hence we drag and_drop and connect the second module " Cleaning_house_dataset" created before.
The next module is the dataset processor that we set up to delete unuseful columns ( "Order",and " PID") and "Garage Cars" and "TotRms AbvGrd" as mentioned previously for avoiding multicollinearity.
I hope you have understood the rule by now: Each step to train a machine learning model is now translated by modules in Core module , that we drag and drop and parameter in the build space. We can run our flowchart as we go along to check for eventual bugs that are indicated in logs.
So the following modules are successively :
- Feature Selector whose role is to distinguish the columns used as features and labels for our model that we indicate in the parameter. For this project "Sale Price" column is the label, and the rest constitute features except the columns deleted obviously. Don't forget to check "Use feature as test". So its parameter is presented in these following figures :
- Ordinal encoder which encodes ordinal values in these following columns "Land Slope", "Exter Qual", "Exter Cond", "Bsmt Qual", "Bsmt Cond", "Bsmt Exposure", "Bsmt Fin Type 1", "BsmtFin Type 2", " Heating QC", " Central Air", " Kitchen Qual", "Functional", "Fireplace Qu", " Garage Fininsh", "Garage Qual", "Garage Cond", " Pool QC", " Land Slope", " Fence".
As the previous module "Feature Selector" returns an ARRAY. Then we should indicate in Ordinal Encoder that its input is an Array, hence the columns above are indicated with their index by starting form number 1 and ignoring the four deleted columns.
In addition, we connect an Item Saver module in its output whose role is to generate a module, that we called here "ordinal_H_P3" which available in Trained model in the right pane.
For information, once our flowchart is running it creates a module that saves the configuration of the module "One Hot Encoder in this project and that will be used in deploy in the next step.
- One Hot Encoder which encodes nominal values in these following columns: "MS Zoning", "Street", "Alley", "Lot Shape", "Land Contour", "Utilities", "Lot Config", "Neighborhood", "Condition 1", "Condition 2", "Bldg Type", "House Style", " Roof Style", "Roof Mati", "Exterior 1st", " Exterioir 2nd", "Mas Vnr Type", " Foundation", " Heating", "Electrical", " Garage Type", "Paved Drive", "Misc Feature", "Sale Type", "Sale Condition".
As in Ordinal Encoder, the input is an array then we indicate these columns by their index by counting from 1 and ignoring deleted columns, so :
As previously, we connect an Item Saver module to generate a module that we called as we want, let's named it "One_hot_HP". So at this stage, our flowchart looks like this :
:) :) Can you follow me till here??. Come on, there’s still some module left to interconnect. :) :) AS SHOWN THE FIGURE BELOW :):)
- Labeled data splitter which split our dataset to train and test data. So we choose the rate " 0.1" which indicates the number of test data among the number of training data.
- XGBoost regressor which represents our model to train. Here we choose this model but there are many models that we can choose in the right pane "Machine learning Algorithm".
- Trainer ML models module which receives respectively the features and label of the training dataset, and the machine learning model to train.
As we want to save the trained model, then we connect an Item Saver module in its output that we named "XGboost_for_regression_HP".
- Evaluator for ML is the module that evaluates our model so in its configuration we indicate our metric; here we choose the Mean Absolute Error. Once the flowchart is launched, a tooltip appears indicating the metric's value. As we can see in the figure above, this module has three inputs: the trained model, the features, and the label of the test dataset.
Finally our flowchart in "build tab" looks like this, once launched ( click on run icon) :
As we can see, we have a Mean Squared Error equals to 13057.47, which means that our model predicts the Saleprice of the house to an error of 14 000.00 $, which is 7% of the entire price. We can say that it predicts well.
Last step: Deploy and Test our model in our Deploy tab and Test tab
As I always say, SmartPredict is a platform in which we can do AI project from scratch, we can deploy and test our model with it.
Deploying a model in SmartPredict still consists of creating a flowchart but this time in "Deploy Tab", all modules in "Core Modules" can be used. Once deploying our model is accessible in web service to which an access token is generated for you in Monitor tab.
So no need to keep you in suspense, the figure below shows you the flowchart that I build for deploying our model for this project :
As we can see, the flowchart in deploy Tab always begins with "Web Service IN" module which receives data in the "JSON" format and always end up with " Web Service OUT" which returns the predicted data in this format too.
The second module must be always a "DataFrame loader/converter", which can convert this JSON format as a dictionary and convert it to a dataframe.
The next modules will depend on our build's flowchart by thinking how the data will be processed, and mainly by using the modules saved by Item Saver before.
So the third module is the "Feature Selector module" as in our build's flowchart we use it.
Then ordinal and nominal data should be encoded so we use the "Odinal Encoder" and "One Hot Encoder" respectively in which we present the trained models "ordinal_H_P3" and "One_hot_HP" saved before.
Obviously the next module is the "Predictor ML models" which receives our machine learning model trained and saved before.
The following figures show you the parameter of the "DataFrame loader/converter" and the "Feature Selector" modules. Other modules are left by their default setting.
Finally, we click on the Rocket Icon for deploying (or the arrow icon for updating)
But the cherry on the cake is that with SmartPredict we can test our model once it 's deployed. So go to the Test tab, the data for the test is presented automatically. Just click on the arrow icon to see the HOUSE 'S PRICE PREDICTED.
The Data / Object Logger module can be used to detect logs at the output of a module. For example, if we want to know the output of the module "Feature Selector" we do like this:
I hope that through this blog you have captured the idea of carrying out a project on the single SmartPredict platform, if not I invite you to read my blogs on "Bank marketing" and "Telecom churn" while doing it.
Try it and you will see that it is very practical and it is even still free.
Thank you for reading. See you for the other projects that we will explore with SP.