Be the first to read our first Ebook!

In the Age of AutoML, we are all Data Scientists! | Artificial Intelligence (AI) has become increasingly essential to our daily routine lives.

How to perform Data profiling quickly with SmartPredict?

Published on Apr 20, 2021 by Christelle Julias

As a professional in data, one of the essential tasks of your job consists of monitoring and cleansing data, improving data quality, data conversion,  and presenting reports for your team. However, data profiling is required for final reports, but it can also be used for other equally important purposes for data projects like content discovery, structure discovery, and relationship discovery. Indeed, you need this right at the beginning for preliminary insights while you are still at the preprocessing data stage. In this blog article, let's see how you can get the most out of it.

A Python library for Machine Learning is Pandas Profiling Tool. If you have limited coding experience or no time to remember lines of commands, you surely wish you could create data profiling reports easily, don't you? Well, you can do that for sure. With the data visualizer module of SmartPredict, you will get all the pandas packages inside a simple module.

  • Type inference: detect the types of columns in a data frame.
  • Essentials: type, unique values, missing values
  • Quantile statistics
  • Descriptive statistics like mean, mode, standard deviation
  • Most frequent values
  • Histogram
  • Correlations ( Spearman, Pearson and Kendall matrices°
  • Missing values matrix, count, heatmap and dendrogram

This simple portable module offers you the full experience without the code worries.

SmartPredict data visualizer for profiling datasets and processing pipelines

Whether you are a data analyst, a data scientist or a product manager, SmartPredict has designed for you a special tool to realize colorful graphical reports in no time.

 As you already know, in SmartPredict, data science pipelines are represented in the form of flowcharts, which makes it easier than ever to create and deploy artificial intelligence projects.

Based on the Pandas profiling tool, the drag-and-drop module for visualization can be transported anywhere on the flowchart and linked to any output for displaying before-after processing data. 

For ways to do that, I invite you to follow this short tutorial on how to get visual reports with the help of the data visualizer. 

1-Create a new project in Manualflow 

2-Drag and drop your dataset into the build workspace 

3-Look for the data visualizer under the control modules drop-down list

4-Link the dataset to the data visualizer then run the project

5-Open the data visualizer's menu and get to the processing then profiling tab.

Notice that we can use the data visualizer to see the content of a dataset anywhere on the flowchart and even after processing operations just like with a data processor.

You have now seen how to use the data visualizer to gain valuable insights from your data exploration

Now start using it to brilliantly present your reports

Find the video here that explains it all in detail.

Conclusion:

We have seen that data profiling is necessary for data warehouse, and business intelligence, if not an extremely important step before diving into a  data science project. As it provides an overview of the data we deal with, it helps pinpoint the quantitative and qualitative relationships that exist between the elements of your dataset. It also offers an intuitive roadmap on where to clean your data set or where to aggregate features. To make it even more playful, SmartPredict has designed a high-level module based on Pandas Profiling Tool for visualizing the distribution of your data.

Try it out!