How GCP Dataprep Assists with Cleansing of a Dataset?

Customer Connectivity Issues needs a pro-active solution. For our approach on building an ML model, the first step is to utilize GCP data prep for cleansing, feature engineering, etc.

Business Problem – Connectivity Issue

  • The customer has multiple Enterprise to Store connections for the exchange of information but is very critical for the Customer’s business. Currently, there are connectivity issues between Enterprise(EAI Layer) and the Stores(Store Layer) across every region
    • Majority of the time the error that we are getting is a connection timeout because at the Stores the Firewall rules were not configured properly or either they missed adding firewall rules as per the network playbook to accept the traffic to/from the Stores and Enterprise
    • Script to automate connectivity check has been implemented for a month which provides a log file with timestamp and results of the successful or unsuccessful transaction
  • With the number of stores that Customer has, it is becoming a tedious and challenging task for our team to ensure the connectivity to/from the stores. Pro-active measures would be very helpful

Solution Approach

The connectivity results are generate in the Log Files that will be used as our input dataset for ML model to predict the issues

Solution Steps

  • Step 1 – Use Google’s Dataprep to prepare the data for the ML Model (Cleansing, Feature Engineering, etc.)
  • Step 2 – Use multiple approach and algorithms to find the best suited ML model for the use case
  • Step 3 – Deploy the algorithm and predict the failures that the project team can use for pro-actively monitoring the connections

As we know, 70-80% effort for the ML model is done for data prepping. Our dataset also needs cleansing, reformatting, and other conditional formatting, which is done in a few easy steps by using dataprep.

Please see the video to see how dataprep assisted with data cleansing and feature engineering.

Once the data prep recipe is created, it can be scheduled and/or reused in the future.

Munira Gandhi is a data & analytics practice manager at Miracle Software Systems, with over 16+ years of as Enterprise Information/Data architect focused on all data aspects (data ingestion, integration, analytics). She is AWS cloud architect certified and is working on Google GCP Data Engineer certification.

Specialties :
- Big Data Architecture and Strategy (Hadoop ecosystem; Google Bigquery)
- Data Science and Analytics (Python)
- Cloud
- Business Intelligence (Power BI, Tableau, Spotfire)
- Oil & Gas domain knowledge expert

About the author

Munira Gandhi
Munira Gandhi

Munira Gandhi is a data & analytics practice manager at Miracle Software Systems, with over 16+ years of as Enterprise Information/Data architect focused on all data aspects (data ingestion, integration, analytics). She is AWS cloud architect certified and is working on Google GCP Data Engineer certification.

Specialties :
- Big Data Architecture and Strategy (Hadoop ecosystem; Google Bigquery)
- Data Science and Analytics (Python)
- Cloud
- Business Intelligence (Power BI, Tableau, Spotfire)
- Oil & Gas domain knowledge expert

Add comment

Munira Gandhi By Munira Gandhi
Welcome to Miracle's Blog

Our blog is a great stop for people who are looking for enterprise solutions with technologies and services that we provide. Over the years Miracle has prided itself for our continuous efforts to help our customers adopt the latest technology. This blog is a diary of our stories, knowledge and thoughts on the future of digital organizations.


For contacting Miracle’s Blog Team for becoming an author, requesting content (or) anything else please feel free to reach out to us at blog@miraclesoft.com.

Who we are?

Miracle Software Systems, a Global Systems Integrator and Minority Owned Business, has been at the cutting edge of technology for over 24 years. Our teams have helped organizations use technology to improve business efficiency, drive new business models and optimize overall IT.

Recent Posts