Case study
Delivering a data lake, which is suitable at the same time for data analysis (including root cause analysis) and machine learning (including feature generation).
Industry: Automotive
Customer: British car maker
Delivering a data lake, which is suitable at the same time for data analysis (including root cause analysis) and machine learning (including feature generation).
Industry: Automotive
Customer: British car maker
Competence domains:
- Cloud architecture
- Data analysis
- Big Data processing
- Data security
Challenge
Today’s main challenge for automotive companies is not simply storing data, but doing so in an efficient way, which gives data science and data analytics teams the ability to extract insights from the data as soon as possible. Unfortunately, each vehicle produces terrabytes of data per day and analyzing it with standard tools could either cost a lot or take a lot of time.
CloudMade provides a solution (ETL pipeline and data lake architecture), which allows data science and analytics teams to work on the data simultaneously, without spending a lot of resources on the infrastructure and receive insights in minutes.
All information from every signal from CAN and MOST busses were stored in the data lake for analysis.
Solution
Areas of responsibility
CloudMade provides full end-to-end solution development from the architecture definition to support for the data analytics and data science teams.
The solution was split into 4 main components (the overall solution was developed in Amazon Web Services and was then migrated to the Google Cloud Platform):
-
ETL pipeline
This component was developed as a scalable solution (handles terabytes of raw data per day), which could handle data downsampling and cleaning.
-
Data lake
Specially designed data lake format which gives the customer the ability to analyze data, identify the root cause of the vehicle issues, generate features for machine learning, and run machine learning itself.
-
Feature generation pipeline
SThis is a framework that enables data analytics and data science teams to write a complex query (including custom data types, aggregation functions, joins of signals by time, VIN, and other parameters). This query is then translated into pyspark transformations and actions.
Technology
CloudMade provides full end-to-end solution development from the architecture definition to support for the data analytics and data science teams.
The solution was split into 4 main components (the overall solution was developed in Amazon Web Services and was then migrated to the Google Cloud Platform):
-
ETL pipeline
TFlume, Kafka, custom modules written in Scala with Akka Actors framework
-
Data Lake
Cloudera Hadoop with HBase and Hive
-
Feature generation pipeline
Flume, Rundeck, PySpark, Scala Spark
-
Spark data analysis wrappers
PySpark, Python
The total size of the data in the data lake was ~100Tb of raw data (after downsampling before compression).