Case study

Delivering a data lake, which is suitable at the same time for data analysis (including root cause analysis) and machine learning (including feature generation).

Industry: Automotive

Customer: British car maker

Delivering a data lake, which is suitable at the same time for data analysis (including root cause analysis) and machine learning (including feature generation).

Industry: Automotive

Customer: British car maker

Competence domains:

  • Cloud architecture
  • Data analysis
  • Big Data processing
  • Data security

Challenge

Today’s main challenge for automotive companies is not simply storing data, but doing so in an efficient way, which gives data science and data analytics teams the ability to extract insights from the data as soon as possible. Unfortunately, each vehicle produces terrabytes of data per day and analyzing it with standard tools could either cost a lot or take a lot of time.

CloudMade provides a solution (ETL pipeline and data lake architecture), which allows data science and analytics teams to work on the data simultaneously, without spending a lot of resources on the infrastructure and receive insights in minutes.

All information from every signal from CAN and MOST busses were stored in the data lake for analysis.

Solution

Areas of responsibility

CloudMade provides full end-to-end solution development from the architecture definition to support for the data analytics and data science teams.

The solution was split into 4 main components (the overall solution was developed in Amazon Web Services and was then migrated to the Google Cloud Platform):

  • ETL pipeline

    This component was developed as a scalable solution (handles terabytes of raw data per day), which could handle data downsampling and cleaning.

  • Data lake

    Specially designed data lake format which gives the customer the ability to analyze data, identify the root cause of the vehicle issues, generate features for machine learning, and run machine learning itself.

  • Feature generation pipeline

    SThis is a framework that enables data analytics and data science teams to write a complex query (including custom data types, aggregation functions, joins of signals by time, VIN, and other parameters). This query is then translated into pyspark transformations and actions.

Technology

CloudMade provides full end-to-end solution development from the architecture definition to support for the data analytics and data science teams.

The solution was split into 4 main components (the overall solution was developed in Amazon Web Services and was then migrated to the Google Cloud Platform):

  • ETL pipeline

    TFlume, Kafka, custom modules written in Scala with Akka Actors framework

  • Data Lake

    Cloudera Hadoop with HBase and Hive

  • Feature generation pipeline

    Flume, Rundeck, PySpark, Scala Spark

  • Spark data analysis wrappers

    PySpark, Python

The total size of the data in the data lake was ~100Tb of raw data (after downsampling before compression).

Tell us about your challenge. At CloudMade we are ready to provide expertise and support.

Contact us