With have seen a full tutorial about Apache Nifi and SparQL.

Today we are going to create a full real-world data-driven application using Apache Nifi, Kafka, and Spark ML.

1. Context & Introduction

In today’s world, data is everywhere, and extracting value from it has become a critical aspect of business success. Apache Nifi, Kafka, and Spark ML are three powerful tools that can be used together to build a comprehensive data pipeline for processing and analyzing large volumes of data.

  • Apache Nifi is used for collecting, processing, and distributing data from various sources, such as databases, files, and IoT devices.
  • Kafka, on the other hand, is used to build real-time data pipelines and streaming applications. It handles large volumes of data and provides features such as fault tolerance, scalability, and high throughput.
  • Finally, Spark ML is a machine learning library built on top of Apache Spark that provides a set of APIs for building and training machine learning models. Spark ML includes algorithms for classification, regression, clustering, and collaborative filtering, as well as tools for feature extraction and transformation.

By combining these three tools, it is possible to build a comprehensive data pipeline that can collect data from various sources, process and transform it using Apache Nifi, and stream it to Kafka for real-time analysis, and after then, use Spark ML to build machine learning models to analyze the data and extract insights.

2. Requirements and Needs

To build a complex streaming analytics application for an airline company, we will work with a fictional use case.

The airline company has a large fleet of airplanes and wants to perform real-time analytics on the sensor data from the airplanes and to monitor them in real-time.

The analytics application for the airline company has the following requirements:

  1. Outfit each airplane with two sensors that emit event data such as timestamp, pilot ID, airplane ID, route, geographic location, and event type.
  2. The geographic event sensor emits geographic information (latitude and longitude coordinates) and events such as excessive turbulence or altitude changes.
  3. The speed sensor emits the speed of the airplane.
  4. Stream the sensor events to an IoT gateway. The data producing app (e.g., an airplane) will send CSV events from each sensor to one of three gateway topics (gateway-west-raw-sensors, gateway-east-raw-sensors, or gateway-central-raw-sensors). Each event will pass the schema name for the event as a Kafka event header.
  5. Use NiFi to consume the events from the Kafka topic, and then route, transform, enrich, and deliver the data from the gateways to two syndication topics (e.g., syndicate-geo-event-avro, syndicate-speed-event-avro, syndicate-geo-event-json, syndicate-speed-event-json) that various downstream analytics applications can subscribe to.
  6. Connect to the two streams of data to perform analytics on the stream.
  7. Join the two sensor streams using attributes in real-time. For example, join the geo-location stream of an airplane with the speed stream of a pilot.
  8. Filter the stream on only events that are infractions or violations.
  9. All infraction events need to be available for descriptive analytics (dash-boarding, visualizations, or similar) by a business analyst. The analyst needs the ability to perform analysis on the streaming data.
  10. Detect complex patterns in real-time. For example, over a three-minute period, detect if the average speed of a pilot is more than 500 knots on routes known to be dangerous.
  11. When each of the preceding rules fires, create alerts, and make them instantly accessible.
  12. Execute a logistical regression Spark ML model on the events in the stream to predict if a pilot is going to commit a violation. If a violation is predicted, then generate an alert.
  13. Monitor and manage the entire application using Streams Messaging Manager and Stream Operations.
50E389A6-627E-43BC-A945-189C6ADAC1D6.jpeg

3. Preparing the environment

3.1 Cluster Deployment

Step 1: Install Ambari 2.7.0

Ambari is a management platform that allows you to provision, manage, and monitor Hadoop clusters. Follow these steps to install Ambari 2.7.0:

  • Go to the Ambari website and download the Ambari 2.7.0 installation package.
  • Extract the package to a directory of your choice.
  • Change to the extracted directory and run the following command to start the Ambari Server installation: sudo ./setup.sh
  • Follow the prompts to complete the installation of Ambari Server.
  • After the installation completes, start the Ambari Server: sudo ambari-server start

Step 2: Install HDP 3.0.0

**HDP** (Hortonworks Data Platform) is an enterprise-grade Hadoop distribution that includes Hadoop ecosystem components, such as HDFS, YARN, MapReduce, Hive, and HBase. Follow these steps to install HDP 3.0.0:

  • Go to the HDP website and download the HDP 3.0.0 installation package.
  • Extract the package to a directory of your choice.
  • Change to the extracted directory and run the following command to start the HDP installation: sudo ./setup.sh
  • Follow the prompts to complete the installation of HDP.
  • After the installation completes, start the HDP services: sudo service hadoop start

I really recommend you to read this IBM document:

Hortonworks Data Platform .pdf

Step 3: Install the HDF 3.2.0 Management Pack onto the HDP cluster

The HDF Management Pack provides a set of management tools and services for HDF components, such as Apache NiFi, Kafka, and Storm. Follow these steps to install the HDF 3.2.0 Management Pack onto the HDP cluster:

  • Download the HDF 3.2.0 Management Pack installation package from the HDF website.
  • Extract the package to a directory of your choice.
  • Change to the extracted directory and run the following command to install the HDF Management Pack onto the HDP cluster: sudo ambari-server install-mpack --mpack=/path/to/hdf-3.2.0-management-pack.tar.gz
  • Follow the prompts to complete the installation of the HDF Management Pack.
  • After the installation completes, restart the Ambari Server: sudo ambari-server restart

Step 4: Review your HDF and HDP cluster deployment options in Planning Your Deployment

After completing the installation of HDF, you should review your HDF and HDP cluster deployment options.

This will ensure that your deployment is optimized for your use case and that you are taking advantage of the latest features and enhancements.

The Planning Your Deployment guide provides detailed information on how to plan and optimize your HDF and HDP cluster deployment.

Step 5: Check the HDF Support Matrices for compatibility

Before using HDF with HDP, it is important to check the HDF Support Matrices to ensure that you are using compatible versions of HDF and HDP. The HDF Support Matrices provide a detailed list of the supported versions and configurations for HDF and HDP.

By following these steps, you should be able to install HDF for your use case and start building your trucking application.

If you encounter any issues during the installation process, consult the HDF documentation or seek assistance from the HDF community.

https://docs.cloudera.com/HDPDocuments/HDF3/HDF-3.3.0/getting-started-with-streaming-analytics/content/registering_schemas_in_schema_registry.html

Dans le tutoriel dans la vidéo suivante, vous pouvez trouver un autre exemple de l’utilisation de Spark et de Kafka, mais cette fois-ci avec Hadoop.

Written by

Albert Oplog

Hi, I'm Albert Oplog. I would humbly like to share my tech journey with people all around the world.