Apache NiFi Crash Course IntroRafael Coss - @racossHadoop Summit – Tokyo
Oct 2016
2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
AgendaData Flow & Streaming Fundamentals
What is dataflow and what are the challenges?
Apache NiFi
Architecture
Lab
3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Data Flow & Streaming Fundamentals
4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Connected Data World Internet of Anything (IoAT)
– Wind Turbines, Oil Rigs, Cars– Weather Stations, Smart Grids– RFID Tags, Beacons, Wearables
User Generated Content (Web & Mobile)– Twitter, Facebook, Snapchat, YouTube– Clickstream, Ads, User Engagement– Payments: Paypal, Venmo
44ZB in 2020
5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Let’s Connect A to BProducers A.K.A Things
AnythingAND
Everything
Internet!
Consumers• User• Storage• System• …More Things
6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
What is Stream Processing?
Batch Processing• Ability to process and analyze data at-rest (stored data)• Request-based, bulk evaluation and short-lived processing• Enabler for Retrospective, Reactive and On-demand Analytics
Stream Processing• Ability to ingest, process and analyze data in-motion in real- or near-real-time• Event or micro-batch driven, continuous evaluation and long-lived processing• Enabler for real-time Prospective, Proactive and Predictive Analytics for Next Best
Action
Stream Processing + Batch Processing = All Data Analyticsreal-time (now) historical (past)
7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Modern Data ApplicationsCustom or Off the Shelf
Real-Time Cyber Securityprotects systems with superior threat detectionSmart Manufacturingdramatically improves yields by managing more variables in greater detailConnected, Autonomous Carsdrive themselves and improve road safetyFuture Farmingoptimizing soil, seeds and equipment to measured conditions on each square footAutomatic Recommendation Enginesmatch products to preferences in milliseconds
DATA ATREST
DATA IN MOTION
ACTIONABLEINTELLIGENCE
Modern Data Applications
Hortonworks DataFlow
Hortonworks Data Platform
8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Store Data
Process and Analyze Data
Acquire Data
Simplistic View of DataFlows: Easy, Definitive
Dataflow
9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
The Unassuming Line: A Case StudyWe’ve seen a few lines show up in the wild thus far
Internet! Inter- & Intra- connections inour global courier enterprise
Spotlight: Arthur Lacôte, https://thenounproject.com/turo/
10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Dataflow Line Anatomy 101Let’s dissect what this line typically represents
Fig 1. Lineus Worldwidewebus. Common Name: Internet!
Script or Application
Script or Application
Data Data
Disparate TransportMechanisms
11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Dataflow Line Anatomy 201Sometimes that transport is just more lines
Fig 1. Lineus Worldwidewebus. Common Name: Internet!
Script or Application
Script or Application
Line Inception
Data Data
12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Realistic View of Dataflows: Complex, Convoluted
Store Data
Process and Analyze Data
Acquire Data
Store DataStore Data
Store Data
Store Data
Acquire Data
Acquire Data
Acquire Data
Dataflow
13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Streaming Architecture
IngestionSimple Event Processing
EngineStream Processing
DestinationData Bus
14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
High-Level Overview
IoT Edge(single node)
IoT Edge(single node)
IoT Devices
IoT Devices
NiFi Hub Data Broker
Column DB
Data Store
Live Dashboard
Data Center(on premises/cloud)
HDFS/S3 HBase/Cassandra
15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
AgendaWhat is dataflow and what are the challenges?
Apache NiFi
Architecture
Live Demo
Community
16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Moving data effectively is hard
Standards: http://xkcd.com/927/
17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Why is moving data effectively hard?
Standards Formats “Exactly Once” Delivery Protocols Veracity of Information Validity of Information Ensuring Security Overcoming Security
Compliance Schemas Consumers Change Credential Management “That [person|team|group]” Network “Exactly Once” Delivery
18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Let’s Connect Lots of As to Bs to As to Cs to Bs to Δs to Cs to ϕsLet’s consider the needs of a courier service
Physical Store
Gateway Server
Mobile Devices
Registers
Server Cluster
Distribution Center Core Data Center at HQ
Server Cluster
On Delivery Routes
Trucks Deliverers
Delivery Truck: Creative Stall, https://thenounproject.com/creativestall/Deliverer: Rigo Peter, https://thenounproject.com/rigo/Cash Register: Sergey Patutin, https://thenounproject.com/bdesign.by/Hand Scanner: Eric Pearson, https://thenounproject.com/epearson001/
19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Great! I am collecting all this data! Let’s use it!Finding our needles in the haystack
Physical Store
Gateway Server
Mobile Devices
Registers
Server Cluster
Distribution Center
Kafka
Core Data Center at HQ
Server Cluster
Others
Storm / Spark / Flink / Apex
Kafka
Storm / Spark / Flink / Apex
On Delivery Routes
Trucks Deliverers
Delivery Truck: Creative Stall, https://thenounproject.com/creativestall/Deliverer: Rigo Peter, https://thenounproject.com/rigo/Cash Register: Sergey Patutin, https://thenounproject.com/bdesign.by/Hand Scanner: Eric Pearson, https://thenounproject.com/epearson001/
20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Let’s Connect Lots of As to Bs to As to Cs to Bs to Δs to Cs to ϕsOh, that courier service is global
21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
AgendaWhat is dataflow and what are the challenges?
Apache NiFi
Architecture
Live Demo
Community
22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Capabilities/Gaps
Use cases collected from the field since last release (HDF 1.2)
Major business drivers behind the use case
Problems, challenges and major pain points
How does NiFi help solve the problems
What are the remaining gaps
Use Cases
24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache NiFi High Level Capabilities Web-based user interface
– Design, control, feedback & monitoring
Highly configurable– Loss tolerant vs guaranteed delivery– Low latency vs high throughput– Dynamic prioritization– Flow can be modified at runtime– Back pressure
Data provenance– Track dataflow from beginning to end
Designed for extension– Build your own processors
Secure– SSL, SSH, HTTPS, etc.
25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache NiFiKey Features
• Guaranteed delivery• Data buffering
- Backpressure- Pressure release
• Prioritized queuing• Flow specific QoS
- Latency vs. throughput- Loss tolerance
• Data provenance• Supports push and pull
models
• Recovery/recording a rolling log of fine-grained history
• Visual command and control
• Flow templates• Pluggable/multi-role
security• Designed for extension• Clustering
26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Deeper Ecosystem Integration: 170+ Processors
HTTP
Syslog
HTML
Image
Hash Encrypt
Extract
TailMerge
Evaluate
Duplicate Execute
Scan
GeoEnrich
Replace
ConvertSplit
Translate
HL7
FTP
UDP
XML
SFTP
Route Content
Route Context
Route Text
Control Rate
Distribute LoadAMQP
27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Revisit: Courier service from the perspective of NiFi
Physical Store
Gateway Server
Mobile Devices
Registers
Server Cluster
Distribution Center Core Data Center at HQ
Server Cluster
Trucks Deliverers
Delivery Truck: Creative Stall, https://thenounproject.com/creativestall/Deliverer: Rigo Peter, https://thenounproject.com/rigo/Cash Register: Sergey Patutin, https://thenounproject.com/bdesign.by/Hand Scanner: Eric Pearson, https://thenounproject.com/epearson001/
NiFi NiFi NiFi NiFi NiFi NiFi
On Delivery Routes
28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Courier service from the perspective of NiFi & MiNiFi
Physical Store
Gateway Server
Mobile Devices
Registers
Server Cluster
Distribution Center Core Data Center at HQ
Server Cluster
Trucks Deliverers
Delivery Truck: Creative Stall, https://thenounproject.com/creativestall/Deliverer: Rigo Peter, https://thenounproject.com/rigo/Cash Register: Sergey Patutin, https://thenounproject.com/bdesign.by/Hand Scanner: Eric Pearson, https://thenounproject.com/epearson001/
Client Libraries
Client Libraries
MiNiFi
MiNiFi NiFi NiFi NiFi NiFi NiFi NiFi
Client Libraries
On Delivery Routes
29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache NiFi Subproject: MiNiFi
Let me get the key parts of NiFi close to where data begins and provide bi-directional communication
NiFi lives in the data center. Give it an enterprise server or a cluster of them.
MiNiFi lives as close to where data is born and is a guest on that device or system
30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Visual Command and Controlvs.
Design and Deploy
31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache NiFi Managed DataflowSOURCES REGIONAL
INFRASTRUCTURECORE
INFRASTRUCTURE
32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
NiFi is based on Flow Based Programming (FBP)FBP Term NiFi Term DescriptionInformation Packet
FlowFile Each object moving through the system.
Black Box FlowFile Processor
Performs the work, doing some combination of data routing, transformation, or mediation between systems.
Bounded Buffer
Connection The linkage between processors, acting as queues and allowing various processes to interact at differing rates.
Scheduler Flow Controller
Maintains the knowledge of how processes are connected, and manages the threads and allocations thereof which all processes use.
Subnet Process Group
A set of processes and their connections, which can receive and send data via ports. A process group allows creation of entirely new component simply by composition of its components.
33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
FlowFiles & Data Agnosticism
NiFi is data agnostic! But, NiFi was designed understanding that users
can care about specifics and provides tooling
to interact with specific formats, protocols, etc.
ISO 8601 - http://xkcd.com/1179/
Robustness principle
Be conservative in what you do, be liberal in what you accept from others“
34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
FlowFiles are like HTTP dataHTTP Data FlowFile
HTTP/1.1 200 OKDate: Sun, 10 Oct 2010 23:26:07 GMTServer: Apache/2.2.8 (CentOS) OpenSSL/0.9.8gLast-Modified: Sun, 26 Sep 2010 22:04:35 GMTETag: "45b6-834-49130cc1182c0"Accept-Ranges: bytesContent-Length: 13Connection: closeContent-Type: text/html
Hello world!
Standard FlowFile AttributesKey: 'entryDate’ Value: 'Fri Jun 17 17:15:04 EDT 2016'Key: 'lineageStartDate’ Value: 'Fri Jun 17 17:15:04 EDT 2016'Key: 'fileSize’ Value: '23609'FlowFile Attribute Map ContentKey: 'filename’Value: '15650246997242'Key: 'path’ Value: './’
Binary Content *
Header
Content
35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
The need for data provenance
For Operators• Traceability, lineage• Recovery and replayFor Compliance• Audit trail• RemediationFor Business / Mission• Value sources • Value IT investment
BEGIN
ENDLINEAGE
36 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Data Provenance– Improved Navigation and Clearer Interaction
• Tracks data at each point as it flows through the system
• Records, indexes, and makes events available for display
• Handles fan-in/fan-out, i.e. merging and splitting data
• View attributes and content at given points in time
37 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
AgendaWhat is dataflow and what are the challenges?
Apache NiFi
Architecture
Live Demo
Community
38 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Zero-master ClusteringFramework
39 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
NiFi vs MiNiFi Java Processes
NiFi Framework
Components
MiNiFi
NiFi Framework
User Interface
Components
NiFi
40 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
MiNiFi
Java agent
Java implementation
Availability– GA HDF 2.0 (built from scratch, ~ 10MB)
Native agent
C++ implementation
Availability– TP HDF 2.0 – GA post HDF 2.0
Resource efficient (focus on memory and disk)
Near term (HDF 2.0)
Design & deploy– Push updates– Config file driven/REST API (MiNiFi API – post
configurations and receive information, etc.) access
Long term
Centralized command and control
MiNiFi Agent MiNiFi Management
43 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Why NiFi?
Moving data is multifaceted in its challenges and these are present in different contexts at varying scopes– Think of our courier example and organizations like it: inter vs intra, domestically, internationally
Provide common tooling and extensions that are commonly needed but be flexible for extension– Leverage existing libraries and expansive Java ecosystem for functionality– Allow organizations to integrate with their existing infrastructure
Empower folks managing your infrastructure to make changes and reason about issues that are occurring– Data Provenance to show context and data’s journey– User Interface/Experience a key component
NiFi Traffic Patterns Demo
NiFi Traffic Patterns Lab
46 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Smart Cities: Traffic Congestion
Monitor: Public transportation vehicles Pedestrian levels Optimize public transit duration
and walking routes
47 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Our Lab for Today
We will be exploring some examples to work through creating a dataflow with Apache NiFi
Use Case: An urban planning board is evaluating the need for a new highway, dependent on current traffic patterns, particularly as other roadwork initiatives are under way. Integrating live data poses a problem because traffic analysis has traditionally been done using historical, aggregated traffic counts. To improve traffic analysis, the city planner wants to leverage real-time data to get a deeper understanding of traffic patterns. NiFi was selected for for this real-time data integration.
Labs are available at http://tinyurl.com/nificrashcourse
Getting Started Resources
49 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Connected Data Architecture with HDC for AWS
C L O U DIdeal Use Cases:Data Science and Exploration(Spark, Zeppelin)
ETL and Data Preparation(Hive, Spark)
Analytics and Reporting(Hive2 w/LLAP, Zeppelin)
Cloud Data Processing
(HDC for AWS)
Technical Preview
hortonworks.github.io/hdp-aws
50 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Learn more and join us!
Apache NiFi sitehttp://nifi.apache.org
Subproject MiNiFi sitehttp://nifi.apache.org/minifi/
Subscribe to and collaborate [email protected]@nifi.apache.org
Submit Ideas or Issueshttps://issues.apache.org/jira/browse/NIFI
Follow us on Twitter@apachenifi
51 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Big Data Tutorials
Get Started– hortonworks.com/tutorials– Apache Hadoop & Ecosystem
• tinyurl.com/hello-hdp– Apache Spark
• tinyurl.com/hwx-spark-intro– Apache NiFi
• tinyurl.com/nifi-intro– Use Case
• IoT• Social Media
52 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hortonworks Nourishes the CommunityH O R T O N W O R K S
C O M M U N I T Y C O N N E C T I O NH O R T O N W O R K S PA R T N E R W O R K S
53 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Want to continue the technical Introduction?
Hadoop Summit Crash Courses– Replays– Free
hadoopsummit.org/san-jose/agenda– Apache Hadoop– Apache Spark– Apache NiFi– IoT & Streaming– Data Science
55 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Thank you!
56 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
AgendaWhat is dataflow and what are the challenges?
Apache NiFi
Architecture
Demo
Community
57 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Matured at NSA 2006-2014
Brief history of the Apache NiFi Community
• Contributors from Government and several commercial industries
• Releases on a 6-8 week schedule
Code developed at NSA
2006
Today
Achieved TLP
status in just 7 months
July 2015
Code available open source
ASL v2
November 2014
58 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
MiNiFi Prospective Plans - Centralized Command and Control
Design at a centralized place, deploy on the edge– Flow deployment– NAR deployment– Agent deployment
Version control of flows Agent status monitoring Bi-directional command and control
Centralized management console with a UI
Top Related