Big Data & Hadoop Environment · 2018-01-31 · Big Data & Hadoop Environment Bu eğitim sunumları...

50
Big Data & Hadoop Environment Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul Mali Destek Programı kapsamında yürütülmekte olan TR10/16/YNY/0036 no’lu İstanbul Big Data Eğitim ve Araştırma Merkezi Projesi dahilinde gerçekleştirilmiştir. İçerik ile ilgili tek sorumluluk Bahçeşehir Üniversitesi’ne ait olup İSTKA veya Kalkınma Bakanlığı’nın görüşlerini yansıtmamaktadır.

Transcript of Big Data & Hadoop Environment · 2018-01-31 · Big Data & Hadoop Environment Bu eğitim sunumları...

Page 1: Big Data & Hadoop Environment · 2018-01-31 · Big Data & Hadoop Environment Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul

Big Data & Hadoop Environment

Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul Mali Destek Programı kapsamında

yürütülmekte olan TR10/16/YNY/0036 no’lu İstanbul Big Data Eğitim ve Araştırma Merkezi Projesi dahilinde

gerçekleştirilmiştir. İçerik ile ilgili tek sorumluluk Bahçeşehir Üniversitesi’ne ait olup İSTKA veya Kalkınma Bakanlığı’nın

görüşlerini yansıtmamaktadır.

Page 2: Big Data & Hadoop Environment · 2018-01-31 · Big Data & Hadoop Environment Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul

What is big data?

Why do we need big data analytics?

How to setup Infrastructure for big data?

Page 3: Big Data & Hadoop Environment · 2018-01-31 · Big Data & Hadoop Environment Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul

How much data?

Hadoop: 10K nodes,

150K cores, 150 PB

(4/2014)

Processes 20 PB a day (2008)

Crawls 20B web pages a day (2012)

Search index is 100+ PB (5/2014)

Bigtable serves 2+ EB, 600M QPS

(5/2014)

300 PB data in Hive +

600 TB/day (4/2014)

400B pages,

10+ PB

(2/2014)

LHC: ~15 PB a year

LSST: 6-10 PB a year

(~2020) 640K ought to

be enough for

anybody.

150 PB on 50k+ servers

running 15k apps (6/2011)

S3: 2T objects, 1.1M

request/second (4/2013)

SKA: 0.3 – 1.5 EB

per year (~2020)

Hadoop: 365 PB, 330K

nodes (6/2014)

Page 4: Big Data & Hadoop Environment · 2018-01-31 · Big Data & Hadoop Environment Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul

The percentage of all data in the word that has been generated in last 2 years?

Page 5: Big Data & Hadoop Environment · 2018-01-31 · Big Data & Hadoop Environment Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul

Big Data

What are Key Features of Big Data?

Volume

Petabyte scale

Variety

Structured

Semi-structured

Unstructured

Velocity

Social Media

Sensor

Throughput

Veracity

Unclean

Imprecise

Unclear

4 Vs

Page 6: Big Data & Hadoop Environment · 2018-01-31 · Big Data & Hadoop Environment Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul

GATGCTTACTATGCGGGCCCC

CGGTCTAATGCTTACTATGC

GCTTACTATGCGGGCCCCTT

AATGCTTACTATGCGGGCCCCTT

TAATGCTTACTATGC

AATGCTTAGCTATGCGGGC

AATGCTTACTATGCGGGCCCCTT

AATGCTTACTATGCGGGCCCCTT

CGGTCTAGATGCTTACTATGC

AATGCTTACTATGCGGGCCCCTT

CGGTCTAATGCTTAGCTATGC

ATGCTTACTATGCGGGCCCCTT

?

Subject

genome

Sequencer

Reads

Human genome: 3 gbp

A few billion short reads

(~100 GB compressed data)

Page 7: Big Data & Hadoop Environment · 2018-01-31 · Big Data & Hadoop Environment Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul

What to do with more data?

Answering factoid questions

Pattern matching on the Web

Works amazingly well

Learning relations

Start with seed instances

Search for patterns on the Web

Using patterns to find more instances

Who shot Abraham Lincoln? X shot Abraham Lincoln

Birthday-of(Mozart, 1756)

Birthday-of(Einstein, 1879)

Wolfgang Amadeus Mozart (1756 - 1791)

Einstein was born in 1879

PERSON (DATE –

PERSON was born in DATE

(Brill et al., TREC 2001; Lin, ACM TOIS 2007)

(Agichtein and Gravano, DL 2000; Ravichandran and Hovy, ACL 2002; … )

Page 8: Big Data & Hadoop Environment · 2018-01-31 · Big Data & Hadoop Environment Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul
Page 9: Big Data & Hadoop Environment · 2018-01-31 · Big Data & Hadoop Environment Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul

No data like more data!

(Banko and Brill, ACL 2001)

(Brants et al., EMNLP 2007)

s/knowledge/data/g;

Page 10: Big Data & Hadoop Environment · 2018-01-31 · Big Data & Hadoop Environment Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul

Database Workloads

OLTP (online transaction processing)

Typical applications: e-commerce, banking, airline reservations

User facing: real-time, low latency, highly-concurrent

Tasks: relatively small set of “standard” transactional queries

Data access pattern: random reads, updates, writes (involving

relatively small amounts of data)

OLAP (online analytical processing)

Typical applications: business intelligence, data mining

Back-end processing: batch workloads, less concurrency

Tasks: complex analytical queries, often ad hoc

Data access pattern: table scans, large amounts of data per query

Page 11: Big Data & Hadoop Environment · 2018-01-31 · Big Data & Hadoop Environment Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul

One Database or Two?

Downsides of co-existing OLTP and OLAP workloads

Poor memory management

Conflicting data access patterns

Variable latency

Solution: separate databases

User-facing OLTP database for high-volume transactions

Data warehouse for OLAP workloads

How do we connect the two?

Page 12: Big Data & Hadoop Environment · 2018-01-31 · Big Data & Hadoop Environment Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul

OLTP/OLAP Architecture

OLTP OLAP

ETL (Extract, Transform, and Load)

Page 13: Big Data & Hadoop Environment · 2018-01-31 · Big Data & Hadoop Environment Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul

Structure of Data Warehouses

SELECT P.Brand, S.Country,

SUM(F.Units_Sold)

FROM Fact_Sales F

INNER JOIN Dim_Date D ON F.Date_Id = D.Id

INNER JOIN Dim_Store S ON F.Store_Id = S.Id

INNER JOIN Dim_Product P ON F.Product_Id = P.Id

WHERE D.YEAR = 1997 AND P.Product_Category = 'tv'

GROUP BY P.Brand, S.Country;

Source: Wikipedia (Star Schema)

Page 14: Big Data & Hadoop Environment · 2018-01-31 · Big Data & Hadoop Environment Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul

OLAP Cubes

store

pro

duct

slice and dice

Common operations

roll up/drill down

pivot

Page 15: Big Data & Hadoop Environment · 2018-01-31 · Big Data & Hadoop Environment Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul

Fast forward…

Page 16: Big Data & Hadoop Environment · 2018-01-31 · Big Data & Hadoop Environment Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul

ETL Bottleneck

ETL is typically a nightly task:

What happens if processing 24 hours of data takes longer than 24

hours?

Hadoop is perfect:

Ingest is limited by speed of HDFS

Scales out with more nodes

Massively parallel

Ability to use any processing tool

Cheaper than parallel databases

ETL is a batch process anyway!

Page 17: Big Data & Hadoop Environment · 2018-01-31 · Big Data & Hadoop Environment Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul

What’s changed?

Dropping cost of disks

Cheaper to store everything than to figure out what to throw away

Types of data collected

From data that’s obviously valuable to data whose value is less

apparent

Rise of social media and user-generated content

Large increase in data volume

Growing maturity of data mining techniques

Demonstrates value of data analytics

Page 18: Big Data & Hadoop Environment · 2018-01-31 · Big Data & Hadoop Environment Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul

a useful service

analyze user

behavior to extract

insights

transform

insights into

action

$ (hopefully)

data science data products

Virtuous Product Cycle

Page 19: Big Data & Hadoop Environment · 2018-01-31 · Big Data & Hadoop Environment Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul

OLTP/OLAP/Hadoop Architecture

OLTP OLAP

ETL (Extract, Transform, and Load)

Hadoop

Page 20: Big Data & Hadoop Environment · 2018-01-31 · Big Data & Hadoop Environment Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul

ETL: Redux

Often, with noisy datasets, ETL is the analysis!

Note that ETL necessarily involves brute force data scans

L, then E and T?

Page 21: Big Data & Hadoop Environment · 2018-01-31 · Big Data & Hadoop Environment Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul

Big Data Ecosystem

Source : datafloq.com

Page 22: Big Data & Hadoop Environment · 2018-01-31 · Big Data & Hadoop Environment Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul

Cloud Computing

Before clouds…

Grids

Connection machine

Vector supercomputers

Cloud computing means many different things:

Big data

Rebranding of web 2.0

Utility computing

Everything as a service

Page 23: Big Data & Hadoop Environment · 2018-01-31 · Big Data & Hadoop Environment Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul

Utility Computing

What?

Computing resources as a metered service (“pay as you go”)

Ability to dynamically provision virtual machines

Why?

Cost: capital vs. operating expenses

Scalability: “infinite” capacity

Elasticity: scale up or down on demand

Does it make sense?

Benefits to cloud users

Business case for cloud providers

I think there is a world

market for about five

computers.

Page 24: Big Data & Hadoop Environment · 2018-01-31 · Big Data & Hadoop Environment Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul

Enabling Technology: Virtualization

Hardware

Operating System

App App App

Traditional

Stack

Hardware

OS

App App App

Hypervisor

OS OS

Virtualized Stack

Page 25: Big Data & Hadoop Environment · 2018-01-31 · Big Data & Hadoop Environment Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul

Everything as a Service

Utility computing = Infrastructure as a Service (IaaS)

Why buy machines when you can rent cycles?

Examples: Amazon’s EC2, Rackspace

Platform as a Service (PaaS)

Give me nice API and take care of the maintenance, upgrades, …

Example: Google App Engine

Software as a Service (SaaS)

Just run it for me!

Example: Gmail, Salesforce

Page 26: Big Data & Hadoop Environment · 2018-01-31 · Big Data & Hadoop Environment Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul

Who cares?

A source of problems…

Cloud-based services generate big data

Clouds make it easier to start companies that generate big data

As well as a solution…

Ability to provision analytics clusters on-demand in the cloud

Commoditization and democratization of big data capabilities

Page 27: Big Data & Hadoop Environment · 2018-01-31 · Big Data & Hadoop Environment Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul

Parallelization Challenges

How do we assign work units to workers?

What if we have more work units than workers?

What if workers need to share partial results?

How do we aggregate partial results?

How do we know all the workers have finished?

What if workers die?

What’s the common theme of all of these problems?

Page 28: Big Data & Hadoop Environment · 2018-01-31 · Big Data & Hadoop Environment Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul

Where the rubber meets the road

Concurrency is difficult to reason about

Concurrency is even more difficult to reason about

At the scale of datacenters and across datacenters

In the presence of failures

In terms of multiple interacting services

Not to mention debugging…

The reality:

Lots of one-off solutions, custom code

Write you own dedicated library, then program with it

Burden on the programmer to explicitly manage everything

Page 29: Big Data & Hadoop Environment · 2018-01-31 · Big Data & Hadoop Environment Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul

The datacenter is the computer!

Source: Barroso and Urs Hölzle (2009)

Page 30: Big Data & Hadoop Environment · 2018-01-31 · Big Data & Hadoop Environment Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul

The datacenter is the computer

It’s all about the right level of abstraction

Moving beyond the von Neumann architecture

What’s the “instruction set” of the datacenter computer?

Hide system-level details from the developers

No more race conditions, lock contention, etc.

No need to explicitly worry about reliability, fault tolerance, etc.

Separating the what from the how

Developer specifies the computation that needs to be performed

Execution framework (“runtime”) handles actual execution

Page 31: Big Data & Hadoop Environment · 2018-01-31 · Big Data & Hadoop Environment Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul

Scaling “up” vs. “out”

No single machine is large enough

Smaller cluster of large SMP machines vs. larger cluster of

commodity machines (e.g., 16 128-core machines vs. 128 16-core

machines)

Nodes need to talk to each other!

Intra-node latencies: ~100 ns

Inter-node latencies: ~100 s

Move processing to the data

Cluster have limited bandwidth

Process data sequentially, avoid random access

Seeks are expensive, disk throughput is reasonable

Seamless scalability

Source: analysis on this an subsequent slides from Barroso and Urs Hölzle (2009)

Page 32: Big Data & Hadoop Environment · 2018-01-31 · Big Data & Hadoop Environment Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul

Common Big Data Analytics Use Cases

• Batch

• Use case 1 : ETL / Batch query (Single Silo)

• Use case 2 : distributed log aggregation

• Batch + real time

• Use case 3 : real time data store

• Use case 4 : real time data store + batch analytics

• Real time / Streaming

• Use case 5 : Streaming

Page 33: Big Data & Hadoop Environment · 2018-01-31 · Big Data & Hadoop Environment Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul

USE CASE 1 : ETL AND BATCH ANALYTICS at SCALE

• Data collected in various databases

• Data is scattered across multiple silos !

• Need a single silo to bring all data together and analyze

Page 34: Big Data & Hadoop Environment · 2018-01-31 · Big Data & Hadoop Environment Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul

USE CASE 1

• Using core Hadoop components

• No vendor lock in (works on all Hadoop distributions)

• Use HDFS (Hadoop File System) for storage

• Data Ingest with Sqoop

• Processing done by Map Reduce & Cousins

• Results are exported back to DB

Page 35: Big Data & Hadoop Environment · 2018-01-31 · Big Data & Hadoop Environment Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul

USE CASE 2 : DATA COMING FROM MULTIPLE SOURCES

• Data coming in from multiple sources.

• Data is ‘streaming in’

• Capture data in Hadoop

• Do batch analytics

Page 36: Big Data & Hadoop Environment · 2018-01-31 · Big Data & Hadoop Environment Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul

USE CASE 2

• Flume

• To bring in logs from multiple sources

• Distributed, reliable way to collect and move data

• If uplinks are dis connected, flume agents will store and

forward data

• HDFS

• Flume can directly write data to HDFS

• Files are segmented or ‘rolled’ by size / time e.g.

• Data-2015-01-01_10-00-00.log

• Data-2015-01-01_11-00-00.log

• Data-2015-01-01_12-00-00.log

Page 37: Big Data & Hadoop Environment · 2018-01-31 · Big Data & Hadoop Environment Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul

USE CASE 2

• Analytics stack : Pig / Hive / Oozie / Spark

(Same as in Use Case 1)

• Oozie

• Work flow manager

• “run this work flow every 1 hour”

• “run this work flow when data shows up in input directory”

• Can manage complex work flows

• Send alerts when processes fail

• ..etc

Page 38: Big Data & Hadoop Environment · 2018-01-31 · Big Data & Hadoop Environment Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul

USE CASE 3 : REAL TIME DATA STORE • Events are coming in

• Need to store the events

• Can be billions of events

• And query them in real time

e.g. last 10 events by user

Page 39: Big Data & Hadoop Environment · 2018-01-31 · Big Data & Hadoop Environment Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul

USE CASE 3

• HDFS is not ideal for updating data in real time and

accessing data in random

• A scalable real time store is needed e.g. Hbase, Cassandra

to support real time updates

• Data comes trickling in (as stream)

• Saved data becomes queryable immediately

• Use HBase APIs (Java / REST) to build dashboards

Page 40: Big Data & Hadoop Environment · 2018-01-31 · Big Data & Hadoop Environment Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul

USE CASE 4 : REAL TIME + BATCH • Building on use case 3

• Do extensive analysis on data on HBase

• E.g. : ‘scoring user models’

• ‘flagging credit card transactions’

Page 41: Big Data & Hadoop Environment · 2018-01-31 · Big Data & Hadoop Environment Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul

USE CASE 4

• HBase is the real time store

• Analytics is done via Map Reduce stack (Pig / Hive)

• Can we do them in a single stack?

• May not be a good idea

• Don’t mix real time

and batch analytics

• Batch Analytics will

impede real time

performance

Page 42: Big Data & Hadoop Environment · 2018-01-31 · Big Data & Hadoop Environment Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul

USE CASE 4

• How to replicate data?

• 1 : periodic synchronization of data between clusters

• 2 : data goes to both clusters at the same time

Page 43: Big Data & Hadoop Environment · 2018-01-31 · Big Data & Hadoop Environment Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul

USE CASE 5 : Streaming

Decision times : batch ( hours / days)

Use cases:

• Modeling

• ETL

• Reporting

Page 44: Big Data & Hadoop Environment · 2018-01-31 · Big Data & Hadoop Environment Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul

MOVING TOWARDS FAST DATA

• Decision time : (near) real time

• seconds (or milli seconds)

• Use Cases

• Alerts (medical / security)

• Fraud detection

• Streaming is becoming

more prevalent

• ‘Connected Devices’

• ‘Internet of Things’

• ‘Beyond Batch’

• We need faster processing

/ analytics

Page 45: Big Data & Hadoop Environment · 2018-01-31 · Big Data & Hadoop Environment Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul

STREAMING ARCHITECTURE – DATA BUCKET

• ‘data bucket’

• Captures incoming data

• Acts as a ‘buffer’ – smoothes out bursts

• So even if our processing offline, we won’t loose data

• Data bucket choices

• * Kafka

• MQ (RabittMQ ..etc)

• Amazon Kinesis

Page 46: Big Data & Hadoop Environment · 2018-01-31 · Big Data & Hadoop Environment Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul

KAFKA ARCHITECTURE

• Producers write data

to brokers

• Consumers read data from brokers

• All of this is

distributed / parallel

• Failure tolerant

• Data is stored as topics

• “sensor_data”

• “alerts”

• “emails”

Page 47: Big Data & Hadoop Environment · 2018-01-31 · Big Data & Hadoop Environment Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul

STREAMING ARCHITECTURE – PROCESSING ENGINE

• Need to process events with low latency

• So many to choose from !

• Choices

• Storm

• Spark

• NiFi

• Flink

Page 48: Big Data & Hadoop Environment · 2018-01-31 · Big Data & Hadoop Environment Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul

STREAMING ARCHITECTURE – DATA STORE

• Where processed data ends up

• Need to absorb data in real time

• Usually a NoSQL storage

• HBase

• Cassandra

• Lots of NoSQL stores

Page 49: Big Data & Hadoop Environment · 2018-01-31 · Big Data & Hadoop Environment Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul

LAMBDA ARCHITECTURE

• Each component is scalable

• Each component is fault tolerant

• Incorporates best practices

• All open source !

Page 50: Big Data & Hadoop Environment · 2018-01-31 · Big Data & Hadoop Environment Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul

LAMBDA ARCHITECTURE

1. All new data is sent to both batch layer and speed layer

2. Batch layer

• Holds master data set (immutable , append-only)

• Answers batch queries

3. Serving layer

• updates batch views so they can be queried adhoc

4. Speed Layer

• Handles new data

• Facilitates fast / real-time queries

5. Query layer

• Answers queries using batch & real-time views