Apache Samza Past, Present and Future

Apache SamzaPast, Present and Future

Kartik Paramasivam Director of Engineering, Streams Infra@ LinkedIn

Agenda1. Stream Processing 2. State of the Union3. Apache Samza : Key Differentiators 4. Apache Samza Futures

Stream Processing: Processing events as soon as they happen..

● Stateless Processing■ Transformation etc.■ Lookup adjunct data (lookup databases/call services )■ Producing results for every event

● Stateful Processing■ Triggering/Producing results periodically (time-windows)

● Maintain intermediate state■ E.g. Joining across multiple streams of events.

● Common Issues■ Scale !! Scale !! Scale !!■ Reliability !!■ Everything else (upgrades, debugging, diagnostics, security, ……)

Stream Processing: State of the Union

MillwheelStorm

Spark Streaming

Dempsey

DataflowAzure Stream Analytics

AWS Kinesis AnalyticsGearPumpKafka Streams

Orleans

Not meant to be an accurate timeline..

Yes It is CROWDED !!

Apache Samza

● Top level Apache project since Dec 2014● 5 big Releases (0.7, 0.8, 0.9, 0.10, 0.11)● 62 Contributors● 14 Committers● Companies using : LinkedIn, Uber, MetaMarkets, Netflix,

Intuit, TripAdvisor, MobileAware, Optimizely …. https://cwiki.apache.org/confluence/display/SAMZA/Powered+By

● Applications at LinkedIn : from ~20 to ~200 in 2 years.

Key Differentiators for Apache Samza

● Performance !!

● Stability

● Support for a variety of input sources

● Stream processing as a service AND as an embedded library

Performance : Accessing Adjunct Data

Performance : Maintaining Temporary State

Performance : Let us talk numbers !

● 100x Difference between using Local State vs Remote No-Sql store● Local State details:

○ 1.1 Million TPS on a single processing machine (SSD)○ Used a 3 node Kafka cluster for storing the durable changelog

● Remote State details:○ 8500 TPS when the Samza job was changed to accessing a remote

No-Sql store○ No-Sql Store was also on a 3 node (ssd) cluster

Remote State : Asynchronous Event Processing

Event Loop(Single thread)

ProcessAsync

Remote DB /Services

Asynchronous I/O calls, using Java Nio, Netty...

Responses sent to main thread via callback

Event loop is woken up to process next message

Task.max.concurrency >1 to enable pipelining

Available with Samza 0.11

Remote State: Synchronous Processing on Multiple Threads

Event Loop(Single thread)

Schedule Process()

Remote DB/ Services

Built-InThread pool

Blocking I/O calls

Event loop is woken up by the worker thread

job.container.thread.pool.size = N

Available with Samza 0.11

Incremental Checkpointing : MVP for stateful apps

Input stream(e.g. Kafka)

Key Differentiators for Apache Samza

● Performance !!

● Stability

Speed Thrills .. but can kill

● Local State Considerations: ○ State should NOT be reseeded under normal operations (e.g.

Upgrades, Application restarts)

○ Minimal State should be reseeded - If a container dies/removed - If a container is added

How Samza keeps Local state ‘stable’ ?

Samza Job

Input Stream

Change-log

Enable Continuous Scheduling

● Kafka or durable intermediate queues are leveraged to avoid backpressure issues in a pipeline.

● Allows each stage to be independent of the next stage

Backpressure in a Pipeline

Key Differentiators for Apache Samza ● Performance !!

● Stability

Pluggable system consumers

… Azure EventHub, Azure Document DB, Google Pub-Sub etc.

Batch processing in Samza!! (NEW)

● HDFS system consumer for Samza ● Same Samza processor can be used for processing

events from Kafka and HDFS with no code changes● Scenarios :

○ Experimentation and Testing○ Re-processing of large datasets ○ Some datasets are readily available on HDFS

(company specific)

Samza - HDFS support

HDFS input

HDFS output

HDFS input

Available since Samza 0.10

The batch job auto-terminates when the input is fully processed.

Brooklin

set offset=0

Backup

Databus

Database Backup (HDFS)

Samza batch pipelines

HDFS output

HDFS input

HDFS output

HDFS input

Samza- HDFS Early Performance Results !!

Benchmark : Count number of records grouped by <Field>

DataSize (bytes): 250 GBNumber of files : 487

Map/Reduce

SparkNumber of Containers

Time-seconds

Key Differentiators for Apache Samza ● Performance !!

● Stability

● Support for a variety of input sources (batch and streaming)

● Stream processing as a service AND (coming soon) as an embedded library

Stream Processing as a Service● Based on YARN

○ Yarn-RM high availability

○ Work preserving RM ○ Support for Heterogenous hardware with Node Labels (NEW)

● Easy upgrade of Samza framework : Use the Samza version deployed on the machine instead of packaging it with the application.

● Disk Quotas for local state (e.g. rocksDB state)● Samza Management Service(SAMZA-REST)-> Next Slide

YARN Resource Managers

Nodes in the YARN cluster

RM SRR RM SRR RM SRR

NM SRN

Samza Management Service (Samza REST) (NEW)

NM SRN NM SRN NM SRN

NM SRN NM SRN NM SRN NM SRN

/v1/jobs /v1/jobs /v1/jobs

Samza Containers

1. Exposes /jobs resource to start, stop, get status of jobs etc.

2. Cleans up stores from dead jobs

SamzaREST

YARN processes(RM/NM)

Agenda1. Stream processing2. State of the union3. Apache Samza : Key differentiators

4. Apache Samza Futures

Coming Soon : Samza as a Library

Stream Processor

Code Job Coordinator

Stream Processor

Leader

● No YARN dependency● Will use ZK for leader

election

● Embed stream processing into your bigger application

StreamProcessor processor = new StreamProcessor (config, “job-name”, “job-id”);processor.start();processor.awaitStart();…processor.stop();

Coming Soon: High Level API and Event Time (SAMZA-914/915)

Count the number of PageViews by Region, every 30 minutes.@Override public void init(Collection<SystemMessageStream> sources) {

sources.forEach(source -> {

Function<PageView, String> keyExtractor = view -> view.getRegion();

source.map(msg -> new PageViewMessage(msg))

.window(Windows.<PageViewMessage, String>intoSessionCounter(keyExtractor,

WindowType.Tumbling, 30*60 ))

Coming Soon: First class support for Pipelines (Samza- 1041)

public class MyPipeline implements PipelineFactory {

public Pipeline create(Config config) {

Processor myShuffler = getShuffle(config); Processor myJoiner = getJoin(config);

Stream inStream = getStream(config, “inStream1”); // … omitted for brevity

PipelineBuilder builder = new PipelineBuilder(); return builder.addInputStreams(myShuffler, inStream1) .addOutputStreams(myShuffler, intermediateOutStream) .addInputStreams(myJoiner, intermediateOutStream, inStream2) .addOutputStreams(myJoiner, finalOutStream) .build(); }}

Shuffle

input output

Future: Miscellaneous

● Exactly once processing● Making it easier to auto-scale even with Local State

(on-demand Standby containers)● Turnkey Disaster Recovery for stateful applications

○ Easy Restore of changelog and checkpoints from some other datacenter.

● Improved support for Batch jobs● SQL over Streams● A default Dashboard :)

Questions ?

Apache Samza Past, Present and Future

Software

Transcript of Apache Samza Past, Present and Future

1AL ISTITUTO ISTRUZIONE SUPERIORE 2017/2018 … · – Past simple of be – there was/were – Past simple – Past continuous – Past simple vs Past continuous – much, many,

mrf-92e2.kxcdn.com · 2019. 11. 26. · APACHE CORPORATION APACHE CORPORATION APACHE CORPORATION APACHE CORPORATION APACHE CORPORATION APACHE CORPORATION APACHE CORPORATION 336 A-213

Apache Camel + Apache ActiveMQ persistence

APACHE-SSL Panoramica su Apache - Unisa

Hadoopのキホンと 活用術のご紹介 - SCSK · Apache Tez: 複数のMap Reduce連携を高速化 ストリーム処理/CEP: Apache Storm、Apache S4、Apache Samza 計算資源をもらうだけで、リアルタイムなデータ処理・通

Real-time full-text search with Luwak and Samza

Past simple vs past progressive

Past perfect e Past Perfect Continuos.

Talend Open Studio for Data Integrationdownload-mirror1.talend.com/tos/user-guide-download/V562/Talend... · AntlR, Apache ActiveMQ, Apache Ant, Apache Axiom, Apache Axis, Apache

Extrakcia ľudí pomocou Apache UIMA a Apache Hadoop

Apache Lucene

Apache Camel 소개 Camel 이란? • Apache Camel는EIP를구현하는통합프레워크 • 어플리케이션설계와개발을용이해짐 • EIP를엔드포인트와프로세서로구현

simple past and past continuous

Simple Past Tense and Past Continuous Tense

Linux - Apache

Auf der Big Data - download.e-bookshelf.de · NoSQL-Datenbanken: Apache Cassandra, Apache CouchDB, MongoDB, Neo4J, Redis, Riak NewSQL-Datenbanken: Apache Phoenix, Apache Tajo, Kylin,

Apache & HTML

Teste Apache

Apache Hadoop - Conceitos teóricos e práticos, evolução e ... · Apache Hadoop Apache Hadoop Conceitosteóricosepráticos,evolução enovaspossibilidades DanielCordeiro Departamento

DISCOTECA APACHE

Hadoopのキホンと活用術のご紹介 - SCSK · Apache Tez: 複数のMap Reduce連携を高速化ストリーム処理/CEP: Apache Storm、Apache S4、Apache Samza 計算資源をもらうだけで、リアルタイムなデータ処理・通