JRuby with Java Code in Data Processing World

38
JRuby with Java Code in Data Processing World JRubyConf.EU at 31 Jul 2015 Satoshi Tagomori (@tagomoris)

Transcript of JRuby with Java Code in Data Processing World

Page 1: JRuby with Java Code in Data Processing World

JRuby with Java Code in Data Processing WorldJRubyConf.EU at 31 Jul 2015 Satoshi Tagomori (@tagomoris)

Page 2: JRuby with Java Code in Data Processing World

Satoshi "Moris" Tagomori (@tagomoris)

Fluentd, Norikra, MessagePack-Ruby,... Docker logging driver for Fluentd (docker v1.8)

Treasure Data, Inc.

Page 3: JRuby with Java Code in Data Processing World

https://jobs.lever.co/treasure-data

We're hiring!OSS team (developer / community manager)

Distributed system engineer (Hadoop, queue/workers) Front-end engineer (RoR)

Page 4: JRuby with Java Code in Data Processing World

Data Processing World

Page 5: JRuby with Java Code in Data Processing World

Data Processing World

Page 6: JRuby with Java Code in Data Processing World

JavaData Processing World

Page 7: JRuby with Java Code in Data Processing World

Data Processing World

Hadoop, Spark, Tez, Flink, Storm, Kafka, ...

Hive, Pig, Drill, Impala, Presto, ....

Page 8: JRuby with Java Code in Data Processing World

Java + Scala, Clojure + C++, ....

Data Processing World

on JVM

Page 9: JRuby with Java Code in Data Processing World

Data Processing World

Many CPU cores, Large memory, High rate Disk I/O, ...

High throughput data processing

Hadoop YARN/MapReduce/HDFS API compatibility

Page 10: JRuby with Java Code in Data Processing World

Two OSS using Java&JRuby

Page 11: JRuby with Java Code in Data Processing World

Norikra: Stream Processing with SQL for everybody

Server software, written in JRuby, runs on JVM

Open source software (GPLv2)

http://norikra.github.io/

https://github.com/norikra/norikra

Distributed on rubygems.org

"gem i norikra"

Page 12: JRuby with Java Code in Data Processing World

What Norikra does:

SELECT path, SUM(bytes) AS s FROM www_access_logs.win:length_batch(10)

WHERE status=200 GROUP BY path ORDER BY s DESC

Page 13: JRuby with Java Code in Data Processing World

SELECT path, SUM(bytes) AS s FROM www_access_logs.win:length_batch(10)

WHERE status=200 GROUP BY path ORDER BY s DESC

{"path":"/", "status":200, "bytes":300, "duration":0.03, "referer":"...", "user-agent":"...."

path:"/", s:301

1

Page 14: JRuby with Java Code in Data Processing World

SELECT path, SUM(bytes) AS s FROM www_access_logs.win:length_batch(10)

WHERE status=200 GROUP BY path ORDER BY s DESC

{"path":"/download/a", "status":200, "bytes":10240, "duration":0.53, "referer":"...", "user-agent":"...."

path:"/", s:301 path:"/download/a", s:10240

2

Page 15: JRuby with Java Code in Data Processing World

SELECT path, SUM(bytes) AS s FROM www_access_logs.win:length_batch(10)

WHERE status=200 GROUP BY path ORDER BY s DESC

{"path":"/", "status":404, "bytes":0, "duration":0.08, "referer":"...", "user-agent":"...."

path:"/", s:301 path:"/download/a", s:10240

3

Page 16: JRuby with Java Code in Data Processing World

SELECT path, SUM(bytes) AS s FROM www_access_logs.win:length_batch(10)

WHERE status=200 GROUP BY path ORDER BY s DESC

{"path":"/", "status":200, "bytes":301, "duration":0.01, "referer":"...", "user-agent":"...."

path:"/", s:602 path:"/download/a", s:10240

4

Page 17: JRuby with Java Code in Data Processing World

SELECT path, SUM(bytes) AS s FROM www_access_logs.win:length_batch(10)

WHERE status=200 GROUP BY path ORDER BY s DESC

{"path":"/download/b", "status":200, "bytes":678, "duration":0.11, "referer":"...", "user-agent":"...."

path:"/", s:602 path:"/download/a", s:10240 path:"/download/b", s:678

5

Page 18: JRuby with Java Code in Data Processing World

SELECT path, SUM(bytes) AS s FROM www_access_logs.win:length_batch(10)

WHERE status=200 GROUP BY path ORDER BY s DESC

{"path":"/download/b", "status":200, "bytes":678, "duration":0.13, "referer":"...", "user-agent":"...."

path:"/", s:602 path:"/download/a", s:10240 path:"/download/b", s:1356

6

Page 19: JRuby with Java Code in Data Processing World

SELECT path, SUM(bytes) AS s FROM www_access_logs.win:length_batch(10)

WHERE status=200 GROUP BY path ORDER BY s DESC

{"path":"/", "status":200, "bytes":301, "duration":0.02, "referer":"...", "user-agent":"...."

path:"/", s:903 path:"/download/a", s:10240 path:"/download/b", s:1356

7

Page 20: JRuby with Java Code in Data Processing World

SELECT path, SUM(bytes) AS s FROM www_access_logs.win:length_batch(10)

WHERE status=200 GROUP BY path ORDER BY s DESC

{"path":"/", "status":200, "bytes":301, "duration":0.09, "referer":"...", "user-agent":"...."

path:"/", s:1204 path:"/download/a", s:10240 path:"/download/b", s:1356

8

Page 21: JRuby with Java Code in Data Processing World

SELECT path, SUM(bytes) AS s FROM www_access_logs.win:length_batch(10)

WHERE status=200 GROUP BY path ORDER BY s DESC

{"path":"/download/a", "status":200, "bytes":10240, "duration":1.1, "referer":"...", "user-agent":"...."

path:"/", s:1204 path:"/download/a", s:20480 path:"/download/b", s:1356

9

Page 22: JRuby with Java Code in Data Processing World

SELECT path, SUM(bytes) AS s FROM www_access_logs.win:length_batch(10)

WHERE status=200 GROUP BY path ORDER BY s DESC

{"path":"/", "status":200, "bytes":301, "duration":0.05, "referer":"...", "user-agent":"...."

path:"/", s:1505 path:"/download/a", s:20480 path:"/download/b", s:1356

10

Page 23: JRuby with Java Code in Data Processing World

SELECT path, SUM(bytes) AS s FROM www_access_logs.win:length_batch(10)

WHERE status=200 GROUP BY path ORDER BY s DESC

10

{"path":"/download/a", "s":20480}

{"path":"/", "s":1505}

{"path":"/download/b", "s":1356}

Page 24: JRuby with Java Code in Data Processing World

Norikra and JavaNorikra is written in JRuby, and using Esper

Key factor: productivity (33days until first release)

Esper:Java library, provides Complex Event Processing

SQL parser, executor

Many features and good performance

Licensed under GPLv2

Page 25: JRuby with Java Code in Data Processing World

Plugins as rubygems

Norikra Server (on JVM)

Esper (Query Engine)

Type DefinitionManager

Output Event Pool

Norikra Engine

RPC Servermizuno (Jetty + Rack)

Rack RPC Handler

Listener

UDFUDF

User-Defined Functions "gem i norikra-udf-xxx"

written in Java, or JRuby (compiled to Java) works in Esper instance: must be a Java class

Listener handler for output data of queries, written in JRuby "gem i norikra-listener-xxx"

Page 26: JRuby with Java Code in Data Processing World

Embulk

"Embulk is a open-source bulk data loader that helps data transfer between various databases, storages, file formats, and cloud services."

http://www.embulk.org/docs/

Page 27: JRuby with Java Code in Data Processing World

Embulk: makes painful data integration work relaxed

Plugin-based parallel bulk data loader

Open source software (Apache License v2.0)

http://www.embulk.org/

https://github.com/embulk/embulk

Distributed as .jar or on rubygems.org

Plugins are on rubygems.orghttp://www.slideshare.net/frsyuki/embuk-making-data-integration-works-relaxed

http://www.slideshare.net/HiroshiNakamura/embulk-20150411

Page 28: JRuby with Java Code in Data Processing World

HDFS

MySQL

Amazon S3

Embulk

CSV Files

SequenceFile

Salesforce.com

Elasticsearch

Cassandra

Hive

Redis

✓ Parallel execution ✓ Data validation ✓ Error recovery ✓ Deterministic behavior ✓ Idempotet retrying

Plugins Plugins

bulk load

Page 29: JRuby with Java Code in Data Processing World

#ccc_cd4 / #embulk

InputPlugin OutputPlugin

Executor pluginFilter plugin

Filter pluginFilter plugins

records

Threads, MapReduce

records

convert, …

input, … output.

29

records

config

Page 30: JRuby with Java Code in Data Processing World

#ccc_cd4 / #embulk

InputPlugin

FileInput plugin

OutputPlugin

FileOutput plugin

Encoder plugin

Formatter plugin

Decoder plugin

Parser plugin

HDFS, S3,Riak CS, …

gzip, bzip2,aes, …

CSV, JSON,pcap, …

buffer

bufferbuffer

buffer

Filter pluginFilter plugin

Filter plugins

recordsrecords

Executor plugin

30

records

config

Page 31: JRuby with Java Code in Data Processing World

Embulk and JavaEmbulk core is written in Java

mainly for performance

Embulk plugins:

are loaded over API based on JRuby

are written in JRuby or Java

JRuby for early release

Java for performance

Page 32: JRuby with Java Code in Data Processing World

InputPlugin

module Embulk class InputExample < InputPlugin Plugin.register_input('example', self)

def self.transaction(config, &control) # read config task = { 'message' => config.param('message', :string, default: nil) } threads = config.param('threads', :int, default: 2)

columns = [ Column.new(0, 'col0', :long), Column.new(1, 'col1', :double), Column.new(2, 'col2', :string), ]

# BEGIN here

commit_reports = yield(task, columns, threads)

# COMMIT here puts "Example input finished"

return {} end

def run(task, schema, index, page_builder) puts "Example input thread #{@index}…"

10.times do |i| @page_builder.add([i, 10.0, "example"]) end @page_builder.finish

commit_report = { } return commit_report end end end

Page 33: JRuby with Java Code in Data Processing World

OutputPlugin

module Embulk class OutputExample < OutputPlugin Plugin.register_output('example', self)

def self.transaction( config, schema, processor_count, &control) # read config task = { 'message' => config.param('message', :string, default: "record") }

puts "Example output started." commit_reports = yield(task) puts "Example output finished. Commit reports = #{commit_reports.to_json}"

return {} end

def initialize(task, schema, index) puts "Example output thread #{index}..." super @message = task.prop('message', :string) @records = 0 end

def add(page) page.each do |record| hash = Hash[schema.names.zip(record)] puts "#{@message}: #{hash.to_json}" @records += 1 end end

def finish end

def abort end

def commit commit_report = { "records" => @records } return commit_report end end end

Page 34: JRuby with Java Code in Data Processing World

Plugin management: Norikra

Esper instance

Engine

Plugin management

UDF Listener

plugins as gems

plugin loader written in JRuby

Java JRuby

Page 35: JRuby with Java Code in Data Processing World

Plugin management: Embulk

Embulk core

Plugin management

input/output/filter parser/formatter

Java JRuby

decoder/encoder file-input/output executor

plugins as gems

plugin loader written in JRuby

Page 36: JRuby with Java Code in Data Processing World

Pluggable softwareon JVM & Java API

Java? Scala? Clojure? JRuby?: JRuby

Plugin packaging: jar? gem?: gem

rubygem.org >>> maven central (or others)

especially for plugin authors

Plugin loader: Class Loader? "require"?: require

Page 37: JRuby with Java Code in Data Processing World

JRuby in Japan

Not so many users :(

CRuby is super major software in Japan

Java -> Ruby -> Scala? Golang?

Page 38: JRuby with Java Code in Data Processing World

Make your software pluggable.Make eco-system&community.

with JRuby!

Thanks!