Data Analytics Service Company and Its Ruby Usage

51
Data Analytics Service Company and Its Ruby Usage RubyKaigi 2015 (Dec 12, 2015) Satoshi Tagomori (@tagomoris)

Transcript of Data Analytics Service Company and Its Ruby Usage

Page 1: Data Analytics Service Company and Its Ruby Usage

Data Analytics Service Company and Its Ruby Usage

RubyKaigi 2015 (Dec 12, 2015)

Satoshi Tagomori (@tagomoris)

Page 2: Data Analytics Service Company and Its Ruby Usage

Satoshi "Moris" Tagomori (@tagomoris)

Fluentd, MessagePack-Ruby, Norikra, ...

Treasure Data, Inc.

Page 3: Data Analytics Service Company and Its Ruby Usage
Page 4: Data Analytics Service Company and Its Ruby Usage

http://www.treasuredata.com/

Page 5: Data Analytics Service Company and Its Ruby Usage

http://www.fluentd.org/

Fluentd Unified Logging Layer

For Stream Data Written in CRuby

http://www.slideshare.net/treasure-data/the-basics-of-fluentd-35681111

Page 6: Data Analytics Service Company and Its Ruby Usage

Bulk Data Loader High Throughput&Reliability

Embulk Written in Java/JRuby

http://www.slideshare.net/frsyuki/embuk-making-data-integration-works-relaxed

http://www.embulk.org/

Page 7: Data Analytics Service Company and Its Ruby Usage

Data Analytics Platform Data Analytics Service

Page 8: Data Analytics Service Company and Its Ruby Usage
Page 9: Data Analytics Service Company and Its Ruby Usage
Page 10: Data Analytics Service Company and Its Ruby Usage

Services

Page 11: Data Analytics Service Company and Its Ruby Usage

Services

JVM

Page 12: Data Analytics Service Company and Its Ruby Usage

Services

JVMC++

Page 13: Data Analytics Service Company and Its Ruby Usage

Data Analytics Flow

Collect Store Process Visualize

Data source

Reporting

Monitoring

Page 14: Data Analytics Service Company and Its Ruby Usage

Data Analytics Flow

Collect Store Process Visualize

Data source

Reporting

Monitoring

Page 15: Data Analytics Service Company and Its Ruby Usage

Data Analytics Platform• Data collection, storage

• Console & API endpoints

• Schema management

• Processing (batch, query, ...)

• Queuing & Scheduling

• Data connector/exporter

Page 16: Data Analytics Service Company and Its Ruby Usage

Treasure Data Internals

Page 17: Data Analytics Service Company and Its Ruby Usage

Data Analytics Platform• Data collection, storage: Ruby(OSS), Java/JRuby(OSS)

• Console & API endpoints: Ruby(RoR)

• Schema management: Ruby/Java (MessagePack)

• Processing (batch, query, ...): Java(Hadoop,Presto)

• Queuing & Scheduling: Ruby(OSS)

• Data connector/exporter: Java, Java/JRuby(OSS)

Page 18: Data Analytics Service Company and Its Ruby Usage

Treasure Data Architecture: Overview

http://www.slideshare.net/tagomoris/data-analytics-service-company-and-its-ruby-usage

Console API

EventCollector

PlazmaDB

Worker

Hadoop Cluster

Presto Cluster

USERS

TD SDKs

SERVERS

DataConnector

CUSTOMER's SYSTEMS

Scheduler

Page 19: Data Analytics Service Company and Its Ruby Usage

Console API

EventCollector

PlazmaDB

Worker

Scheduler

Hadoop Cluster

Presto Cluster

USERS

TD SDKs

SERVERS

DataConnector

CUSTOMER's SYSTEMS

Page 20: Data Analytics Service Company and Its Ruby Usage

Console API

EventCollector

PlazmaDB

Worker

Scheduler

Hadoop Cluster

Presto Cluster

USERS

TD SDKs

SERVERS

DataConnector

CUSTOMER's SYSTEMS

50k/day

200k/day

Page 21: Data Analytics Service Company and Its Ruby Usage

Console API

EventCollector

PlazmaDB

Worker

Scheduler

Hadoop Cluster

Presto Cluster

USERS

TD SDKs

SERVERS

DataConnector

CUSTOMER's SYSTEMS

50k/day

200k/day

12M/day (138/sec)

Page 22: Data Analytics Service Company and Its Ruby Usage

Queue/Worker, Scheduler• Treasure Data: multi-tenant data analytics service

• executes many jobs in shared clusters (queries, imports, ...)

• CORE: queues-workers & schedulers

• Clusters have queues/scheduler... it's not enough • resource limitations for each price plans • priority queues for job types • and many others

Page 23: Data Analytics Service Company and Its Ruby Usage

PerfectSched

https://github.com/treasure-data/perfectsched

Page 24: Data Analytics Service Company and Its Ruby Usage

PerfectSched

• Provides periodical/scheduled queries for customers • it's like reliable "cron"

• Highly available distributed scheduler using RDBMS • Written in CRuby

• At-least-once semantics

• PerfectSched enqueues jobs into PerfectQueue

Page 25: Data Analytics Service Company and Its Ruby Usage

Jobs in TDLOST Duplicated

Retriedfor

ErrorsThroughput Execution

time

DATAimport/export

NG NGOK or

NGHIGH SHORT

(secs-mins)

QUERY NG OK OK LOW

SHORT (secs) or

LONG (mins-hours)

Page 26: Data Analytics Service Company and Its Ruby Usage

PerfectQueue

https://github.com/treasure-data/perfectqueue

Page 27: Data Analytics Service Company and Its Ruby Usage

WorkerWorker

WorkerWorker

PerfectQueue overview

Worker

MySQL1 table for 1 queue

workers for queues

hypervisor process

worker processes

worker processes

worker processes

Page 28: Data Analytics Service Company and Its Ruby Usage

Features

• Priorities for query types

• Resource limits per accounts

• Graceful restarts • Queries must run long time (<= 1d) • New worker code should be loaded, besides

running job with older code

Page 29: Data Analytics Service Company and Its Ruby Usage

PerfectQueue• Highly available distributed queue using RDBMS

• Enqueue by INSERT INTO • Dequeue/Commit by UPDATE • Using transactions

• Flexible scheduling rather than scalability • Workers does many things

• Plazmadb operations (including importing data) • Building job parameters • Handling results of jobs + kicking other jobs

• Using Amazon RDS (MySQL) internally (+ Workers on EC2)

Page 30: Data Analytics Service Company and Its Ruby Usage

Builing Jobs/Parameters• Parameters

• for job types, accounts, price plans and clusters • to control performance/parallelism, permissions

and data types • ex: Java properties

• Jobs • to prepare for customers' queries • to make queries safer/faster • ex: Hive Queries (HiveQL, a variety of SQL)

Page 31: Data Analytics Service Company and Its Ruby Usage

Example: Hive jobenv HADOOP_CLASSPATH=test.jar:td-hadoop-1.0.jar \ HADOOP_OPTS="-Xmx738m -Duser.name=221" \hive --service jar td-hadoop-1.0.jar \ com.treasure_data.hadoop.hive.runner.QueryRunner \ -hiveconf td.jar.version= \ -hiveconf plazma.metadb.config={} \ -hiveconf plazma.storage.config={} \ -hiveconf td.worker.database.config={} \ -hiveconf mapreduce.job.priority=HIGH \ -hiveconf mapreduce.job.queuename=root.q221.high \ -hiveconf mapreduce.job.name=HiveJob379515 \ -hiveconf td.query.mergeThreshold=1333382400 \ -hiveconf td.query.apikey=12345 \ -hiveconf td.scheduled.time=1342449253 \ -hiveconf td.outdir=./jobs/379515 \ -hiveconf hive.metastore.warehouse.dir=/user/hive/221/warehouse \ -hiveconf hive.auto.convert.join.noconditionaltask=false \ -hiveconf hive.mapjoin.localtask.max.memory.usage=0.7 \ -hiveconf hive.mapjoin.smalltable.filesize=25000000 \ -hiveconf hive.resultset.use.unique.column.names=false \ -hiveconf hive.auto.convert.join=false \ -hiveconf hive.optimize.sort.dynamic.partition=false \ -hiveconf mapreduce.job.reduces=-1 \ -hiveconf hive.vectorized.execution.enabled=false \ -hiveconf mapreduce.job.ubertask.enable=true \ -hiveconf yarn.app.mapreduce.am.resource.mb=2048 \ -hiveconf mapreduce.job.ubertask.maxmaps=1 \ -hiveconf mapreduce.job.ubertask.maxreduces=1 \

Page 32: Data Analytics Service Company and Its Ruby Usage

env HADOOP_CLASSPATH=test.jar:td-hadoop-1.0.jar \ HADOOP_OPTS="-Xmx738m -Duser.name=221" \hive --service jar td-hadoop-1.0.jar \ com.treasure_data.hadoop.hive.runner.QueryRunner \ -hiveconf td.jar.version= \ -hiveconf plazma.metadb.config={} \ -hiveconf plazma.storage.config={} \ -hiveconf td.worker.database.config={} \ -hiveconf mapreduce.job.priority=HIGH \ -hiveconf mapreduce.job.queuename=root.q221.high \ -hiveconf mapreduce.job.name=HiveJob379515 \ -hiveconf td.query.mergeThreshold=1333382400 \ -hiveconf td.query.apikey=12345 \ -hiveconf td.scheduled.time=1342449253 \ -hiveconf td.outdir=./jobs/379515 \ -hiveconf hive.metastore.warehouse.dir=/user/hive/221/warehouse \ -hiveconf hive.auto.convert.join.noconditionaltask=false \ -hiveconf hive.mapjoin.localtask.max.memory.usage=0.7 \ -hiveconf hive.mapjoin.smalltable.filesize=25000000 \ -hiveconf hive.resultset.use.unique.column.names=false \ -hiveconf hive.auto.convert.join=false \ -hiveconf hive.optimize.sort.dynamic.partition=false \ -hiveconf mapreduce.job.reduces=-1 \ -hiveconf hive.vectorized.execution.enabled=false \ -hiveconf mapreduce.job.ubertask.enable=true \ -hiveconf yarn.app.mapreduce.am.resource.mb=2048 \ -hiveconf mapreduce.job.ubertask.maxmaps=1 \ -hiveconf mapreduce.job.ubertask.maxreduces=1 \ -hiveconf mapreduce.job.ubertask.maxbytes=536870912 \ -hiveconf td.hive.insertInto.dynamic.partitioning=false \ -outdir ./jobs/379515

Page 33: Data Analytics Service Company and Its Ruby Usage

Example: Hive job (cont)ADD JAR 'td-hadoop-1.0.jar'; CREATE DATABASE IF NOT EXISTS `db`; USE `db`; CREATE TABLE tagomoris (`v` MAP<STRING,STRING>, `time` INT) STORED BY 'com.treasure_data.hadoop.hive.mapred.TDStorageHandler' WITH SERDEPROPERTIES ('msgpack.columns.mapping'='*,time') TBLPROPERTIES ( 'td.storage.user'='221', 'td.storage.database'='dfc', 'td.storage.table'='users_20100604_080812_ce9203d0', 'td.storage.path'='221/dfc/users_20100604_080812_ce9203d0', 'td.table_id'='2', 'td.modifiable'='true', 'plazma.data_set.name'='221/dfc/users_20100604_080812_ce9203d0' ); CREATE TABLE tbl1 ( `uid` INT, `key` STRING, `time` INT ) STORED BY 'com.treasure_data.hadoop.hive.mapred.TDStorageHandler' WITH SERDEPROPERTIES ('msgpack.columns.mapping'='uid,key,time') TBLPROPERTIES ( 'td.storage.user'='221', 'td.storage.database'='dfc',

Page 34: Data Analytics Service Company and Its Ruby Usage

ADD JAR 'td-hadoop-1.0.jar'; CREATE DATABASE IF NOT EXISTS `db`; USE `db`; CREATE TABLE tagomoris (`v` MAP<STRING,STRING>, `time` INT) STORED BY 'com.treasure_data.hadoop.hive.mapred.TDStorageHandler' WITH SERDEPROPERTIES ('msgpack.columns.mapping'='*,time') TBLPROPERTIES ( 'td.storage.user'='221', 'td.storage.database'='dfc', 'td.storage.table'='users_20100604_080812_ce9203d0', 'td.storage.path'='221/dfc/users_20100604_080812_ce9203d0', 'td.table_id'='2', 'td.modifiable'='true', 'plazma.data_set.name'='221/dfc/users_20100604_080812_ce9203d0' ); CREATE TABLE tbl1 ( `uid` INT, `key` STRING, `time` INT ) STORED BY 'com.treasure_data.hadoop.hive.mapred.TDStorageHandler' WITH SERDEPROPERTIES ('msgpack.columns.mapping'='uid,key,time') TBLPROPERTIES ( 'td.storage.user'='221', 'td.storage.database'='dfc', 'td.storage.table'='contests_20100606_120720_96abe81a', 'td.storage.path'='221/dfc/contests_20100606_120720_96abe81a', 'td.table_id'='4', 'td.modifiable'='true', 'plazma.data_set.name'='221/dfc/contests_20100606_120720_96abe81a' ); USE `db`;

Page 35: Data Analytics Service Company and Its Ruby Usage

USE `db`; CREATE TEMPORARY FUNCTION MSGPACK_SERIALIZE AS 'com.treasure_data.hadoop.hive.udf.MessagePackSerialize'; CREATE TEMPORARY FUNCTION TD_TIME_RANGE AS 'com.treasure_data.hadoop.hive.udf.GenericUDFTimeRange'; CREATE TEMPORARY FUNCTION TD_TIME_ADD AS 'com.treasure_data.hadoop.hive.udf.UDFTimeAdd'; CREATE TEMPORARY FUNCTION TD_TIME_FORMAT AS 'com.treasure_data.hadoop.hive.udf.UDFTimeFormat'; CREATE TEMPORARY FUNCTION TD_TIME_PARSE AS 'com.treasure_data.hadoop.hive.udf.UDFTimeParse'; CREATE TEMPORARY FUNCTION TD_SCHEDULED_TIME AS 'com.treasure_data.hadoop.hive.udf.GenericUDFScheduledTime'; CREATE TEMPORARY FUNCTION TD_X_RANK AS 'com.treasure_data.hadoop.hive.udf.Rank'; CREATE TEMPORARY FUNCTION TD_FIRST AS 'com.treasure_data.hadoop.hive.udf.GenericUDAFFirst'; CREATE TEMPORARY FUNCTION TD_LAST AS 'com.treasure_data.hadoop.hive.udf.GenericUDAFLast'; CREATE TEMPORARY FUNCTION TD_SESSIONIZE AS 'com.treasure_data.hadoop.hive.udf.UDFSessionize'; CREATE TEMPORARY FUNCTION TD_PARSE_USER_AGENT AS 'com.treasure_data.hadoop.hive.udf.GenericUDFParseUserAgent'; CREATE TEMPORARY FUNCTION TD_HEX2NUM AS 'com.treasure_data.hadoop.hive.udf.UDFHex2num'; CREATE TEMPORARY FUNCTION TD_MD5 AS 'com.treasure_data.hadoop.hive.udf.UDFmd5'; CREATE TEMPORARY FUNCTION TD_RANK_SEQUENCE AS 'com.treasure_data.hadoop.hive.udf.UDFRankSequence'; CREATE TEMPORARY FUNCTION TD_STRING_EXPLODER AS 'com.treasure_data.hadoop.hive.udf.GenericUDTFStringExploder'; CREATE TEMPORARY FUNCTION TD_URL_DECODE AS

Page 36: Data Analytics Service Company and Its Ruby Usage

CREATE TEMPORARY FUNCTION TD_URL_DECODE AS 'com.treasure_data.hadoop.hive.udf.UDFUrlDecode'; CREATE TEMPORARY FUNCTION TD_DATE_TRUNC AS 'com.treasure_data.hadoop.hive.udf.UDFDateTrunc'; CREATE TEMPORARY FUNCTION TD_LAT_LONG_TO_COUNTRY AS 'com.treasure_data.hadoop.hive.udf.UDFLatLongToCountry'; CREATE TEMPORARY FUNCTION TD_SUBSTRING_INENCODING AS 'com.treasure_data.hadoop.hive.udf.GenericUDFSubstringInEncoding'; CREATE TEMPORARY FUNCTION TD_DIVIDE AS 'com.treasure_data.hadoop.hive.udf.GenericUDFDivide'; CREATE TEMPORARY FUNCTION TD_SUMIF AS 'com.treasure_data.hadoop.hive.udf.GenericUDAFSumIf'; CREATE TEMPORARY FUNCTION TD_AVGIF AS 'com.treasure_data.hadoop.hive.udf.GenericUDAFAvgIf'; CREATE TEMPORARY FUNCTION hivemall_version AS 'hivemall.HivemallVersionUDF'; CREATE TEMPORARY FUNCTION perceptron AS 'hivemall.classifier.PerceptronUDTF'; CREATE TEMPORARY FUNCTION train_perceptron AS 'hivemall.classifier.PerceptronUDTF'; CREATE TEMPORARY FUNCTION train_pa AS 'hivemall.classifier.PassiveAggressiveUDTF'; CREATE TEMPORARY FUNCTION train_pa1 AS 'hivemall.classifier.PassiveAggressiveUDTF'; CREATE TEMPORARY FUNCTION train_pa2 AS 'hivemall.classifier.PassiveAggressiveUDTF'; CREATE TEMPORARY FUNCTION train_cw AS 'hivemall.classifier.ConfidenceWeightedUDTF'; CREATE TEMPORARY FUNCTION train_arow AS 'hivemall.classifier.AROWClassifierUDTF'; CREATE TEMPORARY FUNCTION train_arowh AS 'hivemall.classifier.AROWClassifierUDTF';

Page 37: Data Analytics Service Company and Its Ruby Usage

CREATE TEMPORARY FUNCTION train_arowh AS 'hivemall.classifier.AROWClassifierUDTF'; CREATE TEMPORARY FUNCTION train_scw AS 'hivemall.classifier.SoftConfideceWeightedUDTF'; CREATE TEMPORARY FUNCTION train_scw2 AS 'hivemall.classifier.SoftConfideceWeightedUDTF'; CREATE TEMPORARY FUNCTION adagrad_rda AS 'hivemall.classifier.AdaGradRDAUDTF'; CREATE TEMPORARY FUNCTION train_adagrad_rda AS 'hivemall.classifier.AdaGradRDAUDTF'; CREATE TEMPORARY FUNCTION train_multiclass_perceptron AS 'hivemall.classifier.multiclass.MulticlassPerceptronUDTF'; CREATE TEMPORARY FUNCTION train_multiclass_pa AS 'hivemall.classifier.multiclass.MulticlassPassiveAggressiveUDTF'; CREATE TEMPORARY FUNCTION train_multiclass_pa1 AS 'hivemall.classifier.multiclass.MulticlassPassiveAggressiveUDTF'; CREATE TEMPORARY FUNCTION train_multiclass_pa2 AS 'hivemall.classifier.multiclass.MulticlassPassiveAggressiveUDTF'; CREATE TEMPORARY FUNCTION train_multiclass_cw AS 'hivemall.classifier.multiclass.MulticlassConfidenceWeightedUDTF'; CREATE TEMPORARY FUNCTION train_multiclass_arow AS 'hivemall.classifier.multiclass.MulticlassAROWClassifierUDTF'; CREATE TEMPORARY FUNCTION train_multiclass_scw AS 'hivemall.classifier.multiclass.MulticlassSoftConfidenceWeightedUDTF'; CREATE TEMPORARY FUNCTION train_multiclass_scw2 AS 'hivemall.classifier.multiclass.MulticlassSoftConfidenceWeightedUDTF'; CREATE TEMPORARY FUNCTION cosine_similarity AS 'hivemall.knn.similarity.CosineSimilarityUDF'; CREATE TEMPORARY FUNCTION cosine_sim AS 'hivemall.knn.similarity.CosineSimilarityUDF'; CREATE TEMPORARY FUNCTION jaccard AS 'hivemall.knn.similarity.JaccardIndexUDF';

Page 38: Data Analytics Service Company and Its Ruby Usage

CREATE TEMPORARY FUNCTION jaccard AS 'hivemall.knn.similarity.JaccardIndexUDF'; CREATE TEMPORARY FUNCTION jaccard_similarity AS 'hivemall.knn.similarity.JaccardIndexUDF'; CREATE TEMPORARY FUNCTION angular_similarity AS 'hivemall.knn.similarity.AngularSimilarityUDF'; CREATE TEMPORARY FUNCTION euclid_similarity AS 'hivemall.knn.similarity.EuclidSimilarity'; CREATE TEMPORARY FUNCTION distance2similarity AS 'hivemall.knn.similarity.Distance2SimilarityUDF'; CREATE TEMPORARY FUNCTION hamming_distance AS 'hivemall.knn.distance.HammingDistanceUDF'; CREATE TEMPORARY FUNCTION popcnt AS 'hivemall.knn.distance.PopcountUDF'; CREATE TEMPORARY FUNCTION kld AS 'hivemall.knn.distance.KLDivergenceUDF'; CREATE TEMPORARY FUNCTION euclid_distance AS 'hivemall.knn.distance.EuclidDistanceUDF'; CREATE TEMPORARY FUNCTION cosine_distance AS 'hivemall.knn.distance.CosineDistanceUDF'; CREATE TEMPORARY FUNCTION angular_distance AS 'hivemall.knn.distance.AngularDistanceUDF'; CREATE TEMPORARY FUNCTION jaccard_distance AS 'hivemall.knn.distance.JaccardDistanceUDF'; CREATE TEMPORARY FUNCTION manhattan_distance AS 'hivemall.knn.distance.ManhattanDistanceUDF'; CREATE TEMPORARY FUNCTION minkowski_distance AS 'hivemall.knn.distance.MinkowskiDistanceUDF'; CREATE TEMPORARY FUNCTION minhashes AS 'hivemall.knn.lsh.MinHashesUDF'; CREATE TEMPORARY FUNCTION minhash AS 'hivemall.knn.lsh.MinHashUDTF'; CREATE TEMPORARY FUNCTION bbit_minhash AS 'hivemall.knn.lsh.bBitMinHashUDF'; CREATE TEMPORARY FUNCTION voted_avg AS 'hivemall.ensemble.bagging.VotedAvgUDAF';

Page 39: Data Analytics Service Company and Its Ruby Usage

CREATE TEMPORARY FUNCTION voted_avg AS 'hivemall.ensemble.bagging.VotedAvgUDAF'; CREATE TEMPORARY FUNCTION weight_voted_avg AS 'hivemall.ensemble.bagging.WeightVotedAvgUDAF'; CREATE TEMPORARY FUNCTION wvoted_avg AS 'hivemall.ensemble.bagging.WeightVotedAvgUDAF'; CREATE TEMPORARY FUNCTION max_label AS 'hivemall.ensemble.MaxValueLabelUDAF'; CREATE TEMPORARY FUNCTION maxrow AS 'hivemall.ensemble.MaxRowUDAF'; CREATE TEMPORARY FUNCTION argmin_kld AS 'hivemall.ensemble.ArgminKLDistanceUDAF'; CREATE TEMPORARY FUNCTION mhash AS 'hivemall.ftvec.hashing.MurmurHash3UDF'; CREATE TEMPORARY FUNCTION sha1 AS 'hivemall.ftvec.hashing.Sha1UDF'; CREATE TEMPORARY FUNCTION array_hash_values AS 'hivemall.ftvec.hashing.ArrayHashValuesUDF'; CREATE TEMPORARY FUNCTION prefixed_hash_values AS 'hivemall.ftvec.hashing.ArrayPrefixedHashValuesUDF'; CREATE TEMPORARY FUNCTION polynomial_features AS 'hivemall.ftvec.pairing.PolynomialFeaturesUDF'; CREATE TEMPORARY FUNCTION powered_features AS 'hivemall.ftvec.pairing.PoweredFeaturesUDF'; CREATE TEMPORARY FUNCTION rescale AS 'hivemall.ftvec.scaling.RescaleUDF'; CREATE TEMPORARY FUNCTION rescale_fv AS 'hivemall.ftvec.scaling.RescaleUDF'; CREATE TEMPORARY FUNCTION zscore AS 'hivemall.ftvec.scaling.ZScoreUDF'; CREATE TEMPORARY FUNCTION normalize AS 'hivemall.ftvec.scaling.L2NormalizationUDF'; CREATE TEMPORARY FUNCTION conv2dense AS 'hivemall.ftvec.conv.ConvertToDenseModelUDAF'; CREATE TEMPORARY FUNCTION to_dense_features AS 'hivemall.ftvec.conv.ToDenseFeaturesUDF';

Page 40: Data Analytics Service Company and Its Ruby Usage

CREATE TEMPORARY FUNCTION to_dense_features AS 'hivemall.ftvec.conv.ToDenseFeaturesUDF'; CREATE TEMPORARY FUNCTION to_dense AS 'hivemall.ftvec.conv.ToDenseFeaturesUDF'; CREATE TEMPORARY FUNCTION to_sparse_features AS 'hivemall.ftvec.conv.ToSparseFeaturesUDF'; CREATE TEMPORARY FUNCTION to_sparse AS 'hivemall.ftvec.conv.ToSparseFeaturesUDF'; CREATE TEMPORARY FUNCTION quantify AS 'hivemall.ftvec.conv.QuantifyColumnsUDTF'; CREATE TEMPORARY FUNCTION vectorize_features AS 'hivemall.ftvec.trans.VectorizeFeaturesUDF'; CREATE TEMPORARY FUNCTION categorical_features AS 'hivemall.ftvec.trans.CategoricalFeaturesUDF'; CREATE TEMPORARY FUNCTION indexed_features AS 'hivemall.ftvec.trans.IndexedFeatures'; CREATE TEMPORARY FUNCTION quantified_features AS 'hivemall.ftvec.trans.QuantifiedFeaturesUDTF'; CREATE TEMPORARY FUNCTION quantitative_features AS 'hivemall.ftvec.trans.QuantitativeFeaturesUDF'; CREATE TEMPORARY FUNCTION amplify AS 'hivemall.ftvec.amplify.AmplifierUDTF'; CREATE TEMPORARY FUNCTION rand_amplify AS 'hivemall.ftvec.amplify.RandomAmplifierUDTF'; CREATE TEMPORARY FUNCTION addBias AS 'hivemall.ftvec.AddBiasUDF'; CREATE TEMPORARY FUNCTION add_bias AS 'hivemall.ftvec.AddBiasUDF'; CREATE TEMPORARY FUNCTION sortByFeature AS 'hivemall.ftvec.SortByFeatureUDF'; CREATE TEMPORARY FUNCTION sort_by_feature AS 'hivemall.ftvec.SortByFeatureUDF'; CREATE TEMPORARY FUNCTION extract_feature AS 'hivemall.ftvec.ExtractFeatureUDF';

Page 41: Data Analytics Service Company and Its Ruby Usage

CREATE TEMPORARY FUNCTION extract_feature AS 'hivemall.ftvec.ExtractFeatureUDF'; CREATE TEMPORARY FUNCTION extract_weight AS 'hivemall.ftvec.ExtractWeightUDF'; CREATE TEMPORARY FUNCTION add_feature_index AS 'hivemall.ftvec.AddFeatureIndexUDF'; CREATE TEMPORARY FUNCTION feature AS 'hivemall.ftvec.FeatureUDF'; CREATE TEMPORARY FUNCTION feature_index AS 'hivemall.ftvec.FeatureIndexUDF'; CREATE TEMPORARY FUNCTION tf AS 'hivemall.ftvec.text.TermFrequencyUDAF'; CREATE TEMPORARY FUNCTION train_logregr AS 'hivemall.regression.LogressUDTF'; CREATE TEMPORARY FUNCTION train_pa1_regr AS 'hivemall.regression.PassiveAggressiveRegressionUDTF'; CREATE TEMPORARY FUNCTION train_pa1a_regr AS 'hivemall.regression.PassiveAggressiveRegressionUDTF'; CREATE TEMPORARY FUNCTION train_pa2_regr AS 'hivemall.regression.PassiveAggressiveRegressionUDTF'; CREATE TEMPORARY FUNCTION train_pa2a_regr AS 'hivemall.regression.PassiveAggressiveRegressionUDTF'; CREATE TEMPORARY FUNCTION train_arow_regr AS 'hivemall.regression.AROWRegressionUDTF'; CREATE TEMPORARY FUNCTION train_arowe_regr AS 'hivemall.regression.AROWRegressionUDTF'; CREATE TEMPORARY FUNCTION train_arowe2_regr AS 'hivemall.regression.AROWRegressionUDTF'; CREATE TEMPORARY FUNCTION train_adagrad_regr AS 'hivemall.regression.AdaGradUDTF'; CREATE TEMPORARY FUNCTION train_adadelta_regr AS 'hivemall.regression.AdaDeltaUDTF'; CREATE TEMPORARY FUNCTION train_adagrad AS 'hivemall.regression.AdaGradUDTF';

Page 42: Data Analytics Service Company and Its Ruby Usage

CREATE TEMPORARY FUNCTION train_adagrad AS 'hivemall.regression.AdaGradUDTF'; CREATE TEMPORARY FUNCTION train_adadelta AS 'hivemall.regression.AdaDeltaUDTF'; CREATE TEMPORARY FUNCTION logress AS 'hivemall.regression.LogressUDTF'; CREATE TEMPORARY FUNCTION pa1_regress AS 'hivemall.regression.PassiveAggressiveRegressionUDTF'; CREATE TEMPORARY FUNCTION pa1a_regress AS 'hivemall.regression.PassiveAggressiveRegressionUDTF'; CREATE TEMPORARY FUNCTION pa2_regress AS 'hivemall.regression.PassiveAggressiveRegressionUDTF'; CREATE TEMPORARY FUNCTION pa2a_regress AS 'hivemall.regression.PassiveAggressiveRegressionUDTF'; CREATE TEMPORARY FUNCTION arow_regress AS 'hivemall.regression.AROWRegressionUDTF'; CREATE TEMPORARY FUNCTION arowe_regress AS 'hivemall.regression.AROWRegressionUDTF'; CREATE TEMPORARY FUNCTION arowe2_regress AS 'hivemall.regression.AROWRegressionUDTF'; CREATE TEMPORARY FUNCTION adagrad AS 'hivemall.regression.AdaGradUDTF'; CREATE TEMPORARY FUNCTION adadelta AS 'hivemall.regression.AdaDeltaUDTF'; CREATE TEMPORARY FUNCTION float_array AS 'hivemall.tools.array.AllocFloatArrayUDF'; CREATE TEMPORARY FUNCTION array_remove AS 'hivemall.tools.array.ArrayRemoveUDF'; CREATE TEMPORARY FUNCTION sort_and_uniq_array AS 'hivemall.tools.array.SortAndUniqArrayUDF'; CREATE TEMPORARY FUNCTION subarray_endwith AS 'hivemall.tools.array.SubarrayEndWithUDF'; CREATE TEMPORARY FUNCTION subarray_startwith AS 'hivemall.tools.array.SubarrayStartWithUDF'; CREATE TEMPORARY FUNCTION collect_all AS

Page 43: Data Analytics Service Company and Its Ruby Usage

CREATE TEMPORARY FUNCTION collect_all AS 'hivemall.tools.array.CollectAllUDAF'; CREATE TEMPORARY FUNCTION concat_array AS 'hivemall.tools.array.ConcatArrayUDF'; CREATE TEMPORARY FUNCTION subarray AS 'hivemall.tools.array.SubarrayUDF'; CREATE TEMPORARY FUNCTION array_avg AS 'hivemall.tools.array.ArrayAvgGenericUDAF'; CREATE TEMPORARY FUNCTION array_sum AS 'hivemall.tools.array.ArraySumUDAF'; CREATE TEMPORARY FUNCTION to_string_array AS 'hivemall.tools.array.ToStringArrayUDF'; CREATE TEMPORARY FUNCTION map_get_sum AS 'hivemall.tools.map.MapGetSumUDF'; CREATE TEMPORARY FUNCTION map_tail_n AS 'hivemall.tools.map.MapTailNUDF'; CREATE TEMPORARY FUNCTION to_map AS 'hivemall.tools.map.UDAFToMap'; CREATE TEMPORARY FUNCTION to_ordered_map AS 'hivemall.tools.map.UDAFToOrderedMap'; CREATE TEMPORARY FUNCTION sigmoid AS 'hivemall.tools.math.SigmoidGenericUDF'; CREATE TEMPORARY FUNCTION taskid AS 'hivemall.tools.mapred.TaskIdUDF'; CREATE TEMPORARY FUNCTION jobid AS 'hivemall.tools.mapred.JobIdUDF'; CREATE TEMPORARY FUNCTION rowid AS 'hivemall.tools.mapred.RowIdUDF'; CREATE TEMPORARY FUNCTION generate_series AS 'hivemall.tools.GenerateSeriesUDTF'; CREATE TEMPORARY FUNCTION convert_label AS 'hivemall.tools.ConvertLabelUDF'; CREATE TEMPORARY FUNCTION x_rank AS 'hivemall.tools.RankSequenceUDF'; CREATE TEMPORARY FUNCTION each_top_k AS 'hivemall.tools.EachTopKUDTF'; CREATE TEMPORARY FUNCTION tokenize AS 'hivemall.tools.text.TokenizeUDF'; CREATE TEMPORARY FUNCTION is_stopword AS 'hivemall.tools.text.StopwordUDF'; CREATE TEMPORARY FUNCTION split_words AS

Page 44: Data Analytics Service Company and Its Ruby Usage

CREATE TEMPORARY FUNCTION split_words AS 'hivemall.tools.text.SplitWordsUDF'; CREATE TEMPORARY FUNCTION normalize_unicode AS 'hivemall.tools.text.NormalizeUnicodeUDF'; CREATE TEMPORARY FUNCTION lr_datagen AS 'hivemall.dataset.LogisticRegressionDataGeneratorUDTF'; CREATE TEMPORARY FUNCTION f1score AS 'hivemall.evaluation.FMeasureUDAF'; CREATE TEMPORARY FUNCTION mae AS 'hivemall.evaluation.MeanAbsoluteErrorUDAF'; CREATE TEMPORARY FUNCTION mse AS 'hivemall.evaluation.MeanSquaredErrorUDAF'; CREATE TEMPORARY FUNCTION rmse AS 'hivemall.evaluation.RootMeanSquaredErrorUDAF'; CREATE TEMPORARY FUNCTION mf_predict AS 'hivemall.mf.MFPredictionUDF'; CREATE TEMPORARY FUNCTION train_mf_sgd AS 'hivemall.mf.MatrixFactorizationSGDUDTF'; CREATE TEMPORARY FUNCTION train_mf_adagrad AS 'hivemall.mf.MatrixFactorizationAdaGradUDTF'; CREATE TEMPORARY FUNCTION fm_predict AS 'hivemall.fm.FMPredictGenericUDAF'; CREATE TEMPORARY FUNCTION train_fm AS 'hivemall.fm.FactorizationMachineUDTF'; CREATE TEMPORARY FUNCTION train_randomforest_classifier AS 'hivemall.smile.classification.RandomForestClassifierUDTF'; CREATE TEMPORARY FUNCTION train_rf_classifier AS 'hivemall.smile.classification.RandomForestClassifierUDTF'; CREATE TEMPORARY FUNCTION train_randomforest_regr AS 'hivemall.smile.regression.RandomForestRegressionUDTF'; CREATE TEMPORARY FUNCTION train_rf_regr AS 'hivemall.smile.regression.RandomForestRegressionUDTF'; CREATE TEMPORARY FUNCTION tree_predict AS 'hivemall.smile.tools.TreePredictByStackMachineUDF';

Page 45: Data Analytics Service Company and Its Ruby Usage

CREATE TEMPORARY FUNCTION tree_predict AS 'hivemall.smile.tools.TreePredictByStackMachineUDF'; CREATE TEMPORARY FUNCTION vm_tree_predict AS 'hivemall.smile.tools.TreePredictByStackMachineUDF'; CREATE TEMPORARY FUNCTION rf_ensemble AS 'hivemall.smile.tools.RandomForestEnsembleUDAF'; CREATE TEMPORARY FUNCTION train_gradient_boosting_classifier AS 'hivemall.smile.classification.GradientTreeBoostingClassifierUDTF'; CREATE TEMPORARY FUNCTION guess_attribute_types AS 'hivemall.smile.tools.GuessAttributesUDF'; CREATE TEMPORARY FUNCTION tokenize_ja AS 'hivemall.nlp.tokenizer.KuromojiUDF'; CREATE TEMPORARY MACRO max2(x DOUBLE, y DOUBLE) if(x>y,x,y); CREATE TEMPORARY MACRO min2(x DOUBLE, y DOUBLE) if(x<y,x,y); CREATE TEMPORARY MACRO rand_gid(k INT) floor(rand()*k); CREATE TEMPORARY MACRO rand_gid2(k INT, seed INT) floor(rand(seed)*k); CREATE TEMPORARY MACRO idf(df_t DOUBLE, n_docs DOUBLE) log(10, n_docs / max2(1,df_t)) + 1.0; CREATE TEMPORARY MACRO tfidf(tf FLOAT, df_t DOUBLE, n_docs DOUBLE) tf * (log(10, n_docs / max2(1,df_t)) + 1.0);

SELECT time, COUNT(1) AS cnt FROM tbl1 WHERE TD_TIME_RANGE(time, '2015-12-11', '2015-12-12', 'JST');

Page 46: Data Analytics Service Company and Its Ruby Usage

Do you still love Java / SQL ?

Page 47: Data Analytics Service Company and Its Ruby Usage

PQ written in Ruby• Building jobs/parameters is so complex!

• using data from many configurations (YAML, JSON), internal APIs and RDBMSs

• with many ext syntaxes/rules to tune performance, override configurations for tests, ...

• Ruby empower to write fat/complex worker code

• Testing! • Unit tests using Rspec • System tests (executing real queries/jobs) using

Rspec

Page 48: Data Analytics Service Company and Its Ruby Usage

For Further improvement about workers

• More performance for more customers and less costs

• More scalability for many other kind jobs

• Better and well-controlled tests (indented here documents!)

Page 49: Data Analytics Service Company and Its Ruby Usage

"Done is better than Perfect."

Page 50: Data Analytics Service Company and Its Ruby Usage

PerfectQueue DoneQueue

?

Page 51: Data Analytics Service Company and Its Ruby Usage

We'll improve our code step by step, with improvements of ruby and developer community <3

Thanks!