Zh tw cloud computing era
-
Upload
trendprogcontest13 -
Category
Technology
-
view
8.473 -
download
0
description
Transcript of Zh tw cloud computing era
![Page 1: Zh tw cloud computing era](https://reader033.fdocument.pub/reader033/viewer/2022051110/54b7aca64a795993718b49ca/html5/thumbnails/1.jpg)
Trend Micro
Cloud Computing Era
![Page 2: Zh tw cloud computing era](https://reader033.fdocument.pub/reader033/viewer/2022051110/54b7aca64a795993718b49ca/html5/thumbnails/2.jpg)
Three Major Trends to Chang the World
Cloud Computing
Mobile Big Data
![Page 3: Zh tw cloud computing era](https://reader033.fdocument.pub/reader033/viewer/2022051110/54b7aca64a795993718b49ca/html5/thumbnails/3.jpg)
什麼是雲端運算?
以服務(as-a-service)的商業模式,透過Internet技術,提供具有擴充性(scalable)和彈性(elastic)的IT相關功能給使用者
Essential Characteristics
Service Models
Deployment Models
美國國家標準技術研究所 (NIST)的定義:
![Page 4: Zh tw cloud computing era](https://reader033.fdocument.pub/reader033/viewer/2022051110/54b7aca64a795993718b49ca/html5/thumbnails/4.jpg)
It’s About the Ecosystem
IaaS
PaaS
SaaS
Cloud Computing
Generate
Big Data
Lead
Business Insights
create
Competition, Innovation, Productivity
Structured, Semi-structured
Enterprise Data Warehouse
![Page 5: Zh tw cloud computing era](https://reader033.fdocument.pub/reader033/viewer/2022051110/54b7aca64a795993718b49ca/html5/thumbnails/5.jpg)
What is Big Data?
![Page 6: Zh tw cloud computing era](https://reader033.fdocument.pub/reader033/viewer/2022051110/54b7aca64a795993718b49ca/html5/thumbnails/6.jpg)
What is the problem
• Getting the data to the processors becomes the bottleneck
• Quick calculation
– Typical disk data transfer rate:
• 75MB/sec – Time taken to transfer 100GB of data
to the processor:
• approx. 22 minutes!
![Page 7: Zh tw cloud computing era](https://reader033.fdocument.pub/reader033/viewer/2022051110/54b7aca64a795993718b49ca/html5/thumbnails/7.jpg)
The Era of Big Data – Are You Ready
Data for business commercial analysis
• 2011: multi-terabyte (TB)
• 2020: 35.2 ZB (1 ZB = 1 billion TB)
![Page 8: Zh tw cloud computing era](https://reader033.fdocument.pub/reader033/viewer/2022051110/54b7aca64a795993718b49ca/html5/thumbnails/8.jpg)
Who Needs It?
When to use?
• Affordable Storage/Compute
• Unstructured or Semi-structured
• Resilient Auto Scalability
When to use?
• Ad-hoc Reporting (<1sec)
• Multi-step Transactions
• Lots of Inserts/Updates/Deletes
Enterprise Database Hadoop
![Page 9: Zh tw cloud computing era](https://reader033.fdocument.pub/reader033/viewer/2022051110/54b7aca64a795993718b49ca/html5/thumbnails/9.jpg)
Hadoop!
![Page 10: Zh tw cloud computing era](https://reader033.fdocument.pub/reader033/viewer/2022051110/54b7aca64a795993718b49ca/html5/thumbnails/10.jpg)
– inspired by
• Apache Hadoop project
– inspired by Google's MapReduce and Google File System papers.
• Open sourced, flexible and available architecture for large scale computation and data processing on a network of commodity hardware
• Open Source Software + Hardware Commodity
– IT Costs Reduction
![Page 11: Zh tw cloud computing era](https://reader033.fdocument.pub/reader033/viewer/2022051110/54b7aca64a795993718b49ca/html5/thumbnails/11.jpg)
© 2011 Cloudera, Inc. All Rights Reserved.
Hadoop Core
HDFS
MapReduce
![Page 12: Zh tw cloud computing era](https://reader033.fdocument.pub/reader033/viewer/2022051110/54b7aca64a795993718b49ca/html5/thumbnails/12.jpg)
© 2011 Cloudera, Inc. All Rights Reserved.
HDFS
• Hadoop Distributed File System
• Redundancy
• Fault Tolerant
• Scalable
• Self Healing
• Write Once, Read Many Times
• Java API
• Command Line Tool
![Page 13: Zh tw cloud computing era](https://reader033.fdocument.pub/reader033/viewer/2022051110/54b7aca64a795993718b49ca/html5/thumbnails/13.jpg)
© 2011 Cloudera, Inc. All Rights Reserved.
MapReduce
13
• Two Phases of Functional Programming
• Redundancy
• Fault Tolerant
• Scalable
• Self Healing
• Java API
![Page 14: Zh tw cloud computing era](https://reader033.fdocument.pub/reader033/viewer/2022051110/54b7aca64a795993718b49ca/html5/thumbnails/14.jpg)
© 2011 Cloudera, Inc. All Rights Reserved.
Hadoop Core
14
HDFS
MapReduce
Java Java
Java
Java
![Page 15: Zh tw cloud computing era](https://reader033.fdocument.pub/reader033/viewer/2022051110/54b7aca64a795993718b49ca/html5/thumbnails/15.jpg)
Word Count Example
Key: offset
Value: line
Key: word
Value: count
Key: word
Value: sum of count
0:The cat sat on the mat
22:The aardvark sat on the sofa
![Page 16: Zh tw cloud computing era](https://reader033.fdocument.pub/reader033/viewer/2022051110/54b7aca64a795993718b49ca/html5/thumbnails/16.jpg)
The Hadoop Ecosystems
![Page 17: Zh tw cloud computing era](https://reader033.fdocument.pub/reader033/viewer/2022051110/54b7aca64a795993718b49ca/html5/thumbnails/17.jpg)
The Ecosystem is the System
• Hadoop has become the kernel of the distributed operating system for Big Data
• No one uses the kernel alone
• A collection of projects at Apache
![Page 18: Zh tw cloud computing era](https://reader033.fdocument.pub/reader033/viewer/2022051110/54b7aca64a795993718b49ca/html5/thumbnails/18.jpg)
Relation Map
MapReduce Runtime (Dist. Programming Framework)
Hadoop Distributed File System (HDFS)
Zo
ok
ee
pe
r (C
oo
rdin
atio
n)
Hbase (Column NoSQL DB)
Sqoop/Flume (Data integration)
Oozie (Job Workflow & Scheduling)
Pig/Hive (Analytical Language)
Hue (Web Console)
Mahout (Data Mining)
![Page 19: Zh tw cloud computing era](https://reader033.fdocument.pub/reader033/viewer/2022051110/54b7aca64a795993718b49ca/html5/thumbnails/19.jpg)
Zookeeper – Coordination Framework
MapReduce Runtime (Dist. Programming Framework)
Hadoop Distributed File System (HDFS)
Zo
ok
ee
pe
r (C
oo
rdin
atio
n)
Hbase (Column NoSQL DB)
Sqoop/Flume (Data integration)
Oozie (Job Workflow & Scheduling)
Pig/Hive (Analytical Language)
Hue (Web Console)
Mahout (Data Mining)
![Page 20: Zh tw cloud computing era](https://reader033.fdocument.pub/reader033/viewer/2022051110/54b7aca64a795993718b49ca/html5/thumbnails/20.jpg)
What is ZooKeeper
• A centralized service for maintaining
– Configuration information
– Providing distributed synchronization
• A set of tools to build distributed applications that can safely handle partial failures
• ZooKeeper was designed to store coordination data
– Status information
– Configuration
– Location information
![Page 21: Zh tw cloud computing era](https://reader033.fdocument.pub/reader033/viewer/2022051110/54b7aca64a795993718b49ca/html5/thumbnails/21.jpg)
Flume / Sqoop – Data Integration Framework
MapReduce Runtime (Dist. Programming Framework)
Hadoop Distributed File System (HDFS)
Zo
ok
ee
pe
r (C
oo
rdin
atio
n)
Hbase (Column NoSQL DB)
Sqoop/Flume (Data integration)
Oozie (Job Workflow & Scheduling)
Pig/Hive (Analytical Language)
Hue (Web Console)
Mahout (Data Mining)
![Page 22: Zh tw cloud computing era](https://reader033.fdocument.pub/reader033/viewer/2022051110/54b7aca64a795993718b49ca/html5/thumbnails/22.jpg)
What’s the problem for data collection
• Data collection is currently a priori and ad hoc
• A priori – decide what you want to collect ahead of time
• Ad hoc – each kind of data source goes through its own collection path
![Page 23: Zh tw cloud computing era](https://reader033.fdocument.pub/reader033/viewer/2022051110/54b7aca64a795993718b49ca/html5/thumbnails/23.jpg)
(and how can it help?)
• A distributed data collection service
• It efficiently collecting, aggregating, and moving large amounts of data
• Fault tolerant, many failover and recovery mechanism
• One-stop solution for data collection of all formats
![Page 24: Zh tw cloud computing era](https://reader033.fdocument.pub/reader033/viewer/2022051110/54b7aca64a795993718b49ca/html5/thumbnails/24.jpg)
© 2011 Cloudera, Inc. All Rights Reserved.
Flume Architecture
Log
Flume Node
Log
Flume Node
...
HDFS
![Page 25: Zh tw cloud computing era](https://reader033.fdocument.pub/reader033/viewer/2022051110/54b7aca64a795993718b49ca/html5/thumbnails/25.jpg)
© 2011 Cloudera, Inc. All Rights Reserved.
Flume Sources and Sinks
• Local Files
• HDFS
• Stdin, Stdout
• IRC
• IMAP
![Page 26: Zh tw cloud computing era](https://reader033.fdocument.pub/reader033/viewer/2022051110/54b7aca64a795993718b49ca/html5/thumbnails/26.jpg)
Sqoop
• Easy, parallel database import/export
• What you want do?
– Insert data from RDBMS to HDFS
– Export data from HDFS back into RDBMS
![Page 27: Zh tw cloud computing era](https://reader033.fdocument.pub/reader033/viewer/2022051110/54b7aca64a795993718b49ca/html5/thumbnails/27.jpg)
© 2011 Cloudera, Inc. All Rights Reserved.
Sqoop
28
RDBMS
Sqoop
HDFS
![Page 28: Zh tw cloud computing era](https://reader033.fdocument.pub/reader033/viewer/2022051110/54b7aca64a795993718b49ca/html5/thumbnails/28.jpg)
© 2011 Cloudera, Inc. All Rights Reserved.
Sqoop Examples
29
$ sqoop import --connect jdbc:mysql://localhost/world --
username root --table City
...
$ hadoop fs -cat City/part-m-00000
1,Kabul,AFG,Kabol,17800002,Qandahar,AFG,Qandahar,2375003,He
rat,AFG,Herat,1868004,Mazar-e-
Sharif,AFG,Balkh,1278005,Amsterdam,NLD,Noord-Holland,731200
...
![Page 29: Zh tw cloud computing era](https://reader033.fdocument.pub/reader033/viewer/2022051110/54b7aca64a795993718b49ca/html5/thumbnails/29.jpg)
Pig / Hive – Analytical Language
MapReduce Runtime (Dist. Programming Framework)
Hadoop Distributed File System (HDFS)
Zo
ok
ee
pe
r (C
oo
rdin
atio
n)
Hbase (Column NoSQL DB)
Sqoop/Flume (Data integration)
Oozie (Job Workflow & Scheduling)
Pig/Hive (Analytical Language)
Hue (Web Console)
Mahout (Data Mining)
![Page 30: Zh tw cloud computing era](https://reader033.fdocument.pub/reader033/viewer/2022051110/54b7aca64a795993718b49ca/html5/thumbnails/30.jpg)
Why Hive and Pig?
• Although MapReduce is very powerful, it can also be complex to master
• Many organizations have business or data analysts who are skilled at writing SQL queries, but not at writing Java code
• Many organizations have programmers who are skilled at writing code in scripting languages
• Hive and Pig are two projects which evolved separately to help such people analyze huge amounts of data via MapReduce
– Hive was initially developed at Facebook, Pig at Yahoo!
![Page 31: Zh tw cloud computing era](https://reader033.fdocument.pub/reader033/viewer/2022051110/54b7aca64a795993718b49ca/html5/thumbnails/31.jpg)
Hive – Developed by
• What is Hive?
– An SQL-like interface to Hadoop
• Data Warehouse infrastructure that provides data summarization and ad hoc querying on top of Hadoop
– MapRuduce for execution
– HDFS for storage
• Hive Query Language
– Basic-SQL : Select, From, Join, Group-By
– Equi-Join, Muti-Table Insert, Multi-Group-By
– Batch query
SELECT * FROM purchases WHERE price > 100 GROUP BY storeid
![Page 32: Zh tw cloud computing era](https://reader033.fdocument.pub/reader033/viewer/2022051110/54b7aca64a795993718b49ca/html5/thumbnails/32.jpg)
© 2011 Cloudera, Inc. All Rights Reserved.
Hive
33
MapReduce
Hive
SQL
![Page 33: Zh tw cloud computing era](https://reader033.fdocument.pub/reader033/viewer/2022051110/54b7aca64a795993718b49ca/html5/thumbnails/33.jpg)
Pig
• A high-level scripting language (Pig Latin)
• Process data one step at a time
• Simple to write MapReduce program
• Easy understand
• Easy debug A = load ‘a.txt’ as (id, name, age, ...)
B = load ‘b.txt’ as (id, address, ...)
C = JOIN A BY id, B BY id;STORE C into ‘c.txt’
– Initiated by
![Page 34: Zh tw cloud computing era](https://reader033.fdocument.pub/reader033/viewer/2022051110/54b7aca64a795993718b49ca/html5/thumbnails/34.jpg)
© 2011 Cloudera, Inc. All Rights Reserved.
Pig
MapReduce
Pig
Script
![Page 35: Zh tw cloud computing era](https://reader033.fdocument.pub/reader033/viewer/2022051110/54b7aca64a795993718b49ca/html5/thumbnails/35.jpg)
Hive vs. Pig
Hive Pig
Language HiveQL (SQL-like) Pig Latin, a scripting language
Schema Table definitions
that are stored in a
metastore
A schema is optionally defined
at runtime
Programmait Access JDBC, ODBC PigServer
![Page 36: Zh tw cloud computing era](https://reader033.fdocument.pub/reader033/viewer/2022051110/54b7aca64a795993718b49ca/html5/thumbnails/36.jpg)
• Input
• For the given sample input the map emits
• the reduce just sums up the values
Hello World Bye World
Hello Hadoop Goodbye Hadoop
< Hello, 1>
< World, 1>
< Bye, 1>
< World, 1>
< Hello, 1>
< Hadoop, 1>
< Goodbye, 1>
< Hadoop, 1>
< Bye, 1>
< Goodbye, 1>
< Hadoop, 2>
< Hello, 2>
< World, 2>
WordCount Example
![Page 37: Zh tw cloud computing era](https://reader033.fdocument.pub/reader033/viewer/2022051110/54b7aca64a795993718b49ca/html5/thumbnails/37.jpg)
WordCount Example In MapReduce public class WordCount {
public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
}
}
public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, "wordcount");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}
}
![Page 38: Zh tw cloud computing era](https://reader033.fdocument.pub/reader033/viewer/2022051110/54b7aca64a795993718b49ca/html5/thumbnails/38.jpg)
WordCount Example By Pig
A = LOAD 'wordcount/input' USING PigStorage as (token:chararray);
B = GROUP A BY token;
C = FOREACH B GENERATE group, COUNT(A) as count;
DUMP C;
![Page 39: Zh tw cloud computing era](https://reader033.fdocument.pub/reader033/viewer/2022051110/54b7aca64a795993718b49ca/html5/thumbnails/39.jpg)
WordCount Example By Hive
CREATE TABLE wordcount (token STRING);
LOAD DATA LOCAL INPATH ’wordcount/input'
OVERWRITE INTO TABLE wordcount;
SELECT count(*) FROM wordcount GROUP BY token;
![Page 40: Zh tw cloud computing era](https://reader033.fdocument.pub/reader033/viewer/2022051110/54b7aca64a795993718b49ca/html5/thumbnails/40.jpg)
4
1 © 2011 Cloudera, Inc. All Rights Reserved.
The Story So Far
RDBMS
Hive Pig
Sqoop
MapReduce
HDFS
FS SQL
SQL Script
Posix
Java
Java
Flume
![Page 41: Zh tw cloud computing era](https://reader033.fdocument.pub/reader033/viewer/2022051110/54b7aca64a795993718b49ca/html5/thumbnails/41.jpg)
Hbase – Column NoSQL DB
MapReduce Runtime (Dist. Programming Framework)
Hadoop Distributed File System (HDFS)
Zo
ok
ee
pe
r (C
oo
rdin
atio
n)
Hbase (Column NoSQL DB)
Sqoop/Flume (Data integration)
Oozie (Job Workflow & Scheduling)
Pig/Hive (Analytical Language)
Hue (Web Console)
Mahout (Data Mining)
![Page 42: Zh tw cloud computing era](https://reader033.fdocument.pub/reader033/viewer/2022051110/54b7aca64a795993718b49ca/html5/thumbnails/42.jpg)
Structured-data vs Raw-data
![Page 43: Zh tw cloud computing era](https://reader033.fdocument.pub/reader033/viewer/2022051110/54b7aca64a795993718b49ca/html5/thumbnails/43.jpg)
I – Inspired by
• Coordinated by Zookeeper
• Low Latency
• Random Reads And Writes
• Distributed Key/Value Store
• Simple API
– PUT
– GET
– DELETE
– SCAN
![Page 44: Zh tw cloud computing era](https://reader033.fdocument.pub/reader033/viewer/2022051110/54b7aca64a795993718b49ca/html5/thumbnails/44.jpg)
Hbase – Data Model
• Cells are “versioned”
• Table rows are sorted by row key
• Region – a row range [start-key:end-key]
![Page 45: Zh tw cloud computing era](https://reader033.fdocument.pub/reader033/viewer/2022051110/54b7aca64a795993718b49ca/html5/thumbnails/45.jpg)
© 2011 Cloudera, Inc. All Rights Reserved.
HBase Examples
hbase> create 'mytable', 'mycf‘
hbase> list
hbase> put 'mytable', 'row1', 'mycf:col1', 'val1‘
hbase> put 'mytable', 'row1', 'mycf:col2', 'val2‘
hbase> put 'mytable', 'row2', 'mycf:col1', 'val3‘
hbase> scan 'mytable‘
hbase> disable 'mytable‘
hbase> drop 'mytable'
![Page 46: Zh tw cloud computing era](https://reader033.fdocument.pub/reader033/viewer/2022051110/54b7aca64a795993718b49ca/html5/thumbnails/46.jpg)
Oozie – Job Workflow & Scheduling
MapReduce Runtime (Dist. Programming Framework)
Hadoop Distributed File System (HDFS)
Zo
ok
ee
pe
r (C
oo
rdin
atio
n)
Hbase (Column NoSQL DB)
Sqoop/Flume (Data integration)
Oozie (Job Workflow & Scheduling)
Pig/Hive (Analytical Language)
Hue (Web Console)
Mahout (Data Mining)
![Page 47: Zh tw cloud computing era](https://reader033.fdocument.pub/reader033/viewer/2022051110/54b7aca64a795993718b49ca/html5/thumbnails/47.jpg)
What is ?
• A Java Web Application
• Oozie is a workflow scheduler for Hadoop
• Crond for Hadoop
• Triggered
– Time
– Data
Job 1
Job 3
Job 2
Job 4 Job 5
![Page 48: Zh tw cloud computing era](https://reader033.fdocument.pub/reader033/viewer/2022051110/54b7aca64a795993718b49ca/html5/thumbnails/48.jpg)
© 2011 Cloudera, Inc. All Rights Reserved.
Oozie Features
• Component Independent
– MapReduce
– Hive
– Pig
– SqoopStreaming
![Page 49: Zh tw cloud computing era](https://reader033.fdocument.pub/reader033/viewer/2022051110/54b7aca64a795993718b49ca/html5/thumbnails/49.jpg)
Mahout – Data Mining
MapReduce Runtime (Dist. Programming Framework)
Hadoop Distributed File System (HDFS)
Zo
ok
ee
pe
r (C
oo
rdin
atio
n)
Hbase (Column NoSQL DB)
Sqoop/Flume (Data integration)
Oozie (Job Workflow & Scheduling)
Pig/Hive (Analytical Language)
Hue (Web Console)
Mahout (Data Mining)
![Page 50: Zh tw cloud computing era](https://reader033.fdocument.pub/reader033/viewer/2022051110/54b7aca64a795993718b49ca/html5/thumbnails/50.jpg)
What is
• Machine-learning tool
• Distributed and scalable machine learning algorithms on the Hadoop platform
• Building intelligent applications easier and faster
![Page 51: Zh tw cloud computing era](https://reader033.fdocument.pub/reader033/viewer/2022051110/54b7aca64a795993718b49ca/html5/thumbnails/51.jpg)
© 2011 Cloudera, Inc. All Rights Reserved.
Mahout Use Cases
• Yahoo: Spam Detection
• Foursquare: Recommendations
• SpeedDate.com: Recommendations
• Adobe: User Targetting
• Amazon: Personalization Platform
![Page 52: Zh tw cloud computing era](https://reader033.fdocument.pub/reader033/viewer/2022051110/54b7aca64a795993718b49ca/html5/thumbnails/52.jpg)
Hue – developed by
• Hadoop User Experience
• Apache Open source project
• HUE is a web UI for Hadoop
• Platform for building custom applications with a nice UI library
![Page 53: Zh tw cloud computing era](https://reader033.fdocument.pub/reader033/viewer/2022051110/54b7aca64a795993718b49ca/html5/thumbnails/53.jpg)
Hue
• HUE comes with a suite of applications
– File Browser: Browse HDFS; change permissions and ownership; upload, download, view and edit files.
– Job Browser: View jobs, tasks, counters, logs, etc.
– Beeswax: Wizards to help create Hive tables, load data, run and manage Hive queries, and download results in Excel format.
![Page 54: Zh tw cloud computing era](https://reader033.fdocument.pub/reader033/viewer/2022051110/54b7aca64a795993718b49ca/html5/thumbnails/54.jpg)
Hue: File Browser UI
![Page 55: Zh tw cloud computing era](https://reader033.fdocument.pub/reader033/viewer/2022051110/54b7aca64a795993718b49ca/html5/thumbnails/55.jpg)
Hue: Beewax UI
![Page 56: Zh tw cloud computing era](https://reader033.fdocument.pub/reader033/viewer/2022051110/54b7aca64a795993718b49ca/html5/thumbnails/56.jpg)
Use case Example
• Predict what the user likes based on
– His/Her historical behavior
– Aggregate behavior of people similar to him
![Page 57: Zh tw cloud computing era](https://reader033.fdocument.pub/reader033/viewer/2022051110/54b7aca64a795993718b49ca/html5/thumbnails/57.jpg)
Conclusion
Today, we introduced:
• Why Hadoop is needed
• The basic concepts of HDFS and MapReduce
• What sort of problems can be solved with Hadoop
• What other projects are included in the Hadoop ecosystem
![Page 58: Zh tw cloud computing era](https://reader033.fdocument.pub/reader033/viewer/2022051110/54b7aca64a795993718b49ca/html5/thumbnails/58.jpg)
Recap – Hadoop Ecosystem
MapReduce Runtime (Dist. Programming Framework)
Hadoop Distributed File System (HDFS)
Zo
ok
ee
pe
r (C
oo
rdin
atio
n)
Hbase (Column NoSQL DB)
Sqoop/Flume (Data integration)
Oozie (Job Workflow & Scheduling)
Pig/Hive (Analytical Language)
Hue (Web Console)
Mahout (Data Mining)
![Page 59: Zh tw cloud computing era](https://reader033.fdocument.pub/reader033/viewer/2022051110/54b7aca64a795993718b49ca/html5/thumbnails/59.jpg)
趨勢科技雲端防毒 Case Study
![Page 60: Zh tw cloud computing era](https://reader033.fdocument.pub/reader033/viewer/2022051110/54b7aca64a795993718b49ca/html5/thumbnails/60.jpg)
Collaboration in the underground
![Page 61: Zh tw cloud computing era](https://reader033.fdocument.pub/reader033/viewer/2022051110/54b7aca64a795993718b49ca/html5/thumbnails/61.jpg)
網路威脅呈現爆炸性的成長
各式各樣的變種病毒、垃圾郵件、不明的下載來源等等,這些來自網路上的威脅,躲過傳統安全防護系統的偵測,一直持續呈現爆炸性的成長,形成嚴重的資安威脅
New Unique Malware Discovered
1M
unique
Malwares
every
month
![Page 62: Zh tw cloud computing era](https://reader033.fdocument.pub/reader033/viewer/2022051110/54b7aca64a795993718b49ca/html5/thumbnails/62.jpg)
![Page 63: Zh tw cloud computing era](https://reader033.fdocument.pub/reader033/viewer/2022051110/54b7aca64a795993718b49ca/html5/thumbnails/63.jpg)
New Design Concept for Threat Intelligence
Web Crawler
Trend Micro Endpoint Protection
Trend Micro Mail Protection
Trend Micro Web Protection
Honeypot
CDN / xSP Human Intelligence
150M+ Worldwide Endpoints/Sensors
![Page 64: Zh tw cloud computing era](https://reader033.fdocument.pub/reader033/viewer/2022051110/54b7aca64a795993718b49ca/html5/thumbnails/64.jpg)
Challenges We Are Faced
The Concept is Great but …. 6TB of data and 15B lines of logs received daily by
It becomes the Big Data Challenge!
![Page 65: Zh tw cloud computing era](https://reader033.fdocument.pub/reader033/viewer/2022051110/54b7aca64a795993718b49ca/html5/thumbnails/65.jpg)
Raw Data Information Threat
Intelligence/Solution
Volume: Infinite Time: No Delay Target: Keep Changing Threats
Issues to Address
![Page 66: Zh tw cloud computing era](https://reader033.fdocument.pub/reader033/viewer/2022051110/54b7aca64a795993718b49ca/html5/thumbnails/66.jpg)
SPN Feedback
Log Receiver
L4
Message Bus
Log Receiver
HBase MapReduce
Hadoop Distributed File System (HDFS)
CDN Log
SPAM
HTTP POST
Feedback Information
SPN High Level Architecture
Log Post Processing
Log Post Processing
L4
Email Reputation Service
SPN
infrastru
cture
A
pp
lication
Web Reputation Service
File Reputation Service
Adhoc-Query (Pig)
Circus (Ambari)
Log Post Processing
![Page 67: Zh tw cloud computing era](https://reader033.fdocument.pub/reader033/viewer/2022051110/54b7aca64a795993718b49ca/html5/thumbnails/67.jpg)
Trend Micro Big Data process capacity
雲端防毒每日需要處理的資料量 • 85 億個 Web Reputation 查詢
• 30 億個 Email Reputation查詢
• 70 億個 File Reputation 查詢
• 處理 6 TB 從全世界收集到的 raw logs
• 來自1.5億台終端裝置的連線
![Page 68: Zh tw cloud computing era](https://reader033.fdocument.pub/reader033/viewer/2022051110/54b7aca64a795993718b49ca/html5/thumbnails/68.jpg)
Trend Micro: Web Reputation Services
User Traffic | Honeypot
Akamai
Rating Server for Known Threats
Unknown & Prefilter
Page Download
Threat Analysis
8 billions/day
4.8 billions/day
860 millions/day
40% filtered
82% filtered
25,000 malicious URL /day
99.98% filtered
Trend Micro Products / Technology
CDN Cache
High Throughput Web Service
Hadoop Cluster
Web Crawling
Machine Learning Data Mining
Technology Process Operation
Block malicious URL within 15 minutes once it goes online!
15
Min
utes
![Page 69: Zh tw cloud computing era](https://reader033.fdocument.pub/reader033/viewer/2022051110/54b7aca64a795993718b49ca/html5/thumbnails/69.jpg)
Big Data Cases
![Page 70: Zh tw cloud computing era](https://reader033.fdocument.pub/reader033/viewer/2022051110/54b7aca64a795993718b49ca/html5/thumbnails/70.jpg)
Google vs. Hadoop Ecosystem
Chubby vs. Zookeeper
MapReduce vs. MapReduce
BigTable vs. HBase
GFS vs. HDFS
Ref: Google Cluster
![Page 71: Zh tw cloud computing era](https://reader033.fdocument.pub/reader033/viewer/2022051110/54b7aca64a795993718b49ca/html5/thumbnails/71.jpg)
Pioneer of Big Data Infrastructure – Google
![Page 72: Zh tw cloud computing era](https://reader033.fdocument.pub/reader033/viewer/2022051110/54b7aca64a795993718b49ca/html5/thumbnails/72.jpg)
Hbase use Case@Facebook - Messages HBase Use Cases @ Facebook
Messages
Facebook Insights
Self-service
Hashout
Operational Data Store
More Analytics/Hashout apps
Site Integrity
2010 2011 2012 2013
Social Graph Search Indexing Realtime Hive Updates Cross-system Tracing … and more
![Page 73: Zh tw cloud computing era](https://reader033.fdocument.pub/reader033/viewer/2022051110/54b7aca64a795993718b49ca/html5/thumbnails/73.jpg)
Flagship App:Facebook Messages
• Monthly data volume prior to launch
Monthly data volume prior to launch
15B x 1,024 bytes = 14TB
120B x 100 bytes = 11TB
![Page 74: Zh tw cloud computing era](https://reader033.fdocument.pub/reader033/viewer/2022051110/54b7aca64a795993718b49ca/html5/thumbnails/74.jpg)
Facebook Messages Now
• Quick Stats
– 11B+ messages/day
• 90B+ data accesses
• Peak:1.5M ops/sec
• ~55% Read, 45% Write
– 20PB+ of total data
• Grows 400TB/month
Facebook Messages NOW
Emails
Chats
SMS
Messages Quick Stats • 11B+ messages/day
• 90B+ data accesses • Peak: 1.5M ops/sec • ~55%Rd, 45% Wr
• 20PB+ of total data • Grows 400TB/month
![Page 75: Zh tw cloud computing era](https://reader033.fdocument.pub/reader033/viewer/2022051110/54b7aca64a795993718b49ca/html5/thumbnails/75.jpg)
Facebook Messages:Requirements
• Very High Write Volume
– Previously, chat was not persisted to disk
• Ever-growing data sets(Old data rarely gets accessed)
• Elasticity & automatic failover
• Strong consistency within a single data center
• Large scans/map-reduce support for migrations & schema conversions
• Bulk import data
![Page 76: Zh tw cloud computing era](https://reader033.fdocument.pub/reader033/viewer/2022051110/54b7aca64a795993718b49ca/html5/thumbnails/76.jpg)
Physical Multi-tenancy
• Real-time Ads Insights
– Real-time analytics for social plugins on top of Hbase
– Publishers get real-time distribution/engagement metrics:
• # of impressions, likes
• analytics by domain/URL/demographics and time periods
– Uses HBase capabilities:
• Efficient counters (single-RPC increments)
• TTL for purging old data
– Needs massive write throughput & low latencies
• Billions of URLs
• Millions of counter increments/second
• Operational Data Store
![Page 77: Zh tw cloud computing era](https://reader033.fdocument.pub/reader033/viewer/2022051110/54b7aca64a795993718b49ca/html5/thumbnails/77.jpg)
Facebook Open Source Stack
• Memcached --> App Server Cache
• ZooKeeper --> Small Data Coordination Service
• HBase --> Database Storage Engine
• HDFS --> Distributed FileSystem
• Hadoop --> Asynchronous Map-Reduce Jobs
![Page 78: Zh tw cloud computing era](https://reader033.fdocument.pub/reader033/viewer/2022051110/54b7aca64a795993718b49ca/html5/thumbnails/78.jpg)
![Page 79: Zh tw cloud computing era](https://reader033.fdocument.pub/reader033/viewer/2022051110/54b7aca64a795993718b49ca/html5/thumbnails/79.jpg)
Questions?
![Page 80: Zh tw cloud computing era](https://reader033.fdocument.pub/reader033/viewer/2022051110/54b7aca64a795993718b49ca/html5/thumbnails/80.jpg)
Thank you!