001 hbase introduction
-
Upload
scott-miao -
Category
Technology
-
view
2.446 -
download
0
Transcript of 001 hbase introduction
![Page 1: 001 hbase introduction](https://reader036.fdocument.pub/reader036/viewer/2022062703/554f8aa5b4c9052a518b5111/html5/thumbnails/1.jpg)
HBase IntroductionScott Miao
2012/06/25
![Page 2: 001 hbase introduction](https://reader036.fdocument.pub/reader036/viewer/2022062703/554f8aa5b4c9052a518b5111/html5/thumbnails/2.jpg)
Agenda• Course Credit• One common web site story• Why RDB not affordable ?• Big Data• Why use noSQL ?• HBase Indroduction• Hands-on• noSQL architecture common practices• Case study
2
![Page 3: 001 hbase introduction](https://reader036.fdocument.pub/reader036/viewer/2022062703/554f8aa5b4c9052a518b5111/html5/thumbnails/3.jpg)
一個網站的故事 (1/3)• RDBMS 是 Persistence tier 一個理所當然的選擇• 它可以幫我們處理 transaction(ACID) ,確保完整性限制
(Integrity Constraints) ,標準的 SQL 語言,甚至還有 Stored Procedure 可以用
• 第一次,你的使用者人數越來越多時…• 使用 AP Servers Cluster ,它們共用一台 DB Server
• 第二次,你的使用者人數越來越多時…• DB Server 分成 Master-Slave 架構
• 從 Slave Servers 讀取資料• 寫入資料至 Master Server
Hbase: The Definitive Guide - http://www.amazon.com/HBase-Definitive-Guide-Lars-George/dp/1449396100/ref=sr_1_1?ie=UTF8&qid=1339060175&sr=8-1
3
![Page 4: 001 hbase introduction](https://reader036.fdocument.pub/reader036/viewer/2022062703/554f8aa5b4c9052a518b5111/html5/thumbnails/4.jpg)
一個網站的故事 (2/3)• 第三次,你的使用者人數越來越多時…• 針對讀取資料的瓶頸
• 在 Server 程式和 DB 之間,加入 Cache ,例如 Memcached (Memory DB)
• 但 Server 程式的 Cache 和 DB 之間,很可能出現資料不一致的問題
• 針對寫入資料的瓶頸• 增加 DB Server 的機器規格 (CPU 、 Memory 、 Disk 等, Vertically
Scaling)• 別忘記!我們也要連同 Slave Severs 的規格也要一起增加ㄛ…
4
![Page 5: 001 hbase introduction](https://reader036.fdocument.pub/reader036/viewer/2022062703/554f8aa5b4c9052a518b5111/html5/thumbnails/5.jpg)
一個網站的故事 (3/3)• 第四次,你的使用者人數越來越多時…• 使用 Database Sharding 技術
• 從 Vertically Scaling 轉換成 Horizontally Scaling• 開啟管理的惡夢• RDBMS 天生不適合分散式儲存 (ACID , Fixed Schema)• DBA 要設定一組 Sharding Rules
• 當其中某一台 DB Server 掛掉,或是儲存容量滿了,就要開始手動作Resharding
• Resharding 包含了要重新調整 Sharding Rules ,接著需要作大量 IO 的資料複製和遷移工作,同時間要保證網站可以正常服務,或是要在一定時間內中斷服務
• 這通常是事後不得已,而且少數可選擇的解決方案• 天知道我的網站會這麼紅?
5
![Page 6: 001 hbase introduction](https://reader036.fdocument.pub/reader036/viewer/2022062703/554f8aa5b4c9052a518b5111/html5/thumbnails/6.jpg)
Why RDB not affordable ? (1/6)• Bottleneck of Relational-DB• 90s V.S. recent years (Web 2.0)
• Memcachd + mySQL• Mitigate read stress effectively, but not write stress
• mySQL Cluster solution• Master/Slave
• Not affordable for highly-concurrency scenario• Vertical Partitioning• Vertical/Horizontal Partitioning (Database sharding)
• Complex• Hard to scale-out and change requirements• Low availability
• Some type of simple but big size data cause this conditionhttp://www.infoq.com/cn/news/2011/01/nosql-why
6
![Page 7: 001 hbase introduction](https://reader036.fdocument.pub/reader036/viewer/2022062703/554f8aa5b4c9052a518b5111/html5/thumbnails/7.jpg)
Why RDB not affordable ? (2/6) – A general HA system architecture design
軟體專案的素質之四 ─ 整體設計之 架構設計案例 ─ http://takeshi-experience.blogspot.tw/2012/04/blog-post.html
7
![Page 8: 001 hbase introduction](https://reader036.fdocument.pub/reader036/viewer/2022062703/554f8aa5b4c9052a518b5111/html5/thumbnails/8.jpg)
Why RDB not affordable ? (3/6) – Master/Slave
8
![Page 9: 001 hbase introduction](https://reader036.fdocument.pub/reader036/viewer/2022062703/554f8aa5b4c9052a518b5111/html5/thumbnails/9.jpg)
Why RDB not affordable ? (4/6) – Vertical Partitioning
9
![Page 10: 001 hbase introduction](https://reader036.fdocument.pub/reader036/viewer/2022062703/554f8aa5b4c9052a518b5111/html5/thumbnails/10.jpg)
Why RDB not affordable ? (5/6) – Master/Slave + Vertical Partitioning
10
![Page 11: 001 hbase introduction](https://reader036.fdocument.pub/reader036/viewer/2022062703/554f8aa5b4c9052a518b5111/html5/thumbnails/11.jpg)
Why RDB not affordable ? (6/6) – Vertical/Horizontal Partitioning
11
![Page 12: 001 hbase introduction](https://reader036.fdocument.pub/reader036/viewer/2022062703/554f8aa5b4c9052a518b5111/html5/thumbnails/12.jpg)
• 過去 3年所產生的資料量,比過去四萬年創造的資料量還多!
• WallMart的資料量是美國國會圖書館的 167倍!• eBay分析平台每天處理的資料量高達 100PB! (約
1,000,000GB)• 截至 2010年,世界電子資料儲存量為 1.2ZB!
(1,200,000PB)• 根據 IDC預測, 2020年世界電子資料儲存量會是
2009年的基礎上,再加上 44倍,達到 35萬億GB!• 35,000,000,000,000 Giga Bytes
架构师 10 月刊 ─ http://www.infoq.com/cn/minibooks/architect-oct-10-2011
大資料時代!
12
![Page 13: 001 hbase introduction](https://reader036.fdocument.pub/reader036/viewer/2022062703/554f8aa5b4c9052a518b5111/html5/thumbnails/13.jpg)
Trend Micro’s problem• 每人每天造訪約 20 ~ 60 html 頁面• 每個 html 頁面約包含 15 ~ 30 URI• 每個 URI 物件大小約 10 ~ 150 KB• 以一百萬個用戶而言• 100 萬 X 20 = 2,000 萬個 html 頁面• 2,000 萬個 html 頁面 X 15 = 30,000 萬個 URI ( 三十億 )• 30,000 萬個 URI 物件 X 10 = 30,000KB (3TB)
• 以上純屬台灣區的資料量
• 趨勢是個全球性的公司• 故每天的資料量約數十個 TB
趨勢的雲端發現之旅 ─ http://findbook.tw/book/9789866126185/basic
13
![Page 14: 001 hbase introduction](https://reader036.fdocument.pub/reader036/viewer/2022062703/554f8aa5b4c9052a518b5111/html5/thumbnails/14.jpg)
大資料時代下的新寵兒 ─
• Not only SQL• 於 2009 年開始• 有以下特性• 不使用關聯式資料模型• 天生分散式儲存• 易於水平式擴充的• 開放原始碼的• 易於擴充的• 簡單的 API 操作 (CRUD ,通常沒有 SQL 支援 )• CAP ( 不同於 ACID)
• Eventually Consistency 、 Availability 、 Partition-Tolerance• 儲存巨量且異質的資料
http://nosql-database.org/
14
![Page 15: 001 hbase introduction](https://reader036.fdocument.pub/reader036/viewer/2022062703/554f8aa5b4c9052a518b5111/html5/thumbnails/15.jpg)
Why use noSQL ?• Easy to scale-out• Unlike RDB, no relationship therefore easy to scale-out
• High performance even in the big data• Table-level cache (RDB) V.S. Record-level cache (noSQL)
• Elastic data model• Schema V.S. Schema-less/Dynamic schema
• High availability• Easy to add new machines (nodes) without any performance
impact15
![Page 16: 001 hbase introduction](https://reader036.fdocument.pub/reader036/viewer/2022062703/554f8aa5b4c9052a518b5111/html5/thumbnails/16.jpg)
Comparison between RDB and noSQL
Aspects RDB noSQL
Performance
Scalability
Reliability
Availability
Security
Economics
Data Model
Maturity
Commercial supportOLAP/BI
Human resource
If given a really huge of big data…
Getting lower Sustain as a small size of data
Mainly for scale up Mainly for scale out
ACID CAP
Hard to maintain SLA Easy to maintain SLA
Robust Depends
High-end machines Commodity machines
Relational, Fix-schema Depends but more likely simple, Schema-less
Very mature Not mature, various products
Global company Small start-ups
Mature Immature
Easy to find Hard to find
16
![Page 17: 001 hbase introduction](https://reader036.fdocument.pub/reader036/viewer/2022062703/554f8aa5b4c9052a518b5111/html5/thumbnails/17.jpg)
noSQL basic categories
iTcloud 新雲端時代 ─ http://www.ithome.com.tw/002/cloud/cloud.html
17
![Page 18: 001 hbase introduction](https://reader036.fdocument.pub/reader036/viewer/2022062703/554f8aa5b4c9052a518b5111/html5/thumbnails/18.jpg)
Apache Hbase 介紹• ASF 的 top-level 專案• 屬於 noSQL DB 中的 Key-Value 類型• 源自於 Google 的• Bigtable: A Distributed Storage System for Structured Data• a distributed storage system for managing structured data that is
designed to scale to a very large size: petabytes of data across thousands of commodity servers
• a sparse, distributed, persistent multi-dimensional sorted map
Hbase: The Definitive Guide - http://www.amazon.com/HBase-Definitive-Guide-Lars-George/dp/1449396100/ref=sr_1_1?ie=UTF8&qid=1339060175&sr=8-1
18
![Page 19: 001 hbase introduction](https://reader036.fdocument.pub/reader036/viewer/2022062703/554f8aa5b4c9052a518b5111/html5/thumbnails/19.jpg)
Apache Hbase Concepts – Column-Oriented (1/2)
http://ofps.oreilly.com/titles/9781449396107/intro.html
19
![Page 20: 001 hbase introduction](https://reader036.fdocument.pub/reader036/viewer/2022062703/554f8aa5b4c9052a518b5111/html5/thumbnails/20.jpg)
Apache Hbase Concepts – Column-Oriented (2/2)
• a sparse, distributed, persistent multi-dimensional sorted map• which is indexed by row key, column key (column family +
qualifiers), and a timestamp
Column Families
20
![Page 21: 001 hbase introduction](https://reader036.fdocument.pub/reader036/viewer/2022062703/554f8aa5b4c9052a518b5111/html5/thumbnails/21.jpg)
Apache Hbase Concepts - Architecture
http://ofps.oreilly.com/titles/9781449396107/architecture.html
21
![Page 22: 001 hbase introduction](https://reader036.fdocument.pub/reader036/viewer/2022062703/554f8aa5b4c9052a518b5111/html5/thumbnails/22.jpg)
Hands-on (1/3) –Use your VM (Virtual Machine) to install tm-puppet
• Please refer to SPN Dev hbase training program again~• Install git on your PC• Install tm-puppet on your VM
22
![Page 23: 001 hbase introduction](https://reader036.fdocument.pub/reader036/viewer/2022062703/554f8aa5b4c9052a518b5111/html5/thumbnails/23.jpg)
Hands-on (2/3) –Use HBase shell• Basic operations• help, list, scan
• Create• A table ‘MY_FIRST_TABLE’• Two column families ‘FAM_1’, ‘FAM_2’• Ex.
• create 't1', {NAME => 'f1'}, {NAME => 'f2'}• Create ‘t1’, ‘f1’, ‘f2’
• Put two records (column)• Ex. put 't1', 'r1', 'c1', 'value'
• Update a record (column) (It is also a put)• Delete a record (column)• delete 't1', 'r1', 'c1'
23
![Page 24: 001 hbase introduction](https://reader036.fdocument.pub/reader036/viewer/2022062703/554f8aa5b4c9052a518b5111/html5/thumbnails/24.jpg)
Hands-on (3/3) –Requirements• Put your successful installed tm-puppet image file to git• Use following commands
• Jps• Ifconfig
• Cut the image• Path : ${git_home}/hbase-training/001/hands-on/${your_name}/hands-on-001.jpg
• Put your hbase shell records image file to git• Use following commands
• Scan ‘MY_TEST_TABLE’ • Ifconfig
• Cut the image• Path : ${git_home}/ hbase-training/001/hands-on/${your_name}/hands-on-002.jpg
• Commit and push your git
24
![Page 25: 001 hbase introduction](https://reader036.fdocument.pub/reader036/viewer/2022062703/554f8aa5b4c9052a518b5111/html5/thumbnails/25.jpg)
noSQL architecture practices (1/8) – Use noSQL as complement• Use noSQL as a mirror (implemented by code)• The RDB is still a major storage device, and noSQL as a mirror
NoSQL 架構實踐(一)— 以 NoSQL為輔 ─ http://www.infoq.com/cn/news/2011/02/nosql-architecture-practice
25
![Page 26: 001 hbase introduction](https://reader036.fdocument.pub/reader036/viewer/2022062703/554f8aa5b4c9052a518b5111/html5/thumbnails/26.jpg)
noSQL architecture practices (2/8) – Use noSQL as complement
//PSEUDO CODE for noSQL as a mirror//We want to store the data Object bool status = false; DB.startTransaction(); //start transactionid = DB.Insert(data); //write data Object to RDBif(id > 0){ status = NoSQL.Add(id, data); //write data Object to noSQL by id } if(id > 0 && status == true){ DB.commit(); //commit transaction } else { DB.rollback(); //failed, rollback transaction }
26
![Page 27: 001 hbase introduction](https://reader036.fdocument.pub/reader036/viewer/2022062703/554f8aa5b4c9052a518b5111/html5/thumbnails/27.jpg)
• Use noSQL as a mirror (implemented by synchronization)
noSQL architecture practices (3/8) – Use noSQL as complement
27
![Page 28: 001 hbase introduction](https://reader036.fdocument.pub/reader036/viewer/2022062703/554f8aa5b4c9052a518b5111/html5/thumbnails/28.jpg)
• Combine RDB & noSQL
noSQL architecture practices (4/8) – Use noSQL as complement
28
![Page 29: 001 hbase introduction](https://reader036.fdocument.pub/reader036/viewer/2022062703/554f8aa5b4c9052a518b5111/html5/thumbnails/29.jpg)
noSQL architecture practices (5/8) – Use noSQL as complement
//PSEUDO CODE for RDB & noSQL combination //we want to store the data Object data.title = "title"; data.name = "name"; data.time = "2009-12-01 10:10:01";data.from = "1";bool status = false; DB.startTransaction(); //start transaction //write into RDB, data.from is a value needed by search criteriaid = DB.Insert("INSERT INTO table (from) VALUES(data.from)"); if(id > 0){ //write data Object to noSQL by id status = NoSQL.Add(id, data); } if(id>0 && status==true){ DB.commit(); //commit transaction }else{ DB.rollback(); //failed, rollback transaction }
29
![Page 30: 001 hbase introduction](https://reader036.fdocument.pub/reader036/viewer/2022062703/554f8aa5b4c9052a518b5111/html5/thumbnails/30.jpg)
• What benefits we can get from the RDB & noSQL combination practice
• Decrease the I/O of RDB, therefore save more storage space• Increase the RDB table-level cache hitrate, only the key
values(PK, FK, search criteria related values) updated will refresh the cache
• Increase the synchronization efficiency for RDB Master/Slave architecture
• Increase the RDB backup/recover efficiency• Increase the scalability/performance for whole system
noSQL architecture practices (6/8) – Use noSQL as complement
30
![Page 31: 001 hbase introduction](https://reader036.fdocument.pub/reader036/viewer/2022062703/554f8aa5b4c9052a518b5111/html5/thumbnails/31.jpg)
• Use only with noSQL• Mainly for simple query requirements systems• But there are noSQL products can fulfill the more complex
queries• MonngoDB, Tokyo Cabinet, etc
noSQL architecture practices (7/8) – Use noSQL as master
NoSQL 架構實踐(二)— 以 NoSQL為主 ─ http://www.infoq.com/cn/news/2011/03/nosql-architecture-practice-2
31
![Page 32: 001 hbase introduction](https://reader036.fdocument.pub/reader036/viewer/2022062703/554f8aa5b4c9052a518b5111/html5/thumbnails/32.jpg)
• Use noSQL as major data source• APs only write data into noSQL• Then synchronize the data from noSQL to other data stores
based on their application
noSQL architecture practices (8/8) – Use noSQL as master
32
![Page 33: 001 hbase introduction](https://reader036.fdocument.pub/reader036/viewer/2022062703/554f8aa5b4c9052a518b5111/html5/thumbnails/33.jpg)
Case Study (1/4) – Facebook’s Real-time Message System
• Use HBase to store 135+ billion messages a month• Beat off other few competitors such as Cassandra, mySQL-
Sharding, etc
• Data Patterns• A short set of temporal data that tends to be volatile• An ever-growing set of data that rarely gets accessed
Facebook's New Real-time Messaging System: HBase to Store 135+ Billion Messages a Month - http://highscalability.com/blog/2010/11/16/facebooks-new-real-time-messaging-system-hbase-to-store-135.html
33
![Page 34: 001 hbase introduction](https://reader036.fdocument.pub/reader036/viewer/2022062703/554f8aa5b4c9052a518b5111/html5/thumbnails/34.jpg)
• Some key aspects of their system:• HBase
• Has a simpler consistency model than Cassandra.• Very good scalability and performance for their data patterns.• Most feature rich for their requirements: auto load balancing and
failover, compression support, multiple shards per server, etc.• HDFS, the filesystem used by HBase, supports replication, end-to-end
checksums, and automatic rebalancing.• Facebook's operational teams have a lot of experience using HDFS
because Facebook is a big user of Hadoop and Hadoop uses HDFS as its distributed file system.
Case Study (2/4) – Facebook’s Real-time Message System
34
![Page 35: 001 hbase introduction](https://reader036.fdocument.pub/reader036/viewer/2022062703/554f8aa5b4c9052a518b5111/html5/thumbnails/35.jpg)
• Haystack is used to store attachments.• A custom application server was written from scratch in order
to service the massive inflows of messages from many different sources.
• A user discovery service was written on top of ZooKeeper.• Infrastructure services are accessed for: email account
verification, friend relationships, privacy decisions, and delivery decisions
• Keeping with their small teams doing amazing things approach, 20 new infrastructures services are being released by 15 engineers in one year.
• Facebook is not going to standardize on a single database platform, they will use separate platforms for separate tasks.
Case Study (3/4) – Facebook’s Real-time Message System
35
![Page 36: 001 hbase introduction](https://reader036.fdocument.pub/reader036/viewer/2022062703/554f8aa5b4c9052a518b5111/html5/thumbnails/36.jpg)
Case Study (4/4) – Alibaba China Site architecture
http://www.infoq.com/cn/presentations/hl-alibaba-cn-architecture-design-practice
36
![Page 37: 001 hbase introduction](https://reader036.fdocument.pub/reader036/viewer/2022062703/554f8aa5b4c9052a518b5111/html5/thumbnails/37.jpg)
37
![Page 38: 001 hbase introduction](https://reader036.fdocument.pub/reader036/viewer/2022062703/554f8aa5b4c9052a518b5111/html5/thumbnails/38.jpg)
Data Access pattern as the key for noSQL• Data Structure• Structured• Semi-structured• Unstructured• Size
• How many & how often writes/read (proportion)• Data Writing• Transaction
• Data Reading• Random access• Sequential access• Relationship 38
![Page 39: 001 hbase introduction](https://reader036.fdocument.pub/reader036/viewer/2022062703/554f8aa5b4c9052a518b5111/html5/thumbnails/39.jpg)
Q & A
39