Bigdata 大資料分析實務 (進階上機課程)
-
Upload
- -
Category
Technology
-
view
237 -
download
2
Transcript of Bigdata 大資料分析實務 (進階上機課程)
大綱�
• 另一種作業系統:Linux� • 啟動Hadoop� • 使用分散式儲存系統:HDFS� • 源源不絕的接收資料: Flume� • 使用分散式運算系統:MapReduce� • 使用現成的工具做分類與推荐:Mahout�
基本 Linux指令介紹: ls、cp�
h"p://linux.vbird.org/linux_basic/0220filemanager/0220filemanager-‐fc4.php#
• 複制檔案:cp� • 查看檔案:ls�
基本 Linux指令介紹: mv、rm�
h"p://linux.vbird.org/linux_basic/0220filemanager/0220filemanager-‐fc4.php
• 移動檔案、改檔:mv� • 刪除檔案:rm�
基本 Linux指令介紹: cat、mkdir�
• 建立目錄:mkdir� • 查看檔案內容:cat�
h"p://linux.vbird.org/linux_basic/0220filemanager/0220filemanager-‐fc4.php
Vim文字編輯器介紹�
• 使用『 vi filename 』進入一般指令模式� • 按下 i 進入編輯模式,開始編輯文字� • 按下 [ESC] 按鈕回到一般指令模式� • 按: 進入指令列模式,檔案儲存(w)並離開(q) vi 環境�
h"p://linux.vbird.org/linux_basic/0310vi.php#vi
Hadoop 系統架構�
• Master /slave architecture � – NameNode,DataNode� – Resource Manager,NodeManager�
master slave1
NN DN
RM NM
10
slave2
DN
NM
h"p://www.ewdna.com/2013/04/Hadoop-‐HDFS-‐Comics.html h"p://bradhedlund.com/2011/09/10/understanding-‐hadoop-‐clusters-‐and-‐the-‐network/
15
h"p://www.ewdna.com/2013/04/Hadoop-‐HDFS-‐Comics.html h"p://bradhedlund.com/2011/09/10/understanding-‐hadoop-‐clusters-‐and-‐the-‐network/
16
HDFS 命令列操作�
• 基本指令� – hadoop fs –ls <file_in_hdfs>� – hadoop fs –lsr <dir_in_hdfs>� – hadoop fs –rm <file_in_hdfs>� – hadoop fs –rmr <dir_in_hdfs>� – hadoop fs -mkdir <dir_in_hdfs>� – hadoop fs –cat <file_in_hdfs>�
– hadoop fs –get <file_in_hdfs> <file_in_local>� – hadoop fs –put <file_in_local> <file_in_hdfs>�
17
不用寫程式,也能自動執行�
• 僅定義config檔即可� #vim example agent.sources = source1 agent.channels = channel1 agent.sinks = sink1 agent.sources.source1.type = spooldir agent.sources.source1.channels = channel1 agent.sources.source1.spoolDir = /home/hadoop/flumedata agent.sources.source1.fileHeader = false agent.sinks.sink1.type=hdfs agent.sinks.sink1.channel=channel1 agent.sinks.sink1.hdfs.path=hdfs://master:9000/user/hadoop agent.sinks.sink1.hdfs.fileType=DataStream agent.sinks.sink1.hdfs.writeFormat=TEXT agent.sinks.sink1.hdfs.rollSize = 0 agent.sinks.sink1.hdfs.rollCount = 0 agent.sinks.sink1.hdfs.idleTimeout = 0 agent.channels.channel1.type = memory agent.channels.channel1.capacity = 100
#cd ~/flume/conf #flume-‐ng agent -‐n agent -‐c . -‐f ./example …
1. 由RM做全局的資源分配 2. NM定時回報目前的資源使用量 3. 每個JOB會有一個負責的AppMaster控制Job 4. 將資源管理與工作控制分開 5. YARN為一通用的資源管理系統 可達成在YARN上運行多種框架
22
Step by Step�
#vim wordcount.data aaa bbb ccc ddd bbb ccc ddd eee
# hadoop fs -‐mkdir mr.wordcount # hadoop fs -‐put wordcount.data mr.wordcount # hadoop fs -‐ls mr.wordcount
# hadoop jar MR-‐sample.jar org.nchc.train.mr.wordcount.WordCount mr.wordcount/wordcount.data output ...omit... File Input Format Counters Bytes Read=32 File Output Format Counters Bytes Wri"en=30
# hadoop fs -‐cat output/part-‐r-‐00000 aaa 1 bbb 2 ccc 2 ddd 2 eee 1
動手對資料做分類� 國文� 數學�
ID 1 0 10
ID 2 10 0
ID 3 10 10
ID 4 20 10
ID 5 10 20
ID 6 20 20
ID 7 50 60
ID 8 60 50
ID 9 60 60
ID 10 90 90
國文� 數學�
ID 1 0 10
ID 2 10 0
ID 3 10 10
ID 4 20 10
ID 5 10 20
ID 6 20 20
ID 7 50 60
ID 8 60 50
ID 9 60 60
ID 10 90 90
Step by Step�
#vi clustering.data 0 10 10 0 10 10 20 10 10 20 20 20 50 60 60 50 60 60 90 90
# hadoop fs -‐mkdir testdata # hadoop fs -‐put clustering.data testdata # hadoop fs -‐ls -‐R testdata -‐rw-‐r-‐-‐r-‐-‐ 3 root hdfs 288374 2014-‐02-‐05 21:53 testdata/clustering.data
# mahout org.apache.mahout.clustering.synthecccontrol.canopy.Job -‐t1 3 -‐t2 2 -‐i testdata -‐o output ...omit... 14/09/08 01:31:07 INFO clustering.ClusterDumper: Wrote 3 clusters 14/09/08 01:31:07 INFO driver.MahoutDriver: Program took 104405 ms (Minutes: 1.7400833333333334)
#mahout clusterdump -‐-‐input output/clusters-‐0-‐final -‐-‐pointsDir output/clusteredPoints C-‐0{n=1 c=[9.000, 9.000] r=[]} Weight : [props -‐ opconal]: Point: 1.0: [9.000, 9.000] C-‐1{n=2 c=[5.833, 5.583] r=[0.167, 0.083]} Weight : [props -‐ opconal]: Point: 1.0: [5.000, 6.000] 1.0: [6.000, 5.000] 1.0: [6.000, 6.000] C-‐2{n=4 c=[1.313, 1.333] r=[0.345, 0.527]} Weight : [props -‐ opconal]: Point: 1.0: [1:1.000] 1.0: [0:1.000] 1.0: [1.000, 1.000] 1.0: [2.000, 1.000] 1.0: [1.000, 2.000] 1.0: [2.000, 2.000]
book-‐a book-‐b book-‐c
User 1 5 4 5
User 2 4 5 4
User 3 5 4 4~5
User 4 1 2 1~2
User 5 2 1 1
推薦系統原理�
book-‐a book-‐b book-‐c
User 1 5 4 5
User 2 4 5 4
User 3 5 4
User 4 1 2
User 5 2 1 1
Step by Step�
#vi recom.data 1,1,5 1,2,4 1,3,5 2,1,4 2,2,5 2,3,4 3,1,5 3,2,4 4,1,1 4,2,2 5,1,2 5,2,1 5,3,1
# hadoop fs -‐mkdir testdata # hadoop fs -‐put recom.data testdata # hadoop fs -‐ls -‐R testdata -‐rw-‐r-‐-‐r-‐-‐ 3 root hdfs 288374 2014-‐02-‐05 21:53 testdata/recom.data
# mahout recommenditembased -‐s SIMILARITY_EUCLIDEAN_DISTANCE -‐i testdata -‐o output ...omit… File Input Format Counters Bytes Read=287 File Output Format Counters Bytes Wri"en=32 14/09/04 05:46:56 INFO driver.MahoutDriver: Program took 434965 ms (Minutes: 7.249416666666667)
# hadoop fs -‐cat output/part-‐r-‐00000 3 [3:4.4787264] 4 [3:1.5212735]
book-‐a book-‐b book-‐c
User 1 5 4 5
User 2 4 5 4
User 3 5 4 4~5
User 4 1 2 1~2
User 5 2 1 1
分析結果�
# hadoop fs -‐ca 3 [3:4.478726 4 [3:1.521273
1. 我們預測User4不太喜歡book-c,所以我不會推薦book-c給User4� 2. 我們預測User3喜歡book-c,所以我會推薦book-c給User3�
Try It!�
book1 book2 book3 book4 book5 book6 book7 book8 Book9
User1 3 2 1 5 5 1 3 1
User2 2 3 1 3 5 4 3
User3 1 2 3 3 2 1
User4 2 1 2 1 1 2
User5 3 3 1 3 2 2 3 3 2
User6 1 3 2 2 1
user7 4 4 1 5 1 3 3 4
user對book的評價表
37
總結�
• 使用虛擬機器技能 + 1� • 使用Linux技能 + 1� • 使用HDFS技能 + 1� • 使用Flume技能 + 1� • 使用MapReduce 技能 + 1 � • 使用Mahout做分群技能 + 1� • 使用Mahout做推荐技能 + 1�