打開窗,讓大象跨進來 - Microsoft HDInsight

download 打開窗,讓大象跨進來 - Microsoft HDInsight

If you can't read please download the document

description

Microsoft TechDays 2013 研討會講題

Transcript of 打開窗,讓大象跨進來 - Microsoft HDInsight

  • 1. .. Profile Java XML/Web ServicesDesign Patterns EJB/JPA Java EE Struts/Spring/ Hibernate Open Source Framework JBoss ASGlassFish Application Server Apache HadoopGoogle App Engine Cloudbees jQuery Mobile Node.jsResponsive Web Design Mobile Web iOSAndroid Smart Handheld

2. ...1 Apache Hadoop ...2 Microsoft HDInsight ...3 Software Installation ...4 Command-Line Interface ...5 Web Application ...6 Windows Integration ...7 Cloud Service ...8 Summary 3. .. Apache Hadoop http://hadoop.apache.org/ Lucene Nutch Doug Cutting Hadoop Distributed File System MapReduce Programming Paradigm Java Java SE 6 JDK Hadoop Doug Cutting Doug Cutting PigHive ZooKeeper 4. .. Hadoop Node Scalability Failure 5. .. HDFS Self-HealingDistributed File System Block Disk Replication Reliability API RESTful Web Services Interface for HDFS (WebHDFS) 6. .. MapReduce Computing Model Job Server Task Code Data Server Network Traffic Task 7. .. MapReduce () Input Splitting -> -> -> (k1, v1) Mapper ( n , XXX) -> (XXX, 1) (k1, v1) Mapper (k2, v2) Shuffling n (XXX, 1) -> (XXX, n 1) k2 List(k2, v2) (k2, List(v2)) Reducer (XXX, n 1) -> (XXX, m) Reducer List(k3, v3) 8. .. MapReduce (Word Count) http://www.rabidgremlin.com/data20/ 9. .. Hadoop Ecosystem http://hadoop.apache.org/ 10. .. Hive http://hive.apache.org/ Facebook SQL Interface for Hadoop Hive HiveQL MapReduce Job HDFS MapReduce Overhead Real-Time Analysis Excel ODBC Hive Hadoop Business Intelligence Data Warehousing 11. .. Pig http://pig.apache.org/ Yahoo! Scripting Language for Hadoop Pig Latin Grunt Shell Relation MapReduce Job ETL Hadoop 12. .. Sqoop http://sqoop.apache.org/ SQL to Hadoop Cloudera Apache Get Data to/from SQL Database RDBMS Hadoop Connector JDBC Connection MySQL PostgreSQL Fast Connector MapReduce 13. .. Hadoop vs. RDBMS Hadoop RDBMS Hadoop Archive Storage Hadoop Data Preprocessing Hadoop Data Source Hadoop RDBMS 14. .. Hadoop Distribution Apache Hadoop Cloudera Distribution for Apache Hadoop Hortonworks Data Platform MapR Distribution for Apache Hadoop Intel Distribution for Apache Hadoop Microsoft Windows Azure HDInsight Service 15. .. Windows Hadoop Linux Hadoop Development Production Platform Win32 Development Platform Distributed Operation Win32 Win32 Production Platform Microsoft Windows Azure Hadoop 16. ...1 Apache Hadoop ...2 Microsoft HDInsight ...3 Software Installation ...4 Command-Line Interface ...5 Web Application ...6 Windows Integration ...7 Cloud Service ...8 Summary 17. .. Microsoft Big Data Solution http://www.windowsazure.com/en-us/solutions/big-data/ 18. .. HDInsight 2012 10 Developer Preview Hortonworks Data Platform (HDP) Windows HDP Developer Single Node Deployment 19. .. HDInsight Windows Azure HDInsight Service https://www.hadooponazure.com/ HDInsight for Windows Server http://www.microsoft.com/web/gallery/ install.aspx?appid=HDINSIGHT-PREVIEW 20. ...1 Apache Hadoop ...2 Microsoft HDInsight ...3 Software Installation ...4 Command-Line Interface ...5 Web Application ...6 Windows Integration ...7 Cloud Service ...8 Summary 21. .. HDInsight Server 64-Bit Windows Windows 7/8/8.1 Windows Server 2008 R2/2012 Web Platform Installer 4.6 Visual Studio 2012 Windows Azure SDK 2.1 IIS 8.0 ASP.NET and Web Frameworks 2012.2 PowerShell 22. .. VS 2012 with Azure SDK 2.1 23. .. HDInsight Server IIS Visual C++ 2010 SP1 (x64) .NET Framework 3.5 Python 2.7.3 (x32) Hortonworks Data Platform for Windows 1.0.1 Java SE 6 Development Kit Update 31 Microsoft HDInsight for Windows Server 24. .. HDInsight Server 25. .. HDInsight Command-Line Interface 26. .. HDInsight Web Application http://localhost:8085/ 27. ...1 Apache Hadoop ...2 Microsoft HDInsight ...3 Software Installation ...4 Command-Line Interface ...5 Web Application ...6 Windows Integration ...7 Cloud Service ...8 Summary 28. .. HDInsight 1.5 Hadoop 1.1.0 Hive 0.9.0 Pig 0.9.3 Sqoop 1.4.2 Oozie 3.2.0 29. .. Hadoop Daemon start-onebox.cmd start-all.sh Starting Hadoop Core services Starting Hadoop services Starting namenode Starting datanode Starting secondarynamenode Starting jobtracker Starting tasktracker Starting historyserver Starting Hive services 0 Starting Hive services 1 Starting hwi 2 Starting derbyserver 3 Starting metastore 4 Starting hiveserver 5 Starting hiveserver2 6 Starting Oozie service 7 Starting IsotopeJS services 8 Starting isotopejs Autorun Windows 30. .. HDInsight 31. .. HDInsight Hadoop Command Line Windows Hadoop C:Windowssystem32cmd.exe /k pushd "c:hadoophadoop-1.1.0-SNAPSHOT" && IF EXIST "c:hadoopGettingStartedinit.cmd" ( "c:hadoopGettingStartedinit.cmd" ) ELSE ( "c:hadoophadoop-1.1.0-SNAPSHOTbinhadoop.cmd" ) 32. .. Hadoop Standalone Mode Pseudo-Distributed Mode Fully-Distributed Mode 33. .. HDInsight Server HDInsight Server Pesudo-Distributed Mode core-site.xml fs.default.name = hdfs://localhost:8020 hdfs-site.xml dfs.replication = 1 mapred-site.xml mapred.job.tracker = localhost:50300 hive-site.xml hive.server2.http.port = 10001 PSJob Tracker Port = HDFS Port + 1 34. .. Hadoop Eclipse Plugin 35. .. DemoCommand-Line Interface 36. .. Netflix Dataset 1998-2005 1,2003,Dinosaur Planet 2,2004,Isle of Man TT 2004 Review 3,1997,Character 4,1994,Paula Abdul's Get Up & Dance 5,2004,The Rise and Fall of ECW 6,1997,Sick 7,1992,8 Man 8,2004,What the #$*! Do We Know!? 9,1991,Class of Nuke 'Em High 2 0 10,2001,Fighter 1 11,1999,Full Frame: Documentary Shorts 2 12,1947,My Favorite Brunette 3 13,2003,Lord of the Rings: The Return of the King: Extended Edition: Bonus Material 4 14,1982,Nature: Antarctica 5 15,1988,Neil Diamond: Greatest Hits Live 6 16,1996,Screamers 7 17,2005,7 Seconds 8 18,1994,Immortal Beloved 9 19,2000,By Dawn's Early Light 0 20,1972,Seeta Aur Geeta 1 ... 37. .. MapReduce Demo - Mapper public class HWMapper extends Mapper { private Text word = new Text(); private final static IntWritable one = new IntWritable(1); public void map(Object key, Text value, Context context) throws IOException, InterruptedException { String title = value.toString().toLowerCase().trim().split(",")[2]; 0 String[] words = title.split("W"); 1 for (String str: words) 2 { 3 if (!str.equals("")) 4 { 5 word.set(str); 6 context.write(word, one); 7 } 8 } 9 } 0 } 38. .. MapReduce Demo - Reducer public class HWReducer extends Reducer { private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) 0 { 1 sum += val.get(); 2 } 3 result.set(sum); 4 context.write(key, result); 5 } 6 } 39. .. MapReduce Demo - Job Driver public class HWDriver { public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException { Configuration conf = new Configuration(); Job job = new Job(conf, "Hottest Word"); job.setJarByClass(tw.techdays.hottestword.HWDriver.class); 0 job.setMapperClass(tw.techdays.hottestword.HWMapper.class); 1 job.setReducerClass(tw.techdays.hottestword.HWReducer.class); 2 3 FileInputFormat.addInputPath(job, new Path(otherArgs[0])); 4 FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); 5 6 job.setOutputKeyClass(Text.class); 7 job.setOutputValueClass(IntWritable.class); 8 9 System.exit(job.waitForCompletion(true) ? 0 : 1); 0 } 1 } 40. .. MapReduce Demo Hadoop Command Line mkdir C:Hadoopdata mkdir C:Hadoopdatamovies movie_titles.txt C:Hadoopdatamovies hadoop jar C:/Hadoop/workspace/HottestWord.jar file:///C:/Hadoop/data/movies file:///C:/Hadoop/data/hottest-word 0 1 hadoop fs -mkdir movies 2 hadoop fs -put C:Hadoopdatamoviesmovie_titles.txt movies 3 hadoop jar HottestWord.jar movies hottest-word-hdfs 4 5 hadoop fs -cat hottest-word-hdfs/part-r-00000 6 hadoop fs rmr -skipTrash hottest-word-hdfs 41. .. Hive http://www.orzota.com/hive-for-beginners/ Hadoop Command Line hadoop fs -mkdir /tmp hadoop fs -mkdir /user/hive/warehouse hadoop fs -chmod g+w /tmp hadoop fs -chmod g+w /user/hive/warehouse Command Line Web 42. .. Hive Demo hive -e "statement" hive -f sample.hql Hadoop Command Line hive hive> create external table movie_data(id int, year int, title string) row format delimited fields terminated by ',' lines terminated by 'n'; hive> show tables; hive> load data local inpath 'movies' overwrite into table movie_data; hive> select * from movie_data; 0 hive> select year, count(*) from movie_data group by year; 1 hive> select title, size(split(title, ' ')) from movie_data where year = 2005; 2 hive> select size(split(title, ' ')), count(*) from movie_data 3 group by size(split(title, ' ')); 4 hive> insert overwrite directory 'title-length' 5 select size(split(title, ' ')), count(*) from movie_data 6 group by size(split(title, ' ')); 7 hive> quit; 8 9 hadoop fs -ls title-length 0 hadoop fs -cat title-length/000000_0 1 hadoop fs -rmr -skipTrash title-length 43. .. Hive http://stackoverflow.com/questions/16459790/ hive-insert-overwrite-directory-command-output-is-not-separated-by-a-delimiter Hive Directory Delimiter 001Unix Table Directory Directory concat_ws Function hive> insert overwrite directory 'title-length' select concat_ws(',', cast(size(split(title, ' ')) as string), cast(count(*) as string) ) from movie_data group by size(split(title, ' ')); 44. ...1 Apache Hadoop ...2 Microsoft HDInsight ...3 Software Installation ...4 Command-Line Interface ...5 Web Application ...6 Windows Integration ...7 Cloud Service ...8 Summary 45. .. HDInsight local (hdfs) Cluster http://localhost:8085/ 46. .. DemoWeb Application 47. .. Create MapReduce Job http://localhost:8085/14/WebHCat/CreateMapReduceJob 48. .. Create MapReduce Job args[0]Class Name args[1]-D args[2]TITLE = Job Name args[3]Job Arguments ( ENTER ) 49. .. Interactive JavaScript http://localhost:8085/14/Cluster/InteractiveJS 50. .. Interactive JavaScript http://social.msdn.microsoft.com/Forums/windowsazure/zh-tw/ 8eca1948-4036-4105-8c28-b0653c630013/ hdinsight-interactive-javascript-ls-throwing-error Interactive JavaScript HDInsight start-onebox.cmd Apache Hadoop isotopejs iexplore.exe Process Chrome http://localhost:8085 Interactive JavaScript IE Firefox 51. ...1 Apache Hadoop ...2 Microsoft HDInsight ...3 Software Installation ...4 Command-Line Interface ...5 Web Application ...6 Windows Integration ...7 Cloud Service ...8 Summary 52. .. Microsoft http://www.microsoft.com/taiwan/sqlserver/big-data-solution.aspx Hadoop Excel PowerPivot Power View BI Hadoop SQL Server SQL Server Analysis Service Reporting Service BI 53. .. HDInsight vs. Excel http://www.microsoft.com/en-us/download/details.aspx?id=37134 54. .. HDInsight vs. SQL Server http://www.microsoft.com/en-us/download/details.aspx?id=27584 55. .. DemoWindows Integration 56. .. Sqoop vs. SQL Server SQL Server HDFS sqoop import --connect "jdbc:sqlserver://localhost:1433;databaseName=AdventureWorks2012" --username USERNAME -P --table DatabaseLog -m 1 --target-dir databaselog 0 HDFS SQL Server 1 2 sqoop export 3 --connect "jdbc:sqlserver://localhost:1433;databaseName=TechDays2013" 4 --username USERNAME -P 5 --export-dir title-length 6 --table TitleLength 7 --input-lines-terminated-by n 8 --input-fields-terminated-by t 57. ...1 Apache Hadoop ...2 Microsoft HDInsight ...3 Software Installation ...4 Command-Line Interface ...5 Web Application ...6 Windows Integration ...7 Cloud Service ...8 Summary 58. .. Windows Azure HDInsight Service http://www.windowsazure.com/zh-tw/services/hdinsight/ 59. .. Deliver Solution with Services 60. .. Cloud StorageHDFS vs. ASV HDInsight Service Hadoop Distributed File System (HDFS) hdfs:/// Azure Storage Vault (ASV) asv[s]://[@].blob.core.windows.net/ 61. .. Azure Storage Vault HDFS Application HDFS ASV Blob Storage HDFS Extension HDFS on Windows Azure Blob Storage Windows Azure Blob Storage ASV 62. .. Blob Storage Unstructured Data Storage High Availability/Scalability/Capacity Low Cost Storage Account 200 TB HTTP HTTPS 63. .. Blob Storage http://www.windowsazure.com/en-us/develop/net/how-to-guides/blob-storage/?fb=zh-tw Blob Storagehttp://sally.blob.core.windows.net/movies/MOV1.AVI AVSavs://[email protected]/MOV1.AVI 64. .. Blob Storage HDInsight Cluster MapReduce Job HDFSCluster Blob StorageCluster Blob Storage Cluster Blob Storage Cluster Scale 65. .. Azure Storage Vault vs. JavaScript #cat asv://[email protected]/movie_titles.txt 66. .. Azure Storage Vault asv[s]://[@].blob.core.windows.net/ Cluster Blob Storage Container Default File System core-site.xml asvs://[email protected]/input/log1.txt asvs://myaccount.blob.core.windows.net/result.txt asv:///output/result.txt hadoop fs -ls /output/result.txt 67. .. Windows Azure Marketplace http://datamarket.azure.com/ 68. .. DemoCloud Service 69. .. HDInsight Service http://www.windowsazure.com/zh-tw/pricing/details/hdinsight/ Windows Azure http://www.windowsazure.com Data Services HDInsight Service Storage AccountLocation Cluster (4/8/16/32 Data Nodes) PS SQL Server Management Studio Silverlight 70. ...1 Apache Hadoop ...2 Microsoft HDInsight ...3 Software Installation ...4 Command-Line Interface ...5 Web Application ...6 Windows Integration ...7 Cloud Service ...8 Summary 71. .. Hadoop in the Enterprise http://www.ebizq.net/blogs/enterprise/2009/09/10_ways_to_complement_the_ente.php 72. .. http://www.windowsazure.com/zh-tw/pricing/details/hdinsight/ 73. .. Resources 74. .. Wenming's Blog http://blogs.msdn.com/b/hpctrekker/archive/2013/03/19/ let-there-be-windows-azure-hdinsight.aspx 75. .. Hadoop Sessions 76. .. http://www.iiiedu.org.tw/taipei 77. Slidehttp://www.slideshare.net/KuoChunSu/hdinsight