Hokkaido Summer H7: Introduction to Data...

31
H7: Introduction to Data Mining Hokkaido Summer School 2017 photomorgueFile - Can a computer learn from huge amount of data? Keywords: Aritificial Intelligence Machine Learning Data Mining Computer Game Hiroki Arimura GSB & IST, Hokkaido Univerisity IST bld. 7F, Rm.7-06 tel: 011-706-7680 [email protected] 2017/08/03 HSI2017, Hiroki Arimura, Hokkaido University 1 slide http://www-ikn.ist.hokudai.ac.jp/~arim/HSI2017/arimura.pdf

Transcript of Hokkaido Summer H7: Introduction to Data...

Page 1: Hokkaido Summer H7: Introduction to Data Mininggi-core.oia.hokudai.ac.jp/gsb/wp-content/uploads/... · Automatic Classification by Machine Learning • We use a class of rules, called

H7: Introduction to Data Mining Hokkaido Summer School 2017

photo:morgueFile

- Can a computer learn from huge amount of data?

Keywords: Aritificial Intelligence Machine Learning Data Mining Computer Game

Hiroki Arimura

GSB & IST, Hokkaido Univerisity IST bld. 7F, Rm.7-06 tel: 011-706-7680 [email protected] 2017/08/03 HSI2017,HirokiArimura,Hokkaido

University 1

slide http://www-ikn.ist.hokudai.ac.jp/~arim/HSI2017/arimura.pdf

Page 2: Hokkaido Summer H7: Introduction to Data Mininggi-core.oia.hokudai.ac.jp/gsb/wp-content/uploads/... · Automatic Classification by Machine Learning • We use a class of rules, called

H7: Introduction to Data Mining Hokkaido Summer School 2017 2017HirokiArimura

- Can a computer learn from huge amount of data?

Keywords: Aritificial Intelligence Machine Learning Data Mining Computer Game

Hiroki Arimura

GSB & IST, Hokkaido Univerisity IST bld. 7F, Rm.7-06 tel: 011-706-7680 [email protected]

Anothertopicoftoday:WhatisAI?

(AI: AriCficialIntelligence)

photo:morgueFile2017/08/03 HSI2017,HirokiArimura,HokkaidoUniversity 2

Page 3: Hokkaido Summer H7: Introduction to Data Mininggi-core.oia.hokudai.ac.jp/gsb/wp-content/uploads/... · Automatic Classification by Machine Learning • We use a class of rules, called

DATA MINING: FROM PAST TO PRESENT - WHAT IS DATA MINING?

ISCIS2010, Discovery Science Session, Sep. 2010. Hiroki Arimura, Hokkaido University

3

Page 4: Hokkaido Summer H7: Introduction to Data Mininggi-core.oia.hokudai.ac.jp/gsb/wp-content/uploads/... · Automatic Classification by Machine Learning • We use a class of rules, called

Quiz:Whoisthis?

Hint:2012washis100years'anniversary(bornin1912)2017/08/03 HSI2017,HirokiArimura,HokkaidoUniversity 4

Page 5: Hokkaido Summer H7: Introduction to Data Mininggi-core.oia.hokudai.ac.jp/gsb/wp-content/uploads/... · Automatic Classification by Machine Learning • We use a class of rules, called

Answer:Dr.AlanTuring

(AlanM.Turing,1912-1954,GB)

• Oneofthepioneersofcomputerscienceinearly20C.

• KnownasageniusscienCstinmanyareas.

• "Enigma"project• Alsofamousinhis"Turingtest"inAI.

•  in1930s,heinventedamathemaCcalmodelofcomputers,"Turingmachine"

ーWhatdidhethinkabout?

2017/08/03 HSI2017,HirokiArimura,HokkaidoUniversity 5

Page 6: Hokkaido Summer H7: Introduction to Data Mininggi-core.oia.hokudai.ac.jp/gsb/wp-content/uploads/... · Automatic Classification by Machine Learning • We use a class of rules, called

TheFirstDigitalComputersin1940s•  TheFirstDigitalComputers

in1940beforeW.W.II–  Programmable–  Soawarelibrary

•  ACE,MarkI(1946,GB)–  AlanTuringjoined

•  EDSAC(1949,USA)–  vonNeumannが影響

–  Wilkesetal.(1967TuringAward)

•  ENIAC(1946,USA)–  Oneofthefirstgeneral

purposedigitalcomputers2017/08/03 HSI2017,HirokiArimura,Hokkaido

University 6

Page 7: Hokkaido Summer H7: Introduction to Data Mininggi-core.oia.hokudai.ac.jp/gsb/wp-content/uploads/... · Automatic Classification by Machine Learning • We use a class of rules, called

ArCficialIntelligence(AI)•  Studiesonthepossibilityand

limitaConofimplemenCnghuman'sintelligentacCviCes,suchaswatching,listening,speaking,andthinking.

•  AIresearchstartedin1950srightaaerthebirthofdigitalcomputers

JohnMcCarthy(1927-2011)hep://www.sis.pie.edu

•  1947,AlanTuring–  ProposedthenoConofAI–  1950,proposed"Turingtest"for

tesCngintelligence•  1951,MarvinMinsky

–  InventedarCficialneurons(withD.Edmonds)

•  1956年 JohnMcCarthy–  Proposedtheterm"AriCficial

Intelligence(AI)"atDartmouthConferenin1956.

–  1958,developedtheLISPprogramminglanguage

•  1952-62,A.Samuel–  Inventedacomputerprogram

forplaying"Checker"game.

MarvinMinsky(1927-2016)

heps://en.wikipedia.org/wiki/Marvin_Minsky

2017/08/03 HSI2017,HirokiArimura,HokkaidoUniversity 7

Page 8: Hokkaido Summer H7: Introduction to Data Mininggi-core.oia.hokudai.ac.jp/gsb/wp-content/uploads/... · Automatic Classification by Machine Learning • We use a class of rules, called

Artificial Intelligence and Big Data

IBM's "Watson" System•  IBM Research (16 Feb. 2011)•  Won human masters in a TV

quiz contest "Jeopardy!"•  Answers English questions by

reading millions of books.•  Technology: AI, NLP, & Search

Google's Cloud Computing•  Computation based on data and

information collected from all over the world!

From geek.com: Google server firm http://www.geek.com/articles/chips/up-next-for-google-enterprise-wars-2009078/

From http://www-06.ibm.com/ibm/jp/lead/ideasfromibm/watson/�

Watson beat human masters in a popular TV program "Jeopardy!"

2017/08/03 8

HSI2017HSI2017, Hiroki Arimura, Hokkaido University

Page 9: Hokkaido Summer H7: Introduction to Data Mininggi-core.oia.hokudai.ac.jp/gsb/wp-content/uploads/... · Automatic Classification by Machine Learning • We use a class of rules, called

Centralized?

Quiz 2:

Distributed?

9

or

Consider the present information technology and its environment in the world. Question: Is it

?? ??

2017/08/03

Page 10: Hokkaido Summer H7: Introduction to Data Mininggi-core.oia.hokudai.ac.jp/gsb/wp-content/uploads/... · Automatic Classification by Machine Learning • We use a class of rules, called

Centralized

l  Centralized l  Huge amount of data l  Many CPUs l  Massive Computation

Answer: Either is OK: Two views of the world

Distributed

l  Many devices (iphones etc.) l  Diverse activities of people l  Heterogeneous Time/Space l  Incomplete & complex data

10 Different Characteristics 2017/08/03

Page 11: Hokkaido Summer H7: Introduction to Data Mininggi-core.oia.hokudai.ac.jp/gsb/wp-content/uploads/... · Automatic Classification by Machine Learning • We use a class of rules, called

11

情報知識ネットワーク特論 DM1

Backgrounds

Data Miningl  Study on efficient “semi-automatic”

methods for extracting “interesting and useful” patterns and rules from massive data sets

l  Emerged in the mid 1990s.• Apriori algorithm [Agrawal, Srikant, VLDB1994]

l  Potentially, a collection of existing studies.

• But, emphasis on efficient computation for massive data

l  Boundary of Machine Learning, Statistics, and Databases

Page 12: Hokkaido Summer H7: Introduction to Data Mininggi-core.oia.hokudai.ac.jp/gsb/wp-content/uploads/... · Automatic Classification by Machine Learning • We use a class of rules, called

12

情報知識ネットワーク特論 DM1

Backgrounds

The whole process of Data Mining l  1.Understanding the domain of data

l  2.Preprocessing of data sets

l  3. Mining of patterns(Data Mining in narrow sense)

l  4.Analysis of discovered patterns

l  5.Use of the analyzed results

Page 13: Hokkaido Summer H7: Introduction to Data Mininggi-core.oia.hokudai.ac.jp/gsb/wp-content/uploads/... · Automatic Classification by Machine Learning • We use a class of rules, called

13

HSI2017, Hiroki Arimura, Hokkaido University

Discovering hidden knowledge/ rules from massive data

Data Mining

<vessels >

<shipping >

<iranian >

<port >

<iran >

<the gulf >

<strike >

<attack > <silk worm missile>

<u.s.>

<wheat>

<dallers>

<sea men>

<strike >

<ships>

<gulf >

Traditional Information

Retrieval

Inspection by human

Data Mining

2017/08/03

Page 14: Hokkaido Summer H7: Introduction to Data Mininggi-core.oia.hokudai.ac.jp/gsb/wp-content/uploads/... · Automatic Classification by Machine Learning • We use a class of rules, called

CASE STUDY: CAN A COMPUTER LEARN ASTORONOMY? - SUPERVISED LEARNING

ISCIS2010, Discovery Science Session, Sep. 2010. Hiroki Arimura, Hokkaido University

14

Page 15: Hokkaido Summer H7: Introduction to Data Mininggi-core.oia.hokudai.ac.jp/gsb/wp-content/uploads/... · Automatic Classification by Machine Learning • We use a class of rules, called

Can a computer learn astoronomy?

星雲 恒星

人工衛星惑星

分類

SKICAT Project •  (SKy Image Cataloging and Analysis Tool) in 1990s, NASA JPL, USA •  One of the earliest attempt of large-scale data mining •  Learning of a computer probram ("a classifier")

–  to automatically classify star imagesinto categories of stars. –  by using machine learning based on 1700 training examples

2017/08/03 HSI2017, Hiroki Arimura, Hokkaido University

15

We can make automatic classification of photo images of stars!

Page 16: Hokkaido Summer H7: Introduction to Data Mininggi-core.oia.hokudai.ac.jp/gsb/wp-content/uploads/... · Automatic Classification by Machine Learning • We use a class of rules, called

Automatic Classification by Machine Learning

•  We use a class of rules, called Decision trees. ▼DT classifies data into categories based on its characteristics

("features")▼Classification: Given a data, traverse a path from the root to

some leaf according to the results of tests. The label of the leaf reached gives the category

A>2.5

A≦5.0B =

"circle"

star quasar star galaxy

yes no

yes no yes no

category/class

Test on a feature

2017/08/03 HSI2017, Hiroki Arimura, Hokkaido University

16

Features •  A: radius •  B: shape

Page 17: Hokkaido Summer H7: Introduction to Data Mininggi-core.oia.hokudai.ac.jp/gsb/wp-content/uploads/... · Automatic Classification by Machine Learning • We use a class of rules, called

Learning Algorithm for Decision Trees

A<2 yes no

C>4 yes no

B>4 noyes

B>5 yes no

B>5 yes no A>1 yes

no

A<2

yes no

C>4

yes no

Recursively constructs such a tree that minimizes the classification error from given ○ positive and × negative examples

2017/08/03 HSI2017, Hiroki Arimura, Hokkaido

University 17

Page 18: Hokkaido Summer H7: Introduction to Data Mininggi-core.oia.hokudai.ac.jp/gsb/wp-content/uploads/... · Automatic Classification by Machine Learning • We use a class of rules, called

Can we learn customer preference from purchase data?

•  Acomputercananalyzethecontentsofbasketsforonemillioncustomersinseveraltensofminutes.

•  Whichitemsareboughttogetherinabasket?•  AprioriAlgorithm(in1990sbyIBMAlmaden)•  Oneoftherootofdataminingresearch

2017/08/03 HSI2017,HirokiArimura,HokkaidoUniversity 18

Page 19: Hokkaido Summer H7: Introduction to Data Mininggi-core.oia.hokudai.ac.jp/gsb/wp-content/uploads/... · Automatic Classification by Machine Learning • We use a class of rules, called

19 HSI2017,HirokiArimura,HokkaidoUniversity

AdvancedMachineLearningAlgorithmsBoosCng[Freund,Shapire1996]–  PredicConbyaggregaConofmanylearningalgorithms

1919

SVM [Vapnik 1996]

l  Margin maximization and kernel methods

•  V. Vapnik, Statistical Learning Theory, Wiley, 1998. (SVN)

•  Y. Freund and R. E. Schapire, A decisiontheoretic generalization of on-line learning and an application to boosting, JCSS, 55, 119-139, 1997. (AdaBoost)

•  All the above algorithms are kinds of neural networks形) •  DemonstratedtheirhighperformanceintheoryandpracCce

Deep Learning [Hinton et al.]

l  Neural nets with many layers of different functionalities

2017/08/03

Page 20: Hokkaido Summer H7: Introduction to Data Mininggi-core.oia.hokudai.ac.jp/gsb/wp-content/uploads/... · Automatic Classification by Machine Learning • We use a class of rules, called

Classification: SVM

l  We can learn rules from complex data such as genome sequences and chemical compounds once appropriate features are designed

l  Applications: Medical diagnosis, pharmacy design

H

H

H

Fe

H

H

H

N HH

HH

X XC C

TCGCGAGGCTAGCT

GCAGAGTAT

TCGCGAGGCTAT

TCGCGAGGCTAT

TCGCGAGGT

+1

+1

+1

+1

-1

-1

-1

-1

H

H

H

N H+1

H

H

H

X XC C Classify

20

Page 21: Hokkaido Summer H7: Introduction to Data Mininggi-core.oia.hokudai.ac.jp/gsb/wp-content/uploads/... · Automatic Classification by Machine Learning • We use a class of rules, called

21

HSI2017, Hiroki Arimura, Hokkaido University

有用な規則・パタ

ーン・知識

H H H X X C C

H H

H X X C C 大規模データ

データマイニング

Map of classic & modern DM/ML methods

Learning an unknown classification rule from labeled data sets

n SVM [Vapnik ‘96],

n Boosting [Shapire & Kearns ‘96]

n C4.5 [Quinlan ‘96]

A. Supervised Learning Finding common / interesting patterns in a given data set

n Frequent pattern mining [Agrawal et a. ’94]

C. Pattern Discovery

Statistical Modeling.

Learning statistical models from data

n Bayesian Network [Pearl ’90s]

n Topic models [Blei, Ng, Jordan, 2003]

Graph Mining [Zaki ’02], [Uno, Arimura]

n Emerging pattern mining

n Statistically significant pattern mining

Grouping a given unlabeled data set into subgroups of similar objects (clusters)n 大規模・不完全なデー

タからの高速クラスタリング

n K-means, CLARANS, DBSCAN

2017/08/03

n Deep Learning (Deep Neural Networks)

n Random Forests [Breiman 2001]

B. Custering

n Text Mining

n Stream Mining, etc.

Applications

DM = data mining, ML = machine learningClassic methods

Mordern methods

Page 22: Hokkaido Summer H7: Introduction to Data Mininggi-core.oia.hokudai.ac.jp/gsb/wp-content/uploads/... · Automatic Classification by Machine Learning • We use a class of rules, called

APPLICATIONS OF DATA MINING/MACHINE LEARNING

ISCIS2010, Discovery Science Session, Sep. 2010. Hiroki Arimura, Hokkaido University

22

Page 23: Hokkaido Summer H7: Introduction to Data Mininggi-core.oia.hokudai.ac.jp/gsb/wp-content/uploads/... · Automatic Classification by Machine Learning • We use a class of rules, called

ApplicaConsofmachinelearningQues7ons•  FindanexampleofmachinelearningapplicaConsinyourlife

•  Whatcanbedoneinfuture

Bio-technology•  Rapidgrowthofgenomedatasuchassequencingdata,

andgeneexpressiondata•  PredicConofthefuncConsofunknowngenesfrom

sequences.•  AutomaCcallyfindingcandidatesofmedicinesfromthe

structuresofchemicalcompounds

2017/08/03 HSI2017,HirokiArimura,HokkaidoUniversity 23

Page 24: Hokkaido Summer H7: Introduction to Data Mininggi-core.oia.hokudai.ac.jp/gsb/wp-content/uploads/... · Automatic Classification by Machine Learning • We use a class of rules, called

ApplicaConsofmachinelearningFinance:Creditcardfrauddetec7on•  DisicoveringsuspicioustransacConsandcashwithdrawalfrom

massivetransacConrecords.

SecurityandTransporta7on:ImageRecogni7on•  Recognizingfacesandtrackingmovingpeopleandcarsfromimages

usingmachinelearningtechniques.

Marke7ng•  Wecanpredictcustomers'preferenceandtrendsfrompurchase

data.•  AsapplicaCons,recommendaConservicesforyourfavoritemusics

andbooksarenowavailableinAmazon.Spamdetec7on•  FilteroutadverCsementmessagesfromhugecollecConsofe-mails.2017/08/03 HSI2017,HirokiArimura,Hokkaido

University 24

Page 25: Hokkaido Summer H7: Introduction to Data Mininggi-core.oia.hokudai.ac.jp/gsb/wp-content/uploads/... · Automatic Classification by Machine Learning • We use a class of rules, called

ApplicaConsofmachinelearningTextMining: •  First,weextractthestructureofasentenceasa"datatree"

byanalyzingalargecollecConoffreetext(ReputaConinblogsandopinionsinfree-textquesConary)byNLPtechnique.

•  Next,wecanextractcommonopinionbyfindingfrequentcommonsub-structures(treepa2erns)inthedatatree.

"TheupcomingSumsong'ssmartphonelookscool"

"ThemetalpanelofthesmartphonebySumsonglookscool!"

"ThesmartphonethatIsawyesterdaylookedcoolwithitsopeningpanel."

メタルの

かっこいい

パネルが 携帯は

S社の

S社の

かっこいい

携帯は

今度の

S社の

かっこいい

携帯は

昨日見た

パネルが

開く

変換発見

S社の

かっこいい

携帯は パネルが

共通したところだけを取り出す

"ThesmartphonebySumsonglookscool!"

NLP=NaturalLanguageProcessing

InputTexts CommonOpinions

Morinaga,Arimura,Ikedaetal.,ACMKDD'05,2005.2017/08/03 HSI2017,HirokiArimura,Hokkaido

University

Page 26: Hokkaido Summer H7: Introduction to Data Mininggi-core.oia.hokudai.ac.jp/gsb/wp-content/uploads/... · Automatic Classification by Machine Learning • We use a class of rules, called

Discussion:DesigningmachinelearningapplicaCons

ForeachofthepreviousapplicaCons,pleasethinkaboutthefollowingquesCons1.  Howtopreprocessthedataintofeature

vectors(atable)2.  Whatisthelabel(category)informaCon?3.  Whichlearningalgorithmdoyouuse?4.  Howtoevaluatetheresults

2017/08/03 HSI2017,HirokiArimura,HokkaidoUniversity 26

Page 27: Hokkaido Summer H7: Introduction to Data Mininggi-core.oia.hokudai.ac.jp/gsb/wp-content/uploads/... · Automatic Classification by Machine Learning • We use a class of rules, called

Hokkaido Summer School 2017

photo:morgueFile

2017 Hiroki Arimura

2017/08/03

HSI2017, Hiroki Arimura, Hokkaido University

27

Can a computer learn games from data?

Page 28: Hokkaido Summer H7: Introduction to Data Mininggi-core.oia.hokudai.ac.jp/gsb/wp-content/uploads/... · Automatic Classification by Machine Learning • We use a class of rules, called

Can computer learn chess?

•  In1930,AlanTuringdiscussedthepossiblityofcomputerprogramsplayingchessgames

•  In1950s,Samuelpresentedamachinelearningprogramforplayingchecker– simplerthanchess

hep://en.wikipedia.org/wiki/,"Checker/Draughts"の項目2017/08/03 HSI2017,HirokiArimura,HokkaidoUniversity 28

Page 29: Hokkaido Summer H7: Introduction to Data Mininggi-core.oia.hokudai.ac.jp/gsb/wp-content/uploads/... · Automatic Classification by Machine Learning • We use a class of rules, called

Can computer learn chess?

•  In1950,Samuel'scheckerprogramwonahumanamatureplayer.

•  In1990,acomputerbeatsthecheckerworldchampionforthefirstCme.(1)

•  In1997,acomputer(DeepBluebyIBM)wontheChessworldchampionforthefirstCmeinchessgame.(2)

1)Chinookproject@ualberta:hep://webdocs.cs.ualberta.ca/~chinook/project/

–  DeepBluecanmake200Mlookupsperseconds(IBMRS600x32+customVLSIx512).

–  1stmatch:human(Kasparov)won(3win1lose2draw).–  2ndmatch:computer(DeepBlue)won(2win1lose

3draw).

2)GarryKasparov(1963~):atWorldchampionin1985–1993and1993–2000.G.Kasparov(wikipedia)

2017/08/03 HSI2017,HirokiArimura,HokkaidoUniversity 29

Page 30: Hokkaido Summer H7: Introduction to Data Mininggi-core.oia.hokudai.ac.jp/gsb/wp-content/uploads/... · Automatic Classification by Machine Learning • We use a class of rules, called

Can computer learn GO?

2017/08/03 30

ComputerGO•  In2016,acomputerprogramAlphaGowonthe

world'sbesthumanGoplayer(KeJie柯潔)throughthree-matchseries.

•  Itwaswidelyexpectedthatcomputerscannotwinhumantopplayerforthenexttenyears.

•  AlphaGohasbeendevelopedbyDeepMindteamofGoogleforafewyears.

Technology•  Combininggamesearchwithseveral

machinelearningtechniques•  Montecarlotreesearch(MCTS)•  Reinforcementlearning photoof

AlphaGovs.KeJie

Seealso"AlphaGoversusKeJie"inWikipedia

Page 31: Hokkaido Summer H7: Introduction to Data Mininggi-core.oia.hokudai.ac.jp/gsb/wp-content/uploads/... · Automatic Classification by Machine Learning • We use a class of rules, called

Summary:IntroducContoDataMining

•  HistoryofAIandDataMining•  SKICATProject:ApplicaConofmachinelearningtoastronomicalbigdata

•  ClassificaConofdataminingalgorithms– Supervisedlearning(classificaCon)– Unsupervisedlearning(clustering)– Paeernmining

•  ApplicaConsofdatamining•  DatamininginComputerGame2017/08/03 HSI2017,HirokiArimura,Hokkaido

University 31