Predictive Model - Ministry of Public Healthbps.moph.go.th/new_bps/sites/default/files/Predictive...

PREDICTIVE MODELwith RapidMiner

Wanida Saetang Ph.D (candidate)King Mongkut's University of Technology North Bangkok

THE PROGRESSION OF ANALYTICS

Presenter

Presentation Notes

กระบวนการวิเคราะห์ diagnostic analytics การวิเคราะห์วินิจฉัย how can we make it happen เราจะทำให้มันเกิดขึ้นได้อย่างไร

CRISP-DM

http://mlwiki.org/index.php/CRISP-DM

Presenter

Presentation Notes

กระบวนการวิเคราะห์ข้อมูลมาตรฐานซึ่งเป็นเหมือน blueprint ที่ใช้กันอย่างกว้างขวาง เช่นเดียวกันกับกระบวนการ ISO ในโรงงานอุตสาหกรรม หรือกระบวนการ CMMI ซึ่งเป็นมาตรฐานในการพัฒนาซอฟต์แวร์ ซึ่ง CRISP-DM กระบวนการมาตรฐานในการวิเคราะห์ข้อมูลด้านดาต้า ไมน์นิงนี้ การทำความเข้าใจเกี่ยวกับธุรกิจ เป็นขั้นตอนแรกที่สำคัญมาก เพราะต้องทำความเข้าใจว่าปัญหาคืออะไร ต้องการคำตอบของปัญหาในทิศทางหรือลักษณะ และแปลงปัญหานั้นให้อยู่ในรูปของโจทย์ทางด้าน data mining การทำความเข้าใจเกี่ยวกับข้อมูล เป็นการทำความเข้าใจว่าข้อมูลที่จะนำมาใช้ควรมีลักษณะอย่างไร แหล่งข้อมูลอยู่ที่ใดและที่สำคัญที่สุดคือ ค่าใช้จ่ายหรือต้นทุน (Costs of Data) การเตรียมข้อมูล เป็นขั้นตอนที่ทำการแปลงข้อมูลให้ถูกต้อง (data cleaning) เช่น การแปลงข้อมูลให้อยู่ในช่วง (scale) เดียวกัน หรือการเติมข้อมูลที่ขาดหายไป เป็นต้น โดยขั้นตอนนี้จะเป็นขั้นตอนที่ใช้เวลามากที่สุด 4. เทคนิคในการวิเคราะห์ข้อมูลต่างๆ เช่น การแบ่งกลุ่มข้อมูล (Clustering) การหากฏความสัมพันธ์ (Association Rules) การจำแนกประเภทข้อมูล (Classification) 5. ก่อนที่จะนำโมเดลที่ได้ไปใช้งานต่อไปก็จะต้องมีการวัดประสิทธิภาพของผลลัพธ์ที่ได้ว่าตรงกับวัตถุประสงค์ที่ได้ตั้งไว้ในขั้นตอนแรก หรือ มีความน่าเชื่อถือมากน้อยเพียงใด 6. นำองค์ความรู้ที่ได้เหล่านี้ไปใช้ได้จริงในองค์กรหรือบริษัท ตัวอย่างเช่น การสร้างรายงานเพื่อให้ผู้บริหารหรือนักการตลาดเข้าใจได้ง่ายและสามารถนำไปออกโปรโมชั่นได้

DATA MINING TECHNIQUES

Decision TreeNaive BayesNeural NetworkSupport Vector Machines (SVM)

K-Means DBSCANEM Clustering using GMMs.Agglomerative Hierarchical

Apriori algorithmEclat algorithmFP-growth algorithm

Presenter

Presentation Notes

Expectation–Maximization (EM) Clustering using Gaussian Mixture Models (GMM) Prediction Tasks models to predict unknown or future values Classification Models: predict a categorical value Regression Models: predict a continuous value

http://dataminingtrend.com/2014/decision-tree-model/

http://dataminingtrend.com/2014/naive-bayes/

http://dataminingtrend.com/2014/neural-network-weka-meaning-part1/

http://dataminingtrend.com/2014/support-vector-machine-svm/

https://en.wikipedia.org/wiki/Association_rule_learning#Apriori_algorithm

https://en.wikipedia.org/wiki/Association_rule_learning#Eclat_algorithm

https://en.wikipedia.org/wiki/Association_rule_learning#FP-growth_algorithm

DECISION TREE

http://dataminingtrend.com/2014/data-mining-techniques/ensemble-model/

http://dataminingtrend.com/2014/data-mining-techniques/ensemble-model/

APPLYMODEL

VALIDATE MODEL

Self Consistency Test

Split-validation

Cross-validation

http://dataminingtrend.com/2014/data-mining-techniques/cross-validation/

Presenter

Presentation Notes

วิธี Self Consistency Test หรือบางครั้งเรียกว่า Use Training Set เป็นวิธีการที่ง่ายที่สุด นั่นคือ ข้อมูลที่ใช้ในการสร้างโมเดล (model) และข้อมูลที่ใช้ในการทดสอบโมเดลเป็นข้อมูลชุดเดียวกัน แต่วิธีนี้ไม่นิยมใช้ เพราะ ผลการวัดประสิทธิภาพจะให้ค่าสูงมาก (อาจจะเข้าใกล้ 100%) เนื่องจากเป็นข้อมูล ชุดเดิมที่ระบบได้ทําการเรียนรู้มาแล้ว ไม่เหมาะที่จะนําไปรายงาน วิธีนี้เหมาะสําหรับใช้ในการทดสอบประสิทธิภาพเพื่อดูแนวโน้มของโมเดลที่สร้างขึ้น ถ้าได้ผลการวัดที่น้อย แสดงว่าโมเดลไม่เหมาะสมกับข้อมูล


Presenter

Presentation Notes

Split Test เป็นการแบ่งข้อมูลด้วยการสุ่มออกเป็น 2 ส่วน เช่น 70% ต่อ 30% หรือ 80% ต่อ 20% โดยข้อมูลส่วนที่หนึ่ง (70% หรือ 80%) ใช้ในการสร้างโมเดลและข้อมูลส่วนที่สอง (30% หรือ 20%) ใช้ใน การทดสอบประสิทธิภาพของโมเดล


Presenter

Presentation Notes

วิธีนี้เป็นวิธีที่นิยมในการทำงานวิจัย เพื่อใช้ในการทดสอบประสิทธิภาพของโมเดลเนื่องจากผลที่ได้มีความน่าเชื่อถือ การวัด ประสิทธิภาพด้วยวิธี Cross-validation นี้จะทําการแบ่งข้อมูลออกเป็นหลายส่วน เช่น 5-fold cross-validation คือ ทําการแบ่งข้อมูลออกเป็น 5 ส่วน โดยที่แต่ละส่วนมีจํานวนข้อมูลเท่ากัน หรือ 10-fold cross-validation คือ การแบ่งข้อมูลออกเป็น 10 ส่วน โดยที่แต่ละส่วนมีจํานวนข้อมูลเท่ากัน หลังจากนั้นข้อมูลหนึ่งส่วนจะใช้เป็นตัวทดสอบประสิทธิภาพของโมเดล ทําวนไปเช่นนี้จนครบจํานวนที่แบ่งไว้

Presenter

Presentation Notes

RapidMiner Studio combines data analysis and preparation with custom business deployment. This code-optional software lets you automate reporting based on time intervals or have events trigger changes in your visualizations. Import your own data sets and export to other programs thanks to the platform’s 60+ native integrations. Extensions give you greater flexibility (such as anomaly detection, text processing, and web mining), but may fall outside of the basic subscription price. Although RapidMiner is unapologetically built for data scientists, it’s easy to install and get started. You can even download a basic, free version of the product directly from the site (limited to one logical processor; no customer support).

Work Shop RapidMiner

• Lab1 Decision Tree

• Lab2 Apply Model

• Lab3 Test Model

• Lab4 Validate Model

Predictive Model - Ministry of Public Healthbps.moph.go.th/new_bps/sites/default/files/Predictive...

Documents

Transcript of Predictive Model - Ministry of Public Healthbps.moph.go.th/new_bps/sites/default/files/Predictive...