Advanced Data Mining and Integration Research for Europe SAMI 2011, January 2011, Smolenice,...

42
Advanced Data Mining and Integration Research for Europe SAMI 2011, January 2011, Smolenice, Slovakia ADMIRE – Framework 7 ICT 215024 Using Advanced Data Mining and Integration in Environmental Risk Management Ladislav Hluchy Ondrej Habala, Martin Šeleng, Peter Krammer, Viet Tran Institute of Informatics Slovak Academy of Sciences

Transcript of Advanced Data Mining and Integration Research for Europe SAMI 2011, January 2011, Smolenice,...

Advanced Data Mining and Integration Research for Europe

SAMI 2011, January 2011, Smolenice, Slovakia

ADMIRE – Framework 7 ICT 215024

Using Advanced Data Mining and Integration in Environmental Risk

ManagementLadislav Hluchy

Ondrej Habala, Martin Šeleng, Peter Krammer, Viet Tran

Institute of InformaticsSlovak Academy of Sciences

...making data-mining easierADMIRE – Framework 7 ICT 215024

Contents

• EU FP7 project ADMIRE – overview• Architecture of DMI solution in ADMIRE• New DMI process language – DISPEL• Pilot application scenarios – ORAVA,

RADAR• goals, architecture, experimental results

• Tools in ADMIRE

SAMI 2011, Smolenice, Slovakia, January 2011

...making data-mining easierADMIRE – Framework 7 ICT 215024

ADMIRE - Advanced Data Mining and Integration Research for Europe

• 7th Framework Program• ICT, Call 1.2.A• Commenced in February 2008

over 36 months.• €4.3 million in costs, and €3

million in EC funding

SAMI 2011, Smolenice, Slovakia, January 2011

...making data-mining easierADMIRE – Framework 7 ICT 215024

Collaborators• University of Edinburgh, UK (Coordinator)

– NeSc - National e-Science Centre– EPCC - Edinburgh Parallel Computing Centre

• Fujitsu Labs of Europe, UK• University of Vienna, Austria

– Institute of Scientific Computing• Universidad Politécnica de Madrid, Spain

– Facultad de Informatica• Slovak Academy of Sciences, Slovakia

– Institute of Informatics• ComArch S.A., Poland

SAMI 2011, Smolenice, Slovakia, January 2011

...making data-mining easierADMIRE – Framework 7 ICT 215024

ADMIRE Goals

• Accelerate access to and increase the benefits from data exploitation;

• Deliver consistent and easy to use technology for extracting information and knowledge;

• Cope with complexity, distribution, change and heterogeneity of services, data, and processes, through abstract view of data mining and integration; and

• Provide power to users and developers of data mining and integration processes.

SAMI 2011, Smolenice, Slovakia, January 2011

...making data-mining easierADMIRE – Framework 7 ICT 215024

ADMIRE Structure

– WP1: High-Level Model and Language Research• Incremental development of models and languages with a goal of

describing Data Mining and Integration (DMI) processes abstractly

– WP2: Architecture Research• Incremental development of a flexible, scalable and open DMI

architecture

– WP3: Platform Support & Delivery• Deliver robust service platforms, support users and encapsulate

knowledge in a book

– WP4: Service Infrastructure Development and Enhancement• Develop technology and services to enhance the DMI service

infrastructure based on Fujitsu’s USMT

SAMI 2011, Smolenice, Slovakia, January 2011

...making data-mining easierADMIRE – Framework 7 ICT 215024

ADMIRE Structure

– WP5: Data Mining and Integration Tools Development

• Develop and integrate tools that make the technology easier to use and reduce the frequency of failures

– WP6: Integrated Applications• Demonstration of validation and performance of

architecture, language, platform and tools as an integrated environment for Data Mining and Integration

– WP7: Project Management• Management and coordination of the project

SAMI 2011, Smolenice, Slovakia, January 2011

...making data-mining easierADMIRE – Framework 7 ICT 215024

ADMIRE Architecture: Separation of Concerns

SAMI 2011, Smolenice, Slovakia, January 2011

...making data-mining easierADMIRE – Framework 7 ICT 215024

ADMIRE Architecture

SAMI 2011, Smolenice, Slovakia, January 2011

...making data-mining easierADMIRE – Framework 7 ICT 215024

DISPEL – Data Intensive Systems Process-Engineering Language

• Data-intensive distributed systems• Connection point of complex application requests

and complex enactment systems–Benefit: method development, engineering and evolution

of supported practices can take place independently in each world

• Describes enactment requests for streaming-data workflows processes

• “Process-engineering time” – transform and optimize process in preparation for enactment period

SAMI 2011, Smolenice, Slovakia, January 2011

...making data-mining easierADMIRE – Framework 7 ICT 215024

DISPEL: Simple Example

Creating connectionsCreating connections

String sql1 = "SELECT * FROM some_table";String sql2 = “SELECT * FROM table2”;String resource = "128.18.128.255";

SQLQuery query = new SQLQuery; |- sql1, sql2 -| => query.expression; |- resource -| => query.resource;

Tee tee = new Tee;query.result => tee.connectInput;

Creating streams of literalsCreating streams of literals

SAMI 2011, Smolenice, Slovakia, January 2011

...making data-mining easierADMIRE – Framework 7 ICT 215024

DISPEL – real use

SAMI 2011, Smolenice, Slovakia, January 2011

...making data-mining easierADMIRE – Framework 7 ICT 215024

ADMIRE’s High-Level Architecture

SAMI 2011, Smolenice, Slovakia, January 2011

...making data-mining easierADMIRE – Framework 7 ICT 215024

ADMIRE Gateways

USMT

SAMI 2011, Smolenice, Slovakia, January 2011

...making data-mining easierADMIRE – Framework 7 ICT 215024

Security

• Framework built on top of formal Grid Infrastructure, available security mechanisms include:

–Transport level security: SSL, HTTPs, (currently available)–Message level security: Web Services Security: SOAP

Message Security–X509 certificate authentification–Multiple stakeholder authorization–Explicit Trust Delegation (ETD)

SAMI 2011, Smolenice, Slovakia, January 2011

...making data-mining easierADMIRE – Framework 7 ICT 215024

Pilot Applications

• Admire has 2 pilot applications– CRM– FloodApp

• FloodApp– Orava– Radar– SVP

SAMI 2011, Smolenice, Slovakia, January 2011

...making data-mining easierADMIRE – Framework 7 ICT 215024

ACRM Application

• Large-scale, distributed Churn scenario– 4 database parts, distributed among ADMIRE partners– Graphical UI for business

analysts– Using ADMIRE workbench,

DISPEL and frameworkto create predictionsof customer churn

• Mining over distributed data

SAMI 2011, Smolenice, Slovakia, January 2011

...making data-mining easierADMIRE – Framework 7 ICT 215024

Flood ApplicationData sets used in hydrological scenarios

FSKD 2010 Yantai, China, August 10-12 19

Dataset Domain Description VolumeTemporal coverage

Spatial coverage

HUSAV Hydrology Data from two probes, containing water saturation of soil

10s of MB 1998-2007 Two distinct points

MARS Meteorology Historical meteorological data (temperature, rainfall, etc) for Slovakia

100s of MB 1975-2007 Slovakia (grid 50x50 km)

SVP Hydrology Data from waterworks in western Slovakia (mainly river Váh) – outflows, water levels, temperature, rainfall

100s of MB 1998-2007 15 distinct waterworks

DAISY Pedology Various pedological parameters for one probe in southern Slovakia

10s of MB 1961-2000 One point

WOFOST Pedology Crop data (with attached soil and meteorological data) for Slovakia, year 2006

10s of MB 2006 Slovakia (grid)

SHMU_CURR Meteorology On-line database of meteorological data – copied from SHMI web; including radar imagery

10s of GB + 2008- Slovakia (about 100 distinct probes)

SHMU_HIST Meteorology Historical meteorological data from SHMI probes

100s of MB 1998-2007 Slovakia (more than 100 distinct probes)

SHMU_GRIB Meteorology Historical temperatures and rainfall amounts in a gridded binary format

100s of GB 1998-2007 Slovakia (grid, various sizes)

RADAR Meteorology Weather radar imagery 100s of GB 2005-2008 SlovakiaSHMU_HYDRO Hydrology Historical data from

hydrological measurement stations

10s of MB 1998-2007 Orava and upper Vah river

SOIL_RET Pedology Water retention capacities of soil

10s of MB current (no time series applicable)

Vah river watershed area

SAMI 2011, Smolenice, Slovakia, January 2011

...making data-mining easierADMIRE – Framework 7 ICT 215024

Scenarios deployment in testbed

• Two scenarios (ORAVA, RADAR) completely deployed in testbed

• Other scenario’s data are partially deployed• 5 nodes (1 real + 4 virtual nodes)• Databases (MySQL + PostgreSQL), GRIB files in

file storage• USMT (Unified System Management Technology -

Jetty container), OGSA-DAI (Apache Tomcat)

SAMI 2011, Smolenice, Slovakia, January 2011

...making data-mining easierADMIRE – Framework 7 ICT 215024

Orava scenario• Legend

– Green area – Orava (part of north Slovakia)

– Blue – Orava reservoir and local rivers

– Red dots – hydrological measurement stations

• Notes– We are interested only

on hydrological stations below the Orava reservoir

– In our tests we will use the hydrological station 5830 (Tvrdosin)

SAMI 2011, Smolenice, Slovakia, January 2011

...making data-mining easierADMIRE – Framework 7 ICT 215024

ORAVA – data mining concept• Predictors – rainfall amount (reservoir and station), air

temperature (reservoir and station), reservoir discharge, reservoir temperature

Time Water tempOrava

Rainfall Orava

Air temp Orava

Air tempStation

RainFallStation

OutflowOrava

Water -levelStation

Water tempStation

T-4 E-4 R-4 A-4 B-4 S-4 D-4 X-4 Y-4

T-3 E-3 R-3 A-3 B-3 S-3 D-3 X-3 Y-3

T-2 E-2 R-2 A-2 B-2 S-2 D-2 X-2 Y-2

T-1 E-1 R-1 A-1 B-1 S-1 D-1 X-1 Y-1

T E R A B S D X Y

T+1 R+1 A+1 B+1 S+1 D+1 X+1 Y+1

T+2 R+2 A+2 B+2 S+2 D+2 X+2 Y+2

T+3 R+3 A+3 B+3 S+3 D+3 X+3 Y+3

T+4 R+4 A+4 B+4 S+4 D+4 X+4 Y+4

T+5 R+5 A+5 B+5 S+5 D+5 X+5 Y+5

T+6 R+6 A+6 B+6 S+6 D+6 X+6 Y+6

• Targets – water level and temperature at a station below the reservoir

Predicted by a meteo model

Given in a schedule

Targets of data mining

SAMI 2011, Smolenice, Slovakia, January 2011

...making data-mining easierADMIRE – Framework 7 ICT 215024

ORAVA – data integration

• Integration of data from

– GRIB files– Reservoirs

• Inputs– Time period of

experiment– Reservoir ID– List of hydro

stations– Geo coordinates

SAMI 2011, Smolenice, Slovakia, January 2011

...making data-mining easierADMIRE – Framework 7 ICT 215024

ORAVA – data sets

Dataset Domain Description VolumeTemporal coverage

Spatial coverage

SVP Hydrology Data from waterworks in western Slovakia (mainly river Váh) – outflows, water levels, temperature, rainfall

100s of MB 1998-2007 15 distinct waterworks

SHMU_CURR Meteorology On-line database of meteorological data – copied from SHMI web; including radar imagery

10s of GB + 2008- Slovakia (about 100 distinct probes)

SHMU_HIST Meteorology Historical meteorological data from SHMI probes

100s of MB 1998-2007 Slovakia (more than 100 distinct probes)

SHMU_GRIB Meteorology Historical temperatures and rainfall amounts in a gridded binary format

100s of GB 1998-2007 Slovakia (grid, various sizes)

SHMU_HYDRO

Hydrology Historical data from hydrological measurement stations

10s of MB 1998-2007 Orava and upper Vah river

SAMI 2011, Smolenice, Slovakia, January 2011

...making data-mining easierADMIRE – Framework 7 ICT 215024

ZeroEpsilon Filter

LinearTrend Filter ReplaceMissingValues Filter

ORAVA – integrated and preprocessed data

Water_tempOrava

Air_tempOrava

RainfallOrava

OutflowOrava

RainfallStation

Air_tempStation

Flow/HeightStation

Water_tempStation

-4 30 -5.55E-20 269.0278 28 0.71 -4 30 -5.55E-20 269.0476 28.62 0.7

-5 30 -4.24E-20 269.5059 28.62 0.7-5 30 -8.47E-20 270.2394 28.62 0.7-5 30 -8.47E-20 270.8507 28 0.7-3 50 -8.47E-20 271.2792 28 0.7-3 50 -8.47E-20 271.9238 28 0.8

Water_tempOrava

Air_tempOrava

RainfallOrava

OutflowOrava

RainfallStation

Air_tempStation

Flow/HeightStation

Water_tempStation

1.0 -4.0 0.0 30.0 0.0 -3.12223 28.0 0.71.0 -4.0 0.0 30.0 0.0 -3.1024 28.62 0.7

0.995833 -5.0 0.0 30.0 0.0 -2.64408 28.62 0.70.991667 -5.0 0.0 30.0 0.0 -1.91062 28.62 0.7

0.9875 -5.0 0.0 30.0 0.0 -1.29926 28.0 0.70.983333 -3.0 0.0 50.0 0.0 -0.87076 28.0 0.70.979167 -3.0 0.0 50.0 0.0 -0.22617 28.0 0.8

Integrated raw data

Integrated preprocessed data

Tim

eTi

me

SAMI 2011, Smolenice, Slovakia, January 2011

Kelvin2Celsius Filter

...making data-mining easierADMIRE – Framework 7 ICT 215024

ORAVA – data mining• Input - Integrated data

• Data Mining Phases:– Data understanding

• Data visualization• Data quality exploration

– Data preparation• Missing values substitution

(ReplaceMissingValues filter)• Noise reduction (ZeroEpsilon filter)• Switching from one scale to another

(Kelvin2Celsius filter)• Data modifying (LinearTrend filter)

– Model training• Training on historical data (8760 records)• Linear Regression model• Neural networks - multilayer perceptron

without hidden layers – Model Evaluation

• Testing of the trained model• N-fold cross validation• Using training sets

• Output - Prediction model

Integrated Data

Data Preparation

Data Visualization

Data Cleaning

Clean Data

Model Training

Model Visualization

Model Evaluation

Prediction Model

SAMI 2011, Smolenice, Slovakia, January 2011

...making data-mining easierADMIRE – Framework 7 ICT 215024

Orava – data mining resultsprediction of temperature

• Linear Regression model equation:

_ 0.6473    _

 0.0239    _  0.0359   

0.0055     0.0418     

0.0117    _  0.0503   2.4324

station Orava

Orava Orava

Orava station

station station

Water temp Water temp

Air temp Rainfall

Outflow Rainfall

Air temp Flow

SAMI 2011, Smolenice, Slovakia, January 2011

...making data-mining easierADMIRE – Framework 7 ICT 215024

Orava – temperature prediction model comparison

Model\Properties

Linear regression

Multilayer perceptron

Correlation coefficient

0.9639 0.9821

Mean absolute error

1.1791 0.7748

Root mean squared error

1.4607 1.0386

Relative absolute error

23.8739 % 15.6884 %

Root relative squared error

26.609 % 18.9195 %

Total Number of Instances

8760 8760

Validation data

Linear regressionmodel

Multilayer perceptron model

Predicted data

Error Predicted data

Error

11.6 13.071 1.471 12.446 0.84615.2 14.335 -0.865 14.494 -0.7066.4 7.614 1.214 5.766 -0.6340.7 2.284 1.584 0.926 0.226

11.7 10.948 -0.752 10.266 -1.43414.3 16.526 2.226 13.671 -0.62915.6 12.891 -2.709 14.502 -1.09815.7 12.838 -2.862 13.353 -2.3470.8 1.752 0.952 0.826 0.026

15.8 15.188 -0.612 14.005 -1.79515.4 16.553 1.153 13.129 -2.27114.9 12.795 -2.105 14.599 -0.30115.4 15.660 0.260 13.696 -1.704

SAMI 2011, Smolenice, Slovakia, January 2011

...making data-mining easierADMIRE – Framework 7 ICT 215024

Orava – prediction of water level• Neural network model – multilayer perceptron• Input parameters (6)

– Rainfall ([S+1]), Water-Level ([X])– Outflows ([D], [D+1] – [D], ln([D]), sqrt([D]))

• Output – Difference of waterlevel ([X+1] – [X])

Time Water tempOrava

Rainfall Orava

Air temp Orava

Air tempStation

RainFallStation

OutflowOrava

Water - LevelStation

Water tempStation

T-3 E-3 R-3 A-3 B-3 S-3 D-3 X-3 Y-3

T-2 E-2 R-2 A-2 B-2 S-2 D-2 X-2 Y-2

T-1 E-1 R-1 A-1 B-1 S-1 D-1 X-1 Y-1

T E R A B S D X Y

T+1 R+1 A+1 B+1 S+1 D+1 X+1 Y+1

T+2 R+2 A+2 B+2 S+2 D+2 X+2 Y+2

SAMI 2011, Smolenice, Slovakia, January 2011

...making data-mining easierADMIRE – Framework 7 ICT 215024

Orava – water level prediction • Data count : 8735 records• Activation function of the feed-forward

neural network: sigmoid• Correlation coefficient: 0.9816• Mean absolute error : 0.4105• Root mean squared err.: 0.9673• Relative absolute error :

30.5869 % (from difference)• Root relative squared error

19.2384 % (from difference)

SAMI 2011, Smolenice, Slovakia, January 2011

...making data-mining easierADMIRE – Framework 7 ICT 215024

RADAR

31

• Very short-term rainfall prediction from weather radar data

– Movement of areas with higher air moisture content, and thus also higher precipitation potential

• Mining of matrices of dataTime Potential

precipitation(RADAR)

Measured precipitation(STATION)

Temperature(MODEL)

Wind (MODEL)

T-3 R-3 S-3 H-3 W-3

T-2 R-2 S-2 H-2 W-2

T-1 R-1 S-1 H-1 W-1

T R S H W

T+1 R+1 S+1

T+2 R+2 S+2SAMI 2011, Smolenice, Slovakia, January 2011

Targets of data mining

...making data-mining easierADMIRE – Framework 7 ICT 215024

• Network of synoptic stations in Slovakia– 27 stations in Slovakia– Used data from year 2007, 2008– Rainfall, humidity, atmospheric pressure and temperature values for each hour

Meteorologic data

SAMI 2011, Smolenice, Slovakia, January 2011

...making data-mining easierADMIRE – Framework 7 ICT 215024

RADAR isotonic model

• Actual model for rainfall prediction– Isotonic reggresion model structure– Training on historical data – Correlation coefficient 0.4593 – Mean absolute error 0.1105– Root mean squared error 0.5490 – Total Number of Instances 89700 – Validation 10 Cross Fold

SAMI 2011, Smolenice, Slovakia, January 2011

...making data-mining easierADMIRE – Framework 7 ICT 215024

Table of isotonic model

index Prediction(rainfall)

cut point(reflective) index Prediction

(rainfall)cut point

(reflective) index Prediction(rainfall)

cut point(reflective)

1 0.01 1.78 15 0.23 96.91 29 1.35 355.91

2 0.03 1.84 16 0.28 97.47 30 1.40 377.19

3 0.03 8.28 17 0.30 129.63 31 1.52 381.78

4 0.03 16.97 18 0.33 129.72 32 2.13 395.31

5 0.03 24.28 19 0.42 147.94 33 2.23 399.16

6 0.03 36.91 20 0.44 168.59 34 2.28 447.06

7 0.05 37.53 21 0.50 187.13 35 2.60 447.69

8 0.05 38.72 22 0.51 187.47 36 2.60 467.66

9 0.06 44.53 23 0.62 211.56 37 2.98 515.19

10 0.07 59.03 24 0.72 268.38 38 3.75 625.56

11 0.08 61.16 25 0.93 281.28 39 4.93 665.41

12 0.10 61.78 26 1.00 297.72 40 5.24 901.25

13 0.14 81.59 27 1.14 314.47 41 5.40 934.41

14 0.19 89.22 28 1.26 344.59 42 6.30 971.5

SAMI 2011, Smolenice, Slovakia, January 2011

...making data-mining easierADMIRE – Framework 7 ICT 215024

Hydrometeorological performance

Probability of detection with threshold 0,3 and 0,6 mm rainfall per hour:

• POD0,3 = 63,87 %• POD0,6 = 56,22 %

Miss rate with threshold 0,3 and 0,6 mm rainfall per hour:

• MR0,3 = 1,85 %• MR0,6 = 1,58 %

SAMI 2011, Smolenice, Slovakia, January 2011

...making data-mining easierADMIRE – Framework 7 ICT 215024

RADAR model

• Other tested models – Neural networks, SMOreg, linear regression, ...– Reached correlation coeficient between 0,35 and 0,42– Validation - 10 Cross Fold

Problems in model creation :– process is significantly stochastic– Some input variables are backwards dependent on output– Meteorological process is very sensitive – Reflection matrix represents quantity of water in atmosphere, not exact rainfall rate in specified area, as opposed to data from

synoptic stations

SAMI 2011, Smolenice, Slovakia, January 2011

...making data-mining easierADMIRE – Framework 7 ICT 215024

ADMIRE Tools

• Registry client GUI• Process designer• SKSA• Gateway Process

Manager• DMI Model Visualizer

SAMI 2011, Smolenice, Slovakia, January 2011

...making data-mining easierADMIRE – Framework 7 ICT 215024

Registry client GUI

• Read-only access to ADMIRE Registry– list PEs and view their properties– search, sort PEs

• Write access to Registry is done via DISPEL documents

SAMI 2011, Smolenice, Slovakia, January 2011

...making data-mining easierADMIRE – Framework 7 ICT 215024

Process Designer

Manage your DMI project (files, directories – project structure)

Edit your DMI process graphically

View the canonical (DISPEL) representation of your DMI process in real time

Select elements from the Registry View the properties of

your chosen elements

SAMI 2011, Smolenice, Slovakia, January 2011

...making data-mining easierADMIRE – Framework 7 ICT 215024

Semantic Knowledge Sharing Assistant

• Context the user works in– Several reservoirs, one

settlement• Knowledge that may be

useful in this context– previously entered by

other users

Provides access to existing user’s knowledge, sorting and selecting it automatically according to the user’s current working context

SAMI 2011, Smolenice, Slovakia, January 2011

...making data-mining easierADMIRE – Framework 7 ICT 215024

Gateway Process Manager

• Keep track of running processes– stop/pause/cancel the

process– view the process’ source

DISPEL• access process’ results

(if available) in several ways – raw or visualized

SAMI 2011, Smolenice, Slovakia, January 2011

...making data-mining easierADMIRE – Framework 7 ICT 215024

DMI Model Visualizer

• Visualization of data mining models– Read Weka classifier

object– produce PMML

(Predictive Model Markup Language) description of the model

– Show the PMML as a graphical tree

SAMI 2011, Smolenice, Slovakia, January 2011

...making data-mining easierADMIRE – Framework 7 ICT 215024

Admire Project

Thank you for attention.

SAMI 2011, Smolenice, Slovakia, January 2011