Advanced Data Mining and Integration Research for Europe SAMI 2011, January 2011, Smolenice,...
-
Upload
wilfred-stewart -
Category
Documents
-
view
213 -
download
0
Transcript of Advanced Data Mining and Integration Research for Europe SAMI 2011, January 2011, Smolenice,...
Advanced Data Mining and Integration Research for Europe
SAMI 2011, January 2011, Smolenice, Slovakia
ADMIRE – Framework 7 ICT 215024
Using Advanced Data Mining and Integration in Environmental Risk
ManagementLadislav Hluchy
Ondrej Habala, Martin Šeleng, Peter Krammer, Viet Tran
Institute of InformaticsSlovak Academy of Sciences
...making data-mining easierADMIRE – Framework 7 ICT 215024
Contents
• EU FP7 project ADMIRE – overview• Architecture of DMI solution in ADMIRE• New DMI process language – DISPEL• Pilot application scenarios – ORAVA,
RADAR• goals, architecture, experimental results
• Tools in ADMIRE
SAMI 2011, Smolenice, Slovakia, January 2011
...making data-mining easierADMIRE – Framework 7 ICT 215024
ADMIRE - Advanced Data Mining and Integration Research for Europe
• 7th Framework Program• ICT, Call 1.2.A• Commenced in February 2008
over 36 months.• €4.3 million in costs, and €3
million in EC funding
SAMI 2011, Smolenice, Slovakia, January 2011
...making data-mining easierADMIRE – Framework 7 ICT 215024
Collaborators• University of Edinburgh, UK (Coordinator)
– NeSc - National e-Science Centre– EPCC - Edinburgh Parallel Computing Centre
• Fujitsu Labs of Europe, UK• University of Vienna, Austria
– Institute of Scientific Computing• Universidad Politécnica de Madrid, Spain
– Facultad de Informatica• Slovak Academy of Sciences, Slovakia
– Institute of Informatics• ComArch S.A., Poland
SAMI 2011, Smolenice, Slovakia, January 2011
...making data-mining easierADMIRE – Framework 7 ICT 215024
ADMIRE Goals
• Accelerate access to and increase the benefits from data exploitation;
• Deliver consistent and easy to use technology for extracting information and knowledge;
• Cope with complexity, distribution, change and heterogeneity of services, data, and processes, through abstract view of data mining and integration; and
• Provide power to users and developers of data mining and integration processes.
SAMI 2011, Smolenice, Slovakia, January 2011
...making data-mining easierADMIRE – Framework 7 ICT 215024
ADMIRE Structure
– WP1: High-Level Model and Language Research• Incremental development of models and languages with a goal of
describing Data Mining and Integration (DMI) processes abstractly
– WP2: Architecture Research• Incremental development of a flexible, scalable and open DMI
architecture
– WP3: Platform Support & Delivery• Deliver robust service platforms, support users and encapsulate
knowledge in a book
– WP4: Service Infrastructure Development and Enhancement• Develop technology and services to enhance the DMI service
infrastructure based on Fujitsu’s USMT
SAMI 2011, Smolenice, Slovakia, January 2011
...making data-mining easierADMIRE – Framework 7 ICT 215024
ADMIRE Structure
– WP5: Data Mining and Integration Tools Development
• Develop and integrate tools that make the technology easier to use and reduce the frequency of failures
– WP6: Integrated Applications• Demonstration of validation and performance of
architecture, language, platform and tools as an integrated environment for Data Mining and Integration
– WP7: Project Management• Management and coordination of the project
SAMI 2011, Smolenice, Slovakia, January 2011
...making data-mining easierADMIRE – Framework 7 ICT 215024
ADMIRE Architecture: Separation of Concerns
SAMI 2011, Smolenice, Slovakia, January 2011
...making data-mining easierADMIRE – Framework 7 ICT 215024
ADMIRE Architecture
SAMI 2011, Smolenice, Slovakia, January 2011
...making data-mining easierADMIRE – Framework 7 ICT 215024
DISPEL – Data Intensive Systems Process-Engineering Language
• Data-intensive distributed systems• Connection point of complex application requests
and complex enactment systems–Benefit: method development, engineering and evolution
of supported practices can take place independently in each world
• Describes enactment requests for streaming-data workflows processes
• “Process-engineering time” – transform and optimize process in preparation for enactment period
SAMI 2011, Smolenice, Slovakia, January 2011
...making data-mining easierADMIRE – Framework 7 ICT 215024
DISPEL: Simple Example
Creating connectionsCreating connections
String sql1 = "SELECT * FROM some_table";String sql2 = “SELECT * FROM table2”;String resource = "128.18.128.255";
SQLQuery query = new SQLQuery; |- sql1, sql2 -| => query.expression; |- resource -| => query.resource;
Tee tee = new Tee;query.result => tee.connectInput;
Creating streams of literalsCreating streams of literals
SAMI 2011, Smolenice, Slovakia, January 2011
...making data-mining easierADMIRE – Framework 7 ICT 215024
DISPEL – real use
SAMI 2011, Smolenice, Slovakia, January 2011
...making data-mining easierADMIRE – Framework 7 ICT 215024
ADMIRE’s High-Level Architecture
SAMI 2011, Smolenice, Slovakia, January 2011
...making data-mining easierADMIRE – Framework 7 ICT 215024
ADMIRE Gateways
USMT
SAMI 2011, Smolenice, Slovakia, January 2011
...making data-mining easierADMIRE – Framework 7 ICT 215024
Security
• Framework built on top of formal Grid Infrastructure, available security mechanisms include:
–Transport level security: SSL, HTTPs, (currently available)–Message level security: Web Services Security: SOAP
Message Security–X509 certificate authentification–Multiple stakeholder authorization–Explicit Trust Delegation (ETD)
SAMI 2011, Smolenice, Slovakia, January 2011
...making data-mining easierADMIRE – Framework 7 ICT 215024
Pilot Applications
• Admire has 2 pilot applications– CRM– FloodApp
• FloodApp– Orava– Radar– SVP
SAMI 2011, Smolenice, Slovakia, January 2011
...making data-mining easierADMIRE – Framework 7 ICT 215024
ACRM Application
• Large-scale, distributed Churn scenario– 4 database parts, distributed among ADMIRE partners– Graphical UI for business
analysts– Using ADMIRE workbench,
DISPEL and frameworkto create predictionsof customer churn
• Mining over distributed data
SAMI 2011, Smolenice, Slovakia, January 2011
...making data-mining easierADMIRE – Framework 7 ICT 215024
Flood ApplicationData sets used in hydrological scenarios
FSKD 2010 Yantai, China, August 10-12 19
Dataset Domain Description VolumeTemporal coverage
Spatial coverage
HUSAV Hydrology Data from two probes, containing water saturation of soil
10s of MB 1998-2007 Two distinct points
MARS Meteorology Historical meteorological data (temperature, rainfall, etc) for Slovakia
100s of MB 1975-2007 Slovakia (grid 50x50 km)
SVP Hydrology Data from waterworks in western Slovakia (mainly river Váh) – outflows, water levels, temperature, rainfall
100s of MB 1998-2007 15 distinct waterworks
DAISY Pedology Various pedological parameters for one probe in southern Slovakia
10s of MB 1961-2000 One point
WOFOST Pedology Crop data (with attached soil and meteorological data) for Slovakia, year 2006
10s of MB 2006 Slovakia (grid)
SHMU_CURR Meteorology On-line database of meteorological data – copied from SHMI web; including radar imagery
10s of GB + 2008- Slovakia (about 100 distinct probes)
SHMU_HIST Meteorology Historical meteorological data from SHMI probes
100s of MB 1998-2007 Slovakia (more than 100 distinct probes)
SHMU_GRIB Meteorology Historical temperatures and rainfall amounts in a gridded binary format
100s of GB 1998-2007 Slovakia (grid, various sizes)
RADAR Meteorology Weather radar imagery 100s of GB 2005-2008 SlovakiaSHMU_HYDRO Hydrology Historical data from
hydrological measurement stations
10s of MB 1998-2007 Orava and upper Vah river
SOIL_RET Pedology Water retention capacities of soil
10s of MB current (no time series applicable)
Vah river watershed area
SAMI 2011, Smolenice, Slovakia, January 2011
...making data-mining easierADMIRE – Framework 7 ICT 215024
Scenarios deployment in testbed
• Two scenarios (ORAVA, RADAR) completely deployed in testbed
• Other scenario’s data are partially deployed• 5 nodes (1 real + 4 virtual nodes)• Databases (MySQL + PostgreSQL), GRIB files in
file storage• USMT (Unified System Management Technology -
Jetty container), OGSA-DAI (Apache Tomcat)
SAMI 2011, Smolenice, Slovakia, January 2011
...making data-mining easierADMIRE – Framework 7 ICT 215024
Orava scenario• Legend
– Green area – Orava (part of north Slovakia)
– Blue – Orava reservoir and local rivers
– Red dots – hydrological measurement stations
• Notes– We are interested only
on hydrological stations below the Orava reservoir
– In our tests we will use the hydrological station 5830 (Tvrdosin)
SAMI 2011, Smolenice, Slovakia, January 2011
...making data-mining easierADMIRE – Framework 7 ICT 215024
ORAVA – data mining concept• Predictors – rainfall amount (reservoir and station), air
temperature (reservoir and station), reservoir discharge, reservoir temperature
Time Water tempOrava
Rainfall Orava
Air temp Orava
Air tempStation
RainFallStation
OutflowOrava
Water -levelStation
Water tempStation
T-4 E-4 R-4 A-4 B-4 S-4 D-4 X-4 Y-4
T-3 E-3 R-3 A-3 B-3 S-3 D-3 X-3 Y-3
T-2 E-2 R-2 A-2 B-2 S-2 D-2 X-2 Y-2
T-1 E-1 R-1 A-1 B-1 S-1 D-1 X-1 Y-1
T E R A B S D X Y
T+1 R+1 A+1 B+1 S+1 D+1 X+1 Y+1
T+2 R+2 A+2 B+2 S+2 D+2 X+2 Y+2
T+3 R+3 A+3 B+3 S+3 D+3 X+3 Y+3
T+4 R+4 A+4 B+4 S+4 D+4 X+4 Y+4
T+5 R+5 A+5 B+5 S+5 D+5 X+5 Y+5
T+6 R+6 A+6 B+6 S+6 D+6 X+6 Y+6
• Targets – water level and temperature at a station below the reservoir
Predicted by a meteo model
Given in a schedule
Targets of data mining
SAMI 2011, Smolenice, Slovakia, January 2011
...making data-mining easierADMIRE – Framework 7 ICT 215024
ORAVA – data integration
• Integration of data from
– GRIB files– Reservoirs
• Inputs– Time period of
experiment– Reservoir ID– List of hydro
stations– Geo coordinates
SAMI 2011, Smolenice, Slovakia, January 2011
...making data-mining easierADMIRE – Framework 7 ICT 215024
ORAVA – data sets
Dataset Domain Description VolumeTemporal coverage
Spatial coverage
SVP Hydrology Data from waterworks in western Slovakia (mainly river Váh) – outflows, water levels, temperature, rainfall
100s of MB 1998-2007 15 distinct waterworks
SHMU_CURR Meteorology On-line database of meteorological data – copied from SHMI web; including radar imagery
10s of GB + 2008- Slovakia (about 100 distinct probes)
SHMU_HIST Meteorology Historical meteorological data from SHMI probes
100s of MB 1998-2007 Slovakia (more than 100 distinct probes)
SHMU_GRIB Meteorology Historical temperatures and rainfall amounts in a gridded binary format
100s of GB 1998-2007 Slovakia (grid, various sizes)
SHMU_HYDRO
Hydrology Historical data from hydrological measurement stations
10s of MB 1998-2007 Orava and upper Vah river
SAMI 2011, Smolenice, Slovakia, January 2011
...making data-mining easierADMIRE – Framework 7 ICT 215024
ZeroEpsilon Filter
LinearTrend Filter ReplaceMissingValues Filter
ORAVA – integrated and preprocessed data
Water_tempOrava
Air_tempOrava
RainfallOrava
OutflowOrava
RainfallStation
Air_tempStation
Flow/HeightStation
Water_tempStation
-4 30 -5.55E-20 269.0278 28 0.71 -4 30 -5.55E-20 269.0476 28.62 0.7
-5 30 -4.24E-20 269.5059 28.62 0.7-5 30 -8.47E-20 270.2394 28.62 0.7-5 30 -8.47E-20 270.8507 28 0.7-3 50 -8.47E-20 271.2792 28 0.7-3 50 -8.47E-20 271.9238 28 0.8
Water_tempOrava
Air_tempOrava
RainfallOrava
OutflowOrava
RainfallStation
Air_tempStation
Flow/HeightStation
Water_tempStation
1.0 -4.0 0.0 30.0 0.0 -3.12223 28.0 0.71.0 -4.0 0.0 30.0 0.0 -3.1024 28.62 0.7
0.995833 -5.0 0.0 30.0 0.0 -2.64408 28.62 0.70.991667 -5.0 0.0 30.0 0.0 -1.91062 28.62 0.7
0.9875 -5.0 0.0 30.0 0.0 -1.29926 28.0 0.70.983333 -3.0 0.0 50.0 0.0 -0.87076 28.0 0.70.979167 -3.0 0.0 50.0 0.0 -0.22617 28.0 0.8
Integrated raw data
Integrated preprocessed data
Tim
eTi
me
SAMI 2011, Smolenice, Slovakia, January 2011
Kelvin2Celsius Filter
...making data-mining easierADMIRE – Framework 7 ICT 215024
ORAVA – data mining• Input - Integrated data
• Data Mining Phases:– Data understanding
• Data visualization• Data quality exploration
– Data preparation• Missing values substitution
(ReplaceMissingValues filter)• Noise reduction (ZeroEpsilon filter)• Switching from one scale to another
(Kelvin2Celsius filter)• Data modifying (LinearTrend filter)
– Model training• Training on historical data (8760 records)• Linear Regression model• Neural networks - multilayer perceptron
without hidden layers – Model Evaluation
• Testing of the trained model• N-fold cross validation• Using training sets
• Output - Prediction model
Integrated Data
Data Preparation
Data Visualization
Data Cleaning
Clean Data
Model Training
Model Visualization
Model Evaluation
Prediction Model
SAMI 2011, Smolenice, Slovakia, January 2011
...making data-mining easierADMIRE – Framework 7 ICT 215024
Orava – data mining resultsprediction of temperature
• Linear Regression model equation:
_ 0.6473 _
0.0239 _ 0.0359
0.0055 0.0418
0.0117 _ 0.0503 2.4324
station Orava
Orava Orava
Orava station
station station
Water temp Water temp
Air temp Rainfall
Outflow Rainfall
Air temp Flow
SAMI 2011, Smolenice, Slovakia, January 2011
...making data-mining easierADMIRE – Framework 7 ICT 215024
Orava – temperature prediction model comparison
Model\Properties
Linear regression
Multilayer perceptron
Correlation coefficient
0.9639 0.9821
Mean absolute error
1.1791 0.7748
Root mean squared error
1.4607 1.0386
Relative absolute error
23.8739 % 15.6884 %
Root relative squared error
26.609 % 18.9195 %
Total Number of Instances
8760 8760
Validation data
Linear regressionmodel
Multilayer perceptron model
Predicted data
Error Predicted data
Error
11.6 13.071 1.471 12.446 0.84615.2 14.335 -0.865 14.494 -0.7066.4 7.614 1.214 5.766 -0.6340.7 2.284 1.584 0.926 0.226
11.7 10.948 -0.752 10.266 -1.43414.3 16.526 2.226 13.671 -0.62915.6 12.891 -2.709 14.502 -1.09815.7 12.838 -2.862 13.353 -2.3470.8 1.752 0.952 0.826 0.026
15.8 15.188 -0.612 14.005 -1.79515.4 16.553 1.153 13.129 -2.27114.9 12.795 -2.105 14.599 -0.30115.4 15.660 0.260 13.696 -1.704
SAMI 2011, Smolenice, Slovakia, January 2011
...making data-mining easierADMIRE – Framework 7 ICT 215024
Orava – prediction of water level• Neural network model – multilayer perceptron• Input parameters (6)
– Rainfall ([S+1]), Water-Level ([X])– Outflows ([D], [D+1] – [D], ln([D]), sqrt([D]))
• Output – Difference of waterlevel ([X+1] – [X])
Time Water tempOrava
Rainfall Orava
Air temp Orava
Air tempStation
RainFallStation
OutflowOrava
Water - LevelStation
Water tempStation
T-3 E-3 R-3 A-3 B-3 S-3 D-3 X-3 Y-3
T-2 E-2 R-2 A-2 B-2 S-2 D-2 X-2 Y-2
T-1 E-1 R-1 A-1 B-1 S-1 D-1 X-1 Y-1
T E R A B S D X Y
T+1 R+1 A+1 B+1 S+1 D+1 X+1 Y+1
T+2 R+2 A+2 B+2 S+2 D+2 X+2 Y+2
SAMI 2011, Smolenice, Slovakia, January 2011
...making data-mining easierADMIRE – Framework 7 ICT 215024
Orava – water level prediction • Data count : 8735 records• Activation function of the feed-forward
neural network: sigmoid• Correlation coefficient: 0.9816• Mean absolute error : 0.4105• Root mean squared err.: 0.9673• Relative absolute error :
30.5869 % (from difference)• Root relative squared error
19.2384 % (from difference)
SAMI 2011, Smolenice, Slovakia, January 2011
...making data-mining easierADMIRE – Framework 7 ICT 215024
RADAR
31
• Very short-term rainfall prediction from weather radar data
– Movement of areas with higher air moisture content, and thus also higher precipitation potential
• Mining of matrices of dataTime Potential
precipitation(RADAR)
Measured precipitation(STATION)
Temperature(MODEL)
Wind (MODEL)
T-3 R-3 S-3 H-3 W-3
T-2 R-2 S-2 H-2 W-2
T-1 R-1 S-1 H-1 W-1
T R S H W
T+1 R+1 S+1
T+2 R+2 S+2SAMI 2011, Smolenice, Slovakia, January 2011
Targets of data mining
...making data-mining easierADMIRE – Framework 7 ICT 215024
• Network of synoptic stations in Slovakia– 27 stations in Slovakia– Used data from year 2007, 2008– Rainfall, humidity, atmospheric pressure and temperature values for each hour
Meteorologic data
SAMI 2011, Smolenice, Slovakia, January 2011
...making data-mining easierADMIRE – Framework 7 ICT 215024
RADAR isotonic model
• Actual model for rainfall prediction– Isotonic reggresion model structure– Training on historical data – Correlation coefficient 0.4593 – Mean absolute error 0.1105– Root mean squared error 0.5490 – Total Number of Instances 89700 – Validation 10 Cross Fold
SAMI 2011, Smolenice, Slovakia, January 2011
...making data-mining easierADMIRE – Framework 7 ICT 215024
Table of isotonic model
index Prediction(rainfall)
cut point(reflective) index Prediction
(rainfall)cut point
(reflective) index Prediction(rainfall)
cut point(reflective)
1 0.01 1.78 15 0.23 96.91 29 1.35 355.91
2 0.03 1.84 16 0.28 97.47 30 1.40 377.19
3 0.03 8.28 17 0.30 129.63 31 1.52 381.78
4 0.03 16.97 18 0.33 129.72 32 2.13 395.31
5 0.03 24.28 19 0.42 147.94 33 2.23 399.16
6 0.03 36.91 20 0.44 168.59 34 2.28 447.06
7 0.05 37.53 21 0.50 187.13 35 2.60 447.69
8 0.05 38.72 22 0.51 187.47 36 2.60 467.66
9 0.06 44.53 23 0.62 211.56 37 2.98 515.19
10 0.07 59.03 24 0.72 268.38 38 3.75 625.56
11 0.08 61.16 25 0.93 281.28 39 4.93 665.41
12 0.10 61.78 26 1.00 297.72 40 5.24 901.25
13 0.14 81.59 27 1.14 314.47 41 5.40 934.41
14 0.19 89.22 28 1.26 344.59 42 6.30 971.5
SAMI 2011, Smolenice, Slovakia, January 2011
...making data-mining easierADMIRE – Framework 7 ICT 215024
Hydrometeorological performance
Probability of detection with threshold 0,3 and 0,6 mm rainfall per hour:
• POD0,3 = 63,87 %• POD0,6 = 56,22 %
Miss rate with threshold 0,3 and 0,6 mm rainfall per hour:
• MR0,3 = 1,85 %• MR0,6 = 1,58 %
SAMI 2011, Smolenice, Slovakia, January 2011
...making data-mining easierADMIRE – Framework 7 ICT 215024
RADAR model
• Other tested models – Neural networks, SMOreg, linear regression, ...– Reached correlation coeficient between 0,35 and 0,42– Validation - 10 Cross Fold
Problems in model creation :– process is significantly stochastic– Some input variables are backwards dependent on output– Meteorological process is very sensitive – Reflection matrix represents quantity of water in atmosphere, not exact rainfall rate in specified area, as opposed to data from
synoptic stations
SAMI 2011, Smolenice, Slovakia, January 2011
...making data-mining easierADMIRE – Framework 7 ICT 215024
ADMIRE Tools
• Registry client GUI• Process designer• SKSA• Gateway Process
Manager• DMI Model Visualizer
SAMI 2011, Smolenice, Slovakia, January 2011
...making data-mining easierADMIRE – Framework 7 ICT 215024
Registry client GUI
• Read-only access to ADMIRE Registry– list PEs and view their properties– search, sort PEs
• Write access to Registry is done via DISPEL documents
SAMI 2011, Smolenice, Slovakia, January 2011
...making data-mining easierADMIRE – Framework 7 ICT 215024
Process Designer
Manage your DMI project (files, directories – project structure)
Edit your DMI process graphically
View the canonical (DISPEL) representation of your DMI process in real time
Select elements from the Registry View the properties of
your chosen elements
SAMI 2011, Smolenice, Slovakia, January 2011
...making data-mining easierADMIRE – Framework 7 ICT 215024
Semantic Knowledge Sharing Assistant
• Context the user works in– Several reservoirs, one
settlement• Knowledge that may be
useful in this context– previously entered by
other users
Provides access to existing user’s knowledge, sorting and selecting it automatically according to the user’s current working context
SAMI 2011, Smolenice, Slovakia, January 2011
...making data-mining easierADMIRE – Framework 7 ICT 215024
Gateway Process Manager
• Keep track of running processes– stop/pause/cancel the
process– view the process’ source
DISPEL• access process’ results
(if available) in several ways – raw or visualized
SAMI 2011, Smolenice, Slovakia, January 2011
...making data-mining easierADMIRE – Framework 7 ICT 215024
DMI Model Visualizer
• Visualization of data mining models– Read Weka classifier
object– produce PMML
(Predictive Model Markup Language) description of the model
– Show the PMML as a graphical tree
SAMI 2011, Smolenice, Slovakia, January 2011