Как приготовить тестовые данные для Big Data проекта....
description
Transcript of Как приготовить тестовые данные для Big Data проекта....
Как приготовить тестовые данные для Big Data проекта. Пример из практики.
Dmitriy Romanov. Itera Consulting UA
Part 1. Introduction
Part 2. Data Sampling Approach Description
Part 3. Data Sampling Tool Implementation
List of Contents
Testing BI and DWH07.03.2013
ETL and DWH
CSV
STG2STG1 DWHData-mart
Cube
BISourcesystem
Mapping & report logic
DB2 ETL
File
Report
Report
Sources of Errors
ETL ETL
How we test ETL processes ?
Test Data DSTSRC
ETL
Error ?
CHECK TYPES
Row Count
Field Values
Aggregations Validation
…
Extraction – all data extracted correctlyTransformation – works as expectedLoad – whole data arrived to destination
Testing BI and DWH07.03.2013
How to get Test Data
“Random”Production Data
Subset
Full sized Production
Data
Sampled
Production data
Coverage
Preparation Cost orExecution Time
GeneratedTest Data
Part 1
Data Sampling Approach description
8
Data sampling. Main idea
"Compress" similar data in source dataset and take for further analysis only "unique" rows
Clustering is one of the data mining techniques, which helps in grouping of most similar rows
9
Data sampling. Clustering
1. Three main data types in sources- Numeric data- Categorical (unordered) data- Linked data
2. At the start numeric and categorical data sets can be split into proportional parts (quartiles).
3. Linked data requires linked content analyses and currently treated as unordered data
Data sampling. Defining quartiles
Partitioning can be done with T-SQL NTILE function.Example for ordered data
With some limitations such method also can be used for categorical data too:
FieldVal Qrt. FieldVal Qrt. FieldVal Qrt. FieldVal Qrt. FieldVal Qrt. FieldVal Qrt. FieldVal Qrt. FieldVal Qrt.
Ansvar 1 Delkasko 2 Innbo 3 Leiebil 4Naturskd Bed 5
Næringsbygg 6 Standard 7
Utvidet m/UP 8
Bedrift 1 Død 2 Invaliditet 3 Med. invalid 4Naturskd Ko 5
Panthaver 6 Super 7 Verdisak 8
Brann Tyveri 1 Død/bruk 2 Kasko 3 MLV 4
Naturskd MLV 5
Reise/Helse 6
Till. k.inbo 7
Veterinær 8
Brann/Tyveri 1
Førerulykke 2 Kontor 3 Naturskade 4
Naturskd NBy 5
Reisedekning 6
Utgår Syk 7
Yrkesskade 8
11
Data sampling. Quartiles properties.
Every quartile can be described by:
Quartiles definitions Examples:
Quartile Attribute Description
Qmin Minimal value in quartile
Qmed Median value in quartile
Qmax Maximal value in quartile
Records count records in original dataset belonged to quartile
"Quartilized" view of source data.
- Quartiles should be defined for all fields in the source table- Every field value should be replaced by quartile code which this value belongs to- Quartiles Minimum, Maximum, Median values should get own code
Example :Q1 – value from quartile #1 QnMin/QnMax/QnMed – fields with Min/Max/Med values for N quartile correspondently
Col001 Col002 Col003 Col004 Col005 Col006 Col007 Col008 Col009 Col010 Col011 Col012Q1 Q2Min Q0 Q1 Q1Min Q1Min Q1Med Q1 Q1 Q1 Q1Min Q1MinQ1 Q4Min Q0 Q1 Q1Min Q1Min Q1Med Q1 Q1 Q1 Q1Min Q1Q1 Q4Min Q0 Q2 Q1Min Q1Min Q1Med Q1 Q1 Q1 Q1Min Q1MinQ1 Q2Min Q0 Q1 Q1Min Q1Min Q1Med Q1Med Q1 Q1 Q1Min Q1Q1 Q4Min Q0 Q1 Q1Min Q1Min Q1Med Q1 Q1 Q1 Q1Min Q1Q1 Q2Min Q0 Q1 Q1Min Q1Min Q1Med Q1 Q1 Q1 Q1Min Q1Q1 Q2Min Q0 Q1 Q1Min Q1Min Q1Med Q1 Q1 Q1 Q1Min Q1MinQ1 Q4Min Q0 Q2 Q1Min Q1Min Q1Med Q1 Q1 Q1 Q1Min Q1Q1 Q4Min Q0 Q1 Q1Min Q1Min Q1Med Q1 Q1 Q1 Q1Min Q1Min
Representative set built from clusters
- For more reliable result set must contain all (if exists) Qmin, Qmed, Qmax values for every quartile. - All other values can be ignored in further analysis and will be replaced by Q0 (other) code.- Each unique combination of quartile codes forms cluster
Data sampling. Clustered view resulting set
Cluster Col001 Col002 Col003 Col004 Col005 Col006 Col007 RecCNT1 Q0 Q4Min Q0 Q0 Q1Min Q1Min Q1Med 1512 Q0 Q4Med Q0 Q0 Q1Min Q1Min Q1Med 23 Q0 Q4Min Q0 Q0 Q1Min Q1Min Q1Med 284 Q0 Q2Min Q0 Q0 Q1Min Q1Min Q1Med 155 Q0 Q2Min Q0 Q3Min Q1Min Q1Min Q1Med 126 Q0 Q2Min Q0 Q0 Q1Min Q1Min Q1Med 117 Q0 Q2Min Q0 Q4Max Q1Min Q1Min Q1Med 1
Perform grouping of similar records and calculate how many rows each cluster contains (RecCNT column)
15
Data sampling. Storing resulting set
Take only one row from every cluster and place it to Resulting set
Resulting set can be stored in following forms:- Set of "INSERT" statements
- Cut original CSV file (if available) and left only one row from every cluster
DELETE from J0.STG1_P_LIV_SALDOGOinsert J0.STG1_P_LIV_SALDO(Col001,Col002,Col003,Col004,Col005,Col006,Col007,Col008,Col009,Col010,Col011,Col012,Col013,Col014,Lastet_dat) values('0A319121','20','20360',null,null,'20','2A','00003207H','320040704','30115335797',null,'000000B','30115335797','02','May 3 2012 1:26PM')…
Part 2
Data Sampling Tool Implementation
17
Data sampling tool. Main idea
1. Keep things as simple as possible
- Dynamic SQL source generation, parameterization widely used
- Results of each step are fixed as SQL statement (level of code
understandability/control remain high as possible)
- Fully automated partitioning/clustering/filtering.
2. Break down process into independent parts. Keep in mind possibility of
future extension and customization
3. All data manipulations performed on T-SQL and massive data tasks
delegated to the SQL Server
18
Data sampling tool. Brief overview
- Control logic implemented on Powershell, database part on T-SQL
- Analyze data from Stage1 table and CSV files
- Received number of quartiles/column as parameter
- Generate quartiles definitions and build clusters
- Only one record from cluster goes into representative set
- Representative set can be stored in:
a) Set of insert SQL statements
b) Copy of source text file with only meaningful rows
19
Data sampling tool. Structure
Create clustered view of source data
Analyze STG1 table
Filter & Store result
STG1 Table nameNumber of quartiles
List of quartiles{XML}
List of quartiles{XML}
List of RowID's{FILE}
List of RowID's{FILE}
Inserts_to_stg1_tbl.sqlOr<Copy of source file with only lines corresponded to RowID's list>
Input Process Output
Grouping similar records to clusters
Source for "view"{SQL}
Source for "view"{SQL}
Quartile Definition Example
"Clusterised" View Example
Saving Results Example
Results Example
24
Data sampling tool. Structure
Create clustered view of source data
Analyze STG1 table
Filter & Store result
STG1 Table nameNumber of quartiles
List of quartiles{XML}
List of quartiles{XML}
List of RowID's{FILE}
List of RowID's{FILE}
Inserts_to_stg1_tbl.sqlOr<Copy of source file with only lines corresponded to RowID's list>
Input Process Output
Grouping similar records to clusters
Source for "view"{SQL}
Source for "view"{SQL}
Data sampling tool. Execution results
Original datasetTable Name: STG1_GEN_ORDRE_TRecords : 11,912,182Columns: 35
2 4 8 12 160
20000400006000080000
100000
Result Set Row Count
2 4 8 12 160
100
200
300
400
500
Total Quartile Count
2 4 8 12 160
50
100
150
200
250
300
Time (min)
Qrt/ColumnTotal
Quartiles Count
Result Set Row Count
Time
2 67 8754 86
4 126 19459 95
8 242 32231 100
12 346 57258 143
16 436 83605 255
Questions ?