厦门大学工学博士学位论文 ETL 厦门大学博硕士论文摘要库 · as Informatica,...

17
学校编号:10384 分类号 密级 号:B200431001 UDC 厦门大学工学博士学位论文 支持数据仓库国际化的ETL技术中若干关 键问题研究和实践 许 威 指导教师 李茂青 教授 厦门大学自动化系 申请学位级别 博士 控制理论与工程 论文提交日期 2007. 06 论文答辩日期 2007. 07 学位授予单位 厦门大学 答辩委员会主席 教授 评阅人 教授 教授 教授 教授 教授 200706厦门大学博硕士论文摘要库

Transcript of 厦门大学工学博士学位论文 ETL 厦门大学博硕士论文摘要库 · as Informatica,...

Page 1: 厦门大学工学博士学位论文 ETL 厦门大学博硕士论文摘要库 · as Informatica, Oracle and Teradata to build up a prototype platform. We hope our practice will be

学校编号:10384 分类号 密级

学 号:B200431001 UDC

厦门大学工学博士学位论文

支持数据仓库国际化的ETL技术中若干关

键问题研究和实践

许 威

指导教师 李茂青 教授 厦门大学自动化系

申请学位级别 博士 专 业 名 称 控制理论与工程

论文提交日期 2007. 06 论文答辩日期 2007. 07

学位授予单位 厦门大学

答辩委员会主席 教授

评阅人 教授 教授 教授

教授 教授

2007年06月

厦门大学博硕士论文摘要库

Page 2: 厦门大学工学博士学位论文 ETL 厦门大学博硕士论文摘要库 · as Informatica, Oracle and Teradata to build up a prototype platform. We hope our practice will be

Research and Implementation on a number of key issues of

Data Warehouse ETL technology Internationalization

A Thesis

Submitted to the Graduate School in Partial Fulfillment

of the Requirements for the Degree of

Doctor of Philosophy

By

Wei Xu

Directed by Prof. MaoQing Li

Department of Automation, Xiamen University

June, 2007 厦门大学博硕士论文摘要库

Page 3: 厦门大学工学博士学位论文 ETL 厦门大学博硕士论文摘要库 · as Informatica, Oracle and Teradata to build up a prototype platform. We hope our practice will be

厦门大学学位论文原创性声明

兹呈交的学位论文,是本人在导师指导下独立完成的研究成果。本人在论文

写作中参考的其他个人或集体的研究成果,均在文中以明确方式标明。本人依法

享有和承担由此论文产生的权利和责任。

声明人(签名):

年 月 日

厦门大学博硕士论文摘要库

Page 4: 厦门大学工学博士学位论文 ETL 厦门大学博硕士论文摘要库 · as Informatica, Oracle and Teradata to build up a prototype platform. We hope our practice will be

厦门大学学位论文著作权使用声明

本人完全了解厦门大学有关保留、使用学位论文的规定。厦门大学有权保留

并向国家主管部门或其指定机构送交论文的纸质版和电子版,有权将学位论文用

于非赢利目的的少量复制并允许论文进入学校图书馆被查阅,有权将学位论文的

内容编入有关数据库进行检索,有权将学位论文的标题和摘要汇编出版。保密的

学位论文在解密后适用本规定。

本学位论文属于

1.保密(),在年解密后适用本授权书。

2.不保密( )

(请在以上相应括号内打“√”)

作者签名: 日期: 年 月 日

导师签名: 日期: 年 月 日

厦门大学博硕士论文摘要库

Page 5: 厦门大学工学博士学位论文 ETL 厦门大学博硕士论文摘要库 · as Informatica, Oracle and Teradata to build up a prototype platform. We hope our practice will be

摘要

I

摘 要

伴随着数据仓库的深入应用和中国企业国际化进程的深化,如何对来

自国内外的客户数据的进行有效的管理和应用已经成为关系到成长中的中

国跨国企业进一步推进自身信息化建设的重要关键问题, 本文开展支持数

据仓库系统国际化的一系列关键技术研究,一方面探索数据分析和应用过

程中需要解决的一系列技术难题,如数据抽取和存储过程中常见的多国语

言字符集兼容性,构建健壮性的数据装载体系,如何对多数据源 ETL 加载

数据的过程进行统一控制和事后审计处理等问题,另一方面本论文选择了

目前比较流行的商务智能产品 Informatica、Oracle、Teradata 为基础平

台构建一个原型库,希望以其为切入点,可以进一步推进数据仓库技术的

相关理论和应用研究,有助于推广该技术从而满足企业全球化进程中面向

客户需求和切入国际市场的要求。为有兴趣部署和实施类似系统的单位/组

织提供一个技术实践。

本论文开展支持数据仓库系统国际化的一系列关键技术研究,主要取

得了以下几个方面的主要成果:

第一、 为解决源数据和目标数据库的字符集转换的问题,本论文特

别提出了将多语言字符集数据装载技术(MEETL)应用于数据

库和装载工具的数据流加工,将该方法引向新的重要领域从

而在国际上占有一席之地;

第二、 对源和目标数据库以及装载工具之间的字符集转换的问题进

行探索;在 MEETL 技术的基础上提出并证明了正确移动数据

条件下的 ETL 字符集兼容性定理(Character Set Conversion

Compatible Theorem,简称 CSCCT)和代码页兼容性定理(Code

Page Compatible Theorem 简称 CPCT)。

第三、 提出改进的 UTF-8 与编码 Unicode 转换算法

UTF8DecodeUnicode,并通过引入自定义 UTF 字符集转换函数

的办法,绕过 Teradata 数据库自身的字符集转换机制,从而

厦门大学博硕士论文摘要库

Page 6: 厦门大学工学博士学位论文 ETL 厦门大学博硕士论文摘要库 · as Informatica, Oracle and Teradata to build up a prototype platform. We hope our practice will be

厦门大学工学博士学位论文

II

解决了从 Oracle 源数据到 Teradata 目标数据仓库 ETL 过程

中在处理某些非兼容性 Unicode 字符时产生 6705 错误的问

题。

第四、 通过实现一个具体的应用模型系统,完成了多数据源 ETL 加

载数据的控制处理。本文提出了三种数据处理控制技术:过

程同步控制(Process Synchronization Control,简称 PSC)、

时间范围控制(Time Range Control,简称 TRC)和通用审

计控制(Common Audit Control,简称 CAC)。它们能够优

化 ETL 架构、提高数据装载的效率、并且定义审计规则和保

存审计历史数据用于日后分析和跟踪调错,防止错误数据载

入数据仓库的情况发生。

第五、 建立具有高处理能力和高扩展能力的数据仓库平台,以适应

处理装载日益庞大市场数据的要求。这就要求有一套能够缩

短生产系统的运行时间和占用更少的系统资源的数据装载体

系架构。现有的大部分数据仓库系统所有的生产程序使用相

同的目标数据库用户名/密码,这样不仅系统安全性会有问

题,同时由于用户名相同,不同的生产程序对系统资源的占

用也存在相同的优先级。新的设计思路是不同主题域的生产

程序运行时可以使用与该主题域相对应的账户,减少生产程

序之间运行过程的相互依赖性反过来也就保证了生产系统加

载过程的稳定性。

关键词:ETL;国际化;统一字符编码

厦门大学博硕士论文摘要库

Page 7: 厦门大学工学博士学位论文 ETL 厦门大学博硕士论文摘要库 · as Informatica, Oracle and Teradata to build up a prototype platform. We hope our practice will be

Abstract

III

ABSTRACT

Chinese Companies today have an unprecedented opportunity to

optimize and expand their business exponentially. With the advent

of the worldwide web internet and intranet our data warehouse

applications have the possibility of global exposure. But how to

effectively manage the domestic and foreign data is a crucial problem

for those fast growing Chinese enterprise’s IT construction. This

paper will focus on a series of key technology research of data

warehousing globalization, such as multi-language character set

compatibility, robust ETL data system and unified control and audit

mechanism derived from multiple autonomous data sources. On the other

hand, we have chosen some popular business intelligence products such

as Informatica, Oracle and Teradata to build up a prototype platform.

We hope our practice will be helpful to data warehouse’s theoretical

research and application, expect our deliverables to satisfy

enterprise’s need of meeting the global market requirements and have

been looking forward to providing a technical practice for those

units / organizations who are interested in deploying and

implementing a similar system.

In this paper, based upon our research on data warehouse

globalization/internationalization, the following achievements

have been obtained:

First, we will have a comprehensive view at the MEETL

(Multilingual Enabled Extract, Transform and Load) technology. The

problem begins with a source Oracle database for which the code page

is 8859-1 (Western European) while we implemented a global data

warehouse project, but the actual data is a mixture of code pages,

厦门大学博硕士论文摘要库

Page 8: 厦门大学工学博士学位论文 ETL 厦门大学博硕士论文摘要库 · as Informatica, Oracle and Teradata to build up a prototype platform. We hope our practice will be

厦门大学工学博士学位论文

IV

including shift-JIS, 8859-x, GB2312 and Big5. The goal is to move

this data into a target Oracle database for which the code page is

UTF-8. It is possible to manipulate this data within Oracle code page

8859-1 and Informatica ASCII data movement mode. It is even possible

to view the data using an Oracle client with the appropriate code

page. However, it has not been possible to move this data from 8859-1

to UTF-8 using Informatica. The essence of the problem is that both

Oracle and Informatica attempt to convert the data to the correct

code page, but they each defeat the other's attempt. The key

contribution of this paper is to deliver the MEETL

solution/methodology to resolve such character set conversion

problems.

Second, this paper discusses the character set conversion problem

occurring among the source DB, target DB and ETL tool. Furthermore,

based on the MEETL technology, we have put forwarded and proved the

CSCCT (Character Set Conversion Compatible Theorem) and CPCT (Code

Page Compatible Theorem) theorems.

Third, we will bring forward the UTF8toUnicode algorithm:

UTF8DecodeUnicode. Its purpose is to validate the characters stored

in the UTF-8 format and converts them to the right Unicode values

and input character strings are expected in UTF8 format. Originally

within Teradata DB, when characters are not translatable between

client and server, they will be replaced with an error character.

Meanwhile an error message is occurred: 6705 an illegally formed

character string was encountered during translation. If we embed our

UDF (user defined function) into regular ETL process, it will

function properly to replace Teradata character set mechanism and

increase the error tolerant capability.

Fourthly, we will introduce a practical production instance on

how to accomplish an efficient, scalable, controllable and

厦门大学博硕士论文摘要库

Page 9: 厦门大学工学博士学位论文 ETL 厦门大学博硕士论文摘要库 · as Informatica, Oracle and Teradata to build up a prototype platform. We hope our practice will be

Abstract

V

maintainable ETL (extract, transform, loading) architecture. Within

this infrastructure, we have adopted three new control techniques:

Process Synchronization Control (PSC), Time Range Control (TRC) and

Common Audit Control (CAC). These techniques will contribute to

optimize the ETL architecture, improve the efficiency of data loading,

and generate the audit rules for log tracking and analysis.

Fifthly, we will propose a new way of setting up environment

variables and separate DB login passwords for the DW ETL process.

It will be more efficient and less resource consuming. It will allow

for segregation of DB user logins based on subject area and / or

process unit name, replacing most of situations where all the DW ETL

units of work are logging to the database using the same generic id.

Otherwise spool space is shared amongst all processes using the same

user id, so a badly skewed process can hog the entire spool space

and cause all the others that are running at the same time to error

out with "No more spool space" error. In addition, there is no way

for the DBA to discriminate between different processes and allocate

resources differently (based on criteria like: critical /

non-critical path job, length of queries and current status of the

system). Our purpose is to build up a robust data warehouse ETL

platform, along with high availability and expansibility in the need

of enterprise’s future development.

Key Words: ETL; Internationalization; Unicode

厦门大学博硕士论文摘要库

Page 10: 厦门大学工学博士学位论文 ETL 厦门大学博硕士论文摘要库 · as Informatica, Oracle and Teradata to build up a prototype platform. We hope our practice will be

厦门大学工学博士学位论文

VI

目 录

摘 要............................................................................................. I

ABSTRACT..............................................................................III

第 1 章 导言 ........................................................................... 1

§1.1 研究背景...................................................................................... 1

§1.2 研究目标和内容.......................................................................... 8

§1.3 研究方法...................................................................................... 9

§1.4 论文的特色与创新之处............................................................ 11

§1.5 论文的组织结构........................................................................ 12

第 2 章 字符集转换兼容性定理和 ETL 国际化技术框架14

§2.1 字符集发展简史........................................................................ 14

§2.2 研究背景和问题提出................................................................ 19

§2.2.1 采用单一国家或地区的本地编码字符集........................ 19

§2.2.2 使用 ISO88591 字符集来存放多国语言数据 ................. 22

§2.3 ETL 字符集转换兼容性定理(Character Set Conversion

Compatible Theorem,简称 CSCCT)........................................................ 25

§2.4 ETL 国际化技术框架 ............................................................... 28

§2.4.1 Informatica 概 述 ............................................................. 28

§2.4.2 ETL 技术国际化 ............................................................... 32

§2.5 本章小结.................................................................................... 41

第 3 章 含多国语言字符集 Oracle 数据仓库基础构建工作

42

§3.1 源数据库数据流 ETL 加工原理 .............................................. 43

§3.2 目标数据库数据流 ETL 加工原理 .......................................... 45

§3.3 数据源采用单一国家或地区的本地编码字符集.................... 46

厦门大学博硕士论文摘要库

Page 11: 厦门大学工学博士学位论文 ETL 厦门大学博硕士论文摘要库 · as Informatica, Oracle and Teradata to build up a prototype platform. We hope our practice will be

目录

VII

§3.3.1 ETL 实现方案....................................................................46

§3.3.2 实验和分析 ........................................................................47

§3.4 数据源使用 ISO88591 字符集来存放多国语言数据..............51

§3.4.1 实验处理和步骤 ................................................................51

§3.4.2 解决方案 ............................................................................53

§3.4.3 实验过程和结果 ................................................................58

§3.5 本章小结 ....................................................................................59

第 4 章 Teradata 数据仓库 ETL 国际化技术研究和实践

61

§4.1 问题定义和概述 ........................................................................61

§4.2 解决方案 ....................................................................................62

§4.3 UTF8 转换为 Unicode 的算法与分析 .....................................65

§4.3.1 UTF8 编码规则描述..........................................................65

§4.3.2 UTF8 转换为 Unicode 的算法描述 .................................66

§4.3.3 Unicode 转换为 UTF8 的算法描述 .................................69

§4.4 测试用例和实验结果 ................................................................72

§4.5 本章小结 ....................................................................................73

第 5 章 多数据源 ETL 系统架构设计与实现 ....................74

§5.1 基本概念 ....................................................................................75

§5.2 关系代数 ....................................................................................80

§5.2.1 关系运算符 ........................................................................80

§5.2.2 选择运算 ............................................................................80

§5.2.3 投影运算 ............................................................................81

§5.2.4 连接 ....................................................................................82

§5.2.5 赋值运算 ............................................................................82

§5.2.6 聚集函数 ............................................................................82

§5.2.7 删除数据 ............................................................................83

§5.2.8 插入数据 ............................................................................83

厦门大学博硕士论文摘要库

Page 12: 厦门大学工学博士学位论文 ETL 厦门大学博硕士论文摘要库 · as Informatica, Oracle and Teradata to build up a prototype platform. We hope our practice will be

厦门大学工学博士学位论文

VIII

§5.2.9 更新数据............................................................................ 84

§5.3 多数据源数据仓库应用系统概述............................................ 84

§5.3.1 系统架构............................................................................ 88

§5.3.2 ETL 实施层次 ................................................................... 89

§5.3.3 过程同步控制技术(Process Synchronization Control

Technology) 92

§5.3.4 时间范围控制技术(Time Range Control Technology)93

§5.3.5 通用审计控制技术(Common Audit Control Technology)

93

§5.4 过程同步控制(Process Synchronization Control) ............... 95

§5.4.1 例程表................................................................................ 95

§5.4.2 数据提供控制表................................................................ 97

§5.4.3 ETL 参数表 ...................................................................... 98

§5.4.4 预处理存储过程(Preliminary Stored Procedures).......... 101

§5.4.5 创建目录.......................................................................... 106

§5.5 时间范围控制(Time Range Control, TRC) ....................... 107

§5.5.1 时间范围控制数据.......................................................... 110

§5.5.2 时间范围控制流程.......................................................... 110

§5.6 通用审计控制(Common Audit Control,CAC ) ............. 112

§5.7 本章小结.................................................................................. 117

第 6 章 高性能的数据装载和维护体系设计 ................... 118

§6.1 问题阐述.................................................................................. 118

§6.2 分析和讨论.............................................................................. 120

§6.3 方案一:基于变量形式获取数据库登录字符串.................. 122

§6.4 方案二:基于函数形式获取数据库登录字符串.................. 125

§6.4.1 函数文件目录~dw_adm/.logins...................................... 129

§6.4.2 变量设置文件~dw_adm/.dwProfile................................ 130

§6.4.3 登录信息文件 td_logins.dat............................................ 130

§6.5 实验分析和比较...................................................................... 131

厦门大学博硕士论文摘要库

Page 13: 厦门大学工学博士学位论文 ETL 厦门大学博硕士论文摘要库 · as Informatica, Oracle and Teradata to build up a prototype platform. We hope our practice will be

目录

IX

§6.6 本章小结 ..................................................................................132

第 7 章 结论与展望 ............................................................133

§7.1 论文主要研究成果 ..................................................................133

§7.2 相关研究工作展望 ..................................................................135

参考文献 ..................................................................................136

附录 A 插图目录 ....................................................................143

附录 B 表格目录.....................................................................146

致谢 ..........................................................................................147

作者攻读博士学位期间发表的论文 ......................................149

厦门大学博硕士论文摘要库

Page 14: 厦门大学工学博士学位论文 ETL 厦门大学博硕士论文摘要库 · as Informatica, Oracle and Teradata to build up a prototype platform. We hope our practice will be

厦门大学工学博士学位论文

X

Contents

Chinese Abstract ......................................................................... I

English Abstract.......................................................................III

Chapter 1: Preface ..................................................................... 1

§1.1 Research Background .................................................................. 1

§1.2 Purpose and Contents................................................................... 8

§1.3 Methodology ................................................................................ 9

§1.4 Innovation and Creativity of This Paper .................................... 11

§1.5 Organization............................................................................... 12

Chapter 2: Character Set Conversion Compatible Theorem

and ETL Globalization Infrastructure ...................................... 14

§2.1 Overview of Character Set......................................................... 14

§2.2 Problem Statement ..................................................................... 19

§2.2.1 Monolingual Database Scenario ........................................ 19

§2.2.2 Multilingual Database with ISO88591 Character Set........ 22

§2.3 Character Set Conversion Compatible Theorem,CSCCT....... 25

§2.4 ETL Globalization Infrastructure............................................... 28

§2.4.1 Informatica Introduction .................................................... 28

§2.4.2 ETL Technology Internationlization.................................. 32

§2.5 Brief Summary........................................................................... 41

Chapter 3: Building up Multi-Language DW on Oracle

Platform ................................................................................ 42

§3.1 ETL Principle For Source Database .......................................... 43

§3.2 ETL Principle For Target Database ........................................... 45

§3.3 Monolingual Source Database ................................................... 46

§3.3.1 ETL Solution...................................................................... 46

§3.3.2 Test and Analysis............................................................... 47

§3.4 Source Database With ISO88591 Character Set........................ 51

§3.4.1 Workarounds...................................................................... 51

厦门大学博硕士论文摘要库

Page 15: 厦门大学工学博士学位论文 ETL 厦门大学博硕士论文摘要库 · as Informatica, Oracle and Teradata to build up a prototype platform. We hope our practice will be

Contents

XI

§3.4.2 Solution...............................................................................53

§3.4.3 Test and Result ...................................................................58

§3.5 Brief Summary ...........................................................................59

Chapter 4: Research and Implementaion on Teradata DW

ETL Technology............................................................................61

§4.1 Problem Statement......................................................................61

§4.2 Solution.......................................................................................62

§4.3 Unicode Conversion Algorithm .................................................65

§4.3.1 UTF8 Encoding Description...............................................65

§4.3.2 Algorithm of UTF8 Converting to Unicode.......................66

§4.3.3 Algorithm of Unicode Converting to UTF8.......................69

§4.4 Workarounds and Test Result ....................................................72

§4.5 Brief Summary ...........................................................................73

Chapter 5: Architecture Design for DW Derived From

Multiple Data Sources ....................................................................74

§5.1 Basic Concept .............................................................................75

§5.2 Relational Algebra......................................................................80

§5.2.1 Operator..............................................................................80

§5.2.2 Select ..................................................................................80

§5.2.3 Projection............................................................................81

§5.2.4 Join .....................................................................................82

§5.2.5 Assignment .........................................................................82

§5.2.6 Aggregate Functions...........................................................82

§5.2.7 Delete..................................................................................83

§5.2.8 Insert ...................................................................................83

§5.2.9 Update.................................................................................84

§5.3 Introdcution of Multiple Data Sources DW ...............................84

§5.3.1 System Infrastructure..........................................................88

§5.3.2 ETL Implementation Level ................................................89

§5.3.3 Process Synchronization Control Technology ...................92

§5.3.4 Time Range Control Technology .......................................93

§5.3.5 Common Audit Control Technology..................................93

厦门大学博硕士论文摘要库

Page 16: 厦门大学工学博士学位论文 ETL 厦门大学博硕士论文摘要库 · as Informatica, Oracle and Teradata to build up a prototype platform. We hope our practice will be

厦门大学工学博士学位论文

XII

§5.4 Process Synchronization Control ............................................... 95

§5.4.1 Instance Table .................................................................... 95

§5.4.2 Data Feed Interface Table.................................................. 97

§5.4.3 ETL Parameter Table......................................................... 98

§5.4.4 Preliminary Stored Procedures ........................................ 101

§5.4.5 Creating The Directories.................................................. 106

§5.5 Time Range Control, TRC....................................................... 107

§5.5.1 Control Data..................................................................... 110

§5.5.2 Control Flow .................................................................... 110

§5.6 Common Audit Control,CAC ............................................. 112

§5.7 Breif Summary......................................................................... 117

Chapter 6: High Performance ETL System........................ 118

§6.1 Problem Statement ................................................................... 118

§6.2 Analysis and Discussion .......................................................... 120

§6.3 Solution I:Variables .............................................................. 122

§6.4 Solution II:Fuctions............................................................... 125

§6.4.1 Function Directory:~dw_adm/.logins .............................. 129

§6.4.2 Variable Locaiton:~dw_adm/.dwProfile.......................... 130

§6.4.3 Logon file: td_logins.dat .................................................. 130

§6.5 Workarounds and Solutions..................................................... 131

§6.6 Brief Summary......................................................................... 132

Chapter7:Conclusions and Directions of Future Research133

§7.1 Achievements of This Paper .................................................... 133

§7.2 Related and Future Work ......................................................... 135

Reference ................................................................................ 136

List of Graphs......................................................................... 143

List of Tables .......................................................................... 146

Acknowledgements ................................................................ 147

Publication Lists..................................................................... 149

厦门大学博硕士论文摘要库

Page 17: 厦门大学工学博士学位论文 ETL 厦门大学博硕士论文摘要库 · as Informatica, Oracle and Teradata to build up a prototype platform. We hope our practice will be

Degree papers are in the “Xiamen University Electronic Theses and Dissertations Database”. Fulltexts are available in the following ways: 1. If your library is a CALIS member libraries, please log on http://etd.calis.edu.cn/ and submitrequests online, or consult the interlibrary loan department in your library. 2. For users of non-CALIS member libraries, please mail to [email protected] for delivery details.

厦门大学博硕士论文摘要库