Extending Practical Pre-Aggregation in On-Line Analytical...

33
Extending Practical Pre-Aggregation in On-Line Analytical Processing Torben Bach Pedersen Christian S. Jensen Curtis E. Dyreson* Center for Health Information Services, Kommunedata P. O. Pedersens Vej 2, DK-8200 ˚ Arhus N, Denmark, [email protected] Department of Computer Science, Aalborg University Fredrik Bajers Vej 7E, DK–9220 Aalborg Ø, Denmark, [email protected] * Department of Computer Science, James Cook University Townswille, QLD 4811, Australia, [email protected] September 20, 1999

Transcript of Extending Practical Pre-Aggregation in On-Line Analytical...

Page 1: Extending Practical Pre-Aggregation in On-Line Analytical …people.cs.aau.dk/~csj/Papers/Files/1999_pedersenR-99... · 2004. 5. 6. · This section presents a case study that illustrates

Extending Practical Pre-Aggregationin On-Line Analytical Processing

Torben Bach Pedersen�

Christian S. Jensen � Curtis E. Dyreson*�

Center for Health Information Services, KommunedataP. O. Pedersens Vej 2, DK-8200 Arhus N, Denmark, [email protected]

� Department of Computer Science, Aalborg UniversityFredrik Bajers Vej 7E, DK–9220 Aalborg Ø, Denmark, [email protected]

* Department of Computer Science, James Cook UniversityTownswille, QLD 4811, Australia, [email protected]

September 20, 1999

Page 2: Extending Practical Pre-Aggregation in On-Line Analytical …people.cs.aau.dk/~csj/Papers/Files/1999_pedersenR-99... · 2004. 5. 6. · This section presents a case study that illustrates

Abstract

On-Line Analytical Processing (OLAP) based on a dimensional view of data is being used in-creasingly in traditional business applications as well as in applications such as health care forthe purpose of analyzing very large amounts of data. Pre-aggregation, the prior materializa-tion of aggregate queries for later use, is an essential technique for ensuring adequate responsetime during data analysis. Full pre-aggregation, where all combinations of aggregates are ma-terialized, is infeasible. Instead, modern OLAP systems adopt the practical pre-aggregationapproach of materializing only select combinations of aggregates and then re-use these for ef-ficiently computing other aggregates. However, this re-use of aggregates is contingent on thedimension hierarchies and the relationships between facts and dimensions satisfying stringentconstraints. This severely limits the scope of the practical pre-aggregation approach. This pa-per significantly extends the scope of practical pre-aggregation to cover a much wider rangeof realistic situations. Specifically, algorithms are given that transform “irregular” dimensionhierarchies and fact-dimension relationships, which often occur in real-world OLAP applica-tions, into well-behaved structures that, when used by existing OLAP systems, enable practicalpre-aggregation. The algorithms have low computational complexity and may be applied incre-mentally to reduce the cost of updating OLAP structures.

Page 3: Extending Practical Pre-Aggregation in On-Line Analytical …people.cs.aau.dk/~csj/Papers/Files/1999_pedersenR-99... · 2004. 5. 6. · This section presents a case study that illustrates

Contents

1 Introduction 2

2 Motivation—A Case Study 3

3 Method Context 73.1 A Concrete Data Model Context . . . . . . . . . . . . . . . . . . . . . . . . . 73.2 Hierarchy Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

4 Dimension Transformation Techniques 124.1 Non-Covering Hierarchies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124.2 Non-Onto Hierarchies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154.3 Non-Strict Hierarchies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

5 Fact-Dimension Transformation Techniques 225.1 Mixed Granularity Mappings . . . . . . . . . . . . . . . . . . . . . . . . . . . 225.2 Many-To-Many Relationships . . . . . . . . . . . . . . . . . . . . . . . . . . 23

6 Architectural Context 25

7 Conclusion and Future Work 26

A Incremental Computation 29A.1 Covering Hierarchies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30A.2 Onto Hierarchies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30A.3 Strict Hierarchies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

1

Page 4: Extending Practical Pre-Aggregation in On-Line Analytical …people.cs.aau.dk/~csj/Papers/Files/1999_pedersenR-99... · 2004. 5. 6. · This section presents a case study that illustrates

1 Introduction

On-Line Analytical Processing (OLAP) systems, which aim to ease the process of extractinguseful information from large amounts of detailed transactional data, have gained widespreadacceptance in traditional business applications as well as in new applications such as health care.These systems generally offer a dimensional view of data, in which measured values, termedfacts, are characterized by descriptive values, drawn from a number of dimensions; and thevalues of a dimension are typically organized in a containment-type hierarchy. A prototypicalquery applies an aggregate function, such as average, to the facts characterized by specificvalues from the dimensions.

Fast response times are required from these systems, even for queries that aggregate largeamounts of data. The perhaps most central technique used for meeting this requirement istermed pre-aggregation, where the results of aggregate queries are pre-computed and stored,i.e., materialized, for later use during query processing. Pre-aggregation has attracted substan-tial attention in the research community, where it has been investigated how to optimally usepre-aggregated data for query optimization [7, 3] and how to maintain the pre-aggregated datawhen base data is updated [19, 23]. Further, the latest versions of commercial RDBMS productsoffer query optimization based on pre-computed aggregates and automatic maintenance of thestored aggregates when base data is updated [29].

The fastest response times may be achieved when materializing aggregate results corre-sponding to all combinations of dimension values across all dimensions, termed full (or eager)pre-aggregation. However, the required storage space grows rapidly, to quickly become pro-hibitive, as the complexity of the application increases. This phenomenon is called data ex-plosion [4, 26, 21] and occurs because the number of possible aggregation combinations growsrapidly when the number of dimensions increase, while the sparseness of the multidimensionalspace decreases in higher dimension levels, meaning that aggregates at higher levels take upnearly as much space as lower-level aggregates. In some commercial applications, full pre-aggregation takes up as much as ����� times the space of the raw data [21]. Another problemwith full pre-aggregation is that it takes too long to update the materialized aggregates whenbase data changes.

With the goal of avoiding data explosion, research has focused on how to select the bestsubset of aggregation levels given space constraints [11, 9, 31, 1, 27, 25] or maintenance timeconstraints [10], or the best combination of aggregate data and indices [8]. This approach iscommonly referred to as practical (or partial or semi-eager [5, 11, 28]) pre-aggregation. Com-mercial OLAP systems now also exist that employ practical pre-aggregation, e.g., MicrosoftDecision Support Services (Plato) [18] and Informix MetaCube [13].

The premise underlying the applicability of practical pre-aggregation is that lower-level ag-gregates can be re-used to compute higher-level aggregates, known as summarizability [16].Summarizability occurs when the mappings in the dimension hierarchies are onto (all paths inthe hierarchy have equal lengths), covering (only immediate parent and child values can be re-lated), and strict (each child in a hierarchy has only one parent); and when also the relationshipsbetween facts and dimensions are many-to-one and facts are always mapped to the lowest levelsin the dimensions [16]. However, the data encountered in many real-world applications fail to

2

Page 5: Extending Practical Pre-Aggregation in On-Line Analytical …people.cs.aau.dk/~csj/Papers/Files/1999_pedersenR-99... · 2004. 5. 6. · This section presents a case study that illustrates

comply with this rigid regime. This motivates the search for techniques that allow practicalpre-aggregation to be used for a wider range of applications, the focus of this paper.

Specifically, this papers leverages research such as that cited above. It does so by showinghow to transform dimension hierarchies to obtain summarizability, and by showing how tointegrate the transformed hierarchies into current systems, transparently to the user, so thatstandard OLAP technology is re-used. Specifically, algorithms are presented that automaticallytransform dimension hierarchies to achieve summarizability for hierarchies that are non-onto,non-covering, and non-strict. The algorithms have low computational complexity, and are thusapplicable to even very large databases. It is also described how to use the algorithms to contendwith non-summarizable relationships between facts and dimensions, and it is shown how thealgorithms may be modified to accommodate incremental computation, thus minimizing themaintenance cost associated with base-data updates.

To our knowledge, this work is the first to present algorithms to automatically achieve sum-marizability for non-covering and non-onto hierarchies. The research reported here is also thefirst to demonstrate techniques and algorithms for achieving summarizability in non-strict hi-erarchies. The integration of the techniques into current systems, transparently to the user, webelieve is a novel feature. The only past research on the topic has been on how to manually, andnot transparently to the user, achieve summarizability for non-covering hierarchies [24].

The next section presents a real-world clinical case study that exemplifies the non-summari-zable properties of real-world applications. Section 3 proceeds to define the aspects of a multi-dimensional data model necessary for describing the new techniques, and defines also importantproperties related to summarizability. Sections 4 and 5 present algorithms that transform dimen-sion hierarchies to achieve summarizability, then apply the algorithms to fix non-summarizablerelationships between facts and dimensions. Section 6 demonstrates how the techniques may beintegrated into current systems, transparently to the user. Section 7 summarizes and points totopics for future research. Appendix A describes how to modify the algorithms to accommodateincremental computation.

2 Motivation—A Case Study

This section presents a case study that illustrates the properties of real-world dimension hier-archies. The case study concerns patients in a hospital, their associated diagnoses, and theirplaces of residence. The data analysis goal is to investigate whether some diagnoses occurmore often in some areas than in others, in which case environmental or lifestyle factors mightbe contributing to the disease pattern. An ER diagram illustrating the underlying data is seen inFigure 1.

The most important entities are the patients, for which we record the name. We alwayswant to count the number of patients, grouped by some properties of the patients. Thus, inmultidimensional terms, the patients are the facts, and the other, describing, entities constitutethe dimensions.

Each patient has a number of diagnoses, leading to a many-to-many relationship betweenfacts and the diagnosis dimension. When registering diagnoses of patients, physicians use dif-

3

Page 6: Extending Practical Pre-Aggregation in On-Line Analytical …people.cs.aau.dk/~csj/Papers/Files/1999_pedersenR-99... · 2004. 5. 6. · This section presents a case study that illustrates

Diagnosis Diagnosis

Family Belongs

to (1,n)

* Code * Text

Patient

* Name

(0,n)

Grouping Diagnosis

Group (1,n) (1,n)

Has

Low-level Diagnosis

Address

Lives at

(0,1)

(0,n)

City County City

located in

Located in city

* Address * Name * Name

(0,1) (1,1) (0,n) (1,n)

Diagnosis

D

(0,n)

Located in rural area

(0,n) (0,1)

* Type * Type

(1,n)

Figure 1: ER Schema of Case Study

ferent levels of granularity, ranging from very precise diagnoses, e.g., “Insulin dependent di-abetes during pregnancy,” to more imprecise diagnoses, e.g., “Diabetes,” which cover widerranges of patient conditions. To model this, the relationship from patient to diagnoses is tothe supertype “Diagnosis,” which then has three subtypes, corresponding to different levels ofgranularity, the low-level diagnosis, the diagnosis family, and the diagnosis group. Examplesof these are “Insulin dependent diabetes during pregnancy,” “Insulin dependent diabetes,” and“Diabetes,” respectively. The higher-level diagnoses are both (imprecise) diagnoses in theirown right, but also serve as groups of lower-level diagnoses.

Each diagnosis has an alphanumeric code and a descriptive text, which are specified by somestandard, here the World Health Organization’s International Classification of Diseases (ICD-10) [30], or by the physicians themselves. Indeed, two hierarchies are captured: the standardhierarchy specified by the WHO, and the user-defined hierarchy, which is used for groupingdiagnoses on an ad-hoc basis in other ways than given by the standard. The Type attribute onthe relationships determines whether the relation between two entities is part of the standard orthe user-defined hierarchy.

The hierarchy groups low-level diagnoses into diagnosis families, each of which consists of2–20 related diagnoses. For example, the diagnosis “Insulin dependent diabetes during preg-nancy1” is part of the family “Diabetes during pregnancy.” In the WHO hierarchy, a low-level

1The reason for having a separate pregnancy related diagnosis is that diabetes must be monitored and controlled

4

Page 7: Extending Practical Pre-Aggregation in On-Line Analytical …people.cs.aau.dk/~csj/Papers/Files/1999_pedersenR-99... · 2004. 5. 6. · This section presents a case study that illustrates

diagnosis belongs to exactly one diagnosis family, whereas the user-defined hierarchy doesnot have this restriction. Thus, a low-level diagnosis can belong to several diagnosis families,e.g., the “Insulin dependent diabetes during pregnancy” diagnosis belongs to both the “Dia-betes during pregnancy” and the “Insulin dependent diabetes” family. Next, diagnosis familiesare grouped into diagnosis groups, consisting of 2–10 families, and one family may be partof several groups. For example, the family “Diabetes during pregnancy” may the part of the“Diabetes” and “Other pregnancy related diseases” groups.

In the WHO hierarchy, a family belongs to exactly one group. In the WHO hierarchy, alower-level value belongs to exactly one higher-level value, making it strict and covering. Inthe user-defined hierarchy, a lower-level value may belong to zero or more higher-level values,making it non-strict and non-covering. Properties of the hierarchies will be discussed in moredetail in Section 3.2.

We also record the addresses of the patients. If the address is located in a city, we recordthe city; otherwise, if the address is in a rural area, we record the county in which the address islocated. A city is located in exactly one county. As not all addresses are in cities, we cannot findall addresses in a county by going through the “City located in” relationship. Thus, the mappingfrom addresses to cities is non-covering w.r.t. addresses. For cities and counties, we just recordthe name. Not all counties have cities in them, so the mapping from cities to counties is intorather than onto.

In order to exemplify the data, we assume a standard mapping of the ER diagram to rela-tional tables, i.e., one table per entity and relationship type. We also assume the use of surro-gate keys, named ID, with globally unique values. The three subtypes of the Diagnosis typeare mapped to a common Diagnosis table, and because of this, the “belongs to” and “grouping”relationships are mapped to a common “Grouping” table. The resulting tables with sample dataare shown in Table 1 and will be used in examples throughout the paper.

If we apply pre-aggregation to the data from the case study, several problems occur. Forexample, if the counts of patients by City are pre-computed and we use these for computing thenumbers of patients by county, an incorrect result will occur. In the data, the addresses “123Rural Road” and “1 Sandy Dunes” (one of them is the address of a patient) are not in any city,making the mapping from City to County not covering w.r.t. addresses.

Next, if the counts of patients by Low-Level Diagnosis are pre-computed and we use thesefor computing the total count of patients, an incorrect result again ensues. First, patients onlywith lung cancer are not counted, as lung cancer is not present at the level of Low-Level Diagno-sis; the mapping from Low-Level Diagnosis to Diagnosis Family is into. Second, patients suchas “Jim Doe” only have higher-level diagnoses and will no be counted; the fact-to-dimensionmapping has varying granularity. Third, patients such as “Jane Doe” have several diagnosesand will be counted several times; , the relationship between facts and dimensions is many-to-many. Fourth, Low-Level diagnoses such as “Insulin dependent diabetes during pregnancy” arepart of several diagnosis families, which may also lead to “double” counting when computinghigher-level counts; the dimension hierarchy is non-strict.

These problems yield “non-summarizable” dimension hierarchies that severely limit the

particularly intensely during a pregnancy to assure good health of both mother and child.

5

Page 8: Extending Practical Pre-Aggregation in On-Line Analytical …people.cs.aau.dk/~csj/Papers/Files/1999_pedersenR-99... · 2004. 5. 6. · This section presents a case study that illustrates

ID Name1 John Doe2 Jane Doe3 Jim Doe

Patient

PatientID AddressID1 502 513 52

LivesAt

ID Address50 21 Central Street51 34 Main Street52 123 Rural Road53 1 Sandy Dunes

Address

PatientID DiagnosisID Type1 9 Primary2 5 Secondary2 9 Primary3 11 Primary

Has

ParentID ChildID Type4 5 WHO4 6 WHO9 5 User-defined

10 6 User-defined11 9 WHO11 10 WHO12 4 WHO13 14 WHO

Grouping

ID Code Text Type4 O24 Diabetes during pregnancy Family5 O24.0 Insulin dependent diabetes during pregnancy Low-Level6 O24.1 Non insulin dependent diabetes during pregnancy Low-Level9 E10 Insulin dependent diabetes Family10 E11 Non insulin dependent diabetes Family11 E1 Diabetes Group12 O2 Other pregnancy related diseases Group13 A1 Cancer Group14 A11 Lung cancer Family

Diagnosis

ID Name20 Sydney21 Melbourne

City

AddressID CityID50 2051 21

LocatedInCity

ID Name30 Sydney31 Melbourne32 Outback

County

ID Name52 3153 32LocatedInRuralArea

CityID CountyID20 3021 31CityLocatedIn

Table 1: Tables for the Case Study

6

Page 9: Extending Practical Pre-Aggregation in On-Line Analytical …people.cs.aau.dk/~csj/Papers/Files/1999_pedersenR-99... · 2004. 5. 6. · This section presents a case study that illustrates

applicability of practical pre-aggregation, leaving only full pre-aggregation, requirering hugeamounts of storage, or no pre-aggregation, resulting in long response time for queries.

The properties described above are found in many other real-world applications. Many-to-many relationships between facts and dimensions occur between bank customers and accounts,between companies and Standard Industry Classifications (SICs), and between students and de-partments [15, 16]. Non-strict dimension hierarchies occur from cities to states in a Geographydimension [24] and from weeks to months in a Time dimension. In addition, hierarchies wherethe change over time is captured are generally non-strict. The mapping from holidays to weeksas well as organization hierarchies of varying depth [12] offer examples of “into” mappings.Non-covering relationships exist for days-holidays-weeks and for counties-cities-states, as wellas in organization hierarchies [12].

Even though many real-world cases possess the properties described above, current tech-niques for practical pre-aggregation require that facts are in a many-to-one relationships to di-mensions and that all hierarchies are strict, onto, and covering. Thus, current techniques cannotbe applied when the hierarchies has these properties.

3 Method Context

This section defines the aspects of a multidimensional data model that are necessary to define thetechniques that enable practical pre-aggregation in applications as the one just described. Thefull model is described elsewhere [22]. Next, the data model context is exploited for definingproperties of hierarchies relevant to the techniques.

The particular data model has been chosen over other multidimensional data models becauseit quite naturally captures the data described in the case study and because it includes explicitconcepts of dimensions and dimension hierarchies, which is very important for clearly present-ing the techniques. However, the techniques are also applicable to other multidimensional orstatistical data models, as will be discussed in Section 6.

3.1 A Concrete Data Model Context

For each part of the model, we define the intension and the extension, and we give an illustratingexample.

An n-dimensional fact schema is a two-tuple ��� � ����� , where is a fact type and������������������� �!�"��#%$ is its corresponding dimension types.

Example 1 In the case study from Section 2, Patient is the fact type, and Diagnosis, Residence,and Name are the dimension types. The intuition is that everything that characterizes the facttype is considered to be dimensional.

A dimension type � is a four-tuple '&(� )+*,�.-/*(�.01*�� , where &2�3� &�4��657�8��� �"�"�.9:$ are thecategory types of � , )+* is a partial order on the &;4 ’s, with -/*=<>& and 0/*?<@& being thetop and bottom element of the ordering, respectively. Thus, the category types form a lattice.The intuition is that one category type is “greater than” another category type if members of the

7

Page 10: Extending Practical Pre-Aggregation in On-Line Analytical …people.cs.aau.dk/~csj/Papers/Files/1999_pedersenR-99... · 2004. 5. 6. · This section presents a case study that illustrates

former’s extension logically contain members of the latter’s extension, i.e., they have a largervalue size. The top element of the ordering corresponds to the largest possible value size, thatis, there is only one value in it’s extension, logically containing all other values.

We say that &;4 is a category type of � , written &A4B<C� , if &�4B< & .

Example 2 Low-level diagnoses are contained in diagnosis families, which are contained indiagnosis groups. Thus, the Diagnosis dimension type has the following order on its categorytypes: 0EDGFIHKJ6L.M6N'FIN = Low-level Diagnosis O Diagnosis Family O Diagnosis Group OP-QDGFIHRJSL.M6N'FTN .Other examples of category types are Address, City, and County. Figure 2, to be discussed indetail later, illustrates the dimension types of the case study.

A category U,4 of type &;4 is a set of dimension values V . A dimension W of type � �S� &�4X$A� )Y*(�.-/*(�.0/*Z� is a two-tuple W[�?RU1� )B� , where U\����U,4 $ is a set of categories U,4 suchthat ]_^.`ba�RU(4c�Y��&;4 and ) is a partial order on d%4cU(4 , the union of all dimension values in theindividual categories. We assume a function egf�a�h2i%Ukjl �;m that gives the set of immediatepredecessors of a category U,4 . Similarly, we a assume a function nQaporq+iAUsjl � m that gives theset of immediate descendants of a category Ut4 . For both etf�a�h and nQaporq , we “count” from thecategory -1* (of type -1* ), so that category -u* is the ultimate predecessor and category 0E* (oftype 01* ) is the ultimate descendant.

The definition of the partial order is: given two values V;vc�pV w then V�v/)sV w if V�v is logicallycontained in Vrw . We say that U(4 is a category of W , written Ux4y<CW , if U,4y<CU . For a dimensionvalue V , we say that V is a dimensional value of W , written V1<zW , if Vu<zd�4{U(4 .

The category of type 0u* in dimension of type � contains the values with the smallest valuesize. The category with the largest value size, with type -E* , contains exactly one value, denoted- . For all values V of the dimension W , VE)|- . Value - is similar to the ALL construct of Grayet al. [6]. When the context is clear, we refer to a category of type -}* as a - category, not tobe confused with the - dimension value.

Example 3 In our Diagnosis dimension we have the following categories, named by their type.The numbers in parentheses are the ID values from the Diagnosis table in Table 1. Low-levelDiagnosis = � “Insulin dependent diabetes during pregnancy” (5), ”Non insulin dependent dia-betes during pregnancy” (6) $ , Diagnosis Family = � “Diabetes during pregnancy” (4), “Insulindependent diabetes” (9), ”Non insulin dependent diabetes” (10), “Lung cancer” (14) $ , Diag-nosis Group = � “Diabetes” (11), “Other pregnancy related diseases” (12), “Cancer” (13) $ , and-u~����S�S� �6������� ��-Q$ . We have that egf�a�h� Low-level Diagnosis � � � Diagnosis Family $ . Thepartial order ) is obtained by combining WHO and user-defined hierarchies, as given by theGrouping table in Table 1. Additionally, the top value - is greater than, i.e., logically contains,all the other diagnosis values.

Let � be a set of facts, and W���RU3����U,4 $A� )+� a dimension. A fact-dimension relationbetween � and W is a set �k����R�:�pVX�p$ , where �|<|� and V�<>d%4cU(4 . Thus � links facts todimension values. We say that fact � is characterized by dimension value V , written ��� V , if� V�vB<�W��K�G�pV�v��+<��P�CV�v+)\VX� . We require that ����<��� � V�<�d�4cU(4B�K�:��VX�+<��1��� ; thus, allfact maps to at least one dimension value in every dimension. The - value is used to represent

8

Page 11: Extending Practical Pre-Aggregation in On-Line Analytical …people.cs.aau.dk/~csj/Papers/Files/1999_pedersenR-99... · 2004. 5. 6. · This section presents a case study that illustrates

an unknown or missing value, as - logically contains all dimension values, and so a fact � ismapped to - if it cannot be characterized within the particular dimension.

Example 4 The fact-dimension relation � links patient facts to diagnosis dimension values asgiven by the Has table from the case study, so that �@��� (“John Doe” (1), “Insulin dependent di-abetes” (9)), (“Jane Doe” (2), “Insulin dependent diabetes during pregnancy” (5)), (“Jane Doe”(2), “Insulin dependent diabetes” (9)), (“Jim Doe” (3), “Diabetes” (11)) $ . Note that facts maybe related to values in higher-level categories. We do not require that V belongs to 0�DGFIHRJSL.M6N'FTN .For example, the fact “John Doe” (1) is related to the diagnosis “Insulin dependent diabetes”(5), which belongs to the Diagnosis Family category. This feature will be used later to explicitlycapture the different granularities in the data. If no diagnosis was known for patient “John Doe”(1), we would have added the pair (“John Doe” (1), - ) to � .

A multidimensional object (MO) is a four-tuple � ����y�p���pW �p�1� , where ����¡ ���¢������K$�� is the fact schema, �\�����%$ is a set of facts � where ]£^p`baAR���(�| , W¤����W������%�¥��� �!�"��#%$is a set of dimensions where ]_^.`ba�¦W��¦�z� ��� , and � � ���y����� � ���§�!�"��#�$ is a set of fact-dimension relations, such that �¨���R�:�pVX�©<C�+��ª ��<C�«� � U(4B<�W¬�6RVu<CU(4c��� .

Low-level Diagnosis = ⊥

Diagnosis Family

Diagnosis Group

Diagnosis Dimension

Residence Dimension

Address = ⊥

City

County

Patient

Name = ⊥ ⊥

Name Dimension

Name

Name Residence

Residence

Diagnosis

Diagnosis

Figure 2: Schema of the Case Study

Example 5 For the case study, we get a three-dimensional MO � � ��y�p���pW ���1� , where�2�= Patient, � Diagnosis, Name, Residence $�� and �s�s� “John Doe” (1), “Jane Doe” (2), “JimDoe” (3) $ . The definition of the diagnosis dimension and its corresponding fact-dimension re-lation was given in the previous examples. The Residence dimension has the categories Address( �?0u­A®¦N'FI¯°®¦L.±'® ), City, County, and -E­A®RN²F³¯S®RL.±²® . The values of the categories are given by the cor-responding tables in Table 1. The partial order is given by the relationship tables. Additionally,the only value in the -E­A®RN²F³¯S®RL.±²® category is - , which logically contains all the other values in

9

Page 12: Extending Practical Pre-Aggregation in On-Line Analytical …people.cs.aau.dk/~csj/Papers/Files/1999_pedersenR-99... · 2004. 5. 6. · This section presents a case study that illustrates

the Residence dimension. The Name dimension is simple, i.e., it just has a Name category( ��0E´bHSµ�® ) and a - category. We will refer to this MO as the “Patient” MO. A graphical il-lustration of the schema of the “Patient” MO is seen in Figure 2. Because some addresses mapdirectly to counties, County is an immediate predecessor of Address.

The facts in an MO are objects with value-independent identity. We can test facts for equal-ity, but do not assume an ordering on the facts. The combination of dimensions values thatcharacterize the facts of a fact set is not a “key” for the fact set. Thus, several facts may becharacterized by the same combination of dimension values. But, the facts of an MO is a set,so an MO does not have duplicate facts. The model formally defines quite general conceptsof dimensions and dimension hierarchies, which is ideal for the presentation of our techniques.The presented techniques are not limited by the choice of data model.

3.2 Hierarchy Properties

In this section important properties of MOs that relate to the use of pre-computed aggregatesare defined. The properties will be used in the following sections to state exactly what problemsthe proposed algorithms solve. The first important concept is summarizability, which intuitivelymeans that higher-level aggregates may be obtained directly from lower-level aggregates.

Definition 1 Given a type ¶ , a set ·3� ��·b4��65P� ��� �"�!�.9G$ , where ·£4�<[�X¸ , and a function¹ i¨� ¸ jl ¶ , we say that ¹ is summarizable for · if ¹ 6�K� ¹ K·ZvS�c� �"�"� ¹ K·�º{�.$R$��y� ¹ R·%v%d��!��dz·�ºr� .The argument on the left-hand side of the equation is a multiset, i.e., the same value may occurmultiple times.

Summarizability is important as it is a condition for the flexible use of pre-computed aggre-gates. Without summarizability, lower-level results generally cannot be directly combined intohigher-level results. This means that we cannot choose to pre-compute only a relevant selectionof the possible aggregates and then use these to (efficiently) compute higher-level aggregateson-the-fly. Instead, we have to pre-compute the all the aggregate results of queries that we needfast answers to, while other aggregates must be computed from the base data. Space and timeconstraints can be prohibitive for pre-computing all results, while computing aggregates frombase data is often inefficient.

It has been shown that summarizability is equivalent to the aggregate function ( ¹ ) beingdistributive, all paths being strict, and the mappings between dimension values in the hierarchiesbeing covering and onto [16]. These concepts are formally defined below. The definitionsassume a dimension W[�?¦Uu�§)B� and an MO � �=¡�+���»��W �p�1� .Definition 2 Given two categories, UBv.�pU©w such that U�w/<½¼u¾�Vr¿:RUyvS� , we say that the mappingfrom Uyv to U©w is onto iff ��Vrw1<½U©wX � V�vy<�Uyv+¦V�vY)\V wc��� . Otherwise, it is into. If all mappingsin a dimension are onto, we say that the dimension hierarchy is onto.

Mappings that are into typically occur when the dimension hierarchy has varying height. Inthe case study, there is no low-level cancer diagnosis, meaning that some parts of the hierarchy

10

Page 13: Extending Practical Pre-Aggregation in On-Line Analytical …people.cs.aau.dk/~csj/Papers/Files/1999_pedersenR-99... · 2004. 5. 6. · This section presents a case study that illustrates

have height � , while most have height À . It is thus not possible to use aggregates at the Low-level Diagnosis level for computing aggregates at the two higher levels. Mappings that are intoalso occur often in organization hierarchies.

Definition 3 Given three categories, UBv , U©w , and U�Á such that ]£^p`ba�RU+v��ÂO ]_^.`ba;RU©wp�>O]£^p`baA¦U©Á.� , we say that the mapping from U»w to U�Á is covering with respect to U+v iff ��V�v <Uyv,²��V Áu<ÃU©Á©RV�vY)ÄV ÁYª � V w/<½U©w�¦V�v»)ÄV wÅ��V w/)ÄV Á.����� . Otherwise, it is non-covering withrespect to U+v . If all mappings in a dimension are covering w.r.t. any category, we say that thedimension hierarchy is covering.

Non-covering mappings occur when some of the links between dimension values skip one ormore levels and map directly to a value located higher up in the hierarchy. In the case study, thishappens for the “1 Sandy Dunes” address, which maps directly to “Outback County” (there areno cities in Outback County). Thus, we cannot use aggregates at the City level for computingaggregates at the County level.

Definition 4 Given an MO � �?��y�p���pW ���1� , and two categories U/v and U©w that belong to thesame dimension W��Y<ÆW such that ]£^p`ba;RU+v°�¬OÇ]£^p`ba�RU©wc� , we say that the mapping from U+vto U©w is covering with respect to � , the set of facts, iff ����<��s²��V�wE<�U©w�R���Ä�%V w+ª � V�v+<Uyv(R���Ä�:V�v�� V�v©)Y�¨V wc����� .

This case is similar to the one above, but now it is the mappings between facts and dimensionvalues that may skip one or more levels and map facts directly to dimension values in categoriesabove the bottom level. In the case study, the patients can map to diagnoses anywhere in the Di-agnosis dimension, not just to Low-level Diagnoses. This means that we cannot use aggregatesat the Low-level Diagnosis Level for computing aggregates higher up in the hierarchy.

Definition 5 Given two categories, U+v and U©w such that U©w1<�etf�a�h�RUyv�� , we say that the map-ping from U+v to U©w is strict iff ��V�vÈ<sUyv»'��V wr�pV Á <¥U©w+RV�v�)�V w©��V�vÈ)�V Á�ª V wÉ�ÊV Ác��� .Otherwise, it is non-strict. The hierarchy in dimension W is strict if all mappings in it arestrict; otherwise, it is non-strict. Given an MO � � ��y�p���pW �p�1� and a category Ut4 insome dimension W¬�}<sW , we say that there is a strict path from the set of facts � to Ut4 iff����<���K�Ë�Ä��V�v��C���Ä��V w(��V�vy<�U(4,��V w1<�U(4+ª V�v��sV w ). (Note that the paths to the -u*categories are always strict.)

Non-strict hierarchies occur when a dimension value has multiple parents. This occurs in theDiagnosis dimension in the case study where the “Insulin dependent diabetes during pregnancy”low-level diagnosis is part of both the “Insulin Dependent Diabetes” and the “Diabetes duringpregnancy” diagnosis families, which in turn both are part of the “Diabetes” diagnosis group.This means that we cannot use aggregates at the Diagnosis Family level to compute aggregatesat the Diagnosis Group level, since data for “Insulin dependent diabetes during pregnancy”would then be counted twice.

11

Page 14: Extending Practical Pre-Aggregation in On-Line Analytical …people.cs.aau.dk/~csj/Papers/Files/1999_pedersenR-99... · 2004. 5. 6. · This section presents a case study that illustrates

Definition 6 If the dimension hierarchy for a dimension W is onto, covering, and strict, we saythat W is normalized. Otherwise, it is un-normalized. For an MO � � ��y�pW ���»���1� , if alldimensions W¬�1<\W are normalized and ���+�E<Ä�=�R�:�pVX�Ë<Ä�y�+ª Vz<s01~ ) (, i.e., all factsmap to dimension values in the bottom category), we say that � is normalized. Otherwise, it isun-normalized.

For normalized hierarchies and MOs, all mappings are summarizable, meaning that wecan pre-aggregate values at any combination of dimension levels and safely re-use the pre-aggregated values to compute higher-level aggregate results. Thus, we want to normalize thedimension hierarchies and MOs for which we want to apply practical pre-aggregation.

We proceed to describe how the normalization of the dimension hierarchies and MOs usedfor aggregation is achieved. We first show how to perform transformations on dimension hi-erarchies, then later describe how the same techniques may be applied to eliminate the non-summarizable properties of fact-dimension relations.

4 Dimension Transformation Techniques

This section describes how dimensions can be transformed to achieve summarizability. Trans-forming dimensions on their own, separately from the facts, results in well-behaved dimensionsthat can be applied in a number of different systems or sold to third-party users. The transforma-tion of the dimension hierarchies is a three-step operation. First, all mappings are transformed tobe covering, by introducing extra “intermediate” values. Second, all mappings are transformedto be onto, by introducing “placeholder” values at lower levels for values without any children.Third, mappings are made strict, by “fusing” values together. The three steps are treated in sep-arate sections. None of the algorithms introduce any non-summarizable properties, so applyingeach once is sufficient.

In general, the algorithms take as input a set of tables � m�Ì6Í m;Î that specifies the mappingfrom dimension values in category UBv to values in category U�w . The input needs not contain allpairs of ancestors and descendants—only direct parent-child relationships are required. If thereare non-covering mappings in the hierarchy, we have categories U1�p¼��pÏ such that ��¼��pÏÃ$CÐegf�a�h�¦U}� and ]_^.`ba�¦¼}�1O¢]_^.`ba;RÏ�� . In this case, the input must also contain �1Ñ Í Ò tables thatmap ¼ values to Ï values.

4.1 Non-Covering Hierarchies

The first algorithm renders all mappings in a dimension hierarchy covering w.r.t. any category.When a dimension value is mapped directly to another value in a category higher than the oneimmediately above it in the hierarchy, a new intermediate value is inserted into the categoryimmediately above, and the two original dimension values are linked to this new value, ratherthan to each other.

Example 6 In the hierarchy for the Residence dimension, two links go from Address directlyto County. The address “123 Rural Road” (52) is in “Melbourne County” (31), but not in a

12

Page 15: Extending Practical Pre-Aggregation in On-Line Analytical …people.cs.aau.dk/~csj/Papers/Files/1999_pedersenR-99... · 2004. 5. 6. · This section presents a case study that illustrates

city, and the address “1 Sandy Dunes” (53) is in “Outback County” (32), which does not haveany cities at all. The algorithm inserts two new dimension values in the City category, C31and C32, which represent Melbourne and Outback county, respectively, and links them to theirrespective counties. The addresses “123 Rural Road” and “1 Sandy Dunes” are then linked toC31 and C32, respectively. This occurs in the first call of procedure MakeCovering (on theAddress category; the procedure is given below). When MakeCovering is called recursively onthe City, County, and - categories, nothing happens, as all mappings are already covering. Thetransformation is illustrated graphically in Figure 3. The dotted lines show the “problematic”links, and the bold-face values and thick lines show the new dimension values and links.

30 32 31

20 21

50 51 52 53

30 32 31

20 21

50 51 52 53

C31 C32

Figure 3: Transformations by the MakeCovering Algorithm

In the algorithm, U is a child category, ¼ is a parent category, Ï is a “higher” category, Óare the non-covering links from U to Ï , and Ô are the “higher” dimension values in Ó . The Õoperator denotes natural join. The algorithm works as follows. Given the argument category U(initially the bottom category) in line (1), the algorithms goes through all U ’s parent categories¼ (2). For each parent category ¼ , it looks for predecessor categories Ï of U that are “higher”in the hierarchy than ¼ (4). If such an Ï exist, there might be links in the mapping from U toÏ that are not available by going through ¼ . Line (6) finds these “non-covered” links, Ó , in themapping from U to Ï by “subtracting” the links that are available by going through ¼ from allthe links in the mapping from U to Ï . Line (7) uses Ó to find the dimension values Ô in Ï thatparticipate in the “non-covered” mappings. For each value in Ô , line 8 inserts a correspondingmarked value into ¼ ; these marked values represent the Ô values in ¼ . The marked values in¼ are then linked to the original values in Ï (9) and C (10). Line (12) contains a recursive callto the algorithm ¼ , thus fixing mappings higher up in the hierarchy. The algorithm terminateswhen it reaches the - category, which has no predecessors.

All steps in the algorithm are expressed using standard relational algebra operators. The gen-eral worst-case complexity of join is ÖÈ�# w � , where # is the size of the input. However, becausethe input to the algorithm are hierarchy definitions, the complexity of the join in the algorithm

13

Page 16: Extending Practical Pre-Aggregation in On-Line Analytical …people.cs.aau.dk/~csj/Papers/Files/1999_pedersenR-99... · 2004. 5. 6. · This section presents a case study that illustrates

will only be Ö�¦#/×!Ø�Ù,#�� . Thus, all the operators used can be evaluated in time Ö�¦#/×!Ø�Ù,#�� , where# is the size of the input. The ÚCÛ�f�Ü operation can be performed in ÖÈS�X� time. The inner loopof the algorithm is evaluated at most once for each link between categories, i.e., at most 9 wcÝ �times, where 9 is the number af categories (if all categories are directly linked to all others).Thus, the overall big- Ö complexity of the algorithm is Ö�K9 w #/ײØ�Ùx#�� , where 9 is the number ofcategories and # is the size of the largest participating � m�Ì6Í m;Î relation. The worst-case com-plexity will not apply very often; in most cases, the inner loop will only be evaluated at most 9times.

(1) procedure MakeCovering( U )(2) for each ¼?<½etf�a�h�RUE� do(3) begin(4) for each Ï¢<�etf�a�h�RUE� where ]£^p`ba�RÏC�©Þ¤]£^p`baA¦¼}� do(5) begin(6) Ó�ß � m£Í Ò�àYáym£Í Ò R� m£Í Ñ�Õ2�+Ñ Í Ò �(7) Ô�ß á+Ò RÓx�(8) ¼\ß ¼«dz��ÚCÛ�f�Ü�RâG�yã�â7<CÔÃ$(9) �+Ñ Í Ò ß �+Ñ Í Ò dÃ���ÚzÛ�f�Ü�Râ:�.�.âG�+ã�â <CÔÃ$(10) � m_Í Ñ ß � m_Í Ñ�dÃ��RäX�{ÚCÛ�f�Ü�KâG���yã_RäX�.âG�g<CÓ©$(11) end(12) MakeCovering( ¼ )(13) end(14) end

The algorithm inserts new values into the ¼ category to ensure that the mappings from ¼to higher categories are summarizable, i.e., that pre-aggregated results for ¼ can be directlycombined into higher-level aggregate results. The new values in ¼ mean that the cost of ma-terializing aggregate results for ¼ is higher for the transformed hierarchy than for the original.However, if the hierarchy was not transformed to achieve summarizability, we would have tomaterialize aggregates for å , and perhaps also for higher level categories. At most one newvalue is inserted into ¼ for every value in å , meaning that the extra cost of materializing resultsfor ¼ is never greater than the cost of the (otherwise necessary) materialization of results forå . This is a very unlikely worst-case scenario—in the most common cases, the extra cost for ¼will be much lower than the the cost of materializing results for å , and the savings will be evengreater because materialization of results for higher-level categories may also be avoided.

The correctness argument for the algorithm has two aspects. First, the mappings in thehierarchy should be covering upon termination. Second, the algorithm should only make trans-formations that are semantically correct, i.e., we should get the same results when computingresults with the new hierarchy as with the old. The correctness follows from Theorem 1 and 2,below. As new values are inserted in the ¼ category, we will get aggregate values for boththe new and the original values when “grouping” by ¼ . Results for the original values willbe the same as before, so the original result set is a subset of the result set obtained with thetransformed hierarchy.

14

Page 17: Extending Practical Pre-Aggregation in On-Line Analytical …people.cs.aau.dk/~csj/Papers/Files/1999_pedersenR-99... · 2004. 5. 6. · This section presents a case study that illustrates

Theorem 1 Algorithm MakeCovering terminates and the hierarchy for the resulting dimensionW�æ is covering.

Proof: By induction in the height of the dimension lattice. Base: The height is � , making thestatement trivially true. Induction Step: We assume the statement is true for dimension latticesof height # , and consider lattices of height #èç>� . For termination, we note that there is a finitenumber of R¼��pÏ�� pairs, all operations in the inner loop terminate, and the algorithm is calledrecursively on ¼ , which is the root of a lattice of height # . For the covering property, we notethat the insertion of intermediate, marked values into ¼ means that the mapping from ¼ to Ï iscovering w.r.t. U . By the induction hypothesis, the mappings higher in the hierarchy are fixedby the recursive call of the algorithm.

Theorem 2 Given dimensions W and WÉæ such that WËæ is the result of running MakeCoveringon W , an aggregate result obtained using W is a subset of the result obtained using WÈæ .Proof: Follows easily from Lemma 1, next, as the inserted values are “internal” in the hierarchy.

Lemma 1 For the dimension WËæ¨�[¦U1æ²� )»æ"� resulting from applying algorithm MakeCoveringto dimension WÊ�3RU1� )+� , the following holds: ��V�vc�pV w}<éWêRV�v1)»æ�V w+ë V�v/)¥V wc� (there is apath between any two original dimension values in the new dimension hierarchy iff there was apath between them in the original hierarchy).

Proof: By induction in the height of the dimension lattice. Base: The height is � making thestatement trivially true. Induction Step: We assume the statement is true for dimension latticesof height # , and consider lattices of height #zç?� . Examing the inner loop, we see that theinsertion of intermediate values into ¼ , and the linking of values in U and Ï to these, onlylinks values in U and Ï that were linked before. No links or values are destroyed by the innerloop. Thus, the statement is true for the links from U to ¼ , and from U to Ï . By the inductionhypothesis, the statement holds true for the transformations made by the recursive call on ¼ .

We see that the original values in the hierarchy are still linked to exactly the same originalvalues as before, as stated by Lemma 1, although new values might have been inserted in-between the original values. Thus, when evaluating a query using the transformed hierarchy,the results for the original values will be the same as when using the original hierarchy.

Assuming only the original result set is desired, results for the new values must be excluded,which is easy to accomplish. The new, “internal” values are marked with “mark=internal”,whereas the original values have “mark=original”. In order to exclude the new, internal valuesfrom the result set, the equivalent of an SQL HAVING clause condition of “mark=original” isintroduced into the original query.

4.2 Non-Onto Hierarchies

The second algorithm renders all mappings in hierarchies onto, i.e., all dimension values innon-bottom categories have children. This is ensured by inserting placeholder values in lower

15

Page 18: Extending Practical Pre-Aggregation in On-Line Analytical …people.cs.aau.dk/~csj/Papers/Files/1999_pedersenR-99... · 2004. 5. 6. · This section presents a case study that illustrates

categories to represent the childless values. These new values are marked with the originalvalues, making it possible to map facts to the new placeholder values instead of to the originalvalues. This makes it possible to only map facts to the bottom category.

Example 7 In the Diagnosis dimension, the “Lung cancer” diagnosis family (ID = 14) has nochildren. When the algorithm reaches the Diagnosis Family category, it inserts a placeholdervalue (L14) into the Low-level Diagnosis category, representing the “Lung cancer” diagnosis,and links it to the original value. Facts mapped to the “Lung cancer” value may then insteadbe mapped to the new placeholder value, ensuring that facts are mapped only to the Low-levelDiagnosis Category. A graphical illustration of the transformation is seen in Figure 4. Thebold-faced L14 value is the new value inserted, and the thick line between 14 and L14 is thenew link inserted.

12 11

4 9 10

5 6

13

14

12 11

4 9 10

5 6

13

14

L14

Figure 4: Transformations by the MakeOnto Algorithm

In the algorithm below, ¼ is a parent category, U is a child category, and Ô holds theparent values with no children. The algorithm works as follows. Given a category ¼ (initiallythe - category) in line (1), the algorithm goes through all categories U that are (immediate)descendants of ¼ (2). For each U , line (4) finds the values Ô in ¼ that have no children in U ,by “subtracting” the values with children in U from the values in ¼ . For each “childless” valuein Ô , lines (5) and (6), respectively, insert into U a placeholder value marked with the parentvalue, and links the new value to the original. MakeOnto is then called recursively on U (7).The algorithms terminates when it reaches the 0 category, which has no descendants.

Following the reasoning in Section 4.1, we find that the overall big- Ö complexity is equal toÖÈR9 w #/ײØ�Ùx#�� , where 9 is the number of categories and # is the size of the largest participating� m�Ì6Í m;Î relation. However, the complexity will only be ÖÈR9�#/ײØ�Ù(#�� for the most common cases.

The MakeOnto algorithm inserts new values into U to ensure that the mapping from U to ¼is summarizable. Again, this means that the cost of materializing results for U will be higher for

16

Page 19: Extending Practical Pre-Aggregation in On-Line Analytical …people.cs.aau.dk/~csj/Papers/Files/1999_pedersenR-99... · 2004. 5. 6. · This section presents a case study that illustrates

the transformed hierarchy than for the original. However, if the new values were not inserted,we would have to materialize results for ¼ , and perhaps also higher categories, as well as U .At most one value is inserted in U for every value in ¼ , meaning that the extra cost for U isnever greater than the cost of materializing results for ¼ . As before, this is a very unrealisticscenario, as it corresponds to the case where no values in ¼ have children in U . In most cases,the extra cost for U will be a small percentage of the cost of materializing results for ¼ , and thepotential savings will be even greater, because pre-aggregation for higher-level categories maybe avoided.

(1) procedure MakeOnto( ¼ )(2) for each U?<�nQaporq�R¼E� do(3) begin(4) Ô�ß ¼ à»á Ñ,R� m£Í Ñ��(5) Usß U«dÃ��ÚCÛ�f�Ü��ì:�+ãpì <CÔÃ$(6) � m_Í Ñ ß � m£Í Ñ�dC��6ÚCÛ�fpÜ��ì:�c�KìG�yãpì <CÔz$(7) MakeOnto( U )(8) end(9) end

As before, the correctness argument for the algorithm has two aspects. First, the mappingsin the hierarchy should be onto upon termination. Second, the algorithm should only maketransformations that are semantically correct. The correctness follows from Theorems 3 and 4,below. Again, the result set for the original values obtained using the original hierarchy willbe a subset of the result set obtained using the transformed hierarchy. The results for the newvalues can be excluded from the result set by adding a HAVING clause condition.

Theorem 3 Algorithm MakeOnto terminates and the hierarchy for the resulting dimension W7æis onto.

Proof: By induction in the height of the dimension lattice. Base: The height is � , making thestatement trivially true. Induction Step: We assume the statement is true for dimension latticesof height # , then consider lattices of height #QçP� . For termination, we note that there is a finitenumber of descendants U for each ¼ , that all operations in the loop terminate, and that thealgorithm is called recursively on U , which is the top element in a lattice of height # . For theonto property, we note that the insertion of placeholder values into U makes the mapping fromU to ¼ onto. By the induction hypothesis, the mappings further down in the lattice are handledby the recursive call.

Theorem 4 Given dimensions W and WÉæ such that WËæ is the result of applying the MakeOntoalgorithm to W , an aggregate result obtained using W is a subset of the result obtained using W7æ .Proof: Follows easily from the observation that “childless” dimension values are linked tonew, placeholder values in lower categories in one-to-one relationships, meaning that data forchildless values will still be counted exactly once in aggregate computations that use the newdimension.

17

Page 20: Extending Practical Pre-Aggregation in On-Line Analytical …people.cs.aau.dk/~csj/Papers/Files/1999_pedersenR-99... · 2004. 5. 6. · This section presents a case study that illustrates

4.3 Non-Strict Hierarchies

The third algorithm renders mappings in hierarchies strict, meaning that problems of “double-counting” will not occur. Non-strict hierarchies occur when one dimension value has severalparent values.

The basic idea is to “fuse” a set of parent values into one “fused” value, then link the childvalue to this new value instead. The fused values are inserted into a new category in-betweenthe child and parent categories. Data for the new fused category may safely be re-used forcomputation of higher-level aggregate results, as the hierarchy leading up to the new categoryis strict.

The fused value is also linked to the relevant parent values. This mapping is by nature non-strict, but this non-strictness is not a problem, as we prevent aggregate results for the parentcategory from being re-used higher up in the hierarchy. This is done by “unlinking” the parentcategory from its predecessor categories.

The categories higher up are instead reached through the fused category. This means that wecan still get results for any original category, while being able to apply practical pre-aggregationthroughout the hierarchy. In pre-aggregation terms, the “unlinking” of the parent categoriesmeans that we must prevent results for including this category from being materialized—only“safe” categories may be materialized. This should be given as a constraint to the pre-aggrega-tion system that chooses which levels of aggregation to materialize.

We note that the algorithm does not introduce more levels in the hierarchy, only more cat-egories, and that the number of “safe” categories in the result is the same as the number oforiginal categories. This means that the complexity of the task of selecting the optimal aggre-gation levels to materialize is unaffected by the algorithm.

12 11

4 9 10

5 6 5 6

4,9 4,10

4 9 10

11,12

⊥ ⊥

12 11

⊥ ⊥

Diagnosis Group

Diagnosis Group

Diagnosis Family

Diagnosis Family

Low-level Diagnosis

Low-level Diagnosis

Set-of Diagnosis

Family

Set-of Diagnosis

Group

13

14

L14 L14

14

14

13

13

Figure 5: Schema and Value Transformations by the MakeStrict Algorithm.

Example 8 The result of running the algorithm on the Diagnosis dimension is seen in Figure 5.Because of the non-strictness in the mapping from Low-level Diagnosis to Diagnosis Family,

18

Page 21: Extending Practical Pre-Aggregation in On-Line Analytical …people.cs.aau.dk/~csj/Papers/Files/1999_pedersenR-99... · 2004. 5. 6. · This section presents a case study that illustrates

and from Diagnosis Family to Diagnosis Group, two new category types and the correspondingcategories are introduced. The third picture indicates the argument to the algorithm; and, inaddition, its dotted lines indicate the links deleted by the algorithm. The fourth picture gives theresult of applying the algorithm; here, the bold-face values and thick lines indicate the valuesand links inserted by the algorithm.

In the first call of the algorithm the three Low-level Diagnosis values—“(low-level) Lungcancer” (L14); “Insulin dependent diabetes during pregnancy” (5); and “Non insulin dependentdiabetes during pregnancy” (6)—are linked to the three new fused values—“(low-level) Lungcancer” (14); “Diabetes during pregnancy, Insulin dependent diabetes” (4, 9); and “Diabetesduring pregnancy, Non insulin dependent diabetes” (4, 10)—and these are in turn linked to“Lung Cancer” (14); “Diabetes during pregnancy” (4); “Insulin dependent diabetes” (9); and“Non insulin dependent diabetes” (10). The these latter four values in the Diagnosis Familycategory are un-linked from their parents, as the Diagnosis Family category is “unsafe.”

When called recursively on the Set-of Diagnosis Family category, the algorithm creates thenew fused values “Cancer” (13) and “Diabetes, Other pregnancy related diseases” (11, 12) inthe Set-of Diagnosis Group category. These new values are linked to the values “Cancer” (13),“Diabetes” (11), and “Other pregnancy related diseases” (12) in the Diagnosis Group category,and to the - value; and the values in the Diagnosis Group category are un-linked from theirparents. Note the importance of having a - value: the values not linked to - are exactly theunsafe values, for which aggregate results should not be re-used.

The algorithm assumes that all paths in the dimension hierarchy have equal length, i.e., alldirect links are from children to their immediate parents. This is ensured by the MakeCoveringand MakeOnto algorithms. In the algorithm below, U is a child category, ¼ is a parent category,å is a grandparent category, Ô is the new category introduced to hold the “fused” values, andÕ denotes natural join.

The algorithm takes a category U (initially the 0 category) as input. I then goes through theset of immediate parent categories ¼ of U (line (2)). Line (4) tests if there is non-strictness inthe mapping from U to ¼ and if ¼ has any parents (4). If this test fails, there is no problem asaggregate results for ¼ can either be safely re-used or are guaranteed not be re-used; and thealgorithm in then invoked recursively, in line (20).

If the test succeeds, the algorithm creates a new fused category. First, a new, empty categoryÔ with domain � Ñ is created in line (6). The values inserted into this category represent sets ofvalues of ¼ . For example. the value “1, 2” represents the set consisting of precisely ���.� . Valuesin U are then linked to to new, fused values, representing their particular combination of parentsin ¼ (7). The new values are constructed using a Fuse function, that creates a distinct value foreach combination of ¼ values and stores the corresponding ¼ values along with it.

The resulting links are used in line (8) to insert the fused values into their category Ô , andan “Unfuse” function, mapping fused values from Ô into the corresponding ¼ values, is used inline (9) to map the values in Ô to those in ¼ . In line (10), Ô is included in, and ¼ is excludedfrom, the sets of predecessors of U . The set of predecessors of Ô is set to ¼ in line (11),meaning that the new category Ô resides in-between U and ¼ in the hierarchy.

For each grandparent category å , the algorithm links values in Ô to values in å , in line

19

Page 22: Extending Practical Pre-Aggregation in On-Line Analytical …people.cs.aau.dk/~csj/Papers/Files/1999_pedersenR-99... · 2004. 5. 6. · This section presents a case study that illustrates

(14), includes å in the predecessors of Ô , in line (15), and excludes å from the predecessors of¼ , in line (16), thereby also deleting the links from ¼ to å from the hierarchy. The exclusionof the å categories from the predecessors of ¼ means that aggregate results for ¼ will not bere-used to compute results for the å categories.

In the end, the algorithm is called recursively on the new category, Ô . Note that the test foregf�a�h�¦¼}�èí�¤î in line (4) ensures that the mapping from Ô to ¼ will not be altered, as ¼ nowhas no predecessors.

(1) procedure MakeStrict ( U )(2) for each ¼?<½etf�a�h�RUE� do(3) begin(4) if � V�v©<CU\ � V wr�pV Áy<�¼\RV�vg)ÂV wÅ� V�v©)ÂV ÁZ� V w}í�>V Ác�������Cetf�a�h�R¼E�Bí�>î then(5) begin(6) Ô�ß ïÅf�a�Û�ðSa�ï,Û�ð6aKñ�ò�f.^:K� Ñ �(7) � m_Í ó ß ��¦V�vc�{ô�õ;ora�S��V w1ã�RV�vc�pV wp��<�� m_Í ÑZ$����p$(8) Ô�ß á+ó R� m_Í ó �(9) � ó%Í Ñ ß ��¦V�v.�pV wc�yã�V�v�<�Ô[�7V wy<sö�÷ ø.õ;oraA¦V�v��.$(10) etfpa�h�¦U}�,ß etfpa�h�¦U}��dz��ÔÃ$ à ��¼¬$(11) etfpa�h�¦Ô��,ß ��¼¬$(12) for each å=<½etf�a�h�R¼E� do(13) begin(14) � ó%Í ù ß á+ó%Í ù ¦� ó%Í Ñ�Õ2�+Ñ Í ù �(15) etf�a�h�RÔ �xß etf�a�h�RÔ���dC��å�$(16) etf�a�h�R¼E�(ß etf�a�h�R¼E� à ��å�$(17) end(18) MakeStrict( Ô )(19) end(20) else MakeStrict( ¼ )(21) end(22) end

Following the reasoning in Section 4.1, we find that the overall big- Ö complexity is equalto ÖÈ�ìb#�9©×²Ø�Ùx#/×!Ø�Ùx9£� , where ì is the number of immediate parent and children categories inthe dimension type lattice, # is the size of the largest mapping in the hierarchy, and 9 is themaximum number of values fused together. For most realistic scenarios, ì and 9 are smallconstants, yielding a low Ö�¦#/×!Ø�Ù(#�� complexity for the algorithm.

The MakeStrict algorithm constructs a new category Ô and insert fused values in Ô toachieve summarizability for the mapping from Ô to ¼ , and from Ô to å . The algorithm onlyinserts the fused values for the combinations that are actually present in the mapping from Uto ¼ . This means that the cost of materializing results for Ô is never higher than the cost ofmaterializing results for U . This is a worst-case scenario, for the most common cases the cost ofmaterializing results for Ô will be be close to the cost of materializing results for ¼ . However,without the introduction of Ô , we would have to materialize results not only for ¼ , but also for

20

Page 23: Extending Practical Pre-Aggregation in On-Line Analytical …people.cs.aau.dk/~csj/Papers/Files/1999_pedersenR-99... · 2004. 5. 6. · This section presents a case study that illustrates

å and all higher-level categories. Thus, the potential savings in materialization costs are veryhigh indeed.

Considering correctness, the mappings in the hierarchy should be strict upon termination,and the algorithm should only make transformations that are semantically correct. More specif-ically, it is acceptable that some mappings be non-strict, namely the ones from the new, fusedcategories to the unsafe parent categories. This is so because unsafe categories do not havepredecessors in the resulting hierarchy, meaning that aggregate results for these categories willnot be re-used.

The correctness follows from Theorems 5 and 6, below. When evaluating queries we getthe same result for original values as when evaluating on the old hierarchy. The values thatare deleted by the algorithm were not linked to any facts, meaning that these values did notcontribute to the results in the original hierarchy. As all the new values are inserted into newcategories that are unknown to the user, the aggregate result obtained will be the same for theoriginal and transformed hierarchy. Thus, we do not need to modify the original query.

Theorem 5 Let W æ be the dimension resulting from applying algorithm MakeStrict on dimen-sion W . Then the following hold: Algorithm MakeStrict terminates and the hierarchy for thedimension W æ æ , obtained by removing unsafe categories from W æ , is strict.

Proof: By induction in the height of the dimension lattice. Base: The height is � , making thestatement trivially true. Induction Step: Assuming that the statement is true for lattices of height# , lattices of height #�ç�� are considered. All steps in the algorithm terminate, and the algorithmis called recursively on either ¼ (in the strict case) or Ô (in the non-strict case), both of whichare the root of a lattice of height # , thus guaranteeing termination.

For the strictness property, there are three cases. If the mapping from U to ¼ is alreadystrict, this mapping is not changed, and by the induction hypothesis, the statement holds for therecursive call on ¼ . If the mapping from U to ¼ is non-strict, but ¼ does not have any parents,strictness is ensured, as ¼ is excluded from WÉæ æ . If the mapping is non-strict and ¼ has parents,the resulting mapping from U to Ô is strict. By the induction hypothesis, the statement holdstrue for the recursive call on Ô , as the introduction of Ô has not increased the height of thelattice.

Theorem 6 Given dimensions W and WÉæ such that WËæ is the result of applying the MakeStrictalgorithm to W , an aggregate obtained using WÉæ is the same as that obtained using W .

Proof: Follows from Lemma 2, as all facts are mapped to values in the 0 category, which is asafe category. Thus, there will be a path from a fact � to an original dimension value V iff therewas one in the original hierarchy, meaning that aggregate results computed using the originaland the new hierarchy will be same.

Lemma 2 For the dimension W æ �ʦU æ � ) æ � resulting from applying algorithm MakeStrict todimension Wú�8¦Uu�§)B� , the following holds. ��V�vc�pV w}<�W8RV�vB<�Uyv%��û¨Û�ø a;RUyv��%�CV�v/) æ V wyëV�vB)\V wc� (there is a path between an original dimension value in a safe category and any otheroriginal dimension value in the new dimension hierarchy iff there was a path between them inthe original hierarchy).

21

Page 24: Extending Practical Pre-Aggregation in On-Line Analytical …people.cs.aau.dk/~csj/Papers/Files/1999_pedersenR-99... · 2004. 5. 6. · This section presents a case study that illustrates

Proof: By induction in the height of the dimension lattice. Base: The height of the lattice is� , making the statement trivially true. Induction Step: If either the mapping from U to ¼ isstrict, or ¼ does not have any parents, the algorithm does not change the mappings, and by theindiction hypothesis, the statement is true for the recursive call on ¼ . Otherwise, we observethat the creation of fused values in Ô , and the linking of U , ¼ , and å values to these, only linksexactly the values in U and ¼ , or U and å , that were linked before. Because ¼ is not safe, thelinks from ¼ to å may be deleted. By the induction hypothesis, the statement is true for therecursive call on Ô .

5 Fact-Dimension Transformation Techniques

This section explains how the set of algorithms from Section 4 may also be applied to therelationships between facts and dimensions, thus providing a basis for enabling practical pre-aggregation on concrete MOs that include fact data.

The basic idea is to view the set of facts � as the bottom granularity in the lattice. The inputto the algorithms then consists of the facts, � , the �/ü Í m tables, describing the mappings fromfacts to dimension values, and the U and � m Ì Í m Î tables, describing the dimension categories andthe mappings between them.

Only the covering and strictness properties are considered because for the fact-dimensionrelationships, a mapping between facts and dimension values that is into means that not alldimension values in the bottom category have associated facts, which does not affect summa-rizability. As before, we first apply the MakeCovering algorithm, then the MakeStrict algorithm.

The computational complexity of the algorithms will now be dominated by the size, # , ofthe mapping between facts and dimension values, i.e., the complexity will be Ö�¦#/×!Ø�Ù(#�� if weassume the height of the lattice and the maximum number of values fused together to be smallconstants. This means that the algorithms can be applied to even very large databases.

5.1 Mixed Granularity Mappings

The first case to consider is the one where some of the mappings are non-covering w.r.t. thefacts, meaning that not all facts can be reached through these mappings and thus resulting inthese facts not being accounted for in aggregate computations. This occurs when some facts aremapped directly to dimension values in categories higher than the 0 category, i.e., the facts aremapped to values of mixed granularities.

We use the MakeCovering algorithm to make the mappings covering, initially calling it on� , which is now the bottom of the lattice. The algorithm makes the mappings covering w.r.t.the facts by inserting new marked values, representing the parent values, in the intermediatecategories, and by linking the facts to the new values instead of the parent values. As in Sec-tion 4.1, the marked values keep information about their original values, so that when newfact-dimension mappings are added, the links that are supposed to go directly to the originalparent values now instead can be set to go to the marked value in the 0 category.

22

Page 25: Extending Practical Pre-Aggregation in On-Line Analytical …people.cs.aau.dk/~csj/Papers/Files/1999_pedersenR-99... · 2004. 5. 6. · This section presents a case study that illustrates

Example 9 In the case study, the mapping between Patients and Diagnoses is of mixed gran-ularity: “John Doe” (1) and “Jane Doe” are both mapped to the Diagnosis Family, “Insulindependent diabetes” (9), “Jane Doe” is additionally mapped to the Low-level Diagnosis, “In-sulin dependent diabetes during pregnancy” (5), and “Jim Doe” is mapped to “Diabetes” (11),a Diagnosis Group.

In the first call of the algorithm, two new Low-level Diagnoses are inserted: “L9,” represent-ing “Insulin dependent diabetes,” and “L11,” representing “Diabetes”; and the facts are mappedto these instead of the original values. In the recursive call on Low-level Diagnosis, an “F11”value representing “Diabetes” at the Diagnosis Family level is inserted between “Diabetes” andvalue “L11.”

The transformations are illustrated in Figure 6, where dotted lines indicate links that aredeleted by the algorithm and bold-face value and thick lines indicate dimension values andlinks inserted by the algorithm.

12 11

4 9 10

5 6

13

14

L14

1 2 3

12 11

4 9 10

5 6

13

14

L14

1 2 3

F11

L11 L9

Figure 6: Transformations for Varying Granularities.

5.2 Many-To-Many Relationships

The second case occurs when relationships between facts and dimension values are many-to-many. This means that the hierarchy, with the facts as the bottom category, is non-strict, leadingto possible double-counting of facts. It is enough to make the hierarchy partly strict, as describedin Section 4.3. The MakeStrict algorithm is initially called on � , which is now the bottom ofthe hierarchy lattice. Because the MakeCovering algorithm has already been applied, all pathsfrom facts to the - value have equal length, as required by the MakeStrict algorithm.

23

Page 26: Extending Practical Pre-Aggregation in On-Line Analytical …people.cs.aau.dk/~csj/Papers/Files/1999_pedersenR-99... · 2004. 5. 6. · This section presents a case study that illustrates

Some dimension values have no facts mapped to them, leading to an interesting side effectof the algorithm. When the algorithm fuses values and places the fused values in-between theoriginal values, it also deletes the child-to-parent and parent-to-grandparent links. The fact-lessdimension values are then left disconnected from the rest of the hierarchy, with no links to othervalues.

These fact-less dimension values do not contribute to any aggregate computations and arethus superfluous. To minimize the dimensions, an “Delete-unconnected” algorithm that deletesthe fact-less dimension values by traversing the hierarchy starting at the facts is invoked in apostprocessing step. For a hierarchy of height 9 , this can be done in time Ö�K9�#/×!Ø�Ù,#�� , where# is the size of the mapping between facts and dimensions. Thus, the overall computationalcomplexity is not altered.

12 11

4 9 10

5 6

13

14

L14

1 2 3

F11

L11 L9

12 11

4 9

5

1 2 3

F11

L11 L9

5,L9 L9 L11

4,9 9 F11

11,12 11

Figure 7: Transformations for Many-to-many Fact-Dimension Relationships

Example 10 The relationship between patients and diagnoses is many-to-many. In Example 9,the MO was transformed so that all mappings were covering, as seen in Figure 6; algorithmMakeStrict is applied to this MO. The final result of the application of the algorithm is seen tothe right in Figure 7. Values in italics, e.g., L14, and dotted lines indicate deleted values andlinks. Bold-face values and thick lines denote values and links inserted by the algorithm.

Three new categories are introduced: “Set-of Low-level Diagnosis,” “Set-of Diagnosis Fam-ily,” and “Set-of Diagnosis Group,” as non-strictness occurs at all levels. Fused values areinserted into these fused categories. For example, values “(low-level) Lung Cancer” (L14), “In-sulin dependent diabetes during pregnancy, (low-level) Insulin dependent diabetes” (5, L9), and“(low-level) Insulin dependent diabetes” (L9) are inserted into the “Set-of Low-level Diagno-sis” category; and the original values are linked to the new values.

Values “(low-level) Lung cancer” (L14), “Lung cancer” (14), “Cancer” (13), “Non insulin

24

Page 27: Extending Practical Pre-Aggregation in On-Line Analytical …people.cs.aau.dk/~csj/Papers/Files/1999_pedersenR-99... · 2004. 5. 6. · This section presents a case study that illustrates

dependent diabetes during pregnancy” (6), and “Non insulin dependent diabetes” (10) do notcharacterize any facts and are deleted by “Delete-unconnected.”

6 Architectural Context

The overall idea presented in this paper is to take un-normalized MOs and transform them intonormalized MOs that are well supported by the practical pre-aggregation techniques availablein current OLAP systems. Queries are then evaluated on the transformed MOs. However, westill want the users to see only the original MOs, as they reflect the users’ understanding of thedomain. This prompts the need for means of handling both the original and the transformedMOs. This section explores this coexistence.

A current trend in commercial OLAP technology is the separation of the front-end presen-tation layer from the back-end database server. Modern OLAP applications consist of an OLAPclient that handles the user interface and an OLAP server that manages the data and processesqueries. The client communicates with the server using a standardized application programminginterface (API), e.g., Microsoft’s OLE DB for OLAP [17] or the OLAP Council’s MDAPI [20].The architecture of such a system is given to the left in Figure 8.

Data

OLAP Client

OLAP Server

OLAP Server

Navigational Data

Aggregational Data

OLAP Client

OLAP Server

Query Handler Queries

Queries

Navigational Queries

Aggregation Queries

Figure 8: Architecture of Integration

This separation of client and server facilitates our desire to have the user see the originalMO while queries are evaluated against the transformed MO. Studies have shown that querieson a data warehouse consist of 80% navigational queries that explore the dimension hierarchiesand 20% aggregation queries that summarize the data at various levels of detail [14]. Examplesof navigational and aggregation queries are “Show me the Low-Level Diagnoses containedin the Insulin-Dependent Diabetes Diagnosis Family” and “Show me the count of patients,grouped by Diagnosis Family,” respectively. The navigational queries must be performed onthe original MO, while the aggregation queries must be performed on the transformed MO.This is achieved by introducing an extra “Query Handler” component between the client and

25

Page 28: Extending Practical Pre-Aggregation in On-Line Analytical …people.cs.aau.dk/~csj/Papers/Files/1999_pedersenR-99... · 2004. 5. 6. · This section presents a case study that illustrates

the server. The OLAP client sends a query to the query handler, the primary task of which is todetermine whether the query is a navigational query (internal to a dimension) or an aggregationquery (involving the facts). Navigational queries are passed to one OLAP server that handlesthe original (navigational) data, while aggregation queries are passed to another OLAP serverthat manages the transformed (aggregation) data. This extended system architecture is seen tothe right in Figure 8.

The OLAP server for navigation data needs to support dimension hierarchies which havenon-summarizable properties, a requirement not yet supported by many commercial systemstoday. However, relational OLAP systems using snow-flake schemas [14] are able to supportthis type of hierarchies, as are some other OLAP systems, e.g., Hyperion (Arbor) Essbase [12].If the OLAP system available does not have sufficiently flexible hierarchy support, one solutionis to build a special-purpose OLAP server that conforms to the given API. This task is notas daunting as it may seem at first because only navigational queries need to be supported,meaning that multidimensional queries can be translated into simple SQL “lookup” queries.

We note that the only data needed to answer navigational queries is the hierarchy definitions.Thus, we only need to store the fact data (facts and fact-dimension relations, in our model) once,in the aggregational data, meaning that the overall storage requirement is only slightly largerthan storing just the aggregational data. Navigational queries are evaluated on the originalhierarchy definitions and do not need to be re-written by the query handler.

As described in Section 4, aggregation queries need to be re-written slightly by adding anextra HAVING clause condition to exclude results for the new values inserted by the trans-formation algorithms. This can easily be done automatically by the query handler, giving totaltransparency for the user. Even though the added HAVING clause conditions are only necessaryfor the covering and onto transformations, they can also be applied to hierarchies transformedto achieve strictness; this has no effect, but simplifies the query rewriting.

7 Conclusion and Future Work

Motivated by the increasing use of OLAP systems in many different applications, including inbusiness and health care, this paper provides transformation techniques for multidimensionaldatabases that leverage the existing, performance-enhancing technique, known as practical, orpartial or semi-eager, preaggregation, by making this technique relevant to a much wider rangeof real-world applications.

Current pre-aggregation techniques assume that the dimensional structures are summariz-able. Specifically, the mappings in dimension hierarchies must be onto, covering, and strict;the relationships between facts and dimensions must be many-to-one, and the facts must alwaysbe mapped to the lowest categories in dimensions. The paper presents novel transformationtechniques that render dimensions with hierarchies that are non-onto, non-covering, and non-strict summarizable. The transformations have practically low computational complexity, theymay be implemented using standard relational database technology, and the paper also describeshow to integrate the transformed hierarchies in current OLAP systems, transparently to the user.

The paper also describes how to apply the transformations to the cases of non-summarizable

26

Page 29: Extending Practical Pre-Aggregation in On-Line Analytical …people.cs.aau.dk/~csj/Papers/Files/1999_pedersenR-99... · 2004. 5. 6. · This section presents a case study that illustrates

relationships between facts and dimensions, which also occur often in real-world applications.Finally, it is shown how to modify the algorithms to incrementally maintain the transformedhierarchies when the underlying data is modified.

Several directions for future research appear promising. The current techniques render theentire dimension hierarchies summarizable; extending the techniques to consider only the partsthat have been selected for preaggregation appears attractive and possible. Another directionis to take into account the different types of aggregate functions to be applied, leading to localrelaxation of the summarizability requirement. For example, max and min are insensitive toduplicate values, thus relaxing summarizability.

References

[1] E. Baralis, S. Paraboschi, and E. Teniente. Materialized View Selection in a Multidimen-sional Database. In Proceedings of the Twenty-Third International Conference on VeryLarge Data Bases, pp. 156–165, 1997.

[2] E. F. Codd. Providing OLAP (on-line analytical processing) to user-analysts: An IT man-date. Technical report, E.F. Codd and Associates, 1993.

[3] S. Dar, H. V. Jagadish, A. Y. Levy, and D. Srivastava. Answering SQL Queries UsingViews. In Proceedings of the Twenty-Second International Conference on Very LargeData Bases, pp. 318–329, 1996.

[4] P. M. Deshpande, J. F. Naughton, K. Ramasamy, A. Shukla, K. Tufte, and Y. Zhao. CubingAlgorithms, Storage Estimation, and Storage and Processing Alternatives for OLAP. IEEEData Engineering Bulletin, 20(1):3–11, 1997.

[5] C. E. Dyreson. Information Retrieval from an Incomplete Data Cube. In Proceedings ofthe Twenty-Second Conference on Very Large Databases, pp. 532–543, 1996.

[6] J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Venkatrao, F. Pellow,and H. Pirahesh. Data Cube: A Relational Aggregation Operator Generalizing Group-By,Cross-Tab and Sub-Totals. Data Mining and Knowledge Discovery, 1(1):29–54, 1997.

[7] A. Gupta, V. Harinarayan, and D. Quass. Aggregate Query Processing in Data Warehous-ing Environments. In Proceedings of the Twenty-First International Conference on VeryLarge Data Bases, pp. 358–369, 1995.

[8] H. Gupta, V. Harinarayan, A. Rajaraman, and J. Ullman. Index Selection for OLAP. InProceedings of he Thirteenth International Conference on Data Engineering, pp. 208–219, 1997.

[9] H. Gupta. Selection of Views to Materialize in a Data Warehouse. In Proceedings of theSixth International Conference on Database Theory, pp. 98–112, 1997.

27

Page 30: Extending Practical Pre-Aggregation in On-Line Analytical …people.cs.aau.dk/~csj/Papers/Files/1999_pedersenR-99... · 2004. 5. 6. · This section presents a case study that illustrates

[10] H. Gupta and I.S. Mumick. Selection of Views to Materialize Under a Maintenance-TimeConstraint. In Proceedings of the Seventh International Conference on Database Theory,pp. 453–470, 1999.

[11] V. Harinarayan, A. Rajaraman, and J.D. Ullman. Implementing Data Cubes Efficiently. InProceedings of the ACM SIGMOD International Conference on the Management of Data,pp. 205–216, 1996.

[12] Hyperion Corporation. Hyperion Essbase OLAP Server. URL: O www.hyperion.com/downloads/essbaseolap.pdf Þ . Current as of February 17, 1999.

[13] Informix Corporation. Data Warehouse Administrator’s Guide: MetaCube ROLAPOption for Informix Dynamic Server. URL: O www.informix.com/answers/english/pdf docs/metacube/4189.pdf Þ . Current as of February 15, 1999.

[14] R. Kimball. The Data Warehouse Toolkit. Wiley Computer Publishing, 1996.

[15] R. Kimball. Data Warehouse Architect: Help with Multi-Valued Dimension. DBMSMagazine, 11(9), 1998.

[16] H. Lenz and A. Shoshani. Summarizability in OLAP and Statistical Data Bases. InProceedings of the Ninth International Conference on Statistical and Scientific DatabaseManagement, pp. 39–48, 1997.

[17] Microsoft Corporation. OLE DB for OLAP Version 1.0 Specification. Microsoft TechnicalDocument, 1998.

[18] Microsoft Corporation. OLAP Services White Paper. URL: O www.microsoft.com/sql/70/whpprs/olapoverview.htm Þ . Current as of February 9, 1999.

[19] I. S. Mumick, D. Quass, and B. S. Mumick. Maintenance of data cubes and summarytables in a warehouse. In Proceedings of the ACM SIGMOD International Conference onthe Management of Data, pp. 100–111, 1997.

[20] The OLAP Council. MDAPI Specification Version 2.0. OLAP Council Technical Docu-ment, 1998.

[21] The OLAP Report. Database Explosion. URL: O www.olapreport.com/DatabaseExplosion.htm Þ . Current as of February 10, 1999.

[22] T. B. Pedersen and C. S. Jensen. Multidimensional Data Modeling for ComplexData. In Proceedings of the Fifteenth International Conference on Data Engineer-ing, 1999. Extended version available as TimeCenter Technical Report TR-37, URL:O www.cs.auc.dk/TimeCenter Þ , 1998.

[23] D. Quass and J. Widom. On-Line Warehouse View Maintenance for Batch Updates. InProceedings of the ACM SIGMOD International Conference on the Management of Data,pp. 393–404, 1997.

28

Page 31: Extending Practical Pre-Aggregation in On-Line Analytical …people.cs.aau.dk/~csj/Papers/Files/1999_pedersenR-99... · 2004. 5. 6. · This section presents a case study that illustrates

[24] M. Rafanelli and A. Shoshani. STORM: A Statistical Object Representation Model. InProceedings of the Fifth International Conference on Statistical and Scientific DatabaseManagement, pp. 14–29, 1990.

[25] A. Segev and J. L. Zhao. Selective View Materialization in Data Warehousing Systems.Working paper, URL: O ftp://segev.lbl.gov/pub/LBL DB PUBLICATIONS/1997/aggreg-dw.ps Þ . Current as of February 9, 1999.

[26] A. Shukla, P. M. Deshpande, J. F. Naughton, and K. Ramasamy. Storage Estimationfor Multidimensional Aggregates in the Presence of Hierarchies. In Proceedings of theTwenty-Second International Conference on Very Large Data Bases, pp. 522–531, 1996.

[27] D. Theodoratos and T. Sellis. Data Warehouse Configuration. In Proceedings of theTwenty-Third International Conference on Very Large Data Bases, pp. 126–135, 1997.

[28] J. Widom. Research Problems in Data Warehousing. In Proceedings of the Fourth Inter-national Conference on Information and Knowledge Management., pp. 25–30, 1995.

[29] R. Winter. Databases: Back in the OLAP game. Intelligent Enterprise Magazine, 1(4):60–64, 1998.

[30] World Health Organization. International Classification of Diseases (ICD-10). Tenth Re-vision, 1992.

[31] J. Yang, K. Karlapalem, and Q. Li. Algorithms for materialized view design in a datawarehousing environment. In Proceedings of the Twenty-Third International Conferenceon Very Large Data Bases, pp. 136–145, 1997.

A Incremental Computation

When dimension hierarchies or fact data are updated, the transformed hierarchies must beupdated correspondingly. One solution is to recompute the hierarchies using the new data.This straightforward solution is attractive when updating small dimension hierarchies that onlychange infrequently, or when large bulks of updates are processed. However, for massive hier-archies and frequent updates, and for updates of small parts of the hierarchies in general, it isdesirable if the algorithms need only consider the changed parts of data, which will only be asmall fraction of the total data volume. This section briefly describes how to incrementalize thealgorithms.

In addition to modifying the transformed hierarchies, it is also necessary to update the ac-tual pre-aggregated data when the underlying base data is modified. The modified hierarchiesresulting from the algorithms given in this section differ only locally from the argument hierar-chies. This means that the cost of updating the pre-aggregated data will not be greatly affectedby the hierarchy transformations.

In the incremental algorithms, updates are modeled as deletions followed by insertions, sowe consider only the latter two modification operations. We use prefix ý�� to denote inserted

29

Page 32: Extending Practical Pre-Aggregation in On-Line Analytical …people.cs.aau.dk/~csj/Papers/Files/1999_pedersenR-99... · 2004. 5. 6. · This section presents a case study that illustrates

values, ý�þ to denote deleted values, and ý to denote all modifications. For example, ý��¡Udenotes the values inserted into U . The category and links tables in the algorithms refer to thestates after modifications; and when a hierarchy value is deleted, all links to that value are alsoassumed to be deleted in the same set of modifications.

Insert DeleteU yes no¼ no yesÏ no yes

� m_Í Ò yes no� m_Í Ñ no yes�+Ñ Í Ò no yes

Covering

Insert DeleteU no yes¼ yes no

� m_Í Ñ no yesOnto

Insert DeleteU no yes¼ no yeså no yes

� m_Í Ñ yes yes�+Ñ Í ù yes yes

Strict

Table 2: Effects of Insertions and Deletions on the Covering, Onto, and Strictness Properties

A.1 Covering Hierarchies

Modifications may render covering hierarchies non-covering in several ways. The the left-mosttable in Table 2, named “Covering” and discussed next, indicates whether an insertion (“Insert”)or a deletion (“Delete”) on the different parts of the input to MakeCovering may render themodified hierarchy non-covering.

Problems may arise if links are inserted into � m_Í Ò that are not covered by insertions into� m_Í Ñ and �+Ñ Í Ò , or if links are deleted in � m_Í Ñ or �+Ñ Í Ò , but the corresponding U -to- Ï links arenot deleted in � m_Í Ò . If values are deleted in ¼ or Ï , their links will be deleted too, which ishandled by the case above. Values cannot be inserted into U without any links, as all values inthe original hierarchy must at least be linked to the - value.

The incremental version of MakeCovering algorithm starts by finding (in line (6)) the linksÓ from U to Ï that are not covered by the links from U to ¼ and ¼ to Ï . These links areused as the base for the rest of the transformation. Thus, line (6) of the algorithm becomes thefollowing expression.

Ó�ß�ýQ��� m_Í Ò d áym_Í Ò ¦ý�þc� m_Í Ñ�Õé�+Ñ Í Ò ��d á+m_Í Ò R� m_Í Ñ�Õ2ý¬þ.�+Ñ Í Ò �à�á+m_Í Ò RýQ��� m_Í Ñ�Õ2ýQ�¡�+Ñ Í Ò � à ý¬þc� m_Í Ò

A.2 Onto Hierarchies

The effects on the onto property of insertions and deletions are outlined in the middle table inTable 2. Insertion of values into ¼ , deletion of values in U , and deletion of links in � m_Í Ñ maycause the hierarchy to become non-onto. The incremental version of the MakeOnto algorithm

30

Page 33: Extending Practical Pre-Aggregation in On-Line Analytical …people.cs.aau.dk/~csj/Papers/Files/1999_pedersenR-99... · 2004. 5. 6. · This section presents a case study that illustrates

thus starts by finding (in line (4)) the “childless” values Ô from ¼ with no children in U . As aresult, line (4) of the algorithm becomes the following expression.

Ô�ß ýQ��¼Pd á Ñ,¦ýè~Z� m_Í Ñ�� àYá Ñ(¦ý�þc¼E� àYá Ñ(¦ý¬�¡� m_Í Ñ��

A.3 Strict Hierarchies

The case of maintaining the strictness property of hierarchies is more complicated because anew category Ô is introduced by the algorithm. We assume that all new categories have alreadybeen created before the incremental algorithm is used, i.e., if non-strictness is introduced innew parts of the hierarchy, we have to recompute the transformed hierarchy. The introductionof non-strictness requires major restructuring of both the hierarchy and the pre-aggregated data,so this is reasonable.

An overview of the effect on strictness of insertions and deletions in the input to algorithmMakeStrict is given in the right-most table in Table 2. If links are inserted into, or deleted from,� m_Í Ñ or �+Ñ Í ù , the links to Ô for the affected U , ¼ , and å values must be recomputed.

Insertions into, or deletion from, U , ¼ , or å will be accompanied by corresponding linkinsertions and deletions, so they are handled by the above case. The incremental MakeStrict,given below, works by finding the affected U , ¼ , and å values, then recomputes their links toÔ and deletes the old links, and finally inserting the new links. As before, it is followed by astep that deletes the disconnected parts of the hierarchy.

(1) procedure IncrementalMakeStrict( U )(2) for each ¼?<½etf�a�h�RUE� such that etf�a�h�R¼E�Bí�>î do(3) begin(4) ¿AU\ß á+m ¦ýè� m£Í Ñ��(5) ¿A� m_Í ó ß ��RäX�{ô�õ;ora�S�pìzã_RäX�Kì:�g<C¿AU�Õ2� m£Í Ñ�$��.$(6) ¿AÔ�ß á+ó ¦¿�� m£Í ó �(7) Ô�ß Ôÿd ¿AÔ(8) � m_Í ó ß � m_Í ózà ��¦ä���#��yã�äB<C¿AU¬$�d7¿�� m£Í ó(9) ¿A¼\ß á Ñ,¦ý�� m_Í Ñ��(10) ¿A� ó%Í Ñ ß ��¦#Z�Rì:�yãX#z<C¿AÔÿ�èì <�¿�¼��@ö�÷_ô�õ;ora�¦#��p$(11) � ó�Í Ñ�ß � ó%Í Ñ à ��¦#Z�Rì:�yãpì <�¿�¼¬$gd ¿A� ó%Í Ñ(12) for each å=<½etfpa�h�¦¼}� do(13) begin(14) ¿Aå\ß áyù Rý��+Ñ Í ù d½R¿A¼�Õé�yÑ Í ù ���(15) � ó%Í ù ß � ó�Í ù ��¦#�� ¹ �Yã ¹ <C¿Aå¬$©d á+ó%Í ù ¦� ó%Í Ñ�Õ2�+Ñ Í ù Õ2¿�åE�(16) end(17) IncrementalMakeStrict( Ô )(18) end(19) end

31