DataStage-PX ClassBook V0.1ds

299
 ® IBM WebSphere DataStage Introduction To Enterprise Edition

Transcript of DataStage-PX ClassBook V0.1ds

Page 1: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 1/299

 ® 

IBM WebSphere DataStage

Introduction To Enterprise Edition

Page 2: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 2/299

 Course Contents 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Module 01: Introduction

Module 02: Setting Up Your DataStage Environment

Module 03: Creating Parallel Jobs

Module 04: Accessing Sequential Data

Module 05: Platform Architecture

Module 06: Combining Data

Module 07: Sorting and Aggregating Data

Module 08: Transforming Data

Module 09: Standards and Techniques

Module 10: Accessing Relational Data

Module 11: Compilation and Execution

Module 12: Testing and Debugging

Module 13: Metadata in Enterprise Edition

Module 14: Job Control

Page 3: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 3/299

 

Course 

 

 

 

 

 

 

 

 

 

 

 

 

 

Objectives 

DataStage Clients and ServerSetting up the parallel environment

Importing metadata

Building DataStage jobs

Loading metadata into job stages

Accessing Sequential data

Accessing Relational data

Introducing the Parallel frameworkarchitecture

Transforming data

Sorting and aggregating data

Merging data

Configuration files

Page 4: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 4/299

 ® 

IBM WebSphere DataStage

Module  01: Introduction

Page 5: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 5/299

 

What is IBM WebSphere DataStage? 

Design jobs for Extraction, Transformation, and Loading (ETL)

Ideal tool for data integration projects – such as, data warehouses, data marts,and system migrations

Import, export, create, and manage metadata for use within jobs

Schedule, run, and monitor jobs all within DataStage

 Administer your DataStage development and execution environments

Create batch (controlling) jobs

Page 6: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 6/299

 

DataStage Server  and Clients 

Windows or  Unix Server  

Microsoft Windows 

Page 7: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 7/299

 

Client Logon 

Page 8: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 8/299

 

DataStage  Administrator  

Page 9: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 9/299

 

DataStage Manager  

Page 10: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 10/299

 

DataStage Designer  

Page 11: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 11/299

 

DataStage Director  

Page 12: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 12/299

 

Developing in DataStage 

Define global and project properties in AdministratorImport metadata into the Repository

Manager

Designer Repository View

Build job in Designer

Compile job in Designer

Run and monitor job in Director

Page 13: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 13/299

 

DataStage Projects 

Page 14: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 14/299

 

DataStage Jobs 

Parallel jobsExecuted under  control of  DataStage Server  runtime environment

Built-in functionality for  Pipeline and Partitioning Parallelism

Compiled into OSH (Orchestrate Scripting Language) 

 OSH executes Operators 

 –  Executable C++ class instances 

Runtime monitoring in DataStage Director  

Job Sequences (Batch jobs, Controlling jobs)Master  Server   jobs that kick-off   jobs and other  activities 

Can kick-off  Server  or  Parallel  jobs 

Runtime monitoring in DataStage Director  

Server jobs (Requires Server Edition license)Executed by the DataStage Server  Edition

Compiled into Basic (interpreted pseudo-code)

Runtime monitoring in DataStage Director  

Mainframe jobs (Requires Mainframe Edition license)Compiled into COBOL 

Executed on the Mainframe, outside of  DataStage 

Page 15: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 15/299

 

Design Elements of Parallel Jobs 

StagesImplemented as OSH operators (pre-built components)

Passive stages (E and L of ETL)

Read data

Write data

E.g., Sequential File, Oracle, Peek stagesProcessor (active) stages (T of ETL)

Transform data

Filter data

 Aggregate data

Generate data

Split / Merge data

E.g., Transformer, Aggregator, Join, Sort stages

Links

“Pipes” through which the data moves from stage to stage

Page 16: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 16/299

 

Quiz  – True or False? 

DataStage Designer is used to build and compile your ETL jobs

Manager is used to execute your jobs after you build them

Director is used to execute your jobs after you build them

 Administrator is used to set global and project properties

Page 17: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 17/299

 

Introduction to the Lab Exercises 

Two types of exercises in this course:

 Conceptual exercises

Designed to reinforce a specific module‟s topics

Provide hands-on experiences with DataStage

Introduced by the word “Concept” 

 E.g., “Conceptual Lab 01A” 

Solution Development exercises

Based on production applications

Provide development examples

Introduced by the word “Solution” 

 E.g., “Solution Lab 05A” 

The Solution Development exercises are introduced and discussed in a latermodule

 

Page 18: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 18/299

 

Lab Exercises 

 

Conceptual Lab 01AInstall DataStage clients

Test connection to the DataStage Server

Install lab files

Page 19: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 19/299

 

® 

IBM WebSphere DataStage

Module 02: SettingEnvironment up Your DataStage

Page 20: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 20/299

 

Module Objectives 

 

 

 

 

Setting project properties in AdministratorDefining Environment Variables

Importing / Exporting DataStage objects in Manager

Importing Table Definitions defining sources and targets in Manager

Page 21: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 21/299

 

Setting Project Properties 

 

Page 22: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 22/299

Project Properties 

 

Projects can be created and deleted in AdministratorEach project is associated with a directory on the DataStage Server

Project properties, defaults, and environmental variables are specifiedin Administrator

Can be overridden at the job level

 

01/15/06

 

Page 23: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 23/299

Setting Project Properties 

 

To set project properties, log onto Administrator, select your project,and then click “Properties” 

Page 24: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 24/299

 

Project Properties General Tab 

Page 25: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 25/299

 

Environment Variables 

Page 26: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 26/299

 

Permissions Tab 

 

Page 27: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 27/299

Tracing Tab 

 

Page 28: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 28/299

Parallel Tab 

 

Page 29: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 29/299

Sequence Tab 

Page 30: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 30/299

 

Importing and 

DataStage 

Exporting 

Objects 

 

Page 31: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 31/299

What Is Metadata? 

Data

Source Target

Metadata Metadata

Metadata

Repository

Transform

Page 32: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 32/299

 

Page 33: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 33/299

Manager Contents 

 Metadata

Describing sources and targets: Table definitions

Describing inputs / outputs from external routines

Describing inputs and outputs to BuildOp and CustomOp stages

DataStage objects

JobsRoutines

Compiled jobs / objects

Stages

 

 

Page 34: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 34/299

Import and Export 

 

 

 

 

 

 

 Any object in Manager can be exported to a fileCan export whole projects

Use for backup

Sometimes used for version control

Can be used to move DataStage objects from one project to another

Use to share DataStage jobs and projects with other developers

Page 35: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 35/299

 

Export 

 

 

 

Procedure 

In Manager, click “Export>DataStage Components” Select DataStage objects for export

Specify type of export:

DSX: Default format

XML: Enables processing of export file by XML applications, e.g., for

generating reportsSpecify file path on client machine 

 

Page 36: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 36/299

Quiz - True or False? 

 You can export DataStage objects such as jobs, but you can‟t export metadata, such as field definitions of a sequential file.

 

Page 37: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 37/299

Quiz - True or False? 

 

The directory to which you export is on the DataStage client machine,not on the DataStage server machine.

 

Page 38: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 38/299

Exporting DataStage Objects 

 

Page 39: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 39/299

Select Objects for Export 

Page 40: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 40/299

 

Page 41: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 41/299

Import Procedure 

 In Manager, click “Import>DataStage Components” 

Or “Import>DataStage Components (XML)” if you are importing an XML-format export file

Select DataStage objects for import 

 

Page 42: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 42/299

Importing DataStage Objects 

 

Page 43: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 43/299

Import Options 

 

Page 44: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 44/299

Importing Metadata 

 

Page 45: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 45/299

Metadata Import 

 

 

 

 

 

Import format and column definitions from sequential files

Import relational table column definitions

Imported as “Table Definitions” 

Table definitions can be loaded into job stages

Table definitions can be used to define Routine and Stage interfaces

 

Page 46: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 46/299

Sequential File Import Procedure 

 

 

 

 

In Manager, click Import>Table Definitions>Sequential File Definitions

Select directory containing sequential file and then the file

Select Manager category

Examined format and column definitions and edit is necessary

 

Page 47: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 47/299

Importing Sequential Metadata 

 

Page 48: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 48/299

Sequential Import Window 

 

Page 49: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 49/299

Specify Format 

 

Page 50: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 50/299

Specify Column Names and Types 

Double-click to define extended properties 

 

Page 51: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 51/299

Extended Properties window 

Property categories 

 Available properties 

 

Page 52: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 52/299

Table Definition General Tab 

Second levelcategory

Top levelcategory

 

Page 53: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 53/299

Table Definition Columns Tab 

 

f

Page 54: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 54/299

Table Definition Parallel Tab 

 

T bl D fi iti F t T b

Page 55: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 55/299

Table Definition Format Tab 

 

L b E i

Page 56: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 56/299

Lab Exercises 

 

Conceptual Lab 02A

Set up your DataStage environment

Conceptual Lab 02B

Import a sequential file Table Definition

 

Page 57: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 57/299

 

® 

IBM WebSphere DataStage

Module 03: Creating Parallel Jobs

 

Mod le Objecti es

Page 58: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 58/299

Module Objectives 

 

 

 

 

Design a simple Parallel job in Designer

Compile your job

Run your job in Director

View the job log

 

Page 59: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 59/299

Creating Parallel Jobs 

 

What Is a Parallel Job?

Page 60: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 60/299

What Is a Parallel Job? 

 

 

Executable DataStage program

Created in DataStage Designer

Can use components from Manager Repository

Built using a graphical user interface

Compiles into Orchestrate shell language (OSH) and object code

(from generated C++)

 

 

 

Job Development Overview

Page 61: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 61/299

Job Development Overview 

 Import metadata defining sources and targets

Can be done within Designer or Manager

In Designer, add stages defining data extractions and loads

 Add processing stages to define data transformations

 Add links defining the flow of data from sources to targets

Compile the job

In Director, validate, run, and monitor your job

Can also run the job in Designer

Can only view the job log in Director

 

 

 

 

 

 

Designer Work Area

Page 62: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 62/299

Designer  Work  Area 

Canvas

Repository

Tools

Palette

 

Designer Toolbar

Page 63: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 63/299

Designer Toolbar  

Provides quick access to the main functions of Designer

Show/hide metadata markers 

Run 

Job properties  Compile 

 

Tools Palette

Page 64: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 64/299

Tools Palette 

 

Adding Stages and Links

Page 65: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 65/299

 Adding Stages and Links 

 Drag stages from the Tools Palette to the diagram

Can also be dragged from Stage Type branch to the diagram

Draw links from source to target stage

Right mouse over source stage

Release mouse button over target stage

 

 

Job Creation Example Sequence

Page 66: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 66/299

Job Creation Example Sequence 

 

 

Brief walkthrough of procedure

 Assumes table definition of source already exists in the repository

 

Create New Job

Page 67: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 67/299

Create New Job 

 

Drag Stages and Links From Palette

Page 68: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 68/299

Drag Stages and Links From Palette 

Peek

Row

Generator

 Annotation

 

Renaming Links and Stages 

Page 69: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 69/299

e a g s a d S ages

 

 Click on a stage or link to rename itMeaningful names have manybenefits

Documentation

Clarity

Fewer development errors

 

RowGenerator Stage 

Page 70: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 70/299

g

 

 

 

 

Produces mock data for specified columns

No inputs link; single output link

On Properties tab, specify number of rows

On Columns tab, load or specify column definitions

Click Edit Row over a column to specify the values to be generated for thatcolumn

 A number of algorithms for generating values are available depending on thedata type

 Algorithms for Integer type

Random: seed, limit

Cycle: Initial value, increment

 Algorithms for string type: Cycle , alphabet

 Algorithms for date type: Random, cycle

 

 

 

 

Inside the Row Generator Stage 

Page 71: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 71/299

g

Propertiestab

Set propertyvalue

Property

 

Columns Tab 

Page 72: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 72/299

View data

Load aTable

definition

Select TableDefinition

 

Extended Properties 

Page 73: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 73/299

Specifiedproperties and

their values

 Additionalproperties to add

 

Peek Stage 

Page 74: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 74/299

 Displays field values 

Displayed in job log or sent to a

Skip records option

file

Can control number of records to be displayed

Shows data in each partition, labeled 0, 1, 2, … 

Useful stub stage for  iterative  job development  

Develop job to a stopping point and check the data

 

Peek Stage Properties 

Page 75: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 75/299

Output to job log

 

Job Parameters 

Page 76: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 76/299

 

 

 

Defined in Job Properties window

Makes the job more flexible

Parameters can be:

Used in directory and file names

Used to specify property values

Used in constraints and derivations

Parameter values are determined at run time

When used for directory and files names and names of properties,surround with pound signs (#)

E.g., #NumRows#

Job parameters can reference DataStage and system environment

variables$PROJDEF

$ENV

 

 

 

 

Defining a Job Parameter  

Page 77: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 77/299

Parameters tab

Parameter

 

Using a Job Parameter  in a Stage 

Page 78: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 78/299

Job parameter surroundedwith pound signs

 

 Adding Job Documentation

Page 79: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 79/299

  Job Properties

Short and long descriptions

Shows in Manager

 Annotation stage

 Added from the Tools Palette

Display formatted text descriptions on diagram

 

 

Job Properties Documentation 

Page 80: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 80/299

Documentation

 

 Annotation Stage Properties 

Page 81: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 81/299

 

Compiling a Job Compile 

Page 82: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 82/299

p

 

Errors or Successful Message 

Page 83: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 83/299

Highlight stage with error   Click for  more info 

 

Page 84: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 84/299

Running Jobs and Viewing the Job 

Log in Designer  

 

Prerequisite to Job Execution 

Page 85: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 85/299

 

DataStage Director  

Page 86: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 86/299

 

 

 

Use to run and schedule jobs

View runtime messages

Can invoke from DataStage Manager or Designer

Tools > Run Director

 

Run Options 

Page 87: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 87/299

Stop after  number  of warnings 

Stop after  number  of rows 

 

Director  Log View Click the open 

Page 88: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 88/299

book icon to view 

log messages 

Peek messages 

 

Message Details 

Page 89: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 89/299

 

Other Director Functions 

Page 90: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 90/299

 

 

 

 

Schedule job to run on a particular date/time

Clear job log of messages

Set job log purging conditions

Set Director options

Row limits

 Abort after x warnings

 

Running Jobs from Command Line 

Page 91: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 91/299

 

 

 

Use dsjob –run

Use dsjob –logsum to display messages in the log

Documented in “Parallel Job Advanced Developer‟s Guide”, ch. 7 

 

Lab Exercises 

Conceptual Lab 03A

Page 92: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 92/299

 

Conceptual Lab 03A

Design a simple job in DesignerDefine a job parameter

Document the job

Compile

Run

Monitor the job in Director

 ® 

Page 93: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 93/299

IBM WebSphere DataStage

Module 04: Accessing Sequential Data

 

Module Objectives 

Understand the stages for accessing different kinds of sequential data

Page 94: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 94/299

 

 

 

 

 

 

 

Understand the stages for accessing different kinds of sequential data

Sequential File stage

Data Set stage

Complex Flat File stage

Create jobs that read from and write to sequential files

Read from multiple files using file patternsUse multiple readers

 

Types of Sequential Data Stages 

Page 95: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 95/299

 

Sequential

Fixed or variable length

Data Set

Complex Flat File

 

 

 

The Framework and Sequential Data 

Th EE F k l d t t

Page 96: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 96/299

 

 

The EE Framework processes only datasets 

For files other than datasets, such as sequential flat files, import andexport operations are done

Import and export OSH operators are generated by Sequential andComplex Flat File stages

During import or export DataStage performs format translations – 

into, or out of, the EE internal format

 

 Internally, the format of data is described by

Like Table Definitions

schemas 

 

Using the Sequential File Stage 

Page 97: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 97/299

Both import and export of general files (text, binary) areperformed by the SequentialFile Stage.

 –  Data import:

 –  Data export EE internal format

EE internal format

 

Features of Sequential File Stage 

Normally executes in sequential mode

Page 98: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 98/299

 

 

 

Normally executes in sequential mode

Executes in parallel when reading multiple files

Can use multiple readers within a node

Reads chunks of a single file in parallel

The stage needs to be told:

How file is divided into rows (record format)

How row is divided into columns (column format)

 

File Format Example 

Page 99: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 99/299

1 1

1 1 

Record delimiter

Final Delimiter = end

Field Delimiter

Final Delimiter = comma

nl , Last field , Field 3 , Field 2 , Field 1 

nl Last field , Field 3 , Field 2 , Field 1 

 

Sequential File Stage Rules 

One input link

Page 100: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 100/299

 

 

 

One input link

One stream output link

Optionally, one reject link

Will reject any records not matching metadata in the column definitions

  Example: You specify three columns separated by commas, but the rowthat‟s read had no commas in it 

 

Job Design Using Sequential Stages 

Page 101: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 101/299

Reject link

 

Sequential Source Columns Tab 

Page 102: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 102/299

View data

Load Table Definition

Save as a newTable Definition

 

Input Sequential Stage Properties 

Output tab

Page 103: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 103/299

File toaccess

Column namesin first row

Click to add more files havingthe same format

 

Format Tab 

Page 104: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 104/299

Record format

Column format

 

Reading Using a File Pattern 

U ild

Page 105: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 105/299

Use wildcards

Select FilePattern

 

Properties - Multiple Readers 

Page 106: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 106/299

Multiple readers option allowsyou to set number of readers

per node

 

Sequential Stage  As a Target 

Input Tab

Page 107: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 107/299

p

 Append /Overwrite

 

Reject Link 

Page 108: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 108/299

 Reject mode =

Continue: Continue reading records

Fail: Abort job

Output: Send down output link

In a source stage

 All records not matching the

metadata (column definitions) arerejected

In a target stage

 All records that fail to be written forany reason

Rejected records consist of onecolumn, datatype = raw

 

 

 

Reject mode property

 

Inside the Copy Stage 

Page 109: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 109/299

Column mappings

 

Page 110: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 110/299

DataSet Stage 

 

Data Set 

Page 111: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 111/299

 

 Operating system (Framework) filePreserves partitioning

Component dataset files are written to on each partition 

Suffixed by .ds

Referred to by a header file

Managed by Data Set Management utility from GUI (Manager, Designer,Director)

Represents persistent data

Key to good performance in set of linked jobs

No import / export conversions are needed 

No repartitioning needed 

 

 

 

 

 

 

Persistent Datasets 

 Accessed using DataSet Stage. 

Page 112: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 112/299

Two parts:Descriptor file:

  contains metadata, data location, but NOT the data itself  

 

input.ds 

Data file(s) record  

(  partno:

 

int32; 

description: 

string; ) 

 

 

contain the data 

multiple Unix files (one per  node), accessible in parallel 

node1:/local/disk1/… 

node2:/local/disk2/… 

 

Data Translation 

 Occurs on import

Page 113: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 113/299

From sequential files or file setsFrom RDBMS

Occurs on export

From datasets to file sets or sequential files

From datasets to RDBMS

DataStage engine is most efficient when processing internallyformatted records (i.e. datasets)

 

 

Page 114: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 114/299

 

File Set Stage 

  Can read or write file sets

Page 115: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 115/299

 

 

Files suffixed by .fs

File set consists of:

Descriptor file – contains location of raw data files + metadata

Individual raw data files

Can be processed in parallel

Similar to a dataset

Main difference is that file sets are not in the internal format andtherefore more accessible to external applications

1. 

2. 

 

 

File Set Stage Example 

Page 116: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 116/299

 

Descriptor file

Lab Exercises 

 Conceptual Lab 04A

Page 117: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 117/299

 

Read and write to a sequential fileCreate reject links

Create a data set

Conceptual Lab 04B

Read multiple files using a file path

Conceptual Lab 04CRead a file using multiple readers

 

 

 

DataStage Data Types 

Standard types  Complex types 

Page 118: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 118/299

 

 

 

 

 

 

 

 

 

Char  

VarChar  

Integer  

Decimal (Numeric) 

Floating point 

Date 

Time 

Timestamp 

VarBinary (raw) 

 

 

Vector  (array, occurs) 

Subrecord (group) 

 

Standard Types 

  Char  

Fixed length string

Page 119: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 119/299

VarChar  

Variable length string

Specify maximum length

Integer  

Decimal (Numeric) 

Precision (length including numbers after the decimal point)

Scale (number of digits after the decimal point)Floating point 

Date 

 

 

 

 

 

Default string format:

Time 

Default string format:

Timestamp Default string format:

VarBinary (raw) 

%yyyy-%mm-%dd

 

%hh:%nn:%ss

 

%yyyy-%mm-%dd %hh:%nn:%ss

 

 

Complex Data Types 

 Vector A one-dimensional array

Page 120: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 120/299

Elements are numbered 0 to nElements can be of any single type

 All elements must have the same type

Can have fixed or variable number of elements

Subrecord A group or structure of elements

Elements of the subrecord can be of any type

Subrecords can be embedded

 

 

Schema With Complex Types 

Page 121: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 121/299

subrecord

vector

 

 

 

Table Definition with complex types

 Authors is a subrecord

Books is a vector of 3 strings of length 5

 

Complex Types Column Definitions 

subrecord

Page 122: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 122/299

Elements of subrecord Vector

 Reading and Writing Complex Data

Page 123: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 123/299

Complex FlatFile target

stage

Complex Flat

File sourcestage

 Importing Cobol Copybooks

 Click Import>TableDefinitions>COBOL File Definitions

Page 124: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 124/299

to begin the import

Each level 01 item begins a TableDefinition

Specify position of level 01 items

 

 

Level 01 startposition

Path tocopybook fileWhere to store theTable Definition

 

Page 125: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 125/299

Reading and Writing NULL Values 

 Working with NULLs

 

Internally, NULL is represented by a special value outside the range ofi i l i i l

Page 126: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 126/299

any existing, legitimate values

If NULL is written to a non-nullable column, the job will abort

Columns can be specified as nullable

NULLs can be written to nullable columns

You must “handle” NULLs written to non-nullable columns in aSequential File stage

You need to tell DataStage what value to write to the file

Unhandled rows are rejected

In a Sequential source stage, you can specify values you want

DataStage to convert to NULLs

 

 

 

 

 Specifying a Value for NULL

Page 127: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 127/299

Nullablecolumn

 Addedproperty

 

Page 128: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 128/299

Managing DataSets 

 Managing DataSets

 

 

GUI (Manager, Designer, Director) – tools > data set management

Dataset management from the system command line

Page 129: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 129/299

Dataset management from the system command lineOrchadmin

 

 

 

Unix command line utility

List records

Remove datasets

 –  Removes all component files, not just the header fileDsrecords

 Lists number of records in a dataset

 

Displaying Data and Schema 

Page 130: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 130/299

Display data

Schema

Page 131: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 131/299

 

Lab Exercises 

 Conceptual Lab 04D

Use the dsrecords utility

Use Data Set Management tool

Page 132: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 132/299

Use Data Set Management tool

Conceptual Lab 04E

Reading and Writing NULLs

 

 ® 

Page 133: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 133/299

IBM WebSphere DataStage

Module 05: Platform Architecture

© 2005 IBM Corporation 

 

Module Objectives 

 

Parallel processing architecture

Pipeline parallelism

Page 134: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 134/299

 

 

 

 

Pipeline parallelismPartition parallelism

Partitioning and collecting

Configuration files

 

Key EE Concepts 

 

Parallel processing:

Executing the job on multiple CPUs

Page 135: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 135/299

Executing the job on multiple CPUs

Scalable processing:

 Add more resources (CPUs and disks) to increase system performance

 

•  Example system: 6 CPUs (processing nodes) and disks 

Scale up by adding more CPUs 

 Add CPUs as individual nodes or  to

an SMP system 

• 

• 

1  2 

3  4 

5  6 

 

Scalable Hardware Environments 

Page 136: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 136/299

 

 

 

SMPMulti-CPU (2-64+)

Shared memory & disk

 

GRID / ClustersMultiple, multi-CPU systems

Dedicated memory per node

Typically SAN-based shared storage

MPPMultiple nodes with dedicated memory,storage

2 – 1000‟s of CPUs 

 

 

Single CPU

Dedicated memory &disk

 

 

 

Pipeline Parallelism 

Page 137: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 137/299

 

 

Transform, clean, load processes execute simultaneouslyLike a conveyor belt moving rows from process to process

Start downstream process while upstream process is running 

 Advantages:

Reduces disk usage for  staging areas 

Keeps processors busy Still has limits on scalability

 

 

Page 138: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 138/299

 

Three-Node Partitioning 

Node   1  

Operation subset1

Page 139: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 139/299

psubset1   

Node   2  

Operation subset2   

Data  subset3   Node   3  

Operation 

 

 

 

Here the data is partitioned into three partitions

The operation is performed on each partition of data separately and in parallel

If the data is evenly distributed, the data will be processed three times faster

EE Combines Partitioning and Pipelining 

Page 140: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 140/299

 

Within EE, pipelining, partitioning, and repartitioning are automatic

Job developer only identifies:

 

 

 

 

Sequential vs. Parallel operations (by stage) 

Method of  data partitioning 

Configuration file (which identifies resources) 

 Advanced stage options (buffer  tuning, operator  combining, etc.) 

 

Job Design v. Execution User assembles the flow using DataStage Designer

Page 141: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 141/299

… at runtime, this job runs in parallel for any configuration

(1 node, 4 nodes, N nodes)

No need to modify or  recompile the  job design! 

 

Configuration File 

 Configuration file separates configuration (hardware / software) from job design

Specified per job at runtime by $APT_CONFIG_FILE

Change hardware and resources without changing job design

Page 142: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 142/299

Defines number of nodes (logical processing units) with their resources (need notmatch physical CPUs)

Dataset, Scratch, Buffer disk (file systems)

Optional resources (Database, SAS, etc.)

 Advanced resource optimizations

  “Pools” (named subsets of nodes)

Multiple configuration files can be used at runtime

Optimizes overall throughput and matches job characteristics to overall hardware resources

 Allows runtime constraints on resource usage on a per job basis

 

 

Page 143: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 143/299

 

Page 144: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 144/299

Partitioning and Collecting 

Page 145: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 145/299

 

Partitioning / Collecting Algorithms  Partitioning algorithms include: 

Round robin

Hash: Determine partition based on key value Requires key specification

Page 146: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 146/299

  Requires key specification

Entire: Send all rows down all partitions

Same: Preserve the same partitioning

 Auto: Let DataStage choose the algorithm

Collecting algorithms include: Round robin

Sort Merge

  Read in by key

  Presumes data is sorted by the key in each partition

  Builds a single sorted stream based on the key

Ordered

  Read all records from first partition, then second, … 

 

 

Keyless V. Keyed Partitioning Algorithms 

 

Keyless: Rows are distributed independently of data values

Round Robin

Page 147: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 147/299

Entire

Same

Keyed: Rows are distributed based on values in the specified key

Hash: Partition based on key

 Example: Key is State.  All “CA” rows go into the same par tition; all “MA”

rows go in the same partition. Two rows of the same state never go intodifferent partitions

Modulus: Partition based on modulus of key divided by the number ofpartitions. Key is a numeric type.

 Example: Key is OrderNumber (numeric type). Rows with the sameorder number will all go into the same partition.

DB2: Matches DB2 EEE partitioning

 

 

Partitioning Requirements for Related Records 

 Misplaced records 

Using Aggregator stage to sum customer sales by customer numberIf there are 25 customers, 25 records should be output

Page 148: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 148/299

If there are 25 customers, 25 records should be output

But suppose records with the same customer numbers are spreadacross partitions

  This will produce more than 25 groups (records)

Solution: Use hash partitioning algorithm

Partition imbalances 

Peek stage shows number of records going down each partition

 

 

Unequal Distribution Example 

 Same key values are assigned tothe same partition

 Hash on LName, with 2-node config file

P

Page 149: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 149/299

Partition

1

ID   LName   FName   Address  

1  Ford  Henry  66 Edison  Avenue 

2  Ford  Clara  66 Edison  Avenue 

3  Ford  Edsel  7900 Jefferson 

4  Ford  Eleanor   7900 Jefferson 

7  Ford  Henry  4901 Evergreen 

8  Ford  Clara  4901 Evergreen 

9  Ford  Edsel  1100 Lakeshore 

10  Ford  Eleanor   1100 Lakeshore 

SourceData

ID   LName   FName   Address  

1  Ford  Henry  66 Edison  Avenue 

2  Ford  Clara  66 Edison  Avenue 

3  Ford  Edsel  7900 Jefferson 

4  Ford  Eleanor   7900 Jefferson 

5  Dodge  Horace  17840 Jefferson 

6  Dodge  John  75 Boston Boulevard 

7  Ford  Henry  4901 Evergreen 

8  Ford  Clara  4901 Evergreen 

9  Ford  Edsel  1100 Lakeshore 

10  Ford  Eleanor   1100 Lakeshore 

Part0

ID   LName   FName   Address  

5  Dodge  Horace  17840 Jefferson 

6  Dodge  John  75 Boston Boulevard 

 

Partitioning / Collecting Link Icons 

Partitioning icon

Page 150: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 150/299

Collecting icon

 

More Partitioning Icons 

“fan-out” Sequential to Parallel

Page 151: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 151/299

SAME  partitioner

Re-partitionwatch for this!  

 AUTO partitioner

 

Partitioning Tab 

Key specification

Page 152: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 152/299

 Algorithms

Collecting Specification 

Key specification

Page 153: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 153/299

 

 Algorithms

 

Quiz True or False? 

Everything that has been data-partitioned must becollected in same job

Page 154: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 154/299

 

Data Set Stage 

Page 155: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 155/299

Is the data partitioned? 

 

Page 156: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 156/299

Introduction to the Solution Exercises 

Development 

 

Solution Development Jobs 

 

 

Series of 4 jobs extracted from production jobs

Use a variety of stages in interesting, realistic configurations

Sort, Aggregator stages

Join lookup stage

Page 157: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 157/299

Join, lookup stage

Peek, Filter stages

Modify stage

Oracle stage

Contain useful techniques

Use of Peeks

Datasets used to “connect” jobs

Use of project environment variables in job parameters

Fork Joins

Lookups for auditing

 

Page 158: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 158/299

Glimpse Into the Sort Stage 

 Algorithms

Page 159: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 159/299

 

Sort key to add

 

Copy Stage With Multiple Output Links 

Select output link

Page 160: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 160/299

 

Filter  Stage 

 

 

 

Used with Peek stage to select a portion of data for checking

On Properties tab, specify a Where clause to filter the data

On Mapping tab, map input columns to output columns

Page 161: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 161/299

 

Setting the Filtering Condition 

Filteringcondition

Page 162: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 162/299

Page 163: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 163/299

 

Warehouse Job 03 

Page 164: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 164/299

 

Warehouse Job 04 

Page 165: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 165/299

 

Warehouse Job 02 With Lookup 

Page 166: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 166/299

 Lab Exercises 

 Conceptual Lab 05A

Experiment with partitioning / collecting

Solution Lab 05B (Build Warehouse_01 Job) Add environment variables as job parameters

Read multiple sequential files

 

Page 167: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 167/299

Read multiple sequential files

Use the Sort stage

Use Filter and Peek stages

Write to a DataSet stage

 

® 

IBM WebSphere DataStage

Page 168: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 168/299

Module 06: Combining Data

 Module Objectives

 

 

 

Combine data using the Lookup stage

Combine data using Merge stage

Combine data using the Join stage

Combine data using the Funnel stage

Page 169: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 169/299

 Combine data using the Funnel stage

 

Ways to combine data:

 Horizontally:

Multiple input links

One output link made of columns from different input links.

J i

Combining Data 

Page 170: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 170/299

Joins

Lookup

Merge

Vertically:

One input link, one output link combining groups of related records into asingle record

 Aggregator

Remove Duplicates

Funneling: Multiple input streams funneled into a single output stream

Funnel stage

 

 

 

Lookup, Merge, Join Stages 

 These stages combine two or more input links

Data is combined by designated "key" column(s)

These stages differ mainly in:Memory usage

T t t f ith t h d k l

 

Page 171: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 171/299

Treatment of rows with unmatched key values

Input requirements (sorted, de-duplicated)

 

Not all Links are Created Equal 

•  DataStage distinguishes between:

- The Primary input:  (Framework port 0) - Secondary inputs: ports)

in some cases "Reference"  (other  Framework 

Page 172: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 172/299

ports) 

Conventions:• 

•  Tip: Check “Link Ordering" tab to make sure intendedPrimary is listed first

Joins  Lookup  Merge 

Primary Input: port 0 

Secondary Input(s): ports 1,… 

Left  Source  Master  

Right  Lookup table(s)  Update(s) 

Page 173: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 173/299

 

Lookup Features 

 

 

 

One Stream Input link (Source)

Multiple Reference links (Lookup files)One output link

Optional Reject link

Page 174: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 174/299

 Optional Reject link

Only one per Lookup stage, regardless of number of reference links

Lookup Failure options

Continue, Drop, Fail, Reject

Can return multiple matching rows

Hash tables are built in memory from the lookup files

Indexed by key

Should be small enough to fit into physical memory

 

 

 

 

The Lookup Stage Uses one or more key columns as an index into a table

Usually contains other  values associated with each key.

The lookup table is created in memory before any lookup source rows are processed

 

 

Lookup table 

Associated ValueInde

Page 175: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 175/299

 Associated Value Index 

[…] 

SC SD TN TXUT 

VT[…] 

Key  column of source state_code “TN” 

South Carolina South Dakota Tennessee Texas Utah 

Vermont 

 

Lookup from Sequential File Example 

Reference link

Driver (Source)link

(lookup table)

Page 176: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 176/299

 

Lookup Key Column in Sequential File 

Lookup key

Page 177: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 177/299

Page 178: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 178/299

 

Handling Lookup Failures 

Page 179: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 179/299

Select action

 

Lookup Failure Actions 

•  If the lookup fails to find a matching key column, one of these actions can be taken: 

 –  fail: the lookup Stage reports an error and the job fails immediately.This is the default.

drop: the input row with the failed lookup(s) is dropped – 

Page 180: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 180/299

continue: the input row is transferred to the output, together with the successful tableentries. The failed table entry(s) are not transferred, resulting in either default outputvalues or null output values.

reject: the input row with the failed lookup(s) is transferred to a second output link, the"reject" link.

 – 

 – 

•  There is no option to capture unused table entries  –  Compare with the Join and Merge stages

 

Lookup Stage Behavior  

We shall first use a simplest case, optimal input: 

• 

• • 

Two input links:  “Source" as primary, “Look up" as secondary 

sorted on key column (here "Citizen"), without duplicates on key 

Page 181: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 181/299

Source link (primary input) Lookup link (secondary input)

Citizen Exchange 

M_B_Dextrous Nasdaq

Righty NYSE

Revolution Citizen 

1789 Lefty

1776 M_B_Dextrous

 

Lookup Stage 

Output of Lookup with continue option on key Citizen 

Revolution Citizen Exchange1789 Lefty

1776 M_B_Dextrous Nasdaq

Page 182: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 182/299

Same outpu t as outer join and merge/keep 

Empty string

or NULLOutput of Lookup with drop option on key Citizen 

Same output as inner join and merge/drop 

Revolution  Citizen  Exchange 

1776  M_B_Dextrous  Nasdaq 

 

The Lookup Stage 

 

 

Lookup Tables should be small enough to fit into physical memory

On a MPP you should partition the lookup tables using entire partitioning method

or partition them by the same hash key as the source linkEntire results in multiple copies (one for  each partition) 

On a SMP, choose entire or accept the default (which is entire) 

Page 183: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 183/299

Entire does not result in multiple copies because memory is shared 

 

Join Stage

Page 184: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 184/299

Join Stage 

 

The Join Stage 

•  Four types:

• 

• 

• 

• 

Inner

Left outer

Right outer

Full outer

2 or more sorted input links, 1 output link

Page 185: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 185/299

 2 or more sorted input links, 1 output link

"left" on primary input, "right" on secondary input 

Pre-sort make joins "lightweight": few rows need to be in RAM

Follow the RDBMS-style relational modelCross-products in case of duplicates

Matching entries are reusable for multiple matches

Non-matching entries can be captured (Left, Right, Full)

No fail/reject option for missed matches

 

 

Page 186: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 186/299

 

Join Stage Behavior  

We shall first use a simplest case, optimal input: 

• 

• • 

two input links: "left" as primary, "right" as secondary 

sorted on key column (here without duplicates on key 

"Citizen"), 

Page 187: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 187/299

Left link (primary input) Right link (secondary input)

Citizen Exchange 

M_B_Dextrous Nasdaq

Righty NYSE

Revolution Citizen 

1789 Lefty

1776 M_B_Dextrous

 

Inner Join 

 Transfers rows from both data sets whose key columnscontain equal values to the output

Treats both inputs symmetrically

link

 

Output of inner join on key Citizen

Page 188: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 188/299

Output of inner   join on key Citizen 

Same outpu t as lookup/reject and merge/drop 

Revolution  Citizen  Exchange 

1776  M_B_Dextrous  Nasdaq 

 

Left 

 

Outer Join 

Transfers all  values from the left link and transfers values from the right linkonly where key columns match. 

Revolution Citizen Exchange

1789 Lefty

1776 M B Dextrous Nasdaq

Page 189: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 189/299

Same outpu t  as lookup/continue and merge/keep 

1776 M_B_Dextrous Nasdaq

 

Left Outer Join  Check Link Ordering Tab intended Primary to make sure  is listed first

Page 190: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 190/299

 

Right Outer Join 

 Transfers all  values from the right link and transfers values from the left link onlywhere key columns match.

Revolution  Citiz en  Ex c hange 

1776 M B Dextrous Nasdaq

Page 191: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 191/299

1776  M_B  _Dex trous  Nas daq 

Null or  0  Righty  NYSE 

 

Full Outer Join 

 Transfers rows fr om both data sets, whose key columns contain equal values, tothe output link. 

It also transfers rows, whose key columns contain unequal values, from both inputlinks to the output link. 

Treats both input symmetrically. 

Creates new columns, with new column names! 

 

 

 

Page 192: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 192/299

Revolution leftRec_Citizen rightRec_Citizen Exchange1789 Lefty

1776 M_B_Dextrous M_B_Dextrous Nasdaq

0 Righty NYSE

 

Merge Stage 

Page 193: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 193/299

g g

 

Merge Stage Job 

Page 194: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 194/299

The Merge Stage 

 

 

 

 

 Allows composite keys

Multiple update links

Matched update rows are consumed

Unmatched updates ininput port n can be captured in outputport n

One or moreupdates

Master

Page 195: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 195/299

 

port n 

Lightweight: 

Merge 

RejectsOutput

0

1 2 

0 2 

 

Merge Stage Editor  

Page 196: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 196/299

Unmatched Master rows

One of two options:

Unmatched Update rows option:

 – Capture in reject link(s).Implemented by addingoutgoing links

 –  – 

Keep [default]Drop

(Capture in reject link is NOT an option)

 

Comparison:  Joins, Lookup, Merge 

Jo ins  Lookup  Merge Model 

M emory us age 

#  and nam es of  Inputs 

M andatory  Input S ort 

RDBMS-s ty le  relational  S ourc e - in RAM LU Table  Mas ter  -Update(s ) 

light  heavy  light 

2 or  more:  left, right  1 S ourc e, N   LU Tables  1 Mas ter, N   Update(s ) 

all  inputs  no  all  inputs 

Page 197: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 197/299

Duplic ates  in prim ary  input 

Duplic ates  in s ec ondary  input(s ) 

Options on unmatc hed prim ary 

Options on unmatc hed s ec ondary 

On m atc h, s ec ondary entries are 

#  Outputs 

Captured in rejec t s et(s ) 

OK  (x -produc t)  OK  W arning! 

OK  (x -produc t)  W arning!  OK only when N   = 1 

K eep (left outer), Drop (Inner)  [fail]  | c ontinue | drop | rejec t  [k eep]  | drop 

K eep (right outer), Drop (Inner)  NONE   c apture in rejec t s et(s ) 

c aptured  c aptured  c ons um ed 

1  1 out, (1 rejec t)  1 out, (N   rejec ts ) 

Nothing  (N/A)  unmatc hed primary entries  unm atc hed s ec ondary entries 

 

Funnel Stage 

Page 198: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 198/299

 

What is a Funnel Stage? 

  A processing stage that combines data from multiple input links to a

single output link

Useful to combine data from several identical data sources into a singlelarge dataset

Operates in three modes

 

 

Page 199: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 199/299

Continuous

SortFunnel

Sequence

 

Three Funnel modes  Continuous: 

Combines the records of the input link in no guaranteed order.

It takes one record from each input link in turn. If data is not available on an input link,

the stage skips to the next link rather than waiting.Does not attempt to impose any order on the data it is processing.

Sort Funnel: Combines the input records in the order defined by the value(s) of one ormore key columns and the order of the output records is determined by these sorting

 

Page 200: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 200/299

keys.

Sequence: Copies all records from the first input link to the output link, then all therecords from the second input link and so on.

 

 

Sort Funnel Method 

 

 

 

Produces a sorted output (assuming input links are all sorted on key)

Data from all input links must be sorted on the same key column

Typically data from all input links are hash partitioned before they are sortedSelecting “Auto” partition type under  Input  Par titioning tab defaults to this 

Hash partitioning guarantees that all the records with same key columnvalues are located in the same partition and are processed on the samenode

Page 201: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 201/299

node. 

 Allows for multiple key columns

1  primary  key column, n secondary  key columns 

Funnel stage first examines the primary key in each input record. 

For  records with multiple records with same  primary  key value, it will thenexamine secondary  keys to determine the order  of records it will output 

 

 

Funnel Stage Example 

Page 202: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 202/299

 

Funnel Stage Properties 

Page 203: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 203/299

 

Lab Exercises 

 Conceptual Lab 06A

Use a Lookup stage

Handle lookup failures

Use a Merge stage

Use a Join stage

Use a Funnel stage

Page 204: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 204/299

Solution Lab 06B (Build Warehouse_02 Job)

Use a Join stage

 

 ® 

IBM WebSphere DataStage

Module 07: Sorting and Aggregating Data

Page 205: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 205/299

 

Module Objectives 

 

 

 

Sort data using in-stage sorts and Sort stage

Combine data using Aggregator stage

Combine data Remove Duplicates stage

Page 206: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 206/299

 

Sort Stage 

Page 207: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 207/299

 

Sorting Data 

 

Uses

Some stages require sorted input

 Join, merge stages require sorted input

Some stages use less memory with sorted input

 E.g., Aggregator

Sorts can be done: 

Page 208: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 208/299

Within stages

 On input link Partitioning tab, set partitioning to anything other than AutoIn a separate Sort stage

 Makes sort more visible on diagram

 Has more options

 

Sorting  Alternatives 

S t t S t ithi

Page 209: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 209/299

Sort stage Sort within

stage

 

In-Stage Sorting Partitioning

ort

Preserve-key

ordering

tab Do s 

non

Page 210: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 210/299

Removedups

Can‟t when sorting

be Auto Sort key

 

Sort Stage 

Sort key

Page 211: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 211/299

Sort options

Page 212: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 212/299

 

Sort Options 

 Sort Utility

DataStage – the default

Unix: Don‟t use. Slower than DataStage sort utility

Stable

 Allow duplicates

Memory usage

Sorting takes advantage of the available memory for increased performance

 

 

 

Page 213: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 213/299

So g a es ad a age o e a a ab e e o y o c eased pe o a ce

 Uses disk if necessary

Increasing amount of memory can improve performance

Create key change column

 Add a column with a value of 1 / 0

1 indicates that the key value has changed

0 mean that the key value hasn‟t changed 

Useful for processing groups of rows in a Transformer

 

 

Sort Stage Mapping Tab 

Page 214: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 214/299

 

Partitioning V. Sorting Keys 

 

Partitioning keys are often different than Sorting keys

Keyed partitioning (e.g., Hash) is used to group related records into the

same partition

Sort keys are used to establish order within each partition

For example, partition on HouseHoldID, sort on HouseHoldID,PaymentDate

 

Page 215: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 215/299

Important when removing duplicates. Sorting within each partition is uses to

establish order for duplicate retention (first or last in the group)

 

 Aggregator  Stage 

Page 216: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 216/299

 

 Aggregator Stage 

Purpose: Perform data aggregations 

Specify: 

 

Zero or  more key columns that define groups) 

Columns to be aggregated 

the aggregation units (or  

 

Aggregation functions, include among many others:

Page 217: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 217/299

  Aggregation functions, include among many others: 

count (nulls/non-nulls)Sum

Max / Min / Range

 The grouping method 

issue 

(hash table or   pre-sort ) is a performance 

 

Job with Aggregator  Stage 

 Aggregator stage

Page 218: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 218/299

 

 Aggregator  Stage Properties 

Group columns

Page 219: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 219/299

Group method

 Aggregationfunctions

 

 Aggregator Functions 

 

 Aggregation type = Count rows

Count rows in each group

Put result in a specified output column

 Aggregation type = Calculation

Select column

Put result of calculation in a specified output column

Calculations include:

 

Page 220: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 220/299

 

 

 

 

 

 

 

SumCount

Min, max

Mean

Missing value count

Non-missing value countPercent coefficient of variation

 

Grouping Methods  Hash (default) 

Intermediate results for each group are stored in a hash table

Final results are written out after all input has been processed

No sort required

Use when number of unique groups is small

 Running tally for each gr oup‟s aggregate calculations needs to fit into

memory. Requires about 1K RAM / group

Page 221: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 221/299

E.g. average family income by state requires .05MB of RAM

Sort 

Only a single aggregation group is kept in memory

  When a new group is seen, the current group is written out

Requires input to be sorted by grouping keys

Can handle unlimited numbers of groups

Example: average daily balance by credit card

 

 

 Aggregation Types 

Page 222: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 222/299

Calculation types

 

Remove Duplicates Stage 

Page 223: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 223/299

 

Removing Duplicates 

 

Can be done by Sort stage

Use unique option

 

 

 

No choice on which to keep

Stable sort always retains the first row in the group

Non-stable sort is indeterminate

OR 

Page 224: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 224/299

 

Remove Duplicates stage

Has more sophisticated ways to remove duplicates

 Can choose to retain first or last

 

Remove Duplicates Stage Job 

Page 225: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 225/299

Remove Duplicates

stage

Remove Duplicates Stage Properties 

Key that definesduplicates

Page 226: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 226/299

 Retain first or last

duplicate

 

Lab Exercises 

 Solution Development Lab 07A 

Use Sort stage

Use Aggregator stage

Use RemoveDuplicates stage

(Build Warehouse_03  job) 

Page 227: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 227/299

 

® 

IBM WebSphere DataStage

Module 08: Transforming Data

Page 228: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 228/299

 

Module Objectives 

 

 

Understand ways DataStage allows you to transform data

Use this understanding to:

Create column derivations using user-defined code and system functions

Filter records based on business criteria

Control data flow based on data conditions

Page 229: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 229/299

 

Transformed Data 

 Derivations may include incoming fields or  parts of incoming fields 

Derivations may reference system variables and constants  

 

Frequently uses 

Date and time

Mathematical

Logical

functions performed on  incoming values 

Page 230: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 230/299

Null handlingMore

 

Stages Review 

 Stages that can transform data 

Transformer

Modify

 Aggregator

Stages that do not transform data 

File stages: Sequential, Dataset, Peek, etc.

Sort

 

Page 231: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 231/299

Remove Duplicates

Copy

Filter

Funnel

 

Transformer Stage 

 

 

Column mappings

Derivations

Written in Basic

Final compiled code is C++ generated object code

Constraints

Filter data

Direct data down different output links

 For different processing or storage

 

Page 232: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 232/299

Expressions for constraints and derivations can referenceInput columns

Job parameters

Functions

System variables and constants

Stage variablesExternal routines

 

 

Transformer Stage Uses  Transformer withmultiple outputs

 

Control data flow

Constrain data

Direct data

Derivations 

Page 233: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 233/299

 

Inside the Transformer  Input columns

Stage  Stage variables

Output columns

Constraints

Output

Page 234: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 234/299

Derivations / Mappings

Input / Output column defs

 

Defining a Constraint 

Page 235: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 235/299

Input column

Job parameter

Page 236: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 236/299

 

IF THEN ELSE Derivation 

 

 

Use IF THEN ELSE to conditionally derive a value

Format:

IF <condition> THEN <expression1> ELSE <expression1>

If the condition evaluates to true then the result of expression1 will be copiedto the target column or stage variable

If the condition evaluates to false then the result of expression2 will becopied to the target column or stage variable

Page 237: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 237/299

Example:Suppose the source column is named In.OrderID and the target column isnamed Out.OrderID

Replace In.OrderID values of 3000 by 4000

IF In.OrderID = 3000 THEN 4000 ELSE Out.OrderID

 

 

String Functions and Operators 

 

Substring operator

Format: “String” [loc, length]

Example:

 Suppose In.Description contains the string “Orange Juice” 

 InDescription[8,5] “Juice” 

UpCase(<string>) / DownCase(<string>)

Example: UpCase(In.Description) “ORANGE JUICE” 

 

Page 238: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 238/299

Len(<string>)

Example: Len(In.Description) 12

 

 

Checking for NULLs 

 Nulls can be introduced into the data flow fromlookups

Mismatches (lookup f ailures) can produce nulls 

Can be handled in constraints, derivations,stage variables, or a combination of these

NULL functions

Testing for  NULL 

 

 

  IsNull(<column>)

Page 239: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 239/299

 

IsNotNull(<column>)

Replace NULL with a value 

  NullToValue(<column>, <value>)

Set to NULL:  SetNull() 

  Example: IF In.Col = 5 THEN SetNull()

ELSE In.Col

 

Transformer Functions 

 

 

 

 

 

 

Date & Time

Logical

Null Handling

Number

String

Type Conversion

Page 240: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 240/299

 

Transformer Execution Order  

 

 

 

 

Derivations in stage variables

Constraints are executed before derivations

Column derivations in earlier links are executed before later links

Derivations in higher columns are executed before lower columns

Page 241: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 241/299

 

Transformer Stage Variables 

 Derivations execute in order  from top to bottom 

Later stage variables can reference earlier stage variables

Earlier stage variables can reference later stage variables

  These variables will contain a value derived from the previous rothat came into the Transformer

Multi-purpose 

Counters

 

Page 242: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 242/299

Store values from previous rows to make comparisonsStore derived values to be used in multiple target field derivations

Can be used to control execution of constraints

 

Stage Variables Toggle 

Page 243: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 243/299

Show/Hide button

Page 244: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 244/299

 

Otherwise Link 

Otherwise link

Page 245: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 245/299

 

Defining an Otherwise Link 

Page 246: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 246/299

Check to createotherwise link Can specify abort

condition

 

Specifying Link Ordering 

Link ordering toolbar icon

Page 247: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 247/299

Last inorder

 

Transformer Stage Tips 

 Suggestions - 

Include reject links

Test for NULL values before using a column in a function

Use RCP (Runtime Column Propogation)

  Map columns that have derivations (not just copies).

  More on RCP later.

Be aware of column and stage variable data types.

Oft d l d t tt ti t t i bl t

Page 248: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 248/299

  Often developers do not pay attention to stage variable types

 Avoid type conversions.

  Try to maintain the data type as imported.

 

Modify Stage 

Page 249: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 249/299

Modify Stage  

 

Modify column types

Perform some types of derivations

Null handling

Date / time handling

String handling

 Add or drop columns 

Page 250: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 250/299

 

 

Job With Modify Stage 

Page 251: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 251/299

Modify stage

Page 252: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 252/299

 

Lab Exercises  Conceptual Lab 08A

 Add a Transformer to a job

Define a constraint

Work with null valuesDefine a rejects link

Define a stage variable

Define a derivation

Page 253: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 253/299

 

IBM WebSphere DataStage

Module 09: Standards and Techniques

Page 254: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 254/299

Page 255: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 255/299

 

Job Presentation 

Page 256: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 256/299

Document using theDocument using

annotation stage

 

Job Properties Documentation 

Organize jobs intocategories

Description is displayed inManager and MetaStage

Page 257: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 257/299

 

Naming Conventions  Stages named after the

Data they access

Function they perform

DO NOT leave default stage names like Sequential_File_0One possible convention:

 Use 2-character prefixes to indicate stage type, e.g.,

 – 

 – 

 – 

“SF_” for Sequential File stage

“DS_” for Dataset stage

“CP_” for Copy stage

Page 258: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 258/299

_ py g Links named for the data they carry

DO NOT leave default link names like DSLink3

One possible convention:

 Prefix all link names with “lnk_” 

 Name links after the data flowing through them

Stage and Link Names 

Page 259: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 259/299

 

handle

Name stages andlinks for the data they

 

Iterative Job Design  

 

Use Copy and Peek stages as stubs 

Test job in phases 

Small sections first, then increasing in complexity

Use Peek stage to examine records 

Check data at various locations

Check before and after processing stages

 

Page 260: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 260/299

 

Copy Stage Stub Example 

Copy stage

Page 261: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 261/299

Page 262: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 262/299

 

Developing Jobs 1. Keep it simple

a)  Jobs with many stages are hard to debug and maintain

Start small and build to final solution

Use view data, copy, and peekStart from source and work out

Develop with a 1 node configuration file

Solve the business problem before the performance problem

Don‟t worry too much about partitioning until the sequential flow worksas expected

If you land data in order to break complex jobs into smaller sets of

2. 

a) b) 

c) 

3. 

a) 

4.

Page 263: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 263/299

If you land data in order to break complex jobs into smaller sets of jobs for purposes of restartability or maintainability, use persistentdatasets

Retains partitioning and internal data types

This is tr ue only as long as you don‟t need to read the data outside of

DataStage

4. 

a) 

b) 

 

Final Result 

Page 264: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 264/299

 

Good Things to Have in each Job  

 

Job parameters 

Useful environmental variables to add 

$APT_DUMP_SCORE

  Report OSH to message log

$APT_CONFIG_FILE

to job parameters 

 

 

Establishes runtime parameters to EE engine

Establishes degree of parallelization

Page 265: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 265/299

Page 266: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 266/299

 

DUMP SCORE Output 

Setting  APT_DUMP_SCORE yields: 

Double-click  Partitioner   And 

Collector  

Mapping Node > partition

Page 267: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 267/299

Node--> partition 

 

Use Multiple Configuration Files  

 

 

Make a set for 1X, 2X,…. 

Use different ones for test versus production

Include as a parameter in each job

Page 268: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 268/299

 

Containers  Two varieties 

Local

Shared

Local 

Simplifies a large, complex diagram

Shared 

Creates reusable object that many jobs within the project can

include

 

 

Page 269: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 269/299

 

Reusable Job Components 

 

Use Shared Containers for repeatedly used components

Page 270: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 270/299

Container

Page 271: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 271/299

 

Lab Exercises  

Conceptual Lab 07A

 Apply best practices when naming links and stages

Page 272: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 272/299

 

IBM WebSphere DataStageModule 10: Accessing Relational Data

Page 273: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 273/299

 

Module Objectives  Understand how DataStage  jobs RDBMS tables 

Import relational table definitions 

read and write records to a 

 

 

 

Read from and write to database tables 

Use database tables to lookup data 

Page 274: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 274/299

Page 275: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 275/299

 

Supported Database Access 

Enterprise Edition provides high performance / scalable interfaces for:

 

 

 

 

 

 

DB2 / UDB

Informix

Oracle

Teradata

SQL Server

ODBC

Page 276: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 276/299

 

Importing Table Definitions  Can import using ODBC or using Orchestrate schema definitions

Orchestrate schema imports are better because the data types are moreaccurate

Import>Table Definitions>Orchestrate Schema Definitions

Import>Table Definitions>ODBC Table Definitions

 

 

Page 277: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 277/299

 

Orchestrate Schema Import 

Page 278: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 278/299

 

ODBC Import 

Select ODBC datasource name

Page 279: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 279/299

 

RDBMS Access  

 Automatically convert RDBMS table layouts to/from DataStage TableDefinitions

RDBMS NULLs converted to/from DataStage NULLs

Support for standard SQL syntax for specifying:SELECT clause list WHERE clause filter  condition INSERT / UPDATE 

Supports user-defined queries

 

 

 

Page 280: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 280/299

 

Native Parallel RDBMS Stages 

 

 

 

 

 

 

DB2/UDB Enterprise 

Informix Enterprise 

Oracle Enterprise 

Teradata Enterprise 

ODBC Enterprise 

SQL Server  Enterprise 

Page 281: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 281/299

 

RDBMS Usage  

 As a source

Extract data from table (stream link)

 –  Read methods include: Table, Generated SQL SELECT, or User-defined SQL

 –  User-defined can perform joins, access views

Lookup (reference link)

 – 

 – 

 – 

Normal lookup is memory-based (all table data read into memory)

Can perform one lookup at a time in DBMS (sparse option)

Continue/drop/fail options

 

 As a target

I t

Page 282: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 282/299

Inserts

Upserts (Inserts and updates)

Loader

DB2 Enterprise Stage Source 

 Auto-generatedSELECT

Connectioni f ti

Page 283: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 283/299

 

information

Job example

 

Sourcing with User-Defined SQL User-definedread method

Page 284: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 284/299

Columns in SQL mustmatch definitions on

Columns tab

 

DBMS Source  – Lookup 

Reference

link

Page 285: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 285/299

Page 286: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 286/299

Page 287: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 287/299

 

DB2 Stage Target Properties  SQL INSERT

Drop table and

create

Database specifiedby job parameter

Page 288: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 288/299

Optional CLOSE command

 

DB2 Target Stage Upsert  SQL INSERT

SQL UPDATE

Upsert method

Page 289: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 289/299

Upsert method

 ################################# 

Generated OSH for first 2 stages 

Generated OSH Primer   Comment blocks introduce each operator

Operator order is determined by the order stageswere added to the canvas

OSH uses the familiar syntax of the UNIX shell

Operator name

SchemaOperator options ( “-name value” format) 

Input (indicated by n< where n is the input #)

Output (indicated by n> where n is the output #)

 may include modify 

For every operator, input and/or output datasets arenumbered sequentially starting from 0. E.g.:

op1 0> dstop1 1< src

Virtual datasets are generated to connect operators

 

 ################### 

## Operator  

## Operator  options 

#################################################### #### STAGE: Row_Generator_0 ## Operatorgenerator  ## Operator  options -schema record ( a:int32; 

b:string[max=12];c:nullable decimal[10,2] {nulls=10}; ) -records 50000 

## General options [ident('Row_Generator_0'); jobmon_ident('Row_Generator_0')]## Outputs 0> [] 'Row_Generator_0:lnk_gen.v' ;

Virtual  dataset  is #### STAGE: SortSt used  to connect  

tsort  output  of  one -key 'a' operator  to input  of  -asc  another  

Page 290: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 290/299

Virtual datasets are generated to connect operators ## General options [ident('SortSt'); jobmon_ident('SortSt'); par] ## Inputs 0< 'Row_Generator_0:lnk_gen.v'## Outputs 0> [modify ( keepa,b,c; )] 'SortSt:lnk_sorted.v' ;

 

Framework v. DataStage Terminology 

Framework  schema

propertytype

virtual dataset

Record / field

operator

step, flow, OSH command

Framework

DataStage table definition

formatSQL type and length

link

row / column

stage

 job

DS Parallel Engine

Page 291: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 291/299

Framework DS Parallel Engine

• GUI uses both terminologies

• Log messages (info, warnings, errors) use Framework terminology

 

Elements of a Framework Program 

• 

• 

• 

Operators

Virtual datasets: set of rows processed

Schema:

by Framework

•  data description (metadata) for datasets and links

Page 292: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 292/299

 

Enterprise Edition Runtime  Architecture 

Page 293: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 293/299

 

Enterprise Edition Job Startup  Generated OSH and configuration file are used to “compose” a job “Score”  

Think of “Score” as in musical score, not game score

Similar to the way an RDBMS builds a query optimization plan

Identifies degree of parallelism and node assignments for  each operator  

Inserts sorts and partitioners as needed to ensure correct results 

Defines connection topology (virtual datasets) between adjacent operators 

Inserts buffer  operators to prevent deadlocks 

  E.g., in fork-joins 

Defines number  of actual OS processes 

 Where possible, multiple operators are combined within a single OS process to improve performance and optimize resource requirements

Job Score is used to fork processes with communication interconnects fordata message and control

 

Page 294: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 294/299

data, message, and controlSet $APT_STARTUP_STATUS to show each step of job startup 

Set $APT_PM_SHOW_PIDS to show process IDs in DataStage log 

 

Enterprise Edition Runtime  It is only after the job Score and processes are created that

processing begins

“Startup overhead” of an EE job 

 Job processing ends when either:

Last row of data is processed by final operator  

 A fatal error  is encountered by any operator  

Job is halted (SIGINT) by DataStage Job Control or  human intervention (e.g. DataStage Director  STOP)

Page 295: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 295/299

 

Viewing the Job Score • 

• 

Set $APT_DUMP_SCORE to output the Score to the job log

For each job run, 2 separate Score dumps are written

•  First score is for the license operator

•  Second score entry is the real job score

To identify the Score dump, look for “main program: This step …” 

You don‟t see anywhere the word „Score‟ 

 

License operator job score 

Page 296: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 296/299

Job score 

Page 297: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 297/299

Page 298: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 298/299

Thank You

Page 299: DataStage-PX ClassBook V0.1ds

8/11/2019 DataStage-PX ClassBook V0.1ds

http://slidepdf.com/reader/full/datastage-px-classbook-v01ds 299/299

 

Thank You