Datastage Imp Doc

Parallel Framework Standard Practices

Investigate, Design, Develop: Data Flow Job Development

Prepared by IBM Information Integration Solutions Center of Excellence July 17, 2006

CONFIDENTIAL, PROPRIETARY, AND TRADE SECRET NATURE OF ATTACHED DOCUMENTS

This document is Confidential, Proprietary and Trade Secret Information (Confidential Information) of IBM, Inc. and is provided solely for the purpose of evaluating IBM products with the understanding that such Confidential Information will be disclosed only to those who have a need to know. The attached documents constitute Confidential Information as they include information relating to the business and/or products of IBM (including, without limitation, trade secrets, technical, business, and financial information) and are trade secret under the laws of the State of Massachusetts and the United States.

Copyrights

2006 IBM Information Integration SolutionsAll rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a retrieval system, or translated into any language in any form by any means without the written permission of IBM. While every precaution has been taken in the preparation of this document to reflect current information, IBM assumes no responsibility for errors or omissions or for damages resulting from the use of information contained herein.

Information Integration SolutionsCenter of Excellence

Document GoalsIntended Use This document presents a set of standard practices, methodologies, and examples for IBM

WebSphere DataStage Enterprise Edition (DS/EE) on UNIX, Windows, and USS. Except where noted, this document is intended to supplement, not replace the installation documentation.

Target Audience The primary audience for this document is DataStage developers who have been trained in Enterprise Edition. Information in certain sections may also be relevant for Technical Architects, System Administrators, and Developers

Product Version This document is intended for the following product releases:- WebSphere DataStage Enterprise Edition 7.5.1 (UNIX, USS) - WebSphere DataStage Enterprise Edition 7.5x2 (Windows)

Document Revision HistoryDate Rev. Description

April 16, 2004 1.0 Initial Services releaseJune 30, 2005 2.0 First version based on separation of EE BP into four separate documents, merged

new material on Remote DB2, configuring DS for multiple users.December 9, 2005 3.0 Significant updates, additional materialJanuary 31, 2006 3.1 Updates based on review feedback. Added patch install checklist item (7.10) and

Windows 7.5x2 patch list.February 17, 2006 4.0 Significant updates, new material on ETL overview, data types, naming standards,

USS, design standards, database stage usage, database data type mappings, updated styles and use of cross-references.

March 10, 2006 4.1 Corrected missing Figure 9.March 31, 2006 4.2 Added new material on establishing job boundaries, balancing job resource

requirements / startup time with required data volume and processing windows, and minimizing number of runtime processes. Moved Baselining Performance discussion to Performance Tuniing BP. Expanded performance tuning section.

May 08, 2006 4.3 Removed Architecture Overview (now a separate document). Expanded file stage recommendations.

July 17, 2006 5.0 Updated directory naming standards for consistency with DS/EE Automation Standards and Toolkit. Segmented content into Red Book and Standards. Clarified terminology (Best Practices). Incorporated additional field feedback.

Document ConventionsThis document uses the following conventions:Convention UsageBold In syntax, bold indicates commands, function names, keywords, and options that

must be input exactly as shown. In text, bold indicates keys to press, function names, and menu selections.

Italic In syntax, italic indicates information that you supply. In text, italic also indicates UNIX commands and options, file names, and pathnames.

Plain In text, plain indicates Windows NT commands and options, file names, and pathnames.

Bold Italic Indicates: important information.

Parallel Framework Red Book: Data Flow Job Design July 17, 2006 2 of 179 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a retrieval system, or translated into any language in any form by any means without the written permission of IBM.


Lucida Console Lucida Console text indicates examples of source code and system output.

Lucida Bold In examples, Lucida Console bold indicates characters that the user types or keys the user presses (for example, ).

Lucida Blue In examples, Lucida Blue will be used to illustrate operating system command line prompt.

A right arrow between menu commands indicates you should choose each command in sequence. For example, Choose File Exit means you should choose File from the menu bar, and then choose Exit from the File pull-down menu.

This line continues

The continuation character is used in source code examples to indicate a line that is too long to fit on the page, but must be entered as a single line on screen.

The following are also used:

Syntax definitions and examples are indented for ease in reading. All punctuation marks included in the syntaxfor example, commas, parentheses, or quotation

marksare required unless otherwise indicated. Syntax lines that do not fit on one line in this manual are continued on subsequent lines. The

continuation lines are indented. When entering syntax, type the entire syntax entry, including the continuation lines, on the same input line.

Text enclosed in parenthesis and underlined (like this) following the first use of proper terms will be used instead of the proper term.

Interaction with our example system will usually include the system prompt (in blue) and the command, most often on 2 or more lines.

If appropriate, the system prompt will include the user name and directory for context. For example:

%etl_node%:dsadm /usr/dsadm/Ascential/DataStage >/bin/tar cvf /dev/rmt0 /usr/dsadm/Ascential/DataStage/Projects



Table of Contents1 DATA INTEGRATION OVERVIEW ............................................................................................................................... 6

1.1 JOB SEQUENCES .................................................................................................................................................................... 7 1.2 JOB TYPES ........................................................................................................................................................................... 8

2 STANDARDS ..................................................................................................................................................................... 13 2.1 DIRECTORY STRUCTURES ..................................................................................................................................................... 13 2.2 NAMING CONVENTIONS ....................................................................................................................................................... 18 2.3 DOCUMENTATION AND ANNOTATION ...................................................................................................................................... 29 2.4 WORKING WITH SOURCE CODE CONTROL SYSTEMS ................................................................................................................. 31 2.5 UNDERSTANDING A JOBS ENVIRONMENT ............................................................................................................................... 35

3 DEVELOPMENT GUIDELINES .................................................................................................................................... 39 3.1 MODULAR DEVELOPMENT ................................................................................................ 39 3.2 ESTABLISHING JOB BOUNDARIES ........................................................................................................................................... 39 3.3 JOB DESIGN TEMPLATES ...................................................................................................................................................... 40 3.4 DEFAULT JOB DESIGN ......................................................................................................................................................... 41 3.5 JOB PARAMETERS ................................................................................................................................................................ 42 3.6 PARALLEL SHARED CONTAINERS ........................................................................................................................................... 43 3.7 ERROR AND REJECT RECORD HANDLING ................................................................................................................................ 43 3.8 COMPONENT USAGE ............................................................................................................................................................ 51

4 DATASTAGE DATA TYPES ........................................................................................................................................... 54 4.2 NULL HANDLING ................................................................................................................................................................ 56 4.3 RUNTIME COLUMN PROPAGATION ......................................................................................................................................... 58

5 PARTITIONING AND COLLECTING .......................................................................................................................... 59 5.1 PARTITION TYPES ............................................................................................................................................................... 59 5.2 MONITORING PARTITIONS ..................................................................................................................................................... 67 5.3 PARTITION METHODOLOGY ................................................................................................................................................... 68 5.4 PARTITIONING EXAMPLES ..................................................................................................................................................... 70 5.5 COLLECTOR TYPES .............................................................................................................................................................. 72 5.6 COLLECTING METHODOLOGY ................................................................................................................................................ 73

6 SORTING ........................................................................................................................................................................... 74 6.1 PARTITION AND SORT KEYS ................................................................................................................................................. 74 6.2 COMPLETE (TOTAL) SORT ................................................................................................................................................... 75 6.3 LINK SORT AND SORT STAGE ............................................................................................................................................... 76 6.4 STABLE SORT ..................................................................................................................................................................... 77 6.5 SUB-SORTS ........................................................................................................................................................................ 77 6.6 AUTOMATICALLY-INSERTED SORTS ........................................................................................................................................ 78 6.7 SORT METHODOLOGY .......................................................................................................................................................... 79 6.8 TUNING SORT ..................................................................................................................................................................... 79

7 FILE STAGE USAGE ....................................................................................................................................................... 81 7.1 WHICH FILE STAGE TO USE ................................................................................................................................................. 81 7.2 DATA SET USAGE .............................................................................................................................................................. 81 7.3 SEQUENTIAL FILE STAGES (IMPORT AND EXPORT) ................................................................................................................... 82 7.4 COMPLEX FLAT FILE STAGE ................................................................................................................................................. 85

8 TRANSFORMATION LANGUAGES ............................................................................................................................. 87



8.1 TRANSFORMER STAGE .......................................................................... 87 8.2 MODIFY STAGE ................................................................................... 91

9 COMBINING DATA ......................................................................................................................................................... 94 9.1 LOOKUP VS. JOIN VS. MERGE ............................................................................................................................................... 94 9.2 CAPTURING UNMATCHED RECORDS FROM A JOIN .................................................................................................................... 94 9.3 THE AGGREGATOR STAGE .................................................................................................................................................... 95

10 DATABASE STAGE GUIDELINES ............................................................................................................................. 96 10.1 DATABASE DEVELOPMENT OVERVIEW ................................................................................................................................... 96 10.2 DB2 GUIDELINES ........................................................................................................................................................... 103 10.3 INFORMIX DATABASE GUIDELINES ..................................................................................................................................... 113 10.4 ODBC ENTERPRISE GUIDELINES ...................................................................................................................................... 114 10.5 ORACLE DATABASE GUIDELINES ....................................................................................................................................... 117 10.6 SYBASE ENTERPRISE GUIDELINES ...................................................................................................................................... 119 10.7 TERADATA DATABASE GUIDELINES ................................................................................................................................... 120

11 TROUBLESHOOTING AND MONITORING .......................................................................................................... 124 11.1 WARNING ON SINGLE-NODE CONFIGURATION FILES ............................................................................................................ 124 11.2 DEBUGGING ENVIRONMENT VARIABLES ............................................................................................................................. 124 11.3 HOW TO ISOLATE AND DEBUG A PARALLEL JOB .................................................................................................................. 125 11.4 VIEWING THE GENERATED OSH ...................................................................................................................................... 126 11.5 INTERPRETING THE PARALLEL JOB SCORE ........................................................................................................................... 127

12 PERFORMANCE TUNING JOB DESIGNS .............................................................................................................. 129 12.1 HOW TO DESIGN A JOB FOR OPTIMAL PERFORMANCE .......................................................................................................... 129 12.2 UNDERSTANDING OPERATOR COMBINATION ........................................................................................................................ 131 12.3 MINIMIZING RUNTIME PROCESSES AND RESOURCE REQUIREMENTS ........................................................................................ 133 12.4 UNDERSTANDING BUFFERING ............................................................................................................................................ 134

APPENDIX A: STANDARD PRACTICES SUMMARY...............................................................................................140

APPENDIX B: DATASTAGE NAMING REFERENCE...............................................................................................146

APPENDIX C: UNDERSTANDING THE PARALLEL JOB SCORE.........................................................................148

APPENDIX D: ESTIMATING THE SIZE OF A PARALLEL DATA SET................................................................153

APPENDIX E: ENVIRONMENT VARIABLE REFERENCE.....................................................................................154

APPENDIX F: SORTING AND HASHING ADVANCED EXAMPLE.......................................................................160



1 Data Integration OverviewWork performed by Data Integration jobs fall into 4 general categories:

Reading input data including sequential files, databases and DS/EE Data Sets;

Performing row validation to support data quality;

Performing transformation from data sources to data targets; and

Provisioning data targets.Here is the general flow diagram for Data Stage Enterprise Edition jobs:

Before Job Subroutine

After Job Subroutine

Halt on Error?

No

Exit FailureYes

Create Reject Files (Limited)

Perform Load and/or Create Intermediate

Datasets

Halt on Error?

No

Exit FailureYes

Perform Transformations

Create Error and Reject Files

Halt on Error?

No

Exit FailureYes

Create Reject Files (Limited) Read Input Data

Halt on Error?

No

Exit FailureYes

Create Error and Reject Files

Perform Validations

Over Job Warning Threshold? Exit FailureYes

Errors and Warnings

No



1.1 Job SequencesAs shown in the previous diagram, ETL development is intended to be modular, built from individual Parallel jobs assembled in DataStage Enterprise Edition (DS/EE) controlled as modules from master DataStage Sequence jobs, as illustrated in the example below:

These job Sequences control the interaction and error handling between individual DataStage jobs, and together form a single end-to-end module within a DataStage application.

Job sequences also provide the recommended level of integration with external schedulers (such as AutoSys, Cron, CA7, etc). This provides a level of granularity and control that is easy to manage and maintain, and provides an appropriate leveraging of the respective technologies.

In most production deployments, Job Sequences require a level of integration with various production automation technologies (scheduling, auditing/capture, error logging, etc). These topics are discussed in Parallel Framework Standard Practices: Administration, Management, and Production Automation.



1.2 Job TypesNearly all data integration jobs fall into three major types: Transformation, Hybrid, and Provisioning.

- Transformation jobs prepare data for provisioning jobs- Provisioning jobs load transformed data, and - Hybrid jobs do both.

The following table defines when each type should be used:

Type Data Requirements ExampleTransformation

Data must NOT be changed by any method unless jobs transforming an entire subject area have successfully completed, or where the resource requirements for data transformation are very large.

Reference tables upon which all subsequent jobs and/or the current data target (usually a database) will depend, or long running provisioning processes. This prevents partial replacement of reference data in the event of transformation failure, and preserves the compute effort of long running transformation jobs.

Hybrid Data can be changed regardless of success or failure.

Non-reference data or independent data are candidates. The data target (usually a database) must allow subsequent processing of error or reject rows and tolerate partial or complete non-update of targets. Neither the transformation nor provisioning requirements are large.

Provisioning Data must NOT be changed by any method unless jobs transforming an entire subject area have successfully completed, or where the resource requirements for data provisioning are very large.

Any target where either all sources have been successfully transformed or where the resources required to transform the data must be preserved in the event of a load failure or where the provisioning will take so long that it increases the probability of job failure if the job includes transformation and provisioning.



1.2.1 Transformation JobsIn transformation jobs, data sources, some of which may be write-through cache Data Sets, are processed to produce a load-ready Data Set that represents either the entire target table or new records to be appended to the target table. If the entire target table is held in the load-ready Data Set, that Data Set qualifies as write-through cache and may be used as source data instead of the target table.

The following example transformation job demonstrates the use of write-through cache DS/EE Data Sets:



The target table is among the inputs.

The following example transformation job does NOT produce write-through cache its sources do NOT include the target table.



1.2.2 Hybrid JobsThe following example hybrid job demonstrates several interesting techniques that might be used in more complex jobs. Some of the more interesting solutions in this job are circled, and described below following the highlighted areas from Left to Right:

A column generator inserts the key column for a join and generates a single value guaranteed to never appear in the other input(s) to the join. By specifying a full-outer join we produce a Cartesian product dataset. In this case, we replicated the Oracle structure (lower input) for each country found in the write-through cache country dataset (upper input).

The key column for a Referential Integrity check is validated by a Transformer stage. If the key column is NULL, it is rejected by the transformer to a reject port and the validation is not performed for those records. The non-validated records, the validated records, and the write-through cache records from the last load of the target database are merged.

The merged records are grouped and ordered before being de-duplicated to remove obsolete records.

The de-duplicated records are re-grouped and ordered before calculation of the terminating keys, producing an ordered and linked associative table.

This job also loads the target database table and creates write-through cache. In this case, if the load fails, the cache is deleted, forcing other jobs that might depend on this data to access the existing (not updated) target database table. This enforces a coherent view of the subject area from either cache (current state if all jobs complete successfully) or target tables (previous state if any job fails).



1.2.3 Provisioning JobsThis example provisioning job demonstrates the straightforward approach to simple provisioning tasks.



2 StandardsEstablishing consistent development standards helps to improve developer productivity and reduce ongoing maintenance costs. Development standards can also make it easier to integrate external processes such as automated auditing and reporting, and to build technical and support documentation.

2.1 Directory Structures

2.1.1 Data, Install, and Project Directory StructuresThe following diagrams depict the IBM WebSphere DataStage software directory structures and the support directory structures. These directories are configured during product installation.

/patches

/Ascential

/DataStage

/DSEngine

/PXEngine

/Configurations

File systems are highlighted in blue

/Project_A

Install FS

/Scratch0

/Projects1 Gigabyte

/Project_A

/Project_Z

...

/Project_Z

...

Scratch File Systems

Data File Systems

Install File System

...

/Project_A

/Data0

/Project_Z

...

...

/Project_A

/DataN

/Project_Z

...

/Project_A

/ScratchN

/Project_Z

...

Figure 1: Recommended DataStage Install, Scratch, and Data Directories



By default, the DataStage Administrator client creates its projects (repositories) in the Projects directory of the DataStage installation directory. In general, it is a bad practice to create DataStage projects in the default directory, as disk space is typically limited in production install directories. For this reason, projects should be installed in their own file system.

NOTE: On some operating systems, it is possible to create separate file systems at non-root levels. This is illustrated in the above diagram, as a separate file system for the Projects sub

directory within the DataStage installation.

The DataStage installation creates the following two directories:$DSHOME/../Scratch$DSHOME/../Datasets

The DataStage Administrator should ensure that these default directories are never used by any parallel configuration files. Scratch is used by the EE framework for temporary files such as buffer overflow, sort memory overflow. It is a bad practice to share the DataStage project file system and conductor file system with volatile files like scratch files and Parallel data set part files, because they increase the risk of filling the DataStage project file systems.

To scale I/O performance within DataStage, the administrator should consider creating separate file systems for each Scratch and Resource partition. As a standard practice,

In order to scale I/O for DataStage, consider creating separate file systems for each Scratch and Data resource partition. Consider naming the file systems in accordance with partition numbers in your DataStage EE Configuration file. This best practice advocates creating subdirectories for each project for each scratch and disk partition.



/dev /si /qa /prod

/Project_A

/Staging

/archive

/Project_Z

/archive

...

/Project_A

/archive

/Project_Z

/archive

...

/Project_A

/archive

/Project_Z

/archive

...

/Project_A

/archive

/Project_Z

/archive

...

Figure 2: DataStage Staging Directories

Within the separate Staging file system, data directories are implemented for each deployment phase of a job (development, system integration, qa, and production) as appropriate. If the file system is not shared across multiple servers, not all of these development phases may be present on a local file system. Within each deployment directory, files are separated by Project name as shown below.

/Staging Top-Level Directory

/dev development data tree, location of source data files, target data files, error and reject files.

/Project_A subdirectory created for each project

/archive location of compressed archives created by archive process of previously processed files

/si System Integration (also known as test) data tree

/qa Quality Assurance data tree

/prod Production data tree



2.1.2 Extending the DataStage Project for External Entities

It is quite common for a DataStage application to be integrated with external entities, such as the operating system, another application or middle ware. The integration can be as simple as a file system for housing source files, or it could require scripts for example integrating with an Enterprise Scheduler.

To completely integrate all aspects of a DataStage application the directory structure that is used for integration with external entities should be defined in a way that provides a complete and separate structure in the same spirit as a DataStage project. A directory structure should be created that organizes external entities and is directly associated with 1 and only 1 DataStage project. This will provide a convenient vehicle to group and manage resources used by a project. The directory structure will be made transparent to the DataStage application, through the use of environment variables. Environment variables are a critical portability tool, which will enable DataStage applications to move through the life cycle without any code changes.

Project_Plus

/dev

/bin

/datasets

/src

/logs

/params

/schemas

/scripts

/si /qa /prod

/sql

/Project_A

/Project_Z

.../bin

/datasets

/src

/logs

/params

/schemas

/scripts

/sql

/Project_A

/Project_Z

...

/bin

/datasets

/src

/logs

/params

/schemas

/scripts

/sql

/Project_A

/Project_Z

...

/bin

/datasets

/src

/logs

/params

/schemas

/scripts

/sql

/Project_A

/Project_Z

...

Project_Plus Directory Hierarchy

/doc /doc /doc /doc

Figure 3: Project_Plus Directory Structure



Within the Project_Plus hierarchy, directories are created for each deployment phase of a job (development, system integration, qa, and production) as appropriate. If the file system is not shared across multiple servers, not all of these development phases may be present on a local file system.

Project_Plus Top-Level of Directory Hierarchy/dev development code tree

/Project_A subdirectory created for each project

/bin location of custom programs, DataStage routines, BuildOps, utilities, and shells/doc location of documentation for programs found in /bin subdirectory/src location of source code and makefiles for items found in /bin subdirectory (Note:

depending on change management policies, this directory may only be present in the /dev development code tree)

/datasets location of DataSet header files (.ds file)

/logs location of custom job logs and reports

/params location of parameter files for automated program control, a copy of dsenv and copies of DSParams.$ProjectName project files

/schemas location of Orchestrate schema files

/scripts location of operating system (shell) script files

/sql location of maintenance or template SQL

/si system integration (aka test) code tree

/qa quality assurance code tree

/prod production code tree

In support of a Project_Plus directory structure environment variable parameters should be configured, for example the following diagram shows Project_Plus variables as defined in the DataStage Administrator.



Figure 4: Project_Plus Environment Variables

In some implementations, there may be external entities that are shared with other DataStage projects, for example all jobs are invoked with the same Script. A similar directory structure to the Project_Plus structure could be configured and referred to as DataStage_Plus.

2.2 Naming ConventionsAs a graphical development environment, DataStage offers (within certain restrictions) flexibility to developers when naming various objects and components used to build a data flow. By default, the Designer tool assigns default names based on the object type, and the order the item is placed on the design canvas. While the default names may create a functional data flow, they do not facilitate ease of maintenance over time, nor do they adequately document the business rules or subject areas.

A consistent naming standard is essential to maximize the speed of development minimize the effort and cost of downstream maintenance enable consistency across multiple teams and projects facilitate concurrent development maximize the quality of the developed application increase the readability of the objects in the visual display medium increase the understanding of components when seen in external systems, for example in

WebSphere MetaStage, or an XML extract



This section presents a set of standards and guidelines to apply to developing data integration applications using DataStage Enterprise Edition.

Any set of standards needs to take on the culture of an organization, to be tuned according to needs, so it is envisaged that these standards will develop and will adapt over time to suit both the organization and the purpose.

There are a number of benefits from using a graphical development tool like DataStage, and many of these benefits were used to establish this naming standard:

With rapid development, more effort can be put into analysis and design, enabling a greater understanding of the requirements and greater control over how they are delivered.

There can be a much tighter link between design and development. Since much of the development work is done using a click, drag and drop paradigm there is less

typing involved hence the opportunity to use longer more meaningful, more readable names, while maintaining quality.

Throughout this section, the term Standard refers to those principles that are required, while the term Guideline refers to recommended, but not required, principles.

2.2.1 Key Attributes of the Naming ConventionThis naming convention is based on a three-part convention:

Subject, Subject Modifier, and Class Word

In the context of DataStage, the class word is used to identify either a type of object or the function that a particular type of object will perform. In some cases where appropriate, objects can be sub-typed (for example, a Left Outer Join). In these cases the class word represents the subtype.

For example, in the case of a link object, the class word refers to the functions of Reading, Reference (Lookup), Moving or Writing data (or within a Sequence Job, the moving of a message).

In the case of a data store the class word will refer to the type of data store, for example: Data Set, Sequential File, Table, View, and so forth.

Where there is no sub classification required then the class word will simply refer to the object. As an example, a transformer might be named: Data_Block_Split_Tfm

As a guideline, the Class Word is represented as a two, three or four letter abbreviation. Where it is a three or four letter abbreviation then it should be word capitalized. Where it is a two letter abbreviation both letters should be capitalized.



A list of frequently-used Class Word abbreviations is provided in 12.4.2 DataStage Naming Reference.

One benefit of using the Subject, Subject Modifier, Class Word approach, over using the Prefix approach, is to enable two levels of sorting or grouping. In WebSphere MetaStage, the object type is defined in a separate field- there is a field that denotes whether the object is a column, a derivation, a link, a stage, a job design, and so forth. This is the same or similar information that would be carried in a prefix approach. Carrying this information as a separate attributes enables the first word of the name to be used as the subject matter, allowing sort either by subject matter or by object type. Secondly the class word approach enables sub-classification by object type to provide additional information.

For the purposes of documentation, all word abbreviations should be referenced by the long form to get used to saying the name in full even if reading the abbreviation. Like a logical name, however, when creating the object, the abbreviated form is used. This will help re-enforce wider understanding of the subjects.

The key issue is readability. Though DataStage imposes some limitations on the type of characters and length of various object names, the standard, where possible, will be to separate words by an Underscore which will allow clear identification of each work in a name. This should be enhanced by also using Word Capitalization, for example, the first letter of each Word should be capitalized.

2.2.2 Designer Object LayoutThe effective use of naming conventions means that objects need to be spaced appropriately on the DataStage Designer canvas. For stages with multiple links, expanding the icon border can significantly improve readability. This type of approach takes extra effort at first, so a pattern of work needs to be identified and adopted to help development. The Snap to Grid feature of Designer can help improve development speed. When development is more or less complete, attention should be given to the layout to enhance readability before it is handed over to versioning.

Where possible, consideration should be made to provide DataStage developers with higher resolution screens as this provides them with more screen display real-estate. This can help make them more productive and makes their work more easily read.

2.2.3 Documentation and Metadata CaptureOne of the major problems with any development effort, whatever tool you use, is maintaining documentation. Though best intentions are always apparent, documentation is often something that is left until later, inadequately carried out.

DataStage provides the ability to document during development with the use of meaningful naming standards (as outlined in this section). Establishing standards also eases use of external tools and



processes such as WebSphere MetaStage, which can provide impact analysis, as well as documentation and auditing.

2.2.4 Naming Conventions by Object Type

2.2.4.1 ProjectsEach DataStage Project is a standalone repository. It may or may not have a one to one relationship with an organizations project of work. This factor often can cause terminology issues especially in teamwork where both business and developers are involved. The suffix of a Project name should be used to identify Development (Dev), Test (Test), and Production (Prod).

The name of a DataStage Project may only be 18 characters in length, it can contain alpha-numeric characters and it can contain underscores. However with the limit of 18 characters the name is most often composed of abbreviations.

Examples of Project naming where the project is single application focused are: Accounting Engine NAB Development would be named: Acct_Eng_NAB_Dev Accounting Engine NAB Production would be named: Acct_Eng_NAB_Prod

Examples of Project naming where the project is multi-application focused are: Accounting Engine Development or Acct_Engine_Dev Accounting Engine Production or Acct_Engine_Prod

2.2.4.2 Category HierarchyDataStage organizes objects in its repository by Categories, allowing related objects to be grouped together. Category Names can be long, are Alpha Numeric and can also contain both Spaces and Underscores. Therefore Directory names should be Word Capitalized and separated by either an underscore or a space.

DataStage enforces the top level Directory Structure for different types of Objects (for example, Jobs, Routines, Shared Containers, Table definitions). Below this level, developers have the flexibility to define their own Directory or Category hierarchy.

2.2.4.3 Job Category NamingWithin Designer, dialog box fields that specify a new category have only one input area for defining the Category name. Multiple levels of Hierarchy are named by specifying the Hierarchy levels separated by a backslash (\). For example, the structure A Test\Lower\Lower Still is shown below:



Figure 5: Creating Category Hierarchies

The main reason for having Categories is to group related objects. Where possible, a Category level should only contain objects that are directly related. For example, a job category might contain a Job Sequence and all the jobs and only those jobs that are contained in that sequence. Organizing related DataStage objects within categories also facilitates backup/export/import/change control strategies for projects since Manager can import/export objects by category grouping.

Categorization by Functional ModuleFor a given application, all Jobs and Job Sequences will be grouped in a single parent Category, with sub-levels for individual functional modules. Note that Job names must be unique within a DataStage project, not within a category. Within each functional module category, Jobs and Job Sequences are grouped together in the same scope as the technical design documents. For example, jobs that read write-through cache for a ECRP subset in the ECRDEV project that cleanse and load multi-family mortgage data and are driven by a sequencer might have a hierarchy that looks like the following example:



Figure 6: Categorization by Functional Module

Categorization by DeveloperIn development projects, categories will be created for each developer as their personal sandbox and place they perform unit test activities on jobs they are developing. It is the responsibility of each developer to delete unused or obsolete code, and the responsibility of the development manager assigned the DataStage Manager role to ensure that projects are not obese with unused jobs, categories and metadata.

Remembering that Job names must be unique within a given project, two developers cannot save a copy of the same job with the same name within their individual sandbox categories a unique Job name must be given.

In the previous illustration, project manager, two developers have private categories for sandbox and development activities, and there are 2 additional high-level categories, ECRP and Templates.

2.2.4.4 Table Definition CategoriesUnlike other types of DataStage objects, Table Definitions are always categorized using two level names. By default, DataStage assigns the level names based on the source of the metadata import (for example, Orchestrate, PlugIn, Saved, etc...), but this can be overridden during import.

Although the default table definition categories are useful from a functional perspective, establishing a Table Definition categorization that matches project development organization is recommended.

New Table Definition categories can be created within the repository by right-clicking within the Table Definitions area of the DataStage project repository and choosing the New Category command.

When implementing a customized Table Definition categorization, care must be taken to override the default choices for category names during Table Definition import. On import, the first level Table Definition category is identified as the Data Source Type and the second level categorization is referred to as the Data Source Name as shown in the example on the below. The placement of these fields varies with the method of metadata import.

Temporary TableDefs created by developers to assist with job creation appear under the Saved category by default. Once created, if these TableDefs are to be used by other jobs, they must be moved to the appropriate category and re-imported from that category in every job where they are used. TableDefs that remain in the Saved category



should be deleted as soon as possible. In this example, the TableDefs have been grouped into a master category of Custom, with sub-categories intended to identify the type of the source, e.g.: Datasets. Each subject area will have a master category, e.g: DWPH1 or ECRP. The following is one of the TableDefs from this project showing how to correctly specify the category and sub-category.

An alternative implementation is to set the Data source name to that of the source system or schema.

Figure 7: Table Definition Categories

2.2.4.5 Jobs and Job SequencesJob names must begin with a letter and can contain letters, numbers, and underscores only. Because the name of can be long, Job and Job Sequence names should be descriptive and should use word capitalization to make them readable.

Jobs and Job Sequences are all held under the Category Directory Structure of which the top level is the category Jobs.

A Job will be suffixed with the class word Job and a Job Sequence will be suffixed with the class word Seq.

Examples of Job naming are: CodeBlockAggregationJob CodeBlockProcessingSeq

Jobs should be organized under Category Directories to provide grouping such that a Directory should contain a Sequence Job and all the Jobs that are contained within that sequence. This will be discussed further in Section 2.2.4.2 Category Hierarchy.



2.2.4.6 Shared ContainersShared containers have the same naming constraints as jobs in that the name can be long but can not contain underscores, so word capitalization should be used for readability. Shared containers have their own Category Directory and consideration should be given to a meaningful Directory Hierarchy. When a Shared Container is used, a character code is automatically added to that instance of its use throughout the project. It is optional as to whether you decide to change this code to something meaningful.

To differentiate between Parallel Shared Containers and Server Shared Containers, the following Class Word naming is recommended:

Psc = Parallel (Enterprise Edition) Shared Container Ssc = Server Edition Shared Container

IMPORTANT: Use of Server Edition Shared Containers is discouraged within a parallel job.

Examples of Shared Container naming are: AuditTrailPsc (this is the original as seen in the Category Directory) AuditTrailPscC1 (This is an instance of use of the above shared container) AuditTrailPscC2 (This is another instance of use of the same shared container)

In the above examples the characters C1 and the C2 are automatically applied to the Shared Container Stage by DataStage Designer when dragged onto the design canvas.

2.2.4.7 ParametersA Parameter can be a long name consisting of alphanumeric characters and underscores. Therefore the parameter name must be made readable using Capitalized words separated by underscores. The class word suffix is Parm.

Examples of Parameter naming are: Audit_Trail_Output_Path_Parm Note where this is used in a stage property, the parameter name is delimited by the # sign:

#Audit_Trail_Output_Path_Parm#

2.2.4.8 LinksWithin a DataStage Job, links are objects that represent the flow of data from one stage to the next. Within a Job Sequence, links represent the flow of a message from one activity / step to the next.

It is particularly important to establish a consistent naming convention for link names, instead of using the default DSLink# (where # is an assigned number). Within the graphical Designer environment, stage editors identify links by name; having a descriptive link name reduces the chance for errors (for example, during Link Ordering). Furthermore, when sharing data with external applications (for



example, through Job reporting), establishing standardized link names makes it easier to understand results and audit counts.

The following rules can be used to establish a link name: Use the prefix lnk_ before the subject name to differentiate with stage objects The link name should define the subject of the data that is being moved For non-stream links, the link name should include the link type (reference, reject) to reinforce

the visual cues of the Designer canvas:o Ref for reference links (Lookup)o Rej for reject links (Lookup, Merge, Transformer, Sequential File, Database, etc)

The type of movement may optionally be part of the Class Word, for example:o In for inputo Out for outputo Upd for updateso Ins for insertso Del for deleteso Get for shared container inputso Put for shared container output

As data is enriched through stages, the same name may be appropriate for multiple links. In this case, always specify a unique link name within a particular Job or Job Sequence by including a number. (The DataStage Designer does not require link names on different stages to be unique.)

Examples Link names: Input Transactions: lnk_Txn_In Reference Account Numbers: lnk_Account_Ref Customer File Rejects: lnk_Customer_Rej Reception Succeeded Message or lnk_Reception_Succeeded_Msg

2.2.4.9 Stage NamesDataStage assigns default names to stages as they are dragged onto the Designer canvas. These names are based on the type of stage (object) and a unique number, based on the order the object was added to the flow. Within a Job or Job Sequence, stage names must be unique.

Instead of using the full object name, a 2, 3, or 4 character abbreviation should be used for the Class Word suffix, after the subject name and subject modifier. A list of frequently-used stages and their corresponding Class Word abbreviation may be found in 12.4.2 DataStage Naming Reference.



2.2.4.10 Data StoresFor the purposes of this section, a data store is a physical piece of disk storage where data is held for some period of time. In DataStage terms, this can be either a table in a database structure or a file contained within a disk directory or catalog structure. Data held in a database structure is referred to as either a Table or a View. In data warehousing, two additional subclasses of table might be used: Dimension and Fact. Data held in a file in a directory structure will be classified according to its type, for example: Sequential File, Parallel Data Set, Lookup File Set, etc.

The concept of source and target can be applied in a couple of ways. Every job in a series of jobs could consider the data it gets in to be a source and the data it writes out as being a target. However for the sake of this naming convention a Source will only be data that is extracted from an original system and Target will be the data structures that are produced or loaded as the final result of a particular series of jobs. This is based on the purpose of the project to move some data from a source to a target.

Data Stores used as temporary structures to land data between jobs, supporting restart and modularity, should use the same names in the originating job and any downstream jobs reading the structure. Examples of Data Store naming are:

Transaction Header Sequential File or Txn_Header_SF Customer Dimension or Cust_Dim (This optionally could be further qualified as Cust_Dim_Tgt

if you wish to qualify it as a final target) Customer Table or Cust_Tab General Ledger Account Number View or GL_Acctno_View

2.2.4.11 Transformer Stage and Stage VariablesA Transformer Stage name can be long over 50 characters and can contain underscores. Therefore the name can be descriptive and readable through word capitalization and underscores. DataStage Enterprise Edition supports two types of Transformers:

Tfm: Parallel (Enterprise Edition) Transformer BTfm: BASIC (Server Edition) Transformer

IMPORTANT: For maximum performance and scalability, BASIC Transformers should be avoided in Enterprise Edition data flows.

A Transformer Stage Variable can have a long name consisting of alphanumeric characters but not underscores. Therefore the Stage Variable name must be made readable only by using Capitalized words. The Class Word suffix is Stage Variable or SV. Stage Variables should be named according to their purpose.



When developing Transformer derivation expressions, it is

important to remember Stage variable names are case sensitive.

2.2.4.12 DataStage RoutinesDataStage BASIC routine names will indicate their function and they will be grouped in sub-categories by function under a main category of Custom, for example.:

Routines/Custom/SetDSParamsFromFile.

A How-To document describing the appropriate use of the routine must be provided by the author of the routine, and placed in a documentation repository.

DataStage Custom Transformer routine names will indicate their function and they will be grouped in sub-categories by function under a main category of Custom, for example:

Routines/Custom/DetectTeradataUnicode.

Source code, a makefile, and the resulting object for each Custom Transformer routine must be placed in the project phase source directory, e.g.: /home/dsadm/dev/bin/source.

2.2.4.13 File NamesSource file names should include the name of the source database or system and the source table name or copybook name. The goal is to connect the name of the file with the name of the storage object on the source system. Source flat files will have a unique serial number composed of the date, _ETL_ and time, for example:

Client_Relationship_File1_In_20060104_ETL_184325.psv.

Intermediate datasets are created between modules. Their names will include the name of the module that created the dataset OR the contents of the dataset in that more than one module may use the dataset after it is written, for example:

BUSN_RCR_CUST.ds

Target output files will include the name of the target database or system, the target table name or copybook name. The goal is the same as with source files to connect the name of the file with the name of the file on the target system. Target flat files will have a unique serial number composed of the date, _ETL_ and time, for example:

Client_Relationship_File1_Out_20060104_ETL_184325.psv

Files and datasets will have suffixes that allow easy identification of the content and type. DataStage proprietary format files have required suffixes and are identified in italics in the table below which defines the types of files and their suffixes.

File Type Suffix



Flat delimited and non-delimited files.dat.

Flat pipe (|) delimited files .psvFlat comma-and-quote delimited files .csv.DataStage datasets .ds.DataStage filesets .fsDataStage hash files .hash.Orchestrate schema files .schema.Flat delimited or non-delimited REJECT files .rej.DataStage REJECT datasets _rej.ds.Flat delimited or non-delimited ERROR files .err.DataStage ERROR datasets _err.ds.Flat delimited or non-delimited LOG files .log.

2.3 Documentation and AnnotationDataStage Designer provides description fields for each object type. These fields allow the developer to provide additional descriptions that can be captured and used by administrators and other developers.

The Short Description field is also displayed on summary lines within the Director and Manager clients. At a minimum, description annotations must be provided in the Job Properties Short Description field for each job and job sequence, as shown below:

Figure 8: Job Level Short Description

Within a job, the Annotation tool should be used to highlight steps in a given job flow. Note that by changing the vertical alignment properties (for example, Bottom) the annotation can be drawn around the referenced stage(s), as shown in the following example.



Figure 9: Example Job Annotation

DataStage also allows descriptions to be attached to each stage within the General tab of the stage properties.

Each stage should have a short description of its function specified within the stage properties. These descriptions will appear in the job documentation automatically generated from jobs and sequencers adhering to the standards in this document. More complex operators or operations should have correspondingly longer and more complex explanations on this tab.

Examples of such annotations include:Job short description:

This Job takes the data from GBL Oracle Table AD_TYP and does a truncate load into Teradata Table AD_TYP.

ODBC Enterprise stage read:Read the GLO.RcR_GLOBAL_BUSN_CAT_TYP table from jpORACLE_SERVER using the ODBC driver. There are no selection criteria in the WHERE clause.

Oracle Enterprise stage read:Read the GLOBAL.GLOBAL_REST_CHAR table from jpORACLE_SERVER using the Oracle Enterprise operator. There are no selection criteria in the WHERE clause.



Remove Duplicates stageThis stage removes all but one record with duplicate BUSN_OWN_TYP_ID keys.

Lookup stageThis stage validates the input and writes rejects.This stage validates the input and continues.This stage identifies changes and drops records not matched (not updated).

Copy stageThis stage sends data to the TDMLoadPX stage for loading into Teradata, and to a dataset for use as write-through cache.This stage renames and/or drops columns and is NOT optimized out.This stage is cosmetic and is optimized out.

Sequential file stage:This is the source file for the LANG table.This is the target file for business qualification process rejects.

Transformer stage:This stage generates sequence numbers that have a less-than file scope.This stage converts null dates.

Modify stage:This stage performs data conversions not requiring a transformer.

Teradata MultiLoad stage:Load the RcR_GLOBAL_LCAT_TYP table.

Data Set stage:This stage writes the GLOBAL_Ad_Typ dataset which is used as write-through cache to avoid the use of Teradata in subsequent jobs.This stage reads the GLOBAL_Lcat dataset, which is used as write-through cache to avoid the use of Teradata.

2.4 Working with Source Code Control SystemsDataStages built-in repository manages objects (jobs, sequences, table definitions, routines, custom components) during job development. However, this repository is not capable of managing non-DataStage components (for example, UNIX shell scripts, environment files, job scheduler configurations, etc.) that may be part of a completed application.



Source code control systems (such as ClearCase, PVCS, SCCS) are useful for managing the development lifecycle of all components of an application, organized into specific releases for version control.

DataStage does not directly integrate with source code control systems, but it does offer the ability to exchange information with these systems. It is the responsibility of the DataStage developer to maintain DataStage objects within the source code system.

The Manager client is the primary interface to the DataStage object repository. Using Manager, you can export objects (job designs, table definitions, custom stage types, user-defined routines, etc.) from the repository as clear-text format files. These files can then be checked into the external source code control system.

The export file format for DataStage objects can be either .DSX (DataStage eXport format) or .XML. Both formats contain the same information, although the XML file is generally much larger. Unless there is a need to parse information in the export file, .DSX is the recommended export format.

2.4.1 Source Code Control StandardsThe first step to effective integration with source code control systems is to establish standards and rules for managing this process:

a) Establish Category naming and organization standardDataStage objects can be exported individually or by category (folder hierarchy). Grouping related objects by folder can simplify the process of exchanging information with the external source code control system. This object grouping also helps establish a manageable middle ground between an entire project exports and individual object exports.

b) Define rules for exchange with source code controlAs a graphical development environment, Designer facilitates iterative job design. It would be cumbersome to require the developer to check-in every change to a DataStage object in the external source code control system. Rather, rules should be defined for when this transfer should take place. Typically, milestone points on the development lifecycle are a good point for transferring objects to the source code control system - for example, when a set of objects has completed initial development, unit test, and so on.

c) Dont rely on the source code control system for backupsBecause the rules defined for transfer to the source code control system will typically be only at milestones in the development cycle, they would not be an effective backup strategy. Furthermore, operating system backups of the project repository files only establish a point in time, and cannot be used to restore individual objects.

For these reasons, it is important that an identified individual maintains backup copies of the important job designs using .DSX file exports to a local or (preferably) shared file system.



These backups can be done on a scheduled basis by an Operations support group, or by the individual DataStage developer. In either case, the developer should create a local backup prior to implementing any extensive changes.

2.4.2 Using Object Categorization StandardsAs discussed in Section 2.2.4.2 Category Hierarchy, establishing and following a consistent naming and categorization standard is essential to the change management process. The DataStage Manager can export at the Project, Category, and individual Object levels. Assigning related objects to the same category provides a balanced level of granularity when exporting and importing objects with external source code control systems.

2.4.3 Export to Source Code Control SystemThe process of exporting DataStage objects to a source code control system is a straightforward process. It can be done interactively by the developer or project manager using the Manager client, as explained in this section.

The DataStage client includes Windows command-line utilities for automating the export process. These utilities (dsexport and dscmdexport) are documented in the DataStage Manager Guide.

All exports from the DataStage repository are performed on the Windows workstation. There is no server-side project export facility.

Select the object or category in the Manager browser.

Figure 10: Manager Category browser

Choose Export DataStage Components from the Export menu.



NOTE: Objects cannot be exported from DataStage if they are open in Designer. Make sure all objects are saved and closed before exporting.

Figure 11: Manager Export Options

To export a group of objects to a single export file, the option Selection: By category should be specified in the Options tab.

The filename for export is specified in the Export to file: field at the top of the Export dialog.

If you wish to include compiled Transformer objects for a selected job, make sure the Job Executables category is checked.

Using your source code control utilities, check-in the exported .DSX file

2.4.4 Import from Source Code Control SystemIn a similar manner, the import of objects from an external source code control system is a straightforward process. Import can be interactive through the Manager client (as described in this section), or automated through command-line utilities.



Unlike the export process, command-line import utilities are available for both Windows workstation and DataStage server platforms. The Windows workstation utilities (dsimport and dscmdimport) are documented in the DataStage Manager Guide.

For test and production environments, it is possible to import the job executables from the DataStage server host using the dsjob command-line, as documented in the DataStage Development Kit chapter of the Parallel Job Advanced Developers Guide. Note that using dsjob will only import job executables - job designs can only be imported using the Manager client or the dsimport or dscmdimport client tools.

Use the source code control system to check-out (or export) the .DSX file to your client workstation.

Import objects in the .DSX file using Manager. Choose Import DataStage Components from the Import menu. Select the file you checked out of your source code control system by clicking on the ellipsis () next to the filename field in the import dialog. After selecting your file, click OK to import.

Figure 12: Manager Import options

The import of the .DSX file will place the object in the same DataStage category it originated from. This means that if necessary it will create the Job Category if it doesn't already exits.

If the objects were not exported with the Job Executables, then compile the imported objects from Designer, or using the Multi-Job Compile tool.

2.5 Understanding a Jobs EnvironmentDataStage Enterprise Edition provides a number of environment variables to enable / disable product features and to fine-tune job performance.

Although operating system environment variables can be set in multiple places, there is a defined order of precedence t

Datastage Imp Doc

Documents

Transcript of Datastage Imp Doc