7/31/2019 Kirk Lafker
1/44
1
a presentation by
Kirk Paul Lafler
SAS
Consultant, Author, and TrainerE-mail: [email protected]
7/31/2019 Kirk Lafker
2/44
Copyright 1992-2010 by Kirk Paul Lafler 2
Copyright Kirk Paul Lafler, 1992-2010.
All rights reserved.
SAS is the registered trademark of SAS Institute Inc., Cary, NC, USA.
SAS Certified Professional is the trademark of SAS Institute Inc., Cary,NC, USA.
All other company and product names mentioned are used foridentification purposes only and may be trademarks of theirrespective owners.
7/31/2019 Kirk Lafker
3/44
Copyright 1992-2010 by Kirk Paul Lafler 3
Objectives
What is data federation? Characteristics associated with data federation
What is a join?
Why join?
SASDATA step merge versus a Join
Cartesian product joins
Two table joins
Table aliases to reference tables
Three table joins
Left and Right outer joins
What happens during a join?
Available join algorithms
7/31/2019 Kirk Lafker
4/44
Copyright 1992-2010 by Kirk Paul Lafler 4
Tables Used in Examples
Movies
Actors
7/31/2019 Kirk Lafker
5/44
Copyright 1992-2010 by Kirk Paul Lafler
What is Data Federation?
Process of integrating data from many different sources Leaves data in place without using resources to copy data
Makes access to data sources as easy as possible
Provides a degree of reusability
A data federation approach often replaces a data warehouse
7/31/2019 Kirk Lafker
6/44
Copyright 1992-2010 by Kirk Paul Lafler
Data Federation Characteristics
It represents an integration approach Ability to aggregate data from many different sources
Data sources can be in any location
It can be implemented within any lifecycle methodology
Provides flexibility It is user-centric
Federated data does not copy and store data like a data
warehouse
It contains metadata (information about the actual data andits location)
Is NOT always designed with optimality in mind
7/31/2019 Kirk Lafker
7/44Copyright 1992-2010 by Kirk Paul Lafler 7
What is a Join?
Table One Table Two
Visually, it would look something like this:
Process of combining tables side-by-side (horizontally) Consists of a matching process between rows in tables
Some or all of the tables contents are brought together
Gather and manipulate data from across tables
. . .
7/31/2019 Kirk Lafker
8/44Copyright 1992-2010 by Kirk Paul Lafler 8
Why Join?
Data in a database is often stored in separate tables Joins allow data to be combined as if it were stored in one
huge file
Provide exciting insights into data relationships
Types of joins: Inner joins a maximum of 256 tables can be joined
Outer joins a maximum of 2 tables can be joined
7/31/2019 Kirk Lafker
9/44Copyright 1992-2010 by Kirk Paul Lafler 9
DATA Step Merge versus a Join
Merges process data differently than a standard join The merge process overlays the duplicate by-column
Joins adhere to ANSI guidelines
Joins do not automatically overlay the duplicate matching
column
7/31/2019 Kirk Lafker
10/44Copyright 1992-2010 by Kirk Paul Lafler 10
DATA Step Merge Process
DATA merged;MERGE customers (IN=c)
movies (IN=m);
BY cust_no;IF c AND m;
RUN;
Customers Movies
Cust_no Name
3 Ryan
5 Anna-liese
10 Ronnie
Cust_no Category
3 Adventure
5 Comedy
7 Suspense
Merged
Cust_no Name Category
3 Ryan Adventure5 Anna-liese Comedy
==X
7/31/2019 Kirk Lafker
11/44Copyright 1992-2010 by Kirk Paul Lafler 11
Join Process
PROC SQL;SELECT *FROM customers,
movies
WHEREcustomers.cust_no =movies.cust_no;
QUIT;
Customers Movies
Cust_no Name
3 Ryan
5 Anna-liese
10 Ronnie
Cust_no Name Cust_no Category3 Ryan 3 Adventure
5 Anna-liese 5 Comedy
==X
Cust_no Category
3 Adventure
5 Comedy
7 Suspense
7/31/2019 Kirk Lafker
12/44Copyright 1992-2010 by Kirk Paul Lafler 12
Merge versus Join Results
Cust_no Name Cust_no Category
3 Ryan 3 Adventure
5 Anna-liese 5 Comedy
Merge
Cust_no Name Category
3 Ryan Adventure
5 Anna-liese Comedy
Features
1. Data must be sorted using by-value.
2. Requires variable names to be same.
3. Duplicate matching column is overlaid.
4. Results are not automatically printed.
Features
1. Data does not have to be sorted using by-value.
2. Does not require variable names to be same.
3. Duplicate matching column is not overlaid.
4. Results are automatically printed unless
NOPRINT option is specified.
Versus
Join
7/31/2019 Kirk Lafker
13/44Copyright 1992-2010 by Kirk Paul Lafler 13
Cartesian Product Join (Cross Join)
PROC SQL;SELECT *
FROM customers,movies;QUIT;
Cust_no Name Cust_no Movie_no Category
3 Ryan 3 1011 Adventure
3 Ryan 5 3090 Comedy
3 Ryan 7 4456 Suspense
5 Anna-liese 3 1011 Adventure
5 Anna-liese 5 3090 Comedy5 Anna-liese 7 4456 Suspense
10 Ronnie 3 1011 Adventure
10 Ronnie 5 3090 Comedy
10 Ronnie 7 4456 Suspense
Result represents allpossible combinationsof rows and columns
Absence of WHEREclause produces aCartesian product
No WHEREor Key Used
Customers Movies
Cust_no Name
3 Ryan
5 Anna-liese
10 Ronnie
Cust_no Category
3 Adventure
5 Comedy
7 Suspense
7/31/2019 Kirk Lafker
14/44Copyright 1992-2010 by Kirk Paul Lafler 14
ExampleCartesian Product
A Cartesian product join, sometimes referred to as across join, can be very large because it represents allthe possible combinations of rows and columns.
PROC SQL;
SELECT *
FROM MOVIES,
ACTORS;
QUIT;
7/31/2019 Kirk Lafker
15/44Copyright 1992-2010 by Kirk Paul Lafler 15
The result of anEqui-joinis illustrated by the shadedarea (AB) in the Venn diagram.
A BAB
Equi-Join with Two Tables
7/31/2019 Kirk Lafker
16/44Copyright 1992-2010 by Kirk Paul Lafler 16
Equi-Join with Two Tables
=
EquiJoin
PROC SQL;SELECT *FROM customers,
moviesWHEREcustomers.cust_no =movies.cust_no;
QUIT;
Cust_no Name Cust_no Movie_no Category
3 Ryan 3 1011 Adventure5 Anna-liese 5 3090 Comedy
Customers
Cust_no Name
3 Ryan
5 Anna-liese
10 Ronnie
Movies
Cust_no Category
3 Adventure
5 Comedy
7 Suspense
7/31/2019 Kirk Lafker
17/44Copyright 1992-2010 by Kirk Paul Lafler 17
ExampleWHERE-clause
The most reliable way to join two tables together,and to avoid creating a Cartesian product, is to useaWHEREclause with common columns or keys.
7/31/2019 Kirk Lafker
18/44Copyright 1992-2010 by Kirk Paul Lafler 18
Table Aliases
PROC SQL;SELECT *FROM customers c,
movies mWHERE c.cust_no =
m.cust_no;QUIT;
Cust_no Name Cust_no Movie_no Category
3 Ryan 3 1011 Adventure
5 Anna-liese 5 3090 Comedy
Aliases
=
EquiJoin
Customers
Cust_no Name
3 Ryan
5 Anna-liese
10 Ronnie
Movies
Cust_no Category
3 Adventure
5 Comedy
7 Suspense
7/31/2019 Kirk Lafker
19/44Copyright 1992-2010 by Kirk Paul Lafler 19
ExampleTable Alias
Assigning atable aliasis a not only a useful way toreference a table, but can reduce the number ofkeystrokes typed.
7/31/2019 Kirk Lafker
20/44Copyright 1992-2010 by Kirk Paul Lafler 20
Joining Three Tables
Customers Movies
Cust_no
Name
Cust_no
Movie_no
Category
=
EquiJoin
PROC SQL;SELECT c.cust_no, c.name,m.movie_no, c.category,a.lead_actor
FROM customers c,
movies m,actors aWHERE
c.cust_no = m.cust_no ANDm.movie_no = a.movie_no;
QUIT;
Cust_no Name Movie_no Category Lead_actor
3 Ryan 1011 Adventure Mel Gibson
Actors
Movie_no
Lead_actor=
EquiJoin
7/31/2019 Kirk Lafker
21/44Copyright 1992-2010 by Kirk Paul Lafler 21
The result of aLeft Outer joinis illustrated by theshaded areas (A and AB) in the Venn diagram.
A BAB
Left Outer Joins
7/31/2019 Kirk Lafker
22/44
Copyright 1992-2010 by Kirk Paul Lafler 22
ExampleLeft Outer Join
The result of aLeft Outer joinproduces bothmatched rows from both tables plus any unmatchedrows from the left table.
7/31/2019 Kirk Lafker
23/44
Copyright 1992-2010 by Kirk Paul Lafler 23
The results of aRight Outer joinis illustrated by theshaded areas (B and AB) in the Venn diagram.
A BAB
Right Outer Joins
7/31/2019 Kirk Lafker
24/44
Copyright 1992-2010 by Kirk Paul Lafler 24
ExampleRight Outer Join
The result of aRight Outer joinproduces matchedrows from both tables while preserving unmatchedrows from the right table.
7/31/2019 Kirk Lafker
25/44
Copyright 1992-2010 by Kirk Paul Lafler 25
What Happens during a Join?
When joining two tables:An intermediate Cartesian product is built from the two tables
Rows are selected that match the WHERE clause, if present
When joining more than two tables: SQL query optimizer evaluates the available methods for retrieving
the data and attempts to use the most efficient method
The join is reconstructed into several two-way joins
Removes unwanted rows and columns from the intermediate tables
Determines the order of processing to reduce the size of the
intermediate Cartesian product
7/31/2019 Kirk Lafker
26/44
Copyright 1992-2010 by Kirk Paul Lafler 26
Join Algorithms
Users supply the list of tables for joining along with the joinconditions, and the PROC SQL optimizer determines whichjoin algorithm to use for performing the join. The algorithmsinclude:
7/31/2019 Kirk Lafker
27/44
Copyright 1992-2010 by Kirk Paul Lafler 27
Join Algorithms
Users supply the list of tables for joining along with the joinconditions, and the PROC SQL optimizer determines which ofthe join algorithms to use for performing the join. Thealgorithms include:
Nested Loop Join (brute-force join) When an equalitycondition is not specified, a read of the complete contents of theright table is processed for each row in the left table.
7/31/2019 Kirk Lafker
28/44
Copyright 1992-2010 by Kirk Paul Lafler 28
Nested Loop Join - Features
Used with join relations of two tables One or both of the tables is relatively small
I/O intensive
This join generally performs fairly well with smaller tables,
but generally performs poorly with larger join relations
7/31/2019 Kirk Lafker
29/44
Copyright 1992-2010 by Kirk Paul Lafler 29
Join Algorithms
Users supply the list of tables for joining along with the joinconditions, and the PROC SQL optimizer determines which ofthe join algorithms to use for performing the join. Thealgorithms include:
Nested Loop Join When an equality condition is not specified, a
read of the complete contents of the right table is processed foreach row in the left table.
Sort-Merge Join When the specified tables are already in thedesired sort order, resources are not expended for resorting.
7/31/2019 Kirk Lafker
30/44
Copyright 1992-2010 by Kirk Paul Lafler 30
Sort-Merge Join - Features
Used with joins of two tables Works best when one or both of the join relations are in the
desired order
One or both of the tables are of moderate size
If the optimizer determines a sort is not needed no sort will
be performed, otherwise sort resources are expended:
- using an explicit sort operation
- by taking advantage of pre-existing ordering
Generally performs well, particularly when the majority of
the rows are being joined
7/31/2019 Kirk Lafker
31/44
Copyright 1992-2010 by Kirk Paul Lafler 31
Users supply the list of tables for joining along with the joinconditions, and the PROC SQL optimizer determines which ofthe join algorithms to use for performing the join. Thealgorithms include:
Nested Loop Join When an equality condition is not specified, a
read of the complete contents of the right table is processed foreach row in the left table.
Sort-Merge Join When the specified tables are already in thedesired sort order, resources are not expended for resorting.
Indexed Join When an index exists on >=1 variable(s) torepresent a key, matching rows may be accessed using the index.
Join Algorithms
7/31/2019 Kirk Lafker
32/44
Copyright 1992-2010 by Kirk Paul Lafler 32
Indexed Join - Features
Used with joins of two tables An index must be defined that produces a small subset of
the total number of rows in a table
Matching rows are accessed directly using the index
One or both of the tables are of moderate to large size
Generally performs well, particularly when a small number of
rows are being joined
7/31/2019 Kirk Lafker
33/44
Copyright 1992-2010 by Kirk Paul Lafler 33
Join Algorithms
Users supply the list of tables for joining along with the joinconditions, and the PROC SQL optimizer determines which ofthe join algorithms to use for performing the join. Thealgorithms include:
Nested Loop Join When an equality condition is not specified, a
read of the complete contents of the right table is processed foreach row in the left table.
Sort-Merge Join When the specified tables are already in thedesired sort order, resources are not expended for resorting.
Indexed Join When an index exists on >=1 variable(s) torepresent a key, matching rows may be accessed using the index.
Hash Join When an equality relationship exists and thesmaller table is able to fit in memory, set-matching operationsgenerally perform well.
7/31/2019 Kirk Lafker
34/44
Copyright 1992-2010 by Kirk Paul Lafler 34
Hash Join - Features
Used with joins of two tables The SQL optimizer attempts to estimate the amount memory
required to build a hash table in memory
A hash table data structure associates keys with values
The optimizer tries to use the smaller of the tables as the
hash table
Its purpose is to perform more efficient lookups
Requires an equi-join predicate
This algorithm generally performs well with small to medium
join relations
7/31/2019 Kirk Lafker
35/44
Copyright 1992-2010 by Kirk Paul Lafler 35
Options MSGLEVEL=IBy specifying the MSGLEVEL=I option, helpfulnotes describing index usage, sort utilities, andmerge processing are displayed on the SAS Log.
OPTIONS MSGLEVEL=I;
PROC SQL;SELECT MOVIES.TITLE, RATING, LENGTH, ACTOR_LEADING
FROM MOVIES,
ACTORS
WHERE MOVIES.TITLE = ACTORS.TITLE AND
RATING = PG;QUIT;
7/31/2019 Kirk Lafker
36/44
Copyright 1992-2010 by Kirk Paul Lafler 36
SAS Log ResultsOPTIONS MSGLEVEL=I;PROC SQL;
SELECT MOVIES.TITLE, RATING, LENGTH, ACTOR_LEADING
FROM MOVIES,
ACTORS
WHERE MOVIES.TITLE = ACTORS.TITLE ANDRATING = 'PG';
INFO: Index Rating selected for WHERE clause optimization.
QUIT;
MSGLEVEL=I Log
7/31/2019 Kirk Lafker
37/44
Copyright 1992-2010 by Kirk Paul Lafler 37
_METHOD Option
A_METHOD option can be specified on the PROCSQL statement to display the hierarchy of processingthat takes place. Results are displayed on the Log.
Codes Description
sqxcrta Create table as Selectsqxslct Select
sqxjsl Step loop join (Cartesian)
sqxjm Merge join
sqxjndx Index join
sqxjhsh Hash join
sqxsort Sort
sqxsrc Source rows from table
sqxfil Filter rows
sqxsumg Summary statistics with GROUP BY
sqxsumn Summary statistics with no GROUP BY
7/31/2019 Kirk Lafker
38/44
Copyright 1992-2010 by Kirk Paul Lafler 38
Program Example using _METHOD
SAS Log Results
NOTE: SQL execution methods chosen are:
sqxslctsqxjhsh
sqxsrc( MOVIES )
sqxsrc( ACTORS )
7/31/2019 Kirk Lafker
39/44
Copyright 1992-2010 by Kirk Paul Lafler 39
Conclusion
Data federation characteristics A merge and join do not process data the same way
A join combines tables side-by-side (horizontally)
Joins adhere to ANSI guidelines
Cartesian product represents all possiblecombinations of rows from the underlying tables
256 tables can be joined using an inner join construct
2 tables can be joined using an outer join construct
Four types of join algorithms: Nested loop join
Sort-Merge join
Indexed join
Hash join
7/31/2019 Kirk Lafker
40/44
Copyright 1992-2010 by Kirk Paul Lafler 40
PROC SQL
Beyond the BasicsUsing SAS
Kirk Paul Lafler
PROC SQL
Examples
Book
Coming Winter 2004!Available at www.sas.com!
7/31/2019 Kirk Lafker
41/44
Copyright 1992-2010 by Kirk Paul Lafler 41
Coming Winter 2004!Coming in September 2010!
A Book of
Data Transfer
Methods withExamples
SASand Excel
Transferring Data betweenSAS and Excel Using SAS
Kirk Paul Lafler
William E. Benjamin, Jr.
7/31/2019 Kirk Lafker
42/44
Questions ?Joining
Worlds of Datais exciting and fun!
Kirk Paul Lafler
SAS Consultant, Author, and Trainer
Software Intelligence Corporation
E-mail: [email protected]
7/31/2019 Kirk Lafker
43/44
We kindly thank the WUSS Leadership and the following organizations for their
assistance and support for this symposium:
BASAS
http://www.tmcsoftware.com/index.htmlhttp://www.genomichealth.com/Index.aspxhttp://www.meta-x.com/index.htmlhttp://www.advancedclinical.com/default.aspxhttp://www.sas.com/7/31/2019 Kirk Lafker
44/44
Mark yourcalendars for
WUSS 2010!Three days ofclasses, workshops andpresentations introducing the latest in
SAS technology and applications.
November 3-5
Hyatt Regency Mission Bay
San Diego, CA
More information will be